A lazy reference to a Spark DataFrame.
DataFrame structs hold a session reference and an internal plan representation.
Transforms build up the plan; actions (collect/1, count/1, etc.) execute it
via the Spark Connect server.
Transforms (lazy)
select/2— project columnsfilter/2— filter rows by conditionwith_column/3— add or replace a columndrop/2— drop columnsorder_by/2— sort rowslimit/2— limit number of rowsgroup_by/2— group by columns (returnsSparkEx.GroupedData)join/4— join two DataFramesdistinct/1— deduplicate all rowsunion/2— union two DataFrames
Actions (execute)
Summary
Functions
Aggregate without grouping.
Aliases this DataFrame for use in subqueries.
Computes approximate quantiles. Delegates to SparkEx.DataFrame.Stat.approx_quantile/4.
Performs an as-of join between two DataFrames.
Returns this DataFrame as a table-argument wrapper.
Alias for checkpoint/2 (PySpark checkpoint).
Materializes this DataFrame as a cached relation.
Reduces the number of partitions without shuffling data.
Returns a column reference bound to this DataFrame plan.
Selects columns by regex.
Collects all rows from the DataFrame as a list of maps.
Collects rows into a map using the first column as key and second column as value.
Returns a list of column names.
Computes Pearson correlation. Delegates to SparkEx.DataFrame.Stat.corr/4.
Returns the row count of the DataFrame.
Computes covariance. Delegates to SparkEx.DataFrame.Stat.cov/3.
Creates a global temporary view with the given name.
Creates or replaces a global temporary view with the given name.
Creates or replaces a temporary view with the given name.
Creates a temporary view with the given name.
Cross join — shorthand for join(df, other, [], :cross).
Computes crosstab. Delegates to SparkEx.DataFrame.Stat.crosstab/3.
Groups by cube of the specified columns.
Describes basic statistics. Delegates to SparkEx.DataFrame.Stat.describe/2.
Returns a new DataFrame with duplicate rows removed.
Drops the specified columns.
Drops duplicate rows based on a subset of columns.
Drops duplicate rows within the watermark window.
Drops a global temporary view by name.
Drops a local temporary view by name.
Drops rows with null values. Delegates to SparkEx.DataFrame.NA.drop/2.
Returns list of {column_name, type_string} tuples.
Returns rows in this DataFrame that are not in the other DataFrame.
Returns rows in this DataFrame that are not in the other, preserving duplicates
(equivalent to SQL EXCEPT ALL).
Returns execution metrics from the last action on the session.
Alias for execution_info/1 (PySpark executionInfo).
Returns this DataFrame as an EXISTS subquery expression.
Returns the explain string for the DataFrame's plan.
Fills null values. Delegates to SparkEx.DataFrame.NA.fill/3.
Filters rows based on a boolean condition.
Returns the first row as a map, or nil if empty.
Applies a function to each row on the driver.
Applies a function to each partition (driver-side shim).
Finds frequent items. Delegates to SparkEx.DataFrame.Stat.freq_items/3.
Groups the DataFrame by the given columns, returning a SparkEx.GroupedData.
Alias for group_by/2 (PySpark groupby).
Groups by grouping sets.
Returns the first row as a map, or nil if empty.
Returns the first n rows as a list of maps.
Adds a query optimization hint.
Returns an HTML string representation of the DataFrame.
Returns this DataFrame as an IN subquery expression.
Returns the input files for the plan.
Returns rows in this DataFrame that are also in the other DataFrame.
Returns rows common to both DataFrames, preserving duplicates
(equivalent to SQL INTERSECT ALL).
Returns true if the DataFrame is cached (storage level is not NONE).
Alias for is_cached/1 (PySpark is_cached).
Returns true if the DataFrame has no rows.
Alias for is_empty/1.
Checks if the plan is local (can be computed without Spark).
Checks if the plan represents a streaming query.
Alias for is_streaming/1.
Joins this DataFrame with another on the given condition.
Performs a lateral join between two DataFrames.
Limits the number of rows.
Alias for local_checkpoint/2 with default options.
Materializes this DataFrame as a local (non-reliable) checkpoint.
Alias for local_checkpoint/2 (PySpark localCheckpoint).
Creates a SparkEx.MergeIntoWriter for MERGE INTO operations.
Returns a metadata column expression by name.
Returns the DataFrame for use with SparkEx.DataFrame.NA functions.
Observes metrics during query execution.
Skips the first n rows.
Sorts the DataFrame by the given columns or sort orders.
Parses string columns in the DataFrame as CSV or JSON.
Persists the DataFrame with optional storage level.
Convenience wrapper for SparkEx.GroupedData.pivot/3 so grouped pipelines can stay under DataFrame.
Prints the schema tree, mirroring PySpark printSchema.
Randomly splits the DataFrame into multiple DataFrames using normalized weights.
Returns the number of partitions in the underlying RDD.
Alias for create_or_replace_temp_view/3 (PySpark registerTempTable).
Alias for create_or_replace_temp_view/3 (PySpark registerTempTable).
Repartitions the DataFrame.
Repartitions by partition ID.
Repartitions the DataFrame by range using sort order expressions without specifying partitions.
Repartitions the DataFrame by range using sort order expressions.
Replaces values. Delegates to SparkEx.DataFrame.NA.replace/4.
Groups by rollup of the specified columns.
Checks if this DataFrame has the same semantics as another.
Returns a random sample of rows.
Returns stratified sample. Delegates to SparkEx.DataFrame.Stat.sample_by/4.
Returns this DataFrame as a scalar subquery expression.
Returns the schema of the DataFrame via AnalyzePlan.
Projects a set of columns or expressions.
Projects columns using SQL expression strings.
Returns the semantic hash of the plan.
Returns a formatted string representation of the DataFrame (like PySpark's show()).
Alias for order_by/2 (PySpark sort).
Sorts within each partition by the given columns.
Returns the parent Spark session.
Alias for spark_session/1 (PySpark sparkSession).
Returns the DataFrame for use with SparkEx.DataFrame.Stat functions.
Returns the storage level of a persisted DataFrame.
Alias for except/2 (EXCEPT DISTINCT, matching PySpark subtract).
Computes summary statistics. Delegates to SparkEx.DataFrame.Stat.summary/2.
Creates a DataFrame from a table-valued function (TVF) call.
Tags the DataFrame with an operation tag for interrupt targeting.
Returns the last n rows as a list of maps.
Returns a lazy DataFrame relation for the last n rows.
Returns up to n rows from the DataFrame as a list of maps.
Casts this DataFrame to the given schema.
Materializes the DataFrame as a raw Arrow IPC binary.
Renames all columns in the DataFrame.
Materializes the DataFrame as an Explorer.DataFrame.
Converts each row to a JSON string, returning a single-column DataFrame.
Returns a lazy enumerable over collected rows.
Applies a transformation function to the DataFrame.
Transposes the DataFrame.
Returns the tree-string representation of the plan.
Returns a new DataFrame with the union of rows from both DataFrames.
Alias for union/2.
Union by column name rather than position.
Returns a new DataFrame with the union of rows, removing duplicates
(equivalent to SQL UNION).
Alias for union/2 (PySpark unionAll).
Unpersists the DataFrame.
Unpivots a DataFrame from wide to long format.
Alias for filter/2.
Adds or replaces a column with the given name and expression.
Renames a single column.
Adds or replaces multiple columns at once.
Renames multiple columns using a map of old -> new names.
Adds or replaces metadata for an existing column.
Adds a watermark for streaming event-time processing.
Returns a SparkEx.Writer builder for this DataFrame.
Returns a SparkEx.StreamWriter builder for this streaming DataFrame.
Returns a SparkEx.WriterV2 builder for this DataFrame targeting the given table.
Types
@type plan() :: term()
@type t() :: %SparkEx.DataFrame{ plan: plan(), session: GenServer.server(), tags: [String.t()] }
Functions
@spec agg( t() | SparkEx.GroupedData.t(), SparkEx.Column.t() | [SparkEx.Column.t()] | map() ) :: t()
Aggregate without grouping.
Shortcut for df |> group_by([]) |> GroupedData.agg(exprs).
Examples
df |> DataFrame.agg([Functions.count(Functions.col("id"))])
Aliases this DataFrame for use in subqueries.
Examples
df |> DataFrame.alias("t")
@spec approx_quantile(t(), String.t() | [String.t()], [float()], float()) :: {:ok, [float()] | [[float()]]} | {:error, term()}
Computes approximate quantiles. Delegates to SparkEx.DataFrame.Stat.approx_quantile/4.
@spec as_of_join( t(), t(), SparkEx.Column.t() | String.t(), SparkEx.Column.t() | String.t(), keyword() ) :: t()
Performs an as-of join between two DataFrames.
Options
:tolerance— optional tolerance expression (e.g.Functions.lit(10)):allow_exact_matches— whether exact matches are allowed (default: true):direction— join direction string (default: "backward")
Examples
import SparkEx.Functions, only: [col: 1, lit: 1]
DataFrame.as_of_join(df1, df2, col("t1"), col("t2"), on: col("id"), tolerance: lit(5))
@spec as_table(t()) :: SparkEx.TableArg.t()
Returns this DataFrame as a table-argument wrapper.
Alias for persist/2 with default storage level (PySpark cache).
Alias for checkpoint/2 (PySpark checkpoint).
Materializes this DataFrame as a cached relation.
Options
:eager— whether to checkpoint eagerly (default: true)
@spec coalesce(t(), pos_integer()) :: t()
Reduces the number of partitions without shuffling data.
Examples
df |> DataFrame.coalesce(1)
@spec col(t(), String.t() | atom()) :: SparkEx.Column.t()
Returns a column reference bound to this DataFrame plan.
Useful for disambiguating columns across joins/subqueries.
@spec col_regex(t(), String.t()) :: SparkEx.Column.t()
Selects columns by regex.
Examples
df |> DataFrame.col_regex("^name_.*")
Collects all rows from the DataFrame as a list of maps.
Options
:timeout— gRPC call timeout in ms (default: 60_000)
Collects rows into a map using the first column as key and second column as value.
The DataFrame must have exactly two columns.
Returns a list of column names.
Computes Pearson correlation. Delegates to SparkEx.DataFrame.Stat.corr/4.
@spec count(t()) :: {:ok, non_neg_integer()} | {:error, term()}
Returns the row count of the DataFrame.
Computes covariance. Delegates to SparkEx.DataFrame.Stat.cov/3.
Creates a global temporary view with the given name.
Global temp views are accessible across sessions within the same Spark application
and are available in the global_temp database.
Creates or replaces a global temporary view with the given name.
Creates or replaces a temporary view with the given name.
Creates a temporary view with the given name.
Raises an error if a view with this name already exists.
Cross join — shorthand for join(df, other, [], :cross).
Computes crosstab. Delegates to SparkEx.DataFrame.Stat.crosstab/3.
@spec cube( t(), SparkEx.Column.t() | String.t() | atom() | [SparkEx.Column.t() | String.t() | atom()] ) :: SparkEx.GroupedData.t()
Groups by cube of the specified columns.
Describes basic statistics. Delegates to SparkEx.DataFrame.Stat.describe/2.
Returns a new DataFrame with duplicate rows removed.
Examples
df |> SparkEx.DataFrame.distinct()
@spec drop( t(), [SparkEx.Column.t() | String.t() | atom()] | SparkEx.Column.t() | String.t() | atom() ) :: t()
Drops the specified columns.
Accepts a list of column names as strings or atoms.
Examples
df |> SparkEx.DataFrame.drop(["temp_col", "debug_col"])
df |> SparkEx.DataFrame.drop([:temp_col])
@spec drop_duplicates(t(), [SparkEx.Column.t() | String.t() | atom()] | nil) :: t()
Drops duplicate rows based on a subset of columns.
When subset is empty, deduplicates on all columns (like distinct/1).
Examples
df |> DataFrame.drop_duplicates(["id", "name"])
@spec drop_duplicates_within_watermark(t(), [SparkEx.Column.t() | String.t() | atom()]) :: t()
Drops duplicate rows within the watermark window.
@spec drop_global_temp_view(GenServer.server(), String.t()) :: {:ok, boolean()} | {:error, term()}
Drops a global temporary view by name.
@spec drop_temp_view(GenServer.server(), String.t()) :: {:ok, boolean()} | {:error, term()}
Drops a local temporary view by name.
Drops rows with null values. Delegates to SparkEx.DataFrame.NA.drop/2.
@spec dtypes(t() | {:ok, t()} | {:error, term()}) :: {:ok, [{String.t(), String.t()}]} | {:error, term()}
Returns list of {column_name, type_string} tuples.
Returns rows in this DataFrame that are not in the other DataFrame.
Examples
DataFrame.except(df1, df2)
Returns rows in this DataFrame that are not in the other, preserving duplicates
(equivalent to SQL EXCEPT ALL).
Returns execution metrics from the last action on the session.
Alias for execution_info/1 (PySpark executionInfo).
@spec exists(t()) :: SparkEx.Column.t()
Returns this DataFrame as an EXISTS subquery expression.
Returns the explain string for the DataFrame's plan.
Modes: :simple, :extended, :codegen, :cost, :formatted
Fills null values. Delegates to SparkEx.DataFrame.NA.fill/3.
@spec filter(t(), SparkEx.Column.t() | String.t()) :: t()
Filters rows based on a boolean condition.
Examples
import SparkEx.Functions, only: [col: 1, lit: 1]
df |> SparkEx.DataFrame.filter(col("age") |> SparkEx.Column.gt(lit(18)))
Returns the first row as a map, or nil if empty.
Applies a function to each row on the driver.
@spec foreach_partition(t(), (Enumerable.t() -> term()), keyword()) :: :ok | {:error, term()}
Applies a function to each partition (driver-side shim).
Finds frequent items. Delegates to SparkEx.DataFrame.Stat.freq_items/3.
@spec group_by( t(), SparkEx.Column.t() | String.t() | atom() | [SparkEx.Column.t() | String.t() | atom()] ) :: SparkEx.GroupedData.t()
Groups the DataFrame by the given columns, returning a SparkEx.GroupedData.
Use SparkEx.GroupedData.agg/2 to apply aggregate functions.
Accepts a list of column names (strings or atoms) or Column structs.
Examples
import SparkEx.Functions
df
|> DataFrame.group_by(["department"])
|> SparkEx.GroupedData.agg([sum(col("salary"))])
@spec groupby(t(), [SparkEx.Column.t() | String.t() | atom()]) :: SparkEx.GroupedData.t()
Alias for group_by/2 (PySpark groupby).
@spec grouping_sets(t(), [[SparkEx.Column.t() | String.t() | atom()]], [ SparkEx.Column.t() | String.t() | atom() ]) :: SparkEx.GroupedData.t()
Groups by grouping sets.
Accepts a list of column lists, and an optional list of explicit grouping columns. When grouping columns are provided, they are used as the grouping expressions instead of being derived from the sets.
Returns the first row as a map, or nil if empty.
Equivalent to PySpark's head() behavior.
@spec head( t(), keyword() ) :: {:ok, map() | nil} | {:error, term()}
@spec head(t(), non_neg_integer()) :: {:ok, [map()]} | {:error, term()}
Returns the first n rows as a list of maps.
Equivalent to take/3 but follows PySpark naming for head(n).
@spec head(t(), non_neg_integer(), keyword()) :: {:ok, [map()]} | {:error, term()}
Adds a query optimization hint.
Supports primitive values, Columns, and lists of primitive values/columns.
Returns an HTML string representation of the DataFrame.
Options
:num_rows— number of rows (default: 20):truncate— column width truncation (default: 20)
@spec in_subquery(t(), [SparkEx.Column.t()]) :: SparkEx.Column.t()
@spec in_subquery(SparkEx.Column.t(), t()) :: SparkEx.Column.t()
Returns this DataFrame as an IN subquery expression.
Accepts a list of expressions to compare against the subquery values.
Returns the input files for the plan.
Returns rows in this DataFrame that are also in the other DataFrame.
Examples
DataFrame.intersect(df1, df2)
Returns rows common to both DataFrames, preserving duplicates
(equivalent to SQL INTERSECT ALL).
Returns true if the DataFrame is cached (storage level is not NONE).
Alias for is_cached/1 (PySpark is_cached).
Returns true if the DataFrame has no rows.
Alias for is_empty/1.
Checks if the plan is local (can be computed without Spark).
Checks if the plan represents a streaming query.
Alias for is_streaming/1.
@spec join( t(), t(), SparkEx.Column.t() | String.t() | atom() | [SparkEx.Column.t() | String.t() | atom()], atom() | String.t() ) :: t()
Joins this DataFrame with another on the given condition.
Join types
:inner(default):left— left outer join:right— right outer join:full— full outer join:cross— cross join (no condition needed):left_semi— left semi join:left_anti— left anti join
Join conditions
The on parameter can be:
- A
Columnstruct representing the join condition expression - A list of column name strings for a
USINGjoin
Examples
import SparkEx.Functions, only: [col: 1]
DataFrame.join(df1, df2, Column.eq(col("df1.id"), col("df2.id")), :inner)
DataFrame.join(df1, df2, ["id"], :inner)
Performs a lateral join between two DataFrames.
The right plan is expected to reference columns from the left plan where supported.
@spec limit(t(), non_neg_integer()) :: t()
Limits the number of rows.
Examples
df |> SparkEx.DataFrame.limit(100)
Alias for local_checkpoint/2 with default options.
Materializes this DataFrame as a local (non-reliable) checkpoint.
Options
:eager— whether to checkpoint eagerly (default: true):storage_level— optional storage level struct (seeSparkEx.Types.storage_level/0)
Alias for local_checkpoint/2 (PySpark localCheckpoint).
@spec melt( t(), [SparkEx.Column.t() | String.t() | atom()], [SparkEx.Column.t() | String.t() | atom()] | nil, String.t(), String.t() ) :: t()
Alias for unpivot/5.
@spec merge_into(t(), String.t()) :: SparkEx.MergeIntoWriter.t()
Creates a SparkEx.MergeIntoWriter for MERGE INTO operations.
Examples
df
|> DataFrame.merge_into("target_table")
|> MergeIntoWriter.on(col("source.id") |> Column.eq(col("target.id")))
|> MergeIntoWriter.when_matched_update_all()
|> MergeIntoWriter.when_not_matched_insert_all()
|> MergeIntoWriter.merge()
@spec merge_into(t(), String.t(), SparkEx.Column.t()) :: SparkEx.MergeIntoWriter.t()
@spec metadata_column(t(), String.t()) :: SparkEx.Column.t()
Returns a metadata column expression by name.
Examples
df |> DataFrame.metadata_column("_metadata")
Returns the DataFrame for use with SparkEx.DataFrame.NA functions.
@spec observe(t(), SparkEx.Observation.t() | String.t(), [SparkEx.Column.t()]) :: t()
Observes metrics during query execution.
Accepts an SparkEx.Observation or a name string and a list of Column expressions.
@spec offset(t(), non_neg_integer()) :: t()
Skips the first n rows.
Examples
df |> DataFrame.offset(10)
@spec order_by( t(), SparkEx.Column.t() | String.t() | atom() | [SparkEx.Column.t() | String.t() | atom()], keyword() ) :: t()
Sorts the DataFrame by the given columns or sort orders.
Accepts a list of:
SparkEx.Columnstructs (with optional.asc()/.desc())- strings (ascending by default)
- atoms (ascending by default)
Examples
import SparkEx.Functions, only: [col: 1]
df |> SparkEx.DataFrame.order_by([col("age") |> SparkEx.Column.desc()])
df |> SparkEx.DataFrame.order_by(["name"])
@spec parse( t(), :csv | :json, String.t() | SparkEx.Types.struct_type() | nil, map() | nil ) :: t()
Parses string columns in the DataFrame as CSV or JSON.
Parameters
format—:csvor:jsonschema— DDL string or struct type for the output schema (optional)options— map of parse options (optional)
Examples
df |> DataFrame.parse(:csv, "a INT, b STRING")
df |> DataFrame.parse(:json, "a INT, b STRING", %{"mode" => "FAILFAST"})
Persists the DataFrame with optional storage level.
Options
:storage_level— a storage level struct (seeSparkEx.Types.storage_level/0)
@spec pivot(SparkEx.GroupedData.t(), SparkEx.Column.t() | String.t(), [term()] | nil) :: SparkEx.GroupedData.t()
Convenience wrapper for SparkEx.GroupedData.pivot/3 so grouped pipelines can stay under DataFrame.
Prints the schema tree, mirroring PySpark printSchema.
Options
:level— tree depth level (optional)
Randomly splits the DataFrame into multiple DataFrames using normalized weights.
@spec rdd_num_partitions(t()) :: {:ok, non_neg_integer()} | {:error, term()}
Returns the number of partitions in the underlying RDD.
Alias for create_or_replace_temp_view/3 (PySpark registerTempTable).
Alias for create_or_replace_temp_view/3 (PySpark registerTempTable).
@spec repartition(t(), pos_integer() | [SparkEx.Column.t() | String.t() | atom()], [ SparkEx.Column.t() | String.t() | atom() ]) :: t()
Repartitions the DataFrame.
When called with an integer, does a hash repartition to num_partitions.
When called with an integer and columns, repartitions by those expressions.
When called with only columns (list), repartitions by those expressions
with default partition count.
Examples
df |> DataFrame.repartition(10)
df |> DataFrame.repartition(10, [col("key")])
df |> DataFrame.repartition([col("key")])
@spec repartition_by_id( t(), pos_integer() | nil, SparkEx.Column.t() | String.t() | atom() ) :: t()
Repartitions by partition ID.
Examples
df |> DataFrame.repartition_by_id(col("partition_col"))
@spec repartition_by_range(t(), [SparkEx.Column.t() | String.t() | atom()]) :: t()
Repartitions the DataFrame by range using sort order expressions without specifying partitions.
@spec repartition_by_range(t(), pos_integer(), [ SparkEx.Column.t() | String.t() | atom() ]) :: t()
Repartitions the DataFrame by range using sort order expressions.
This uses the RepartitionByExpression relation with sort-order expressions.
Replaces values. Delegates to SparkEx.DataFrame.NA.replace/4.
@spec rollup( t(), SparkEx.Column.t() | String.t() | atom() | [SparkEx.Column.t() | String.t() | atom()] ) :: SparkEx.GroupedData.t()
Groups by rollup of the specified columns.
Checks if this DataFrame has the same semantics as another.
Returns a random sample of rows.
Options
:with_replacement— sample with replacement (default: false):seed— random seed (default: nil)
Examples
df |> DataFrame.sample(0.1)
df |> DataFrame.sample(0.5, with_replacement: true, seed: 42)
Returns stratified sample. Delegates to SparkEx.DataFrame.Stat.sample_by/4.
@spec scalar(t()) :: SparkEx.Column.t()
Returns this DataFrame as a scalar subquery expression.
Returns the schema of the DataFrame via AnalyzePlan.
@spec select( t(), SparkEx.Column.t() | String.t() | atom() | [SparkEx.Column.t() | String.t() | atom()] ) :: t()
Projects a set of columns or expressions.
Accepts a list of:
SparkEx.Columnstructs- strings (interpreted as column names)
- atoms (interpreted as column names)
Examples
import SparkEx.Functions, only: [col: 1, lit: 1]
df |> SparkEx.DataFrame.select([col("name"), col("age")])
df |> SparkEx.DataFrame.select(["name", "age"])
df |> SparkEx.DataFrame.select([:name, :age])
Projects columns using SQL expression strings.
Each string is parsed as a SQL expression by Spark.
Examples
df |> DataFrame.select_expr(["name", "age + 1 AS age_plus"])
Returns the semantic hash of the plan.
Returns a formatted string representation of the DataFrame (like PySpark's show()).
Options
:num_rows— number of rows to show (default: 20):truncate— column width truncation (default: 20, 0 for no truncation):vertical— vertical display format (default: false)
@spec sort(t(), [SparkEx.Column.t() | String.t() | atom()]) :: t()
Alias for order_by/2 (PySpark sort).
@spec sort_within_partitions( t(), [SparkEx.Column.t() | String.t() | atom() | integer()], keyword() ) :: t()
Sorts within each partition by the given columns.
Examples
df |> DataFrame.sort_within_partitions(["key"])
@spec spark_session(t()) :: GenServer.server()
Returns the parent Spark session.
@spec sparkSession(t()) :: GenServer.server()
Alias for spark_session/1 (PySpark sparkSession).
Returns the DataFrame for use with SparkEx.DataFrame.Stat functions.
@spec storage_level(t()) :: {:ok, SparkEx.Types.storage_level()} | {:error, term()}
Returns the storage level of a persisted DataFrame.
Alias for except/2 (EXCEPT DISTINCT, matching PySpark subtract).
Computes summary statistics. Delegates to SparkEx.DataFrame.Stat.summary/2.
@spec table_function(GenServer.server(), String.t(), [SparkEx.Column.t() | term()]) :: t()
Creates a DataFrame from a table-valued function (TVF) call.
TVFs are built-in Spark functions that return tables (e.g. range, explode).
Arguments can be Column structs or literal values.
Examples
DataFrame.table_function(session, "range", [lit(0), lit(10)])
Tags the DataFrame with an operation tag for interrupt targeting.
Tags are propagated to the ExecutePlanRequest when the DataFrame is
executed. Multiple tags can be added by calling tag/2 multiple times.
Examples
df = SparkEx.sql(session, "SELECT * FROM big_table")
|> DataFrame.tag("etl-job-42")
# Later, from another process:
SparkEx.interrupt_tag(session, "etl-job-42")
@spec tail(t(), non_neg_integer()) :: {:ok, [map()]} | {:error, term()}
Returns the last n rows as a list of maps.
Mirrors PySpark tail(n) eager behavior.
@spec tail_df(t(), non_neg_integer()) :: t()
Returns a lazy DataFrame relation for the last n rows.
This preserves the previous lazy tail behavior when needed.
@spec take(t(), non_neg_integer(), keyword()) :: {:ok, [map()]} | {:error, term()}
Returns up to n rows from the DataFrame as a list of maps.
@spec to( t(), SparkEx.Types.data_type_proto() | String.t() | SparkEx.Types.struct_type() ) :: t()
Casts this DataFrame to the given schema.
Accepts a Spark Connect DataType, a DDL string, or a SparkEx.Types struct type.
Materializes the DataFrame as a raw Arrow IPC binary.
Returns an Arrow.Table if the :arrow dependency is available.
Renames all columns in the DataFrame.
Examples
df |> DataFrame.to_df(["id", "full_name", "years"])
@spec to_explorer( t(), keyword() ) :: {:ok, Explorer.DataFrame.t()} | {:error, term()}
Materializes the DataFrame as an Explorer.DataFrame.
By default, injects a LIMIT of max_rows into the Spark plan to prevent
unbounded collection. Pass unsafe: true to skip the limit injection.
Local decoder limits still apply unless you explicitly set max_rows: :infinity
and/or max_bytes: :infinity.
Options
:max_rows— maximum number of rows (default: 10_000):max_bytes— maximum Arrow data bytes (default: 64 MB):unsafe— skip LIMIT injection only (default: false):timeout— gRPC timeout in ms (default: 60_000)
Examples
{:ok, explorer_df} = DataFrame.to_explorer(df)
{:ok, explorer_df} = DataFrame.to_explorer(df, max_rows: 1_000)
{:ok, explorer_df} = DataFrame.to_explorer(df, unsafe: true)
Converts each row to a JSON string, returning a single-column DataFrame.
Equivalent to PySpark's DataFrame.toJSON().
Examples
df |> DataFrame.to_json_rows()
@spec to_local_iterator( t(), keyword() ) :: {:ok, Enumerable.t()} | {:error, term()}
Returns a lazy enumerable over collected rows.
Applies a transformation function to the DataFrame.
The function receives the DataFrame and must return a DataFrame.
@spec transpose(t(), keyword() | SparkEx.Column.t() | String.t()) :: t()
Transposes the DataFrame.
Options
:index_column— column(s) to use as index (default: nil)
Returns the tree-string representation of the plan.
Options
:level— tree depth level (optional)
Returns a new DataFrame with the union of rows from both DataFrames.
Both DataFrames must have the same schema. Duplicates are preserved
(equivalent to SQL UNION ALL).
Examples
DataFrame.union(df1, df2)
Alias for union/2.
Union by column name rather than position.
Options
:allow_missing— if true, missing columns are filled with nulls (default: false)
Examples
DataFrame.union_by_name(df1, df2)
DataFrame.union_by_name(df1, df2, allow_missing: true)
Returns a new DataFrame with the union of rows, removing duplicates
(equivalent to SQL UNION).
Examples
DataFrame.union_distinct(df1, df2)
Alias for union/2 (PySpark unionAll).
Unpersists the DataFrame.
Options
:blocking— whether to block until unpersisted (default: false)
@spec unpivot( t(), [SparkEx.Column.t() | String.t() | atom()], [ SparkEx.Column.t() | String.t() | atom() | {String.t() | atom(), String.t() | atom()} ] | nil, String.t(), String.t() ) :: t()
Unpivots a DataFrame from wide to long format.
Parameters
ids— columns to keep as identifier columnsvalues— columns to unpivot (nil for all non-id columns)variable_column_name— name for the variable columnvalue_column_name— name for the value column
Examples
df |> DataFrame.unpivot(["id"], ["col1", "col2"], "variable", "value")
@spec where(t(), SparkEx.Column.t()) :: t()
Alias for filter/2.
@spec with_column(t(), String.t(), SparkEx.Column.t()) :: t()
Adds or replaces a column with the given name and expression.
Examples
import SparkEx.Functions, only: [col: 1, lit: 1]
df |> SparkEx.DataFrame.with_column("doubled", col("value") |> SparkEx.Column.multiply(lit(2)))
Renames a single column.
Examples
df |> DataFrame.with_column_renamed("old_name", "new_name")
@spec with_columns(t(), [{String.t(), SparkEx.Column.t()}] | map()) :: t()
Adds or replaces multiple columns at once.
Accepts a list of {name, column} tuples or a list of aliased Column expressions.
Examples
import SparkEx.Functions, only: [col: 1, lit: 1]
df |> DataFrame.with_columns([
{"doubled", Column.multiply(col("x"), lit(2))},
{"const", lit(42)}
])
@spec with_columns_renamed( t(), %{required(String.t()) => String.t()} | (String.t() -> String.t()) ) :: t()
Renames multiple columns using a map of old -> new names.
When called with a function, makes an eager schema RPC to discover column names. The map variant is fully lazy.
Examples
df |> DataFrame.with_columns_renamed(%{"old1" => "new1", "old2" => "new2"})
df |> DataFrame.with_columns_renamed(&String.upcase/1)
Adds or replaces metadata for an existing column.
Adds a watermark for streaming event-time processing.
Examples
df |> DataFrame.with_watermark("event_time", "10 minutes")
@spec write(t()) :: SparkEx.Writer.t()
Returns a SparkEx.Writer builder for this DataFrame.
Examples
df
|> DataFrame.write()
|> SparkEx.Writer.format("parquet")
|> SparkEx.Writer.mode(:overwrite)
|> SparkEx.Writer.save("/data/output.parquet")
@spec write_stream(t()) :: SparkEx.StreamWriter.t()
Returns a SparkEx.StreamWriter builder for this streaming DataFrame.
Examples
df
|> DataFrame.write_stream()
|> SparkEx.StreamWriter.format("console")
|> SparkEx.StreamWriter.output_mode("append")
|> SparkEx.StreamWriter.start()
@spec write_v2(t(), String.t()) :: SparkEx.WriterV2.t()
Returns a SparkEx.WriterV2 builder for this DataFrame targeting the given table.
Examples
df
|> DataFrame.write_v2("catalog.db.my_table")
|> SparkEx.WriterV2.using("parquet")
|> SparkEx.WriterV2.create()