SparkEx.GroupedData (SparkEx v0.1.0)

Copy Markdown View Source

Represents a grouped DataFrame, created by SparkEx.DataFrame.group_by/2.

Aggregation functions like agg/2 can be applied to produce a new DataFrame. Convenience methods count/1, min/2, max/2, sum/2, avg/2, mean/2 apply common aggregations. pivot/3 enables pivot-style aggregation.

Note: convenience numeric aggregation methods (sum/1, avg/1, min/1, max/1, mean/1) called without explicit columns make an eager schema RPC to discover numeric columns. The schema is cached per process so repeated calls on the same GroupedData avoid redundant RPCs. Pass explicit column names to skip the schema lookup entirely.

Summary

Functions

Applies aggregate expressions to the grouped data, returning a new DataFrame.

Computes the average for each group.

Counts the number of records for each group.

Computes the maximum value for each group.

Computes the mean (alias for avg) for each group.

Computes the minimum value for each group.

Pivots on a column, enabling pivot-style aggregation.

Computes the sum for each group.

Types

t()

@type t() :: %SparkEx.GroupedData{
  cached_schema: term() | nil,
  group_type: atom(),
  grouping_exprs: [SparkEx.Column.expr()],
  grouping_sets: [[SparkEx.Column.expr()]] | nil,
  pivot_col: SparkEx.Column.expr() | nil,
  pivot_values: [term()] | nil,
  plan: term(),
  session: GenServer.server()
}

Functions

agg(gd, agg_columns)

@spec agg(t(), [SparkEx.Column.t()] | map()) :: SparkEx.DataFrame.t()

Applies aggregate expressions to the grouped data, returning a new DataFrame.

Accepts a list of SparkEx.Column structs representing aggregate expressions (e.g. Functions.sum(col("amount")), Functions.count(col("id"))).

Examples

import SparkEx.Functions

df
|> DataFrame.group_by(["department"])
|> SparkEx.GroupedData.agg([sum(col("salary")), avg(col("age"))])

avg(gd, cols \\ [])

Computes the average for each group.

count(gd, cols \\ [])

Counts the number of records for each group.

max(gd, cols \\ [])

Computes the maximum value for each group.

mean(gd, cols \\ [])

Computes the mean (alias for avg) for each group.

min(gd, cols \\ [])

Computes the minimum value for each group.

pivot(gd, pivot_col, values \\ nil)

@spec pivot(t(), SparkEx.Column.t() | String.t(), [term()] | nil) :: t()

Pivots on a column, enabling pivot-style aggregation.

After calling pivot/3, use agg/2 to specify the aggregation.

Examples

df
|> DataFrame.group_by(["year"])
|> SparkEx.GroupedData.pivot("course", ["dotNET", "Java"])
|> SparkEx.GroupedData.agg([sum(col("earnings"))])

sum(gd, cols \\ [])

Computes the sum for each group.