Statistical operations sub-API for DataFrames.
Provides descriptive statistics, correlation, covariance, crosstab, frequency items, approximate quantiles, and stratified sampling.
Most methods return lazy DataFrames. Scalar-returning methods
(corr/4, cov/3, approx_quantile/4) execute eagerly.
Summary
Functions
Computes approximate quantiles for one or more columns.
Computes the Pearson correlation coefficient between two columns.
Computes the sample covariance between two columns.
Computes a contingency table (crosstab) of two columns.
Computes basic statistics (count, mean, stddev, min, max) for selected columns.
Finds all items which have a frequency greater than or equal to support.
Returns a stratified sample of the DataFrame.
Computes specified statistics for numeric and string columns.
Functions
@spec approx_quantile( SparkEx.DataFrame.t(), String.t() | [String.t()], [float()], float() ) :: {:ok, [float()] | [[float()]]} | {:error, term()}
Computes approximate quantiles for one or more columns.
Returns {:ok, [float]} for a single column or {:ok, [[float]]} for multiple.
Examples
{:ok, quantiles} = DataFrame.Stat.approx_quantile(df, "age", [0.25, 0.5, 0.75])
{:ok, quantiles} = DataFrame.Stat.approx_quantile(df, ["age", "salary"], [0.5], 0.01)
@spec corr(SparkEx.DataFrame.t(), String.t(), String.t(), String.t()) :: {:ok, float()} | {:error, term()}
Computes the Pearson correlation coefficient between two columns.
Returns {:ok, float} or {:error, reason}.
Examples
{:ok, r} = DataFrame.Stat.corr(df, "height", "weight")
@spec cov(SparkEx.DataFrame.t(), String.t(), String.t()) :: {:ok, float()} | {:error, term()}
Computes the sample covariance between two columns.
Returns {:ok, float} or {:error, reason}.
Examples
{:ok, c} = DataFrame.Stat.cov(df, "height", "weight")
@spec crosstab(SparkEx.DataFrame.t(), String.t(), String.t()) :: SparkEx.DataFrame.t()
Computes a contingency table (crosstab) of two columns.
Returns a DataFrame with the frequency of each combination of values.
Examples
DataFrame.Stat.crosstab(df, "department", "gender")
@spec describe(SparkEx.DataFrame.t(), String.t() | [String.t()]) :: SparkEx.DataFrame.t()
Computes basic statistics (count, mean, stddev, min, max) for selected columns.
If no columns are given, describes all columns.
Examples
DataFrame.Stat.describe(df)
DataFrame.Stat.describe(df, ["age", "salary"])
@spec freq_items(SparkEx.DataFrame.t(), [String.t()], float() | keyword()) :: SparkEx.DataFrame.t()
Finds all items which have a frequency greater than or equal to support.
Examples
DataFrame.Stat.freq_items(df, ["category", "status"])
DataFrame.Stat.freq_items(df, ["category"], 0.05)
@spec sample_by( SparkEx.DataFrame.t(), SparkEx.Column.t() | String.t(), map(), integer() | keyword() | nil ) :: SparkEx.DataFrame.t()
Returns a stratified sample of the DataFrame.
Parameters
col— column name (string) orColumnused for stratification.fractions— map of%{stratum_value => sampling_fraction}.seed— optional random seed. Auto-generated if not provided; pass an explicit seed for reproducibility.
Examples
DataFrame.Stat.sample_by(df, "label", %{0 => 0.1, 1 => 0.5})
DataFrame.Stat.sample_by(df, "label", %{0 => 0.1, 1 => 0.5}, 42)
@spec summary(SparkEx.DataFrame.t(), String.t() | [String.t()]) :: SparkEx.DataFrame.t()
Computes specified statistics for numeric and string columns.
Statistics can include: "count", "mean", "stddev", "min", "max", and percentiles like "25%", "50%", "75%".
Examples
DataFrame.Stat.summary(df)
DataFrame.Stat.summary(df, ["count", "min", "max"])