SparkEx.DataFrame.Stat (SparkEx v0.1.0)

Copy Markdown View Source

Statistical operations sub-API for DataFrames.

Provides descriptive statistics, correlation, covariance, crosstab, frequency items, approximate quantiles, and stratified sampling.

Most methods return lazy DataFrames. Scalar-returning methods (corr/4, cov/3, approx_quantile/4) execute eagerly.

Summary

Functions

Computes approximate quantiles for one or more columns.

Computes the Pearson correlation coefficient between two columns.

Computes the sample covariance between two columns.

Computes a contingency table (crosstab) of two columns.

Computes basic statistics (count, mean, stddev, min, max) for selected columns.

Finds all items which have a frequency greater than or equal to support.

Returns a stratified sample of the DataFrame.

Computes specified statistics for numeric and string columns.

Functions

approx_quantile(df, col, probabilities, relative_error \\ 0.0)

@spec approx_quantile(
  SparkEx.DataFrame.t(),
  String.t() | [String.t()],
  [float()],
  float()
) :: {:ok, [float()] | [[float()]]} | {:error, term()}

Computes approximate quantiles for one or more columns.

Returns {:ok, [float]} for a single column or {:ok, [[float]]} for multiple.

Examples

{:ok, quantiles} = DataFrame.Stat.approx_quantile(df, "age", [0.25, 0.5, 0.75])
{:ok, quantiles} = DataFrame.Stat.approx_quantile(df, ["age", "salary"], [0.5], 0.01)

corr(df, col1, col2, method \\ "pearson")

@spec corr(SparkEx.DataFrame.t(), String.t(), String.t(), String.t()) ::
  {:ok, float()} | {:error, term()}

Computes the Pearson correlation coefficient between two columns.

Returns {:ok, float} or {:error, reason}.

Examples

{:ok, r} = DataFrame.Stat.corr(df, "height", "weight")

cov(df, col1, col2)

@spec cov(SparkEx.DataFrame.t(), String.t(), String.t()) ::
  {:ok, float()} | {:error, term()}

Computes the sample covariance between two columns.

Returns {:ok, float} or {:error, reason}.

Examples

{:ok, c} = DataFrame.Stat.cov(df, "height", "weight")

crosstab(df, col1, col2)

Computes a contingency table (crosstab) of two columns.

Returns a DataFrame with the frequency of each combination of values.

Examples

DataFrame.Stat.crosstab(df, "department", "gender")

describe(df, cols \\ [])

@spec describe(SparkEx.DataFrame.t(), String.t() | [String.t()]) ::
  SparkEx.DataFrame.t()

Computes basic statistics (count, mean, stddev, min, max) for selected columns.

If no columns are given, describes all columns.

Examples

DataFrame.Stat.describe(df)
DataFrame.Stat.describe(df, ["age", "salary"])

freq_items(df, cols, support \\ 0.01)

@spec freq_items(SparkEx.DataFrame.t(), [String.t()], float() | keyword()) ::
  SparkEx.DataFrame.t()

Finds all items which have a frequency greater than or equal to support.

Examples

DataFrame.Stat.freq_items(df, ["category", "status"])
DataFrame.Stat.freq_items(df, ["category"], 0.05)

sample_by(df, col, fractions, seed \\ nil)

@spec sample_by(
  SparkEx.DataFrame.t(),
  SparkEx.Column.t() | String.t(),
  map(),
  integer() | keyword() | nil
) :: SparkEx.DataFrame.t()

Returns a stratified sample of the DataFrame.

Parameters

  • col — column name (string) or Column used for stratification.
  • fractions — map of %{stratum_value => sampling_fraction}.
  • seed — optional random seed. Auto-generated if not provided; pass an explicit seed for reproducibility.

Examples

DataFrame.Stat.sample_by(df, "label", %{0 => 0.1, 1 => 0.5})
DataFrame.Stat.sample_by(df, "label", %{0 => 0.1, 1 => 0.5}, 42)

summary(df, statistics \\ [])

@spec summary(SparkEx.DataFrame.t(), String.t() | [String.t()]) ::
  SparkEx.DataFrame.t()

Computes specified statistics for numeric and string columns.

Statistics can include: "count", "mean", "stddev", "min", "max", and percentiles like "25%", "50%", "75%".

Examples

DataFrame.Stat.summary(df)
DataFrame.Stat.summary(df, ["count", "min", "max"])