ExDataCheck.Drift (ExDataCheck v0.2.1)

View Source

Data drift detection for ML model monitoring.

Drift detection identifies when the distribution of production data differs significantly from training data, which can degrade model performance.

Drift Detection Methods

  • Kolmogorov-Smirnov (KS): For continuous numerical features
  • Chi-Square: For categorical features
  • Population Stability Index (PSI): Industry-standard metric

Workflow

  1. Create baseline from training/reference data
  2. Detect drift in production/current data
  3. Monitor drift scores over time
  4. Retrain model when significant drift detected

Examples

# Create baseline from training data
baseline = ExDataCheck.Drift.create_baseline(training_data)

# Check production data for drift
drift_result = ExDataCheck.Drift.detect(production_data, baseline)

case drift_result do
  %{drifted: true} = r ->
    IO.puts("Drift detected in columns")
    trigger_model_retraining()
  _ ->
    :ok
end

Summary

Types

Complete baseline for all columns.

Baseline distribution for a column.

Functions

Creates a baseline distribution from a reference dataset.

Detects drift between current data and baseline.

Performs two-sample Kolmogorov-Smirnov test.

Calculates Population Stability Index (PSI) between two distributions.

Types

baseline()

@type baseline() :: %{optional(atom() | String.t()) => baseline_column()}

Complete baseline for all columns.

baseline_column()

@type baseline_column() :: map()

Baseline distribution for a column.

For numeric columns:

  • :type - :numeric
  • :values - List of baseline values
  • :mean - Baseline mean
  • :stdev - Baseline standard deviation

For categorical columns:

  • :type - :categorical
  • :frequencies - Map of value frequencies

Functions

create_baseline(dataset)

@spec create_baseline([map()]) :: baseline()

Creates a baseline distribution from a reference dataset.

The baseline captures the distribution of each column for later comparison.

Parameters

  • dataset - Reference dataset (typically training data)

Returns

Map of column names to baseline statistics.

Examples

iex> dataset = [%{age: 25}, %{age: 30}, %{age: 35}]
iex> baseline = ExDataCheck.Drift.create_baseline(dataset)
iex> baseline[:age].type
:numeric

detect(dataset, baseline, opts \\ [])

@spec detect([map()], baseline(), keyword()) :: ExDataCheck.DriftResult.t()

Detects drift between current data and baseline.

Compares the distribution of current data against the baseline and identifies columns that have drifted significantly.

Parameters

  • dataset - Current dataset to check for drift
  • baseline - Baseline created from reference data
  • opts - Options
    • :threshold - Drift score threshold (default: 0.05)
    • :method - Detection method (:auto, :ks, :psi, default: :auto)

Returns

DriftResult struct with drift detection results.

Examples

iex> baseline = ExDataCheck.Drift.create_baseline(training_data)
iex> result = ExDataCheck.Drift.detect(production_data, baseline)
iex> result.drifted
false

ks_test(dist1, dist2)

@spec ks_test([number()], [number()]) :: {float(), float()}

Performs two-sample Kolmogorov-Smirnov test.

Tests whether two samples come from the same distribution.

Returns

Tuple of {ks_statistic, p_value}.

Examples

iex> dist1 = [1, 2, 3, 4, 5]
iex> dist2 = [1, 2, 3, 4, 5]
iex> {stat, p} = ExDataCheck.Drift.ks_test(dist1, dist2)
iex> stat
0.0

psi(baseline_dist, current_dist)

@spec psi(map(), map()) :: float()

Calculates Population Stability Index (PSI) between two distributions.

PSI measures distribution shift, commonly used in credit scoring and ML monitoring.

PSI Interpretation:

  • PSI < 0.1: No significant shift
  • 0.1 <= PSI < 0.2: Moderate shift
  • PSI >= 0.2: Significant shift

Parameters

  • baseline_dist - Map of categories to baseline proportions
  • current_dist - Map of categories to current proportions

Examples

iex> baseline = %{"A" => 0.5, "B" => 0.5}
iex> current = %{"A" => 0.5, "B" => 0.5}
iex> ExDataCheck.Drift.psi(baseline, current)
0.0