ExDataCheck.Expectations.ML (ExDataCheck v0.2.1)

View Source

ML-specific expectations for machine learning workflows.

Provides expectations tailored for ML use cases:

  • Label balance checking for classification tasks
  • Label cardinality validation
  • Feature correlation analysis
  • Missing value detection (critical for ML)
  • Dataset size validation

Examples

# Check label balance for classification
expect_label_balance(:target, min_ratio: 0.2)

# Ensure reasonable number of classes
expect_label_cardinality(:target, min: 2, max: 10)

# Detect highly correlated features
expect_feature_correlation(:feature1, :feature2, max: 0.95)

# Ensure no missing values (critical for many ML algorithms)
expect_no_missing_values(:features)

# Validate dataset size
expect_table_row_count_to_be_between(1000, 1_000_000)

Design Principles

  • ML-Aware: Designed specifically for ML pipeline needs
  • Practical Defaults: Sensible defaults based on ML best practices
  • Detailed Diagnostics: Provides actionable information for fixing issues

Summary

Functions

Expects correlation between two features to be within bounds.

Expects labels to be reasonably balanced across classes.

Expects the number of unique labels to fall within a specified range.

Expects no significant data drift from baseline distribution.

Expects no missing (nil) values in a column.

Expects the dataset to have a row count within a specified range.

Functions

expect_feature_correlation(column1, column2, opts \\ [])

@spec expect_feature_correlation(atom() | String.t(), atom() | String.t(), keyword()) ::
  ExDataCheck.Expectation.t()

Expects correlation between two features to be within bounds.

Helps detect feature redundancy or collinearity issues in ML models.

Parameters

  • column1 - First feature column
  • column2 - Second feature column
  • opts - Options
    • :max - Maximum absolute correlation (default: 0.95)
    • :min - Minimum absolute correlation (default: nil)

Examples

iex> dataset = [%{f1: 1, f2: 10}, %{f1: 2, f2: 12}, %{f1: 3, f2: 35}]
iex> expectation = ExDataCheck.Expectations.ML.expect_feature_correlation(:f1, :f2, max: 0.95)
iex> result = expectation.validator.(dataset)
iex> is_boolean(result.success)
true

expect_label_balance(column, opts \\ [])

@spec expect_label_balance(
  atom() | String.t(),
  keyword()
) :: ExDataCheck.Expectation.t()

Expects labels to be reasonably balanced across classes.

Important for classification tasks to avoid model bias. Checks that the smallest class represents at least min_ratio of the dataset.

Parameters

  • column - Label column name
  • opts - Options
    • :min_ratio - Minimum ratio for smallest class (default: 0.1 or 10%)

Examples

iex> dataset = [%{target: 0}, %{target: 1}, %{target: 0}, %{target: 1}]
iex> expectation = ExDataCheck.Expectations.ML.expect_label_balance(:target, min_ratio: 0.4)
iex> result = expectation.validator.(dataset)
iex> result.success
true

expect_label_cardinality(column, opts \\ [])

@spec expect_label_cardinality(
  atom() | String.t(),
  keyword()
) :: ExDataCheck.Expectation.t()

Expects the number of unique labels to fall within a specified range.

Useful for validating classification tasks have reasonable number of classes.

Parameters

  • column - Label column name
  • opts - Options
    • :min - Minimum number of unique labels (default: 2)
    • :max - Maximum number of unique labels (default: 100)

Examples

iex> dataset = [%{target: "A"}, %{target: "B"}, %{target: "C"}]
iex> expectation = ExDataCheck.Expectations.ML.expect_label_cardinality(:target, min: 2, max: 5)
iex> result = expectation.validator.(dataset)
iex> result.success
true

expect_no_data_drift(column, baseline, opts \\ [])

@spec expect_no_data_drift(
  atom() | String.t(),
  ExDataCheck.Drift.baseline(),
  keyword()
) ::
  ExDataCheck.Expectation.t()

Expects no significant data drift from baseline distribution.

Compares current data against a baseline (typically training data) to detect distribution changes that could degrade model performance.

Parameters

  • column - Column name to check for drift
  • baseline - Baseline distribution created with Drift.create_baseline/1
  • opts - Options
    • :threshold - Drift score threshold (default: 0.05)
    • :method - Detection method (default: :auto)

Examples

# Create baseline from training data
training_data = [%{feature: 25}, %{feature: 30}, %{feature: 35}]
baseline = ExDataCheck.Drift.create_baseline(training_data)

# Check production data
production_data = [%{feature: 26}, %{feature: 31}, %{feature: 36}]
expectation = ExDataCheck.Expectations.ML.expect_no_data_drift(:feature, baseline)
result = expectation.validator.(production_data)

expect_no_missing_values(column)

@spec expect_no_missing_values(atom() | String.t()) :: ExDataCheck.Expectation.t()

Expects no missing (nil) values in a column.

Alias for expect_column_values_to_not_be_null/1 but with ML-specific naming. Many ML algorithms cannot handle missing values.

Parameters

  • column - Column name

Examples

iex> dataset = [%{features: [1, 2, 3]}, %{features: [4, 5, 6]}]
iex> expectation = ExDataCheck.Expectations.ML.expect_no_missing_values(:features)
iex> result = expectation.validator.(dataset)
iex> result.success
true

expect_table_row_count_to_be_between(min, max)

@spec expect_table_row_count_to_be_between(non_neg_integer(), non_neg_integer()) ::
  ExDataCheck.Expectation.t()

Expects the dataset to have a row count within a specified range.

Important for ML to ensure sufficient training data while avoiding computational issues with very large datasets.

Parameters

  • min - Minimum row count
  • max - Maximum row count

Examples

iex> dataset = Enum.map(1..500, fn i -> %{id: i} end)
iex> expectation = ExDataCheck.Expectations.ML.expect_table_row_count_to_be_between(100, 1000)
iex> result = expectation.validator.(dataset)
iex> result.success
true