ExDataCheck.Expectations.ML (ExDataCheck v0.2.1)

ML-specific expectations for machine learning workflows.

Provides expectations tailored for ML use cases:

Label balance checking for classification tasks
Label cardinality validation
Feature correlation analysis
Missing value detection (critical for ML)
Dataset size validation

Examples

# Check label balance for classification
expect_label_balance(:target, min_ratio: 0.2)

# Ensure reasonable number of classes
expect_label_cardinality(:target, min: 2, max: 10)

# Detect highly correlated features
expect_feature_correlation(:feature1, :feature2, max: 0.95)

# Ensure no missing values (critical for many ML algorithms)
expect_no_missing_values(:features)

# Validate dataset size
expect_table_row_count_to_be_between(1000, 1_000_000)

Design Principles

ML-Aware: Designed specifically for ML pipeline needs
Practical Defaults: Sensible defaults based on ML best practices
Detailed Diagnostics: Provides actionable information for fixing issues

Summary

Functions

expect_feature_correlation(column1, column2, opts \\ [])

Expects correlation between two features to be within bounds.

expect_label_balance(column, opts \\ [])

Expects labels to be reasonably balanced across classes.

expect_label_cardinality(column, opts \\ [])

Expects the number of unique labels to fall within a specified range.

expect_no_data_drift(column, baseline, opts \\ [])

Expects no significant data drift from baseline distribution.

expect_no_missing_values(column)

Expects no missing (nil) values in a column.

expect_table_row_count_to_be_between(min, max)

Expects the dataset to have a row count within a specified range.

Functions

expect_feature_correlation(column1, column2, opts \\ [])

@spec expect_feature_correlation(atom() | String.t(), atom() | String.t(), keyword()) ::
  ExDataCheck.Expectation.t()

Expects correlation between two features to be within bounds.

Helps detect feature redundancy or collinearity issues in ML models.

Parameters

column1 - First feature column
column2 - Second feature column
opts - Options
- :max - Maximum absolute correlation (default: 0.95)
- :min - Minimum absolute correlation (default: nil)

Examples

iex> dataset = [%{f1: 1, f2: 10}, %{f1: 2, f2: 12}, %{f1: 3, f2: 35}]
iex> expectation = ExDataCheck.Expectations.ML.expect_feature_correlation(:f1, :f2, max: 0.95)
iex> result = expectation.validator.(dataset)
iex> is_boolean(result.success)
true

expect_label_balance(column, opts \\ [])

@spec expect_label_balance(
  atom() | String.t(),
  keyword()
) :: ExDataCheck.Expectation.t()

Expects labels to be reasonably balanced across classes.

Important for classification tasks to avoid model bias. Checks that the smallest class represents at least min_ratio of the dataset.

Parameters

column - Label column name
opts - Options
- :min_ratio - Minimum ratio for smallest class (default: 0.1 or 10%)

Examples

iex> dataset = [%{target: 0}, %{target: 1}, %{target: 0}, %{target: 1}]
iex> expectation = ExDataCheck.Expectations.ML.expect_label_balance(:target, min_ratio: 0.4)
iex> result = expectation.validator.(dataset)
iex> result.success
true

expect_label_cardinality(column, opts \\ [])

@spec expect_label_cardinality(
  atom() | String.t(),
  keyword()
) :: ExDataCheck.Expectation.t()

Expects the number of unique labels to fall within a specified range.

Useful for validating classification tasks have reasonable number of classes.

Parameters

column - Label column name
opts - Options
- :min - Minimum number of unique labels (default: 2)
- :max - Maximum number of unique labels (default: 100)

Examples

iex> dataset = [%{target: "A"}, %{target: "B"}, %{target: "C"}]
iex> expectation = ExDataCheck.Expectations.ML.expect_label_cardinality(:target, min: 2, max: 5)
iex> result = expectation.validator.(dataset)
iex> result.success
true

expect_no_data_drift(column, baseline, opts \\ [])

@spec expect_no_data_drift(
  atom() | String.t(),
  ExDataCheck.Drift.baseline(),
  keyword()
) ::
  ExDataCheck.Expectation.t()

Expects no significant data drift from baseline distribution.

Compares current data against a baseline (typically training data) to detect distribution changes that could degrade model performance.

Parameters

column - Column name to check for drift
baseline - Baseline distribution created with Drift.create_baseline/1
opts - Options
- :threshold - Drift score threshold (default: 0.05)
- :method - Detection method (default: :auto)

Examples

# Create baseline from training data
training_data = [%{feature: 25}, %{feature: 30}, %{feature: 35}]
baseline = ExDataCheck.Drift.create_baseline(training_data)

# Check production data
production_data = [%{feature: 26}, %{feature: 31}, %{feature: 36}]
expectation = ExDataCheck.Expectations.ML.expect_no_data_drift(:feature, baseline)
result = expectation.validator.(production_data)

expect_no_missing_values(column)

@spec expect_no_missing_values(atom() | String.t()) :: ExDataCheck.Expectation.t()

Expects no missing (nil) values in a column.

Alias for expect_column_values_to_not_be_null/1 but with ML-specific naming. Many ML algorithms cannot handle missing values.

Parameters

column - Column name

Examples

iex> dataset = [%{features: [1, 2, 3]}, %{features: [4, 5, 6]}]
iex> expectation = ExDataCheck.Expectations.ML.expect_no_missing_values(:features)
iex> result = expectation.validator.(dataset)
iex> result.success
true

expect_table_row_count_to_be_between(min, max)

@spec expect_table_row_count_to_be_between(non_neg_integer(), non_neg_integer()) ::
  ExDataCheck.Expectation.t()

Expects the dataset to have a row count within a specified range.

Important for ML to ensure sufficient training data while avoiding computational issues with very large datasets.

Parameters

min - Minimum row count
max - Maximum row count

Examples

iex> dataset = Enum.map(1..500, fn i -> %{id: i} end)
iex> expectation = ExDataCheck.Expectations.ML.expect_table_row_count_to_be_between(100, 1000)
iex> result = expectation.validator.(dataset)
iex> result.success
true