ExDataCheck.Expectations.ML (ExDataCheck v0.2.1)
View SourceML-specific expectations for machine learning workflows.
Provides expectations tailored for ML use cases:
- Label balance checking for classification tasks
- Label cardinality validation
- Feature correlation analysis
- Missing value detection (critical for ML)
- Dataset size validation
Examples
# Check label balance for classification
expect_label_balance(:target, min_ratio: 0.2)
# Ensure reasonable number of classes
expect_label_cardinality(:target, min: 2, max: 10)
# Detect highly correlated features
expect_feature_correlation(:feature1, :feature2, max: 0.95)
# Ensure no missing values (critical for many ML algorithms)
expect_no_missing_values(:features)
# Validate dataset size
expect_table_row_count_to_be_between(1000, 1_000_000)Design Principles
- ML-Aware: Designed specifically for ML pipeline needs
- Practical Defaults: Sensible defaults based on ML best practices
- Detailed Diagnostics: Provides actionable information for fixing issues
Summary
Functions
Expects correlation between two features to be within bounds.
Expects labels to be reasonably balanced across classes.
Expects the number of unique labels to fall within a specified range.
Expects no significant data drift from baseline distribution.
Expects no missing (nil) values in a column.
Expects the dataset to have a row count within a specified range.
Functions
@spec expect_feature_correlation(atom() | String.t(), atom() | String.t(), keyword()) :: ExDataCheck.Expectation.t()
Expects correlation between two features to be within bounds.
Helps detect feature redundancy or collinearity issues in ML models.
Parameters
column1- First feature columncolumn2- Second feature columnopts- Options:max- Maximum absolute correlation (default: 0.95):min- Minimum absolute correlation (default: nil)
Examples
iex> dataset = [%{f1: 1, f2: 10}, %{f1: 2, f2: 12}, %{f1: 3, f2: 35}]
iex> expectation = ExDataCheck.Expectations.ML.expect_feature_correlation(:f1, :f2, max: 0.95)
iex> result = expectation.validator.(dataset)
iex> is_boolean(result.success)
true
@spec expect_label_balance( atom() | String.t(), keyword() ) :: ExDataCheck.Expectation.t()
Expects labels to be reasonably balanced across classes.
Important for classification tasks to avoid model bias. Checks that the
smallest class represents at least min_ratio of the dataset.
Parameters
column- Label column nameopts- Options:min_ratio- Minimum ratio for smallest class (default: 0.1 or 10%)
Examples
iex> dataset = [%{target: 0}, %{target: 1}, %{target: 0}, %{target: 1}]
iex> expectation = ExDataCheck.Expectations.ML.expect_label_balance(:target, min_ratio: 0.4)
iex> result = expectation.validator.(dataset)
iex> result.success
true
@spec expect_label_cardinality( atom() | String.t(), keyword() ) :: ExDataCheck.Expectation.t()
Expects the number of unique labels to fall within a specified range.
Useful for validating classification tasks have reasonable number of classes.
Parameters
column- Label column nameopts- Options:min- Minimum number of unique labels (default: 2):max- Maximum number of unique labels (default: 100)
Examples
iex> dataset = [%{target: "A"}, %{target: "B"}, %{target: "C"}]
iex> expectation = ExDataCheck.Expectations.ML.expect_label_cardinality(:target, min: 2, max: 5)
iex> result = expectation.validator.(dataset)
iex> result.success
true
@spec expect_no_data_drift( atom() | String.t(), ExDataCheck.Drift.baseline(), keyword() ) :: ExDataCheck.Expectation.t()
Expects no significant data drift from baseline distribution.
Compares current data against a baseline (typically training data) to detect distribution changes that could degrade model performance.
Parameters
column- Column name to check for driftbaseline- Baseline distribution created withDrift.create_baseline/1opts- Options:threshold- Drift score threshold (default: 0.05):method- Detection method (default: :auto)
Examples
# Create baseline from training data
training_data = [%{feature: 25}, %{feature: 30}, %{feature: 35}]
baseline = ExDataCheck.Drift.create_baseline(training_data)
# Check production data
production_data = [%{feature: 26}, %{feature: 31}, %{feature: 36}]
expectation = ExDataCheck.Expectations.ML.expect_no_data_drift(:feature, baseline)
result = expectation.validator.(production_data)
@spec expect_no_missing_values(atom() | String.t()) :: ExDataCheck.Expectation.t()
Expects no missing (nil) values in a column.
Alias for expect_column_values_to_not_be_null/1 but with ML-specific
naming. Many ML algorithms cannot handle missing values.
Parameters
column- Column name
Examples
iex> dataset = [%{features: [1, 2, 3]}, %{features: [4, 5, 6]}]
iex> expectation = ExDataCheck.Expectations.ML.expect_no_missing_values(:features)
iex> result = expectation.validator.(dataset)
iex> result.success
true
@spec expect_table_row_count_to_be_between(non_neg_integer(), non_neg_integer()) :: ExDataCheck.Expectation.t()
Expects the dataset to have a row count within a specified range.
Important for ML to ensure sufficient training data while avoiding computational issues with very large datasets.
Parameters
min- Minimum row countmax- Maximum row count
Examples
iex> dataset = Enum.map(1..500, fn i -> %{id: i} end)
iex> expectation = ExDataCheck.Expectations.ML.expect_table_row_count_to_be_between(100, 1000)
iex> result = expectation.validator.(dataset)
iex> result.success
true