Changelog

View Source

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

Unreleased

0.5.1 - 2025-12-28

Changed

  • Updated crucible_framework dependency from ~> 0.5.0 to ~> 0.5.2

0.5.0 - 2025-12-27

Changed

  • Normalized describe/1 to canonical schema format - The describe/1 callback now returns a schema conforming to the Crucible Stage contract specification v1.0.
    • Changed :stage key to :name key (atom value)
    • Added __schema_version__: "1.0.0" marker for schema evolution
    • Added required field (list of required option keys)
    • Added optional field (list of optional option keys)
    • Added types field (type specifications for all options)
    • Added defaults field (default values for optional options)
    • Moved metrics list to __extensions__.fairness.supported_metrics
    • Added data_sources and output_location to extensions

Added

  • Conformance tests - New test/ex_fairness/conformance_test.exs validates Stage contract compliance
  • Extended describe/1 tests - Comprehensive tests for canonical schema format

Dependencies

  • Updated crucible_framework dependency to ~> 0.5.0 (required for new describe/1 contract)

0.4.0 - 2025-12-25

Added

  • ExFairness.CrucibleStage implementing Crucible.Stage for crucible_framework pipelines.
  • Environment-specific config files (config/*.exs) to disable the CrucibleFramework repo by default.
  • Documentation snapshot and gap analysis in docs/20251225/.
  • Crucible stage integration test suite.

Changed

  • Refactored ExFairness.evaluate/5 into smaller helpers for metric computation and violations.
  • Improved chi-square computation structure in ExFairness.Utils.StatisticalTests.
  • Updated project logo in assets/ExFairness.svg.
  • Dependencies: add crucible_framework, update crucible_ir to ~> 0.2.0, add ecto_sql and postgrex.

0.3.1 - 2025-11-26

Added - CrucibleIR Integration

Pipeline Stage:

  • ExFairness.Stage - Pipeline stage for Crucible framework integration
    • Seamless integration with CrucibleIR experiment orchestration
    • Accepts CrucibleIR.Reliability.Fairness configuration
    • Extracts predictions, labels, and sensitive attributes from model outputs
    • Supports all ExFairness metrics (demographic parity, equalized odds, equal opportunity, predictive parity, calibration)
    • Configurable threshold and fail-on-violation behavior
    • Comprehensive error handling and validation
    • Returns structured fairness results with violations tracking

Main API Enhancement:

  • ExFairness.evaluate/5 - New function for CrucibleIR config-based evaluation
    • Direct evaluation using CrucibleIR.Reliability.Fairness struct
    • Optional probabilities parameter for calibration metrics
    • Returns structured results with metrics, violations, and overall pass/fail status
    • Conditionally compiled (only when crucible_ir is available)

Configuration Support

The integration supports the following CrucibleIR.Reliability.Fairness structure:

%CrucibleIR.Reliability.Fairness{
  enabled: true,                    # Enable/disable fairness evaluation
  metrics: [:demographic_parity, :equalized_odds, :equal_opportunity, :predictive_parity, :calibration],
  group_by: :gender,                # Sensitive attribute field name
  threshold: 0.1,                   # Maximum acceptable disparity
  fail_on_violation: false,         # Whether to fail on violations
  options: %{}                      # Additional metric-specific options
}

Testing

New Test Suite:

  • ExFairness.StageTest - 15 comprehensive tests
    • Stage description validation
    • Disabled fairness pass-through
    • Single and multiple metric evaluation
    • Calibration with/without probabilities
    • Custom threshold configuration
    • Violation detection and reporting
    • Fail-on-violation behavior
    • Invalid context handling
    • Unknown metric handling
    • Custom options pass-through

Test Coverage: 174 (v0.3.0) → 189 (v0.4.0) = +15 tests (+8.6%)

Dependencies

New Dependencies:

  • {:crucible_ir, "~> 0.1.1"} - CrucibleIR configuration structs

Documentation

Updated Documentation:

  • mix.exs - Added ExFairness.Stage to Pipeline module group
  • README.md - Added Stage usage examples and CrucibleIR integration guide
  • ExFairness.Stage - Comprehensive module documentation with examples
  • ExFairness.evaluate/5 - Full API documentation with examples

Quality Metrics

  • Zero compilation warnings (enforced via warnings_as_errors)
  • Zero Dialyzer errors (type-safe)
  • All tests passing (189 total tests)
  • Backward compatible: All v0.3.0 code works without modification

Integration Benefits

  1. Seamless Crucible Integration: ExFairness can now be used as a pipeline stage in Crucible experiments
  2. Standardized Configuration: Uses CrucibleIR configuration structs for consistency
  3. Experiment Orchestration: Fairness evaluation can be automated as part of experiment pipelines
  4. Flexible Violation Handling: Choose whether fairness violations should fail experiments
  5. Comprehensive Results: Structured output suitable for experiment reporting

Example Usage

# Configure fairness evaluation
config = %CrucibleIR.Reliability.Fairness{
  enabled: true,
  metrics: [:demographic_parity, :equalized_odds],
  group_by: :gender,
  threshold: 0.1,
  fail_on_violation: false
}

# In a Crucible pipeline
context = %{
  experiment: %{reliability: %{fairness: config}},
  outputs: model_outputs  # List of maps with :prediction, :label, :gender
}

{:ok, result_context} = ExFairness.Stage.run(context)
# result_context.fairness contains fairness evaluation results

# Or use the direct evaluation API
result = ExFairness.evaluate(predictions, labels, sensitive_attr, config)
# Returns %{metrics: ..., overall_passes: ..., violations: ...}

Breaking Changes

None - This is a backward compatible release. All existing code continues to work unchanged.

Migration from v0.3.0

No code changes required. The new CrucibleIR integration is opt-in and does not affect existing usage patterns.

0.3.0 - 2025-11-25

Added - Statistical Inference and Calibration

Statistical Inference Framework:

  • ExFairness.Utils.Bootstrap - Bootstrap confidence interval computation
    • Stratified bootstrap to preserve group proportions
    • Parallel and sequential computation modes
    • Percentile and basic bootstrap methods
    • Configurable number of samples (default: 1000)
    • Reproducible with seed parameter
    • GPU-accelerated metric computation via Nx.Defn
  • ExFairness.Utils.StatisticalTests - Hypothesis testing for fairness metrics
    • Two-proportion Z-test for demographic parity
    • Chi-square test for equalized odds
    • Permutation test for any fairness metric (non-parametric)
    • Cohen's h effect size computation
    • Configurable significance levels (default: α=0.05)
    • Statistical interpretation generation

Calibration Fairness Metric:

  • ExFairness.Metrics.Calibration - Calibration fairness for probability predictions
    • Expected Calibration Error (ECE) computation
    • Maximum Calibration Error (MCE) computation
    • Uniform and quantile binning strategies
    • Configurable number of bins (default: 10)
    • Group-wise calibration comparison
    • Validation for probability ranges [0, 1]

Enhanced - Existing Metrics

All existing fairness metrics can now optionally include:

  • Bootstrap confidence intervals
  • Statistical hypothesis tests
  • Effect size measures
  • Enhanced interpretations with statistical significance

Example usage:

result = ExFairness.demographic_parity(predictions, sensitive_attr,
  include_ci: true,              # NEW: Bootstrap CI
  statistical_test: :z_test,     # NEW: Hypothesis testing
  bootstrap_samples: 1000,       # NEW: Configurable bootstrap
  confidence_level: 0.95         # NEW: CI level
)
# Returns enhanced result with :confidence_interval and :p_value

Testing

New Test Suites:

  • ExFairness.Utils.BootstrapTest - 11 comprehensive tests
    • Bootstrap interval validation
    • Stratified sampling verification
    • Method comparison (percentile vs basic)
    • Reproducibility testing
    • Parallel vs sequential equivalence
  • ExFairness.Utils.StatisticalTestsTest - 14 comprehensive tests
    • Two-proportion Z-test validation
    • Chi-square test verification
    • Permutation test correctness
    • Effect size computation
    • P-value range validation
  • ExFairness.Metrics.CalibrationTest - 15 comprehensive tests
    • ECE/MCE computation validation
    • Binning strategy verification
    • Probability range validation
    • Edge case handling

Total Tests: 134 (v0.2.0) → 174 (v0.3.0) = +40 tests (+30%)

Documentation

Design Documentation:

  • docs/20251125/enhancements_design.md (comprehensive 8-week implementation plan)
    • Statistical inference algorithms and formulas
    • Calibration metric mathematical foundation
    • Implementation roadmap and success criteria
    • API examples and migration guide
    • Research citations (10+ additional papers)

Updated Documentation:

  • mix.exs - Version updated to 0.3.0, new modules added to docs
  • README.md - Version badge and installation instructions updated
  • CHANGELOG.md - Complete v0.3.0 release notes

Quality Metrics

  • Zero compilation warnings (enforced via warnings_as_errors)
  • Zero Dialyzer errors (type-safe)
  • Test coverage target: >90% (expected)
  • Backward compatible: All v0.2.0 code works without modification

Performance

  • Bootstrap: ~1-2 seconds for 1000 samples on standard metrics
  • Permutation test: ~2-3 seconds for 10,000 permutations
  • Parallel bootstrap: 4-8x speedup on multi-core systems
  • Calibration: <100ms for typical datasets

Research Foundations

New Academic Citations:

  • Efron, B., & Tibshirani, R. J. (1994). "An introduction to the bootstrap." CRC press.
  • Davison, A. C., & Hinkley, D. V. (1997). "Bootstrap methods and their application."
  • Good, P. (2013). "Permutation tests: A practical guide to resampling methods."
  • Agresti, A. (2018). "Statistical methods for the social sciences."
  • Cohen, J. (1988). "Statistical power analysis for the behavioral sciences."
  • Pleiss, G., et al. (2017). "On fairness and calibration." NeurIPS.
  • Guo, C., et al. (2017). "On calibration of modern neural networks." ICML.

Breaking Changes

None - This is a backward compatible release. All existing code continues to work unchanged.

Migration from v0.2.0

No code changes required. All new features are opt-in via additional parameters.

See docs/20251125/enhancements_design.md for detailed migration examples.

0.2.0 - 2025-10-20

Added - Comprehensive Technical Documentation

  • future_directions.md (1,941 lines) - Complete roadmap to v1.0.0
    • Detailed specifications for statistical inference
    • Calibration metric with complete algorithm
    • Intersectional analysis implementation plan
    • Threshold optimization algorithm
    • 6-month development timeline
    • 12+ additional research citations
  • implementation_report.md (1,288 lines) - Technical implementation details
    • Module-by-module analysis of all 14 modules
    • Algorithm documentation with pseudocode
    • Design decisions and rationale
    • Performance characteristics
    • Code statistics and metrics
  • testing_and_qa_strategy.md (1,220 lines) - QA methodology
    • TDD philosophy and evidence
    • Complete test coverage matrix (134 tests)
    • Edge case testing strategy
    • Future testing enhancements (property testing, integration testing)
    • Quality gates and CI/CD specifications

Enhanced - README.md

  • Expanded from ~660 to 1,437 lines (+118%)
  • Added Mathematical Foundations section (200+ lines)
    • Complete mathematical definitions for all 4 metrics
    • Formal probability notation
    • Disparity measures
    • Comprehensive citations with DOI numbers
  • Added Theoretical Background section (300+ lines)
    • Types of fairness (group, individual, causal)
    • Measurement problem discussion
    • Impossibility theorem with proof intuition
    • Fairness-accuracy tradeoff analysis
  • Added Advanced Usage section (200+ lines)
    • Axon integration example (neural networks)
    • Scholar integration example (classical ML)
    • Batch fairness analysis
    • Production monitoring with GenServer
  • Expanded Research Foundations (150+ lines)
    • 15+ peer-reviewed papers with full bibliographic details
    • DOI numbers for all citations
    • Framework comparisons (AIF360, Fairlearn, etc.)
  • Added API Reference section
  • Updated real-world use cases with legal compliance checks

Documentation

  • Total documentation: ~9,120 lines
  • Academic citations: 27+ peer-reviewed papers
  • Working code examples: 20+
  • Integration patterns documented

0.1.0 - 2025-10-20

Added - Core Implementation

Infrastructure:

  • ExFairness.Error - Custom exception handling with type safety
  • ExFairness.Validation - Comprehensive input validation
    • Binary tensor validation
    • Shape matching validation
    • Multiple groups requirement (min 2 groups)
    • Sufficient samples validation (default: 10 per group)
    • Helpful error messages with actionable suggestions
  • ExFairness.Utils - GPU-accelerated tensor operations
    • positive_rate/2 - Positive prediction rate with masking
    • create_group_mask/2 - Binary mask generation
    • group_count/2 - Sample counting per group
    • group_positive_rates/2 - Batch rate computation
  • ExFairness.Utils.Metrics - Classification metrics
    • confusion_matrix/3 - TP, FP, TN, FN with masking
    • true_positive_rate/3 - TPR/Recall
    • false_positive_rate/3 - FPR
    • positive_predictive_value/3 - PPV/Precision

Fairness Metrics:

Detection Algorithms:

  • ExFairness.Detection.DisparateImpact - EEOC 80% rule
    • Legal standard for adverse impact
    • 4/5ths rule implementation
    • Legal interpretation with EEOC context
    • Citations: EEOC (1978), Biddle (2006)

Mitigation Techniques:

  • ExFairness.Mitigation.Reweighting - Sample weighting for fairness
    • Supports demographic parity and equalized odds targets
    • Formula: w(a,y) = P(Y=y) / P(A=a,Y=y)
    • Normalized weights (mean = 1.0)
    • GPU-accelerated via Nx.Defn
    • Citations: Kamiran & Calders (2012)

Reporting System:

  • ExFairness.Report - Multi-metric fairness assessment
    • Aggregate pass/fail counts
    • Overall assessment generation
    • Markdown export (human-readable)
    • JSON export (machine-readable)

Main API:

Testing

  • 134 total tests (102 unit tests + 32 doctests)
  • 100% pass rate
  • Comprehensive edge case coverage
  • Strict TDD approach (Red-Green-Refactor)
  • All tests async (parallel execution)

Quality Gates

  • Zero compiler warnings (enforced)
  • Zero Dialyzer errors (type-safe)
  • Credo strict mode configured
  • Code formatting enforced (100 char lines)
  • ExCoveralls configured for coverage reports

Documentation

  • Comprehensive README.md with examples
  • Complete module documentation (@moduledoc)
  • Complete function documentation (@doc)
  • Working examples (verified by doctests)
  • Research citations in all metrics
  • Mathematical definitions included

Dependencies

  • Production: nx ~> 0.7 (only production dependency)
  • Development: ex_doc, dialyxir, excoveralls, credo, stream_data, jason