ExFairness.Metrics.Calibration (ExFairness v0.5.1)

Calibration fairness metric.

Measures whether predicted probabilities are well-calibrated across groups. A model is calibrated if predictions of p% actually occur p% of the time.

Mathematical Definition

For predicted probability ŝ(x) and outcome y:

P(Y = 1 | ŝ(X) = s, A = a) ≈ s  for all s, a

Fairness requires calibration holds across all groups.

Expected Calibration Error (ECE)

ECE measures the weighted average of calibration error across bins:

ECE = Σ_b (n_b / n) · |acc(b) - conf(b)|

where:

b = bin index
n_b = number of samples in bin b
acc(b) = accuracy in bin b
conf(b) = average confidence in bin b

Group Fairness

Calibration fairness requires similar ECE across groups:

Δ_ECE = |ECE_A - ECE_B|

Use Cases

Medical risk scores (predicted risk should match actual risk)
Credit scoring (approval probability should match default rate)
Hiring (interview likelihood should match success rate)
Any application where users rely on prediction confidence

References

Kleinberg, J., et al. (2017). "Inherent trade-offs in algorithmic fairness."
Pleiss, G., et al. (2017). "On fairness and calibration." NeurIPS.
Chouldechova, A. (2017). "Fair prediction with disparate impact."
Guo, C., et al. (2017). "On calibration of modern neural networks." ICML.

Examples

iex> # Perfect calibration example
iex> probs = Nx.tensor([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])
iex> labels = Nx.tensor([0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])
iex> sensitive = Nx.tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
iex> result = ExFairness.Metrics.Calibration.compute(probs, labels, sensitive, n_bins: 5)
iex> is_float(result.disparity)
true

Summary

Types

result()

Functions

compute(probabilities, labels, sensitive_attr, opts \\ [])

Computes calibration fairness disparity between groups.

reliability_diagram(probabilities, labels, sensitive_attr, opts \\ [])

Generates reliability diagram data for calibration plotting.

Types

result()

@type result() :: %{
  group_a_ece: float(),
  group_b_ece: float(),
  disparity: float(),
  passes: boolean(),
  threshold: float(),
  group_a_mce: float(),
  group_b_mce: float(),
  n_bins: integer(),
  strategy: :uniform | :quantile,
  interpretation: String.t()
}

Functions

compute(probabilities, labels, sensitive_attr, opts \\ [])

@spec compute(Nx.Tensor.t(), Nx.Tensor.t(), Nx.Tensor.t(), keyword()) :: result()

Computes calibration fairness disparity between groups.

Parameters

probabilities - Predicted probabilities (0.0 to 1.0)
labels - Binary labels (0 or 1)
sensitive_attr - Binary sensitive attribute (0 or 1)
opts:
- :n_bins - Number of probability bins (default: 10)
- :strategy - Binning strategy (:uniform or :quantile, default: :uniform)
- :threshold - Max acceptable ECE disparity (default: 0.1)
- :min_per_group - Minimum samples per group (default: 5)

Returns

Map with ECE for each group, disparity, and detailed calibration metrics:

:group_a_ece - Expected Calibration Error for group A
:group_b_ece - Expected Calibration Error for group B
:disparity - Absolute difference in ECE
:passes - Whether disparity is within threshold
:threshold - Threshold used
:group_a_mce - Maximum Calibration Error for group A
:group_b_mce - Maximum Calibration Error for group B
:n_bins - Number of bins used
:strategy - Binning strategy used
:interpretation - Plain language explanation

Examples

iex> probs = Nx.tensor([0.1, 0.3, 0.6, 0.9, 0.2, 0.4, 0.7, 0.8, 0.5, 0.3, 0.1, 0.3, 0.6, 0.9, 0.2, 0.4, 0.7, 0.8, 0.5, 0.3])
iex> labels = Nx.tensor([0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0])
iex> sensitive = Nx.tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
iex> result = ExFairness.Metrics.Calibration.compute(probs, labels, sensitive, n_bins: 5)
iex> result.n_bins
5

reliability_diagram(probabilities, labels, sensitive_attr, opts \\ [])

@spec reliability_diagram(Nx.Tensor.t(), Nx.Tensor.t(), Nx.Tensor.t(), keyword()) ::
  %{
    bins: [map()],
    n_bins: integer(),
    strategy: :uniform | :quantile
  }

Generates reliability diagram data for calibration plotting.

Returns bin-level accuracy, confidence, and counts per group using the same binning strategy as compute/4.