ExFairness.Utils.StatisticalTests (ExFairness v0.5.1)

Hypothesis testing for fairness metrics.

Provides parametric and non-parametric tests to assess statistical significance of observed disparities in fairness metrics.

Statistical Tests

Two-Proportion Z-Test: Tests demographic parity differences
Chi-Square Test: Tests independence in confusion matrices
Permutation Test: Non-parametric test for any metric

References

Agresti, A. (2018). "Statistical methods for the social sciences."
Good, P. (2013). "Permutation tests: A practical guide to resampling methods for testing hypotheses."
Cohen, J. (1988). "Statistical power analysis for the behavioral sciences."

Examples

iex> predictions = Nx.tensor([1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0])
iex> sensitive = Nx.tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
iex> result = ExFairness.Utils.StatisticalTests.two_proportion_test(predictions, sensitive)
iex> is_float(result.p_value) and result.p_value >= 0.0 and result.p_value <= 1.0
true

Summary

Types

test_result()

Functions

chi_square_test(predictions, labels, sensitive_attr, opts \\ [])

Chi-square test for equalized odds.

cohens_h(p1, p2)

Computes Cohen's h effect size for two proportions.

permutation_test(data, metric_fn, opts \\ [])

Permutation test for any fairness metric.

two_proportion_test(predictions, sensitive_attr, opts \\ [])

Two-proportion Z-test for demographic parity.

Types

test_result()

@type test_result() :: %{
  statistic: float(),
  p_value: float(),
  significant: boolean(),
  alpha: float(),
  effect_size: float() | nil,
  test_name: String.t(),
  interpretation: String.t()
}

Functions

chi_square_test(predictions, labels, sensitive_attr, opts \\ [])

@spec chi_square_test(Nx.Tensor.t(), Nx.Tensor.t(), Nx.Tensor.t(), keyword()) ::
  test_result()

Chi-square test for equalized odds.

Tests whether confusion matrices are independent of group membership.

Hypotheses

H₀: Confusion matrix is independent of sensitive attribute
H₁: Confusion matrix depends on sensitive attribute

Test Statistic

χ² = Σ (O_ij - E_ij)² / E_ij

where O_ij = observed count, E_ij = expected count under independence

Parameters

predictions - Binary predictions tensor
labels - Binary labels tensor
sensitive_attr - Binary sensitive attribute tensor
opts:
- :alpha - Significance level (default: 0.05)

Returns

Test result map with chi-square statistic and p-value.

Examples

iex> predictions = Nx.tensor([1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1])
iex> labels = Nx.tensor([1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1])
iex> sensitive = Nx.tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
iex> result = ExFairness.Utils.StatisticalTests.chi_square_test(predictions, labels, sensitive)
iex> result.test_name
"Chi-Square Test"

cohens_h(p1, p2)

@spec cohens_h(float(), float()) :: float()

Computes Cohen's h effect size for two proportions.

Cohen's h is the difference between two arcsine-transformed proportions.

Effect Size Guidelines

Small: h ≈ 0.2
Medium: h ≈ 0.5
Large: h ≈ 0.8

Formula

h = 2 * (arcsin(√p₁) - arcsin(√p₂))

Examples

iex> h = ExFairness.Utils.StatisticalTests.cohens_h(0.5, 0.3)
iex> h > 0.4 and h < 0.5
true

permutation_test(data, metric_fn, opts \\ [])

@spec permutation_test([Nx.Tensor.t()], function(), keyword()) :: test_result()

Permutation test for any fairness metric.

Non-parametric test that doesn't assume normal distribution.

Algorithm

Compute observed metric on actual data
For i = 1 to n_permutations: a. Randomly permute sensitive attributes b. Compute metric on permuted data c. Store permuted_statistics[i]
P-value = proportion of permuted statistics ≥ observed

Parameters

data - List of data tensors [predictions, labels?, sensitive_attr]
metric_fn - Function computing metric (returns numeric value)
opts:
- :n_permutations - Number of permutations (default: 10000)
- :alpha - Significance level (default: 0.05)
- :alternative - Test direction (:two_sided, :greater, :less)
- :seed - Random seed for reproducibility

Returns

Test result map with permutation statistics and p-value.

Examples

iex> predictions = Nx.tensor([1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0])
iex> sensitive = Nx.tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
iex> metric_fn = fn [preds, sens] ->
...>   result = ExFairness.demographic_parity(preds, sens)
...>   result.disparity
...> end
iex> result = ExFairness.Utils.StatisticalTests.permutation_test(
...>   [predictions, sensitive],
...>   metric_fn,
...>   n_permutations: 100,
...>   seed: 42
...> )
iex> result.test_name
"Permutation Test"

two_proportion_test(predictions, sensitive_attr, opts \\ [])

@spec two_proportion_test(Nx.Tensor.t(), Nx.Tensor.t(), keyword()) :: test_result()

Two-proportion Z-test for demographic parity.

Tests whether positive prediction rates differ significantly between groups.

Hypotheses

H₀: p_A = p_B (no disparity between groups)
H₁: p_A ≠ p_B (disparity exists)

Test Statistic

Under H₀, the standard error is:

SE = sqrt(p̂ * (1 - p̂) * (1/n_A + 1/n_B))

where p̂ = (n_A p_A + n_B p_B) / (n_A + n_B)

Z-statistic:

Z = (p_A - p_B) / SE

P-value (two-tailed):

p = 2 * P(|Z| > |z_observed|)

Assumptions

Large sample sizes (n_A, n_B > 30 recommended)
Independent observations
np and n(1-p) > 5 for both groups

Parameters

predictions - Binary predictions tensor (0 or 1)
sensitive_attr - Binary sensitive attribute tensor (0 or 1)
opts:
- :alpha - Significance level (default: 0.05)
- :alternative - Test direction (:two_sided, :greater, :less)

Returns

Test result map with statistic, p-value, and significance.

Examples

iex> predictions = Nx.tensor([1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
iex> sensitive = Nx.tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
iex> result = ExFairness.Utils.StatisticalTests.two_proportion_test(predictions, sensitive)
iex> result.test_name
"Two-Proportion Z-Test"