Nasty.Semantic.Coreference.Evaluator (Nasty v0.3.0)

View Source

Coreference resolution evaluation metrics.

Implements standard coreference evaluation metrics:

  • MUC (Vilain et al., 1995) - Mention-based
  • B³ (Bagga & Baldwin, 1998) - Entity-based
  • CEAF (Luo, 2005) - Entity-based with optimal alignment
  • CoNLL F1 - Average of MUC, B³, and CEAF

Example

# Evaluate predictions
metrics = Evaluator.evaluate(gold_chains, predicted_chains)

# Access individual metrics
muc_f1 = metrics.muc.f1
b3_f1 = metrics.b3.f1
ceaf_f1 = metrics.ceaf.f1
conll_f1 = metrics.conll_f1

References

  • MUC: Vilain et al. (1995). "A model-theoretic coreference scoring scheme"
  • B³: Bagga & Baldwin (1998). "Algorithms for scoring coreference chains"
  • CEAF: Luo (2005). "On coreference resolution performance metrics"
  • CoNLL: Pradhan et al. (2012). "CoNLL-2012 shared task"

Summary

Functions

Compute B³ metric (entity-based).

Compute CEAF metric (entity-based with optimal alignment).

Compute MUC metric (mention-based).

Compute CoNLL F1 score.

Evaluate predicted coreference chains against gold standard.

Format evaluation results as string.

Types

evaluation()

@type evaluation() :: %{
  muc: metric(),
  b3: metric(),
  ceaf: metric(),
  conll_f1: float()
}

metric()

@type metric() :: %{precision: float(), recall: float(), f1: float()}

Functions

compute_b3(gold_chains, predicted_chains)

Compute B³ metric (entity-based).

B³ computes precision and recall for each mention individually, then averages across all mentions.

Parameters

  • gold_chains - Gold standard chains
  • predicted_chains - Predicted chains

Returns

Map with precision, recall, and F1

compute_ceaf(gold_chains, predicted_chains)

Compute CEAF metric (entity-based with optimal alignment).

CEAF finds the optimal alignment between gold and predicted chains using the Kuhn-Munkres algorithm (Hungarian algorithm).

Parameters

  • gold_chains - Gold standard chains
  • predicted_chains - Predicted chains

Returns

Map with precision, recall, and F1

compute_muc(gold_chains, predicted_chains)

Compute MUC metric (mention-based).

MUC measures the minimum number of links needed to connect mentions in the same cluster.

Parameters

  • gold_chains - Gold standard chains
  • predicted_chains - Predicted chains

Returns

Map with precision, recall, and F1

conll_f1(gold_chains, predicted_chains)

Compute CoNLL F1 score.

CoNLL F1 is the average of MUC, B³, and CEAF F1 scores.

Parameters

  • gold_chains - Gold standard chains
  • predicted_chains - Predicted chains

Returns

CoNLL F1 score (0.0 to 1.0)

evaluate(gold_chains, predicted_chains)

Evaluate predicted coreference chains against gold standard.

Parameters

  • gold_chains - Gold standard coreference chains
  • predicted_chains - Predicted coreference chains

Returns

Map with all evaluation metrics

format_results(metrics)

@spec format_results(evaluation()) :: String.t()

Format evaluation results as string.

Parameters

  • metrics - Evaluation metrics

Returns

Formatted string with all metrics