E2E_COREFERENCE

View Source

Phase 2: End-to-End Span-Based Coreference Resolution

This document describes the end-to-end (E2E) span-based coreference resolution system implemented in Phase 2. This architecture jointly learns mention detection and coreference resolution, achieving higher accuracy than the pipelined Phase 1 approach.

Overview

Key Differences from Phase 1

AspectPhase 1 (Pipelined)Phase 2 (End-to-End)
ArchitectureTwo separate stagesSingle joint model
Mention DetectionRule-based, pre-definedLearned span scoring
OptimizationStages optimized separatelyJoint end-to-end optimization
Error PropagationDetection errors → resolution errorsNo error propagation
Expected F175-80%82-85%
Training Time~2-3 hours~4-6 hours
Inference Speed~50-100ms/doc~80-120ms/doc

Architecture Diagram

Text  Token Embeddings  BiLSTM Encoder
                             
                    Span Enumeration (all possible spans)
                             
                    Span Scoring (mention detection head)
                             
                    Top-K Pruning (keep best spans)
                             
                    Pairwise Scoring (coreference head)
                             
                    Clustering  Coreference Chains

Quick Start

Training

mix nasty.train.e2e_coref \
  --corpus data/ontonotes/train \
  --dev data/ontonotes/dev \
  --output priv/models/en/e2e_coref \
  --epochs 25 \
  --batch-size 16 \
  --learning-rate 0.0005

Evaluation

mix nasty.eval.e2e_coref \
  --model priv/models/en/e2e_coref \
  --test data/ontonotes/test \
  --baseline

The --baseline flag compares E2E results with Phase 1 models.

Usage in Code

# Load trained E2E models
{:ok, models, params, vocab} = 
  Nasty.Semantic.Coreference.Neural.E2ETrainer.load_models(
    "priv/models/en/e2e_coref"
  )

# Resolve coreferences in a document
{:ok, resolved_doc} = 
  Nasty.Semantic.Coreference.Neural.E2EResolver.resolve(
    document, 
    models, 
    params, 
    vocab
  )

# Or use auto-loading convenience function
{:ok, resolved_doc} = 
  Nasty.Semantic.Coreference.Neural.E2EResolver.resolve_auto(
    document,
    "priv/models/en/e2e_coref"
  )

# Access coreference chains
chains = resolved_doc.coref_chains

Model Components

1. Span Enumeration (SpanEnumeration)

Generates all possible spans up to a maximum length, then prunes to top-K candidates.

Key Functions:

  • enumerate_spans/2 - Generate all spans up to max_length
  • enumerate_and_prune/2 - Score and prune to top-K
  • span_representation/4 - Compute span embedding

Span Representation:

span_repr = [start_state, end_state, attention_over_span, width_embedding]

Configuration:

  • max_length: 10 tokens (default)
  • top_k: 50 spans per sentence (default)

2. Span Model (SpanModel)

Joint architecture with shared encoder and two task-specific heads.

Components:

  • Shared Encoder: BiLSTM (256 hidden units) processes entire document
  • Span Scorer Head: Feedforward network [256, 128] → Sigmoid (mention detection)
  • Pair Scorer Head: Feedforward network [512, 256] → Sigmoid (coreference)

Loss Function:

total_loss = 0.3 * span_loss + 0.7 * coref_loss

Both use binary cross-entropy.

3. E2E Trainer (E2ETrainer)

Training pipeline with joint optimization and early stopping.

Training Process:

  1. Load training and dev data (OntoNotes format)
  2. Build vocabulary from all documents
  3. Initialize models with random weights
  4. Train for N epochs with Adam optimizer
  5. Evaluate on dev set after each epoch
  6. Early stopping when dev F1 stops improving

Hyperparameters:

  • Epochs: 25
  • Batch size: 16
  • Learning rate: 0.0005 (lower than Phase 1)
  • Dropout: 0.3
  • Patience: 3 epochs
  • Span loss weight: 0.3
  • Coref loss weight: 0.7

4. E2E Resolver (E2EResolver)

Inference using trained models.

Resolution Steps:

  1. Extract tokens from document AST
  2. Convert tokens to IDs using vocabulary
  3. Encode with BiLSTM
  4. Enumerate and score candidate spans
  5. Filter spans by score threshold (default: 0.5)
  6. Score all span pairs for coreference
  7. Build chains using greedy clustering
  8. Attach chains to document

Clustering Algorithm: Greedy left-to-right antecedent selection:

  • For each span, find best previous span with score > threshold
  • Merge spans into same cluster if score exceeds threshold
  • Results in transitively-closed coreference chains

Data Preparation

Span Training Data

The E2E model requires two types of training data:

1. Span Detection Data (create_span_training_data/2):

  • Generates (span, label) pairs
  • Label = 1 if span is a mention, 0 otherwise
  • Enumerates all spans up to max_width
  • Samples negative spans at 3:1 ratio

2. Antecedent Data (create_antecedent_data/2):

  • Generates (mention, antecedent, label) triples
  • Label = 1 if coreferent, 0 otherwise
  • Considers previous N mentions as antecedent candidates
  • Samples negative antecedents at 1.5:1 ratio

Training Options

All training options with defaults:

--corpus <path>              # Required
--dev <path>                 # Required
--output <path>              # Required
--epochs 25                  # Training epochs
--batch-size 16              # Batch size
--learning-rate 0.0005       # Learning rate
--hidden-dim 256             # LSTM hidden dimension
--dropout 0.3                # Dropout rate
--patience 3                 # Early stopping patience
--max-span-width 10          # Maximum span width
--top-k-spans 50             # Spans to keep per sentence
--span-loss-weight 0.3       # Weight for span detection
--coref-loss-weight 0.7      # Weight for coreference

Evaluation Options

--model <path>               # Required
--test <path>                # Required
--baseline                   # Compare with Phase 1
--max-span-length 10         # Maximum span length
--top-k-spans 50             # Top K spans to keep
--min-span-score 0.5         # Minimum span score
--min-coref-score 0.5        # Minimum coref score

Performance

Expected Results

E2E Model (Phase 2):

  • CoNLL F1: 82-85%
  • MUC F1: 83-86%
  • B³ F1: 80-83%
  • CEAF F1: 82-85%

Improvement over Phase 1:

  • +5-10 F1 points overall
  • Better recall on singletons
  • Fewer spurious mention detections
  • More accurate pronoun resolution

Speed Benchmarks

  • Training: ~4-6 hours on OntoNotes (single GPU)
  • Encoding: ~80 mentions/sec
  • Span enumeration: ~500 spans/sec
  • Pairwise scoring: ~800 pairs/sec
  • End-to-end: ~80-120ms per document

Advantages of E2E Approach

  1. No Error Propagation: Mention detection errors don't affect coreference
  2. Joint Optimization: Both tasks optimized together for best overall performance
  3. Learned Mention Detection: Model learns what constitutes a mention
  4. Better Boundaries: Span enumeration finds correct mention boundaries
  5. Global Context: BiLSTM encoder captures document-level context

Limitations

  1. Computational Cost: Enumerating all spans is expensive
  2. Memory Usage: Requires storing representations for all candidate spans
  3. Max Span Length: Limited to spans of 10 tokens or less
  4. Pruning Errors: Top-K pruning may discard valid mentions

Troubleshooting

Out of Memory

  • Reduce --batch-size to 8
  • Reduce --top-k-spans to 30
  • Reduce --hidden-dim to 128

Low Span Detection

  • Increase --span-loss-weight to 0.5
  • Lower --min-span-score to 0.3
  • Increase --top-k-spans to 100

Low Coreference Accuracy

  • Increase --coref-loss-weight to 0.8
  • Train for more epochs: --epochs 30
  • Add more training data

Slow Training

  • Increase --batch-size to 32 (if memory allows)
  • Reduce --top-k-spans to 30
  • Use GPU acceleration (EXLA)

Module Reference

SpanEnumeration

enumerate_and_prune(lstm_outputs, opts) :: {:ok, [span()]}
enumerate_spans(lstm_outputs, max_length) :: [{start, end}]
span_representation(lstm_outputs, start, end, width_emb) :: tensor
build_span_scorer(opts) :: Axon.t()

SpanModel

build_model(opts) :: models()
build_encoder(vocab_size, embed_dim, hidden_dim, dropout) :: Axon.t()
build_span_scorer(span_dim, hidden_layers, dropout) :: Axon.t()
build_pair_scorer(pair_dim, hidden_layers, dropout) :: Axon.t()
extract_pair_features(span1, span2, tokens) :: tensor
forward(models, params, token_ids, spans) :: {span_scores, coref_scores}
compute_loss(span_scores, coref_scores, gold_span_labels, gold_coref_labels, opts) :: scalar

E2ETrainer

train(train_data, dev_data, vocab, opts) :: {:ok, models(), params(), history()}
save_models(models, params, vocab, path) :: :ok
load_models(path) :: {:ok, models(), params(), vocab()}

E2EResolver

resolve(document, models, params, vocab, opts) :: {:ok, Document.t()}
resolve_auto(document, model_path, opts) :: {:ok, Document.t()}

References

  • Lee et al. (2017). "End-to-end Neural Coreference Resolution"
  • Lee et al. (2018). "Higher-order Coreference Resolution with Coarse-to-fine Inference"
  • Joshi et al. (2019). "SpanBERT: Improving Pre-training by Representing and Predicting Spans"

See Also