E2E_COREFERENCE
View SourcePhase 2: End-to-End Span-Based Coreference Resolution
This document describes the end-to-end (E2E) span-based coreference resolution system implemented in Phase 2. This architecture jointly learns mention detection and coreference resolution, achieving higher accuracy than the pipelined Phase 1 approach.
Overview
Key Differences from Phase 1
| Aspect | Phase 1 (Pipelined) | Phase 2 (End-to-End) |
|---|---|---|
| Architecture | Two separate stages | Single joint model |
| Mention Detection | Rule-based, pre-defined | Learned span scoring |
| Optimization | Stages optimized separately | Joint end-to-end optimization |
| Error Propagation | Detection errors → resolution errors | No error propagation |
| Expected F1 | 75-80% | 82-85% |
| Training Time | ~2-3 hours | ~4-6 hours |
| Inference Speed | ~50-100ms/doc | ~80-120ms/doc |
Architecture Diagram
Text → Token Embeddings → BiLSTM Encoder
↓
Span Enumeration (all possible spans)
↓
Span Scoring (mention detection head)
↓
Top-K Pruning (keep best spans)
↓
Pairwise Scoring (coreference head)
↓
Clustering → Coreference ChainsQuick Start
Training
mix nasty.train.e2e_coref \
--corpus data/ontonotes/train \
--dev data/ontonotes/dev \
--output priv/models/en/e2e_coref \
--epochs 25 \
--batch-size 16 \
--learning-rate 0.0005
Evaluation
mix nasty.eval.e2e_coref \
--model priv/models/en/e2e_coref \
--test data/ontonotes/test \
--baseline
The --baseline flag compares E2E results with Phase 1 models.
Usage in Code
# Load trained E2E models
{:ok, models, params, vocab} =
Nasty.Semantic.Coreference.Neural.E2ETrainer.load_models(
"priv/models/en/e2e_coref"
)
# Resolve coreferences in a document
{:ok, resolved_doc} =
Nasty.Semantic.Coreference.Neural.E2EResolver.resolve(
document,
models,
params,
vocab
)
# Or use auto-loading convenience function
{:ok, resolved_doc} =
Nasty.Semantic.Coreference.Neural.E2EResolver.resolve_auto(
document,
"priv/models/en/e2e_coref"
)
# Access coreference chains
chains = resolved_doc.coref_chainsModel Components
1. Span Enumeration (SpanEnumeration)
Generates all possible spans up to a maximum length, then prunes to top-K candidates.
Key Functions:
enumerate_spans/2- Generate all spans up to max_lengthenumerate_and_prune/2- Score and prune to top-Kspan_representation/4- Compute span embedding
Span Representation:
span_repr = [start_state, end_state, attention_over_span, width_embedding]Configuration:
max_length: 10 tokens (default)top_k: 50 spans per sentence (default)
2. Span Model (SpanModel)
Joint architecture with shared encoder and two task-specific heads.
Components:
- Shared Encoder: BiLSTM (256 hidden units) processes entire document
- Span Scorer Head: Feedforward network [256, 128] → Sigmoid (mention detection)
- Pair Scorer Head: Feedforward network [512, 256] → Sigmoid (coreference)
Loss Function:
total_loss = 0.3 * span_loss + 0.7 * coref_lossBoth use binary cross-entropy.
3. E2E Trainer (E2ETrainer)
Training pipeline with joint optimization and early stopping.
Training Process:
- Load training and dev data (OntoNotes format)
- Build vocabulary from all documents
- Initialize models with random weights
- Train for N epochs with Adam optimizer
- Evaluate on dev set after each epoch
- Early stopping when dev F1 stops improving
Hyperparameters:
- Epochs: 25
- Batch size: 16
- Learning rate: 0.0005 (lower than Phase 1)
- Dropout: 0.3
- Patience: 3 epochs
- Span loss weight: 0.3
- Coref loss weight: 0.7
4. E2E Resolver (E2EResolver)
Inference using trained models.
Resolution Steps:
- Extract tokens from document AST
- Convert tokens to IDs using vocabulary
- Encode with BiLSTM
- Enumerate and score candidate spans
- Filter spans by score threshold (default: 0.5)
- Score all span pairs for coreference
- Build chains using greedy clustering
- Attach chains to document
Clustering Algorithm: Greedy left-to-right antecedent selection:
- For each span, find best previous span with score > threshold
- Merge spans into same cluster if score exceeds threshold
- Results in transitively-closed coreference chains
Data Preparation
Span Training Data
The E2E model requires two types of training data:
1. Span Detection Data (create_span_training_data/2):
- Generates (span, label) pairs
- Label = 1 if span is a mention, 0 otherwise
- Enumerates all spans up to max_width
- Samples negative spans at 3:1 ratio
2. Antecedent Data (create_antecedent_data/2):
- Generates (mention, antecedent, label) triples
- Label = 1 if coreferent, 0 otherwise
- Considers previous N mentions as antecedent candidates
- Samples negative antecedents at 1.5:1 ratio
Training Options
All training options with defaults:
--corpus <path> # Required
--dev <path> # Required
--output <path> # Required
--epochs 25 # Training epochs
--batch-size 16 # Batch size
--learning-rate 0.0005 # Learning rate
--hidden-dim 256 # LSTM hidden dimension
--dropout 0.3 # Dropout rate
--patience 3 # Early stopping patience
--max-span-width 10 # Maximum span width
--top-k-spans 50 # Spans to keep per sentence
--span-loss-weight 0.3 # Weight for span detection
--coref-loss-weight 0.7 # Weight for coreference
Evaluation Options
--model <path> # Required
--test <path> # Required
--baseline # Compare with Phase 1
--max-span-length 10 # Maximum span length
--top-k-spans 50 # Top K spans to keep
--min-span-score 0.5 # Minimum span score
--min-coref-score 0.5 # Minimum coref score
Performance
Expected Results
E2E Model (Phase 2):
- CoNLL F1: 82-85%
- MUC F1: 83-86%
- B³ F1: 80-83%
- CEAF F1: 82-85%
Improvement over Phase 1:
- +5-10 F1 points overall
- Better recall on singletons
- Fewer spurious mention detections
- More accurate pronoun resolution
Speed Benchmarks
- Training: ~4-6 hours on OntoNotes (single GPU)
- Encoding: ~80 mentions/sec
- Span enumeration: ~500 spans/sec
- Pairwise scoring: ~800 pairs/sec
- End-to-end: ~80-120ms per document
Advantages of E2E Approach
- No Error Propagation: Mention detection errors don't affect coreference
- Joint Optimization: Both tasks optimized together for best overall performance
- Learned Mention Detection: Model learns what constitutes a mention
- Better Boundaries: Span enumeration finds correct mention boundaries
- Global Context: BiLSTM encoder captures document-level context
Limitations
- Computational Cost: Enumerating all spans is expensive
- Memory Usage: Requires storing representations for all candidate spans
- Max Span Length: Limited to spans of 10 tokens or less
- Pruning Errors: Top-K pruning may discard valid mentions
Troubleshooting
Out of Memory
- Reduce
--batch-sizeto 8 - Reduce
--top-k-spansto 30 - Reduce
--hidden-dimto 128
Low Span Detection
- Increase
--span-loss-weightto 0.5 - Lower
--min-span-scoreto 0.3 - Increase
--top-k-spansto 100
Low Coreference Accuracy
- Increase
--coref-loss-weightto 0.8 - Train for more epochs:
--epochs 30 - Add more training data
Slow Training
- Increase
--batch-sizeto 32 (if memory allows) - Reduce
--top-k-spansto 30 - Use GPU acceleration (EXLA)
Module Reference
SpanEnumeration
enumerate_and_prune(lstm_outputs, opts) :: {:ok, [span()]}
enumerate_spans(lstm_outputs, max_length) :: [{start, end}]
span_representation(lstm_outputs, start, end, width_emb) :: tensor
build_span_scorer(opts) :: Axon.t()SpanModel
build_model(opts) :: models()
build_encoder(vocab_size, embed_dim, hidden_dim, dropout) :: Axon.t()
build_span_scorer(span_dim, hidden_layers, dropout) :: Axon.t()
build_pair_scorer(pair_dim, hidden_layers, dropout) :: Axon.t()
extract_pair_features(span1, span2, tokens) :: tensor
forward(models, params, token_ids, spans) :: {span_scores, coref_scores}
compute_loss(span_scores, coref_scores, gold_span_labels, gold_coref_labels, opts) :: scalarE2ETrainer
train(train_data, dev_data, vocab, opts) :: {:ok, models(), params(), history()}
save_models(models, params, vocab, path) :: :ok
load_models(path) :: {:ok, models(), params(), vocab()}E2EResolver
resolve(document, models, params, vocab, opts) :: {:ok, Document.t()}
resolve_auto(document, model_path, opts) :: {:ok, Document.t()}References
- Lee et al. (2017). "End-to-end Neural Coreference Resolution"
- Lee et al. (2018). "Higher-order Coreference Resolution with Coarse-to-fine Inference"
- Joshi et al. (2019). "SpanBERT: Improving Pre-training by Representing and Predicting Spans"
See Also
- NEURAL_COREFERENCE.md - Phase 1 documentation
- COREFERENCE_TRAINING.md - Detailed training guide
- API.md - Full API reference