Statistical Models Guide
View SourceThis document provides a comprehensive guide to the advanced statistical models implemented in Nasty: PCFG (Probabilistic Context-Free Grammar) for parsing and CRF (Conditional Random Fields) for sequence labeling/NER.
Overview
Nasty implements two major classes of statistical models:
- PCFG Parser - For probabilistic phrase structure parsing with ambiguity resolution
- CRF-based NER - For context-aware named entity recognition using sequence labeling
Both models follow the Nasty.Statistics.Model behaviour, providing consistent interfaces for training, prediction, and persistence.
PCFG (Probabilistic Context-Free Grammar)
What is PCFG?
PCFG extends traditional context-free grammars with probabilities on production rules. This allows the parser to:
- Resolve syntactic ambiguities probabilistically
- Score different parse trees and select the most likely one
- Handle rare constructions gracefully through smoothing
Architecture
Core Modules:
Nasty.Statistics.Parsing.Grammar- Rule representation and CNF conversionNasty.Statistics.Parsing.CYKParser- CYK parsing algorithmNasty.Statistics.Parsing.PCFG- Main model implementingModelbehaviour
Data Flow:
flowchart TD
A["Training Data (Treebank)"]
B["Extract Grammar Rules + Probabilities"]
C["CNF Conversion"]
D["Trained PCFG Model"]
E["CYK Parser (Viterbi)"]
F["Parse Tree with Probability"]
A --> B
B --> C
C --> D
D --> E
E --> FGrammar Rules
PCFG uses production rules with probabilities:
NP → Det Noun [0.35]
NP → PropN [0.25]
VP → Verb NP [0.45]The sum of probabilities for all rules with the same left-hand side equals 1.0.
CYK Algorithm
The Cocke-Younger-Kasami algorithm:
- Requires grammar in Chomsky Normal Form (CNF)
- Uses dynamic programming (O(n³) complexity)
- Fills a chart bottom-up
- Extracts highest probability parse tree
Complexity:
- Time: O(n³ × |G|) where n = sentence length, |G| = grammar size
- Space: O(n² × |G|)
Training
Train PCFG from Universal Dependencies treebanks or raw grammar rules:
# From raw rules
training_data = [
{:np, [:det, :noun], 350}, # Count: 350 occurrences
{:np, [:propn], 250},
{:vp, [:verb, :np], 450}
]
model = PCFG.new()
{:ok, trained} = PCFG.train(model, training_data, smoothing: 0.001)
:ok = PCFG.save(trained, "priv/models/en/pcfg.model")Prediction
Parse sentences to get probabilistic parse trees:
{:ok, model} = PCFG.load("priv/models/en/pcfg.model")
tokens = [%Token{text: "the", pos_tag: :det}, %Token{text: "cat", pos_tag: :noun}]
{:ok, parse_tree} = PCFG.predict(model, tokens)
# Parse tree contains:
# - label: :np
# - probability: 0.0245
# - children: [...]
# - span: {0, 1}N-Best Parsing
Get multiple parse hypotheses:
{:ok, trees} = PCFG.predict(model, tokens, n_best: 5)
# Returns top 5 parse trees sorted by probabilityEvaluation
Compute bracketing precision/recall/F1:
test_data = [{tokens, gold_tree}, ...]
metrics = PCFG.evaluate(model, test_data)
# %{precision: 0.87, recall: 0.85, f1: 0.86, exact_match: 0.42}Mix Tasks
# Train PCFG from UD treebank
mix nasty.train.pcfg \
--corpus data/en_ewt-ud-train.conllu \
--output priv/models/en/pcfg.model \
--smoothing 0.001 \
--test data/en_ewt-ud-test.conllu
# Evaluate PCFG
mix nasty.eval \
--model priv/models/en/pcfg.model \
--test data/en_ewt-ud-test.conllu \
--type pcfg
CRF (Conditional Random Fields)
What is CRF?
CRFs are discriminative models for sequence labeling that consider:
- Rich feature sets (lexical, orthographic, contextual)
- Label dependencies (transition probabilities)
- Global normalization (partition function)
Unlike HMMs, CRFs can handle overlapping features and don't make independence assumptions.
Architecture
Core Modules:
Nasty.Statistics.SequenceLabeling.Features- Feature extractionNasty.Statistics.SequenceLabeling.Viterbi- Decoding algorithmNasty.Statistics.SequenceLabeling.Optimizer- Gradient descent trainingNasty.Statistics.SequenceLabeling.CRF- Main model implementingModelbehaviour
Data Flow:
flowchart TD
A["Tokens + Labels (Training)"]
B["Feature Extraction"]
C["Forward-Backward (Gradient Computation)"]
D["Gradient Descent Optimization"]
E["Trained CRF Model"]
F["Viterbi Decoding"]
G["Label Sequence"]
A --> B
B --> C
C --> D
D --> E
E --> F
F --> GFeature Extraction
CRFs use rich feature sets extracted from tokens:
Lexical Features:
- word, word_lower, lemma
Orthographic Features:
- capitalized, all_caps, word_shape (Xxxx, XXX, ddd)
- has_digit, has_hyphen, has_punctuation
POS Features:
- pos_tag
Context Features:
- prev_word, next_word
- prev_pos, next_pos
- is_first, is_last
Affix Features:
- prefix-1, prefix-2, ..., prefix-4
- suffix-1, suffix-2, ..., suffix-4
Gazetteer Features:
- in_gazetteer=person/place/org
Pattern Features:
- pattern=all_digits, pattern=year, pattern=acronym
- short_word, long_word
Model
Linear-chain CRF:
P(y|x) = exp(score(x, y)) / Z(x)
score(x, y) = Σ feature_weights + Σ transition_weightsWhere:
feature_weights: Map of (feature, label) → weighttransition_weights: Map of (prev_label, curr_label) → weightZ(x): Partition function (normalization)
Training
Train CRF on BIO-tagged sequences:
# BIO tagging: B-PER, I-PER, B-GPE, I-GPE, B-ORG, I-ORG, O
training_data = [
{
[%Token{text: "John"}, %Token{text: "Smith"}],
[:b_per, :i_per]
},
...
]
model = CRF.new(labels: [:b_per, :i_per, :b_gpe, :i_gpe, :b_org, :i_org, :o])
{:ok, trained} = CRF.train(model, training_data,
iterations: 100,
learning_rate: 0.1,
regularization: 1.0
)
:ok = CRF.save(trained, "priv/models/en/crf_ner.model")Training Options:
:iterations- Maximum iterations (default: 100):learning_rate- Initial learning rate (default: 0.1):regularization- L2 regularization (default: 1.0):method-:sgd,:momentum,:adagrad(default::momentum):convergence_threshold- Gradient norm threshold (default: 0.01)
Prediction
Label sequences using Viterbi decoding:
{:ok, model} = CRF.load("priv/models/en/crf_ner.model")
tokens = [%Token{text: "John"}, %Token{text: "lives"}, %Token{text: "in"}, %Token{text: "NYC"}]
{:ok, labels} = CRF.predict(model, tokens)
# [:b_per, :o, :o, :b_gpe]Viterbi Algorithm
Find most likely label sequence:
- Initialize scores for first position
- For each subsequent position:
- Compute emission score (from features)
- Compute transition score (from previous label)
- Track best previous label (backpointer)
- Backtrack from best final label
Complexity:
- Time: O(n × L²) where n = sequence length, L = number of labels
- Space: O(n × L)
Forward-Backward Algorithm
Used during training to compute gradients:
Forward: P(label at position t | observations up to t)
Backward: P(observations after t | label at t)
- Partition Function: Z(x) = sum over all label sequences
Optimization
Gradient descent with momentum:
Gradient = Observed Features - Expected Features
Weight Update: w := w - learning_rate * (gradient + regularization * w)
Momentum: v := momentum * v + gradientMix Tasks
# Train CRF NER
mix nasty.train.crf \
--corpus data/ner_train.conllu \
--output priv/models/en/crf_ner.model \
--task ner \
--iterations 100 \
--learning-rate 0.1 \
--regularization 1.0 \
--test data/ner_test.conllu
# Evaluate NER
mix nasty.eval \
--model priv/models/en/crf_ner.model \
--test data/ner_test.conllu \
--type crf \
--task ner
Integration with English Pipeline
Both models integrate seamlessly with the existing English module:
PCFG Integration
The PCFG parser is integrated into Nasty.Language.English.SentenceParser:
# Use PCFG parsing
{:ok, tokens} = English.tokenize(text)
{:ok, tagged} = English.tag_pos(tokens)
{:ok, document} = English.parse(tagged, model: :pcfg)
# Or directly with SentenceParser
alias Nasty.Language.English.SentenceParser
{:ok, sentences} = SentenceParser.parse_sentences(tokens, model: :pcfg)
# With specific model (bypasses registry lookup)
{:ok, pcfg_model} = PCFG.load("path/to/model.pcfg")
{:ok, sentences} = SentenceParser.parse_sentences(tokens,
model: :pcfg,
pcfg_model: pcfg_model
)
# Falls back to rule-based if :model option not specified or model unavailable
{:ok, document} = English.parse(tagged) # Uses rule-based parsingCRF Integration
The CRF model is integrated into Nasty.Language.English.EntityRecognizer:
# Use CRF for NER
alias Nasty.Language.English.EntityRecognizer
{:ok, tokens} = English.tokenize(text)
{:ok, tagged} = English.tag_pos(tokens)
# CRF-based entity recognition
entities = EntityRecognizer.recognize(tagged, model: :crf)
# With specific model (bypasses registry lookup)
{:ok, crf_model} = CRF.load("path/to/model.crf")
entities = EntityRecognizer.recognize(tagged,
model: :crf,
crf_model: crf_model
)
# Falls back to rule-based if :model option not specified or model unavailable
entities = EntityRecognizer.recognize(tagged) # Uses rule-based NERModel Registry
Both integrations use Nasty.Statistics.ModelLoader to automatically load models:
# Models are loaded from the registry by task and language
# PCFG: ModelLoader.load_latest(:en, :pcfg)
# CRF: ModelLoader.load_latest(:en, :ner_crf)
# Models should be saved with proper naming:
# priv/models/en/pcfg_v1.model
# priv/models/en/ner_crf_v1.modelPerformance Expectations
PCFG Parser
Accuracy:
- Bracketing F1: 85-90% on UD test sets
- Higher than rule-based parsing for ambiguous structures
Speed:
- ~50-100ms per sentence (CPU)
- Depends on sentence length and grammar size
Memory:
- ~50-100MB model file
- O(n²) space during parsing
CRF-based NER
Accuracy:
- Entity-level F1: 92-95% (vs 70-80% rule-based)
- Proper boundary detection
- Better handling of unseen entities
Speed:
- ~20-30ms per sentence (CPU)
- Linear in sequence length
Memory:
- ~20-50MB model file (depends on feature set)
- O(n) space during decoding
Comparison with Other Approaches
PCFG vs Rule-based Parsing
| Aspect | PCFG | Rule-based |
|---|---|---|
| Ambiguity | Probabilistic resolution | Greedy heuristics |
| Unknown structures | Graceful degradation | May fail |
| Training | Requires treebank | None needed |
| Speed | Slower (O(n³)) | Faster (O(n)) |
| Accuracy | Higher on complex sentences | Good for simple sentences |
CRF vs Rule-based NER
| Aspect | CRF | Rule-based |
|---|---|---|
| Context | Global sequence context | Local patterns |
| Features | Rich feature sets | Limited to POS + patterns |
| Boundaries | Learned from data | Heuristic rules |
| Training | Requires annotated data | None needed |
| Unseen entities | Better generalization | Pattern matching only |
| Accuracy | 92-95% F1 | 70-80% F1 |
Best Practices
For PCFG
- Training Data: Use high-quality treebanks (UD, Penn Treebank)
- Smoothing: Use add-k smoothing (k=0.001) for unseen rules
- CNF Conversion: Always convert to CNF before parsing
- Beam Search: Use beam width 10-20 for efficiency
- Evaluation: Report bracketing F1, not just accuracy
For CRF
- Features: Start with full feature set, prune if needed
- Regularization: Use L2 (λ=1.0) to prevent overfitting
- Learning Rate: Start with 0.1, decay if not converging
- BIO Tagging: Always use BIO scheme for proper boundaries
- Gazetteers: Include domain-specific entity lists
- Iterations: 100-200 iterations usually sufficient
Troubleshooting
PCFG Issues
Problem: Parse fails (returns :error)
- Solution: Check if all words have lexical rules; add unknown word handling
Problem: Low parsing F1
- Solution: Increase training data; adjust smoothing; check CNF conversion
Problem: Slow parsing
- Solution: Reduce beam width; prune low-probability rules
CRF Issues
Problem: Training doesn't converge
- Solution: Reduce learning rate; increase regularization; check gradient computation
Problem: Low NER F1
- Solution: Add more features; increase training data; check BIO tagging consistency
Problem: Slow training
- Solution: Reduce feature set; use AdaGrad; parallelize if possible
Future Enhancements
PCFG
- Lexicalized PCFG (head-driven)
- Latent variable PCFG
- Neural PCFG with embeddings
- Dependency conversion from CFG parse
CRF
- Higher-order CRF (beyond linear-chain)
- Semi-Markov CRF for multi-token entities
- Structured perceptron as alternative
- Neural CRF with BiLSTM features
References
PCFG
- Charniak, E. (1997). Statistical Parsing with a Context-Free Grammar and Word Statistics
- Klein & Manning (2003). Accurate Unlexicalized Parsing
- Petrov et al. (2006). Learning Accurate, Compact, and Interpretable Tree Annotation
CRF
- Lafferty et al. (2001). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
- Sutton & McCallum (2012). An Introduction to Conditional Random Fields
- Tjong Kim Sang & De Meulder (2003). Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition
Related Documentation
- ROADMAP.md - Development roadmap and priorities
- NEURAL_MODELS.md - Neural network architectures (BiLSTM-CRF)
- TRAINING_NEURAL.md - Neural model training guide
- PARSING_GUIDE.md - Comprehensive parsing documentation
- WARP.md - Command reference for training and evaluation