Statistical Models Guide

This document provides a comprehensive guide to the advanced statistical models implemented in Nasty: PCFG (Probabilistic Context-Free Grammar) for parsing and CRF (Conditional Random Fields) for sequence labeling/NER.

Overview

Nasty implements two major classes of statistical models:

PCFG Parser - For probabilistic phrase structure parsing with ambiguity resolution
CRF-based NER - For context-aware named entity recognition using sequence labeling

Both models follow the Nasty.Statistics.Model behaviour, providing consistent interfaces for training, prediction, and persistence.

PCFG (Probabilistic Context-Free Grammar)

What is PCFG?

PCFG extends traditional context-free grammars with probabilities on production rules. This allows the parser to:

Resolve syntactic ambiguities probabilistically
Score different parse trees and select the most likely one
Handle rare constructions gracefully through smoothing

Architecture

Core Modules:

Nasty.Statistics.Parsing.Grammar - Rule representation and CNF conversion
Nasty.Statistics.Parsing.CYKParser - CYK parsing algorithm
Nasty.Statistics.Parsing.PCFG - Main model implementing Model behaviour

Data Flow:

flowchart TD
    A["Training Data (Treebank)"]
    B["Extract Grammar Rules + Probabilities"]
    C["CNF Conversion"]
    D["Trained PCFG Model"]
    E["CYK Parser (Viterbi)"]
    F["Parse Tree with Probability"]
    
    A --> B
    B --> C
    C --> D
    D --> E
    E --> F

Grammar Rules

PCFG uses production rules with probabilities:

NP → Det Noun     [0.35]
NP → PropN        [0.25]
VP → Verb NP      [0.45]

The sum of probabilities for all rules with the same left-hand side equals 1.0.

CYK Algorithm

The Cocke-Younger-Kasami algorithm:

Requires grammar in Chomsky Normal Form (CNF)
Uses dynamic programming (O(n³) complexity)
Fills a chart bottom-up
Extracts highest probability parse tree

Complexity:

Time: O(n³ × |G|) where n = sentence length, |G| = grammar size
Space: O(n² × |G|)

Training

Train PCFG from Universal Dependencies treebanks or raw grammar rules:

# From raw rules
training_data = [
  {:np, [:det, :noun], 350},  # Count: 350 occurrences
  {:np, [:propn], 250},
  {:vp, [:verb, :np], 450}
]

model = PCFG.new()
{:ok, trained} = PCFG.train(model, training_data, smoothing: 0.001)
:ok = PCFG.save(trained, "priv/models/en/pcfg.model")

Prediction

Parse sentences to get probabilistic parse trees:

{:ok, model} = PCFG.load("priv/models/en/pcfg.model")
tokens = [%Token{text: "the", pos_tag: :det}, %Token{text: "cat", pos_tag: :noun}]
{:ok, parse_tree} = PCFG.predict(model, tokens)

# Parse tree contains:
# - label: :np
# - probability: 0.0245
# - children: [...]
# - span: {0, 1}

N-Best Parsing

Get multiple parse hypotheses:

{:ok, trees} = PCFG.predict(model, tokens, n_best: 5)
# Returns top 5 parse trees sorted by probability

Evaluation

Compute bracketing precision/recall/F1:

test_data = [{tokens, gold_tree}, ...]
metrics = PCFG.evaluate(model, test_data)
# %{precision: 0.87, recall: 0.85, f1: 0.86, exact_match: 0.42}

Mix Tasks

# Train PCFG from UD treebank
mix nasty.train.pcfg \
  --corpus data/en_ewt-ud-train.conllu \
  --output priv/models/en/pcfg.model \
  --smoothing 0.001 \
  --test data/en_ewt-ud-test.conllu

# Evaluate PCFG
mix nasty.eval \
  --model priv/models/en/pcfg.model \
  --test data/en_ewt-ud-test.conllu \
  --type pcfg

CRF (Conditional Random Fields)

What is CRF?

CRFs are discriminative models for sequence labeling that consider:

Rich feature sets (lexical, orthographic, contextual)
Label dependencies (transition probabilities)
Global normalization (partition function)

Unlike HMMs, CRFs can handle overlapping features and don't make independence assumptions.

Architecture

Core Modules:

Nasty.Statistics.SequenceLabeling.Features - Feature extraction
Nasty.Statistics.SequenceLabeling.Viterbi - Decoding algorithm
Nasty.Statistics.SequenceLabeling.Optimizer - Gradient descent training
Nasty.Statistics.SequenceLabeling.CRF - Main model implementing Model behaviour

Data Flow:

flowchart TD
    A["Tokens + Labels (Training)"]
    B["Feature Extraction"]
    C["Forward-Backward (Gradient Computation)"]
    D["Gradient Descent Optimization"]
    E["Trained CRF Model"]
    F["Viterbi Decoding"]
    G["Label Sequence"]
    
    A --> B
    B --> C
    C --> D
    D --> E
    E --> F
    F --> G

Feature Extraction

CRFs use rich feature sets extracted from tokens:

Lexical Features:

word, word_lower, lemma

Orthographic Features:

capitalized, all_caps, word_shape (Xxxx, XXX, ddd)
has_digit, has_hyphen, has_punctuation

POS Features:

pos_tag

Context Features:

prev_word, next_word
prev_pos, next_pos
is_first, is_last

Affix Features:

prefix-1, prefix-2, ..., prefix-4
suffix-1, suffix-2, ..., suffix-4

Gazetteer Features:

in_gazetteer=person/place/org

Pattern Features:

pattern=all_digits, pattern=year, pattern=acronym
short_word, long_word

Model

Linear-chain CRF:

P(y|x) = exp(score(x, y)) / Z(x)

score(x, y) = Σ feature_weights + Σ transition_weights

Where:

feature_weights: Map of (feature, label) → weight
transition_weights: Map of (prev_label, curr_label) → weight
Z(x): Partition function (normalization)

Training

Train CRF on BIO-tagged sequences:

# BIO tagging: B-PER, I-PER, B-GPE, I-GPE, B-ORG, I-ORG, O
training_data = [
  {
    [%Token{text: "John"}, %Token{text: "Smith"}],
    [:b_per, :i_per]
  },
  ...
]

model = CRF.new(labels: [:b_per, :i_per, :b_gpe, :i_gpe, :b_org, :i_org, :o])
{:ok, trained} = CRF.train(model, training_data,
  iterations: 100,
  learning_rate: 0.1,
  regularization: 1.0
)
:ok = CRF.save(trained, "priv/models/en/crf_ner.model")

Training Options:

:iterations - Maximum iterations (default: 100)
:learning_rate - Initial learning rate (default: 0.1)
:regularization - L2 regularization (default: 1.0)
:method - :sgd, :momentum, :adagrad (default: :momentum)
:convergence_threshold - Gradient norm threshold (default: 0.01)

Prediction

Label sequences using Viterbi decoding:

{:ok, model} = CRF.load("priv/models/en/crf_ner.model")
tokens = [%Token{text: "John"}, %Token{text: "lives"}, %Token{text: "in"}, %Token{text: "NYC"}]
{:ok, labels} = CRF.predict(model, tokens)
# [:b_per, :o, :o, :b_gpe]

Viterbi Algorithm

Find most likely label sequence:

Initialize scores for first position
For each subsequent position:
- Compute emission score (from features)
- Compute transition score (from previous label)
- Track best previous label (backpointer)
Backtrack from best final label

Complexity:

Time: O(n × L²) where n = sequence length, L = number of labels
Space: O(n × L)

Forward-Backward Algorithm

Used during training to compute gradients:

Forward: P(label at position t | observations up to t)
Backward: P(observations after t | label at t)
Partition Function: Z(x) = sum over all label sequences

Optimization

Gradient descent with momentum:

Gradient = Observed Features - Expected Features
Weight Update: w := w - learning_rate * (gradient + regularization * w)
Momentum: v := momentum * v + gradient

Mix Tasks

# Train CRF NER
mix nasty.train.crf \
  --corpus data/ner_train.conllu \
  --output priv/models/en/crf_ner.model \
  --task ner \
  --iterations 100 \
  --learning-rate 0.1 \
  --regularization 1.0 \
  --test data/ner_test.conllu

# Evaluate NER
mix nasty.eval \
  --model priv/models/en/crf_ner.model \
  --test data/ner_test.conllu \
  --type crf \
  --task ner

Integration with English Pipeline

Both models integrate seamlessly with the existing English module:

PCFG Integration

The PCFG parser is integrated into Nasty.Language.English.SentenceParser:

# Use PCFG parsing
{:ok, tokens} = English.tokenize(text)
{:ok, tagged} = English.tag_pos(tokens)
{:ok, document} = English.parse(tagged, model: :pcfg)

# Or directly with SentenceParser
alias Nasty.Language.English.SentenceParser
{:ok, sentences} = SentenceParser.parse_sentences(tokens, model: :pcfg)

# With specific model (bypasses registry lookup)
{:ok, pcfg_model} = PCFG.load("path/to/model.pcfg")
{:ok, sentences} = SentenceParser.parse_sentences(tokens, 
  model: :pcfg, 
  pcfg_model: pcfg_model
)

# Falls back to rule-based if :model option not specified or model unavailable
{:ok, document} = English.parse(tagged)  # Uses rule-based parsing

CRF Integration

The CRF model is integrated into Nasty.Language.English.EntityRecognizer:

# Use CRF for NER
alias Nasty.Language.English.EntityRecognizer

{:ok, tokens} = English.tokenize(text)
{:ok, tagged} = English.tag_pos(tokens)

# CRF-based entity recognition
entities = EntityRecognizer.recognize(tagged, model: :crf)

# With specific model (bypasses registry lookup)
{:ok, crf_model} = CRF.load("path/to/model.crf")
entities = EntityRecognizer.recognize(tagged, 
  model: :crf,
  crf_model: crf_model
)

# Falls back to rule-based if :model option not specified or model unavailable
entities = EntityRecognizer.recognize(tagged)  # Uses rule-based NER

Model Registry

Both integrations use Nasty.Statistics.ModelLoader to automatically load models:

# Models are loaded from the registry by task and language
# PCFG: ModelLoader.load_latest(:en, :pcfg)
# CRF:  ModelLoader.load_latest(:en, :ner_crf)

# Models should be saved with proper naming:
# priv/models/en/pcfg_v1.model
# priv/models/en/ner_crf_v1.model

Performance Expectations

PCFG Parser

Accuracy:

Bracketing F1: 85-90% on UD test sets
Higher than rule-based parsing for ambiguous structures

Speed:

~50-100ms per sentence (CPU)
Depends on sentence length and grammar size

Memory:

~50-100MB model file
O(n²) space during parsing

CRF-based NER

Accuracy:

Entity-level F1: 92-95% (vs 70-80% rule-based)
Proper boundary detection
Better handling of unseen entities

Speed:

~20-30ms per sentence (CPU)
Linear in sequence length

Memory:

~20-50MB model file (depends on feature set)
O(n) space during decoding

Comparison with Other Approaches

PCFG vs Rule-based Parsing

Aspect	PCFG	Rule-based
Ambiguity	Probabilistic resolution	Greedy heuristics
Unknown structures	Graceful degradation	May fail
Training	Requires treebank	None needed
Speed	Slower (O(n³))	Faster (O(n))
Accuracy	Higher on complex sentences	Good for simple sentences

CRF vs Rule-based NER

Aspect	CRF	Rule-based
Context	Global sequence context	Local patterns
Features	Rich feature sets	Limited to POS + patterns
Boundaries	Learned from data	Heuristic rules
Training	Requires annotated data	None needed
Unseen entities	Better generalization	Pattern matching only
Accuracy	92-95% F1	70-80% F1

Best Practices

For PCFG

Training Data: Use high-quality treebanks (UD, Penn Treebank)
Smoothing: Use add-k smoothing (k=0.001) for unseen rules
CNF Conversion: Always convert to CNF before parsing
Beam Search: Use beam width 10-20 for efficiency
Evaluation: Report bracketing F1, not just accuracy

For CRF

Features: Start with full feature set, prune if needed
Regularization: Use L2 (λ=1.0) to prevent overfitting
Learning Rate: Start with 0.1, decay if not converging
BIO Tagging: Always use BIO scheme for proper boundaries
Gazetteers: Include domain-specific entity lists
Iterations: 100-200 iterations usually sufficient

Troubleshooting

PCFG Issues

Problem: Parse fails (returns :error)

Solution: Check if all words have lexical rules; add unknown word handling

Problem: Low parsing F1

Solution: Increase training data; adjust smoothing; check CNF conversion

Problem: Slow parsing

Solution: Reduce beam width; prune low-probability rules

CRF Issues

Problem: Training doesn't converge

Solution: Reduce learning rate; increase regularization; check gradient computation

Problem: Low NER F1

Solution: Add more features; increase training data; check BIO tagging consistency

Problem: Slow training

Solution: Reduce feature set; use AdaGrad; parallelize if possible

Future Enhancements

PCFG

Lexicalized PCFG (head-driven)
Latent variable PCFG
Neural PCFG with embeddings
Dependency conversion from CFG parse

CRF

Higher-order CRF (beyond linear-chain)
Semi-Markov CRF for multi-token entities
Structured perceptron as alternative
Neural CRF with BiLSTM features

References

PCFG

Charniak, E. (1997). Statistical Parsing with a Context-Free Grammar and Word Statistics
Klein & Manning (2003). Accurate Unlexicalized Parsing
Petrov et al. (2006). Learning Accurate, Compact, and Interpretable Tree Annotation

CRF

Lafferty et al. (2001). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
Sutton & McCallum (2012). An Introduction to Conditional Random Fields
Tjong Kim Sang & De Meulder (2003). Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition

ROADMAP.md - Development roadmap and priorities
NEURAL_MODELS.md - Neural network architectures (BiLSTM-CRF)
TRAINING_NEURAL.md - Neural model training guide
PARSING_GUIDE.md - Comprehensive parsing documentation
WARP.md - Command reference for training and evaluation

← Previous Page Code Interoperability Guide

Next Page → Neural Models in Nasty

Statistical Models Guide

Overview

PCFG (Probabilistic Context-Free Grammar)

What is PCFG?

Architecture

Grammar Rules

CYK Algorithm

Training

Prediction

N-Best Parsing

Evaluation

Mix Tasks

CRF (Conditional Random Fields)

What is CRF?

Architecture

Feature Extraction

Model

Training

Prediction

Viterbi Algorithm

Forward-Backward Algorithm

Optimization

Mix Tasks

Integration with English Pipeline

PCFG Integration

CRF Integration

Model Registry

Performance Expectations

PCFG Parser

CRF-based NER

Comparison with Other Approaches

PCFG vs Rule-based Parsing

CRF vs Rule-based NER

Best Practices

For PCFG

For CRF

Troubleshooting

PCFG Issues

CRF Issues

Future Enhancements

PCFG

CRF

References

PCFG

CRF

Related Documentation