Statistical Models Guide

View Source

This document provides a comprehensive guide to the advanced statistical models implemented in Nasty: PCFG (Probabilistic Context-Free Grammar) for parsing and CRF (Conditional Random Fields) for sequence labeling/NER.

Overview

Nasty implements two major classes of statistical models:

  1. PCFG Parser - For probabilistic phrase structure parsing with ambiguity resolution
  2. CRF-based NER - For context-aware named entity recognition using sequence labeling

Both models follow the Nasty.Statistics.Model behaviour, providing consistent interfaces for training, prediction, and persistence.

PCFG (Probabilistic Context-Free Grammar)

What is PCFG?

PCFG extends traditional context-free grammars with probabilities on production rules. This allows the parser to:

  • Resolve syntactic ambiguities probabilistically
  • Score different parse trees and select the most likely one
  • Handle rare constructions gracefully through smoothing

Architecture

Core Modules:

Data Flow:

flowchart TD
    A["Training Data (Treebank)"]
    B["Extract Grammar Rules + Probabilities"]
    C["CNF Conversion"]
    D["Trained PCFG Model"]
    E["CYK Parser (Viterbi)"]
    F["Parse Tree with Probability"]
    
    A --> B
    B --> C
    C --> D
    D --> E
    E --> F

Grammar Rules

PCFG uses production rules with probabilities:

NP  Det Noun     [0.35]
NP  PropN        [0.25]
VP  Verb NP      [0.45]

The sum of probabilities for all rules with the same left-hand side equals 1.0.

CYK Algorithm

The Cocke-Younger-Kasami algorithm:

  1. Requires grammar in Chomsky Normal Form (CNF)
  2. Uses dynamic programming (O(n³) complexity)
  3. Fills a chart bottom-up
  4. Extracts highest probability parse tree

Complexity:

  • Time: O(n³ × |G|) where n = sentence length, |G| = grammar size
  • Space: O(n² × |G|)

Training

Train PCFG from Universal Dependencies treebanks or raw grammar rules:

# From raw rules
training_data = [
  {:np, [:det, :noun], 350},  # Count: 350 occurrences
  {:np, [:propn], 250},
  {:vp, [:verb, :np], 450}
]

model = PCFG.new()
{:ok, trained} = PCFG.train(model, training_data, smoothing: 0.001)
:ok = PCFG.save(trained, "priv/models/en/pcfg.model")

Prediction

Parse sentences to get probabilistic parse trees:

{:ok, model} = PCFG.load("priv/models/en/pcfg.model")
tokens = [%Token{text: "the", pos_tag: :det}, %Token{text: "cat", pos_tag: :noun}]
{:ok, parse_tree} = PCFG.predict(model, tokens)

# Parse tree contains:
# - label: :np
# - probability: 0.0245
# - children: [...]
# - span: {0, 1}

N-Best Parsing

Get multiple parse hypotheses:

{:ok, trees} = PCFG.predict(model, tokens, n_best: 5)
# Returns top 5 parse trees sorted by probability

Evaluation

Compute bracketing precision/recall/F1:

test_data = [{tokens, gold_tree}, ...]
metrics = PCFG.evaluate(model, test_data)
# %{precision: 0.87, recall: 0.85, f1: 0.86, exact_match: 0.42}

Mix Tasks

# Train PCFG from UD treebank
mix nasty.train.pcfg \
  --corpus data/en_ewt-ud-train.conllu \
  --output priv/models/en/pcfg.model \
  --smoothing 0.001 \
  --test data/en_ewt-ud-test.conllu

# Evaluate PCFG
mix nasty.eval \
  --model priv/models/en/pcfg.model \
  --test data/en_ewt-ud-test.conllu \
  --type pcfg

CRF (Conditional Random Fields)

What is CRF?

CRFs are discriminative models for sequence labeling that consider:

  • Rich feature sets (lexical, orthographic, contextual)
  • Label dependencies (transition probabilities)
  • Global normalization (partition function)

Unlike HMMs, CRFs can handle overlapping features and don't make independence assumptions.

Architecture

Core Modules:

Data Flow:

flowchart TD
    A["Tokens + Labels (Training)"]
    B["Feature Extraction"]
    C["Forward-Backward (Gradient Computation)"]
    D["Gradient Descent Optimization"]
    E["Trained CRF Model"]
    F["Viterbi Decoding"]
    G["Label Sequence"]
    
    A --> B
    B --> C
    C --> D
    D --> E
    E --> F
    F --> G

Feature Extraction

CRFs use rich feature sets extracted from tokens:

Lexical Features:

  • word, word_lower, lemma

Orthographic Features:

  • capitalized, all_caps, word_shape (Xxxx, XXX, ddd)
  • has_digit, has_hyphen, has_punctuation

POS Features:

  • pos_tag

Context Features:

  • prev_word, next_word
  • prev_pos, next_pos
  • is_first, is_last

Affix Features:

  • prefix-1, prefix-2, ..., prefix-4
  • suffix-1, suffix-2, ..., suffix-4

Gazetteer Features:

  • in_gazetteer=person/place/org

Pattern Features:

  • pattern=all_digits, pattern=year, pattern=acronym
  • short_word, long_word

Model

Linear-chain CRF:

P(y|x) = exp(score(x, y)) / Z(x)

score(x, y) = Σ feature_weights + Σ transition_weights

Where:

  • feature_weights: Map of (feature, label) → weight
  • transition_weights: Map of (prev_label, curr_label) → weight
  • Z(x): Partition function (normalization)

Training

Train CRF on BIO-tagged sequences:

# BIO tagging: B-PER, I-PER, B-GPE, I-GPE, B-ORG, I-ORG, O
training_data = [
  {
    [%Token{text: "John"}, %Token{text: "Smith"}],
    [:b_per, :i_per]
  },
  ...
]

model = CRF.new(labels: [:b_per, :i_per, :b_gpe, :i_gpe, :b_org, :i_org, :o])
{:ok, trained} = CRF.train(model, training_data,
  iterations: 100,
  learning_rate: 0.1,
  regularization: 1.0
)
:ok = CRF.save(trained, "priv/models/en/crf_ner.model")

Training Options:

  • :iterations - Maximum iterations (default: 100)
  • :learning_rate - Initial learning rate (default: 0.1)
  • :regularization - L2 regularization (default: 1.0)
  • :method - :sgd, :momentum, :adagrad (default: :momentum)
  • :convergence_threshold - Gradient norm threshold (default: 0.01)

Prediction

Label sequences using Viterbi decoding:

{:ok, model} = CRF.load("priv/models/en/crf_ner.model")
tokens = [%Token{text: "John"}, %Token{text: "lives"}, %Token{text: "in"}, %Token{text: "NYC"}]
{:ok, labels} = CRF.predict(model, tokens)
# [:b_per, :o, :o, :b_gpe]

Viterbi Algorithm

Find most likely label sequence:

  1. Initialize scores for first position
  2. For each subsequent position:
    • Compute emission score (from features)
    • Compute transition score (from previous label)
    • Track best previous label (backpointer)
  3. Backtrack from best final label

Complexity:

  • Time: O(n × L²) where n = sequence length, L = number of labels
  • Space: O(n × L)

Forward-Backward Algorithm

Used during training to compute gradients:

  • Forward: P(label at position t | observations up to t)

  • Backward: P(observations after t | label at t)

  • Partition Function: Z(x) = sum over all label sequences

Optimization

Gradient descent with momentum:

Gradient = Observed Features - Expected Features
Weight Update: w := w - learning_rate * (gradient + regularization * w)
Momentum: v := momentum * v + gradient

Mix Tasks

# Train CRF NER
mix nasty.train.crf \
  --corpus data/ner_train.conllu \
  --output priv/models/en/crf_ner.model \
  --task ner \
  --iterations 100 \
  --learning-rate 0.1 \
  --regularization 1.0 \
  --test data/ner_test.conllu

# Evaluate NER
mix nasty.eval \
  --model priv/models/en/crf_ner.model \
  --test data/ner_test.conllu \
  --type crf \
  --task ner

Integration with English Pipeline

Both models integrate seamlessly with the existing English module:

PCFG Integration

The PCFG parser is integrated into Nasty.Language.English.SentenceParser:

# Use PCFG parsing
{:ok, tokens} = English.tokenize(text)
{:ok, tagged} = English.tag_pos(tokens)
{:ok, document} = English.parse(tagged, model: :pcfg)

# Or directly with SentenceParser
alias Nasty.Language.English.SentenceParser
{:ok, sentences} = SentenceParser.parse_sentences(tokens, model: :pcfg)

# With specific model (bypasses registry lookup)
{:ok, pcfg_model} = PCFG.load("path/to/model.pcfg")
{:ok, sentences} = SentenceParser.parse_sentences(tokens, 
  model: :pcfg, 
  pcfg_model: pcfg_model
)

# Falls back to rule-based if :model option not specified or model unavailable
{:ok, document} = English.parse(tagged)  # Uses rule-based parsing

CRF Integration

The CRF model is integrated into Nasty.Language.English.EntityRecognizer:

# Use CRF for NER
alias Nasty.Language.English.EntityRecognizer

{:ok, tokens} = English.tokenize(text)
{:ok, tagged} = English.tag_pos(tokens)

# CRF-based entity recognition
entities = EntityRecognizer.recognize(tagged, model: :crf)

# With specific model (bypasses registry lookup)
{:ok, crf_model} = CRF.load("path/to/model.crf")
entities = EntityRecognizer.recognize(tagged, 
  model: :crf,
  crf_model: crf_model
)

# Falls back to rule-based if :model option not specified or model unavailable
entities = EntityRecognizer.recognize(tagged)  # Uses rule-based NER

Model Registry

Both integrations use Nasty.Statistics.ModelLoader to automatically load models:

# Models are loaded from the registry by task and language
# PCFG: ModelLoader.load_latest(:en, :pcfg)
# CRF:  ModelLoader.load_latest(:en, :ner_crf)

# Models should be saved with proper naming:
# priv/models/en/pcfg_v1.model
# priv/models/en/ner_crf_v1.model

Performance Expectations

PCFG Parser

Accuracy:

  • Bracketing F1: 85-90% on UD test sets
  • Higher than rule-based parsing for ambiguous structures

Speed:

  • ~50-100ms per sentence (CPU)
  • Depends on sentence length and grammar size

Memory:

  • ~50-100MB model file
  • O(n²) space during parsing

CRF-based NER

Accuracy:

  • Entity-level F1: 92-95% (vs 70-80% rule-based)
  • Proper boundary detection
  • Better handling of unseen entities

Speed:

  • ~20-30ms per sentence (CPU)
  • Linear in sequence length

Memory:

  • ~20-50MB model file (depends on feature set)
  • O(n) space during decoding

Comparison with Other Approaches

PCFG vs Rule-based Parsing

AspectPCFGRule-based
AmbiguityProbabilistic resolutionGreedy heuristics
Unknown structuresGraceful degradationMay fail
TrainingRequires treebankNone needed
SpeedSlower (O(n³))Faster (O(n))
AccuracyHigher on complex sentencesGood for simple sentences

CRF vs Rule-based NER

AspectCRFRule-based
ContextGlobal sequence contextLocal patterns
FeaturesRich feature setsLimited to POS + patterns
BoundariesLearned from dataHeuristic rules
TrainingRequires annotated dataNone needed
Unseen entitiesBetter generalizationPattern matching only
Accuracy92-95% F170-80% F1

Best Practices

For PCFG

  1. Training Data: Use high-quality treebanks (UD, Penn Treebank)
  2. Smoothing: Use add-k smoothing (k=0.001) for unseen rules
  3. CNF Conversion: Always convert to CNF before parsing
  4. Beam Search: Use beam width 10-20 for efficiency
  5. Evaluation: Report bracketing F1, not just accuracy

For CRF

  1. Features: Start with full feature set, prune if needed
  2. Regularization: Use L2 (λ=1.0) to prevent overfitting
  3. Learning Rate: Start with 0.1, decay if not converging
  4. BIO Tagging: Always use BIO scheme for proper boundaries
  5. Gazetteers: Include domain-specific entity lists
  6. Iterations: 100-200 iterations usually sufficient

Troubleshooting

PCFG Issues

Problem: Parse fails (returns :error)

  • Solution: Check if all words have lexical rules; add unknown word handling

Problem: Low parsing F1

  • Solution: Increase training data; adjust smoothing; check CNF conversion

Problem: Slow parsing

  • Solution: Reduce beam width; prune low-probability rules

CRF Issues

Problem: Training doesn't converge

  • Solution: Reduce learning rate; increase regularization; check gradient computation

Problem: Low NER F1

  • Solution: Add more features; increase training data; check BIO tagging consistency

Problem: Slow training

  • Solution: Reduce feature set; use AdaGrad; parallelize if possible

Future Enhancements

PCFG

  • Lexicalized PCFG (head-driven)
  • Latent variable PCFG
  • Neural PCFG with embeddings
  • Dependency conversion from CFG parse

CRF

  • Higher-order CRF (beyond linear-chain)
  • Semi-Markov CRF for multi-token entities
  • Structured perceptron as alternative
  • Neural CRF with BiLSTM features

References

PCFG

  • Charniak, E. (1997). Statistical Parsing with a Context-Free Grammar and Word Statistics
  • Klein & Manning (2003). Accurate Unlexicalized Parsing
  • Petrov et al. (2006). Learning Accurate, Compact, and Interpretable Tree Annotation

CRF

  • Lafferty et al. (2001). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
  • Sutton & McCallum (2012). An Introduction to Conditional Random Fields
  • Tjong Kim Sang & De Meulder (2003). Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition