Nasty.Statistics.Parsing.PCFG (Nasty v0.3.0)
View SourceProbabilistic Context-Free Grammar (PCFG) model for parsing.
Implements the Nasty.Statistics.Model behaviour for statistical parsing
with grammar rules learned from treebanks.
Training
PCFG models are trained on annotated treebanks (e.g., Universal Dependencies). The training process extracts grammar rules and estimates their probabilities from phrase structure trees.
Parsing
Uses the CYK algorithm to find the most likely parse tree for a sentence. The grammar is automatically converted to Chomsky Normal Form (CNF) for efficient parsing.
Examples
# Training
training_data = load_treebank("data/train.conllu")
model = PCFG.new()
{:ok, trained} = PCFG.train(model, training_data, smoothing: 0.001)
:ok = PCFG.save(trained, "priv/models/en/pcfg.model")
# Parsing
{:ok, model} = PCFG.load("priv/models/en/pcfg.model")
tokens = [%Token{text: "the"}, %Token{text: "cat"}]
{:ok, parse_tree} = PCFG.predict(model, tokens, [])
Summary
Functions
Evaluates the model's parsing accuracy on test data.
Loads a trained PCFG model from disk.
Returns model metadata.
Creates a new untrained PCFG model.
Parses a sequence of tokens using the trained PCFG.
Saves the trained PCFG model to disk.
Trains the PCFG model on annotated phrase structure data.
Types
@type t() :: %Nasty.Statistics.Parsing.PCFG{ language: atom(), lexicon: %{required(String.t()) => [atom()]}, metadata: map(), non_terminals: MapSet.t(), rule_index: %{required(atom()) => [Nasty.Statistics.Parsing.Grammar.Rule.t()]}, rules: [Nasty.Statistics.Parsing.Grammar.Rule.t()], smoothing_k: float(), start_symbol: atom() }
Functions
Evaluates the model's parsing accuracy on test data.
Computes bracketing precision, recall, and F1 score.
Parameters
model- Trained PCFG modeltest_data- List of{tokens, gold_tree}tuplesopts- Options passed to parser
Returns
Map with evaluation metrics:
:precision- Bracketing precision:recall- Bracketing recall:f1- Bracketing F1 score:exact_match- Percentage of exact matches
Loads a trained PCFG model from disk.
Returns model metadata.
Creates a new untrained PCFG model.
Options
:start_symbol- Root symbol (default::s):smoothing_k- Smoothing constant (default: 0.001):language- Language code (default::en)
@spec predict(t(), [Nasty.AST.Token.t()], keyword()) :: {:ok, term()} | {:error, term()}
Parses a sequence of tokens using the trained PCFG.
Parameters
model- Trained PCFG modeltokens- List of%Token{}structs (should have POS tags)opts- Options::beam_width- Beam search width (default: 10):start_symbol- Root symbol (default: model's start symbol):n_best- Return n-best parses (default: 1)
Returns
{:ok, parse_tree}- Best parse tree{:ok, [parse_tree]}- Multiple parse trees if:n_best> 1{:error, reason}- Parsing failed
Saves the trained PCFG model to disk.
Trains the PCFG model on annotated phrase structure data.
Training Data Format
Training data should be a list of {tokens, parse_tree} tuples where:
tokensis a list of%Token{}structsparse_treeis a hierarchical structure representing the syntax tree
Alternatively, accepts raw grammar rules as [{lhs, rhs, count}, ...].
Options
:smoothing- Smoothing constant (overrides model setting):cnf- Convert to CNF (default: true)
Returns
{:ok, trained_model} with learned grammar rules