Nasty.Statistics.Parsing.PCFG (Nasty v0.3.0)

View Source

Probabilistic Context-Free Grammar (PCFG) model for parsing.

Implements the Nasty.Statistics.Model behaviour for statistical parsing with grammar rules learned from treebanks.

Training

PCFG models are trained on annotated treebanks (e.g., Universal Dependencies). The training process extracts grammar rules and estimates their probabilities from phrase structure trees.

Parsing

Uses the CYK algorithm to find the most likely parse tree for a sentence. The grammar is automatically converted to Chomsky Normal Form (CNF) for efficient parsing.

Examples

# Training
training_data = load_treebank("data/train.conllu")
model = PCFG.new()
{:ok, trained} = PCFG.train(model, training_data, smoothing: 0.001)
:ok = PCFG.save(trained, "priv/models/en/pcfg.model")

# Parsing
{:ok, model} = PCFG.load("priv/models/en/pcfg.model")
tokens = [%Token{text: "the"}, %Token{text: "cat"}]
{:ok, parse_tree} = PCFG.predict(model, tokens, [])

Summary

Functions

Evaluates the model's parsing accuracy on test data.

Loads a trained PCFG model from disk.

Returns model metadata.

Creates a new untrained PCFG model.

Parses a sequence of tokens using the trained PCFG.

Saves the trained PCFG model to disk.

Trains the PCFG model on annotated phrase structure data.

Types

t()

@type t() :: %Nasty.Statistics.Parsing.PCFG{
  language: atom(),
  lexicon: %{required(String.t()) => [atom()]},
  metadata: map(),
  non_terminals: MapSet.t(),
  rule_index: %{required(atom()) => [Nasty.Statistics.Parsing.Grammar.Rule.t()]},
  rules: [Nasty.Statistics.Parsing.Grammar.Rule.t()],
  smoothing_k: float(),
  start_symbol: atom()
}

Functions

evaluate(model, test_data, opts \\ [])

@spec evaluate(t(), list(), keyword()) :: map()

Evaluates the model's parsing accuracy on test data.

Computes bracketing precision, recall, and F1 score.

Parameters

  • model - Trained PCFG model
  • test_data - List of {tokens, gold_tree} tuples
  • opts - Options passed to parser

Returns

Map with evaluation metrics:

  • :precision - Bracketing precision
  • :recall - Bracketing recall
  • :f1 - Bracketing F1 score
  • :exact_match - Percentage of exact matches

load(path)

@spec load(Path.t()) :: {:ok, t()} | {:error, term()}

Loads a trained PCFG model from disk.

metadata(model)

@spec metadata(t()) :: map()

Returns model metadata.

new(opts \\ [])

@spec new(keyword()) :: t()

Creates a new untrained PCFG model.

Options

  • :start_symbol - Root symbol (default: :s)
  • :smoothing_k - Smoothing constant (default: 0.001)
  • :language - Language code (default: :en)

predict(model, tokens, opts \\ [])

@spec predict(t(), [Nasty.AST.Token.t()], keyword()) ::
  {:ok, term()} | {:error, term()}

Parses a sequence of tokens using the trained PCFG.

Parameters

  • model - Trained PCFG model
  • tokens - List of %Token{} structs (should have POS tags)
  • opts - Options:
    • :beam_width - Beam search width (default: 10)
    • :start_symbol - Root symbol (default: model's start symbol)
    • :n_best - Return n-best parses (default: 1)

Returns

  • {:ok, parse_tree} - Best parse tree
  • {:ok, [parse_tree]} - Multiple parse trees if :n_best > 1
  • {:error, reason} - Parsing failed

save(model, path)

@spec save(t(), Path.t()) :: :ok | {:error, term()}

Saves the trained PCFG model to disk.

train(model, training_data, opts \\ [])

@spec train(t(), list(), keyword()) :: {:ok, t()} | {:error, term()}

Trains the PCFG model on annotated phrase structure data.

Training Data Format

Training data should be a list of {tokens, parse_tree} tuples where:

  • tokens is a list of %Token{} structs
  • parse_tree is a hierarchical structure representing the syntax tree

Alternatively, accepts raw grammar rules as [{lhs, rhs, count}, ...].

Options

  • :smoothing - Smoothing constant (overrides model setting)
  • :cnf - Convert to CNF (default: true)

Returns

{:ok, trained_model} with learned grammar rules