Nasty.Statistics.Parsing.Grammar (Nasty v0.3.0)

Grammar rule representation and manipulation for PCFGs.

Provides data structures and operations for working with probabilistic context-free grammar rules.

Grammar Rules

A rule represents a production: A → α where:

A is a non-terminal (left-hand side)
α is a sequence of terminals and/or non-terminals (right-hand side)
Each rule has an associated probability

Examples

iex> rule = Grammar.Rule.new(:np, [:det, :noun], 0.35)
%Grammar.Rule{lhs: :np, rhs: [:det, :noun], probability: 0.35}

iex> Grammar.lexical_rule?(rule)
false  # Not a lexical rule (no terminal symbols)

Summary

Functions

apply_smoothing(rules, k \\ 0.001)

Applies add-k smoothing to grammar rules.

binary_rule?(arg1)

Checks if a rule is binary (A → B C).

index_by_lhs(rules)

Builds an index of rules by their left-hand side for fast lookup.

lexical_rule?(arg1)

Checks if a rule is lexical (produces a terminal/word).

non_terminals(rules)

Extracts all non-terminals used in the grammar.

normalize_probabilities(rules)

Normalizes rule probabilities for rules with the same LHS.

terminals(rules)

Extracts all terminals (words) from lexical rules.

to_cnf(rules)

Converts the grammar to Chomsky Normal Form (CNF).

unary_rule?(arg1)

Checks if a rule is unary (A → B).

Functions

apply_smoothing(rules, k \\ 0.001)

@spec apply_smoothing([Nasty.Statistics.Parsing.Grammar.Rule.t()], float()) :: [
  Nasty.Statistics.Parsing.Grammar.Rule.t()
]

Applies add-k smoothing to grammar rules.

Adds a small constant k to each rule count before normalization to handle unseen productions.

Examples

iex> rules = [Rule.new(:np, [:det, :noun], 0.7), Rule.new(:np, [:pron], 0.3)]
iex> smoothed = Grammar.apply_smoothing(rules, 0.001)
iex> Enum.all?(smoothed, fn r -> r.probability > 0 end)
true

binary_rule?(arg1)

@spec binary_rule?(Nasty.Statistics.Parsing.Grammar.Rule.t()) :: boolean()

Checks if a rule is binary (A → B C).

Examples

iex> Grammar.binary_rule?(Rule.new(:np, [:det, :noun], 0.35))
true

iex> Grammar.binary_rule?(Rule.new(:np, [:pron], 0.25))
false

index_by_lhs(rules)

@spec index_by_lhs([Nasty.Statistics.Parsing.Grammar.Rule.t()]) :: %{
  required(atom()) => [Nasty.Statistics.Parsing.Grammar.Rule.t()]
}

Builds an index of rules by their left-hand side for fast lookup.

Examples

iex> rules = [
...>   Rule.new(:np, [:det, :noun], 0.35),
...>   Rule.new(:np, [:pron], 0.25),
...>   Rule.new(:vp, [:verb, :np], 0.45)
...> ]
iex> index = Grammar.index_by_lhs(rules)
iex> length(index[:np])
2

lexical_rule?(arg1)

@spec lexical_rule?(Nasty.Statistics.Parsing.Grammar.Rule.t()) :: boolean()

Checks if a rule is lexical (produces a terminal/word).

A rule is lexical if its RHS contains exactly one element that is a string.

Examples

iex> Grammar.lexical_rule?(Rule.new(:noun, ["cat"], 0.01))
true

iex> Grammar.lexical_rule?(Rule.new(:np, [:det, :noun], 0.35))
false

non_terminals(rules)

@spec non_terminals([Nasty.Statistics.Parsing.Grammar.Rule.t()]) :: MapSet.t()

Extracts all non-terminals used in the grammar.

Examples

iex> rules = [
...>   Rule.new(:np, [:det, :noun], 0.35),
...>   Rule.new(:vp, [:verb, :np], 0.45)
...> ]
iex> Grammar.non_terminals(rules)
MapSet.new([:np, :vp, :det, :noun, :verb])

normalize_probabilities(rules)

@spec normalize_probabilities([Nasty.Statistics.Parsing.Grammar.Rule.t()]) :: [
  Nasty.Statistics.Parsing.Grammar.Rule.t()
]

Normalizes rule probabilities for rules with the same LHS.

Ensures that P(A → α) for all rules with LHS = A sums to 1.0.

Examples

iex> rules = [
...>   Rule.new(:np, [:det, :noun], 35),  # Raw counts
...>   Rule.new(:np, [:pron], 25),
...>   Rule.new(:np, [:propn], 40)
...> ]
iex> normalized = Grammar.normalize_probabilities(rules)
iex> Enum.reduce(normalized, 0.0, fn r, acc -> acc + r.probability end)
1.0

terminals(rules)

@spec terminals([Nasty.Statistics.Parsing.Grammar.Rule.t()]) :: MapSet.t()

Extracts all terminals (words) from lexical rules.

Examples

iex> rules = [
...>   Rule.new(:det, ["the"], 0.5),
...>   Rule.new(:noun, ["cat"], 0.3)
...> ]
iex> Grammar.terminals(rules)
MapSet.new(["the", "cat"])

to_cnf(rules)

@spec to_cnf([Nasty.Statistics.Parsing.Grammar.Rule.t()]) :: [
  Nasty.Statistics.Parsing.Grammar.Rule.t()
]

Converts the grammar to Chomsky Normal Form (CNF).

CNF requires:

All rules are either binary (A → B C) or lexical (A → word)
No unary rules (except to terminals)
No epsilon productions

This is required for CYK parsing.

Implementation

Eliminate epsilon productions
Eliminate unary rules (A → B) by substitution
Convert long rules (A → B C D) into binary: A → B X, X → C D
Convert mixed terminal/non-terminal rules

Examples

iex> rules = [Rule.new(:s, [:np, :vp, :pp], 0.8)]
iex> cnf_rules = Grammar.to_cnf(rules)
iex> Enum.all?(cnf_rules, fn r -> binary_rule?(r) or lexical_rule?(r) end)
true

unary_rule?(arg1)

@spec unary_rule?(Nasty.Statistics.Parsing.Grammar.Rule.t()) :: boolean()

Checks if a rule is unary (A → B).

Examples

iex> Grammar.unary_rule?(Rule.new(:np, [:pron], 0.25))
true

iex> Grammar.unary_rule?(Rule.new(:np, [:det, :noun], 0.35))
false