Nasty.Statistics.Parsing.CYKParser (Nasty v0.3.0)

View Source

CYK (Cocke-Younger-Kasami) parsing algorithm for PCFGs.

Implements bottom-up chart parsing with dynamic programming to find the highest probability parse tree for a given sentence.

Algorithm

The CYK algorithm uses a chart (2D table) where chart[i][j] stores all possible parse trees for the span from word i to word j.

  1. Initialize bottom row with lexical rules (words)
  2. For each span length (2 to n):
    • For each possible split point k:
      • Try combining trees from [i,k] and [k+1,j]
      • If a binary rule A → B C exists, create new tree for A
  3. Return highest probability tree spanning entire sentence

Complexity

Time: O(n³ × |G|) where n = sentence length, |G| = grammar size Space: O(n² × |G|)

Examples

iex> grammar = PCFG.new(...)
iex> tokens = [%Token{text: "the"}, %Token{text: "cat"}]
iex> {:ok, parse_tree} = CYKParser.parse(grammar, tokens)

Summary

Functions

Builds the CYK parsing chart using dynamic programming.

Extracts bracket pairs from a parse tree for evaluation.

Extracts all possible parse trees from the chart.

Computes log probability of a parse tree.

Parses a sequence of tokens using the CYK algorithm.

Converts a parse tree to a bracketed string representation.

Types

chart()

@type chart() :: %{
  required({non_neg_integer(), non_neg_integer()}) => [chart_entry()]
}

chart_entry()

@type chart_entry() :: {atom(), parse_tree()}

parse_tree()

@type parse_tree() :: %{
  label: atom(),
  probability: float(),
  children: [parse_tree() | Nasty.AST.Token.t()],
  span: {non_neg_integer(), non_neg_integer()},
  rule: Nasty.Statistics.Parsing.Grammar.Rule.t() | nil
}

Functions

build_chart(grammar, tokens, beam_width)

@spec build_chart(map(), [Nasty.AST.Token.t()], non_neg_integer()) :: chart()

Builds the CYK parsing chart using dynamic programming.

Implementation

  1. Fill diagonal with lexical rules (single words)
  2. For increasing span lengths:
    • Try all split points
    • Apply binary rules to combine smaller spans
  3. Keep only highest probability parses (beam search)

extract_brackets(arg1)

@spec extract_brackets(parse_tree()) :: [
  {atom(), non_neg_integer(), non_neg_integer()}
]

Extracts bracket pairs from a parse tree for evaluation.

Returns a list of {label, start_pos, end_pos} tuples representing all constituents in the tree.

Examples

iex> tree = parse_tree_for("the cat sat")
iex> CYKParser.extract_brackets(tree)
[{:s, 0, 2}, {:np, 0, 1}, {:vp, 2, 2}, ...]

get_n_best_parses(chart, label, i, j, n \\ 1)

@spec get_n_best_parses(
  chart(),
  atom(),
  non_neg_integer(),
  non_neg_integer(),
  pos_integer()
) :: [
  parse_tree()
]

Extracts all possible parse trees from the chart.

Returns all trees (not just the best one) for the given span and label. Useful for n-best parsing.

Parameters

  • chart - Completed CYK chart
  • label - Non-terminal to extract
  • i - Start position
  • j - End position
  • n - Number of best parses to return (default: 1)

log_probability(map)

@spec log_probability(parse_tree()) :: float()

Computes log probability of a parse tree.

Sum of log probabilities of all rules used in the tree. More numerically stable than multiplying probabilities.

Examples

iex> tree = %{probability: 0.001, children: [...]}
iex> CYKParser.log_probability(tree)
-6.907755278982137  # log(0.001)

parse(grammar, tokens, opts \\ [])

@spec parse(map(), [Nasty.AST.Token.t()], keyword()) ::
  {:ok, parse_tree()} | {:error, term()}

Parses a sequence of tokens using the CYK algorithm.

Returns the highest probability parse tree, or an error if no parse exists.

Parameters

  • grammar - PCFG model with rules in CNF
  • tokens - List of tokens to parse
  • opts - Options:
    • :start_symbol - Root non-terminal (default: :s)
    • :beam_width - Max entries per chart cell (default: 10, 0 = unlimited)

Returns

  • {:ok, parse_tree} - Successful parse
  • {:error, reason} - No valid parse found

to_brackets(arg1)

@spec to_brackets(parse_tree()) :: String.t()

Converts a parse tree to a bracketed string representation.

Useful for evaluation and debugging.

Examples

iex> tree = %{label: :np, children: [...]}
iex> CYKParser.to_brackets(tree)
"(NP (DET the) (NOUN cat))"