Nasty.Statistics.SequenceLabeling.CRF (Nasty v0.3.0)

Conditional Random Field (CRF) for sequence labeling.

Implements linear-chain CRF with feature-based modeling for tasks like Named Entity Recognition (NER), POS tagging, etc.

Model

Linear-chain CRF models the conditional probability:

P(y|x) = exp(score(x, y)) / Z(x)

Where:

score(x, y) = Σ feature_weights + Σ transition_weights
Z(x) is the partition function (normalizer)

Training

Uses forward-backward algorithm to compute gradients and gradient descent with momentum for optimization.

Prediction

Uses Viterbi algorithm to find the most likely label sequence.

Examples

# Training
model = CRF.new(labels: [:person, :gpe, :org, :none])
training_data = load_annotated_data()
{:ok, trained} = CRF.train(model, training_data, iterations: 100)

# Prediction
{:ok, labels} = CRF.predict(trained, tokens, [])

Summary

Types

t()

Functions

load(path)

Loads a trained CRF model from disk.

metadata(model)

Returns model metadata.

new(opts \\ [])

Creates a new untrained CRF model.

predict(model, tokens, opts \\ [])

Predicts labels for a sequence of tokens using Viterbi decoding.

save(model, path)

Saves the trained CRF model to disk.

train(model, training_data, opts \\ [])

Trains the CRF model on annotated sequence data.

Types

t()

@type t() :: %Nasty.Statistics.SequenceLabeling.CRF{
  feature_weights: map(),
  label_set: MapSet.t(),
  labels: [atom()],
  language: atom(),
  metadata: map(),
  transition_weights: map()
}

Functions

load(path)

@spec load(Path.t()) :: {:ok, t()} | {:error, term()}

Loads a trained CRF model from disk.

metadata(model)

@spec metadata(t()) :: map()

Returns model metadata.

new(opts \\ [])

@spec new(keyword()) :: t()

Creates a new untrained CRF model.

Options

:labels - List of possible labels (required)
:language - Language code (default: :en)

predict(model, tokens, opts \\ [])

@spec predict(t(), [Nasty.AST.Token.t()], keyword()) ::
  {:ok, [atom()]} | {:error, term()}

Predicts labels for a sequence of tokens using Viterbi decoding.

Parameters

model - Trained CRF model
tokens - List of %Token{} structs
opts - Options (currently unused)

Returns

{:ok, labels} - Predicted label sequence

save(model, path)

@spec save(t(), Path.t()) :: :ok | {:error, term()}

Saves the trained CRF model to disk.

train(model, training_data, opts \\ [])

@spec train(t(), [{[Nasty.AST.Token.t()], [atom()]}], keyword()) ::
  {:ok, t()} | {:error, term()}

Trains the CRF model on annotated sequence data.

Training Data Format

List of {tokens, labels} tuples where:

tokens is a list of %Token{} structs
labels is a list of label atoms (same length as tokens)

Options

:iterations - Maximum training iterations (default: 100)
:learning_rate - Initial learning rate (default: 0.1)
:regularization - L2 regularization strength (default: 1.0)
:method - Optimization method (:sgd, :momentum, :adagrad) (default: :momentum)
:convergence_threshold - Gradient norm threshold (default: 0.01)

Returns

{:ok, trained_model} with learned feature and transition weights