Nasty.Statistics.Neural.Architectures.BiLSTMCRF (Nasty v0.3.0)

View Source

Bidirectional LSTM with Conditional Random Field (CRF) layer for sequence tagging.

This is a state-of-the-art architecture for sequence labeling tasks like POS tagging and NER, achieving 97-98% accuracy on standard benchmarks.

Architecture

Input (word IDs + optional character IDs)
   |
   v
Embedding Layer (word embeddings + optional char CNN)
   |
   v
BiLSTM Layer 1 (forward + backward)
   |
   v
Dropout
   |
   v
BiLSTM Layer 2 (optional, forward + backward)
   |
   v
Dropout
   |
   v
Dense Layer (project to tag space)
   |
   v
CRF Layer (structured prediction with transition matrix)
   |
   v
Output (tag sequence)

Key Features

  • Bidirectional context: Captures both left and right context
  • CRF decoding: Models transition probabilities between tags
  • Character embeddings: Handles out-of-vocabulary words
  • Dropout: Prevents overfitting
  • Flexible depth: 1-3 LSTM layers

Expected Performance

  • POS Tagging: 97-98% accuracy on Penn Treebank / UD
  • NER: 88-92% F1 on CoNLL-2003
  • Speed: ~1000-5000 tokens/second (CPU), 10000+ (GPU)

Usage

# Build model
model = BiLSTMCRF.build(
  vocab_size: 10000,
  num_tags: 17,
  embedding_dim: 300,
  hidden_size: 256,
  num_layers: 2
)

# Train
{:ok, trained_state} = Trainer.train(
  fn -> model end,
  training_data,
  validation_data,
  epochs: 10
)

# Predict
{:ok, tags} = BiLSTMCRF.predict(model, trained_state, word_ids)

Summary

Functions

Builds a BiLSTM-CRF model.

Builds a complete BiLSTM-CRF model with CRF layer.

CRF forward pass - returns normalized probabilities.

Computes the score of the gold (true) tag sequence.

Adds a CRF layer to the model.

Computes the partition function using forward algorithm.

Returns default configuration for BiLSTM-CRF.

Returns dependency parsing specific configuration.

Returns NER specific configuration.

Returns POS tagging specific configuration.

Helper to reverse sequence along time axis.

Example training configuration for BiLSTM-CRF.

Functions

build(opts)

@spec build(keyword() | map()) :: Axon.t()

Builds a BiLSTM-CRF model.

Options

  • :vocab_size - Vocabulary size (required)
  • :num_tags - Number of output tags (required)
  • :embedding_dim - Word embedding dimension (default: 300)
  • :hidden_size - LSTM hidden size (default: 256)
  • :num_layers - Number of BiLSTM layers (default: 2)
  • :dropout - Dropout rate (default: 0.3)
  • :use_char_cnn - Add character-level CNN (default: false)
  • :char_vocab_size - Character vocabulary size (default: 100)
  • :char_embedding_dim - Character embedding dimension (default: 30)
  • :char_filters - Character CNN filter sizes (default: [3, 4, 5])
  • :char_num_filters - Number of filters per size (default: 30)
  • :pretrained_embeddings - Pre-trained embedding matrix (default: nil)
  • :freeze_embeddings - Freeze embedding weights (default: false)

Returns

An %Axon{} model ready for training.

build_bilstm_stack(input, hidden_size, num_layers, dropout)

Builds the BiLSTM stack.

Parameters

  • input - Input tensor
  • hidden_size - LSTM hidden size
  • num_layers - Number of layers
  • dropout - Dropout rate

Returns

Axon layer representing the BiLSTM stack.

build_char_cnn(char_input, vocab_size, embedding_dim, filter_sizes, num_filters)

Builds character-level CNN.

Parameters

  • char_input - Character ID input [batch, seq, char_seq]
  • vocab_size - Character vocabulary size
  • embedding_dim - Character embedding dimension
  • filter_sizes - List of filter sizes (e.g., [3, 4, 5])
  • num_filters - Number of filters per size

Returns

Axon layer with character-level features.

build_with_crf(opts)

@spec build_with_crf(keyword()) :: Axon.t()

Builds a complete BiLSTM-CRF model with CRF layer.

This is a more advanced version that includes proper CRF decoding. Requires custom Axon layers for CRF forward-backward and Viterbi.

Options

Same as build/1, plus:

  • :use_crf - Use full CRF layer (default: false, uses softmax instead)
  • :transition_init - Transition matrix initialization (default: :random)

Returns

An %Axon{} model with CRF output layer.

crf_forward(emissions, transitions)

CRF forward pass - returns normalized probabilities.

Parameters

  • emissions - Emission scores [batch, seq, num_tags]
  • transitions - Transition matrix [num_tags, num_tags]

Returns

Normalized CRF scores [batch, seq, num_tags]

crf_gold_score(emissions, tags, transitions, mask \\ nil)

Computes the score of the gold (true) tag sequence.

Parameters

  • emissions - Emission scores [batch, seq, num_tags]
  • tags - True tag sequence [batch, seq]
  • transitions - Transition matrix [num_tags, num_tags]
  • mask - Sequence mask [batch, seq] (optional)

Returns

Gold sequence scores [batch]

crf_layer(logits, num_tags)

Adds a CRF layer to the model.

This layer learns tag transition probabilities and uses them during inference to produce globally optimal tag sequences.

Parameters

  • logits - Emission scores [batch, seq, num_tags]
  • num_tags - Number of tags

Returns

CRF layer output

crf_loss(logits, targets, transition_matrix, opts \\ [])

CRF loss function.

Computes the negative log-likelihood for a CRF layer. This considers transition probabilities between tags.

Parameters

  • logits - Model output logits [batch, seq, num_tags]
  • targets - True tag indices [batch, seq]
  • transition_matrix - Tag transition probabilities [num_tags, num_tags]
  • opts - Loss options

Returns

Scalar loss value.

Note

This is a simplified version. A full CRF implementation would include:

  • Forward-backward algorithm for partition function
  • Viterbi decoding for inference
  • Handling of variable-length sequences with masking

crf_partition_function(emissions, transitions, mask \\ nil)

Computes the partition function using forward algorithm.

Uses log-space computation for numerical stability.

Parameters

  • emissions - Emission scores [batch, seq, num_tags]
  • transitions - Transition matrix [num_tags, num_tags]
  • mask - Sequence mask [batch, seq] (optional)

Returns

Log partition function [batch]

default_config(opts \\ [])

@spec default_config(keyword()) :: map()

Returns default configuration for BiLSTM-CRF.

Parameters

  • opts - Optional overrides

Returns

Map with default configuration.

dependency_parsing_config(opts)

@spec dependency_parsing_config(keyword()) :: map()

Returns dependency parsing specific configuration.

Parameters

  • opts - Required and optional parameters

Returns

Map with dependency parsing configuration.

ner_config(opts)

@spec ner_config(keyword()) :: map()

Returns NER specific configuration.

Parameters

  • opts - Required and optional parameters

Returns

Map with NER configuration.

pos_tagging_config(opts)

@spec pos_tagging_config(keyword()) :: map()

Returns POS tagging specific configuration.

Parameters

  • opts - Required and optional parameters

Returns

Map with POS tagging configuration.

reverse_sequence(layer)

Helper to reverse sequence along time axis.

This is used for backward LSTM processing.

training_config(task, dataset_size)

@spec training_config(atom(), pos_integer()) :: map()

Example training configuration for BiLSTM-CRF.

Returns recommended hyperparameters based on task and dataset size.

Parameters

  • task - Task type: :pos_tagging, :ner, :chunking
  • dataset_size - Number of training examples

Returns

Map of recommended hyperparameters.

viterbi_decode(emission_scores, transition_matrix, opts \\ [])

Viterbi decoding for CRF inference.

Finds the most likely tag sequence given emission scores and transitions.

Parameters

  • emission_scores - Emission probabilities [batch, seq, num_tags]
  • transition_matrix - Transition probabilities [num_tags, num_tags]
  • opts - Decoding options

Returns

Most likely tag sequence [batch, seq].

Note

This is a placeholder. Full implementation requires:

  • Dynamic programming for Viterbi algorithm
  • Handling of variable-length sequences
  • Efficient batched computation