Nasty.Statistics.Neural.Architectures.BiLSTMCRF (Nasty v0.3.0)
View SourceBidirectional LSTM with Conditional Random Field (CRF) layer for sequence tagging.
This is a state-of-the-art architecture for sequence labeling tasks like POS tagging and NER, achieving 97-98% accuracy on standard benchmarks.
Architecture
Input (word IDs + optional character IDs)
|
v
Embedding Layer (word embeddings + optional char CNN)
|
v
BiLSTM Layer 1 (forward + backward)
|
v
Dropout
|
v
BiLSTM Layer 2 (optional, forward + backward)
|
v
Dropout
|
v
Dense Layer (project to tag space)
|
v
CRF Layer (structured prediction with transition matrix)
|
v
Output (tag sequence)Key Features
- Bidirectional context: Captures both left and right context
- CRF decoding: Models transition probabilities between tags
- Character embeddings: Handles out-of-vocabulary words
- Dropout: Prevents overfitting
- Flexible depth: 1-3 LSTM layers
Expected Performance
- POS Tagging: 97-98% accuracy on Penn Treebank / UD
- NER: 88-92% F1 on CoNLL-2003
- Speed: ~1000-5000 tokens/second (CPU), 10000+ (GPU)
Usage
# Build model
model = BiLSTMCRF.build(
vocab_size: 10000,
num_tags: 17,
embedding_dim: 300,
hidden_size: 256,
num_layers: 2
)
# Train
{:ok, trained_state} = Trainer.train(
fn -> model end,
training_data,
validation_data,
epochs: 10
)
# Predict
{:ok, tags} = BiLSTMCRF.predict(model, trained_state, word_ids)
Summary
Functions
Builds a BiLSTM-CRF model.
Builds the BiLSTM stack.
Builds character-level CNN.
Builds a complete BiLSTM-CRF model with CRF layer.
CRF forward pass - returns normalized probabilities.
Computes the score of the gold (true) tag sequence.
Adds a CRF layer to the model.
CRF loss function.
Computes the partition function using forward algorithm.
Returns default configuration for BiLSTM-CRF.
Returns dependency parsing specific configuration.
Returns NER specific configuration.
Returns POS tagging specific configuration.
Helper to reverse sequence along time axis.
Example training configuration for BiLSTM-CRF.
Viterbi decoding for CRF inference.
Functions
Builds a BiLSTM-CRF model.
Options
:vocab_size- Vocabulary size (required):num_tags- Number of output tags (required):embedding_dim- Word embedding dimension (default: 300):hidden_size- LSTM hidden size (default: 256):num_layers- Number of BiLSTM layers (default: 2):dropout- Dropout rate (default: 0.3):use_char_cnn- Add character-level CNN (default: false):char_vocab_size- Character vocabulary size (default: 100):char_embedding_dim- Character embedding dimension (default: 30):char_filters- Character CNN filter sizes (default: [3, 4, 5]):char_num_filters- Number of filters per size (default: 30):pretrained_embeddings- Pre-trained embedding matrix (default: nil):freeze_embeddings- Freeze embedding weights (default: false)
Returns
An %Axon{} model ready for training.
Builds the BiLSTM stack.
Parameters
input- Input tensorhidden_size- LSTM hidden sizenum_layers- Number of layersdropout- Dropout rate
Returns
Axon layer representing the BiLSTM stack.
Builds character-level CNN.
Parameters
char_input- Character ID input [batch, seq, char_seq]vocab_size- Character vocabulary sizeembedding_dim- Character embedding dimensionfilter_sizes- List of filter sizes (e.g., [3, 4, 5])num_filters- Number of filters per size
Returns
Axon layer with character-level features.
Builds a complete BiLSTM-CRF model with CRF layer.
This is a more advanced version that includes proper CRF decoding. Requires custom Axon layers for CRF forward-backward and Viterbi.
Options
Same as build/1, plus:
:use_crf- Use full CRF layer (default: false, uses softmax instead):transition_init- Transition matrix initialization (default: :random)
Returns
An %Axon{} model with CRF output layer.
CRF forward pass - returns normalized probabilities.
Parameters
emissions- Emission scores [batch, seq, num_tags]transitions- Transition matrix [num_tags, num_tags]
Returns
Normalized CRF scores [batch, seq, num_tags]
Computes the score of the gold (true) tag sequence.
Parameters
emissions- Emission scores [batch, seq, num_tags]tags- True tag sequence [batch, seq]transitions- Transition matrix [num_tags, num_tags]mask- Sequence mask [batch, seq] (optional)
Returns
Gold sequence scores [batch]
Adds a CRF layer to the model.
This layer learns tag transition probabilities and uses them during inference to produce globally optimal tag sequences.
Parameters
logits- Emission scores [batch, seq, num_tags]num_tags- Number of tags
Returns
CRF layer output
CRF loss function.
Computes the negative log-likelihood for a CRF layer. This considers transition probabilities between tags.
Parameters
logits- Model output logits [batch, seq, num_tags]targets- True tag indices [batch, seq]transition_matrix- Tag transition probabilities [num_tags, num_tags]opts- Loss options
Returns
Scalar loss value.
Note
This is a simplified version. A full CRF implementation would include:
- Forward-backward algorithm for partition function
- Viterbi decoding for inference
- Handling of variable-length sequences with masking
Computes the partition function using forward algorithm.
Uses log-space computation for numerical stability.
Parameters
emissions- Emission scores [batch, seq, num_tags]transitions- Transition matrix [num_tags, num_tags]mask- Sequence mask [batch, seq] (optional)
Returns
Log partition function [batch]
Returns default configuration for BiLSTM-CRF.
Parameters
opts- Optional overrides
Returns
Map with default configuration.
Returns dependency parsing specific configuration.
Parameters
opts- Required and optional parameters
Returns
Map with dependency parsing configuration.
Returns NER specific configuration.
Parameters
opts- Required and optional parameters
Returns
Map with NER configuration.
Returns POS tagging specific configuration.
Parameters
opts- Required and optional parameters
Returns
Map with POS tagging configuration.
Helper to reverse sequence along time axis.
This is used for backward LSTM processing.
@spec training_config(atom(), pos_integer()) :: map()
Example training configuration for BiLSTM-CRF.
Returns recommended hyperparameters based on task and dataset size.
Parameters
task- Task type::pos_tagging,:ner,:chunkingdataset_size- Number of training examples
Returns
Map of recommended hyperparameters.
Viterbi decoding for CRF inference.
Finds the most likely tag sequence given emission scores and transitions.
Parameters
emission_scores- Emission probabilities [batch, seq, num_tags]transition_matrix- Transition probabilities [num_tags, num_tags]opts- Decoding options
Returns
Most likely tag sequence [batch, seq].
Note
This is a placeholder. Full implementation requires:
- Dynamic programming for Viterbi algorithm
- Handling of variable-length sequences
- Efficient batched computation