Training Neural Models Guide

View Source

This guide provides detailed instructions for training neural models in Nasty, from data preparation to deployment.

Table of Contents

  1. Prerequisites
  2. Data Preparation
  3. Training POS Tagging Models
  4. Advanced Training Options
  5. Model Evaluation
  6. Troubleshooting

Prerequisites

System Requirements

  • Memory: Minimum 4GB RAM for training, 8GB+ recommended
  • CPU: Multi-core CPU (4+ cores recommended)
  • GPU: Optional but highly recommended (10-100x speedup with EXLA)
  • Storage: 500MB-2GB for models and training data

Dependencies

All neural dependencies are included in mix.exs:

{:axon, "~> 0.7"},
{:nx, "~> 0.9"},
{:exla, "~> 0.9"},
{:bumblebee, "~> 0.6"}

Install with:

mix deps.get

Enable GPU Acceleration (Optional)

Set environment variable for EXLA to use GPU:

export XLA_TARGET=cuda120  # or cuda118, rocm, etc.
mix deps.compile

Data Preparation

CoNLL-U Format

Neural models train on CoNLL-U formatted data. Each sentence is separated by blank lines, with one token per line:

1	The	the	DET	DT	_	2	det	_	_
2	cat	cat	NOUN	NN	_	3	subj	_	_
3	sat	sit	VERB	VBD	_	0	root	_	_

1	Dogs	dog	NOUN	NNS	_	2	subj	_	_
2	run	run	VERB	VBP	_	0	root	_	_

Columns (tab-separated):

  1. Index
  2. Word form
  3. Lemma
  4. UPOS tag (used for training)
  5. XPOS tag
  6. Features
  7. Head
  8. Dependency relation 9-10. Additional annotations

Where to Get Training Data

Universal Dependencies corpora:

Download and extract:

cd data
git clone https://github.com/UniversalDependencies/UD_English-EWT

Data Split Recommendations

  • Training: 80% (or use provided train split)
  • Validation: 10% (or use provided dev split)
  • Test: 10% (or use provided test split)

The training pipeline handles splitting automatically if you provide a single file.

Training POS Tagging Models

Quick Start - CLI Training

The easiest way to train is using the Mix task:

mix nasty.train.neural_pos \
  --corpus data/UD_English-EWT/en_ewt-ud-train.conllu \
  --output models/pos_neural_v1.axon \
  --epochs 10 \
  --batch-size 32

CLI Options Reference

mix nasty.train.neural_pos [options]

Required:
  --corpus PATH          Path to CoNLL-U training corpus

Optional:
  --output PATH          Model save path (default: pos_neural.axon)
  --validation PATH      Path to validation corpus (auto-split if not provided)
  --epochs N             Number of training epochs (default: 10)
  --batch-size N         Batch size (default: 32)
  --learning-rate F      Learning rate (default: 0.001)
  --hidden-size N        LSTM hidden size (default: 256)
  --embedding-dim N      Word embedding dimension (default: 300)
  --num-layers N         Number of LSTM layers (default: 2)
  --dropout F            Dropout rate (default: 0.3)
  --use-char-cnn         Enable character CNN (default: enabled)
  --char-embedding-dim N Character embedding dim (default: 50)
  --optimizer NAME       Optimizer: adam, sgd, adamw (default: adam)
  --early-stopping N     Early stopping patience (default: 3)
  --checkpoint-dir PATH  Save checkpoints during training
  --min-freq N           Min word frequency for vocab (default: 1)
  --validation-split F   Validation split fraction (default: 0.1)

Programmatic Training

For more control, train programmatically:

alias Nasty.Statistics.POSTagging.NeuralTagger
alias Nasty.Statistics.Neural.DataLoader

# Load training data
{:ok, sentences} = DataLoader.load_conllu_file("data/train.conllu")

# Split into train/validation
{train_data, valid_data} = DataLoader.split_data(sentences, validation_split: 0.1)

# Create and configure tagger
tagger = NeuralTagger.new(training_data: train_data)

# Train with custom options
{:ok, trained_tagger} = NeuralTagger.train(tagger, train_data,
  epochs: 20,
  batch_size: 32,
  learning_rate: 0.001,
  hidden_size: 512,
  embedding_dim: 300,
  num_lstm_layers: 3,
  dropout: 0.5,
  use_char_cnn: true,
  validation_data: valid_data,
  early_stopping_patience: 5
)

# Save trained model
:ok = NeuralTagger.save(trained_tagger, "models/pos_advanced.axon")

Advanced Training Options

Hyperparameter Tuning

Hidden Size (--hidden-size):

  • Small (128-256): Faster training, less memory, slightly lower accuracy
  • Medium (256-512): Balanced performance (default: 256)
  • Large (512-1024): Best accuracy, requires more memory/time

Embedding Dimension (--embedding-dim):

  • Small (50-100): Fast, low memory
  • Medium (300): Good balance (default, matches GloVe)
  • Large (300-1024): For very large corpora

Number of LSTM Layers (--num-layers):

  • 1 layer: Fast, simple patterns
  • 2 layers: Balanced (default, recommended)
  • 3+ layers: Complex patterns, risk overfitting

Dropout (--dropout):

  • 0.0: No regularization (risk overfitting)
  • 0.3: Good default
  • 0.5: Strong regularization for small datasets

Batch Size (--batch-size):

  • Small (8-16): Better generalization, slower
  • Medium (32): Good balance (default)
  • Large (64-128): Faster training, needs more memory

Character CNN Configuration

Character-level CNN helps with out-of-vocabulary words:

mix nasty.train.neural_pos \
  --corpus data/train.conllu \
  --use-char-cnn \
  --char-embedding-dim 50 \
  --char-vocab-size 150

Disable if training is too slow:

mix nasty.train.neural_pos \
  --corpus data/train.conllu \
  --no-char-cnn

Using Pre-trained Embeddings

Load GloVe embeddings for better initialization:

alias Nasty.Statistics.Neural.Embeddings

# Load GloVe vectors
glove_embeddings = Embeddings.load_glove("data/glove.6B.300d.txt", word_vocab)

# Train with pre-trained embeddings
{:ok, tagger} = NeuralTagger.train(base_tagger, train_data,
  pretrained_embeddings: glove_embeddings,
  freeze_embeddings: false  # Allow fine-tuning
)

Note: GloVe loading is currently a placeholder. Full implementation coming soon.

Optimizer Selection

Adam (default):

  • Adaptive learning rates
  • Works well out-of-the-box
  • Good for most use cases

SGD:

  • Simple, stable
  • May need learning rate scheduling
  • Good baseline

AdamW:

  • Adam with weight decay
  • Better generalization
  • Recommended for large models
mix nasty.train.neural_pos \
  --corpus data/train.conllu \
  --optimizer adamw \
  --learning-rate 0.0001

Early Stopping

Automatically stop training when validation performance plateaus:

mix nasty.train.neural_pos \
  --corpus data/train.conllu \
  --validation data/dev.conllu \
  --early-stopping 5  # Stop after 5 epochs without improvement

Checkpointing

Save model checkpoints during training:

mix nasty.train.neural_pos \
  --corpus data/train.conllu \
  --checkpoint-dir checkpoints/ \
  --checkpoint-frequency 2  # Save every 2 epochs

Checkpoints are named: checkpoint_epoch_001.axon, checkpoint_epoch_002.axon, etc.

Model Evaluation

During Training

The training task prints per-tag metrics:

Epoch 1/10
  Loss: 0.456
  Accuracy: 0.923
  
Per-tag accuracy:
  NOUN: 0.957
  VERB: 0.942
  DET: 0.989
  ...

Post-Training Evaluation

Evaluate on test set:

mix nasty.eval.neural_pos \
  --model models/pos_neural_v1.axon \
  --test data/en_ewt-ud-test.conllu

Or programmatically:

{:ok, model} = NeuralTagger.load("models/pos_neural_v1.axon")
{:ok, test_sentences} = DataLoader.load_conllu_file("data/test.conllu")

# Evaluate
correct = 0
total = 0

for {words, gold_tags} <- test_sentences do
  {:ok, pred_tags} = NeuralTagger.predict(model, words, [])
  
  correct = correct + Enum.count(Enum.zip(pred_tags, gold_tags), fn {p, g} -> p == g end)
  total = total + length(gold_tags)
end

accuracy = correct / total
IO.puts("Accuracy: #{Float.round(accuracy * 100, 2)}%")

Metrics to Track

  • Overall Accuracy: Percentage of correctly tagged tokens
  • Per-Tag Accuracy: Accuracy for each POS tag
  • Per-Tag Precision/Recall: For detailed error analysis
  • OOV Accuracy: Performance on out-of-vocabulary words
  • Training Time: Total time and time per epoch
  • Convergence: Number of epochs to best validation score

Troubleshooting

Out of Memory

Symptoms: Process crashes with memory error

Solutions:

  1. Reduce batch size: --batch-size 16 or --batch-size 8
  2. Reduce hidden size: --hidden-size 128
  3. Reduce embedding dimension: --embedding-dim 100
  4. Disable character CNN: --no-char-cnn
  5. Use smaller training corpus subset

Training Too Slow

Symptoms: Hours per epoch

Solutions:

  1. Enable EXLA GPU support (see Prerequisites)
  2. Increase batch size: --batch-size 64
  3. Disable character CNN if not needed
  4. Use fewer LSTM layers: --num-layers 1
  5. Reduce hidden size: --hidden-size 128

Overfitting

Symptoms: High training accuracy, low validation accuracy

Solutions:

  1. Increase dropout: --dropout 0.5
  2. Use more training data
  3. Enable early stopping: --early-stopping 3
  4. Reduce model complexity (fewer layers, smaller hidden size)
  5. Add L2 regularization

Underfitting

Symptoms: Low training and validation accuracy

Solutions:

  1. Increase model capacity: --hidden-size 512 --num-layers 3
  2. Train longer: --epochs 20
  3. Lower dropout: --dropout 0.2
  4. Increase learning rate: --learning-rate 0.01
  5. Check data quality (wrong labels, formatting issues)

Validation Loss Not Decreasing

Symptoms: Validation loss stays flat or increases

Solutions:

  1. Lower learning rate: --learning-rate 0.0001
  2. Add early stopping
  3. Check for data issues (train/validation overlap, different distributions)
  4. Try different optimizer: --optimizer adamw

CoNLL-U Loading Errors

Symptoms: Parser errors, wrong tag counts

Solutions:

  1. Verify file format (tab-separated, 10 columns)
  2. Check for empty lines between sentences
  3. Ensure UTF-8 encoding
  4. Remove or fix malformed lines
  5. Validate with UD validator: https://universaldependencies.org/tools.html

Model Not Learning

Symptoms: Loss stays constant, accuracy at baseline

Solutions:

  1. Check data quality (are labels correct?)
  2. Verify vocabulary is being built correctly
  3. Increase learning rate: --learning-rate 0.01
  4. Remove or reduce dropout initially
  5. Check for bugs in data preprocessing

Best Practices

For Small Datasets (<5K sentences)

mix nasty.train.neural_pos \
  --corpus data/small_corpus.conllu \
  --epochs 20 \
  --batch-size 16 \
  --hidden-size 128 \
  --embedding-dim 100 \
  --dropout 0.5 \
  --early-stopping 5 \
  --no-char-cnn

For Medium Datasets (5K-50K sentences)

mix nasty.train.neural_pos \
  --corpus data/medium_corpus.conllu \
  --epochs 15 \
  --batch-size 32 \
  --hidden-size 256 \
  --embedding-dim 300 \
  --dropout 0.3 \
  --use-char-cnn \
  --early-stopping 3

For Large Datasets (50K+ sentences)

mix nasty.train.neural_pos \
  --corpus data/large_corpus.conllu \
  --epochs 10 \
  --batch-size 64 \
  --hidden-size 512 \
  --embedding-dim 300 \
  --num-layers 3 \
  --dropout 0.3 \
  --use-char-cnn \
  --optimizer adamw \
  --learning-rate 0.0001

Production Deployment

After training, deploy your model:

  1. Save the trained model:

    # Model is already saved by training task
    ls -lh models/pos_neural_v1.axon
    
  2. Load in production:

    {:ok, model} = NeuralTagger.load("models/pos_neural_v1.axon")
  3. Integrate with POSTagger:

    # Use neural mode
    {:ok, ast} = Nasty.parse(text, language: :en, model: :neural, neural_model: model)
    
    # Or use ensemble mode
    {:ok, ast} = Nasty.parse(text, language: :en, model: :neural_ensemble, neural_model: model)
  4. Monitor performance:

    • Track accuracy on representative sample
    • Monitor latency (should be <100ms per sentence on CPU)
    • Watch memory usage

Next Steps