Fine-tuning Transformers Guide

Complete guide to fine-tuning pre-trained transformer models on custom datasets in Nasty.

Overview

Fine-tuning adapts a pre-trained transformer (BERT, RoBERTa, etc.) to your specific NLP task. Instead of training from scratch, you:

Start with a model trained on billions of tokens
Train for a few epochs on your task-specific data (1000+ examples)
Achieve state-of-the-art accuracy in minutes/hours instead of days/weeks

Benefits:

98-99% POS tagging accuracy (vs 97-98% BiLSTM-CRF)
93-95% NER F1 score (vs 75-80% rule-based)
10-100x less training data required
Transfer learning from massive pre-training

Quick Start

# Fine-tune RoBERTa for POS tagging
mix nasty.fine_tune.pos \
  --model roberta_base \
  --train data/en_ewt-ud-train.conllu \
  --validation data/en_ewt-ud-dev.conllu \
  --output models/pos_finetuned \
  --epochs 3 \
  --batch-size 16

# Fine-tune time: 10-30 minutes (CPU), 2-5 minutes (GPU)
# Result: 98-99% accuracy on UD English

Prerequisites

System Requirements

Memory: 8GB+ RAM (16GB recommended)
Storage: 2GB for models and data
GPU: Optional but highly recommended (10-30x speedup with EXLA)
Time: 10-30 minutes per run (CPU), 2-5 minutes (GPU)

Required Data

Training data must be in CoNLL-U format:

1	The	the	DET	DT	_	2	det	_	_
2	cat	cat	NOUN	NN	_	3	nsubj	_	_
3	sat	sit	VERB	VBD	_	0	root	_	_

1	Dogs	dog	NOUN	NNS	_	2	nsubj	_	_
2	run	run	VERB	VBP	_	0	root	_	_

Download Universal Dependencies corpora:

POS Tagging Fine-tuning

Basic Usage

mix nasty.fine_tune.pos \
  --model roberta_base \
  --train data/train.conllu \
  --epochs 3

Full Configuration

mix nasty.fine_tune.pos \
  --model bert_base_cased \
  --train data/en_ewt-ud-train.conllu \
  --validation data/en_ewt-ud-dev.conllu \
  --output models/pos_bert_finetuned \
  --epochs 5 \
  --batch-size 32 \
  --learning-rate 0.00002 \
  --max-length 512 \
  --eval-steps 500

Options Reference

Option	Description	Default
`--model`	Base transformer (required)	-
`--train`	Training CoNLL-U file (required)	-
`--validation`	Validation file	None
`--output`	Output directory	priv/models/finetuned
`--epochs`	Training epochs	3
`--batch-size`	Batch size	16
`--learning-rate`	Learning rate	3e-5
`--max-length`	Max sequence length	512
`--eval-steps`	Evaluate every N steps	500

Supported Models

English Models

bert-base-cased (110M params):

Best for: Case-sensitive tasks, proper nouns
Memory: ~500MB
Speed: Medium

roberta-base (125M params):

Best for: General purpose, highest accuracy
Memory: ~550MB
Speed: Medium
Recommended for most tasks

distilbert-base (66M params):

Best for: Fast inference, lower memory
Memory: ~300MB
Speed: Fast
Accuracy: ~97% (vs 98% full BERT)

Multilingual Models

xlm-roberta-base (270M params):

Languages: 100 languages
Best for: Spanish, Catalan, multilingual
Memory: ~1.1GB
Cross-lingual transfer: 90-95% of monolingual

bert-base-multilingual-cased (110M params):

Languages: 104 languages
Good baseline for many languages
Memory: ~500MB

Data Preparation

Minimum Dataset Size

Task	Minimum	Recommended	Optimal
POS Tagging	1,000 sentences	5,000 sentences	10,000+ sentences
NER	500 sentences	2,000 sentences	5,000+ sentences
Classification	100 examples/class	500 examples/class	1,000+ examples/class

Data Splitting

Standard split ratios:

Total data: 12,000 sentences

Training:   9,600 (80%)
Validation: 1,200 (10%)
Test:       1,200 (10%)

Data Quality Checklist

[ ] Consistent annotation scheme (use Universal Dependencies)
[ ] Balanced representation across domains (news, social media, technical)
[ ] Clean text (no encoding errors, proper Unicode)
[ ] No data leakage (train/val/test are disjoint)
[ ] Representative of production data

Hyperparameter Tuning

Learning Rate

Most important hyperparameter!

# Too high: Model doesn't converge
--learning-rate 0.001  # DON'T USE

# Too low: Learning is very slow
--learning-rate 0.000001  # DON'T USE

# Good defaults:
--learning-rate 0.00003  # RoBERTa, BERT (3e-5)
--learning-rate 0.00002  # DistilBERT (2e-5)
--learning-rate 0.00005  # XLM-RoBERTa (5e-5)

Batch Size

Balance between speed and memory:

# Small dataset or low memory
--batch-size 8

# Balanced (recommended)
--batch-size 16

# Large dataset, lots of memory
--batch-size 32

# Very large dataset, GPU
--batch-size 64

Memory usage by batch size:

Batch 8: ~2GB GPU memory
Batch 16: ~4GB GPU memory
Batch 32: ~8GB GPU memory
Batch 64: ~16GB GPU memory

Number of Epochs

# Small dataset (1K-5K examples)
--epochs 5

# Medium dataset (5K-20K examples)
--epochs 3

# Large dataset (20K+ examples)
--epochs 2

Rule of thumb: Stop when validation loss plateaus (use validation set!)

Max Sequence Length

# Short texts (tweets, titles)
--max-length 128  # Faster, uses less memory

# Normal texts (sentences, paragraphs)
--max-length 512  # Default, good balance

# Long texts (documents)
--max-length 1024  # Slower, uses more memory

Programmatic Fine-tuning

For more control, use the API directly:

alias Nasty.Statistics.Neural.Transformers.{Loader, FineTuner, DataPreprocessor}
alias Nasty.Statistics.Neural.DataLoader

# Load base model
{:ok, base_model} = Loader.load_model(:roberta_base)

# Load training data
{:ok, train_sentences} = DataLoader.load_conllu_file("data/train.conllu")

# Prepare examples
training_data = 
  Enum.map(train_sentences, fn sentence ->
    tokens = sentence.tokens
    labels = Enum.map(tokens, & &1.pos)
    {tokens, labels}
  end)

# Create label map (UPOS tags)
label_map = %{
  0 => "ADJ", 1 => "ADP", 2 => "ADV", 3 => "AUX",
  4 => "CCONJ", 5 => "DET", 6 => "INTJ", 7 => "NOUN",
  8 => "NUM", 9 => "PART", 10 => "PRON", 11 => "PROPN",
  12 => "PUNCT", 13 => "SCONJ", 14 => "SYM", 15 => "VERB", 16 => "X"
}

# Fine-tune
{:ok, finetuned} = FineTuner.fine_tune(
  base_model,
  training_data,
  :pos_tagging,
  num_labels: 17,
  label_map: label_map,
  epochs: 3,
  batch_size: 16,
  learning_rate: 3.0e-5
)

# Save
File.write!("models/pos_finetuned.axon", :erlang.term_to_binary(finetuned))

Evaluation

During Training

The CLI automatically evaluates on validation set:

Fine-tuning POS tagger
  Model: roberta_base
  Training data: data/train.conllu
  Output: models/pos_finetuned

Loading base model...
Model loaded: roberta_base

Loading training data...
Training examples: 8,724
Validation examples: 1,091
Number of POS tags: 17

Starting fine-tuning...

Epoch 1/3, Iteration 100: loss=0.3421, accuracy=0.891
Epoch 1/3, Iteration 200: loss=0.2156, accuracy=0.934
Epoch 1 completed. validation_loss: 0.1842, validation_accuracy: 0.951

Epoch 2/3, Iteration 100: loss=0.1523, accuracy=0.963
Epoch 2/3, Iteration 200: loss=0.1298, accuracy=0.971
Epoch 2 completed. validation_loss: 0.0921, validation_accuracy: 0.979

Epoch 3/3, Iteration 100: loss=0.0876, accuracy=0.981
Epoch 3/3, Iteration 200: loss=0.0745, accuracy=0.985
Epoch 3 completed. validation_loss: 0.0654, validation_accuracy: 0.987

Fine-tuning completed successfully!
Model saved to: models/pos_finetuned

Evaluating on validation set...

Validation Results:
  Accuracy: 98.72%
  Total predictions: 16,427
  Correct predictions: 16,217

Post-training Evaluation

Test on held-out test set:

mix nasty.eval \
  --model models/pos_finetuned.axon \
  --test data/en_ewt-ud-test.conllu \
  --type pos_tagging

Troubleshooting

Out of Memory

Symptoms: Process crashes, CUDA out of memory

Solutions:

Reduce batch size: --batch-size 8
Reduce max length: --max-length 256
Use smaller model: distilbert-base instead of roberta-base
Use gradient accumulation (API only)

Training Too Slow

Symptoms: Hours per epoch

Solutions:

Enable GPU: Set XLA_TARGET=cuda env var
Increase batch size: --batch-size 32
Reduce max length: --max-length 256
Use DistilBERT instead of BERT

Poor Accuracy

Symptoms: Validation accuracy <95%

Solutions:

Train longer: --epochs 5
Increase dataset size (need 5K+ sentences)
Lower learning rate: --learning-rate 0.00001
Check data quality (annotation errors?)
Try different model: RoBERTa instead of BERT

Overfitting

Symptoms: High training accuracy, low validation accuracy

Solutions:

More training data
Fewer epochs: --epochs 2
Higher learning rate: --learning-rate 0.00005
Use validation set for early stopping

Model Not Learning

Symptoms: Loss stays constant

Solutions:

Higher learning rate: --learning-rate 0.0001
Check data format (is it loading correctly?)
Verify labels are correct
Try different optimizer (edit FineTuner code)

Best Practices

1. Always Use Validation Set

# GOOD: Monitor validation performance
mix nasty.fine_tune.pos \
  --train data/train.conllu \
  --validation data/dev.conllu

# BAD: No way to detect overfitting
mix nasty.fine_tune.pos \
  --train data/train.conllu

2. Start with Defaults

Don't tune hyperparameters until you see the baseline:

# First run: Use defaults
mix nasty.fine_tune.pos --model roberta_base --train data/train.conllu

# Then: Tune if needed

3. Use RoBERTa for Best Accuracy

# Highest accuracy
--model roberta_base

# Not: BERT or DistilBERT (unless you need speed/size)

4. Save Intermediate Checkpoints

Models are saved automatically to output directory. Keep multiple versions:

models/
  pos_epoch1.axon
  pos_epoch2.axon
  pos_epoch3.axon
  pos_final.axon  # Best model

5. Document Your Configuration

Keep a log of what worked:

# models/pos_finetuned/README.md

Model: RoBERTa-base
Training data: UD_English-EWT (8,724 sentences)
Epochs: 3
Batch size: 16
Learning rate: 3e-5
Final accuracy: 98.7%
Training time: 15 minutes (GPU)

Production Deployment

After fine-tuning, deploy to production:

1. Quantize for Efficiency

mix nasty.quantize \
  --model models/pos_finetuned.axon \
  --calibration data/calibration.conllu \
  --output models/pos_finetuned_int8.axon

Result: 4x smaller, 2-3x faster, <1% accuracy loss

2. Load in Production

# Load quantized model
{:ok, model} = INT8.load("models/pos_finetuned_int8.axon")

# Use for inference
def tag_sentence(text) do
  {:ok, tokens} = Nasty.parse(text, language: :en)
  {:ok, tagged} = apply_model(model, tokens)
  tagged
end

3. Monitor Performance

Track key metrics:

Accuracy on representative samples (weekly)
Inference latency (should be <100ms per sentence)
Memory usage (should be stable)
Error rate by domain/source

Advanced Topics

Few-shot Learning

Fine-tune with minimal data (100-500 examples):

FineTuner.few_shot_fine_tune(
  base_model,
  small_dataset,
  :pos_tagging,
  epochs: 10,
  learning_rate: 1.0e-5,
  data_augmentation: true
)

Domain Adaptation

Fine-tune on domain-specific data:

# Medical text
mix nasty.fine_tune.pos \
  --model roberta_base \
  --train data/medical_train.conllu

# Legal text
mix nasty.fine_tune.pos \
  --model roberta_base \
  --train data/legal_train.conllu

Multilingual Fine-tuning

Use XLM-RoBERTa for multiple languages:

mix nasty.fine_tune.pos \
  --model xlm_roberta_base \
  --train data/multilingual_train.conllu  # Mix of en, es, ca

Fine-tuning Transformers Guide

Overview

Quick Start

Prerequisites

System Requirements

Required Data

POS Tagging Fine-tuning

Basic Usage

Full Configuration

Options Reference

Supported Models

English Models

Multilingual Models

Data Preparation

Minimum Dataset Size

Data Splitting

Data Quality Checklist

Hyperparameter Tuning

Learning Rate

Batch Size

Number of Epochs

Max Sequence Length

Programmatic Fine-tuning

Evaluation

During Training

Post-training Evaluation

Troubleshooting

Out of Memory

Training Too Slow

Poor Accuracy

Overfitting

Model Not Learning

Best Practices

1. Always Use Validation Set

2. Start with Defaults

3. Use RoBERTa for Best Accuracy

4. Save Intermediate Checkpoints

5. Document Your Configuration

Production Deployment

1. Quantize for Efficiency

2. Load in Production

3. Monitor Performance

Advanced Topics

Few-shot Learning

Domain Adaptation

Multilingual Fine-tuning

See Also