Model Quantization Guide
View SourceComplete guide to quantizing neural models in Nasty for deployment optimization.
Overview
Model quantization reduces model size and inference time by converting Float32 weights to lower-precision representations (INT8, INT4). This enables:
- 4x smaller models (400MB → 100MB)
- 2-3x faster inference on CPU
- 40-60% lower memory usage
- Minimal accuracy loss (<1% with proper calibration)
- Mobile and edge deployment with reduced resource requirements
Quantization Methods
Nasty supports three quantization approaches:
1. INT8 Post-Training Quantization (Recommended)
Convert trained Float32 models to INT8 after training.
Advantages:
- No retraining required
- Fast conversion (minutes)
- <1% accuracy degradation
- Works with any trained model
Use when:
- You have a trained model ready for deployment
- You need quick optimization
- Accuracy requirements are not extremely strict (>97%)
alias Nasty.Statistics.Neural.Quantization.INT8
# Load trained model
{:ok, model} = NeuralTagger.load("models/pos_tagger.axon")
# Prepare calibration data (100-1000 representative samples)
calibration_data = load_calibration_samples("data/calibration.conllu", limit: 500)
# Quantize
{:ok, quantized} = INT8.quantize(model,
calibration_data: calibration_data,
calibration_method: :percentile, # More robust than :minmax
target_accuracy_loss: 0.01 # Max 1% loss
)
# Save
INT8.save(quantized, "models/pos_tagger_int8.axon")2. Dynamic Quantization
Quantize weights at load time, keep activations in Float32.
Advantages:
- No calibration data needed
- Faster than static quantization
- Easy to apply
Disadvantages:
- Slower inference than INT8 (activations still Float32)
- 50% smaller (not 75% like INT8)
Use when:
- You don't have calibration data
- You need quick wins without accuracy concerns
- Memory is more constrained than compute
alias Nasty.Statistics.Neural.Quantization.Dynamic
{:ok, model} = NeuralTagger.load("models/pos_tagger.axon")
# Quantize dynamically
{:ok, quantized} = Dynamic.quantize(model)
# Use immediately - no saving needed
{:ok, predictions} = Dynamic.predict(quantized, tokens)3. Quantization-Aware Training (QAT)
Train model with quantization simulation from the start.
Advantages:
- Best accuracy (no degradation)
- Handles quantization errors during training
- Optimal for production
Disadvantages:
- Requires retraining
- Longer training time (1.5-2x)
- More complex setup
Use when:
- Accuracy is critical (medical, legal, finance)
- You're training from scratch anyway
- You have time for proper training
alias Nasty.Statistics.Neural.Quantization.QAT
alias Nasty.Statistics.Neural.Transformers.FineTuner
# Fine-tune with QAT enabled
{:ok, model} = FineTuner.fine_tune(
base_model,
training_data,
:pos_tagging,
epochs: 5,
quantization_aware: true, # Enable QAT
qat_opts: [
bits: 8,
fake_quantize: true
]
)
# Model is already quantization-ready
QAT.save(model, "models/pos_tagger_qat_int8.axon")Calibration Data
Calibration determines optimal quantization ranges for activations.
Requirements
- Size: 100-1000 samples (more is better, diminishing returns after 1000)
- Representativeness: Must cover typical input distributions
- Format: Same as training data (tokens, sentences, etc.)
Preparing Calibration Data
# From CoNLL-U file
defmodule CalibrationLoader do
def load_samples(path, opts \\ []) do
limit = Keyword.get(opts, :limit, 500)
path
|> DataLoader.load_conllu_file()
|> elem(1)
|> Enum.take(limit)
|> Enum.map(fn sentence ->
# Convert to format expected by model
%{
input_ids: sentence.input_ids,
attention_mask: sentence.attention_mask
}
end)
end
end
calibration_data = CalibrationLoader.load_samples("data/dev.conllu", limit: 500)Calibration Methods
MinMax (:minmax):
- Uses absolute min/max of activations
- Fast but sensitive to outliers
- Default method
INT8.quantize(model, calibration_data: data, calibration_method: :minmax)Percentile (:percentile):
- Uses 99.99th percentile instead of absolute max
- More robust to outliers
- Recommended for production
INT8.quantize(model,
calibration_data: data,
calibration_method: :percentile,
percentile: 99.99
)Entropy (:entropy):
- Minimizes KL divergence between FP32 and INT8
- Best accuracy but slowest
- Use for critical applications
INT8.quantize(model,
calibration_data: data,
calibration_method: :entropy
)Model Comparison
Before Quantization
# Original Float32 model
ls -lh models/pos_tagger.axon
# => 412M
# Inference time (CPU)
mix nasty.benchmark --model pos_tagger.axon
# => 45ms per sentence
After INT8 Quantization
# Quantized INT8 model
ls -lh models/pos_tagger_int8.axon
# => 108M (3.8x smaller)
# Inference time (CPU)
mix nasty.benchmark --model pos_tagger_int8.axon
# => 18ms per sentence (2.5x faster)
Accuracy Comparison
# Evaluate both models
mix nasty.eval --model models/pos_tagger.axon --test data/test.conllu
# => Accuracy: 97.8%
mix nasty.eval --model models/pos_tagger_int8.axon --test data/test.conllu
# => Accuracy: 97.4% (0.4% degradation)
Mix Tasks
Quantize Existing Model
mix nasty.quantize \
--model models/pos_tagger.axon \
--calibration data/calibration.conllu \
--method percentile \
--output models/pos_tagger_int8.axon
Evaluate Quantized Model
mix nasty.quantize.eval \
--original models/pos_tagger.axon \
--quantized models/pos_tagger_int8.axon \
--test data/test.conllu
Output:
Comparing models on 2000 test examples:
Original (Float32):
Accuracy: 97.84%
Memory: 412MB
Avg inference: 45.3ms
Quantized (INT8):
Accuracy: 97.41%
Memory: 108MB
Avg inference: 18.2ms
Summary:
Size reduction: 3.8x
Speed improvement: 2.5x
Accuracy loss: 0.43%Estimate Size Reduction
mix nasty.quantize.estimate --model models/pos_tagger.axon
Output:
Model: models/pos_tagger.axon
Parameters: 125,000,000
Estimated sizes:
Float32 (current): 412 MB
INT8: 108 MB (3.8x smaller)
INT4: 58 MB (7.1x smaller)
Memory usage:
Float32: ~1.2 GB (with activations)
INT8: ~350 MB (70% reduction)Advanced Options
Per-Channel Quantization
Quantize each output channel separately for better accuracy:
INT8.quantize(model,
calibration_data: data,
per_channel: true # Default
)Symmetric vs Asymmetric
Symmetric (default, faster):
INT8.quantize(model, symmetric: true)
# Range: [-127, 127], zero_point = 0Asymmetric (better accuracy):
INT8.quantize(model, symmetric: false)
# Range: [-128, 127], zero_point = computedSelective Quantization
Quantize only certain layers:
INT8.quantize(model,
calibration_data: data,
skip_layers: ["embedding", "output"] # Keep these in Float32
)Deployment Strategies
CPU Deployment
INT8 quantization provides maximum speedup on CPU:
# Production inference
{:ok, model} = INT8.load("models/pos_tagger_int8.axon")
def tag_text(text) do
{:ok, tokens} = Tokenizer.tokenize(text)
{:ok, tagged} = INT8.predict(model, tokens)
tagged
endGPU Deployment
Limited benefits on GPU (GPUs are optimized for Float32):
# Use Float32 on GPU, INT8 on CPU
model =
if gpu_available?() do
{:ok, m} = NeuralTagger.load("models/pos_tagger.axon")
m
else
{:ok, m} = INT8.load("models/pos_tagger_int8.axon")
m
endMobile/Edge Deployment
Essential for resource-constrained devices:
# Aggressive quantization for mobile
{:ok, model} = INT8.quantize(full_model,
calibration_data: data,
calibration_method: :percentile,
per_channel: true,
compress: true # Additional gzip compression
)
# Further optimize
{:ok, pruned} = Pruner.prune(model, sparsity: 0.3)
{:ok, distilled} = Distiller.distill(pruned, student_size: 0.5)Troubleshooting
High Accuracy Loss
Problem: Accuracy drops >2% after quantization
Solutions:
- Use more calibration data (increase from 100 to 1000 samples)
- Switch to percentile method with higher percentile (99.99)
- Use asymmetric quantization
- Skip quantizing sensitive layers (embedding, output)
- Try QAT for best accuracy
# Better calibration
INT8.quantize(model,
calibration_data: more_samples, # 1000 instead of 100
calibration_method: :percentile,
percentile: 99.99,
symmetric: false
)Slow Quantization
Problem: Calibration takes too long
Solutions:
- Reduce calibration sample size
- Use minmax instead of entropy method
- Disable per-channel quantization
# Faster quantization
INT8.quantize(model,
calibration_data: fewer_samples, # 100 instead of 1000
calibration_method: :minmax,
per_channel: false
)Large Model Size
Problem: INT8 model still too large
Solutions:
- Apply model pruning first
- Use knowledge distillation
- Consider INT4 quantization (more aggressive)
# Aggressive optimization pipeline
{:ok, pruned} = Pruner.prune(model, sparsity: 0.4)
{:ok, quantized} = INT8.quantize(pruned, calibration_data: data)
{:ok, compressed} = Compressor.compress(quantized, method: :gzip)Best Practices
1. Always Validate Accuracy
# Validate before deploying
{:ok, quantized} = INT8.quantize(model,
calibration_data: data,
target_accuracy_loss: 0.01 # Fail if >1% loss
)2. Use Representative Calibration Data
# BAD: Only formal text
calibration_data = load_samples("formal_documents.txt")
# GOOD: Mixed domains matching production
calibration_data =
load_samples("news.txt", 100) ++
load_samples("social_media.txt", 100) ++
load_samples("technical.txt", 100)3. Benchmark in Production Environment
# Test on actual deployment hardware
mix nasty.benchmark \
--model models/pos_tagger_int8.axon \
--environment production \
--samples 1000
4. Version Your Quantized Models
models/
pos_tagger_v1_fp32.axon # Original
pos_tagger_v1_int8_minmax.axon # Quick quantization
pos_tagger_v1_int8_percentile.axon # Production quantization
pos_tagger_v1_qat.axon # Quantization-aware trainedPerformance Metrics
POS Tagging (UD English)
| Model | Size | Inference (CPU) | Accuracy | Use Case |
|---|---|---|---|---|
| Float32 | 412MB | 45ms | 97.8% | GPU servers |
| INT8 (minmax) | 108MB | 19ms | 97.2% | Fast deployment |
| INT8 (percentile) | 108MB | 18ms | 97.4% | Production |
| INT8 QAT | 108MB | 18ms | 97.8% | Critical apps |
NER (CoNLL-2003)
| Model | Size | Inference (CPU) | F1 Score | Use Case |
|---|---|---|---|---|
| Float32 | 380MB | 52ms | 94.2% | Research |
| INT8 | 98MB | 21ms | 93.5% | Production |
See Also
- NEURAL_MODELS.md - Neural architecture details
- FINE_TUNING.md - Training custom models
- PRETRAINED_MODELS.md - Using transformers
- Model Compression Papers