Model Quantization Guide

Complete guide to quantizing neural models in Nasty for deployment optimization.

Overview

Model quantization reduces model size and inference time by converting Float32 weights to lower-precision representations (INT8, INT4). This enables:

4x smaller models (400MB → 100MB)
2-3x faster inference on CPU
40-60% lower memory usage
Minimal accuracy loss (<1% with proper calibration)
Mobile and edge deployment with reduced resource requirements

Quantization Methods

Nasty supports three quantization approaches:

1. INT8 Post-Training Quantization (Recommended)

Convert trained Float32 models to INT8 after training.

Advantages:

No retraining required
Fast conversion (minutes)
<1% accuracy degradation
Works with any trained model

Use when:

You have a trained model ready for deployment
You need quick optimization
Accuracy requirements are not extremely strict (>97%)

alias Nasty.Statistics.Neural.Quantization.INT8

# Load trained model
{:ok, model} = NeuralTagger.load("models/pos_tagger.axon")

# Prepare calibration data (100-1000 representative samples)
calibration_data = load_calibration_samples("data/calibration.conllu", limit: 500)

# Quantize
{:ok, quantized} = INT8.quantize(model,
  calibration_data: calibration_data,
  calibration_method: :percentile,  # More robust than :minmax
  target_accuracy_loss: 0.01  # Max 1% loss
)

# Save
INT8.save(quantized, "models/pos_tagger_int8.axon")

2. Dynamic Quantization

Quantize weights at load time, keep activations in Float32.

Advantages:

No calibration data needed
Faster than static quantization
Easy to apply

Disadvantages:

Slower inference than INT8 (activations still Float32)
50% smaller (not 75% like INT8)

Use when:

You don't have calibration data
You need quick wins without accuracy concerns
Memory is more constrained than compute

alias Nasty.Statistics.Neural.Quantization.Dynamic

{:ok, model} = NeuralTagger.load("models/pos_tagger.axon")

# Quantize dynamically
{:ok, quantized} = Dynamic.quantize(model)

# Use immediately - no saving needed
{:ok, predictions} = Dynamic.predict(quantized, tokens)

3. Quantization-Aware Training (QAT)

Train model with quantization simulation from the start.

Advantages:

Best accuracy (no degradation)
Handles quantization errors during training
Optimal for production

Disadvantages:

Requires retraining
Longer training time (1.5-2x)
More complex setup

Use when:

Accuracy is critical (medical, legal, finance)
You're training from scratch anyway
You have time for proper training

alias Nasty.Statistics.Neural.Quantization.QAT
alias Nasty.Statistics.Neural.Transformers.FineTuner

# Fine-tune with QAT enabled
{:ok, model} = FineTuner.fine_tune(
  base_model,
  training_data,
  :pos_tagging,
  epochs: 5,
  quantization_aware: true,  # Enable QAT
  qat_opts: [
    bits: 8,
    fake_quantize: true
  ]
)

# Model is already quantization-ready
QAT.save(model, "models/pos_tagger_qat_int8.axon")

Calibration Data

Calibration determines optimal quantization ranges for activations.

Requirements

Size: 100-1000 samples (more is better, diminishing returns after 1000)
Representativeness: Must cover typical input distributions
Format: Same as training data (tokens, sentences, etc.)

Preparing Calibration Data

# From CoNLL-U file
defmodule CalibrationLoader do
  def load_samples(path, opts \\ []) do
    limit = Keyword.get(opts, :limit, 500)
    
    path
    |> DataLoader.load_conllu_file()
    |> elem(1)
    |> Enum.take(limit)
    |> Enum.map(fn sentence ->
      # Convert to format expected by model
      %{
        input_ids: sentence.input_ids,
        attention_mask: sentence.attention_mask
      }
    end)
  end
end

calibration_data = CalibrationLoader.load_samples("data/dev.conllu", limit: 500)

Calibration Methods

MinMax (:minmax):

Uses absolute min/max of activations
Fast but sensitive to outliers
Default method

INT8.quantize(model, calibration_data: data, calibration_method: :minmax)

Percentile (:percentile):

Uses 99.99th percentile instead of absolute max
More robust to outliers
Recommended for production

INT8.quantize(model, 
  calibration_data: data,
  calibration_method: :percentile,
  percentile: 99.99
)

Entropy (:entropy):

Minimizes KL divergence between FP32 and INT8
Best accuracy but slowest
Use for critical applications

INT8.quantize(model,
  calibration_data: data,
  calibration_method: :entropy
)

Model Comparison

Before Quantization

# Original Float32 model
ls -lh models/pos_tagger.axon
# => 412M

# Inference time (CPU)
mix nasty.benchmark --model pos_tagger.axon
# => 45ms per sentence

After INT8 Quantization

# Quantized INT8 model
ls -lh models/pos_tagger_int8.axon
# => 108M (3.8x smaller)

# Inference time (CPU)
mix nasty.benchmark --model pos_tagger_int8.axon
# => 18ms per sentence (2.5x faster)

Accuracy Comparison

# Evaluate both models
mix nasty.eval --model models/pos_tagger.axon --test data/test.conllu
# => Accuracy: 97.8%

mix nasty.eval --model models/pos_tagger_int8.axon --test data/test.conllu
# => Accuracy: 97.4%  (0.4% degradation)

Mix Tasks

Quantize Existing Model

mix nasty.quantize \
  --model models/pos_tagger.axon \
  --calibration data/calibration.conllu \
  --method percentile \
  --output models/pos_tagger_int8.axon

Evaluate Quantized Model

mix nasty.quantize.eval \
  --original models/pos_tagger.axon \
  --quantized models/pos_tagger_int8.axon \
  --test data/test.conllu

Output:

Comparing models on 2000 test examples:

Original (Float32):
  Accuracy: 97.84%
  Memory: 412MB
  Avg inference: 45.3ms

Quantized (INT8):
  Accuracy: 97.41%
  Memory: 108MB
  Avg inference: 18.2ms

Summary:
  Size reduction: 3.8x
  Speed improvement: 2.5x
  Accuracy loss: 0.43%

Estimate Size Reduction

mix nasty.quantize.estimate --model models/pos_tagger.axon

Output:

Model: models/pos_tagger.axon
Parameters: 125,000,000

Estimated sizes:
  Float32 (current): 412 MB
  INT8: 108 MB (3.8x smaller)
  INT4: 58 MB (7.1x smaller)
  
Memory usage:
  Float32: ~1.2 GB (with activations)
  INT8: ~350 MB (70% reduction)

Advanced Options

Per-Channel Quantization

Quantize each output channel separately for better accuracy:

INT8.quantize(model,
  calibration_data: data,
  per_channel: true  # Default
)

Symmetric vs Asymmetric

Symmetric (default, faster):

INT8.quantize(model, symmetric: true)
# Range: [-127, 127], zero_point = 0

Asymmetric (better accuracy):

INT8.quantize(model, symmetric: false)
# Range: [-128, 127], zero_point = computed

Selective Quantization

Quantize only certain layers:

INT8.quantize(model,
  calibration_data: data,
  skip_layers: ["embedding", "output"]  # Keep these in Float32
)

Deployment Strategies

CPU Deployment

INT8 quantization provides maximum speedup on CPU:

# Production inference
{:ok, model} = INT8.load("models/pos_tagger_int8.axon")

def tag_text(text) do
  {:ok, tokens} = Tokenizer.tokenize(text)
  {:ok, tagged} = INT8.predict(model, tokens)
  tagged
end

GPU Deployment

Limited benefits on GPU (GPUs are optimized for Float32):

# Use Float32 on GPU, INT8 on CPU
model = 
  if gpu_available?() do
    {:ok, m} = NeuralTagger.load("models/pos_tagger.axon")
    m
  else
    {:ok, m} = INT8.load("models/pos_tagger_int8.axon")
    m
  end

Mobile/Edge Deployment

Essential for resource-constrained devices:

# Aggressive quantization for mobile
{:ok, model} = INT8.quantize(full_model,
  calibration_data: data,
  calibration_method: :percentile,
  per_channel: true,
  compress: true  # Additional gzip compression
)

# Further optimize
{:ok, pruned} = Pruner.prune(model, sparsity: 0.3)
{:ok, distilled} = Distiller.distill(pruned, student_size: 0.5)

Troubleshooting

High Accuracy Loss

Problem: Accuracy drops >2% after quantization

Solutions:

Use more calibration data (increase from 100 to 1000 samples)
Switch to percentile method with higher percentile (99.99)
Use asymmetric quantization
Skip quantizing sensitive layers (embedding, output)
Try QAT for best accuracy

# Better calibration
INT8.quantize(model,
  calibration_data: more_samples,  # 1000 instead of 100
  calibration_method: :percentile,
  percentile: 99.99,
  symmetric: false
)

Slow Quantization

Problem: Calibration takes too long

Solutions:

Reduce calibration sample size
Use minmax instead of entropy method
Disable per-channel quantization

# Faster quantization
INT8.quantize(model,
  calibration_data: fewer_samples,  # 100 instead of 1000
  calibration_method: :minmax,
  per_channel: false
)

Large Model Size

Problem: INT8 model still too large

Solutions:

Apply model pruning first
Use knowledge distillation
Consider INT4 quantization (more aggressive)

# Aggressive optimization pipeline
{:ok, pruned} = Pruner.prune(model, sparsity: 0.4)
{:ok, quantized} = INT8.quantize(pruned, calibration_data: data)
{:ok, compressed} = Compressor.compress(quantized, method: :gzip)

Best Practices

1. Always Validate Accuracy

# Validate before deploying
{:ok, quantized} = INT8.quantize(model,
  calibration_data: data,
  target_accuracy_loss: 0.01  # Fail if >1% loss
)

2. Use Representative Calibration Data

# BAD: Only formal text
calibration_data = load_samples("formal_documents.txt")

# GOOD: Mixed domains matching production
calibration_data = 
  load_samples("news.txt", 100) ++
  load_samples("social_media.txt", 100) ++
  load_samples("technical.txt", 100)

3. Benchmark in Production Environment

# Test on actual deployment hardware
mix nasty.benchmark \
  --model models/pos_tagger_int8.axon \
  --environment production \
  --samples 1000

4. Version Your Quantized Models

models/
  pos_tagger_v1_fp32.axon         # Original
  pos_tagger_v1_int8_minmax.axon  # Quick quantization
  pos_tagger_v1_int8_percentile.axon  # Production quantization
  pos_tagger_v1_qat.axon          # Quantization-aware trained

Performance Metrics

POS Tagging (UD English)

Model	Size	Inference (CPU)	Accuracy	Use Case
Float32	412MB	45ms	97.8%	GPU servers
INT8 (minmax)	108MB	19ms	97.2%	Fast deployment
INT8 (percentile)	108MB	18ms	97.4%	Production
INT8 QAT	108MB	18ms	97.8%	Critical apps

NER (CoNLL-2003)

Model	Size	Inference (CPU)	F1 Score	Use Case
Float32	380MB	52ms	94.2%	Research
INT8	98MB	21ms	93.5%	Production

Model Quantization Guide

Overview

Quantization Methods

1. INT8 Post-Training Quantization (Recommended)

2. Dynamic Quantization

3. Quantization-Aware Training (QAT)

Calibration Data

Requirements

Preparing Calibration Data

Calibration Methods

Model Comparison

Before Quantization

After INT8 Quantization

Accuracy Comparison

Mix Tasks

Quantize Existing Model

Evaluate Quantized Model

Estimate Size Reduction

Advanced Options

Per-Channel Quantization

Symmetric vs Asymmetric

Selective Quantization

Deployment Strategies

CPU Deployment

GPU Deployment

Mobile/Edge Deployment

Troubleshooting

High Accuracy Loss

Slow Quantization

Large Model Size

Best Practices

1. Always Validate Accuracy

2. Use Representative Calibration Data

3. Benchmark in Production Environment

4. Version Your Quantized Models

Performance Metrics

POS Tagging (UD English)

NER (CoNLL-2003)

See Also