Table of Contents
- Understanding Gradient Computation
- Choosing an Optimizer
- Learning Rate Strategies
- Gradient Clipping
- Weight Decay
- Gradient Accumulation
- Batch Size Selection
- Memory Optimization
- Common Problems and Solutions
- Performance Benchmarks
Understanding Gradient Computation
Current Limitation: Numerical Gradients
ExBurn v0.1.0 uses numerical differentiation (finite differences) to compute gradients. This is the main performance bottleneck.
Central differences: ∂L/∂w ≈ (L(w + ε) - L(w - ε)) / 2ε
One-sided: ∂L/∂w ≈ (L(w + ε) - L(w)) / εImpact: For a model with N scalar parameters, central differences requires 2N forward passes per mini-batch. A 100K-parameter model needs 200K forward passes per batch.
Choosing a Gradient Method
# Default: central differences (more accurate, slower)
grads = ExBurn.Training.compute_gradients(model, {x, y}, grad_method: :numerical)
# Faster: one-sided differences (less accurate, ~2x faster)
grads = ExBurn.Training.compute_gradients(model, {x, y}, grad_method: :numerical_batch)| Method | Forward Passes | Error Order | When to Use |
|---|---|---|---|
:numerical | 2N | O(ε²) | Small models, high accuracy needed |
:numerical_batch | N+1 | O(ε) | Larger models, speed matters more |
When Autodiff Arrives (v0.3.0)
Burn's Autodiff backend will compute exact gradients in a single backward pass, regardless of parameter count. This is a game-changer:
Numerical (v0.1.0): 200K forward passes for 100K params
Autodiff (v0.3.0): 1 backward pass for any model sizeRecommendation: For now, keep models small (< 50K params) for training. Use larger models only for inference.
Choosing an Optimizer
Adam (Default)
Best general-purpose optimizer. Adapts learning rates per-parameter.
ExBurn.Model.compile(model, optimizer: :adam, learning_rate: 0.001)
# beta1=0.9, beta2=0.999, epsilon=1e-8When to use: Default choice for most tasks. Works well with default hyperparameters.
Tips:
learning_rate: 0.001is a good starting point- Reduce to
0.0001if training is unstable - Increase to
0.01if convergence is very slow
SGD with Momentum
Can achieve better generalization than Adam with proper tuning.
ExBurn.Model.compile(model, optimizer: :sgd, learning_rate: 0.01)
# momentum=0.9When to use: When you need maximum generalization and have time to tune.
Tips:
- Requires higher learning rate than Adam (typically 0.01–0.1)
- Use Nesterov momentum for faster convergence:
ExBurn.Training.fit(model, data, nesterov: true) - Combine with cosine annealing LR schedule for best results
RMSprop
Good for recurrent networks and non-stationary objectives.
ExBurn.Model.compile(model, optimizer: :rmsprop, learning_rate: 0.001)
# decay=0.9, epsilon=1e-8When to use: RNNs, LSTMs, or when Adam diverges.
Optimizer Comparison
| Optimizer | Convergence Speed | Generalization | Tuning Effort | Memory |
|---|---|---|---|---|
| Adam | Fast | Good | Low | 2x params (m + v) |
| SGD + Momentum | Medium | Best | High | 1x params (velocity) |
| RMSprop | Medium | Good | Medium | 1x params (cache) |
Learning Rate Strategies
Fixed Learning Rate
# No schedule — use constant learning rate
ExBurn.Model.compile(model, learning_rate: 0.001)Step Decay
Reduce LR by a factor every N epochs. Good for long training runs.
# Halve the learning rate every 10 epochs
lr_schedule: {:step, 0.001, 10, 0.5}Exponential Decay
Smooth decay. Good for medium-length training.
# Multiply LR by 0.95 each epoch
lr_schedule: {:exponential, 0.001, 0.95}Cosine Annealing
Smoothly decay from base_lr to min_lr following a cosine curve. Often gives the best results.
# Decay from 0.001 to 0.00001 over the training run
lr_schedule: {:cosine, 0.001, 1.0e-5}Learning Rate Schedule Comparison
LR
│
0.001 ─┤ ████
│ ████ ╲ Step (sudden drops)
│ ████ ╲ ╲
│ ████ ╲ ╲
│ ████ ╲ ╲
0.0001 ┤ ╲ ╲
│ ╲ ╲ ╲
│ ╲ ╲ ╲
│ ╲ ╲ ╲
0.00001 ┤──────────────╲──── Cosine (smooth)
└──────────────────────── EpochsTips
- Start with Adam + cosine annealing for best results
- If loss oscillates, reduce the base learning rate
- If convergence is too slow, increase the base learning rate
- Use warmup (planned) for large batch sizes
Gradient Clipping
Prevents exploding gradients, which cause NaN loss.
Clip by Norm
Scales all gradients so their total norm doesn't exceed a threshold:
# If ||gradients||_2 > 1.0, scale them down
clip_norm: 1.0When to use: Always enable for recurrent networks. Recommended for deep networks.
Clip by Value
Clips each gradient element to a range:
# Clip each gradient to [-5.0, 5.0]
clip_value: 5.0When to use: As a safety net alongside norm clipping.
Tips
clip_norm: 1.0is a good default- If you see NaN loss, enable clipping immediately
- Clipping doesn't prevent vanishing gradients — use residual connections for that
Weight Decay
L2 regularization that penalizes large weights, improving generalization:
ExBurn.Model.compile(model, weight_decay: 1.0e-4)This adds weight_decay * param to each gradient before the optimizer step.
Tips
1.0e-4is a good default for most tasks1.0e-5for small datasets (less regularization)1.0e-3for large models that overfit- Don't use with AdamW (not yet implemented) — with standard Adam, weight decay interacts with the adaptive learning rate
Gradient Accumulation
Simulates a larger batch size by accumulating gradients across multiple mini-batches:
# Effective batch size = 32 * 4 = 128
ExBurn.Training.fit(model, data,
batch_size: 32,
accumulate_gradients: 4
)When to Use
- GPU memory limits your batch size
- You want the stability of large batches but can't fit them in memory
- Training on mobile devices with limited RAM
Tips
- Increase learning rate proportionally to the accumulation factor (e.g., 4x accumulation → 2x LR)
- Batch normalization (when available) will still see the small mini-batch statistics
Batch Size Selection
| Batch Size | Pros | Cons |
|---|---|---|
| 8–16 | Better generalization, less memory | Noisy gradients, slower training |
| 32–64 | Good default | Balanced |
| 128–256 | Faster training, stable gradients | May generalize worse, more memory |
| 512+ | Very stable gradients | Often worse generalization, high memory |
Tips
- Start with 32 and increase if you have memory headroom
- If you increase batch size, increase learning rate proportionally
- Use gradient accumulation to simulate large batches on memory-constrained devices
Memory Optimization
On Desktop (CUDA/Metal)
# Use f16 for 2x memory reduction
# (convert parameters to f16 before training)
# Use gradient accumulation to reduce per-batch memory
accumulate_gradients: 4On Mobile (iOS/Android)
# Keep models small (< 10M params)
# Use CPU for training (GPU autodiff is memory-intensive)
ExBurn.Model.compile(model, device: :cpu)
# Free intermediate tensors explicitly
ExBurn.Tensor.free(intermediate_tensor)Memory-Saving Tips
- Reduce batch size — the single biggest lever
- Use gradient accumulation — same effective batch, less memory
- Free tensors explicitly — don't wait for GC
- Use f16 precision — halves memory for tensors
- Avoid storing all intermediate activations — use gradient checkpointing (planned)
Common Problems and Solutions
Loss is NaN
Causes: Exploding gradients, too high learning rate, numerical instability
Solutions:
# 1. Enable gradient clipping
clip_norm: 1.0
# 2. Reduce learning rate
learning_rate: 0.0001
# 3. Use :numerical_batch gradient method (more stable)
grad_method: :numerical_batchLoss Doesn't Decrease
Causes: Too low learning rate, bad initialization, wrong loss function
Solutions:
# 1. Increase learning rate
learning_rate: 0.01
# 2. Check loss function matches task
# Classification → :cross_entropy
# Regression → :mse
# Binary → :binary_cross_entropy
# 3. Verify data preprocessing (normalization, etc.)Loss Oscillates
Causes: Learning rate too high, batch size too small
Solutions:
# 1. Reduce learning rate
learning_rate: 0.0005
# 2. Increase batch size or use gradient accumulation
accumulate_gradients: 4
# 3. Use learning rate schedule
lr_schedule: {:cosine, 0.001, 1.0e-6}Overfitting
Causes: Model too complex, not enough data, no regularization
Solutions:
# 1. Add weight decay
weight_decay: 1.0e-3
# 2. Add dropout in the Axon model
|> Axon.dropout(rate: 0.5)
# 3. Freeze early layers
model = ExBurn.Model.freeze(model, ["hidden1"])
# 4. Use early stopping
callbacks: [ExBurn.Training.EarlyStoppingCallback.wait(5)]Training is Very Slow
Causes: Numerical gradients on large model, too many epochs
Solutions:
# 1. Use faster gradient method
grad_method: :numerical_batch
# 2. Reduce model size
# 3. Use fewer epochs with early stopping
callbacks: [ExBurn.Training.EarlyStoppingCallback.wait(3)]
# 4. Increase batch size (fewer optimizer steps)
batch_size: 128Performance Benchmarks
Approximate training times per epoch on synthetic data (will vary by hardware):
| Model Size | Params | Batch | Method | Time/Epoch |
|---|---|---|---|---|
| Tiny MLP | 1K | 32 | :numerical | ~2s |
| Small MLP | 10K | 32 | :numerical | ~15s |
| Small MLP | 10K | 32 | :numerical_batch | ~8s |
| Medium MLP | 100K | 32 | :numerical | ~3min |
| Medium MLP | 100K | 32 | :numerical_batch | ~1.5min |
Key takeaway: With numerical gradients, training time scales linearly with parameter count. Keep models under 50K parameters for interactive training, or switch to inference-only for larger models until autodiff arrives in v0.3.0.
Quick Reference: Recommended Settings
For Quick Experiments
compiled = ExBurn.Model.compile(model,
loss: :cross_entropy,
optimizer: :adam,
learning_rate: 0.001
)
ExBurn.Training.fit(compiled, data,
epochs: 10,
batch_size: 32,
verbose: true
)For Best Results
compiled = ExBurn.Model.compile(model,
loss: :cross_entropy,
optimizer: :adam,
learning_rate: 0.001,
weight_decay: 1.0e-4
)
ExBurn.Training.fit(compiled, data,
epochs: 50,
batch_size: 64,
shuffle: true,
validation_data: val_data,
lr_schedule: {:cosine, 0.001, 1.0e-6},
clip_norm: 1.0,
accuracy: true,
callbacks: [
&ExBurn.Training.LoggingCallback.log/1,
ExBurn.Training.EarlyStoppingCallback.wait(10, 1.0e-5),
ExBurn.Training.CheckpointCallback.every(10, "/checkpoints")
]
)For Memory-Constrained Devices
compiled = ExBurn.Model.compile(model,
loss: :cross_entropy,
optimizer: :adam,
learning_rate: 0.0005,
device: :cpu
)
ExBurn.Training.fit(compiled, data,
epochs: 20,
batch_size: 16,
accumulate_gradients: 4,
clip_norm: 1.0
)