Overview
ExBurn compiles models for mobile deployment via Burn's CubeCL backend:
- iOS: Metal via CubeCL
- Android: Vulkan via CubeCL
The typical workflow is: train on a desktop GPU → save the model → load and run inference on mobile.
Training and Saving on Desktop
# Train on desktop (CUDA or Metal)
model =
Axon.input("input", shape: {nil, 784})
|> Axon.dense(128, activation: :relu)
|> Axon.dropout(rate: 0.2)
|> Axon.dense(10)
compiled = ExBurn.Model.compile(model,
loss: :cross_entropy,
optimizer: :adam,
learning_rate: 0.001
)
trained = ExBurn.Training.fit(compiled, {train_x, train_y},
epochs: 20,
batch_size: 64
)
# Save for deployment
ExBurn.Model.save(trained, "model.bin")Loading and Inference on Mobile
# Load the model on the mobile device
{:ok, model} = ExBurn.Model.load(compiled, "model.bin")
# Run inference
{:ok, output} = ExBurn.Model.predict(model, input_tensor)Using ExBurn.Serving for Batched Inference
For production inference with concurrent batching:
serving = ExBurn.Serving.build(model,
batch_size: 32,
batch_timeout: 50,
partitions: System.schedulers_online()
)
output = Nx.Serving.run(serving, input_tensor)Cross-Compilation
iOS (Metal)
# Add the iOS target
rustup target add aarch64-apple-ios
# Build the NIF for iOS
cd native/ex_burn_nif
cargo build --target aarch64-apple-ios --features metal --no-default-features --release
Android (Vulkan)
# Add the Android target
rustup target add aarch64-linux-android
# Build the NIF for Android
cd native/ex_burn_nif
cargo build --target aarch64-linux-android --features vulkan --no-default-features --release
CPU-only Fallback
cd native/ex_burn_nif
cargo build --no-default-features --release
Model Optimization for Mobile
1. Use f16 Precision
Halves memory usage with minimal accuracy loss on inference:
# Convert parameters to f15
# (planned — currently use Nx's built-in type conversion)2. Reduce Model Size
| Model Size | Feasibility on Mobile |
|---|---|
| < 1M params | ✅ Comfortable on all modern devices |
| 1M – 10M params | ✅ Fine for inference, training may OOM |
| 10M – 50M params | ⚠️ Inference only, may need quantization |
| > 50M params | ❌ Not recommended for mobile |
3. Use ExCubecl Pipelines
Chain multiple GPU kernels without CPU round-trips:
{:ok, pipeline} = ExBurn.CubeclBridge.pipeline()
ExBurn.CubeclBridge.pipeline_add(pipeline, "dense", [input_buf, weight_buf, bias_buf], output_buf)
ExBurn.CubeclBridge.pipeline_add(pipeline, "relu", [output_buf], output_buf)
{:ok, _} = ExBurn.CubeclBridge.pipeline_run(pipeline)4. Batch Inference
Process multiple inputs together for better GPU utilization:
serving = ExBurn.Serving.build(model, batch_size: 16, batch_timeout: 100)Supported Operations
| Operation | iOS (Metal) | Android (Vulkan) | Notes |
|---|---|---|---|
| Dense / Linear | ✅ | ✅ | |
| Conv2D | ✅ | ✅ | |
| ReLU | ✅ | ✅ | |
| Sigmoid | ✅ | ✅ | |
| Softmax | ✅ | ✅ | |
| Dropout | ✅ | ✅ | No-op during inference |
| LayerNorm | ✅ | ✅ | |
| MatMul | ✅ | ✅ | |
| Transpose | ✅ | ✅ | |
| Reshape | ✅ | ✅ | |
| Concatenate | ✅ | ✅ | |
| Slice | ✅ | ✅ |
Memory Considerations
- Burn's Autodiff backend is memory-intensive. Training on mobile is only feasible for small models (< 10M parameters).
- Inference is the primary use case for mobile deployment.
- Minimum recommended: 4GB RAM, A12+ chip (iOS) / Snapdragon 700+ (Android).
- Use gradient checkpointing (planned for v0.3.0) to reduce training memory.
Precompiled NIFs (v0.2.0)
Starting with v0.2.0, precompiled NIF binaries are distributed via rustler_precompiled, eliminating the Rust toolchain requirement for end users. The NIF automatically downloads the correct binary for the target platform.