Mobile Deployment with ExBurn

Copy Markdown View Source

Overview

ExBurn compiles trained models for mobile deployment via Burn's CubeCL backend. The pipeline optimizes models for the target GPU backend:

  • iOS: Metal via CubeCL
  • Android: Vulkan via CubeCL

ExBurn is designed as a library — it provides the Nx backend and GPU acceleration layer that other frameworks can build on top of.

Compiling a Model

# Define a model with Axon
model =
  Axon.input("input", shape: {nil, 784})
  |> Axon.dense(128, activation: :relu)
  |> Axon.dropout(rate: 0.2)
  |> Axon.dense(10)

# Compile for training/inference
compiled = ExBurn.Model.compile(model,
  loss: :cross_entropy,
  optimizer: :adam,
  learning_rate: 0.001
)

# Run inference
{:ok, output} = ExBurn.Model.predict(compiled, input_tensor)

# Save for deployment
ExBurn.Model.save(compiled, "model.bin")

# Load
{:ok, loaded} = ExBurn.Model.load(compiled, "model.bin")

Using ExCubecl for GPU Inference

ExBurn integrates with ExCubecl for GPU buffer management and kernel execution:

# Create GPU buffers via ExCubecl
{:ok, input_buf} = ExCubecl.buffer([1.0, 2.0, 3.0], [3], :f32)
{:ok, output_buf} = ExCubecl.buffer([0.0, 0.0, 0.0], [3], :f32)

# Run a kernel
ExCubecl.run_kernel("elementwise_add", [input_buf, input_buf], output_buf)

# Read results back
{:ok, data} = ExCubecl.read(output_buf)

Using ExBurn.Serving for Batched Inference

For production inference with concurrent batching:

# Build a serving from a compiled model
serving = ExBurn.Serving.build(compiled,
  batch_size: 32,
  batch_timeout: 50
)

# Run batched inference
output = Nx.Serving.run(serving, input_tensor)

Model Optimization Tips

  1. Use f16 quantization: Halves memory usage with minimal accuracy loss
  2. Reduce model size: Target < 10MB for mobile apps
  3. Batch inference: Process multiple inputs together for better throughput
  4. Use ExCubecl pipelines: Chain multiple GPU kernels without CPU round-trips
  5. Profile on device: Benchmark on the target hardware before deploying

Supported Operations

OperationiOS (Metal)Android (Vulkan)
Dense
Conv2D
ReLU
Sigmoid
Softmax
Dropout
LayerNorm