A TFLite "delegate" is the runtime that actually executes the model graph. TFLite ships several; picking the right one for your platform
- model combination is the single biggest perf lever.
This guide lays out the decision matrix and explains the quirks we discovered measuring real hardware.
Quick decision tree
Are you on iOS (any iPhone/iPad)?
├── Model is FP16 or FP32? → delegate: "coreml" (24 ms iPhone SE A15)
└── Model is INT8? → delegate: "xnnpack" (27 ms iPhone SE A15)
(NOT coreml — 0/256 nodes delegate)
Are you on Android?
├── MediaTek Dimensity? → delegate: "nnapi", accelerator: "mtk-gpu_shim"
├── Qualcomm Snapdragon? → delegate: "nnapi", accelerator: "qti-gpu"
├── Pixel? → delegate: "nnapi", accelerator: "google-edgetpu"
├── Unknown OEM? → delegate: "xnnpack" (always works)
└── Want to discover? → see "Discovering NNAPI accelerators" below
Mac / Linux dev host? → delegate: "xnnpack"Per-delegate detail
xnnpack — CPU+SIMD
Bundled into TFLite. Cross-platform. Default when no other delegate is set explicitly.
Highly-optimised: tuned ARM NEON / Intel AVX kernels, INT8 + FP32 + quantized-int8 paths.
| Pro | Con |
|---|---|
| Works everywhere TFLite runs | No GPU / NPU |
| Reproducible numbers (no thermal throttle) | Slower than accelerator paths when those work |
| No vendor-driver dependencies |
Options:
num_threads:— CPU thread count (default 6). Up to physical core count helps; oversubscription hurts.
On the Moto G Power 5G (Dimensity 7020), XNNPACK matched the GPU delegate at 77 ms for YOLOv8n. On modern phones with strong CPU cores, XNNPACK is a competitive default.
nnapi — Android Neural Networks API
Android's neural-net dispatch layer. Each device's vendor ships a
HAL driver (libmtk-gpu-shim.so, libqti-gpu.so, etc.); NNAPI picks
one based on the accelerator name (or by default, badly — see
below).
Options:
accelerator:— vendor HAL name string. Always pass this explicitly. NNAPI's auto-selection on at least one MediaTek device picks the NPU which is 5× SLOWER than the GPU for YOLO-class models.allow_fp16:— let the HAL promote FP32 ops to FP16 (defaulttrue). Lossy but typically fine for inference.
Discovering NNAPI accelerators on a connected device
The standalone bench CLI (in scripts/bench_android/) has a
list-nnapi mode:
adb push scripts/bench_android/bench /data/local/tmp/
adb push ~/.mob/cache/tflite-2.16.1-android_arm64/jni/arm64-v8a/libtensorflowlite_jni.so /data/local/tmp/
adb shell 'cd /data/local/tmp && LD_LIBRARY_PATH=. ./bench list-nnapi'
Output is a list of accelerator names available on this device:
mtk-gpu_shim
mtk-neuron_shim
nnapi-reference (CPU emulation — slow)Known accelerator perf rankings (YOLOv8n on Moto G Power 5G)
| Accelerator | Median |
|---|---|
mtk-gpu_shim | 75-117 ms (best — MediaTek's PowerVR HAL) |
xnnpack CPU | 77-91 ms (tied with GPU; deterministic) |
mtk-neuron_shim | 355 ms (NPU — slower because YOLO post-processing falls back to CPU) |
nnapi-reference | 358 ms (CPU emulation — never use) |
nnapi (no accelerator) | 358 ms (defaults to mtk-neuron_shim — never do this) |
The NPU loses despite being a "real" neural-net accelerator because
YOLO has concat + reshape ops in its post-processing that aren't
in the NPU's supported set. TFLite falls back to CPU for those ops
mid-graph, with cross-device buffer transfers between each fallback.
The transfer overhead swamps any per-op NPU speedup.
A model designed end-to-end for the APU (no reshape/concat in the
inference graph) would land much faster on mtk-neuron_shim. YOLOv8n
as exported doesn't fit.
coreml — Apple Core ML
Routes the delegated portion of the graph through Apple's Core ML framework, which internally schedules to the Apple Neural Engine when ops are supported on devices that have one (A11+ iPhones).
Options:
coreml_ane_only:— whentrue,load_module/2returns{:error, _}instead of falling back to CPU on devices without an ANE. Useful for "ANE-only or skip" logic. Defaultfalse.
Op-coverage caveat — the INT8 trap
Don't use Core ML with INT8 models. The Ultralytics
yolov8n_full_integer_quant.tflite export uses INT8 quantization ops
that Core ML's tooling doesn't translate to ANE primitives. The
result: 0 out of 256 nodes delegated, and the whole model falls
back to CPU which is slower than just running XNNPACK directly.
For Core ML you want the FP16 or FP32 model variant:
| Model | Core ML delegation rate | Latency (iPhone SE A15) |
|---|---|---|
| INT8 | 0/256 (0%) — full CPU fallback | 45 ms (don't use) |
| FP16 | 214/385 (56%) | 23-25 ms |
| FP32 | 214/254 (84%) | 24-25 ms |
FP16 and FP32 hit the same wall-clock because the delegated portion is the same (214 conv-shaped ops). FP16 wins on bundle size (~6 MB vs ~12 MB).
The 30% of nodes that fall to CPU on FP16 are the post-processing ops (concat / reshape / NMS-prep) — same shape as the Android NPU problem. Core ML handles the boundary more gracefully than NNAPI NPU does (cheap shared-memory transitions on Apple silicon), which is why this works at all.
metal — Apple Metal GPU (planned)
TFLite ships TensorFlowLiteCMetal.xcframework with a Metal GPU
delegate, but the current NIF doesn't expose it as a delegate:
option. PR welcome.
Core ML is usually faster than Metal on Apple Silicon since it can pick ANE for supported ops + Metal as a fallback. Metal-only is mainly useful for older devices without an ANE.
Comparing the paths on the same device
Same iPhone SE 3rd gen A15, same .tflite model files, varying the
delegate:
| Variant | Delegate | Delegation | Min / Median / Max |
|---|---|---|---|
| INT8 | xnnpack | n/a (CPU+NEON) | 27 / 36 / 37 ms |
| INT8 | coreml | 0/256 (full fallback) | 36 / 39 / 42 ms |
| FP16 | xnnpack | n/a (CPU+NEON) | 86 / 98 / 265 ms |
| FP16 | coreml | 214/385 (56%) | 23 / 25 / 26 ms |
| FP32 | coreml | 214/254 (84%) | 24 / 24 / 25 ms |
The standout: FP16 + Core ML wins at 25 ms median. Half the bundle of FP32 with identical wall-clock. The CPU+NEON XNNPACK path is impressive at 36 ms — for context, our standalone bench measurements show it consistently within 30 ms of the GPU/ANE paths on modern phones.
Composing with Nx backends
TFLite delegates handle the model graph. The pre/post-processing in your Elixir code is separate compute that you can route to a different backend:
# Input prep on EMLX (Metal GPU on iOS) — useful for batch
# transformations, scaling, normalization.
input_bytes =
camera_bytes
|> Nx.from_binary(:f32, backend: {EMLX.Backend, device: :gpu})
|> Nx.reshape({1, 640, 640, 3})
|> Nx.divide(255.0)
|> Nx.to_binary()
# Model inference on TFLite + Core ML → ANE
{:ok, [out]} = NxTfliteMob.call(handle, [input_bytes])
# Output decode on EMLX again
out
|> Nx.from_binary(:f32, backend: {EMLX.Backend, device: :gpu})
|> Nx.reshape({1, 84, 8400})
|> ...Two distinct compute paths, one screen. The TFLite delegate doesn't
care what backend your Nx code uses — it sees only the bytes you
hand to call/2.
When xnnpack is the right answer even when GPU/NPU is available
- Deterministic numbers (CPU paths don't thermal-throttle as aggressively)
- Cold-start (delegate init for Core ML / NNAPI is 100-500 ms)
- Tiny models (the delegate dispatch overhead dominates inference for sub-ms models)
- Cross-platform parity for tests