NxTfliteMob (nx_tflite_mob v0.0.4)

Copy Markdown View Source

Call TensorFlow Lite models from Elixir / BEAM, with full vendor accelerator access on phones — Apple Neural Engine on iOS, MediaTek / Qualcomm GPU+NPU HALs on Android.

This is NOT an Nx.Backend

NxTfliteMob does not replace Nx.BinaryBackend, EMLX.Backend, NxVulkan.Backend, etc. There is no NxTfliteMob.Backend module to set via Nx.global_default_backend/1.

TFLite executes pre-compiled model graphs (.tflite files) end-to-end through vendor-optimised delegates. The whole graph stays opaque so the delegate can fuse + schedule it for ANE / GPU / NPU. You don't compose your own ops here — you call a pre-trained model.

Use NxTfliteMob when you have a pre-trained model to run. Use Nx backends when you're writing arbitrary tensor math in Elixir.

Both can coexist in the same app.

API surface

Three functions: load_module/2, call/2, release_module/1.

iex> tflite = File.read!("priv/yolov8n_float16.tflite")
iex> {:ok, handle} = NxTfliteMob.load_module(tflite,
...>   delegate: "coreml", coreml_ane_only: false)
iex> {:ok, [output_bytes]} = NxTfliteMob.call(handle, [input_bytes])
iex> NxTfliteMob.release_module(handle)
:ok

See the YOLO walkthrough for a complete end-to-end example with input prep, inference, and output decode.

Delegate options

The delegate opt selects how the model graph runs. Per-platform recommendations (see also the delegates guide):

Android — delegate: "nnapi"

NNAPI is Android's neural-net dispatch API. It picks a vendor HAL driver based on the accelerator name:

accelerator: valueWhat it routes to
"mtk-gpu_shim"MediaTek's GPU HAL — fastest for YOLO on Dimensity chips
"mtk-neuron_shim"MediaTek's APU/NPU — only worthwhile if your graph is pure conv (no concat/reshape post-processing — TFLite falls back to CPU for those, transfer overhead dominates)
"qti-gpu"Qualcomm Snapdragon GPU
"google-edgetpu"Pixel TPU
nil (no key)NNAPI auto-picks — often the WRONG choice for YOLO (defaults to NPU on MediaTek, which is 5× slower)

Discover available accelerators on a connected device with adb shell + the standalone bench CLI's list-nnapi mode (see the package's scripts/bench_android/).

Other Android opts:

  • num_threads: — XNNPACK CPU thread count (default 6)
  • allow_fp16: — let NNAPI run FP32 ops in FP16 (default true)

iOS — delegate: "coreml"

Core ML routes the delegated portion through Apple's Core ML framework, which internally schedules to the Apple Neural Engine when ops are supported. For YOLOv8n FP16, ~56% of nodes delegate to the ANE on an iPhone SE 3rd gen A15 (the rest fall to CPU via XNNPACK), hitting 24 ms per inference.

Caveats:

  • INT8 + Core ML doesn't work. Core ML's tooling doesn't understand the Ultralytics INT8 quant flavour — 0/256 nodes delegate. Use the FP16 model variant for Core ML.
  • coreml_ane_only: (default false) — when true, the delegate returns nil instead of falling back to CPU on devices without an ANE. Useful for "ANE-only or skip" logic; irrelevant on A11+ devices where the ANE is always present.

iOS — delegate: "metal" (planned)

TFLite ships TensorFlowLiteCMetal.xcframework for Metal GPU inference but the current NIF doesn't expose it as a delegate: option yet. PR welcome. Core ML is usually faster anyway on Apple Silicon devices (Core ML can pick GPU when ANE ops are unsupported).

XNNPACK CPU — delegate: "xnnpack" (default)

Bundled into TFLite. Highly-optimised CPU+SIMD path. Default when no other delegate is set. Surprisingly competitive on modern phones — ~77 ms on the Moto G Power 5G (tied with the GPU path) and 27-37 ms on iPhone SE 3rd gen A15. Use this when:

  • You're on a device without GPU/NPU acceleration
  • The vendor delegate fails to delegate (e.g. INT8 + Core ML)
  • You want deterministic, reproducible numbers (CPU paths don't thermal-throttle as aggressively as GPUs)

Input + output byte layout

call/2 is raw-bytes-in, raw-bytes-out. The byte layout is model-specific — you have to match what the .tflite model expects.

Inspect a model's expected shape/dtype via mix Python helpers or TFLite's flatc tool. Or :erlang.load_nif/2 an inspector NIF built against TFLite's TfLiteInterpreterGetInputTensor — exposing this in the Elixir API is on the roadmap.

Common shapes:

ModelInputOutput
YOLOv8n INT8 (Ultralytics full_integer_quant)1×640×640×3 INT8 NHWC (1228800 bytes)1×84×8400 INT8 (705600 bytes)
YOLOv8n FP16 (Ultralytics float16)1×640×640×3 FP32 NHWC (4915200 bytes — the FP16 model accepts FP32 input that's cast internally)1×84×8400 FP32 normalised (2822400 bytes)
YOLOv8n FP321×640×640×3 FP32 NHWC1×84×8400 FP32
MobileNetV2 (ImageNet)1×224×224×3 FP32 NHWC1×1001 FP32 (class logits)

See the YOLO walkthrough for the layout-aware decoder we use in production (pure-BEAM, 13 ms for the full INT8 NMS pass).

Where Nx fits in (optionally)

You CAN use Nx tensors on either side of call/2. It's optional — bytes-in/bytes-out is the canonical interface.

Input prep with Nx:

input_bytes =
  camera_frame_f32_binary
  |> Nx.from_binary(:f32)
  |> Nx.reshape({1, 640, 640, 3})
  |> Nx.as_type(:s8)      # quantize for INT8 model
  |> Nx.to_binary()

{:ok, [out]} = NxTfliteMob.call(handle, [input_bytes])

Output decode with Nx:

detections =
  out
  |> Nx.from_binary(:s8)
  |> Nx.reshape({1, 84, 8400})
  |> Nx.as_type(:f32)
  |> Nx.multiply(scale)
  |> Nx.subtract(zero_point)
  |> extract_detections()

In practice we bypass Nx for performance-critical decoding — Nx.BinaryBackend for an argmax across {80, 8400} is 1700 ms; a pure-BEAM :binary.at/2 loop is 13 ms (130× faster). See NxeigenProbe.LiveYoloScreen for the pure-BEAM decoder pattern.

Using with Mob

If you're building a Mob app, the easiest path is mob_dev's Igniter task:

mix mob.enable tflite

This adds the dep + generates a per-platform default-opts helper and registers the NIF in mob_dev's static-NIF table. Requires mob_dev >= 0.5.9. See mob_dev's mob.enable docs for details.

After mix mob.enable tflite, the auto-generated helper picks delegate opts per platform:

{:ok, h} = NxTfliteMob.load_module(model_bytes,
             MyApp.TfliteInit.default_opts())

Building from source (non-Mob)

See the package's Makefile — targets android, ios_device, ios_sim, mac. Each requires platform-appropriate TFLite distribution (cached at ~/.mob/cache/ by mob_dev's downloader, or per-target overrides for standalone builds).

Mac builds require building libtensorflowlite_c.dylib from TF source first — TFLite has no Mac arm64 prebuilt. See docs/build_mac_tflite.md in the repo.

Summary

Types

Opaque handle to a loaded TFLite model. Pass to call/2 and free with release_module/1. Closed handles also get freed when garbage collected, but explicit release is recommended for short-lived inferences.

Functions

Run inference on a loaded model.

Load a TFLite model from raw .tflite FlatBuffer bytes.

Free the model + delegate + interpreter held by handle.

Types

module_handle()

@type module_handle() :: reference()

Opaque handle to a loaded TFLite model. Pass to call/2 and free with release_module/1. Closed handles also get freed when garbage collected, but explicit release is recommended for short-lived inferences.

Functions

call(handle, inputs)

@spec call(module_handle(), [binary()]) ::
  {:ok, [binary()]} | {:error, String.t() | charlist()}

Run inference on a loaded model.

inputs is a list of binaries — one per input tensor in the model's declared input order. Each binary must match the model's expected shape × dtype byte layout exactly (1×640×640×3 INT8 = 1228800 bytes for YOLOv8n full_integer_quant, for example).

Returns {:ok, outputs} where outputs is a list of binaries — one per output tensor, also in declared order. Decode each according to the model's documented output layout.

Examples

# YOLOv8n INT8 — 1×640×640×3 INT8 input, 1×84×8400 INT8 output
input = <<1228800 INT8 bytes>>
{:ok, [output]} = NxTfliteMob.call(handle, [input])
true = byte_size(output) == 705600

Errors

Returns {:error, message} for:

  • Input list length doesn't match the model's input-tensor count
  • Any input binary's size doesn't match the model's expected size
  • The model's TfLiteInterpreterInvoke returns non-OK status

load_module(model_bytes, opts \\ [])

@spec load_module(
  binary(),
  keyword()
) :: {:ok, module_handle()} | {:error, String.t() | charlist()}

Load a TFLite model from raw .tflite FlatBuffer bytes.

Returns {:ok, handle} on success or {:error, message} if the bytes aren't a valid TFLite model or delegate creation fails.

Options

All options are documented in detail in the moduledoc:

  • :delegate (string) — "xnnpack" (default), "nnapi" (Android), "coreml" (iOS)
  • :accelerator (string) — vendor accelerator name for NNAPI (e.g. "mtk-gpu_shim")
  • :num_threads (integer) — XNNPACK CPU thread count (default 6)
  • :allow_fp16 (boolean) — NNAPI FP32→FP16 promotion (default true)
  • :coreml_ane_only (boolean) — Core ML requires ANE (default false — falls back to CPU/GPU)

Examples

# XNNPACK CPU (cross-platform default)
{:ok, h} = NxTfliteMob.load_module(tflite_bytes, [])

# Android NNAPI → MediaTek GPU HAL
{:ok, h} = NxTfliteMob.load_module(tflite_bytes,
             delegate: "nnapi",
             accelerator: "mtk-gpu_shim",
             allow_fp16: true)

# iOS Core ML → ANE
{:ok, h} = NxTfliteMob.load_module(tflite_bytes,
             delegate: "coreml",
             coreml_ane_only: false)

release_module(handle)

@spec release_module(module_handle()) :: :ok

Free the model + delegate + interpreter held by handle.

Idempotent — calling on an already-released handle returns :ok (the underlying resource is zero'd and re-releasing is a no-op).

Resources are also freed on GC if release_module/1 isn't called, but explicit release is recommended for tight loops or short-lived inferences to keep memory predictable.