whisper_cpp usage rules

For agents and humans writing code against whisper_cpp. These rules are shipped with the Hex package so downstream consumers can opt in to a consistent set of conventions.

Loading models

Pass a path to a .bin or .gguf whisper.cpp checkpoint to WhisperCpp.load_model/2. Download checkpoints from https://huggingface.co/ggerganov/whisper.cpp.
Cache the %WhisperCpp.Model{} for the process lifetime; loading is expensive and the underlying NIF resource is safe to share across BEAM processes - concurrent transcribe/3 calls do not serialise.
Prefer device: :auto (the default). Explicit device selection that does not match the installed NIF artefact returns :invalid_request.

Audio input

transcribe/3 accepts exactly one shape: {:pcm_f32, binary()}, where the binary is little-endian IEEE-754 f32 samples, mono, 16 kHz, normalised to [-1.0, 1.0].
This library does not decode audio file formats. Decode WAV, MP3, FLAC, M4A, Opus, etc. upstream and hand the PCM in. Standard recipe with ffmpeg:
```
ffmpeg -i input.mp3 -f f32le -ac 1 -ar 16000 input.pcm
```
In Elixir: pcm = File.read!("input.pcm"), then WhisperCpp.transcribe(model, {:pcm_f32, pcm}, ...).
Bare binaries (without the {:pcm_f32, _} wrapper) and file paths are rejected with :invalid_request. A typo'd path used to turn into garbage PCM; the wrapper surfaces the bug instead.

Slicing PCM

Use WhisperCpp.transcribe_slice/4 to transcribe a [start_s, end_s) window of an already-decoded master PCM buffer. It handles the byte math, runs whisper.cpp on the slice, and shifts segment/word times back into the absolute timeline.
Slices shorter than 0.3 s return an empty transcription. whisper.cpp pads short inputs and hallucinates into the padding; do not pass unfiltered VAD output.

Cancellation and progress

For cancellable transcribes, mint a %WhisperCpp.AbortHandle{} via WhisperCpp.AbortHandle.new/0 and pass it via :abort_handle. Signal cancellation from another process with WhisperCpp.AbortHandle.abort/1. The call returns {:ok, partial_transcription} with whatever segments completed before whisper.cpp's next abort poll.
For progress, pass :progress_pid (commonly self() inside a Task). The pid receives {:whisper_progress, percent} messages (0..100) as work advances; duplicate percentages are coalesced.
Both hooks are zero-cost when omitted.

Options and errors

Pass options as keyword lists. Unknown keys and out-of-range values fail with {:error, %WhisperCpp.Error{reason: :invalid_request}} before reaching the NIF - rely on this for input validation.
Match %WhisperCpp.Error{} (or its :reason field) rather than inspecting message strings.

Performance

:n_threads defaults to 4. On dedicated nodes, set it to the number of physical cores.
Word timestamps add one DTW pass; enable :word_timestamps only when you need them.
For latency-sensitive workloads, prefer :single_segment on short clips to skip the segment-split pass.
Beam search (:beam_size > 1) is roughly 2-3x slower than greedy and worth it for the lowest WER on long-form audio; for short slices, greedy is usually fine.
A single loaded model handle is safe to share: parallel transcribe calls do not serialise on the context lock, so saturating a GPU or multi-core CPU from many BEAM processes is the expected pattern.

← Previous Page Changelog