whisper_cpp usage rules

Copy Markdown View Source

For agents and humans writing code against whisper_cpp. These rules are shipped with the Hex package so downstream consumers can opt in to a consistent set of conventions.

Loading models

  • Pass a path to a .bin or .gguf whisper.cpp checkpoint to WhisperCpp.load_model/2. Download checkpoints from https://huggingface.co/ggerganov/whisper.cpp.
  • Cache the %WhisperCpp.Model{} for the process lifetime; loading is expensive and the underlying NIF resource is safe to share across BEAM processes - concurrent transcribe/3 calls do not serialise.
  • Prefer device: :auto (the default). Explicit device selection that does not match the installed NIF artefact returns :invalid_request.

Audio input

  • transcribe/3 accepts exactly one shape: {:pcm_f32, binary()}, where the binary is little-endian IEEE-754 f32 samples, mono, 16 kHz, normalised to [-1.0, 1.0].

  • This library does not decode audio file formats. Decode WAV, MP3, FLAC, M4A, Opus, etc. upstream and hand the PCM in. Standard recipe with ffmpeg:

    ffmpeg -i input.mp3 -f f32le -ac 1 -ar 16000 input.pcm
    

    In Elixir: pcm = File.read!("input.pcm"), then WhisperCpp.transcribe(model, {:pcm_f32, pcm}, ...).

  • Bare binaries (without the {:pcm_f32, _} wrapper) and file paths are rejected with :invalid_request. A typo'd path used to turn into garbage PCM; the wrapper surfaces the bug instead.

Slicing PCM

  • Use WhisperCpp.transcribe_slice/4 to transcribe a [start_s, end_s) window of an already-decoded master PCM buffer. It handles the byte math, runs whisper.cpp on the slice, and shifts segment/word times back into the absolute timeline.
  • Slices shorter than 0.3 s return an empty transcription. whisper.cpp pads short inputs and hallucinates into the padding; do not pass unfiltered VAD output.

Cancellation and progress

  • For cancellable transcribes, mint a %WhisperCpp.AbortHandle{} via WhisperCpp.AbortHandle.new/0 and pass it via :abort_handle. Signal cancellation from another process with WhisperCpp.AbortHandle.abort/1. The call returns {:ok, partial_transcription} with whatever segments completed before whisper.cpp's next abort poll.
  • For progress, pass :progress_pid (commonly self() inside a Task). The pid receives {:whisper_progress, percent} messages (0..100) as work advances; duplicate percentages are coalesced.
  • Both hooks are zero-cost when omitted.

Options and errors

  • Pass options as keyword lists. Unknown keys and out-of-range values fail with {:error, %WhisperCpp.Error{reason: :invalid_request}} before reaching the NIF - rely on this for input validation.
  • Match %WhisperCpp.Error{} (or its :reason field) rather than inspecting message strings.

Performance

  • :n_threads defaults to 4. On dedicated nodes, set it to the number of physical cores.
  • Word timestamps add one DTW pass; enable :word_timestamps only when you need them.
  • For latency-sensitive workloads, prefer :single_segment on short clips to skip the segment-split pass.
  • Beam search (:beam_size > 1) is roughly 2-3x slower than greedy and worth it for the lowest WER on long-form audio; for short slices, greedy is usually fine.
  • A single loaded model handle is safe to share: parallel transcribe calls do not serialise on the context lock, so saturating a GPU or multi-core CPU from many BEAM processes is the expected pattern.