whisper_ct2 usage rules

Copy Markdown View Source

Rules for LLM coding agents using whisper_ct2 in a consumer project. Published per the usage_rules convention; sync into your project with mix usage_rules.sync.

Load the model once, reuse the struct

WhisperCt2.load_model/2 returns {:ok, %WhisperCt2.Model{ref: ref}} where ref is a NIF resource pointing at the live CTranslate2 model. The model stays in memory as long as some process holds the struct. Do not call load_model/2 per request - hold it in a long-lived process.

defmodule MyApp.Whisper do
  use GenServer

  def start_link(path), do: GenServer.start_link(__MODULE__, path, name: __MODULE__)
  def transcribe(audio, opts \\ []),
    do: GenServer.call(__MODULE__, {:transcribe, audio, opts}, :infinity)

  @impl true
  def init(path) do
    {:ok, model} = WhisperCt2.load_model(path)
    {:ok, model}
  end

  @impl true
  def handle_call({:transcribe, audio, opts}, _from, model) do
    {:reply, WhisperCt2.transcribe(model, audio, opts), model}
  end
end

Put it under your supervision tree. When the process dies the NIF resource is freed, so let the supervisor reload it.

Parallelism: one Model serialises inside ct2rs

A single %Model{} processes calls serially through the NIF. For real concurrency across multiple callers, load N replicas (one per process) and pool them - e.g. with :poolboy, nimble_pool, or a Registry-keyed set of GenServers. Increasing :max_queued_batches only deepens the queue, not the worker count.

Do not share the same model across OS threads expecting parallel inference; share it across BEAM processes for fan-in, not fan-out.

Batched transcribe collapses per-call overhead

WhisperCt2.transcribe_batch(model, [audio1, audio2, ...], opts) stacks every chunk of every input into one mel batch and runs the encoder once across the whole thing. For diarization-driven workflows (one call per turn, dozens to hundreds of turns) this is materially faster than looping transcribe/3 because CTranslate2 amortises the encoder forward pass across the batch.

:language applies to every audio in the batch; pass nil to auto-detect per-audio (only meaningful on multilingual checkpoints).

For carving sub-windows out of an already-decoded buffer, use WhisperCt2.Pcm.slice(samples, sample_rate, start_s, duration_s) - it does the f32 byte math (4 bytes/sample) and bounds-checks against the buffer size, so a slice past the end fails loudly instead of decoding garbage.

pcm = File.read!("call.pcm")          # f32 LE, 16 kHz mono, prepared upstream
{:ok, turn} = WhisperCt2.Pcm.slice(pcm, model.sampling_rate, 12.3, 4.7)
WhisperCt2.transcribe(model, {:pcm_f32, turn}, language: "en")

Word-level timestamps are opt-in

Pass word_timestamps: true to attach %WhisperCt2.Word{text, start, end, probability} entries to each segment. Implementation reuses the encoder output from generate and runs one extra batched align call (DTW over decoder attention) across every chunk in the batch. Cost is on the order of the alignment pass itself, not a second encoder forward. Use it for caption alignment or diarization-aware splicing; skip it when you only need segment timing.

Segment timestamps: :with_timestamps

:with_timestamps defaults to true: the prompt asks Whisper to emit <|t_..|> tokens that split each 30 s chunk into sub-segments. Leave it on for stock OpenAI / Systran/faster-whisper-* checkpoints.

Set with_timestamps: false for fine-tunes that ignore the timestamp instruction or were trained to emit plain text (e.g. some domain fine-tunes). The chunk's full text then becomes one segment spanning [0, chunk_duration_s) instead of being silently dropped.

:word_timestamps implicitly forces :with_timestamps back to true - the DTW alignment needs the timestamp scaffolding. Don't combine with_timestamps: false with word_timestamps: true and expect the former to win.

Initial prompt and prefix

  • :initial_prompt - free-text conditioning prepended via <|startofprev|>. Bias the decoder toward domain vocabulary, names, or speaker style ("Discussion of CTranslate2 internals", "Dialogue between Alice and Bob"). Same role as in faster-whisper.
  • :prefix - forced text the generation must start with. Useful when the first words are already known (caption corrections, fixed intro lines).

Both are tokenised inside the NIF without special-token expansion, so control tokens in the strings are not interpreted.

Pass :language when you know it

:language defaults to nil, which makes Whisper auto-detect from the first chunk. Auto-detection adds latency and can misfire on short or noisy clips (English-only fine-tunes still sometimes guess :cy or :fr). Always pass language: "en" (or the relevant ISO code) when the source language is known.

model.multilingual tells you whether the loaded checkpoint can do anything other than English - faster-whisper-*.en variants are monolingual and ignore :language. Branch on model.multilingual if your code supports both.

Result shape

{:ok, %WhisperCt2.Transcription{text, segments, language, duration_s}}:

  • text - all segment texts joined by " " and String.trim/1'd. Use this for display or downstream NLP.
  • segments - list of %WhisperCt2.Segment{}, each carrying absolute :start / :end seconds, :no_speech_prob, :avg_logprob, the underlying text-token IDs (:tokens), and :words (nil unless :word_timestamps was set).
  • language - resolved ISO code (auto-detected when not pinned).
  • duration_s - input audio length in seconds.

Segment timestamps are real fields, not embedded tokens - do not regex the text for <|t_..|>. Boundaries are produced by Whisper's own timestamp tokens, parsed inside the NIF.

:no_speech_prob and :avg_logprob are always populated; filter hallucination with e.g. seg.avg_logprob < -1.0 or seg.no_speech_prob > 0.6.

Model struct fields are part of the API

Illustrative shape (:ref and :path omitted for brevity; both are also part of the struct):

%WhisperCt2.Model{
  sampling_rate: 16_000,   # always 16 kHz for published Whisper
  n_samples: 480_000,      # samples in one Whisper window (30 s)
  multilingual: true,      # false for *.en variants
  device: :cpu,            # resolved (never :auto)
  compute_type: :int8,     # resolved (never :default / :auto)
  ...
}

Read these at runtime instead of hardcoding. device and compute_type are the resolved values - :auto and :default are normalised at load time.

Audio contract is strict

CTranslate2 wants mono f32 PCM at the model's sample rate (always 16 kHz for published Whisper checkpoints), normalised to -1.0..1.0. transcribe/3 and transcribe_batch/3 accept exactly one shape:

  • {:pcm_f32, binary} - little-endian f32 samples at the model's sample rate.

There is no built-in decoder. Paths, raw bare binaries, WAV bytes, MP3, etc. are all rejected at the boundary with a clear :invalid_request error. Decoding, downmixing, and resampling are the caller's job; use ffmpeg, Membrane, or your platform audio stack upstream.

ffmpeg -i input.mp3 -ar 16000 -ac 1 -f f32le output.pcm
pcm = File.read!("output.pcm")
WhisperCt2.transcribe(model, {:pcm_f32, pcm}, language: "en")

For microphone or streaming sources, build the f32 buffer yourself and pass {:pcm_f32, binary}. The authoritative sample rate is model.sampling_rate on the loaded struct, not a hard-coded 16_000.

Audio longer than 30 s is split into Whisper-window chunks automatically; per-chunk text is in transcription.segments.

Return shape: never raises on the happy path

Every public function returns {:ok, _} | {:error, %WhisperCt2.Error{}}. The error struct implements Exception, so raise/1 works if you want let-it-crash behaviour, but do not write case clauses that assume an {:ok, _} pattern only. Error.reason is one of:

  • :invalid_request - bad options or audio shape; rejected before the NIF
  • :load_error - model directory missing or unreadable
  • :inference_error - CTranslate2 raised during transcription
  • :runtime_error - other ct2rs-side failure
  • :nif_panic - Rust panic caught by the panic boundary
  • :native_error - fallback for unrecognised native errors

Device and compute_type selection

Probe before deciding:

WhisperCt2.available_devices()
#=> {:ok, %{cpu: 1, cuda: 1, cuda_supported: true}}
  • device: :auto (default) picks CUDA when the artefact was built with it and at least one device is visible; otherwise CPU. Use this unless you have a reason not to.
  • device: :cuda returns {:error, %WhisperCt2.Error{reason: :invalid_request}} if CUDA is unavailable - do not assume it succeeds.
  • compute_type: :default keeps the stored quantisation of the model (recommended for Systran/faster-whisper-* int8 builds). compute_type: :auto lets ct2rs pick the fastest supported on-device.

Do not hardcode :float16 / :int8_float16 unless you know the target hardware supports it - mismatches raise :load_error.

Model files

load_model/2 needs a directory containing:

model.bin
config.json
tokenizer.json
vocabulary.txt
preprocessor_config.json

Systran/faster-whisper-* ships the first four. preprocessor_config.json must be copied from any openai/whisper-* repo (all sizes share the file). A missing preprocessor_config.json is the most common :load_error cause; check this first when load fails.

Backend selection at install time

The published Hex package picks the right precompiled NIF from your target triple automatically. Two consumer-facing knobs:

  • WHISPER_CT2_VARIANT=mkl on x86_64-unknown-linux-gnu selects the Intel MKL artefact instead of oneDNN. Only set this on Intel-only fleets.
  • WHISPER_CT2_BUILD=1 (or config :rustler_precompiled, :force_build, whisper_ct2: true) forces a source build. First build of CTranslate2 takes ~10 minutes and needs Rust, CMake, and a C++17 toolchain. Do not enable this in CI unless you understand the cost.

x86_64 macOS and Windows are not shipped - source build only.

Do not

  • Do not call load_model/2 per transcription.
  • Do not pass .wav (or any other file) paths, raw bare binaries, encoded WAV bytes, mp3, opus, or non-16 kHz audio to transcribe/3 - the audio contract is {:pcm_f32, binary} only. Decode and resample upstream (ffmpeg -ar 16000 -ac 1 -f f32le).
  • Do not assume device: :cuda succeeds; check available_devices/0 or use :auto.
  • Do not share a single %Model{} to get parallel inference; pool replicas.
  • Do not catch :nif_panic and retry blindly - it indicates a bug worth reporting.
  • Do not hardcode 16_000 as the sample rate - read model.sampling_rate.
  • Do not pass :language to a *.en checkpoint and expect anything but English; check model.multilingual if the language is dynamic.
  • Do not regex segment text for <|t_..|> tokens - segment timestamps are real fields (:start, :end) populated from the model output.
  • Do not loop transcribe/3 over a list of short clips when transcribe_batch/3 would batch them through one encoder pass.
  • Do not pass control tokens like <|en|> inside :initial_prompt or :prefix; they are tokenised as plain text and will not behave as special tokens.