Native Elixir bindings for Whisper speech-to-text via CTranslate2.
Calls ct2rs::sys::Whisper directly through a Rustler NIF: no Python, no
HTTP gateway. The NIF owns the mel spectrogram, tokenizer, and prompt
construction, so structured per-segment results, :initial_prompt /
:prefix biasing, word-level timestamps, and batched transcribe across
multiple audios are all first-class.
Quickstart
{:ok, model} = WhisperCt2.load_model("/path/to/faster-whisper-tiny")
pcm = File.read!("audio.pcm")
{:ok, %WhisperCt2.Transcription{text: text, segments: segs}} =
WhisperCt2.transcribe(model, {:pcm_f32, pcm}, language: "en")
IO.puts(text)
for s <- segs, do: IO.puts("[#{s.start}-#{s.end}] #{s.text}")Audio contract
CTranslate2 expects mono f32 PCM samples at the model's
:sampling_rate (16 kHz for every published Whisper checkpoint),
normalised to the -1.0..1.0 range. transcribe/3 and
transcribe_batch/3 accept exactly one shape:
{:pcm_f32, binary}- little-endian f32 samples at the model's sample rate.
Decoding .wav, resampling, downmixing, and any other format
conversion is the caller's job. There is no bundled audio decoder;
use ffmpeg, a dedicated library, or your platform's audio stack
before calling in.
Audio longer than the Whisper 30 s window is chunked internally; the
encoder runs once across every chunk in the batch. Diarization-driven
workflows that need many short splices should use
transcribe_batch/3.
Summary
Types
Audio sources accepted by transcribe/3 and transcribe_batch/3.
Options accepted by load_model/2.
Options accepted by transcribe/3 / transcribe_batch/3.
Functions
Reports CTranslate2 device support for this build.
Loads a CTranslate2 Whisper model from a directory.
Transcribes audio using model.
Transcribes a list of audios in one batched generate call. Every
chunk of every input shares a single encoder forward pass; output
preserves input order.
Types
@type audio() :: {:pcm_f32, binary()}
Audio sources accepted by transcribe/3 and transcribe_batch/3.
@type load_opt() :: {:device, :cpu | :cuda | :auto} | {:compute_type, WhisperCt2.Model.compute_type()} | {:device_indices, [non_neg_integer(), ...]} | {:num_threads_per_replica, non_neg_integer()} | {:max_queued_batches, integer()} | {:cpu_core_offset, integer()}
Options accepted by load_model/2.
@type transcribe_opt() :: {:language, String.t() | nil} | {:initial_prompt, String.t() | nil} | {:prefix, String.t() | nil} | {:word_timestamps, boolean()} | {:with_timestamps, boolean()} | {:beam_size, pos_integer()} | {:patience, float()} | {:length_penalty, float()} | {:repetition_penalty, float()} | {:no_repeat_ngram_size, non_neg_integer()} | {:sampling_temperature, float()} | {:sampling_topk, pos_integer()} | {:suppress_blank, boolean()} | {:max_length, pos_integer()} | {:num_hypotheses, pos_integer()} | {:max_initial_timestamp_index, non_neg_integer()} | {:suppress_tokens, [integer()]}
Options accepted by transcribe/3 / transcribe_batch/3.
Functions
@spec available_devices() :: {:ok, %{cpu: non_neg_integer(), cuda: non_neg_integer(), cuda_supported: boolean()}} | {:error, WhisperCt2.Error.t()}
Reports CTranslate2 device support for this build.
Returns {:ok, %{cpu: n, cuda: n, cuda_supported: bool}} on success.
cuda_supported reflects compile-time CUDA features (build with
WHISPER_CT2_FEATURES=cuda-dynamic mix compile to enable). cuda is the
count of NVIDIA GPU devices visible at runtime, or 0 when CUDA is not
built in.
@spec load_model(Path.t(), [load_opt()]) :: {:ok, WhisperCt2.Model.t()} | {:error, WhisperCt2.Error.t()}
Loads a CTranslate2 Whisper model from a directory.
See the WhisperCt2 module doc for required model files.
Options
:device-:cpu,:cuda, or:auto(default).:autopicks CUDA when the binary was built with CUDA support and a device is visible; otherwise CPU.:compute_type- precision used at inference.:defaultkeeps the model's stored quantisation;:autopicks the fastest supported on this device.:device_indices- non-empty list of GPU indices (default[0]).:num_threads_per_replica- intra-op threads.0lets CTranslate2 pick.:max_queued_batches,:cpu_core_offset- passed through to CTranslate2.
@spec transcribe(WhisperCt2.Model.t(), audio(), [transcribe_opt()]) :: {:ok, WhisperCt2.Transcription.t()} | {:error, WhisperCt2.Error.t()}
Transcribes audio using model.
Returns {:ok, %WhisperCt2.Transcription{}} whose :segments carry
absolute start/end times, no_speech_prob, avg_logprob, the
underlying text tokens, and (when :word_timestamps is set) per-word
timing. no_speech_prob and avg_logprob are always populated.
Options
:language- ISO code ("en").nil(default) auto-detects.:initial_prompt- free-text context prepended via<|startofprev|>to bias decoding.:prefix- forced text the generation must start with.:word_timestamps- whentrue, attaches:wordsto each segment via one extra batched DTW alignment pass. Defaultfalse.:with_timestamps- whentrue(default) the prompt asks the model to emit<|t_..|>timestamp tokens that split the output into sub-segments. Set tofalsefor fine-tunes that emit text without timestamps; the chunk's full text becomes one segment spanning[0, chunk_duration_s). Implicitly forced totruewhenever:word_timestampsis enabled because alignment needs the timestamp scaffolding.- Decoding knobs forwarded to CTranslate2:
:beam_size,:patience,:length_penalty,:repetition_penalty,:no_repeat_ngram_size,:sampling_temperature,:sampling_topk,:suppress_blank,:max_length,:num_hypotheses,:max_initial_timestamp_index,:suppress_tokens.
@spec transcribe_batch(WhisperCt2.Model.t(), [audio()], [transcribe_opt()]) :: {:ok, [WhisperCt2.Transcription.t()]} | {:error, WhisperCt2.Error.t()}
Transcribes a list of audios in one batched generate call. Every
chunk of every input shares a single encoder forward pass; output
preserves input order.
Options are the same as transcribe/3. :language applies to every
audio in the batch; pass nil to auto-detect per-audio.