WhisperCt2 (whisper_ct2 v0.5.0)

Copy Markdown View Source

Native Elixir bindings for Whisper speech-to-text via CTranslate2.

Calls ct2rs::sys::Whisper directly through a Rustler NIF: no Python, no HTTP gateway. The NIF owns the mel spectrogram, tokenizer, and prompt construction, so structured per-segment results, :initial_prompt / :prefix biasing, word-level timestamps, and batched transcribe across multiple audios are all first-class.

Quickstart

{:ok, model} = WhisperCt2.load_model("/path/to/faster-whisper-tiny")
pcm = File.read!("audio.pcm")

{:ok, %WhisperCt2.Transcription{text: text, segments: segs}} =
  WhisperCt2.transcribe(model, {:pcm_f32, pcm}, language: "en")

IO.puts(text)
for s <- segs, do: IO.puts("[#{s.start}-#{s.end}] #{s.text}")

Audio contract

CTranslate2 expects mono f32 PCM samples at the model's :sampling_rate (16 kHz for every published Whisper checkpoint), normalised to the -1.0..1.0 range. transcribe/3 and transcribe_batch/3 accept exactly one shape:

  • {:pcm_f32, binary} - little-endian f32 samples at the model's sample rate.

Decoding .wav, resampling, downmixing, and any other format conversion is the caller's job. There is no bundled audio decoder; use ffmpeg, a dedicated library, or your platform's audio stack before calling in.

Audio longer than the Whisper 30 s window is chunked internally; the encoder runs once across every chunk in the batch. Diarization-driven workflows that need many short splices should use transcribe_batch/3.

Summary

Types

Audio sources accepted by transcribe/3 and transcribe_batch/3.

Options accepted by load_model/2.

Functions

Reports CTranslate2 device support for this build.

Loads a CTranslate2 Whisper model from a directory.

Transcribes audio using model.

Transcribes a list of audios in one batched generate call. Every chunk of every input shares a single encoder forward pass; output preserves input order.

Types

audio()

@type audio() :: {:pcm_f32, binary()}

Audio sources accepted by transcribe/3 and transcribe_batch/3.

load_opt()

@type load_opt() ::
  {:device, :cpu | :cuda | :auto}
  | {:compute_type, WhisperCt2.Model.compute_type()}
  | {:device_indices, [non_neg_integer(), ...]}
  | {:num_threads_per_replica, non_neg_integer()}
  | {:max_queued_batches, integer()}
  | {:cpu_core_offset, integer()}

Options accepted by load_model/2.

transcribe_opt()

@type transcribe_opt() ::
  {:language, String.t() | nil}
  | {:initial_prompt, String.t() | nil}
  | {:prefix, String.t() | nil}
  | {:word_timestamps, boolean()}
  | {:with_timestamps, boolean()}
  | {:beam_size, pos_integer()}
  | {:patience, float()}
  | {:length_penalty, float()}
  | {:repetition_penalty, float()}
  | {:no_repeat_ngram_size, non_neg_integer()}
  | {:sampling_temperature, float()}
  | {:sampling_topk, pos_integer()}
  | {:suppress_blank, boolean()}
  | {:max_length, pos_integer()}
  | {:num_hypotheses, pos_integer()}
  | {:max_initial_timestamp_index, non_neg_integer()}
  | {:suppress_tokens, [integer()]}

Options accepted by transcribe/3 / transcribe_batch/3.

Functions

available_devices()

@spec available_devices() ::
  {:ok,
   %{cpu: non_neg_integer(), cuda: non_neg_integer(), cuda_supported: boolean()}}
  | {:error, WhisperCt2.Error.t()}

Reports CTranslate2 device support for this build.

Returns {:ok, %{cpu: n, cuda: n, cuda_supported: bool}} on success. cuda_supported reflects compile-time CUDA features (build with WHISPER_CT2_FEATURES=cuda-dynamic mix compile to enable). cuda is the count of NVIDIA GPU devices visible at runtime, or 0 when CUDA is not built in.

load_model(path, opts \\ [])

@spec load_model(Path.t(), [load_opt()]) ::
  {:ok, WhisperCt2.Model.t()} | {:error, WhisperCt2.Error.t()}

Loads a CTranslate2 Whisper model from a directory.

See the WhisperCt2 module doc for required model files.

Options

  • :device - :cpu, :cuda, or :auto (default). :auto picks CUDA when the binary was built with CUDA support and a device is visible; otherwise CPU.
  • :compute_type - precision used at inference. :default keeps the model's stored quantisation; :auto picks the fastest supported on this device.
  • :device_indices - non-empty list of GPU indices (default [0]).
  • :num_threads_per_replica - intra-op threads. 0 lets CTranslate2 pick.
  • :max_queued_batches, :cpu_core_offset - passed through to CTranslate2.

transcribe(model, audio, opts \\ [])

@spec transcribe(WhisperCt2.Model.t(), audio(), [transcribe_opt()]) ::
  {:ok, WhisperCt2.Transcription.t()} | {:error, WhisperCt2.Error.t()}

Transcribes audio using model.

Returns {:ok, %WhisperCt2.Transcription{}} whose :segments carry absolute start/end times, no_speech_prob, avg_logprob, the underlying text tokens, and (when :word_timestamps is set) per-word timing. no_speech_prob and avg_logprob are always populated.

Options

  • :language - ISO code ("en"). nil (default) auto-detects.
  • :initial_prompt - free-text context prepended via <|startofprev|> to bias decoding.
  • :prefix - forced text the generation must start with.
  • :word_timestamps - when true, attaches :words to each segment via one extra batched DTW alignment pass. Default false.
  • :with_timestamps - when true (default) the prompt asks the model to emit <|t_..|> timestamp tokens that split the output into sub-segments. Set to false for fine-tunes that emit text without timestamps; the chunk's full text becomes one segment spanning [0, chunk_duration_s). Implicitly forced to true whenever :word_timestamps is enabled because alignment needs the timestamp scaffolding.
  • Decoding knobs forwarded to CTranslate2: :beam_size, :patience, :length_penalty, :repetition_penalty, :no_repeat_ngram_size, :sampling_temperature, :sampling_topk, :suppress_blank, :max_length, :num_hypotheses, :max_initial_timestamp_index, :suppress_tokens.

transcribe_batch(model, audios, opts \\ [])

@spec transcribe_batch(WhisperCt2.Model.t(), [audio()], [transcribe_opt()]) ::
  {:ok, [WhisperCt2.Transcription.t()]} | {:error, WhisperCt2.Error.t()}

Transcribes a list of audios in one batched generate call. Every chunk of every input shares a single encoder forward pass; output preserves input order.

Options are the same as transcribe/3. :language applies to every audio in the batch; pass nil to auto-detect per-audio.