WhisperCpp (whisper_cpp v0.2.0)

Copy Markdown View Source

Native Elixir bindings for whisper.cpp.

A thin wrapper around the whisper-rs crate, calling whisper.cpp's C API through a Rustler NIF. No whisper-cli subprocess, no Python, no temporary files. Structured per-segment results, :initial_prompt biasing, word-level timestamps, and CUDA / ROCm (hipBLAS) / Metal / CPU backends.

Quickstart

{:ok, model} = WhisperCpp.load_model("models/ggml-large-v3.bin")

{:ok, %WhisperCpp.Transcription{text: text, segments: segs}} =
  WhisperCpp.transcribe(model, {:pcm_f32, samples}, language: "en")

IO.puts(text)
for s <- segs, do: IO.puts("[#{s.start}-#{s.end}] #{s.text}")

Audio contract

transcribe/3 accepts exactly one input shape:

{:pcm_f32, binary()}

where binary is little-endian IEEE-754 f32 samples, mono, 16 kHz, normalised to [-1.0, 1.0]. Decode audio file formats (WAV, MP3, FLAC, M4A, Opus, ...) upstream with ffmpeg or similar:

ffmpeg -i input.mp3 -f f32le -ac 1 -ar 16000 - | 

Use transcribe_slice/4 to transcribe a [start_s, end_s) window of an already-decoded master PCM buffer; the returned segment / word times are shifted back into the original audio timeline.

Summary

Types

Audio input accepted by transcribe/3.

Options accepted by load_model/2.

Functions

Reports the runtime backends compiled into this NIF artefact.

Loads a GGUF or GGML whisper.cpp model file.

Transcribes audio using model.

Transcribes a [start_s, end_s) slice of samples and shifts the returned segment/word timestamps to absolute seconds in the original audio.

Types

audio()

@type audio() :: {:pcm_f32, binary()}

Audio input accepted by transcribe/3.

load_opt()

@type load_opt() ::
  {:device, WhisperCpp.Model.device() | :auto} | {:use_gpu, boolean()}

Options accepted by load_model/2.

transcribe_opt()

@type transcribe_opt() ::
  {:language, String.t() | nil}
  | {:translate, boolean()}
  | {:initial_prompt, String.t() | nil}
  | {:word_timestamps, boolean()}
  | {:beam_size, pos_integer()}
  | {:best_of, pos_integer()}
  | {:temperature, float()}
  | {:n_threads, pos_integer()}
  | {:n_max_text_ctx, non_neg_integer()}
  | {:offset_ms, non_neg_integer()}
  | {:duration_ms, non_neg_integer()}
  | {:no_speech_thold, float()}
  | {:logprob_thold, float()}
  | {:suppress_blank, boolean()}
  | {:suppress_non_speech_tokens, boolean()}
  | {:single_segment, boolean()}
  | {:print_progress, boolean()}
  | {:abort_handle, WhisperCpp.AbortHandle.t() | nil}
  | {:progress_pid, pid() | nil}

Options accepted by transcribe/3 / transcribe_slice/4.

Functions

available_devices()

@spec available_devices() ::
  {:ok, %{backends: [atom()], gpu_supported: boolean()}}
  | {:error, WhisperCpp.Error.t()}

Reports the runtime backends compiled into this NIF artefact.

Returns {:ok, %{backends: [...], gpu_supported: bool}}. The backends list reflects compile-time cargo features (e.g. [:cpu, :cuda] on a WHISPER_CPP_VARIANT=cuda build).

Build a source artefact with GPU support via:

WHISPER_CPP_BUILD=1 WHISPER_CPP_FEATURES=cuda       mix compile  # NVIDIA
WHISPER_CPP_BUILD=1 WHISPER_CPP_FEATURES=hipblas    mix compile  # AMD ROCm
WHISPER_CPP_BUILD=1 WHISPER_CPP_FEATURES=vulkan     mix compile  # cross-vendor
WHISPER_CPP_BUILD=1 WHISPER_CPP_FEATURES=metal      mix compile  # Apple Silicon
WHISPER_CPP_BUILD=1 WHISPER_CPP_FEATURES=coreml     mix compile  # Apple ANE
WHISPER_CPP_BUILD=1 WHISPER_CPP_FEATURES=intel-sycl mix compile  # Intel Arc/Xe
WHISPER_CPP_BUILD=1 WHISPER_CPP_FEATURES=openblas   mix compile  # CPU + OpenBLAS
WHISPER_CPP_BUILD=1 WHISPER_CPP_FEATURES=openmp     mix compile  # CPU + OpenMP

Pick one accelerator per build; the backend is baked into the artefact.

load_model(path, opts \\ [])

@spec load_model(Path.t(), [load_opt()]) ::
  {:ok, WhisperCpp.Model.t()} | {:error, WhisperCpp.Error.t()}

Loads a GGUF or GGML whisper.cpp model file.

Pass a path to a .bin (legacy GGML) or .gguf file. Download official weights from https://huggingface.co/ggerganov/whisper.cpp.

Options

  • :device - one of :cpu, :cuda, :hipblas, :vulkan, :metal, :coreml, :intel_sycl, or :auto (default). :auto picks the GPU backend when the artefact was built with one; otherwise CPU. Requesting a backend that was not compiled in returns {:error, %WhisperCpp.Error{reason: :invalid_request}}.
  • :use_gpu - shortcut: false forces device: :cpu. Default true.

transcribe(model, audio, opts \\ [])

@spec transcribe(WhisperCpp.Model.t(), audio(), [transcribe_opt()]) ::
  {:ok, WhisperCpp.Transcription.t()} | {:error, WhisperCpp.Error.t()}

Transcribes audio using model.

Returns {:ok, %WhisperCpp.Transcription{}} whose :segments carry absolute start/end times, no_speech_prob, avg_logprob, the underlying text tokens, and (when :word_timestamps is set) per-word timing.

Options

  • :language - ISO code ("en"). nil (default) auto-detects on multilingual models; auto-detect on monolingual models always returns "en".
  • :translate - translate to English instead of transcribing.
  • :initial_prompt - free-text context prepended via <|startofprev|> to bias decoding (max ~224 tokens).
  • :word_timestamps - attach per-word timing. Default false.
  • :beam_size - beam-search width. Default 5.
  • :best_of - greedy candidates kept when beam_size <= 1.
  • :temperature - sampling temperature (0.0 = greedy/beam).
  • :n_threads - intra-op threads. Default 4.
  • :n_max_text_ctx - cap decoder context tokens.
  • :offset_ms, :duration_ms - clip the audio window.
  • :no_speech_thold - silence detection threshold. Default 0.6.
  • :logprob_thold - reject segments with avg_logprob below this.
  • :suppress_blank, :suppress_non_speech_tokens - decoder suppressions.
  • :single_segment - force a single segment for the whole audio.
  • :print_progress - whisper.cpp progress to stderr.
  • :abort_handle - %WhisperCpp.AbortHandle{} whose abort/1 cancels in-flight inference. The call returns {:ok, partial_transcription} with whatever segments completed before the abort took effect.
  • :progress_pid - pid that receives {:whisper_progress, percent} messages (0..100) as work advances; duplicate percentages are coalesced.

transcribe_slice(model, samples, range, opts \\ [])

@spec transcribe_slice(WhisperCpp.Model.t(), binary(), {number(), number()}, [
  transcribe_opt()
]) ::
  {:ok, WhisperCpp.Transcription.t()} | {:error, WhisperCpp.Error.t()}

Transcribes a [start_s, end_s) slice of samples and shifts the returned segment/word timestamps to absolute seconds in the original audio.

Slices the f32 PCM buffer, runs whisper.cpp on the slice, and rewrites local segment times back into the absolute timeline. Returns {:ok, %Transcription{}} with absolute timings, or {:error, Error.t()}. Slices shorter than 0.3 s return an empty transcription (whisper.cpp pads short inputs and hallucinates into the padding).