# `WhisperCpp`
[🔗](https://github.com/rubas/whisper_cpp/blob/v0.2.0/lib/whisper_cpp.ex#L1)

Native Elixir bindings for [whisper.cpp](https://github.com/ggerganov/whisper.cpp).

A thin wrapper around the
[`whisper-rs`](https://codeberg.org/tazz4843/whisper-rs) crate, calling
whisper.cpp's C API through a Rustler NIF. No `whisper-cli` subprocess,
no Python, no temporary files. Structured per-segment results,
`:initial_prompt` biasing, word-level timestamps, and CUDA / ROCm
(hipBLAS) / Metal / CPU backends.

## Quickstart

    {:ok, model} = WhisperCpp.load_model("models/ggml-large-v3.bin")

    {:ok, %WhisperCpp.Transcription{text: text, segments: segs}} =
      WhisperCpp.transcribe(model, {:pcm_f32, samples}, language: "en")

    IO.puts(text)
    for s <- segs, do: IO.puts("[#{s.start}-#{s.end}] #{s.text}")

## Audio contract

`transcribe/3` accepts exactly one input shape:

    {:pcm_f32, binary()}

where `binary` is little-endian IEEE-754 `f32` samples, mono, 16 kHz,
normalised to `[-1.0, 1.0]`. Decode audio file formats (WAV, MP3,
FLAC, M4A, Opus, ...) upstream with ffmpeg or similar:

    ffmpeg -i input.mp3 -f f32le -ac 1 -ar 16000 - | …

Use `transcribe_slice/4` to transcribe a `[start_s, end_s)` window of an
already-decoded master PCM buffer; the returned segment / word times
are shifted back into the original audio timeline.

# `audio`

```elixir
@type audio() :: {:pcm_f32, binary()}
```

Audio input accepted by `transcribe/3`.

# `load_opt`

```elixir
@type load_opt() ::
  {:device, WhisperCpp.Model.device() | :auto} | {:use_gpu, boolean()}
```

Options accepted by `load_model/2`.

# `transcribe_opt`

```elixir
@type transcribe_opt() ::
  {:language, String.t() | nil}
  | {:translate, boolean()}
  | {:initial_prompt, String.t() | nil}
  | {:word_timestamps, boolean()}
  | {:beam_size, pos_integer()}
  | {:best_of, pos_integer()}
  | {:temperature, float()}
  | {:n_threads, pos_integer()}
  | {:n_max_text_ctx, non_neg_integer()}
  | {:offset_ms, non_neg_integer()}
  | {:duration_ms, non_neg_integer()}
  | {:no_speech_thold, float()}
  | {:logprob_thold, float()}
  | {:suppress_blank, boolean()}
  | {:suppress_non_speech_tokens, boolean()}
  | {:single_segment, boolean()}
  | {:print_progress, boolean()}
  | {:abort_handle, WhisperCpp.AbortHandle.t() | nil}
  | {:progress_pid, pid() | nil}
```

Options accepted by `transcribe/3` / `transcribe_slice/4`.

# `available_devices`

```elixir
@spec available_devices() ::
  {:ok, %{backends: [atom()], gpu_supported: boolean()}}
  | {:error, WhisperCpp.Error.t()}
```

Reports the runtime backends compiled into this NIF artefact.

Returns `{:ok, %{backends: [...], gpu_supported: bool}}`. The
`backends` list reflects compile-time cargo features (e.g.
`[:cpu, :cuda]` on a `WHISPER_CPP_VARIANT=cuda` build).

Build a source artefact with GPU support via:

    WHISPER_CPP_BUILD=1 WHISPER_CPP_FEATURES=cuda       mix compile  # NVIDIA
    WHISPER_CPP_BUILD=1 WHISPER_CPP_FEATURES=hipblas    mix compile  # AMD ROCm
    WHISPER_CPP_BUILD=1 WHISPER_CPP_FEATURES=vulkan     mix compile  # cross-vendor
    WHISPER_CPP_BUILD=1 WHISPER_CPP_FEATURES=metal      mix compile  # Apple Silicon
    WHISPER_CPP_BUILD=1 WHISPER_CPP_FEATURES=coreml     mix compile  # Apple ANE
    WHISPER_CPP_BUILD=1 WHISPER_CPP_FEATURES=intel-sycl mix compile  # Intel Arc/Xe
    WHISPER_CPP_BUILD=1 WHISPER_CPP_FEATURES=openblas   mix compile  # CPU + OpenBLAS
    WHISPER_CPP_BUILD=1 WHISPER_CPP_FEATURES=openmp     mix compile  # CPU + OpenMP

Pick one accelerator per build; the backend is baked into the artefact.

# `load_model`

```elixir
@spec load_model(Path.t(), [load_opt()]) ::
  {:ok, WhisperCpp.Model.t()} | {:error, WhisperCpp.Error.t()}
```

Loads a GGUF or GGML whisper.cpp model file.

Pass a path to a `.bin` (legacy GGML) or `.gguf` file. Download official
weights from <https://huggingface.co/ggerganov/whisper.cpp>.

## Options

- `:device` - one of `:cpu`, `:cuda`, `:hipblas`, `:vulkan`, `:metal`,
  `:coreml`, `:intel_sycl`, or `:auto` (default). `:auto` picks the GPU
  backend when the artefact was built with one; otherwise CPU.
  Requesting a backend that was not compiled in returns
  `{:error, %WhisperCpp.Error{reason: :invalid_request}}`.
- `:use_gpu` - shortcut: `false` forces `device: :cpu`. Default `true`.

# `transcribe`

```elixir
@spec transcribe(WhisperCpp.Model.t(), audio(), [transcribe_opt()]) ::
  {:ok, WhisperCpp.Transcription.t()} | {:error, WhisperCpp.Error.t()}
```

Transcribes `audio` using `model`.

Returns `{:ok, %WhisperCpp.Transcription{}}` whose `:segments` carry
absolute start/end times, `no_speech_prob`, `avg_logprob`, the
underlying text tokens, and (when `:word_timestamps` is set) per-word
timing.

## Options

- `:language` - ISO code (`"en"`). `nil` (default) auto-detects on
  multilingual models; auto-detect on monolingual models always returns
  `"en"`.
- `:translate` - translate to English instead of transcribing.
- `:initial_prompt` - free-text context prepended via `<|startofprev|>`
  to bias decoding (max ~224 tokens).
- `:word_timestamps` - attach per-word timing. Default `false`.
- `:beam_size` - beam-search width. Default `5`.
- `:best_of` - greedy candidates kept when `beam_size <= 1`.
- `:temperature` - sampling temperature (`0.0` = greedy/beam).
- `:n_threads` - intra-op threads. Default `4`.
- `:n_max_text_ctx` - cap decoder context tokens.
- `:offset_ms`, `:duration_ms` - clip the audio window.
- `:no_speech_thold` - silence detection threshold. Default `0.6`.
- `:logprob_thold` - reject segments with `avg_logprob` below this.
- `:suppress_blank`, `:suppress_non_speech_tokens` - decoder suppressions.
- `:single_segment` - force a single segment for the whole audio.
- `:print_progress` - whisper.cpp progress to stderr.
- `:abort_handle` - `%WhisperCpp.AbortHandle{}` whose `abort/1` cancels
  in-flight inference. The call returns `{:ok, partial_transcription}`
  with whatever segments completed before the abort took effect.
- `:progress_pid` - pid that receives `{:whisper_progress, percent}`
  messages (0..100) as work advances; duplicate percentages are
  coalesced.

# `transcribe_slice`

```elixir
@spec transcribe_slice(WhisperCpp.Model.t(), binary(), {number(), number()}, [
  transcribe_opt()
]) ::
  {:ok, WhisperCpp.Transcription.t()} | {:error, WhisperCpp.Error.t()}
```

Transcribes a `[start_s, end_s)` slice of `samples` and shifts the
returned segment/word timestamps to absolute seconds in the original
audio.

Slices the f32 PCM buffer, runs whisper.cpp on the slice, and rewrites
local segment times back into the absolute timeline. Returns
`{:ok, %Transcription{}}` with absolute timings, or
`{:error, Error.t()}`. Slices shorter than 0.3 s return an empty
transcription (whisper.cpp pads short inputs and hallucinates into the
padding).

---

*Consult [api-reference.md](api-reference.md) for complete listing*