# `WhisperCt2`
[🔗](https://github.com/rubas/whisper_ct2/blob/v0.5.0/lib/whisper_ct2.ex#L1)

Native Elixir bindings for Whisper speech-to-text via CTranslate2.

Calls `ct2rs::sys::Whisper` directly through a Rustler NIF: no Python, no
HTTP gateway. The NIF owns the mel spectrogram, tokenizer, and prompt
construction, so structured per-segment results, `:initial_prompt` /
`:prefix` biasing, word-level timestamps, and batched transcribe across
multiple audios are all first-class.

## Quickstart

    {:ok, model} = WhisperCt2.load_model("/path/to/faster-whisper-tiny")
    pcm = File.read!("audio.pcm")

    {:ok, %WhisperCt2.Transcription{text: text, segments: segs}} =
      WhisperCt2.transcribe(model, {:pcm_f32, pcm}, language: "en")

    IO.puts(text)
    for s <- segs, do: IO.puts("[#{s.start}-#{s.end}] #{s.text}")

## Audio contract

CTranslate2 expects **mono `f32` PCM samples** at the model's
`:sampling_rate` (16 kHz for every published Whisper checkpoint),
normalised to the `-1.0..1.0` range. `transcribe/3` and
`transcribe_batch/3` accept exactly one shape:

- `{:pcm_f32, binary}` - little-endian f32 samples at the model's
  sample rate.

Decoding `.wav`, resampling, downmixing, and any other format
conversion is the caller's job. There is no bundled audio decoder;
use `ffmpeg`, a dedicated library, or your platform's audio stack
before calling in.

Audio longer than the Whisper 30 s window is chunked internally; the
encoder runs once across every chunk in the batch. Diarization-driven
workflows that need many short splices should use
`transcribe_batch/3`.

# `audio`

```elixir
@type audio() :: {:pcm_f32, binary()}
```

Audio sources accepted by `transcribe/3` and `transcribe_batch/3`.

# `load_opt`

```elixir
@type load_opt() ::
  {:device, :cpu | :cuda | :auto}
  | {:compute_type, WhisperCt2.Model.compute_type()}
  | {:device_indices, [non_neg_integer(), ...]}
  | {:num_threads_per_replica, non_neg_integer()}
  | {:max_queued_batches, integer()}
  | {:cpu_core_offset, integer()}
```

Options accepted by `load_model/2`.

# `transcribe_opt`

```elixir
@type transcribe_opt() ::
  {:language, String.t() | nil}
  | {:initial_prompt, String.t() | nil}
  | {:prefix, String.t() | nil}
  | {:word_timestamps, boolean()}
  | {:with_timestamps, boolean()}
  | {:beam_size, pos_integer()}
  | {:patience, float()}
  | {:length_penalty, float()}
  | {:repetition_penalty, float()}
  | {:no_repeat_ngram_size, non_neg_integer()}
  | {:sampling_temperature, float()}
  | {:sampling_topk, pos_integer()}
  | {:suppress_blank, boolean()}
  | {:max_length, pos_integer()}
  | {:num_hypotheses, pos_integer()}
  | {:max_initial_timestamp_index, non_neg_integer()}
  | {:suppress_tokens, [integer()]}
```

Options accepted by `transcribe/3` / `transcribe_batch/3`.

# `available_devices`

```elixir
@spec available_devices() ::
  {:ok,
   %{cpu: non_neg_integer(), cuda: non_neg_integer(), cuda_supported: boolean()}}
  | {:error, WhisperCt2.Error.t()}
```

Reports CTranslate2 device support for this build.

Returns `{:ok, %{cpu: n, cuda: n, cuda_supported: bool}}` on success.
`cuda_supported` reflects compile-time CUDA features (build with
`WHISPER_CT2_FEATURES=cuda-dynamic mix compile` to enable). `cuda` is the
count of NVIDIA GPU devices visible at runtime, or `0` when CUDA is not
built in.

# `load_model`

```elixir
@spec load_model(Path.t(), [load_opt()]) ::
  {:ok, WhisperCt2.Model.t()} | {:error, WhisperCt2.Error.t()}
```

Loads a CTranslate2 Whisper model from a directory.

See the `WhisperCt2` module doc for required model files.

## Options

- `:device` - `:cpu`, `:cuda`, or `:auto` (default). `:auto` picks CUDA
  when the binary was built with CUDA support and a device is visible;
  otherwise CPU.
- `:compute_type` - precision used at inference. `:default` keeps the
  model's stored quantisation; `:auto` picks the fastest supported on
  this device.
- `:device_indices` - non-empty list of GPU indices (default `[0]`).
- `:num_threads_per_replica` - intra-op threads. `0` lets CTranslate2 pick.
- `:max_queued_batches`, `:cpu_core_offset` - passed through to
  CTranslate2.

# `transcribe`

```elixir
@spec transcribe(WhisperCt2.Model.t(), audio(), [transcribe_opt()]) ::
  {:ok, WhisperCt2.Transcription.t()} | {:error, WhisperCt2.Error.t()}
```

Transcribes `audio` using `model`.

Returns `{:ok, %WhisperCt2.Transcription{}}` whose `:segments` carry
absolute start/end times, `no_speech_prob`, `avg_logprob`, the
underlying text tokens, and (when `:word_timestamps` is set) per-word
timing. `no_speech_prob` and `avg_logprob` are always populated.

## Options

- `:language` - ISO code (`"en"`). `nil` (default) auto-detects.
- `:initial_prompt` - free-text context prepended via `<|startofprev|>`
  to bias decoding.
- `:prefix` - forced text the generation must start with.
- `:word_timestamps` - when `true`, attaches `:words` to each segment
  via one extra batched DTW alignment pass. Default `false`.
- `:with_timestamps` - when `true` (default) the prompt asks the model
  to emit `<|t_..|>` timestamp tokens that split the output into
  sub-segments. Set to `false` for fine-tunes that emit text without
  timestamps; the chunk's full text becomes one segment spanning
  `[0, chunk_duration_s)`. Implicitly forced to `true` whenever
  `:word_timestamps` is enabled because alignment needs the timestamp
  scaffolding.
- Decoding knobs forwarded to CTranslate2: `:beam_size`, `:patience`,
  `:length_penalty`, `:repetition_penalty`, `:no_repeat_ngram_size`,
  `:sampling_temperature`, `:sampling_topk`, `:suppress_blank`,
  `:max_length`, `:num_hypotheses`, `:max_initial_timestamp_index`,
  `:suppress_tokens`.

# `transcribe_batch`

```elixir
@spec transcribe_batch(WhisperCt2.Model.t(), [audio()], [transcribe_opt()]) ::
  {:ok, [WhisperCt2.Transcription.t()]} | {:error, WhisperCt2.Error.t()}
```

Transcribes a list of audios in one batched `generate` call. Every
chunk of every input shares a single encoder forward pass; output
preserves input order.

Options are the same as `transcribe/3`. `:language` applies to every
audio in the batch; pass `nil` to auto-detect per-audio.

---

*Consult [api-reference.md](api-reference.md) for complete listing*