View Source Bumblebee.Audio (Bumblebee v0.2.0)

High-level tasks related to audio processing.

Link to this section Summary


Builds serving for speech-to-text generation.

Link to this section Types

Link to this type


View Source
@type speech_to_text_input() :: Nx.t() | {:file, String.t()}

A term representing audio.

Can be either of:

  • a 1-dimensional Nx.Tensor with audio samples

  • {:file, path} with path to an audio file (note that this requires ffmpeg installed)

Link to this type


View Source
@type speech_to_text_output() :: %{results: [speech_to_text_result()]}
Link to this type


View Source
@type speech_to_text_result() :: %{text: String.t()}

Link to this section Functions

Link to this function

speech_to_text(model_info, featurizer, tokenizer, opts \\ [])

View Source

Builds serving for speech-to-text generation.

The serving accepts speech_to_text_input/0 and returns speech_to_text_output/0. A list of inputs is also supported.

Note that either :max_new_tokens or :max_length must be specified. The generation should generally finish based on the audio input, however you still need to specify the upper limit.



  • :max_new_tokens - the maximum number of tokens to be generated, ignoring the number of tokens in the prompt

  • :compile - compiles all computations for predefined input shapes during serving initialization. Should be a keyword list with the following keys:

    • :batch_size - the maximum batch size of the input. Inputs are optionally padded to always match this batch size

    It is advised to set this option in production and also configure a defn compiler using :defn_options to maximally reduce inference time.

  • :defn_options - the options for JIT compilation. Defaults to []

Also accepts all the other options of Bumblebee.Text.Generation.build_generate/3.



{:ok, whisper} = Bumblebee.load_model({:hf, "openai/whisper-tiny"})
{:ok, featurizer} = Bumblebee.load_featurizer({:hf, "openai/whisper-tiny"})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/whisper-tiny"})

serving =
  Bumblebee.Audio.speech_to_text(whisper, featurizer, tokenizer,
    max_new_tokens: 100,
    defn_options: [compiler: EXLA]
  ), {:file, "/path/to/audio.wav"})
#=> %{results: [%{text: "There is a cat outside the window."}]}