Edifice.Audio.VALLE (Edifice v0.2.0)

Copy Markdown View Source

VALL-E: Neural Codec Language Models for Zero-Shot Text-to-Speech.

VALL-E treats TTS as conditional language modeling over neural codec tokens (from EnCodec). Given text and a 3-second audio prompt, VALL-E generates speech that preserves the speaker's voice characteristics.

Architecture

VALL-E uses a two-stage approach with separate models for coarse and fine tokens:

Text tokens [batch, text_len]
Audio prompt tokens [batch, num_codebooks, prompt_len]
      |
+------------------------------------------+
| AR Model (Autoregressive)                |
|                                          |
| Decoder-only transformer                 |
| Generates codebook 0 (coarse tokens)     |
| Left-to-right, causal attention          |
+------------------------------------------+
      |
Coarse tokens [batch, seq_len]  (codebook 0)
      |
+------------------------------------------+
| NAR Model (Non-Autoregressive)           |
|                                          |
| Bidirectional transformer                |
| Generates codebooks 1-7 (fine tokens)    |
| Processes all positions in parallel      |
+------------------------------------------+
      |
Fine tokens [batch, 7, seq_len]  (codebooks 1-7)
      |
Full tokens [batch, 8, seq_len] -> EnCodec decoder -> waveform

Two-Stage Generation

  1. AR Stage: Given text + audio prompt, autoregressively generate the coarse (codebook 0) tokens. This captures the overall prosody and content.

  2. NAR Stage: Given text + coarse tokens, predict fine (codebooks 1-7) tokens in parallel. Each codebook level is predicted conditioning on all previous levels. This adds acoustic detail.

Zero-Shot Capability

The 3-second audio prompt provides speaker characteristics (timbre, accent, speaking style). VALL-E learns to preserve these while generating new content from the text. No per-speaker fine-tuning is needed.

Usage

# Build full VALL-E
{ar_model, nar_model} = VALLE.build(
  text_vocab_size: 256,
  audio_vocab_size: 1024,
  num_layers: 12,
  hidden_dim: 1024
)

# AR forward pass
coarse_logits = VALLE.ar_forward(ar_fn, params, text_tokens, prompt_tokens)

# NAR forward pass
fine_logits = VALLE.nar_forward(nar_fn, params, text_tokens, coarse_tokens, codebook_idx: 1)

References

Summary

Types

Options for build/1 and related functions.

Functions

AR model forward pass for coarse token generation.

Build a complete VALL-E model (AR + NAR).

Build the autoregressive (AR) model for coarse token generation.

Build the non-autoregressive (NAR) model for fine token generation.

Compute cross-entropy loss for language modeling.

Get the output vocabulary size.

Compute VALL-E combined loss (AR + NAR cross-entropy).

Types

build_opt()

@type build_opt() ::
  {:text_vocab_size, pos_integer()}
  | {:audio_vocab_size, pos_integer()}
  | {:hidden_dim, pos_integer()}
  | {:num_layers, pos_integer()}
  | {:num_heads, pos_integer()}
  | {:num_codebooks, pos_integer()}
  | {:dropout, float()}

Options for build/1 and related functions.

Functions

ar_forward(ar_fn, params, text_tokens, prompt_tokens, audio_tokens)

@spec ar_forward(function(), map(), Nx.Tensor.t(), Nx.Tensor.t(), Nx.Tensor.t()) ::
  Nx.Tensor.t()

AR model forward pass for coarse token generation.

Parameters

  • ar_fn - Compiled AR model prediction function
  • params - AR model parameters
  • text_tokens - Text/phoneme tokens [batch, text_len]
  • prompt_tokens - Audio prompt tokens [batch, num_codebooks, prompt_len]
  • audio_tokens - Generated audio tokens so far [batch, audio_len]

Returns

Logits [batch, total_len, audio_vocab_size] for next token prediction.

build(opts \\ [])

@spec build([build_opt()]) :: {Axon.t(), Axon.t()}

Build a complete VALL-E model (AR + NAR).

Options

  • :text_vocab_size - Text vocabulary size (default: 256 for BPE/phonemes)
  • :audio_vocab_size - Audio codebook size (default: 1024, matches EnCodec)
  • :hidden_dim - Transformer hidden dimension (default: 1024)
  • :num_layers - Number of transformer layers (default: 12)
  • :num_heads - Number of attention heads (default: 16)
  • :num_codebooks - Number of EnCodec codebooks (default: 8)
  • :dropout - Dropout rate (default: 0.1)

Returns

A tuple {ar_model, nar_model} of Axon models.

  • AR model: autoregressive for coarse token generation
  • NAR model: non-autoregressive for fine token generation

build_ar(opts \\ [])

@spec build_ar([build_opt()]) :: Axon.t()

Build the autoregressive (AR) model for coarse token generation.

The AR model is a decoder-only transformer that generates codebook 0 tokens autoregressively. It conditions on:

  • Text tokens (phonemes or BPE)
  • Audio prompt tokens (3 seconds of reference audio, all codebooks)

Options

  • :text_vocab_size - Text vocabulary size (default: 256)
  • :audio_vocab_size - Audio codebook size (default: 1024)
  • :hidden_dim - Transformer dimension (default: 1024)
  • :num_layers - Number of decoder layers (default: 12)
  • :num_heads - Number of attention heads (default: 16)
  • :dropout - Dropout rate (default: 0.1)

Returns

An Axon model with inputs:

  • "text_tokens": [batch, text_len]
  • "prompt_tokens": [batch, num_codebooks, prompt_len]
  • "audio_tokens": [batch, audio_len] (codebook 0 tokens being generated)

Output: logits [batch, total_len, audio_vocab_size]

build_nar(opts \\ [])

@spec build_nar([build_opt()]) :: Axon.t()

Build the non-autoregressive (NAR) model for fine token generation.

The NAR model generates codebooks 1-7 given the coarse tokens (codebook 0). It uses bidirectional attention since fine tokens are predicted in parallel.

Options

Same as build_ar/1.

Returns

An Axon model with inputs:

  • "text_tokens": [batch, text_len]
  • "coarse_tokens": [batch, seq_len] (codebook 0)
  • "prev_codebook_tokens": [batch, seq_len] (tokens from codebook i-1)
  • "codebook_idx": scalar indicating which codebook (1-7) to predict

Output: logits [batch, seq_len, audio_vocab_size]

cross_entropy_loss(logits, targets)

@spec cross_entropy_loss(Nx.Tensor.t(), Nx.Tensor.t()) :: Nx.Tensor.t()

Compute cross-entropy loss for language modeling.

Parameters

  • logits - Predicted logits [batch, seq_len, vocab_size]
  • targets - Target token IDs [batch, seq_len]

Returns

Cross-entropy loss scalar.

nar_forward(nar_fn, params, text_tokens, coarse_tokens, prev_tokens, codebook_idx)

@spec nar_forward(
  function(),
  map(),
  Nx.Tensor.t(),
  Nx.Tensor.t(),
  Nx.Tensor.t(),
  integer()
) ::
  Nx.Tensor.t()

NAR model forward pass for fine token generation.

Parameters

  • nar_fn - Compiled NAR model prediction function
  • params - NAR model parameters
  • text_tokens - Text tokens [batch, text_len]
  • coarse_tokens - Coarse (codebook 0) tokens [batch, seq_len]
  • prev_tokens - Previous codebook tokens [batch, seq_len]
  • codebook_idx - Which codebook to predict (1-7)

Returns

Logits [batch, total_len, audio_vocab_size].

output_size(opts \\ [])

@spec output_size(keyword()) :: pos_integer()

Get the output vocabulary size.

valle_loss(ar_logits, ar_targets, nar_logits, nar_targets, opts \\ [])

@spec valle_loss(
  Nx.Tensor.t(),
  Nx.Tensor.t(),
  Nx.Tensor.t(),
  Nx.Tensor.t(),
  keyword()
) :: Nx.Tensor.t()

Compute VALL-E combined loss (AR + NAR cross-entropy).

Parameters

  • ar_logits - AR model logits [batch, seq_len, vocab_size]
  • ar_targets - Target coarse tokens [batch, seq_len]
  • nar_logits - NAR model logits [batch, seq_len, vocab_size]
  • nar_targets - Target fine tokens [batch, seq_len]

Options

  • :ar_weight - Weight for AR loss (default: 1.0)
  • :nar_weight - Weight for NAR loss (default: 1.0)

Returns

Combined loss scalar.