VALL-E: Neural Codec Language Models for Zero-Shot Text-to-Speech.
VALL-E treats TTS as conditional language modeling over neural codec tokens (from EnCodec). Given text and a 3-second audio prompt, VALL-E generates speech that preserves the speaker's voice characteristics.
Architecture
VALL-E uses a two-stage approach with separate models for coarse and fine tokens:
Text tokens [batch, text_len]
Audio prompt tokens [batch, num_codebooks, prompt_len]
|
+------------------------------------------+
| AR Model (Autoregressive) |
| |
| Decoder-only transformer |
| Generates codebook 0 (coarse tokens) |
| Left-to-right, causal attention |
+------------------------------------------+
|
Coarse tokens [batch, seq_len] (codebook 0)
|
+------------------------------------------+
| NAR Model (Non-Autoregressive) |
| |
| Bidirectional transformer |
| Generates codebooks 1-7 (fine tokens) |
| Processes all positions in parallel |
+------------------------------------------+
|
Fine tokens [batch, 7, seq_len] (codebooks 1-7)
|
Full tokens [batch, 8, seq_len] -> EnCodec decoder -> waveformTwo-Stage Generation
AR Stage: Given text + audio prompt, autoregressively generate the coarse (codebook 0) tokens. This captures the overall prosody and content.
NAR Stage: Given text + coarse tokens, predict fine (codebooks 1-7) tokens in parallel. Each codebook level is predicted conditioning on all previous levels. This adds acoustic detail.
Zero-Shot Capability
The 3-second audio prompt provides speaker characteristics (timbre, accent, speaking style). VALL-E learns to preserve these while generating new content from the text. No per-speaker fine-tuning is needed.
Usage
# Build full VALL-E
{ar_model, nar_model} = VALLE.build(
text_vocab_size: 256,
audio_vocab_size: 1024,
num_layers: 12,
hidden_dim: 1024
)
# AR forward pass
coarse_logits = VALLE.ar_forward(ar_fn, params, text_tokens, prompt_tokens)
# NAR forward pass
fine_logits = VALLE.nar_forward(nar_fn, params, text_tokens, coarse_tokens, codebook_idx: 1)References
- Wang et al., "Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers" (Microsoft, 2023) — https://arxiv.org/abs/2301.02111
- VALL-E X (multilingual): https://arxiv.org/abs/2303.03926
Summary
Functions
AR model forward pass for coarse token generation.
Build a complete VALL-E model (AR + NAR).
Build the autoregressive (AR) model for coarse token generation.
Build the non-autoregressive (NAR) model for fine token generation.
Compute cross-entropy loss for language modeling.
NAR model forward pass for fine token generation.
Get the output vocabulary size.
Compute VALL-E combined loss (AR + NAR cross-entropy).
Types
@type build_opt() :: {:text_vocab_size, pos_integer()} | {:audio_vocab_size, pos_integer()} | {:hidden_dim, pos_integer()} | {:num_layers, pos_integer()} | {:num_heads, pos_integer()} | {:num_codebooks, pos_integer()} | {:dropout, float()}
Options for build/1 and related functions.
Functions
@spec ar_forward(function(), map(), Nx.Tensor.t(), Nx.Tensor.t(), Nx.Tensor.t()) :: Nx.Tensor.t()
AR model forward pass for coarse token generation.
Parameters
ar_fn- Compiled AR model prediction functionparams- AR model parameterstext_tokens- Text/phoneme tokens[batch, text_len]prompt_tokens- Audio prompt tokens[batch, num_codebooks, prompt_len]audio_tokens- Generated audio tokens so far[batch, audio_len]
Returns
Logits [batch, total_len, audio_vocab_size] for next token prediction.
Build a complete VALL-E model (AR + NAR).
Options
:text_vocab_size- Text vocabulary size (default: 256 for BPE/phonemes):audio_vocab_size- Audio codebook size (default: 1024, matches EnCodec):hidden_dim- Transformer hidden dimension (default: 1024):num_layers- Number of transformer layers (default: 12):num_heads- Number of attention heads (default: 16):num_codebooks- Number of EnCodec codebooks (default: 8):dropout- Dropout rate (default: 0.1)
Returns
A tuple {ar_model, nar_model} of Axon models.
- AR model: autoregressive for coarse token generation
- NAR model: non-autoregressive for fine token generation
Build the autoregressive (AR) model for coarse token generation.
The AR model is a decoder-only transformer that generates codebook 0 tokens autoregressively. It conditions on:
- Text tokens (phonemes or BPE)
- Audio prompt tokens (3 seconds of reference audio, all codebooks)
Options
:text_vocab_size- Text vocabulary size (default: 256):audio_vocab_size- Audio codebook size (default: 1024):hidden_dim- Transformer dimension (default: 1024):num_layers- Number of decoder layers (default: 12):num_heads- Number of attention heads (default: 16):dropout- Dropout rate (default: 0.1)
Returns
An Axon model with inputs:
- "text_tokens":
[batch, text_len] - "prompt_tokens":
[batch, num_codebooks, prompt_len] - "audio_tokens":
[batch, audio_len](codebook 0 tokens being generated)
Output: logits [batch, total_len, audio_vocab_size]
Build the non-autoregressive (NAR) model for fine token generation.
The NAR model generates codebooks 1-7 given the coarse tokens (codebook 0). It uses bidirectional attention since fine tokens are predicted in parallel.
Options
Same as build_ar/1.
Returns
An Axon model with inputs:
- "text_tokens":
[batch, text_len] - "coarse_tokens":
[batch, seq_len](codebook 0) - "prev_codebook_tokens":
[batch, seq_len](tokens from codebook i-1) - "codebook_idx": scalar indicating which codebook (1-7) to predict
Output: logits [batch, seq_len, audio_vocab_size]
@spec cross_entropy_loss(Nx.Tensor.t(), Nx.Tensor.t()) :: Nx.Tensor.t()
Compute cross-entropy loss for language modeling.
Parameters
logits- Predicted logits[batch, seq_len, vocab_size]targets- Target token IDs[batch, seq_len]
Returns
Cross-entropy loss scalar.
@spec nar_forward( function(), map(), Nx.Tensor.t(), Nx.Tensor.t(), Nx.Tensor.t(), integer() ) :: Nx.Tensor.t()
NAR model forward pass for fine token generation.
Parameters
nar_fn- Compiled NAR model prediction functionparams- NAR model parameterstext_tokens- Text tokens[batch, text_len]coarse_tokens- Coarse (codebook 0) tokens[batch, seq_len]prev_tokens- Previous codebook tokens[batch, seq_len]codebook_idx- Which codebook to predict (1-7)
Returns
Logits [batch, total_len, audio_vocab_size].
@spec output_size(keyword()) :: pos_integer()
Get the output vocabulary size.
@spec valle_loss( Nx.Tensor.t(), Nx.Tensor.t(), Nx.Tensor.t(), Nx.Tensor.t(), keyword() ) :: Nx.Tensor.t()
Compute VALL-E combined loss (AR + NAR cross-entropy).
Parameters
ar_logits- AR model logits[batch, seq_len, vocab_size]ar_targets- Target coarse tokens[batch, seq_len]nar_logits- NAR model logits[batch, seq_len, vocab_size]nar_targets- Target fine tokens[batch, seq_len]
Options
:ar_weight- Weight for AR loss (default: 1.0):nar_weight- Weight for NAR loss (default: 1.0)
Returns
Combined loss scalar.