VAR: Visual Autoregressive Modeling via Next-Scale Prediction.
Implements the VAR architecture from "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction" (Tian et al., NeurIPS 2024 Best Paper). Instead of generating images token-by-token (like traditional AR), VAR generates images scale-by-scale: 1×1 → 2×2 → 4×4 → ... → N×N.
Key Innovation: Next-Scale Prediction
Traditional autoregressive image generation flattens images to 1D sequences and predicts pixels/tokens one at a time. This is slow and ignores spatial structure. VAR instead:
- Encodes images at multiple resolutions via a multi-scale VQ tokenizer
- Autoregressively predicts each scale given all coarser scales
- Each scale prediction is parallel (all tokens at that scale at once)
Scale 1 (1×1): [tok] → Predict via GPT
Scale 2 (2×2): [tok tok] → Predict via GPT given Scale 1
[tok tok]
Scale 3 (4×4): [tok tok tok tok] → Predict via GPT given Scales 1-2
[tok tok tok tok]
[tok tok tok tok]
[tok tok tok tok]
...
Scale K (N×N): Full resolutionArchitecture
Image [batch, H, W, C]
|
v
+---------------------------+
| Multi-Scale VQ Tokenizer | (encode at scales 1, 2, 4, 8, 16...)
+---------------------------+
|
v
[Scale tokens at each resolution]
|
v
+---------------------------+
| GPT-2 Backbone | (autoregressive over scale sequence)
| (decoder_only pattern) |
+---------------------------+
|
v
| Predict next scale tokens |
|
v
+---------------------------+
| Multi-Scale VQ Decoder | (decode all scales to image)
+---------------------------+
|
v
Output Image [batch, H, W, C]Usage
# Build tokenizer for encoding/decoding
tokenizer = VAR.build_tokenizer(
image_size: 256,
scales: [1, 2, 4, 8, 16],
codebook_size: 1024,
embed_dim: 256
)
# Build the full VAR model (GPT backbone for next-scale prediction)
model = VAR.build(
hidden_size: 512,
num_layers: 12,
num_heads: 8,
scales: [1, 2, 4, 8, 16],
codebook_size: 1024
)Reference
- Paper: "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction"
- Authors: Tian, Yu, et al.
- arXiv: https://arxiv.org/abs/2404.02905
- Award: NeurIPS 2024 Best Paper
Summary
Functions
Build the VAR model (GPT-2 style backbone for next-scale prediction).
Build a multi-scale VQ tokenizer for VAR.
Perform next-scale prediction given current scale tokens.
Get recommended defaults for VAR.
Get the total number of tokens across all scales.
Types
@type build_opt() :: {:codebook_size, pos_integer()} | {:dropout, float()} | {:hidden_size, pos_integer()} | {:mlp_ratio, float()} | {:num_heads, pos_integer()} | {:num_layers, pos_integer()} | {:scales, [pos_integer()]}
Options for build/1.
@type tokenizer_opt() :: {:codebook_size, pos_integer()} | {:embed_dim, pos_integer()} | {:image_size, pos_integer()} | {:in_channels, pos_integer()} | {:scales, [pos_integer()]}
Options for build_tokenizer/1.
Functions
Build the VAR model (GPT-2 style backbone for next-scale prediction).
Options
:hidden_size- Transformer hidden dimension (default: 512):num_layers- Number of transformer layers (default: 12):num_heads- Number of attention heads (default: 8):mlp_ratio- MLP expansion ratio (default: 4.0):scales- List of scale factors (default: [1, 2, 4, 8, 16]):codebook_size- Vocabulary size per scale (default: 1024):dropout- Dropout rate (default: 0.1)
Returns
An Axon model that takes scale token embeddings and predicts next-scale logits.
@spec build_tokenizer([tokenizer_opt()]) :: {Axon.t(), Axon.t()}
Build a multi-scale VQ tokenizer for VAR.
The tokenizer encodes images at multiple resolutions, each with its own codebook. This enables the coarse-to-fine generation strategy.
Options
:image_size- Target image size (default: 256):scales- List of scale factors [1, 2, 4, ...] (default: [1, 2, 4, 8, 16]):codebook_size- Number of codes per scale (default: 1024):embed_dim- Embedding dimension (default: 256):in_channels- Input image channels (default: 3)
Returns
A tuple {encoder, decoder} where:
encodermaps images to multi-scale token indicesdecodermaps multi-scale tokens back to images
@spec next_scale_prediction(Axon.t(), map(), Nx.Tensor.t(), non_neg_integer()) :: Nx.Tensor.t()
Perform next-scale prediction given current scale tokens.
This function is used during inference to autoregressively generate each scale conditioned on all previous scales.
Parameters
model- The VAR modelparams- Model parameterscurrent_tokens- Token indices for scales 1..kscale_idx- Which scale to predict (0-indexed)
Returns
Logits for the next scale's tokens.
@spec recommended_defaults() :: keyword()
Get recommended defaults for VAR.
@spec total_tokens([pos_integer()]) :: pos_integer()
Get the total number of tokens across all scales.