Edifice.Generative.VAR (Edifice v0.2.0)

VAR: Visual Autoregressive Modeling via Next-Scale Prediction.

Implements the VAR architecture from "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction" (Tian et al., NeurIPS 2024 Best Paper). Instead of generating images token-by-token (like traditional AR), VAR generates images scale-by-scale: 1×1 → 2×2 → 4×4 → ... → N×N.

Key Innovation: Next-Scale Prediction

Traditional autoregressive image generation flattens images to 1D sequences and predicts pixels/tokens one at a time. This is slow and ignores spatial structure. VAR instead:

Encodes images at multiple resolutions via a multi-scale VQ tokenizer
Autoregressively predicts each scale given all coarser scales
Each scale prediction is parallel (all tokens at that scale at once)

Scale 1 (1×1):   [tok]           → Predict via GPT
Scale 2 (2×2):   [tok tok]       → Predict via GPT given Scale 1
                 [tok tok]
Scale 3 (4×4):   [tok tok tok tok]   → Predict via GPT given Scales 1-2
                 [tok tok tok tok]
                 [tok tok tok tok]
                 [tok tok tok tok]
...
Scale K (N×N):   Full resolution

Architecture

Image [batch, H, W, C]
      |
      v
+---------------------------+
| Multi-Scale VQ Tokenizer  |  (encode at scales 1, 2, 4, 8, 16...)
+---------------------------+
      |
      v
[Scale tokens at each resolution]
      |
      v
+---------------------------+
| GPT-2 Backbone            |  (autoregressive over scale sequence)
| (decoder_only pattern)    |
+---------------------------+
      |
      v
| Predict next scale tokens |
      |
      v
+---------------------------+
| Multi-Scale VQ Decoder    |  (decode all scales to image)
+---------------------------+
      |
      v
Output Image [batch, H, W, C]

Usage

# Build tokenizer for encoding/decoding
tokenizer = VAR.build_tokenizer(
  image_size: 256,
  scales: [1, 2, 4, 8, 16],
  codebook_size: 1024,
  embed_dim: 256
)

# Build the full VAR model (GPT backbone for next-scale prediction)
model = VAR.build(
  hidden_size: 512,
  num_layers: 12,
  num_heads: 8,
  scales: [1, 2, 4, 8, 16],
  codebook_size: 1024
)

Reference

Paper: "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction"
Authors: Tian, Yu, et al.
arXiv: https://arxiv.org/abs/2404.02905
Award: NeurIPS 2024 Best Paper

Summary

Types

build_opt()

Options for build/1.

tokenizer_opt()

Options for build_tokenizer/1.

Functions

build(opts \\ [])

Build the VAR model (GPT-2 style backbone for next-scale prediction).

build_tokenizer(opts \\ [])

Build a multi-scale VQ tokenizer for VAR.

next_scale_prediction(model, params, current_embeddings, scale_idx)

Perform next-scale prediction given current scale tokens.

recommended_defaults()

Get recommended defaults for VAR.

total_tokens(scales \\ [1, 2, 4, 8, 16])

Get the total number of tokens across all scales.

Types

build_opt()

@type build_opt() ::
  {:codebook_size, pos_integer()}
  | {:dropout, float()}
  | {:hidden_size, pos_integer()}
  | {:mlp_ratio, float()}
  | {:num_heads, pos_integer()}
  | {:num_layers, pos_integer()}
  | {:scales, [pos_integer()]}

Options for build/1.

tokenizer_opt()

@type tokenizer_opt() ::
  {:codebook_size, pos_integer()}
  | {:embed_dim, pos_integer()}
  | {:image_size, pos_integer()}
  | {:in_channels, pos_integer()}
  | {:scales, [pos_integer()]}

Options for build_tokenizer/1.

Functions

build(opts \\ [])

@spec build([build_opt()]) :: Axon.t()

Build the VAR model (GPT-2 style backbone for next-scale prediction).

Options

:hidden_size - Transformer hidden dimension (default: 512)
:num_layers - Number of transformer layers (default: 12)
:num_heads - Number of attention heads (default: 8)
:mlp_ratio - MLP expansion ratio (default: 4.0)
:scales - List of scale factors (default: [1, 2, 4, 8, 16])
:codebook_size - Vocabulary size per scale (default: 1024)
:dropout - Dropout rate (default: 0.1)

Returns

An Axon model that takes scale token embeddings and predicts next-scale logits.

build_tokenizer(opts \\ [])

@spec build_tokenizer([tokenizer_opt()]) :: {Axon.t(), Axon.t()}

Build a multi-scale VQ tokenizer for VAR.

The tokenizer encodes images at multiple resolutions, each with its own codebook. This enables the coarse-to-fine generation strategy.

Options

:image_size - Target image size (default: 256)
:scales - List of scale factors [1, 2, 4, ...] (default: [1, 2, 4, 8, 16])
:codebook_size - Number of codes per scale (default: 1024)
:embed_dim - Embedding dimension (default: 256)
:in_channels - Input image channels (default: 3)

Returns

A tuple {encoder, decoder} where:

encoder maps images to multi-scale token indices
decoder maps multi-scale tokens back to images

next_scale_prediction(model, params, current_embeddings, scale_idx)

@spec next_scale_prediction(Axon.t(), map(), Nx.Tensor.t(), non_neg_integer()) ::
  Nx.Tensor.t()

Perform next-scale prediction given current scale tokens.

This function is used during inference to autoregressively generate each scale conditioned on all previous scales.

Parameters

model - The VAR model
params - Model parameters
current_tokens - Token indices for scales 1..k
scale_idx - Which scale to predict (0-indexed)

Returns

Logits for the next scale's tokens.

recommended_defaults()

@spec recommended_defaults() :: keyword()

Get recommended defaults for VAR.

total_tokens(scales \\ [1, 2, 4, 8, 16])

@spec total_tokens([pos_integer()]) :: pos_integer()

Get the total number of tokens across all scales.