Edifice.Generative.VAR (Edifice v0.2.0)

Copy Markdown View Source

VAR: Visual Autoregressive Modeling via Next-Scale Prediction.

Implements the VAR architecture from "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction" (Tian et al., NeurIPS 2024 Best Paper). Instead of generating images token-by-token (like traditional AR), VAR generates images scale-by-scale: 1×1 → 2×2 → 4×4 → ... → N×N.

Key Innovation: Next-Scale Prediction

Traditional autoregressive image generation flattens images to 1D sequences and predicts pixels/tokens one at a time. This is slow and ignores spatial structure. VAR instead:

  1. Encodes images at multiple resolutions via a multi-scale VQ tokenizer
  2. Autoregressively predicts each scale given all coarser scales
  3. Each scale prediction is parallel (all tokens at that scale at once)
Scale 1 (1×1):   [tok]            Predict via GPT
Scale 2 (2×2):   [tok tok]        Predict via GPT given Scale 1
                 [tok tok]
Scale 3 (4×4):   [tok tok tok tok]    Predict via GPT given Scales 1-2
                 [tok tok tok tok]
                 [tok tok tok tok]
                 [tok tok tok tok]
...
Scale K (N×N):   Full resolution

Architecture

Image [batch, H, W, C]
      |
      v
+---------------------------+
| Multi-Scale VQ Tokenizer  |  (encode at scales 1, 2, 4, 8, 16...)
+---------------------------+
      |
      v
[Scale tokens at each resolution]
      |
      v
+---------------------------+
| GPT-2 Backbone            |  (autoregressive over scale sequence)
| (decoder_only pattern)    |
+---------------------------+
      |
      v
| Predict next scale tokens |
      |
      v
+---------------------------+
| Multi-Scale VQ Decoder    |  (decode all scales to image)
+---------------------------+
      |
      v
Output Image [batch, H, W, C]

Usage

# Build tokenizer for encoding/decoding
tokenizer = VAR.build_tokenizer(
  image_size: 256,
  scales: [1, 2, 4, 8, 16],
  codebook_size: 1024,
  embed_dim: 256
)

# Build the full VAR model (GPT backbone for next-scale prediction)
model = VAR.build(
  hidden_size: 512,
  num_layers: 12,
  num_heads: 8,
  scales: [1, 2, 4, 8, 16],
  codebook_size: 1024
)

Reference

  • Paper: "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction"
  • Authors: Tian, Yu, et al.
  • arXiv: https://arxiv.org/abs/2404.02905
  • Award: NeurIPS 2024 Best Paper

Summary

Functions

Build the VAR model (GPT-2 style backbone for next-scale prediction).

Build a multi-scale VQ tokenizer for VAR.

Perform next-scale prediction given current scale tokens.

Get recommended defaults for VAR.

Get the total number of tokens across all scales.

Types

build_opt()

@type build_opt() ::
  {:codebook_size, pos_integer()}
  | {:dropout, float()}
  | {:hidden_size, pos_integer()}
  | {:mlp_ratio, float()}
  | {:num_heads, pos_integer()}
  | {:num_layers, pos_integer()}
  | {:scales, [pos_integer()]}

Options for build/1.

tokenizer_opt()

@type tokenizer_opt() ::
  {:codebook_size, pos_integer()}
  | {:embed_dim, pos_integer()}
  | {:image_size, pos_integer()}
  | {:in_channels, pos_integer()}
  | {:scales, [pos_integer()]}

Options for build_tokenizer/1.

Functions

build(opts \\ [])

@spec build([build_opt()]) :: Axon.t()

Build the VAR model (GPT-2 style backbone for next-scale prediction).

Options

  • :hidden_size - Transformer hidden dimension (default: 512)
  • :num_layers - Number of transformer layers (default: 12)
  • :num_heads - Number of attention heads (default: 8)
  • :mlp_ratio - MLP expansion ratio (default: 4.0)
  • :scales - List of scale factors (default: [1, 2, 4, 8, 16])
  • :codebook_size - Vocabulary size per scale (default: 1024)
  • :dropout - Dropout rate (default: 0.1)

Returns

An Axon model that takes scale token embeddings and predicts next-scale logits.

build_tokenizer(opts \\ [])

@spec build_tokenizer([tokenizer_opt()]) :: {Axon.t(), Axon.t()}

Build a multi-scale VQ tokenizer for VAR.

The tokenizer encodes images at multiple resolutions, each with its own codebook. This enables the coarse-to-fine generation strategy.

Options

  • :image_size - Target image size (default: 256)
  • :scales - List of scale factors [1, 2, 4, ...] (default: [1, 2, 4, 8, 16])
  • :codebook_size - Number of codes per scale (default: 1024)
  • :embed_dim - Embedding dimension (default: 256)
  • :in_channels - Input image channels (default: 3)

Returns

A tuple {encoder, decoder} where:

  • encoder maps images to multi-scale token indices
  • decoder maps multi-scale tokens back to images

next_scale_prediction(model, params, current_embeddings, scale_idx)

@spec next_scale_prediction(Axon.t(), map(), Nx.Tensor.t(), non_neg_integer()) ::
  Nx.Tensor.t()

Perform next-scale prediction given current scale tokens.

This function is used during inference to autoregressively generate each scale conditioned on all previous scales.

Parameters

  • model - The VAR model
  • params - Model parameters
  • current_tokens - Token indices for scales 1..k
  • scale_idx - Which scale to predict (0-indexed)

Returns

Logits for the next scale's tokens.

total_tokens(scales \\ [1, 2, 4, 8, 16])

@spec total_tokens([pos_integer()]) :: pos_integer()

Get the total number of tokens across all scales.