# `Edifice.Generative.VAR`
[🔗](https://github.com/blasphemetheus/edifice/blob/main/lib/edifice/generative/var.ex#L1)

VAR: Visual Autoregressive Modeling via Next-Scale Prediction.

Implements the VAR architecture from "Visual Autoregressive Modeling:
Scalable Image Generation via Next-Scale Prediction" (Tian et al., NeurIPS 2024
Best Paper). Instead of generating images token-by-token (like traditional AR),
VAR generates images scale-by-scale: 1×1 → 2×2 → 4×4 → ... → N×N.

## Key Innovation: Next-Scale Prediction

Traditional autoregressive image generation flattens images to 1D sequences
and predicts pixels/tokens one at a time. This is slow and ignores spatial
structure. VAR instead:

1. Encodes images at multiple resolutions via a multi-scale VQ tokenizer
2. Autoregressively predicts each scale given all coarser scales
3. Each scale prediction is parallel (all tokens at that scale at once)

```
Scale 1 (1×1):   [tok]           → Predict via GPT
Scale 2 (2×2):   [tok tok]       → Predict via GPT given Scale 1
                 [tok tok]
Scale 3 (4×4):   [tok tok tok tok]   → Predict via GPT given Scales 1-2
                 [tok tok tok tok]
                 [tok tok tok tok]
                 [tok tok tok tok]
...
Scale K (N×N):   Full resolution
```

## Architecture

```
Image [batch, H, W, C]
      |
      v
+---------------------------+
| Multi-Scale VQ Tokenizer  |  (encode at scales 1, 2, 4, 8, 16...)
+---------------------------+
      |
      v
[Scale tokens at each resolution]
      |
      v
+---------------------------+
| GPT-2 Backbone            |  (autoregressive over scale sequence)
| (decoder_only pattern)    |
+---------------------------+
      |
      v
| Predict next scale tokens |
      |
      v
+---------------------------+
| Multi-Scale VQ Decoder    |  (decode all scales to image)
+---------------------------+
      |
      v
Output Image [batch, H, W, C]
```

## Usage

    # Build tokenizer for encoding/decoding
    tokenizer = VAR.build_tokenizer(
      image_size: 256,
      scales: [1, 2, 4, 8, 16],
      codebook_size: 1024,
      embed_dim: 256
    )

    # Build the full VAR model (GPT backbone for next-scale prediction)
    model = VAR.build(
      hidden_size: 512,
      num_layers: 12,
      num_heads: 8,
      scales: [1, 2, 4, 8, 16],
      codebook_size: 1024
    )

## Reference

- Paper: "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction"
- Authors: Tian, Yu, et al.
- arXiv: https://arxiv.org/abs/2404.02905
- Award: NeurIPS 2024 Best Paper

# `build_opt`

```elixir
@type build_opt() ::
  {:codebook_size, pos_integer()}
  | {:dropout, float()}
  | {:hidden_size, pos_integer()}
  | {:mlp_ratio, float()}
  | {:num_heads, pos_integer()}
  | {:num_layers, pos_integer()}
  | {:scales, [pos_integer()]}
```

Options for `build/1`.

# `tokenizer_opt`

```elixir
@type tokenizer_opt() ::
  {:codebook_size, pos_integer()}
  | {:embed_dim, pos_integer()}
  | {:image_size, pos_integer()}
  | {:in_channels, pos_integer()}
  | {:scales, [pos_integer()]}
```

Options for `build_tokenizer/1`.

# `build`

```elixir
@spec build([build_opt()]) :: Axon.t()
```

Build the VAR model (GPT-2 style backbone for next-scale prediction).

## Options

  - `:hidden_size` - Transformer hidden dimension (default: 512)
  - `:num_layers` - Number of transformer layers (default: 12)
  - `:num_heads` - Number of attention heads (default: 8)
  - `:mlp_ratio` - MLP expansion ratio (default: 4.0)
  - `:scales` - List of scale factors (default: [1, 2, 4, 8, 16])
  - `:codebook_size` - Vocabulary size per scale (default: 1024)
  - `:dropout` - Dropout rate (default: 0.1)

## Returns

  An Axon model that takes scale token embeddings and predicts next-scale logits.

# `build_tokenizer`

```elixir
@spec build_tokenizer([tokenizer_opt()]) :: {Axon.t(), Axon.t()}
```

Build a multi-scale VQ tokenizer for VAR.

The tokenizer encodes images at multiple resolutions, each with its own
codebook. This enables the coarse-to-fine generation strategy.

## Options

  - `:image_size` - Target image size (default: 256)
  - `:scales` - List of scale factors [1, 2, 4, ...] (default: [1, 2, 4, 8, 16])
  - `:codebook_size` - Number of codes per scale (default: 1024)
  - `:embed_dim` - Embedding dimension (default: 256)
  - `:in_channels` - Input image channels (default: 3)

## Returns

  A tuple `{encoder, decoder}` where:
  - `encoder` maps images to multi-scale token indices
  - `decoder` maps multi-scale tokens back to images

# `next_scale_prediction`

```elixir
@spec next_scale_prediction(Axon.t(), map(), Nx.Tensor.t(), non_neg_integer()) ::
  Nx.Tensor.t()
```

Perform next-scale prediction given current scale tokens.

This function is used during inference to autoregressively generate
each scale conditioned on all previous scales.

## Parameters

  - `model` - The VAR model
  - `params` - Model parameters
  - `current_tokens` - Token indices for scales 1..k
  - `scale_idx` - Which scale to predict (0-indexed)

## Returns

  Logits for the next scale's tokens.

# `recommended_defaults`

```elixir
@spec recommended_defaults() :: keyword()
```

Get recommended defaults for VAR.

# `total_tokens`

```elixir
@spec total_tokens([pos_integer()]) :: pos_integer()
```

Get the total number of tokens across all scales.

---

*Consult [api-reference.md](api-reference.md) for complete listing*