# `Edifice.Vision.MetaFormer`
[🔗](https://github.com/blasphemetheus/edifice/blob/main/lib/edifice/vision/metaformer.ex#L1)

MetaFormer: The general architecture behind ViT's success.

Implements the MetaFormer framework and CAFormer from "MetaFormer Baselines
for Vision" (Yu et al., 2022/2023). The key insight: ViT's power comes from
the overall architecture (norm → token mixer → residual → norm → FFN → residual),
not from the specific choice of self-attention as the token mixer.

## Key Insight

Even replacing attention with **average pooling** (PoolFormer) achieves
competitive results. This proves the MetaFormer architecture itself is the
main contributor to performance, not the specific token mixer.

## MetaFormer Block

```
Input
  |
  v
+---------------------+
| LayerNorm           |
| Token Mixer (any)   |  ← pooling, conv, attention, etc.
| + Residual          |
+---------------------+
  |
  v
+---------------------+
| LayerNorm           |
| FFN (MLP)           |
| + Residual          |
+---------------------+
  |
  v
Output
```

## CAFormer (Conv-Attention Former)

Best-performing MetaFormer variant using the optimal mixer for each stage:
- Stages 1-2: Depthwise separable convolution (good for local patterns)
- Stages 3-4: Self-attention (good for global patterns)

```
Image → PatchEmbed → [Conv×3] → [Conv×3] → [Attn×9] → [Attn×3] → Pool → Head
         Stage 0       Stage 1    Stage 2     Stage 3    Stage 4
         dim=64        dim=128    dim=320     dim=512
```

## Token Mixers

- `:pooling` — Average pooling (PoolFormer)
- `:conv` — Depthwise separable convolution
- `:attention` — Standard self-attention
- Custom function — Any `(Axon.t(), keyword()) -> Axon.t()`

## Usage

    # Generic MetaFormer with any mixer
    model = MetaFormer.build_metaformer(
      image_size: 224,
      patch_size: 4,
      depths: [3, 3, 9, 3],
      dims: [64, 128, 320, 512],
      token_mixer: :attention
    )

    # CAFormer: conv stages then attention stages
    model = MetaFormer.build_caformer(
      image_size: 224,
      patch_size: 4,
      depths: [3, 3, 9, 3],
      dims: [64, 128, 320, 512]
    )

## References

- "MetaFormer is Actually What You Need for Vision" (Yu et al., CVPR 2022)
- "MetaFormer Baselines for Vision" (Yu et al., TPAMI 2023)
- https://arxiv.org/abs/2210.13452

# `caformer_opt`

```elixir
@type caformer_opt() ::
  {:depths, [pos_integer()]}
  | {:dims, [pos_integer()]}
  | {:image_size, pos_integer()}
  | {:in_channels, pos_integer()}
  | {:num_classes, pos_integer() | nil}
  | {:patch_size, pos_integer()}
```

Options for `build_caformer/1`.

# `metaformer_opt`

```elixir
@type metaformer_opt() ::
  {:depths, [pos_integer()]}
  | {:dims, [pos_integer()]}
  | {:image_size, pos_integer()}
  | {:in_channels, pos_integer()}
  | {:num_classes, pos_integer() | nil}
  | {:patch_size, pos_integer()}
  | {:pool_size, pos_integer()}
  | {:token_mixer, atom()}
```

Options for `build_metaformer/1`.

# `build`

```elixir
@spec build(keyword()) :: Axon.t()
```

Build via `Edifice.build/2`. Dispatches to `build_metaformer/1` or
`build_caformer/1` based on `:variant` option.

# `build_caformer`

```elixir
@spec build_caformer([caformer_opt()]) :: Axon.t()
```

Build a CAFormer model (Conv stages + Attention stages).

CAFormer uses depthwise convolution for the first two stages (local patterns)
and self-attention for the last two stages (global patterns).

## Options

  - `:image_size` - Input image size, square (default: 224)
  - `:patch_size` - Initial patch size (default: 4)
  - `:in_channels` - Number of input channels (default: 3)
  - `:depths` - Number of blocks per stage (default: [3, 3, 9, 3])
  - `:dims` - Hidden dimension per stage (default: [64, 128, 320, 512])
  - `:num_classes` - Number of output classes (optional)

## Returns

  An Axon model. Without `:num_classes`, outputs `[batch, last_dim]`.
  With `:num_classes`, outputs `[batch, num_classes]`.

# `build_metaformer`

```elixir
@spec build_metaformer([metaformer_opt()]) :: Axon.t()
```

Build a MetaFormer model with a configurable token mixer.

## Options

  - `:image_size` - Input image size, square (default: 224)
  - `:patch_size` - Initial patch size (default: 4)
  - `:in_channels` - Number of input channels (default: 3)
  - `:depths` - Number of blocks per stage (default: [3, 3, 9, 3])
  - `:dims` - Hidden dimension per stage (default: [64, 128, 320, 512])
  - `:token_mixer` - Token mixer type: `:pooling`, `:conv`, `:attention` (default: `:pooling`)
  - `:pool_size` - Pooling kernel size when mixer is `:pooling` (default: 3)
  - `:num_classes` - Number of output classes (optional)

## Returns

  An Axon model. Without `:num_classes`, outputs `[batch, last_dim]`.
  With `:num_classes`, outputs `[batch, num_classes]`.

# `output_size`

```elixir
@spec output_size(keyword()) :: pos_integer()
```

Get the output size of a MetaFormer model.

---

*Consult [api-reference.md](api-reference.md) for complete listing*