Edifice.Vision.MetaFormer (Edifice v0.2.0)

MetaFormer: The general architecture behind ViT's success.

Implements the MetaFormer framework and CAFormer from "MetaFormer Baselines for Vision" (Yu et al., 2022/2023). The key insight: ViT's power comes from the overall architecture (norm → token mixer → residual → norm → FFN → residual), not from the specific choice of self-attention as the token mixer.

Key Insight

Even replacing attention with average pooling (PoolFormer) achieves competitive results. This proves the MetaFormer architecture itself is the main contributor to performance, not the specific token mixer.

MetaFormer Block

Input
  |
  v
+---------------------+
| LayerNorm           |
| Token Mixer (any)   |  ← pooling, conv, attention, etc.
| + Residual          |
+---------------------+
  |
  v
+---------------------+
| LayerNorm           |
| FFN (MLP)           |
| + Residual          |
+---------------------+
  |
  v
Output

CAFormer (Conv-Attention Former)

Best-performing MetaFormer variant using the optimal mixer for each stage:

Stages 1-2: Depthwise separable convolution (good for local patterns)
Stages 3-4: Self-attention (good for global patterns)

Image → PatchEmbed → [Conv×3] → [Conv×3] → [Attn×9] → [Attn×3] → Pool → Head
         Stage 0       Stage 1    Stage 2     Stage 3    Stage 4
         dim=64        dim=128    dim=320     dim=512

Token Mixers

:pooling — Average pooling (PoolFormer)
:conv — Depthwise separable convolution
:attention — Standard self-attention
Custom function — Any (Axon.t(), keyword()) -> Axon.t()

Usage

# Generic MetaFormer with any mixer
model = MetaFormer.build_metaformer(
  image_size: 224,
  patch_size: 4,
  depths: [3, 3, 9, 3],
  dims: [64, 128, 320, 512],
  token_mixer: :attention
)

# CAFormer: conv stages then attention stages
model = MetaFormer.build_caformer(
  image_size: 224,
  patch_size: 4,
  depths: [3, 3, 9, 3],
  dims: [64, 128, 320, 512]
)

References

"MetaFormer is Actually What You Need for Vision" (Yu et al., CVPR 2022)
"MetaFormer Baselines for Vision" (Yu et al., TPAMI 2023)
https://arxiv.org/abs/2210.13452

Summary

Types

caformer_opt()

Options for build_caformer/1.

metaformer_opt()

Options for build_metaformer/1.

Functions

build(opts \\ [])

Build via Edifice.build/2. Dispatches to build_metaformer/1 or build_caformer/1 based on :variant option.

build_caformer(opts \\ [])

Build a CAFormer model (Conv stages + Attention stages).

build_metaformer(opts \\ [])

Build a MetaFormer model with a configurable token mixer.

output_size(opts \\ [])

Get the output size of a MetaFormer model.