Edifice.Vision.MetaFormer (Edifice v0.2.0)

Copy Markdown View Source

MetaFormer: The general architecture behind ViT's success.

Implements the MetaFormer framework and CAFormer from "MetaFormer Baselines for Vision" (Yu et al., 2022/2023). The key insight: ViT's power comes from the overall architecture (norm → token mixer → residual → norm → FFN → residual), not from the specific choice of self-attention as the token mixer.

Key Insight

Even replacing attention with average pooling (PoolFormer) achieves competitive results. This proves the MetaFormer architecture itself is the main contributor to performance, not the specific token mixer.

MetaFormer Block

Input
  |
  v
+---------------------+
| LayerNorm           |
| Token Mixer (any)   |   pooling, conv, attention, etc.
| + Residual          |
+---------------------+
  |
  v
+---------------------+
| LayerNorm           |
| FFN (MLP)           |
| + Residual          |
+---------------------+
  |
  v
Output

CAFormer (Conv-Attention Former)

Best-performing MetaFormer variant using the optimal mixer for each stage:

  • Stages 1-2: Depthwise separable convolution (good for local patterns)
  • Stages 3-4: Self-attention (good for global patterns)
Image  PatchEmbed  [Conv×3]  [Conv×3]  [Attn×9]  [Attn×3]  Pool  Head
         Stage 0       Stage 1    Stage 2     Stage 3    Stage 4
         dim=64        dim=128    dim=320     dim=512

Token Mixers

  • :pooling — Average pooling (PoolFormer)
  • :conv — Depthwise separable convolution
  • :attention — Standard self-attention
  • Custom function — Any (Axon.t(), keyword()) -> Axon.t()

Usage

# Generic MetaFormer with any mixer
model = MetaFormer.build_metaformer(
  image_size: 224,
  patch_size: 4,
  depths: [3, 3, 9, 3],
  dims: [64, 128, 320, 512],
  token_mixer: :attention
)

# CAFormer: conv stages then attention stages
model = MetaFormer.build_caformer(
  image_size: 224,
  patch_size: 4,
  depths: [3, 3, 9, 3],
  dims: [64, 128, 320, 512]
)

References

  • "MetaFormer is Actually What You Need for Vision" (Yu et al., CVPR 2022)
  • "MetaFormer Baselines for Vision" (Yu et al., TPAMI 2023)
  • https://arxiv.org/abs/2210.13452

Summary

Functions

Build via Edifice.build/2. Dispatches to build_metaformer/1 or build_caformer/1 based on :variant option.

Build a CAFormer model (Conv stages + Attention stages).

Build a MetaFormer model with a configurable token mixer.

Get the output size of a MetaFormer model.

Types

caformer_opt()

@type caformer_opt() ::
  {:depths, [pos_integer()]}
  | {:dims, [pos_integer()]}
  | {:image_size, pos_integer()}
  | {:in_channels, pos_integer()}
  | {:num_classes, pos_integer() | nil}
  | {:patch_size, pos_integer()}

Options for build_caformer/1.

metaformer_opt()

@type metaformer_opt() ::
  {:depths, [pos_integer()]}
  | {:dims, [pos_integer()]}
  | {:image_size, pos_integer()}
  | {:in_channels, pos_integer()}
  | {:num_classes, pos_integer() | nil}
  | {:patch_size, pos_integer()}
  | {:pool_size, pos_integer()}
  | {:token_mixer, atom()}

Options for build_metaformer/1.

Functions

build(opts \\ [])

@spec build(keyword()) :: Axon.t()

Build via Edifice.build/2. Dispatches to build_metaformer/1 or build_caformer/1 based on :variant option.

build_caformer(opts \\ [])

@spec build_caformer([caformer_opt()]) :: Axon.t()

Build a CAFormer model (Conv stages + Attention stages).

CAFormer uses depthwise convolution for the first two stages (local patterns) and self-attention for the last two stages (global patterns).

Options

  • :image_size - Input image size, square (default: 224)
  • :patch_size - Initial patch size (default: 4)
  • :in_channels - Number of input channels (default: 3)
  • :depths - Number of blocks per stage (default: [3, 3, 9, 3])
  • :dims - Hidden dimension per stage (default: [64, 128, 320, 512])
  • :num_classes - Number of output classes (optional)

Returns

An Axon model. Without :num_classes, outputs [batch, last_dim]. With :num_classes, outputs [batch, num_classes].

build_metaformer(opts \\ [])

@spec build_metaformer([metaformer_opt()]) :: Axon.t()

Build a MetaFormer model with a configurable token mixer.

Options

  • :image_size - Input image size, square (default: 224)
  • :patch_size - Initial patch size (default: 4)
  • :in_channels - Number of input channels (default: 3)
  • :depths - Number of blocks per stage (default: [3, 3, 9, 3])
  • :dims - Hidden dimension per stage (default: [64, 128, 320, 512])
  • :token_mixer - Token mixer type: :pooling, :conv, :attention (default: :pooling)
  • :pool_size - Pooling kernel size when mixer is :pooling (default: 3)
  • :num_classes - Number of output classes (optional)

Returns

An Axon model. Without :num_classes, outputs [batch, last_dim]. With :num_classes, outputs [batch, num_classes].

output_size(opts \\ [])

@spec output_size(keyword()) :: pos_integer()

Get the output size of a MetaFormer model.