Edifice.Vision.PoolFormer (Edifice v0.2.0)

Copy Markdown View Source

PoolFormer: MetaFormer with average pooling as token mixer (Yu et al., 2022).

Demonstrates that the general MetaFormer architecture (norm -> token_mixer -> residual -> norm -> FFN -> residual) is more important than the specific attention mechanism. PoolFormer replaces self-attention with simple average pooling, achieving competitive performance with much lower computational cost.

Architecture

Image [batch, channels, height, width]
      |
+-----v--------------------+
| Patch Embedding           |  Split into P x P patches, linear project
+---------------------------+
      |
      v
[batch, num_patches, hidden_size]
      |
+-----v--------------------+
| PoolFormer Block x N      |
|                           |
| Token Mixing:             |
|   LN -> AvgPool - x      |
|   + Residual              |
|                           |
| Channel Mixing:           |
|   LN -> Dense(4*h)       |
|   -> GELU                |
|   -> Dense(h)            |
|   + Residual              |
+---------------------------+
      |
      v
+---------------------------+
| LayerNorm -> Mean Pool    |
+---------------------------+
      |
      v
[batch, hidden_size]

Key Insight

The pooling token mixer subtracts the input from its average-pooled version, which creates a simple form of local context aggregation. This is much simpler and faster than attention while maintaining competitive accuracy.

Usage

model = PoolFormer.build(
  image_size: 224,
  patch_size: 16,
  hidden_size: 256,
  num_layers: 4,
  num_classes: 1000
)

References

Summary

Types

Options for build/1.

Functions

Build a PoolFormer model.

Get the output size of a PoolFormer model.

Types

build_opt()

@type build_opt() ::
  {:hidden_size, pos_integer()}
  | {:image_size, pos_integer()}
  | {:in_channels, pos_integer()}
  | {:num_classes, pos_integer() | nil}
  | {:num_layers, pos_integer()}
  | {:patch_size, pos_integer()}
  | {:pool_size, pos_integer()}

Options for build/1.

Functions

build(opts \\ [])

@spec build([build_opt()]) :: Axon.t()

Build a PoolFormer model.

Options

  • :image_size - Input image size, square (default: 224)
  • :patch_size - Patch size, square (default: 16)
  • :in_channels - Number of input channels (default: 3)
  • :hidden_size - Hidden dimension per patch (default: 256)
  • :num_layers - Number of PoolFormer blocks (default: 4)
  • :pool_size - Pooling kernel size for token mixer (default: 3)
  • :num_classes - Number of output classes (optional)

Returns

An Axon model. Without :num_classes, outputs [batch, hidden_size]. With :num_classes, outputs [batch, num_classes].

output_size(opts \\ [])

@spec output_size(keyword()) :: pos_integer()

Get the output size of a PoolFormer model.

Returns :num_classes if set, otherwise :hidden_size.