PoolFormer: MetaFormer with average pooling as token mixer (Yu et al., 2022).
Demonstrates that the general MetaFormer architecture (norm -> token_mixer -> residual -> norm -> FFN -> residual) is more important than the specific attention mechanism. PoolFormer replaces self-attention with simple average pooling, achieving competitive performance with much lower computational cost.
Architecture
Image [batch, channels, height, width]
|
+-----v--------------------+
| Patch Embedding | Split into P x P patches, linear project
+---------------------------+
|
v
[batch, num_patches, hidden_size]
|
+-----v--------------------+
| PoolFormer Block x N |
| |
| Token Mixing: |
| LN -> AvgPool - x |
| + Residual |
| |
| Channel Mixing: |
| LN -> Dense(4*h) |
| -> GELU |
| -> Dense(h) |
| + Residual |
+---------------------------+
|
v
+---------------------------+
| LayerNorm -> Mean Pool |
+---------------------------+
|
v
[batch, hidden_size]Key Insight
The pooling token mixer subtracts the input from its average-pooled version, which creates a simple form of local context aggregation. This is much simpler and faster than attention while maintaining competitive accuracy.
Usage
model = PoolFormer.build(
image_size: 224,
patch_size: 16,
hidden_size: 256,
num_layers: 4,
num_classes: 1000
)References
- Yu et al., "MetaFormer is Actually What You Need for Vision" (CVPR 2022)
- https://arxiv.org/abs/2111.11418
Summary
Types
@type build_opt() :: {:hidden_size, pos_integer()} | {:image_size, pos_integer()} | {:in_channels, pos_integer()} | {:num_classes, pos_integer() | nil} | {:num_layers, pos_integer()} | {:patch_size, pos_integer()} | {:pool_size, pos_integer()}
Options for build/1.
Functions
Build a PoolFormer model.
Options
:image_size- Input image size, square (default: 224):patch_size- Patch size, square (default: 16):in_channels- Number of input channels (default: 3):hidden_size- Hidden dimension per patch (default: 256):num_layers- Number of PoolFormer blocks (default: 4):pool_size- Pooling kernel size for token mixer (default: 3):num_classes- Number of output classes (optional)
Returns
An Axon model. Without :num_classes, outputs [batch, hidden_size].
With :num_classes, outputs [batch, num_classes].
@spec output_size(keyword()) :: pos_integer()
Get the output size of a PoolFormer model.
Returns :num_classes if set, otherwise :hidden_size.