FocalNet: Focal Modulation Networks for vision (Yang et al., 2022).
Replaces self-attention with focal modulation, which aggregates context at multiple granularity levels using hierarchical depthwise convolutions and gated aggregation. This provides a simple yet effective alternative to attention that captures both local and global context.
Architecture
Image [batch, channels, height, width]
|
+-----v--------------------+
| Patch Embedding | Split into P x P patches, linear project
+---------------------------+
|
v
[batch, num_patches, hidden_size]
|
+-----v--------------------+
| FocalNet Block x N |
| |
| Focal Modulation: |
| q = Dense(x) |
| For each level l: |
| ctx += gelu(conv_l) |
| gate = sigmoid(Dense(x))|
| out = q * gate * ctx |
| + Residual |
| |
| FFN: |
| Dense(4*h) -> GELU |
| -> Dense(h) |
| + Residual |
+---------------------------+
|
v
+---------------------------+
| LayerNorm -> Mean Pool |
+---------------------------+
|
v
[batch, hidden_size]Usage
model = FocalNet.build(
image_size: 224,
patch_size: 16,
hidden_size: 256,
num_layers: 4,
focal_levels: 3,
num_classes: 1000
)References
- Yang et al., "Focal Modulation Networks" (NeurIPS 2022)
- https://arxiv.org/abs/2203.11926
Summary
Types
@type build_opt() :: {:focal_kernel, pos_integer()} | {:focal_levels, pos_integer()} | {:hidden_size, pos_integer()} | {:image_size, pos_integer()} | {:in_channels, pos_integer()} | {:num_classes, pos_integer() | nil} | {:num_layers, pos_integer()} | {:patch_size, pos_integer()}
Options for build/1.
Functions
Build a FocalNet model.
Options
:image_size- Input image size, square (default: 224):patch_size- Patch size, square (default: 16):in_channels- Number of input channels (default: 3):hidden_size- Hidden dimension per patch (default: 256):num_layers- Number of FocalNet blocks (default: 4):focal_levels- Number of focal context levels (default: 3):focal_kernel- Base kernel size for focal convolutions (default: 3):num_classes- Number of output classes (optional)
Returns
An Axon model. Without :num_classes, outputs [batch, hidden_size].
With :num_classes, outputs [batch, num_classes].
@spec output_size(keyword()) :: pos_integer()
Get the output size of a FocalNet model.
Returns :num_classes if set, otherwise :hidden_size.