Edifice.Vision.EfficientViT (Edifice v0.2.0)

EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction.

Implements EfficientViT from "EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction" (Liu et al., 2023). Achieves O(n) complexity instead of O(n²) via linear attention with cascaded group attention.

Key Innovations

Linear attention: Uses kernel trick to avoid materializing the full attention matrix. Q×K^T is computed via feature maps, giving O(n) complexity.
Cascaded group attention (CGA): Different heads see different channel splits of the input, enforcing head diversity and reducing redundancy.
Multi-scale: Progressive downsampling stages, each with its own dimension.
Depthwise conv in FFN: Adds local context between linear layers.

Architecture

Image [batch, channels, height, width]
      |
      v
+--------------------------+
| Patch Embedding           |
+--------------------------+
      |
      v
+==========================+
| Stage 1 (depth[0] blocks) |
|  CGA Linear Attention    |
|  DW-Conv FFN             |
+==========================+
      | (downsample)
      v
+==========================+
| Stage 2 (depth[1] blocks) |
|  CGA Linear Attention    |
|  DW-Conv FFN             |
+==========================+
      | (downsample)
      v
+==========================+
| Stage 3 (depth[2] blocks) |
|  CGA Linear Attention    |
|  DW-Conv FFN             |
+==========================+
      |
      v
+--------------------------+
| LayerNorm + Global Pool  |
+--------------------------+
      |
      v
[batch, last_dim]

Cascaded Group Attention

Input: [batch, seq, dim]
       |
  Split into num_heads groups along dim
       |
  Head 0: [batch, seq, dim/heads] → Q₀, K₀, V₀ → LinearAttn → out₀
  Head 1: [batch, seq, dim/heads] → Q₁, K₁, V₁ → LinearAttn → out₁ + out₀
  Head 2: [batch, seq, dim/heads] → Q₂, K₂, V₂ → LinearAttn → out₂ + out₁
  ...
       |
  Concatenate all head outputs
       |
  Output projection

Each head sees a unique slice of the feature map (no shared representation), which forces diverse attention patterns across heads.

Linear Attention

Standard attention: O(n²)

Attn = softmax(QK^T/√d) × V

Linear attention: O(n)

Attn = φ(Q) × (φ(K)^T × V)  where φ is ELU+1

By computing φ(K)^T × V first (d×d matrix), we avoid the n×n attention matrix entirely.

Usage

model = EfficientViT.build(
  image_size: 224,
  patch_size: 16,
  embed_dim: 64,
  depths: [1, 2, 3],
  num_heads: [4, 4, 4]
)

References

Paper: "EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction"
arXiv: https://arxiv.org/abs/2205.14756

Summary

Types

build_opt()

Options for build/1.

Functions

build(opts \\ [])

Build an EfficientViT model with linear attention.

output_size(opts \\ [])

Get the output size of an EfficientViT model.