Mixture of Depths: per-token routing where only top-C% tokens are processed.
Implements the Mixture-of-Depths approach from "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models" (Raposo et al., 2024). A learned router scores each token and only the top-C% (by capacity ratio) are processed through the full transformer block; the rest skip via residual.
Architecture
Input [batch, seq, hidden]
|
v
+-----------------------------+
| Per Layer: |
| Router: dense -> sigmoid |
| -> soft gate per token |
| Transformer block on all |
| output = gate*block + |
| (1-gate)*input |
+-----------------------------+
| (repeat num_layers)
v
Final Norm -> Last Timestep
Output [batch, hidden_size]How It Works
For each layer, a router network produces a per-token score in [0, 1]. A top-C selection mechanism identifies which tokens should receive full processing. In this Axon-compatible implementation, all tokens pass through the transformer block, but the router gate controls how much of the block output vs. the residual input each token uses:
output_t = gate_t * block(input_t) + (1 - gate_t) * input_tTokens with low router scores effectively skip the block via residual.
Usage
model = MixtureOfDepths.build(
embed_dim: 287,
hidden_size: 256,
num_heads: 4,
capacity_ratio: 0.5,
num_layers: 4
)References
- Raposo et al., "Mixture-of-Depths" (2024)
- https://arxiv.org/abs/2404.02258
Summary
Functions
Build a Mixture of Depths model.
Get the output size of a MixtureOfDepths model.
Get recommended defaults for MixtureOfDepths.
Types
@type build_opt() :: {:capacity_ratio, float()} | {:dropout, float()} | {:embed_dim, pos_integer()} | {:hidden_size, pos_integer()} | {:num_heads, pos_integer()} | {:num_layers, pos_integer()} | {:window_size, pos_integer()}
Options for build/1.
Functions
Build a Mixture of Depths model.
Options
:embed_dim- Input embedding dimension (required):hidden_size- Internal hidden dimension (default: 256):num_heads- Number of attention heads (default: 4):capacity_ratio- Fraction of tokens to process (default: 0.5):num_layers- Number of transformer layers (default: 4):dropout- Dropout rate (default: 0.1):window_size- Sequence length (default: 60)
Returns
An Axon model outputting [batch, hidden_size].
@spec output_size(keyword()) :: pos_integer()
Get the output size of a MixtureOfDepths model.
@spec recommended_defaults() :: keyword()
Get recommended defaults for MixtureOfDepths.