Edifice.Meta.MixtureOfDepths (Edifice v0.2.0)

Mixture of Depths: per-token routing where only top-C% tokens are processed.

Implements the Mixture-of-Depths approach from "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models" (Raposo et al., 2024). A learned router scores each token and only the top-C% (by capacity ratio) are processed through the full transformer block; the rest skip via residual.

Architecture

Input [batch, seq, hidden]
      |
      v
+-----------------------------+
| Per Layer:                  |
|   Router: dense -> sigmoid  |
|   -> soft gate per token    |
|   Transformer block on all  |
|   output = gate*block +     |
|            (1-gate)*input   |
+-----------------------------+
      | (repeat num_layers)
      v
Final Norm -> Last Timestep
Output [batch, hidden_size]

How It Works

For each layer, a router network produces a per-token score in [0, 1]. A top-C selection mechanism identifies which tokens should receive full processing. In this Axon-compatible implementation, all tokens pass through the transformer block, but the router gate controls how much of the block output vs. the residual input each token uses:

output_t = gate_t * block(input_t) + (1 - gate_t) * input_t

Tokens with low router scores effectively skip the block via residual.

Usage

model = MixtureOfDepths.build(
  embed_dim: 287,
  hidden_size: 256,
  num_heads: 4,
  capacity_ratio: 0.5,
  num_layers: 4
)

References

Raposo et al., "Mixture-of-Depths" (2024)
https://arxiv.org/abs/2404.02258

Summary

Types

build_opt()

Options for build/1.

Functions

build(opts \\ [])

Build a Mixture of Depths model.

output_size(opts \\ [])

Get the output size of a MixtureOfDepths model.

recommended_defaults()

Get recommended defaults for MixtureOfDepths.