Edifice.Meta.MixtureOfDepths (Edifice v0.2.0)

Copy Markdown View Source

Mixture of Depths: per-token routing where only top-C% tokens are processed.

Implements the Mixture-of-Depths approach from "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models" (Raposo et al., 2024). A learned router scores each token and only the top-C% (by capacity ratio) are processed through the full transformer block; the rest skip via residual.

Architecture

Input [batch, seq, hidden]
      |
      v
+-----------------------------+
| Per Layer:                  |
|   Router: dense -> sigmoid  |
|   -> soft gate per token    |
|   Transformer block on all  |
|   output = gate*block +     |
|            (1-gate)*input   |
+-----------------------------+
      | (repeat num_layers)
      v
Final Norm -> Last Timestep
Output [batch, hidden_size]

How It Works

For each layer, a router network produces a per-token score in [0, 1]. A top-C selection mechanism identifies which tokens should receive full processing. In this Axon-compatible implementation, all tokens pass through the transformer block, but the router gate controls how much of the block output vs. the residual input each token uses:

output_t = gate_t * block(input_t) + (1 - gate_t) * input_t

Tokens with low router scores effectively skip the block via residual.

Usage

model = MixtureOfDepths.build(
  embed_dim: 287,
  hidden_size: 256,
  num_heads: 4,
  capacity_ratio: 0.5,
  num_layers: 4
)

References

Summary

Types

Options for build/1.

Functions

Build a Mixture of Depths model.

Get the output size of a MixtureOfDepths model.

Get recommended defaults for MixtureOfDepths.

Types

build_opt()

@type build_opt() ::
  {:capacity_ratio, float()}
  | {:dropout, float()}
  | {:embed_dim, pos_integer()}
  | {:hidden_size, pos_integer()}
  | {:num_heads, pos_integer()}
  | {:num_layers, pos_integer()}
  | {:window_size, pos_integer()}

Options for build/1.

Functions

build(opts \\ [])

@spec build([build_opt()]) :: Axon.t()

Build a Mixture of Depths model.

Options

  • :embed_dim - Input embedding dimension (required)
  • :hidden_size - Internal hidden dimension (default: 256)
  • :num_heads - Number of attention heads (default: 4)
  • :capacity_ratio - Fraction of tokens to process (default: 0.5)
  • :num_layers - Number of transformer layers (default: 4)
  • :dropout - Dropout rate (default: 0.1)
  • :window_size - Sequence length (default: 60)

Returns

An Axon model outputting [batch, hidden_size].

output_size(opts \\ [])

@spec output_size(keyword()) :: pos_integer()

Get the output size of a MixtureOfDepths model.