# `Edifice.Meta.MixtureOfDepths`
[🔗](https://github.com/blasphemetheus/edifice/blob/main/lib/edifice/meta/mixture_of_depths.ex#L1)

Mixture of Depths: per-token routing where only top-C% tokens are processed.

Implements the Mixture-of-Depths approach from "Mixture-of-Depths: Dynamically
allocating compute in transformer-based language models" (Raposo et al., 2024).
A learned router scores each token and only the top-C% (by capacity ratio) are
processed through the full transformer block; the rest skip via residual.

## Architecture

```
Input [batch, seq, hidden]
      |
      v
+-----------------------------+
| Per Layer:                  |
|   Router: dense -> sigmoid  |
|   -> soft gate per token    |
|   Transformer block on all  |
|   output = gate*block +     |
|            (1-gate)*input   |
+-----------------------------+
      | (repeat num_layers)
      v
Final Norm -> Last Timestep
Output [batch, hidden_size]
```

## How It Works

For each layer, a router network produces a per-token score in [0, 1].
A top-C selection mechanism identifies which tokens should receive full
processing. In this Axon-compatible implementation, all tokens pass through
the transformer block, but the router gate controls how much of the block
output vs. the residual input each token uses:

    output_t = gate_t * block(input_t) + (1 - gate_t) * input_t

Tokens with low router scores effectively skip the block via residual.

## Usage

    model = MixtureOfDepths.build(
      embed_dim: 287,
      hidden_size: 256,
      num_heads: 4,
      capacity_ratio: 0.5,
      num_layers: 4
    )

## References
- Raposo et al., "Mixture-of-Depths" (2024)
- https://arxiv.org/abs/2404.02258

# `build_opt`

```elixir
@type build_opt() ::
  {:capacity_ratio, float()}
  | {:dropout, float()}
  | {:embed_dim, pos_integer()}
  | {:hidden_size, pos_integer()}
  | {:num_heads, pos_integer()}
  | {:num_layers, pos_integer()}
  | {:window_size, pos_integer()}
```

Options for `build/1`.

# `build`

```elixir
@spec build([build_opt()]) :: Axon.t()
```

Build a Mixture of Depths model.

## Options
  - `:embed_dim` - Input embedding dimension (required)
  - `:hidden_size` - Internal hidden dimension (default: 256)
  - `:num_heads` - Number of attention heads (default: 4)
  - `:capacity_ratio` - Fraction of tokens to process (default: 0.5)
  - `:num_layers` - Number of transformer layers (default: 4)
  - `:dropout` - Dropout rate (default: 0.1)
  - `:window_size` - Sequence length (default: 60)

## Returns
  An Axon model outputting `[batch, hidden_size]`.

# `output_size`

```elixir
@spec output_size(keyword()) :: pos_integer()
```

Get the output size of a MixtureOfDepths model.

# `recommended_defaults`

```elixir
@spec recommended_defaults() :: keyword()
```

Get recommended defaults for MixtureOfDepths.

---

*Consult [api-reference.md](api-reference.md) for complete listing*