Edifice.Meta.SwitchMoE (Edifice v0.2.0)

Switch Transformer - Top-1 Expert Routing.

The Switch Transformer simplifies MoE routing by selecting only a single expert per token (top-1), reducing computation and communication costs while maintaining model capacity. Each token is routed to exactly one expert based on learned routing weights.

Architecture

Input [batch, seq_len, embed_dim]
      |
      v
+------------------------------------+
| Input Projection                   |
+------------------------------------+
      |
      v
+------------------------------------+
| Switch Block 1:                    |
|   Pre-Norm -> Router (top-1)       |
|   -> Selected Expert FFN           |
|   + Residual                       |
+------------------------------------+
      |  (repeat N times)
      v
+------------------------------------+
| Final Norm + Last Timestep         |
+------------------------------------+
      |
      v
Output [batch, hidden_size]

Router Design

The router computes softmax probabilities over experts and selects the highest-scoring expert for each token. Since Axon uses static graphs, all experts are computed and the router selects via weighted combination with a peaked (near-one-hot) distribution.

Usage

model = SwitchMoE.build(
  embed_dim: 256,
  hidden_size: 256,
  num_experts: 8,
  expert_size: 512,
  num_layers: 4
)

References

Fedus et al., "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity" (JMLR 2022)
https://arxiv.org/abs/2101.03961

Summary

Types

build_opt()

Options for build/1.

Functions

build(opts \\ [])

Build a Switch Transformer model.

output_size(opts \\ [])

Get the output size of a Switch MoE model.

switch_block(input, hidden_size, opts \\ [])

Single Switch block: pre-norm -> top-1 routed expert FFN -> residual.