Edifice.Meta.SoftMoE (Edifice v0.2.0)

Soft Mixture of Experts (Puigcerver et al., 2024).

Unlike hard-routing MoE (Switch/top-K), Soft MoE computes a soft weighted combination of all expert outputs for every token. This eliminates token dropping, load balancing issues, and routing instability while maintaining the capacity benefits of MoE.

Architecture

Input [batch, seq_len, embed_dim]
      |
      v
+------------------------------------+
| Input Projection                   |
+------------------------------------+
      |
      v
+------------------------------------+
| SoftMoE Block:                     |
|   1. Compute dispatch weights      |
|      D = softmax(X * Phi)          |
|   2. Compute expert inputs         |
|      X_e = D^T * X                 |
|   3. Run all experts               |
|      Y_e = Expert_e(X_e)           |
|   4. Combine outputs               |
|      Y = C * stack(Y_e)            |
|   + Residual                       |
+------------------------------------+
      |  (repeat N times)
      v
Output [batch, hidden_size]

Usage

model = SoftMoE.build(
  embed_dim: 256,
  hidden_size: 256,
  num_experts: 4,
  num_layers: 4
)

References

Puigcerver et al., "From Sparse to Soft Mixtures of Experts" (ICLR 2024)
https://arxiv.org/abs/2308.00951

Summary

Types

build_opt()

Options for build/1.

Functions

build(opts \\ [])

Build a Soft MoE model.

output_size(opts \\ [])

Get the output size of a Soft MoE model.

soft_moe_block(input, hidden_size, opts \\ [])

Single Soft MoE block with dispatch-combine routing.