SwiGLU / GeGLU / ReGLU gated feed-forward networks.
Gated Linear Units with various activation functions, as used in the feed-forward blocks of modern transformers (LLaMA, PaLM, Mistral). The gating mechanism provides better gradient flow and expressiveness compared to standard dense + activation.
Formula
SwiGLU(x) = (xW1 * SiLU(xV)) W2
GeGLU(x) = (xW1 * GELU(xV)) W2
ReGLU(x) = (xW1 * ReLU(xV)) W2Architecture
Input [batch, ..., dim]
|
+-------+-------+
| |
Dense W1 Dense V (gate)
| |
| Activation (SiLU/GELU/ReLU)
| |
+----> Multiply <+
|
Dense W2
|
Output [batch, ..., dim]Usage
ffn = SwiGLU.layer(input, hidden_size: 256, inner_size: 1024)References
- "GLU Variants Improve Transformer" (Shazeer, 2020)
- https://arxiv.org/abs/2002.05202
Summary
Functions
Build a SwiGLU feed-forward block as an Axon layer.
Functions
Build a SwiGLU feed-forward block as an Axon layer.
Options
:hidden_size- Input/output dimension (required):inner_size- Intermediate dimension (default: hidden_size * 2.667, rounded to multiple of 8):activation- Gate activation: :silu, :gelu, :relu (default: :silu):dropout- Dropout rate (default: 0.0):name- Layer name prefix (default: "swiglu")