Edifice.Blocks.SwiGLU (Edifice v0.2.0)

SwiGLU / GeGLU / ReGLU gated feed-forward networks.

Gated Linear Units with various activation functions, as used in the feed-forward blocks of modern transformers (LLaMA, PaLM, Mistral). The gating mechanism provides better gradient flow and expressiveness compared to standard dense + activation.

Formula

SwiGLU(x) = (xW1 * SiLU(xV)) W2
GeGLU(x)  = (xW1 * GELU(xV)) W2
ReGLU(x)  = (xW1 * ReLU(xV)) W2

Architecture

Input [batch, ..., dim]
      |
      +-------+-------+
      |               |
   Dense W1       Dense V (gate)
      |               |
      |          Activation (SiLU/GELU/ReLU)
      |               |
      +----> Multiply <+
                |
             Dense W2
                |
Output [batch, ..., dim]

Usage

ffn = SwiGLU.layer(input, hidden_size: 256, inner_size: 1024)

References

"GLU Variants Improve Transformer" (Shazeer, 2020)
https://arxiv.org/abs/2002.05202

Summary

Functions

layer(input, opts \\ [])

Build a SwiGLU feed-forward block as an Axon layer.

Functions

layer(input, opts \\ [])

@spec layer(
  Axon.t(),
  keyword()
) :: Axon.t()

Build a SwiGLU feed-forward block as an Axon layer.

Options

:hidden_size - Input/output dimension (required)
:inner_size - Intermediate dimension (default: hidden_size * 2.667, rounded to multiple of 8)
:activation - Gate activation: :silu, :gelu, :relu (default: :silu)
:dropout - Dropout rate (default: 0.0)
:name - Layer name prefix (default: "swiglu")