Edifice.Recurrent.GatedDeltaNet (Edifice v0.2.0)

Gated DeltaNet - Linear Attention with Gated Delta Rule.

Extends DeltaNet with a data-dependent gating mechanism that modulates the state matrix between timesteps. Where vanilla DeltaNet always retains all of S_{t-1} (modulated only by the delta correction), Gated DeltaNet introduces a forget gate alpha_t that controls how much of the previous state to retain before applying the delta update.

This gives the model explicit control over memory erasure, which is critical for tasks that require forgetting stale associations.

Key Innovations

Gated state transition: St = alpha_t * S{t-1} + betat * (v_t - S{t-1} k_t) k_t^T
Data-dependent forgetting: alpha_t = sigmoid(W_alpha x_t) controls memory decay
Short convolution: Optional causal convolution before Q/K/V projections for local context
Swish gate on output: Gated output projection for expressivity

Equations

q_t = W_q x_t                              # Query projection
k_t = W_k x_t                              # Key projection (L2 normalized)
v_t = W_v x_t                              # Value projection
beta_t = sigmoid(W_beta x_t)               # Update gate (write strength)
alpha_t = sigmoid(W_alpha x_t)             # Forget gate (retention)
S_t = alpha_t * S_{t-1} + beta_t * (v_t - S_{t-1} k_t) * k_t^T  # Gated delta rule
o_t = swish(W_g x_t) * (S_t q_t)          # Gated output

Architecture

Input [batch, seq_len, embed_dim]
      |
      v
[Input Projection] -> hidden_size
      |
      v
+----------------------------------------------+
|      Gated DeltaNet Layer                     |
|  Short Conv (optional) for local context      |
|  Project to Q, K, V, beta, alpha, gate        |
|  For each timestep:                           |
|    S = alpha * S + beta * (v - S@k) * k^T     |
|    output = swish(gate) * (S @ q)             |
+----------------------------------------------+
      | (repeat num_layers)
      v
[Layer Norm] -> [Last Timestep]
      |
      v
Output [batch, hidden_size]

Compared to DeltaNet

Aspect	DeltaNet	Gated DeltaNet
State update	S + beta error k^T	alpha S + beta error * k^T
Forgetting	Implicit (via delta correction)	Explicit (alpha gate)
Output gating	None	Swish gate
Local context	None	Optional short convolution
Expressivity	Lower	Higher (data-dependent dynamics)

Usage

model = GatedDeltaNet.build(
  embed_dim: 287,
  hidden_size: 256,
  num_layers: 4,
  use_short_conv: true,
  dropout: 0.1
)

References

"Gated Delta Networks: Improving Mamba2 with Delta Rule" (Yang et al., 2024)
https://arxiv.org/abs/2412.06464
Adopted by Qwen3-Next and Kimi Linear (Moonshot AI)

Summary

Types

build_opt()

Options for build/1.

Functions

build(opts \\ [])

Build a Gated DeltaNet model for sequence processing.

build_block(input, opts \\ [])

Build a single Gated DeltaNet block that can be used as a backbone layer in hybrid architectures.

default_dropout()

Default dropout rate

default_hidden_size()

Default hidden dimension

default_num_heads()

Default number of attention heads

default_num_layers()

Default number of layers

output_size(opts \\ [])

Get the output size of a Gated DeltaNet model.