Edifice.Recurrent.DeltaNet (Edifice v0.2.0)

DeltaNet - Linear Attention with Delta Rule.

Implements linear attention with the delta rule update from "Linear Transformers with Learnable Kernel Functions are Better In-Context Models" (Schlag et al., 2021) and subsequent work.

DeltaNet maintains an associative memory matrix S that is updated using the delta rule, which corrects previous associations rather than blindly accumulating them. This gives it superior retrieval accuracy compared to standard linear attention.

Key Innovations

Delta rule update: St = S{t-1} + betat * (v_t - S{t-1} k_t) k_t^T
Error-correcting: Subtracts the current retrieval S_{t-1} k_t before adding
Learnable beta: Controls update rate per-token via a gate
Linear complexity: O(d^2) memory vs O(n*d) for softmax attention

Equations

q_t = W_q x_t                          # Query projection
k_t = W_k x_t                          # Key projection (L2 normalized)
v_t = W_v x_t                          # Value projection
beta_t = sigmoid(W_beta x_t)           # Update gate
S_t = S_{t-1} + beta_t * (v_t - S_{t-1} k_t) * k_t^T   # Delta rule
o_t = S_t q_t                          # Output retrieval

Architecture

Input [batch, seq_len, embed_dim]
      |
      v
[Input Projection] -> hidden_size
      |
      v
+----------------------------------+
|      DeltaNet Layer              |
|  Project to Q, K, V, beta        |
|  For each timestep:              |
|    error = v - S @ k             |
|    S += beta * error * k^T       |
|    output = S @ q                |
+----------------------------------+
      | (repeat num_layers)
      v
[Layer Norm] -> [Last Timestep]
      |
      v
Output [batch, hidden_size]

Usage

model = DeltaNet.build(
  embed_dim: 287,
  hidden_size: 256,
  num_layers: 4,
  dropout: 0.1
)

References

Paper: https://arxiv.org/abs/2102.11174
Delta rule RNNs: https://arxiv.org/abs/2310.01655

Summary

Types

build_opt()

Options for build/1.

Functions

build(opts \\ [])

Build a DeltaNet model for sequence processing.

build_block(input, opts \\ [])

Build a single DeltaNet block that can be used as a backbone layer in hybrid architectures.

default_dropout()

Default dropout rate

default_hidden_size()

Default hidden dimension

default_num_heads()

Default number of attention heads

default_num_layers()

Default number of layers

norm_eps()

Epsilon for normalization

output_size(opts \\ [])

Get the output size of a DeltaNet model.