Edifice.Recurrent.TTT (Edifice v0.2.0)

Test-Time Training (TTT) Layers.

Implements TTT layers from "Learning to (Learn at Test Time): RNNs with Expressive Hidden States" (Sun et al., 2024). In TTT, the hidden state is itself a model (a linear layer or small MLP) that is updated via a self-supervised gradient step at each token.

Key Innovations

Hidden state IS a model: Instead of a vector, the hidden state is the weight matrix of a small inner model
Self-supervised updates: At each step, the inner model does a gradient step on a reconstruction loss
Equivalent to linear attention: TTT-Linear is mathematically equivalent to linear attention with the delta rule when the inner model is linear

Paper-Faithful Implementation

Follows the official TTT-Linear implementation (ttt-lm-pytorch) with these key stability mechanisms:

W_0 ~ N(0, 0.02): Small initialization keeps early predictions near zero, preventing gradient explosion in the first steps.
eta / head_dim scaling: Inner learning rate is scaled by 1/d (d=inner_size), keeping weight updates small. Without this, eta in [0,1] is ~64x too large.
Inner LayerNorm: Learnable LayerNorm on inner model predictions before computing reconstruction error. Prevents prediction magnitudes from drifting.
Output gating: Sigmoid gate on output (like SwiGLU) for smoother gradients.

Equations (TTT-Linear)

# Project inputs
q_t = W_q x_t                          # Query
k_t = W_k x_t                          # Key
v_t = W_v x_t                          # Value (reconstruction target)
eta_t = sigmoid(W_eta x_t) / d         # Learning rate gate (scaled by 1/head_dim)

# Inner model forward + LayerNorm
pred_t = LayerNorm(W_{t-1} @ k_t)

# Self-supervised gradient update
error_t = pred_t - v_t
grad_W = error_t @ k_t^T
W_t = W_{t-1} - eta_t * grad_W

# Gated output using updated model
o_t = W_t @ q_t * sigmoid(gate_t)

Architecture

Input [batch, seq_len, embed_dim]
      |
      v
[Input Projection] -> hidden_size
      |
      v
+--------------------------------------+
|        TTT Layer                     |
|  Project to Q, K, V, eta, gate       |
|  For each timestep:                  |
|    pred = LayerNorm(W @ k)           |
|    error = pred - v                  |
|    W -= (eta/d) * error * k^T        |
|    output = (W @ q) * sigmoid(gate)  |
+--------------------------------------+
      | (repeat num_layers)
      v
[Layer Norm] -> [Last Timestep]
      |
      v
Output [batch, hidden_size]

Usage

model = TTT.build(
  embed_dim: 287,
  hidden_size: 256,
  num_layers: 4,
  inner_size: 64,
  dropout: 0.1
)

References

Paper: https://arxiv.org/abs/2407.04620

Summary

Types

build_opt()

Options for build/1.

Functions

build(opts \\ [])

Build a TTT model for sequence processing.

default_dropout()

Default dropout rate

default_hidden_size()

Default hidden dimension

default_inner_size()

Default inner model dimension (key/value size)

default_num_layers()

Default number of layers

output_size(opts \\ [])

Get the output size of a TTT model.