Edifice.Attention.DiffTransformer (Edifice v0.2.0)

Differential Transformer V2: simplified noise-cancelling attention.

Instead of a single softmax attention map per head, the Differential Transformer computes two independent attention maps and subtracts them. Shared noise patterns (tokens that attract attention universally) cancel out, amplifying the signal-to-noise ratio — analogous to differential amplifiers in electronics.

Key Innovation

For each head, Q and K are split into two halves:

A1 = softmax(Q1 @ K1^T / sqrt(d/2))
A2 = softmax(Q2 @ K2^T / sqrt(d/2))

DiffAttn = (A1 - lambda * A2) @ V

Where lambda is a simple learnable scalar (initialized to layer-dependent value).

V2 Simplifications

Lambda is now a single learnable scalar per layer (vs 4-vector parameterization)
Removed per-head GroupNorm/SubLayerNorm
Uses simple RMSNorm before output projection

Architecture

Input [batch, seq_len, embed_dim]
      |
Input projection to hidden_size
      |
+--------------------------------------+
|   DiffTransformer Block (x N)        |
|                                      |
|   LayerNorm -> Diff Attention        |
|     Q -> [Q1, Q2], K -> [K1, K2]    |
|     A1 = softmax(Q1K1^T/s)          |
|     A2 = softmax(Q2K2^T/s)          |
|     out = (A1 - lambda*A2) @ V      |
|     RMSNorm                          |
|   -> Residual                        |
|   LayerNorm -> FFN -> Residual       |
+--------------------------------------+
      |
Final LayerNorm
      |
Last timestep -> [batch, hidden_size]

Usage

model = DiffTransformer.build(
  embed_dim: 287,
  hidden_size: 256,
  num_heads: 4,
  num_layers: 6
)

References

"Differential Transformer" (Ye et al., Microsoft Research, 2024)
"Differential Transformer V2" (2025) - simplified parameterization

Summary

Types

build_opt()

Options for build/1.

Functions

build(opts \\ [])

Build a Differential Transformer V2 model.

build_diff_attention(input, opts)

Build the differential attention layer (V2 simplified).

output_size(opts \\ [])

Get the output dimension for a model configuration.

recommended_defaults()

Recommended default configuration.