Differential Transformer V2: simplified noise-cancelling attention.
Instead of a single softmax attention map per head, the Differential Transformer computes two independent attention maps and subtracts them. Shared noise patterns (tokens that attract attention universally) cancel out, amplifying the signal-to-noise ratio — analogous to differential amplifiers in electronics.
Key Innovation
For each head, Q and K are split into two halves:
A1 = softmax(Q1 @ K1^T / sqrt(d/2))
A2 = softmax(Q2 @ K2^T / sqrt(d/2))
DiffAttn = (A1 - lambda * A2) @ VWhere lambda is a simple learnable scalar (initialized to layer-dependent value).
V2 Simplifications
- Lambda is now a single learnable scalar per layer (vs 4-vector parameterization)
- Removed per-head GroupNorm/SubLayerNorm
- Uses simple RMSNorm before output projection
Architecture
Input [batch, seq_len, embed_dim]
|
Input projection to hidden_size
|
+--------------------------------------+
| DiffTransformer Block (x N) |
| |
| LayerNorm -> Diff Attention |
| Q -> [Q1, Q2], K -> [K1, K2] |
| A1 = softmax(Q1K1^T/s) |
| A2 = softmax(Q2K2^T/s) |
| out = (A1 - lambda*A2) @ V |
| RMSNorm |
| -> Residual |
| LayerNorm -> FFN -> Residual |
+--------------------------------------+
|
Final LayerNorm
|
Last timestep -> [batch, hidden_size]Usage
model = DiffTransformer.build(
embed_dim: 287,
hidden_size: 256,
num_heads: 4,
num_layers: 6
)References
- "Differential Transformer" (Ye et al., Microsoft Research, 2024)
- "Differential Transformer V2" (2025) - simplified parameterization
Summary
Functions
Build a Differential Transformer V2 model.
Build the differential attention layer (V2 simplified).
Get the output dimension for a model configuration.
Recommended default configuration.
Types
@type build_opt() :: {:embed_dim, pos_integer()} | {:hidden_size, pos_integer()} | {:num_heads, pos_integer()} | {:num_layers, pos_integer()} | {:dropout, float()} | {:window_size, pos_integer()}
Options for build/1.
Functions
Build a Differential Transformer V2 model.
Options
:embed_dim- Size of input embedding per timestep (required):hidden_size- Internal hidden dimension (default: 256):num_heads- Number of differential attention heads (default: 4):num_layers- Number of transformer blocks (default: 6):dropout- Dropout rate (default: 0.1):window_size- Expected sequence length for JIT optimization (default: 60)
Returns
An Axon model that outputs [batch, hidden_size] from the last position.
Build the differential attention layer (V2 simplified).
Projects to Q, K, V, splits Q/K into two halves, computes dual softmax attention maps and subtracts them with learnable lambda scalar, then applies RMSNorm.
@spec output_size(keyword()) :: non_neg_integer()
Get the output dimension for a model configuration.
@spec recommended_defaults() :: keyword()
Recommended default configuration.