# `Edifice.Attention.DiffTransformer`
[🔗](https://github.com/blasphemetheus/edifice/blob/main/lib/edifice/attention/diff_transformer.ex#L1)

Differential Transformer V2: simplified noise-cancelling attention.

Instead of a single softmax attention map per head, the Differential Transformer
computes two independent attention maps and subtracts them. Shared noise patterns
(tokens that attract attention universally) cancel out, amplifying the
signal-to-noise ratio — analogous to differential amplifiers in electronics.

## Key Innovation

For each head, Q and K are split into two halves:

```
A1 = softmax(Q1 @ K1^T / sqrt(d/2))
A2 = softmax(Q2 @ K2^T / sqrt(d/2))

DiffAttn = (A1 - lambda * A2) @ V
```

Where `lambda` is a simple learnable scalar (initialized to layer-dependent value).

## V2 Simplifications

- Lambda is now a single learnable scalar per layer (vs 4-vector parameterization)
- Removed per-head GroupNorm/SubLayerNorm
- Uses simple RMSNorm before output projection

## Architecture

```
Input [batch, seq_len, embed_dim]
      |
Input projection to hidden_size
      |
+--------------------------------------+
|   DiffTransformer Block (x N)        |
|                                      |
|   LayerNorm -> Diff Attention        |
|     Q -> [Q1, Q2], K -> [K1, K2]    |
|     A1 = softmax(Q1K1^T/s)          |
|     A2 = softmax(Q2K2^T/s)          |
|     out = (A1 - lambda*A2) @ V      |
|     RMSNorm                          |
|   -> Residual                        |
|   LayerNorm -> FFN -> Residual       |
+--------------------------------------+
      |
Final LayerNorm
      |
Last timestep -> [batch, hidden_size]
```

## Usage

    model = DiffTransformer.build(
      embed_dim: 287,
      hidden_size: 256,
      num_heads: 4,
      num_layers: 6
    )

## References

- "Differential Transformer" (Ye et al., Microsoft Research, 2024)
- "Differential Transformer V2" (2025) - simplified parameterization

# `build_opt`

```elixir
@type build_opt() ::
  {:embed_dim, pos_integer()}
  | {:hidden_size, pos_integer()}
  | {:num_heads, pos_integer()}
  | {:num_layers, pos_integer()}
  | {:dropout, float()}
  | {:window_size, pos_integer()}
```

Options for `build/1`.

# `build`

```elixir
@spec build([build_opt()]) :: Axon.t()
```

Build a Differential Transformer V2 model.

## Options

  - `:embed_dim` - Size of input embedding per timestep (required)
  - `:hidden_size` - Internal hidden dimension (default: 256)
  - `:num_heads` - Number of differential attention heads (default: 4)
  - `:num_layers` - Number of transformer blocks (default: 6)
  - `:dropout` - Dropout rate (default: 0.1)
  - `:window_size` - Expected sequence length for JIT optimization (default: 60)

## Returns

  An Axon model that outputs `[batch, hidden_size]` from the last position.

# `build_diff_attention`

```elixir
@spec build_diff_attention(
  Axon.t(),
  keyword()
) :: Axon.t()
```

Build the differential attention layer (V2 simplified).

Projects to Q, K, V, splits Q/K into two halves, computes dual
softmax attention maps and subtracts them with learnable lambda scalar,
then applies RMSNorm.

# `output_size`

```elixir
@spec output_size(keyword()) :: non_neg_integer()
```

Get the output dimension for a model configuration.

# `recommended_defaults`

```elixir
@spec recommended_defaults() :: keyword()
```

Recommended default configuration.

---

*Consult [api-reference.md](api-reference.md) for complete listing*
