Edifice.Attention.MLA (Edifice v0.2.0)

Multi-Head Latent Attention (MLA) from DeepSeek-V2/V3.

MLA compresses key-value representations into low-rank latent vectors, dramatically reducing KV cache memory while maintaining attention quality. It also uses decoupled Rotary Position Embedding (RoPE) to keep position information separate from compressed content.

Key Innovations

KV compression: Instead of caching full K,V per head, compress to a low-rank latent c_KV and reconstruct K,V on-the-fly during attention
Q compression: Query is also compressed through a low-rank bottleneck
Decoupled RoPE: Position information is encoded via separate RoPE dimensions that are concatenated with content dimensions, not mixed

Architecture

Input [batch, seq_len, embed_dim]
      |
      v
+--------------------------+
| MLA Block x N            |
|  LayerNorm               |
|  MLA Attention:          |
|   h -> W_DKV -> c_KV     |  (KV latent)
|   c_KV -> W_UK -> K_c    |  (content keys)
|   c_KV -> W_UV -> V      |  (values)
|   h -> W_DQ -> c_Q       |  (Q latent)
|   c_Q -> W_UQ -> Q_c     |  (content queries)
|   c_Q -> W_QR -> RoPE    |  (query rope)
|   h -> W_KR -> RoPE      |  (key rope, shared)
|   Q = [Q_c ; Q_r]        |
|   K = [K_c ; K_r]        |
|   score = softmax(QK^T/s) |
|  Residual                |
|  LayerNorm -> FFN        |
|  Residual                |
+--------------------------+
      |
      v
[batch, hidden_size]       (last timestep)

Usage

model = MLA.build(
  embed_dim: 287,
  hidden_size: 256,
  num_heads: 4,
  kv_latent_dim: 64,
  num_layers: 4
)

References

"DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model" (DeepSeek-AI, 2024)
arXiv: https://arxiv.org/abs/2405.04434

Summary

Types

build_opt()

Options for build/1.

Functions

build(opts \\ [])

Build an MLA model for sequence processing.

build_mla_block(input, opts)

Build a single MLA transformer block.

output_size(opts \\ [])

Get the output size of an MLA model.

recommended_defaults()

Get recommended defaults.