Edifice.Attention.LightningAttention (Edifice v0.2.0)

Lightning Attention — hybrid linear/softmax block attention.

Splits the sequence into fixed-size blocks and uses two complementary attention mechanisms:

Intra-block: Standard softmax attention within each block (O(B²) per block)
Inter-block: Linear attention via cumulative KV state across blocks (O(B·d) per block)

This achieves near-linear overall complexity while retaining the expressivity of softmax attention at the local level.

Architecture

Input [batch, seq_len, embed_dim]
      |
Input Projection to hidden_size
      |
+--------------------------------------------+
|  Lightning Attention Block (x num_layers)  |
|                                            |
|  LayerNorm -> Q,K,V projections           |
|  Reshape to [batch, heads, blocks, B, d]  |
|                                            |
|  Intra-block: softmax(Q_b @ K_b^T) @ V_b |
|  Inter-block: Q_b @ cumsum(K_j^T V_j)    |
|  Output = intra + inter                   |
|                                            |
|  -> Residual                              |
|  LayerNorm -> FFN -> Residual             |
+--------------------------------------------+
      |
Final LayerNorm
      |
Last timestep -> [batch, hidden_size]

Constraints

seq_len must be divisible by block_size.

Usage

model = LightningAttention.build(
  embed_dim: 287,
  hidden_size: 256,
  num_heads: 8,
  num_layers: 4,
  block_size: 64
)

References

Qin et al., "Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models" (2024)

Summary

Types

build_opt()

Options for build/1.

Functions

build(opts \\ [])

Build a Lightning Attention model.

build_lightning_attention(input, opts)

Build the lightning attention sublayer.

output_size(opts \\ [])

Get the output size of the model.