Lightning Attention — hybrid linear/softmax block attention.
Splits the sequence into fixed-size blocks and uses two complementary attention mechanisms:
- Intra-block: Standard softmax attention within each block (O(B²) per block)
- Inter-block: Linear attention via cumulative KV state across blocks (O(B·d) per block)
This achieves near-linear overall complexity while retaining the expressivity of softmax attention at the local level.
Architecture
Input [batch, seq_len, embed_dim]
|
Input Projection to hidden_size
|
+--------------------------------------------+
| Lightning Attention Block (x num_layers) |
| |
| LayerNorm -> Q,K,V projections |
| Reshape to [batch, heads, blocks, B, d] |
| |
| Intra-block: softmax(Q_b @ K_b^T) @ V_b |
| Inter-block: Q_b @ cumsum(K_j^T V_j) |
| Output = intra + inter |
| |
| -> Residual |
| LayerNorm -> FFN -> Residual |
+--------------------------------------------+
|
Final LayerNorm
|
Last timestep -> [batch, hidden_size]Constraints
seq_len must be divisible by block_size.
Usage
model = LightningAttention.build(
embed_dim: 287,
hidden_size: 256,
num_heads: 8,
num_layers: 4,
block_size: 64
)References
- Qin et al., "Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models" (2024)
Summary
Functions
Build a Lightning Attention model.
Build the lightning attention sublayer.
Get the output size of the model.
Types
@type build_opt() :: {:embed_dim, pos_integer()} | {:hidden_size, pos_integer()} | {:num_heads, pos_integer()} | {:num_layers, pos_integer()} | {:block_size, pos_integer()} | {:dropout, float()}
Options for build/1.
Functions
Build a Lightning Attention model.
Options
:embed_dim- Input embedding dimension (required):hidden_size- Internal hidden dimension (default: 256):num_heads- Number of attention heads (default: 8):num_layers- Number of Lightning Attention blocks (default: 4):block_size- Block size B for chunked attention (default: 64).seq_lenmust be divisible by this value.:dropout- Dropout rate (default: 0.1):seq_len/:window_size- Expected sequence length (default: 60)
Returns
An Axon model outputting [batch, hidden_size].
Build the lightning attention sublayer.
This creates the core attention mechanism with both intra-block (softmax) and inter-block (linear) attention pathways.
@spec output_size(keyword()) :: pos_integer()
Get the output size of the model.