Dual Chunk Attention — context extension via intra-chunk and inter-chunk attention.
DeepSeek's method for handling long sequences (used in Qwen2.5-128K): splits sequences into fixed-size chunks, computes standard attention within each chunk, then uses a separate mechanism for attending across chunk summaries. This reduces peak memory from O(seq^2) to O(chunk^2 + num_chunks^2).
Key Innovation
Instead of computing full quadratic attention over long sequences:
- Intra-chunk attention: Standard multi-head attention within each chunk (local patterns)
- Inter-chunk attention: Attention over chunk summaries (global context)
- Combination: Learnable blending of local and global representations
Architecture
Input [batch, seq_len, embed_dim]
|
Input projection to hidden_size
|
+----------------------------------------------+
| Dual Chunk Attention Block (x num_layers) |
| |
| LayerNorm -> Dual Chunk Attention |
| Split into chunks [batch, num_chunks, chunk_size, hidden]
| Intra-chunk: MultiHead per chunk |
| Inter-chunk: Summarize -> Attend -> Expand
| Combine: gate * inter + (1-gate) * intra |
| -> Residual |
| LayerNorm -> FFN -> Residual |
+----------------------------------------------+
|
Final LayerNorm
|
Last timestep -> [batch, hidden_size]Memory Complexity
For seq_len = N and chunk_size = C:
- Standard attention: O(N^2)
- Dual Chunk: O((N/C) C^2 + (N/C)^2) = O(NC + N^2/C^2)
With C = sqrt(N), this becomes O(N^1.5) — subquadratic!
Usage
model = DualChunk.build(
embed_dim: 287,
hidden_size: 256,
num_heads: 8,
num_layers: 4,
chunk_size: 64
)Constraints
seq_len must be divisible by chunk_size.
References
- DeepSeek context extension methods (2024)
- Qwen2.5 long-context training (Alibaba, 2024)
- "Efficient Long-Range Transformers" survey literature
Summary
Functions
Build a Dual Chunk Attention model.
Build the dual chunk attention sublayer.
Get the output size of the model.
Types
@type build_opt() :: {:embed_dim, pos_integer()} | {:hidden_size, pos_integer()} | {:num_heads, pos_integer()} | {:num_layers, pos_integer()} | {:chunk_size, pos_integer()} | {:dropout, float()}
Options for build/1.
Functions
Build a Dual Chunk Attention model.
Options
:embed_dim- Input embedding dimension (required):hidden_size- Internal hidden dimension (default: 256):num_heads- Number of attention heads (default: 8):num_layers- Number of Dual Chunk Attention blocks (default: 4):chunk_size- Chunk size C for chunked attention (default: 64).seq_lenmust be divisible by this value.:dropout- Dropout rate (default: 0.1):seq_len/:window_size- Expected sequence length (default: 64)
Returns
An Axon model outputting [batch, hidden_size].
Build the dual chunk attention sublayer.
This creates the core attention mechanism with both intra-chunk and inter-chunk attention pathways.
@spec output_size(keyword()) :: pos_integer()
Get the output size of the model.