NSA: Native Sparse Attention (DeepSeek-V3/V4).
Hardware-aligned three-path sparse attention mechanism that achieves efficient long-context attention by combining global context, fine-grained retrieval, and local attention in parallel paths.
Key Innovation: Hardware-Aligned Sparse Attention
Instead of standard full quadratic attention, NSA uses three complementary sparse attention patterns that can be computed efficiently on modern hardware:
- Compressed Tokens: Global context via pooled/compressed sequences
- Top-k Blocks: Fine-grained retrieval of most relevant key-value blocks
- Sliding Window: Local attention for recent context
Architecture
Input [batch, seq_len, embed_dim]
|
v
+----------------------------------------+
| Native Sparse Attention |
| |
| Q, K, V = Linear(input) |
| |
| +-----------+ +------------+ +------+|
| | Compress | | Top-k | | Slide||
| | (global) | | Blocks | | Wind ||
| | | | (retrieval)| |(local)||
| +-----------+ +------------+ +------+|
| | | | |
| v v v |
| attn_c attn_b attn_w |
| | | | |
| +------+-------+-----+------+ |
| | | |
| v v |
| gate_weights (learnable) |
| | |
| v |
| weighted_sum(attn_c, attn_b, attn_w)|
+----------------------------------------+
|
v
[batch, seq_len, embed_dim] or [batch, hidden_size]Three Paths
1. Compressed Tokens (Global Context)
Pool Q/K/V into fewer tokens using strided convolution with compression_ratio. Compute softmax attention over compressed sequence for O(n/r) complexity.
2. Top-k Blocks (Fine-grained Retrieval)
- Divide K/V into blocks of block_size
- Compute block-level scores: dot(Q, mean(K_block))
- Select top num_selected_blocks blocks
- Compute attention within selected blocks
3. Sliding Window (Local)
Standard local attention over the last window_size tokens for recent context.
Usage
model = NSA.build(
embed_dim: 287,
hidden_size: 256,
num_heads: 8,
head_dim: 32,
window_size: 64,
block_size: 16,
num_selected_blocks: 8,
compression_ratio: 4,
num_layers: 4
)References
- Paper: "Native Sparse Attention: Hardware-Aligned and Natively Trainable"
- Authors: DeepSeek-AI (2025)
- Used in: DeepSeek-V3, DeepSeek-V4
Summary
Functions
Build an NSA model for sequence processing.
Build the NSA attention layer with three parallel sparse paths.
Build a single NSA transformer block.
Get the output size of an NSA model.
Calculate approximate parameter count for an NSA model.
Get recommended defaults.
Types
@type build_opt() :: {:block_size, pos_integer()} | {:compression_ratio, pos_integer()} | {:dropout, float()} | {:embed_dim, pos_integer()} | {:head_dim, pos_integer()} | {:hidden_size, pos_integer()} | {:num_heads, pos_integer()} | {:num_layers, pos_integer()} | {:num_selected_blocks, pos_integer()} | {:seq_len, pos_integer()} | {:window_size, pos_integer()}
Options for build/1.
Functions
Build an NSA model for sequence processing.
Options
:embed_dim- Size of input embedding per timestep (required):hidden_size- Internal hidden dimension (default: 256):num_heads- Number of attention heads (default: 8):head_dim- Dimension per head (default: 32):window_size- Sliding window size for local attention (default: 64):block_size- Block size for top-k selection (default: 16):num_selected_blocks- Number of blocks to select per query (default: 8):compression_ratio- Compression ratio for global path (default: 4):num_layers- Number of NSA blocks (default: 4):dropout- Dropout rate (default: 0.1):seq_len- Expected sequence length (default: 256)
Returns
An Axon model that outputs [batch, hidden_size] from the last position.
Build the NSA attention layer with three parallel sparse paths.
Build a single NSA transformer block.
@spec output_size(keyword()) :: non_neg_integer()
Get the output size of an NSA model.
@spec param_count(keyword()) :: non_neg_integer()
Calculate approximate parameter count for an NSA model.
@spec recommended_defaults() :: keyword()
Get recommended defaults.