Edifice.Attention.NSA (Edifice v0.2.0)

NSA: Native Sparse Attention (DeepSeek-V3/V4).

Hardware-aligned three-path sparse attention mechanism that achieves efficient long-context attention by combining global context, fine-grained retrieval, and local attention in parallel paths.

Key Innovation: Hardware-Aligned Sparse Attention

Instead of standard full quadratic attention, NSA uses three complementary sparse attention patterns that can be computed efficiently on modern hardware:

Compressed Tokens: Global context via pooled/compressed sequences
Top-k Blocks: Fine-grained retrieval of most relevant key-value blocks
Sliding Window: Local attention for recent context

Architecture

Input [batch, seq_len, embed_dim]
      |
      v
+----------------------------------------+
|        Native Sparse Attention          |
|                                         |
|  Q, K, V = Linear(input)                |
|                                         |
|  +-----------+  +------------+  +------+|
|  | Compress  |  | Top-k      |  | Slide||
|  | (global)  |  | Blocks     |  | Wind ||
|  |           |  | (retrieval)|  |(local)||
|  +-----------+  +------------+  +------+|
|        |              |            |    |
|        v              v            v    |
|      attn_c        attn_b       attn_w  |
|        |              |            |    |
|        +------+-------+-----+------+    |
|               |             |           |
|               v             v           |
|         gate_weights (learnable)        |
|               |                         |
|               v                         |
|      weighted_sum(attn_c, attn_b, attn_w)|
+----------------------------------------+
      |
      v
[batch, seq_len, embed_dim] or [batch, hidden_size]

Three Paths

1. Compressed Tokens (Global Context)

Pool Q/K/V into fewer tokens using strided convolution with compression_ratio. Compute softmax attention over compressed sequence for O(n/r) complexity.

2. Top-k Blocks (Fine-grained Retrieval)

Divide K/V into blocks of block_size
Compute block-level scores: dot(Q, mean(K_block))
Select top num_selected_blocks blocks
Compute attention within selected blocks

3. Sliding Window (Local)

Standard local attention over the last window_size tokens for recent context.

Usage

model = NSA.build(
  embed_dim: 287,
  hidden_size: 256,
  num_heads: 8,
  head_dim: 32,
  window_size: 64,
  block_size: 16,
  num_selected_blocks: 8,
  compression_ratio: 4,
  num_layers: 4
)

References

Paper: "Native Sparse Attention: Hardware-Aligned and Natively Trainable"
Authors: DeepSeek-AI (2025)
Used in: DeepSeek-V3, DeepSeek-V4

Summary

Types

build_opt()

Options for build/1.

Functions

build(opts \\ [])

Build an NSA model for sequence processing.

build_nsa_attention(input, opts)

Build the NSA attention layer with three parallel sparse paths.

build_nsa_block(input, opts)

Build a single NSA transformer block.

output_size(opts \\ [])

Get the output size of an NSA model.

param_count(opts)

Calculate approximate parameter count for an NSA model.

recommended_defaults()

Get recommended defaults.