Edifice.Attention.NSA (Edifice v0.2.0)

Copy Markdown View Source

NSA: Native Sparse Attention (DeepSeek-V3/V4).

Hardware-aligned three-path sparse attention mechanism that achieves efficient long-context attention by combining global context, fine-grained retrieval, and local attention in parallel paths.

Key Innovation: Hardware-Aligned Sparse Attention

Instead of standard full quadratic attention, NSA uses three complementary sparse attention patterns that can be computed efficiently on modern hardware:

  1. Compressed Tokens: Global context via pooled/compressed sequences
  2. Top-k Blocks: Fine-grained retrieval of most relevant key-value blocks
  3. Sliding Window: Local attention for recent context

Architecture

Input [batch, seq_len, embed_dim]
      |
      v
+----------------------------------------+
|        Native Sparse Attention          |
|                                         |
|  Q, K, V = Linear(input)                |
|                                         |
|  +-----------+  +------------+  +------+|
|  | Compress  |  | Top-k      |  | Slide||
|  | (global)  |  | Blocks     |  | Wind ||
|  |           |  | (retrieval)|  |(local)||
|  +-----------+  +------------+  +------+|
|        |              |            |    |
|        v              v            v    |
|      attn_c        attn_b       attn_w  |
|        |              |            |    |
|        +------+-------+-----+------+    |
|               |             |           |
|               v             v           |
|         gate_weights (learnable)        |
|               |                         |
|               v                         |
|      weighted_sum(attn_c, attn_b, attn_w)|
+----------------------------------------+
      |
      v
[batch, seq_len, embed_dim] or [batch, hidden_size]

Three Paths

1. Compressed Tokens (Global Context)

Pool Q/K/V into fewer tokens using strided convolution with compression_ratio. Compute softmax attention over compressed sequence for O(n/r) complexity.

2. Top-k Blocks (Fine-grained Retrieval)

  • Divide K/V into blocks of block_size
  • Compute block-level scores: dot(Q, mean(K_block))
  • Select top num_selected_blocks blocks
  • Compute attention within selected blocks

3. Sliding Window (Local)

Standard local attention over the last window_size tokens for recent context.

Usage

model = NSA.build(
  embed_dim: 287,
  hidden_size: 256,
  num_heads: 8,
  head_dim: 32,
  window_size: 64,
  block_size: 16,
  num_selected_blocks: 8,
  compression_ratio: 4,
  num_layers: 4
)

References

  • Paper: "Native Sparse Attention: Hardware-Aligned and Natively Trainable"
  • Authors: DeepSeek-AI (2025)
  • Used in: DeepSeek-V3, DeepSeek-V4

Summary

Types

Options for build/1.

Functions

Build an NSA model for sequence processing.

Build the NSA attention layer with three parallel sparse paths.

Build a single NSA transformer block.

Get the output size of an NSA model.

Calculate approximate parameter count for an NSA model.

Get recommended defaults.

Types

build_opt()

@type build_opt() ::
  {:block_size, pos_integer()}
  | {:compression_ratio, pos_integer()}
  | {:dropout, float()}
  | {:embed_dim, pos_integer()}
  | {:head_dim, pos_integer()}
  | {:hidden_size, pos_integer()}
  | {:num_heads, pos_integer()}
  | {:num_layers, pos_integer()}
  | {:num_selected_blocks, pos_integer()}
  | {:seq_len, pos_integer()}
  | {:window_size, pos_integer()}

Options for build/1.

Functions

build(opts \\ [])

@spec build([build_opt()]) :: Axon.t()

Build an NSA model for sequence processing.

Options

  • :embed_dim - Size of input embedding per timestep (required)
  • :hidden_size - Internal hidden dimension (default: 256)
  • :num_heads - Number of attention heads (default: 8)
  • :head_dim - Dimension per head (default: 32)
  • :window_size - Sliding window size for local attention (default: 64)
  • :block_size - Block size for top-k selection (default: 16)
  • :num_selected_blocks - Number of blocks to select per query (default: 8)
  • :compression_ratio - Compression ratio for global path (default: 4)
  • :num_layers - Number of NSA blocks (default: 4)
  • :dropout - Dropout rate (default: 0.1)
  • :seq_len - Expected sequence length (default: 256)

Returns

An Axon model that outputs [batch, hidden_size] from the last position.

build_nsa_attention(input, opts)

@spec build_nsa_attention(
  Axon.t(),
  keyword()
) :: Axon.t()

Build the NSA attention layer with three parallel sparse paths.

build_nsa_block(input, opts)

@spec build_nsa_block(
  Axon.t(),
  keyword()
) :: Axon.t()

Build a single NSA transformer block.

output_size(opts \\ [])

@spec output_size(keyword()) :: non_neg_integer()

Get the output size of an NSA model.

param_count(opts)

@spec param_count(keyword()) :: non_neg_integer()

Calculate approximate parameter count for an NSA model.