# `Edifice.Attention.NSA`
[🔗](https://github.com/blasphemetheus/edifice/blob/main/lib/edifice/attention/nsa.ex#L1)

NSA: Native Sparse Attention (DeepSeek-V3/V4).

<!-- verified: true, date: 2026-02-23 -->

Hardware-aligned three-path sparse attention mechanism that achieves
efficient long-context attention by combining global context, fine-grained
retrieval, and local attention in parallel paths.

## Key Innovation: Hardware-Aligned Sparse Attention

Instead of standard full quadratic attention, NSA uses three complementary
sparse attention patterns that can be computed efficiently on modern hardware:

1. **Compressed Tokens**: Global context via pooled/compressed sequences
2. **Top-k Blocks**: Fine-grained retrieval of most relevant key-value blocks
3. **Sliding Window**: Local attention for recent context

## Architecture

```
Input [batch, seq_len, embed_dim]
      |
      v
+----------------------------------------+
|        Native Sparse Attention          |
|                                         |
|  Q, K, V = Linear(input)                |
|                                         |
|  +-----------+  +------------+  +------+|
|  | Compress  |  | Top-k      |  | Slide||
|  | (global)  |  | Blocks     |  | Wind ||
|  |           |  | (retrieval)|  |(local)||
|  +-----------+  +------------+  +------+|
|        |              |            |    |
|        v              v            v    |
|      attn_c        attn_b       attn_w  |
|        |              |            |    |
|        +------+-------+-----+------+    |
|               |             |           |
|               v             v           |
|         gate_weights (learnable)        |
|               |                         |
|               v                         |
|      weighted_sum(attn_c, attn_b, attn_w)|
+----------------------------------------+
      |
      v
[batch, seq_len, embed_dim] or [batch, hidden_size]
```

## Three Paths

### 1. Compressed Tokens (Global Context)
Pool Q/K/V into fewer tokens using strided convolution with compression_ratio.
Compute softmax attention over compressed sequence for O(n/r) complexity.

### 2. Top-k Blocks (Fine-grained Retrieval)
- Divide K/V into blocks of block_size
- Compute block-level scores: dot(Q, mean(K_block))
- Select top num_selected_blocks blocks
- Compute attention within selected blocks

### 3. Sliding Window (Local)
Standard local attention over the last window_size tokens for recent context.

## Usage

    model = NSA.build(
      embed_dim: 287,
      hidden_size: 256,
      num_heads: 8,
      head_dim: 32,
      window_size: 64,
      block_size: 16,
      num_selected_blocks: 8,
      compression_ratio: 4,
      num_layers: 4
    )

## References

- Paper: "Native Sparse Attention: Hardware-Aligned and Natively Trainable"
- Authors: DeepSeek-AI (2025)
- Used in: DeepSeek-V3, DeepSeek-V4

# `build_opt`

```elixir
@type build_opt() ::
  {:block_size, pos_integer()}
  | {:compression_ratio, pos_integer()}
  | {:dropout, float()}
  | {:embed_dim, pos_integer()}
  | {:head_dim, pos_integer()}
  | {:hidden_size, pos_integer()}
  | {:num_heads, pos_integer()}
  | {:num_layers, pos_integer()}
  | {:num_selected_blocks, pos_integer()}
  | {:seq_len, pos_integer()}
  | {:window_size, pos_integer()}
```

Options for `build/1`.

# `build`

```elixir
@spec build([build_opt()]) :: Axon.t()
```

Build an NSA model for sequence processing.

## Options

  - `:embed_dim` - Size of input embedding per timestep (required)
  - `:hidden_size` - Internal hidden dimension (default: 256)
  - `:num_heads` - Number of attention heads (default: 8)
  - `:head_dim` - Dimension per head (default: 32)
  - `:window_size` - Sliding window size for local attention (default: 64)
  - `:block_size` - Block size for top-k selection (default: 16)
  - `:num_selected_blocks` - Number of blocks to select per query (default: 8)
  - `:compression_ratio` - Compression ratio for global path (default: 4)
  - `:num_layers` - Number of NSA blocks (default: 4)
  - `:dropout` - Dropout rate (default: 0.1)
  - `:seq_len` - Expected sequence length (default: 256)

## Returns

  An Axon model that outputs `[batch, hidden_size]` from the last position.

# `build_nsa_attention`

```elixir
@spec build_nsa_attention(
  Axon.t(),
  keyword()
) :: Axon.t()
```

Build the NSA attention layer with three parallel sparse paths.

# `build_nsa_block`

```elixir
@spec build_nsa_block(
  Axon.t(),
  keyword()
) :: Axon.t()
```

Build a single NSA transformer block.

# `output_size`

```elixir
@spec output_size(keyword()) :: non_neg_integer()
```

Get the output size of an NSA model.

# `param_count`

```elixir
@spec param_count(keyword()) :: non_neg_integer()
```

Calculate approximate parameter count for an NSA model.

# `recommended_defaults`

```elixir
@spec recommended_defaults() :: keyword()
```

Get recommended defaults.

---

*Consult [api-reference.md](api-reference.md) for complete listing*