# `Edifice.Attention.DualChunk`
[🔗](https://github.com/blasphemetheus/edifice/blob/main/lib/edifice/attention/dual_chunk.ex#L1)

Dual Chunk Attention — context extension via intra-chunk and inter-chunk attention.

DeepSeek's method for handling long sequences (used in Qwen2.5-128K): splits sequences
into fixed-size chunks, computes standard attention within each chunk, then uses a
separate mechanism for attending across chunk summaries. This reduces peak memory
from O(seq^2) to O(chunk^2 + num_chunks^2).

## Key Innovation

Instead of computing full quadratic attention over long sequences:
1. **Intra-chunk attention**: Standard multi-head attention within each chunk (local patterns)
2. **Inter-chunk attention**: Attention over chunk summaries (global context)
3. **Combination**: Learnable blending of local and global representations

## Architecture

```
Input [batch, seq_len, embed_dim]
      |
Input projection to hidden_size
      |
+----------------------------------------------+
|   Dual Chunk Attention Block (x num_layers)  |
|                                              |
|   LayerNorm -> Dual Chunk Attention          |
|     Split into chunks [batch, num_chunks, chunk_size, hidden]
|     Intra-chunk: MultiHead per chunk         |
|     Inter-chunk: Summarize -> Attend -> Expand
|     Combine: gate * inter + (1-gate) * intra |
|   -> Residual                                |
|   LayerNorm -> FFN -> Residual               |
+----------------------------------------------+
      |
Final LayerNorm
      |
Last timestep -> [batch, hidden_size]
```

## Memory Complexity

For seq_len = N and chunk_size = C:
- Standard attention: O(N^2)
- Dual Chunk: O((N/C) * C^2 + (N/C)^2) = O(N*C + N^2/C^2)

With C = sqrt(N), this becomes O(N^1.5) — subquadratic!

## Usage

    model = DualChunk.build(
      embed_dim: 287,
      hidden_size: 256,
      num_heads: 8,
      num_layers: 4,
      chunk_size: 64
    )

## Constraints

`seq_len` must be divisible by `chunk_size`.

## References

- DeepSeek context extension methods (2024)
- Qwen2.5 long-context training (Alibaba, 2024)
- "Efficient Long-Range Transformers" survey literature

# `build_opt`

```elixir
@type build_opt() ::
  {:embed_dim, pos_integer()}
  | {:hidden_size, pos_integer()}
  | {:num_heads, pos_integer()}
  | {:num_layers, pos_integer()}
  | {:chunk_size, pos_integer()}
  | {:dropout, float()}
```

Options for `build/1`.

# `build`

```elixir
@spec build([build_opt()]) :: Axon.t()
```

Build a Dual Chunk Attention model.

## Options

  - `:embed_dim` - Input embedding dimension (required)
  - `:hidden_size` - Internal hidden dimension (default: 256)
  - `:num_heads` - Number of attention heads (default: 8)
  - `:num_layers` - Number of Dual Chunk Attention blocks (default: 4)
  - `:chunk_size` - Chunk size C for chunked attention (default: 64).
    `seq_len` must be divisible by this value.
  - `:dropout` - Dropout rate (default: 0.1)
  - `:seq_len` / `:window_size` - Expected sequence length (default: 64)

## Returns

  An Axon model outputting `[batch, hidden_size]`.

# `build_dual_chunk_attention`

```elixir
@spec build_dual_chunk_attention(
  Axon.t(),
  keyword()
) :: Axon.t()
```

Build the dual chunk attention sublayer.

This creates the core attention mechanism with both intra-chunk and inter-chunk
attention pathways.

# `output_size`

```elixir
@spec output_size(keyword()) :: pos_integer()
```

Get the output size of the model.

---

*Consult [api-reference.md](api-reference.md) for complete listing*
