Edifice.Attention.DualChunk (Edifice v0.2.0)

Copy Markdown View Source

Dual Chunk Attention — context extension via intra-chunk and inter-chunk attention.

DeepSeek's method for handling long sequences (used in Qwen2.5-128K): splits sequences into fixed-size chunks, computes standard attention within each chunk, then uses a separate mechanism for attending across chunk summaries. This reduces peak memory from O(seq^2) to O(chunk^2 + num_chunks^2).

Key Innovation

Instead of computing full quadratic attention over long sequences:

  1. Intra-chunk attention: Standard multi-head attention within each chunk (local patterns)
  2. Inter-chunk attention: Attention over chunk summaries (global context)
  3. Combination: Learnable blending of local and global representations

Architecture

Input [batch, seq_len, embed_dim]
      |
Input projection to hidden_size
      |
+----------------------------------------------+
|   Dual Chunk Attention Block (x num_layers)  |
|                                              |
|   LayerNorm -> Dual Chunk Attention          |
|     Split into chunks [batch, num_chunks, chunk_size, hidden]
|     Intra-chunk: MultiHead per chunk         |
|     Inter-chunk: Summarize -> Attend -> Expand
|     Combine: gate * inter + (1-gate) * intra |
|   -> Residual                                |
|   LayerNorm -> FFN -> Residual               |
+----------------------------------------------+
      |
Final LayerNorm
      |
Last timestep -> [batch, hidden_size]

Memory Complexity

For seq_len = N and chunk_size = C:

  • Standard attention: O(N^2)
  • Dual Chunk: O((N/C) C^2 + (N/C)^2) = O(NC + N^2/C^2)

With C = sqrt(N), this becomes O(N^1.5) — subquadratic!

Usage

model = DualChunk.build(
  embed_dim: 287,
  hidden_size: 256,
  num_heads: 8,
  num_layers: 4,
  chunk_size: 64
)

Constraints

seq_len must be divisible by chunk_size.

References

  • DeepSeek context extension methods (2024)
  • Qwen2.5 long-context training (Alibaba, 2024)
  • "Efficient Long-Range Transformers" survey literature

Summary

Types

Options for build/1.

Functions

Build a Dual Chunk Attention model.

Build the dual chunk attention sublayer.

Get the output size of the model.

Types

build_opt()

@type build_opt() ::
  {:embed_dim, pos_integer()}
  | {:hidden_size, pos_integer()}
  | {:num_heads, pos_integer()}
  | {:num_layers, pos_integer()}
  | {:chunk_size, pos_integer()}
  | {:dropout, float()}

Options for build/1.

Functions

build(opts \\ [])

@spec build([build_opt()]) :: Axon.t()

Build a Dual Chunk Attention model.

Options

  • :embed_dim - Input embedding dimension (required)
  • :hidden_size - Internal hidden dimension (default: 256)
  • :num_heads - Number of attention heads (default: 8)
  • :num_layers - Number of Dual Chunk Attention blocks (default: 4)
  • :chunk_size - Chunk size C for chunked attention (default: 64). seq_len must be divisible by this value.
  • :dropout - Dropout rate (default: 0.1)
  • :seq_len / :window_size - Expected sequence length (default: 64)

Returns

An Axon model outputting [batch, hidden_size].

build_dual_chunk_attention(input, opts)

@spec build_dual_chunk_attention(
  Axon.t(),
  keyword()
) :: Axon.t()

Build the dual chunk attention sublayer.

This creates the core attention mechanism with both intra-chunk and inter-chunk attention pathways.

output_size(opts \\ [])

@spec output_size(keyword()) :: pos_integer()

Get the output size of the model.