Edifice.Attention.Based (Edifice v0.2.0)

Based: Linear attention with Taylor expansion feature map.

Replaces the quadratic softmax(QK^T) attention with a linear approximation using Taylor-expanded feature maps. Instead of computing the full attention matrix, Based projects Q and K through a polynomial feature map phi(x) and computes attention in linear time.

Key Innovation

The Taylor feature map approximates softmax attention:

phi(x) = [1, x, x^2/sqrt(2!), ...] for Taylor order N
Linear attention: output = phi(Q) @ (phi(K)^T @ V) / (phi(Q) @ sum(phi(K)))
This avoids the O(n^2) softmax(QK^T) computation

Architecture

Input [batch, seq_len, embed_dim]
      |
Input projection to hidden_size
      |
+--------------------------------------+
|   Based Block (x num_layers)         |
|                                      |
|   LayerNorm -> Based Linear Attn     |
|     Q, K projections + Taylor phi()  |
|     Linear attention via phi(Q/K)    |
|   -> Residual                        |
|   LayerNorm -> FFN -> Residual       |
+--------------------------------------+
      |
Final LayerNorm
      |
Last timestep -> [batch, hidden_size]

Complexity

Mechanism	Time	Space
Softmax attention	O(n^2 d)	O(n^2)
Based (Taylor)	O(n d^2 p)	O(d^2 p)

Where p = Taylor order, typically 2-3.

Usage

model = Based.build(
  embed_dim: 287,
  hidden_size: 256,
  num_heads: 4,
  taylor_order: 2,
  num_layers: 4
)

References

"Simple linear attention language models balance the recall-throughput tradeoff" (Arora et al., 2024)

Summary

Types

build_opt()

Options for build/1.

Functions

build(opts \\ [])

Build a Based linear attention model.

build_based_attention(input, opts)

Build the Based linear attention layer with Taylor feature map.

output_size(opts \\ [])

Get the output size of a Based model.