Edifice.Attention.LinearTransformer (Edifice v0.2.0)

Linear Transformer: Linear attention using kernel feature maps.

Replaces softmax attention with a kernel-based linear attention mechanism, reducing complexity from O(N^2) to O(N) by avoiding explicit computation of the N x N attention matrix.

Key Innovation: Kernel Feature Maps

Standard attention computes: Attn(Q,K,V) = softmax(QK^T/sqrt(d)) * V

Linear attention rewrites this using a feature map phi:

Attn(Q,K,V) = phi(Q) * (phi(K)^T * V) / (phi(Q) * sum(phi(K)))

By computing phi(K)^T * V first (a d x d matrix), we avoid the N x N attention matrix entirely. The feature map phi(x) = ELU(x) + 1 ensures non-negative attention weights.

Architecture

Input [batch, seq_len, embed_dim]
      |
      v
+-------------------------------------+
|  Linear Transformer Block            |
|                                      |
|  LayerNorm                           |
|    -> Q, K, V projections            |
|    -> phi(Q), phi(K) feature maps    |
|    -> KV = phi(K)^T * V  [d x d]    |
|    -> out = phi(Q) * KV   [N x d]   |
|    -> normalize by phi(Q)*sum(K)     |
|  -> Residual                         |
|                                      |
|  LayerNorm -> FFN -> Residual        |
+-------------------------------------+
      | (repeat for num_layers)
      v
Last timestep -> [batch, hidden_size]

Complexity

Operation	Standard	Linear
Attention	O(N^2 * d)	O(N * d^2)
Memory	O(N^2)	O(N * d)
Best when	N < d	N > d

Linear attention is most beneficial when sequence length N exceeds the head dimension d.

Usage

model = LinearTransformer.build(
  embed_dim: 287,
  hidden_size: 256,
  num_layers: 4,
  dropout: 0.1
)

References

Paper: "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention" (Katharopoulos et al., 2020)
Feature map: ELU+1 from the original paper

Summary

Types

build_opt()

Options for build/1.

Functions

build(opts \\ [])

Build a Linear Transformer model for sequence processing.

output_size(opts \\ [])

Get the output size of a Linear Transformer model.

param_count(opts)

Calculate approximate parameter count for a Linear Transformer model.

recommended_defaults()

Recommended default configuration for sequence processing.