Edifice.Attention.Performer (Edifice v0.2.0)

Performer: Fast Attention Via Positive Orthogonal Random Features (FAVOR+).

Performer approximates softmax attention using random feature maps, achieving O(N) time and space complexity. The FAVOR+ mechanism uses orthogonal random features to approximate the exponential kernel.

Key Innovation: FAVOR+ Random Feature Attention

Standard attention: softmax(QK^T/sqrt(d)) * V -- O(N^2)

FAVOR+ approximates exp(QK^T) using random features:

exp(q^T k) ~ phi(q)^T phi(k)

Where phi(x) = exp(-||x||^2 / 2) / sqrt(m) * [exp(w_1^T x), ..., exp(w_m^T x)]
w_1, ..., w_m ~ iid N(0, I) (orthogonalized)

This allows rewriting attention as:

Attn(Q,K,V) = D^{-1} * phi(Q) * (phi(K)^T * V)
D = diag(phi(Q) * phi(K)^T * 1)

Computing phi(K)^T V is O(Ndm) instead of O(N^2d).

Architecture

Input [batch, seq_len, embed_dim]
      |
      v
+-------------------------------------+
|  Performer Block                     |
|                                      |
|  LayerNorm                           |
|    -> Q, K, V projections            |
|    -> Random feature map phi(Q,K)    |
|       (orthogonal random features)   |
|    -> KV = phi(K)^T * V   [d, d]    |
|    -> out = phi(Q) * KV    [N, d]   |
|    -> normalize by D                 |
|  -> Residual                         |
|                                      |
|  LayerNorm -> FFN -> Residual        |
+-------------------------------------+
      | (repeat for num_layers)
      v
Last timestep -> [batch, hidden_size]

Complexity

Component	Standard	Performer
Time	O(N^2 * d)	O(N d m)
Space	O(N^2 + N*d)	O(N * (d+m))
Random features	-	m (default 64)

Where m = num_features controls approximation quality vs speed tradeoff.

Usage

model = Performer.build(
  embed_dim: 287,
  hidden_size: 256,
  num_features: 64,
  num_layers: 4,
  dropout: 0.1
)

References

Paper: "Rethinking Attention with Performers" (Choromanski et al., ICLR 2021)
FAVOR+: Fast Attention Via positive Orthogonal Random features

Summary

Types

build_opt()

Options for build/1.

Functions

build(opts \\ [])

Build a Performer model for sequence processing.

generate_orthogonal_features(head_dim, num_features, opts \\ [])

Generate orthogonal random features for FAVOR+ via QR decomposition.

output_size(opts \\ [])

Get the output size of a Performer model.

param_count(opts)

Calculate approximate parameter count for a Performer model.

recommended_defaults()

Recommended default configuration for sequence processing.