Edifice.Attention.FNet (Edifice v0.2.0)

FNet: Replacing Attention with Fourier Transform.

FNet replaces the self-attention sublayer in Transformers with an unparameterized Fourier Transform, achieving O(N log N) token mixing with no learnable attention parameters.

Key Innovation: FFT Mixing

Instead of computing attention weights, FNet applies FFT along the sequence axis to mix token information. This is parameter-free and achieves surprisingly competitive performance:

Standard Transformer:  LayerNorm -> Self-Attention -> Residual
FNet:                  LayerNorm -> FFT Mixing     -> Residual

The FFT provides global token mixing (every token interacts with every other token through frequency-domain multiplication).

Architecture

Input [batch, seq_len, embed_dim]
      |
      v
+-------------------------------------+
|       FNet Block                     |
|                                      |
|  LayerNorm                           |
|    -> FFT along seq axis             |
|    -> Take real part                 |
|  -> Residual                         |
|                                      |
|  LayerNorm                           |
|    -> Dense(hidden * 4)              |
|    -> GeLU                           |
|    -> Dense(hidden)                  |
|  -> Residual                         |
+-------------------------------------+
      | (repeat for num_layers)
      v
Last timestep -> [batch, hidden_size]

Complexity

Component	Transformer	FNet
Token mixing	O(N^2)	O(N^2)*
Parameters	Q,K,V weights	None (DFT)
Training speed	Baseline	~7x faster
Quality	Baseline	92-97% of BERT

*Note: We use real-valued DFT matrix multiply instead of Nx.fft because EXLA's autodiff through complex FFT outputs triggers Nx.less/2 errors in LayerNorm's backward pass. For typical seq_len (30-128) and hidden_size (256-512), the O(N^2) matrix multiply is negligible vs the FFN layers.

Usage

model = FNet.build(
  embed_dim: 287,
  hidden_size: 256,
  num_layers: 4,
  dropout: 0.1
)

References

Paper: "FNet: Mixing Tokens with Fourier Transforms" (Lee-Thorp et al., Google 2021)

Summary

Types

build_opt()

Options for build/1.

Functions

build(opts \\ [])

Build an FNet model for sequence processing.

dft_real_matrix(n)

Build a real-valued DFT matrix: DFT[k, n] = cos(2π k n / N).

fourier_mixing_real(tensor, dft_seq, dft_hidden)

Apply Fourier mixing using real-valued DFT matrix multiply.

output_size(opts \\ [])

Get the output size of an FNet model.

param_count(opts)

Calculate approximate parameter count for an FNet model.

recommended_defaults()

Recommended default configuration for sequence processing.