Building Blocks

Composable primitives -- normalization, position encoding, gating, patching, and cross-attention -- that combine to form complete architectures.

Overview

The Blocks family contains the fundamental layers that appear repeatedly across transformers, vision models, diffusion systems, and state-space architectures. Rather than reimplementing these primitives inside every architecture, Edifice factors them into standalone, tested modules with a consistent layer/2 or functional API.

These blocks fall into five categories: normalization (RMSNorm, AdaptiveNorm), position encoding (SinusoidalPE, RoPE, ALiBi), activation gating (SwiGLU), input tokenization (PatchEmbed), and cross-sequence attention (CrossAttention). Each is designed to slot into an Axon computation graph as a composable node, so higher-level architectures can mix and match without coupling to a specific implementation.

Understanding when to use each block is often more important than understanding their internals. A position encoding choice, for example, determines whether a model can generalize to longer sequences than it saw during training. A normalization choice affects training stability and throughput. This guide provides the conceptual grounding to make those decisions.

Conceptual Foundation

All blocks in this family serve the same meta-purpose: constrain or inject structure into an otherwise generic sequence of linear projections and nonlinearities. Without normalization, activations drift and training diverges. Without position encoding, attention treats sequences as unordered bags. Without gating, feed-forward blocks lack the multiplicative interactions that improve gradient flow.

The key equation underlying most of these blocks is the transformer's attention computation, into which position encodings and normalizations are inserted:

Attention(Q, K, V) = softmax( (QK^T + bias) / sqrt(d_k) ) V

where bias may come from ALiBi slopes, the Q and K may be rotated by RoPE, the inputs may be RMSNorm'd, and the feed-forward that follows may use SwiGLU gating.

Architecture Evolution

2017  Vaswani et al. "Attention Is All You Need"
  |     - LayerNorm, Sinusoidal PE, standard FFN
  |
2019  Zhang & Sennrich
  |     - RMSNorm: drop mean centering, ~50% faster normalization
  |
2020  Shazeer "GLU Variants Improve Transformer"
  |     - SwiGLU/GeGLU: gated FFN with multiplicative interactions
  |
2021  Su et al. "RoFormer"                    Dosovitskiy et al. "ViT"
  |     - RoPE: rotary embeddings,              - PatchEmbed: images
  |       relative position via rotation           as patch sequences
  |
2022  Press et al. "Train Short, Test Long"
  |     - ALiBi: no learned params,
  |       linear bias for extrapolation
  |
2023  Peebles & Xie "DiT"
  |     - AdaptiveNorm (AdaLN-Zero):
  |       condition-dependent scale/shift/gate
  |
2024+ Modern LLMs (LLaMA, Mistral, Mamba-2)
        - RMSNorm + RoPE + SwiGLU as standard stack

When to Use What

Normalization

Scenario	Module	Rationale
General transformer layers	`RMSNorm`	Faster than LayerNorm, no mean subtraction, standard in modern LLMs
Axon built-in suffices	`Axon.layer_norm`	When mean centering matters (e.g., small models, batch norm alternative)
Conditional generation (diffusion, class-conditional)	`AdaptiveNorm`	Scale/shift/gate predicted from conditioning signal
Diffusion Transformers (DiT)	`AdaptiveNorm` with `:adaln_zero` mode	Zero-initialized gate for stable early training

Position Encoding

Scenario	Module	Rationale
Fixed-length, no extrapolation needed	`SinusoidalPE`	Simplest, deterministic, no learned params
Long-context LLMs, extrapolation required	`RoPE`	Relative position via rotation, smooth extrapolation with NTK-aware scaling
Maximum extrapolation, minimal complexity	`ALiBi`	Zero learned params, linear bias, best length generalization
Vision transformers	`SinusoidalPE` or learned	2D grid positions, extrapolation less critical
Encoder-decoder (translation, summarization)	`SinusoidalPE` or `RoPE`	Both work; RoPE preferred for modern systems

Feed-Forward / Gating

Scenario	Module	Rationale
Modern transformer FFN	`SwiGLU`	3 projections (gate, up, down) with SiLU gating; better quality per FLOP
Legacy / simple models	`Axon.dense` + activation	Standard 2-projection FFN when parameter budget is tight
GeGLU or ReGLU variant needed	`SwiGLU` with `:activation` option	Same module, different gate activation

Tokenization and Cross-Attention

Scenario	Module	Rationale
Image input to transformer	`PatchEmbed`	Splits image into non-overlapping patches, projects to embedding dim
Multimodal fusion (text + image)	`CrossAttention`	Queries from one modality, keys/values from another
Encoder-decoder architectures	`CrossAttention`	Decoder attends to encoder outputs
U-Net conditioning (Stable Diffusion)	`CrossAttention`	Inject text conditioning into image generation

Key Concepts

Normalization: LayerNorm vs RMSNorm vs AdaptiveNorm

LayerNorm (Axon built-in) normalizes by subtracting the mean and dividing by the standard deviation across the feature dimension, then applies learned scale (gamma) and shift (beta). This two-step process -- centering then scaling -- costs roughly twice the compute of RMSNorm.

RMSNorm drops the mean subtraction entirely. It divides by the root mean square of the activations and applies only a learned scale. Empirically, the centering step contributes little to training stability in large models, making RMSNorm the better default. LLaMA, Mistral, Mamba-2, and most 2024+ architectures use RMSNorm exclusively.

RMSNorm(x) = (x / sqrt(mean(x^2) + eps)) * gamma

AdaptiveNorm (AdaLN / AdaLN-Zero) replaces fixed gamma and beta with values predicted from a conditioning signal. In diffusion models, the conditioning signal is typically a timestep embedding. AdaLN-Zero additionally predicts a gating factor alpha initialized to zero, which ensures that at initialization each transformer block acts as an identity function -- critical for stable training of deep generative models.

                  LayerNorm              RMSNorm              AdaptiveNorm
                  ---------              -------              ------------
Centering:        Yes (subtract mean)    No                   Via predicted beta
Scaling:          Yes (learned gamma)    Yes (learned gamma)  Via predicted gamma
Extra gate:       No                     No                   Optional (AdaLN-Zero)
Parameters:       2 * hidden_size        hidden_size          Predicted from condition
Typical use:      Legacy, small models   Modern LLMs/SSMs     Conditional generation

Position Encoding: Absolute, Relative, and None

Position encoding determines how a model distinguishes token positions. The three approaches in Edifice represent distinct philosophies:

SinusoidalPE (absolute) assigns each position a unique vector using sine and cosine waves at geometrically spaced frequencies. It is added to the input embeddings once. The model must learn that these additive signals encode position. Extrapolation beyond the training length is theoretically possible but unreliable in practice.

RoPE (relative, multiplicative) rotates query and key vectors by position-dependent angles. Because rotation is multiplicative, the inner product between Q at position m and K at position n depends only on (m - n), making it inherently relative. RoPE handles extrapolation better than sinusoidal PE and is the standard for modern LLMs. The base frequency (default 10,000) can be scaled for longer contexts.

Position encoding injection points in a transformer block:

  Input --[+ SinusoidalPE]--> LayerNorm --> Q,K,V projection
                                               |
                                         [RoPE rotates Q,K]
                                               |
                                         [ALiBi biases QK^T scores]
                                               |
                                            softmax --> output

ALiBi (relative, additive bias) adds no parameters at all. Instead, it adds a linear penalty to attention scores based on the distance between query and key positions. Each head gets a different slope from a geometric schedule, so some heads attend locally (steep slope) and others globally (gentle slope). ALiBi provides the strongest length extrapolation because it never saw position-specific parameters during training -- the linear bias generalizes naturally.

Activation Gating: SwiGLU

Standard transformer FFN blocks use two linear projections with a nonlinearity:

FFN(x) = W2 * activation(W1 * x)

SwiGLU adds a third projection that acts as a gate:

SwiGLU(x) = W2 * (SiLU(V * x) * W1 * x)

The element-wise multiplication between the gate path and the linear path creates richer gradient dynamics. The SiLU (x * sigmoid(x)) activation on the gate provides smooth gating. Despite having 50% more parameters in the projections, SwiGLU achieves better quality per FLOP because the gating improves gradient flow through depth. The default expansion factor (2.667x, rounded to a multiple of 8 for tensor core alignment) is calibrated to match the parameter count of a standard 4x FFN.

Patch Embedding for Vision

PatchEmbed bridges the gap between spatial image data and the sequence format that transformers expect. It divides an image into a grid of non-overlapping patches, flattens each patch into a vector, and projects it to the model's embedding dimension.

For a 224x224 RGB image with 16x16 patches: 196 patches, each originally 768-dimensional (16 16 3), projected to the target embedding dimension. This is equivalent to a single convolution with kernel size and stride both equal to the patch size, but Edifice implements it as reshape + dense for clarity and compatibility with the Axon graph.

Smaller patches (8x8, 14x14) give finer spatial resolution at the cost of longer sequences. Larger patches (32x32) are more efficient but lose spatial detail. The 16x16 default from ViT remains the most common choice.

Complexity Comparison

Module	Params per Layer	Time Complexity	Space Complexity	Notes
`RMSNorm`	hidden_size	O(n * d)	O(d)	Single learnable scale vector
`SinusoidalPE`	0	O(n * d)	O(n * d) table	Precomputed, no learned params
`RoPE`	0	O(n * d)	O(n * d/2) table	Rotation angles precomputed
`ALiBi`	0	O(n^2 * h)	O(h n n) bias	Slopes fixed by geometric schedule
`SwiGLU`	3 d d_inner	O(n d d_inner)	O(d_inner)	Three projections vs two in standard FFN
`PatchEmbed`	P^2 C d_embed	O(N_patches P^2 C * d)	O(N_patches * d)	One-time input tokenization
`AdaptiveNorm`	cond_dim (2 or 3) d	O(n * d)	O(d)	Params in conditioning projection
`CrossAttention`	4 * d^2	O(n_q n_kv d)	O(n_q * n_kv)	Standard attention across two sequences

Where n = sequence length, d = hidden dimension, h = number of heads, P = patch size, C = channels.

Module Reference

Edifice.Blocks.RMSNorm -- Root mean square normalization without mean centering; standard for LLaMA, Mistral, Mamba-2
Edifice.Blocks.SwiGLU -- Gated feed-forward with SiLU/GELU/ReLU gate; 3-projection FFN used in modern transformers
Edifice.Blocks.RoPE -- Rotary position embedding via pairwise dimension rotation; relative position without learned params
Edifice.Blocks.ALiBi -- Attention bias from head-specific linear slopes; zero learned parameters, best extrapolation
Edifice.Blocks.PatchEmbed -- Image-to-patch tokenization for vision transformers; reshape + linear projection
Edifice.Blocks.SinusoidalPE -- Fixed sinusoidal position encoding from the original transformer; deterministic baseline
Edifice.Blocks.AdaptiveNorm -- Condition-dependent normalization (AdaLN/AdaLN-Zero) for diffusion and conditional generation
Edifice.Blocks.CrossAttention -- Cross-attention between two sequences; used in encoder-decoder and multimodal models

Cross-References

attention_mechanisms.md -- Position encodings (RoPE, ALiBi, SinusoidalPE) plug directly into attention score computation
vision_architectures.md -- PatchEmbed is the standard input layer for ViT, DeiT, Swin, and MAE; AdaptiveNorm appears in DiT
generative_models.md -- AdaLN-Zero is central to Diffusion Transformers (DiT) for conditional image generation
state_space_models.md -- RMSNorm and SwiGLU appear in Mamba blocks alongside selective state-space layers