A guided tour through Edifice's 19 architecture families -- what to learn first, what builds on what, and where to go deep.

How to Use This Guide

Edifice has 90+ architectures across 19 families. That's overwhelming if you try to learn them all at once. This guide gives you a structured path through the families, organized so that each step builds naturally on the previous ones. At each step, you get:

  • What it is and why it matters
  • Prerequisites -- what you should understand first
  • Key ideas to focus on
  • Try it -- a runnable Edifice example
  • Go deeper -- links to the detailed architecture guide

You don't need to follow this linearly. If you know what problem you're solving, jump to the relevant section using the Problem Landscape as your map. But if you're here to learn ML architectures from the ground up, start at Phase 1 and work through.

Phase 1: The Fundamentals

These are the building blocks that everything else is built on. Learn these first.

Step 1: Feedforward Networks (MLP)

What: The simplest neural network -- stacked dense layers with activations. Input goes in, passes through layers, output comes out. No recurrence, no attention, no fancy routing.

Why it matters: MLPs appear inside almost every other architecture. The feed-forward block in a transformer? An MLP. The classification head at the end of a vision model? An MLP. The expert networks in Mixture of Experts? MLPs. Understanding MLPs means understanding the fundamental building block.

Key ideas:

  • Dense layers (matrix multiply + bias + activation)
  • How depth (more layers) and width (more neurons per layer) affect capacity
  • Dropout for regularization
  • Residual connections for training deep networks

Try it:

# A simple 3-layer MLP for tabular data
model = Edifice.Feedforward.MLP.build(
  input_size: 32,
  hidden_sizes: [128, 64, 16],
  activation: :relu,
  dropout: 0.1
)

{init_fn, predict_fn} = Axon.build(model)
params = init_fn.(Nx.template({1, 32}, :f32), Axon.ModelState.empty())
output = predict_fn.(params, Nx.broadcast(0.5, {4, 32}))
# => {4, 16}  -- 4 samples, 16 features from the last hidden layer

Also explore: KAN (Kolmogorov-Arnold Networks -- learnable activation functions instead of fixed ones) and TabNet (attention-based feature selection for tabular data).

Go deeper: The MLP module docs cover residual connections and layer normalization options.


Step 2: Convolutional Networks

Prerequisites: Step 1 (understand dense layers and activations)

What: Networks that use shared filters sliding across the input. Instead of every neuron connecting to every input (dense), a convolutional filter looks at a small local region and reuses the same weights across all positions.

Why it matters: Convolutions encode the insight that local patterns matter and can appear anywhere in the input. This is why they dominate image processing and are widely used for sequence modeling (TCN). The concept of "receptive field" -- how much input context a layer can see -- carries over to understanding attention and SSMs.

Key ideas:

  • Filters/kernels: small weight matrices that slide across input
  • Feature maps: the output of applying a filter
  • Stride and padding: how the filter moves and handles edges
  • Residual blocks (ResNet): the skip connection pattern that enables very deep networks
  • Depthwise separable convolutions (MobileNet): factoring convolutions for efficiency

Try it:

# ResNet for image-like data
model = Edifice.Convolutional.ResNet.build(
  input_channels: 3,
  num_classes: 10,
  depth: 18
)

{init_fn, predict_fn} = Axon.build(model)
params = init_fn.(Nx.template({1, 32, 32, 3}, :f32), Axon.ModelState.empty())
output = predict_fn.(params, Nx.broadcast(0.5, {2, 32, 32, 3}))
# => {2, 10}  -- 2 images, 10 class probabilities

# TCN for temporal sequences (causal convolutions)
model = Edifice.Convolutional.TCN.build(
  embed_size: 64,
  hidden_size: 128,
  num_layers: 4,
  kernel_size: 3,
  window_size: 100
)

{init_fn, predict_fn} = Axon.build(model)
params = init_fn.(Nx.template({1, 100, 64}, :f32), Axon.ModelState.empty())
output = predict_fn.(params, Nx.broadcast(0.5, {1, 100, 64}))
# => {1, 128}

Go deeper: Convolutional Networks guide


Step 3: Building Blocks

Prerequisites: Steps 1-2

What: The composable primitives -- normalization (RMSNorm), position encoding (RoPE, ALiBi), gating (SwiGLU), and patching (PatchEmbed) -- that appear inside transformers, SSMs, and vision models.

Why it matters: When you read about a transformer using "pre-RMSNorm with RoPE and SwiGLU FFN," you need to know what each of those pieces does. These blocks are the vocabulary of modern architecture design.

Key ideas:

  • Why normalization is essential (training stability)
  • RMSNorm vs LayerNorm (speed vs mean centering)
  • Position encoding: how networks know where tokens are in a sequence
  • RoPE (rotary): relative position via rotation, good extrapolation
  • ALiBi: no learned parameters, linear bias
  • SwiGLU: gated feed-forward with multiplicative interactions

Go deeper: Building Blocks guide


Phase 2: Sequence Processing

The three major approaches to processing ordered data. Understanding the tradeoffs between these families is one of the most important skills in modern ML.

Step 4: Recurrent Networks

Prerequisites: Phase 1

What: Networks that maintain a hidden state which is updated at each timestep. They process sequences one token at a time, carrying forward a compressed summary of everything seen so far.

Why it matters: Recurrence is the most intuitive approach to sequences -- it mirrors how you might mentally process a sentence word by word. Modern recurrent architectures (xLSTM, MinGRU, Titans) have closed much of the gap with transformers while retaining constant-memory inference.

Key ideas:

  • Hidden state: the network's running memory
  • Gates: mechanisms that control what to remember and what to forget
  • LSTM/GRU: the classic gated architectures
  • The vanishing gradient problem and how gates solve it
  • Parallel scan: how MinGRU/MinLSTM make recurrence parallelizable for training

Try it:

# Classic LSTM
model = Edifice.build(:lstm,
  embed_size: 64,
  hidden_size: 128,
  num_layers: 2,
  window_size: 60
)

{init_fn, predict_fn} = Axon.build(model)
params = init_fn.(Nx.template({1, 60, 64}, :f32), Axon.ModelState.empty())
output = predict_fn.(params, Nx.broadcast(0.5, {1, 60, 64}))
# => {1, 128}

# Modern minimal GRU (parallel-scannable)
model = Edifice.build(:min_gru,
  embed_size: 64,
  hidden_size: 128,
  num_layers: 4,
  window_size: 60
)

Go deeper: Recurrent Networks guide


Step 5: Attention Mechanisms

Prerequisites: Phase 1, especially Building Blocks (Step 3)

What: A mechanism where each position in a sequence computes relevance scores against all other positions, then aggregates information based on those scores. This is the core of the transformer architecture.

Why it matters: Transformers (built on attention) are the dominant architecture for language models and increasingly for other domains. Understanding attention -- and its quadratic cost -- is essential context for understanding why SSMs, linear attention, and hybrid architectures exist.

Key ideas:

  • Queries, keys, and values: the three projections
  • Scaled dot-product attention: the core computation
  • Multi-head attention: parallel attention with different learned perspectives
  • The quadratic bottleneck: O(L^2) in sequence length
  • Linear attention variants: approximating attention in O(L)
  • Retention and RWKV: recurrence-based alternatives

Try it:

# Standard multi-head attention transformer
model = Edifice.build(:attention,
  embed_size: 128,
  hidden_size: 256,
  num_heads: 8,
  num_layers: 4,
  window_size: 60
)

{init_fn, predict_fn} = Axon.build(model)
params = init_fn.(Nx.template({1, 60, 128}, :f32), Axon.ModelState.empty())
output = predict_fn.(params, Nx.broadcast(0.5, {1, 60, 128}))
# => {1, 256}

# RetNet: attention-like quality with recurrent inference
model = Edifice.build(:retnet,
  embed_size: 128,
  hidden_size: 256,
  num_heads: 4,
  num_layers: 4,
  window_size: 60
)

Go deeper: Attention Mechanisms guide


Step 6: State Space Models

Prerequisites: Steps 4 and 5 (understand both recurrence and attention tradeoffs)

What: Models that treat sequences as discretized continuous-time dynamical systems. A hidden state evolves according to learned dynamics, combining the parallel training of convolutions with the constant-memory inference of recurrence.

Why it matters: SSMs (especially Mamba) are the strongest alternative to transformers for sequence modeling. They scale linearly with sequence length while matching or exceeding transformer quality on many tasks. Understanding the SSM-attention tradeoff is crucial for architecture selection.

Key ideas:

  • State space equations: the continuous-time formulation
  • Discretization: converting continuous to discrete for digital computation
  • Selective SSMs (Mamba): input-dependent parameters
  • Parallel scan: how linear recurrences train in parallel
  • Hybrid models (Jamba, Zamba): SSM + attention for the best of both

Try it:

# Mamba: the flagship SSM
model = Edifice.build(:mamba,
  embed_size: 128,
  hidden_size: 256,
  state_size: 16,
  num_layers: 4,
  window_size: 60
)

{init_fn, predict_fn} = Axon.build(model)
params = init_fn.(Nx.template({1, 60, 128}, :f32), Axon.ModelState.empty())
output = predict_fn.(params, Nx.broadcast(0.5, {1, 60, 128}))
# => {1, 256}

# Hybrid: Mamba + Attention
model = Edifice.build(:jamba,
  embed_size: 128,
  hidden_size: 256,
  num_layers: 6,
  attention_ratio: 0.33,  # 1/3 of layers use attention
  num_heads: 4,
  window_size: 60
)

Go deeper: State Space Models guide


Phase 3: Specialized Domains

With the sequence processing fundamentals down, branch out into domain-specific families.

Step 7: Vision Architectures

Prerequisites: Steps 2 (convolutions), 3 (building blocks), 5 (attention)

What: Architectures designed for image understanding, from Vision Transformers (ViT) that treat images as sequences of patches, to U-Net for pixel-level segmentation.

Key ideas:

  • Patch embedding: converting images into token sequences
  • ViT: applying the transformer to vision
  • Swin: hierarchical vision transformer with shifted windows
  • U-Net: encoder-decoder with skip connections for dense prediction

Try it:

# Vision Transformer
model = Edifice.Vision.ViT.build(
  image_size: 32,
  patch_size: 8,
  num_channels: 3,
  embed_size: 256,
  num_heads: 8,
  num_layers: 6,
  num_classes: 10
)

{init_fn, predict_fn} = Axon.build(model)
params = init_fn.(Nx.template({1, 32, 32, 3}, :f32), Axon.ModelState.empty())
output = predict_fn.(params, Nx.broadcast(0.5, {2, 32, 32, 3}))
# => {2, 10}

Go deeper: Vision Architectures guide


Step 8: Graph & Set Networks

Prerequisites: Step 1 (MLPs), basic understanding of attention helps

What: Architectures for data with relational structure (graphs) or no ordering (sets). Graph networks propagate information along edges between nodes. Set networks process unordered collections with permutation-invariant operations.

Key ideas:

  • Message passing: nodes aggregate information from their neighbors
  • Adjacency matrices: how graph structure is represented as tensors
  • Permutation invariance: the output shouldn't change if you reorder the nodes/elements
  • Pooling: going from node-level to graph-level predictions

Try it:

# Graph attention network
model = Edifice.Graph.GAT.build_classifier(
  input_dim: 16,
  hidden_dims: [64, 64],
  num_classes: 3,
  num_heads: 4,
  pool: :mean
)

# DeepSets: process unordered collections
model = Edifice.Sets.DeepSets.build(
  input_dim: 3,
  hidden_dim: 64,
  output_dim: 10,
  pool: :mean
)

{init_fn, predict_fn} = Axon.build(model)
params = init_fn.(Nx.template({4, 20, 3}, :f32), Axon.ModelState.empty())
output = predict_fn.(params, Nx.broadcast(0.5, {4, 20, 3}))
# => {4, 10}  -- set-level predictions for 4 sets of 20 points

Go deeper: Graph & Set Networks guide


Step 9: Generative Models

Prerequisites: Steps 1 and 5 (MLPs and attention); probability concepts from Core Vocabulary

What: Architectures that learn to create new data. Instead of predicting a label, they learn the distribution of the training data and can generate new samples from it.

Key ideas:

  • Latent space and the encoder-decoder pattern (VAE)
  • The reparameterization trick: making sampling differentiable
  • Adversarial training (GAN): generator vs. discriminator
  • Diffusion: adding noise then learning to reverse it
  • Flow matching: learning ODE trajectories from noise to data

Try it:

# Variational Autoencoder
{encoder, decoder} = Edifice.Generative.VAE.build(
  input_size: 784,
  latent_size: 32,
  encoder_sizes: [512, 256],
  decoder_sizes: [256, 512]
)

# Build encoder
{enc_init, enc_predict} = Axon.build(encoder)
enc_params = enc_init.(Nx.template({1, 784}, :f32), Axon.ModelState.empty())
%{mu: mu, log_var: log_var} = enc_predict.(enc_params, Nx.broadcast(0.5, {4, 784}))

# Sample from latent space
z = Edifice.Generative.VAE.reparameterize(mu, log_var)
# z shape: {4, 32}  -- 4 latent vectors

Go deeper: Generative Models guide


Step 10: Contrastive & Self-Supervised Learning

Prerequisites: Steps 1-2 (feedforward + convolutions), basic understanding of encoders

What: Methods that learn useful representations without labeled data. They create their own training signal by comparing different views of the same data (contrastive) or by reconstructing masked portions (self-supervised).

Key ideas:

  • Positive and negative pairs: what to pull together, what to push apart
  • Projection heads: small networks that transform representations for the contrastive objective
  • Momentum encoders (BYOL): a slowly-updating copy of the network
  • Masked autoencoders (MAE): reconstruct what you can't see

Go deeper: Contrastive Learning guide


Phase 4: Advanced and Specialized

These families build on the foundations and address specific needs.

Step 11: Meta-Learning (MoE, LoRA, Adapter, Capsules)

Prerequisites: Phase 2

What: Techniques that modify or compose other architectures: Mixture of Experts routes different inputs to different sub-networks; LoRA and Adapters add small trainable modules to frozen pretrained models; Capsules encode part-whole relationships.

Why it matters: MoE is how modern large language models scale to hundreds of billions of parameters while keeping inference cost manageable. LoRA is how you fine-tune those models on your specific task with limited compute.

Go deeper: Meta-Learning guide


Step 12: Dynamic & Continuous Systems (NeuralODE, Liquid, Energy)

Prerequisites: Phase 2, comfort with the idea of differential equations

What: Architectures that model continuous dynamics: Neural ODEs define depth as a continuous variable, Liquid Networks use ODE-based cells that adapt over time, Energy-Based Models learn energy landscapes, and Hopfield networks provide associative memory.

Go deeper: Dynamic & Continuous guide


Step 13: Uncertainty & Memory (Bayesian, NTM, Evidential)

Prerequisites: Phase 1, basic probability

What: Networks that know what they don't know (Bayesian, MC Dropout, Evidential) and networks with external memory banks (NTM, Memory Networks) for tasks requiring storage and retrieval.

Go deeper: Uncertainty & Memory guide


Step 14: Neuromorphic (SNN, ANN2SNN)

Prerequisites: Phase 2

What: Spiking neural networks that communicate through discrete spikes rather than continuous activations, inspired by biological neurons. ANN2SNN converts conventional networks to spiking equivalents for deployment on neuromorphic hardware.


The Dependency Graph

A visual summary of what builds on what:

Phase 1: Fundamentals
      
  1. MLP       2. Convolutional  3. Blocks       
  (foundation)      (local pattern)      (components)  
      
                                                  
                                                  
Phase 2: Sequence Processing
      
 4. Recurrent     5. Attention         6. SSM          
  (sequential)  (parallel)        (best of both) 
      
                                                  
       
                    
Phase 3: Specialized Domains
   
 7. Vision  8. Graph   9. Generative 10. Contrast.
   
                    
                    
Phase 4: Advanced
   
11. Meta   12. ODE/   13. Uncertain. 14. Neuromor.
  (MoE)      Energy     & Memory       (Spiking)  
   

Suggested Exercises

For each phase, a concrete exercise to test your understanding:

Phase 1: Build an MLP classifier for a simple dataset (e.g., Iris or synthetic data). Train it using Axon's training API. Add dropout and observe the effect on overfitting.

Phase 2: Take the same dataset but structure it as sequences. Compare LSTM, attention, and Mamba on the same task. Measure parameter counts and observe output shapes.

Phase 3: Pick a domain that interests you:

  • Vision: classify MNIST digits with both ResNet and ViT
  • Graphs: classify synthetic graph structures with GCN
  • Generation: train a VAE on a simple distribution and sample from it

Phase 4: Take your Phase 2 model and apply LoRA fine-tuning. Compare training a full model from scratch vs. fine-tuning with LoRA -- observe the difference in trainable parameter count.

Quick Reference: All 19 Families

#FamilyGuideKey ArchitectureCore Idea
1Feedforward-MLPStacked dense layers
2ConvolutionalGuideResNetShared local filters + skip connections
3Building BlocksGuideRMSNorm, RoPE, SwiGLUComposable primitives
4RecurrentGuideLSTM, xLSTMSequential hidden state
5AttentionGuideMulti-Head, GQAPairwise relevance scoring
6State SpaceGuideMambaDiscretized dynamical system
7VisionGuideViT, SwinImages as patch sequences
8GraphGuideGCN, GATMessage passing on edges
9SetsGuideDeepSetsPermutation-invariant aggregation
10GenerativeGuideVAE, DiffusionLearn and sample from p(data)
11ContrastiveGuideSimCLR, BYOLLearn representations without labels
12EnergyGuideEBM, HopfieldEnergy landscape minimization
13LiquidGuideLiquidNNContinuous-time ODE cells
14ProbabilisticGuideBayesian, EvidentialCalibrated uncertainty
15MemoryGuideNTMExternal differentiable memory
16MetaGuideMoE, LoRACompose and adapt architectures
17Neuromorphic-SNNSpike-based communication
18CapsuleGuideCapsulePart-whole relationships
19HypernetworkGuideHypernetworkNetworks generating networks