Learning Path

A guided tour through Edifice's 19 architecture families -- what to learn first, what builds on what, and where to go deep.

How to Use This Guide

Edifice has 90+ architectures across 19 families. That's overwhelming if you try to learn them all at once. This guide gives you a structured path through the families, organized so that each step builds naturally on the previous ones. At each step, you get:

What it is and why it matters
Prerequisites -- what you should understand first
Key ideas to focus on
Try it -- a runnable Edifice example
Go deeper -- links to the detailed architecture guide

You don't need to follow this linearly. If you know what problem you're solving, jump to the relevant section using the Problem Landscape as your map. But if you're here to learn ML architectures from the ground up, start at Phase 1 and work through.

Phase 1: The Fundamentals

These are the building blocks that everything else is built on. Learn these first.

Step 1: Feedforward Networks (MLP)

What: The simplest neural network -- stacked dense layers with activations. Input goes in, passes through layers, output comes out. No recurrence, no attention, no fancy routing.

Why it matters: MLPs appear inside almost every other architecture. The feed-forward block in a transformer? An MLP. The classification head at the end of a vision model? An MLP. The expert networks in Mixture of Experts? MLPs. Understanding MLPs means understanding the fundamental building block.

Key ideas:

Dense layers (matrix multiply + bias + activation)
How depth (more layers) and width (more neurons per layer) affect capacity
Dropout for regularization
Residual connections for training deep networks

Try it:

# A simple 3-layer MLP for tabular data
model = Edifice.Feedforward.MLP.build(
  input_size: 32,
  hidden_sizes: [128, 64, 16],
  activation: :relu,
  dropout: 0.1
)

{init_fn, predict_fn} = Axon.build(model)
params = init_fn.(Nx.template({1, 32}, :f32), Axon.ModelState.empty())
output = predict_fn.(params, Nx.broadcast(0.5, {4, 32}))
# => {4, 16}  -- 4 samples, 16 features from the last hidden layer

Also explore: KAN (Kolmogorov-Arnold Networks -- learnable activation functions instead of fixed ones) and TabNet (attention-based feature selection for tabular data).

Go deeper: The MLP module docs cover residual connections and layer normalization options.

Step 2: Convolutional Networks

Prerequisites: Step 1 (understand dense layers and activations)

What: Networks that use shared filters sliding across the input. Instead of every neuron connecting to every input (dense), a convolutional filter looks at a small local region and reuses the same weights across all positions.

Why it matters: Convolutions encode the insight that local patterns matter and can appear anywhere in the input. This is why they dominate image processing and are widely used for sequence modeling (TCN). The concept of "receptive field" -- how much input context a layer can see -- carries over to understanding attention and SSMs.

Key ideas:

Filters/kernels: small weight matrices that slide across input
Feature maps: the output of applying a filter
Stride and padding: how the filter moves and handles edges
Residual blocks (ResNet): the skip connection pattern that enables very deep networks
Depthwise separable convolutions (MobileNet): factoring convolutions for efficiency

Try it:

# ResNet for image-like data
model = Edifice.Convolutional.ResNet.build(
  input_channels: 3,
  num_classes: 10,
  depth: 18
)

{init_fn, predict_fn} = Axon.build(model)
params = init_fn.(Nx.template({1, 32, 32, 3}, :f32), Axon.ModelState.empty())
output = predict_fn.(params, Nx.broadcast(0.5, {2, 32, 32, 3}))
# => {2, 10}  -- 2 images, 10 class probabilities

# TCN for temporal sequences (causal convolutions)
model = Edifice.Convolutional.TCN.build(
  embed_size: 64,
  hidden_size: 128,
  num_layers: 4,
  kernel_size: 3,
  window_size: 100
)

{init_fn, predict_fn} = Axon.build(model)
params = init_fn.(Nx.template({1, 100, 64}, :f32), Axon.ModelState.empty())
output = predict_fn.(params, Nx.broadcast(0.5, {1, 100, 64}))
# => {1, 128}

Go deeper: Convolutional Networks guide

Step 3: Building Blocks

Prerequisites: Steps 1-2

What: The composable primitives -- normalization (RMSNorm), position encoding (RoPE, ALiBi), gating (SwiGLU), and patching (PatchEmbed) -- that appear inside transformers, SSMs, and vision models.

Why it matters: When you read about a transformer using "pre-RMSNorm with RoPE and SwiGLU FFN," you need to know what each of those pieces does. These blocks are the vocabulary of modern architecture design.

Key ideas:

Why normalization is essential (training stability)
RMSNorm vs LayerNorm (speed vs mean centering)
Position encoding: how networks know where tokens are in a sequence
RoPE (rotary): relative position via rotation, good extrapolation
ALiBi: no learned parameters, linear bias
SwiGLU: gated feed-forward with multiplicative interactions

Go deeper: Building Blocks guide

Phase 2: Sequence Processing

The three major approaches to processing ordered data. Understanding the tradeoffs between these families is one of the most important skills in modern ML.

Step 4: Recurrent Networks

Prerequisites: Phase 1

What: Networks that maintain a hidden state which is updated at each timestep. They process sequences one token at a time, carrying forward a compressed summary of everything seen so far.

Why it matters: Recurrence is the most intuitive approach to sequences -- it mirrors how you might mentally process a sentence word by word. Modern recurrent architectures (xLSTM, MinGRU, Titans) have closed much of the gap with transformers while retaining constant-memory inference.

Key ideas:

Hidden state: the network's running memory
Gates: mechanisms that control what to remember and what to forget
LSTM/GRU: the classic gated architectures
The vanishing gradient problem and how gates solve it
Parallel scan: how MinGRU/MinLSTM make recurrence parallelizable for training

Try it:

# Classic LSTM
model = Edifice.build(:lstm,
  embed_size: 64,
  hidden_size: 128,
  num_layers: 2,
  window_size: 60
)

{init_fn, predict_fn} = Axon.build(model)
params = init_fn.(Nx.template({1, 60, 64}, :f32), Axon.ModelState.empty())
output = predict_fn.(params, Nx.broadcast(0.5, {1, 60, 64}))
# => {1, 128}

# Modern minimal GRU (parallel-scannable)
model = Edifice.build(:min_gru,
  embed_size: 64,
  hidden_size: 128,
  num_layers: 4,
  window_size: 60
)

Go deeper: Recurrent Networks guide

Step 5: Attention Mechanisms

Prerequisites: Phase 1, especially Building Blocks (Step 3)

What: A mechanism where each position in a sequence computes relevance scores against all other positions, then aggregates information based on those scores. This is the core of the transformer architecture.

Why it matters: Transformers (built on attention) are the dominant architecture for language models and increasingly for other domains. Understanding attention -- and its quadratic cost -- is essential context for understanding why SSMs, linear attention, and hybrid architectures exist.

Key ideas:

Queries, keys, and values: the three projections
Scaled dot-product attention: the core computation
Multi-head attention: parallel attention with different learned perspectives
The quadratic bottleneck: O(L^2) in sequence length
Linear attention variants: approximating attention in O(L)
Retention and RWKV: recurrence-based alternatives

Try it:

# Standard multi-head attention transformer
model = Edifice.build(:attention,
  embed_size: 128,
  hidden_size: 256,
  num_heads: 8,
  num_layers: 4,
  window_size: 60
)

{init_fn, predict_fn} = Axon.build(model)
params = init_fn.(Nx.template({1, 60, 128}, :f32), Axon.ModelState.empty())
output = predict_fn.(params, Nx.broadcast(0.5, {1, 60, 128}))
# => {1, 256}

# RetNet: attention-like quality with recurrent inference
model = Edifice.build(:retnet,
  embed_size: 128,
  hidden_size: 256,
  num_heads: 4,
  num_layers: 4,
  window_size: 60
)

Go deeper: Attention Mechanisms guide

Step 6: State Space Models

Prerequisites: Steps 4 and 5 (understand both recurrence and attention tradeoffs)

What: Models that treat sequences as discretized continuous-time dynamical systems. A hidden state evolves according to learned dynamics, combining the parallel training of convolutions with the constant-memory inference of recurrence.

Why it matters: SSMs (especially Mamba) are the strongest alternative to transformers for sequence modeling. They scale linearly with sequence length while matching or exceeding transformer quality on many tasks. Understanding the SSM-attention tradeoff is crucial for architecture selection.

Key ideas:

State space equations: the continuous-time formulation
Discretization: converting continuous to discrete for digital computation
Selective SSMs (Mamba): input-dependent parameters
Parallel scan: how linear recurrences train in parallel
Hybrid models (Jamba, Zamba): SSM + attention for the best of both

Try it:

# Mamba: the flagship SSM
model = Edifice.build(:mamba,
  embed_size: 128,
  hidden_size: 256,
  state_size: 16,
  num_layers: 4,
  window_size: 60
)

{init_fn, predict_fn} = Axon.build(model)
params = init_fn.(Nx.template({1, 60, 128}, :f32), Axon.ModelState.empty())
output = predict_fn.(params, Nx.broadcast(0.5, {1, 60, 128}))
# => {1, 256}

# Hybrid: Mamba + Attention
model = Edifice.build(:jamba,
  embed_size: 128,
  hidden_size: 256,
  num_layers: 6,
  attention_ratio: 0.33,  # 1/3 of layers use attention
  num_heads: 4,
  window_size: 60
)

Go deeper: State Space Models guide

Phase 3: Specialized Domains

With the sequence processing fundamentals down, branch out into domain-specific families.

Step 7: Vision Architectures

Prerequisites: Steps 2 (convolutions), 3 (building blocks), 5 (attention)

What: Architectures designed for image understanding, from Vision Transformers (ViT) that treat images as sequences of patches, to U-Net for pixel-level segmentation.

Key ideas:

Patch embedding: converting images into token sequences
ViT: applying the transformer to vision
Swin: hierarchical vision transformer with shifted windows
U-Net: encoder-decoder with skip connections for dense prediction

Try it:

# Vision Transformer
model = Edifice.Vision.ViT.build(
  image_size: 32,
  patch_size: 8,
  num_channels: 3,
  embed_size: 256,
  num_heads: 8,
  num_layers: 6,
  num_classes: 10
)

{init_fn, predict_fn} = Axon.build(model)
params = init_fn.(Nx.template({1, 32, 32, 3}, :f32), Axon.ModelState.empty())
output = predict_fn.(params, Nx.broadcast(0.5, {2, 32, 32, 3}))
# => {2, 10}

Go deeper: Vision Architectures guide

Step 8: Graph & Set Networks

Prerequisites: Step 1 (MLPs), basic understanding of attention helps

What: Architectures for data with relational structure (graphs) or no ordering (sets). Graph networks propagate information along edges between nodes. Set networks process unordered collections with permutation-invariant operations.

Key ideas:

Message passing: nodes aggregate information from their neighbors
Adjacency matrices: how graph structure is represented as tensors
Permutation invariance: the output shouldn't change if you reorder the nodes/elements
Pooling: going from node-level to graph-level predictions

Try it:

# Graph attention network
model = Edifice.Graph.GAT.build_classifier(
  input_dim: 16,
  hidden_dims: [64, 64],
  num_classes: 3,
  num_heads: 4,
  pool: :mean
)

# DeepSets: process unordered collections
model = Edifice.Sets.DeepSets.build(
  input_dim: 3,
  hidden_dim: 64,
  output_dim: 10,
  pool: :mean
)

{init_fn, predict_fn} = Axon.build(model)
params = init_fn.(Nx.template({4, 20, 3}, :f32), Axon.ModelState.empty())
output = predict_fn.(params, Nx.broadcast(0.5, {4, 20, 3}))
# => {4, 10}  -- set-level predictions for 4 sets of 20 points

Go deeper: Graph & Set Networks guide

Step 9: Generative Models

Prerequisites: Steps 1 and 5 (MLPs and attention); probability concepts from Core Vocabulary

What: Architectures that learn to create new data. Instead of predicting a label, they learn the distribution of the training data and can generate new samples from it.

Key ideas:

Latent space and the encoder-decoder pattern (VAE)
The reparameterization trick: making sampling differentiable
Adversarial training (GAN): generator vs. discriminator
Diffusion: adding noise then learning to reverse it
Flow matching: learning ODE trajectories from noise to data

Try it:

# Variational Autoencoder
{encoder, decoder} = Edifice.Generative.VAE.build(
  input_size: 784,
  latent_size: 32,
  encoder_sizes: [512, 256],
  decoder_sizes: [256, 512]
)

# Build encoder
{enc_init, enc_predict} = Axon.build(encoder)
enc_params = enc_init.(Nx.template({1, 784}, :f32), Axon.ModelState.empty())
%{mu: mu, log_var: log_var} = enc_predict.(enc_params, Nx.broadcast(0.5, {4, 784}))

# Sample from latent space
z = Edifice.Generative.VAE.reparameterize(mu, log_var)
# z shape: {4, 32}  -- 4 latent vectors

Go deeper: Generative Models guide

Step 10: Contrastive & Self-Supervised Learning

Prerequisites: Steps 1-2 (feedforward + convolutions), basic understanding of encoders

What: Methods that learn useful representations without labeled data. They create their own training signal by comparing different views of the same data (contrastive) or by reconstructing masked portions (self-supervised).

Key ideas:

Positive and negative pairs: what to pull together, what to push apart
Projection heads: small networks that transform representations for the contrastive objective
Momentum encoders (BYOL): a slowly-updating copy of the network
Masked autoencoders (MAE): reconstruct what you can't see

Go deeper: Contrastive Learning guide

Phase 4: Advanced and Specialized

These families build on the foundations and address specific needs.

Step 11: Meta-Learning (MoE, LoRA, Adapter, Capsules)

Prerequisites: Phase 2

What: Techniques that modify or compose other architectures: Mixture of Experts routes different inputs to different sub-networks; LoRA and Adapters add small trainable modules to frozen pretrained models; Capsules encode part-whole relationships.

Why it matters: MoE is how modern large language models scale to hundreds of billions of parameters while keeping inference cost manageable. LoRA is how you fine-tune those models on your specific task with limited compute.

Go deeper: Meta-Learning guide

Step 12: Dynamic & Continuous Systems (NeuralODE, Liquid, Energy)

Prerequisites: Phase 2, comfort with the idea of differential equations

What: Architectures that model continuous dynamics: Neural ODEs define depth as a continuous variable, Liquid Networks use ODE-based cells that adapt over time, Energy-Based Models learn energy landscapes, and Hopfield networks provide associative memory.

Go deeper: Dynamic & Continuous guide

Step 13: Uncertainty & Memory (Bayesian, NTM, Evidential)

Prerequisites: Phase 1, basic probability

What: Networks that know what they don't know (Bayesian, MC Dropout, Evidential) and networks with external memory banks (NTM, Memory Networks) for tasks requiring storage and retrieval.

Go deeper: Uncertainty & Memory guide

Step 14: Neuromorphic (SNN, ANN2SNN)

Prerequisites: Phase 2

What: Spiking neural networks that communicate through discrete spikes rather than continuous activations, inspired by biological neurons. ANN2SNN converts conventional networks to spiking equivalents for deployment on neuromorphic hardware.

The Dependency Graph

A visual summary of what builds on what:

Phase 1: Fundamentals
┌──────────────┐   ┌──────────────────┐   ┌─────────────────┐
│  1. MLP      │──→│ 2. Convolutional │──→│ 3. Blocks       │
│  (foundation)│   │   (local pattern)│   │   (components)  │
└──────┬───────┘   └────────┬─────────┘   └────────┬────────┘
       │                    │                       │
       ▼                    ▼                       ▼
Phase 2: Sequence Processing
┌──────────────┐   ┌──────────────────┐   ┌─────────────────┐
│ 4. Recurrent │   │ 5. Attention     │   │ 6. SSM          │
│  (sequential)│──→│  (parallel)      │──→│  (best of both) │
└──────────────┘   └──────────────────┘   └─────────────────┘
       │                    │                       │
       └────────────┬───────┴───────────────────────┘
                    ▼
Phase 3: Specialized Domains
┌──────────┐ ┌──────────┐ ┌──────────────┐ ┌─────────────┐
│ 7. Vision│ │ 8. Graph │ │ 9. Generative│ │10. Contrast.│
└──────────┘ └──────────┘ └──────────────┘ └─────────────┘
                    │
                    ▼
Phase 4: Advanced
┌──────────┐ ┌──────────┐ ┌──────────────┐ ┌─────────────┐
│11. Meta  │ │12. ODE/  │ │13. Uncertain.│ │14. Neuromor.│
│  (MoE)   │ │  Energy  │ │  & Memory    │ │  (Spiking)  │
└──────────┘ └──────────┘ └──────────────┘ └─────────────┘

Suggested Exercises

For each phase, a concrete exercise to test your understanding:

Phase 1: Build an MLP classifier for a simple dataset (e.g., Iris or synthetic data). Train it using Axon's training API. Add dropout and observe the effect on overfitting.

Phase 2: Take the same dataset but structure it as sequences. Compare LSTM, attention, and Mamba on the same task. Measure parameter counts and observe output shapes.

Phase 3: Pick a domain that interests you:

Vision: classify MNIST digits with both ResNet and ViT
Graphs: classify synthetic graph structures with GCN
Generation: train a VAE on a simple distribution and sample from it

Phase 4: Take your Phase 2 model and apply LoRA fine-tuning. Compare training a full model from scratch vs. fine-tuning with LoRA -- observe the difference in trainable parameter count.

Quick Reference: All 19 Families

#	Family	Guide	Key Architecture	Core Idea
1	Feedforward	-	MLP	Stacked dense layers
2	Convolutional	Guide	ResNet	Shared local filters + skip connections
3	Building Blocks	Guide	RMSNorm, RoPE, SwiGLU	Composable primitives
4	Recurrent	Guide	LSTM, xLSTM	Sequential hidden state
5	Attention	Guide	Multi-Head, GQA	Pairwise relevance scoring
6	State Space	Guide	Mamba	Discretized dynamical system
7	Vision	Guide	ViT, Swin	Images as patch sequences
8	Graph	Guide	GCN, GAT	Message passing on edges
9	Sets	Guide	DeepSets	Permutation-invariant aggregation
10	Generative	Guide	VAE, Diffusion	Learn and sample from p(data)
11	Contrastive	Guide	SimCLR, BYOL	Learn representations without labels
12	Energy	Guide	EBM, Hopfield	Energy landscape minimization
13	Liquid	Guide	LiquidNN	Continuous-time ODE cells
14	Probabilistic	Guide	Bayesian, Evidential	Calibrated uncertainty
15	Memory	Guide	NTM	External differentiable memory
16	Meta	Guide	MoE, LoRA	Compose and adapt architectures
17	Neuromorphic	-	SNN	Spike-based communication
18	Capsule	Guide	Capsule	Part-whole relationships
19	Hypernetwork	Guide	Hypernetwork	Networks generating networks

← Previous Page Reading Edifice

Next Page → Architecture Taxonomy