ML Foundations

What neural networks are, how they learn, and why any of this works at all.

Why This Guide Exists

Edifice gives you 90+ neural network architectures. Before you explore them, you need a mental model of what a neural network actually is and what it means for one to "learn." This guide builds that foundation from scratch. No prior ML knowledge is assumed -- just basic comfort with the idea that numbers go in and numbers come out.

If you already know what a loss function is and can explain backpropagation at a high level, skip to Reading Edifice or the Learning Path.

What Is a Neural Network?

A neural network is a function made of simple, stacked building blocks. Each block takes numbers in, transforms them, and passes them forward. Stack enough of these blocks together with the right transformations, and the network can approximate remarkably complex patterns.

The fundamental unit is the neuron (also called a node or unit). A neuron does three things:

1. Multiply each input by a weight        (how important is this input?)
2. Add all the weighted inputs together    (combine the evidence)
3. Apply an activation function            (introduce non-linearity)

Concretely:
  output = activation( w1*x1 + w2*x2 + ... + wN*xN + bias )

The weights and bias are the neuron's parameters -- the knobs the network adjusts during learning. The activation function is what makes neural networks more powerful than simple linear regression: without it, stacking layers would just produce another linear function, no matter how many layers you add.

Layers

Neurons are organized into layers. A layer is just a group of neurons that all process the same inputs and produce their outputs together:

Input         Layer 1         Layer 2         Output
[x1] ──────→ [n1] ──────→ [n5] ──────→ [prediction]
[x2] ──╲ ╱→ [n2] ──╲ ╱→ [n6] ──────→
       ╳       ╳
[x3] ──╱ ╲→ [n3] ──╱ ╲→ [n7]
         ╲→ [n4] ──╱

The key insight: each neuron in a layer connects to every neuron in the next layer (in a standard "dense" or "fully connected" layer). This means a layer with 4 neurons connecting to a layer with 3 neurons has 4 × 3 = 12 weight parameters, plus 3 biases.

Three terms you'll see everywhere:

Input layer: the raw data entering the network (not really a "layer" of neurons -- just the data)
Hidden layers: the intermediate layers where the network builds up internal representations
Output layer: the final layer that produces the prediction

A network with many hidden layers is called a deep neural network -- that's where "deep learning" comes from. The depth is what gives these networks their power: early layers learn simple features, and later layers combine those into increasingly abstract representations.

The Forward Pass

When data flows from input to output through the network, that's called the forward pass. Nothing mysterious -- it's just function composition. The output of layer 1 becomes the input to layer 2, and so on:

input → layer_1(input) → layer_2(...) → layer_3(...) → prediction

Every architecture in Edifice -- whether it's a simple MLP, a transformer, a Mamba SSM, or a graph network -- ultimately performs a forward pass. What differs is the structure of those intermediate transformations. Some architectures look at sequences one token at a time (recurrent networks). Some let every token attend to every other token (transformers). Some model the data as continuous dynamical systems (state space models). But the forward pass concept is universal.

What Does "Learning" Mean?

A neural network starts with random parameters. Its predictions are garbage. Learning is the process of adjusting those parameters so the predictions get better.

This requires three ingredients:

1. A Loss Function

The loss function (also called cost function or objective) measures how wrong the network's predictions are. It takes the network's output and the correct answer, and produces a single number: the loss. Lower is better.

                    ┌────────────────┐
  network output ──→│  Loss Function  │──→ single number (the loss)
  correct answer ──→│                 │
                    └────────────────┘

Examples:
  - Predicting a number?  Loss = (predicted - actual)²
  - Classifying images?   Loss = -log(probability of correct class)

The choice of loss function tells the network what "better" means. Different problems use different loss functions, and this choice shapes how the network learns.

2. Gradient Descent

Once we have a loss, we need a way to reduce it. Gradient descent is the core algorithm. The idea is intuitive: the gradient tells you which direction increases the loss fastest, so you step in the opposite direction.

Think of it like descending a mountain in fog:
  - You can't see the valley, but you can feel the slope under your feet
  - At each step, you move in the steepest downhill direction
  - Eventually you reach a low point

The "slope" is the gradient -- the derivative of the loss with respect to each parameter.
The "step size" is the learning rate -- how far you move each update.

  new_weight = old_weight - learning_rate × gradient

A small learning rate means slow, careful progress. A large learning rate means faster movement but with the risk of overshooting the valley entirely. Choosing the right learning rate is one of the most impactful decisions in training.

3. Backpropagation

Backpropagation is how the network figures out each parameter's gradient. It's just the chain rule from calculus, applied systematically backward through the network:

Forward:   input → layer_1 → layer_2 → layer_3 → prediction → loss
Backward:  input ← layer_1 ← layer_2 ← layer_3 ← prediction ← loss
                                                                 ↑
                                                         "how does each
                                                          weight affect
                                                          this loss?"

For each weight in the network, backpropagation computes: "if I increase this weight by a tiny amount, how much does the loss change?" Weights that contribute a lot to the error get large gradients (big updates). Weights that barely affect the loss get small gradients (small updates). This is what makes learning efficient -- the network focuses its adjustments where they matter most.

You don't need to implement backpropagation yourself. Nx (the numerical computing library under Edifice) handles this automatically through automatic differentiation. You define the forward pass, and Nx computes all the gradients for you.

The Training Loop

Training a neural network is a repetitive cycle:

repeat until good enough:
  1. Forward pass:    feed a batch of data through the network
  2. Compute loss:    measure how wrong the predictions are
  3. Backward pass:   compute gradients via backpropagation
  4. Update weights:  adjust parameters in the direction that reduces loss

One pass through the entire training dataset is called an epoch. In practice, you don't feed the whole dataset at once -- you split it into batches (typically 32-512 samples) and update weights after each batch. This is called mini-batch gradient descent, and it's what virtually everyone uses because:

Full-dataset gradient computation is too expensive for large datasets
The noise from random batches actually helps escape shallow local minima
It enables training on data that doesn't fit in memory

A typical training run might be 10-100 epochs, with hundreds or thousands of batch updates per epoch.

Tensors and Shapes

Neural networks operate on tensors -- multi-dimensional arrays of numbers. If you know what a matrix is, a tensor is just the generalization to any number of dimensions:

Scalar:     42                          shape: ()        0 dimensions
Vector:     [1, 2, 3]                   shape: {3}       1 dimension
Matrix:     [[1, 2], [3, 4], [5, 6]]    shape: {3, 2}    2 dimensions
3D Tensor:  a stack of matrices          shape: {4, 3, 2} 3 dimensions

In Edifice and Nx, shapes are written as tuples. The most common shapes you'll encounter:

{batch_size, features}                     Tabular data or network output
{batch_size, sequence_length, features}    Sequences (text, time series, game frames)
{batch_size, height, width, channels}      Images

The batch dimension (always first) is how many samples the network processes simultaneously. Processing samples in batches is more efficient than one at a time because modern hardware (GPUs especially) is optimized for parallel operations on large blocks of numbers.

Understanding shapes is critical for working with Edifice. When you see something like {1, 60, 256}, that means: 1 sample, 60 timesteps, 256 features per timestep. A Mamba model with embed_size: 256 and window_size: 60 expects exactly that input shape.

Generalization: The Actual Goal

The point of training isn't to memorize the training data -- it's to learn patterns that apply to new, unseen data. This is called generalization, and it's the central challenge of machine learning.

Two failure modes:

Underfitting                          Overfitting
───────────                          ──────────
Network is too simple or             Network memorizes the training data
undertrained to capture               but fails on new data.
the underlying pattern.

Training loss: high                   Training loss: very low
Test loss: high                       Test loss: high
"Can't learn the pattern"            "Learned the noise, not the signal"

Think of it like studying for a test. Underfitting is not studying enough -- you don't know the material. Overfitting is memorizing specific practice problems without understanding the concepts -- you ace the practice test but fail the real one.

Techniques for fighting overfitting (called regularization) include:

Dropout: randomly zeroing out neurons during training, forcing the network to not rely on any single neuron
Weight decay: penalizing large weights, encouraging simpler solutions
Early stopping: stop training when performance on a held-out validation set starts to degrade
Data augmentation: artificially expanding training data through transformations

Why Architecture Matters

If all neural networks do the same basic thing (forward pass, loss, gradient descent), why do we need so many different architectures?

Because structure encodes assumptions about the data. The right architecture builds in the right biases for your problem:

Data Type          Key Property              Architecture Bias
─────────          ────────────              ─────────────────
Images             Spatial locality          Convolutions: share filters
                                             across positions

Sequences          Temporal ordering         Recurrence or attention:
                                             model dependencies over time

Graphs             Relational structure      Message passing: aggregate
                                             information from neighbors

Sets               Permutation invariance    Symmetric aggregation:
                                             order doesn't matter

A convolutional network "knows" that a cat's ear looks the same regardless of where it appears in an image. A recurrent network "knows" that word order matters. A graph network "knows" that nodes interact through edges. These structural biases mean the network needs less data and less training to learn the pattern, because the architecture already encodes part of the answer.

This is why Edifice has 19 families -- each family encodes a different set of assumptions about what the data looks like and how it should be processed.

What's Next

With these foundations in place, you're ready for:

Core Vocabulary -- the precise terminology used across all Edifice guides
Problem Landscape -- how different ML problems map to different architecture families
Reading Edifice -- understanding the code patterns in this library
Learning Path -- a guided tour through the 19 architecture families

← Previous Page LICENSE

Next Page → Core Vocabulary