Neural Networks & Backprop

Neural networks are universal function approximators — given enough capacity, they can represent any continuous function. The magic that makes them trainable is backpropagation: an efficient application of the chain rule that propagates error signals from output to input, telling every weight how to change.

Theory

Computation Graph

forward pass — compute values

backward pass — propagate gradients

A neural network is a composition of functions — each layer transforms its input a little. Backpropagation answers one question: how should each weight change to reduce the error? It works backwards from the output, tracing each weight's contribution to the mistake via the chain rule.

The Neuron Model

A single neuron computes a weighted sum of inputs, adds a bias, then applies a nonlinearity:

$z = Wx + b, \quad a = \sigma(z)$

where $W \in \mathbb{R}^{d_{out} \times d_{in}}$ , $x \in \mathbb{R}^{d_{in}}$ , $b \in \mathbb{R}^{d_{out}}$ , and $\sigma$ is an activation function.

Activation Functions

The choice of activation function fundamentally shapes what a network can learn and how gradients flow.

Sigmoid maps any real value to $(0, 1)$ , historically used for binary outputs:

$\sigma(z) = \frac{1}{1 + e^{-z}}, \quad \sigma'(z) = \sigma(z)(1 - \sigma(z))$

The derivative is at most $0.25$ , which causes vanishing gradients in deep networks — multiplying many values $< 1$ drives the gradient to zero.

Tanh is zero-centered (unlike sigmoid), with range $(-1, 1)$ :

$\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}, \quad \tanh'(z) = 1 - \tanh^2(z)$

Still saturates, but the zero-centered outputs mean downstream layers receive gradients with both signs.

ReLU (Rectified Linear Unit) is the modern default:

$\text{ReLU}(z) = \max(0, z), \quad \text{ReLU}'(z) = \begin{cases} 1 & z > 0 \\ 0 & z \leq 0 \end{cases}$

No saturation for positive inputs means gradients flow freely. The trade-off is "dying ReLU": neurons that receive consistently negative inputs will have zero gradient and never recover. Leaky ReLU uses $\alpha z$ for $z < 0$ (typically $\alpha = 0.01$ ) to address this.

💡Intuition

Think of ReLU as a gate: it passes positive signals unchanged and blocks negative ones. This sparsity — roughly 50% of neurons firing on any input — is computationally efficient and acts as an implicit regularizer.

Forward Pass: 2-Layer MLP

For a network with one hidden layer of width $H$ and $C$ output classes:

$z^{(1)} = W^{(1)} x + b^{(1)} \in \mathbb{R}^H$ $a^{(1)} = \text{ReLU}(z^{(1)}) \in \mathbb{R}^H$ $z^{(2)} = W^{(2)} a^{(1)} + b^{(2)} \in \mathbb{R}^C$ $\hat{y} = \text{softmax}(z^{(2)}), \quad \hat{y}_c = \frac{e^{z^{(2)}_c}}{\sum_{j} e^{z^{(2)}_j}}$

The cross-entropy loss for true class $y$ :

$\mathcal{L} = -\log \hat{y}_y = -z^{(2)}_y + \log \sum_j e^{z^{(2)}_j}$

Backpropagation through this composition is forced by the chain rule — it's the only way to decompose $\frac{\partial \mathcal{L}}{\partial w}$ into per-layer factors without redundantly recomputing shared intermediate values. A naive finite-difference approach would require two forward passes per parameter; the chain rule reduces this to one forward pass and one backward pass regardless of network depth.

Backpropagation via Chain Rule

Backprop is the chain rule applied systematically. Starting from the output:

Output layer gradient — for cross-entropy + softmax, the combined gradient simplifies beautifully:

$\frac{\partial \mathcal{L}}{\partial z^{(2)}} = \hat{y} - e_y$

where $e_y$ is the one-hot vector for class $y$ . This $\delta^{(2)} \in \mathbb{R}^C$ is the output "delta."

Weight gradients at layer 2:

$\frac{\partial \mathcal{L}}{\partial W^{(2)}} = \delta^{(2)} (a^{(1)})^\top, \quad \frac{\partial \mathcal{L}}{\partial b^{(2)}} = \delta^{(2)}$

Backprop through layer 1 — propagate the delta through $W^{(2)}$ and through the ReLU:

$\delta^{(1)} = \underbrace{(W^{(2)})^\top \delta^{(2)}}_{\text{linear backprop}} \odot \underbrace{\mathbf{1}[z^{(1)} > 0]}_{\text{ReLU gate}}$

Weight gradients at layer 1:

$\frac{\partial \mathcal{L}}{\partial W^{(1)}} = \delta^{(1)} x^\top, \quad \frac{\partial \mathcal{L}}{\partial b^{(1)}} = \delta^{(1)}$

The general pattern: delta at layer $l$ equals $(W^{(l+1)})^\top \delta^{(l+1)}$ multiplied elementwise by the activation derivative $\sigma'(z^{(l)})$ . This is the delta rule.

ℹ️Note

Backprop is $O(n)$ in the number of parameters — the same asymptotic cost as a forward pass. The naive alternative (finite differences for each weight) would require $O(n)$ forward passes. For GPT-3 with 175B parameters, backprop makes training feasible where finite differences would not.

Walkthrough

Dataset: MNIST Digit Classification

Modified National Institute of Standards and Technology database (MNIST) contains 70,000 grayscale images ( $28 \times 28 = 784$ pixels each) of handwritten digits 0–9. The task: classify each image into one of 10 classes.

Training set: 60,000 images
Test set: 10,000 images
Baseline (majority class): ~10% accuracy
Human-level: ~99.8%
A 2-layer Multi-Layer Perceptron (MLP) trained here: ~98.1%

NumPy from Scratch

Understanding the raw computation before using frameworks is essential:

python

import numpy as np
 
# X_train: (60000, 784), normalized to [0, 1]
# y_train: (60000,), integer labels 0-9
X_train = X_train / 255.0
X_test  = X_test  / 255.0
 
n_input, n_hidden, n_output = 784, 256, 10
 
# Xavier initialization: scale by 1/sqrt(fan_in)
rng = np.random.default_rng(42)
W1 = rng.normal(0, 1/np.sqrt(n_input),  (n_hidden, n_input))
b1 = np.zeros(n_hidden)
W2 = rng.normal(0, 1/np.sqrt(n_hidden), (n_output, n_hidden))
b2 = np.zeros(n_output)
 
def relu(z):
    return np.maximum(0, z)
 
def softmax(z):
    # Subtract max for numerical stability
    e = np.exp(z - z.max(axis=1, keepdims=True))
    return e / e.sum(axis=1, keepdims=True)
 
def forward(X):
    z1    = X @ W1.T + b1          # (batch, 256)
    a1    = relu(z1)                # (batch, 256)
    z2    = a1 @ W2.T + b2         # (batch, 10)
    y_hat = softmax(z2)             # (batch, 10)
    return z1, a1, z2, y_hat
 
def backward(X, y, z1, a1, z2, y_hat, lr=0.01):
    global W1, b1, W2, b2
    n = len(y)
 
    # Output delta: softmax + cross-entropy gradient
    d2 = y_hat.copy()
    d2[np.arange(n), y] -= 1
    d2 /= n                        # (batch, 10)
 
    dW2 = d2.T @ a1               # (10, 256)
    db2 = d2.sum(axis=0)          # (10,)
 
    # Backprop through ReLU
    d1 = d2 @ W2                  # (batch, 256)
    d1 *= (z1 > 0)                # ReLU gate
 
    dW1 = d1.T @ X               # (256, 784)
    db1 = d1.sum(axis=0)         # (256,)
 
    W2 -= lr * dW2;  b2 -= lr * db2
    W1 -= lr * dW1;  b1 -= lr * db1
 
# Training loop — 20 epochs, batch size 128
batch_size = 128
for epoch in range(20):
    idx = np.random.permutation(len(X_train))
    Xs, ys = X_train[idx], y_train[idx]
    for i in range(0, len(Xs), batch_size):
        z1, a1, z2, y_hat = forward(Xs[i:i+batch_size])
        backward(Xs[i:i+batch_size], ys[i:i+batch_size], z1, a1, z2, y_hat)
    _, _, _, yh = forward(X_test)
    acc = (yh.argmax(1) == y_test).mean()
    print(f"Epoch {epoch+1:2d} | acc={acc:.4f}")
# Epoch  1 | acc=0.9234
# Epoch  5 | acc=0.9562
# Epoch 20 | acc=0.9743

PyTorch Version

PyTorch handles the computation graph and backprop automatically:

python

import torch
import torch.nn as nn
 
class MLP(nn.Module):
    def __init__(self, input_dim=784, hidden_dim=256, num_classes=10, dropout=0.3):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, num_classes),
        )
 
    def forward(self, x):
        return self.net(x.view(x.size(0), -1))
 
model = MLP().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

Code Implementation

The production training pipeline with MLflow tracking, config-driven hyperparameters, and artifact management:

train.py

python

"""MLP Training Pipeline — MNIST digit classification with MLflow tracking."""
import argparse
import json
from pathlib import Path
 
import mlflow
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
 
 
class MLP(nn.Module):
    """Two-layer MLP with batch norm and dropout."""
    def __init__(self, input_dim=784, hidden_dim=256, num_classes=10, dropout=0.3):
        super().__init__()
        self.net = nn.Sequential(
            nn.Flatten(),
            nn.Linear(input_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, num_classes),
        )
 
    def forward(self, x):
        return self.net(x)
 
 
def train(cfg: dict) -> None:
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    hp = cfg["hyperparams"]
 
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,)),  # MNIST mean/std
    ])
    train_ds = datasets.MNIST("./data", train=True,  download=True, transform=transform)
    test_ds  = datasets.MNIST("./data", train=False, transform=transform)
    train_loader = DataLoader(train_ds, batch_size=hp["batch_size"], shuffle=True,
                              num_workers=4, pin_memory=True)
    test_loader  = DataLoader(test_ds,  batch_size=256)
 
    model = MLP(hidden_dim=hp["hidden_dim"], dropout=hp["dropout"]).to(device)
    optimizer = optim.Adam(model.parameters(), lr=hp["learning_rate"],
                           weight_decay=hp["weight_decay"])
    scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=hp["epochs"])
    criterion = nn.CrossEntropyLoss()
 
    mlflow.set_experiment("mlp-mnist")
    with mlflow.start_run():
        mlflow.log_params(hp)
        best_acc = 0.0
 
        for epoch in range(hp["epochs"]):
            model.train()
            total_loss = 0.0
            for X, y in train_loader:
                X, y = X.to(device), y.to(device)
                optimizer.zero_grad()
                loss = criterion(model(X), y)
                loss.backward()
                torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
                optimizer.step()
                total_loss += loss.item()
            scheduler.step()
 
            # Evaluation
            model.eval()
            correct = 0
            with torch.no_grad():
                for X, y in test_loader:
                    X, y = X.to(device), y.to(device)
                    correct += (model(X).argmax(1) == y).sum().item()
            acc = correct / len(test_ds)
 
            mlflow.log_metrics({
                "train_loss": total_loss / len(train_loader),
                "test_acc": acc,
            }, step=epoch)
            print(f"Epoch {epoch+1:3d} | loss={total_loss/len(train_loader):.4f} | acc={acc:.4f}")
 
            if acc > best_acc:
                best_acc = acc
                torch.save(model.state_dict(), "best_model.pt")
 
        mlflow.log_artifact("best_model.pt")
        print(f"Best test accuracy: {best_acc:.4f}")
        # Expected output: Best test accuracy: 0.9812
 
 
def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--config", default="../../config.yaml")
    import yaml
    train(yaml.safe_load(open(parser.parse_args().config)))
 
 
if __name__ == "__main__":
    main()

Analysis & Evaluation

Where Your Intuition Breaks

Deeper networks seem strictly more powerful — and they are, but only when gradients can flow. With saturating activations (sigmoid, tanh), multiplying many derivatives less than 0.25 through 10+ layers drives the gradient to near zero. Early layers receive no useful signal and stop learning. Depth only helps when paired with non-saturating activations (ReLU, GELU) or skip connections.

Learning Curves

Typical training dynamics for the 2-layer MLP on MNIST (hidden_dim=256, lr=1e-3, Adam optimizer):

Epoch	Train Loss	Test Accuracy
1	0.3821	95.23%
5	0.0891	97.41%
10	0.0523	97.89%
20	0.0312	98.12%

The loss drops sharply in the first epoch because the network quickly learns the most salient features (stroke direction, enclosed regions). Subsequent epochs refine decision boundaries on harder examples.

Activation Function Comparison

Testing identical architectures with different activations (hidden_dim=256, 20 epochs, MNIST):

Activation	Test Acc	Epoch 1 Acc	Notes
ReLU	98.12%	95.23%	Fast convergence, standard choice
Tanh	97.83%	93.71%	Slightly slower, zero-centered
Sigmoid	96.21%	88.14%	Vanishing gradients slow early training
Leaky ReLU	98.19%	95.31%	Marginal improvement, no dying neurons

💡Intuition

The sigmoid's poor early accuracy (88%) versus ReLU's (95%) after just one epoch illustrates the vanishing gradient problem. The gradient through a sigmoid is at most $0.25$ . Stack 5 layers and the signal at the input is attenuated by $0.25^5 \approx 0.001$ , making the first few layers learn essentially nothing in early training.

Gradient Magnitude Analysis

Monitor gradient health during training to detect vanishing or exploding gradients:

python

# After loss.backward(), before optimizer.step()
for name, param in model.named_parameters():
    if param.grad is not None:
        grad_norm = param.grad.norm().item()
        print(f"{name}: grad_norm={grad_norm:.6f}")
 
# Healthy ReLU network:
# net.1.weight: grad_norm=0.082100
# net.4.weight: grad_norm=0.063400
# net.7.weight: grad_norm=0.120300
 
# Unhealthy sigmoid network (deep):
# net.1.weight: grad_norm=0.000300  <- vanishing!
# net.4.weight: grad_norm=0.008900
# net.7.weight: grad_norm=0.118700

⚠️Warning

If gradient norms in early layers are consistently below $10^{-4}$ , those layers are not learning. Solutions: switch to ReLU, add batch normalization, use residual connections (skip connections bypass the problem entirely by providing a gradient highway), or reduce network depth.

Production-Ready Code

FastAPI Serving with PyTorch Model

python

"""serve_api/app.py — Production inference endpoint for the trained MLP."""
import io
from pathlib import Path
 
import torch
import torch.nn as nn
from fastapi import FastAPI, File, UploadFile, HTTPException
from fastapi.responses import JSONResponse
from PIL import Image
from torchvision import transforms
 
app = FastAPI(title="MNIST MLP Inference API", version="1.0.0")
 
 
class MLP(nn.Module):
    def __init__(self, input_dim=784, hidden_dim=256, num_classes=10, dropout=0.0):
        super().__init__()
        self.net = nn.Sequential(
            nn.Flatten(),
            nn.Linear(input_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, num_classes),
        )
 
    def forward(self, x):
        return self.net(x)
 
 
# Load once at startup — not per request
model = MLP(hidden_dim=256, dropout=0.0)
model.load_state_dict(torch.load("best_model.pt", map_location="cpu"))
model.eval()
 
transform = transforms.Compose([
    transforms.Grayscale(),
    transforms.Resize((28, 28)),
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,)),
])
 
 
@app.get("/health")
def health():
    return {"status": "ok"}
 
 
@app.post("/predict")
async def predict(file: UploadFile = File(...)):
    if not file.content_type.startswith("image/"):
        raise HTTPException(400, "File must be an image")
 
    img = Image.open(io.BytesIO(await file.read())).convert("L")
    tensor = transform(img).unsqueeze(0)  # (1, 1, 28, 28)
 
    with torch.inference_mode():
        logits = model(tensor)            # (1, 10)
        probs  = torch.softmax(logits, dim=1).squeeze()
 
    pred = probs.argmax().item()
    return JSONResponse({
        "prediction": pred,
        "confidence": round(probs[pred].item(), 4),
        "probabilities": {str(i): round(p.item(), 4) for i, p in enumerate(probs)},
    })
 
# Run: uvicorn app:app --host 0.0.0.0 --port 8000
# Test: curl -X POST "http://localhost:8000/predict" -F "file=@digit.png"

🚀Production

Set model.eval() at startup and use torch.inference_mode() instead of torch.no_grad() — it disables autograd entirely for a roughly 10% speedup. For high-throughput services, implement request batching: accumulate incoming requests for 5-10ms and serve them as a single batch. For a 256-wide MLP, batch size 32 is typically optimal on CPU (4ms vs 0.8ms per image = 10× throughput improvement).

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

NLP Essentials

Sequence Labeling

Regularization Techniques