Bridge: Adam, Learning Rate Theory & Neural Loss Landscape Analysis

The optimization theory of Module 05 connects directly to the practical choices made in every deep learning training run: why Adam adapts step sizes per parameter, why warmup is necessary for large learning rates, why flat minima generalize better, and why sharpness-aware minimization (SAM) outperforms SGD on test accuracy despite no improvement in training loss. This lesson closes the loop from theory to practice.

Concepts

Every practical training choice in deep learning — step size, momentum, weight decay schedule, batch size — has a precise optimization-theoretic interpretation grounded in the theory of this module. Adam's adaptive step sizes are a diagonal Fisher approximation. Learning rate warmup avoids the "catapult phase" instability. SAM's second forward pass computes a gradient at the worst nearby point. Understanding why each works — not just that it works — lets you debug training failures systematically rather than by trial and error.

Adam as Approximate Natural Gradient

The Adam optimizer (Kingma & Ba, 2015):

\begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1)\nabla\mathcal{L}_t \quad \text{(first moment, momentum)} \\ v_t &= \beta_2 v_{t-1} + (1-\beta_2)(\nabla\mathcal{L}_t)^2 \quad \text{(second moment, adaptive scaling)} \\ \hat{m}_t &= m_t / (1-\beta_1^t), \quad \hat{v}_t = v_t/(1-\beta_2^t) \quad \text{(bias correction)} \\ \theta_t &= \theta_{t-1} - \eta \cdot \hat{m}_t / (\sqrt{\hat{v}_t} + \varepsilon) \end{aligned}

with defaults $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , $\varepsilon = 10^{-8}$ .

Connection to natural gradient. As shown in Module 04, the natural gradient step uses $\mathcal{I}(\theta)^{-1}\nabla\mathcal{L}$ , where $\mathcal{I}(\theta)$ is the Fisher information matrix. The diagonal of $\mathcal{I}(\theta)$ is:

$\mathcal{I}(\theta)_{jj} = \mathbb{E}\!\left[\left(\frac{\partial \log p(y|x;\theta)}{\partial\theta_j}\right)^2\right] \approx \mathbb{E}[(\nabla_j\mathcal{L})^2].$

The Adam denominator $\sqrt{\hat{v}_t}$ approximates $\sqrt{\mathcal{I}(\theta)_{jj}}$ — so $\hat{m}_t/(\sqrt{\hat{v}_t}+\varepsilon)$ is approximately the natural gradient with diagonal Fisher approximation plus momentum. Each parameter gets a step size adapted to its gradient magnitude, automatically correcting for different scale in different directions.

Interpretation. A parameter with consistently large gradients (high Fisher diagonal) gets a small effective step size (it's already well-informed). A parameter with small gradients (low Fisher) gets a large effective step size (needs more nudging). This is why Adam is dramatically better than plain SGD on ill-conditioned problems. The diagonal Fisher approximation is the cheapest possible natural gradient — using only $O(n)$ memory versus $O(n^2)$ for the full Fisher — and for most architectures it captures the dominant conditioning problem (different layers having different gradient magnitudes) while discarding the off-diagonal correlations that are expensive to maintain.

Convergence. For convex problems with $G$ -bounded gradients, Adam achieves $O(G\log T/\sqrt{T})$ regret — same as AdaGrad, but with better constants due to the exponential moving average. For non-convex problems with fixed $\eta$ : there exist counterexamples where Adam diverges without the $\varepsilon$ correction.

AdamW (weight decay decoupled). Standard Adam with $L_2$ regularization adds the regularizer gradient to $\nabla\mathcal{L}$ before computing moments — this interacts with the adaptive scaling. AdamW decouples weight decay by directly subtracting from parameters:

$\theta_t \leftarrow \theta_{t-1} - \eta\left(\hat{m}_t/(\sqrt{\hat{v}_t}+\varepsilon) + \lambda\theta_{t-1}\right).$

This correctly implements $L_2$ regularization without distorting the adaptive step sizes. Modern LLM training uses AdamW exclusively.

Learning Rate Schedules and the Catapult Phase

Why learning rate schedules matter. From convergence theory: for a strongly convex $L$ -smooth problem, constant step $\alpha = 1/L$ is optimal for GD. But in practice, deep networks benefit from:

Starting with a large learning rate (explores the landscape, finds flat basins)
Decaying the learning rate (converges tightly within a basin)

Linear warmup. Large initial learning rates cause instability at the start — the loss can spike or diverge because the gradients are misaligned with the loss landscape. Linearly increasing $\eta$ from 0 to $\eta_{\max}$ over $T_\text{warm}$ steps ("warmup") stabilizes early training.

Theoretical grounding (Lewkowycz et al., 2020 — "catapult phase"). For large learning rates above a threshold $\eta_c \approx 2/\lambda_{\max}(\nabla^2\mathcal{L})$ (twice the smoothness constant), gradient descent enters the "catapult phase" — the sharpness (max Hessian eigenvalue) initially increases (progressive sharpening), then stabilizes at $2/\eta$ (edge of stability). The iterate is "launched" into flatter regions of the landscape. This is why large learning rates find flatter minima.

Cosine annealing. The schedule $\eta_t = \eta_{\max}\cdot\frac{1+\cos(\pi t/T)}{2}$ for $t \in [0,T]$ provides smooth decay from $\eta_{\max}$ to 0. Used with restarts (SGDR, Loshchilov & Hutter, 2017):

$\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max}-\eta_{\min})\left(1+\cos\!\left(\frac{\pi (t \mod T_i)}{T_i}\right)\right),$

where $T_i$ doubles after each restart. The periodically-reset learning rate escapes local basins and samples multiple solutions — weight averaging across restarts often improves generalization.

OneCycleLR (super-convergence, Smith, 2018): a single cycle with a large peak learning rate — training can converge in $1/10$ the usual epochs. The peak learning rate is set at the boundary of instability.

Edge of Stability

Edge of Stability (EoS) (Cohen et al., 2021): in practice, gradient descent with a fixed learning rate $\eta$ drives the sharpness $\lambda_{\max}(\nabla^2\mathcal{L})$ to stabilize around $2/\eta$ , even when $2/\eta > L$ (formally in the "unstable" regime).

Why? When sharpness exceeds $2/\eta$ , GD makes oscillatory steps that happen to reduce sharpness — a self-stabilizing feedback loop. The iterate oscillates within a ball while the ball drifts toward lower loss.

Implication. For a given step size $\eta$ , training converges to a region where $\lambda_{\max}(\nabla^2\mathcal{L}) \approx 2/\eta$ . Larger $\eta$ → smaller sharpness → flatter minimum → better generalization. This gives formal support for the empirical observation that SGD with large step sizes generalizes better.

Sharpness-Aware Minimization (SAM)

SAM (Foret et al., 2021) directly minimizes the worst-case loss in a ball around the current parameters:

$\min_\theta \max_{\|\epsilon\| \leq \rho} \mathcal{L}(\theta + \epsilon).$

Inner maximization (approximate). The maximizer over the $\rho$ -ball:

$\hat\epsilon(\theta) = \arg\max_{\|\epsilon\|\leq\rho} \mathcal{L}(\theta+\epsilon) \approx \rho\cdot\frac{\nabla_\theta\mathcal{L}(\theta)}{\|\nabla_\theta\mathcal{L}(\theta)\|}.$

SAM update:

Compute gradient at perturbed point: $g = \nabla_\theta\mathcal{L}(\theta + \hat\epsilon(\theta))$
Update: $\theta \leftarrow \theta - \eta g$

This requires two forward-backward passes per step (one to compute $\hat\epsilon$ , one to compute $g$ ) — doubling compute cost. In return, SAM consistently finds flatter minima with lower $\lambda_{\max}(\nabla^2\mathcal{L})$ and better test performance.

Connection to PAC-Bayes. The SAM objective upper-bounds the PAC-Bayes generalization bound:

$\mathcal{L}_{\text{test}} \lesssim \mathcal{L}_{\text{train}} + \sqrt{\frac{\text{tr}(\nabla^2\mathcal{L})}{n_{\text{data}}}} \lesssim \mathcal{L}_{\text{train}} + O\!\left(\frac{\lambda_{\max}}{\sqrt{n_{\text{data}}}}\right).$

Minimizing the SAM objective directly attacks the right-hand side's sharpness term.

Optimizer Comparison Table

Optimizer	Step-size adaptation	Convergence guarantee	Practical default?
SGD	None (uniform)	$O(1/\sqrt{T})$ convex	For CV with LR tuning
SGD+Momentum	Momentum (Polyak)	Same as SGD	Wide use in CV
AdaGrad	Cumulative $\sum g_t^2$	$O(\log T/\sqrt{T})$	Sparse data, NLP (old)
RMSProp	EMA of $g_t^2$	Heuristic	Hidden layers in RNNs
Adam	EMA of $g_t, g_t^2$	$O(\log T/\sqrt{T})$	Default for LLMs/NLP
AdamW	Adam + decoupled decay	Same as Adam	LLM pre-training
SAM+Adam	Adam + flatness penalty	None (non-convex)	SOTA image classif.

Worked Example

Example 1: Adam Convergence Bound

For a convex problem with $G$ -bounded gradients ( $\|\nabla f_t\| \leq G$ ), Adam's regret after $T$ rounds satisfies:

$\text{Regret}(T) = \sum_{t=1}^T f_t(x_t) - \min_x \sum_{t=1}^T f_t(x) \leq \frac{d\sqrt{T}G^2\eta\sqrt{1-\beta_2^T}}{\sqrt{1-\beta_2}(1-\beta_1)\varepsilon} + O\!\left(\frac{G^2}{\eta(1-\beta_1)}\sum_j \|g_{1:T,j}\|_2\right).$

In the adaptive case where gradients are sparse (many $g_{t,j}$ are zero), the second term can be much smaller than for uniform SGD — this is why Adam is particularly effective for embeddings and attention layers with sparse gradient patterns.

Example 2: Cosine Annealing Schedule

For a 300-epoch training run with $\eta_{\max} = 0.1$ and 5-epoch warmup, the learning rate at epoch $t$ :

$\eta_t = \begin{cases} 0.1 \cdot t/5 & t \leq 5 \text{ (linear warmup)} \\ 0.1 \cdot \frac{1+\cos(\pi(t-5)/295)}{2} & t > 5 \text{ (cosine decay)} \end{cases}$

At $t = 5$ : $\eta = 0.1$ . At $t = 150$ : $\eta \approx 0.1\cdot(1+\cos(\pi/2))/2 = 0.05$ . At $t = 300$ : $\eta \approx 0.1\cdot(1+\cos\pi)/2 = 0$ .

The gradual decay prevents the oscillatory behavior of a step schedule (which abruptly drops $\eta$ , causing the model to "jolt" into a sharper basin) while also fully converging to a stationary point as $\eta \to 0$ .

Example 3: SAM Two-Step Update

For a batch loss $\mathcal{L}(\theta)$ with $\rho = 0.05$ :

Step 1 (perturbation): Compute $g_0 = \nabla_\theta\mathcal{L}(\theta)$ . Set $\hat\epsilon = \rho\cdot g_0/\|g_0\|$ (gradient direction normalized to $\rho$ -sphere).

Step 2 (gradient at perturbed point): Compute $g_{\text{SAM}} = \nabla_\theta\mathcal{L}(\theta + \hat\epsilon)$ . This is the gradient at the worst-case nearby point.

Update: $\theta \leftarrow \theta - \eta g_{\text{SAM}}$ .

The key difference from vanilla GD: $g_{\text{SAM}}$ points in the direction of steepest ascent at the perturbed point $\theta + \hat\epsilon$ , not at $\theta$ . Near a sharp minimum, $\hat\epsilon$ pushes toward the sharp ridge, and $g_{\text{SAM}}$ then moves away from it — the net effect is to seek flatter regions.

Connections

Where Your Intuition Breaks

AdamW is so ubiquitous that it's tempting to treat it as theoretically principled for all settings. But AdamW has no convergence guarantee for non-convex objectives with a fixed learning rate — there exist pathological (if contrived) non-convex problems where Adam oscillates without converging. The reason it works in practice for neural networks is not covered by any general theorem; it's empirical robustness on the specific geometry of neural loss surfaces. More practically: AdamW's bias-correction terms ( $1-\beta_1^t$ , $1-\beta_2^t$ ) are critical during the first few hundred steps. At step $t=1$ with $\beta_2 = 0.999$ , the uncorrected $v_1 = 0.001 \cdot g_1^2$ — a near-zero denominator that would cause enormous steps without the $/(1-\beta_2^t)$ correction. This is why learning rate warmup is not optional when using AdamW at large scale: the optimizer is not theoretically stable at full learning rate from step 1.

💡Intuition

Adam's step size adaptation is diagonal natural gradient descent. The update $\hat{m}_t/\sqrt{\hat{v}_t}$ divides the (smoothed) gradient by the (smoothed) RMS of past gradients. This is exactly the natural gradient with a diagonal Fisher approximation: each parameter's step size is scaled inversely to the information it carries about the loss. Parameters with high gradient variance (high information) take small steps; parameters with low gradient variance take large steps. This is why Adam handles different parameter scales automatically, whereas SGD with a global learning rate struggles on problems with parameters of very different magnitudes.

💡Intuition

The edge of stability connects theory to practice. The finding that GD with step size $\eta$ drives sharpness to $2/\eta$ has a beautiful implication: the learning rate sets the sharpness of the minimum found! A larger $\eta$ gives a flatter minimum (lower sharpness), which empirically corresponds to better generalization. This gives a mechanistic explanation for the folklore that "larger learning rates generalize better" — it's not just exploration, it's that larger LR directly imposes a flatness constraint on the found solution.

⚠️Warning

SAM's double compute cost is not always worth it. SAM requires two backward passes per step. At billion-parameter scale, this is often prohibitive. Efficient SAM variants (ASAM, mSAM, sparse SAM) reduce the cost by applying perturbations only to a subset of parameters or normalizing by parameter magnitude. For models trained at limited compute budgets (most academic research), the SAM overhead may outweigh its generalization benefit relative to simply training longer with standard Adam and better data augmentation. The generalization benefit of SAM is most pronounced for small datasets and without strong augmentation.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Non-Convex Landscapes: Saddle Points, Spurious Minima & Escape

Probability Theory

Measure Theory Primer: σ-Algebras, Measures & Lebesgue Integration