Non-Convex Landscapes: Saddle Points, Spurious Minima & Escape
Most deep learning objectives are non-convex, yet gradient descent reliably finds good solutions. Understanding why requires a careful look at the structure of non-convex landscapes: where saddle points dominate over local minima in high dimensions, how gradient descent can escape saddle points with noise, and why overparameterized networks seem to lack bad local minima despite the absence of any general theorem guaranteeing it.
Concepts
A mountain pass is a saddle point: you're at a minimum if you look left-right across the pass, but at a maximum if you look along the ridge. In one dimension these are just peaks and valleys. In ten million dimensions — the parameter count of a typical neural network — almost every "stuck point" where the gradient vanishes is a saddle, not a local minimum. And at most saddle points in high-dimensional neural networks, at least one escape direction has lower loss, so gradient descent with any noise will eventually fall off. This is why neural networks are trainable despite being non-convex: the landscape is dominated by benign saddles, not bad local minima.
Types of Critical Points
A point is a critical point (stationary point) of if . Three kinds:
Local minimum. and (all eigenvalues positive). The function curves up in every direction.
Local maximum. and (all eigenvalues negative). The function curves down in every direction.
Saddle point. and has both positive and negative eigenvalues. The function curves up in some directions and down in others.
Higher-order: has some zero eigenvalues ("flat directions") — degenerate critical points that require higher-order analysis.
The Hessian classification is not arbitrary: for a smooth function, the second-order Taylor expansion is the first term that distinguishes a minimum (all curvatures positive) from a saddle (mixed curvatures) from a maximum (all curvatures negative). The gradient alone () cannot distinguish them — you need the curvature information in . This is why second-order optimization methods matter: they can check the Hessian's sign structure and avoid saddle points, while first-order methods must infer it from trajectory shape or use noise.
Saddle Points Dominate in High Dimensions
Statistical physics intuition. For a random function of variables, a critical point with negative eigenvalues (index- saddle) has probability of occurring. In high dimensions:
- Fraction of critical points that are local minima: exponentially small (all eigenvalues positive in a random function requires all random signs to be positive).
- Fraction that are saddle points: overwhelmingly dominant.
For deep networks, empirical studies (Goodfellow et al., 2015) find that almost all critical points encountered during training are saddle points — and most of these saddle points have similar loss values to the global minimum. There are very few "bad" local minima.
Strict saddle property. A critical point is a strict saddle if — there exists a direction of strict negative curvature. Many non-convex objectives arising in ML satisfy the strict saddle property: every critical point is either a local minimum or a strict saddle. For such objectives, gradient descent with noise converges to a local minimum.
Escaping Saddle Points
Near a strict saddle with , the landscape is locally:
In the negative-curvature direction (eigenvector of ): . So any displacement in the direction decreases — the function "falls off" the saddle.
Problem. Gradient descent stalls near a saddle because is small near any stationary point. GD makes negligible progress near a saddle — it takes exponentially many steps to escape via the gradient alone.
Perturbed gradient descent (Jin et al., 2017). When progress is slow (gradient is small), add a random perturbation:
If ‖∇f(xₖ)‖ ≤ ε:
xₖ ← xₖ + ξ, where ξ ~ Uniform(B(0,r))
With probability , after iterations, this finds an -approximate second-order stationary point (SOSP): and .
Key insight. Random perturbations almost surely land in the "escape subspace" — the subspace of negative curvature. The probability of landing exactly in the "flat" subspace (which would be stuck) is zero for random perturbations.
SGD escapes for free. Stochastic gradient noise acts as automatic perturbation. Under standard assumptions, SGD converges to an SOSP without explicit perturbation — gradient noise provides implicit escaping.
Spurious Local Minima
Question. Do deep neural networks have "bad" local minima — points where , , but is far from the global minimum?
Deep linear networks. For linear networks :
Theorem (Baldi & Hornik, 1989; Laurent & Brecht, 2018). Every local minimum of a deep linear network is a global minimum (under rank assumptions). All saddle points have at least one negative curvature direction. Deep linear networks have no spurious local minima.
Nonlinear networks. No general theorem establishes absence of spurious local minima for nonlinear deep networks. However:
- For two-layer networks with enough hidden units (overparameterized) and random initialization: the loss converges to zero for training data (Du et al., 2019; Li & Liang, 2018).
- For networks in the Neural Tangent Kernel (NTK) regime (infinite width): dynamics are equivalent to kernel regression, which is convex (Jacot et al., 2018).
- Empirically: modern deep networks trained with SGD almost never get stuck at bad local minima — loss consistently decreases to near zero on the training set.
Linear Mode Connectivity
Observation. Two solutions found by training the same network with different random seeds often have nearly the same loss value, and the linear interpolation maintains near-constant loss — no barriers in between.
Linear mode connectivity (Frankle et al., 2020; Entezari et al., 2022): most independently trained networks are linearly connected in parameter space after permutation alignment (reordering neurons to match up equivalent representations). This suggests the loss landscape has a single well-connected basin for sufficiently large networks.
Implication. Averaging the weights of two independently trained networks gives a model competitive with either individual model — this is model merging or weight averaging (Stochastic Weight Averaging, Wortsman et al., 2022).
Loss Landscape Geometry: Key Facts
Batchnorm smooths the landscape. Batch normalization decouples the effective learning rate from layer scale, reducing the effective condition number of the Hessian. Empirically, training without BatchNorm often shows sharp ridges and cliffs in the landscape.
Residual connections (skip connections) reduce effective depth. A residual network is effectively a superposition of paths of different depths. Shallow paths dominate early training, making the effective landscape approximately convex near initialization.
Sharpness and generalization. Empirically, sharp minima (large ) generalize worse than flat minima (small ). SGD with small batch sizes finds flatter minima than large-batch GD (Keskar et al., 2017). The intuition: flat minima are insensitive to small weight perturbations, making the model robust to distribution shift.
Worked Example
Example 1: Saddle Point of
Critical point analysis. at . Hessian: — one positive () and one negative () eigenvalue. This is a strict saddle.
GD behavior. Starting from : the -direction is a negative curvature direction — wants to increase (toward ). But the -direction curves up — decays toward 0. Gradient: . The GD step mostly moves in toward the origin (since gradient points up, reducing toward 0). Eventually both and approach 0 and GD stalls.
Perturbed GD. Near the saddle, adding : with probability , the perturbation has a component in the -direction that is large enough to kick the trajectory away from the saddle into the region, where decreases without bound.
Example 2: Matrix Factorization — No Spurious Minima
Problem. For rank- matrix with , minimize over factors , :
Theorem. If , then every local minimum of satisfies (is a global minimum). All other critical points are saddle points.
Proof sketch. At a critical point: and . If , construct a direction of descent using the SVD of the residual — the residual always provides a descent direction.
Significance. Matrix factorization / matrix completion / PCA / dictionary learning all have this no-spurious-minima property. Gradient descent reliably finds the global optimum, explaining why these problems work well in practice without convex relaxation.
Example 3: NTK Regime and Convexity
For a network near initialization, in the limit of infinite width (and appropriate scaling), the second-order term vanishes and the network is linear in :
where is a fixed feature map. The training loss becomes:
which is a convex quadratic in . GD finds the global minimum in steps, where is the condition number of the kernel Gram matrix.
In finite-width networks, the NTK regime holds approximately when overparameterization is extreme and the learning rate is very small. This explains why overparameterized networks are easy to train — but they also underfit (less expressive than finite-width networks that move far from initialization).
Connections
Where Your Intuition Breaks
The "no bad local minima" story for overparameterized neural networks is empirically compelling but theoretically incomplete. The results that say neural networks lack bad local minima typically require very specific conditions: infinite-width networks (NTK regime), linear networks, networks with very special architectures. Real neural networks of finite width trained with realistic data can absolutely have bad local minima — they are just apparently rare in practice for well-designed architectures and initialization strategies. When you hear "gradient descent always finds good solutions because there are no bad local minima," the correct reading is: "empirically, for the architecture/dataset combinations we've studied, bad local minima appear to be rare enough not to cause problems." It's not a theorem. This is one of the central open problems in the theory of deep learning.
In high dimensions, almost all critical points are saddle points. For a random quadratic function in variables, the Hessian eigenvalues are roughly half positive and half negative. The probability that all eigenvalues are positive (local minimum) is — exponentially small. This is the spin-glass intuition from statistical physics: bad local minima are exponentially rare, but saddle points are everywhere. For practical ML objectives, empirical evidence confirms this: training runs consistently find solutions of similar quality regardless of initialization, suggesting no deep basins around bad local minima.
Linear mode connectivity reveals the geometry of the solution manifold. When two independently trained networks are permutation-aligned and linearly interpolated without a loss barrier, it suggests all good solutions lie in a single connected manifold — a "flat basin" in parameter space. This manifold has high intrinsic dimension (many directions of low curvature), which is consistent with the flatness hypothesis for generalization. Weight averaging exploits this: the average of two solutions on the same manifold is also on (or near) the manifold.
The NTK regime is not how large networks actually learn. In practice, feature learning — where the network's internal representations change significantly during training — is essential for performance. NTK theory describes infinite-width "lazy" networks that barely move their representations. Real networks operate in an intermediate regime: somewhat overparameterized but not infinitely so, moving features substantially. The NTK explains trainability (global convergence for sufficiently wide networks) but not generalization (why test error is low), which requires understanding feature learning, not just optimization.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.