Neural-Path/Notes
45 min

Non-Convex Landscapes: Saddle Points, Spurious Minima & Escape

Most deep learning objectives are non-convex, yet gradient descent reliably finds good solutions. Understanding why requires a careful look at the structure of non-convex landscapes: where saddle points dominate over local minima in high dimensions, how gradient descent can escape saddle points with noise, and why overparameterized networks seem to lack bad local minima despite the absence of any general theorem guaranteeing it.

Concepts

f(x,y) = x² − y²  |  saddle point at origin  | blue contours: f>0, red contours: f<0
-2-2-1-101122xystartsaddle
step 0/12
GD: converges to the saddle (∇f=0 but not a minimum — all steps become tiny as gradient vanishes). Perturbed GD: random noise when near the saddle kicks the trajectory into the negative-curvature direction and escapes.

A mountain pass is a saddle point: you're at a minimum if you look left-right across the pass, but at a maximum if you look along the ridge. In one dimension these are just peaks and valleys. In ten million dimensions — the parameter count of a typical neural network — almost every "stuck point" where the gradient vanishes is a saddle, not a local minimum. And at most saddle points in high-dimensional neural networks, at least one escape direction has lower loss, so gradient descent with any noise will eventually fall off. This is why neural networks are trainable despite being non-convex: the landscape is dominated by benign saddles, not bad local minima.

Types of Critical Points

A point xx^* is a critical point (stationary point) of ff if f(x)=0\nabla f(x^*) = 0. Three kinds:

Local minimum. f(x)=0\nabla f(x^*) = 0 and 2f(x)0\nabla^2 f(x^*) \succ 0 (all eigenvalues positive). The function curves up in every direction.

Local maximum. f(x)=0\nabla f(x^*) = 0 and 2f(x)0\nabla^2 f(x^*) \prec 0 (all eigenvalues negative). The function curves down in every direction.

Saddle point. f(x)=0\nabla f(x^*) = 0 and 2f(x)\nabla^2 f(x^*) has both positive and negative eigenvalues. The function curves up in some directions and down in others.

Higher-order: 2f(x)\nabla^2 f(x^*) has some zero eigenvalues ("flat directions") — degenerate critical points that require higher-order analysis.

The Hessian classification is not arbitrary: for a smooth function, the second-order Taylor expansion is the first term that distinguishes a minimum (all curvatures positive) from a saddle (mixed curvatures) from a maximum (all curvatures negative). The gradient alone (f=0\nabla f = 0) cannot distinguish them — you need the curvature information in 2f\nabla^2 f. This is why second-order optimization methods matter: they can check the Hessian's sign structure and avoid saddle points, while first-order methods must infer it from trajectory shape or use noise.

Saddle Points Dominate in High Dimensions

Statistical physics intuition. For a random function of nn variables, a critical point with kk negative eigenvalues (index-kk saddle) has probability (nk)/2n\sim \binom{n}{k}/2^n of occurring. In high dimensions:

  • Fraction of critical points that are local minima: exponentially small (all nn eigenvalues positive in a random function requires all nn random signs to be positive).
  • Fraction that are saddle points: overwhelmingly dominant.

For deep networks, empirical studies (Goodfellow et al., 2015) find that almost all critical points encountered during training are saddle points — and most of these saddle points have similar loss values to the global minimum. There are very few "bad" local minima.

Strict saddle property. A critical point xx^* is a strict saddle if λmin(2f(x))<0\lambda_{\min}(\nabla^2 f(x^*)) < 0 — there exists a direction of strict negative curvature. Many non-convex objectives arising in ML satisfy the strict saddle property: every critical point is either a local minimum or a strict saddle. For such objectives, gradient descent with noise converges to a local minimum.

Escaping Saddle Points

Near a strict saddle xx^* with λmin(2f(x))=γ<0\lambda_{\min}(\nabla^2 f(x^*)) = -\gamma < 0, the landscape is locally:

f(x+v)f(x)+12vT2f(x)v.f(x^* + v) \approx f(x^*) + \frac{1}{2}v^T\nabla^2 f(x^*)v.

In the negative-curvature direction uu (eigenvector of λmin\lambda_{\min}): f(x+ϵu)f(x)γ2ϵ2<f(x)f(x^* + \epsilon u) \approx f(x^*) - \frac{\gamma}{2}\epsilon^2 < f(x^*). So any displacement in the uu direction decreases ff — the function "falls off" the saddle.

Problem. Gradient descent stalls near a saddle because f\|\nabla f\| is small near any stationary point. GD makes negligible progress near a saddle — it takes exponentially many steps to escape via the gradient alone.

Perturbed gradient descent (Jin et al., 2017). When progress is slow (gradient is small), add a random perturbation:

If ‖∇f(xₖ)‖ ≤ ε:
   xₖ ← xₖ + ξ, where ξ ~ Uniform(B(0,r))

With probability 1δ\geq 1 - \delta, after O ⁣(L2Δϵ4lognδ)O\!\left(\frac{L^2 \Delta}{\epsilon^4} \log\frac{n}{\delta}\right) iterations, this finds an (ϵ,Lϵ)(\epsilon, \sqrt{L\epsilon})-approximate second-order stationary point (SOSP): fϵ\|\nabla f\| \leq \epsilon and λmin(2f)Lϵ\lambda_{\min}(\nabla^2 f) \geq -\sqrt{L\epsilon}.

Key insight. Random perturbations almost surely land in the "escape subspace" — the subspace of negative curvature. The probability of landing exactly in the "flat" subspace (which would be stuck) is zero for random perturbations.

SGD escapes for free. Stochastic gradient noise acts as automatic perturbation. Under standard assumptions, SGD converges to an SOSP without explicit perturbation — gradient noise provides implicit escaping.

Spurious Local Minima

Question. Do deep neural networks have "bad" local minima — points where f=0\nabla f = 0, 2f0\nabla^2 f \succ 0, but ff is far from the global minimum?

Deep linear networks. For linear networks f(W1,,Wk)=WkW1XYF2f(W_1,\ldots,W_k) = \|W_k\cdots W_1 X - Y\|_F^2:

Theorem (Baldi & Hornik, 1989; Laurent & Brecht, 2018). Every local minimum of a deep linear network is a global minimum (under rank assumptions). All saddle points have at least one negative curvature direction. Deep linear networks have no spurious local minima.

Nonlinear networks. No general theorem establishes absence of spurious local minima for nonlinear deep networks. However:

  • For two-layer networks with enough hidden units (overparameterized) and random initialization: the loss converges to zero for training data (Du et al., 2019; Li & Liang, 2018).
  • For networks in the Neural Tangent Kernel (NTK) regime (infinite width): dynamics are equivalent to kernel regression, which is convex (Jacot et al., 2018).
  • Empirically: modern deep networks trained with SGD almost never get stuck at bad local minima — loss consistently decreases to near zero on the training set.

Linear Mode Connectivity

Observation. Two solutions θ1,θ2\theta_1, \theta_2 found by training the same network with different random seeds often have nearly the same loss value, and the linear interpolation θ(λ)=(1λ)θ1+λθ2\theta(\lambda) = (1-\lambda)\theta_1 + \lambda\theta_2 maintains near-constant loss — no barriers in between.

Linear mode connectivity (Frankle et al., 2020; Entezari et al., 2022): most independently trained networks are linearly connected in parameter space after permutation alignment (reordering neurons to match up equivalent representations). This suggests the loss landscape has a single well-connected basin for sufficiently large networks.

Implication. Averaging the weights of two independently trained networks gives a model competitive with either individual model — this is model merging or weight averaging (Stochastic Weight Averaging, Wortsman et al., 2022).

Loss Landscape Geometry: Key Facts

Batchnorm smooths the landscape. Batch normalization decouples the effective learning rate from layer scale, reducing the effective condition number of the Hessian. Empirically, training without BatchNorm often shows sharp ridges and cliffs in the landscape.

Residual connections (skip connections) reduce effective depth. A residual network x+F(x)x + F(x) is effectively a superposition of paths of different depths. Shallow paths dominate early training, making the effective landscape approximately convex near initialization.

Sharpness and generalization. Empirically, sharp minima (large λmax(2f)\lambda_{\max}(\nabla^2 f)) generalize worse than flat minima (small λmax\lambda_{\max}). SGD with small batch sizes finds flatter minima than large-batch GD (Keskar et al., 2017). The intuition: flat minima are insensitive to small weight perturbations, making the model robust to distribution shift.

Worked Example

Example 1: Saddle Point of f(x,y)=x2y2f(x,y) = x^2 - y^2

Critical point analysis. f=(2x,2y)=0\nabla f = (2x, -2y) = 0 at (x,y)=(0,0)(x,y) = (0,0). Hessian: 2f=diag(2,2)\nabla^2 f = \text{diag}(2, -2) — one positive (+2+2) and one negative (2-2) eigenvalue. This is a strict saddle.

GD behavior. Starting from (0.05,1.8)(0.05, 1.8): the yy-direction is a negative curvature direction — yy wants to increase (toward ±\pm\infty). But the xx-direction curves up — xx decays toward 0. Gradient: (20.05,21.8)=(0.1,3.6)(2\cdot 0.05, -2\cdot 1.8) = (0.1, -3.6). The GD step mostly moves in yy toward the origin (since 3.6-3.6 gradient points up, reducing yy toward 0). Eventually both xx and yy approach 0 and GD stalls.

Perturbed GD. Near the saddle, adding ξN(0,σ2I)\xi \sim \mathcal{N}(0, \sigma^2 I): with probability 1/2\geq 1/2, the perturbation has a component in the yy-direction that is large enough to kick the trajectory away from the saddle into the y±y \to \pm\infty region, where ff decreases without bound.

Example 2: Matrix Factorization — No Spurious Minima

Problem. For rank-rr matrix MRm×nM^* \in \mathbb{R}^{m\times n} with rmin(m,n)r \leq \min(m,n), minimize over factors URm×rU \in \mathbb{R}^{m\times r}, VRn×rV \in \mathbb{R}^{n\times r}:

f(U,V)=12UVTMF2.f(U,V) = \frac{1}{2}\|UV^T - M^*\|_F^2.

Theorem. If rank(M)=r\text{rank}(M^*) = r, then every local minimum of ff satisfies UVT=MUV^T = M^* (is a global minimum). All other critical points are saddle points.

Proof sketch. At a critical point: Uf=(UVTM)V=0\nabla_U f = (UV^T - M^*)V = 0 and Vf=(UVTM)TU=0\nabla_V f = (UV^T - M^*)^T U = 0. If UVTMUV^T \neq M^*, construct a direction of descent using the SVD of the residual UVTMUV^T - M^* — the residual always provides a descent direction.

Significance. Matrix factorization / matrix completion / PCA / dictionary learning all have this no-spurious-minima property. Gradient descent reliably finds the global optimum, explaining why these problems work well in practice without convex relaxation.

Example 3: NTK Regime and Convexity

For a network f(x;θ)=f(x;θ0)+θf(x;θ0)T(θθ0)+O(θθ02)f(x; \theta) = f(x; \theta_0) + \nabla_\theta f(x;\theta_0)^T(\theta - \theta_0) + O(\|\theta-\theta_0\|^2) near initialization, in the limit of infinite width (and appropriate scaling), the second-order term vanishes and the network is linear in δθ=θθ0\delta\theta = \theta - \theta_0:

f(x;θ)f0(x)+K(x,X)(θθ0),f(x;\theta) \approx f_0(x) + K(x, X)(\theta-\theta_0),

where K(x,X)=θf(x;θ0)R1×pK(x, X) = \nabla_\theta f(x;\theta_0) \in \mathbb{R}^{1\times p} is a fixed feature map. The training loss becomes:

L(θ)=1nK(θθ0)residual2,L(\theta) = \frac{1}{n}\|K(\theta-\theta_0) - \text{residual}\|^2,

which is a convex quadratic in θ\theta. GD finds the global minimum in O(κKlog(1/ϵ))O(\kappa_K\log(1/\epsilon)) steps, where κK=λmax(KKT)/λmin(KKT)\kappa_K = \lambda_{\max}(K K^T)/\lambda_{\min}(K K^T) is the condition number of the kernel Gram matrix.

In finite-width networks, the NTK regime holds approximately when overparameterization is extreme and the learning rate is very small. This explains why overparameterized networks are easy to train — but they also underfit (less expressive than finite-width networks that move far from initialization).

Connections

Where Your Intuition Breaks

The "no bad local minima" story for overparameterized neural networks is empirically compelling but theoretically incomplete. The results that say neural networks lack bad local minima typically require very specific conditions: infinite-width networks (NTK regime), linear networks, networks with very special architectures. Real neural networks of finite width trained with realistic data can absolutely have bad local minima — they are just apparently rare in practice for well-designed architectures and initialization strategies. When you hear "gradient descent always finds good solutions because there are no bad local minima," the correct reading is: "empirically, for the architecture/dataset combinations we've studied, bad local minima appear to be rare enough not to cause problems." It's not a theorem. This is one of the central open problems in the theory of deep learning.

💡Intuition

In high dimensions, almost all critical points are saddle points. For a random quadratic function in nn variables, the Hessian eigenvalues are roughly half positive and half negative. The probability that all nn eigenvalues are positive (local minimum) is 2n2^{-n} — exponentially small. This is the spin-glass intuition from statistical physics: bad local minima are exponentially rare, but saddle points are everywhere. For practical ML objectives, empirical evidence confirms this: training runs consistently find solutions of similar quality regardless of initialization, suggesting no deep basins around bad local minima.

💡Intuition

Linear mode connectivity reveals the geometry of the solution manifold. When two independently trained networks are permutation-aligned and linearly interpolated without a loss barrier, it suggests all good solutions lie in a single connected manifold — a "flat basin" in parameter space. This manifold has high intrinsic dimension (many directions of low curvature), which is consistent with the flatness hypothesis for generalization. Weight averaging exploits this: the average of two solutions on the same manifold is also on (or near) the manifold.

⚠️Warning

The NTK regime is not how large networks actually learn. In practice, feature learning — where the network's internal representations change significantly during training — is essential for performance. NTK theory describes infinite-width "lazy" networks that barely move their representations. Real networks operate in an intermediate regime: somewhat overparameterized but not infinitely so, moving features substantially. The NTK explains trainability (global convergence for sufficiently wide networks) but not generalization (why test error is low), which requires understanding feature learning, not just optimization.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.