Neural-Path/Notes
45 min

Gradient Methods: Convergence Rates & Information-Theoretic Lower Bounds

Gradient descent is simple to state but subtle to analyze. Its convergence rate depends on two constants — the smoothness LL and strong convexity μ\mu — and these determine both the optimal step size and the number of iterations to reach a target accuracy. Nesterov's acceleration achieves an O(1/k2)O(1/k^2) rate for convex functions, provably optimal among all first-order methods by an information-theoretic lower bound.

Concepts

Convergence rates: f(xₖ) − f* vs iteration k  (log scale, μ=0.10, L=2.0, κ=L/μ=20.0)
10⁻⁶10⁻⁵10⁻⁴10⁻³10⁻²10⁻11010²f−f*0255075100k
Click a method to toggle. Increase L/μ (condition number κ) to see how poorly-conditioned problems slow gradient descent relative to Nesterov acceleration.

You've watched gradient descent in practice: sometimes it converges in dozens of steps, sometimes it takes tens of thousands, sometimes it zigzags. The two numbers that fully explain this behavior are the smoothness LL (how fast the gradient changes) and the strong convexity μ\mu (how bowl-shaped the function is). Their ratio κ=L/μ\kappa = L/\mu is the condition number, and it determines whether gradient descent sprints or crawls. Understanding convergence rates is understanding why some optimization problems are easy and others are not — before you even run the algorithm.

Gradient Descent and the Descent Lemma

The gradient descent update:

xk+1=xkαf(xk),α>0.x_{k+1} = x_k - \alpha \nabla f(x_k), \quad \alpha > 0.

Descent lemma. If ff is LL-smooth, then for any x,yx, y:

f(y)f(x)+f(x)T(yx)+L2yx2.f(y) \leq f(x) + \nabla f(x)^T(y-x) + \frac{L}{2}\|y-x\|^2.

Applying this to y=xk+1=xkαf(xk)y = x_{k+1} = x_k - \alpha\nabla f(x_k):

f(xk+1)f(xk)α(1Lα2)f(xk)2.f(x_{k+1}) \leq f(x_k) - \alpha\left(1 - \frac{L\alpha}{2}\right)\|\nabla f(x_k)\|^2.

With step size α=1/L\alpha = 1/L:

f(xk+1)f(xk)12Lf(xk)2.f(x_{k+1}) \leq f(x_k) - \frac{1}{2L}\|\nabla f(x_k)\|^2.

Each step decreases ff by at least 12Lf2\frac{1}{2L}\|\nabla f\|^2. This is the workhorse of all convergence proofs. The 1/L1/L step size is not a heuristic — it is the largest step that the descent lemma guarantees will not overshoot. Going larger risks increasing the loss; going smaller is safe but wastes progress. The LL-smoothness condition is precisely what lets you set the step size optimally without knowing the loss landscape in advance.

Convergence for LL-Smooth Convex Functions

Theorem. For ff convex and LL-smooth, gradient descent with α=1/L\alpha = 1/L satisfies:

f(xk)fLx0x22k.f(x_k) - f^* \leq \frac{L\|x_0 - x^*\|^2}{2k}.

Rate: O(1/k)O(1/k) — sublinear convergence. To achieve accuracy ϵ\epsilon, need kLx0x2/(2ϵ)k \geq L\|x_0-x^*\|^2/(2\epsilon) iterations.

Proof sketch. Sum the descent lemma over iterations, use the first-order convexity condition f(x)f(xk)+f(xk)T(xxk)f(x^*) \geq f(x_k) + \nabla f(x_k)^T(x^* - x_k), and telescope.

Linear Convergence for Strongly Convex Functions

Theorem. For ff that is μ\mu-strongly convex and LL-smooth, gradient descent with α=1/L\alpha = 1/L satisfies:

xkx2(1μL)kx0x2.\|x_k - x^*\|^2 \leq \left(1 - \frac{\mu}{L}\right)^k \|x_0 - x^*\|^2.

Rate: O ⁣((11/κ)k)O\!\left(\left(1 - 1/\kappa\right)^k\right) where κ=L/μ\kappa = L/\mu is the condition number. This is linear convergence (exponentially fast). To achieve ϵ\epsilon accuracy: kκlog(x0x2/ϵ)k \geq \kappa \log(\|x_0-x^*\|^2/\epsilon).

Proof. Strong convexity gives f(x)fμ2xx2f(x) - f^* \geq \frac{\mu}{2}\|x-x^*\|^2, so f(x)22μ(f(x)f)\|\nabla f(x)\|^2 \geq 2\mu(f(x)-f^*). Combined with the descent lemma and μ\mu-strong convexity, one obtains the geometric decay.

Interpretation. The condition number κ=L/μ\kappa = L/\mu controls convergence: for κ=1\kappa = 1 (perfectly conditioned), one step suffices. For κ=1000\kappa = 1000 (ill-conditioned, e.g., long thin valleys), need 1000log(1/ϵ)\sim 1000\log(1/\epsilon) steps. Preconditioning reduces κ\kappa.

Nesterov Accelerated Gradient

Nesterov's momentum method (1983). Two sequences:

yk+1=xk1Lf(xk)xk+1=yk+1+k1k+2(yk+1yk)\begin{aligned} y_{k+1} &= x_k - \frac{1}{L}\nabla f(x_k) \\ x_{k+1} &= y_{k+1} + \frac{k-1}{k+2}(y_{k+1} - y_k) \end{aligned}

The second step adds a momentum term: the update "overshoots" using recent history. The momentum coefficient (k1)/(k+2)(k-1)/(k+2) is precisely calibrated.

Convergence (convex case):

f(xk)f2Lx0x2(k+1)2.f(x_k) - f^* \leq \frac{2L\|x_0 - x^*\|^2}{(k+1)^2}.

Rate: O(1/k2)O(1/k^2) — quadratic improvement over plain GD's O(1/k)O(1/k).

Convergence (strongly convex case): with μ>0\mu > 0, convergence rate (1μ/L)k(1 - \sqrt{\mu/L})^k, compared to (1μ/L)k(1 - \mu/L)^k for plain GD. Since μ/Lμ/L\sqrt{\mu/L} \gg \mu/L when κ1\kappa \gg 1, acceleration gives factor κ\sqrt{\kappa} improvement in iteration count.

Why does momentum help? Nesterov's key insight: gradient descent wastes information — it only uses the gradient at the current point. Momentum accumulates gradient information from previous steps in a way that cancels the "zigzag" behavior in ill-conditioned landscapes. The estimate sequence technique provides the formal proof.

Information-Theoretic Lower Bounds

Theorem (Nemirovski & Yudin, 1983). For any first-order method (using only f\nabla f evaluations) and any k<n/2k < n/2, there exists a convex LL-smooth function ff such that:

f(xk)f3Lx0x232(k+1)2.f(x_k) - f^* \geq \frac{3L\|x_0-x^*\|^2}{32(k+1)^2}.

This is a minimax lower bound — no first-order method can beat O(1/k2)O(1/k^2) in the worst case. Nesterov's method achieves this bound, making it optimal.

Similarly for strongly convex: no first-order method can beat the rate (1μ/L)k(1 - \sqrt{\mu/L})^k. Nesterov's strongly convex method achieves this.

Proof idea. Construct a "worst-case" function as a tridiagonal quadratic whose gradient at xkx_k depends only on the first kk coordinates of x0x_0. After kk gradient steps, only the first kk coordinates of xx^* can be found, leaving an irreducible error.

Subgradient Method for Non-Smooth Functions

When ff is not differentiable (e.g., f(x)=x1f(x) = \|x\|_1), replace gradients with subgradients gf(x)g \in \partial f(x):

xk+1=xkαkgk,gkf(xk).x_{k+1} = x_k - \alpha_k g_k, \quad g_k \in \partial f(x_k).

The descent lemma no longer holds (subgradient steps can increase ff). Instead, track the best iterate xˉk=argminikf(xi)\bar{x}_k = \arg\min_{i \leq k} f(x_i).

Convergence (convex, GG-bounded subgradients, step αk=R/(Gk+1)\alpha_k = R/(G\sqrt{k+1})):

f(xˉk)fRGk+1.f(\bar{x}_k) - f^* \leq \frac{RG}{\sqrt{k+1}}.

Rate: O(1/k)O(1/\sqrt{k}) — sublinear and significantly slower than smooth GD. The non-smoothness fundamentally limits first-order methods.

Stochastic Gradient Descent

SGD replaces the full gradient with a noisy estimate:

xk+1=xkαk~f(xk),E[~f(x)]=f(x).x_{k+1} = x_k - \alpha_k \tilde{\nabla} f(x_k), \quad \mathbb{E}[\tilde{\nabla} f(x)] = \nabla f(x).

For finite-sum objectives f(x)=1ni=1nfi(x)f(x) = \frac{1}{n}\sum_{i=1}^n f_i(x) (standard in ML), the stochastic gradient is ~f=fi\tilde{\nabla} f = \nabla f_i for uniformly sampled ii.

Convergence (convex, GG-bounded variance, αk=C/k\alpha_k = C/\sqrt{k}):

E[f(xˉk)]fO(G/k).\mathbb{E}[f(\bar{x}_k)] - f^* \leq O(G/\sqrt{k}).

Same O(1/k)O(1/\sqrt{k}) rate as subgradient method, but each step costs O(1)O(1) vs O(n)O(n) for full GD. Trade-off: SGD is nn times cheaper per step but has a worse constant in the convergence bound.

Variance reduction. SGD's noise prevents it from converging faster than O(1/k)O(1/\sqrt{k}) for general stochastic objectives. But for finite-sum f=1nfif = \frac{1}{n}\sum f_i, algorithms like SVRG and SAGA periodically compute full gradients to reduce variance, achieving linear convergence at O(n+κ)O(n + \kappa) total gradient complexity — exponentially faster than SGD.

The Polyak-Łojasiewicz (PL) Condition

A strictly weaker condition than strong convexity that still gives linear convergence:

12f(x)2μ(f(x)f)x.\frac{1}{2}\|\nabla f(x)\|^2 \geq \mu(f(x) - f^*) \quad \forall x.

Under PL, gradient descent converges linearly: f(xk)f(1μα)k(f(x0)f)f(x_k) - f^* \leq (1 - \mu\alpha)^k (f(x_0) - f^*). PL holds for overparameterized networks on training data (when the network can interpolate), explaining why GD finds global minima in practice despite non-convexity.

Worked Example

Example 1: Convergence Rate on a Quadratic

Let f(x)=12xTAxf(x) = \frac{1}{2}x^T A x for PD AA with eigenvalues λ1λn\lambda_1 \leq \ldots \leq \lambda_n. We have μ=λ1\mu = \lambda_1, L=λnL = \lambda_n, κ=λn/λ1\kappa = \lambda_n/\lambda_1.

Gradient descent with α=1/λn\alpha = 1/\lambda_n: the error in eigenbasis direction viv_i decreases as (1λi/λn)k(1 - \lambda_i/\lambda_n)^k. The slowest direction is v1v_1 (smallest eigenvalue), giving overall rate (11/κ)k(1 - 1/\kappa)^k.

For κ=100\kappa = 100: need 100log(1/ϵ)\sim 100\log(1/\epsilon) iterations to reach accuracy ϵ\epsilon. For κ=10000\kappa = 10000: need 10000log(1/ϵ)\sim 10000\log(1/\epsilon). On ImageNet-sized networks, the condition number can be in the millions — this is why Adam (approximate natural gradient with implicit diagonal preconditioning) is used instead of plain SGD.

Example 2: The Descent Lemma Proof

Claim: For LL-smooth ff and step α=1/L\alpha = 1/L, f(xk+1)f(xk)12Lf(xk)2f(x_{k+1}) \leq f(x_k) - \frac{1}{2L}\|\nabla f(x_k)\|^2.

Proof. Set y=xk1Lf(xk)y = x_k - \frac{1}{L}\nabla f(x_k) in the descent lemma:

f(y)f(xk)+f(xk)T(1Lf(xk))+L21L2f(xk)2f(y) \leq f(x_k) + \nabla f(x_k)^T\left(-\frac{1}{L}\nabla f(x_k)\right) + \frac{L}{2}\cdot\frac{1}{L^2}\|\nabla f(x_k)\|^2 =f(xk)1Lf(xk)2+12Lf(xk)2=f(xk)12Lf(xk)2.= f(x_k) - \frac{1}{L}\|\nabla f(x_k)\|^2 + \frac{1}{2L}\|\nabla f(x_k)\|^2 = f(x_k) - \frac{1}{2L}\|\nabla f(x_k)\|^2.

Each iteration reduces ff by at least 12Lf2\frac{1}{2L}\|\nabla f\|^2. Summing over k=0,,K1k=0,\ldots,K-1 and using f(xk)20\|\nabla f(x_k)\|^2 \geq 0:

k=0K1f(xk)22L(f(x0)f).\sum_{k=0}^{K-1}\|\nabla f(x_k)\|^2 \leq 2L(f(x_0) - f^*).

So minkf(xk)22L(f(x0)f)/K\min_k \|\nabla f(x_k)\|^2 \leq 2L(f(x_0)-f^*)/K, implying convergence to a stationary point.

Example 3: SGD vs GD Sample Complexity

For f=1ni=1nfif = \frac{1}{n}\sum_{i=1}^n f_i with fif_i convex and LL-smooth, strongly convex with parameter μ\mu:

  • GD: Needs O(κlog(1/ϵ))O(\kappa \log(1/\epsilon)) iterations, each costing O(n)O(n) gradient evaluations. Total: O(nκlog(1/ϵ))O(n\kappa\log(1/\epsilon)).
  • SGD: Needs O(1/(μϵ))O(1/(\mu\epsilon)) iterations, each costing O(1)O(1). Total: O(1/(μϵ))O(1/(\mu\epsilon)).
  • SVRG: Needs O((n+κ)log(1/ϵ))O((n + \kappa)\log(1/\epsilon)) total gradient evaluations.

SVRG dominates when nκn \gg \kappa: for large datasets (many examples, well-conditioned problem), variance reduction is the method of choice.

Connections

Where Your Intuition Breaks

Nesterov acceleration achieves the optimal O(1/k2)O(1/k^2) rate — so you might expect it to dominate in deep learning. It doesn't. The issue is that the convergence theory requires a fixed, global Lipschitz constant LL. Neural network loss surfaces have wildly varying curvature: the Hessian eigenvalues can differ by 106×10^6\times across different regions and directions. The optimal step size 1/L1/L uses the largest curvature to set a safe step, which is extremely conservative in the directions with small curvature. Adaptive methods (Adam, Adagrad) instead use per-parameter step sizes, effectively running a different gradient descent in each coordinate with a step size adapted to that coordinate's observed curvature. They have no convergence theory as clean as Nesterov's, but they are dramatically more effective in practice because they solve the conditioning problem that Nesterov assumes away.

💡Intuition

Why O(1/k2)O(1/k^2) is a fundamental limit. The lower bound proof constructs a "hard" convex function where each gradient evaluation reveals at most one new dimension of the problem. After kk gradient steps from x0=0x_0 = 0, the method can only find the optimal value in the first kk coordinates — the remaining nkn-k coordinates contribute irreducible error proportional to 1/k21/k^2. Nesterov's method is optimal because it "covers" new dimensions as fast as physically possible — each gradient evaluation is maximally informative. No oracle-based first-order method can escape this bound.

💡Intuition

The condition number is the enemy. The number of iterations of GD scales as O(κ)O(\kappa) for strongly convex problems — so a 10×10\times increase in condition number means 10×10\times more iterations. In deep learning, the condition number of the loss Hessian at a trained model can reach 10610^6 or more. This is why: (1) momentum (Nesterov) reduces dependence to O(κ)O(\sqrt{\kappa}), (2) second-order methods (Newton) eliminate κ\kappa entirely but at cost O(n3)O(n^3) per step, and (3) adaptive methods (Adam) approximate a preconditioner to reduce the effective κ\kappa.

⚠️Warning

SGD's noise floor. Even for strongly convex objectives, SGD with a constant step size does not converge to xx^* — it oscillates in a ball of radius O(ασ2/μ)O(\alpha\sigma^2/\mu) around xx^*, where σ2\sigma^2 is the gradient noise variance. Decreasing the step size as αk=C/k\alpha_k = C/k eliminates the noise floor but slows convergence. In practice, learning rate warmup followed by cosine decay is a heuristic that starts large (fast progress) and ends small (tight convergence) without a formal convergence guarantee — yet empirically outperforms all theory-optimal schedules on large neural networks.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.