Gradient Methods: Convergence Rates & Information-Theoretic Lower Bounds
Gradient descent is simple to state but subtle to analyze. Its convergence rate depends on two constants — the smoothness and strong convexity — and these determine both the optimal step size and the number of iterations to reach a target accuracy. Nesterov's acceleration achieves an rate for convex functions, provably optimal among all first-order methods by an information-theoretic lower bound.
Concepts
You've watched gradient descent in practice: sometimes it converges in dozens of steps, sometimes it takes tens of thousands, sometimes it zigzags. The two numbers that fully explain this behavior are the smoothness (how fast the gradient changes) and the strong convexity (how bowl-shaped the function is). Their ratio is the condition number, and it determines whether gradient descent sprints or crawls. Understanding convergence rates is understanding why some optimization problems are easy and others are not — before you even run the algorithm.
Gradient Descent and the Descent Lemma
The gradient descent update:
Descent lemma. If is -smooth, then for any :
Applying this to :
With step size :
Each step decreases by at least . This is the workhorse of all convergence proofs. The step size is not a heuristic — it is the largest step that the descent lemma guarantees will not overshoot. Going larger risks increasing the loss; going smaller is safe but wastes progress. The -smoothness condition is precisely what lets you set the step size optimally without knowing the loss landscape in advance.
Convergence for -Smooth Convex Functions
Theorem. For convex and -smooth, gradient descent with satisfies:
Rate: — sublinear convergence. To achieve accuracy , need iterations.
Proof sketch. Sum the descent lemma over iterations, use the first-order convexity condition , and telescope.
Linear Convergence for Strongly Convex Functions
Theorem. For that is -strongly convex and -smooth, gradient descent with satisfies:
Rate: where is the condition number. This is linear convergence (exponentially fast). To achieve accuracy: .
Proof. Strong convexity gives , so . Combined with the descent lemma and -strong convexity, one obtains the geometric decay.
Interpretation. The condition number controls convergence: for (perfectly conditioned), one step suffices. For (ill-conditioned, e.g., long thin valleys), need steps. Preconditioning reduces .
Nesterov Accelerated Gradient
Nesterov's momentum method (1983). Two sequences:
The second step adds a momentum term: the update "overshoots" using recent history. The momentum coefficient is precisely calibrated.
Convergence (convex case):
Rate: — quadratic improvement over plain GD's .
Convergence (strongly convex case): with , convergence rate , compared to for plain GD. Since when , acceleration gives factor improvement in iteration count.
Why does momentum help? Nesterov's key insight: gradient descent wastes information — it only uses the gradient at the current point. Momentum accumulates gradient information from previous steps in a way that cancels the "zigzag" behavior in ill-conditioned landscapes. The estimate sequence technique provides the formal proof.
Information-Theoretic Lower Bounds
Theorem (Nemirovski & Yudin, 1983). For any first-order method (using only evaluations) and any , there exists a convex -smooth function such that:
This is a minimax lower bound — no first-order method can beat in the worst case. Nesterov's method achieves this bound, making it optimal.
Similarly for strongly convex: no first-order method can beat the rate . Nesterov's strongly convex method achieves this.
Proof idea. Construct a "worst-case" function as a tridiagonal quadratic whose gradient at depends only on the first coordinates of . After gradient steps, only the first coordinates of can be found, leaving an irreducible error.
Subgradient Method for Non-Smooth Functions
When is not differentiable (e.g., ), replace gradients with subgradients :
The descent lemma no longer holds (subgradient steps can increase ). Instead, track the best iterate .
Convergence (convex, -bounded subgradients, step ):
Rate: — sublinear and significantly slower than smooth GD. The non-smoothness fundamentally limits first-order methods.
Stochastic Gradient Descent
SGD replaces the full gradient with a noisy estimate:
For finite-sum objectives (standard in ML), the stochastic gradient is for uniformly sampled .
Convergence (convex, -bounded variance, ):
Same rate as subgradient method, but each step costs vs for full GD. Trade-off: SGD is times cheaper per step but has a worse constant in the convergence bound.
Variance reduction. SGD's noise prevents it from converging faster than for general stochastic objectives. But for finite-sum , algorithms like SVRG and SAGA periodically compute full gradients to reduce variance, achieving linear convergence at total gradient complexity — exponentially faster than SGD.
The Polyak-Łojasiewicz (PL) Condition
A strictly weaker condition than strong convexity that still gives linear convergence:
Under PL, gradient descent converges linearly: . PL holds for overparameterized networks on training data (when the network can interpolate), explaining why GD finds global minima in practice despite non-convexity.
Worked Example
Example 1: Convergence Rate on a Quadratic
Let for PD with eigenvalues . We have , , .
Gradient descent with : the error in eigenbasis direction decreases as . The slowest direction is (smallest eigenvalue), giving overall rate .
For : need iterations to reach accuracy . For : need . On ImageNet-sized networks, the condition number can be in the millions — this is why Adam (approximate natural gradient with implicit diagonal preconditioning) is used instead of plain SGD.
Example 2: The Descent Lemma Proof
Claim: For -smooth and step , .
Proof. Set in the descent lemma:
Each iteration reduces by at least . Summing over and using :
So , implying convergence to a stationary point.
Example 3: SGD vs GD Sample Complexity
For with convex and -smooth, strongly convex with parameter :
- GD: Needs iterations, each costing gradient evaluations. Total: .
- SGD: Needs iterations, each costing . Total: .
- SVRG: Needs total gradient evaluations.
SVRG dominates when : for large datasets (many examples, well-conditioned problem), variance reduction is the method of choice.
Connections
Where Your Intuition Breaks
Nesterov acceleration achieves the optimal rate — so you might expect it to dominate in deep learning. It doesn't. The issue is that the convergence theory requires a fixed, global Lipschitz constant . Neural network loss surfaces have wildly varying curvature: the Hessian eigenvalues can differ by across different regions and directions. The optimal step size uses the largest curvature to set a safe step, which is extremely conservative in the directions with small curvature. Adaptive methods (Adam, Adagrad) instead use per-parameter step sizes, effectively running a different gradient descent in each coordinate with a step size adapted to that coordinate's observed curvature. They have no convergence theory as clean as Nesterov's, but they are dramatically more effective in practice because they solve the conditioning problem that Nesterov assumes away.
Why is a fundamental limit. The lower bound proof constructs a "hard" convex function where each gradient evaluation reveals at most one new dimension of the problem. After gradient steps from , the method can only find the optimal value in the first coordinates — the remaining coordinates contribute irreducible error proportional to . Nesterov's method is optimal because it "covers" new dimensions as fast as physically possible — each gradient evaluation is maximally informative. No oracle-based first-order method can escape this bound.
The condition number is the enemy. The number of iterations of GD scales as for strongly convex problems — so a increase in condition number means more iterations. In deep learning, the condition number of the loss Hessian at a trained model can reach or more. This is why: (1) momentum (Nesterov) reduces dependence to , (2) second-order methods (Newton) eliminate entirely but at cost per step, and (3) adaptive methods (Adam) approximate a preconditioner to reduce the effective .
SGD's noise floor. Even for strongly convex objectives, SGD with a constant step size does not converge to — it oscillates in a ball of radius around , where is the gradient noise variance. Decreasing the step size as eliminates the noise floor but slows convergence. In practice, learning rate warmup followed by cosine decay is a heuristic that starts large (fast progress) and ends small (tight convergence) without a formal convergence guarantee — yet empirically outperforms all theory-optimal schedules on large neural networks.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.