Gradient Methods: Convergence Rates & Information-Theoretic Lower Bounds

Gradient descent is simple to state but subtle to analyze. Its convergence rate depends on two constants — the smoothness $L$ and strong convexity $\mu$ — and these determine both the optimal step size and the number of iterations to reach a target accuracy. Nesterov's acceleration achieves an $O(1/k^2)$ rate for convex functions, provably optimal among all first-order methods by an information-theoretic lower bound.

Concepts

Convergence rates: f(xₖ) − f* vs iteration k (log scale, μ=0.10, L=2.0, κ=L/μ=20.0)

μ (str. conv.)0.10L (smoothness)2.0

Click a method to toggle. Increase L/μ (condition number κ) to see how poorly-conditioned problems slow gradient descent relative to Nesterov acceleration.

You've watched gradient descent in practice: sometimes it converges in dozens of steps, sometimes it takes tens of thousands, sometimes it zigzags. The two numbers that fully explain this behavior are the smoothness $L$ (how fast the gradient changes) and the strong convexity $\mu$ (how bowl-shaped the function is). Their ratio $\kappa = L/\mu$ is the condition number, and it determines whether gradient descent sprints or crawls. Understanding convergence rates is understanding why some optimization problems are easy and others are not — before you even run the algorithm.

Gradient Descent and the Descent Lemma

The gradient descent update:

$x_{k+1} = x_k - \alpha \nabla f(x_k), \quad \alpha > 0.$

Descent lemma. If $f$ is $L$ -smooth, then for any $x, y$ :

$f(y) \leq f(x) + \nabla f(x)^T(y-x) + \frac{L}{2}\|y-x\|^2.$

Applying this to $y = x_{k+1} = x_k - \alpha\nabla f(x_k)$ :

$f(x_{k+1}) \leq f(x_k) - \alpha\left(1 - \frac{L\alpha}{2}\right)\|\nabla f(x_k)\|^2.$

With step size $\alpha = 1/L$ :

$f(x_{k+1}) \leq f(x_k) - \frac{1}{2L}\|\nabla f(x_k)\|^2.$

Each step decreases $f$ by at least $\frac{1}{2L}\|\nabla f\|^2$ . This is the workhorse of all convergence proofs. The $1/L$ step size is not a heuristic — it is the largest step that the descent lemma guarantees will not overshoot. Going larger risks increasing the loss; going smaller is safe but wastes progress. The $L$ -smoothness condition is precisely what lets you set the step size optimally without knowing the loss landscape in advance.

Convergence for $L$ -Smooth Convex Functions

Theorem. For $f$ convex and $L$ -smooth, gradient descent with $\alpha = 1/L$ satisfies:

$f(x_k) - f^* \leq \frac{L\|x_0 - x^*\|^2}{2k}.$

Rate: $O(1/k)$ — sublinear convergence. To achieve accuracy $\epsilon$ , need $k \geq L\|x_0-x^*\|^2/(2\epsilon)$ iterations.

Proof sketch. Sum the descent lemma over iterations, use the first-order convexity condition $f(x^*) \geq f(x_k) + \nabla f(x_k)^T(x^* - x_k)$ , and telescope.

Linear Convergence for Strongly Convex Functions

Theorem. For $f$ that is $\mu$ -strongly convex and $L$ -smooth, gradient descent with $\alpha = 1/L$ satisfies:

$\|x_k - x^*\|^2 \leq \left(1 - \frac{\mu}{L}\right)^k \|x_0 - x^*\|^2.$

Rate: $O\!\left(\left(1 - 1/\kappa\right)^k\right)$ where $\kappa = L/\mu$ is the condition number. This is linear convergence (exponentially fast). To achieve $\epsilon$ accuracy: $k \geq \kappa \log(\|x_0-x^*\|^2/\epsilon)$ .

Proof. Strong convexity gives $f(x) - f^* \geq \frac{\mu}{2}\|x-x^*\|^2$ , so $\|\nabla f(x)\|^2 \geq 2\mu(f(x)-f^*)$ . Combined with the descent lemma and $\mu$ -strong convexity, one obtains the geometric decay.

Interpretation. The condition number $\kappa = L/\mu$ controls convergence: for $\kappa = 1$ (perfectly conditioned), one step suffices. For $\kappa = 1000$ (ill-conditioned, e.g., long thin valleys), need $\sim 1000\log(1/\epsilon)$ steps. Preconditioning reduces $\kappa$ .

Nesterov Accelerated Gradient

Nesterov's momentum method (1983). Two sequences:

\begin{aligned} y_{k+1} &= x_k - \frac{1}{L}\nabla f(x_k) \\ x_{k+1} &= y_{k+1} + \frac{k-1}{k+2}(y_{k+1} - y_k) \end{aligned}

The second step adds a momentum term: the update "overshoots" using recent history. The momentum coefficient $(k-1)/(k+2)$ is precisely calibrated.

Convergence (convex case):

$f(x_k) - f^* \leq \frac{2L\|x_0 - x^*\|^2}{(k+1)^2}.$

Rate: $O(1/k^2)$ — quadratic improvement over plain GD's $O(1/k)$ .

Convergence (strongly convex case): with $\mu > 0$ , convergence rate $(1 - \sqrt{\mu/L})^k$ , compared to $(1 - \mu/L)^k$ for plain GD. Since $\sqrt{\mu/L} \gg \mu/L$ when $\kappa \gg 1$ , acceleration gives factor $\sqrt{\kappa}$ improvement in iteration count.

Why does momentum help? Nesterov's key insight: gradient descent wastes information — it only uses the gradient at the current point. Momentum accumulates gradient information from previous steps in a way that cancels the "zigzag" behavior in ill-conditioned landscapes. The estimate sequence technique provides the formal proof.

Information-Theoretic Lower Bounds

Theorem (Nemirovski & Yudin, 1983). For any first-order method (using only $\nabla f$ evaluations) and any $k < n/2$ , there exists a convex $L$ -smooth function $f$ such that:

$f(x_k) - f^* \geq \frac{3L\|x_0-x^*\|^2}{32(k+1)^2}.$

This is a minimax lower bound — no first-order method can beat $O(1/k^2)$ in the worst case. Nesterov's method achieves this bound, making it optimal.

Similarly for strongly convex: no first-order method can beat the rate $(1 - \sqrt{\mu/L})^k$ . Nesterov's strongly convex method achieves this.

Proof idea. Construct a "worst-case" function as a tridiagonal quadratic whose gradient at $x_k$ depends only on the first $k$ coordinates of $x_0$ . After $k$ gradient steps, only the first $k$ coordinates of $x^*$ can be found, leaving an irreducible error.

Subgradient Method for Non-Smooth Functions

When $f$ is not differentiable (e.g., $f(x) = \|x\|_1$ ), replace gradients with subgradients $g \in \partial f(x)$ :

$x_{k+1} = x_k - \alpha_k g_k, \quad g_k \in \partial f(x_k).$

The descent lemma no longer holds (subgradient steps can increase $f$ ). Instead, track the best iterate $\bar{x}_k = \arg\min_{i \leq k} f(x_i)$ .

Convergence (convex, $G$ -bounded subgradients, step $\alpha_k = R/(G\sqrt{k+1})$ ):

$f(\bar{x}_k) - f^* \leq \frac{RG}{\sqrt{k+1}}.$

Rate: $O(1/\sqrt{k})$ — sublinear and significantly slower than smooth GD. The non-smoothness fundamentally limits first-order methods.

Stochastic Gradient Descent

SGD replaces the full gradient with a noisy estimate:

$x_{k+1} = x_k - \alpha_k \tilde{\nabla} f(x_k), \quad \mathbb{E}[\tilde{\nabla} f(x)] = \nabla f(x).$

For finite-sum objectives $f(x) = \frac{1}{n}\sum_{i=1}^n f_i(x)$ (standard in ML), the stochastic gradient is $\tilde{\nabla} f = \nabla f_i$ for uniformly sampled $i$ .

Convergence (convex, $G$ -bounded variance, $\alpha_k = C/\sqrt{k}$ ):

$\mathbb{E}[f(\bar{x}_k)] - f^* \leq O(G/\sqrt{k}).$

Same $O(1/\sqrt{k})$ rate as subgradient method, but each step costs $O(1)$ vs $O(n)$ for full GD. Trade-off: SGD is $n$ times cheaper per step but has a worse constant in the convergence bound.

Variance reduction. SGD's noise prevents it from converging faster than $O(1/\sqrt{k})$ for general stochastic objectives. But for finite-sum $f = \frac{1}{n}\sum f_i$ , algorithms like SVRG and SAGA periodically compute full gradients to reduce variance, achieving linear convergence at $O(n + \kappa)$ total gradient complexity — exponentially faster than SGD.

The Polyak-Łojasiewicz (PL) Condition

A strictly weaker condition than strong convexity that still gives linear convergence:

$\frac{1}{2}\|\nabla f(x)\|^2 \geq \mu(f(x) - f^*) \quad \forall x.$

Under PL, gradient descent converges linearly: $f(x_k) - f^* \leq (1 - \mu\alpha)^k (f(x_0) - f^*)$ . PL holds for overparameterized networks on training data (when the network can interpolate), explaining why GD finds global minima in practice despite non-convexity.

Worked Example

Example 1: Convergence Rate on a Quadratic

Let $f(x) = \frac{1}{2}x^T A x$ for PD $A$ with eigenvalues $\lambda_1 \leq \ldots \leq \lambda_n$ . We have $\mu = \lambda_1$ , $L = \lambda_n$ , $\kappa = \lambda_n/\lambda_1$ .

Gradient descent with $\alpha = 1/\lambda_n$ : the error in eigenbasis direction $v_i$ decreases as $(1 - \lambda_i/\lambda_n)^k$ . The slowest direction is $v_1$ (smallest eigenvalue), giving overall rate $(1 - 1/\kappa)^k$ .

For $\kappa = 100$ : need $\sim 100\log(1/\epsilon)$ iterations to reach accuracy $\epsilon$ . For $\kappa = 10000$ : need $\sim 10000\log(1/\epsilon)$ . On ImageNet-sized networks, the condition number can be in the millions — this is why Adam (approximate natural gradient with implicit diagonal preconditioning) is used instead of plain SGD.

Example 2: The Descent Lemma Proof

Claim: For $L$ -smooth $f$ and step $\alpha = 1/L$ , $f(x_{k+1}) \leq f(x_k) - \frac{1}{2L}\|\nabla f(x_k)\|^2$ .

Proof. Set $y = x_k - \frac{1}{L}\nabla f(x_k)$ in the descent lemma:

$f(y) \leq f(x_k) + \nabla f(x_k)^T\left(-\frac{1}{L}\nabla f(x_k)\right) + \frac{L}{2}\cdot\frac{1}{L^2}\|\nabla f(x_k)\|^2$ $= f(x_k) - \frac{1}{L}\|\nabla f(x_k)\|^2 + \frac{1}{2L}\|\nabla f(x_k)\|^2 = f(x_k) - \frac{1}{2L}\|\nabla f(x_k)\|^2.$

Each iteration reduces $f$ by at least $\frac{1}{2L}\|\nabla f\|^2$ . Summing over $k=0,\ldots,K-1$ and using $\|\nabla f(x_k)\|^2 \geq 0$ :

$\sum_{k=0}^{K-1}\|\nabla f(x_k)\|^2 \leq 2L(f(x_0) - f^*).$

So $\min_k \|\nabla f(x_k)\|^2 \leq 2L(f(x_0)-f^*)/K$ , implying convergence to a stationary point.

Example 3: SGD vs GD Sample Complexity

For $f = \frac{1}{n}\sum_{i=1}^n f_i$ with $f_i$ convex and $L$ -smooth, strongly convex with parameter $\mu$ :

GD: Needs $O(\kappa \log(1/\epsilon))$ iterations, each costing $O(n)$ gradient evaluations. Total: $O(n\kappa\log(1/\epsilon))$ .
SGD: Needs $O(1/(\mu\epsilon))$ iterations, each costing $O(1)$ . Total: $O(1/(\mu\epsilon))$ .
SVRG: Needs $O((n + \kappa)\log(1/\epsilon))$ total gradient evaluations.

SVRG dominates when $n \gg \kappa$ : for large datasets (many examples, well-conditioned problem), variance reduction is the method of choice.

Connections

Where Your Intuition Breaks

Nesterov acceleration achieves the optimal $O(1/k^2)$ rate — so you might expect it to dominate in deep learning. It doesn't. The issue is that the convergence theory requires a fixed, global Lipschitz constant $L$ . Neural network loss surfaces have wildly varying curvature: the Hessian eigenvalues can differ by $10^6\times$ across different regions and directions. The optimal step size $1/L$ uses the largest curvature to set a safe step, which is extremely conservative in the directions with small curvature. Adaptive methods (Adam, Adagrad) instead use per-parameter step sizes, effectively running a different gradient descent in each coordinate with a step size adapted to that coordinate's observed curvature. They have no convergence theory as clean as Nesterov's, but they are dramatically more effective in practice because they solve the conditioning problem that Nesterov assumes away.

💡Intuition

Why $O(1/k^2)$ is a fundamental limit. The lower bound proof constructs a "hard" convex function where each gradient evaluation reveals at most one new dimension of the problem. After $k$ gradient steps from $x_0 = 0$ , the method can only find the optimal value in the first $k$ coordinates — the remaining $n-k$ coordinates contribute irreducible error proportional to $1/k^2$ . Nesterov's method is optimal because it "covers" new dimensions as fast as physically possible — each gradient evaluation is maximally informative. No oracle-based first-order method can escape this bound.

💡Intuition

The condition number is the enemy. The number of iterations of GD scales as $O(\kappa)$ for strongly convex problems — so a $10\times$ increase in condition number means $10\times$ more iterations. In deep learning, the condition number of the loss Hessian at a trained model can reach $10^6$ or more. This is why: (1) momentum (Nesterov) reduces dependence to $O(\sqrt{\kappa})$ , (2) second-order methods (Newton) eliminate $\kappa$ entirely but at cost $O(n^3)$ per step, and (3) adaptive methods (Adam) approximate a preconditioner to reduce the effective $\kappa$ .

⚠️Warning

SGD's noise floor. Even for strongly convex objectives, SGD with a constant step size does not converge to $x^*$ — it oscillates in a ball of radius $O(\alpha\sigma^2/\mu)$ around $x^*$ , where $\sigma^2$ is the gradient noise variance. Decreasing the step size as $\alpha_k = C/k$ eliminates the noise floor but slows convergence. In practice, learning rate warmup followed by cosine decay is a heuristic that starts large (fast progress) and ends small (tight convergence) without a formal convergence guarantee — yet empirically outperforms all theory-optimal schedules on large neural networks.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Duality: Lagrangians, KKT Conditions & Strong Duality

Proximal Methods, ADMM & Operator Splitting

Gradient Methods: Convergence Rates & Information-Theoretic Lower Bounds

Concepts

Gradient Descent and the Descent Lemma

Convergence for LLL-Smooth Convex Functions

Linear Convergence for Strongly Convex Functions

Nesterov Accelerated Gradient

Information-Theoretic Lower Bounds

Subgradient Method for Non-Smooth Functions

Stochastic Gradient Descent

The Polyak-Łojasiewicz (PL) Condition

Worked Example

Example 1: Convergence Rate on a Quadratic

Example 2: The Descent Lemma Proof

Example 3: SGD vs GD Sample Complexity

Connections

Where Your Intuition Breaks

Convergence for $L$ -Smooth Convex Functions