Iterative Solvers: Conjugate Gradient & Krylov Methods
When is large (millions of variables), direct solvers require time and memory — infeasible. Krylov methods solve using only matrix-vector products , never storing explicitly. Conjugate gradient is optimal for SPD systems; GMRES handles nonsymmetric problems. Preconditioning is the key to practical performance.
Concepts
For a sparse matrix — common in physics simulations, graph problems, and kernel methods — direct solvers require operations and memory. Krylov methods sidestep this: they solve using only matrix-vector products , never accessing explicitly. After products, the iterate lives in a -dimensional subspace, and the method finds the best approximation there.
Krylov Subspaces
The Krylov subspace generated by and initial residual is:
Krylov methods compute iterates satisfying an optimality condition over the Krylov subspace. Each iteration requires one matrix-vector product .
Why this works: the Cayley-Hamilton theorem states for the characteristic polynomial of degree . So is a polynomial in of degree , meaning . The Krylov subspace contains the exact solution at iteration ; the method finds the best approximation within for .
The polynomial interpretation is why Krylov methods converge fast when eigenvalues cluster: CG at step finds the degree- polynomial that best approximates on the spectrum of . When eigenvalues are few or tightly clustered, a low-degree polynomial can approximate accurately on the entire spectrum — making sufficient for convergence.
Conjugate Gradient (CG)
For symmetric positive definite , the CG method minimizes the -norm error :
Algorithm (starting from , , ):
- (step size)
- (update)
- (residual update)
- (momentum)
- (new search direction)
Each iteration: 1 matrix-vector product, 2 dot products, 2 vector updates — operations for sparse .
Convergence rate: the error satisfies:
where . For tolerance : iterations — better than gradient descent's by the square root.
Finite termination: in exact arithmetic, CG terminates in at most steps (not in practice due to floating point).
Preconditioning
Preconditioned CG: solve where is easy to invert. The convergence depends on instead of .
Common preconditioners:
| Preconditioner | Description | Quality |
|---|---|---|
| Jacobi (diagonal) | Poor for general systems, simple | |
| Incomplete Cholesky (IC) | Cholesky with sparsity pattern of | Good for SPD |
| Algebraic multigrid (AMG) | Hierarchy of coarse grids | Excellent for PDEs |
| Block diagonal | Good for block structure |
Perfect preconditioner: gives and CG converges in 1 step — but inverting is the original problem. Practical preconditioners trade accuracy for cheapness.
Other Krylov Methods
GMRES (Generalized Minimal Residual): for non-symmetric , minimizes over . Requires storing all basis vectors: memory. Restarted GMRES(m): restart every steps to bound memory; may not converge for all problems.
BiCGSTAB (Biconjugate Gradient Stabilized): for non-symmetric systems, per iteration and storage. Less stable than GMRES but more memory-efficient.
Lanczos/Arnoldi iteration: builds orthonormal basis for . Lanczos (for symmetric ) produces a tridiagonal reduction; Arnoldi (general ) produces a Hessenberg reduction. Both find eigenvalue approximations as a byproduct.
MINRES: for symmetric indefinite (not PD), minimizes the 2-norm of the residual. Always converges; preferred over CG when is indefinite.
Worked Example
Example 1: CG on a Poisson System
Solve the 2D discrete Poisson equation on an grid: is the 5-point Laplacian, , . For : , .
Without preconditioning: iterations.
With multigrid preconditioner: , so iterations. The speedup makes trillion-cell CFD simulations practical.
Cost per iteration: is stored implicitly as the 5-point stencil — applying costs multiplications. No matrix is ever stored. Total storage: .
Example 2: CG for Kernel Methods at Scale
Solve where is a kernel matrix, (too large for Cholesky: flops). Use CG with matrix-vector product :
- Exact kernel: per multiplication — total for CG. Still expensive for .
- Random Fourier Features (Bochner): approximate with , . Then , costing per multiplication — total.
- GPyTorch's blackbox matrix-matrix multiplication (BBMM) applies this idea to scale GP regression to millions of data points.
Example 3: GMRES for Neural Network Implicit Differentiation
Implicit differentiation of a fixed-point condition requires solving the linear system . For deep equilibrium models (DEQs), is never explicitly formed — only matrix-vector products via automatic differentiation.
GMRES applies: compute using one forward pass + one backward pass (Jacobian-vector product via autodiff). GMRES solves the system in iterations, each costing . This is the key algorithm in DEQs, PINNs, and neural ODEs — solving linear systems without explicit Jacobians.
Connections
Where Your Intuition Breaks
CG terminates in at most steps in exact arithmetic, giving the exact solution. In floating-point arithmetic, rounding errors cause the Krylov vectors to lose orthogonality — the search directions are no longer truly conjugate — and convergence stagnates well before steps. The standard fix is restarting (GMRES(k): restart every iterations) or explicit re-orthogonalization. For ill-conditioned problems, CG may appear to converge (small residual ) while the solution error is still large — because the residual and the error are related by . In practice, monitor the relative residual AND compare with a known reference when possible; the polynomial convergence bound is a guide, not a guarantee for finite-precision iteration.
CG is the optimal polynomial method for SPD systems. At step , CG computes the best polynomial of degree (with ) such that minimizes the -norm error. The convergence bound comes from the best Chebyshev polynomial approximation to on . No other algorithm using only matrix-vector products can do better than CG's error bound at step . Preconditioning is the only way to improve the exponent.
The spectrum of determines iteration count, not just . The convergence bound uses only and , but CG is sensitive to the full spectrum. If has distinct eigenvalues, CG converges in exactly steps (in exact arithmetic). For matrices with a few large eigenvalues and the rest clustered near 1, CG converges much faster than the bound suggests. This is why CG with AMG preconditioner achieves 1 iteration per grid-level pass — the preconditioned spectrum has distinct eigenvalue clusters.
CG in float32 loses orthogonality and can fail to converge. In exact arithmetic, successive CG directions are mutually -conjugate. In floating-point, rounding errors destroy conjugacy after iterations — for float32 after iterations. For well-conditioned systems, convergence happens long before this. For ill-conditioned systems (), CG stagnates: the computed residual is large but the true residual is small. The fix: explicit re-orthogonalization (expensive), or restarts. For production ML solvers in float32, always use a preconditioner to reduce below .
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.