Neural-Path/Notes
45 min

Iterative Solvers: Conjugate Gradient & Krylov Methods

When nn is large (millions of variables), direct solvers require O(n3)O(n^3) time and O(n2)O(n^2) memory — infeasible. Krylov methods solve Ax=bAx = b using only matrix-vector products vAvv \mapsto Av, never storing AA explicitly. Conjugate gradient is optimal for SPD systems; GMRES handles nonsymmetric problems. Preconditioning is the key to practical performance.

Concepts

For a 106×10610^6 \times 10^6 sparse matrix — common in physics simulations, graph problems, and kernel methods — direct solvers require O(n3)=1018O(n^3) = 10^{18} operations and O(n2)O(n^2) memory. Krylov methods sidestep this: they solve Ax=bAx = b using only matrix-vector products vAvv \mapsto Av, never accessing AA explicitly. After kk products, the iterate lives in a kk-dimensional subspace, and the method finds the best approximation there.

Krylov Subspaces

The Krylov subspace generated by AA and initial residual r0=bAx0r_0 = b - Ax_0 is:

Kk(A,r0)=span{r0,Ar0,A2r0,,Ak1r0}.\mathcal{K}_k(A, r_0) = \text{span}\{r_0, Ar_0, A^2r_0, \ldots, A^{k-1}r_0\}.

Krylov methods compute iterates xkx0+Kk(A,r0)x_k \in x_0 + \mathcal{K}_k(A, r_0) satisfying an optimality condition over the Krylov subspace. Each iteration requires one matrix-vector product AvAv.

Why this works: the Cayley-Hamilton theorem states p(A)=0p(A) = 0 for the characteristic polynomial pp of degree nn. So A1A^{-1} is a polynomial in AA of degree n1n-1, meaning x=A1bx0+Kn(A,r0)x_* = A^{-1}b \in x_0 + \mathcal{K}_n(A, r_0). The Krylov subspace contains the exact solution at iteration nn; the method finds the best approximation within Kk\mathcal{K}_k for knk \leq n.

The polynomial interpretation is why Krylov methods converge fast when eigenvalues cluster: CG at step kk finds the degree-kk polynomial pkp_k that best approximates 1/λ1/\lambda on the spectrum of AA. When eigenvalues are few or tightly clustered, a low-degree polynomial can approximate 1/λ1/\lambda accurately on the entire spectrum — making knk \ll n sufficient for convergence.

Conjugate Gradient (CG)

For symmetric positive definite AA, the CG method minimizes the AA-norm error xkxA=(xkx)TA(xkx)\|x_k - x_*\|_A = \sqrt{(x_k - x_*)^T A(x_k - x_*)}:

xk=argminxx0+KkxxA.x_k = \arg\min_{x \in x_0 + \mathcal{K}_k} \|x - x_*\|_A.

Algorithm (starting from x0=0x_0 = 0, r0=br_0 = b, p0=r0p_0 = r_0):

  1. αk=rkTrkpkTApk\alpha_k = \frac{r_k^T r_k}{p_k^T A p_k} (step size)
  2. xk+1=xk+αkpkx_{k+1} = x_k + \alpha_k p_k (update)
  3. rk+1=rkαkApkr_{k+1} = r_k - \alpha_k A p_k (residual update)
  4. βk=rk+1Trk+1rkTrk\beta_k = \frac{r_{k+1}^T r_{k+1}}{r_k^T r_k} (momentum)
  5. pk+1=rk+1+βkpkp_{k+1} = r_{k+1} + \beta_k p_k (new search direction)

Each iteration: 1 matrix-vector product, 2 dot products, 2 vector updates — O(n)O(n) operations for sparse AA.

Convergence rate: the error satisfies:

xkxAx0xA2(κ1κ+1)k\frac{\|x_k - x_*\|_A}{\|x_0 - x_*\|_A} \leq 2\left(\frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}\right)^k

where κ=κ(A)=λmax/λmin\kappa = \kappa(A) = \lambda_{\max}/\lambda_{\min}. For tolerance ε\varepsilon: k=O(κlog(1/ε))k = O(\sqrt{\kappa}\log(1/\varepsilon)) iterations — better than gradient descent's O(κlog(1/ε))O(\kappa\log(1/\varepsilon)) by the square root.

Finite termination: in exact arithmetic, CG terminates in at most nn steps (not nn in practice due to floating point).

Preconditioning

Preconditioned CG: solve P1Ax=P1bP^{-1}Ax = P^{-1}b where PAP \approx A is easy to invert. The convergence depends on κ(P1A)\kappa(P^{-1}A) instead of κ(A)\kappa(A).

Common preconditioners:

PreconditionerDescriptionQuality
Jacobi (diagonal)P=diag(A)P = \text{diag}(A)Poor for general systems, simple
Incomplete Cholesky (IC)Cholesky with sparsity pattern of AAGood for SPD
Algebraic multigrid (AMG)Hierarchy of coarse gridsExcellent for PDEs
Block diagonalP=blkdiag(A11,A22,)P = \text{blkdiag}(A_{11}, A_{22}, \ldots)Good for block structure

Perfect preconditioner: P=AP = A gives κ(P1A)=1\kappa(P^{-1}A) = 1 and CG converges in 1 step — but inverting AA is the original problem. Practical preconditioners trade accuracy for cheapness.

Other Krylov Methods

GMRES (Generalized Minimal Residual): for non-symmetric AA, minimizes bAxk2\|b - Ax_k\|_2 over Kk\mathcal{K}_k. Requires storing all basis vectors: O(kn)O(kn) memory. Restarted GMRES(m): restart every mm steps to bound memory; may not converge for all problems.

BiCGSTAB (Biconjugate Gradient Stabilized): for non-symmetric systems, O(n)O(n) per iteration and O(n)O(n) storage. Less stable than GMRES but more memory-efficient.

Lanczos/Arnoldi iteration: builds orthonormal basis for Kk\mathcal{K}_k. Lanczos (for symmetric AA) produces a tridiagonal reduction; Arnoldi (general AA) produces a Hessenberg reduction. Both find eigenvalue approximations as a byproduct.

MINRES: for symmetric indefinite AA (not PD), minimizes the 2-norm of the residual. Always converges; preferred over CG when AA is indefinite.

Worked Example

Example 1: CG on a Poisson System

Solve the 2D discrete Poisson equation Au=fAu = f on an n×nn \times n grid: AA is the 5-point Laplacian, ARN×NA \in \mathbb{R}^{N \times N}, N=n2N = n^2. For n=1000n = 1000: N=106N = 10^6, κ(A)4N/π24×106\kappa(A) \approx 4N/\pi^2 \approx 4 \times 10^6.

Without preconditioning: O(κlog(1/ε))=O(4×106log(106))O(2000×14)=O(28000)O(\sqrt{\kappa}\log(1/\varepsilon)) = O(\sqrt{4\times10^6}\log(10^6)) \approx O(2000 \times 14) = O(28000) iterations.

With multigrid preconditioner: κ(P1A)=O(1)\kappa(P^{-1}A) = O(1), so O(log(1/ε))=O(14)O(\log(1/\varepsilon)) = O(14) iterations. The 2000×2000\times speedup makes trillion-cell CFD simulations practical.

Cost per iteration: AA is stored implicitly as the 5-point stencil — applying AA costs 5N5N multiplications. No N×NN \times N matrix is ever stored. Total storage: O(N)O(N).

Example 2: CG for Kernel Methods at Scale

Solve (K+σ2I)α=y(K + \sigma^2 I)\alpha = y where KRN×NK \in \mathbb{R}^{N \times N} is a kernel matrix, N=105N = 10^5 (too large for Cholesky: 101510^{15} flops). Use CG with matrix-vector product vKvv \mapsto Kv:

  • Exact kernel: O(N2)O(N^2) per multiplication — O(N2.5)O(N^{2.5}) total for CG. Still expensive for N=105N = 10^5.
  • Random Fourier Features (Bochner): approximate KZZTK \approx ZZ^T with ZRN×DZ \in \mathbb{R}^{N \times D}, D=103D = 10^3. Then (K+σ2I)vZ(ZTv)+σ2v(K + \sigma^2 I)v \approx Z(Z^Tv) + \sigma^2 v, costing O(ND)O(ND) per multiplication — O(NDκ)O(ND\sqrt{\kappa}) total.
  • GPyTorch's blackbox matrix-matrix multiplication (BBMM) applies this idea to scale GP regression to millions of data points.

Example 3: GMRES for Neural Network Implicit Differentiation

Implicit differentiation of a fixed-point condition F(θ,z(θ))=0F(\theta, z(\theta)) = 0 requires solving the linear system Fzv=u\frac{\partial F}{\partial z} v = u. For deep equilibrium models (DEQs), Fz\frac{\partial F}{\partial z} is never explicitly formed — only matrix-vector products via automatic differentiation.

GMRES applies: compute JFvJ_F v using one forward pass + one backward pass (Jacobian-vector product via autodiff). GMRES solves the system in O(κ)O(\kappa) iterations, each costing O(forward pass)O(\text{forward pass}). This is the key algorithm in DEQs, PINNs, and neural ODEs — solving linear systems without explicit Jacobians.

Connections

Where Your Intuition Breaks

CG terminates in at most nn steps in exact arithmetic, giving the exact solution. In floating-point arithmetic, rounding errors cause the Krylov vectors to lose orthogonality — the search directions are no longer truly conjugate — and convergence stagnates well before nn steps. The standard fix is restarting (GMRES(k): restart every kk iterations) or explicit re-orthogonalization. For ill-conditioned problems, CG may appear to converge (small residual rk\|r_k\|) while the solution error xkx\|x_k - x_*\| is still large — because the residual and the error are related by κ(A)\kappa(A). In practice, monitor the relative residual AND compare with a known reference when possible; the polynomial convergence bound is a guide, not a guarantee for finite-precision iteration.

💡Intuition

CG is the optimal polynomial method for SPD systems. At step kk, CG computes the best polynomial pkp_k of degree k\leq k (with pk(0)=1p_k(0) = 1) such that xk=x0+pk(A)r0x_k = x_0 + p_k(A)r_0 minimizes the AA-norm error. The convergence bound ((κ1)/(κ+1))k((\sqrt\kappa-1)/(\sqrt\kappa+1))^k comes from the best Chebyshev polynomial approximation to 1/λ1/\lambda on [λmin,λmax][\lambda_{\min}, \lambda_{\max}]. No other algorithm using only matrix-vector products can do better than CG's error bound at step kk. Preconditioning is the only way to improve the exponent.

💡Intuition

The spectrum of AA determines iteration count, not just κ\kappa. The convergence bound uses only λmin\lambda_{\min} and λmax\lambda_{\max}, but CG is sensitive to the full spectrum. If AA has mm distinct eigenvalues, CG converges in exactly mm steps (in exact arithmetic). For matrices with a few large eigenvalues and the rest clustered near 1, CG converges much faster than the bound suggests. This is why CG with AMG preconditioner achieves 1 iteration per grid-level pass — the preconditioned spectrum has O(1)O(1) distinct eigenvalue clusters.

⚠️Warning

CG in float32 loses orthogonality and can fail to converge. In exact arithmetic, successive CG directions are mutually AA-conjugate. In floating-point, rounding errors destroy conjugacy after O(1/εmach)O(1/\varepsilon_{\text{mach}}) iterations — for float32 after 107\sim 10^7 iterations. For well-conditioned systems, convergence happens long before this. For ill-conditioned systems (κ1/εmach\kappa \sim 1/\varepsilon_{\text{mach}}), CG stagnates: the computed residual rkr_k is large but the true residual bAxkb - Ax_k is small. The fix: explicit re-orthogonalization (expensive), or restarts. For production ML solvers in float32, always use a preconditioner to reduce κ\kappa below 105\sim 10^5.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.