Iterative Solvers: Conjugate Gradient & Krylov Methods

When $n$ is large (millions of variables), direct solvers require $O(n^3)$ time and $O(n^2)$ memory — infeasible. Krylov methods solve $Ax = b$ using only matrix-vector products $v \mapsto Av$ , never storing $A$ explicitly. Conjugate gradient is optimal for SPD systems; GMRES handles nonsymmetric problems. Preconditioning is the key to practical performance.

Concepts

For a $10^6 \times 10^6$ sparse matrix — common in physics simulations, graph problems, and kernel methods — direct solvers require $O(n^3) = 10^{18}$ operations and $O(n^2)$ memory. Krylov methods sidestep this: they solve $Ax = b$ using only matrix-vector products $v \mapsto Av$ , never accessing $A$ explicitly. After $k$ products, the iterate lives in a $k$ -dimensional subspace, and the method finds the best approximation there.

Krylov Subspaces

The Krylov subspace generated by $A$ and initial residual $r_0 = b - Ax_0$ is:

$\mathcal{K}_k(A, r_0) = \text{span}\{r_0, Ar_0, A^2r_0, \ldots, A^{k-1}r_0\}.$

Krylov methods compute iterates $x_k \in x_0 + \mathcal{K}_k(A, r_0)$ satisfying an optimality condition over the Krylov subspace. Each iteration requires one matrix-vector product $Av$ .

Why this works: the Cayley-Hamilton theorem states $p(A) = 0$ for the characteristic polynomial $p$ of degree $n$ . So $A^{-1}$ is a polynomial in $A$ of degree $n-1$ , meaning $x_* = A^{-1}b \in x_0 + \mathcal{K}_n(A, r_0)$ . The Krylov subspace contains the exact solution at iteration $n$ ; the method finds the best approximation within $\mathcal{K}_k$ for $k \leq n$ .

The polynomial interpretation is why Krylov methods converge fast when eigenvalues cluster: CG at step $k$ finds the degree- $k$ polynomial $p_k$ that best approximates $1/\lambda$ on the spectrum of $A$ . When eigenvalues are few or tightly clustered, a low-degree polynomial can approximate $1/\lambda$ accurately on the entire spectrum — making $k \ll n$ sufficient for convergence.

Conjugate Gradient (CG)

For symmetric positive definite $A$ , the CG method minimizes the $A$ -norm error $\|x_k - x_*\|_A = \sqrt{(x_k - x_*)^T A(x_k - x_*)}$ :

$x_k = \arg\min_{x \in x_0 + \mathcal{K}_k} \|x - x_*\|_A.$

Algorithm (starting from $x_0 = 0$ , $r_0 = b$ , $p_0 = r_0$ ):

$\alpha_k = \frac{r_k^T r_k}{p_k^T A p_k}$ (step size)
$x_{k+1} = x_k + \alpha_k p_k$ (update)
$r_{k+1} = r_k - \alpha_k A p_k$ (residual update)
$\beta_k = \frac{r_{k+1}^T r_{k+1}}{r_k^T r_k}$ (momentum)
$p_{k+1} = r_{k+1} + \beta_k p_k$ (new search direction)

Each iteration: 1 matrix-vector product, 2 dot products, 2 vector updates — $O(n)$ operations for sparse $A$ .

Convergence rate: the error satisfies:

$\frac{\|x_k - x_*\|_A}{\|x_0 - x_*\|_A} \leq 2\left(\frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}\right)^k$

where $\kappa = \kappa(A) = \lambda_{\max}/\lambda_{\min}$ . For tolerance $\varepsilon$ : $k = O(\sqrt{\kappa}\log(1/\varepsilon))$ iterations — better than gradient descent's $O(\kappa\log(1/\varepsilon))$ by the square root.

Finite termination: in exact arithmetic, CG terminates in at most $n$ steps (not $n$ in practice due to floating point).

Preconditioning

Preconditioned CG: solve $P^{-1}Ax = P^{-1}b$ where $P \approx A$ is easy to invert. The convergence depends on $\kappa(P^{-1}A)$ instead of $\kappa(A)$ .

Common preconditioners:

Preconditioner	Description	Quality
Jacobi (diagonal)	$P = \text{diag}(A)$	Poor for general systems, simple
Incomplete Cholesky (IC)	Cholesky with sparsity pattern of $A$	Good for SPD
Algebraic multigrid (AMG)	Hierarchy of coarse grids	Excellent for PDEs
Block diagonal	$P = \text{blkdiag}(A_{11}, A_{22}, \ldots)$	Good for block structure

Perfect preconditioner: $P = A$ gives $\kappa(P^{-1}A) = 1$ and CG converges in 1 step — but inverting $A$ is the original problem. Practical preconditioners trade accuracy for cheapness.

Other Krylov Methods

GMRES (Generalized Minimal Residual): for non-symmetric $A$ , minimizes $\|b - Ax_k\|_2$ over $\mathcal{K}_k$ . Requires storing all basis vectors: $O(kn)$ memory. Restarted GMRES(m): restart every $m$ steps to bound memory; may not converge for all problems.

BiCGSTAB (Biconjugate Gradient Stabilized): for non-symmetric systems, $O(n)$ per iteration and $O(n)$ storage. Less stable than GMRES but more memory-efficient.

Lanczos/Arnoldi iteration: builds orthonormal basis for $\mathcal{K}_k$ . Lanczos (for symmetric $A$ ) produces a tridiagonal reduction; Arnoldi (general $A$ ) produces a Hessenberg reduction. Both find eigenvalue approximations as a byproduct.

MINRES: for symmetric indefinite $A$ (not PD), minimizes the 2-norm of the residual. Always converges; preferred over CG when $A$ is indefinite.

Worked Example

Example 1: CG on a Poisson System

Solve the 2D discrete Poisson equation $Au = f$ on an $n \times n$ grid: $A$ is the 5-point Laplacian, $A \in \mathbb{R}^{N \times N}$ , $N = n^2$ . For $n = 1000$ : $N = 10^6$ , $\kappa(A) \approx 4N/\pi^2 \approx 4 \times 10^6$ .

Without preconditioning: $O(\sqrt{\kappa}\log(1/\varepsilon)) = O(\sqrt{4\times10^6}\log(10^6)) \approx O(2000 \times 14) = O(28000)$ iterations.

With multigrid preconditioner: $\kappa(P^{-1}A) = O(1)$ , so $O(\log(1/\varepsilon)) = O(14)$ iterations. The $2000\times$ speedup makes trillion-cell CFD simulations practical.

Cost per iteration: $A$ is stored implicitly as the 5-point stencil — applying $A$ costs $5N$ multiplications. No $N \times N$ matrix is ever stored. Total storage: $O(N)$ .

Example 2: CG for Kernel Methods at Scale

Solve $(K + \sigma^2 I)\alpha = y$ where $K \in \mathbb{R}^{N \times N}$ is a kernel matrix, $N = 10^5$ (too large for Cholesky: $10^{15}$ flops). Use CG with matrix-vector product $v \mapsto Kv$ :

Exact kernel: $O(N^2)$ per multiplication — $O(N^{2.5})$ total for CG. Still expensive for $N = 10^5$ .
Random Fourier Features (Bochner): approximate $K \approx ZZ^T$ with $Z \in \mathbb{R}^{N \times D}$ , $D = 10^3$ . Then $(K + \sigma^2 I)v \approx Z(Z^Tv) + \sigma^2 v$ , costing $O(ND)$ per multiplication — $O(ND\sqrt{\kappa})$ total.
GPyTorch's blackbox matrix-matrix multiplication (BBMM) applies this idea to scale GP regression to millions of data points.

Example 3: GMRES for Neural Network Implicit Differentiation

Implicit differentiation of a fixed-point condition $F(\theta, z(\theta)) = 0$ requires solving the linear system $\frac{\partial F}{\partial z} v = u$ . For deep equilibrium models (DEQs), $\frac{\partial F}{\partial z}$ is never explicitly formed — only matrix-vector products via automatic differentiation.

GMRES applies: compute $J_F v$ using one forward pass + one backward pass (Jacobian-vector product via autodiff). GMRES solves the system in $O(\kappa)$ iterations, each costing $O(\text{forward pass})$ . This is the key algorithm in DEQs, PINNs, and neural ODEs — solving linear systems without explicit Jacobians.

Connections

Where Your Intuition Breaks

CG terminates in at most $n$ steps in exact arithmetic, giving the exact solution. In floating-point arithmetic, rounding errors cause the Krylov vectors to lose orthogonality — the search directions are no longer truly conjugate — and convergence stagnates well before $n$ steps. The standard fix is restarting (GMRES(k): restart every $k$ iterations) or explicit re-orthogonalization. For ill-conditioned problems, CG may appear to converge (small residual $\|r_k\|$ ) while the solution error $\|x_k - x_*\|$ is still large — because the residual and the error are related by $\kappa(A)$ . In practice, monitor the relative residual AND compare with a known reference when possible; the polynomial convergence bound is a guide, not a guarantee for finite-precision iteration.

💡Intuition

CG is the optimal polynomial method for SPD systems. At step $k$ , CG computes the best polynomial $p_k$ of degree $\leq k$ (with $p_k(0) = 1$ ) such that $x_k = x_0 + p_k(A)r_0$ minimizes the $A$ -norm error. The convergence bound $((\sqrt\kappa-1)/(\sqrt\kappa+1))^k$ comes from the best Chebyshev polynomial approximation to $1/\lambda$ on $[\lambda_{\min}, \lambda_{\max}]$ . No other algorithm using only matrix-vector products can do better than CG's error bound at step $k$ . Preconditioning is the only way to improve the exponent.

💡Intuition

The spectrum of $A$ determines iteration count, not just $\kappa$ . The convergence bound uses only $\lambda_{\min}$ and $\lambda_{\max}$ , but CG is sensitive to the full spectrum. If $A$ has $m$ distinct eigenvalues, CG converges in exactly $m$ steps (in exact arithmetic). For matrices with a few large eigenvalues and the rest clustered near 1, CG converges much faster than the bound suggests. This is why CG with AMG preconditioner achieves 1 iteration per grid-level pass — the preconditioned spectrum has $O(1)$ distinct eigenvalue clusters.

⚠️Warning

CG in float32 loses orthogonality and can fail to converge. In exact arithmetic, successive CG directions are mutually $A$ -conjugate. In floating-point, rounding errors destroy conjugacy after $O(1/\varepsilon_{\text{mach}})$ iterations — for float32 after $\sim 10^7$ iterations. For well-conditioned systems, convergence happens long before this. For ill-conditioned systems ( $\kappa \sim 1/\varepsilon_{\text{mach}}$ ), CG stagnates: the computed residual $r_k$ is large but the true residual $b - Ax_k$ is small. The fix: explicit re-orthogonalization (expensive), or restarts. For production ML solvers in float32, always use a preconditioner to reduce $\kappa$ below $\sim 10^5$ .

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Direct Linear Solvers: LU, Cholesky & QR Factorizations

Automatic Differentiation: Forward Mode, Reverse Mode & Computation Graphs