Neural-Path/Notes
40 min

Direct Linear Solvers: LU, Cholesky & QR Factorizations

Factorizing a matrix decomposes an expensive problem (solving Ax=bAx = b) into two cheap triangular solves. LU factorization handles general square matrices; Cholesky exploits positive definiteness for a 2× speedup; QR handles rectangular systems and underdetermined least-squares. Every numerical linear algebra routine — LAPACK, NumPy, PyTorch's torch.linalg — is built on these three factorizations.

Concepts

Solving Ax=bAx = b by computing A1A^{-1} explicitly is both slow and numerically unstable — it squares the condition number. The key insight is that you never need A1A^{-1} itself: you need A1bA^{-1}b. Factorizing AA into triangular factors reduces the problem to two back-substitution passes, each costing O(n2)O(n^2). The O(n3)O(n^3) factorization is paid once and amortized over every right-hand side thereafter.

LU Factorization

Gaussian elimination reduces ARn×nA \in \mathbb{R}^{n \times n} to upper triangular form UU via a sequence of row operations, encoded as a unit lower triangular matrix LL:

PA=LUPA = LU

where PP is a permutation matrix (partial pivoting: swap rows to put the largest element in the current column on the diagonal). The factorization exists and is unique when AA is nonsingular with the permutation chosen.

The permutation PP is not a convenience — it is required for numerical stability. Without pivoting, Gaussian elimination divides by pivot elements that can be arbitrarily small, causing multipliers Lik=Aik/AkkL_{ik} = A_{ik}/A_{kk} to blow up and amplifying rounding errors catastrophically. Partial pivoting bounds all multipliers by 1, making LU stable in practice for virtually every matrix that arises in applications.

Algorithm: for k=1,,n1k = 1, \ldots, n-1: choose pivot row rkr \geq k with Ark=max|A_{rk}| = \max; swap rows; compute multipliers Lik=Aik/AkkL_{ik} = A_{ik}/A_{kk}; eliminate column kk below the pivot. Cost: 23n3\frac{2}{3}n^3 flops.

Solving Ax=bAx = b: (1) compute PA=LUPA = LU; (2) forward substitution: solve Ly=PbLy = Pb in O(n2)O(n^2); (3) backward substitution: solve Ux=yUx = y in O(n2)O(n^2). The O(n3)O(n^3) cost is front-loaded in the factorization — multiple right-hand sides reuse L,UL, U.

Stability: partial pivoting bounds growth factor ρ2n1\rho \leq 2^{n-1} (theoretical worst case), but in practice ρ=O(n2/3)\rho = O(n^{2/3}). Complete pivoting gives tighter bounds but is rarely needed.

Cholesky Factorization

For a symmetric positive definite matrix AA (all eigenvalues >0> 0), the unique Cholesky factorization is:

A=LLTA = LL^T

where LL is lower triangular with positive diagonal. Algorithm: for j=1,,nj = 1, \ldots, n:

Ljj=Ajjk=1j1Ljk2,Lij=1Ljj ⁣(Aijk=1j1LikLjk),i>j.L_{jj} = \sqrt{A_{jj} - \sum_{k=1}^{j-1} L_{jk}^2}, \quad L_{ij} = \frac{1}{L_{jj}}\!\left(A_{ij} - \sum_{k=1}^{j-1} L_{ik}L_{jk}\right), \quad i > j.

Cost: 13n3\frac{1}{3}n^3 flops — exactly half of LU. Storage: only the lower triangle, so n(n+1)/2n(n+1)/2 entries.

Stability: Cholesky is unconditionally backward stable without pivoting (no growth factor). The diagonal entries LjjL_{jj} are always positive for PD matrices, so no zero-pivot failures occur.

Test for positive definiteness: the Cholesky factorization succeeds iff AA is symmetric positive definite. If it fails (imaginary LjjL_{jj}), AA is indefinite or singular.

QR Factorization

For ARm×nA \in \mathbb{R}^{m \times n} with mnm \geq n, the thin QR factorization is:

A=QRA = QR

where QRm×nQ \in \mathbb{R}^{m \times n} has orthonormal columns and RRn×nR \in \mathbb{R}^{n \times n} is upper triangular. The full QR extends QQ to an m×mm \times m orthogonal matrix.

Computation via Householder reflections: reflect each column to zero out below-diagonal entries. Numerically superior to Gram-Schmidt, which accumulates errors. Cost: 2mn223n32mn^2 - \frac{2}{3}n^3 flops.

Least squares via QR: minimize Axb2\|Ax - b\|_2 by computing A=QRA = QR and solving Rx=QTbRx = Q^T b via backward substitution. This is the standard LAPACK dgelsy algorithm — more numerically stable than the normal equations ATAx=ATbA^T A x = A^T b by avoiding the κ(A)2\kappa(A)^2 condition number squaring.

Rank-revealing QR: with column pivoting, AP=QRAP = QR where large diagonal entries of RR correspond to the most important columns. If Rkk/R11<εmachR_{kk}/R_{11} < \varepsilon_{\text{mach}}, columns k,,nk, \ldots, n are numerically rank-deficient. This gives an approximate rank and basis.

Comparison and When to Use Each

ProblemMatrix structureUse
Square system Ax=bAx = bGeneralLU with partial pivoting
Square system Ax=bAx = bSPDCholesky (2× faster)
Least squares minAxb\min\|Ax-b\|Thin rectangularQR (stable) or normal equations + Cholesky (fast, less stable)
Symmetric indefiniteA=ATA = A^T not PDLDLT^T factorization (diagonal DD, block-2×2 pivots)
Diagonal + low-rankA=D+UVTA = D + UV^TWoodbury identity: (D+UVT)1=D1D1U(I+VTD1U)1VTD1(D + UV^T)^{-1} = D^{-1} - D^{-1}U(I + V^T D^{-1}U)^{-1}V^T D^{-1}

Multiple right-hand sides: factor once, solve each bib_i with O(n2)O(n^2) triangular solves. For kk right-hand sides: O(n3+kn2)O(n^3 + kn^2) total vs O(kn3)O(kn^3) if re-factorizing.

Worked Example

Example 1: LU for a 3×33 \times 3 System

Solve (211433879)x=(111)\begin{pmatrix}2&1&1\\4&3&3\\8&7&9\end{pmatrix}x = \begin{pmatrix}1\\1\\1\end{pmatrix}.

Step 1 (eliminate column 1): multipliers L21=2,L31=4L_{21}=2, L_{31}=4. After elimination, A(2)=(211011035)A^{(2)} = \begin{pmatrix}2&1&1\\0&1&1\\0&3&5\end{pmatrix}.

Step 2 (eliminate column 2): multiplier L32=3L_{32}=3. U=(211011002)U = \begin{pmatrix}2&1&1\\0&1&1\\0&0&2\end{pmatrix}.

L=(100210431)L = \begin{pmatrix}1&0&0\\2&1&0\\4&3&1\end{pmatrix}.

Forward sub: Ly=bLy = b gives y=(1,1,2)y = (1,-1,-2).

Backward sub: Ux=yUx = y gives x3=1,x2=0,x1=1x_3 = -1, x_2 = 0, x_1 = 1. Verify: Ax=(2,4,8)T1=bAx = (2,4,8)^T \cdot 1 = b? 2+01=12+0-1=1 ✓.

Example 2: Cholesky for a Gaussian Process

GP posterior requires (K+σn2I)1y(K + \sigma_n^2 I)^{-1} y where KK is PSD. Cholesky: K+σn2I=LLTK + \sigma_n^2 I = LL^T. Then (K+σn2I)1y=(LT)1(L1y)(K + \sigma_n^2 I)^{-1} y = (L^T)^{-1}(L^{-1}y) — two triangular solves.

For n=1000n = 1000: Cholesky costs 13(103)3=10933×108\frac{1}{3}(10^3)^3 = \frac{10^9}{3} \approx 3 \times 10^8 flops. Each additional right-hand side (e.g., predicting at new test points) costs only 2n2=2×1062n^2 = 2 \times 10^6 flops. Factoring once and re-using is essential for GP hyperparameter optimization (which requires 10\sim 10 gradient evaluations, each needing L1yL^{-1} y).

Example 3: Rank-Revealing QR for Model Selection

Given a design matrix XR100×50X \in \mathbb{R}^{100 \times 50} (100 samples, 50 features) where some features are nearly collinear: compute rank-revealing QR with threshold ε=1010\varepsilon = 10^{-10}.

The diagonal of RR reveals: R11=10.2,R22=8.7,,R40=0.1,R41=1012R_{11} = 10.2, R_{22} = 8.7, \ldots, R_{40} = 0.1, R_{41} = 10^{-12}. Numerical rank =40= 40: the last 10 columns are linearly dependent. The rank-revealing QR identifies the 40 most informative feature columns — equivalent to subset selection but with a numerically stable algorithm.

Connections

Where Your Intuition Breaks

QR factorization is presented as the numerically stable way to solve least-squares problems, correcting the instability of the normal equations ATAx=ATbA^T A x = A^T b (which squares the condition number). This is true — but only for Householder QR, not Gram-Schmidt. The classical Gram-Schmidt algorithm computes QR via orthogonalization, but loses orthogonality rapidly when columns of AA are nearly linearly dependent: the computed QQ satisfies QTQIεmachκ(A)\|Q^T Q - I\| \approx \varepsilon_{\text{mach}} \cdot \kappa(A), not εmach\varepsilon_{\text{mach}}. Modified Gram-Schmidt is better but still unstable for ill-conditioned AA. Householder QR applies unitary reflections — backward stable regardless of condition. In practice, all numerical libraries use Householder; Gram-Schmidt appears in textbooks and can silently corrupt results.

💡Intuition

Factorization is a one-time cost that amortizes over many solves. In scientific computing and machine learning, the same matrix AA often appears with different right-hand sides: multiple observations y1,y2,y_1, y_2, \ldots in GP regression, multiple Newton steps in optimization, or multiple variance queries xT(K+σ2I)1xx^T (K + \sigma^2 I)^{-1} x in Bayesian design. Factor once for O(n3)O(n^3), then each solve costs O(n2)O(n^2). This principle drives the design of LAPACK's two-stage API: getrf (factor) + getrs (solve). PyTorch's torch.linalg.solve calls this internally but exposes only a single call — the amortization is hidden.

💡Intuition

Cholesky failure is an SPD test. Attempting Cholesky on a matrix AA and getting a negative square root (failed square root of AjjLjk2A_{jj} - \sum L_{jk}^2) proves AA is not positive definite — useful for debugging covariance matrices in probabilistic models. If Cholesky fails on a kernel matrix that should be PSD, the culprit is usually: (1) a kernel function that is not actually PSD (e.g., a non-Mercer kernel), (2) numerical underflow making small eigenvalues negative, or (3) a bug in the kernel computation. A standard fix is a nugget/jitter diagonal KK+εIK \leftarrow K + \varepsilon I for small ε106\varepsilon \approx 10^{-6}.

⚠️Warning

Never explicitly form A1A^{-1}. Computing A1A^{-1} costs O(n3)O(n^3) and doubles the condition number. Every use of A1bA^{-1} b should be replaced with solve(A, b). In PyTorch, torch.linalg.inv(A) @ b is numerically inferior to torch.linalg.solve(A, b) — the latter uses LU directly, the former computes the inverse matrix first. Similarly, tr(A1B)\text{tr}(A^{-1} B) should be computed as tr(LT(L1B))\text{tr}(L^{-T}(L^{-1}B)) via two triangular solves, not by explicitly inverting. Most numerical analysis bugs in ML code trace back to explicit matrix inversion.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.