Direct Linear Solvers: LU, Cholesky & QR Factorizations
Factorizing a matrix decomposes an expensive problem (solving ) into two cheap triangular solves. LU factorization handles general square matrices; Cholesky exploits positive definiteness for a 2× speedup; QR handles rectangular systems and underdetermined least-squares. Every numerical linear algebra routine — LAPACK, NumPy, PyTorch's torch.linalg — is built on these three factorizations.
Concepts
Solving by computing explicitly is both slow and numerically unstable — it squares the condition number. The key insight is that you never need itself: you need . Factorizing into triangular factors reduces the problem to two back-substitution passes, each costing . The factorization is paid once and amortized over every right-hand side thereafter.
LU Factorization
Gaussian elimination reduces to upper triangular form via a sequence of row operations, encoded as a unit lower triangular matrix :
where is a permutation matrix (partial pivoting: swap rows to put the largest element in the current column on the diagonal). The factorization exists and is unique when is nonsingular with the permutation chosen.
The permutation is not a convenience — it is required for numerical stability. Without pivoting, Gaussian elimination divides by pivot elements that can be arbitrarily small, causing multipliers to blow up and amplifying rounding errors catastrophically. Partial pivoting bounds all multipliers by 1, making LU stable in practice for virtually every matrix that arises in applications.
Algorithm: for : choose pivot row with ; swap rows; compute multipliers ; eliminate column below the pivot. Cost: flops.
Solving : (1) compute ; (2) forward substitution: solve in ; (3) backward substitution: solve in . The cost is front-loaded in the factorization — multiple right-hand sides reuse .
Stability: partial pivoting bounds growth factor (theoretical worst case), but in practice . Complete pivoting gives tighter bounds but is rarely needed.
Cholesky Factorization
For a symmetric positive definite matrix (all eigenvalues ), the unique Cholesky factorization is:
where is lower triangular with positive diagonal. Algorithm: for :
Cost: flops — exactly half of LU. Storage: only the lower triangle, so entries.
Stability: Cholesky is unconditionally backward stable without pivoting (no growth factor). The diagonal entries are always positive for PD matrices, so no zero-pivot failures occur.
Test for positive definiteness: the Cholesky factorization succeeds iff is symmetric positive definite. If it fails (imaginary ), is indefinite or singular.
QR Factorization
For with , the thin QR factorization is:
where has orthonormal columns and is upper triangular. The full QR extends to an orthogonal matrix.
Computation via Householder reflections: reflect each column to zero out below-diagonal entries. Numerically superior to Gram-Schmidt, which accumulates errors. Cost: flops.
Least squares via QR: minimize by computing and solving via backward substitution. This is the standard LAPACK dgelsy algorithm — more numerically stable than the normal equations by avoiding the condition number squaring.
Rank-revealing QR: with column pivoting, where large diagonal entries of correspond to the most important columns. If , columns are numerically rank-deficient. This gives an approximate rank and basis.
Comparison and When to Use Each
| Problem | Matrix structure | Use |
|---|---|---|
| Square system | General | LU with partial pivoting |
| Square system | SPD | Cholesky (2× faster) |
| Least squares | Thin rectangular | QR (stable) or normal equations + Cholesky (fast, less stable) |
| Symmetric indefinite | not PD | LDL factorization (diagonal , block-2×2 pivots) |
| Diagonal + low-rank | Woodbury identity: |
Multiple right-hand sides: factor once, solve each with triangular solves. For right-hand sides: total vs if re-factorizing.
Worked Example
Example 1: LU for a System
Solve .
Step 1 (eliminate column 1): multipliers . After elimination, .
Step 2 (eliminate column 2): multiplier . .
.
Forward sub: gives .
Backward sub: gives . Verify: ? ✓.
Example 2: Cholesky for a Gaussian Process
GP posterior requires where is PSD. Cholesky: . Then — two triangular solves.
For : Cholesky costs flops. Each additional right-hand side (e.g., predicting at new test points) costs only flops. Factoring once and re-using is essential for GP hyperparameter optimization (which requires gradient evaluations, each needing ).
Example 3: Rank-Revealing QR for Model Selection
Given a design matrix (100 samples, 50 features) where some features are nearly collinear: compute rank-revealing QR with threshold .
The diagonal of reveals: . Numerical rank : the last 10 columns are linearly dependent. The rank-revealing QR identifies the 40 most informative feature columns — equivalent to subset selection but with a numerically stable algorithm.
Connections
Where Your Intuition Breaks
QR factorization is presented as the numerically stable way to solve least-squares problems, correcting the instability of the normal equations (which squares the condition number). This is true — but only for Householder QR, not Gram-Schmidt. The classical Gram-Schmidt algorithm computes QR via orthogonalization, but loses orthogonality rapidly when columns of are nearly linearly dependent: the computed satisfies , not . Modified Gram-Schmidt is better but still unstable for ill-conditioned . Householder QR applies unitary reflections — backward stable regardless of condition. In practice, all numerical libraries use Householder; Gram-Schmidt appears in textbooks and can silently corrupt results.
Factorization is a one-time cost that amortizes over many solves. In scientific computing and machine learning, the same matrix often appears with different right-hand sides: multiple observations in GP regression, multiple Newton steps in optimization, or multiple variance queries in Bayesian design. Factor once for , then each solve costs . This principle drives the design of LAPACK's two-stage API: getrf (factor) + getrs (solve). PyTorch's torch.linalg.solve calls this internally but exposes only a single call — the amortization is hidden.
Cholesky failure is an SPD test. Attempting Cholesky on a matrix and getting a negative square root (failed square root of ) proves is not positive definite — useful for debugging covariance matrices in probabilistic models. If Cholesky fails on a kernel matrix that should be PSD, the culprit is usually: (1) a kernel function that is not actually PSD (e.g., a non-Mercer kernel), (2) numerical underflow making small eigenvalues negative, or (3) a bug in the kernel computation. A standard fix is a nugget/jitter diagonal for small .
Never explicitly form . Computing costs and doubles the condition number. Every use of should be replaced with solve(A, b). In PyTorch, torch.linalg.inv(A) @ b is numerically inferior to torch.linalg.solve(A, b) — the latter uses LU directly, the former computes the inverse matrix first. Similarly, should be computed as via two triangular solves, not by explicitly inverting. Most numerical analysis bugs in ML code trace back to explicit matrix inversion.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.