Direct Linear Solvers: LU, Cholesky & QR Factorizations

Factorizing a matrix decomposes an expensive problem (solving $Ax = b$ ) into two cheap triangular solves. LU factorization handles general square matrices; Cholesky exploits positive definiteness for a 2× speedup; QR handles rectangular systems and underdetermined least-squares. Every numerical linear algebra routine — LAPACK, NumPy, PyTorch's torch.linalg — is built on these three factorizations.

Concepts

Solving $Ax = b$ by computing $A^{-1}$ explicitly is both slow and numerically unstable — it squares the condition number. The key insight is that you never need $A^{-1}$ itself: you need $A^{-1}b$ . Factorizing $A$ into triangular factors reduces the problem to two back-substitution passes, each costing $O(n^2)$ . The $O(n^3)$ factorization is paid once and amortized over every right-hand side thereafter.

LU Factorization

Gaussian elimination reduces $A \in \mathbb{R}^{n \times n}$ to upper triangular form $U$ via a sequence of row operations, encoded as a unit lower triangular matrix $L$ :

$PA = LU$

where $P$ is a permutation matrix (partial pivoting: swap rows to put the largest element in the current column on the diagonal). The factorization exists and is unique when $A$ is nonsingular with the permutation chosen.

The permutation $P$ is not a convenience — it is required for numerical stability. Without pivoting, Gaussian elimination divides by pivot elements that can be arbitrarily small, causing multipliers $L_{ik} = A_{ik}/A_{kk}$ to blow up and amplifying rounding errors catastrophically. Partial pivoting bounds all multipliers by 1, making LU stable in practice for virtually every matrix that arises in applications.

Algorithm: for $k = 1, \ldots, n-1$ : choose pivot row $r \geq k$ with $|A_{rk}| = \max$ ; swap rows; compute multipliers $L_{ik} = A_{ik}/A_{kk}$ ; eliminate column $k$ below the pivot. Cost: $\frac{2}{3}n^3$ flops.

Solving $Ax = b$ : (1) compute $PA = LU$ ; (2) forward substitution: solve $Ly = Pb$ in $O(n^2)$ ; (3) backward substitution: solve $Ux = y$ in $O(n^2)$ . The $O(n^3)$ cost is front-loaded in the factorization — multiple right-hand sides reuse $L, U$ .

Stability: partial pivoting bounds growth factor $\rho \leq 2^{n-1}$ (theoretical worst case), but in practice $\rho = O(n^{2/3})$ . Complete pivoting gives tighter bounds but is rarely needed.

Cholesky Factorization

For a symmetric positive definite matrix $A$ (all eigenvalues $> 0$ ), the unique Cholesky factorization is:

$A = LL^T$

where $L$ is lower triangular with positive diagonal. Algorithm: for $j = 1, \ldots, n$ :

$L_{jj} = \sqrt{A_{jj} - \sum_{k=1}^{j-1} L_{jk}^2}, \quad L_{ij} = \frac{1}{L_{jj}}\!\left(A_{ij} - \sum_{k=1}^{j-1} L_{ik}L_{jk}\right), \quad i > j.$

Cost: $\frac{1}{3}n^3$ flops — exactly half of LU. Storage: only the lower triangle, so $n(n+1)/2$ entries.

Stability: Cholesky is unconditionally backward stable without pivoting (no growth factor). The diagonal entries $L_{jj}$ are always positive for PD matrices, so no zero-pivot failures occur.

Test for positive definiteness: the Cholesky factorization succeeds iff $A$ is symmetric positive definite. If it fails (imaginary $L_{jj}$ ), $A$ is indefinite or singular.

QR Factorization

For $A \in \mathbb{R}^{m \times n}$ with $m \geq n$ , the thin QR factorization is:

$A = QR$

where $Q \in \mathbb{R}^{m \times n}$ has orthonormal columns and $R \in \mathbb{R}^{n \times n}$ is upper triangular. The full QR extends $Q$ to an $m \times m$ orthogonal matrix.

Computation via Householder reflections: reflect each column to zero out below-diagonal entries. Numerically superior to Gram-Schmidt, which accumulates errors. Cost: $2mn^2 - \frac{2}{3}n^3$ flops.

Least squares via QR: minimize $\|Ax - b\|_2$ by computing $A = QR$ and solving $Rx = Q^T b$ via backward substitution. This is the standard LAPACK dgelsy algorithm — more numerically stable than the normal equations $A^T A x = A^T b$ by avoiding the $\kappa(A)^2$ condition number squaring.

Rank-revealing QR: with column pivoting, $AP = QR$ where large diagonal entries of $R$ correspond to the most important columns. If $R_{kk}/R_{11} < \varepsilon_{\text{mach}}$ , columns $k, \ldots, n$ are numerically rank-deficient. This gives an approximate rank and basis.

Comparison and When to Use Each

Problem	Matrix structure	Use
Square system $Ax = b$	General	LU with partial pivoting
Square system $Ax = b$	SPD	Cholesky (2× faster)
Least squares $\min\\|Ax-b\\|$	Thin rectangular	QR (stable) or normal equations + Cholesky (fast, less stable)
Symmetric indefinite	$A = A^T$ not PD	LDL $^T$ factorization (diagonal $D$ , block-2×2 pivots)
Diagonal + low-rank	$A = D + UV^T$	Woodbury identity: $(D + UV^T)^{-1} = D^{-1} - D^{-1}U(I + V^T D^{-1}U)^{-1}V^T D^{-1}$

Multiple right-hand sides: factor once, solve each $b_i$ with $O(n^2)$ triangular solves. For $k$ right-hand sides: $O(n^3 + kn^2)$ total vs $O(kn^3)$ if re-factorizing.

Worked Example

Example 1: LU for a $3 \times 3$ System

Solve $\begin{pmatrix}2&1&1\\4&3&3\\8&7&9\end{pmatrix}x = \begin{pmatrix}1\\1\\1\end{pmatrix}$ .

Step 1 (eliminate column 1): multipliers $L_{21}=2, L_{31}=4$ . After elimination, $A^{(2)} = \begin{pmatrix}2&1&1\\0&1&1\\0&3&5\end{pmatrix}$ .

Step 2 (eliminate column 2): multiplier $L_{32}=3$ . $U = \begin{pmatrix}2&1&1\\0&1&1\\0&0&2\end{pmatrix}$ .

$L = \begin{pmatrix}1&0&0\\2&1&0\\4&3&1\end{pmatrix}$ .

Forward sub: $Ly = b$ gives $y = (1,-1,-2)$ .

Backward sub: $Ux = y$ gives $x_3 = -1, x_2 = 0, x_1 = 1$ . Verify: $Ax = (2,4,8)^T \cdot 1 = b$ ? $2+0-1=1$ ✓.

Example 2: Cholesky for a Gaussian Process

GP posterior requires $(K + \sigma_n^2 I)^{-1} y$ where $K$ is PSD. Cholesky: $K + \sigma_n^2 I = LL^T$ . Then $(K + \sigma_n^2 I)^{-1} y = (L^T)^{-1}(L^{-1}y)$ — two triangular solves.

For $n = 1000$ : Cholesky costs $\frac{1}{3}(10^3)^3 = \frac{10^9}{3} \approx 3 \times 10^8$ flops. Each additional right-hand side (e.g., predicting at new test points) costs only $2n^2 = 2 \times 10^6$ flops. Factoring once and re-using is essential for GP hyperparameter optimization (which requires $\sim 10$ gradient evaluations, each needing $L^{-1} y$ ).

Example 3: Rank-Revealing QR for Model Selection

Given a design matrix $X \in \mathbb{R}^{100 \times 50}$ (100 samples, 50 features) where some features are nearly collinear: compute rank-revealing QR with threshold $\varepsilon = 10^{-10}$ .

The diagonal of $R$ reveals: $R_{11} = 10.2, R_{22} = 8.7, \ldots, R_{40} = 0.1, R_{41} = 10^{-12}$ . Numerical rank $= 40$ : the last 10 columns are linearly dependent. The rank-revealing QR identifies the 40 most informative feature columns — equivalent to subset selection but with a numerically stable algorithm.

Connections

Where Your Intuition Breaks

QR factorization is presented as the numerically stable way to solve least-squares problems, correcting the instability of the normal equations $A^T A x = A^T b$ (which squares the condition number). This is true — but only for Householder QR, not Gram-Schmidt. The classical Gram-Schmidt algorithm computes QR via orthogonalization, but loses orthogonality rapidly when columns of $A$ are nearly linearly dependent: the computed $Q$ satisfies $\|Q^T Q - I\| \approx \varepsilon_{\text{mach}} \cdot \kappa(A)$ , not $\varepsilon_{\text{mach}}$ . Modified Gram-Schmidt is better but still unstable for ill-conditioned $A$ . Householder QR applies unitary reflections — backward stable regardless of condition. In practice, all numerical libraries use Householder; Gram-Schmidt appears in textbooks and can silently corrupt results.

💡Intuition

Factorization is a one-time cost that amortizes over many solves. In scientific computing and machine learning, the same matrix $A$ often appears with different right-hand sides: multiple observations $y_1, y_2, \ldots$ in GP regression, multiple Newton steps in optimization, or multiple variance queries $x^T (K + \sigma^2 I)^{-1} x$ in Bayesian design. Factor once for $O(n^3)$ , then each solve costs $O(n^2)$ . This principle drives the design of LAPACK's two-stage API: getrf (factor) + getrs (solve). PyTorch's torch.linalg.solve calls this internally but exposes only a single call — the amortization is hidden.

💡Intuition

Cholesky failure is an SPD test. Attempting Cholesky on a matrix $A$ and getting a negative square root (failed square root of $A_{jj} - \sum L_{jk}^2$ ) proves $A$ is not positive definite — useful for debugging covariance matrices in probabilistic models. If Cholesky fails on a kernel matrix that should be PSD, the culprit is usually: (1) a kernel function that is not actually PSD (e.g., a non-Mercer kernel), (2) numerical underflow making small eigenvalues negative, or (3) a bug in the kernel computation. A standard fix is a nugget/jitter diagonal $K \leftarrow K + \varepsilon I$ for small $\varepsilon \approx 10^{-6}$ .

⚠️Warning

Never explicitly form $A^{-1}$ . Computing $A^{-1}$ costs $O(n^3)$ and doubles the condition number. Every use of $A^{-1} b$ should be replaced with solve(A, b). In PyTorch, torch.linalg.inv(A) @ b is numerically inferior to torch.linalg.solve(A, b) — the latter uses LU directly, the former computes the inverse matrix first. Similarly, $\text{tr}(A^{-1} B)$ should be computed as $\text{tr}(L^{-T}(L^{-1}B))$ via two triangular solves, not by explicitly inverting. Most numerical analysis bugs in ML code trace back to explicit matrix inversion.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Floating Point Arithmetic, Numerical Stability & Condition Numbers

Iterative Solvers: Conjugate Gradient & Krylov Methods

Direct Linear Solvers: LU, Cholesky & QR Factorizations

Concepts

LU Factorization

Cholesky Factorization

QR Factorization

Comparison and When to Use Each

Worked Example

Example 1: LU for a 3×33 \times 33×3 System

Example 2: Cholesky for a Gaussian Process

Example 3: Rank-Revealing QR for Model Selection

Connections

Where Your Intuition Breaks

Example 1: LU for a $3 \times 3$ System