SVD, QR & LU Decompositions
Matrix decompositions are the central algorithmic tool of numerical linear algebra: they rewrite a matrix as a product of structured factors (orthogonal, triangular, diagonal) that make subsequent operations — solving systems, computing inverses, finding least-squares solutions, approximating rank — cheap, stable, and interpretable. The Singular Value Decomposition (SVD) in particular is the universal factorization, applicable to any matrix regardless of shape or rank, and underlies dimensionality reduction, low-rank approximation, pseudoinverses, and the analysis of neural network weight matrices.
Concepts
SVD Geometry — A = U Σ Vᵀ (rotate → scale → rotate)
Singular values
σ₁ = 2.558
σ₂ = 0.977
κ = σ₁/σ₂ = 2.62
Singular vectors
v₁ = (0.768, 0.641)
v₂ = (-0.641, 0.768)
u₁ = (0.851, 0.526)
u₂ = (-0.526, 0.851)
Full rank — two nonzero singular values, two rotation stages
Click a stage to highlight it. Each panel shows the unit circle after applying that step. The final panel equals Ax.
Every matrix — whether it compresses images, encodes user preferences, or transforms activations in a neural network — does the same thing geometrically: rotate, scale, rotate. The SVD makes this explicit: any matrix decomposes as , two rotations and a diagonal scaling. This universality is why SVD underlies PCA, low-rank approximation, pseudoinverses, and virtually every other dimension-reduction algorithm.
The Singular Value Decomposition (SVD)
Theorem (SVD). For any matrix , there exist orthogonal matrices , and a matrix with nonnegative entries on the diagonal (and zeros elsewhere) such that
Two separate rotation matrices and are needed rather than one because the domain () and the range () are different spaces — they cannot share a basis. A single orthogonal matrix would conflate input and output coordinate systems. The two rotations plus a diagonal are the minimal factorization that separates which directions matter (geometry) from how much each direction matters (magnitude).
The diagonal entries are the singular values of . The columns of are the left singular vectors, the columns of are the right singular vectors.
Geometric interpretation. Every linear map decomposes into three stages:
- : rotate in the domain (an isometry)
- : scale each axis by (stretching/compression)
- : rotate in the range (an isometry)
A unit sphere in is mapped to an ellipsoid in with semi-axes (where ).
Existence proof sketch. Let and be a maximizer (exists by compactness). Set . Extend to orthonormal bases and proceed by induction on the orthogonal complement — the same argument as the Spectral Theorem proof, applied to .
Thin SVD and Rank
Full SVD: , , .
Thin (economy) SVD: for , keep only the first columns of and the top-left block of :
Rank: number of nonzero singular values. In the outer-product form:
Connection to eigenvalues: are the nonzero eigenvalues of both (right singular vectors = eigenvectors) and (left singular vectors = eigenvectors). For symmetric PSD matrices, and — the SVD and spectral decomposition coincide.
Eckart-Young Theorem (Best Low-Rank Approximation)
Theorem. Among all rank- matrices :
where is the truncated SVD.
This is the theoretical foundation for:
- PCA: project data onto top- singular vectors of the data matrix
- Image compression: store rank- SVD instead of full image
- Collaborative filtering: approximate user-item matrix with low-rank factors
- Word embeddings: truncated SVD of a PMI co-occurrence matrix gives GloVe-like embeddings
The Pseudoinverse
For a system that may be over- or under-determined, the Moore-Penrose pseudoinverse is:
Then is the minimum-norm least-squares solution: it minimizes among all solutions minimizing .
For a full-rank tall matrix (, ): — the ordinary least squares formula.
Regularized pseudoinverse (truncated SVD). When some (near-singular), inverting them amplifies noise. Truncate at threshold :
This is Tikhonov regularization in SVD form, equivalent to ridge regression.
The QR Decomposition
Theorem (QR). For any with , there exists an orthogonal and upper triangular (with zero rows below row ) such that
If has full column rank, is uniquely determined up to signs of diagonal entries, and the thin form with , upper triangular is unique (with positive diagonal).
Relation to Gram-Schmidt. The QR decomposition is the matrix form of Gram-Schmidt orthogonalization: columns of are the orthonormalized columns of , and encodes the change-of-basis coefficients.
Solving least squares via QR. For full-rank :
Since is upper triangular, solve by back-substitution in (after paying for QR).
Why QR over normal equations? The normal equations square the condition number: . For ill-conditioned , this causes catastrophic loss of numerical precision. QR avoids squaring the condition number and is the standard method in production numerical solvers.
Householder vs Gram-Schmidt:
| Method | Stability | Cost | Use case |
|---|---|---|---|
| Gram-Schmidt (classical) | Unstable for ill-cond. | Conceptual | |
| Modified Gram-Schmidt | Stable | Small matrices | |
| Householder reflections | Backward stable | Standard in practice | |
| Givens rotations | Backward stable | Sparse , updating QR |
The LU Decomposition
Theorem (LU with pivoting). For any , there exists a permutation matrix , a unit lower triangular , and upper triangular such that
Gaussian elimination computes , with partial pivoting (swapping rows to put the largest entry in the pivot position at each step — the permutation matrix records these swaps).
Solving via LU:
- →
- Forward substitution: solve in
- Back substitution: solve in
- Total: for factorization, then per right-hand side
Why LU for square systems, not QR? LU is roughly twice as fast as QR for square systems (same but half the constant). For rectangular least-squares problems, QR is preferable.
Cholesky decomposition. For symmetric positive definite :
where is lower triangular with positive diagonal. This is Cholesky factorization — twice as fast as LU (exploits symmetry), always stable (no pivoting needed since guarantees no zero pivots), and the standard method for solving Gaussian process regression, computing covariance-weighted least squares, and sampling from multivariate Gaussians.
Condition Number and Numerical Stability
The condition number of (in the 2-norm) is:
It measures how much relative error in can be amplified in the solution to :
In floating-point arithmetic with machine epsilon (double precision), a condition number of means you lose roughly decimal digits of accuracy. When , the linear system is essentially unsolvable at double precision.
Worked Example
Example 1: SVD of a Matrix
Let .
Step 1: Compute .
Step 2: Eigenvalues of (= ). Characteristic polynomial gives , so , , .
The zero singular value confirms , consistent with being with full row rank.
Step 3: Right singular vectors = eigenvectors of :
Step 4: Left singular vectors :
Low-rank approximation: captures of the Frobenius energy.
Example 2: QR and Least Squares
Given data points , fit a line .
, so QR gives a unique least-squares solution. The thin QR factorization yields — approximately the line .
Example 3: Cholesky for Gaussian Sampling
To sample : compute the Cholesky factor , draw , then .
This is how multivariate Gaussian samples are generated in PyTorch (torch.distributions.MultivariateNormal uses Cholesky internally), GPyTorch, Stan, and TensorFlow Probability. The Cholesky factor also computes the log-determinant for the multivariate Gaussian log-likelihood:
Connections
Where Your Intuition Breaks
The SVD is always expensive — it is an operation. For full SVD this is true, but in practice you almost never need the full decomposition. Randomized SVD (Halko, Martinsson, Tropp 2011) computes a rank- approximation in — linear in , which is small when the matrix has low intrinsic rank. For large sparse matrices (graph Laplacians, word co-occurrence matrices), Lanczos-based methods compute the top singular vectors without ever materializing the full matrix. The correct mental model is: "computing the full SVD is expensive; computing the top-k SVD is affordable and is what you always want."
Decomposition Comparison Table
| Decomposition | Form | Shape of | Rank needed | Primary use |
|---|---|---|---|---|
| SVD | Any | Any | Low-rank approx, pseudoinverse, PCA | |
| Spectral | Square symmetric | Any | Eigenanalysis, quadratic forms | |
| QR | Any | Least squares, Gram-Schmidt | ||
| LU (+pivot) | Square | Full | Linear systems | |
| Cholesky | Square sym PD | Full | Covariance systems, Gaussian sampling | |
| Eigendecomp | Square | — | Diagonalizable matrices only |
When the Normal Equations Are Dangerous
The normal equations are algebraically equivalent to the QR solution but numerically inferior:
Example. Take with (double precision). Then but . The normal equations are at the numerical precision limit even though the original system is well-posed. QR avoids this by never forming .
Never form explicitly when solving least-squares problems numerically. Use numpy.linalg.lstsq, scipy.linalg.lstsq, or QR-based solvers. The only exception is when is very tall and skinny (), and even then, use the rcond parameter to handle near-rank-deficiency.
SVD is PCA. If is a centered data matrix (rows = observations), then the SVD gives: right singular vectors = principal directions, = eigenvalues of the sample covariance, = principal component scores. PCA via SVD is preferred over the covariance-eigendecomposition route when (more features than samples), because is cheaper to decompose than .
Randomized SVD for large matrices. For a matrix with , computing the full SVD is . Randomized SVD (Halko, Martinsson, Tropp 2011) computes a rank- approximation in : sketch the column space with a random Gaussian matrix , form , QR-decompose , then compute the SVD of the small matrix . This is how sklearn.decomposition.TruncatedSVD and torch.svd_lowrank work.
SVD in Neural Network Analysis
Recent interpretability research uses SVD to analyze weight matrices in transformers:
- Effective rank: number of singular values above a threshold, measuring how "compressed" a weight matrix is
- Intrinsic dimensionality: the rank- approximation quality as a function of — a steep drop indicates the weight matrix has low intrinsic rank
- LoRA (Low-Rank Adaptation): instead of fine-tuning directly, parameterize the update as where , , . The claim (motivated by SVD analysis) is that fine-tuning directions live in a low-dimensional subspace of the full weight space.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.