Neural-Path/Notes
60 min

SVD, QR & LU Decompositions

Matrix decompositions are the central algorithmic tool of numerical linear algebra: they rewrite a matrix as a product of structured factors (orthogonal, triangular, diagonal) that make subsequent operations — solving systems, computing inverses, finding least-squares solutions, approximating rank — cheap, stable, and interpretable. The Singular Value Decomposition (SVD) in particular is the universal factorization, applicable to any matrix regardless of shape or rank, and underlies dimensionality reduction, low-rank approximation, pseudoinverses, and the analysis of neural network weight matrices.

Concepts

SVD Geometry — A = U Σ Vᵀ (rotate → scale → rotate)

Inputunit circleV^T · xrotate (domain)Σ · (V^Tx)scale axesU · Σ · V^Txrotate (range) = Ax

Singular values

σ₁ = 2.558

σ₂ = 0.977

κ = σ₁/σ₂ = 2.62

Singular vectors

v₁ = (0.768, 0.641)

v₂ = (-0.641, 0.768)

u₁ = (0.851, 0.526)

u₂ = (-0.526, 0.851)

Full rank — two nonzero singular values, two rotation stages

Click a stage to highlight it. Each panel shows the unit circle after applying that step. The final panel equals Ax.

Every matrix — whether it compresses images, encodes user preferences, or transforms activations in a neural network — does the same thing geometrically: rotate, scale, rotate. The SVD makes this explicit: any matrix decomposes as A=UΣVTA = U\Sigma V^T, two rotations and a diagonal scaling. This universality is why SVD underlies PCA, low-rank approximation, pseudoinverses, and virtually every other dimension-reduction algorithm.

The Singular Value Decomposition (SVD)

Theorem (SVD). For any matrix ARm×nA \in \mathbb{R}^{m \times n}, there exist orthogonal matrices URm×mU \in \mathbb{R}^{m \times m}, VRn×nV \in \mathbb{R}^{n \times n} and a matrix ΣRm×n\Sigma \in \mathbb{R}^{m \times n} with nonnegative entries on the diagonal (and zeros elsewhere) such that

A=UΣVT.A = U\Sigma V^T.

Two separate rotation matrices UU and VV are needed rather than one because the domain (Rn\mathbb{R}^n) and the range (Rm\mathbb{R}^m) are different spaces — they cannot share a basis. A single orthogonal matrix would conflate input and output coordinate systems. The two rotations plus a diagonal Σ\Sigma are the minimal factorization that separates which directions matter (geometry) from how much each direction matters (magnitude).

The diagonal entries σ1σ2σmin(m,n)0\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_{\min(m,n)} \geq 0 are the singular values of AA. The columns of UU are the left singular vectors, the columns of VV are the right singular vectors.

Geometric interpretation. Every linear map A:RnRmA : \mathbb{R}^n \to \mathbb{R}^m decomposes into three stages:

  1. VTV^T: rotate in the domain (an isometry)
  2. Σ\Sigma: scale each axis by σi\sigma_i (stretching/compression)
  3. UU: rotate in the range (an isometry)

A unit sphere in Rn\mathbb{R}^n is mapped to an ellipsoid in Rm\mathbb{R}^m with semi-axes σ1,,σr\sigma_1, \ldots, \sigma_r (where r=rank(A)r = \operatorname{rank}(A)).

Existence proof sketch. Let σ1=A2=maxx=1Ax\sigma_1 = \|A\|_2 = \max_{\|\mathbf{x}\|=1} \|A\mathbf{x}\| and v1\mathbf{v}_1 be a maximizer (exists by compactness). Set u1=Av1/σ1\mathbf{u}_1 = A\mathbf{v}_1/\sigma_1. Extend to orthonormal bases and proceed by induction on the orthogonal complement — the same argument as the Spectral Theorem proof, applied to ATAA^TA.

Thin SVD and Rank

Full SVD: URm×mU \in \mathbb{R}^{m \times m}, ΣRm×n\Sigma \in \mathbb{R}^{m \times n}, VRn×nV \in \mathbb{R}^{n \times n}.

Thin (economy) SVD: for mnm \geq n, keep only the first nn columns of UU and the top-left n×nn \times n block of Σ\Sigma:

A=UnΣnVT,UnRm×n,ΣnRn×n,VRn×n.A = U_n \Sigma_n V^T, \qquad U_n \in \mathbb{R}^{m \times n}, \quad \Sigma_n \in \mathbb{R}^{n \times n}, \quad V \in \mathbb{R}^{n \times n}.

Rank: rank(A)=\operatorname{rank}(A) = number of nonzero singular values. In the outer-product form:

A=i=1rσiuiviT,r=rank(A).A = \sum_{i=1}^r \sigma_i \mathbf{u}_i \mathbf{v}_i^T, \qquad r = \operatorname{rank}(A).

Connection to eigenvalues: σi2\sigma_i^2 are the nonzero eigenvalues of both ATAA^TA (right singular vectors = eigenvectors) and AATAA^T (left singular vectors = eigenvectors). For symmetric PSD matrices, σi=λi\sigma_i = \lambda_i and U=V=QU = V = Q — the SVD and spectral decomposition coincide.

Eckart-Young Theorem (Best Low-Rank Approximation)

Theorem. Among all rank-kk matrices BRm×nB \in \mathbb{R}^{m \times n}:

AAk2=minrank(B)kAB2=σk+1,\|A - A_k\|_2 = \min_{\operatorname{rank}(B) \leq k} \|A - B\|_2 = \sigma_{k+1}, AAkF=minrank(B)kABF=σk+12++σr2,\|A - A_k\|_F = \min_{\operatorname{rank}(B) \leq k} \|A - B\|_F = \sqrt{\sigma_{k+1}^2 + \cdots + \sigma_r^2},

where Ak=i=1kσiuiviTA_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i^T is the truncated SVD.

This is the theoretical foundation for:

  • PCA: project data onto top-kk singular vectors of the data matrix
  • Image compression: store rank-kk SVD instead of full image
  • Collaborative filtering: approximate user-item matrix with low-rank factors
  • Word embeddings: truncated SVD of a PMI co-occurrence matrix gives GloVe-like embeddings

The Pseudoinverse

For a system Ax=bA\mathbf{x} = \mathbf{b} that may be over- or under-determined, the Moore-Penrose pseudoinverse is:

A+=VΣ+UT,Σ+=diag ⁣(1σ1,,1σr,0,,0).A^+ = V\Sigma^+ U^T, \qquad \Sigma^+ = \operatorname{diag}\!\left(\frac{1}{\sigma_1}, \ldots, \frac{1}{\sigma_r}, 0, \ldots, 0\right).

Then x+=A+b\mathbf{x}^+ = A^+\mathbf{b} is the minimum-norm least-squares solution: it minimizes x\|\mathbf{x}\| among all solutions minimizing Axb\|A\mathbf{x} - \mathbf{b}\|.

For a full-rank tall matrix (m>nm > n, rank=n\operatorname{rank}=n): A+=(ATA)1ATA^+ = (A^TA)^{-1}A^T — the ordinary least squares formula.

Regularized pseudoinverse (truncated SVD). When some σi0\sigma_i \approx 0 (near-singular), inverting them amplifies noise. Truncate at threshold τ\tau:

Aτ+=σi>τ1σiviuiT.A^+_\tau = \sum_{\sigma_i > \tau} \frac{1}{\sigma_i} \mathbf{v}_i \mathbf{u}_i^T.

This is Tikhonov regularization in SVD form, equivalent to ridge regression.

The QR Decomposition

Theorem (QR). For any ARm×nA \in \mathbb{R}^{m \times n} with mnm \geq n, there exists an orthogonal QRm×mQ \in \mathbb{R}^{m \times m} and upper triangular RRm×nR \in \mathbb{R}^{m \times n} (with zero rows below row nn) such that

A=QR.A = QR.

If AA has full column rank, RR is uniquely determined up to signs of diagonal entries, and the thin form A=Q1R1A = Q_1 R_1 with Q1Rm×nQ_1 \in \mathbb{R}^{m \times n}, R1Rn×nR_1 \in \mathbb{R}^{n \times n} upper triangular is unique (with positive diagonal).

Relation to Gram-Schmidt. The QR decomposition is the matrix form of Gram-Schmidt orthogonalization: columns of Q1Q_1 are the orthonormalized columns of AA, and R1R_1 encodes the change-of-basis coefficients.

Solving least squares via QR. For m>nm > n full-rank AA:

minxAxb2    QTAx=QTb    R1x=Q1Tb.\min_{\mathbf{x}} \|A\mathbf{x} - \mathbf{b}\|^2 \iff Q^TA\mathbf{x} = Q^T\mathbf{b} \iff R_1\mathbf{x} = Q_1^T\mathbf{b}.

Since R1R_1 is upper triangular, solve by back-substitution in O(n2)O(n^2) (after paying O(mn)O(mn) for QR).

Why QR over normal equations? The normal equations ATAx=ATbA^TA\mathbf{x} = A^T\mathbf{b} square the condition number: κ(ATA)=κ(A)2\kappa(A^TA) = \kappa(A)^2. For ill-conditioned AA, this causes catastrophic loss of numerical precision. QR avoids squaring the condition number and is the standard method in production numerical solvers.

Householder vs Gram-Schmidt:

MethodStabilityCostUse case
Gram-Schmidt (classical)Unstable for ill-cond. AAO(mn2)O(mn^2)Conceptual
Modified Gram-SchmidtStableO(mn2)O(mn^2)Small matrices
Householder reflectionsBackward stableO(mn2)O(mn^2)Standard in practice
Givens rotationsBackward stableO(mn2)O(mn^2)Sparse AA, updating QR

The LU Decomposition

Theorem (LU with pivoting). For any ARn×nA \in \mathbb{R}^{n \times n}, there exists a permutation matrix PP, a unit lower triangular LL, and upper triangular UU such that

PA=LU.PA = LU.

Gaussian elimination computes LL, UU with partial pivoting (swapping rows to put the largest entry in the pivot position at each step — the permutation matrix PP records these swaps).

Solving Ax=bA\mathbf{x} = \mathbf{b} via LU:

  1. PAx=PbPA\mathbf{x} = PbLUx=PbLU\mathbf{x} = Pb
  2. Forward substitution: solve Ly=PbL\mathbf{y} = Pb in O(n2)O(n^2)
  3. Back substitution: solve Ux=yU\mathbf{x} = \mathbf{y} in O(n2)O(n^2)
  4. Total: O(n3)O(n^3) for factorization, then O(n2)O(n^2) per right-hand side

Why LU for square systems, not QR? LU is roughly twice as fast as QR for square systems (same O(n3)O(n^3) but half the constant). For rectangular least-squares problems, QR is preferable.

Cholesky decomposition. For symmetric positive definite A=AT0A = A^T \succ 0:

A=LLTA = LL^T

where LL is lower triangular with positive diagonal. This is Cholesky factorization — twice as fast as LU (exploits symmetry), always stable (no pivoting needed since A0A \succ 0 guarantees no zero pivots), and the standard method for solving Gaussian process regression, computing covariance-weighted least squares, and sampling from multivariate Gaussians.

Condition Number and Numerical Stability

The condition number of AA (in the 2-norm) is:

κ2(A)=A2A12=σ1σn.\kappa_2(A) = \|A\|_2 \|A^{-1}\|_2 = \frac{\sigma_1}{\sigma_n}.

It measures how much relative error in b\mathbf{b} can be amplified in the solution x\mathbf{x} to Ax=bA\mathbf{x} = \mathbf{b}:

δxxκ(A)δbb.\frac{\|\delta\mathbf{x}\|}{\|\mathbf{x}\|} \leq \kappa(A) \cdot \frac{\|\delta\mathbf{b}\|}{\|\mathbf{b}\|}.

In floating-point arithmetic with machine epsilon ϵmach1016\epsilon_{\text{mach}} \approx 10^{-16} (double precision), a condition number of κ\kappa means you lose roughly log10(κ)\log_{10}(\kappa) decimal digits of accuracy. When κ>1012\kappa > 10^{12}, the linear system is essentially unsolvable at double precision.

Worked Example

Example 1: SVD of a 2×32 \times 3 Matrix

Let A=(110011)A = \begin{pmatrix}1 & 1 & 0 \\ 0 & 1 & 1\end{pmatrix}.

Step 1: Compute ATAA^TA.

ATA=(101101)(110011)=(110121011).A^TA = \begin{pmatrix}1&0\\1&1\\0&1\end{pmatrix}\begin{pmatrix}1&1&0\\0&1&1\end{pmatrix} = \begin{pmatrix}1&1&0\\1&2&1\\0&1&1\end{pmatrix}.

Step 2: Eigenvalues of ATAA^TA (= σi2\sigma_i^2). Characteristic polynomial gives λ=3,1,0\lambda = 3, 1, 0, so σ1=3\sigma_1 = \sqrt{3}, σ2=1\sigma_2 = 1, σ3=0\sigma_3 = 0.

The zero singular value confirms rank(A)=2\operatorname{rank}(A) = 2, consistent with AA being 2×32 \times 3 with full row rank.

Step 3: Right singular vectors = eigenvectors of ATAA^TA:

v1=16(121),v2=12(101),v3=13(111).\mathbf{v}_1 = \frac{1}{\sqrt{6}}\begin{pmatrix}1\\2\\1\end{pmatrix}, \quad \mathbf{v}_2 = \frac{1}{\sqrt{2}}\begin{pmatrix}1\\0\\-1\end{pmatrix}, \quad \mathbf{v}_3 = \frac{1}{\sqrt{3}}\begin{pmatrix}1\\-1\\1\end{pmatrix}.

Step 4: Left singular vectors ui=Avi/σi\mathbf{u}_i = A\mathbf{v}_i / \sigma_i:

u1=12(11),u2=12(11).\mathbf{u}_1 = \frac{1}{\sqrt{2}}\begin{pmatrix}1\\1\end{pmatrix}, \quad \mathbf{u}_2 = \frac{1}{\sqrt{2}}\begin{pmatrix}1\\-1\end{pmatrix}.

Low-rank approximation: A3u1v1TA \approx \sqrt{3}\,\mathbf{u}_1\mathbf{v}_1^T captures σ12/(σ12+σ22)=3/4=75%\sigma_1^2/(\sigma_1^2+\sigma_2^2) = 3/4 = 75\% of the Frobenius energy.

Example 2: QR and Least Squares

Given data points (1,1),(2,2.1),(3,2.9),(4,4.2)(1,1), (2,2.1), (3,2.9), (4,4.2), fit a line y=α+βxy = \alpha + \beta x.

A=(11121314),b=(12.12.94.2).A = \begin{pmatrix}1&1\\1&2\\1&3\\1&4\end{pmatrix}, \quad \mathbf{b} = \begin{pmatrix}1\\2.1\\2.9\\4.2\end{pmatrix}.

rank(A)=2\operatorname{rank}(A) = 2, so QR gives a unique least-squares solution. The thin QR factorization A=Q1R1A = Q_1 R_1 yields x^=R11Q1Tb=(α^β^)(0.051.03)\hat{\mathbf{x}} = R_1^{-1}Q_1^T\mathbf{b} = \begin{pmatrix}\hat{\alpha}\\\hat{\beta}\end{pmatrix} \approx \begin{pmatrix}0.05\\1.03\end{pmatrix} — approximately the line y=0.05+1.03xy = 0.05 + 1.03x.

Example 3: Cholesky for Gaussian Sampling

To sample zN(μ,Σ)\mathbf{z} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma): compute the Cholesky factor Σ=LLT\Sigma = LL^T, draw ϵN(0,I)\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, I), then z=μ+Lϵ\mathbf{z} = \boldsymbol{\mu} + L\boldsymbol{\epsilon}.

This is how multivariate Gaussian samples are generated in PyTorch (torch.distributions.MultivariateNormal uses Cholesky internally), GPyTorch, Stan, and TensorFlow Probability. The Cholesky factor also computes the log-determinant for the multivariate Gaussian log-likelihood:

logΣ=2i=1nlogLii.\log |\Sigma| = 2 \sum_{i=1}^n \log L_{ii}.

Connections

Where Your Intuition Breaks

The SVD is always expensive — it is an O(min(mn2,m2n))O(\min(mn^2, m^2n)) operation. For full SVD this is true, but in practice you almost never need the full decomposition. Randomized SVD (Halko, Martinsson, Tropp 2011) computes a rank-kk approximation in O(mnk)O(mnk) — linear in kk, which is small when the matrix has low intrinsic rank. For large sparse matrices (graph Laplacians, word co-occurrence matrices), Lanczos-based methods compute the top kk singular vectors without ever materializing the full matrix. The correct mental model is: "computing the full SVD is expensive; computing the top-k SVD is affordable and is what you always want."

Decomposition Comparison Table

DecompositionFormShape of AARank neededPrimary use
SVDUΣVTU\Sigma V^TAny m×nm \times nAnyLow-rank approx, pseudoinverse, PCA
SpectralQΛQTQ\Lambda Q^TSquare symmetricAnyEigenanalysis, quadratic forms
QRQRQRmnm \geq nAnyLeast squares, Gram-Schmidt
LU (+pivot)PA=LUPA = LUSquareFullLinear systems
CholeskyLLTLL^TSquare sym PDFullCovariance systems, Gaussian sampling
EigendecompPDP1PDP^{-1}SquareDiagonalizable matrices only

When the Normal Equations Are Dangerous

The normal equations ATAx=ATbA^TA\mathbf{x} = A^T\mathbf{b} are algebraically equivalent to the QR solution but numerically inferior:

Example. Take A=(11ϵ00ϵ)A = \begin{pmatrix}1 & 1 \\ \epsilon & 0 \\ 0 & \epsilon\end{pmatrix} with ϵ=108\epsilon = 10^{-8} (double precision). Then κ(A)2/ϵ1.4×108\kappa(A) \approx \sqrt{2}/\epsilon \approx 1.4 \times 10^8 but κ(ATA)κ(A)22×10161/ϵmach\kappa(A^TA) \approx \kappa(A)^2 \approx 2 \times 10^{16} \approx 1/\epsilon_{\text{mach}}. The normal equations are at the numerical precision limit even though the original system is well-posed. QR avoids this by never forming ATAA^TA.

⚠️Warning

Never form ATAA^TA explicitly when solving least-squares problems numerically. Use numpy.linalg.lstsq, scipy.linalg.lstsq, or QR-based solvers. The only exception is when AA is very tall and skinny (mnm \gg n), and even then, use the rcond parameter to handle near-rank-deficiency.

💡Intuition

SVD is PCA. If XRn×dX \in \mathbb{R}^{n \times d} is a centered data matrix (rows = observations), then the SVD X=UΣVTX = U\Sigma V^T gives: right singular vectors VV = principal directions, Σ2/(n1)\Sigma^2/(n-1) = eigenvalues of the sample covariance, XV=UΣXV = U\Sigma = principal component scores. PCA via SVD is preferred over the covariance-eigendecomposition route when dnd \gg n (more features than samples), because XXTRn×nXX^T \in \mathbb{R}^{n \times n} is cheaper to decompose than XTXRd×dX^TX \in \mathbb{R}^{d \times d}.

💡Intuition

Randomized SVD for large matrices. For a matrix ARm×nA \in \mathbb{R}^{m \times n} with m,nkm, n \gg k, computing the full SVD is O(min(mn2,m2n))O(\min(mn^2, m^2 n)). Randomized SVD (Halko, Martinsson, Tropp 2011) computes a rank-kk approximation in O(mnk)O(mnk): sketch the column space with a random Gaussian matrix ΩRn×k\Omega \in \mathbb{R}^{n \times k}, form Y=AΩY = A\Omega, QR-decompose Y=QRY = QR, then compute the SVD of the small matrix QTARk×nQ^TA \in \mathbb{R}^{k \times n}. This is how sklearn.decomposition.TruncatedSVD and torch.svd_lowrank work.

SVD in Neural Network Analysis

Recent interpretability research uses SVD to analyze weight matrices in transformers:

  • Effective rank: number of singular values above a threshold, measuring how "compressed" a weight matrix is
  • Intrinsic dimensionality: the rank-kk approximation quality as a function of kk — a steep drop indicates the weight matrix has low intrinsic rank
  • LoRA (Low-Rank Adaptation): instead of fine-tuning WRdin×doutW \in \mathbb{R}^{d_{\text{in}} \times d_{\text{out}}} directly, parameterize the update as ΔW=BA\Delta W = BA where BRdout×rB \in \mathbb{R}^{d_{\text{out}} \times r}, ARr×dinA \in \mathbb{R}^{r \times d_{\text{in}}}, rmin(din,dout)r \ll \min(d_{\text{in}}, d_{\text{out}}). The claim (motivated by SVD analysis) is that fine-tuning directions live in a low-dimensional subspace of the full weight space.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.