Requires:Eigenvalues, Eigenvectors & Diagonalization The Spectral Theorem & Symmetric Matrices

SVD, QR & LU Decompositions

Matrix decompositions are the central algorithmic tool of numerical linear algebra: they rewrite a matrix as a product of structured factors (orthogonal, triangular, diagonal) that make subsequent operations — solving systems, computing inverses, finding least-squares solutions, approximating rank — cheap, stable, and interpretable. The Singular Value Decomposition (SVD) in particular is the universal factorization, applicable to any matrix regardless of shape or rank, and underlies dimensionality reduction, low-rank approximation, pseudoinverses, and the analysis of neural network weight matrices.

Concepts

SVD Geometry — A = U Σ Vᵀ (rotate → scale → rotate)

Singular values

σ₁ = 2.558

σ₂ = 0.977

κ = σ₁/σ₂ = 2.62

Singular vectors

v₁ = (0.768, 0.641)

v₂ = (-0.641, 0.768)

u₁ = (0.851, 0.526)

u₂ = (-0.526, 0.851)

Full rank — two nonzero singular values, two rotation stages

Click a stage to highlight it. Each panel shows the unit circle after applying that step. The final panel equals Ax.

Every matrix — whether it compresses images, encodes user preferences, or transforms activations in a neural network — does the same thing geometrically: rotate, scale, rotate. The SVD makes this explicit: any matrix decomposes as $A = U\Sigma V^T$ , two rotations and a diagonal scaling. This universality is why SVD underlies PCA, low-rank approximation, pseudoinverses, and virtually every other dimension-reduction algorithm.

The Singular Value Decomposition (SVD)

Theorem (SVD). For any matrix $A \in \mathbb{R}^{m \times n}$ , there exist orthogonal matrices $U \in \mathbb{R}^{m \times m}$ , $V \in \mathbb{R}^{n \times n}$ and a matrix $\Sigma \in \mathbb{R}^{m \times n}$ with nonnegative entries on the diagonal (and zeros elsewhere) such that

$A = U\Sigma V^T.$

Two separate rotation matrices $U$ and $V$ are needed rather than one because the domain ( $\mathbb{R}^n$ ) and the range ( $\mathbb{R}^m$ ) are different spaces — they cannot share a basis. A single orthogonal matrix would conflate input and output coordinate systems. The two rotations plus a diagonal $\Sigma$ are the minimal factorization that separates which directions matter (geometry) from how much each direction matters (magnitude).

The diagonal entries $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_{\min(m,n)} \geq 0$ are the singular values of $A$ . The columns of $U$ are the left singular vectors, the columns of $V$ are the right singular vectors.

Geometric interpretation. Every linear map $A : \mathbb{R}^n \to \mathbb{R}^m$ decomposes into three stages:

$V^T$ : rotate in the domain (an isometry)
$\Sigma$ : scale each axis by $\sigma_i$ (stretching/compression)
$U$ : rotate in the range (an isometry)

A unit sphere in $\mathbb{R}^n$ is mapped to an ellipsoid in $\mathbb{R}^m$ with semi-axes $\sigma_1, \ldots, \sigma_r$ (where $r = \operatorname{rank}(A)$ ).

Existence proof sketch. Let $\sigma_1 = \|A\|_2 = \max_{\|\mathbf{x}\|=1} \|A\mathbf{x}\|$ and $\mathbf{v}_1$ be a maximizer (exists by compactness). Set $\mathbf{u}_1 = A\mathbf{v}_1/\sigma_1$ . Extend to orthonormal bases and proceed by induction on the orthogonal complement — the same argument as the Spectral Theorem proof, applied to $A^TA$ .

Thin SVD and Rank

Full SVD: $U \in \mathbb{R}^{m \times m}$ , $\Sigma \in \mathbb{R}^{m \times n}$ , $V \in \mathbb{R}^{n \times n}$ .

Thin (economy) SVD: for $m \geq n$ , keep only the first $n$ columns of $U$ and the top-left $n \times n$ block of $\Sigma$ :

$A = U_n \Sigma_n V^T, \qquad U_n \in \mathbb{R}^{m \times n}, \quad \Sigma_n \in \mathbb{R}^{n \times n}, \quad V \in \mathbb{R}^{n \times n}.$

Rank: $\operatorname{rank}(A) =$ number of nonzero singular values. In the outer-product form:

$A = \sum_{i=1}^r \sigma_i \mathbf{u}_i \mathbf{v}_i^T, \qquad r = \operatorname{rank}(A).$

Connection to eigenvalues: $\sigma_i^2$ are the nonzero eigenvalues of both $A^TA$ (right singular vectors = eigenvectors) and $AA^T$ (left singular vectors = eigenvectors). For symmetric PSD matrices, $\sigma_i = \lambda_i$ and $U = V = Q$ — the SVD and spectral decomposition coincide.

Eckart-Young Theorem (Best Low-Rank Approximation)

Theorem. Among all rank- $k$ matrices $B \in \mathbb{R}^{m \times n}$ :

$\|A - A_k\|_2 = \min_{\operatorname{rank}(B) \leq k} \|A - B\|_2 = \sigma_{k+1},$ $\|A - A_k\|_F = \min_{\operatorname{rank}(B) \leq k} \|A - B\|_F = \sqrt{\sigma_{k+1}^2 + \cdots + \sigma_r^2},$

where $A_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i^T$ is the truncated SVD.

This is the theoretical foundation for:

PCA: project data onto top- $k$ singular vectors of the data matrix
Image compression: store rank- $k$ SVD instead of full image
Collaborative filtering: approximate user-item matrix with low-rank factors
Word embeddings: truncated SVD of a PMI co-occurrence matrix gives GloVe-like embeddings

The Pseudoinverse

For a system $A\mathbf{x} = \mathbf{b}$ that may be over- or under-determined, the Moore-Penrose pseudoinverse is:

$A^+ = V\Sigma^+ U^T, \qquad \Sigma^+ = \operatorname{diag}\!\left(\frac{1}{\sigma_1}, \ldots, \frac{1}{\sigma_r}, 0, \ldots, 0\right).$

Then $\mathbf{x}^+ = A^+\mathbf{b}$ is the minimum-norm least-squares solution: it minimizes $\|\mathbf{x}\|$ among all solutions minimizing $\|A\mathbf{x} - \mathbf{b}\|$ .

For a full-rank tall matrix ( $m > n$ , $\operatorname{rank}=n$ ): $A^+ = (A^TA)^{-1}A^T$ — the ordinary least squares formula.

Regularized pseudoinverse (truncated SVD). When some $\sigma_i \approx 0$ (near-singular), inverting them amplifies noise. Truncate at threshold $\tau$ :

$A^+_\tau = \sum_{\sigma_i > \tau} \frac{1}{\sigma_i} \mathbf{v}_i \mathbf{u}_i^T.$

This is Tikhonov regularization in SVD form, equivalent to ridge regression.

The QR Decomposition

Theorem (QR). For any $A \in \mathbb{R}^{m \times n}$ with $m \geq n$ , there exists an orthogonal $Q \in \mathbb{R}^{m \times m}$ and upper triangular $R \in \mathbb{R}^{m \times n}$ (with zero rows below row $n$ ) such that

$A = QR.$

If $A$ has full column rank, $R$ is uniquely determined up to signs of diagonal entries, and the thin form $A = Q_1 R_1$ with $Q_1 \in \mathbb{R}^{m \times n}$ , $R_1 \in \mathbb{R}^{n \times n}$ upper triangular is unique (with positive diagonal).

Relation to Gram-Schmidt. The QR decomposition is the matrix form of Gram-Schmidt orthogonalization: columns of $Q_1$ are the orthonormalized columns of $A$ , and $R_1$ encodes the change-of-basis coefficients.

Solving least squares via QR. For $m > n$ full-rank $A$ :

$\min_{\mathbf{x}} \|A\mathbf{x} - \mathbf{b}\|^2 \iff Q^TA\mathbf{x} = Q^T\mathbf{b} \iff R_1\mathbf{x} = Q_1^T\mathbf{b}.$

Since $R_1$ is upper triangular, solve by back-substitution in $O(n^2)$ (after paying $O(mn)$ for QR).

Why QR over normal equations? The normal equations $A^TA\mathbf{x} = A^T\mathbf{b}$ square the condition number: $\kappa(A^TA) = \kappa(A)^2$ . For ill-conditioned $A$ , this causes catastrophic loss of numerical precision. QR avoids squaring the condition number and is the standard method in production numerical solvers.

Householder vs Gram-Schmidt:

Method	Stability	Cost	Use case
Gram-Schmidt (classical)	Unstable for ill-cond. $A$	$O(mn^2)$	Conceptual
Modified Gram-Schmidt	Stable	$O(mn^2)$	Small matrices
Householder reflections	Backward stable	$O(mn^2)$	Standard in practice
Givens rotations	Backward stable	$O(mn^2)$	Sparse $A$ , updating QR

The LU Decomposition

Theorem (LU with pivoting). For any $A \in \mathbb{R}^{n \times n}$ , there exists a permutation matrix $P$ , a unit lower triangular $L$ , and upper triangular $U$ such that

$PA = LU.$

Gaussian elimination computes $L$ , $U$ with partial pivoting (swapping rows to put the largest entry in the pivot position at each step — the permutation matrix $P$ records these swaps).

Solving $A\mathbf{x} = \mathbf{b}$ via LU:

$PA\mathbf{x} = Pb$ → $LU\mathbf{x} = Pb$
Forward substitution: solve $L\mathbf{y} = Pb$ in $O(n^2)$
Back substitution: solve $U\mathbf{x} = \mathbf{y}$ in $O(n^2)$
Total: $O(n^3)$ for factorization, then $O(n^2)$ per right-hand side

Why LU for square systems, not QR? LU is roughly twice as fast as QR for square systems (same $O(n^3)$ but half the constant). For rectangular least-squares problems, QR is preferable.

Cholesky decomposition. For symmetric positive definite $A = A^T \succ 0$ :

$A = LL^T$

where $L$ is lower triangular with positive diagonal. This is Cholesky factorization — twice as fast as LU (exploits symmetry), always stable (no pivoting needed since $A \succ 0$ guarantees no zero pivots), and the standard method for solving Gaussian process regression, computing covariance-weighted least squares, and sampling from multivariate Gaussians.

Condition Number and Numerical Stability

The condition number of $A$ (in the 2-norm) is:

$\kappa_2(A) = \|A\|_2 \|A^{-1}\|_2 = \frac{\sigma_1}{\sigma_n}.$

It measures how much relative error in $\mathbf{b}$ can be amplified in the solution $\mathbf{x}$ to $A\mathbf{x} = \mathbf{b}$ :

$\frac{\|\delta\mathbf{x}\|}{\|\mathbf{x}\|} \leq \kappa(A) \cdot \frac{\|\delta\mathbf{b}\|}{\|\mathbf{b}\|}.$

In floating-point arithmetic with machine epsilon $\epsilon_{\text{mach}} \approx 10^{-16}$ (double precision), a condition number of $\kappa$ means you lose roughly $\log_{10}(\kappa)$ decimal digits of accuracy. When $\kappa > 10^{12}$ , the linear system is essentially unsolvable at double precision.

Worked Example

Example 1: SVD of a $2 \times 3$ Matrix

Let $A = \begin{pmatrix}1 & 1 & 0 \\ 0 & 1 & 1\end{pmatrix}$ .

Step 1: Compute $A^TA$ .

$A^TA = \begin{pmatrix}1&0\\1&1\\0&1\end{pmatrix}\begin{pmatrix}1&1&0\\0&1&1\end{pmatrix} = \begin{pmatrix}1&1&0\\1&2&1\\0&1&1\end{pmatrix}.$

Step 2: Eigenvalues of $A^TA$ (= $\sigma_i^2$ ). Characteristic polynomial gives $\lambda = 3, 1, 0$ , so $\sigma_1 = \sqrt{3}$ , $\sigma_2 = 1$ , $\sigma_3 = 0$ .

The zero singular value confirms $\operatorname{rank}(A) = 2$ , consistent with $A$ being $2 \times 3$ with full row rank.

Step 3: Right singular vectors = eigenvectors of $A^TA$ :

$\mathbf{v}_1 = \frac{1}{\sqrt{6}}\begin{pmatrix}1\\2\\1\end{pmatrix}, \quad \mathbf{v}_2 = \frac{1}{\sqrt{2}}\begin{pmatrix}1\\0\\-1\end{pmatrix}, \quad \mathbf{v}_3 = \frac{1}{\sqrt{3}}\begin{pmatrix}1\\-1\\1\end{pmatrix}.$

Step 4: Left singular vectors $\mathbf{u}_i = A\mathbf{v}_i / \sigma_i$ :

$\mathbf{u}_1 = \frac{1}{\sqrt{2}}\begin{pmatrix}1\\1\end{pmatrix}, \quad \mathbf{u}_2 = \frac{1}{\sqrt{2}}\begin{pmatrix}1\\-1\end{pmatrix}.$

Low-rank approximation: $A \approx \sqrt{3}\,\mathbf{u}_1\mathbf{v}_1^T$ captures $\sigma_1^2/(\sigma_1^2+\sigma_2^2) = 3/4 = 75\%$ of the Frobenius energy.

Example 2: QR and Least Squares

Given data points $(1,1), (2,2.1), (3,2.9), (4,4.2)$ , fit a line $y = \alpha + \beta x$ .

$A = \begin{pmatrix}1&1\\1&2\\1&3\\1&4\end{pmatrix}, \quad \mathbf{b} = \begin{pmatrix}1\\2.1\\2.9\\4.2\end{pmatrix}.$

$\operatorname{rank}(A) = 2$ , so QR gives a unique least-squares solution. The thin QR factorization $A = Q_1 R_1$ yields $\hat{\mathbf{x}} = R_1^{-1}Q_1^T\mathbf{b} = \begin{pmatrix}\hat{\alpha}\\\hat{\beta}\end{pmatrix} \approx \begin{pmatrix}0.05\\1.03\end{pmatrix}$ — approximately the line $y = 0.05 + 1.03x$ .

Example 3: Cholesky for Gaussian Sampling

To sample $\mathbf{z} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma)$ : compute the Cholesky factor $\Sigma = LL^T$ , draw $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, I)$ , then $\mathbf{z} = \boldsymbol{\mu} + L\boldsymbol{\epsilon}$ .

This is how multivariate Gaussian samples are generated in PyTorch (torch.distributions.MultivariateNormal uses Cholesky internally), GPyTorch, Stan, and TensorFlow Probability. The Cholesky factor also computes the log-determinant for the multivariate Gaussian log-likelihood:

$\log |\Sigma| = 2 \sum_{i=1}^n \log L_{ii}.$

Connections

Where Your Intuition Breaks

The SVD is always expensive — it is an $O(\min(mn^2, m^2n))$ operation. For full SVD this is true, but in practice you almost never need the full decomposition. Randomized SVD (Halko, Martinsson, Tropp 2011) computes a rank- $k$ approximation in $O(mnk)$ — linear in $k$ , which is small when the matrix has low intrinsic rank. For large sparse matrices (graph Laplacians, word co-occurrence matrices), Lanczos-based methods compute the top $k$ singular vectors without ever materializing the full matrix. The correct mental model is: "computing the full SVD is expensive; computing the top-k SVD is affordable and is what you always want."

Decomposition Comparison Table

Decomposition	Form	Shape of $A$	Rank needed	Primary use
SVD	$U\Sigma V^T$	Any $m \times n$	Any	Low-rank approx, pseudoinverse, PCA
Spectral	$Q\Lambda Q^T$	Square symmetric	Any	Eigenanalysis, quadratic forms
QR	$QR$	$m \geq n$	Any	Least squares, Gram-Schmidt
LU (+pivot)	$PA = LU$	Square	Full	Linear systems
Cholesky	$LL^T$	Square sym PD	Full	Covariance systems, Gaussian sampling
Eigendecomp	$PDP^{-1}$	Square	—	Diagonalizable matrices only

When the Normal Equations Are Dangerous

The normal equations $A^TA\mathbf{x} = A^T\mathbf{b}$ are algebraically equivalent to the QR solution but numerically inferior:

Example. Take $A = \begin{pmatrix}1 & 1 \\ \epsilon & 0 \\ 0 & \epsilon\end{pmatrix}$ with $\epsilon = 10^{-8}$ (double precision). Then $\kappa(A) \approx \sqrt{2}/\epsilon \approx 1.4 \times 10^8$ but $\kappa(A^TA) \approx \kappa(A)^2 \approx 2 \times 10^{16} \approx 1/\epsilon_{\text{mach}}$ . The normal equations are at the numerical precision limit even though the original system is well-posed. QR avoids this by never forming $A^TA$ .

⚠️Warning

Never form $A^TA$ explicitly when solving least-squares problems numerically. Use numpy.linalg.lstsq, scipy.linalg.lstsq, or QR-based solvers. The only exception is when $A$ is very tall and skinny ( $m \gg n$ ), and even then, use the rcond parameter to handle near-rank-deficiency.

💡Intuition

SVD is PCA. If $X \in \mathbb{R}^{n \times d}$ is a centered data matrix (rows = observations), then the SVD $X = U\Sigma V^T$ gives: right singular vectors $V$ = principal directions, $\Sigma^2/(n-1)$ = eigenvalues of the sample covariance, $XV = U\Sigma$ = principal component scores. PCA via SVD is preferred over the covariance-eigendecomposition route when $d \gg n$ (more features than samples), because $XX^T \in \mathbb{R}^{n \times n}$ is cheaper to decompose than $X^TX \in \mathbb{R}^{d \times d}$ .

💡Intuition

Randomized SVD for large matrices. For a matrix $A \in \mathbb{R}^{m \times n}$ with $m, n \gg k$ , computing the full SVD is $O(\min(mn^2, m^2 n))$ . Randomized SVD (Halko, Martinsson, Tropp 2011) computes a rank- $k$ approximation in $O(mnk)$ : sketch the column space with a random Gaussian matrix $\Omega \in \mathbb{R}^{n \times k}$ , form $Y = A\Omega$ , QR-decompose $Y = QR$ , then compute the SVD of the small matrix $Q^TA \in \mathbb{R}^{k \times n}$ . This is how sklearn.decomposition.TruncatedSVD and torch.svd_lowrank work.

SVD in Neural Network Analysis

Recent interpretability research uses SVD to analyze weight matrices in transformers:

Effective rank: number of singular values above a threshold, measuring how "compressed" a weight matrix is
Intrinsic dimensionality: the rank- $k$ approximation quality as a function of $k$ — a steep drop indicates the weight matrix has low intrinsic rank
LoRA (Low-Rank Adaptation): instead of fine-tuning $W \in \mathbb{R}^{d_{\text{in}} \times d_{\text{out}}}$ directly, parameterize the update as $\Delta W = BA$ where $B \in \mathbb{R}^{d_{\text{out}} \times r}$ , $A \in \mathbb{R}^{r \times d_{\text{in}}}$ , $r \ll \min(d_{\text{in}}, d_{\text{out}})$ . The claim (motivated by SVD analysis) is that fine-tuning directions live in a low-dimensional subspace of the full weight space.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

The Spectral Theorem & Symmetric Matrices

Positive Semidefinite Matrices & Quadratic Forms

SVD, QR & LU Decompositions

Concepts

The Singular Value Decomposition (SVD)

Thin SVD and Rank

Eckart-Young Theorem (Best Low-Rank Approximation)

The Pseudoinverse

The QR Decomposition

The LU Decomposition

Condition Number and Numerical Stability

Worked Example

Example 1: SVD of a 2×32 \times 32×3 Matrix

Example 2: QR and Least Squares

Example 3: Cholesky for Gaussian Sampling

Connections

Where Your Intuition Breaks

Decomposition Comparison Table

When the Normal Equations Are Dangerous

SVD in Neural Network Analysis

Example 1: SVD of a $2 \times 3$ Matrix