Requires:SVD, QR & LU Decompositions Positive Semidefinite Matrices & Quadratic Forms

Linear Systems & Least Squares

Linear systems $A\mathbf{x} = \mathbf{b}$ and their generalization to overdetermined systems via least squares are the two most ubiquitous computational problems in science and engineering. The geometric insight — that the least-squares solution projects $\mathbf{b}$ orthogonally onto the column space of $A$ — unifies the normal equations, the pseudoinverse, and QR-based solvers into a single picture. Ridge regression, weighted least squares, and the Gauss-Markov theorem all follow directly from this geometry.

Concepts

Least Squares — drag points vertically to update the OLS fit

Fit lineResidualsMean line ȳ

α̂ = -0.010

β̂ = 1.030

SSR = 0.219

R² = 0.980

Normal equations check

Σeᵢ = 0.000

Σxᵢeᵢ = 0.000

Both should equal 0: residuals ⊥ columns of X

Red squares visualize squared residuals — OLS minimizes their total area. The fit line is the projection of b onto the column space of X.

When you fit a regression line to data, there is usually no line that passes through every point — so you find the line that minimizes the total squared distance from points to line. That minimization is the least-squares problem, and its solution has a geometric interpretation: the best-fit prediction $A\hat{\mathbf{x}}$ is the orthogonal projection of the target $\mathbf{b}$ onto the column space of $A$ . Every linear regression algorithm — OLS, ridge, weighted — is a variant of this single projection.

Four Fundamental Subspaces

For $A \in \mathbb{R}^{m \times n}$ , the four fundamental subspaces and their relationships:

Subspace	Definition	Dimension	Contains
Column space $\operatorname{col}(A)$	$\{A\mathbf{x} : \mathbf{x} \in \mathbb{R}^n\}$	$r = \operatorname{rank}(A)$	All achievable $A\mathbf{x}$
Null space $\operatorname{null}(A)$	$\{\mathbf{x} : A\mathbf{x} = \mathbf{0}\}$	$n - r$	Solutions to homogeneous system
Row space $\operatorname{row}(A)$	$\operatorname{col}(A^T)$	$r$	All $A^T\mathbf{y}$
Left null space	$\operatorname{null}(A^T)$	$m - r$	Left zeros of $A$

Fundamental Theorem of Linear Algebra (Strang). $\operatorname{col}(A) \perp \operatorname{null}(A^T)$ and $\operatorname{row}(A) \perp \operatorname{null}(A)$ . Every $\mathbf{b} \in \mathbb{R}^m$ decomposes uniquely as $\mathbf{b} = \mathbf{b}_{\text{col}} + \mathbf{b}_{\text{null}}$ with $\mathbf{b}_{\text{col}} \in \operatorname{col}(A)$ and $\mathbf{b}_{\text{null}} \in \operatorname{null}(A^T)$ .

Consistency. $A\mathbf{x} = \mathbf{b}$ has a solution $\iff$ $\mathbf{b} \in \operatorname{col}(A)$ $\iff$ $\mathbf{b}_{\text{null}} = \mathbf{0}$ .

Solution Structure of Linear Systems

For a consistent system $A\mathbf{x} = \mathbf{b}$ :

$\mathbf{x} = \mathbf{x}_p + \mathbf{x}_h, \qquad \mathbf{x}_p \text{ particular solution}, \quad \mathbf{x}_h \in \operatorname{null}(A).$

Uniqueness: The solution is unique iff $\operatorname{null}(A) = \{\mathbf{0}\}$ iff $\operatorname{rank}(A) = n$ (full column rank).

Three cases:

Case	$\operatorname{rank}(A)$	Solutions	ML context
$m = n$ , full rank	$r = n = m$	Unique	Square invertible systems
Tall ( $m > n$ ), full col. rank	$r = n$	Overdetermined → least squares	Regression ( $m$ data, $n$ features)
Wide ( $m < n$ ), full row rank	$r = m$	Underdetermined → min-norm	Compressed sensing, $d \gg n$

The Least-Squares Problem

For overdetermined $A \in \mathbb{R}^{m \times n}$ ( $m > n$ ) with $\mathbf{b} \notin \operatorname{col}(A)$ , there is no exact solution. Instead, minimize the residual:

$\hat{\mathbf{x}} = \arg\min_{\mathbf{x}} \|A\mathbf{x} - \mathbf{b}\|^2.$

Geometric interpretation. The minimum is achieved when $A\hat{\mathbf{x}}$ is the orthogonal projection of $\mathbf{b}$ onto $\operatorname{col}(A)$ . The residual $\mathbf{r} = \mathbf{b} - A\hat{\mathbf{x}}$ must be orthogonal to every column of $A$ :

$A^T(\mathbf{b} - A\hat{\mathbf{x}}) = \mathbf{0} \implies A^TA\hat{\mathbf{x}} = A^T\mathbf{b}.$

These are the normal equations. When $A$ has full column rank, $A^TA \succ 0$ is invertible and:

$\hat{\mathbf{x}} = (A^TA)^{-1}A^T\mathbf{b} = A^+\mathbf{b}.$

The matrix $A^+ = (A^TA)^{-1}A^T$ is the (left) pseudoinverse. The projection matrix onto $\operatorname{col}(A)$ is $P = AA^+ = A(A^TA)^{-1}A^T$ .

The normal equations $A^TA\hat{\mathbf{x}} = A^T\mathbf{b}$ arise necessarily from the orthogonality condition. Setting the residual perpendicular to every column of $A$ is not a design choice — it is the only condition that locates the minimum of $\|A\mathbf{x} - \mathbf{b}\|^2$ . Every other linear system solver for least squares (QR, SVD, gradient descent on the normal equations) is solving the same geometric problem, just via a different computational route.

Weighted Least Squares

When observations have different noise levels, minimize a weighted residual:

$\hat{\mathbf{x}}_W = \arg\min_{\mathbf{x}} \|W^{1/2}(A\mathbf{x} - \mathbf{b})\|^2 = \arg\min_{\mathbf{x}} (A\mathbf{x} - \mathbf{b})^T W (A\mathbf{x} - \mathbf{b}),$

where $W \succ 0$ is a diagonal weight matrix. Normal equations become:

$A^TWA\hat{\mathbf{x}}_W = A^TW\mathbf{b} \implies \hat{\mathbf{x}}_W = (A^TWA)^{-1}A^TW\mathbf{b}.$

Gauss-Markov Theorem. If the true model is $\mathbf{b} = A\mathbf{x}^* + \boldsymbol{\epsilon}$ with $\mathbb{E}[\boldsymbol{\epsilon}] = \mathbf{0}$ and $\operatorname{Cov}(\boldsymbol{\epsilon}) = \sigma^2 W^{-1}$ , then the weighted least-squares estimator is the Best Linear Unbiased Estimator (BLUE) — it has the smallest variance among all linear unbiased estimators. With $W = I$ (homoscedastic noise), ordinary OLS is BLUE.

Ridge Regression and Tikhonov Regularization

When $A^TA$ is ill-conditioned or singular, add a regularization term:

$\hat{\mathbf{x}}_\alpha = \arg\min_{\mathbf{x}} \|A\mathbf{x} - \mathbf{b}\|^2 + \alpha\|\mathbf{x}\|^2 = (A^TA + \alpha I)^{-1}A^T\mathbf{b}.$

Effect on singular values. Using the SVD $A = U\Sigma V^T$ :

$\hat{\mathbf{x}}_\alpha = \sum_{i=1}^r \frac{\sigma_i}{\sigma_i^2 + \alpha} \mathbf{v}_i \mathbf{u}_i^T \mathbf{b}.$

Compare to unregularized pseudoinverse: $\sum_i \frac{1}{\sigma_i} \mathbf{v}_i \mathbf{u}_i^T \mathbf{b}$ . Ridge replaces $1/\sigma_i$ with $\sigma_i/(\sigma_i^2 + \alpha)$ , which shrinks the contribution of small singular values — exactly the right thing when small $\sigma_i$ are noisy.

Bias-variance tradeoff: Ridge introduces bias ( $\hat{\mathbf{x}}_\alpha \neq \mathbf{x}^*$ even in the limit of infinite data) but reduces variance. The optimal $\alpha$ minimizes the mean squared prediction error.

Condition number improvement: $\kappa(A^TA + \alpha I) = (\sigma_1^2 + \alpha)/(\sigma_n^2 + \alpha) < \kappa(A^TA) = \sigma_1^2/\sigma_n^2$ for $\alpha > 0$ .

Numerical Methods for Least Squares

Three standard approaches:

1. QR decomposition (recommended for dense $A$ ).

Compute $A = Q_1 R_1$ (thin QR). Then:

$\hat{\mathbf{x}} = R_1^{-1} Q_1^T \mathbf{b}.$

Cost: $O(mn^2)$ for QR, $O(mn + n^2)$ per right-hand side. Condition number of the system is $\kappa(A)$ , not $\kappa(A)^2$ . This is the production-standard method.

2. SVD (most robust, handles rank-deficient $A$ ).

Compute $A = U\Sigma V^T$ . Truncate small singular values at threshold $\tau$ :

$\hat{\mathbf{x}}^+ = \sum_{\sigma_i > \tau} \frac{1}{\sigma_i} \mathbf{v}_i (\mathbf{u}_i^T \mathbf{b}).$

Cost: $O(mn^2)$ for SVD (more expensive than QR). Use when numerical rank determination is needed.

3. Normal equations (fast but numerically inferior).

Solve $A^TA\hat{\mathbf{x}} = A^T\mathbf{b}$ via Cholesky. Cost: $O(mn^2/2 + n^3/6)$ . Condition number is $\kappa(A)^2$ . Avoid unless $m \gg n$ and $\kappa(A)$ is small.

Iterative Solvers for Large Systems

For large sparse $A$ (e.g., graph Laplacians, finite-difference matrices), direct methods ( $O(n^3)$ ) are too slow. Iterative methods:

Method	Per-iteration cost	Convergence rate	Use case
Conjugate Gradient (CG)	$O(nnz)$	$\left(\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\right)^k$	$A$ symmetric PD
LSQR	$O(nnz)$	Similar to CG on $A^TA$	General $A$ , least squares
MINRES	$O(nnz)$	—	$A$ symmetric indefinite
Randomized Kaczmarz	$O(n)$	Depends on row norms	Very large $m$ , stochastic

Preconditioning: Replace $A\mathbf{x} = \mathbf{b}$ with $M^{-1}A\mathbf{x} = M^{-1}\mathbf{b}$ where $M \approx A$ is easy to invert. Good preconditioners reduce $\kappa(M^{-1}A) \ll \kappa(A)$ , dramatically accelerating convergence.

Worked Example

Example 1: Full-Rank Least Squares — Polynomial Regression

Fit $y = \beta_0 + \beta_1 x + \beta_2 x^2$ to data $(x_i, y_i)_{i=1}^5$ . The design matrix:

$A = \begin{pmatrix}1 & x_1 & x_1^2 \\ \vdots & \vdots & \vdots \\ 1 & x_5 & x_5^2\end{pmatrix} \in \mathbb{R}^{5 \times 3}.$

$A$ has rank 3 when the $x_i$ are distinct. Normal equations: $(A^TA)\hat{\boldsymbol{\beta}} = A^T\mathbf{y}$ . In practice, never form $A^TA$ — use numpy.linalg.lstsq(A, y) which internally uses SVD or QR.

Condition number warning. For Vandermonde matrices (columns $1, x, x^2, \ldots, x^{d-1}$ ), $\kappa(A)$ grows exponentially in $d$ . At degree 10 with $x_i \in [0,1]$ , $\kappa(A) \sim 10^{13}$ — at the edge of double-precision reliability. Use orthogonal polynomial bases (Chebyshev, Legendre) instead.

Example 2: Underdetermined System — Minimum-Norm Solution

For $A = \begin{pmatrix}1 & 1 & 0 \\ 0 & 1 & 1\end{pmatrix}$ , $\mathbf{b} = \begin{pmatrix}1 \\ 1\end{pmatrix}$ : $m = 2 < n = 3$ , infinitely many solutions.

The null space of $A$ : $(1, -1, 1)^T$ spans $\operatorname{null}(A)$ (verify: $A(1,-1,1)^T = \mathbf{0}$ ).

Minimum-norm solution $\mathbf{x}^+ = A^+\mathbf{b} = A^T(AA^T)^{-1}\mathbf{b}$ : this is the unique solution with $\mathbf{x}^+ \in \operatorname{row}(A)$ (orthogonal to the null space).

$AA^T = \begin{pmatrix}2 & 1 \\ 1 & 2\end{pmatrix}, \quad (AA^T)^{-1} = \frac{1}{3}\begin{pmatrix}2 & -1 \\ -1 & 2\end{pmatrix}, \quad \mathbf{x}^+ = A^T(AA^T)^{-1}\mathbf{b} = \frac{1}{3}\begin{pmatrix}1\\2\\1\end{pmatrix}.$

In compressed sensing, the minimum-norm solution of an underdetermined system is the starting point — but we additionally want the sparsest solution, which requires $\ell^1$ minimization (LASSO), not $\ell^2$ .

Example 3: Ridge Regression as Bayesian MAP

Suppose $\mathbf{b} = A\mathbf{x} + \boldsymbol{\epsilon}$ with $\boldsymbol{\epsilon} \sim \mathcal{N}(0, \sigma^2 I)$ and prior $\mathbf{x} \sim \mathcal{N}(0, \tau^2 I)$ .

The MAP (Maximum A Posteriori) estimate maximizes $p(\mathbf{x}|\mathbf{b}) \propto p(\mathbf{b}|\mathbf{x})p(\mathbf{x})$ , which is equivalent to minimizing:

$-\log p(\mathbf{x}|\mathbf{b}) = \frac{1}{2\sigma^2}\|A\mathbf{x} - \mathbf{b}\|^2 + \frac{1}{2\tau^2}\|\mathbf{x}\|^2.$

Setting $\alpha = \sigma^2/\tau^2$ recovers exactly ridge regression. Ridge regression is MAP estimation under a Gaussian prior with variance $\tau^2 = \sigma^2/\alpha$ . Larger $\alpha$ (stronger regularization) corresponds to a tighter prior (less variance in the prior on $\mathbf{x}$ ).

Connections

Where Your Intuition Breaks

Using the normal equations $A^TA\hat{\mathbf{x}} = A^T\mathbf{b}$ is the standard way to solve least squares. Algebraically this is correct, but numerically it is often dangerous. Forming $A^TA$ squares the condition number: $\kappa(A^TA) = \kappa(A)^2$ . For a moderately ill-conditioned regression matrix with $\kappa(A) = 10^4$ , the normal equations have $\kappa(A^TA) \approx 10^8$ — near the limit of double-precision arithmetic. QR-based solvers avoid this by never forming $A^TA$ , maintaining the original condition number. In practice, always use numpy.linalg.lstsq or scipy.linalg.lstsq, never solve the normal equations explicitly.

Geometry of Projection

The projection matrix $P = A(A^TA)^{-1}A^T$ onto $\operatorname{col}(A)$ satisfies:

$P^2 = P$ (idempotent)
$P^T = P$ (symmetric)
$\|P\|_2 = 1$ (projections don't expand)
$I - P$ is the complementary projection onto $\operatorname{null}(A^T)$

These four properties (idempotent symmetric matrix) completely characterize orthogonal projections — any matrix satisfying them projects onto some subspace. This characterization is used in hypothesis testing (hat matrix in regression), signal processing (oblique projections), and control theory.

💡Intuition

The hat matrix and leverage. In linear regression $\hat{\mathbf{y}} = P\mathbf{y} = H\mathbf{y}$ where $H = X(X^TX)^{-1}X^T$ is the hat matrix (it "puts a hat on" $\mathbf{y}$ ). The diagonal entries $h_{ii} = \mathbf{x}_i^T(X^TX)^{-1}\mathbf{x}_i$ are the leverage scores — they measure how much the $i$ -th observation influences its own prediction. High leverage ( $h_{ii} \approx 1$ ) means the model is forced to pass near point $i$ . Outliers with high leverage are the most dangerous in linear regression.

⚠️Warning

Multicollinearity and the condition number. When two columns of $A$ are nearly parallel (nearly linearly dependent), $\kappa(A^TA)$ explodes. This doesn't mean least squares is wrong — the minimum $\|A\mathbf{x}-\mathbf{b}\|$ is still well-defined — but the minimizer $\hat{\mathbf{x}}$ becomes extremely sensitive to perturbations in $\mathbf{b}$ . A 1% change in data can flip the sign of a coefficient. This is why variance inflation factors (VIF) are monitored in regression diagnostics and why ridge regression is standard practice in multicollinear settings.

💡Intuition

LASSO vs Ridge: geometry. OLS constrained to a ball: $\min \|A\mathbf{x}-\mathbf{b}\|^2$ s.t. $\|\mathbf{x}\|_p \leq t$ . For $p=2$ (ridge), the $\ell^2$ ball has no corners, so the constrained optimum rarely lies on an axis — Ridge shrinks all coefficients but doesn't zero any. For $p=1$ (LASSO), the $\ell^1$ ball has corners on the coordinate axes — the constrained optimum frequently hits a corner, setting some coefficients to exactly zero (sparsity). The geometry of norm balls (Lesson 2 of this module) directly explains why LASSO produces sparse solutions.

When to Use Each Solver

Scenario	Recommended solver
Dense $A$ , well-conditioned	`np.linalg.lstsq` (QR internally)
Dense $A$ , rank-deficient or ill-conditioned	`np.linalg.lstsq` with `rcond` threshold
Symmetric PD system	Cholesky (`scipy.linalg.cho_solve`)
Large sparse $A$	CG or LSQR with preconditioning
Large, low-rank structure in $A$	Randomized SVD + truncated pseudoinverse
Ridge regression	`sklearn.linear_model.Ridge` (uses Cholesky or SVD)
LASSO	Coordinate descent (`sklearn.linear_model.Lasso`)

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Positive Semidefinite Matrices & Quadratic Forms

Matrix Calculus & Differentiation