Neural-Path/Notes
50 min

Linear Systems & Least Squares

Linear systems Ax=bA\mathbf{x} = \mathbf{b} and their generalization to overdetermined systems via least squares are the two most ubiquitous computational problems in science and engineering. The geometric insight — that the least-squares solution projects b\mathbf{b} orthogonally onto the column space of AA — unifies the normal equations, the pseudoinverse, and QR-based solvers into a single picture. Ridge regression, weighted least squares, and the Gauss-Markov theorem all follow directly from this geometry.

Concepts

Least Squares — drag points vertically to update the OLS fit

12345xyŷ = -0.010 + 1.030x

α̂ = -0.010

β̂ = 1.030

SSR = 0.219

R² = 0.980

Normal equations check

Σeᵢ = 0.000

Σxᵢeᵢ = 0.000

Both should equal 0: residuals ⊥ columns of X

Red squares visualize squared residuals — OLS minimizes their total area. The fit line is the projection of b onto the column space of X.

When you fit a regression line to data, there is usually no line that passes through every point — so you find the line that minimizes the total squared distance from points to line. That minimization is the least-squares problem, and its solution has a geometric interpretation: the best-fit prediction Ax^A\hat{\mathbf{x}} is the orthogonal projection of the target b\mathbf{b} onto the column space of AA. Every linear regression algorithm — OLS, ridge, weighted — is a variant of this single projection.

Four Fundamental Subspaces

For ARm×nA \in \mathbb{R}^{m \times n}, the four fundamental subspaces and their relationships:

SubspaceDefinitionDimensionContains
Column space col(A)\operatorname{col}(A){Ax:xRn}\{A\mathbf{x} : \mathbf{x} \in \mathbb{R}^n\}r=rank(A)r = \operatorname{rank}(A)All achievable AxA\mathbf{x}
Null space null(A)\operatorname{null}(A){x:Ax=0}\{\mathbf{x} : A\mathbf{x} = \mathbf{0}\}nrn - rSolutions to homogeneous system
Row space row(A)\operatorname{row}(A)col(AT)\operatorname{col}(A^T)rrAll ATyA^T\mathbf{y}
Left null spacenull(AT)\operatorname{null}(A^T)mrm - rLeft zeros of AA

Fundamental Theorem of Linear Algebra (Strang). col(A)null(AT)\operatorname{col}(A) \perp \operatorname{null}(A^T) and row(A)null(A)\operatorname{row}(A) \perp \operatorname{null}(A). Every bRm\mathbf{b} \in \mathbb{R}^m decomposes uniquely as b=bcol+bnull\mathbf{b} = \mathbf{b}_{\text{col}} + \mathbf{b}_{\text{null}} with bcolcol(A)\mathbf{b}_{\text{col}} \in \operatorname{col}(A) and bnullnull(AT)\mathbf{b}_{\text{null}} \in \operatorname{null}(A^T).

Consistency. Ax=bA\mathbf{x} = \mathbf{b} has a solution     \iff bcol(A)\mathbf{b} \in \operatorname{col}(A)     \iff bnull=0\mathbf{b}_{\text{null}} = \mathbf{0}.

Solution Structure of Linear Systems

For a consistent system Ax=bA\mathbf{x} = \mathbf{b}:

x=xp+xh,xp particular solution,xhnull(A).\mathbf{x} = \mathbf{x}_p + \mathbf{x}_h, \qquad \mathbf{x}_p \text{ particular solution}, \quad \mathbf{x}_h \in \operatorname{null}(A).

Uniqueness: The solution is unique iff null(A)={0}\operatorname{null}(A) = \{\mathbf{0}\} iff rank(A)=n\operatorname{rank}(A) = n (full column rank).

Three cases:

Caserank(A)\operatorname{rank}(A)SolutionsML context
m=nm = n, full rankr=n=mr = n = mUniqueSquare invertible systems
Tall (m>nm > n), full col. rankr=nr = nOverdetermined → least squaresRegression (mm data, nn features)
Wide (m<nm < n), full row rankr=mr = mUnderdetermined → min-normCompressed sensing, dnd \gg n

The Least-Squares Problem

For overdetermined ARm×nA \in \mathbb{R}^{m \times n} (m>nm > n) with bcol(A)\mathbf{b} \notin \operatorname{col}(A), there is no exact solution. Instead, minimize the residual:

x^=argminxAxb2.\hat{\mathbf{x}} = \arg\min_{\mathbf{x}} \|A\mathbf{x} - \mathbf{b}\|^2.

Geometric interpretation. The minimum is achieved when Ax^A\hat{\mathbf{x}} is the orthogonal projection of b\mathbf{b} onto col(A)\operatorname{col}(A). The residual r=bAx^\mathbf{r} = \mathbf{b} - A\hat{\mathbf{x}} must be orthogonal to every column of AA:

AT(bAx^)=0    ATAx^=ATb.A^T(\mathbf{b} - A\hat{\mathbf{x}}) = \mathbf{0} \implies A^TA\hat{\mathbf{x}} = A^T\mathbf{b}.

These are the normal equations. When AA has full column rank, ATA0A^TA \succ 0 is invertible and:

x^=(ATA)1ATb=A+b.\hat{\mathbf{x}} = (A^TA)^{-1}A^T\mathbf{b} = A^+\mathbf{b}.

The matrix A+=(ATA)1ATA^+ = (A^TA)^{-1}A^T is the (left) pseudoinverse. The projection matrix onto col(A)\operatorname{col}(A) is P=AA+=A(ATA)1ATP = AA^+ = A(A^TA)^{-1}A^T.

The normal equations ATAx^=ATbA^TA\hat{\mathbf{x}} = A^T\mathbf{b} arise necessarily from the orthogonality condition. Setting the residual perpendicular to every column of AA is not a design choice — it is the only condition that locates the minimum of Axb2\|A\mathbf{x} - \mathbf{b}\|^2. Every other linear system solver for least squares (QR, SVD, gradient descent on the normal equations) is solving the same geometric problem, just via a different computational route.

Weighted Least Squares

When observations have different noise levels, minimize a weighted residual:

x^W=argminxW1/2(Axb)2=argminx(Axb)TW(Axb),\hat{\mathbf{x}}_W = \arg\min_{\mathbf{x}} \|W^{1/2}(A\mathbf{x} - \mathbf{b})\|^2 = \arg\min_{\mathbf{x}} (A\mathbf{x} - \mathbf{b})^T W (A\mathbf{x} - \mathbf{b}),

where W0W \succ 0 is a diagonal weight matrix. Normal equations become:

ATWAx^W=ATWb    x^W=(ATWA)1ATWb.A^TWA\hat{\mathbf{x}}_W = A^TW\mathbf{b} \implies \hat{\mathbf{x}}_W = (A^TWA)^{-1}A^TW\mathbf{b}.

Gauss-Markov Theorem. If the true model is b=Ax+ϵ\mathbf{b} = A\mathbf{x}^* + \boldsymbol{\epsilon} with E[ϵ]=0\mathbb{E}[\boldsymbol{\epsilon}] = \mathbf{0} and Cov(ϵ)=σ2W1\operatorname{Cov}(\boldsymbol{\epsilon}) = \sigma^2 W^{-1}, then the weighted least-squares estimator is the Best Linear Unbiased Estimator (BLUE) — it has the smallest variance among all linear unbiased estimators. With W=IW = I (homoscedastic noise), ordinary OLS is BLUE.

Ridge Regression and Tikhonov Regularization

When ATAA^TA is ill-conditioned or singular, add a regularization term:

x^α=argminxAxb2+αx2=(ATA+αI)1ATb.\hat{\mathbf{x}}_\alpha = \arg\min_{\mathbf{x}} \|A\mathbf{x} - \mathbf{b}\|^2 + \alpha\|\mathbf{x}\|^2 = (A^TA + \alpha I)^{-1}A^T\mathbf{b}.

Effect on singular values. Using the SVD A=UΣVTA = U\Sigma V^T:

x^α=i=1rσiσi2+αviuiTb.\hat{\mathbf{x}}_\alpha = \sum_{i=1}^r \frac{\sigma_i}{\sigma_i^2 + \alpha} \mathbf{v}_i \mathbf{u}_i^T \mathbf{b}.

Compare to unregularized pseudoinverse: i1σiviuiTb\sum_i \frac{1}{\sigma_i} \mathbf{v}_i \mathbf{u}_i^T \mathbf{b}. Ridge replaces 1/σi1/\sigma_i with σi/(σi2+α)\sigma_i/(\sigma_i^2 + \alpha), which shrinks the contribution of small singular values — exactly the right thing when small σi\sigma_i are noisy.

Bias-variance tradeoff: Ridge introduces bias (x^αx\hat{\mathbf{x}}_\alpha \neq \mathbf{x}^* even in the limit of infinite data) but reduces variance. The optimal α\alpha minimizes the mean squared prediction error.

Condition number improvement: κ(ATA+αI)=(σ12+α)/(σn2+α)<κ(ATA)=σ12/σn2\kappa(A^TA + \alpha I) = (\sigma_1^2 + \alpha)/(\sigma_n^2 + \alpha) < \kappa(A^TA) = \sigma_1^2/\sigma_n^2 for α>0\alpha > 0.

Numerical Methods for Least Squares

Three standard approaches:

1. QR decomposition (recommended for dense AA).

Compute A=Q1R1A = Q_1 R_1 (thin QR). Then:

x^=R11Q1Tb.\hat{\mathbf{x}} = R_1^{-1} Q_1^T \mathbf{b}.

Cost: O(mn2)O(mn^2) for QR, O(mn+n2)O(mn + n^2) per right-hand side. Condition number of the system is κ(A)\kappa(A), not κ(A)2\kappa(A)^2. This is the production-standard method.

2. SVD (most robust, handles rank-deficient AA).

Compute A=UΣVTA = U\Sigma V^T. Truncate small singular values at threshold τ\tau:

x^+=σi>τ1σivi(uiTb).\hat{\mathbf{x}}^+ = \sum_{\sigma_i > \tau} \frac{1}{\sigma_i} \mathbf{v}_i (\mathbf{u}_i^T \mathbf{b}).

Cost: O(mn2)O(mn^2) for SVD (more expensive than QR). Use when numerical rank determination is needed.

3. Normal equations (fast but numerically inferior).

Solve ATAx^=ATbA^TA\hat{\mathbf{x}} = A^T\mathbf{b} via Cholesky. Cost: O(mn2/2+n3/6)O(mn^2/2 + n^3/6). Condition number is κ(A)2\kappa(A)^2. Avoid unless mnm \gg n and κ(A)\kappa(A) is small.

Iterative Solvers for Large Systems

For large sparse AA (e.g., graph Laplacians, finite-difference matrices), direct methods (O(n3)O(n^3)) are too slow. Iterative methods:

MethodPer-iteration costConvergence rateUse case
Conjugate Gradient (CG)O(nnz)O(nnz)(κ1κ+1)k\left(\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\right)^kAA symmetric PD
LSQRO(nnz)O(nnz)Similar to CG on ATAA^TAGeneral AA, least squares
MINRESO(nnz)O(nnz)AA symmetric indefinite
Randomized KaczmarzO(n)O(n)Depends on row normsVery large mm, stochastic

Preconditioning: Replace Ax=bA\mathbf{x} = \mathbf{b} with M1Ax=M1bM^{-1}A\mathbf{x} = M^{-1}\mathbf{b} where MAM \approx A is easy to invert. Good preconditioners reduce κ(M1A)κ(A)\kappa(M^{-1}A) \ll \kappa(A), dramatically accelerating convergence.

Worked Example

Example 1: Full-Rank Least Squares — Polynomial Regression

Fit y=β0+β1x+β2x2y = \beta_0 + \beta_1 x + \beta_2 x^2 to data (xi,yi)i=15(x_i, y_i)_{i=1}^5. The design matrix:

A=(1x1x121x5x52)R5×3.A = \begin{pmatrix}1 & x_1 & x_1^2 \\ \vdots & \vdots & \vdots \\ 1 & x_5 & x_5^2\end{pmatrix} \in \mathbb{R}^{5 \times 3}.

AA has rank 3 when the xix_i are distinct. Normal equations: (ATA)β^=ATy(A^TA)\hat{\boldsymbol{\beta}} = A^T\mathbf{y}. In practice, never form ATAA^TA — use numpy.linalg.lstsq(A, y) which internally uses SVD or QR.

Condition number warning. For Vandermonde matrices (columns 1,x,x2,,xd11, x, x^2, \ldots, x^{d-1}), κ(A)\kappa(A) grows exponentially in dd. At degree 10 with xi[0,1]x_i \in [0,1], κ(A)1013\kappa(A) \sim 10^{13} — at the edge of double-precision reliability. Use orthogonal polynomial bases (Chebyshev, Legendre) instead.

Example 2: Underdetermined System — Minimum-Norm Solution

For A=(110011)A = \begin{pmatrix}1 & 1 & 0 \\ 0 & 1 & 1\end{pmatrix}, b=(11)\mathbf{b} = \begin{pmatrix}1 \\ 1\end{pmatrix}: m=2<n=3m = 2 < n = 3, infinitely many solutions.

The null space of AA: (1,1,1)T(1, -1, 1)^T spans null(A)\operatorname{null}(A) (verify: A(1,1,1)T=0A(1,-1,1)^T = \mathbf{0}).

Minimum-norm solution x+=A+b=AT(AAT)1b\mathbf{x}^+ = A^+\mathbf{b} = A^T(AA^T)^{-1}\mathbf{b}: this is the unique solution with x+row(A)\mathbf{x}^+ \in \operatorname{row}(A) (orthogonal to the null space).

AAT=(2112),(AAT)1=13(2112),x+=AT(AAT)1b=13(121).AA^T = \begin{pmatrix}2 & 1 \\ 1 & 2\end{pmatrix}, \quad (AA^T)^{-1} = \frac{1}{3}\begin{pmatrix}2 & -1 \\ -1 & 2\end{pmatrix}, \quad \mathbf{x}^+ = A^T(AA^T)^{-1}\mathbf{b} = \frac{1}{3}\begin{pmatrix}1\\2\\1\end{pmatrix}.

In compressed sensing, the minimum-norm solution of an underdetermined system is the starting point — but we additionally want the sparsest solution, which requires 1\ell^1 minimization (LASSO), not 2\ell^2.

Example 3: Ridge Regression as Bayesian MAP

Suppose b=Ax+ϵ\mathbf{b} = A\mathbf{x} + \boldsymbol{\epsilon} with ϵN(0,σ2I)\boldsymbol{\epsilon} \sim \mathcal{N}(0, \sigma^2 I) and prior xN(0,τ2I)\mathbf{x} \sim \mathcal{N}(0, \tau^2 I).

The MAP (Maximum A Posteriori) estimate maximizes p(xb)p(bx)p(x)p(\mathbf{x}|\mathbf{b}) \propto p(\mathbf{b}|\mathbf{x})p(\mathbf{x}), which is equivalent to minimizing:

logp(xb)=12σ2Axb2+12τ2x2.-\log p(\mathbf{x}|\mathbf{b}) = \frac{1}{2\sigma^2}\|A\mathbf{x} - \mathbf{b}\|^2 + \frac{1}{2\tau^2}\|\mathbf{x}\|^2.

Setting α=σ2/τ2\alpha = \sigma^2/\tau^2 recovers exactly ridge regression. Ridge regression is MAP estimation under a Gaussian prior with variance τ2=σ2/α\tau^2 = \sigma^2/\alpha. Larger α\alpha (stronger regularization) corresponds to a tighter prior (less variance in the prior on x\mathbf{x}).

Connections

Where Your Intuition Breaks

Using the normal equations ATAx^=ATbA^TA\hat{\mathbf{x}} = A^T\mathbf{b} is the standard way to solve least squares. Algebraically this is correct, but numerically it is often dangerous. Forming ATAA^TA squares the condition number: κ(ATA)=κ(A)2\kappa(A^TA) = \kappa(A)^2. For a moderately ill-conditioned regression matrix with κ(A)=104\kappa(A) = 10^4, the normal equations have κ(ATA)108\kappa(A^TA) \approx 10^8 — near the limit of double-precision arithmetic. QR-based solvers avoid this by never forming ATAA^TA, maintaining the original condition number. In practice, always use numpy.linalg.lstsq or scipy.linalg.lstsq, never solve the normal equations explicitly.

Geometry of Projection

The projection matrix P=A(ATA)1ATP = A(A^TA)^{-1}A^T onto col(A)\operatorname{col}(A) satisfies:

  • P2=PP^2 = P (idempotent)
  • PT=PP^T = P (symmetric)
  • P2=1\|P\|_2 = 1 (projections don't expand)
  • IPI - P is the complementary projection onto null(AT)\operatorname{null}(A^T)

These four properties (idempotent symmetric matrix) completely characterize orthogonal projections — any matrix satisfying them projects onto some subspace. This characterization is used in hypothesis testing (hat matrix in regression), signal processing (oblique projections), and control theory.

💡Intuition

The hat matrix and leverage. In linear regression y^=Py=Hy\hat{\mathbf{y}} = P\mathbf{y} = H\mathbf{y} where H=X(XTX)1XTH = X(X^TX)^{-1}X^T is the hat matrix (it "puts a hat on" y\mathbf{y}). The diagonal entries hii=xiT(XTX)1xih_{ii} = \mathbf{x}_i^T(X^TX)^{-1}\mathbf{x}_i are the leverage scores — they measure how much the ii-th observation influences its own prediction. High leverage (hii1h_{ii} \approx 1) means the model is forced to pass near point ii. Outliers with high leverage are the most dangerous in linear regression.

⚠️Warning

Multicollinearity and the condition number. When two columns of AA are nearly parallel (nearly linearly dependent), κ(ATA)\kappa(A^TA) explodes. This doesn't mean least squares is wrong — the minimum Axb\|A\mathbf{x}-\mathbf{b}\| is still well-defined — but the minimizer x^\hat{\mathbf{x}} becomes extremely sensitive to perturbations in b\mathbf{b}. A 1% change in data can flip the sign of a coefficient. This is why variance inflation factors (VIF) are monitored in regression diagnostics and why ridge regression is standard practice in multicollinear settings.

💡Intuition

LASSO vs Ridge: geometry. OLS constrained to a ball: minAxb2\min \|A\mathbf{x}-\mathbf{b}\|^2 s.t. xpt\|\mathbf{x}\|_p \leq t. For p=2p=2 (ridge), the 2\ell^2 ball has no corners, so the constrained optimum rarely lies on an axis — Ridge shrinks all coefficients but doesn't zero any. For p=1p=1 (LASSO), the 1\ell^1 ball has corners on the coordinate axes — the constrained optimum frequently hits a corner, setting some coefficients to exactly zero (sparsity). The geometry of norm balls (Lesson 2 of this module) directly explains why LASSO produces sparse solutions.

When to Use Each Solver

ScenarioRecommended solver
Dense AA, well-conditionednp.linalg.lstsq (QR internally)
Dense AA, rank-deficient or ill-conditionednp.linalg.lstsq with rcond threshold
Symmetric PD systemCholesky (scipy.linalg.cho_solve)
Large sparse AACG or LSQR with preconditioning
Large, low-rank structure in AARandomized SVD + truncated pseudoinverse
Ridge regressionsklearn.linear_model.Ridge (uses Cholesky or SVD)
LASSOCoordinate descent (sklearn.linear_model.Lasso)

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.