Linear Systems & Least Squares
Linear systems and their generalization to overdetermined systems via least squares are the two most ubiquitous computational problems in science and engineering. The geometric insight — that the least-squares solution projects orthogonally onto the column space of — unifies the normal equations, the pseudoinverse, and QR-based solvers into a single picture. Ridge regression, weighted least squares, and the Gauss-Markov theorem all follow directly from this geometry.
Concepts
Least Squares — drag points vertically to update the OLS fit
α̂ = -0.010
β̂ = 1.030
SSR = 0.219
R² = 0.980
Normal equations check
Σeᵢ = 0.000
Σxᵢeᵢ = 0.000
Both should equal 0: residuals ⊥ columns of X
Red squares visualize squared residuals — OLS minimizes their total area. The fit line is the projection of b onto the column space of X.
When you fit a regression line to data, there is usually no line that passes through every point — so you find the line that minimizes the total squared distance from points to line. That minimization is the least-squares problem, and its solution has a geometric interpretation: the best-fit prediction is the orthogonal projection of the target onto the column space of . Every linear regression algorithm — OLS, ridge, weighted — is a variant of this single projection.
Four Fundamental Subspaces
For , the four fundamental subspaces and their relationships:
| Subspace | Definition | Dimension | Contains |
|---|---|---|---|
| Column space | All achievable | ||
| Null space | Solutions to homogeneous system | ||
| Row space | All | ||
| Left null space | Left zeros of |
Fundamental Theorem of Linear Algebra (Strang). and . Every decomposes uniquely as with and .
Consistency. has a solution .
Solution Structure of Linear Systems
For a consistent system :
Uniqueness: The solution is unique iff iff (full column rank).
Three cases:
| Case | Solutions | ML context | |
|---|---|---|---|
| , full rank | Unique | Square invertible systems | |
| Tall (), full col. rank | Overdetermined → least squares | Regression ( data, features) | |
| Wide (), full row rank | Underdetermined → min-norm | Compressed sensing, |
The Least-Squares Problem
For overdetermined () with , there is no exact solution. Instead, minimize the residual:
Geometric interpretation. The minimum is achieved when is the orthogonal projection of onto . The residual must be orthogonal to every column of :
These are the normal equations. When has full column rank, is invertible and:
The matrix is the (left) pseudoinverse. The projection matrix onto is .
The normal equations arise necessarily from the orthogonality condition. Setting the residual perpendicular to every column of is not a design choice — it is the only condition that locates the minimum of . Every other linear system solver for least squares (QR, SVD, gradient descent on the normal equations) is solving the same geometric problem, just via a different computational route.
Weighted Least Squares
When observations have different noise levels, minimize a weighted residual:
where is a diagonal weight matrix. Normal equations become:
Gauss-Markov Theorem. If the true model is with and , then the weighted least-squares estimator is the Best Linear Unbiased Estimator (BLUE) — it has the smallest variance among all linear unbiased estimators. With (homoscedastic noise), ordinary OLS is BLUE.
Ridge Regression and Tikhonov Regularization
When is ill-conditioned or singular, add a regularization term:
Effect on singular values. Using the SVD :
Compare to unregularized pseudoinverse: . Ridge replaces with , which shrinks the contribution of small singular values — exactly the right thing when small are noisy.
Bias-variance tradeoff: Ridge introduces bias ( even in the limit of infinite data) but reduces variance. The optimal minimizes the mean squared prediction error.
Condition number improvement: for .
Numerical Methods for Least Squares
Three standard approaches:
1. QR decomposition (recommended for dense ).
Compute (thin QR). Then:
Cost: for QR, per right-hand side. Condition number of the system is , not . This is the production-standard method.
2. SVD (most robust, handles rank-deficient ).
Compute . Truncate small singular values at threshold :
Cost: for SVD (more expensive than QR). Use when numerical rank determination is needed.
3. Normal equations (fast but numerically inferior).
Solve via Cholesky. Cost: . Condition number is . Avoid unless and is small.
Iterative Solvers for Large Systems
For large sparse (e.g., graph Laplacians, finite-difference matrices), direct methods () are too slow. Iterative methods:
| Method | Per-iteration cost | Convergence rate | Use case |
|---|---|---|---|
| Conjugate Gradient (CG) | symmetric PD | ||
| LSQR | Similar to CG on | General , least squares | |
| MINRES | — | symmetric indefinite | |
| Randomized Kaczmarz | Depends on row norms | Very large , stochastic |
Preconditioning: Replace with where is easy to invert. Good preconditioners reduce , dramatically accelerating convergence.
Worked Example
Example 1: Full-Rank Least Squares — Polynomial Regression
Fit to data . The design matrix:
has rank 3 when the are distinct. Normal equations: . In practice, never form — use numpy.linalg.lstsq(A, y) which internally uses SVD or QR.
Condition number warning. For Vandermonde matrices (columns ), grows exponentially in . At degree 10 with , — at the edge of double-precision reliability. Use orthogonal polynomial bases (Chebyshev, Legendre) instead.
Example 2: Underdetermined System — Minimum-Norm Solution
For , : , infinitely many solutions.
The null space of : spans (verify: ).
Minimum-norm solution : this is the unique solution with (orthogonal to the null space).
In compressed sensing, the minimum-norm solution of an underdetermined system is the starting point — but we additionally want the sparsest solution, which requires minimization (LASSO), not .
Example 3: Ridge Regression as Bayesian MAP
Suppose with and prior .
The MAP (Maximum A Posteriori) estimate maximizes , which is equivalent to minimizing:
Setting recovers exactly ridge regression. Ridge regression is MAP estimation under a Gaussian prior with variance . Larger (stronger regularization) corresponds to a tighter prior (less variance in the prior on ).
Connections
Where Your Intuition Breaks
Using the normal equations is the standard way to solve least squares. Algebraically this is correct, but numerically it is often dangerous. Forming squares the condition number: . For a moderately ill-conditioned regression matrix with , the normal equations have — near the limit of double-precision arithmetic. QR-based solvers avoid this by never forming , maintaining the original condition number. In practice, always use numpy.linalg.lstsq or scipy.linalg.lstsq, never solve the normal equations explicitly.
Geometry of Projection
The projection matrix onto satisfies:
- (idempotent)
- (symmetric)
- (projections don't expand)
- is the complementary projection onto
These four properties (idempotent symmetric matrix) completely characterize orthogonal projections — any matrix satisfying them projects onto some subspace. This characterization is used in hypothesis testing (hat matrix in regression), signal processing (oblique projections), and control theory.
The hat matrix and leverage. In linear regression where is the hat matrix (it "puts a hat on" ). The diagonal entries are the leverage scores — they measure how much the -th observation influences its own prediction. High leverage () means the model is forced to pass near point . Outliers with high leverage are the most dangerous in linear regression.
Multicollinearity and the condition number. When two columns of are nearly parallel (nearly linearly dependent), explodes. This doesn't mean least squares is wrong — the minimum is still well-defined — but the minimizer becomes extremely sensitive to perturbations in . A 1% change in data can flip the sign of a coefficient. This is why variance inflation factors (VIF) are monitored in regression diagnostics and why ridge regression is standard practice in multicollinear settings.
LASSO vs Ridge: geometry. OLS constrained to a ball: s.t. . For (ridge), the ball has no corners, so the constrained optimum rarely lies on an axis — Ridge shrinks all coefficients but doesn't zero any. For (LASSO), the ball has corners on the coordinate axes — the constrained optimum frequently hits a corner, setting some coefficients to exactly zero (sparsity). The geometry of norm balls (Lesson 2 of this module) directly explains why LASSO produces sparse solutions.
When to Use Each Solver
| Scenario | Recommended solver |
|---|---|
| Dense , well-conditioned | np.linalg.lstsq (QR internally) |
| Dense , rank-deficient or ill-conditioned | np.linalg.lstsq with rcond threshold |
| Symmetric PD system | Cholesky (scipy.linalg.cho_solve) |
| Large sparse | CG or LSQR with preconditioning |
| Large, low-rank structure in | Randomized SVD + truncated pseudoinverse |
| Ridge regression | sklearn.linear_model.Ridge (uses Cholesky or SVD) |
| LASSO | Coordinate descent (sklearn.linear_model.Lasso) |
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.