Inner Products, Norms & Orthogonality
A vector space tells you what you can add and scale; an inner product tells you what it means for two vectors to be perpendicular and how long they are. These two concepts — angle and length — are what make least-squares, PCA, attention, and kernel methods work. The dot product that appears in every linear layer is an inner product; the regularization term is a squared norm; the cosine similarity in embedding search is the normalized inner product. This lesson builds the full framework from axioms, proves the fundamental inequality (Cauchy-Schwarz), and shows how orthogonality makes projections the most natural computational primitive in all of applied mathematics.
Concepts
Lp Unit Ball — p = 2
L² (Euclidean)
‖v‖2 = 1.000v = (0.6, 0.8)
ML use
Ridge regression, weight decay, cosine similarity
Unit ball shape
Circle — perfectly symmetric; the only Lp ball invariant under rotation
The yellow vector v = (0.6, 0.8) lies on the L² unit sphere. Its Lp norm changes as p varies.
The dot product that appears in every linear layer, the penalty in weight decay, and the cosine similarity in embedding search are all the same idea: measuring angle and length between vectors. An inner product is the formal generalization of the familiar dot product to any vector space, and a norm is the corresponding notion of length. Everything about projection, least squares, and attention reduces to these two primitives.
Inner Product Spaces
An inner product on a real vector space is a function satisfying:
| # | Property | Statement |
|---|---|---|
| 1 | Symmetry | |
| 2 | Linearity in first argument | |
| 3 | Positive definiteness | , with |
A vector space equipped with an inner product is an inner product space.
The three axioms are exactly what is needed to define angle and length consistently. Symmetry ensures ; positive definiteness ensures no nonzero vector has zero length; linearity ensures that projections are well-defined and that the Gram-Schmidt process works. Any weaker set of axioms would break at least one of these geometric properties.
For complex vector spaces, symmetry becomes conjugate symmetry: , and linearity becomes sesquilinearity (linear in the first argument, conjugate-linear in the second). This matters for complex-valued neural networks and quantum ML, but throughout this module we work over .
Canonical examples:
-
Euclidean inner product on : — the default in all of ML
-
Weighted inner product: for any positive definite matrix — arises in Mahalanobis distance and natural gradient descent
-
Frobenius inner product on : — treats matrices as flattened vectors; appears in low-rank regularization
-
function inner product: on continuous functions on — the infinite-dimensional version; appears in Fourier analysis and reproducing kernel Hilbert spaces (Module 13)
The Cauchy-Schwarz Inequality
Theorem. For any inner product space and any : with equality if and only if and are linearly dependent.
Proof. If , both sides equal zero. Otherwise, for any : This quadratic in is non-negative everywhere, so its discriminant is non-positive: Taking square roots gives . Equality holds iff the quadratic touches zero, i.e. .
Geometric consequence. Cauchy-Schwarz guarantees that the ratio always lies in , so the angle between and is well-defined:
Cosine similarity in embedding retrieval is exactly this — angle measured as inner product after normalization to unit norm.
Norms
An inner product induces a norm via . More generally, a norm on is any function satisfying:
- Positive definiteness: , with
- Absolute homogeneity:
- Triangle inequality:
Not every norm comes from an inner product (the and norms do not), but every inner product norm satisfies an additional identity:
A norm satisfies the parallelogram law if and only if it arises from an inner product (von Neumann's theorem).
norms on (for ):
The limit gives . The geometric objects — the unit balls illustrated in the diagram at the top of this lesson — reveal the character of each norm:
The shape of the unit ball explains sparsity. The ball has corners exactly on the coordinate axes. When you minimize a loss subject to , the constrained optimum tends to land at a corner — meaning most coordinates of are exactly zero. The round ball has no corners, so the optimum lands on the smooth boundary with all coordinates nonzero. This geometric accident is why Lasso produces sparse models and ridge regression does not.
Matrix norms used throughout ML:
| Norm | Formula | Interpretation | ML use |
|---|---|---|---|
| Frobenius | Euclidean norm of entries | Weight decay, LoRA regularization | |
| Spectral | (largest singular value) | Maximum stretching factor | Lipschitz bounds, spectral normalization |
| Nuclear | of singular values | Promotes low-rank structure; matrix completion | |
| Sum of column norms | Group sparsity; structured pruning |
Norm equivalence. On any finite-dimensional space, all norms are equivalent: for any two norms , on , there exist such that
Concretely:
Norm equivalence means convergence in one norm implies convergence in every norm — so the choice of norm is a statistical or computational preference, not a topological one.
Orthogonality
Definition. Vectors are orthogonal if , written . A set is:
- Orthogonal if for all
- Orthonormal if additionally for all
Theorem. Every orthogonal set of nonzero vectors is linearly independent.
Proof. Suppose . Take the inner product with : Since , we get for all .
Pythagorean Theorem. If , then .
Proof. .
Orthogonal complement. For any subspace :
is itself a subspace, and — every vector decomposes uniquely as with and . This is the orthogonal direct sum decomposition.
Gram-Schmidt Orthogonalization and QR
Given linearly independent vectors , the Gram-Schmidt process constructs an orthonormal basis for the same span:
Each step subtracts the component of already explained by the previous 's.
QR decomposition. Applying Gram-Schmidt to the columns of (with , linearly independent columns) yields where:
- has orthonormal columns:
- is upper triangular with positive diagonal:
The entry records how much of was projected onto . QR is the numerical backbone of least-squares solvers and is more numerically stable than forming the normal equations .
Orthogonal Projections and the Best Approximation Theorem
Theorem (Best Approximation). For a subspace and any , there exists a unique minimizing over all . This minimizer satisfies:
Proof. For any , write . Since and : with equality iff .
Orthogonal Projection — drag the vector tip
Rotate subspace W
angle = 30°
The residual v − P(v) is always perpendicular to W (right-angle box). This is the Best Approximation Theorem: P(v) is the closest point in W to v.
Drag the vector in the diagram above. The green vector is always the foot of the perpendicular from to the subspace — the right-angle box confirms orthogonality. The orange dashed line is the residual, and its length is the approximation error. Best Approximation says: no other point in is closer to .
Projection matrix. For with having linearly independent columns:
If is already orthonormal, this simplifies to .
Characterization. A matrix is an orthogonal projection onto its column space if and only if it is:
- Idempotent: (projecting twice is the same as once)
- Symmetric:
Gram Matrix and Kernels
Given vectors , the Gram matrix has entries:
Properties: is always symmetric positive semidefinite. Its rank equals .
Replacing by for a positive definite kernel gives a kernel matrix — the Gram matrix of an implicit feature map satisfying . This is why kernel methods can operate in infinite-dimensional feature spaces without ever computing explicitly. Mercer's theorem (Module 13, Lesson 4) makes this precise.
Worked Example
Example 1: Gram-Schmidt on Two Vectors
Let , .
Step 1. Normalize :
Step 2. Subtract the -component of :
Verify orthonormality:
The QR factorization is with:
Example 2: Projection and Least Squares
Find the point in closest to .
Let . Form the normal equations :
Solving:
Verify residual orthogonality:
and
Example 3: Gram Matrix as Kernel Matrix
Let , , . The linear kernel gives:
has rank 2 (three points in span at most a 2-dimensional space). The polynomial kernel implicitly maps to a 6-dimensional feature space and gives:
Both and are symmetric positive semidefinite — the defining property of any valid kernel matrix.
Connections
Where Your Intuition Breaks
The norm is always the right choice. In fact, the choice of norm is a modeling decision that encodes what you want to penalize. The norm penalizes all coefficients proportionally — large weights cost quadratically more. The norm penalizes all coefficients equally — it is indifferent to whether a weight is 0.1 or 0.01, but aggressively pushes small weights to exactly zero. The norm penalizes only the single largest weight, making it the right choice when you care about worst-case behavior. There is no universally correct norm; the right norm is the one whose geometry matches the problem's structure.
Norms in ML: A Decision Guide
| Choice | When to use | Why |
|---|---|---|
| norm | Default for vectors, weight decay | Rotation-invariant; smooth gradient everywhere |
| norm | Sparsity in weights or activations | Corners of unit ball encourage exact zeros |
| norm | Adversarial robustness | Controls worst-case coordinate deviation |
| Frobenius norm | Matrix regularization, LoRA updates | Treats all entries equally; differentiable |
| Nuclear norm | Low-rank matrix recovery, collaborative filtering | on singular values promotes low rank |
| Spectral norm | GAN discriminator, Lipschitz constraints | Controls maximum gradient magnitude |
Orthogonality as an Engineering Tool
Orthogonal initialization preserves the norm of activations through the forward pass: when is orthogonal. This prevents exploding and vanishing gradients in deep networks and is the default in several initialization schemes.
Attention as inner products. The scaled dot-product attention score is an inner product with a variance-stabilizing denominator: for random unit-variance , , the raw inner product has variance , so dividing by restores unit variance and prevents the softmax from saturating in high-dimensional spaces.
Neural style transfer represents image style as the Gram matrix of feature maps: measures correlation between feature channels and , capturing texture without spatial information.
Common Pitfalls
Confusing and . Weight decay adds , giving gradient . The squared norm is smooth everywhere; itself is not differentiable at . For , is non-differentiable at any zero coordinate — requiring subdifferentials or proximal operators.
Applying orthonormal formulas to orthogonal (but not orthonormal) bases. If has orthonormal columns, and . If columns are orthogonal but not unit-norm, these simplifications fail — divide each column by its norm first.
Gram matrix rank versus sample count. . If points live in a -dimensional subspace, has rank 50. Kernel methods cannot distinguish points that differ only in the null space of the feature map, regardless of how large is.
Every result in this lesson reduces to: decompose into a component along a subspace and a component orthogonal to it. Gram-Schmidt builds an orthonormal basis so this decomposition is numerically clean. The projection matrix executes it in one matrix-vector product. The normal equations find the best linear fit by projecting onto the column space of . Kernel methods replace explicit inner products with a kernel function, but the Gram matrix structure — symmetric, PSD, rank equals intrinsic dimension — is identical.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.