Requires:Vector Spaces & Linear Maps

Inner Products, Norms & Orthogonality

A vector space tells you what you can add and scale; an inner product tells you what it means for two vectors to be perpendicular and how long they are. These two concepts — angle and length — are what make least-squares, PCA, attention, and kernel methods work. The dot product $\mathbf{x}^\top \mathbf{y}$ that appears in every linear layer is an inner product; the $\ell_2$ regularization term $\lambda\|\mathbf{w}\|_2^2$ is a squared norm; the cosine similarity in embedding search is the normalized inner product. This lesson builds the full framework from axioms, proves the fundamental inequality (Cauchy-Schwarz), and shows how orthogonality makes projections the most natural computational primitive in all of applied mathematics.

Concepts

Lp Unit Ball — p = 2

L² (Euclidean)

‖v‖₂ = 1.000v = (0.6, 0.8)

ML use

Ridge regression, weight decay, cosine similarity

Unit ball shape

Circle — perfectly symmetric; the only Lp ball invariant under rotation

The yellow vector v = (0.6, 0.8) lies on the L² unit sphere. Its Lp norm changes as p varies.

The dot product $\mathbf{x}^\top \mathbf{y}$ that appears in every linear layer, the $\ell_2$ penalty $\lambda\|\mathbf{w}\|_2^2$ in weight decay, and the cosine similarity in embedding search are all the same idea: measuring angle and length between vectors. An inner product is the formal generalization of the familiar dot product to any vector space, and a norm is the corresponding notion of length. Everything about projection, least squares, and attention reduces to these two primitives.

Inner Product Spaces

An inner product on a real vector space $V$ is a function $\langle \cdot, \cdot \rangle : V \times V \to \mathbb{R}$ satisfying:

#	Property	Statement
1	Symmetry	$\langle u, v \rangle = \langle v, u \rangle$
2	Linearity in first argument	$\langle \alpha u + \beta w, v \rangle = \alpha\langle u, v \rangle + \beta\langle w, v \rangle$
3	Positive definiteness	$\langle v, v \rangle \geq 0$ , with $\langle v, v \rangle = 0 \iff v = \mathbf{0}$

A vector space equipped with an inner product is an inner product space.

The three axioms are exactly what is needed to define angle and length consistently. Symmetry ensures $\cos(\mathbf{u},\mathbf{v}) = \cos(\mathbf{v},\mathbf{u})$ ; positive definiteness ensures no nonzero vector has zero length; linearity ensures that projections are well-defined and that the Gram-Schmidt process works. Any weaker set of axioms would break at least one of these geometric properties.

⚠️Warning

For complex vector spaces, symmetry becomes conjugate symmetry: $\langle u, v \rangle = \overline{\langle v, u \rangle}$ , and linearity becomes sesquilinearity (linear in the first argument, conjugate-linear in the second). This matters for complex-valued neural networks and quantum ML, but throughout this module we work over $\mathbb{R}$ .

Canonical examples:

Euclidean inner product on $\mathbb{R}^n$ : $\langle x, y \rangle = x^\top y = \sum_{i=1}^n x_i y_i$ — the default in all of ML
Weighted inner product: $\langle x, y \rangle_W = x^\top W y$ for any positive definite matrix $W \succ 0$ — arises in Mahalanobis distance and natural gradient descent
Frobenius inner product on $\mathbb{R}^{m \times n}$ : $\langle A, B \rangle_F = \mathrm{tr}(A^\top B) = \sum_{i,j} A_{ij} B_{ij}$ — treats matrices as flattened vectors; appears in low-rank regularization
$L^2$ function inner product: $\langle f, g \rangle = \int_a^b f(x)\,g(x)\,dx$ on continuous functions on $[a, b]$ — the infinite-dimensional version; appears in Fourier analysis and reproducing kernel Hilbert spaces (Module 13)

The Cauchy-Schwarz Inequality

Theorem. For any inner product space and any $u, v \in V$ : $|\langle u, v \rangle| \leq \|u\| \cdot \|v\|$ with equality if and only if $u$ and $v$ are linearly dependent.

Proof. If $v = \mathbf{0}$ , both sides equal zero. Otherwise, for any $t \in \mathbb{R}$ : $0 \leq \langle u - tv,\, u - tv \rangle = \|u\|^2 - 2t\langle u, v \rangle + t^2 \|v\|^2$ This quadratic in $t$ is non-negative everywhere, so its discriminant is non-positive: $4\langle u, v \rangle^2 - 4\|u\|^2\|v\|^2 \leq 0$ Taking square roots gives $|\langle u, v \rangle| \leq \|u\|\|v\|$ . Equality holds iff the quadratic touches zero, i.e. $u = tv$ . $\square$

Geometric consequence. Cauchy-Schwarz guarantees that the ratio $\frac{\langle u, v \rangle}{\|u\|\|v\|}$ always lies in $[-1, 1]$ , so the angle between $u$ and $v$ is well-defined: $\cos\theta_{u,v} = \frac{\langle u, v \rangle}{\|u\|\|v\|}$

Cosine similarity in embedding retrieval is exactly this — angle measured as inner product after normalization to unit norm.

Norms

An inner product induces a norm via $\|v\| = \sqrt{\langle v, v \rangle}$ . More generally, a norm on $V$ is any function $\|\cdot\| : V \to \mathbb{R}_{\geq 0}$ satisfying:

Positive definiteness: $\|v\| \geq 0$ , with $\|v\| = 0 \iff v = \mathbf{0}$
Absolute homogeneity: $\|\alpha v\| = |\alpha|\|v\|$
Triangle inequality: $\|u + v\| \leq \|u\| + \|v\|$

Not every norm comes from an inner product (the $\ell_1$ and $\ell_\infty$ norms do not), but every inner product norm satisfies an additional identity:

$\text{Parallelogram law:} \quad \|u + v\|^2 + \|u - v\|^2 = 2\|u\|^2 + 2\|v\|^2$

A norm satisfies the parallelogram law if and only if it arises from an inner product (von Neumann's theorem).

$\ell^p$ norms on $\mathbb{R}^n$ (for $p \geq 1$ ): $\|x\|_p = \left(\sum_{i=1}^n |x_i|^p\right)^{1/p}$

The limit $p \to \infty$ gives $\|x\|_\infty = \max_i |x_i|$ . The geometric objects $\{x : \|x\|_p \leq 1\}$ — the unit balls illustrated in the diagram at the top of this lesson — reveal the character of each norm:

💡Intuition

The shape of the unit ball explains sparsity. The $\ell_1$ ball has corners exactly on the coordinate axes. When you minimize a loss subject to $\|w\|_1 \leq c$ , the constrained optimum tends to land at a corner — meaning most coordinates of $w$ are exactly zero. The round $\ell_2$ ball has no corners, so the optimum lands on the smooth boundary with all coordinates nonzero. This geometric accident is why Lasso produces sparse models and ridge regression does not.

Matrix norms used throughout ML:

Norm	Formula	Interpretation	ML use
Frobenius $\\|A\\|_F$	$\sqrt{\mathrm{tr}(A^\top A)}$	Euclidean norm of entries	Weight decay, LoRA regularization
Spectral $\\|A\\|_2$	$\sigma_1(A)$ (largest singular value)	Maximum stretching factor	Lipschitz bounds, spectral normalization
Nuclear $\\|A\\|_*$	$\sum_i \sigma_i(A)$	$\ell_1$ of singular values	Promotes low-rank structure; matrix completion
$\ell_{2,1}$ $\\|A\\|_{2,1}$	$\sum_j \\|A_{:,j}\\|_2$	Sum of column norms	Group sparsity; structured pruning

Norm equivalence. On any finite-dimensional space, all norms are equivalent: for any two norms $\|\cdot\|_\alpha$ , $\|\cdot\|_\beta$ on $\mathbb{R}^n$ , there exist $c_1, c_2 > 0$ such that $c_1 \|x\|_\beta \leq \|x\|_\alpha \leq c_2 \|x\|_\beta \quad \forall x \in \mathbb{R}^n$

Concretely: $\|x\|_2 \leq \|x\|_1 \leq \sqrt{n}\,\|x\|_2 \qquad \|x\|_\infty \leq \|x\|_2 \leq \sqrt{n}\,\|x\|_\infty$

Norm equivalence means convergence in one norm implies convergence in every norm — so the choice of norm is a statistical or computational preference, not a topological one.

Orthogonality

Definition. Vectors $u, v \in V$ are orthogonal if $\langle u, v \rangle = 0$ , written $u \perp v$ . A set $\{v_1, \ldots, v_k\}$ is:

Orthogonal if $\langle v_i, v_j \rangle = 0$ for all $i \neq j$
Orthonormal if additionally $\|v_i\| = 1$ for all $i$

Theorem. Every orthogonal set of nonzero vectors is linearly independent.

Proof. Suppose $\sum_i \alpha_i v_i = \mathbf{0}$ . Take the inner product with $v_j$ : $0 = \left\langle \sum_i \alpha_i v_i,\, v_j \right\rangle = \sum_i \alpha_i \langle v_i, v_j \rangle = \alpha_j \|v_j\|^2$ Since $\|v_j\|^2 > 0$ , we get $\alpha_j = 0$ for all $j$ . $\square$

Pythagorean Theorem. If $u \perp v$ , then $\|u + v\|^2 = \|u\|^2 + \|v\|^2$ .

Proof. $\|u + v\|^2 = \langle u+v, u+v \rangle = \|u\|^2 + 2\langle u, v \rangle + \|v\|^2 = \|u\|^2 + \|v\|^2$ . $\square$

Orthogonal complement. For any subspace $W \subseteq V$ : $W^\perp = \{v \in V : \langle v, w \rangle = 0 \text{ for all } w \in W\}$

$W^\perp$ is itself a subspace, and $V = W \oplus W^\perp$ — every vector $v$ decomposes uniquely as $v = \hat{v} + (v - \hat{v})$ with $\hat{v} \in W$ and $(v - \hat{v}) \in W^\perp$ . This is the orthogonal direct sum decomposition.

Gram-Schmidt Orthogonalization and QR

Given linearly independent vectors $\{a_1, \ldots, a_k\} \subset V$ , the Gram-Schmidt process constructs an orthonormal basis $\{q_1, \ldots, q_k\}$ for the same span:

$\tilde{q}_j = a_j - \sum_{i=1}^{j-1} \langle a_j, q_i \rangle\, q_i \qquad q_j = \frac{\tilde{q}_j}{\|\tilde{q}_j\|}$

Each step subtracts the component of $a_j$ already explained by the previous $q_i$ 's.

QR decomposition. Applying Gram-Schmidt to the columns $a_1, \ldots, a_n$ of $A \in \mathbb{R}^{m \times n}$ (with $m \geq n$ , linearly independent columns) yields $A = QR$ where:

$Q \in \mathbb{R}^{m \times n}$ has orthonormal columns: $Q^\top Q = I_n$
$R \in \mathbb{R}^{n \times n}$ is upper triangular with positive diagonal: $R_{jj} = \|\tilde{q}_j\|$

The entry $R_{ij} = \langle a_j, q_i \rangle$ records how much of $a_j$ was projected onto $q_i$ . QR is the numerical backbone of least-squares solvers and is more numerically stable than forming the normal equations $A^\top A$ .

Orthogonal Projections and the Best Approximation Theorem

Theorem (Best Approximation). For a subspace $W \subseteq V$ and any $v \in V$ , there exists a unique $\hat{v} \in W$ minimizing $\|v - w\|$ over all $w \in W$ . This minimizer satisfies: $v - \hat{v} \perp W \quad \text{(i.e., } \langle v - \hat{v}, w \rangle = 0 \text{ for all } w \in W\text{)}$

Proof. For any $w \in W$ , write $v - w = (v - \hat{v}) + (\hat{v} - w)$ . Since $v - \hat{v} \perp W$ and $\hat{v} - w \in W$ : $\|v - w\|^2 = \|v - \hat{v}\|^2 + \|\hat{v} - w\|^2 \geq \|v - \hat{v}\|^2$ with equality iff $w = \hat{v}$ . $\square$

Orthogonal Projection — drag the vector tip

v =(0.50, 0.90)

P(v) =(0.76, 0.44)

v − P(v) =(-0.26, 0.46)

‖v‖ =1.030

‖P(v)‖ =0.883

‖v − P(v)‖ =0.529

cos θ =0.858

θ =30.9°

Rotate subspace W

angle = 30°

The residual v − P(v) is always perpendicular to W (right-angle box). This is the Best Approximation Theorem: P(v) is the closest point in W to v.

💡Intuition

Drag the vector $v$ in the diagram above. The green vector $P(v)$ is always the foot of the perpendicular from $v$ to the subspace $W$ — the right-angle box confirms orthogonality. The orange dashed line $v - P(v)$ is the residual, and its length is the approximation error. Best Approximation says: no other point in $W$ is closer to $v$ .

Projection matrix. For $W = \mathrm{col}(A)$ with $A \in \mathbb{R}^{m \times n}$ having linearly independent columns: $P_W = A(A^\top A)^{-1}A^\top$

If $A = Q$ is already orthonormal, this simplifies to $P_W = QQ^\top$ .

Characterization. A matrix $P$ is an orthogonal projection onto its column space if and only if it is:

Idempotent: $P^2 = P$ (projecting twice is the same as once)
Symmetric: $P^\top = P$

Gram Matrix and Kernels

Given vectors $x_1, \ldots, x_n \in V$ , the Gram matrix $G \in \mathbb{R}^{n \times n}$ has entries: $G_{ij} = \langle x_i, x_j \rangle$

Properties: $G$ is always symmetric positive semidefinite. Its rank equals $\dim(\mathrm{span}\{x_1, \ldots, x_n\})$ .

Replacing $\langle x_i, x_j \rangle$ by $k(x_i, x_j)$ for a positive definite kernel $k$ gives a kernel matrix — the Gram matrix of an implicit feature map $\phi$ satisfying $k(x, y) = \langle \phi(x), \phi(y) \rangle$ . This is why kernel methods can operate in infinite-dimensional feature spaces without ever computing $\phi$ explicitly. Mercer's theorem (Module 13, Lesson 4) makes this precise.

Worked Example

Example 1: Gram-Schmidt on Two Vectors

Let $a_1 = (1, 1, 0)^\top$ , $a_2 = (1, 0, 1)^\top \in \mathbb{R}^3$ .

Step 1. Normalize $a_1$ : $\tilde{q}_1 = a_1 = (1, 1, 0)^\top \qquad q_1 = \frac{(1,1,0)^\top}{\sqrt{2}} = \left(\tfrac{1}{\sqrt{2}},\, \tfrac{1}{\sqrt{2}},\, 0\right)^\top$

Step 2. Subtract the $q_1$ -component of $a_2$ : $\langle a_2, q_1 \rangle = \tfrac{1}{\sqrt{2}} \cdot 1 + \tfrac{1}{\sqrt{2}} \cdot 0 + 0 \cdot 1 = \tfrac{1}{\sqrt{2}}$ $\tilde{q}_2 = a_2 - \tfrac{1}{\sqrt{2}}\,q_1 = (1,0,1)^\top - \tfrac{1}{2}(1,1,0)^\top = \left(\tfrac{1}{2}, -\tfrac{1}{2}, 1\right)^\top$ $\|\tilde{q}_2\| = \sqrt{\tfrac{1}{4} + \tfrac{1}{4} + 1} = \tfrac{\sqrt{6}}{2} \qquad q_2 = \left(\tfrac{1}{\sqrt{6}}, -\tfrac{1}{\sqrt{6}}, \tfrac{2}{\sqrt{6}}\right)^\top$

Verify orthonormality: $\langle q_1, q_2 \rangle = \tfrac{1}{\sqrt{2}} \cdot \tfrac{1}{\sqrt{6}} + \tfrac{1}{\sqrt{2}} \cdot \left(-\tfrac{1}{\sqrt{6}}\right) + 0 = 0$ $\checkmark$

The QR factorization is $A = QR$ with: $R = \begin{bmatrix} \sqrt{2} & \tfrac{1}{\sqrt{2}} \\ 0 & \tfrac{\sqrt{6}}{2} \end{bmatrix}$

Example 2: Projection and Least Squares

Find the point in $W = \mathrm{span}\{(1,1,0)^\top,\,(0,1,1)^\top\}$ closest to $b = (1, 2, 3)^\top$ .

Let $A = \begin{bmatrix} 1 & 0 \\ 1 & 1 \\ 0 & 1 \end{bmatrix}$ . Form the normal equations $A^\top A \hat{x} = A^\top b$ :

$A^\top A = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix} \qquad A^\top b = \begin{bmatrix} 3 \\ 5 \end{bmatrix}$

Solving: $\hat{x} = \tfrac{1}{3}\begin{bmatrix}2 & -1\\-1 & 2\end{bmatrix}\begin{bmatrix}3\\5\end{bmatrix} = \begin{bmatrix}\tfrac{1}{3}\\\tfrac{7}{3}\end{bmatrix}$

$\hat{b} = A\hat{x} = \begin{bmatrix}\tfrac{1}{3}\\\tfrac{8}{3}\\\tfrac{7}{3}\end{bmatrix}$

Verify residual orthogonality: $b - \hat{b} = \left(\tfrac{2}{3}, -\tfrac{2}{3}, \tfrac{2}{3}\right)^\top$

$\langle b - \hat{b},\,(1,1,0)^\top \rangle = \tfrac{2}{3} - \tfrac{2}{3} = 0$ $\checkmark$ and $\langle b - \hat{b},\,(0,1,1)^\top \rangle = -\tfrac{2}{3} + \tfrac{2}{3} = 0$ $\checkmark$

Example 3: Gram Matrix as Kernel Matrix

Let $x_1 = (1,0)^\top$ , $x_2 = (0,1)^\top$ , $x_3 = (1,1)^\top$ . The linear kernel $k(x, y) = x^\top y$ gives: $G = \begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 1 \\ 1 & 1 & 2 \end{bmatrix}$

$G$ has rank 2 (three points in $\mathbb{R}^2$ span at most a 2-dimensional space). The polynomial kernel $k(x, y) = (1 + x^\top y)^2$ implicitly maps to a 6-dimensional feature space and gives: $G' = \begin{bmatrix} 4 & 1 & 4 \\ 1 & 4 & 4 \\ 4 & 4 & 9 \end{bmatrix}$

Both $G$ and $G'$ are symmetric positive semidefinite — the defining property of any valid kernel matrix.

Connections

Where Your Intuition Breaks

The $\ell_2$ norm is always the right choice. In fact, the choice of norm is a modeling decision that encodes what you want to penalize. The $\ell_2$ norm penalizes all coefficients proportionally — large weights cost quadratically more. The $\ell_1$ norm penalizes all coefficients equally — it is indifferent to whether a weight is 0.1 or 0.01, but aggressively pushes small weights to exactly zero. The $\ell_\infty$ norm penalizes only the single largest weight, making it the right choice when you care about worst-case behavior. There is no universally correct norm; the right norm is the one whose geometry matches the problem's structure.

Norms in ML: A Decision Guide

Choice	When to use	Why
$\ell_2$ norm	Default for vectors, weight decay	Rotation-invariant; smooth gradient everywhere
$\ell_1$ norm	Sparsity in weights or activations	Corners of unit ball encourage exact zeros
$\ell_\infty$ norm	Adversarial robustness	Controls worst-case coordinate deviation
Frobenius norm	Matrix regularization, LoRA updates	Treats all entries equally; differentiable
Nuclear norm	Low-rank matrix recovery, collaborative filtering	$\ell_1$ on singular values promotes low rank
Spectral norm	GAN discriminator, Lipschitz constraints	Controls maximum gradient magnitude

Orthogonality as an Engineering Tool

Orthogonal initialization preserves the $\ell_2$ norm of activations through the forward pass: $\|Wx\|_2 = \|x\|_2$ when $W$ is orthogonal. This prevents exploding and vanishing gradients in deep networks and is the default in several initialization schemes.

Attention as inner products. The scaled dot-product attention score $q^\top k / \sqrt{d_k}$ is an inner product with a variance-stabilizing denominator: for random unit-variance $q$ , $k \in \mathbb{R}^{d_k}$ , the raw inner product has variance $d_k$ , so dividing by $\sqrt{d_k}$ restores unit variance and prevents the softmax from saturating in high-dimensional spaces.

Neural style transfer represents image style as the Gram matrix of feature maps: $G_{ij} = \langle F_i, F_j \rangle$ measures correlation between feature channels $i$ and $j$ , capturing texture without spatial information.

Common Pitfalls

Confusing $\|x\|_2$ and $\|x\|_2^2$ . Weight decay adds $\frac{\lambda}{2}\|w\|_2^2$ , giving gradient $\lambda w$ . The squared norm is smooth everywhere; $\|w\|_2$ itself is not differentiable at $\mathbf{0}$ . For $\ell_1$ , $\|w\|_1$ is non-differentiable at any zero coordinate — requiring subdifferentials or proximal operators.

Applying orthonormal formulas to orthogonal (but not orthonormal) bases. If $Q$ has orthonormal columns, $P = QQ^\top$ and $Q^{-1} = Q^\top$ . If columns are orthogonal but not unit-norm, these simplifications fail — divide each column by its norm first.

Gram matrix rank versus sample count. $\mathrm{rank}(G) = \dim(\mathrm{span}\{x_1, \ldots, x_n\})$ . If $n = 10{,}000$ points live in a $d = 50$ -dimensional subspace, $G$ has rank 50. Kernel methods cannot distinguish points that differ only in the null space of the feature map, regardless of how large $n$ is.

💡The unifying theme

Every result in this lesson reduces to: decompose $v$ into a component along a subspace and a component orthogonal to it. Gram-Schmidt builds an orthonormal basis so this decomposition is numerically clean. The projection matrix executes it in one matrix-vector product. The normal equations find the best linear fit by projecting $b$ onto the column space of $A$ . Kernel methods replace explicit inner products with a kernel function, but the Gram matrix structure — symmetric, PSD, rank equals intrinsic dimension — is identical.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Vector Spaces & Linear Maps

Eigenvalues, Eigenvectors & Diagonalization