Neural-Path/Notes
45 min

Inner Products, Norms & Orthogonality

A vector space tells you what you can add and scale; an inner product tells you what it means for two vectors to be perpendicular and how long they are. These two concepts — angle and length — are what make least-squares, PCA, attention, and kernel methods work. The dot product xy\mathbf{x}^\top \mathbf{y} that appears in every linear layer is an inner product; the 2\ell_2 regularization term λw22\lambda\|\mathbf{w}\|_2^2 is a squared norm; the cosine similarity in embedding search is the normalized inner product. This lesson builds the full framework from axioms, proves the fundamental inequality (Cauchy-Schwarz), and shows how orthogonality makes projections the most natural computational primitive in all of applied mathematics.

Concepts

Lp Unit Ball — p = 2

x₁x₂-1-111v

L² (Euclidean)

‖v‖2 = 1.000v = (0.6, 0.8)

ML use

Ridge regression, weight decay, cosine similarity

Unit ball shape

Circle — perfectly symmetric; the only Lp ball invariant under rotation

The yellow vector v = (0.6, 0.8) lies on the L² unit sphere. Its Lp norm changes as p varies.

The dot product xy\mathbf{x}^\top \mathbf{y} that appears in every linear layer, the 2\ell_2 penalty λw22\lambda\|\mathbf{w}\|_2^2 in weight decay, and the cosine similarity in embedding search are all the same idea: measuring angle and length between vectors. An inner product is the formal generalization of the familiar dot product to any vector space, and a norm is the corresponding notion of length. Everything about projection, least squares, and attention reduces to these two primitives.

Inner Product Spaces

An inner product on a real vector space VV is a function ,:V×VR\langle \cdot, \cdot \rangle : V \times V \to \mathbb{R} satisfying:

#PropertyStatement
1Symmetryu,v=v,u\langle u, v \rangle = \langle v, u \rangle
2Linearity in first argumentαu+βw,v=αu,v+βw,v\langle \alpha u + \beta w, v \rangle = \alpha\langle u, v \rangle + \beta\langle w, v \rangle
3Positive definitenessv,v0\langle v, v \rangle \geq 0, with v,v=0    v=0\langle v, v \rangle = 0 \iff v = \mathbf{0}

A vector space equipped with an inner product is an inner product space.

The three axioms are exactly what is needed to define angle and length consistently. Symmetry ensures cos(u,v)=cos(v,u)\cos(\mathbf{u},\mathbf{v}) = \cos(\mathbf{v},\mathbf{u}); positive definiteness ensures no nonzero vector has zero length; linearity ensures that projections are well-defined and that the Gram-Schmidt process works. Any weaker set of axioms would break at least one of these geometric properties.

⚠️Warning

For complex vector spaces, symmetry becomes conjugate symmetry: u,v=v,u\langle u, v \rangle = \overline{\langle v, u \rangle}, and linearity becomes sesquilinearity (linear in the first argument, conjugate-linear in the second). This matters for complex-valued neural networks and quantum ML, but throughout this module we work over R\mathbb{R}.

Canonical examples:

  • Euclidean inner product on Rn\mathbb{R}^n: x,y=xy=i=1nxiyi\langle x, y \rangle = x^\top y = \sum_{i=1}^n x_i y_i — the default in all of ML

  • Weighted inner product: x,yW=xWy\langle x, y \rangle_W = x^\top W y for any positive definite matrix W0W \succ 0 — arises in Mahalanobis distance and natural gradient descent

  • Frobenius inner product on Rm×n\mathbb{R}^{m \times n}: A,BF=tr(AB)=i,jAijBij\langle A, B \rangle_F = \mathrm{tr}(A^\top B) = \sum_{i,j} A_{ij} B_{ij} — treats matrices as flattened vectors; appears in low-rank regularization

  • L2L^2 function inner product: f,g=abf(x)g(x)dx\langle f, g \rangle = \int_a^b f(x)\,g(x)\,dx on continuous functions on [a,b][a, b] — the infinite-dimensional version; appears in Fourier analysis and reproducing kernel Hilbert spaces (Module 13)

The Cauchy-Schwarz Inequality

Theorem. For any inner product space and any u,vVu, v \in V: u,vuv|\langle u, v \rangle| \leq \|u\| \cdot \|v\| with equality if and only if uu and vv are linearly dependent.

Proof. If v=0v = \mathbf{0}, both sides equal zero. Otherwise, for any tRt \in \mathbb{R}: 0utv,utv=u22tu,v+t2v20 \leq \langle u - tv,\, u - tv \rangle = \|u\|^2 - 2t\langle u, v \rangle + t^2 \|v\|^2 This quadratic in tt is non-negative everywhere, so its discriminant is non-positive: 4u,v24u2v204\langle u, v \rangle^2 - 4\|u\|^2\|v\|^2 \leq 0 Taking square roots gives u,vuv|\langle u, v \rangle| \leq \|u\|\|v\|. Equality holds iff the quadratic touches zero, i.e. u=tvu = tv. \square

Geometric consequence. Cauchy-Schwarz guarantees that the ratio u,vuv\frac{\langle u, v \rangle}{\|u\|\|v\|} always lies in [1,1][-1, 1], so the angle between uu and vv is well-defined: cosθu,v=u,vuv\cos\theta_{u,v} = \frac{\langle u, v \rangle}{\|u\|\|v\|}

Cosine similarity in embedding retrieval is exactly this — angle measured as inner product after normalization to unit norm.

Norms

An inner product induces a norm via v=v,v\|v\| = \sqrt{\langle v, v \rangle}. More generally, a norm on VV is any function :VR0\|\cdot\| : V \to \mathbb{R}_{\geq 0} satisfying:

  1. Positive definiteness: v0\|v\| \geq 0, with v=0    v=0\|v\| = 0 \iff v = \mathbf{0}
  2. Absolute homogeneity: αv=αv\|\alpha v\| = |\alpha|\|v\|
  3. Triangle inequality: u+vu+v\|u + v\| \leq \|u\| + \|v\|

Not every norm comes from an inner product (the 1\ell_1 and \ell_\infty norms do not), but every inner product norm satisfies an additional identity:

Parallelogram law:u+v2+uv2=2u2+2v2\text{Parallelogram law:} \quad \|u + v\|^2 + \|u - v\|^2 = 2\|u\|^2 + 2\|v\|^2

A norm satisfies the parallelogram law if and only if it arises from an inner product (von Neumann's theorem).

p\ell^p norms on Rn\mathbb{R}^n (for p1p \geq 1): xp=(i=1nxip)1/p\|x\|_p = \left(\sum_{i=1}^n |x_i|^p\right)^{1/p}

The limit pp \to \infty gives x=maxixi\|x\|_\infty = \max_i |x_i|. The geometric objects {x:xp1}\{x : \|x\|_p \leq 1\} — the unit balls illustrated in the diagram at the top of this lesson — reveal the character of each norm:

💡Intuition

The shape of the unit ball explains sparsity. The 1\ell_1 ball has corners exactly on the coordinate axes. When you minimize a loss subject to w1c\|w\|_1 \leq c, the constrained optimum tends to land at a corner — meaning most coordinates of ww are exactly zero. The round 2\ell_2 ball has no corners, so the optimum lands on the smooth boundary with all coordinates nonzero. This geometric accident is why Lasso produces sparse models and ridge regression does not.

Matrix norms used throughout ML:

NormFormulaInterpretationML use
Frobenius AF\|A\|_Ftr(AA)\sqrt{\mathrm{tr}(A^\top A)}Euclidean norm of entriesWeight decay, LoRA regularization
Spectral A2\|A\|_2σ1(A)\sigma_1(A) (largest singular value)Maximum stretching factorLipschitz bounds, spectral normalization
Nuclear A\|A\|_*iσi(A)\sum_i \sigma_i(A)1\ell_1 of singular valuesPromotes low-rank structure; matrix completion
2,1\ell_{2,1} A2,1\|A\|_{2,1}jA:,j2\sum_j \|A_{:,j}\|_2Sum of column normsGroup sparsity; structured pruning

Norm equivalence. On any finite-dimensional space, all norms are equivalent: for any two norms α\|\cdot\|_\alpha, β\|\cdot\|_\beta on Rn\mathbb{R}^n, there exist c1,c2>0c_1, c_2 > 0 such that c1xβxαc2xβxRnc_1 \|x\|_\beta \leq \|x\|_\alpha \leq c_2 \|x\|_\beta \quad \forall x \in \mathbb{R}^n

Concretely: x2x1nx2xx2nx\|x\|_2 \leq \|x\|_1 \leq \sqrt{n}\,\|x\|_2 \qquad \|x\|_\infty \leq \|x\|_2 \leq \sqrt{n}\,\|x\|_\infty

Norm equivalence means convergence in one norm implies convergence in every norm — so the choice of norm is a statistical or computational preference, not a topological one.

Orthogonality

Definition. Vectors u,vVu, v \in V are orthogonal if u,v=0\langle u, v \rangle = 0, written uvu \perp v. A set {v1,,vk}\{v_1, \ldots, v_k\} is:

  • Orthogonal if vi,vj=0\langle v_i, v_j \rangle = 0 for all iji \neq j
  • Orthonormal if additionally vi=1\|v_i\| = 1 for all ii

Theorem. Every orthogonal set of nonzero vectors is linearly independent.

Proof. Suppose iαivi=0\sum_i \alpha_i v_i = \mathbf{0}. Take the inner product with vjv_j: 0=iαivi,vj=iαivi,vj=αjvj20 = \left\langle \sum_i \alpha_i v_i,\, v_j \right\rangle = \sum_i \alpha_i \langle v_i, v_j \rangle = \alpha_j \|v_j\|^2 Since vj2>0\|v_j\|^2 > 0, we get αj=0\alpha_j = 0 for all jj. \square

Pythagorean Theorem. If uvu \perp v, then u+v2=u2+v2\|u + v\|^2 = \|u\|^2 + \|v\|^2.

Proof. u+v2=u+v,u+v=u2+2u,v+v2=u2+v2\|u + v\|^2 = \langle u+v, u+v \rangle = \|u\|^2 + 2\langle u, v \rangle + \|v\|^2 = \|u\|^2 + \|v\|^2. \square

Orthogonal complement. For any subspace WVW \subseteq V: W={vV:v,w=0 for all wW}W^\perp = \{v \in V : \langle v, w \rangle = 0 \text{ for all } w \in W\}

WW^\perp is itself a subspace, and V=WWV = W \oplus W^\perp — every vector vv decomposes uniquely as v=v^+(vv^)v = \hat{v} + (v - \hat{v}) with v^W\hat{v} \in W and (vv^)W(v - \hat{v}) \in W^\perp. This is the orthogonal direct sum decomposition.

Gram-Schmidt Orthogonalization and QR

Given linearly independent vectors {a1,,ak}V\{a_1, \ldots, a_k\} \subset V, the Gram-Schmidt process constructs an orthonormal basis {q1,,qk}\{q_1, \ldots, q_k\} for the same span:

q~j=aji=1j1aj,qiqiqj=q~jq~j\tilde{q}_j = a_j - \sum_{i=1}^{j-1} \langle a_j, q_i \rangle\, q_i \qquad q_j = \frac{\tilde{q}_j}{\|\tilde{q}_j\|}

Each step subtracts the component of aja_j already explained by the previous qiq_i's.

QR decomposition. Applying Gram-Schmidt to the columns a1,,ana_1, \ldots, a_n of ARm×nA \in \mathbb{R}^{m \times n} (with mnm \geq n, linearly independent columns) yields A=QRA = QR where:

  • QRm×nQ \in \mathbb{R}^{m \times n} has orthonormal columns: QQ=InQ^\top Q = I_n
  • RRn×nR \in \mathbb{R}^{n \times n} is upper triangular with positive diagonal: Rjj=q~jR_{jj} = \|\tilde{q}_j\|

The entry Rij=aj,qiR_{ij} = \langle a_j, q_i \rangle records how much of aja_j was projected onto qiq_i. QR is the numerical backbone of least-squares solvers and is more numerically stable than forming the normal equations AAA^\top A.

Orthogonal Projections and the Best Approximation Theorem

Theorem (Best Approximation). For a subspace WVW \subseteq V and any vVv \in V, there exists a unique v^W\hat{v} \in W minimizing vw\|v - w\| over all wWw \in W. This minimizer satisfies: vv^W(i.e., vv^,w=0 for all wW)v - \hat{v} \perp W \quad \text{(i.e., } \langle v - \hat{v}, w \rangle = 0 \text{ for all } w \in W\text{)}

Proof. For any wWw \in W, write vw=(vv^)+(v^w)v - w = (v - \hat{v}) + (\hat{v} - w). Since vv^Wv - \hat{v} \perp W and v^wW\hat{v} - w \in W: vw2=vv^2+v^w2vv^2\|v - w\|^2 = \|v - \hat{v}\|^2 + \|\hat{v} - w\|^2 \geq \|v - \hat{v}\|^2 with equality iff w=v^w = \hat{v}. \square

Orthogonal Projection — drag the vector tip

WvP(v)v − P(v)O
v =(0.50, 0.90)
P(v) =(0.76, 0.44)
v − P(v) =(-0.26, 0.46)
‖v‖ =1.030
‖P(v)‖ =0.883
‖v − P(v)‖ =0.529
cos θ =0.858
θ =30.9°

Rotate subspace W

angle = 30°

The residual v − P(v) is always perpendicular to W (right-angle box). This is the Best Approximation Theorem: P(v) is the closest point in W to v.

💡Intuition

Drag the vector vv in the diagram above. The green vector P(v)P(v) is always the foot of the perpendicular from vv to the subspace WW — the right-angle box confirms orthogonality. The orange dashed line vP(v)v - P(v) is the residual, and its length is the approximation error. Best Approximation says: no other point in WW is closer to vv.

Projection matrix. For W=col(A)W = \mathrm{col}(A) with ARm×nA \in \mathbb{R}^{m \times n} having linearly independent columns: PW=A(AA)1AP_W = A(A^\top A)^{-1}A^\top

If A=QA = Q is already orthonormal, this simplifies to PW=QQP_W = QQ^\top.

Characterization. A matrix PP is an orthogonal projection onto its column space if and only if it is:

  • Idempotent: P2=PP^2 = P (projecting twice is the same as once)
  • Symmetric: P=PP^\top = P

Gram Matrix and Kernels

Given vectors x1,,xnVx_1, \ldots, x_n \in V, the Gram matrix GRn×nG \in \mathbb{R}^{n \times n} has entries: Gij=xi,xjG_{ij} = \langle x_i, x_j \rangle

Properties: GG is always symmetric positive semidefinite. Its rank equals dim(span{x1,,xn})\dim(\mathrm{span}\{x_1, \ldots, x_n\}).

Replacing xi,xj\langle x_i, x_j \rangle by k(xi,xj)k(x_i, x_j) for a positive definite kernel kk gives a kernel matrix — the Gram matrix of an implicit feature map ϕ\phi satisfying k(x,y)=ϕ(x),ϕ(y)k(x, y) = \langle \phi(x), \phi(y) \rangle. This is why kernel methods can operate in infinite-dimensional feature spaces without ever computing ϕ\phi explicitly. Mercer's theorem (Module 13, Lesson 4) makes this precise.

Worked Example

Example 1: Gram-Schmidt on Two Vectors

Let a1=(1,1,0)a_1 = (1, 1, 0)^\top, a2=(1,0,1)R3a_2 = (1, 0, 1)^\top \in \mathbb{R}^3.

Step 1. Normalize a1a_1: q~1=a1=(1,1,0)q1=(1,1,0)2=(12,12,0)\tilde{q}_1 = a_1 = (1, 1, 0)^\top \qquad q_1 = \frac{(1,1,0)^\top}{\sqrt{2}} = \left(\tfrac{1}{\sqrt{2}},\, \tfrac{1}{\sqrt{2}},\, 0\right)^\top

Step 2. Subtract the q1q_1-component of a2a_2: a2,q1=121+120+01=12\langle a_2, q_1 \rangle = \tfrac{1}{\sqrt{2}} \cdot 1 + \tfrac{1}{\sqrt{2}} \cdot 0 + 0 \cdot 1 = \tfrac{1}{\sqrt{2}} q~2=a212q1=(1,0,1)12(1,1,0)=(12,12,1)\tilde{q}_2 = a_2 - \tfrac{1}{\sqrt{2}}\,q_1 = (1,0,1)^\top - \tfrac{1}{2}(1,1,0)^\top = \left(\tfrac{1}{2}, -\tfrac{1}{2}, 1\right)^\top q~2=14+14+1=62q2=(16,16,26)\|\tilde{q}_2\| = \sqrt{\tfrac{1}{4} + \tfrac{1}{4} + 1} = \tfrac{\sqrt{6}}{2} \qquad q_2 = \left(\tfrac{1}{\sqrt{6}}, -\tfrac{1}{\sqrt{6}}, \tfrac{2}{\sqrt{6}}\right)^\top

Verify orthonormality: q1,q2=1216+12(16)+0=0\langle q_1, q_2 \rangle = \tfrac{1}{\sqrt{2}} \cdot \tfrac{1}{\sqrt{6}} + \tfrac{1}{\sqrt{2}} \cdot \left(-\tfrac{1}{\sqrt{6}}\right) + 0 = 0 \checkmark

The QR factorization is A=QRA = QR with: R=[212062]R = \begin{bmatrix} \sqrt{2} & \tfrac{1}{\sqrt{2}} \\ 0 & \tfrac{\sqrt{6}}{2} \end{bmatrix}

Example 2: Projection and Least Squares

Find the point in W=span{(1,1,0),(0,1,1)}W = \mathrm{span}\{(1,1,0)^\top,\,(0,1,1)^\top\} closest to b=(1,2,3)b = (1, 2, 3)^\top.

Let A=[101101]A = \begin{bmatrix} 1 & 0 \\ 1 & 1 \\ 0 & 1 \end{bmatrix}. Form the normal equations AAx^=AbA^\top A \hat{x} = A^\top b:

AA=[2112]Ab=[35]A^\top A = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix} \qquad A^\top b = \begin{bmatrix} 3 \\ 5 \end{bmatrix}

Solving: x^=13[2112][35]=[1373]\hat{x} = \tfrac{1}{3}\begin{bmatrix}2 & -1\\-1 & 2\end{bmatrix}\begin{bmatrix}3\\5\end{bmatrix} = \begin{bmatrix}\tfrac{1}{3}\\\tfrac{7}{3}\end{bmatrix}

b^=Ax^=[138373]\hat{b} = A\hat{x} = \begin{bmatrix}\tfrac{1}{3}\\\tfrac{8}{3}\\\tfrac{7}{3}\end{bmatrix}

Verify residual orthogonality: bb^=(23,23,23)b - \hat{b} = \left(\tfrac{2}{3}, -\tfrac{2}{3}, \tfrac{2}{3}\right)^\top

bb^,(1,1,0)=2323=0\langle b - \hat{b},\,(1,1,0)^\top \rangle = \tfrac{2}{3} - \tfrac{2}{3} = 0 \checkmark and bb^,(0,1,1)=23+23=0\langle b - \hat{b},\,(0,1,1)^\top \rangle = -\tfrac{2}{3} + \tfrac{2}{3} = 0 \checkmark

Example 3: Gram Matrix as Kernel Matrix

Let x1=(1,0)x_1 = (1,0)^\top, x2=(0,1)x_2 = (0,1)^\top, x3=(1,1)x_3 = (1,1)^\top. The linear kernel k(x,y)=xyk(x, y) = x^\top y gives: G=[101011112]G = \begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 1 \\ 1 & 1 & 2 \end{bmatrix}

GG has rank 2 (three points in R2\mathbb{R}^2 span at most a 2-dimensional space). The polynomial kernel k(x,y)=(1+xy)2k(x, y) = (1 + x^\top y)^2 implicitly maps to a 6-dimensional feature space and gives: G=[414144449]G' = \begin{bmatrix} 4 & 1 & 4 \\ 1 & 4 & 4 \\ 4 & 4 & 9 \end{bmatrix}

Both GG and GG' are symmetric positive semidefinite — the defining property of any valid kernel matrix.

Connections

Where Your Intuition Breaks

The 2\ell_2 norm is always the right choice. In fact, the choice of norm is a modeling decision that encodes what you want to penalize. The 2\ell_2 norm penalizes all coefficients proportionally — large weights cost quadratically more. The 1\ell_1 norm penalizes all coefficients equally — it is indifferent to whether a weight is 0.1 or 0.01, but aggressively pushes small weights to exactly zero. The \ell_\infty norm penalizes only the single largest weight, making it the right choice when you care about worst-case behavior. There is no universally correct norm; the right norm is the one whose geometry matches the problem's structure.

Norms in ML: A Decision Guide

ChoiceWhen to useWhy
2\ell_2 normDefault for vectors, weight decayRotation-invariant; smooth gradient everywhere
1\ell_1 normSparsity in weights or activationsCorners of unit ball encourage exact zeros
\ell_\infty normAdversarial robustnessControls worst-case coordinate deviation
Frobenius normMatrix regularization, LoRA updatesTreats all entries equally; differentiable
Nuclear normLow-rank matrix recovery, collaborative filtering1\ell_1 on singular values promotes low rank
Spectral normGAN discriminator, Lipschitz constraintsControls maximum gradient magnitude

Orthogonality as an Engineering Tool

Orthogonal initialization preserves the 2\ell_2 norm of activations through the forward pass: Wx2=x2\|Wx\|_2 = \|x\|_2 when WW is orthogonal. This prevents exploding and vanishing gradients in deep networks and is the default in several initialization schemes.

Attention as inner products. The scaled dot-product attention score qk/dkq^\top k / \sqrt{d_k} is an inner product with a variance-stabilizing denominator: for random unit-variance qq, kRdkk \in \mathbb{R}^{d_k}, the raw inner product has variance dkd_k, so dividing by dk\sqrt{d_k} restores unit variance and prevents the softmax from saturating in high-dimensional spaces.

Neural style transfer represents image style as the Gram matrix of feature maps: Gij=Fi,FjG_{ij} = \langle F_i, F_j \rangle measures correlation between feature channels ii and jj, capturing texture without spatial information.

Common Pitfalls

Confusing x2\|x\|_2 and x22\|x\|_2^2. Weight decay adds λ2w22\frac{\lambda}{2}\|w\|_2^2, giving gradient λw\lambda w. The squared norm is smooth everywhere; w2\|w\|_2 itself is not differentiable at 0\mathbf{0}. For 1\ell_1, w1\|w\|_1 is non-differentiable at any zero coordinate — requiring subdifferentials or proximal operators.

Applying orthonormal formulas to orthogonal (but not orthonormal) bases. If QQ has orthonormal columns, P=QQP = QQ^\top and Q1=QQ^{-1} = Q^\top. If columns are orthogonal but not unit-norm, these simplifications fail — divide each column by its norm first.

Gram matrix rank versus sample count. rank(G)=dim(span{x1,,xn})\mathrm{rank}(G) = \dim(\mathrm{span}\{x_1, \ldots, x_n\}). If n=10,000n = 10{,}000 points live in a d=50d = 50-dimensional subspace, GG has rank 50. Kernel methods cannot distinguish points that differ only in the null space of the feature map, regardless of how large nn is.

💡The unifying theme

Every result in this lesson reduces to: decompose vv into a component along a subspace and a component orthogonal to it. Gram-Schmidt builds an orthonormal basis so this decomposition is numerically clean. The projection matrix executes it in one matrix-vector product. The normal equations find the best linear fit by projecting bb onto the column space of AA. Kernel methods replace explicit inner products with a kernel function, but the Gram matrix structure — symmetric, PSD, rank equals intrinsic dimension — is identical.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.