Neural-Path/Notes
50 min

Vector Spaces & Linear Maps

Every machine learning model operates on vectors. When we write wx\mathbf{w}^\top \mathbf{x}, we are implicitly using the axioms of a vector space — the rules that make addition and scalar multiplication behave predictably. The gradient update wwηL\mathbf{w} \leftarrow \mathbf{w} - \eta \nabla \mathcal{L} stays in the same space as w\mathbf{w} because that space is closed under exactly these operations. Understanding the axiomatic foundation makes every downstream theorem in linear algebra inevitable rather than arbitrary: spectral decompositions, least-squares solutions, and backpropagation all follow from the same eight rules you will see in this lesson.

Concepts

Linear Map Visualization
Input space ℝ²e₁e₂
A
Output space ℝ²Ae₁Ae₂
A = [[0.71, -0.71], [0.71, 0.71]]

Isometry — grid rotates, lengths and angles preserved

e₁ (first basis vector) e₂ (second basis vector)

When you multiply a NumPy array by a scalar or add two arrays elementwise, you're already using vector space operations — the rules that make these behave predictably. The eight axioms below are not arbitrary: they are the minimal conditions that make span, basis, and dimension well-defined. Any set of objects that respects these rules — polynomials, matrices, functions — is a vector space and inherits the full machinery of linear algebra.

Fields

A field F\mathbb{F} is a set equipped with addition and multiplication that satisfies commutativity, associativity, distributivity, and the existence of additive and multiplicative identities and inverses (except division by zero). The two fields relevant throughout this module are the real numbers R\mathbb{R} and the complex numbers C\mathbb{C}. Unless otherwise stated, assume F=R\mathbb{F} = \mathbb{R}.

Vector Space Axioms

A vector space over field F\mathbb{F} is a set VV equipped with two operations:

  • Addition: V×VVV \times V \to V, written (u,v)u+v(u, v) \mapsto u + v
  • Scalar multiplication: F×VV\mathbb{F} \times V \to V, written (α,v)αv(\alpha, v) \mapsto \alpha v

satisfying all eight axioms:

#AxiomStatement
1Commutativityu+v=v+uu + v = v + u
2Associativity (addition)(u+v)+w=u+(v+w)(u + v) + w = u + (v + w)
3Additive identity0V\exists\, \mathbf{0} \in V such that v+0=vv + \mathbf{0} = v
4Additive inversevV\forall\, v \in V, (v)\exists\, (-v) such that v+(v)=0v + (-v) = \mathbf{0}
5Multiplicative identity1v=v1 \cdot v = v
6Associativity (scalar)(αβ)v=α(βv)(\alpha \beta)v = \alpha(\beta v)
7Distributivity Iα(u+v)=αu+αv\alpha(u + v) = \alpha u + \alpha v
8Distributivity II(α+β)v=αv+βv(\alpha + \beta)v = \alpha v + \beta v

Elements of VV are called vectors; elements of F\mathbb{F} are scalars.

The eight axioms are the minimum needed to guarantee that span, basis, and dimension are well-defined. Without distributivity (Axioms 7–8) you could not decompose a vector into basis components; without the additive inverse (Axiom 4) you could not subtract. Each axiom closes exactly one loophole that would otherwise break one of the geometric concepts linear algebra is built on.

⚠️Warning

The word "vector" does not mean "arrow in 2D or 3D space." It means any element of any set that satisfies these eight axioms. Polynomials, matrices, and functions are all vectors in the appropriate spaces.

Canonical examples:

  • Rn\mathbb{R}^n: tuples of nn real numbers with coordinatewise addition and scalar multiplication — the workhorse of ML
  • Rm×n\mathbb{R}^{m \times n}: m×nm \times n real matrices with elementwise addition and scalar multiplication; dim=mn\dim = mn
  • Pn\mathbb{P}_n: polynomials of degree at most nn with coefficients in R\mathbb{R}; dim=n+1\dim = n+1
  • The zero space {0}\{0\}: the trivial vector space with a single element; dim=0\dim = 0

Non-examples: Zn\mathbb{Z}^n with real scalar multiplication fails Axiom 5 (since 121Z\frac{1}{2} \cdot 1 \notin \mathbb{Z}). The unit sphere Sn1RnS^{n-1} \subset \mathbb{R}^n is not a vector space because it is not closed under addition.

Subspaces

A nonempty subset WVW \subseteq V is a subspace of VV if it is itself a vector space under the inherited operations. The subspace test reduces this to three conditions:

W is a subspace    {0Wu,vWu+vWαF,vWαvWW \text{ is a subspace} \iff \begin{cases} \mathbf{0} \in W \\ u, v \in W \Rightarrow u + v \in W \\ \alpha \in \mathbb{F},\, v \in W \Rightarrow \alpha v \in W \end{cases}

Conditions 2 and 3 together mean WW is closed under linear combinations. They imply Condition 1 for any nonempty WW (take α=0\alpha = 0 in Condition 3).

Key examples that appear throughout ML:

col(A)={Ax:xRn}Rm(column space of ARm×n)\text{col}(A) = \{Ax : x \in \mathbb{R}^n\} \subseteq \mathbb{R}^m \qquad \text{(column space of } A \in \mathbb{R}^{m \times n}\text{)}

null(A)={xRn:Ax=0}Rn(null space of A)\text{null}(A) = \{x \in \mathbb{R}^n : Ax = \mathbf{0}\} \subseteq \mathbb{R}^n \qquad \text{(null space of } A\text{)}

Both satisfy the subspace test (verify: A(x1+x2)=Ax1+Ax2=0A(x_1 + x_2) = Ax_1 + Ax_2 = \mathbf{0} for null space closure under addition).

Span, Linear Independence, and Bases

Span: The span of a set S={v1,,vk}VS = \{v_1, \ldots, v_k\} \subset V is the set of all finite linear combinations:

span(S)={i=1kαivi:αiF}\text{span}(S) = \left\{ \sum_{i=1}^k \alpha_i v_i : \alpha_i \in \mathbb{F} \right\}

span(S)\text{span}(S) is always a subspace of VV — the smallest subspace containing SS.

Linear independence: Vectors v1,,vkv_1, \ldots, v_k are linearly independent if the only solution to i=1kαivi=0\sum_{i=1}^k \alpha_i v_i = \mathbf{0} is α1==αk=0\alpha_1 = \cdots = \alpha_k = 0. Equivalently, no viv_i lies in the span of the others.

Basis: A set B={b1,,bn}\mathcal{B} = \{b_1, \ldots, b_n\} is a basis for VV if it is linearly independent and span(B)=V\text{span}(\mathcal{B}) = V. Equivalently, every vVv \in V has a unique representation v=i=1nαibiv = \sum_{i=1}^n \alpha_i b_i.

Theorem (Steinitz Exchange Lemma): Any two bases of a finite-dimensional vector space have the same cardinality. This common cardinality is the dimension of VV, written dim(V)\dim(V).

dim(Rn)=ndim(Rm×n)=mndim(Pn)=n+1\dim(\mathbb{R}^n) = n \qquad \dim(\mathbb{R}^{m \times n}) = mn \qquad \dim(\mathbb{P}_n) = n+1

Linear Maps

A function T:VWT: V \to W between vector spaces over the same field is linear (a linear map or linear transformation) if for all u,vVu, v \in V and α,βF\alpha, \beta \in \mathbb{F}:

T(αu+βv)=αT(u)+βT(v)T(\alpha u + \beta v) = \alpha T(u) + \beta T(v)

Linearity has two immediate consequences: T(0V)=0WT(\mathbf{0}_V) = \mathbf{0}_W, and TT is completely determined by its values on any basis.

Canonical examples:

  • T(x)=AxT(\mathbf{x}) = A\mathbf{x} for ARm×nA \in \mathbb{R}^{m \times n}: matrix multiplication is the canonical linear map from Rn\mathbb{R}^n to Rm\mathbb{R}^m
  • Differentiation D:PnPn1D: \mathbb{P}_n \to \mathbb{P}_{n-1}, D(p)=pD(p) = p': linear because (p+q)=p+q(p+q)' = p' + q' and (αp)=αp(\alpha p)' = \alpha p'
  • Orthogonal projection onto a subspace WW: linear by the linearity of the projection formula

Note on neural networks: A neural network layer xAx+b\mathbf{x} \mapsto A\mathbf{x} + \mathbf{b} with b0\mathbf{b} \neq \mathbf{0} is affine, not linear (it fails T(0)=0T(\mathbf{0}) = \mathbf{0}). The distinction matters when analyzing expressivity and composability.

Kernel and image: For a linear map T:VWT: V \to W:

ker(T)={vV:T(v)=0W}V\ker(T) = \{v \in V : T(v) = \mathbf{0}_W\} \subseteq V im(T)={T(v):vV}W\text{im}(T) = \{T(v) : v \in V\} \subseteq W

Both are subspaces (verify by the subspace test). The kernel measures how much information TT collapses; the image measures what outputs TT can produce.

💡Projection as a linear map

Imagine the transformation that takes any 2D vector and collapses it onto the x-axis: (x,y)(x,0)(x, y) \mapsto (x, 0). The kernel of this map is the entire y-axis — all vectors (0,y)(0, y) map to zero. The image is just the x-axis. This is a rank-1 linear map: it destroys one dimension of information irreversibly.

The Rank-Nullity Theorem

Theorem (Rank-Nullity / Dimension Theorem): For any linear map T:VWT: V \to W with dim(V)=n<\dim(V) = n < \infty:

dim(kerT)+dim(imT)=dim(V)\dim(\ker T) + \dim(\text{im}\, T) = \dim(V) nullity(T)+rank(T)=n\text{nullity}(T) + \text{rank}(T) = n

Proof. Let k=dim(kerT)k = \dim(\ker T) and choose a basis {e1,,ek}\{e_1, \ldots, e_k\} for ker(T)\ker(T). By the basis extension theorem, extend to a basis {e1,,ek,f1,,fr}\{e_1, \ldots, e_k, f_1, \ldots, f_r\} for VV, so k+r=nk + r = n.

We claim {T(f1),,T(fr)}\{T(f_1), \ldots, T(f_r)\} is a basis for im(T)\text{im}(T).

Spanning: For any wim(T)w \in \text{im}(T), write w=T(v)w = T(v) for some v=αiei+βjfjv = \sum \alpha_i e_i + \sum \beta_j f_j. Then w=T(v)=αiT(ei)+βjT(fj)=βjT(fj)w = T(v) = \sum \alpha_i T(e_i) + \sum \beta_j T(f_j) = \sum \beta_j T(f_j), since eiker(T)e_i \in \ker(T).

Independence: Suppose βjT(fj)=0\sum \beta_j T(f_j) = \mathbf{0}. Then T ⁣(βjfj)=0T\!\left(\sum \beta_j f_j\right) = \mathbf{0}, so βjfjker(T)\sum \beta_j f_j \in \ker(T). Writing βjfj=αiei\sum \beta_j f_j = \sum \alpha_i e_i and using independence of the full basis gives all αi=βj=0\alpha_i = \beta_j = 0.

Therefore dim(imT)=r\dim(\text{im}\, T) = r, and k+r=nk + r = n. \square

💡Intuition

Rank-nullity is a dimension budget. A linear map cannot create new dimensions: every dimension lost to the kernel is a dimension denied to the image. If T:R512R10T: \mathbb{R}^{512} \to \mathbb{R}^{10} is a classification layer, rank-nullity says at least 51210=502512 - 10 = 502 dimensions of input are discarded.

Matrix Representations and Change of Basis

Given ordered bases B=(b1,,bn)\mathcal{B} = (b_1, \ldots, b_n) for VV and C=(c1,,cm)\mathcal{C} = (c_1, \ldots, c_m) for WW, the matrix of TT with respect to (B,C)(\mathcal{B}, \mathcal{C}) is the m×nm \times n matrix whose jj-th column is the coordinate vector of T(bj)T(b_j) in basis C\mathcal{C}:

[T]BC=[[T(b1)]C[T(b2)]C[T(bn)]C][T]_{\mathcal{B}}^{\mathcal{C}} = \begin{bmatrix} [T(b_1)]_{\mathcal{C}} & [T(b_2)]_{\mathcal{C}} & \cdots & [T(b_n)]_{\mathcal{C}} \end{bmatrix}

Change of basis. If PP is the invertible change-of-basis matrix from B\mathcal{B} to the standard basis (columns are the basis vectors bib_i expressed in standard coordinates), then the matrix of TT in B\mathcal{B} is related to the standard-basis matrix by:

[T]B=P1[T]stdP[T]_{\mathcal{B}} = P^{-1} [T]_{\text{std}} P

This is matrix similarity. Two matrices AA and BB are similar (ABA \sim B) if B=P1APB = P^{-1}AP for some invertible PP. Similar matrices represent the same linear map in different bases and share: eigenvalues, determinant, trace, and rank.

💡Why eigendecomposition matters

For a symmetric matrix AA, the spectral theorem (Lesson 4) guarantees the existence of an orthonormal eigenbasis. In that basis, [A]B=Λ[A]_{\mathcal{B}} = \Lambda is diagonal — the simplest possible matrix representation. Change of basis is the mechanism by which this simplification happens.

Worked Example

Example 1: Verifying the Null Space

Let A=[123456]A = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}. Find null(A)\text{null}(A) and apply the subspace test.

Row reduction:

[123456]R2R24R1[123036]R213R2[123012]R1R12R2[101012]\begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix} \xrightarrow{R_2 \leftarrow R_2 - 4R_1} \begin{bmatrix} 1 & 2 & 3 \\ 0 & -3 & -6 \end{bmatrix} \xrightarrow{R_2 \leftarrow -\frac{1}{3}R_2} \begin{bmatrix} 1 & 2 & 3 \\ 0 & 1 & 2 \end{bmatrix} \xrightarrow{R_1 \leftarrow R_1 - 2R_2} \begin{bmatrix} 1 & 0 & -1 \\ 0 & 1 & 2 \end{bmatrix}

Free variable: x3=tx_3 = t. Back-substitution: x2=2tx_2 = -2t, x1=tx_1 = t. So:

null(A)=span ⁣{[121]}\text{null}(A) = \text{span}\!\left\{\begin{bmatrix} 1 \\ -2 \\ 1 \end{bmatrix}\right\}

Subspace test:

  1. A0=0A\mathbf{0} = \mathbf{0}
  2. If Au=Av=0A\mathbf{u} = A\mathbf{v} = \mathbf{0} then A(u+v)=Au+Av=0A(\mathbf{u}+\mathbf{v}) = A\mathbf{u} + A\mathbf{v} = \mathbf{0}
  3. If Au=0A\mathbf{u} = \mathbf{0} then A(cu)=cAu=0A(c\mathbf{u}) = cA\mathbf{u} = \mathbf{0}

Example 2: Rank-Nullity in Action

For AR2×3A \in \mathbb{R}^{2 \times 3}: two pivot columns give rank(A)=2\text{rank}(A) = 2; one free variable gives nullity(A)=1\text{nullity}(A) = 1.

rank(A)+nullity(A)=2+1=3=dim(R3)\text{rank}(A) + \text{nullity}(A) = 2 + 1 = 3 = \dim(\mathbb{R}^3) \checkmark

Since rank(A)=2=m\text{rank}(A) = 2 = m, the column space is all of R2\mathbb{R}^2: the map is surjective, and Ax=bA\mathbf{x} = \mathbf{b} has at least one solution for every bR2\mathbf{b} \in \mathbb{R}^2. Since the null space is nontrivial, the solution is not unique.

Example 3: Change of Basis for a 2D Rotation

Let T:R2R2T: \mathbb{R}^2 \to \mathbb{R}^2 be rotation by θ=45°\theta = 45°. In the standard basis:

[T]std=[cos45°sin45°sin45°cos45°]=12[1111][T]_{\text{std}} = \begin{bmatrix} \cos 45° & -\sin 45° \\ \sin 45° & \cos 45° \end{bmatrix} = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 & -1 \\ 1 & 1 \end{bmatrix}

Choose the orthonormal basis B=(12[11], 12[11])\mathcal{B} = \left(\frac{1}{\sqrt{2}}\begin{bmatrix}1\\1\end{bmatrix},\ \frac{1}{\sqrt{2}}\begin{bmatrix}-1\\1\end{bmatrix}\right).

The change-of-basis matrix and its inverse:

P=12[1111],P1=P=12[1111]P = \frac{1}{\sqrt{2}}\begin{bmatrix}1 & -1 \\ 1 & 1\end{bmatrix}, \qquad P^{-1} = P^\top = \frac{1}{\sqrt{2}}\begin{bmatrix}1 & 1 \\ -1 & 1\end{bmatrix}

(Since PP is orthogonal, P1=PP^{-1} = P^\top.) Computing [T]B=P[T]stdP[T]_\mathcal{B} = P^\top [T]_\text{std} P:

P[T]stdP=12[1111][1111][1111]=[1101]P^\top [T]_\text{std} P = \frac{1}{2}\begin{bmatrix}1 & 1 \\ -1 & 1\end{bmatrix}\begin{bmatrix}1 & -1 \\ 1 & 1\end{bmatrix}\begin{bmatrix}1 & -1 \\ 1 & 1\end{bmatrix} = \begin{bmatrix}1 & -1 \\ 0 & 1\end{bmatrix}

This previews the spectral theorem: for symmetric matrices, one can always choose B\mathcal{B} to be an eigenbasis, making [A]B[A]_\mathcal{B} diagonal.

Connections

Where Your Intuition Breaks

A higher-dimensional vector space always has more expressive power. The ambient dimension and the effective rank of a linear map are entirely different things. A weight matrix WR512×512W \in \mathbb{R}^{512 \times 512} acts on 512-dimensional inputs, but if rank(W)=10\operatorname{rank}(W) = 10, the map erases 502 input directions — they produce identical outputs. LoRA exploits this directly: the low-rank update ΔW=BA\Delta W = BA with BRd×rB \in \mathbb{R}^{d \times r}, ARr×dA \in \mathbb{R}^{r \times d} and rdr \ll d adds exactly rr new expressive directions regardless of how large dd is. Rank is the true measure of capacity; dimension is just the ambient container.

Invertibility and the Four Fundamental Subspaces

For ARm×nA \in \mathbb{R}^{m \times n} with rank rr, the four fundamental subspaces partition the domain and codomain:

SubspaceLives inDimensionRole
Column space col(A)\text{col}(A)Rm\mathbb{R}^mrrReachable outputs
Left null space null(A)\text{null}(A^\top)Rm\mathbb{R}^mmrm - rUnreachable directions
Row space col(A)\text{col}(A^\top)Rn\mathbb{R}^nrrInputs that matter
Null space null(A)\text{null}(A)Rn\mathbb{R}^nnrn - rInputs that are erased

The fundamental theorem of linear algebra states: col(A)null(A)\text{col}(A^\top) \perp \text{null}(A) and col(A)null(A)\text{col}(A) \perp \text{null}(A^\top).

Invertibility conditions:

ConditionGeometric meaningConsequence
r=n=mr = n = m (square, full rank)Bijective — no information lost or missedA1A^{-1} exists; Ax=bAx = b has unique solution
r=m<nr = m < n (wide, full row rank)Surjective — all outputs reachableAx=bAx = b has solutions; infinitely many
r=n<mr = n < m (tall, full col rank)Injective — no kernelAx=bAx = b has at most one solution; may have none
r<min(m,n)r < \min(m,n) (rank deficient)Neither injective nor surjectiveKernel is nontrivial; image is a proper subspace

Common Pitfalls

Confusing linear with affine. Every neural network layer f(x)=Wx+bf(\mathbf{x}) = W\mathbf{x} + \mathbf{b} is affine when b0\mathbf{b} \neq \mathbf{0}. Affine maps do not form a vector space (composition of two affine maps is affine, but the sum is not generally affine). This distinction matters for analyzing equivariance, scaling laws, and composability proofs.

Assuming all subspaces are hyperplanes. A subspace of Rn\mathbb{R}^n can have any dimension from 0 to nn. In particular, Rn\mathbb{R}^n is a subspace of itself.

Rank deficiency as a failure mode. If a weight matrix WRd×dW \in \mathbb{R}^{d \times d} becomes rank deficient during training, the model can no longer distinguish inputs that differ only in null(W)\text{null}(W). This is one motivation for rank-regularization and for LoRA fine-tuning: the low-rank update ΔW=BA\Delta W = BA explicitly constrains rank(ΔW)r\text{rank}(\Delta W) \leq r.

Rank-Nullity as Algorithm Design

The rank-nullity theorem is a design constraint, not just a theorem. When building a compression layer f:RdinRdoutf: \mathbb{R}^{d_{\text{in}}} \to \mathbb{R}^{d_{\text{out}}} with dout<dind_{\text{out}} < d_{\text{in}}, you are choosing to discard dindoutd_{\text{in}} - d_{\text{out}} dimensions. The question is which dimensions. PCA (Lesson 5) answers this optimally in a least-squares sense; attention (Module 13, Lesson 7) answers it adaptively based on context.

💡A mental model for the whole module

Every result in this module is an answer to the same question: given a linear map TT, what is the most useful basis to express it in? The spectral theorem (Lesson 4) answers this for symmetric matrices. SVD (Lesson 5) answers it for arbitrary matrices. Matrix calculus (Lesson 8) asks how the answer changes when we perturb the map.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.