Vector Spaces & Linear Maps

Every machine learning model operates on vectors. When we write $\mathbf{w}^\top \mathbf{x}$ , we are implicitly using the axioms of a vector space — the rules that make addition and scalar multiplication behave predictably. The gradient update $\mathbf{w} \leftarrow \mathbf{w} - \eta \nabla \mathcal{L}$ stays in the same space as $\mathbf{w}$ because that space is closed under exactly these operations. Understanding the axiomatic foundation makes every downstream theorem in linear algebra inevitable rather than arbitrary: spectral decompositions, least-squares solutions, and backpropagation all follow from the same eight rules you will see in this lesson.

Concepts

Linear Map Visualization

Input space ℝ²

→A

Output space ℝ²

A = [[0.71, -0.71], [0.71, 0.71]]

Isometry — grid rotates, lengths and angles preserved

— e₁ (first basis vector)— e₂ (second basis vector)

When you multiply a NumPy array by a scalar or add two arrays elementwise, you're already using vector space operations — the rules that make these behave predictably. The eight axioms below are not arbitrary: they are the minimal conditions that make span, basis, and dimension well-defined. Any set of objects that respects these rules — polynomials, matrices, functions — is a vector space and inherits the full machinery of linear algebra.

Fields

A field $\mathbb{F}$ is a set equipped with addition and multiplication that satisfies commutativity, associativity, distributivity, and the existence of additive and multiplicative identities and inverses (except division by zero). The two fields relevant throughout this module are the real numbers $\mathbb{R}$ and the complex numbers $\mathbb{C}$ . Unless otherwise stated, assume $\mathbb{F} = \mathbb{R}$ .

Vector Space Axioms

A vector space over field $\mathbb{F}$ is a set $V$ equipped with two operations:

Addition: $V \times V \to V$ , written $(u, v) \mapsto u + v$
Scalar multiplication: $\mathbb{F} \times V \to V$ , written $(\alpha, v) \mapsto \alpha v$

satisfying all eight axioms:

#	Axiom	Statement
1	Commutativity	$u + v = v + u$
2	Associativity (addition)	$(u + v) + w = u + (v + w)$
3	Additive identity	$\exists\, \mathbf{0} \in V$ such that $v + \mathbf{0} = v$
4	Additive inverse	$\forall\, v \in V$ , $\exists\, (-v)$ such that $v + (-v) = \mathbf{0}$
5	Multiplicative identity	$1 \cdot v = v$
6	Associativity (scalar)	$(\alpha \beta)v = \alpha(\beta v)$
7	Distributivity I	$\alpha(u + v) = \alpha u + \alpha v$
8	Distributivity II	$(\alpha + \beta)v = \alpha v + \beta v$

Elements of $V$ are called vectors; elements of $\mathbb{F}$ are scalars.

The eight axioms are the minimum needed to guarantee that span, basis, and dimension are well-defined. Without distributivity (Axioms 7–8) you could not decompose a vector into basis components; without the additive inverse (Axiom 4) you could not subtract. Each axiom closes exactly one loophole that would otherwise break one of the geometric concepts linear algebra is built on.

⚠️Warning

The word "vector" does not mean "arrow in 2D or 3D space." It means any element of any set that satisfies these eight axioms. Polynomials, matrices, and functions are all vectors in the appropriate spaces.

Canonical examples:

$\mathbb{R}^n$ : tuples of $n$ real numbers with coordinatewise addition and scalar multiplication — the workhorse of ML
$\mathbb{R}^{m \times n}$ : $m \times n$ real matrices with elementwise addition and scalar multiplication; $\dim = mn$
$\mathbb{P}_n$ : polynomials of degree at most $n$ with coefficients in $\mathbb{R}$ ; $\dim = n+1$
The zero space $\{0\}$ : the trivial vector space with a single element; $\dim = 0$

Non-examples: $\mathbb{Z}^n$ with real scalar multiplication fails Axiom 5 (since $\frac{1}{2} \cdot 1 \notin \mathbb{Z}$ ). The unit sphere $S^{n-1} \subset \mathbb{R}^n$ is not a vector space because it is not closed under addition.

Subspaces

A nonempty subset $W \subseteq V$ is a subspace of $V$ if it is itself a vector space under the inherited operations. The subspace test reduces this to three conditions:

$W \text{ is a subspace} \iff \begin{cases} \mathbf{0} \in W \\ u, v \in W \Rightarrow u + v \in W \\ \alpha \in \mathbb{F},\, v \in W \Rightarrow \alpha v \in W \end{cases}$

Conditions 2 and 3 together mean $W$ is closed under linear combinations. They imply Condition 1 for any nonempty $W$ (take $\alpha = 0$ in Condition 3).

Key examples that appear throughout ML:

$\text{col}(A) = \{Ax : x \in \mathbb{R}^n\} \subseteq \mathbb{R}^m \qquad \text{(column space of } A \in \mathbb{R}^{m \times n}\text{)}$

$\text{null}(A) = \{x \in \mathbb{R}^n : Ax = \mathbf{0}\} \subseteq \mathbb{R}^n \qquad \text{(null space of } A\text{)}$

Both satisfy the subspace test (verify: $A(x_1 + x_2) = Ax_1 + Ax_2 = \mathbf{0}$ for null space closure under addition).

Span, Linear Independence, and Bases

Span: The span of a set $S = \{v_1, \ldots, v_k\} \subset V$ is the set of all finite linear combinations:

$\text{span}(S) = \left\{ \sum_{i=1}^k \alpha_i v_i : \alpha_i \in \mathbb{F} \right\}$

$\text{span}(S)$ is always a subspace of $V$ — the smallest subspace containing $S$ .

Linear independence: Vectors $v_1, \ldots, v_k$ are linearly independent if the only solution to $\sum_{i=1}^k \alpha_i v_i = \mathbf{0}$ is $\alpha_1 = \cdots = \alpha_k = 0$ . Equivalently, no $v_i$ lies in the span of the others.

Basis: A set $\mathcal{B} = \{b_1, \ldots, b_n\}$ is a basis for $V$ if it is linearly independent and $\text{span}(\mathcal{B}) = V$ . Equivalently, every $v \in V$ has a unique representation $v = \sum_{i=1}^n \alpha_i b_i$ .

Theorem (Steinitz Exchange Lemma): Any two bases of a finite-dimensional vector space have the same cardinality. This common cardinality is the dimension of $V$ , written $\dim(V)$ .

$\dim(\mathbb{R}^n) = n \qquad \dim(\mathbb{R}^{m \times n}) = mn \qquad \dim(\mathbb{P}_n) = n+1$

Linear Maps

A function $T: V \to W$ between vector spaces over the same field is linear (a linear map or linear transformation) if for all $u, v \in V$ and $\alpha, \beta \in \mathbb{F}$ :

$T(\alpha u + \beta v) = \alpha T(u) + \beta T(v)$

Linearity has two immediate consequences: $T(\mathbf{0}_V) = \mathbf{0}_W$ , and $T$ is completely determined by its values on any basis.

Canonical examples:

$T(\mathbf{x}) = A\mathbf{x}$ for $A \in \mathbb{R}^{m \times n}$ : matrix multiplication is the canonical linear map from $\mathbb{R}^n$ to $\mathbb{R}^m$
Differentiation $D: \mathbb{P}_n \to \mathbb{P}_{n-1}$ , $D(p) = p'$ : linear because $(p+q)' = p' + q'$ and $(\alpha p)' = \alpha p'$
Orthogonal projection onto a subspace $W$ : linear by the linearity of the projection formula

Note on neural networks: A neural network layer $\mathbf{x} \mapsto A\mathbf{x} + \mathbf{b}$ with $\mathbf{b} \neq \mathbf{0}$ is affine, not linear (it fails $T(\mathbf{0}) = \mathbf{0}$ ). The distinction matters when analyzing expressivity and composability.

Kernel and image: For a linear map $T: V \to W$ :

$\ker(T) = \{v \in V : T(v) = \mathbf{0}_W\} \subseteq V$ $\text{im}(T) = \{T(v) : v \in V\} \subseteq W$

Both are subspaces (verify by the subspace test). The kernel measures how much information $T$ collapses; the image measures what outputs $T$ can produce.

💡Projection as a linear map

Imagine the transformation that takes any 2D vector and collapses it onto the x-axis: $(x, y) \mapsto (x, 0)$ . The kernel of this map is the entire y-axis — all vectors $(0, y)$ map to zero. The image is just the x-axis. This is a rank-1 linear map: it destroys one dimension of information irreversibly.

The Rank-Nullity Theorem

Theorem (Rank-Nullity / Dimension Theorem): For any linear map $T: V \to W$ with $\dim(V) = n < \infty$ :

$\dim(\ker T) + \dim(\text{im}\, T) = \dim(V)$ $\text{nullity}(T) + \text{rank}(T) = n$

Proof. Let $k = \dim(\ker T)$ and choose a basis $\{e_1, \ldots, e_k\}$ for $\ker(T)$ . By the basis extension theorem, extend to a basis $\{e_1, \ldots, e_k, f_1, \ldots, f_r\}$ for $V$ , so $k + r = n$ .

We claim $\{T(f_1), \ldots, T(f_r)\}$ is a basis for $\text{im}(T)$ .

Spanning: For any $w \in \text{im}(T)$ , write $w = T(v)$ for some $v = \sum \alpha_i e_i + \sum \beta_j f_j$ . Then $w = T(v) = \sum \alpha_i T(e_i) + \sum \beta_j T(f_j) = \sum \beta_j T(f_j)$ , since $e_i \in \ker(T)$ .

Independence: Suppose $\sum \beta_j T(f_j) = \mathbf{0}$ . Then $T\!\left(\sum \beta_j f_j\right) = \mathbf{0}$ , so $\sum \beta_j f_j \in \ker(T)$ . Writing $\sum \beta_j f_j = \sum \alpha_i e_i$ and using independence of the full basis gives all $\alpha_i = \beta_j = 0$ .

Therefore $\dim(\text{im}\, T) = r$ , and $k + r = n$ . $\square$

💡Intuition

Rank-nullity is a dimension budget. A linear map cannot create new dimensions: every dimension lost to the kernel is a dimension denied to the image. If $T: \mathbb{R}^{512} \to \mathbb{R}^{10}$ is a classification layer, rank-nullity says at least $512 - 10 = 502$ dimensions of input are discarded.

Matrix Representations and Change of Basis

Given ordered bases $\mathcal{B} = (b_1, \ldots, b_n)$ for $V$ and $\mathcal{C} = (c_1, \ldots, c_m)$ for $W$ , the matrix of $T$ with respect to $(\mathcal{B}, \mathcal{C})$ is the $m \times n$ matrix whose $j$ -th column is the coordinate vector of $T(b_j)$ in basis $\mathcal{C}$ :

$[T]_{\mathcal{B}}^{\mathcal{C}} = \begin{bmatrix} [T(b_1)]_{\mathcal{C}} & [T(b_2)]_{\mathcal{C}} & \cdots & [T(b_n)]_{\mathcal{C}} \end{bmatrix}$

Change of basis. If $P$ is the invertible change-of-basis matrix from $\mathcal{B}$ to the standard basis (columns are the basis vectors $b_i$ expressed in standard coordinates), then the matrix of $T$ in $\mathcal{B}$ is related to the standard-basis matrix by:

$[T]_{\mathcal{B}} = P^{-1} [T]_{\text{std}} P$

This is matrix similarity. Two matrices $A$ and $B$ are similar ( $A \sim B$ ) if $B = P^{-1}AP$ for some invertible $P$ . Similar matrices represent the same linear map in different bases and share: eigenvalues, determinant, trace, and rank.

💡Why eigendecomposition matters

For a symmetric matrix $A$ , the spectral theorem (Lesson 4) guarantees the existence of an orthonormal eigenbasis. In that basis, $[A]_{\mathcal{B}} = \Lambda$ is diagonal — the simplest possible matrix representation. Change of basis is the mechanism by which this simplification happens.

Worked Example

Example 1: Verifying the Null Space

Let $A = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}$ . Find $\text{null}(A)$ and apply the subspace test.

Row reduction:

$\begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix} \xrightarrow{R_2 \leftarrow R_2 - 4R_1} \begin{bmatrix} 1 & 2 & 3 \\ 0 & -3 & -6 \end{bmatrix} \xrightarrow{R_2 \leftarrow -\frac{1}{3}R_2} \begin{bmatrix} 1 & 2 & 3 \\ 0 & 1 & 2 \end{bmatrix} \xrightarrow{R_1 \leftarrow R_1 - 2R_2} \begin{bmatrix} 1 & 0 & -1 \\ 0 & 1 & 2 \end{bmatrix}$

Free variable: $x_3 = t$ . Back-substitution: $x_2 = -2t$ , $x_1 = t$ . So:

$\text{null}(A) = \text{span}\!\left\{\begin{bmatrix} 1 \\ -2 \\ 1 \end{bmatrix}\right\}$

Subspace test:

$A\mathbf{0} = \mathbf{0}$ ✓
If $A\mathbf{u} = A\mathbf{v} = \mathbf{0}$ then $A(\mathbf{u}+\mathbf{v}) = A\mathbf{u} + A\mathbf{v} = \mathbf{0}$ ✓
If $A\mathbf{u} = \mathbf{0}$ then $A(c\mathbf{u}) = cA\mathbf{u} = \mathbf{0}$ ✓

Example 2: Rank-Nullity in Action

For $A \in \mathbb{R}^{2 \times 3}$ : two pivot columns give $\text{rank}(A) = 2$ ; one free variable gives $\text{nullity}(A) = 1$ .

$\text{rank}(A) + \text{nullity}(A) = 2 + 1 = 3 = \dim(\mathbb{R}^3) \checkmark$

Since $\text{rank}(A) = 2 = m$ , the column space is all of $\mathbb{R}^2$ : the map is surjective, and $A\mathbf{x} = \mathbf{b}$ has at least one solution for every $\mathbf{b} \in \mathbb{R}^2$ . Since the null space is nontrivial, the solution is not unique.

Example 3: Change of Basis for a 2D Rotation

Let $T: \mathbb{R}^2 \to \mathbb{R}^2$ be rotation by $\theta = 45°$ . In the standard basis:

$[T]_{\text{std}} = \begin{bmatrix} \cos 45° & -\sin 45° \\ \sin 45° & \cos 45° \end{bmatrix} = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 & -1 \\ 1 & 1 \end{bmatrix}$

Choose the orthonormal basis $\mathcal{B} = \left(\frac{1}{\sqrt{2}}\begin{bmatrix}1\\1\end{bmatrix},\ \frac{1}{\sqrt{2}}\begin{bmatrix}-1\\1\end{bmatrix}\right)$ .

The change-of-basis matrix and its inverse:

$P = \frac{1}{\sqrt{2}}\begin{bmatrix}1 & -1 \\ 1 & 1\end{bmatrix}, \qquad P^{-1} = P^\top = \frac{1}{\sqrt{2}}\begin{bmatrix}1 & 1 \\ -1 & 1\end{bmatrix}$

(Since $P$ is orthogonal, $P^{-1} = P^\top$ .) Computing $[T]_\mathcal{B} = P^\top [T]_\text{std} P$ :

$P^\top [T]_\text{std} P = \frac{1}{2}\begin{bmatrix}1 & 1 \\ -1 & 1\end{bmatrix}\begin{bmatrix}1 & -1 \\ 1 & 1\end{bmatrix}\begin{bmatrix}1 & -1 \\ 1 & 1\end{bmatrix} = \begin{bmatrix}1 & -1 \\ 0 & 1\end{bmatrix}$

This previews the spectral theorem: for symmetric matrices, one can always choose $\mathcal{B}$ to be an eigenbasis, making $[A]_\mathcal{B}$ diagonal.

Connections

Where Your Intuition Breaks

A higher-dimensional vector space always has more expressive power. The ambient dimension and the effective rank of a linear map are entirely different things. A weight matrix $W \in \mathbb{R}^{512 \times 512}$ acts on 512-dimensional inputs, but if $\operatorname{rank}(W) = 10$ , the map erases 502 input directions — they produce identical outputs. LoRA exploits this directly: the low-rank update $\Delta W = BA$ with $B \in \mathbb{R}^{d \times r}$ , $A \in \mathbb{R}^{r \times d}$ and $r \ll d$ adds exactly $r$ new expressive directions regardless of how large $d$ is. Rank is the true measure of capacity; dimension is just the ambient container.

Invertibility and the Four Fundamental Subspaces

For $A \in \mathbb{R}^{m \times n}$ with rank $r$ , the four fundamental subspaces partition the domain and codomain:

Subspace	Lives in	Dimension	Role
Column space $\text{col}(A)$	$\mathbb{R}^m$	$r$	Reachable outputs
Left null space $\text{null}(A^\top)$	$\mathbb{R}^m$	$m - r$	Unreachable directions
Row space $\text{col}(A^\top)$	$\mathbb{R}^n$	$r$	Inputs that matter
Null space $\text{null}(A)$	$\mathbb{R}^n$	$n - r$	Inputs that are erased

The fundamental theorem of linear algebra states: $\text{col}(A^\top) \perp \text{null}(A)$ and $\text{col}(A) \perp \text{null}(A^\top)$ .

Invertibility conditions:

Condition	Geometric meaning	Consequence
$r = n = m$ (square, full rank)	Bijective — no information lost or missed	$A^{-1}$ exists; $Ax = b$ has unique solution
$r = m < n$ (wide, full row rank)	Surjective — all outputs reachable	$Ax = b$ has solutions; infinitely many
$r = n < m$ (tall, full col rank)	Injective — no kernel	$Ax = b$ has at most one solution; may have none
$r < \min(m,n)$ (rank deficient)	Neither injective nor surjective	Kernel is nontrivial; image is a proper subspace

Common Pitfalls

Confusing linear with affine. Every neural network layer $f(\mathbf{x}) = W\mathbf{x} + \mathbf{b}$ is affine when $\mathbf{b} \neq \mathbf{0}$ . Affine maps do not form a vector space (composition of two affine maps is affine, but the sum is not generally affine). This distinction matters for analyzing equivariance, scaling laws, and composability proofs.

Assuming all subspaces are hyperplanes. A subspace of $\mathbb{R}^n$ can have any dimension from 0 to $n$ . In particular, $\mathbb{R}^n$ is a subspace of itself.

Rank deficiency as a failure mode. If a weight matrix $W \in \mathbb{R}^{d \times d}$ becomes rank deficient during training, the model can no longer distinguish inputs that differ only in $\text{null}(W)$ . This is one motivation for rank-regularization and for LoRA fine-tuning: the low-rank update $\Delta W = BA$ explicitly constrains $\text{rank}(\Delta W) \leq r$ .

Rank-Nullity as Algorithm Design

The rank-nullity theorem is a design constraint, not just a theorem. When building a compression layer $f: \mathbb{R}^{d_{\text{in}}} \to \mathbb{R}^{d_{\text{out}}}$ with $d_{\text{out}} < d_{\text{in}}$ , you are choosing to discard $d_{\text{in}} - d_{\text{out}}$ dimensions. The question is which dimensions. PCA (Lesson 5) answers this optimally in a least-squares sense; attention (Module 13, Lesson 7) answers it adaptively based on context.

💡A mental model for the whole module

Every result in this module is an answer to the same question: given a linear map $T$ , what is the most useful basis to express it in? The spectral theorem (Lesson 4) answers this for symmetric matrices. SVD (Lesson 5) answers it for arbitrary matrices. Matrix calculus (Lesson 8) asks how the answer changes when we perturb the map.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

ML Data

Data Flywheels & Feedback Loops

Inner Products, Norms & Orthogonality