Topology Primer: Metric Spaces, Continuity & Compactness

Topology gives precise meaning to the intuitive notions of "nearby," "approaching a limit," and "continuously varying" — ideas that are invoked constantly in machine learning yet rarely defined with precision. Understanding metric spaces and their topology illuminates why gradient descent converges, why Lipschitz continuity is a stability guarantee, why compactness implies optimization problems have solutions, and why the choice of distance function shapes the geometry of embedding spaces.

Concepts

Lipschitz Continuity — drag the point to move the cone

x = 0.500

f(x) = 0.500

L = 1

f(x) = x. Lipschitz constant L = 1 — every chord has slope ≤ 1. Smooth and well-behaved everywhere.

A function is L-Lipschitz if the graph stays inside the yellow cone at every point. The cone has slope ±L — bounding how fast the function can change. ML relevance: spectral normalization enforces L=1 for discriminator weights in GANs.

When you write loss.backward() in PyTorch, the optimizer takes a step of size lr * gradient. For this to work reliably — not diverge, not oscillate — the loss function must not change too fast relative to the step size. That "not too fast" condition is Lipschitz continuity: the constraint that a fixed multiple of your input move bounds how much the output can move. Metric spaces are the language for making "distance between inputs" and "distance between outputs" precise across all settings — whether your space is $\mathbb{R}^n$ , a space of probability distributions, or the set of all possible sentences.

Metric Spaces

A metric space $(X, d)$ is a set $X$ together with a metric $d : X \times X \to \mathbb{R}_{\geq 0}$ satisfying:

Non-negativity: $d(x, y) \geq 0$ , with $d(x, y) = 0 \iff x = y$
Symmetry: $d(x, y) = d(y, x)$
Triangle inequality: $d(x, z) \leq d(x, y) + d(y, z)$

These three axioms are the minimum needed to make "nearby" mean something consistent. Non-negativity ensures distances are comparable; symmetry ensures going from $x$ to $y$ costs the same as from $y$ to $x$ ; and the triangle inequality — the deepest of the three — ensures you can't shortcut a path by going through a third point. Without the triangle inequality, the notion of convergence breaks: a sequence could get "close" to a limit by a sequence of shortcuts that never actually approach it.

Common metrics on $\mathbb{R}^n$ :

Name	Formula	Unit ball shape	Use in ML
Euclidean ( $\ell^2$ )	$\sqrt{\sum_i (x_i-y_i)^2}$	Sphere	Default; PCA, $k$ -means
Manhattan ( $\ell^1$ )	$\sum_i	x_i - y_i	$
Chebyshev ( $\ell^\infty$ )	$\max_i	x_i - y_i	$
Mahalanobis	$\sqrt{(\mathbf{x}-\mathbf{y})^T\Sigma^{-1}(\mathbf{x}-\mathbf{y})}$	Ellipsoid	Accounts for covariance
Cosine distance	$1 - \frac{\mathbf{x}\cdot\mathbf{y}}{\\|\mathbf{x}\\|\\|\mathbf{y}\\|}$	—	NLP embeddings
Edit distance	Min. insertions/deletions	—	Strings, sequences

Open and closed balls. For $x \in X$ and $r > 0$ :

$B(x, r) = \{y \in X : d(x, y) < r\} \quad \text{(open ball)},$ $\bar{B}(x, r) = \{y \in X : d(x, y) \leq r\} \quad \text{(closed ball)}.$

Topology: Open Sets and Convergence

A set $U \subseteq X$ is open if for every $x \in U$ there exists $r > 0$ such that $B(x, r) \subseteq U$ — every point has room to breathe.

A set $C \subseteq X$ is closed if its complement $X \setminus C$ is open — equivalently, $C$ contains all its limit points.

Convergence. A sequence $(x_n)_{n=1}^\infty$ converges to $x$ (written $x_n \to x$ ) if for every $\varepsilon > 0$ there exists $N$ such that $n \geq N \implies d(x_n, x) < \varepsilon$ .

Cauchy sequence. $(x_n)$ is Cauchy if for every $\varepsilon > 0$ there exists $N$ such that $m, n \geq N \implies d(x_m, x_n) < \varepsilon$ . Every convergent sequence is Cauchy; the converse holds in complete metric spaces.

Complete metric spaces. A metric space in which every Cauchy sequence converges. Examples: $\mathbb{R}^n$ (Euclidean), $\ell^2$ (square-summable sequences), $L^2$ (square-integrable functions). Non-example: $\mathbb{Q}$ (rationals) — the sequence $3, 3.1, 3.14, 3.141, \ldots$ is Cauchy but converges to $\pi \notin \mathbb{Q}$ .

Continuity and Its Characterizations

A function $f : X \to Y$ between metric spaces is continuous at $x$ if for every $\varepsilon > 0$ there exists $\delta > 0$ such that

$d_X(x', x) < \delta \implies d_Y(f(x'), f(x)) < \varepsilon.$

Equivalent characterizations:

$f$ is continuous $\iff$ for every open $U \subseteq Y$ , the preimage $f^{-1}(U)$ is open in $X$
$f$ is continuous $\iff$ $x_n \to x$ implies $f(x_n) \to f(x)$ (sequential continuity)

Lipschitz Continuity

$f : X \to Y$ is $L$ -Lipschitz (or has Lipschitz constant $L$ ) if for all $x, x' \in X$ :

$d_Y(f(x), f(x')) \leq L \cdot d_X(x, x').$

Lipschitz continuity is uniform continuity with an explicit rate. The Lipschitz constant $L = \sup_{x \neq x'} d_Y(f(x),f(x')) / d_X(x,x')$ .

For differentiable $f : \mathbb{R}^n \to \mathbb{R}^m$ : $f$ is $L$ -Lipschitz $\iff$ $\|J_f(\mathbf{x})\|_2 \leq L$ for all $\mathbf{x}$ $\iff$ $\|J_f\|_{op} \leq L$ .

$L$ -smooth functions. A differentiable $f : \mathbb{R}^n \to \mathbb{R}$ is $L$ -smooth if $\nabla f$ is $L$ -Lipschitz:

$\|\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})\| \leq L\|\mathbf{x} - \mathbf{y}\|.$

Equivalently, $\nabla^2 f \preceq LI$ everywhere (Hessian eigenvalues bounded by $L$ ). This is the standard condition for gradient descent convergence: step size $\eta \leq 1/L$ guarantees descent.

Compactness

Definition. A set $K \subseteq X$ is compact if every open cover of $K$ has a finite subcover — informally, $K$ is "small enough" that you can't escape to infinity or to the boundary without being caught.

Heine-Borel Theorem. In $\mathbb{R}^n$ , a set $K$ is compact $\iff$ $K$ is closed and bounded.

Sequential compactness. $K$ is compact $\iff$ every sequence in $K$ has a convergent subsequence (with limit in $K$ ).

Why compactness matters for optimization:

Weierstrass Extreme Value Theorem. If $f : K \to \mathbb{R}$ is continuous and $K$ is compact, then $f$ achieves its maximum and minimum on $K$ .

In ML: why does ERM (empirical risk minimization) have a solution? Because we restrict to a compact hypothesis class (e.g., weights in a bounded set), and the loss is continuous. Without compactness, you might chase a sequence of models approaching the infimum without ever reaching it.

Topological Spaces and Beyond

A topological space $(X, \mathcal{T})$ generalizes metric spaces by axiomatizing the collection of open sets directly. Every metric space induces a topology (open sets = arbitrary unions of open balls), but not every topology is metrizable.

Key topological properties:

Property	Definition	ML relevance
Hausdorff	Distinct points have disjoint open neighborhoods	Limits are unique; standard in analysis
Connected	Not a disjoint union of two open sets	Convex sets are connected
Path-connected	Any two points joined by a continuous path	Manifold hypothesis: data lives on a connected manifold
Second-countable	Countable base for the topology	Required for manifold theory

Worked Example

Example 1: Convergence in Different Metrics

The sequence $\mathbf{x}_n = (1/n, 1/n)$ in $\mathbb{R}^2$ :

$d_2(\mathbf{x}_n, \mathbf{0}) = \sqrt{2}/n \to 0$ — converges in $\ell^2$
$d_1(\mathbf{x}_n, \mathbf{0}) = 2/n \to 0$ — converges in $\ell^1$
$d_\infty(\mathbf{x}_n, \mathbf{0}) = 1/n \to 0$ — converges in $\ell^\infty$

All three metrics give the same convergent sequences in $\mathbb{R}^n$ (they are equivalent metrics — each is a constant multiple of the others). For the $k$ -NN algorithm, however, the choice of metric dramatically changes which neighbors are "nearest" in high dimensions.

Example 2: The Lipschitz Constant of a Neural Network Layer

A fully-connected layer $\mathbf{h} = \sigma(W\mathbf{x})$ with $W \in \mathbb{R}^{m \times n}$ and $\sigma$ being ReLU.

Lipschitz constant:

The map $\mathbf{x} \mapsto W\mathbf{x}$ is $\|W\|_2 = \sigma_{\max}(W)$ -Lipschitz (largest singular value).
ReLU is 1-Lipschitz: $|\text{relu}(a) - \text{relu}(b)| \leq |a - b|$ .
Composition: the layer is $\|W\|_2$ -Lipschitz.

Spectral normalization (Miyato et al., 2018) divides $W$ by its largest singular value:

$\tilde{W} = W / \sigma_{\max}(W),$

making each layer 1-Lipschitz and the full network 1-Lipschitz (since a composition of 1-Lipschitz maps is 1-Lipschitz). This stabilizes GAN training by preventing the discriminator from varying too rapidly, making its gradients more reliable.

Example 3: Compactness Failure and Optimization

Consider minimizing $f(x) = e^{-x}$ over $\mathbb{R}$ . The infimum is 0, but it is never achieved: $\inf_{x \in \mathbb{R}} f(x) = 0$ , but $f(x) > 0$ for all $x$ .

This is compactness failure: $\mathbb{R}$ is closed but not bounded (not compact), so Weierstrass doesn't apply.

Fix: Restrict to $[-M, M]$ for some $M > 0$ (compact). Now the minimum of $f$ on $[-M, M]$ is $f(M) = e^{-M} > 0$ — achieved, but larger than the unconstrained infimum.

In ML: neural network loss minimization over all of weight space has no guarantee of achieving a minimum (the loss might asymptotically approach zero as weights → ∞). Weight decay / $\ell^2$ regularization effectively restricts the search to a compact ball, ensuring the regularized problem has a solution.

Connections

Where Your Intuition Breaks

The most common misconception: all reasonable distances are essentially the same in $\mathbb{R}^n$ . It's true that $\ell^1$ , $\ell^2$ , and $\ell^\infty$ metrics are equivalent (each bounded by a constant multiple of the others), meaning they define the same open sets and the same convergent sequences. But "equivalent" does not mean "interchangeable for algorithms." In high dimensions, $k$ -NN with $\ell^2$ distance assigns nearly equal distances to all points (the curse of dimensionality), while $\ell^1$ distance concentrates differently. And the Mahalanobis and cosine metrics are not equivalent to Euclidean — they encode qualitatively different geometry. Metric equivalence tells you about topology; it says nothing about which distances are meaningful for your specific learning problem.

Metric Spaces vs Normed Spaces vs Inner Product Spaces

Structure	Additional axioms	Gives you
Metric space	$d \geq 0$ , symmetry, triangle inequality	Convergence, continuity
Normed space	$\|\alpha x\| =	\alpha
Inner product space	$\langle x,y \rangle$ bilinear, symmetric, PD	Angles, projections, Gram-Schmidt
Banach space	Normed + complete	Convergent series, fixed-point theorems
Hilbert space	Inner product + complete	Projection theorem, spectral theory

Every inner product space → normed space → metric space (via $d(x,y) = \|x-y\|$ ), but not conversely.

💡Intuition

Why topology, not just calculus? Classical calculus assumes you're working in $\mathbb{R}$ or $\mathbb{R}^n$ with the Euclidean metric. But many ML objects live in other spaces: probability distributions (infinite-dimensional), strings (discrete metric), graphs (graph edit distance), neural network weight spaces (high-dimensional, non-Euclidean geometry near flat regions). Topology provides the language to discuss continuity and convergence in all these settings uniformly.

💡Intuition

The Manifold Hypothesis. High-dimensional data (images, text, audio) doesn't fill $\mathbb{R}^d$ uniformly — it concentrates near a low-dimensional manifold embedded in $\mathbb{R}^d$ . If this manifold has dimension $k \ll d$ , then distances between data points that respect the manifold geometry (geodesic distances) are far more meaningful than Euclidean distances, which cut through the ambient space. Topology formalizes what "manifold" means; Lesson 4 develops the machinery.

⚠️Warning

The curse of dimensionality is a topological phenomenon. In high dimensions, the volume of a ball is concentrated near its surface: for $B(0,1) \subset \mathbb{R}^d$ , the shell $\{r \leq \|x\| \leq 1\}$ contains $1 - r^d$ fraction of the volume — for $r = 0.9$ and $d = 100$ , this is $1 - 0.9^{100} \approx 1$ . Almost all the volume is in the outer shell. This means the notion of "nearby" in high-dimensional Euclidean space is misleading: everything is approximately equally far from everything else, making distance-based algorithms (nearest neighbors, $k$ -means) degrade.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Linear Algebra

Matrix Calculus & Differentiation

Differentiation in Rⁿ: Jacobians, Hessians & the Chain Rule