Neural-Path/Notes
40 min

Topology Primer: Metric Spaces, Continuity & Compactness

Topology gives precise meaning to the intuitive notions of "nearby," "approaching a limit," and "continuously varying" — ideas that are invoked constantly in machine learning yet rarely defined with precision. Understanding metric spaces and their topology illuminates why gradient descent converges, why Lipschitz continuity is a stability guarantee, why compactness implies optimization problems have solutions, and why the choice of distance function shapes the geometry of embedding spaces.

Concepts

Lipschitz Continuity — drag the point to move the cone

-2-1012slope ±1

x = 0.500

f(x) = 0.500

L = 1

f(x) = x. Lipschitz constant L = 1 — every chord has slope ≤ 1. Smooth and well-behaved everywhere.

A function is L-Lipschitz if the graph stays inside the yellow cone at every point. The cone has slope ±L — bounding how fast the function can change. ML relevance: spectral normalization enforces L=1 for discriminator weights in GANs.

When you write loss.backward() in PyTorch, the optimizer takes a step of size lr * gradient. For this to work reliably — not diverge, not oscillate — the loss function must not change too fast relative to the step size. That "not too fast" condition is Lipschitz continuity: the constraint that a fixed multiple of your input move bounds how much the output can move. Metric spaces are the language for making "distance between inputs" and "distance between outputs" precise across all settings — whether your space is Rn\mathbb{R}^n, a space of probability distributions, or the set of all possible sentences.

Metric Spaces

A metric space (X,d)(X, d) is a set XX together with a metric d:X×XR0d : X \times X \to \mathbb{R}_{\geq 0} satisfying:

  1. Non-negativity: d(x,y)0d(x, y) \geq 0, with d(x,y)=0    x=yd(x, y) = 0 \iff x = y
  2. Symmetry: d(x,y)=d(y,x)d(x, y) = d(y, x)
  3. Triangle inequality: d(x,z)d(x,y)+d(y,z)d(x, z) \leq d(x, y) + d(y, z)

These three axioms are the minimum needed to make "nearby" mean something consistent. Non-negativity ensures distances are comparable; symmetry ensures going from xx to yy costs the same as from yy to xx; and the triangle inequality — the deepest of the three — ensures you can't shortcut a path by going through a third point. Without the triangle inequality, the notion of convergence breaks: a sequence could get "close" to a limit by a sequence of shortcuts that never actually approach it.

Common metrics on Rn\mathbb{R}^n:

NameFormulaUnit ball shapeUse in ML
Euclidean (2\ell^2)i(xiyi)2\sqrt{\sum_i (x_i-y_i)^2}SphereDefault; PCA, kk-means
Manhattan (1\ell^1)$\sum_ix_i - y_i$
Chebyshev (\ell^\infty)$\max_ix_i - y_i$
Mahalanobis(xy)TΣ1(xy)\sqrt{(\mathbf{x}-\mathbf{y})^T\Sigma^{-1}(\mathbf{x}-\mathbf{y})}EllipsoidAccounts for covariance
Cosine distance1xyxy1 - \frac{\mathbf{x}\cdot\mathbf{y}}{\|\mathbf{x}\|\|\mathbf{y}\|}NLP embeddings
Edit distanceMin. insertions/deletionsStrings, sequences

Open and closed balls. For xXx \in X and r>0r > 0:

B(x,r)={yX:d(x,y)<r}(open ball),B(x, r) = \{y \in X : d(x, y) < r\} \quad \text{(open ball)}, Bˉ(x,r)={yX:d(x,y)r}(closed ball).\bar{B}(x, r) = \{y \in X : d(x, y) \leq r\} \quad \text{(closed ball)}.

Topology: Open Sets and Convergence

A set UXU \subseteq X is open if for every xUx \in U there exists r>0r > 0 such that B(x,r)UB(x, r) \subseteq U — every point has room to breathe.

A set CXC \subseteq X is closed if its complement XCX \setminus C is open — equivalently, CC contains all its limit points.

Convergence. A sequence (xn)n=1(x_n)_{n=1}^\infty converges to xx (written xnxx_n \to x) if for every ε>0\varepsilon > 0 there exists NN such that nN    d(xn,x)<εn \geq N \implies d(x_n, x) < \varepsilon.

Cauchy sequence. (xn)(x_n) is Cauchy if for every ε>0\varepsilon > 0 there exists NN such that m,nN    d(xm,xn)<εm, n \geq N \implies d(x_m, x_n) < \varepsilon. Every convergent sequence is Cauchy; the converse holds in complete metric spaces.

Complete metric spaces. A metric space in which every Cauchy sequence converges. Examples: Rn\mathbb{R}^n (Euclidean), 2\ell^2 (square-summable sequences), L2L^2 (square-integrable functions). Non-example: Q\mathbb{Q} (rationals) — the sequence 3,3.1,3.14,3.141,3, 3.1, 3.14, 3.141, \ldots is Cauchy but converges to πQ\pi \notin \mathbb{Q}.

Continuity and Its Characterizations

A function f:XYf : X \to Y between metric spaces is continuous at xx if for every ε>0\varepsilon > 0 there exists δ>0\delta > 0 such that

dX(x,x)<δ    dY(f(x),f(x))<ε.d_X(x', x) < \delta \implies d_Y(f(x'), f(x)) < \varepsilon.

Equivalent characterizations:

  • ff is continuous     \iff for every open UYU \subseteq Y, the preimage f1(U)f^{-1}(U) is open in XX
  • ff is continuous     \iff xnxx_n \to x implies f(xn)f(x)f(x_n) \to f(x) (sequential continuity)

Lipschitz Continuity

f:XYf : X \to Y is LL-Lipschitz (or has Lipschitz constant LL) if for all x,xXx, x' \in X:

dY(f(x),f(x))LdX(x,x).d_Y(f(x), f(x')) \leq L \cdot d_X(x, x').

Lipschitz continuity is uniform continuity with an explicit rate. The Lipschitz constant L=supxxdY(f(x),f(x))/dX(x,x)L = \sup_{x \neq x'} d_Y(f(x),f(x')) / d_X(x,x').

For differentiable f:RnRmf : \mathbb{R}^n \to \mathbb{R}^m: ff is LL-Lipschitz     \iff Jf(x)2L\|J_f(\mathbf{x})\|_2 \leq L for all x\mathbf{x}     \iff JfopL\|J_f\|_{op} \leq L.

LL-smooth functions. A differentiable f:RnRf : \mathbb{R}^n \to \mathbb{R} is LL-smooth if f\nabla f is LL-Lipschitz:

f(x)f(y)Lxy.\|\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})\| \leq L\|\mathbf{x} - \mathbf{y}\|.

Equivalently, 2fLI\nabla^2 f \preceq LI everywhere (Hessian eigenvalues bounded by LL). This is the standard condition for gradient descent convergence: step size η1/L\eta \leq 1/L guarantees descent.

Compactness

Definition. A set KXK \subseteq X is compact if every open cover of KK has a finite subcover — informally, KK is "small enough" that you can't escape to infinity or to the boundary without being caught.

Heine-Borel Theorem. In Rn\mathbb{R}^n, a set KK is compact     \iff KK is closed and bounded.

Sequential compactness. KK is compact     \iff every sequence in KK has a convergent subsequence (with limit in KK).

Why compactness matters for optimization:

Weierstrass Extreme Value Theorem. If f:KRf : K \to \mathbb{R} is continuous and KK is compact, then ff achieves its maximum and minimum on KK.

In ML: why does ERM (empirical risk minimization) have a solution? Because we restrict to a compact hypothesis class (e.g., weights in a bounded set), and the loss is continuous. Without compactness, you might chase a sequence of models approaching the infimum without ever reaching it.

Topological Spaces and Beyond

A topological space (X,T)(X, \mathcal{T}) generalizes metric spaces by axiomatizing the collection of open sets directly. Every metric space induces a topology (open sets = arbitrary unions of open balls), but not every topology is metrizable.

Key topological properties:

PropertyDefinitionML relevance
HausdorffDistinct points have disjoint open neighborhoodsLimits are unique; standard in analysis
ConnectedNot a disjoint union of two open setsConvex sets are connected
Path-connectedAny two points joined by a continuous pathManifold hypothesis: data lives on a connected manifold
Second-countableCountable base for the topologyRequired for manifold theory

Worked Example

Example 1: Convergence in Different Metrics

The sequence xn=(1/n,1/n)\mathbf{x}_n = (1/n, 1/n) in R2\mathbb{R}^2:

  • d2(xn,0)=2/n0d_2(\mathbf{x}_n, \mathbf{0}) = \sqrt{2}/n \to 0 — converges in 2\ell^2
  • d1(xn,0)=2/n0d_1(\mathbf{x}_n, \mathbf{0}) = 2/n \to 0 — converges in 1\ell^1
  • d(xn,0)=1/n0d_\infty(\mathbf{x}_n, \mathbf{0}) = 1/n \to 0 — converges in \ell^\infty

All three metrics give the same convergent sequences in Rn\mathbb{R}^n (they are equivalent metrics — each is a constant multiple of the others). For the kk-NN algorithm, however, the choice of metric dramatically changes which neighbors are "nearest" in high dimensions.

Example 2: The Lipschitz Constant of a Neural Network Layer

A fully-connected layer h=σ(Wx)\mathbf{h} = \sigma(W\mathbf{x}) with WRm×nW \in \mathbb{R}^{m \times n} and σ\sigma being ReLU.

Lipschitz constant:

  • The map xWx\mathbf{x} \mapsto W\mathbf{x} is W2=σmax(W)\|W\|_2 = \sigma_{\max}(W)-Lipschitz (largest singular value).
  • ReLU is 1-Lipschitz: relu(a)relu(b)ab|\text{relu}(a) - \text{relu}(b)| \leq |a - b|.
  • Composition: the layer is W2\|W\|_2-Lipschitz.

Spectral normalization (Miyato et al., 2018) divides WW by its largest singular value:

W~=W/σmax(W),\tilde{W} = W / \sigma_{\max}(W),

making each layer 1-Lipschitz and the full network 1-Lipschitz (since a composition of 1-Lipschitz maps is 1-Lipschitz). This stabilizes GAN training by preventing the discriminator from varying too rapidly, making its gradients more reliable.

Example 3: Compactness Failure and Optimization

Consider minimizing f(x)=exf(x) = e^{-x} over R\mathbb{R}. The infimum is 0, but it is never achieved: infxRf(x)=0\inf_{x \in \mathbb{R}} f(x) = 0, but f(x)>0f(x) > 0 for all xx.

This is compactness failure: R\mathbb{R} is closed but not bounded (not compact), so Weierstrass doesn't apply.

Fix: Restrict to [M,M][-M, M] for some M>0M > 0 (compact). Now the minimum of ff on [M,M][-M, M] is f(M)=eM>0f(M) = e^{-M} > 0 — achieved, but larger than the unconstrained infimum.

In ML: neural network loss minimization over all of weight space has no guarantee of achieving a minimum (the loss might asymptotically approach zero as weights → ∞). Weight decay / 2\ell^2 regularization effectively restricts the search to a compact ball, ensuring the regularized problem has a solution.

Connections

Where Your Intuition Breaks

The most common misconception: all reasonable distances are essentially the same in Rn\mathbb{R}^n. It's true that 1\ell^1, 2\ell^2, and \ell^\infty metrics are equivalent (each bounded by a constant multiple of the others), meaning they define the same open sets and the same convergent sequences. But "equivalent" does not mean "interchangeable for algorithms." In high dimensions, kk-NN with 2\ell^2 distance assigns nearly equal distances to all points (the curse of dimensionality), while 1\ell^1 distance concentrates differently. And the Mahalanobis and cosine metrics are not equivalent to Euclidean — they encode qualitatively different geometry. Metric equivalence tells you about topology; it says nothing about which distances are meaningful for your specific learning problem.

Metric Spaces vs Normed Spaces vs Inner Product Spaces

StructureAdditional axiomsGives you
Metric spaced0d \geq 0, symmetry, triangle inequalityConvergence, continuity
Normed space$|\alpha x| =\alpha
Inner product spacex,y\langle x,y \rangle bilinear, symmetric, PDAngles, projections, Gram-Schmidt
Banach spaceNormed + completeConvergent series, fixed-point theorems
Hilbert spaceInner product + completeProjection theorem, spectral theory

Every inner product space → normed space → metric space (via d(x,y)=xyd(x,y) = \|x-y\|), but not conversely.

💡Intuition

Why topology, not just calculus? Classical calculus assumes you're working in R\mathbb{R} or Rn\mathbb{R}^n with the Euclidean metric. But many ML objects live in other spaces: probability distributions (infinite-dimensional), strings (discrete metric), graphs (graph edit distance), neural network weight spaces (high-dimensional, non-Euclidean geometry near flat regions). Topology provides the language to discuss continuity and convergence in all these settings uniformly.

💡Intuition

The Manifold Hypothesis. High-dimensional data (images, text, audio) doesn't fill Rd\mathbb{R}^d uniformly — it concentrates near a low-dimensional manifold embedded in Rd\mathbb{R}^d. If this manifold has dimension kdk \ll d, then distances between data points that respect the manifold geometry (geodesic distances) are far more meaningful than Euclidean distances, which cut through the ambient space. Topology formalizes what "manifold" means; Lesson 4 develops the machinery.

⚠️Warning

The curse of dimensionality is a topological phenomenon. In high dimensions, the volume of a ball is concentrated near its surface: for B(0,1)RdB(0,1) \subset \mathbb{R}^d, the shell {rx1}\{r \leq \|x\| \leq 1\} contains 1rd1 - r^d fraction of the volume — for r=0.9r = 0.9 and d=100d = 100, this is 10.910011 - 0.9^{100} \approx 1. Almost all the volume is in the outer shell. This means the notion of "nearby" in high-dimensional Euclidean space is misleading: everything is approximately equally far from everything else, making distance-based algorithms (nearest neighbors, kk-means) degrade.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.