Topology Primer: Metric Spaces, Continuity & Compactness
Topology gives precise meaning to the intuitive notions of "nearby," "approaching a limit," and "continuously varying" — ideas that are invoked constantly in machine learning yet rarely defined with precision. Understanding metric spaces and their topology illuminates why gradient descent converges, why Lipschitz continuity is a stability guarantee, why compactness implies optimization problems have solutions, and why the choice of distance function shapes the geometry of embedding spaces.
Concepts
Lipschitz Continuity — drag the point to move the cone
x = 0.500
f(x) = 0.500
L = 1
f(x) = x. Lipschitz constant L = 1 — every chord has slope ≤ 1. Smooth and well-behaved everywhere.
A function is L-Lipschitz if the graph stays inside the yellow cone at every point. The cone has slope ±L — bounding how fast the function can change. ML relevance: spectral normalization enforces L=1 for discriminator weights in GANs.
When you write loss.backward() in PyTorch, the optimizer takes a step of size lr * gradient. For this to work reliably — not diverge, not oscillate — the loss function must not change too fast relative to the step size. That "not too fast" condition is Lipschitz continuity: the constraint that a fixed multiple of your input move bounds how much the output can move. Metric spaces are the language for making "distance between inputs" and "distance between outputs" precise across all settings — whether your space is , a space of probability distributions, or the set of all possible sentences.
Metric Spaces
A metric space is a set together with a metric satisfying:
- Non-negativity: , with
- Symmetry:
- Triangle inequality:
These three axioms are the minimum needed to make "nearby" mean something consistent. Non-negativity ensures distances are comparable; symmetry ensures going from to costs the same as from to ; and the triangle inequality — the deepest of the three — ensures you can't shortcut a path by going through a third point. Without the triangle inequality, the notion of convergence breaks: a sequence could get "close" to a limit by a sequence of shortcuts that never actually approach it.
Common metrics on :
| Name | Formula | Unit ball shape | Use in ML |
|---|---|---|---|
| Euclidean () | Sphere | Default; PCA, -means | |
| Manhattan () | $\sum_i | x_i - y_i | $ |
| Chebyshev () | $\max_i | x_i - y_i | $ |
| Mahalanobis | Ellipsoid | Accounts for covariance | |
| Cosine distance | — | NLP embeddings | |
| Edit distance | Min. insertions/deletions | — | Strings, sequences |
Open and closed balls. For and :
Topology: Open Sets and Convergence
A set is open if for every there exists such that — every point has room to breathe.
A set is closed if its complement is open — equivalently, contains all its limit points.
Convergence. A sequence converges to (written ) if for every there exists such that .
Cauchy sequence. is Cauchy if for every there exists such that . Every convergent sequence is Cauchy; the converse holds in complete metric spaces.
Complete metric spaces. A metric space in which every Cauchy sequence converges. Examples: (Euclidean), (square-summable sequences), (square-integrable functions). Non-example: (rationals) — the sequence is Cauchy but converges to .
Continuity and Its Characterizations
A function between metric spaces is continuous at if for every there exists such that
Equivalent characterizations:
- is continuous for every open , the preimage is open in
- is continuous implies (sequential continuity)
Lipschitz Continuity
is -Lipschitz (or has Lipschitz constant ) if for all :
Lipschitz continuity is uniform continuity with an explicit rate. The Lipschitz constant .
For differentiable : is -Lipschitz for all .
-smooth functions. A differentiable is -smooth if is -Lipschitz:
Equivalently, everywhere (Hessian eigenvalues bounded by ). This is the standard condition for gradient descent convergence: step size guarantees descent.
Compactness
Definition. A set is compact if every open cover of has a finite subcover — informally, is "small enough" that you can't escape to infinity or to the boundary without being caught.
Heine-Borel Theorem. In , a set is compact is closed and bounded.
Sequential compactness. is compact every sequence in has a convergent subsequence (with limit in ).
Why compactness matters for optimization:
Weierstrass Extreme Value Theorem. If is continuous and is compact, then achieves its maximum and minimum on .
In ML: why does ERM (empirical risk minimization) have a solution? Because we restrict to a compact hypothesis class (e.g., weights in a bounded set), and the loss is continuous. Without compactness, you might chase a sequence of models approaching the infimum without ever reaching it.
Topological Spaces and Beyond
A topological space generalizes metric spaces by axiomatizing the collection of open sets directly. Every metric space induces a topology (open sets = arbitrary unions of open balls), but not every topology is metrizable.
Key topological properties:
| Property | Definition | ML relevance |
|---|---|---|
| Hausdorff | Distinct points have disjoint open neighborhoods | Limits are unique; standard in analysis |
| Connected | Not a disjoint union of two open sets | Convex sets are connected |
| Path-connected | Any two points joined by a continuous path | Manifold hypothesis: data lives on a connected manifold |
| Second-countable | Countable base for the topology | Required for manifold theory |
Worked Example
Example 1: Convergence in Different Metrics
The sequence in :
- — converges in
- — converges in
- — converges in
All three metrics give the same convergent sequences in (they are equivalent metrics — each is a constant multiple of the others). For the -NN algorithm, however, the choice of metric dramatically changes which neighbors are "nearest" in high dimensions.
Example 2: The Lipschitz Constant of a Neural Network Layer
A fully-connected layer with and being ReLU.
Lipschitz constant:
- The map is -Lipschitz (largest singular value).
- ReLU is 1-Lipschitz: .
- Composition: the layer is -Lipschitz.
Spectral normalization (Miyato et al., 2018) divides by its largest singular value:
making each layer 1-Lipschitz and the full network 1-Lipschitz (since a composition of 1-Lipschitz maps is 1-Lipschitz). This stabilizes GAN training by preventing the discriminator from varying too rapidly, making its gradients more reliable.
Example 3: Compactness Failure and Optimization
Consider minimizing over . The infimum is 0, but it is never achieved: , but for all .
This is compactness failure: is closed but not bounded (not compact), so Weierstrass doesn't apply.
Fix: Restrict to for some (compact). Now the minimum of on is — achieved, but larger than the unconstrained infimum.
In ML: neural network loss minimization over all of weight space has no guarantee of achieving a minimum (the loss might asymptotically approach zero as weights → ∞). Weight decay / regularization effectively restricts the search to a compact ball, ensuring the regularized problem has a solution.
Connections
Where Your Intuition Breaks
The most common misconception: all reasonable distances are essentially the same in . It's true that , , and metrics are equivalent (each bounded by a constant multiple of the others), meaning they define the same open sets and the same convergent sequences. But "equivalent" does not mean "interchangeable for algorithms." In high dimensions, -NN with distance assigns nearly equal distances to all points (the curse of dimensionality), while distance concentrates differently. And the Mahalanobis and cosine metrics are not equivalent to Euclidean — they encode qualitatively different geometry. Metric equivalence tells you about topology; it says nothing about which distances are meaningful for your specific learning problem.
Metric Spaces vs Normed Spaces vs Inner Product Spaces
| Structure | Additional axioms | Gives you |
|---|---|---|
| Metric space | , symmetry, triangle inequality | Convergence, continuity |
| Normed space | $|\alpha x| = | \alpha |
| Inner product space | bilinear, symmetric, PD | Angles, projections, Gram-Schmidt |
| Banach space | Normed + complete | Convergent series, fixed-point theorems |
| Hilbert space | Inner product + complete | Projection theorem, spectral theory |
Every inner product space → normed space → metric space (via ), but not conversely.
Why topology, not just calculus? Classical calculus assumes you're working in or with the Euclidean metric. But many ML objects live in other spaces: probability distributions (infinite-dimensional), strings (discrete metric), graphs (graph edit distance), neural network weight spaces (high-dimensional, non-Euclidean geometry near flat regions). Topology provides the language to discuss continuity and convergence in all these settings uniformly.
The Manifold Hypothesis. High-dimensional data (images, text, audio) doesn't fill uniformly — it concentrates near a low-dimensional manifold embedded in . If this manifold has dimension , then distances between data points that respect the manifold geometry (geodesic distances) are far more meaningful than Euclidean distances, which cut through the ambient space. Topology formalizes what "manifold" means; Lesson 4 develops the machinery.
The curse of dimensionality is a topological phenomenon. In high dimensions, the volume of a ball is concentrated near its surface: for , the shell contains fraction of the volume — for and , this is . Almost all the volume is in the outer shell. This means the notion of "nearby" in high-dimensional Euclidean space is misleading: everything is approximately equally far from everything else, making distance-based algorithms (nearest neighbors, -means) degrade.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.