Smooth Manifolds & Tangent Spaces
A smooth manifold is a space that locally looks like but globally can be curved and topologically nontrivial. The tangent space at each point is the linear approximation — the "flat world" visible from that location. Understanding manifolds is essential for Riemannian optimization (descent on curved parameter spaces), the manifold hypothesis in representation learning, and the geometry of probability distributions.
Concepts
Manifold & Tangent Space — drag slider to move along the manifold
position parameter = 0.785
p = (84.853, 84.853)
T_p direction:
(-0.707, 0.707)
The unit circle is a 1D manifold embedded in R². Every point has a tangent line — the 1D tangent space T_p(S¹).
The yellow line is the tangent space — a local linear approximation to the curved manifold. As you move p, the tangent "rolls" along the curve. A chart maps the manifold locally to flat R^k.
The surface of the Earth is a 2D manifold — locally it looks flat (you can use a city map without spherical corrections), but globally it is curved and finite. This is the manifold idea: a space that is locally Euclidean but globally non-trivial. The tangent space at each point is the flat map that works locally — the "city map" for that neighborhood. In machine learning, the manifold hypothesis says that high-dimensional data (images, text embeddings) lives near a low-dimensional curved surface inside the ambient space, which is why autoencoders and dimensionality reduction work at all.
Smooth Manifolds
Definition. A smooth -manifold is a topological space with an atlas — a collection of charts where:
- is an open cover of
- Each is a homeomorphism (continuous bijection with continuous inverse)
- Smoothness condition: whenever , the transition map is a smooth () diffeomorphism
Each chart provides local coordinates — a coordinate system valid in the neighborhood . The atlas axioms ensure these coordinate systems glue together consistently. The smoothness condition on transition maps is exactly what you need: without it, you could have a space that looks flat in each chart individually but with incompatible coordinate systems across charts — calculus would be undefined at the seams. Requiring transition maps is the minimum condition to make differentiation well-defined globally, independent of which chart you choose to work in.
Examples:
| Manifold | Dimension | Description |
|---|---|---|
| Trivial: one chart, identity map | ||
| (sphere) | Two charts: stereographic projection from North and South poles | |
| (rotation matrices) | Lie group of orthogonal matrices with det=1 | |
| (PD matrices) | Open cone in symmetric matrices | |
| Stiefel manifold | Rectangular matrices with orthonormal columns | |
| Grassmannian | -dimensional subspaces of |
Tangent Spaces
The tangent space at a point is the set of all velocity vectors of smooth curves through — it is the best linear approximation to at .
Formal definition. A tangent vector at is an equivalence class of smooth curves with , where if they have the same velocity in any local chart.
In coordinates near : , with basis (partial derivative operators). A tangent vector acts on smooth functions by directional differentiation: .
Tangent bundle. — the disjoint union of all tangent spaces. A vector field is a smooth section of : an assignment varying smoothly with .
Smooth Maps and the Differential
A map between smooth manifolds is smooth if it is smooth in every pair of charts. The differential (or pushforward) of at is the linear map:
defined by . In local coordinates, the differential is represented by the Jacobian matrix of the coordinate expression of .
Chain rule on manifolds. For and :
This is the manifold version of the matrix chain rule.
Lie Groups and Lie Algebras
A Lie group is both a smooth manifold and a group, with group operations (multiplication and inversion) being smooth maps.
The Lie algebra (tangent space at the identity ) carries the infinitesimal group structure. The exponential map gives a canonical way to "integrate" a Lie algebra element to a group element, generalizing for matrix Lie groups.
Matrix Lie groups and their Lie algebras:
| Lie group | Lie algebra | Elements |
|---|---|---|
| (invertible) | All matrices | |
| (rotations) | Skew-symmetric: | |
| (unitary) | Skew-Hermitian: | |
| (det=1) | Traceless: |
Optimization on Lie groups. Gradient descent on uses the Lie algebra: compute gradient in (tangent at identity), map to current point via left-translation, update. This is Riemannian gradient descent on Lie groups.
The Manifold Hypothesis
Hypothesis. High-dimensional data (images, text, audio) lies near or on a low-dimensional smooth manifold , where .
Evidence: Interpolation in latent spaces (e.g., walking between two images in a VAE's latent space produces realistic intermediate images), intrinsic dimensionality estimation showing is much smaller than , and the success of compressed representations.
Consequences:
- Learning reduces to estimating the manifold or a function on it
- Dimensionality reduction is compression without information loss (if )
- Geodesic distance on the manifold is more meaningful than Euclidean distance through the ambient space
- Autoencoders implicitly learn to map data onto a low-dimensional manifold (the latent space)
Worked Example
Example 1: Charts for (the 2-sphere)
The 2-sphere is a 2-manifold.
Stereographic projection from the North Pole :
This maps bijectively to . A second chart from the South Pole covers the North Pole. Together they form an atlas for .
Tangent space at : the tangent vectors are all vectors in orthogonal to the outward normal . So .
Example 2: and Rotation Representation
is the manifold of 3D rotations — a 3-dimensional manifold. Its Lie algebra consists of skew-symmetric matrices, identified with via:
The exponential map gives the Rodrigues rotation formula — a rotation by angle around axis .
In 3D deep learning (point cloud processing, robotics), parameterizing rotations via avoids the gimbal lock and discontinuities of Euler angles.
Example 3: Optimization on the Stiefel Manifold
Train a network where the weight matrix must have orthonormal columns () — a constraint appearing in subspace learning, PCA-like objectives, and orthogonal RNNs (for gradient stability).
The Stiefel manifold is a -dimensional manifold.
Riemannian gradient: take the Euclidean gradient , project to the tangent space :
Then retract back to the manifold using QR decomposition (orthogonalize the updated ). This is how orthogonal gradient descent works — it never leaves the manifold.
Connections
Where Your Intuition Breaks
The manifold hypothesis is widely cited but rarely scrutinized: real data is rarely on a smooth manifold in any strict mathematical sense. Training images contain noise (a random perturbation moves you off any smooth surface), the distribution has positive measure in the full ambient space, and "the manifold" is not a single connected component — there are separate clusters for different classes. The manifold hypothesis is better understood as a claim that data has low intrinsic dimensionality: a few degrees of freedom explain most of the variation. This weaker version is empirically supported (intrinsic dimension estimates give for image datasets) and is what actually justifies dimensionality reduction and generative modeling — but it doesn't require a literally smooth manifold. When you see failures of latent space interpolation producing blurry or unrealistic in-between points, you're hitting the gaps in the manifold approximation.
Why charts matter. Neural networks processing 3D data often parameterize rotations as quaternions, rotation matrices, or axis-angle vectors. Each is a different chart on . Rotation matrices are the most natural (no singularities, group structure is explicit), but expensive to store. Quaternions are compact but have a two-to-one coverage ( and represent the same rotation). The choice of chart affects numerical stability, interpolation quality, and what "gradient descent" means.
The tangent space IS the linearization. When you compute gradients via backpropagation, you are working in the tangent space of the parameter manifold (usually — trivial). For constrained optimization on manifolds, the gradient must be projected to the tangent space of the constraint set before updating, because only the tangent component is feasible. This is why projected gradient descent, Frank-Wolfe, and Riemannian SGD all include a projection/retraction step.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.