Smooth Manifolds & Tangent Spaces

A smooth manifold is a space that locally looks like $\mathbb{R}^n$ but globally can be curved and topologically nontrivial. The tangent space at each point is the linear approximation — the "flat world" visible from that location. Understanding manifolds is essential for Riemannian optimization (descent on curved parameter spaces), the manifold hypothesis in representation learning, and the geometry of probability distributions.

Concepts

Manifold & Tangent Space — drag slider to move along the manifold

position parameter = 0.785

p = (84.853, 84.853)

T_p direction:

(-0.707, 0.707)

Show stereographic chart

The unit circle is a 1D manifold embedded in R². Every point has a tangent line — the 1D tangent space T_p(S¹).

The yellow line is the tangent space — a local linear approximation to the curved manifold. As you move p, the tangent "rolls" along the curve. A chart maps the manifold locally to flat R^k.

The surface of the Earth is a 2D manifold — locally it looks flat (you can use a city map without spherical corrections), but globally it is curved and finite. This is the manifold idea: a space that is locally Euclidean but globally non-trivial. The tangent space at each point is the flat map that works locally — the "city map" for that neighborhood. In machine learning, the manifold hypothesis says that high-dimensional data (images, text embeddings) lives near a low-dimensional curved surface inside the ambient space, which is why autoencoders and dimensionality reduction work at all.

Smooth Manifolds

Definition. A smooth $n$ -manifold $M$ is a topological space with an atlas — a collection of charts $\{(U_\alpha, \phi_\alpha)\}$ where:

$\{U_\alpha\}$ is an open cover of $M$
Each $\phi_\alpha : U_\alpha \to V_\alpha \subset \mathbb{R}^n$ is a homeomorphism (continuous bijection with continuous inverse)
Smoothness condition: whenever $U_\alpha \cap U_\beta \neq \emptyset$ , the transition map $\phi_\beta \circ \phi_\alpha^{-1} : \phi_\alpha(U_\alpha \cap U_\beta) \to \phi_\beta(U_\alpha \cap U_\beta)$ is a smooth ( $C^\infty$ ) diffeomorphism

Each chart $\phi_\alpha$ provides local coordinates — a coordinate system valid in the neighborhood $U_\alpha$ . The atlas axioms ensure these coordinate systems glue together consistently. The smoothness condition on transition maps is exactly what you need: without it, you could have a space that looks flat in each chart individually but with incompatible coordinate systems across charts — calculus would be undefined at the seams. Requiring $C^\infty$ transition maps is the minimum condition to make differentiation well-defined globally, independent of which chart you choose to work in.

Examples:

Manifold	Dimension	Description
$\mathbb{R}^n$	$n$	Trivial: one chart, identity map
$S^n$ (sphere)	$n$	Two charts: stereographic projection from North and South poles
$SO(n)$ (rotation matrices)	$n(n-1)/2$	Lie group of orthogonal matrices with det=1
$\operatorname{Sym}^+(n)$ (PD matrices)	$n(n+1)/2$	Open cone in symmetric matrices
Stiefel manifold $\mathcal{V}_{k,n}$	$kn - k^2$	Rectangular matrices with orthonormal columns
Grassmannian $\text{Gr}(k,n)$	$k(n-k)$	$k$ -dimensional subspaces of $\mathbb{R}^n$

Tangent Spaces

The tangent space $T_p M$ at a point $p \in M$ is the set of all velocity vectors of smooth curves through $p$ — it is the best linear approximation to $M$ at $p$ .

Formal definition. A tangent vector at $p$ is an equivalence class of smooth curves $\gamma : (-\varepsilon, \varepsilon) \to M$ with $\gamma(0) = p$ , where $\gamma_1 \sim \gamma_2$ if they have the same velocity in any local chart.

In coordinates $(x^1, \ldots, x^n)$ near $p$ : $T_p M \cong \mathbb{R}^n$ , with basis $\{\partial/\partial x^i\}$ (partial derivative operators). A tangent vector $\mathbf{v} = v^i \partial/\partial x^i$ acts on smooth functions by directional differentiation: $\mathbf{v}[f] = v^i \partial f/\partial x^i$ .

Tangent bundle. $TM = \bigsqcup_{p \in M} T_p M$ — the disjoint union of all tangent spaces. A vector field is a smooth section of $TM$ : an assignment $p \mapsto X(p) \in T_p M$ varying smoothly with $p$ .

Smooth Maps and the Differential

A map $F : M \to N$ between smooth manifolds is smooth if it is smooth in every pair of charts. The differential (or pushforward) of $F$ at $p$ is the linear map:

$dF_p : T_p M \to T_{F(p)} N,$

defined by $(dF_p)(\mathbf{v})[g] = \mathbf{v}[g \circ F]$ . In local coordinates, the differential is represented by the Jacobian matrix of the coordinate expression of $F$ .

Chain rule on manifolds. For $G : N \to P$ and $F : M \to N$ :

$d(G \circ F)_p = dG_{F(p)} \circ dF_p.$

This is the manifold version of the matrix chain rule.

Lie Groups and Lie Algebras

A Lie group $G$ is both a smooth manifold and a group, with group operations (multiplication and inversion) being smooth maps.

The Lie algebra $\mathfrak{g} = T_e G$ (tangent space at the identity $e$ ) carries the infinitesimal group structure. The exponential map $\exp : \mathfrak{g} \to G$ gives a canonical way to "integrate" a Lie algebra element to a group element, generalizing $e^{tA}$ for matrix Lie groups.

Matrix Lie groups and their Lie algebras:

Lie group $G$	Lie algebra $\mathfrak{g}$	Elements
$GL(n)$ (invertible)	$\mathfrak{gl}(n)$	All $n \times n$ matrices
$SO(n)$ (rotations)	$\mathfrak{so}(n)$	Skew-symmetric: $A + A^T = 0$
$U(n)$ (unitary)	$\mathfrak{u}(n)$	Skew-Hermitian: $A + A^* = 0$
$SL(n)$ (det=1)	$\mathfrak{sl}(n)$	Traceless: $\operatorname{tr}(A) = 0$

Optimization on Lie groups. Gradient descent on $G$ uses the Lie algebra: compute gradient in $\mathfrak{g}$ (tangent at identity), map to current point via left-translation, update. This is Riemannian gradient descent on Lie groups.

The Manifold Hypothesis

Hypothesis. High-dimensional data (images, text, audio) lies near or on a low-dimensional smooth manifold $\mathcal{M} \subset \mathbb{R}^d$ , where $\dim \mathcal{M} = k \ll d$ .

Evidence: Interpolation in latent spaces (e.g., walking between two images in a VAE's latent space produces realistic intermediate images), intrinsic dimensionality estimation showing $k$ is much smaller than $d$ , and the success of compressed representations.

Consequences:

Learning reduces to estimating the manifold or a function on it
Dimensionality reduction is compression without information loss (if $k \ll d$ )
Geodesic distance on the manifold is more meaningful than Euclidean distance through the ambient space
Autoencoders implicitly learn to map data onto a low-dimensional manifold (the latent space)

Worked Example

Example 1: Charts for $S^2$ (the 2-sphere)

The 2-sphere $S^2 = \{(x,y,z) \in \mathbb{R}^3 : x^2+y^2+z^2=1\}$ is a 2-manifold.

Stereographic projection from the North Pole $(0,0,1)$ :

$\phi_N(x,y,z) = \left(\frac{x}{1-z}, \frac{y}{1-z}\right) \in \mathbb{R}^2.$

This maps $S^2 \setminus \{N\}$ bijectively to $\mathbb{R}^2$ . A second chart from the South Pole covers the North Pole. Together they form an atlas for $S^2$ .

Tangent space at $p = (1,0,0)$ : the tangent vectors are all vectors in $\mathbb{R}^3$ orthogonal to the outward normal $\mathbf{n} = p = (1,0,0)$ . So $T_p S^2 = \{(v_1,v_2,v_3) : v_1 = 0\} = \operatorname{span}\{(0,1,0),(0,0,1)\} \cong \mathbb{R}^2$ .

Example 2: $SO(3)$ and Rotation Representation

$SO(3)$ is the manifold of 3D rotations — a 3-dimensional manifold. Its Lie algebra $\mathfrak{so}(3)$ consists of $3\times 3$ skew-symmetric matrices, identified with $\mathbb{R}^3$ via:

$\hat{\boldsymbol{\omega}} = \begin{pmatrix}0 & -\omega_3 & \omega_2 \\ \omega_3 & 0 & -\omega_1 \\ -\omega_2 & \omega_1 & 0\end{pmatrix} \leftrightarrow \boldsymbol{\omega} = (\omega_1,\omega_2,\omega_3)^T.$

The exponential map $\exp(\hat{\boldsymbol{\omega}}) = R \in SO(3)$ gives the Rodrigues rotation formula — a rotation by angle $\|\boldsymbol{\omega}\|$ around axis $\boldsymbol{\omega}/\|\boldsymbol{\omega}\|$ .

In 3D deep learning (point cloud processing, robotics), parameterizing rotations via $\mathfrak{so}(3)$ avoids the gimbal lock and discontinuities of Euler angles.

Example 3: Optimization on the Stiefel Manifold

Train a network where the weight matrix $W \in \mathbb{R}^{d \times k}$ must have orthonormal columns ( $W^TW = I_k$ ) — a constraint appearing in subspace learning, PCA-like objectives, and orthogonal RNNs (for gradient stability).

The Stiefel manifold $\mathcal{V}_{k,d} = \{W \in \mathbb{R}^{d \times k} : W^TW = I_k\}$ is a $(dk - k(k+1)/2)$ -dimensional manifold.

Riemannian gradient: take the Euclidean gradient $G = \nabla L(W)$ , project to the tangent space $T_W \mathcal{V}$ :

$\text{grad}_W L = G - W\frac{G^TW + W^TG}{2} \quad \text{(retracted to } T_W\mathcal{V}\text{)}.$

Then retract back to the manifold using QR decomposition (orthogonalize the updated $W$ ). This is how orthogonal gradient descent works — it never leaves the manifold.

Connections

Where Your Intuition Breaks

The manifold hypothesis is widely cited but rarely scrutinized: real data is rarely on a smooth manifold in any strict mathematical sense. Training images contain noise (a random perturbation moves you off any smooth surface), the distribution has positive measure in the full ambient space, and "the manifold" is not a single connected component — there are separate clusters for different classes. The manifold hypothesis is better understood as a claim that data has low intrinsic dimensionality: a few degrees of freedom explain most of the variation. This weaker version is empirically supported (intrinsic dimension estimates give $k \ll d$ for image datasets) and is what actually justifies dimensionality reduction and generative modeling — but it doesn't require a literally smooth manifold. When you see failures of latent space interpolation producing blurry or unrealistic in-between points, you're hitting the gaps in the manifold approximation.

💡Intuition

Why charts matter. Neural networks processing 3D data often parameterize rotations as quaternions, rotation matrices, or axis-angle vectors. Each is a different chart on $SO(3)$ . Rotation matrices are the most natural (no singularities, group structure is explicit), but expensive to store. Quaternions are compact but have a two-to-one coverage ( $q$ and $-q$ represent the same rotation). The choice of chart affects numerical stability, interpolation quality, and what "gradient descent" means.

💡Intuition

The tangent space IS the linearization. When you compute gradients via backpropagation, you are working in the tangent space of the parameter manifold (usually $\mathbb{R}^n$ — trivial). For constrained optimization on manifolds, the gradient must be projected to the tangent space of the constraint set before updating, because only the tangent component is feasible. This is why projected gradient descent, Frank-Wolfe, and Riemannian SGD all include a projection/retraction step.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Integration in Rⁿ: Fubini, Change of Variables & Surface Integrals

Riemannian Geometry: Metrics, Geodesics & Curvature