Neural-Path/Notes
45 min

Smooth Manifolds & Tangent Spaces

A smooth manifold is a space that locally looks like Rn\mathbb{R}^n but globally can be curved and topologically nontrivial. The tangent space at each point is the linear approximation — the "flat world" visible from that location. Understanding manifolds is essential for Riemannian optimization (descent on curved parameter spaces), the manifold hypothesis in representation learning, and the geometry of probability distributions.

Concepts

Manifold & Tangent Space — drag slider to move along the manifold

position parameter = 0.785

T_p Mp

p = (84.853, 84.853)

T_p direction:

(-0.707, 0.707)

The unit circle is a 1D manifold embedded in R². Every point has a tangent line — the 1D tangent space T_p(S¹).

The yellow line is the tangent space — a local linear approximation to the curved manifold. As you move p, the tangent "rolls" along the curve. A chart maps the manifold locally to flat R^k.

The surface of the Earth is a 2D manifold — locally it looks flat (you can use a city map without spherical corrections), but globally it is curved and finite. This is the manifold idea: a space that is locally Euclidean but globally non-trivial. The tangent space at each point is the flat map that works locally — the "city map" for that neighborhood. In machine learning, the manifold hypothesis says that high-dimensional data (images, text embeddings) lives near a low-dimensional curved surface inside the ambient space, which is why autoencoders and dimensionality reduction work at all.

Smooth Manifolds

Definition. A smooth nn-manifold MM is a topological space with an atlas — a collection of charts {(Uα,ϕα)}\{(U_\alpha, \phi_\alpha)\} where:

  • {Uα}\{U_\alpha\} is an open cover of MM
  • Each ϕα:UαVαRn\phi_\alpha : U_\alpha \to V_\alpha \subset \mathbb{R}^n is a homeomorphism (continuous bijection with continuous inverse)
  • Smoothness condition: whenever UαUβU_\alpha \cap U_\beta \neq \emptyset, the transition map ϕβϕα1:ϕα(UαUβ)ϕβ(UαUβ)\phi_\beta \circ \phi_\alpha^{-1} : \phi_\alpha(U_\alpha \cap U_\beta) \to \phi_\beta(U_\alpha \cap U_\beta) is a smooth (CC^\infty) diffeomorphism

Each chart ϕα\phi_\alpha provides local coordinates — a coordinate system valid in the neighborhood UαU_\alpha. The atlas axioms ensure these coordinate systems glue together consistently. The smoothness condition on transition maps is exactly what you need: without it, you could have a space that looks flat in each chart individually but with incompatible coordinate systems across charts — calculus would be undefined at the seams. Requiring CC^\infty transition maps is the minimum condition to make differentiation well-defined globally, independent of which chart you choose to work in.

Examples:

ManifoldDimensionDescription
Rn\mathbb{R}^nnnTrivial: one chart, identity map
SnS^n (sphere)nnTwo charts: stereographic projection from North and South poles
SO(n)SO(n) (rotation matrices)n(n1)/2n(n-1)/2Lie group of orthogonal matrices with det=1
Sym+(n)\operatorname{Sym}^+(n) (PD matrices)n(n+1)/2n(n+1)/2Open cone in symmetric matrices
Stiefel manifold Vk,n\mathcal{V}_{k,n}knk2kn - k^2Rectangular matrices with orthonormal columns
Grassmannian Gr(k,n)\text{Gr}(k,n)k(nk)k(n-k)kk-dimensional subspaces of Rn\mathbb{R}^n

Tangent Spaces

The tangent space TpMT_p M at a point pMp \in M is the set of all velocity vectors of smooth curves through pp — it is the best linear approximation to MM at pp.

Formal definition. A tangent vector at pp is an equivalence class of smooth curves γ:(ε,ε)M\gamma : (-\varepsilon, \varepsilon) \to M with γ(0)=p\gamma(0) = p, where γ1γ2\gamma_1 \sim \gamma_2 if they have the same velocity in any local chart.

In coordinates (x1,,xn)(x^1, \ldots, x^n) near pp: TpMRnT_p M \cong \mathbb{R}^n, with basis {/xi}\{\partial/\partial x^i\} (partial derivative operators). A tangent vector v=vi/xi\mathbf{v} = v^i \partial/\partial x^i acts on smooth functions by directional differentiation: v[f]=vif/xi\mathbf{v}[f] = v^i \partial f/\partial x^i.

Tangent bundle. TM=pMTpMTM = \bigsqcup_{p \in M} T_p M — the disjoint union of all tangent spaces. A vector field is a smooth section of TMTM: an assignment pX(p)TpMp \mapsto X(p) \in T_p M varying smoothly with pp.

Smooth Maps and the Differential

A map F:MNF : M \to N between smooth manifolds is smooth if it is smooth in every pair of charts. The differential (or pushforward) of FF at pp is the linear map:

dFp:TpMTF(p)N,dF_p : T_p M \to T_{F(p)} N,

defined by (dFp)(v)[g]=v[gF](dF_p)(\mathbf{v})[g] = \mathbf{v}[g \circ F]. In local coordinates, the differential is represented by the Jacobian matrix of the coordinate expression of FF.

Chain rule on manifolds. For G:NPG : N \to P and F:MNF : M \to N:

d(GF)p=dGF(p)dFp.d(G \circ F)_p = dG_{F(p)} \circ dF_p.

This is the manifold version of the matrix chain rule.

Lie Groups and Lie Algebras

A Lie group GG is both a smooth manifold and a group, with group operations (multiplication and inversion) being smooth maps.

The Lie algebra g=TeG\mathfrak{g} = T_e G (tangent space at the identity ee) carries the infinitesimal group structure. The exponential map exp:gG\exp : \mathfrak{g} \to G gives a canonical way to "integrate" a Lie algebra element to a group element, generalizing etAe^{tA} for matrix Lie groups.

Matrix Lie groups and their Lie algebras:

Lie group GGLie algebra g\mathfrak{g}Elements
GL(n)GL(n) (invertible)gl(n)\mathfrak{gl}(n)All n×nn \times n matrices
SO(n)SO(n) (rotations)so(n)\mathfrak{so}(n)Skew-symmetric: A+AT=0A + A^T = 0
U(n)U(n) (unitary)u(n)\mathfrak{u}(n)Skew-Hermitian: A+A=0A + A^* = 0
SL(n)SL(n) (det=1)sl(n)\mathfrak{sl}(n)Traceless: tr(A)=0\operatorname{tr}(A) = 0

Optimization on Lie groups. Gradient descent on GG uses the Lie algebra: compute gradient in g\mathfrak{g} (tangent at identity), map to current point via left-translation, update. This is Riemannian gradient descent on Lie groups.

The Manifold Hypothesis

Hypothesis. High-dimensional data (images, text, audio) lies near or on a low-dimensional smooth manifold MRd\mathcal{M} \subset \mathbb{R}^d, where dimM=kd\dim \mathcal{M} = k \ll d.

Evidence: Interpolation in latent spaces (e.g., walking between two images in a VAE's latent space produces realistic intermediate images), intrinsic dimensionality estimation showing kk is much smaller than dd, and the success of compressed representations.

Consequences:

  • Learning reduces to estimating the manifold or a function on it
  • Dimensionality reduction is compression without information loss (if kdk \ll d)
  • Geodesic distance on the manifold is more meaningful than Euclidean distance through the ambient space
  • Autoencoders implicitly learn to map data onto a low-dimensional manifold (the latent space)

Worked Example

Example 1: Charts for S2S^2 (the 2-sphere)

The 2-sphere S2={(x,y,z)R3:x2+y2+z2=1}S^2 = \{(x,y,z) \in \mathbb{R}^3 : x^2+y^2+z^2=1\} is a 2-manifold.

Stereographic projection from the North Pole (0,0,1)(0,0,1):

ϕN(x,y,z)=(x1z,y1z)R2.\phi_N(x,y,z) = \left(\frac{x}{1-z}, \frac{y}{1-z}\right) \in \mathbb{R}^2.

This maps S2{N}S^2 \setminus \{N\} bijectively to R2\mathbb{R}^2. A second chart from the South Pole covers the North Pole. Together they form an atlas for S2S^2.

Tangent space at p=(1,0,0)p = (1,0,0): the tangent vectors are all vectors in R3\mathbb{R}^3 orthogonal to the outward normal n=p=(1,0,0)\mathbf{n} = p = (1,0,0). So TpS2={(v1,v2,v3):v1=0}=span{(0,1,0),(0,0,1)}R2T_p S^2 = \{(v_1,v_2,v_3) : v_1 = 0\} = \operatorname{span}\{(0,1,0),(0,0,1)\} \cong \mathbb{R}^2.

Example 2: SO(3)SO(3) and Rotation Representation

SO(3)SO(3) is the manifold of 3D rotations — a 3-dimensional manifold. Its Lie algebra so(3)\mathfrak{so}(3) consists of 3×33\times 3 skew-symmetric matrices, identified with R3\mathbb{R}^3 via:

ω^=(0ω3ω2ω30ω1ω2ω10)ω=(ω1,ω2,ω3)T.\hat{\boldsymbol{\omega}} = \begin{pmatrix}0 & -\omega_3 & \omega_2 \\ \omega_3 & 0 & -\omega_1 \\ -\omega_2 & \omega_1 & 0\end{pmatrix} \leftrightarrow \boldsymbol{\omega} = (\omega_1,\omega_2,\omega_3)^T.

The exponential map exp(ω^)=RSO(3)\exp(\hat{\boldsymbol{\omega}}) = R \in SO(3) gives the Rodrigues rotation formula — a rotation by angle ω\|\boldsymbol{\omega}\| around axis ω/ω\boldsymbol{\omega}/\|\boldsymbol{\omega}\|.

In 3D deep learning (point cloud processing, robotics), parameterizing rotations via so(3)\mathfrak{so}(3) avoids the gimbal lock and discontinuities of Euler angles.

Example 3: Optimization on the Stiefel Manifold

Train a network where the weight matrix WRd×kW \in \mathbb{R}^{d \times k} must have orthonormal columns (WTW=IkW^TW = I_k) — a constraint appearing in subspace learning, PCA-like objectives, and orthogonal RNNs (for gradient stability).

The Stiefel manifold Vk,d={WRd×k:WTW=Ik}\mathcal{V}_{k,d} = \{W \in \mathbb{R}^{d \times k} : W^TW = I_k\} is a (dkk(k+1)/2)(dk - k(k+1)/2)-dimensional manifold.

Riemannian gradient: take the Euclidean gradient G=L(W)G = \nabla L(W), project to the tangent space TWVT_W \mathcal{V}:

gradWL=GWGTW+WTG2(retracted to TWV).\text{grad}_W L = G - W\frac{G^TW + W^TG}{2} \quad \text{(retracted to } T_W\mathcal{V}\text{)}.

Then retract back to the manifold using QR decomposition (orthogonalize the updated WW). This is how orthogonal gradient descent works — it never leaves the manifold.

Connections

Where Your Intuition Breaks

The manifold hypothesis is widely cited but rarely scrutinized: real data is rarely on a smooth manifold in any strict mathematical sense. Training images contain noise (a random perturbation moves you off any smooth surface), the distribution has positive measure in the full ambient space, and "the manifold" is not a single connected component — there are separate clusters for different classes. The manifold hypothesis is better understood as a claim that data has low intrinsic dimensionality: a few degrees of freedom explain most of the variation. This weaker version is empirically supported (intrinsic dimension estimates give kdk \ll d for image datasets) and is what actually justifies dimensionality reduction and generative modeling — but it doesn't require a literally smooth manifold. When you see failures of latent space interpolation producing blurry or unrealistic in-between points, you're hitting the gaps in the manifold approximation.

💡Intuition

Why charts matter. Neural networks processing 3D data often parameterize rotations as quaternions, rotation matrices, or axis-angle vectors. Each is a different chart on SO(3)SO(3). Rotation matrices are the most natural (no singularities, group structure is explicit), but expensive to store. Quaternions are compact but have a two-to-one coverage (qq and q-q represent the same rotation). The choice of chart affects numerical stability, interpolation quality, and what "gradient descent" means.

💡Intuition

The tangent space IS the linearization. When you compute gradients via backpropagation, you are working in the tangent space of the parameter manifold (usually Rn\mathbb{R}^n — trivial). For constrained optimization on manifolds, the gradient must be projected to the tangent space of the constraint set before updating, because only the tangent component is feasible. This is why projected gradient descent, Frank-Wolfe, and Riemannian SGD all include a projection/retraction step.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.