Neural-Path/Notes
45 min

Riemannian Geometry: Metrics, Geodesics & Curvature

Riemannian geometry equips each point of a manifold with an inner product on its tangent space, giving rise to notions of distance, angle, and curvature that vary from point to point. This is the geometry underlying natural gradient descent (which uses the Fisher information metric), optimal transport (Wasserstein distances), hyperbolic embeddings of hierarchical data, and the curvature analysis of neural network loss surfaces.

Concepts

The sign of sectional curvature K determines how geodesics (shortest paths) behave relative to each other. Click a panel to focus it.

↘ converge ↙K > 0 (sphere)↑ parallel ↑K = 0 (flat)↙ diverge ↘K < 0 (hyperbolic)

Two geodesics start close together. Curvature determines what happens next.

K > 0: geodesics meet (sphere — no parallel postulate) · K = 0: Euclid's parallel postulate holds · K < 0: infinitely many parallels through any point (hyperbolic)

When you fly from New York to London, the shortest route arcs northward over Greenland — not because pilots are inefficient, but because the Earth's surface curves. That arc is a geodesic: the shortest path on a curved surface. Riemannian geometry assigns each point of a space its own inner product, so "length" and "angle" are defined locally and can vary from place to place. In machine learning, this matters whenever you want to measure distances in spaces that aren't flat: the space of probability distributions (Fisher-Rao metric), the space of covariance matrices, or the curved loss surface of a neural network.

Riemannian Metrics

Definition. A Riemannian metric on a smooth manifold MM is a smooth assignment of an inner product gp:TpM×TpMRg_p : T_p M \times T_p M \to \mathbb{R} to each tangent space, varying smoothly with pp.

In local coordinates (x1,,xn)(x^1, \ldots, x^n), the metric is represented by the metric tensor g=(gij)g = (g_{ij}) where

gij(p)=gp ⁣(xi,xj).g_{ij}(p) = g_p\!\left(\frac{\partial}{\partial x^i}, \frac{\partial}{\partial x^j}\right).

The length of a curve γ:[a,b]M\gamma : [a,b] \to M is:

L(γ)=abgγ(t) ⁣(γ˙(t),γ˙(t))dt=abgij(γ(t))γ˙i(t)γ˙j(t)dt.L(\gamma) = \int_a^b \sqrt{g_{\gamma(t)}\!\left(\dot\gamma(t), \dot\gamma(t)\right)} \, dt = \int_a^b \sqrt{g_{ij}(\gamma(t))\dot\gamma^i(t)\dot\gamma^j(t)} \, dt.

Examples of Riemannian manifolds:

ManifoldMetric ggGeometry
Rn\mathbb{R}^ngij=δijg_{ij} = \delta_{ij} (identity)Flat Euclidean
SnS^n (sphere, radius rr)Induced from Rn+1\mathbb{R}^{n+1}Constant positive curvature 1/r21/r^2
Hyperbolic space Hn\mathbb{H}^ngij=δij/xn2g_{ij} = \delta_{ij}/x_n^2 (Poincaré half-plane)Constant negative curvature
Sym+(n)\operatorname{Sym}^+(n)gΣ(A,B)=tr(Σ1AΣ1B)g_\Sigma(A,B) = \operatorname{tr}(\Sigma^{-1}A\Sigma^{-1}B)Affine-invariant metric
Statistical manifoldFisher information metric gij=E[ij]g_{ij} = \mathbb{E}[\partial_i\ell\,\partial_j\ell]Information geometry

Geodesics

The geodesic distance between p,qMp, q \in M is:

d(p,q)=infγ:pqL(γ),d(p,q) = \inf_{\gamma : p \to q} L(\gamma),

the infimum over all smooth curves connecting pp and qq.

A geodesic is a locally length-minimizing curve — the Riemannian analogue of a straight line. Geodesics satisfy the geodesic equation:

γ¨k+Γijkγ˙iγ˙j=0,\ddot\gamma^k + \Gamma^k_{ij}\dot\gamma^i\dot\gamma^j = 0,

where Γijk\Gamma^k_{ij} are the Christoffel symbols (encoding how the metric changes across the manifold):

Γijk=12gkl(igjl+jgillgij).\Gamma^k_{ij} = \frac{1}{2}g^{kl}\left(\partial_i g_{jl} + \partial_j g_{il} - \partial_l g_{ij}\right).

The Christoffel symbols encode how basis vectors rotate as you move across the manifold — they are zero on flat Rn\mathbb{R}^n (no rotation needed) and nonzero wherever the metric changes. They had to appear in the geodesic equation because "staying in the same direction" on a curved surface requires continuously correcting for the bending of the coordinate system — the same reason a ship navigator must adjust heading even while traveling in a straight geodesic path.

On Rn\mathbb{R}^n: Christoffel symbols vanish (gij=δijg_{ij} = \delta_{ij} constant), geodesic equation reduces to γ¨=0\ddot\gamma = 0 — straight lines.

On S2S^2: geodesics are great circles — intersection of the sphere with planes through the origin.

The Exponential and Logarithm Maps

At each point pMp \in M, the exponential map expp:TpMM\exp_p : T_p M \to M "shoots out" from pp in the direction and magnitude of a tangent vector:

expp(v)=γ(1),where γ is the geodesic with γ(0)=p,γ˙(0)=v.\exp_p(\mathbf{v}) = \gamma(1), \quad \text{where } \gamma \text{ is the geodesic with } \gamma(0)=p, \dot\gamma(0)=\mathbf{v}.

The logarithm map logp:MTpM\log_p : M \to T_p M is the local inverse of expp\exp_p (defined on a neighborhood of pp).

These maps are the bridge between curved and flat:

  • Optimization: natural gradient descent updates as pt+1=exppt(ηgradf(pt))p_{t+1} = \exp_{p_t}(-\eta \operatorname{grad} f(p_t)), replacing the flat Euclidean step with a geodesic step.
  • Averaging: the Fréchet mean (Riemannian mean) of points {pi}\{p_i\} is pˉ=argminpid(p,pi)2\bar p = \arg\min_p \sum_i d(p, p_i)^2, computed iteratively via log maps.
  • Interpolation: geodesic interpolation γ(t)=expp(tlogpq)\gamma(t) = \exp_p(t \log_p q) for t[0,1]t \in [0,1].

Curvature

Riemann curvature tensor R(X,Y)ZR(X,Y)Z measures the non-commutativity of covariant differentiation — how much parallel transport around a small loop rotates a tangent vector. It encodes all local curvature information.

Sectional curvature K(σ)K(\sigma) for a 2D tangent plane σ=span{X,Y}\sigma = \operatorname{span}\{X, Y\}:

K(σ)=g(R(X,Y)Y,X)g(X,X)g(Y,Y)g(X,Y)2.K(\sigma) = \frac{g(R(X,Y)Y,X)}{g(X,X)g(Y,Y) - g(X,Y)^2}.

Sign of KKGeometryExample
K>0K > 0Positive curvature — geodesics convergeSphere
K=0K = 0Flat — geodesics stay parallelEuclidean Rn\mathbb{R}^n
K<0K < 0Negative curvature — geodesics divergeHyperbolic space

Scalar curvature R=gijRijR = g^{ij}R_{ij} (trace of the Ricci tensor) — a single number per point summarizing average sectional curvature. In general relativity, RR appears in Einstein's field equations.

Information Geometry: The Fisher-Rao Metric

A parametric family of distributions {p(x;θ):θΘ}\{p(\mathbf{x};\boldsymbol{\theta}) : \boldsymbol{\theta} \in \Theta\} forms a statistical manifold. The Fisher information matrix defines a natural Riemannian metric:

gij(θ)=I(θ)ij=Exp(;θ) ⁣[logp(x;θ)θilogp(x;θ)θj].g_{ij}(\boldsymbol{\theta}) = \mathcal{I}(\boldsymbol{\theta})_{ij} = \mathbb{E}_{\mathbf{x}\sim p(\cdot;\boldsymbol{\theta})}\!\left[\frac{\partial \log p(\mathbf{x};\boldsymbol{\theta})}{\partial\theta_i} \frac{\partial\log p(\mathbf{x};\boldsymbol{\theta})}{\partial\theta_j}\right].

This is the Fisher-Rao metric. The geodesic distance it induces is a measure of statistical distinguishability: points far apart are easy to distinguish by hypothesis testing.

Natural gradient. The standard gradient θL\nabla_\theta L ignores the Riemannian structure. The natural gradient is:

~θL=I(θ)1θL,\tilde\nabla_\theta L = \mathcal{I}(\theta)^{-1} \nabla_\theta L,

the gradient in the coordinate system defined by the Fisher metric. Natural gradient descent updates parameters in the direction that is steepest in probability space (KL-divergence), not parameter space. It is invariant to reparameterization — changing from θ\theta to ϕ(θ)\phi(\theta) gives the same update.

KFAC and K-FAC variants approximate I(θ)1\mathcal{I}(\theta)^{-1} cheaply using Kronecker-factored approximations, enabling natural gradient descent at scale.

Worked Example

Example 1: Geodesics on S2S^2

Two cities at latitudes ϕ1,ϕ2\phi_1, \phi_2 and longitudes λ1,λ2\lambda_1, \lambda_2. The great circle distance (geodesic on S2S^2):

d=rΔσ,Δσ=arccos(sinϕ1sinϕ2+cosϕ1cosϕ2cos(λ2λ1)).d = r \cdot \Delta\sigma, \qquad \Delta\sigma = \arccos(\sin\phi_1\sin\phi_2 + \cos\phi_1\cos\phi_2\cos(\lambda_2-\lambda_1)).

This is less than the Euclidean distance through the Earth's interior, and also less than navigating along latitude lines (which are not geodesics on S2S^2 except at the equator).

Why great circles? Because the sphere has positive curvature, geodesics (great circles) "converge" — two geodesics starting at the same point in different directions will meet again at the antipodal point. This is why there are no parallel lines on a sphere.

Example 2: Natural Gradient for Gaussian

For a Gaussian family N(μ,σ2)\mathcal{N}(\mu, \sigma^2) parameterized by (μ,σ)(\mu, \sigma):

I(μ,σ)=(1/σ2002/σ2).\mathcal{I}(\mu, \sigma) = \begin{pmatrix}1/\sigma^2 & 0 \\ 0 & 2/\sigma^2\end{pmatrix}.

The natural gradient of the negative log-likelihood L=12logσ2+(xμ)22σ2L = \frac{1}{2}\log\sigma^2 + \frac{(x-\mu)^2}{2\sigma^2} is:

~(μ,σ)L=I1L=(σ200σ2/2)((xμ)/σ2(xμ)2/σ31/σ).\tilde\nabla_{(\mu,\sigma)} L = \mathcal{I}^{-1}\nabla L = \begin{pmatrix}\sigma^2 & 0 \\ 0 & \sigma^2/2\end{pmatrix}\begin{pmatrix}-(x-\mu)/\sigma^2 \\ (x-\mu)^2/\sigma^3 - 1/\sigma\end{pmatrix}.

The σ2\sigma^2 factor rescales the μ\mu gradient — effectively using a different step size for each parameter, adapted to the information content. For small σ\sigma (sharp distribution), gradients in μ\mu are large; natural gradient corrects for this.

Example 3: Hyperbolic Embeddings of Trees

Trees have exponential growth — the number of nodes at depth dd grows as βd\beta^d for branching factor β\beta. Euclidean space cannot embed trees without distortion because volumes grow polynomially.

Hyperbolic space Hn\mathbb{H}^n (constant negative curvature 1-1) has exponentially growing volume with distance — matched to tree geometry. The Poincaré disk model: Dn={xRn:x<1}D^n = \{x \in \mathbb{R}^n : \|x\| < 1\} with metric:

gx=4(1x2)2δij.g_x = \frac{4}{(1-\|x\|^2)^2} \delta_{ij}.

Geodesic distance: d(x,y)=2arccosh(1+2xy2(1x2)(1y2))d(x,y) = 2\operatorname{arccosh}\left(1 + \frac{2\|x-y\|^2}{(1-\|x\|^2)(1-\|y\|^2)}\right).

Hierarchical structures (WordNet, biological taxonomies, knowledge graphs) embed into Hn\mathbb{H}^n with low distortion using far fewer dimensions than Euclidean space. Poincaré embeddings (Nickel & Kiela, 2017) demonstrate this on word hierarchies.

Connections

Where Your Intuition Breaks

The natural gradient sounds like a strict improvement over vanilla gradient descent — it corrects for the geometry and is invariant to reparameterization. But computing the natural gradient requires inverting the Fisher information matrix, which is O(n2)O(n^2) to store and O(n3)O(n^3) to invert for nn parameters. A GPT-2 model has ~117M parameters; the Fisher matrix would be 117M×117M117M \times 117M — completely infeasible. KFAC and related approximations make it tractable for specific architectures, but they sacrifice the exact invariance. The reason Adam and Adagrad dominate in practice is not that geometry doesn't matter — it's that diagonal Fisher approximations (which these optimizers implicitly use) capture the most important rescaling at negligible cost. The full Riemannian geometry is theoretically correct but practically expensive.

💡Intuition

Why the Fisher metric is canonical. The Fisher-Rao metric is the only Riemannian metric on the space of probability distributions that is invariant under sufficient statistics — i.e., it doesn't change when you reparameterize the distribution or discard irrelevant information. This invariance under sufficient statistics is the information-geometric content of the Cramér-Rao bound: the Fisher information sets the fundamental limit on estimation variance. The natural gradient uses this canonical geometry.

💡Intuition

Curvature of the loss landscape. The Hessian of a neural network loss at a critical point measures local curvature in parameter space. But the loss landscape is not flat — it has complex curvature that varies across the landscape. Points near a flat minimum (many small eigenvalues of the Hessian, low effective rank) tend to generalize better than sharp minima (many large eigenvalues). SGD with small batch size implicitly prefers flat minima via its noisy trajectory — a connection to the Riemannian geometry of the loss manifold that is an active research area.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.