Riemannian Geometry: Metrics, Geodesics & Curvature
Riemannian geometry equips each point of a manifold with an inner product on its tangent space, giving rise to notions of distance, angle, and curvature that vary from point to point. This is the geometry underlying natural gradient descent (which uses the Fisher information metric), optimal transport (Wasserstein distances), hyperbolic embeddings of hierarchical data, and the curvature analysis of neural network loss surfaces.
Concepts
The sign of sectional curvature K determines how geodesics (shortest paths) behave relative to each other. Click a panel to focus it.
Two geodesics start close together. Curvature determines what happens next.
When you fly from New York to London, the shortest route arcs northward over Greenland — not because pilots are inefficient, but because the Earth's surface curves. That arc is a geodesic: the shortest path on a curved surface. Riemannian geometry assigns each point of a space its own inner product, so "length" and "angle" are defined locally and can vary from place to place. In machine learning, this matters whenever you want to measure distances in spaces that aren't flat: the space of probability distributions (Fisher-Rao metric), the space of covariance matrices, or the curved loss surface of a neural network.
Riemannian Metrics
Definition. A Riemannian metric on a smooth manifold is a smooth assignment of an inner product to each tangent space, varying smoothly with .
In local coordinates , the metric is represented by the metric tensor where
The length of a curve is:
Examples of Riemannian manifolds:
| Manifold | Metric | Geometry |
|---|---|---|
| (identity) | Flat Euclidean | |
| (sphere, radius ) | Induced from | Constant positive curvature |
| Hyperbolic space | (Poincaré half-plane) | Constant negative curvature |
| Affine-invariant metric | ||
| Statistical manifold | Fisher information metric | Information geometry |
Geodesics
The geodesic distance between is:
the infimum over all smooth curves connecting and .
A geodesic is a locally length-minimizing curve — the Riemannian analogue of a straight line. Geodesics satisfy the geodesic equation:
where are the Christoffel symbols (encoding how the metric changes across the manifold):
The Christoffel symbols encode how basis vectors rotate as you move across the manifold — they are zero on flat (no rotation needed) and nonzero wherever the metric changes. They had to appear in the geodesic equation because "staying in the same direction" on a curved surface requires continuously correcting for the bending of the coordinate system — the same reason a ship navigator must adjust heading even while traveling in a straight geodesic path.
On : Christoffel symbols vanish ( constant), geodesic equation reduces to — straight lines.
On : geodesics are great circles — intersection of the sphere with planes through the origin.
The Exponential and Logarithm Maps
At each point , the exponential map "shoots out" from in the direction and magnitude of a tangent vector:
The logarithm map is the local inverse of (defined on a neighborhood of ).
These maps are the bridge between curved and flat:
- Optimization: natural gradient descent updates as , replacing the flat Euclidean step with a geodesic step.
- Averaging: the Fréchet mean (Riemannian mean) of points is , computed iteratively via log maps.
- Interpolation: geodesic interpolation for .
Curvature
Riemann curvature tensor measures the non-commutativity of covariant differentiation — how much parallel transport around a small loop rotates a tangent vector. It encodes all local curvature information.
Sectional curvature for a 2D tangent plane :
| Sign of | Geometry | Example |
|---|---|---|
| Positive curvature — geodesics converge | Sphere | |
| Flat — geodesics stay parallel | Euclidean | |
| Negative curvature — geodesics diverge | Hyperbolic space |
Scalar curvature (trace of the Ricci tensor) — a single number per point summarizing average sectional curvature. In general relativity, appears in Einstein's field equations.
Information Geometry: The Fisher-Rao Metric
A parametric family of distributions forms a statistical manifold. The Fisher information matrix defines a natural Riemannian metric:
This is the Fisher-Rao metric. The geodesic distance it induces is a measure of statistical distinguishability: points far apart are easy to distinguish by hypothesis testing.
Natural gradient. The standard gradient ignores the Riemannian structure. The natural gradient is:
the gradient in the coordinate system defined by the Fisher metric. Natural gradient descent updates parameters in the direction that is steepest in probability space (KL-divergence), not parameter space. It is invariant to reparameterization — changing from to gives the same update.
KFAC and K-FAC variants approximate cheaply using Kronecker-factored approximations, enabling natural gradient descent at scale.
Worked Example
Example 1: Geodesics on
Two cities at latitudes and longitudes . The great circle distance (geodesic on ):
This is less than the Euclidean distance through the Earth's interior, and also less than navigating along latitude lines (which are not geodesics on except at the equator).
Why great circles? Because the sphere has positive curvature, geodesics (great circles) "converge" — two geodesics starting at the same point in different directions will meet again at the antipodal point. This is why there are no parallel lines on a sphere.
Example 2: Natural Gradient for Gaussian
For a Gaussian family parameterized by :
The natural gradient of the negative log-likelihood is:
The factor rescales the gradient — effectively using a different step size for each parameter, adapted to the information content. For small (sharp distribution), gradients in are large; natural gradient corrects for this.
Example 3: Hyperbolic Embeddings of Trees
Trees have exponential growth — the number of nodes at depth grows as for branching factor . Euclidean space cannot embed trees without distortion because volumes grow polynomially.
Hyperbolic space (constant negative curvature ) has exponentially growing volume with distance — matched to tree geometry. The Poincaré disk model: with metric:
Geodesic distance: .
Hierarchical structures (WordNet, biological taxonomies, knowledge graphs) embed into with low distortion using far fewer dimensions than Euclidean space. Poincaré embeddings (Nickel & Kiela, 2017) demonstrate this on word hierarchies.
Connections
Where Your Intuition Breaks
The natural gradient sounds like a strict improvement over vanilla gradient descent — it corrects for the geometry and is invariant to reparameterization. But computing the natural gradient requires inverting the Fisher information matrix, which is to store and to invert for parameters. A GPT-2 model has ~117M parameters; the Fisher matrix would be — completely infeasible. KFAC and related approximations make it tractable for specific architectures, but they sacrifice the exact invariance. The reason Adam and Adagrad dominate in practice is not that geometry doesn't matter — it's that diagonal Fisher approximations (which these optimizers implicitly use) capture the most important rescaling at negligible cost. The full Riemannian geometry is theoretically correct but practically expensive.
Why the Fisher metric is canonical. The Fisher-Rao metric is the only Riemannian metric on the space of probability distributions that is invariant under sufficient statistics — i.e., it doesn't change when you reparameterize the distribution or discard irrelevant information. This invariance under sufficient statistics is the information-geometric content of the Cramér-Rao bound: the Fisher information sets the fundamental limit on estimation variance. The natural gradient uses this canonical geometry.
Curvature of the loss landscape. The Hessian of a neural network loss at a critical point measures local curvature in parameter space. But the loss landscape is not flat — it has complex curvature that varies across the landscape. Points near a flat minimum (many small eigenvalues of the Hessian, low effective rank) tend to generalize better than sharp minima (many large eigenvalues). SGD with small batch size implicitly prefers flat minima via its noisy trajectory — a connection to the Riemannian geometry of the loss manifold that is an active research area.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.