Riemannian Geometry: Metrics, Geodesics & Curvature

Riemannian geometry equips each point of a manifold with an inner product on its tangent space, giving rise to notions of distance, angle, and curvature that vary from point to point. This is the geometry underlying natural gradient descent (which uses the Fisher information metric), optimal transport (Wasserstein distances), hyperbolic embeddings of hierarchical data, and the curvature analysis of neural network loss surfaces.

Concepts

The sign of sectional curvature K determines how geodesics (shortest paths) behave relative to each other. Click a panel to focus it.

Two geodesics start close together. Curvature determines what happens next.

K > 0: geodesics meet (sphere — no parallel postulate) · K = 0: Euclid's parallel postulate holds · K < 0: infinitely many parallels through any point (hyperbolic)

When you fly from New York to London, the shortest route arcs northward over Greenland — not because pilots are inefficient, but because the Earth's surface curves. That arc is a geodesic: the shortest path on a curved surface. Riemannian geometry assigns each point of a space its own inner product, so "length" and "angle" are defined locally and can vary from place to place. In machine learning, this matters whenever you want to measure distances in spaces that aren't flat: the space of probability distributions (Fisher-Rao metric), the space of covariance matrices, or the curved loss surface of a neural network.

Riemannian Metrics

Definition. A Riemannian metric on a smooth manifold $M$ is a smooth assignment of an inner product $g_p : T_p M \times T_p M \to \mathbb{R}$ to each tangent space, varying smoothly with $p$ .

In local coordinates $(x^1, \ldots, x^n)$ , the metric is represented by the metric tensor $g = (g_{ij})$ where

$g_{ij}(p) = g_p\!\left(\frac{\partial}{\partial x^i}, \frac{\partial}{\partial x^j}\right).$

The length of a curve $\gamma : [a,b] \to M$ is:

$L(\gamma) = \int_a^b \sqrt{g_{\gamma(t)}\!\left(\dot\gamma(t), \dot\gamma(t)\right)} \, dt = \int_a^b \sqrt{g_{ij}(\gamma(t))\dot\gamma^i(t)\dot\gamma^j(t)} \, dt.$

Examples of Riemannian manifolds:

Manifold	Metric $g$	Geometry
$\mathbb{R}^n$	$g_{ij} = \delta_{ij}$ (identity)	Flat Euclidean
$S^n$ (sphere, radius $r$ )	Induced from $\mathbb{R}^{n+1}$	Constant positive curvature $1/r^2$
Hyperbolic space $\mathbb{H}^n$	$g_{ij} = \delta_{ij}/x_n^2$ (Poincaré half-plane)	Constant negative curvature
$\operatorname{Sym}^+(n)$	$g_\Sigma(A,B) = \operatorname{tr}(\Sigma^{-1}A\Sigma^{-1}B)$	Affine-invariant metric
Statistical manifold	Fisher information metric $g_{ij} = \mathbb{E}[\partial_i\ell\,\partial_j\ell]$	Information geometry

Geodesics

The geodesic distance between $p, q \in M$ is:

$d(p,q) = \inf_{\gamma : p \to q} L(\gamma),$

the infimum over all smooth curves connecting $p$ and $q$ .

A geodesic is a locally length-minimizing curve — the Riemannian analogue of a straight line. Geodesics satisfy the geodesic equation:

$\ddot\gamma^k + \Gamma^k_{ij}\dot\gamma^i\dot\gamma^j = 0,$

where $\Gamma^k_{ij}$ are the Christoffel symbols (encoding how the metric changes across the manifold):

$\Gamma^k_{ij} = \frac{1}{2}g^{kl}\left(\partial_i g_{jl} + \partial_j g_{il} - \partial_l g_{ij}\right).$

The Christoffel symbols encode how basis vectors rotate as you move across the manifold — they are zero on flat $\mathbb{R}^n$ (no rotation needed) and nonzero wherever the metric changes. They had to appear in the geodesic equation because "staying in the same direction" on a curved surface requires continuously correcting for the bending of the coordinate system — the same reason a ship navigator must adjust heading even while traveling in a straight geodesic path.

On $\mathbb{R}^n$ : Christoffel symbols vanish ( $g_{ij} = \delta_{ij}$ constant), geodesic equation reduces to $\ddot\gamma = 0$ — straight lines.

On $S^2$ : geodesics are great circles — intersection of the sphere with planes through the origin.

The Exponential and Logarithm Maps

At each point $p \in M$ , the exponential map $\exp_p : T_p M \to M$ "shoots out" from $p$ in the direction and magnitude of a tangent vector:

$\exp_p(\mathbf{v}) = \gamma(1), \quad \text{where } \gamma \text{ is the geodesic with } \gamma(0)=p, \dot\gamma(0)=\mathbf{v}.$

The logarithm map $\log_p : M \to T_p M$ is the local inverse of $\exp_p$ (defined on a neighborhood of $p$ ).

These maps are the bridge between curved and flat:

Optimization: natural gradient descent updates as $p_{t+1} = \exp_{p_t}(-\eta \operatorname{grad} f(p_t))$ , replacing the flat Euclidean step with a geodesic step.
Averaging: the Fréchet mean (Riemannian mean) of points $\{p_i\}$ is $\bar p = \arg\min_p \sum_i d(p, p_i)^2$ , computed iteratively via log maps.
Interpolation: geodesic interpolation $\gamma(t) = \exp_p(t \log_p q)$ for $t \in [0,1]$ .

Curvature

Riemann curvature tensor $R(X,Y)Z$ measures the non-commutativity of covariant differentiation — how much parallel transport around a small loop rotates a tangent vector. It encodes all local curvature information.

Sectional curvature $K(\sigma)$ for a 2D tangent plane $\sigma = \operatorname{span}\{X, Y\}$ :

$K(\sigma) = \frac{g(R(X,Y)Y,X)}{g(X,X)g(Y,Y) - g(X,Y)^2}.$

Sign of $K$	Geometry	Example
$K > 0$	Positive curvature — geodesics converge	Sphere
$K = 0$	Flat — geodesics stay parallel	Euclidean $\mathbb{R}^n$
$K < 0$	Negative curvature — geodesics diverge	Hyperbolic space

Scalar curvature $R = g^{ij}R_{ij}$ (trace of the Ricci tensor) — a single number per point summarizing average sectional curvature. In general relativity, $R$ appears in Einstein's field equations.

Information Geometry: The Fisher-Rao Metric

A parametric family of distributions $\{p(\mathbf{x};\boldsymbol{\theta}) : \boldsymbol{\theta} \in \Theta\}$ forms a statistical manifold. The Fisher information matrix defines a natural Riemannian metric:

$g_{ij}(\boldsymbol{\theta}) = \mathcal{I}(\boldsymbol{\theta})_{ij} = \mathbb{E}_{\mathbf{x}\sim p(\cdot;\boldsymbol{\theta})}\!\left[\frac{\partial \log p(\mathbf{x};\boldsymbol{\theta})}{\partial\theta_i} \frac{\partial\log p(\mathbf{x};\boldsymbol{\theta})}{\partial\theta_j}\right].$

This is the Fisher-Rao metric. The geodesic distance it induces is a measure of statistical distinguishability: points far apart are easy to distinguish by hypothesis testing.

Natural gradient. The standard gradient $\nabla_\theta L$ ignores the Riemannian structure. The natural gradient is:

$\tilde\nabla_\theta L = \mathcal{I}(\theta)^{-1} \nabla_\theta L,$

the gradient in the coordinate system defined by the Fisher metric. Natural gradient descent updates parameters in the direction that is steepest in probability space (KL-divergence), not parameter space. It is invariant to reparameterization — changing from $\theta$ to $\phi(\theta)$ gives the same update.

KFAC and K-FAC variants approximate $\mathcal{I}(\theta)^{-1}$ cheaply using Kronecker-factored approximations, enabling natural gradient descent at scale.

Worked Example

Example 1: Geodesics on $S^2$

Two cities at latitudes $\phi_1, \phi_2$ and longitudes $\lambda_1, \lambda_2$ . The great circle distance (geodesic on $S^2$ ):

$d = r \cdot \Delta\sigma, \qquad \Delta\sigma = \arccos(\sin\phi_1\sin\phi_2 + \cos\phi_1\cos\phi_2\cos(\lambda_2-\lambda_1)).$

This is less than the Euclidean distance through the Earth's interior, and also less than navigating along latitude lines (which are not geodesics on $S^2$ except at the equator).

Why great circles? Because the sphere has positive curvature, geodesics (great circles) "converge" — two geodesics starting at the same point in different directions will meet again at the antipodal point. This is why there are no parallel lines on a sphere.

Example 2: Natural Gradient for Gaussian

For a Gaussian family $\mathcal{N}(\mu, \sigma^2)$ parameterized by $(\mu, \sigma)$ :

$\mathcal{I}(\mu, \sigma) = \begin{pmatrix}1/\sigma^2 & 0 \\ 0 & 2/\sigma^2\end{pmatrix}.$

The natural gradient of the negative log-likelihood $L = \frac{1}{2}\log\sigma^2 + \frac{(x-\mu)^2}{2\sigma^2}$ is:

$\tilde\nabla_{(\mu,\sigma)} L = \mathcal{I}^{-1}\nabla L = \begin{pmatrix}\sigma^2 & 0 \\ 0 & \sigma^2/2\end{pmatrix}\begin{pmatrix}-(x-\mu)/\sigma^2 \\ (x-\mu)^2/\sigma^3 - 1/\sigma\end{pmatrix}.$

The $\sigma^2$ factor rescales the $\mu$ gradient — effectively using a different step size for each parameter, adapted to the information content. For small $\sigma$ (sharp distribution), gradients in $\mu$ are large; natural gradient corrects for this.

Example 3: Hyperbolic Embeddings of Trees

Trees have exponential growth — the number of nodes at depth $d$ grows as $\beta^d$ for branching factor $\beta$ . Euclidean space cannot embed trees without distortion because volumes grow polynomially.

Hyperbolic space $\mathbb{H}^n$ (constant negative curvature $-1$ ) has exponentially growing volume with distance — matched to tree geometry. The Poincaré disk model: $D^n = \{x \in \mathbb{R}^n : \|x\| < 1\}$ with metric:

$g_x = \frac{4}{(1-\|x\|^2)^2} \delta_{ij}.$

Geodesic distance: $d(x,y) = 2\operatorname{arccosh}\left(1 + \frac{2\|x-y\|^2}{(1-\|x\|^2)(1-\|y\|^2)}\right)$ .

Hierarchical structures (WordNet, biological taxonomies, knowledge graphs) embed into $\mathbb{H}^n$ with low distortion using far fewer dimensions than Euclidean space. Poincaré embeddings (Nickel & Kiela, 2017) demonstrate this on word hierarchies.

Connections

Where Your Intuition Breaks

The natural gradient sounds like a strict improvement over vanilla gradient descent — it corrects for the geometry and is invariant to reparameterization. But computing the natural gradient requires inverting the Fisher information matrix, which is $O(n^2)$ to store and $O(n^3)$ to invert for $n$ parameters. A GPT-2 model has ~117M parameters; the Fisher matrix would be $117M \times 117M$ — completely infeasible. KFAC and related approximations make it tractable for specific architectures, but they sacrifice the exact invariance. The reason Adam and Adagrad dominate in practice is not that geometry doesn't matter — it's that diagonal Fisher approximations (which these optimizers implicitly use) capture the most important rescaling at negligible cost. The full Riemannian geometry is theoretically correct but practically expensive.

💡Intuition

Why the Fisher metric is canonical. The Fisher-Rao metric is the only Riemannian metric on the space of probability distributions that is invariant under sufficient statistics — i.e., it doesn't change when you reparameterize the distribution or discard irrelevant information. This invariance under sufficient statistics is the information-geometric content of the Cramér-Rao bound: the Fisher information sets the fundamental limit on estimation variance. The natural gradient uses this canonical geometry.

💡Intuition

Curvature of the loss landscape. The Hessian of a neural network loss at a critical point measures local curvature in parameter space. But the loss landscape is not flat — it has complex curvature that varies across the landscape. Points near a flat minimum (many small eigenvalues of the Hessian, low effective rank) tend to generalize better than sharp minima (many large eigenvalues). SGD with small batch size implicitly prefers flat minima via its noisy trajectory — a connection to the Riemannian geometry of the loss manifold that is an active research area.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Smooth Manifolds & Tangent Spaces

Bridge: Loss Landscapes, Natural Gradient & Equivariant Networks

Riemannian Geometry: Metrics, Geodesics & Curvature

Concepts

Riemannian Metrics

Geodesics

The Exponential and Logarithm Maps

Curvature

Information Geometry: The Fisher-Rao Metric

Worked Example

Example 1: Geodesics on S2S^2S2

Example 2: Natural Gradient for Gaussian

Example 3: Hyperbolic Embeddings of Trees

Connections

Where Your Intuition Breaks

Example 1: Geodesics on $S^2$