Neural-Path/Notes
45 min

Discrete-Time Markov Chains: Stationarity, Ergodicity & Mixing Times

A Markov chain encodes a process where the future depends only on the present — the memoryless property that makes stochastic systems analytically tractable. Understanding how chains converge to equilibrium reveals why MCMC algorithms work and how mixing times bound their computational cost.

Concepts

A 3-state Markov chain converges to its stationary distribution π regardless of starting state. Adjust transition probabilities to reshape π, then run the chain and watch empirical frequencies converge to the new π.

0.700.300.400.600.500.50ABCπA=0.31πB=0.37πC=0.32A
n=1 steps
Transition probabilities — drag to reshape stationary distribution
A
B 0.70C 0.30
B
A 0.40C 0.60
C
A 0.50B 0.50
State A
π=0.308
emp=1.000
State B
π=0.374
emp=0.000
State C
π=0.317
emp=0.000

Solid bars = empirical frequency. Dashed = stationary π. They converge by the ergodic theorem.

Markov chains model any system where the next state depends only on the present — weather patterns, web page ranks, MCMC samplers, and protein folding all follow the same rule. The transition matrix encodes this memorylessness precisely: one row per state, each row a probability distribution over where the chain goes next.

The Markov Property and Transition Matrix

A discrete-time Markov chain on state space S\mathcal{S} is a sequence of random variables X0,X1,X2,X_0, X_1, X_2, \ldots satisfying the Markov property:

P(Xt+1=jX0,X1,,Xt)=P(Xt+1=jXt).P(X_{t+1} = j \mid X_0, X_1, \ldots, X_t) = P(X_{t+1} = j \mid X_t).

For a homogeneous (time-invariant) chain, the transition matrix PP has entries Pij=P(Xt+1=jXt=i)P_{ij} = P(X_{t+1} = j \mid X_t = i). Each row sums to one: jPij=1\sum_j P_{ij} = 1, making PP a stochastic matrix.

The stochastic matrix structure — non-negative entries, rows summing to one — is not an assumption about the chain but a consequence of probability being defined over all next states. Once the Markov property is encoded in a single matrix, all long-run analysis reduces to eigenvectors of a finite object: the nn-step distribution μt=μ0Pt\mu_t = \mu_0 P^t follows from one matrix power.

The nn-step transition probabilities are given by the matrix power: Pijn=P(Xt+n=jXt=i)P^n_{ij} = P(X_{t+n} = j \mid X_t = i).

If the chain starts in distribution μ0\mu_0 (a row vector), then after tt steps the distribution is μt=μ0Pt\mu_t = \mu_0 P^t.

Stationary Distributions

A distribution π\pi is stationary (or invariant) if πP=π\pi P = \pi, i.e., πj=iπiPij\pi_j = \sum_i \pi_i P_{ij} for all jj. Equivalently, π\pi is a left eigenvector of PP with eigenvalue 1.

Detailed balance (sufficient for stationarity): if πiPij=πjPji\pi_i P_{ij} = \pi_j P_{ji} for all i,ji, j, then πP=π\pi P = \pi. Any chain satisfying detailed balance is reversible. The Metropolis-Hastings algorithm is designed to satisfy detailed balance by construction.

Ergodic Theorem: Existence and Uniqueness

Irreducibility: a chain is irreducible if every state is reachable from every other state — there exists nn such that Pijn>0P^n_{ij} > 0 for all i,ji, j.

Aperiodicity: state ii has period di=gcd{n1:Piin>0}d_i = \gcd\{n \geq 1 : P^n_{ii} > 0\}. A chain is aperiodic if di=1d_i = 1 for all ii.

Ergodic theorem: a chain that is irreducible and aperiodic on a finite state space has a unique stationary distribution π\pi, and:

limtPijt=πjfor all i,j.\lim_{t \to \infty} P^t_{ij} = \pi_j \quad \text{for all } i, j.

Moreover, by the law of large numbers for Markov chains:

1Tt=0T1f(Xt)a.s.iπif(i)=Eπ[f]\frac{1}{T}\sum_{t=0}^{T-1} f(X_t) \xrightarrow{a.s.} \sum_i \pi_i f(i) = \mathbb{E}_\pi[f]

for any bounded function ff. This is the basis for MCMC estimation.

Proof sketch: by Perron-Frobenius, a non-negative irreducible matrix has a unique maximal eigenvalue ρ\rho. For a stochastic matrix ρ=1\rho = 1, and aperiodicity ensures no eigenvalue has λ=1|\lambda| = 1 except λ=1\lambda = 1 itself. The remaining eigenvalues satisfy λk<1|\lambda_k| < 1, so Pt1πP^t \to \mathbf{1}\pi geometrically.

Mixing Times and the Spectral Gap

Define the total variation distance between distributions μ\mu and ν\nu:

μνTV=12iμiνi=maxASμ(A)ν(A).\|\mu - \nu\|_{\text{TV}} = \frac{1}{2}\sum_i |\mu_i - \nu_i| = \max_{A \subseteq \mathcal{S}} |\mu(A) - \nu(A)|.

The ε\varepsilon-mixing time is tmix(ε)=min{t:maxiPt(i,)πTVε}t_{\text{mix}}(\varepsilon) = \min\{t : \max_i \|P^t(i,\cdot) - \pi\|_{\text{TV}} \leq \varepsilon\}.

For a reversible chain, the eigenvalues of PP satisfy 1=λ1λ2λS11 = \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_{|\mathcal{S}|} \geq -1. The spectral gap is γ=1λ2\gamma = 1 - \lambda_2 (second-largest eigenvalue controls convergence):

Pt(i,)πTV1πmin(1γ)t.\|P^t(i,\cdot) - \pi\|_{\text{TV}} \leq \sqrt{\frac{1}{\pi_{\min}}} (1 - \gamma)^t.

The mixing time scales as tmix(ε)=O ⁣(1γlog1επmin)t_{\text{mix}}(\varepsilon) = O\!\left(\frac{1}{\gamma}\log\frac{1}{\varepsilon \pi_{\min}}\right).

Conductance provides a combinatorial bound: the conductance (Cheeger constant) of the chain is:

Φ=minS:π(S)1/2iS,jSπiPijπ(S).\Phi = \min_{S : \pi(S) \leq 1/2} \frac{\sum_{i \in S, j \notin S} \pi_i P_{ij}}{\pi(S)}.

The Cheeger inequality relates it to the spectral gap: Φ2/2γ2Φ\Phi^2/2 \leq \gamma \leq 2\Phi.

Hitting Times and Recurrence

The expected hitting time from ii to jj is hij=Ei[min{t1:Xt=j}]h_{ij} = \mathbb{E}_i[\min\{t \geq 1 : X_t = j\}]. For an ergodic chain, hjj=1/πjh_{jj} = 1/\pi_j — the expected return time to state jj equals the reciprocal of its stationary probability.

Recurrence vs transience (relevant for countably infinite state spaces): state ii is recurrent if P(return to i)=1P(\text{return to } i) = 1, transient if P(return to i)<1P(\text{return to } i) < 1. For finite irreducible chains, all states are recurrent.

Random walk on Zd\mathbb{Z}^d: symmetric nearest-neighbor random walk is recurrent for d2d \leq 2 (Pólya's theorem) and transient for d3d \geq 3. In d=2d=2, the walk returns to the origin with probability 1 but the expected return time is infinite (null recurrence).

Worked Example

Example 1: PageRank as Stationary Distribution

Google's PageRank computes the stationary distribution of a Markov chain on web pages. The transition matrix is:

Pij=(1α)Aijout-degree(i)+αNP_{ij} = (1-\alpha) \frac{A_{ij}}{\text{out-degree}(i)} + \frac{\alpha}{N}

where AA is the adjacency matrix of the web graph, α=0.15\alpha = 0.15 is the teleportation probability, and NN is the number of pages.

The teleportation term makes PP irreducible (all entries strictly positive) and aperiodic. The stationary distribution π\pi gives each page's rank. Power iteration computes π\pi: starting from π0=(1/N,,1/N)\pi_0 = (1/N, \ldots, 1/N), iterate πt+1=πtP\pi_{t+1} = \pi_t P until convergence. The second eigenvalue λ2=1α=0.85\lambda_2 = 1 - \alpha = 0.85 determines the convergence rate — roughly log(1/ε)/log0.856.3log(1/ε)\log(1/\varepsilon) / |\log 0.85| \approx 6.3\log(1/\varepsilon) iterations.

For a web graph with N=109N = 10^9 pages, power iteration converges in 50\sim 50 iterations, exploiting the sparse structure of AA.

Example 2: Metropolis Chain on {1,,n}\{1, \ldots, n\}

Target πieβEi\pi_i \propto e^{-\beta E_i} (Boltzmann distribution). Proposal: propose jj uniformly from {i1,i+1}\{i-1, i+1\} (nearest neighbors). Accept with probability min(1,πj/πi)\min(1, \pi_j/\pi_i).

The resulting chain has transition probabilities:

Pij={12min(1,eβ(EjEi))ij=11kiPiki=j.P_{ij} = \begin{cases} \frac{1}{2}\min\left(1, e^{-\beta(E_j - E_i)}\right) & |i-j|=1 \\ 1 - \sum_{k \neq i} P_{ik} & i = j. \end{cases}

At high temperature (β\beta small): accepts most moves, mixes quickly (γ\gamma large). At low temperature (β\beta large): rarely accepts uphill moves, mixes slowly. The mixing time scales exponentially in the energy barrier height — this is why simulated annealing slowly decreases β\beta.

Example 3: Mixing Time Computation

For a random walk on a cycle CnC_n (states 0,1,,n10, 1, \ldots, n-1, steps ±1modn\pm 1 \bmod n): the eigenvalues are λk=cos(2πk/n)\lambda_k = \cos(2\pi k/n) for k=0,1,,n1k = 0, 1, \ldots, n-1. The spectral gap is γ=1cos(2π/n)2π2/n2\gamma = 1 - \cos(2\pi/n) \approx 2\pi^2/n^2 for large nn.

The mixing time is tmix(1/4)=Θ(n2)t_{\text{mix}}(1/4) = \Theta(n^2) — the walk must diffuse distance n/2n/2, which takes O((n/2)2)=O(n2)O((n/2)^2) = O(n^2) steps. This is why random walk on a long path is slow: conductance Φ=O(1/n)\Phi = O(1/n), confirming the Ω(n2)\Omega(n^2) lower bound via Cheeger.

Connections

Where Your Intuition Breaks

The stationary distribution π\pi exists and is unique for any irreducible, aperiodic chain — but uniqueness does not imply fast convergence. A two-cluster chain where inter-cluster transition probabilities are ε1\varepsilon \ll 1 has a unique stationary distribution assigning equal weight to both clusters, yet mixing time scales as 1/ε1/\varepsilon: the chain spends O(1/ε)O(1/\varepsilon) steps in one cluster before crossing to the other. Knowing that π\pi exists tells you the long-run average; the spectral gap γ\gamma is what tells you whether "long run" means thousands or billions of steps.

💡Intuition

The spectral gap is the single most important quantity for MCMC. The mixing time is roughly 1/γ1/\gamma where γ=1λ2\gamma = 1 - \lambda_2. A good MCMC sampler is one with a large spectral gap. Strategies to enlarge γ\gamma: (1) use non-local moves (Hamiltonian MC uses gradient information to propose long-distance moves), (2) run multiple chains in parallel with swaps (parallel tempering), (3) use data-driven proposals that approximate the target. The Metropolis algorithm with local proposals has γ=O(1/n2)\gamma = O(1/n^2) for many targets — HMC achieves γ=O(1)\gamma = O(1) under mild conditions.

💡Intuition

Detailed balance is a design principle, not a theorem. Many textbooks present detailed balance as if it is the definition of MCMC. In reality, detailed balance is a convenient sufficient condition for stationarity — the Metropolis-Hastings algorithm satisfies it by construction. But detailed balance is not necessary. Non-reversible Markov chains can converge faster: they can have larger spectral gaps than any reversible chain with the same stationary distribution. Lifted Markov chains, persistent MCMC, and irreversible perturbations can reduce mixing times.

⚠️Warning

"The chain converged" is not the same as "the chain mixed." Convergence diagnostics (Gelman-Rubin R^\hat{R}, trace plot stationarity) can give false confidence when the chain is stuck in a mode. A chain can appear converged in total variation while exploring only a small region of high probability mass. High-dimensional targets with multiple separated modes require explicit multi-modal sampling strategies (parallel tempering, replica exchange, annealed importance sampling) — standard MCMC with local proposals will not mix across modes in reasonable time.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.