Discrete-Time Markov Chains: Stationarity, Ergodicity & Mixing Times
A Markov chain encodes a process where the future depends only on the present — the memoryless property that makes stochastic systems analytically tractable. Understanding how chains converge to equilibrium reveals why MCMC algorithms work and how mixing times bound their computational cost.
Concepts
A 3-state Markov chain converges to its stationary distribution π regardless of starting state. Adjust transition probabilities to reshape π, then run the chain and watch empirical frequencies converge to the new π.
Solid bars = empirical frequency. Dashed = stationary π. They converge by the ergodic theorem.
Markov chains model any system where the next state depends only on the present — weather patterns, web page ranks, MCMC samplers, and protein folding all follow the same rule. The transition matrix encodes this memorylessness precisely: one row per state, each row a probability distribution over where the chain goes next.
The Markov Property and Transition Matrix
A discrete-time Markov chain on state space is a sequence of random variables satisfying the Markov property:
For a homogeneous (time-invariant) chain, the transition matrix has entries . Each row sums to one: , making a stochastic matrix.
The stochastic matrix structure — non-negative entries, rows summing to one — is not an assumption about the chain but a consequence of probability being defined over all next states. Once the Markov property is encoded in a single matrix, all long-run analysis reduces to eigenvectors of a finite object: the -step distribution follows from one matrix power.
The -step transition probabilities are given by the matrix power: .
If the chain starts in distribution (a row vector), then after steps the distribution is .
Stationary Distributions
A distribution is stationary (or invariant) if , i.e., for all . Equivalently, is a left eigenvector of with eigenvalue 1.
Detailed balance (sufficient for stationarity): if for all , then . Any chain satisfying detailed balance is reversible. The Metropolis-Hastings algorithm is designed to satisfy detailed balance by construction.
Ergodic Theorem: Existence and Uniqueness
Irreducibility: a chain is irreducible if every state is reachable from every other state — there exists such that for all .
Aperiodicity: state has period . A chain is aperiodic if for all .
Ergodic theorem: a chain that is irreducible and aperiodic on a finite state space has a unique stationary distribution , and:
Moreover, by the law of large numbers for Markov chains:
for any bounded function . This is the basis for MCMC estimation.
Proof sketch: by Perron-Frobenius, a non-negative irreducible matrix has a unique maximal eigenvalue . For a stochastic matrix , and aperiodicity ensures no eigenvalue has except itself. The remaining eigenvalues satisfy , so geometrically.
Mixing Times and the Spectral Gap
Define the total variation distance between distributions and :
The -mixing time is .
For a reversible chain, the eigenvalues of satisfy . The spectral gap is (second-largest eigenvalue controls convergence):
The mixing time scales as .
Conductance provides a combinatorial bound: the conductance (Cheeger constant) of the chain is:
The Cheeger inequality relates it to the spectral gap: .
Hitting Times and Recurrence
The expected hitting time from to is . For an ergodic chain, — the expected return time to state equals the reciprocal of its stationary probability.
Recurrence vs transience (relevant for countably infinite state spaces): state is recurrent if , transient if . For finite irreducible chains, all states are recurrent.
Random walk on : symmetric nearest-neighbor random walk is recurrent for (Pólya's theorem) and transient for . In , the walk returns to the origin with probability 1 but the expected return time is infinite (null recurrence).
Worked Example
Example 1: PageRank as Stationary Distribution
Google's PageRank computes the stationary distribution of a Markov chain on web pages. The transition matrix is:
where is the adjacency matrix of the web graph, is the teleportation probability, and is the number of pages.
The teleportation term makes irreducible (all entries strictly positive) and aperiodic. The stationary distribution gives each page's rank. Power iteration computes : starting from , iterate until convergence. The second eigenvalue determines the convergence rate — roughly iterations.
For a web graph with pages, power iteration converges in iterations, exploiting the sparse structure of .
Example 2: Metropolis Chain on
Target (Boltzmann distribution). Proposal: propose uniformly from (nearest neighbors). Accept with probability .
The resulting chain has transition probabilities:
At high temperature ( small): accepts most moves, mixes quickly ( large). At low temperature ( large): rarely accepts uphill moves, mixes slowly. The mixing time scales exponentially in the energy barrier height — this is why simulated annealing slowly decreases .
Example 3: Mixing Time Computation
For a random walk on a cycle (states , steps ): the eigenvalues are for . The spectral gap is for large .
The mixing time is — the walk must diffuse distance , which takes steps. This is why random walk on a long path is slow: conductance , confirming the lower bound via Cheeger.
Connections
Where Your Intuition Breaks
The stationary distribution exists and is unique for any irreducible, aperiodic chain — but uniqueness does not imply fast convergence. A two-cluster chain where inter-cluster transition probabilities are has a unique stationary distribution assigning equal weight to both clusters, yet mixing time scales as : the chain spends steps in one cluster before crossing to the other. Knowing that exists tells you the long-run average; the spectral gap is what tells you whether "long run" means thousands or billions of steps.
The spectral gap is the single most important quantity for MCMC. The mixing time is roughly where . A good MCMC sampler is one with a large spectral gap. Strategies to enlarge : (1) use non-local moves (Hamiltonian MC uses gradient information to propose long-distance moves), (2) run multiple chains in parallel with swaps (parallel tempering), (3) use data-driven proposals that approximate the target. The Metropolis algorithm with local proposals has for many targets — HMC achieves under mild conditions.
Detailed balance is a design principle, not a theorem. Many textbooks present detailed balance as if it is the definition of MCMC. In reality, detailed balance is a convenient sufficient condition for stationarity — the Metropolis-Hastings algorithm satisfies it by construction. But detailed balance is not necessary. Non-reversible Markov chains can converge faster: they can have larger spectral gaps than any reversible chain with the same stationary distribution. Lifted Markov chains, persistent MCMC, and irreversible perturbations can reduce mixing times.
"The chain converged" is not the same as "the chain mixed." Convergence diagnostics (Gelman-Rubin , trace plot stationarity) can give false confidence when the chain is stuck in a mode. A chain can appear converged in total variation while exploring only a small region of high probability mass. High-dimensional targets with multiple separated modes require explicit multi-modal sampling strategies (parallel tempering, replica exchange, annealed importance sampling) — standard MCMC with local proposals will not mix across modes in reasonable time.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.