Discrete-Time Markov Chains: Stationarity, Ergodicity & Mixing Times

A Markov chain encodes a process where the future depends only on the present — the memoryless property that makes stochastic systems analytically tractable. Understanding how chains converge to equilibrium reveals why MCMC algorithms work and how mixing times bound their computational cost.

Concepts

A 3-state Markov chain converges to its stationary distribution π regardless of starting state. Adjust transition probabilities to reshape π, then run the chain and watch empirical frequencies converge to the new π.

n=1 steps

Transition probabilities — drag to reshape stationary distribution

→B 0.70→C 0.30

→A 0.40→C 0.60

→A 0.50→B 0.50

State A

π=0.308

emp=1.000

State B

π=0.374

emp=0.000

State C

π=0.317

emp=0.000

Solid bars = empirical frequency. Dashed = stationary π. They converge by the ergodic theorem.

Markov chains model any system where the next state depends only on the present — weather patterns, web page ranks, MCMC samplers, and protein folding all follow the same rule. The transition matrix encodes this memorylessness precisely: one row per state, each row a probability distribution over where the chain goes next.

The Markov Property and Transition Matrix

A discrete-time Markov chain on state space $\mathcal{S}$ is a sequence of random variables $X_0, X_1, X_2, \ldots$ satisfying the Markov property:

$P(X_{t+1} = j \mid X_0, X_1, \ldots, X_t) = P(X_{t+1} = j \mid X_t).$

For a homogeneous (time-invariant) chain, the transition matrix $P$ has entries $P_{ij} = P(X_{t+1} = j \mid X_t = i)$ . Each row sums to one: $\sum_j P_{ij} = 1$ , making $P$ a stochastic matrix.

The stochastic matrix structure — non-negative entries, rows summing to one — is not an assumption about the chain but a consequence of probability being defined over all next states. Once the Markov property is encoded in a single matrix, all long-run analysis reduces to eigenvectors of a finite object: the $n$ -step distribution $\mu_t = \mu_0 P^t$ follows from one matrix power.

The $n$ -step transition probabilities are given by the matrix power: $P^n_{ij} = P(X_{t+n} = j \mid X_t = i)$ .

If the chain starts in distribution $\mu_0$ (a row vector), then after $t$ steps the distribution is $\mu_t = \mu_0 P^t$ .

Stationary Distributions

A distribution $\pi$ is stationary (or invariant) if $\pi P = \pi$ , i.e., $\pi_j = \sum_i \pi_i P_{ij}$ for all $j$ . Equivalently, $\pi$ is a left eigenvector of $P$ with eigenvalue 1.

Detailed balance (sufficient for stationarity): if $\pi_i P_{ij} = \pi_j P_{ji}$ for all $i, j$ , then $\pi P = \pi$ . Any chain satisfying detailed balance is reversible. The Metropolis-Hastings algorithm is designed to satisfy detailed balance by construction.

Ergodic Theorem: Existence and Uniqueness

Irreducibility: a chain is irreducible if every state is reachable from every other state — there exists $n$ such that $P^n_{ij} > 0$ for all $i, j$ .

Aperiodicity: state $i$ has period $d_i = \gcd\{n \geq 1 : P^n_{ii} > 0\}$ . A chain is aperiodic if $d_i = 1$ for all $i$ .

Ergodic theorem: a chain that is irreducible and aperiodic on a finite state space has a unique stationary distribution $\pi$ , and:

$\lim_{t \to \infty} P^t_{ij} = \pi_j \quad \text{for all } i, j.$

Moreover, by the law of large numbers for Markov chains:

$\frac{1}{T}\sum_{t=0}^{T-1} f(X_t) \xrightarrow{a.s.} \sum_i \pi_i f(i) = \mathbb{E}_\pi[f]$

for any bounded function $f$ . This is the basis for MCMC estimation.

Proof sketch: by Perron-Frobenius, a non-negative irreducible matrix has a unique maximal eigenvalue $\rho$ . For a stochastic matrix $\rho = 1$ , and aperiodicity ensures no eigenvalue has $|\lambda| = 1$ except $\lambda = 1$ itself. The remaining eigenvalues satisfy $|\lambda_k| < 1$ , so $P^t \to \mathbf{1}\pi$ geometrically.

Mixing Times and the Spectral Gap

Define the total variation distance between distributions $\mu$ and $\nu$ :

$\|\mu - \nu\|_{\text{TV}} = \frac{1}{2}\sum_i |\mu_i - \nu_i| = \max_{A \subseteq \mathcal{S}} |\mu(A) - \nu(A)|.$

The $\varepsilon$ -mixing time is $t_{\text{mix}}(\varepsilon) = \min\{t : \max_i \|P^t(i,\cdot) - \pi\|_{\text{TV}} \leq \varepsilon\}$ .

For a reversible chain, the eigenvalues of $P$ satisfy $1 = \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_{|\mathcal{S}|} \geq -1$ . The spectral gap is $\gamma = 1 - \lambda_2$ (second-largest eigenvalue controls convergence):

$\|P^t(i,\cdot) - \pi\|_{\text{TV}} \leq \sqrt{\frac{1}{\pi_{\min}}} (1 - \gamma)^t.$

The mixing time scales as $t_{\text{mix}}(\varepsilon) = O\!\left(\frac{1}{\gamma}\log\frac{1}{\varepsilon \pi_{\min}}\right)$ .

Conductance provides a combinatorial bound: the conductance (Cheeger constant) of the chain is:

$\Phi = \min_{S : \pi(S) \leq 1/2} \frac{\sum_{i \in S, j \notin S} \pi_i P_{ij}}{\pi(S)}.$

The Cheeger inequality relates it to the spectral gap: $\Phi^2/2 \leq \gamma \leq 2\Phi$ .

Hitting Times and Recurrence

The expected hitting time from $i$ to $j$ is $h_{ij} = \mathbb{E}_i[\min\{t \geq 1 : X_t = j\}]$ . For an ergodic chain, $h_{jj} = 1/\pi_j$ — the expected return time to state $j$ equals the reciprocal of its stationary probability.

Recurrence vs transience (relevant for countably infinite state spaces): state $i$ is recurrent if $P(\text{return to } i) = 1$ , transient if $P(\text{return to } i) < 1$ . For finite irreducible chains, all states are recurrent.

Random walk on $\mathbb{Z}^d$ : symmetric nearest-neighbor random walk is recurrent for $d \leq 2$ (Pólya's theorem) and transient for $d \geq 3$ . In $d=2$ , the walk returns to the origin with probability 1 but the expected return time is infinite (null recurrence).

Worked Example

Example 1: PageRank as Stationary Distribution

Google's PageRank computes the stationary distribution of a Markov chain on web pages. The transition matrix is:

$P_{ij} = (1-\alpha) \frac{A_{ij}}{\text{out-degree}(i)} + \frac{\alpha}{N}$

where $A$ is the adjacency matrix of the web graph, $\alpha = 0.15$ is the teleportation probability, and $N$ is the number of pages.

The teleportation term makes $P$ irreducible (all entries strictly positive) and aperiodic. The stationary distribution $\pi$ gives each page's rank. Power iteration computes $\pi$ : starting from $\pi_0 = (1/N, \ldots, 1/N)$ , iterate $\pi_{t+1} = \pi_t P$ until convergence. The second eigenvalue $\lambda_2 = 1 - \alpha = 0.85$ determines the convergence rate — roughly $\log(1/\varepsilon) / |\log 0.85| \approx 6.3\log(1/\varepsilon)$ iterations.

For a web graph with $N = 10^9$ pages, power iteration converges in $\sim 50$ iterations, exploiting the sparse structure of $A$ .

Example 2: Metropolis Chain on $\{1, \ldots, n\}$

Target $\pi_i \propto e^{-\beta E_i}$ (Boltzmann distribution). Proposal: propose $j$ uniformly from $\{i-1, i+1\}$ (nearest neighbors). Accept with probability $\min(1, \pi_j/\pi_i)$ .

The resulting chain has transition probabilities:

$P_{ij} = \begin{cases} \frac{1}{2}\min\left(1, e^{-\beta(E_j - E_i)}\right) & |i-j|=1 \\ 1 - \sum_{k \neq i} P_{ik} & i = j. \end{cases}$

At high temperature ( $\beta$ small): accepts most moves, mixes quickly ( $\gamma$ large). At low temperature ( $\beta$ large): rarely accepts uphill moves, mixes slowly. The mixing time scales exponentially in the energy barrier height — this is why simulated annealing slowly decreases $\beta$ .

Example 3: Mixing Time Computation

For a random walk on a cycle $C_n$ (states $0, 1, \ldots, n-1$ , steps $\pm 1 \bmod n$ ): the eigenvalues are $\lambda_k = \cos(2\pi k/n)$ for $k = 0, 1, \ldots, n-1$ . The spectral gap is $\gamma = 1 - \cos(2\pi/n) \approx 2\pi^2/n^2$ for large $n$ .

The mixing time is $t_{\text{mix}}(1/4) = \Theta(n^2)$ — the walk must diffuse distance $n/2$ , which takes $O((n/2)^2) = O(n^2)$ steps. This is why random walk on a long path is slow: conductance $\Phi = O(1/n)$ , confirming the $\Omega(n^2)$ lower bound via Cheeger.

Connections

Where Your Intuition Breaks

The stationary distribution $\pi$ exists and is unique for any irreducible, aperiodic chain — but uniqueness does not imply fast convergence. A two-cluster chain where inter-cluster transition probabilities are $\varepsilon \ll 1$ has a unique stationary distribution assigning equal weight to both clusters, yet mixing time scales as $1/\varepsilon$ : the chain spends $O(1/\varepsilon)$ steps in one cluster before crossing to the other. Knowing that $\pi$ exists tells you the long-run average; the spectral gap $\gamma$ is what tells you whether "long run" means thousands or billions of steps.

💡Intuition

The spectral gap is the single most important quantity for MCMC. The mixing time is roughly $1/\gamma$ where $\gamma = 1 - \lambda_2$ . A good MCMC sampler is one with a large spectral gap. Strategies to enlarge $\gamma$ : (1) use non-local moves (Hamiltonian MC uses gradient information to propose long-distance moves), (2) run multiple chains in parallel with swaps (parallel tempering), (3) use data-driven proposals that approximate the target. The Metropolis algorithm with local proposals has $\gamma = O(1/n^2)$ for many targets — HMC achieves $\gamma = O(1)$ under mild conditions.

💡Intuition

Detailed balance is a design principle, not a theorem. Many textbooks present detailed balance as if it is the definition of MCMC. In reality, detailed balance is a convenient sufficient condition for stationarity — the Metropolis-Hastings algorithm satisfies it by construction. But detailed balance is not necessary. Non-reversible Markov chains can converge faster: they can have larger spectral gaps than any reversible chain with the same stationary distribution. Lifted Markov chains, persistent MCMC, and irreversible perturbations can reduce mixing times.

⚠️Warning

"The chain converged" is not the same as "the chain mixed." Convergence diagnostics (Gelman-Rubin $\hat{R}$ , trace plot stationarity) can give false confidence when the chain is stuck in a mode. A chain can appear converged in total variation while exploring only a small region of high probability mass. High-dimensional targets with multiple separated modes require explicit multi-modal sampling strategies (parallel tempering, replica exchange, annealed importance sampling) — standard MCMC with local proposals will not mix across modes in reasonable time.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Information Theory

Bridge: ELBO & VAEs, Contrastive Learning & Rate-Distortion as Compression

Continuous-Time Markov Chains & Poisson Processes