Neural-Path/Notes
45 min

Channel Capacity & Shannon's Coding Theorems

Shannon's channel capacity theorem identifies the maximum rate at which information can be transmitted reliably over a noisy channel, establishing the theoretical limits of compression and communication that no algorithm, regardless of complexity, can exceed.

Concepts

AWGN channel capacity C = ½ log₂(1+SNR) bits per channel use. Practical modulation schemes (dots) must stay below the Shannon limit. No coding scheme can exceed C regardless of complexity.

0246810-5051015202530SNR (dB)bits/useBPSKQPSK16-QAM64-QAM256-QAMShannon limit
Capacity at SNR
1.730 bits/use
High-SNR approx
3.322 bits/use
Low-SNR approx
7.213 bits/use

High-SNR: C ≈ ½ log₂(SNR) ≈ SNR_dB/6 bits/use (3 dB ≈ 1 bit). Low-SNR: C ≈ SNR/(2 ln 2) (linear in SNR — bandwidth is the bottleneck).

Shannon's theorem answers a question that seemed unanswerable: what is the maximum rate at which information can flow reliably through a noisy channel, and can this limit actually be achieved? The remarkable answer — yes, with long enough codes — established that reliable digital communication over a noisy medium is not just possible but achievable at a well-defined theoretical maximum, now called channel capacity.

Channels and Capacity

A discrete memoryless channel (DMC) is specified by input alphabet X\mathcal{X}, output alphabet Y\mathcal{Y}, and transition probabilities PYX(yx)P_{Y|X}(y|x). The channel is memoryless: P(Y1,,YnX1,,Xn)=iP(YiXi)P(Y_1,\ldots,Y_n | X_1,\ldots,X_n) = \prod_i P(Y_i|X_i).

Examples:

ChannelModel
Binary symmetric (BSC)Flip each bit with prob pp: Y=XZY = X \oplus Z, ZBernoulli(p)Z \sim \text{Bernoulli}(p)
Binary erasure (BEC)Erase with prob ε\varepsilon: Y{0,1,?}Y \in \{0, 1, ?\}
AWGNY=X+ZY = X + Z, ZN(0,N)Z \sim \mathcal{N}(0, N), power constraint E[X2]P\mathbb{E}[X^2] \leq P

Channel capacity:

C=maxPXI(X;Y)(bits per channel use).C = \max_{P_X} I(X; Y) \quad \text{(bits per channel use)}.

The capacity is the maximum mutual information over all input distributions. The maximizing distribution is the capacity-achieving distribution.

Capacity is defined as the maximum mutual information over input distributions — not over coding schemes. The maximization over PXP_X finds the input distribution that extracts the most information per channel use; the channel coding theorem then proves this theoretical limit is achievable with long enough codes. The key conceptual move is separating the information-theoretic limit from the engineering of codes: first establish what is possible, then ask how to achieve it. This separation principle pervades information theory and explains why theoretical limits can be proven decades before practical codes achieving them are found.

BSC capacity: CBSC=1Hb(p)C_{\text{BSC}} = 1 - H_b(p) where Hb(p)=plog2p(1p)log2(1p)H_b(p) = -p\log_2 p - (1-p)\log_2(1-p).

BEC capacity: CBEC=1εC_{\text{BEC}} = 1 - \varepsilon (erased bits carry no information; non-erased bits are noiseless).

AWGN capacity (Shannon 1948):

CAWGN=12log2 ⁣(1+PN)=12log2(1+SNR)bits per channel use.C_{\text{AWGN}} = \frac{1}{2}\log_2\!\left(1 + \frac{P}{N}\right) = \frac{1}{2}\log_2(1 + \text{SNR}) \quad \text{bits per channel use}.

For bandwidth WW and total noise power N0WN_0 W: C=Wlog2(1+P/(N0W))C = W\log_2(1 + P/(N_0 W)) bits/second (Shannon-Hartley theorem).

Shannon's Channel Coding Theorem

Theorem (channel coding): for any rate R<CR < C and ε>0\varepsilon > 0, there exists a sequence of codes with block length nn and rate RR such that the maximum probability of error Pe(n)εP_e^{(n)} \leq \varepsilon for large enough nn.

Converse: for any sequence of codes with rate R>CR > C, Pe(n)1C/Ro(1)>0P_e^{(n)} \geq 1 - C/R - o(1) > 0. Reliable communication above capacity is impossible.

Proof sketch (achievability): random coding — generate 2nR2^{nR} codewords iid from the capacity-achieving distribution PXP_X^*. Encode message mm by transmitting codeword c(m)c(m). Decode by typical set decoding: find the unique mm' such that (c(m),yn)(c(m'), y^n) are jointly typical. By the AEP, the probability of incorrect decoding decays exponentially in nn for R<CR < C.

Error exponent: the probability of error decays as Pe(n)enE(R)P_e^{(n)} \leq e^{-nE(R)} where E(R)>0E(R) > 0 for R<CR < C is the reliability function (Gallager exponent). The reliability function characterizes the fundamental tradeoff between rate and reliability.

Source Coding (Lossless Compression)

Shannon's source coding theorem: given an iid source X1,X2,X_1, X_2, \ldots with entropy H(X)H(X):

  • Achievability: for any rate R>H(X)R > H(X), there exists a code compressing nn symbols to nRnR bits with vanishing error probability.
  • Converse: for any rate R<H(X)R < H(X), error probability 1\to 1.

The minimum achievable rate for lossless compression is exactly H(X)H(X) bits/symbol.

Practical codes approaching H(X)H(X):

  • Huffman coding: achieves H(X)LHuffman<H(X)+1H(X) \leq L_{\text{Huffman}} < H(X) + 1 bits/symbol (where LL is the expected codeword length)
  • Arithmetic coding: achieves H(X)+2/nH(X) + 2/n bits/symbol for block codes
  • Lempel-Ziv: universal, achieves H(X)H(X) asymptotically without knowing the source distribution

Source-Channel Separation Theorem

Theorem (separation): to transmit an iid source with entropy HH reliably over a channel with capacity CC:

  • Possible if and only if HCH \leq C
  • The optimal strategy is to use a source code (compress to HH bits/symbol) followed by a channel code (transmit at rate CC) — the two problems decouple

The separation theorem is computationally convenient: design the best source code and best channel code independently, then concatenate. Combined source-channel coding cannot do better asymptotically.

Rate-Distortion Theory

For lossy compression, allow distortion d(x,x^)d(x, \hat x) (typically MSE or Hamming distance). The rate-distortion function R(D)R(D) is the minimum rate for compressing to within expected distortion DD:

R(D)=minP(X^X):E[d(X,X^)]DI(X;X^).R(D) = \min_{P(\hat X | X) : \mathbb{E}[d(X,\hat X)] \leq D} I(X; \hat X).

Gaussian source with XN(0,σ2)X \sim \mathcal{N}(0, \sigma^2) and MSE distortion:

R(D)={12log2σ2D0<Dσ20D>σ2.R(D) = \begin{cases} \frac{1}{2}\log_2\frac{\sigma^2}{D} & 0 < D \leq \sigma^2 \\ 0 & D > \sigma^2 \end{cases}.

Each additional bit of compression doubles the distortion: D=σ222RD = \sigma^2 2^{-2R}.

Vector Gaussian source with covariance Σ\Sigma (eigenvalues λ1,,λd\lambda_1, \ldots, \lambda_d): water-filling allocation distributes distortion optimally:

Di=min(μ,λi),R(D)=imax ⁣(0,12log2λiμ),D_i = \min(\mu, \lambda_i), \quad R(D) = \sum_i \max\!\left(0, \frac{1}{2}\log_2\frac{\lambda_i}{\mu}\right),

where μ\mu (water level) satisfies iDi=D\sum_i D_i = D.

Converse: no compression algorithm can achieve both average distortion D\leq D and rate <R(D)< R(D), regardless of computational complexity.

Worked Example

Example 1: BSC at Threshold

Binary symmetric channel with crossover probability p=0.1p = 0.1. Capacity C=1Hb(0.1)=10.469=0.531C = 1 - H_b(0.1) = 1 - 0.469 = 0.531 bits/use.

For transmission at rate R=0.4<CR = 0.4 < C: there exist codes with vanishing error probability. A practical example: a rate-1/2 LDPC code operating at 0.4 bits/use achieves Pe<106P_e < 10^{-6}.

For R=0.6>CR = 0.6 > C: any code has Pe1C/R=10.531/0.6=0.115P_e \geq 1 - C/R = 1 - 0.531/0.6 = 0.115 — at least 11.5% of messages will be decoded incorrectly, no matter how good the code.

The reliability function E(0.4)0.025E(0.4) \approx 0.025 nats/use, so Pee0.025nP_e \leq e^{-0.025n}. For n=1000n = 1000: Pee251011P_e \leq e^{-25} \approx 10^{-11}.

Example 2: AWGN at Different SNRs

SNR (dB)SNR (linear)Capacity (bits/use)Typical modulationEfficiency
0 dB0.5BPSK (1 bit/use)50%
10 dB10×1.7316-QAM (4 bits/use at high SNR)43%
20 dB100×3.3264-QAM (6 bits/use)55%
30 dB1000×4.98256-QAM (8 bits/use)63%

Modern 5G codes (polar codes, LDPC) operate within 0.5 dB of the Shannon limit — essentially achieving the theoretical bound in practice.

Example 3: Lossy Compression via Rate-Distortion

Compress n=1000n = 1000 samples of XN(0,1)X \sim \mathcal{N}(0, 1) to kk bits per sample (rate R=kR = k).

R(D)=12log2(1/D)R(D) = \frac{1}{2}\log_2(1/D) gives D=22RD = 2^{-2R}.

Rate (bits/sample)Max distortionExample
8 (uncompressed float16 truncated)2161052^{-16} \approx 10^{-5}near lossless
40.004mild compression
20.0625moderate quality
10.25low quality (25% variance lost)
0.50.5half the variance represented

No compression algorithm (JPEG, LLM quantization, any neural codec) can beat these bounds: compressing a N(0,1)\mathcal{N}(0,1) source to 1 bit/sample must introduce MSE 0.25\geq 0.25.

Connections

Where Your Intuition Breaks

Shannon's theorem is an asymptotic result: reliable communication at rate R<CR < C requires block length nn \to \infty. For any finite nn, the achievable rate is strictly below CC, and the gap can be substantial. Polar codes and LDPC codes approach capacity as nn grows, but the finite-nn penalty matters in real systems with latency constraints. More importantly, Shannon capacity assumes a fixed known channel — in ML, the "channel" (data distribution, noise level, task) is unknown and must be estimated from data. The information-theoretic limits apply to the true channel, not the estimated one, which introduces an estimation error that classical channel coding theory does not account for.

💡Intuition

Capacity is achievable with random coding but not with simple codes. The achievability proof uses random codes — each codeword is drawn independently from the input distribution. Such codes cannot be encoded or decoded efficiently (exponential search). The practical miracle is that structured codes (turbo codes, LDPC, polar codes) can approach capacity with polynomial-time algorithms. Polar codes (Arıkan 2009), used in 5G standards, provably achieve capacity for any symmetric binary-input channel with O(nlogn)O(n \log n) encoding and decoding complexity.

💡Intuition

The AWGN capacity formula has deep geometric content. C=12log(1+P/N)C = \frac{1}{2}\log(1 + P/N) comes from volume comparison: transmitted codewords live on a sphere of radius nP\sqrt{nP} in Rn\mathbb{R}^n; each received signal is within a noise ball of radius nN\sqrt{nN}; the number of distinguishable codewords is approximately (nP/nN)n/2=(P/N)n/2=2n12log(P/N)(nP/nN)^{n/2} = (P/N)^{n/2} = 2^{n \cdot \frac{1}{2}\log(P/N)}. The +1+1 comes from the total power P+NP + N: received signals live on a sphere of radius n(P+N)\sqrt{n(P+N)}. Volume counting gives exactly C=12log(1+P/N)C = \frac{1}{2}\log(1 + P/N).

⚠️Warning

The separation theorem breaks down in the finite blocklength regime. For short block lengths (as in real-time communications, control systems), joint source-channel coding can outperform separated source-channel coding. The dispersion theory (Polyanskiy-Poor-Verdú 2010) characterizes finite-blocklength performance: the achievable rate at block length nn is CV/nQ1(ε)+O(logn/n)C - \sqrt{V/n}\, Q^{-1}(\varepsilon) + O(\log n / n) where VV is the channel dispersion. This shows that for short packets (common in IoT and 5G), codes must be carefully designed for the blocklength, and the asymptotic separation theorem no longer applies.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.