Channel Capacity & Shannon's Coding Theorems

Shannon's channel capacity theorem identifies the maximum rate at which information can be transmitted reliably over a noisy channel, establishing the theoretical limits of compression and communication that no algorithm, regardless of complexity, can exceed.

Concepts

AWGN channel capacity C = ½ log₂(1+SNR) bits per channel use. Practical modulation schemes (dots) must stay below the Shannon limit. No coding scheme can exceed C regardless of complexity.

SNR = 10 dB (10.0×)

Show modulations

Capacity at SNR

1.730 bits/use

High-SNR approx

3.322 bits/use

Low-SNR approx

7.213 bits/use

High-SNR: C ≈ ½ log₂(SNR) ≈ SNR_dB/6 bits/use (3 dB ≈ 1 bit). Low-SNR: C ≈ SNR/(2 ln 2) (linear in SNR — bandwidth is the bottleneck).

Shannon's theorem answers a question that seemed unanswerable: what is the maximum rate at which information can flow reliably through a noisy channel, and can this limit actually be achieved? The remarkable answer — yes, with long enough codes — established that reliable digital communication over a noisy medium is not just possible but achievable at a well-defined theoretical maximum, now called channel capacity.

Channels and Capacity

A discrete memoryless channel (DMC) is specified by input alphabet $\mathcal{X}$ , output alphabet $\mathcal{Y}$ , and transition probabilities $P_{Y|X}(y|x)$ . The channel is memoryless: $P(Y_1,\ldots,Y_n | X_1,\ldots,X_n) = \prod_i P(Y_i|X_i)$ .

Examples:

Channel	Model
Binary symmetric (BSC)	Flip each bit with prob $p$ : $Y = X \oplus Z$ , $Z \sim \text{Bernoulli}(p)$
Binary erasure (BEC)	Erase with prob $\varepsilon$ : $Y \in \{0, 1, ?\}$
AWGN	$Y = X + Z$ , $Z \sim \mathcal{N}(0, N)$ , power constraint $\mathbb{E}[X^2] \leq P$

Channel capacity:

$C = \max_{P_X} I(X; Y) \quad \text{(bits per channel use)}.$

The capacity is the maximum mutual information over all input distributions. The maximizing distribution is the capacity-achieving distribution.

Capacity is defined as the maximum mutual information over input distributions — not over coding schemes. The maximization over $P_X$ finds the input distribution that extracts the most information per channel use; the channel coding theorem then proves this theoretical limit is achievable with long enough codes. The key conceptual move is separating the information-theoretic limit from the engineering of codes: first establish what is possible, then ask how to achieve it. This separation principle pervades information theory and explains why theoretical limits can be proven decades before practical codes achieving them are found.

BSC capacity: $C_{\text{BSC}} = 1 - H_b(p)$ where $H_b(p) = -p\log_2 p - (1-p)\log_2(1-p)$ .

BEC capacity: $C_{\text{BEC}} = 1 - \varepsilon$ (erased bits carry no information; non-erased bits are noiseless).

AWGN capacity (Shannon 1948):

$C_{\text{AWGN}} = \frac{1}{2}\log_2\!\left(1 + \frac{P}{N}\right) = \frac{1}{2}\log_2(1 + \text{SNR}) \quad \text{bits per channel use}.$

For bandwidth $W$ and total noise power $N_0 W$ : $C = W\log_2(1 + P/(N_0 W))$ bits/second (Shannon-Hartley theorem).

Shannon's Channel Coding Theorem

Theorem (channel coding): for any rate $R < C$ and $\varepsilon > 0$ , there exists a sequence of codes with block length $n$ and rate $R$ such that the maximum probability of error $P_e^{(n)} \leq \varepsilon$ for large enough $n$ .

Converse: for any sequence of codes with rate $R > C$ , $P_e^{(n)} \geq 1 - C/R - o(1) > 0$ . Reliable communication above capacity is impossible.

Proof sketch (achievability): random coding — generate $2^{nR}$ codewords iid from the capacity-achieving distribution $P_X^*$ . Encode message $m$ by transmitting codeword $c(m)$ . Decode by typical set decoding: find the unique $m'$ such that $(c(m'), y^n)$ are jointly typical. By the AEP, the probability of incorrect decoding decays exponentially in $n$ for $R < C$ .

Error exponent: the probability of error decays as $P_e^{(n)} \leq e^{-nE(R)}$ where $E(R) > 0$ for $R < C$ is the reliability function (Gallager exponent). The reliability function characterizes the fundamental tradeoff between rate and reliability.

Source Coding (Lossless Compression)

Shannon's source coding theorem: given an iid source $X_1, X_2, \ldots$ with entropy $H(X)$ :

Achievability: for any rate $R > H(X)$ , there exists a code compressing $n$ symbols to $nR$ bits with vanishing error probability.
Converse: for any rate $R < H(X)$ , error probability $\to 1$ .

The minimum achievable rate for lossless compression is exactly $H(X)$ bits/symbol.

Practical codes approaching $H(X)$ :

Huffman coding: achieves $H(X) \leq L_{\text{Huffman}} < H(X) + 1$ bits/symbol (where $L$ is the expected codeword length)
Arithmetic coding: achieves $H(X) + 2/n$ bits/symbol for block codes
Lempel-Ziv: universal, achieves $H(X)$ asymptotically without knowing the source distribution

Source-Channel Separation Theorem

Theorem (separation): to transmit an iid source with entropy $H$ reliably over a channel with capacity $C$ :

Possible if and only if $H \leq C$
The optimal strategy is to use a source code (compress to $H$ bits/symbol) followed by a channel code (transmit at rate $C$ ) — the two problems decouple

The separation theorem is computationally convenient: design the best source code and best channel code independently, then concatenate. Combined source-channel coding cannot do better asymptotically.

Rate-Distortion Theory

For lossy compression, allow distortion $d(x, \hat x)$ (typically MSE or Hamming distance). The rate-distortion function $R(D)$ is the minimum rate for compressing to within expected distortion $D$ :

$R(D) = \min_{P(\hat X | X) : \mathbb{E}[d(X,\hat X)] \leq D} I(X; \hat X).$

Gaussian source with $X \sim \mathcal{N}(0, \sigma^2)$ and MSE distortion:

$R(D) = \begin{cases} \frac{1}{2}\log_2\frac{\sigma^2}{D} & 0 < D \leq \sigma^2 \\ 0 & D > \sigma^2 \end{cases}.$

Each additional bit of compression doubles the distortion: $D = \sigma^2 2^{-2R}$ .

Vector Gaussian source with covariance $\Sigma$ (eigenvalues $\lambda_1, \ldots, \lambda_d$ ): water-filling allocation distributes distortion optimally:

$D_i = \min(\mu, \lambda_i), \quad R(D) = \sum_i \max\!\left(0, \frac{1}{2}\log_2\frac{\lambda_i}{\mu}\right),$

where $\mu$ (water level) satisfies $\sum_i D_i = D$ .

Converse: no compression algorithm can achieve both average distortion $\leq D$ and rate $< R(D)$ , regardless of computational complexity.

Worked Example

Example 1: BSC at Threshold

Binary symmetric channel with crossover probability $p = 0.1$ . Capacity $C = 1 - H_b(0.1) = 1 - 0.469 = 0.531$ bits/use.

For transmission at rate $R = 0.4 < C$ : there exist codes with vanishing error probability. A practical example: a rate-1/2 LDPC code operating at 0.4 bits/use achieves $P_e < 10^{-6}$ .

For $R = 0.6 > C$ : any code has $P_e \geq 1 - C/R = 1 - 0.531/0.6 = 0.115$ — at least 11.5% of messages will be decoded incorrectly, no matter how good the code.

The reliability function $E(0.4) \approx 0.025$ nats/use, so $P_e \leq e^{-0.025n}$ . For $n = 1000$ : $P_e \leq e^{-25} \approx 10^{-11}$ .

Example 2: AWGN at Different SNRs

SNR (dB)	SNR (linear)	Capacity (bits/use)	Typical modulation	Efficiency
0 dB	1×	0.5	BPSK (1 bit/use)	50%
10 dB	10×	1.73	16-QAM (4 bits/use at high SNR)	43%
20 dB	100×	3.32	64-QAM (6 bits/use)	55%
30 dB	1000×	4.98	256-QAM (8 bits/use)	63%

Modern 5G codes (polar codes, LDPC) operate within 0.5 dB of the Shannon limit — essentially achieving the theoretical bound in practice.

Example 3: Lossy Compression via Rate-Distortion

Compress $n = 1000$ samples of $X \sim \mathcal{N}(0, 1)$ to $k$ bits per sample (rate $R = k$ ).

$R(D) = \frac{1}{2}\log_2(1/D)$ gives $D = 2^{-2R}$ .

Rate (bits/sample)	Max distortion	Example
8 (uncompressed float16 truncated)	$2^{-16} \approx 10^{-5}$	near lossless
4	0.004	mild compression
2	0.0625	moderate quality
1	0.25	low quality (25% variance lost)
0.5	0.5	half the variance represented

No compression algorithm (JPEG, LLM quantization, any neural codec) can beat these bounds: compressing a $\mathcal{N}(0,1)$ source to 1 bit/sample must introduce MSE $\geq 0.25$ .

Connections

Where Your Intuition Breaks

Shannon's theorem is an asymptotic result: reliable communication at rate $R < C$ requires block length $n \to \infty$ . For any finite $n$ , the achievable rate is strictly below $C$ , and the gap can be substantial. Polar codes and LDPC codes approach capacity as $n$ grows, but the finite- $n$ penalty matters in real systems with latency constraints. More importantly, Shannon capacity assumes a fixed known channel — in ML, the "channel" (data distribution, noise level, task) is unknown and must be estimated from data. The information-theoretic limits apply to the true channel, not the estimated one, which introduces an estimation error that classical channel coding theory does not account for.

💡Intuition

Capacity is achievable with random coding but not with simple codes. The achievability proof uses random codes — each codeword is drawn independently from the input distribution. Such codes cannot be encoded or decoded efficiently (exponential search). The practical miracle is that structured codes (turbo codes, LDPC, polar codes) can approach capacity with polynomial-time algorithms. Polar codes (Arıkan 2009), used in 5G standards, provably achieve capacity for any symmetric binary-input channel with $O(n \log n)$ encoding and decoding complexity.

💡Intuition

The AWGN capacity formula has deep geometric content. $C = \frac{1}{2}\log(1 + P/N)$ comes from volume comparison: transmitted codewords live on a sphere of radius $\sqrt{nP}$ in $\mathbb{R}^n$ ; each received signal is within a noise ball of radius $\sqrt{nN}$ ; the number of distinguishable codewords is approximately $(nP/nN)^{n/2} = (P/N)^{n/2} = 2^{n \cdot \frac{1}{2}\log(P/N)}$ . The $+1$ comes from the total power $P + N$ : received signals live on a sphere of radius $\sqrt{n(P+N)}$ . Volume counting gives exactly $C = \frac{1}{2}\log(1 + P/N)$ .

⚠️Warning

The separation theorem breaks down in the finite blocklength regime. For short block lengths (as in real-time communications, control systems), joint source-channel coding can outperform separated source-channel coding. The dispersion theory (Polyanskiy-Poor-Verdú 2010) characterizes finite-blocklength performance: the achievable rate at block length $n$ is $C - \sqrt{V/n}\, Q^{-1}(\varepsilon) + O(\log n / n)$ where $V$ is the channel dispersion. This shows that for short packets (common in IoT and 5G), codes must be carefully designed for the blocklength, and the asymptotic separation theorem no longer applies.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Data Processing Inequality & Sufficient Statistics

Bridge: ELBO & VAEs, Contrastive Learning & Rate-Distortion as Compression