Channel Capacity & Shannon's Coding Theorems
Shannon's channel capacity theorem identifies the maximum rate at which information can be transmitted reliably over a noisy channel, establishing the theoretical limits of compression and communication that no algorithm, regardless of complexity, can exceed.
Concepts
AWGN channel capacity C = ½ log₂(1+SNR) bits per channel use. Practical modulation schemes (dots) must stay below the Shannon limit. No coding scheme can exceed C regardless of complexity.
High-SNR: C ≈ ½ log₂(SNR) ≈ SNR_dB/6 bits/use (3 dB ≈ 1 bit). Low-SNR: C ≈ SNR/(2 ln 2) (linear in SNR — bandwidth is the bottleneck).
Shannon's theorem answers a question that seemed unanswerable: what is the maximum rate at which information can flow reliably through a noisy channel, and can this limit actually be achieved? The remarkable answer — yes, with long enough codes — established that reliable digital communication over a noisy medium is not just possible but achievable at a well-defined theoretical maximum, now called channel capacity.
Channels and Capacity
A discrete memoryless channel (DMC) is specified by input alphabet , output alphabet , and transition probabilities . The channel is memoryless: .
Examples:
| Channel | Model |
|---|---|
| Binary symmetric (BSC) | Flip each bit with prob : , |
| Binary erasure (BEC) | Erase with prob : |
| AWGN | , , power constraint |
Channel capacity:
The capacity is the maximum mutual information over all input distributions. The maximizing distribution is the capacity-achieving distribution.
Capacity is defined as the maximum mutual information over input distributions — not over coding schemes. The maximization over finds the input distribution that extracts the most information per channel use; the channel coding theorem then proves this theoretical limit is achievable with long enough codes. The key conceptual move is separating the information-theoretic limit from the engineering of codes: first establish what is possible, then ask how to achieve it. This separation principle pervades information theory and explains why theoretical limits can be proven decades before practical codes achieving them are found.
BSC capacity: where .
BEC capacity: (erased bits carry no information; non-erased bits are noiseless).
AWGN capacity (Shannon 1948):
For bandwidth and total noise power : bits/second (Shannon-Hartley theorem).
Shannon's Channel Coding Theorem
Theorem (channel coding): for any rate and , there exists a sequence of codes with block length and rate such that the maximum probability of error for large enough .
Converse: for any sequence of codes with rate , . Reliable communication above capacity is impossible.
Proof sketch (achievability): random coding — generate codewords iid from the capacity-achieving distribution . Encode message by transmitting codeword . Decode by typical set decoding: find the unique such that are jointly typical. By the AEP, the probability of incorrect decoding decays exponentially in for .
Error exponent: the probability of error decays as where for is the reliability function (Gallager exponent). The reliability function characterizes the fundamental tradeoff between rate and reliability.
Source Coding (Lossless Compression)
Shannon's source coding theorem: given an iid source with entropy :
- Achievability: for any rate , there exists a code compressing symbols to bits with vanishing error probability.
- Converse: for any rate , error probability .
The minimum achievable rate for lossless compression is exactly bits/symbol.
Practical codes approaching :
- Huffman coding: achieves bits/symbol (where is the expected codeword length)
- Arithmetic coding: achieves bits/symbol for block codes
- Lempel-Ziv: universal, achieves asymptotically without knowing the source distribution
Source-Channel Separation Theorem
Theorem (separation): to transmit an iid source with entropy reliably over a channel with capacity :
- Possible if and only if
- The optimal strategy is to use a source code (compress to bits/symbol) followed by a channel code (transmit at rate ) — the two problems decouple
The separation theorem is computationally convenient: design the best source code and best channel code independently, then concatenate. Combined source-channel coding cannot do better asymptotically.
Rate-Distortion Theory
For lossy compression, allow distortion (typically MSE or Hamming distance). The rate-distortion function is the minimum rate for compressing to within expected distortion :
Gaussian source with and MSE distortion:
Each additional bit of compression doubles the distortion: .
Vector Gaussian source with covariance (eigenvalues ): water-filling allocation distributes distortion optimally:
where (water level) satisfies .
Converse: no compression algorithm can achieve both average distortion and rate , regardless of computational complexity.
Worked Example
Example 1: BSC at Threshold
Binary symmetric channel with crossover probability . Capacity bits/use.
For transmission at rate : there exist codes with vanishing error probability. A practical example: a rate-1/2 LDPC code operating at 0.4 bits/use achieves .
For : any code has — at least 11.5% of messages will be decoded incorrectly, no matter how good the code.
The reliability function nats/use, so . For : .
Example 2: AWGN at Different SNRs
| SNR (dB) | SNR (linear) | Capacity (bits/use) | Typical modulation | Efficiency |
|---|---|---|---|---|
| 0 dB | 1× | 0.5 | BPSK (1 bit/use) | 50% |
| 10 dB | 10× | 1.73 | 16-QAM (4 bits/use at high SNR) | 43% |
| 20 dB | 100× | 3.32 | 64-QAM (6 bits/use) | 55% |
| 30 dB | 1000× | 4.98 | 256-QAM (8 bits/use) | 63% |
Modern 5G codes (polar codes, LDPC) operate within 0.5 dB of the Shannon limit — essentially achieving the theoretical bound in practice.
Example 3: Lossy Compression via Rate-Distortion
Compress samples of to bits per sample (rate ).
gives .
| Rate (bits/sample) | Max distortion | Example |
|---|---|---|
| 8 (uncompressed float16 truncated) | near lossless | |
| 4 | 0.004 | mild compression |
| 2 | 0.0625 | moderate quality |
| 1 | 0.25 | low quality (25% variance lost) |
| 0.5 | 0.5 | half the variance represented |
No compression algorithm (JPEG, LLM quantization, any neural codec) can beat these bounds: compressing a source to 1 bit/sample must introduce MSE .
Connections
Where Your Intuition Breaks
Shannon's theorem is an asymptotic result: reliable communication at rate requires block length . For any finite , the achievable rate is strictly below , and the gap can be substantial. Polar codes and LDPC codes approach capacity as grows, but the finite- penalty matters in real systems with latency constraints. More importantly, Shannon capacity assumes a fixed known channel — in ML, the "channel" (data distribution, noise level, task) is unknown and must be estimated from data. The information-theoretic limits apply to the true channel, not the estimated one, which introduces an estimation error that classical channel coding theory does not account for.
Capacity is achievable with random coding but not with simple codes. The achievability proof uses random codes — each codeword is drawn independently from the input distribution. Such codes cannot be encoded or decoded efficiently (exponential search). The practical miracle is that structured codes (turbo codes, LDPC, polar codes) can approach capacity with polynomial-time algorithms. Polar codes (Arıkan 2009), used in 5G standards, provably achieve capacity for any symmetric binary-input channel with encoding and decoding complexity.
The AWGN capacity formula has deep geometric content. comes from volume comparison: transmitted codewords live on a sphere of radius in ; each received signal is within a noise ball of radius ; the number of distinguishable codewords is approximately . The comes from the total power : received signals live on a sphere of radius . Volume counting gives exactly .
The separation theorem breaks down in the finite blocklength regime. For short block lengths (as in real-time communications, control systems), joint source-channel coding can outperform separated source-channel coding. The dispersion theory (Polyanskiy-Poor-Verdú 2010) characterizes finite-blocklength performance: the achievable rate at block length is where is the channel dispersion. This shows that for short packets (common in IoT and 5G), codes must be carefully designed for the blocklength, and the asymptotic separation theorem no longer applies.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.