Bridge: Concentration Inequalities (Hoeffding, Bernstein, Sub-Gaussian) & Generalization Bounds

Concentration inequalities bound how far a random variable deviates from its mean, far more sharply than Chebyshev. The bridge from probability to learning theory runs through these bounds: Hoeffding's inequality gives the PAC learning sample complexity for finite hypothesis classes; Rademacher complexity extends this to infinite classes; and uniform convergence — the guarantee that empirical risk concentrates around true risk simultaneously over all hypotheses — is the foundation of generalization bounds for SVMs, random forests, and neural networks.

Concepts

"How many training examples do I need?" — this question, central to all of ML practice, is answered by concentration inequalities. The CLT tells you the distribution of sample averages but doesn't tell you the probability of a single run going badly wrong. Hoeffding's inequality and Chernoff bounds give explicit probabilities: "with probability at most $\delta$ , the empirical risk deviates from true risk by more than $\epsilon$ , and this requires $n \geq C \log(1/\delta)/\epsilon^2$ samples." Rademacher complexity extends this to rich function classes. This lesson is where the probability theory of M06 connects directly to the sample complexity and generalization theory that drives architecture and training decisions.

Sub-Gaussian Random Variables

A random variable $X$ with mean 0 is sub-Gaussian with parameter $\sigma$ if:

$\mathbb{E}[e^{tX}] \leq e^{\sigma^2 t^2/2} \quad \forall t \in \mathbb{R}.$

Equivalently, the tail decays like a Gaussian: $P(|X| \geq t) \leq 2\exp(-t^2/(2\sigma^2))$ .

Examples:

$\mathcal{N}(0, \sigma^2)$ : sub-Gaussian with parameter $\sigma$ (tight)
Bounded $X \in [a,b]$ : sub-Gaussian with parameter $(b-a)/2$
Rademacher (uniform on $\{-1,+1\}$ ): sub-Gaussian with parameter 1

Characterizations:

$\mathbb{E}[e^{tX}] \leq e^{\sigma^2 t^2/2}$ (MGF condition)
$\|X\|_{\psi_2} < \infty$ where $\|X\|_{\psi_2} = \inf\{s > 0 : \mathbb{E}[e^{X^2/s^2}] \leq 2\}$ (Orlicz norm)
$P(|X| \geq t) \leq 2e^{-t^2/(2\sigma^2)}$ (tail condition)
$\mathbb{E}[X^{2k}] \leq (2k-1)!!\sigma^{2k}$ (moment condition)

These are all equivalent (up to constants). Sub-Gaussianity is the natural class for concentration inequalities. The MGF condition is the right definition rather than the tail condition because the MGF encodes all moments and is stable under sums (the MGF of a sum of independent sub-Gaussian variables is the product of their MGFs — which gives Hoeffding's inequality directly by Markov's inequality applied to $e^{tX}$ ). Defining sub-Gaussianity via MGFs makes the whole concentration theory fall out algebraically.

Sub-exponential random variables. $X$ is sub-exponential if $\mathbb{E}[e^{tX}] \leq e^{\nu^2 t^2/2}$ only for $|t| \leq 1/b$ (not all $t$ ). Examples: $\chi^2$ , squared Gaussian. Sub-exponential tails decay like $e^{-ct}$ rather than $e^{-ct^2}$ .

Hoeffding's Inequality

Theorem. Let $X_1, \ldots, X_n$ be independent with $X_i \in [a_i, b_i]$ and $\mathbb{E}[X_i] = \mu_i$ . Then:

$P\!\left(\sum_{i=1}^n (X_i - \mu_i) \geq t\right) \leq \exp\!\left(\frac{-2t^2}{\sum_{i=1}^n (b_i-a_i)^2}\right).$

Special case (iid $X_i \in [0,1]$ , mean $\mu$ ):

$P(\bar{X}_n - \mu \geq \varepsilon) \leq e^{-2n\varepsilon^2}, \qquad P(|\bar{X}_n - \mu| \geq \varepsilon) \leq 2e^{-2n\varepsilon^2}.$

Proof sketch (Hoeffding's lemma). For $X \in [a,b]$ with $\mathbb{E}[X]=0$ : $\mathbb{E}[e^{tX}] \leq e^{t^2(b-a)^2/8}$ (proved by convexity of $e^{tx}$ and the constraint $X \in [a,b]$ ). Apply independently to each term and optimize over $t$ .

Comparison to Chebyshev. Chebyshev gives $P(|\bar{X}_n - \mu| \geq \varepsilon) \leq \sigma^2/(n\varepsilon^2)$ — a polynomial bound. Hoeffding gives $2e^{-2n\varepsilon^2}$ — exponentially smaller for large $n\varepsilon^2$ . For $n=100$ , $\varepsilon=0.1$ : Chebyshev gives $\leq \sigma^2$ (trivially $\leq 1$ ), Hoeffding gives $\leq 2e^{-2} \approx 0.27$ .

Bernstein's Inequality

Theorem. Let $X_i$ be independent, $\mathbb{E}[X_i] = 0$ , $|X_i| \leq M$ , $\sum_i \mathbb{E}[X_i^2] = \nu$ . Then:

$P\!\left(\sum_i X_i \geq t\right) \leq \exp\!\left(\frac{-t^2/2}{\nu + Mt/3}\right).$

Two regimes:

Gaussian regime ( $t \ll \nu/M$ ): bound $\approx e^{-t^2/(2\nu)}$ (variance-controlled)
Poisson regime ( $t \gg \nu/M$ ): bound $\approx e^{-3t/(2M)}$ (boundedness-controlled)

Bernstein is tighter than Hoeffding when the variance $\nu$ is much smaller than the maximum range $nM^2/4$ — i.e., when most $X_i$ are well concentrated near 0 and only rarely large.

McDiarmid's Inequality (Bounded Differences)

Theorem. Let $X_1, \ldots, X_n$ be independent. If $f : \mathcal{X}^n \to \mathbb{R}$ satisfies:

$\sup_{x_1,\ldots,x_n,x_i'} |f(x_1,\ldots,x_i,\ldots,x_n) - f(x_1,\ldots,x_i',\ldots,x_n)| \leq c_i$

for all $i$ (the bounded differences condition), then:

$P(f(X_1,\ldots,X_n) - \mathbb{E}[f] \geq t) \leq \exp\!\left(\frac{-2t^2}{\sum_i c_i^2}\right).$

Applications: The empirical risk $\hat{L}(h) = \frac{1}{n}\sum_i \ell(h(x_i), y_i)$ satisfies bounded differences with $c_i = 1/n$ (one example can change $\hat{L}$ by at most $1/n$ for $\ell \in [0,1]$ ). McDiarmid gives $P(|\hat{L}(h) - L(h)| \geq \varepsilon) \leq 2e^{-2n\varepsilon^2}$ — the concentration of empirical risk around true risk.

PAC Learning and Sample Complexity

PAC (Probably Approximately Correct) learning formalized by Valiant (1984): a learner sees $n$ iid examples from unknown distribution $D$ and must find $h \in \mathcal{H}$ with $L(h) \leq \min_{h'\in\mathcal{H}} L(h') + \varepsilon$ with probability $\geq 1-\delta$ .

Finite hypothesis class. For $|\mathcal{H}| < \infty$ and $\ell \in [0,1]$ :

By Hoeffding + union bound: $P\!\left(\max_{h\in\mathcal{H}}|\hat{L}(h) - L(h)| \geq \varepsilon\right) \leq 2|\mathcal{H}|e^{-2n\varepsilon^2}$ .

Set RHS $= \delta$ : solve for $n$ :

$n \geq \frac{1}{2\varepsilon^2}\left(\log|\mathcal{H}| + \log\frac{2}{\delta}\right).$

The sample complexity grows logarithmically in $|\mathcal{H}|$ — the log of the number of hypotheses is the key measure of complexity for finite classes.

VC dimension. For infinite $\mathcal{H}$ , the VC dimension $d_{VC}(\mathcal{H})$ measures the largest set of points the class can shatter. Generalization bound:

$L(\hat{h}) - \min_h L(h) \leq O\!\left(\sqrt{\frac{d_{VC}\log n + \log(1/\delta)}{n}}\right)$

with probability $\geq 1-\delta$ . The VC dimension is $d$ for halfspaces in $\mathbb{R}^d$ , $O(p\log p)$ for neural networks with $p$ parameters (crude bound).

Rademacher complexity. For a function class $\mathcal{F}$ :

$\mathcal{R}_n(\mathcal{F}) = \mathbb{E}_\sigma\!\left[\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^n\sigma_i f(x_i)\right],$

where $\sigma_i \sim \text{Uniform}\{-1,+1\}$ are Rademacher variables. The generalization bound:

$L(\hat{h}) - \min_h L(h) \leq 2\mathcal{R}_n(\mathcal{H}) + 3\sqrt{\frac{\log(2/\delta)}{2n}}.$

Rademacher complexity directly measures the "correlation with noise" that a class can achieve — the tighter $\mathcal{R}_n$ is, the better generalization. For linear classifiers with $\|w\|_2 \leq B$ and $\|x\|_2 \leq R$ : $\mathcal{R}_n \leq BR/\sqrt{n}$ — the generalization bound depends on the norm, not the dimension.

Implications for Deep Learning

Neural network generalization. With $p$ parameters (potentially billions), VC-based bounds give vacuous bounds (scaling as $\sqrt{p/n}$ ). Yet modern overparameterized networks generalize well. Why?

Implicit regularization: gradient descent finds minimum-norm solutions, and norm-based Rademacher bounds ( $\sim \|w\|/\sqrt{n}$ ) can be much better than parameter-count-based bounds
PAC-Bayes bounds: $\mathcal{L}_{\text{test}} \leq \mathcal{L}_{\text{train}} + \sqrt{(\text{KL}(Q\|P) + \log(n/\delta))/(2n)}$ for any prior $P$ and posterior $Q$ — for priors near the found solution, KL is small
Algorithmic stability: SGD-trained networks satisfy leave-one-out stability bounds, giving $O(1/n)$ generalization gaps

Worked Example

Example 1: Hoeffding for Binary Classification

Hypothesis class $\mathcal{H}$ with $|\mathcal{H}| = 2^{20} \approx 10^6$ (e.g., $10^6$ rules). Training set $n = 10000$ . Desired: $\varepsilon = 0.01$ , $\delta = 0.05$ .

Required: $n \geq \frac{1}{2(0.01)^2}(\log(10^6) + \log(40)) = \frac{1}{0.0002}(13.8 + 3.7) = 5000 \cdot 17.5 = 87500$ .

With only 10,000 examples, the uniform bound does not guarantee $\varepsilon = 0.01$ error. With $n = 87500$ : guaranteed. This is why generalization guarantees for decision trees (exponentially many possible trees) require very large datasets or alternative complexity measures.

Example 2: Rademacher Complexity for Linear Models

For linear classifiers $f_w(x) = w^Tx$ with $\|w\|_2 \leq B$ over $\|x\|_2 \leq R$ :

$\mathcal{R}_n(\mathcal{F}) = \mathbb{E}_\sigma\!\left[\sup_{\|w\|\leq B}\frac{1}{n}\sum_i \sigma_i w^Tx_i\right] = \frac{B}{n}\mathbb{E}\!\left[\left\|\sum_i \sigma_i x_i\right\|_2\right] \leq \frac{B}{n}\sqrt{\mathbb{E}\!\left[\left\|\sum_i\sigma_i x_i\right\|_2^2\right]} = \frac{B}{\sqrt{n}}R.$

The generalization bound: $L(\hat{h}) \leq \hat{L}(\hat{h}) + 2BR/\sqrt{n} + O(\sqrt{\log(1/\delta)/n})$ .

Key insight: the bound depends on $B$ (weight norm) and $R$ (feature norm), not on the ambient dimension $d$ ! A 1-million-dimensional linear classifier generalizes as well as a 10-dimensional one, given the same weight norm. This is why $L_2$ regularization (limiting $\|w\|$ ) works.

Example 3: PAC-Bayes Bound

For a Gaussian prior $P = \mathcal{N}(0, \sigma^2 I)$ and posterior $Q = \mathcal{N}(\hat w, s^2 I)$ (the trained model), the KL divergence:

$\text{KL}(Q\|P) = \frac{\|\hat w\|^2}{2\sigma^2} + d\left(\frac{s^2}{\sigma^2} - 1 - \log\frac{s^2}{\sigma^2}\right)/2.$

With $\hat w$ being the SGD output (small norm due to implicit regularization), $\sigma$ chosen as the radius of likely good solutions, and $d$ the parameter count: this bound can be tight for models where the weight norm doesn't grow with $d$ (weight-tied models, or models with effective rank much smaller than $d$ ).

Connections

Where Your Intuition Breaks

Generalization bounds like the Rademacher complexity bound ( $R_n(\mathcal{F}) \leq B/\sqrt{n}$ ) look like they should explain deep learning generalization, but modern neural networks violate their assumptions. These bounds require the function class to be fixed before seeing the data. Neural networks are implicitly selected by early stopping, architecture search, and hyperparameter tuning — all based on validation data — making the effective hypothesis class much larger than any single architecture. The "interpolating classifiers" that achieve zero training loss and still generalize (the so-called benign overfitting regime) cannot be explained by classical Rademacher bounds, which would predict catastrophic generalization failure for zero-training-error models. This is an active open problem: the probability theory of M06 explains classical generalization, but the generalization of overparameterized neural networks requires new mathematical frameworks (PAC-Bayes, algorithmic stability, double descent theory) that go beyond what these concentration inequalities can directly handle.

💡Intuition

Concentration inequalities convert high-probability bounds to expectations. Hoeffding says $P(|\hat{L} - L| \geq \varepsilon) \leq 2e^{-2n\varepsilon^2}$ — with probability $\geq 1-\delta$ , the empirical risk is within $\sqrt{\log(2/\delta)/(2n)}$ of the true risk. This is the sample complexity: the $\delta$ -failure probability decays exponentially in $n$ , not just as $1/n$ (Chebyshev). The exponential decay means high-confidence bounds (say $\delta = 10^{-10}$ ) only require $O(\log(1/\delta))$ extra samples — a tiny constant overhead. This is why PAC bounds are "probably approximately correct": the probability of failure can be made negligible with only logarithmic cost.

💡Intuition

The union bound is both powerful and loose. Applying Hoeffding to each $h \in \mathcal{H}$ and union-bounding gives a bound that degrades logarithmically in $|\mathcal{H}|$ — excellent for finite classes but vacuous for infinite ones. The union bound is exact when events are disjoint (impossible for overlapping hypotheses) and loosest when all events are the same. VC dimension and Rademacher complexity replace the union bound with a more refined covering argument: instead of the $|\mathcal{H}|$ parameter, they use a combinatorial measure of the effective size of $\mathcal{H}$ on finite samples.

⚠️Warning

Standard generalization bounds are often vacuous for deep learning. VC-based bounds for neural networks with $p$ parameters give $O(\sqrt{p\log p/n})$ — for $p=10^9$ and $n=10^6$ , this bound exceeds 1 (trivially true, uninformative). Modern networks generalize despite being overparameterized (more parameters than training examples), which violates the implicit assumption of VC theory that complexity grows with parameters. Explaining this requires going beyond uniform convergence: implicit regularization, flat-minima generalization, and PAC-Bayes bounds that track weight norms rather than parameter counts all provide partial but non-vacuous explanations.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Limit Theorems: LLN, CLT & Berry-Esseen

Statistics & Learning Theory

Estimation Theory: MLE, Sufficiency, Fisher Information & Cramér-Rao