Neural-Path/Notes
35 min

Bridge: Concentration Inequalities (Hoeffding, Bernstein, Sub-Gaussian) & Generalization Bounds

Concentration inequalities bound how far a random variable deviates from its mean, far more sharply than Chebyshev. The bridge from probability to learning theory runs through these bounds: Hoeffding's inequality gives the PAC learning sample complexity for finite hypothesis classes; Rademacher complexity extends this to infinite classes; and uniform convergence — the guarantee that empirical risk concentrates around true risk simultaneously over all hypotheses — is the foundation of generalization bounds for SVMs, random forests, and neural networks.

Concepts

"How many training examples do I need?" — this question, central to all of ML practice, is answered by concentration inequalities. The CLT tells you the distribution of sample averages but doesn't tell you the probability of a single run going badly wrong. Hoeffding's inequality and Chernoff bounds give explicit probabilities: "with probability at most δ\delta, the empirical risk deviates from true risk by more than ϵ\epsilon, and this requires nClog(1/δ)/ϵ2n \geq C \log(1/\delta)/\epsilon^2 samples." Rademacher complexity extends this to rich function classes. This lesson is where the probability theory of M06 connects directly to the sample complexity and generalization theory that drives architecture and training decisions.

Sub-Gaussian Random Variables

A random variable XX with mean 0 is sub-Gaussian with parameter σ\sigma if:

E[etX]eσ2t2/2tR.\mathbb{E}[e^{tX}] \leq e^{\sigma^2 t^2/2} \quad \forall t \in \mathbb{R}.

Equivalently, the tail decays like a Gaussian: P(Xt)2exp(t2/(2σ2))P(|X| \geq t) \leq 2\exp(-t^2/(2\sigma^2)).

Examples:

  • N(0,σ2)\mathcal{N}(0, \sigma^2): sub-Gaussian with parameter σ\sigma (tight)
  • Bounded X[a,b]X \in [a,b]: sub-Gaussian with parameter (ba)/2(b-a)/2
  • Rademacher (uniform on {1,+1}\{-1,+1\}): sub-Gaussian with parameter 1

Characterizations:

  1. E[etX]eσ2t2/2\mathbb{E}[e^{tX}] \leq e^{\sigma^2 t^2/2} (MGF condition)
  2. Xψ2<\|X\|_{\psi_2} < \infty where Xψ2=inf{s>0:E[eX2/s2]2}\|X\|_{\psi_2} = \inf\{s > 0 : \mathbb{E}[e^{X^2/s^2}] \leq 2\} (Orlicz norm)
  3. P(Xt)2et2/(2σ2)P(|X| \geq t) \leq 2e^{-t^2/(2\sigma^2)} (tail condition)
  4. E[X2k](2k1)!!σ2k\mathbb{E}[X^{2k}] \leq (2k-1)!!\sigma^{2k} (moment condition)

These are all equivalent (up to constants). Sub-Gaussianity is the natural class for concentration inequalities. The MGF condition is the right definition rather than the tail condition because the MGF encodes all moments and is stable under sums (the MGF of a sum of independent sub-Gaussian variables is the product of their MGFs — which gives Hoeffding's inequality directly by Markov's inequality applied to etXe^{tX}). Defining sub-Gaussianity via MGFs makes the whole concentration theory fall out algebraically.

Sub-exponential random variables. XX is sub-exponential if E[etX]eν2t2/2\mathbb{E}[e^{tX}] \leq e^{\nu^2 t^2/2} only for t1/b|t| \leq 1/b (not all tt). Examples: χ2\chi^2, squared Gaussian. Sub-exponential tails decay like ecte^{-ct} rather than ect2e^{-ct^2}.

Hoeffding's Inequality

Theorem. Let X1,,XnX_1, \ldots, X_n be independent with Xi[ai,bi]X_i \in [a_i, b_i] and E[Xi]=μi\mathbb{E}[X_i] = \mu_i. Then:

P ⁣(i=1n(Xiμi)t)exp ⁣(2t2i=1n(biai)2).P\!\left(\sum_{i=1}^n (X_i - \mu_i) \geq t\right) \leq \exp\!\left(\frac{-2t^2}{\sum_{i=1}^n (b_i-a_i)^2}\right).

Special case (iid Xi[0,1]X_i \in [0,1], mean μ\mu):

P(Xˉnμε)e2nε2,P(Xˉnμε)2e2nε2.P(\bar{X}_n - \mu \geq \varepsilon) \leq e^{-2n\varepsilon^2}, \qquad P(|\bar{X}_n - \mu| \geq \varepsilon) \leq 2e^{-2n\varepsilon^2}.

Proof sketch (Hoeffding's lemma). For X[a,b]X \in [a,b] with E[X]=0\mathbb{E}[X]=0: E[etX]et2(ba)2/8\mathbb{E}[e^{tX}] \leq e^{t^2(b-a)^2/8} (proved by convexity of etxe^{tx} and the constraint X[a,b]X \in [a,b]). Apply independently to each term and optimize over tt.

Comparison to Chebyshev. Chebyshev gives P(Xˉnμε)σ2/(nε2)P(|\bar{X}_n - \mu| \geq \varepsilon) \leq \sigma^2/(n\varepsilon^2) — a polynomial bound. Hoeffding gives 2e2nε22e^{-2n\varepsilon^2} — exponentially smaller for large nε2n\varepsilon^2. For n=100n=100, ε=0.1\varepsilon=0.1: Chebyshev gives σ2\leq \sigma^2 (trivially 1\leq 1), Hoeffding gives 2e20.27\leq 2e^{-2} \approx 0.27.

Bernstein's Inequality

Theorem. Let XiX_i be independent, E[Xi]=0\mathbb{E}[X_i] = 0, XiM|X_i| \leq M, iE[Xi2]=ν\sum_i \mathbb{E}[X_i^2] = \nu. Then:

P ⁣(iXit)exp ⁣(t2/2ν+Mt/3).P\!\left(\sum_i X_i \geq t\right) \leq \exp\!\left(\frac{-t^2/2}{\nu + Mt/3}\right).

Two regimes:

  • Gaussian regime (tν/Mt \ll \nu/M): bound et2/(2ν)\approx e^{-t^2/(2\nu)} (variance-controlled)
  • Poisson regime (tν/Mt \gg \nu/M): bound e3t/(2M)\approx e^{-3t/(2M)} (boundedness-controlled)

Bernstein is tighter than Hoeffding when the variance ν\nu is much smaller than the maximum range nM2/4nM^2/4 — i.e., when most XiX_i are well concentrated near 0 and only rarely large.

McDiarmid's Inequality (Bounded Differences)

Theorem. Let X1,,XnX_1, \ldots, X_n be independent. If f:XnRf : \mathcal{X}^n \to \mathbb{R} satisfies:

supx1,,xn,xif(x1,,xi,,xn)f(x1,,xi,,xn)ci\sup_{x_1,\ldots,x_n,x_i'} |f(x_1,\ldots,x_i,\ldots,x_n) - f(x_1,\ldots,x_i',\ldots,x_n)| \leq c_i

for all ii (the bounded differences condition), then:

P(f(X1,,Xn)E[f]t)exp ⁣(2t2ici2).P(f(X_1,\ldots,X_n) - \mathbb{E}[f] \geq t) \leq \exp\!\left(\frac{-2t^2}{\sum_i c_i^2}\right).

Applications: The empirical risk L^(h)=1ni(h(xi),yi)\hat{L}(h) = \frac{1}{n}\sum_i \ell(h(x_i), y_i) satisfies bounded differences with ci=1/nc_i = 1/n (one example can change L^\hat{L} by at most 1/n1/n for [0,1]\ell \in [0,1]). McDiarmid gives P(L^(h)L(h)ε)2e2nε2P(|\hat{L}(h) - L(h)| \geq \varepsilon) \leq 2e^{-2n\varepsilon^2} — the concentration of empirical risk around true risk.

PAC Learning and Sample Complexity

PAC (Probably Approximately Correct) learning formalized by Valiant (1984): a learner sees nn iid examples from unknown distribution DD and must find hHh \in \mathcal{H} with L(h)minhHL(h)+εL(h) \leq \min_{h'\in\mathcal{H}} L(h') + \varepsilon with probability 1δ\geq 1-\delta.

Finite hypothesis class. For H<|\mathcal{H}| < \infty and [0,1]\ell \in [0,1]:

By Hoeffding + union bound: P ⁣(maxhHL^(h)L(h)ε)2He2nε2P\!\left(\max_{h\in\mathcal{H}}|\hat{L}(h) - L(h)| \geq \varepsilon\right) \leq 2|\mathcal{H}|e^{-2n\varepsilon^2}.

Set RHS =δ= \delta: solve for nn:

n12ε2(logH+log2δ).n \geq \frac{1}{2\varepsilon^2}\left(\log|\mathcal{H}| + \log\frac{2}{\delta}\right).

The sample complexity grows logarithmically in H|\mathcal{H}| — the log of the number of hypotheses is the key measure of complexity for finite classes.

VC dimension. For infinite H\mathcal{H}, the VC dimension dVC(H)d_{VC}(\mathcal{H}) measures the largest set of points the class can shatter. Generalization bound:

L(h^)minhL(h)O ⁣(dVClogn+log(1/δ)n)L(\hat{h}) - \min_h L(h) \leq O\!\left(\sqrt{\frac{d_{VC}\log n + \log(1/\delta)}{n}}\right)

with probability 1δ\geq 1-\delta. The VC dimension is dd for halfspaces in Rd\mathbb{R}^d, O(plogp)O(p\log p) for neural networks with pp parameters (crude bound).

Rademacher complexity. For a function class F\mathcal{F}:

Rn(F)=Eσ ⁣[supfF1ni=1nσif(xi)],\mathcal{R}_n(\mathcal{F}) = \mathbb{E}_\sigma\!\left[\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^n\sigma_i f(x_i)\right],

where σiUniform{1,+1}\sigma_i \sim \text{Uniform}\{-1,+1\} are Rademacher variables. The generalization bound:

L(h^)minhL(h)2Rn(H)+3log(2/δ)2n.L(\hat{h}) - \min_h L(h) \leq 2\mathcal{R}_n(\mathcal{H}) + 3\sqrt{\frac{\log(2/\delta)}{2n}}.

Rademacher complexity directly measures the "correlation with noise" that a class can achieve — the tighter Rn\mathcal{R}_n is, the better generalization. For linear classifiers with w2B\|w\|_2 \leq B and x2R\|x\|_2 \leq R: RnBR/n\mathcal{R}_n \leq BR/\sqrt{n} — the generalization bound depends on the norm, not the dimension.

Implications for Deep Learning

Neural network generalization. With pp parameters (potentially billions), VC-based bounds give vacuous bounds (scaling as p/n\sqrt{p/n}). Yet modern overparameterized networks generalize well. Why?

  • Implicit regularization: gradient descent finds minimum-norm solutions, and norm-based Rademacher bounds (w/n\sim \|w\|/\sqrt{n}) can be much better than parameter-count-based bounds
  • PAC-Bayes bounds: LtestLtrain+(KL(QP)+log(n/δ))/(2n)\mathcal{L}_{\text{test}} \leq \mathcal{L}_{\text{train}} + \sqrt{(\text{KL}(Q\|P) + \log(n/\delta))/(2n)} for any prior PP and posterior QQ — for priors near the found solution, KL is small
  • Algorithmic stability: SGD-trained networks satisfy leave-one-out stability bounds, giving O(1/n)O(1/n) generalization gaps

Worked Example

Example 1: Hoeffding for Binary Classification

Hypothesis class H\mathcal{H} with H=220106|\mathcal{H}| = 2^{20} \approx 10^6 (e.g., 10610^6 rules). Training set n=10000n = 10000. Desired: ε=0.01\varepsilon = 0.01, δ=0.05\delta = 0.05.

Required: n12(0.01)2(log(106)+log(40))=10.0002(13.8+3.7)=500017.5=87500n \geq \frac{1}{2(0.01)^2}(\log(10^6) + \log(40)) = \frac{1}{0.0002}(13.8 + 3.7) = 5000 \cdot 17.5 = 87500.

With only 10,000 examples, the uniform bound does not guarantee ε=0.01\varepsilon = 0.01 error. With n=87500n = 87500: guaranteed. This is why generalization guarantees for decision trees (exponentially many possible trees) require very large datasets or alternative complexity measures.

Example 2: Rademacher Complexity for Linear Models

For linear classifiers fw(x)=wTxf_w(x) = w^Tx with w2B\|w\|_2 \leq B over x2R\|x\|_2 \leq R:

Rn(F)=Eσ ⁣[supwB1niσiwTxi]=BnE ⁣[iσixi2]BnE ⁣[iσixi22]=BnR.\mathcal{R}_n(\mathcal{F}) = \mathbb{E}_\sigma\!\left[\sup_{\|w\|\leq B}\frac{1}{n}\sum_i \sigma_i w^Tx_i\right] = \frac{B}{n}\mathbb{E}\!\left[\left\|\sum_i \sigma_i x_i\right\|_2\right] \leq \frac{B}{n}\sqrt{\mathbb{E}\!\left[\left\|\sum_i\sigma_i x_i\right\|_2^2\right]} = \frac{B}{\sqrt{n}}R.

The generalization bound: L(h^)L^(h^)+2BR/n+O(log(1/δ)/n)L(\hat{h}) \leq \hat{L}(\hat{h}) + 2BR/\sqrt{n} + O(\sqrt{\log(1/\delta)/n}).

Key insight: the bound depends on BB (weight norm) and RR (feature norm), not on the ambient dimension dd! A 1-million-dimensional linear classifier generalizes as well as a 10-dimensional one, given the same weight norm. This is why L2L_2 regularization (limiting w\|w\|) works.

Example 3: PAC-Bayes Bound

For a Gaussian prior P=N(0,σ2I)P = \mathcal{N}(0, \sigma^2 I) and posterior Q=N(w^,s2I)Q = \mathcal{N}(\hat w, s^2 I) (the trained model), the KL divergence:

KL(QP)=w^22σ2+d(s2σ21logs2σ2)/2.\text{KL}(Q\|P) = \frac{\|\hat w\|^2}{2\sigma^2} + d\left(\frac{s^2}{\sigma^2} - 1 - \log\frac{s^2}{\sigma^2}\right)/2.

With w^\hat w being the SGD output (small norm due to implicit regularization), σ\sigma chosen as the radius of likely good solutions, and dd the parameter count: this bound can be tight for models where the weight norm doesn't grow with dd (weight-tied models, or models with effective rank much smaller than dd).

Connections

Where Your Intuition Breaks

Generalization bounds like the Rademacher complexity bound (Rn(F)B/nR_n(\mathcal{F}) \leq B/\sqrt{n}) look like they should explain deep learning generalization, but modern neural networks violate their assumptions. These bounds require the function class to be fixed before seeing the data. Neural networks are implicitly selected by early stopping, architecture search, and hyperparameter tuning — all based on validation data — making the effective hypothesis class much larger than any single architecture. The "interpolating classifiers" that achieve zero training loss and still generalize (the so-called benign overfitting regime) cannot be explained by classical Rademacher bounds, which would predict catastrophic generalization failure for zero-training-error models. This is an active open problem: the probability theory of M06 explains classical generalization, but the generalization of overparameterized neural networks requires new mathematical frameworks (PAC-Bayes, algorithmic stability, double descent theory) that go beyond what these concentration inequalities can directly handle.

💡Intuition

Concentration inequalities convert high-probability bounds to expectations. Hoeffding says P(L^Lε)2e2nε2P(|\hat{L} - L| \geq \varepsilon) \leq 2e^{-2n\varepsilon^2} — with probability 1δ\geq 1-\delta, the empirical risk is within log(2/δ)/(2n)\sqrt{\log(2/\delta)/(2n)} of the true risk. This is the sample complexity: the δ\delta-failure probability decays exponentially in nn, not just as 1/n1/n (Chebyshev). The exponential decay means high-confidence bounds (say δ=1010\delta = 10^{-10}) only require O(log(1/δ))O(\log(1/\delta)) extra samples — a tiny constant overhead. This is why PAC bounds are "probably approximately correct": the probability of failure can be made negligible with only logarithmic cost.

💡Intuition

The union bound is both powerful and loose. Applying Hoeffding to each hHh \in \mathcal{H} and union-bounding gives a bound that degrades logarithmically in H|\mathcal{H}| — excellent for finite classes but vacuous for infinite ones. The union bound is exact when events are disjoint (impossible for overlapping hypotheses) and loosest when all events are the same. VC dimension and Rademacher complexity replace the union bound with a more refined covering argument: instead of the H|\mathcal{H}| parameter, they use a combinatorial measure of the effective size of H\mathcal{H} on finite samples.

⚠️Warning

Standard generalization bounds are often vacuous for deep learning. VC-based bounds for neural networks with pp parameters give O(plogp/n)O(\sqrt{p\log p/n}) — for p=109p=10^9 and n=106n=10^6, this bound exceeds 1 (trivially true, uninformative). Modern networks generalize despite being overparameterized (more parameters than training examples), which violates the implicit assumption of VC theory that complexity grows with parameters. Explaining this requires going beyond uniform convergence: implicit regularization, flat-minima generalization, and PAC-Bayes bounds that track weight norms rather than parameter counts all provide partial but non-vacuous explanations.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.