Bridge: Concentration Inequalities (Hoeffding, Bernstein, Sub-Gaussian) & Generalization Bounds
Concentration inequalities bound how far a random variable deviates from its mean, far more sharply than Chebyshev. The bridge from probability to learning theory runs through these bounds: Hoeffding's inequality gives the PAC learning sample complexity for finite hypothesis classes; Rademacher complexity extends this to infinite classes; and uniform convergence — the guarantee that empirical risk concentrates around true risk simultaneously over all hypotheses — is the foundation of generalization bounds for SVMs, random forests, and neural networks.
Concepts
"How many training examples do I need?" — this question, central to all of ML practice, is answered by concentration inequalities. The CLT tells you the distribution of sample averages but doesn't tell you the probability of a single run going badly wrong. Hoeffding's inequality and Chernoff bounds give explicit probabilities: "with probability at most , the empirical risk deviates from true risk by more than , and this requires samples." Rademacher complexity extends this to rich function classes. This lesson is where the probability theory of M06 connects directly to the sample complexity and generalization theory that drives architecture and training decisions.
Sub-Gaussian Random Variables
A random variable with mean 0 is sub-Gaussian with parameter if:
Equivalently, the tail decays like a Gaussian: .
Examples:
- : sub-Gaussian with parameter (tight)
- Bounded : sub-Gaussian with parameter
- Rademacher (uniform on ): sub-Gaussian with parameter 1
Characterizations:
- (MGF condition)
- where (Orlicz norm)
- (tail condition)
- (moment condition)
These are all equivalent (up to constants). Sub-Gaussianity is the natural class for concentration inequalities. The MGF condition is the right definition rather than the tail condition because the MGF encodes all moments and is stable under sums (the MGF of a sum of independent sub-Gaussian variables is the product of their MGFs — which gives Hoeffding's inequality directly by Markov's inequality applied to ). Defining sub-Gaussianity via MGFs makes the whole concentration theory fall out algebraically.
Sub-exponential random variables. is sub-exponential if only for (not all ). Examples: , squared Gaussian. Sub-exponential tails decay like rather than .
Hoeffding's Inequality
Theorem. Let be independent with and . Then:
Special case (iid , mean ):
Proof sketch (Hoeffding's lemma). For with : (proved by convexity of and the constraint ). Apply independently to each term and optimize over .
Comparison to Chebyshev. Chebyshev gives — a polynomial bound. Hoeffding gives — exponentially smaller for large . For , : Chebyshev gives (trivially ), Hoeffding gives .
Bernstein's Inequality
Theorem. Let be independent, , , . Then:
Two regimes:
- Gaussian regime (): bound (variance-controlled)
- Poisson regime (): bound (boundedness-controlled)
Bernstein is tighter than Hoeffding when the variance is much smaller than the maximum range — i.e., when most are well concentrated near 0 and only rarely large.
McDiarmid's Inequality (Bounded Differences)
Theorem. Let be independent. If satisfies:
for all (the bounded differences condition), then:
Applications: The empirical risk satisfies bounded differences with (one example can change by at most for ). McDiarmid gives — the concentration of empirical risk around true risk.
PAC Learning and Sample Complexity
PAC (Probably Approximately Correct) learning formalized by Valiant (1984): a learner sees iid examples from unknown distribution and must find with with probability .
Finite hypothesis class. For and :
By Hoeffding + union bound: .
Set RHS : solve for :
The sample complexity grows logarithmically in — the log of the number of hypotheses is the key measure of complexity for finite classes.
VC dimension. For infinite , the VC dimension measures the largest set of points the class can shatter. Generalization bound:
with probability . The VC dimension is for halfspaces in , for neural networks with parameters (crude bound).
Rademacher complexity. For a function class :
where are Rademacher variables. The generalization bound:
Rademacher complexity directly measures the "correlation with noise" that a class can achieve — the tighter is, the better generalization. For linear classifiers with and : — the generalization bound depends on the norm, not the dimension.
Implications for Deep Learning
Neural network generalization. With parameters (potentially billions), VC-based bounds give vacuous bounds (scaling as ). Yet modern overparameterized networks generalize well. Why?
- Implicit regularization: gradient descent finds minimum-norm solutions, and norm-based Rademacher bounds () can be much better than parameter-count-based bounds
- PAC-Bayes bounds: for any prior and posterior — for priors near the found solution, KL is small
- Algorithmic stability: SGD-trained networks satisfy leave-one-out stability bounds, giving generalization gaps
Worked Example
Example 1: Hoeffding for Binary Classification
Hypothesis class with (e.g., rules). Training set . Desired: , .
Required: .
With only 10,000 examples, the uniform bound does not guarantee error. With : guaranteed. This is why generalization guarantees for decision trees (exponentially many possible trees) require very large datasets or alternative complexity measures.
Example 2: Rademacher Complexity for Linear Models
For linear classifiers with over :
The generalization bound: .
Key insight: the bound depends on (weight norm) and (feature norm), not on the ambient dimension ! A 1-million-dimensional linear classifier generalizes as well as a 10-dimensional one, given the same weight norm. This is why regularization (limiting ) works.
Example 3: PAC-Bayes Bound
For a Gaussian prior and posterior (the trained model), the KL divergence:
With being the SGD output (small norm due to implicit regularization), chosen as the radius of likely good solutions, and the parameter count: this bound can be tight for models where the weight norm doesn't grow with (weight-tied models, or models with effective rank much smaller than ).
Connections
Where Your Intuition Breaks
Generalization bounds like the Rademacher complexity bound () look like they should explain deep learning generalization, but modern neural networks violate their assumptions. These bounds require the function class to be fixed before seeing the data. Neural networks are implicitly selected by early stopping, architecture search, and hyperparameter tuning — all based on validation data — making the effective hypothesis class much larger than any single architecture. The "interpolating classifiers" that achieve zero training loss and still generalize (the so-called benign overfitting regime) cannot be explained by classical Rademacher bounds, which would predict catastrophic generalization failure for zero-training-error models. This is an active open problem: the probability theory of M06 explains classical generalization, but the generalization of overparameterized neural networks requires new mathematical frameworks (PAC-Bayes, algorithmic stability, double descent theory) that go beyond what these concentration inequalities can directly handle.
Concentration inequalities convert high-probability bounds to expectations. Hoeffding says — with probability , the empirical risk is within of the true risk. This is the sample complexity: the -failure probability decays exponentially in , not just as (Chebyshev). The exponential decay means high-confidence bounds (say ) only require extra samples — a tiny constant overhead. This is why PAC bounds are "probably approximately correct": the probability of failure can be made negligible with only logarithmic cost.
The union bound is both powerful and loose. Applying Hoeffding to each and union-bounding gives a bound that degrades logarithmically in — excellent for finite classes but vacuous for infinite ones. The union bound is exact when events are disjoint (impossible for overlapping hypotheses) and loosest when all events are the same. VC dimension and Rademacher complexity replace the union bound with a more refined covering argument: instead of the parameter, they use a combinatorial measure of the effective size of on finite samples.
Standard generalization bounds are often vacuous for deep learning. VC-based bounds for neural networks with parameters give — for and , this bound exceeds 1 (trivially true, uninformative). Modern networks generalize despite being overparameterized (more parameters than training examples), which violates the implicit assumption of VC theory that complexity grows with parameters. Explaining this requires going beyond uniform convergence: implicit regularization, flat-minima generalization, and PAC-Bayes bounds that track weight norms rather than parameter counts all provide partial but non-vacuous explanations.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.