PAC Learning, VC Dimension & Rademacher Complexity
PAC learning theory asks when a learning algorithm can probably approximately correctly generalize — VC dimension and Rademacher complexity are the complexity measures that determine sample requirements. Together they form the theoretical foundation for understanding when machine learning works and why.
Concepts
Generalization bounds quantify L(h) − L̂(h) with probability ≥ 1−δ. Each bound decays at a different rate determined by the complexity measure.
All bounds decay as O(1/√n). The key difference: finite-class scales log|H|, VC scales d·log(n), Rademacher depends on weight norms — not parameter counts.
Every time a model trained on thousands of examples is deployed to millions of users, you are relying on generalization: that performance on the training set predicts performance on new data. PAC learning theory asks when this is mathematically justified — and the answer is elegant: a hypothesis class can generalize if and only if it has finite VC dimension. The Rademacher complexity then gives the tightest data-dependent bound on how large the generalization gap can be.
The PAC Learning Framework
The PAC (Probably Approximately Correct) model (Valiant, 1984) formalizes learnability.
Setup: an unknown distribution over generates iid examples. A hypothesis class is a set of functions .
True risk: .
Empirical risk: .
Realizability assumption: with .
ERM (empirical risk minimizer): .
PAC learnability: is PAC learnable if there exists an algorithm such that for any , given examples, outputs with .
The formulation — probably approximately correct — is the minimal weakening of "always correct" that yields a useful learning guarantee. Asking for zero error on all distributions is impossible (the no-free-lunch theorem proves this). The PAC definition separates two relaxations cleanly: controls approximation quality, controls reliability. Both degrade gracefully with sample size , and the relationship is exactly the sample complexity function.
Finite Hypothesis Classes
For under realizability: ERM returns with . For any with :
Union bound over all "bad" hypotheses: .
Setting this :
The sample complexity is logarithmic in — a boolean formula over variables has possible functions but only bits of complexity.
Agnostic PAC (no realizability): for , Hoeffding + union bound gives:
Here is the excess risk .
VC Dimension
For infinite , we need a combinatorial complexity measure.
Shattering: shatters a set if for every labeling , there exists with for all .
VC dimension (Vapnik-Chervonenkis): .
Key examples:
| Hypothesis class | VC dimension |
|---|---|
| Intervals on : | 2 |
| Halfspaces in : | |
| Affine halfspaces (with bias): | |
| Axis-aligned rectangles in | 4 |
| Sine classifiers | |
| Neural networks with weights and layers | |
| Polynomials of degree in |
Why VC dim = for halfspaces: any points in general position in can be shattered (by choosing to separate any labeling via a hyperplane). No points can always be shattered — one point always lies in the convex hull of the others after some labeling.
Sauer-Shelah Lemma
The growth function counts the number of distinct dichotomies produces on points.
Sauer-Shelah lemma: if , then
The growth function is polynomial in (once ), not exponential — this is the key that makes generalization possible.
VC generalization bound: with probability :
Sample complexity for agnostic PAC learning: .
Fundamental Theorem of Statistical Learning
Theorem (fundamental theorem): The following are equivalent:
- is PAC learnable (agnostic)
- has finite VC dimension
- satisfies uniform convergence (for all , uniform bound over holds)
- ERM is a PAC learner for
The equivalence of PAC learnability with finite VC dimension is the central theorem of computational learning theory.
Rademacher Complexity
Rademacher complexity provides tighter, data-dependent bounds:
where (Rademacher variables).
Generalization bound: with probability :
Key computations:
For linear classifiers with and :
This bound depends only on (weight norm) and (feature norm) — not on the ambient dimension . A million-dimensional linear classifier generalizes as well as a 10-dimensional one, for the same weight norm.
No-Free-Lunch Theorem
Theorem: For any algorithm and any , there exists a distribution such that:
- There is an with .
- With probability , the algorithm returns with .
Interpretation: no learning algorithm is universally good. Any algorithm that performs well on one distribution must perform poorly on some other. The sample complexity lower bound is tight — VC dimension is the right complexity measure.
Worked Example
Example 1: VC Dimension of Rectangles
Axis-aligned rectangles in : .
Shatters 4 points: place points at and . Every labelings can be achieved by choosing the rectangle to include or exclude each axis-extreme point.
Cannot shatter 5 points: by Radon's theorem, any 5 points in have a point inside the convex hull of the others. The labeling that includes all 4 outermost points but excludes the interior point cannot be achieved by an axis-aligned rectangle (which must include all points in its bounding box). Therefore .
Sample complexity for , : . Axis-aligned rectangles are simple but have moderately high sample complexity.
Example 2: Rademacher Bound for SVMs
SVM with and all training points with . By the Rademacher bound:
For hard-margin SVM (margin , so ): .
Key insight: the bound scales as — a larger margin gives a better generalization guarantee. This is why SVM maximizes the margin: it directly minimizes the Rademacher complexity and the generalization bound.
Example 3: VC Dimension Lower Bound for Neural Networks
A neural network with weights ( parameters) and activation thresholds has VC dimension at least (by the standard parameterization argument — any points in general position can be shattered). The upper bound is for networks with layers (Bartlett 1999).
For a network with parameters and training examples: the VC bound gives — a vacuous bound (test error could be ≤ 170, which is larger than 1). The classical theory fails to explain modern deep learning generalization.
Connections
Where Your Intuition Breaks
The fundamental theorem says finite VC dimension is necessary and sufficient for PAC learnability. The dangerous reading is that VC dimension determines how well a learnable class generalizes — it only determines whether it generalizes at all. The VC bound is a worst-case upper bound over all distributions; for specific distributions and algorithms, the actual generalization gap can be orders of magnitude smaller. This is why billion-parameter neural networks with vacuous VC bounds still generalize in practice: the VC bound applies the right qualitative tool (learnable vs. not) but is the wrong instrument for quantitative prediction. It is a theorem about learnability as a binary property, not a tight characterization of sample complexity for any particular model on real data.
VC dimension measures combinatorial complexity, not size. A hypothesis class with hypotheses has finite (possibly small) VC dimension; a class with infinitely many hypotheses (like all sine functions) can have infinite VC dimension. The VC dimension counts the richness of the class — how many distinct dichotomies it can produce. Large VC dimension means the class is rich enough to memorize any labeling of many points, which correlates with needing many examples to generalize.
Rademacher complexity is "correlating with noise." The Rademacher complexity asks: how well can the best hypothesis in correlate with random labels? A class that can fit noise perfectly (high ) cannot generalize. A class that cannot fit noise at all () has no excess generalization gap. Empirically, can be estimated by fitting the hypothesis class to random label permutations — modern networks can achieve this almost perfectly, suggesting the Rademacher bound is loose for real data.
VC bounds are often vacuous for neural networks, yet they generalize. The VC bound for a billion-parameter network trained on a million examples gives a bound greater than 1 — meaningless. This is the central puzzle: classical uniform-convergence-based theory predicts that overparameterized networks should not generalize, yet they do. The resolution involves implicit regularization (GD finds min-norm solutions), data-dependent bounds (PAC-Bayes via margins), and double descent (covered in the bridge lesson). The fundamental theorem tells us that finite VC dimension is necessary and sufficient for learnability, but it does not tell us the tightest possible bound.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.