PAC Learning, VC Dimension & Rademacher Complexity

PAC learning theory asks when a learning algorithm can probably approximately correctly generalize — VC dimension and Rademacher complexity are the complexity measures that determine sample requirements. Together they form the theoretical foundation for understanding when machine learning works and why.

Concepts

Generalization bounds quantify L(h) − L̂(h) with probability ≥ 1−δ. Each bound decays at a different rate determined by the complexity measure.

Finite class: √((log|H|+log(2/δ))/(2n))VC dim: √(d·log(n)/n)Rademacher (B=R=1)

log₂|H| = 20

VC dim d = 10

δ = 0.05

Finite class (n=1k)

0.094

VC bound (n=1k)

0.260

Rademacher (n=1k)

0.070

All bounds decay as O(1/√n). The key difference: finite-class scales log|H|, VC scales d·log(n), Rademacher depends on weight norms — not parameter counts.

Every time a model trained on thousands of examples is deployed to millions of users, you are relying on generalization: that performance on the training set predicts performance on new data. PAC learning theory asks when this is mathematically justified — and the answer is elegant: a hypothesis class can generalize if and only if it has finite VC dimension. The Rademacher complexity then gives the tightest data-dependent bound on how large the generalization gap can be.

The PAC Learning Framework

The PAC (Probably Approximately Correct) model (Valiant, 1984) formalizes learnability.

Setup: an unknown distribution $\mathcal{D}$ over $\mathcal{X} \times \{0,1\}$ generates iid examples. A hypothesis class $\mathcal{H}$ is a set of functions $h: \mathcal{X} \to \{0,1\}$ .

True risk: $L(h) = P_{(x,y)\sim\mathcal{D}}[h(x) \neq y]$ .

Empirical risk: $\hat L(h) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}[h(x_i) \neq y_i]$ .

Realizability assumption: $\exists h^* \in \mathcal{H}$ with $L(h^*) = 0$ .

ERM (empirical risk minimizer): $\hat h = \arg\min_{h \in \mathcal{H}} \hat L(h)$ .

PAC learnability: $\mathcal{H}$ is PAC learnable if there exists an algorithm $A$ such that for any $\varepsilon, \delta > 0$ , given $n \geq n_0(\varepsilon, \delta)$ examples, $A$ outputs $\hat h$ with $P[L(\hat h) \leq \varepsilon] \geq 1-\delta$ .

The $(\varepsilon, \delta)$ formulation — probably approximately correct — is the minimal weakening of "always correct" that yields a useful learning guarantee. Asking for zero error on all distributions is impossible (the no-free-lunch theorem proves this). The PAC definition separates two relaxations cleanly: $\varepsilon$ controls approximation quality, $\delta$ controls reliability. Both degrade gracefully with sample size $n$ , and the relationship $n \geq n_0(\varepsilon, \delta)$ is exactly the sample complexity function.

Finite Hypothesis Classes

For $|\mathcal{H}| < \infty$ under realizability: ERM returns $\hat h$ with $\hat L(\hat h) = 0$ . For any $h$ with $L(h) > \varepsilon$ :

$P[\hat L(h) = 0] = (1 - L(h))^n \leq (1 - \varepsilon)^n \leq e^{-\varepsilon n}.$

Union bound over all "bad" hypotheses: $P[\hat L(\hat h) = 0 \text{ yet } L(\hat h) > \varepsilon] \leq |\mathcal{H}| e^{-\varepsilon n}$ .

Setting this $\leq \delta$ :

$n \geq \frac{1}{\varepsilon}\left(\log|\mathcal{H}| + \log\frac{1}{\delta}\right).$

The sample complexity is logarithmic in $|\mathcal{H}|$ — a boolean formula over $d$ variables has $2^{2^d}$ possible functions but only $d\log 2$ bits of complexity.

Agnostic PAC (no realizability): for $|\mathcal{H}| < \infty$ , Hoeffding + union bound gives:

$n \geq \frac{1}{2\varepsilon^2}\left(\log|\mathcal{H}| + \log\frac{2}{\delta}\right).$

Here $\varepsilon$ is the excess risk $L(\hat h) - \min_{h \in \mathcal{H}} L(h)$ .

VC Dimension

For infinite $\mathcal{H}$ , we need a combinatorial complexity measure.

Shattering: $\mathcal{H}$ shatters a set $C = \{x_1,\ldots,x_m\}$ if for every labeling $y \in \{0,1\}^m$ , there exists $h \in \mathcal{H}$ with $h(x_i) = y_i$ for all $i$ .

VC dimension (Vapnik-Chervonenkis): $d_{\text{VC}}(\mathcal{H}) = \max\{m : \exists C \text{ s.t. } \mathcal{H} \text{ shatters } C\}$ .

Key examples:

Hypothesis class	VC dimension
Intervals on $\mathbb{R}$ : $h_{a,b}(x) = \mathbf{1}[a \leq x \leq b]$	2
Halfspaces in $\mathbb{R}^d$ : $h_w(x) = \mathbf{1}[w^Tx \geq 0]$	$d$
Affine halfspaces (with bias): $\mathbf{1}[w^Tx + b \geq 0]$	$d + 1$
Axis-aligned rectangles in $\mathbb{R}^2$	4
Sine classifiers $h_\omega(x) = \mathbf{1}[\sin(\omega x) \geq 0]$	$\infty$
Neural networks with $W$ weights and $L$ layers	$O(WL \log W)$
Polynomials of degree $\leq k$ in $\mathbb{R}^d$	$\binom{d+k}{k}$

Why VC dim = $d$ for halfspaces: any $d$ points in general position in $\mathbb{R}^d$ can be shattered (by choosing $w$ to separate any labeling via a hyperplane). No $d+1$ points can always be shattered — one point always lies in the convex hull of the others after some labeling.

Sauer-Shelah Lemma

The growth function $\Pi_\mathcal{H}(n) = \max_{C: |C|=n} |\{(h(x_1),\ldots,h(x_n)) : h \in \mathcal{H}\}|$ counts the number of distinct dichotomies $\mathcal{H}$ produces on $n$ points.

Sauer-Shelah lemma: if $d_{\text{VC}}(\mathcal{H}) = d$ , then

$\Pi_\mathcal{H}(n) \leq \sum_{i=0}^d \binom{n}{i} \leq \left(\frac{en}{d}\right)^d.$

The growth function is polynomial in $n$ (once $n > d$ ), not exponential — this is the key that makes generalization possible.

VC generalization bound: with probability $\geq 1-\delta$ :

$\sup_{h \in \mathcal{H}} |\hat L(h) - L(h)| \leq O\!\left(\sqrt{\frac{d_{\text{VC}} \log(n/d_{\text{VC}}) + \log(1/\delta)}{n}}\right).$

Sample complexity for agnostic PAC learning: $n = O\!\left(\frac{d_{\text{VC}} + \log(1/\delta)}{\varepsilon^2}\right)$ .

Fundamental Theorem of Statistical Learning

Theorem (fundamental theorem): The following are equivalent:

$\mathcal{H}$ is PAC learnable (agnostic)
$\mathcal{H}$ has finite VC dimension
$\mathcal{H}$ satisfies uniform convergence (for all $\varepsilon, \delta$ , uniform bound over $\mathcal{H}$ holds)
ERM is a PAC learner for $\mathcal{H}$

The equivalence of PAC learnability with finite VC dimension is the central theorem of computational learning theory.

Rademacher Complexity

Rademacher complexity provides tighter, data-dependent bounds:

$\hat{\mathcal{R}}_n(\mathcal{H}) = \mathbb{E}_\sigma\!\left[\sup_{h \in \mathcal{H}} \frac{1}{n}\sum_{i=1}^n \sigma_i h(x_i)\right],$

where $\sigma_i \stackrel{\text{iid}}{\sim} \text{Uniform}\{-1, +1\}$ (Rademacher variables).

Generalization bound: with probability $\geq 1-\delta$ :

$L(\hat h) - \min_{h \in \mathcal{H}} L(h) \leq 2\hat{\mathcal{R}}_n(\mathcal{H}) + 3\sqrt{\frac{\log(2/\delta)}{2n}}.$

Key computations:

For linear classifiers $h_w(x) = \text{sign}(w^Tx)$ with $\|w\|_2 \leq B$ and $\|x_i\|_2 \leq R$ :

$\hat{\mathcal{R}}_n(\mathcal{H}) = \frac{B}{n}\mathbb{E}\!\left[\left\|\sum_i \sigma_i x_i\right\|_2\right] \leq \frac{BR}{\sqrt{n}}.$

This bound depends only on $B$ (weight norm) and $R$ (feature norm) — not on the ambient dimension $d$ . A million-dimensional linear classifier generalizes as well as a 10-dimensional one, for the same weight norm.

No-Free-Lunch Theorem

Theorem: For any algorithm $A$ and any $n < |\mathcal{X}|/2$ , there exists a distribution $\mathcal{D}$ such that:

There is an $h^* \in \{0,1\}^\mathcal{X}$ with $L(h^*) = 0$ .
With probability $\geq 1/7$ , the algorithm $A$ returns $h$ with $L(h) \geq 1/8$ .

Interpretation: no learning algorithm is universally good. Any algorithm that performs well on one distribution must perform poorly on some other. The sample complexity lower bound $n = \Omega(d_{\text{VC}}/\varepsilon^2)$ is tight — VC dimension is the right complexity measure.

Worked Example

Example 1: VC Dimension of Rectangles

Axis-aligned rectangles in $\mathbb{R}^2$ : $h_{a,b,c,d}(x_1, x_2) = \mathbf{1}[a \leq x_1 \leq b, c \leq x_2 \leq d]$ .

Shatters 4 points: place points at $(\pm 1, 0)$ and $(0, \pm 1)$ . Every $2^4 = 16$ labelings can be achieved by choosing the rectangle to include or exclude each axis-extreme point.

Cannot shatter 5 points: by Radon's theorem, any 5 points in $\mathbb{R}^2$ have a point inside the convex hull of the others. The labeling that includes all 4 outermost points but excludes the interior point cannot be achieved by an axis-aligned rectangle (which must include all points in its bounding box). Therefore $d_{\text{VC}} = 4$ .

Sample complexity for $\varepsilon = 0.01$ , $\delta = 0.05$ : $n \approx 4/0.0001 \cdot (\log(4/0.0001) + \log(40)) \approx 700{,}000$ . Axis-aligned rectangles are simple but have moderately high sample complexity.

Example 2: Rademacher Bound for SVMs

SVM with $\|w\|_2 \leq B$ and all training points with $\|x_i\|_2 \leq R$ . By the Rademacher bound:

$L(\hat h) \leq \hat L(\hat h) + 2\frac{BR}{\sqrt{n}} + 3\sqrt{\frac{\log(2/\delta)}{2n}}.$

For hard-margin SVM (margin $\gamma = 1/\|w\|$ , so $B = 1/\gamma$ ): $L(\hat h) \leq 0 + 2R/({\gamma\sqrt{n}}) + O(\sqrt{\log(1/\delta)/n})$ .

Key insight: the bound scales as $R/(\gamma\sqrt{n})$ — a larger margin $\gamma$ gives a better generalization guarantee. This is why SVM maximizes the margin: it directly minimizes the Rademacher complexity and the generalization bound.

Example 3: VC Dimension Lower Bound for Neural Networks

A neural network with $W$ weights ( $p$ parameters) and activation thresholds has VC dimension at least $W$ (by the standard parameterization argument — any $W$ points in general position can be shattered). The upper bound is $O(WL\log W)$ for networks with $L$ layers (Bartlett 1999).

For a network with $10^9$ parameters and $n = 10^6$ training examples: the VC bound gives $\sqrt{10^9 \log(10^9)/10^6} \approx \sqrt{30000} \approx 170$ — a vacuous bound (test error could be ≤ 170, which is larger than 1). The classical theory fails to explain modern deep learning generalization.

Connections

Where Your Intuition Breaks

The fundamental theorem says finite VC dimension is necessary and sufficient for PAC learnability. The dangerous reading is that VC dimension determines how well a learnable class generalizes — it only determines whether it generalizes at all. The VC bound $O(\sqrt{d_{\text{VC}} \log n / n})$ is a worst-case upper bound over all distributions; for specific distributions and algorithms, the actual generalization gap can be orders of magnitude smaller. This is why billion-parameter neural networks with vacuous VC bounds still generalize in practice: the VC bound applies the right qualitative tool (learnable vs. not) but is the wrong instrument for quantitative prediction. It is a theorem about learnability as a binary property, not a tight characterization of sample complexity for any particular model on real data.

💡Intuition

VC dimension measures combinatorial complexity, not size. A hypothesis class with $2^{100}$ hypotheses has finite (possibly small) VC dimension; a class with infinitely many hypotheses (like all sine functions) can have infinite VC dimension. The VC dimension counts the richness of the class — how many distinct dichotomies it can produce. Large VC dimension means the class is rich enough to memorize any labeling of many points, which correlates with needing many examples to generalize.

💡Intuition

Rademacher complexity is "correlating with noise." The Rademacher complexity asks: how well can the best hypothesis in $\mathcal{H}$ correlate with random $\pm 1$ labels? A class that can fit noise perfectly (high $\hat{\mathcal{R}}_n$ ) cannot generalize. A class that cannot fit noise at all ( $\hat{\mathcal{R}}_n \approx 0$ ) has no excess generalization gap. Empirically, $\hat{\mathcal{R}}_n$ can be estimated by fitting the hypothesis class to random label permutations — modern networks can achieve this almost perfectly, suggesting the Rademacher bound is loose for real data.

⚠️Warning

VC bounds are often vacuous for neural networks, yet they generalize. The VC bound for a billion-parameter network trained on a million examples gives a bound greater than 1 — meaningless. This is the central puzzle: classical uniform-convergence-based theory predicts that overparameterized networks should not generalize, yet they do. The resolution involves implicit regularization (GD finds min-norm solutions), data-dependent bounds (PAC-Bayes via margins), and double descent (covered in the bridge lesson). The fundamental theorem tells us that finite VC dimension is necessary and sufficient for learnability, but it does not tell us the tightest possible bound.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Bayesian Inference: Priors, Posteriors & Conjugacy

High-Dimensional Statistics: Sparsity, RIP & Compressed Sensing