Neural-Path/Notes
45 min

PAC Learning, VC Dimension & Rademacher Complexity

PAC learning theory asks when a learning algorithm can probably approximately correctly generalize — VC dimension and Rademacher complexity are the complexity measures that determine sample requirements. Together they form the theoretical foundation for understanding when machine learning works and why.

Concepts

Generalization bounds quantify L(h) − L̂(h) with probability ≥ 1−δ. Each bound decays at a different rate determined by the complexity measure.

0.00.20.40.60.81.01.21003001k3k10k50kSample size n (log scale)Bound value
Finite class: √((log|H|+log(2/δ))/(2n))VC dim: √(d·log(n)/n)Rademacher (B=R=1)
Finite class (n=1k)
0.094
VC bound (n=1k)
0.260
Rademacher (n=1k)
0.070

All bounds decay as O(1/√n). The key difference: finite-class scales log|H|, VC scales d·log(n), Rademacher depends on weight norms — not parameter counts.

Every time a model trained on thousands of examples is deployed to millions of users, you are relying on generalization: that performance on the training set predicts performance on new data. PAC learning theory asks when this is mathematically justified — and the answer is elegant: a hypothesis class can generalize if and only if it has finite VC dimension. The Rademacher complexity then gives the tightest data-dependent bound on how large the generalization gap can be.

The PAC Learning Framework

The PAC (Probably Approximately Correct) model (Valiant, 1984) formalizes learnability.

Setup: an unknown distribution D\mathcal{D} over X×{0,1}\mathcal{X} \times \{0,1\} generates iid examples. A hypothesis class H\mathcal{H} is a set of functions h:X{0,1}h: \mathcal{X} \to \{0,1\}.

True risk: L(h)=P(x,y)D[h(x)y]L(h) = P_{(x,y)\sim\mathcal{D}}[h(x) \neq y].

Empirical risk: L^(h)=1ni=1n1[h(xi)yi]\hat L(h) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}[h(x_i) \neq y_i].

Realizability assumption: hH\exists h^* \in \mathcal{H} with L(h)=0L(h^*) = 0.

ERM (empirical risk minimizer): h^=argminhHL^(h)\hat h = \arg\min_{h \in \mathcal{H}} \hat L(h).

PAC learnability: H\mathcal{H} is PAC learnable if there exists an algorithm AA such that for any ε,δ>0\varepsilon, \delta > 0, given nn0(ε,δ)n \geq n_0(\varepsilon, \delta) examples, AA outputs h^\hat h with P[L(h^)ε]1δP[L(\hat h) \leq \varepsilon] \geq 1-\delta.

The (ε,δ)(\varepsilon, \delta) formulation — probably approximately correct — is the minimal weakening of "always correct" that yields a useful learning guarantee. Asking for zero error on all distributions is impossible (the no-free-lunch theorem proves this). The PAC definition separates two relaxations cleanly: ε\varepsilon controls approximation quality, δ\delta controls reliability. Both degrade gracefully with sample size nn, and the relationship nn0(ε,δ)n \geq n_0(\varepsilon, \delta) is exactly the sample complexity function.

Finite Hypothesis Classes

For H<|\mathcal{H}| < \infty under realizability: ERM returns h^\hat h with L^(h^)=0\hat L(\hat h) = 0. For any hh with L(h)>εL(h) > \varepsilon:

P[L^(h)=0]=(1L(h))n(1ε)neεn.P[\hat L(h) = 0] = (1 - L(h))^n \leq (1 - \varepsilon)^n \leq e^{-\varepsilon n}.

Union bound over all "bad" hypotheses: P[L^(h^)=0 yet L(h^)>ε]HeεnP[\hat L(\hat h) = 0 \text{ yet } L(\hat h) > \varepsilon] \leq |\mathcal{H}| e^{-\varepsilon n}.

Setting this δ\leq \delta:

n1ε(logH+log1δ).n \geq \frac{1}{\varepsilon}\left(\log|\mathcal{H}| + \log\frac{1}{\delta}\right).

The sample complexity is logarithmic in H|\mathcal{H}| — a boolean formula over dd variables has 22d2^{2^d} possible functions but only dlog2d\log 2 bits of complexity.

Agnostic PAC (no realizability): for H<|\mathcal{H}| < \infty, Hoeffding + union bound gives:

n12ε2(logH+log2δ).n \geq \frac{1}{2\varepsilon^2}\left(\log|\mathcal{H}| + \log\frac{2}{\delta}\right).

Here ε\varepsilon is the excess risk L(h^)minhHL(h)L(\hat h) - \min_{h \in \mathcal{H}} L(h).

VC Dimension

For infinite H\mathcal{H}, we need a combinatorial complexity measure.

Shattering: H\mathcal{H} shatters a set C={x1,,xm}C = \{x_1,\ldots,x_m\} if for every labeling y{0,1}my \in \{0,1\}^m, there exists hHh \in \mathcal{H} with h(xi)=yih(x_i) = y_i for all ii.

VC dimension (Vapnik-Chervonenkis): dVC(H)=max{m:C s.t. H shatters C}d_{\text{VC}}(\mathcal{H}) = \max\{m : \exists C \text{ s.t. } \mathcal{H} \text{ shatters } C\}.

Key examples:

Hypothesis classVC dimension
Intervals on R\mathbb{R}: ha,b(x)=1[axb]h_{a,b}(x) = \mathbf{1}[a \leq x \leq b]2
Halfspaces in Rd\mathbb{R}^d: hw(x)=1[wTx0]h_w(x) = \mathbf{1}[w^Tx \geq 0]dd
Affine halfspaces (with bias): 1[wTx+b0]\mathbf{1}[w^Tx + b \geq 0]d+1d + 1
Axis-aligned rectangles in R2\mathbb{R}^24
Sine classifiers hω(x)=1[sin(ωx)0]h_\omega(x) = \mathbf{1}[\sin(\omega x) \geq 0]\infty
Neural networks with WW weights and LL layersO(WLlogW)O(WL \log W)
Polynomials of degree k\leq k in Rd\mathbb{R}^d(d+kk)\binom{d+k}{k}

Why VC dim = dd for halfspaces: any dd points in general position in Rd\mathbb{R}^d can be shattered (by choosing ww to separate any labeling via a hyperplane). No d+1d+1 points can always be shattered — one point always lies in the convex hull of the others after some labeling.

Sauer-Shelah Lemma

The growth function ΠH(n)=maxC:C=n{(h(x1),,h(xn)):hH}\Pi_\mathcal{H}(n) = \max_{C: |C|=n} |\{(h(x_1),\ldots,h(x_n)) : h \in \mathcal{H}\}| counts the number of distinct dichotomies H\mathcal{H} produces on nn points.

Sauer-Shelah lemma: if dVC(H)=dd_{\text{VC}}(\mathcal{H}) = d, then

ΠH(n)i=0d(ni)(end)d.\Pi_\mathcal{H}(n) \leq \sum_{i=0}^d \binom{n}{i} \leq \left(\frac{en}{d}\right)^d.

The growth function is polynomial in nn (once n>dn > d), not exponential — this is the key that makes generalization possible.

VC generalization bound: with probability 1δ\geq 1-\delta:

suphHL^(h)L(h)O ⁣(dVClog(n/dVC)+log(1/δ)n).\sup_{h \in \mathcal{H}} |\hat L(h) - L(h)| \leq O\!\left(\sqrt{\frac{d_{\text{VC}} \log(n/d_{\text{VC}}) + \log(1/\delta)}{n}}\right).

Sample complexity for agnostic PAC learning: n=O ⁣(dVC+log(1/δ)ε2)n = O\!\left(\frac{d_{\text{VC}} + \log(1/\delta)}{\varepsilon^2}\right).

Fundamental Theorem of Statistical Learning

Theorem (fundamental theorem): The following are equivalent:

  1. H\mathcal{H} is PAC learnable (agnostic)
  2. H\mathcal{H} has finite VC dimension
  3. H\mathcal{H} satisfies uniform convergence (for all ε,δ\varepsilon, \delta, uniform bound over H\mathcal{H} holds)
  4. ERM is a PAC learner for H\mathcal{H}

The equivalence of PAC learnability with finite VC dimension is the central theorem of computational learning theory.

Rademacher Complexity

Rademacher complexity provides tighter, data-dependent bounds:

R^n(H)=Eσ ⁣[suphH1ni=1nσih(xi)],\hat{\mathcal{R}}_n(\mathcal{H}) = \mathbb{E}_\sigma\!\left[\sup_{h \in \mathcal{H}} \frac{1}{n}\sum_{i=1}^n \sigma_i h(x_i)\right],

where σiiidUniform{1,+1}\sigma_i \stackrel{\text{iid}}{\sim} \text{Uniform}\{-1, +1\} (Rademacher variables).

Generalization bound: with probability 1δ\geq 1-\delta:

L(h^)minhHL(h)2R^n(H)+3log(2/δ)2n.L(\hat h) - \min_{h \in \mathcal{H}} L(h) \leq 2\hat{\mathcal{R}}_n(\mathcal{H}) + 3\sqrt{\frac{\log(2/\delta)}{2n}}.

Key computations:

For linear classifiers hw(x)=sign(wTx)h_w(x) = \text{sign}(w^Tx) with w2B\|w\|_2 \leq B and xi2R\|x_i\|_2 \leq R:

R^n(H)=BnE ⁣[iσixi2]BRn.\hat{\mathcal{R}}_n(\mathcal{H}) = \frac{B}{n}\mathbb{E}\!\left[\left\|\sum_i \sigma_i x_i\right\|_2\right] \leq \frac{BR}{\sqrt{n}}.

This bound depends only on BB (weight norm) and RR (feature norm) — not on the ambient dimension dd. A million-dimensional linear classifier generalizes as well as a 10-dimensional one, for the same weight norm.

No-Free-Lunch Theorem

Theorem: For any algorithm AA and any n<X/2n < |\mathcal{X}|/2, there exists a distribution D\mathcal{D} such that:

  1. There is an h{0,1}Xh^* \in \{0,1\}^\mathcal{X} with L(h)=0L(h^*) = 0.
  2. With probability 1/7\geq 1/7, the algorithm AA returns hh with L(h)1/8L(h) \geq 1/8.

Interpretation: no learning algorithm is universally good. Any algorithm that performs well on one distribution must perform poorly on some other. The sample complexity lower bound n=Ω(dVC/ε2)n = \Omega(d_{\text{VC}}/\varepsilon^2) is tight — VC dimension is the right complexity measure.

Worked Example

Example 1: VC Dimension of Rectangles

Axis-aligned rectangles in R2\mathbb{R}^2: ha,b,c,d(x1,x2)=1[ax1b,cx2d]h_{a,b,c,d}(x_1, x_2) = \mathbf{1}[a \leq x_1 \leq b, c \leq x_2 \leq d].

Shatters 4 points: place points at (±1,0)(\pm 1, 0) and (0,±1)(0, \pm 1). Every 24=162^4 = 16 labelings can be achieved by choosing the rectangle to include or exclude each axis-extreme point.

Cannot shatter 5 points: by Radon's theorem, any 5 points in R2\mathbb{R}^2 have a point inside the convex hull of the others. The labeling that includes all 4 outermost points but excludes the interior point cannot be achieved by an axis-aligned rectangle (which must include all points in its bounding box). Therefore dVC=4d_{\text{VC}} = 4.

Sample complexity for ε=0.01\varepsilon = 0.01, δ=0.05\delta = 0.05: n4/0.0001(log(4/0.0001)+log(40))700,000n \approx 4/0.0001 \cdot (\log(4/0.0001) + \log(40)) \approx 700{,}000. Axis-aligned rectangles are simple but have moderately high sample complexity.

Example 2: Rademacher Bound for SVMs

SVM with w2B\|w\|_2 \leq B and all training points with xi2R\|x_i\|_2 \leq R. By the Rademacher bound:

L(h^)L^(h^)+2BRn+3log(2/δ)2n.L(\hat h) \leq \hat L(\hat h) + 2\frac{BR}{\sqrt{n}} + 3\sqrt{\frac{\log(2/\delta)}{2n}}.

For hard-margin SVM (margin γ=1/w\gamma = 1/\|w\|, so B=1/γB = 1/\gamma): L(h^)0+2R/(γn)+O(log(1/δ)/n)L(\hat h) \leq 0 + 2R/({\gamma\sqrt{n}}) + O(\sqrt{\log(1/\delta)/n}).

Key insight: the bound scales as R/(γn)R/(\gamma\sqrt{n}) — a larger margin γ\gamma gives a better generalization guarantee. This is why SVM maximizes the margin: it directly minimizes the Rademacher complexity and the generalization bound.

Example 3: VC Dimension Lower Bound for Neural Networks

A neural network with WW weights (pp parameters) and activation thresholds has VC dimension at least WW (by the standard parameterization argument — any WW points in general position can be shattered). The upper bound is O(WLlogW)O(WL\log W) for networks with LL layers (Bartlett 1999).

For a network with 10910^9 parameters and n=106n = 10^6 training examples: the VC bound gives 109log(109)/10630000170\sqrt{10^9 \log(10^9)/10^6} \approx \sqrt{30000} \approx 170 — a vacuous bound (test error could be ≤ 170, which is larger than 1). The classical theory fails to explain modern deep learning generalization.

Connections

Where Your Intuition Breaks

The fundamental theorem says finite VC dimension is necessary and sufficient for PAC learnability. The dangerous reading is that VC dimension determines how well a learnable class generalizes — it only determines whether it generalizes at all. The VC bound O(dVClogn/n)O(\sqrt{d_{\text{VC}} \log n / n}) is a worst-case upper bound over all distributions; for specific distributions and algorithms, the actual generalization gap can be orders of magnitude smaller. This is why billion-parameter neural networks with vacuous VC bounds still generalize in practice: the VC bound applies the right qualitative tool (learnable vs. not) but is the wrong instrument for quantitative prediction. It is a theorem about learnability as a binary property, not a tight characterization of sample complexity for any particular model on real data.

💡Intuition

VC dimension measures combinatorial complexity, not size. A hypothesis class with 21002^{100} hypotheses has finite (possibly small) VC dimension; a class with infinitely many hypotheses (like all sine functions) can have infinite VC dimension. The VC dimension counts the richness of the class — how many distinct dichotomies it can produce. Large VC dimension means the class is rich enough to memorize any labeling of many points, which correlates with needing many examples to generalize.

💡Intuition

Rademacher complexity is "correlating with noise." The Rademacher complexity asks: how well can the best hypothesis in H\mathcal{H} correlate with random ±1\pm 1 labels? A class that can fit noise perfectly (high R^n\hat{\mathcal{R}}_n) cannot generalize. A class that cannot fit noise at all (R^n0\hat{\mathcal{R}}_n \approx 0) has no excess generalization gap. Empirically, R^n\hat{\mathcal{R}}_n can be estimated by fitting the hypothesis class to random label permutations — modern networks can achieve this almost perfectly, suggesting the Rademacher bound is loose for real data.

⚠️Warning

VC bounds are often vacuous for neural networks, yet they generalize. The VC bound for a billion-parameter network trained on a million examples gives a bound greater than 1 — meaningless. This is the central puzzle: classical uniform-convergence-based theory predicts that overparameterized networks should not generalize, yet they do. The resolution involves implicit regularization (GD finds min-norm solutions), data-dependent bounds (PAC-Bayes via margins), and double descent (covered in the bridge lesson). The fundamental theorem tells us that finite VC dimension is necessary and sufficient for learnability, but it does not tell us the tightest possible bound.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.