Bridge: Double Descent, Implicit Regularization & Modern Generalization Theory
Double descent, implicit regularization by gradient descent, and benign overfitting are modern extensions of classical statistical theory that explain why overparameterized neural networks generalize despite fitting the training data exactly. They reveal that the classical bias-variance tradeoff is only half the picture.
Concepts
Double descent: test error has a classical U-shape in the underparameterized regime (p<n), spikes at the interpolation threshold (p=n), then decreases again as p/n grows beyond 1.
Classical statistics teaches: fit the training data well, but not too well — overfitting kills generalization. Modern deep learning violates this completely: models with more parameters than training examples fit the data exactly and still generalize. The classical U-shaped bias-variance tradeoff is not wrong — it describes only the first descent. Double descent reveals that beyond the interpolation threshold, a second descent begins.
Classical Bias-Variance Tradeoff
For a parametric model with parameters trained on examples, the expected test MSE decomposes as:
The bias-variance decomposition is an algebraic identity — it holds exactly for any estimator under squared loss, with no approximations. The classical inference that adding model complexity always increases variance is also correct in the regime . Double descent does not contradict this: it shows that variance eventually decreases as , because the minimum-norm interpolant spreads coefficient mass uniformly rather than concentrating it on the training points — a geometric regularization effect that classical theory had no reason to anticipate.
In the classical regime ():
- Low complexity: high bias, low variance
- High complexity: low bias, high variance
- Optimal tradeoff at the model complexity minimizing bias + variance
This predicts a U-shaped test error curve in model complexity. Classical wisdom: stop before full interpolation of the training data (before ) to avoid variance blowup.
The Interpolation Threshold and Double Descent
The interpolation threshold is reached when the model has enough parameters to exactly fit the training data (). For linear regression: at .
Double descent (Belkin et al. 2019, Hastie et al. 2022): the test error has two descents:
- Classical regime (): U-shaped curve as complexity increases
- Interpolation threshold (): test error spikes — the model just barely interpolates the data with high-variance solutions
- Modern regime (): test error decreases again as
Minimum-Norm Interpolation
In the overparameterized regime (), there are infinitely many solutions fitting the training data exactly. Gradient descent from zero initialization selects the minimum -norm interpolant:
Test MSE of the minimum-norm interpolant (random features model, Hastie et al. 2022). For with iid entries, signal with , noise , and ratio :
For (OLS): variance term diverges as . For : variance decreases as , and for : test MSE .
The key is that adding more parameters (larger ) spreads the fitted coefficients thinner, reducing the variance of the min-norm solution even though it still interpolates the noise.
Implicit Bias of Gradient Descent
Gradient descent on the squared loss from follows:
Theorem (implicit bias): as , gradient descent from converges to the minimum -norm solution .
Proof sketch: the GD iterates remain in the row space of (since each update adds to ). Within the row space, the unique solution to with minimal norm is the pseudoinverse solution.
For nonlinear networks: gradient descent on the cross-entropy loss in the limit of small step sizes selects the max-margin solution (Soudry et al. 2018) — the minimum -norm linear separator in the parameter space. This implicit bias toward simple (minimum-norm) solutions provides a partial explanation for generalization.
PAC-Bayes and Margin Theory
Classical VC bounds fail for neural networks. Two alternatives that give non-vacuous bounds:
PAC-Bayes (McAllester 1999): for any prior over parameters (chosen before training) and posterior (the trained model), with probability :
Here is the expected loss under the randomized predictor .
For a Gaussian prior and (a perturbed trained model):
With tuned to the scale of good solutions and small: if is bounded (the trained network has small norm relative to the prior), the KL is bounded, and the generalization gap is non-vacuous.
Margin theory: for a linear classifier with margin on the training data ( for all ):
The bound depends on (the inverse squared normalized margin), not the number of parameters. Networks with large margins relative to their weight norms generalize well even with many parameters.
The Neural Tangent Kernel
In the infinite-width limit and at small learning rates, neural networks train in the lazy training (NTK) regime:
The network behaves as a linear model in the parameter space, with the NTK:
Consequence: at initialization, the NTK is approximately constant over training — the network's predictions evolve as a kernel regression with kernel . Generalization is governed by the spectrum of , not the number of parameters.
Limitations: the NTK regime requires very large width and describes lazy training, not feature learning. Real networks in practical regimes (finite width, significant learning rate) do learn features and deviate from the NTK regime — they can generalize better than the NTK predicts.
Worked Example
Example 1: Double Descent in Polynomial Regression
Fit a polynomial of degree to points from ().
- : high bias (model too simple), moderate variance — test MSE high due to underfitting
- (near ): low bias, very high variance — model fits noise, spiky predictions
- : model just barely interpolates (knife's edge)
- : minimum-norm interpolant, smooth prediction — test MSE lower than
- : very smooth interpolant, test MSE close to noise level
The minimum-norm polynomial at is the smoothest interpolant — it implicitly minimizes high-frequency components, behaving like a low-pass filter.
Example 2: Implicit Bias in Logistic Regression
For separable data with GD on cross-entropy loss: as , but — the direction converges to the max-margin separator.
For points with labels according to the rule: the max-margin separator is exactly with margin . GD converges to this direction from random initialization, even without explicit regularization.
This explains why logistic regression without regularization still generalizes on linearly separable data: GD selects the max-margin separator, which is the most robust separator.
Example 3: PAC-Bayes Bound for a ResNet
For a ResNet with (sum of squared weights), noise (small perturbation), and prior scale (Gaussian prior):
(with terms negligible for small ).
With training examples: generalization bound .
A 5% bound is non-vacuous and competitive with the observed test error. Weight decay (penalizing ) directly tightens this bound — it is not just a regularization trick but provably tightens the PAC-Bayes generalization guarantee.
Connections
Where Your Intuition Breaks
"Gradient descent from zero initialization finds the minimum-norm solution" is proven for linear models and carefully analyzed for overparameterized linear settings. The intuition naturally extends to neural networks — but the implicit bias of GD for nonlinear networks depends critically on architecture. Batch normalization, residual connections, and attention mechanisms all alter the loss landscape and change which of the many training-error-minimizing solutions GD selects. For networks with batch norm, weight norms alone do not determine generalization; the effective degrees of freedom are better captured by spectral norms or PAC-Bayes bounds over perturbed weights. Trusting implicit regularization without understanding how architecture shapes it is a common source of unexpected failure in fine-tuning and transfer learning.
Double descent reconciles memorization and generalization. Classical theory says: models that memorize training data (zero training error) cannot generalize. Double descent refutes this: the minimum-norm interpolant generalizes because it is the simplest (smoothest) of all interpolants. The key is not whether training error is zero, but how training error is achieved. Minimum-norm solutions spread coefficient mass uniformly and do not over-fit individual data points — they are the analog of smooth spline interpolation.
Implicit bias is a form of algorithmic regularization. We choose gradient descent for computational reasons, not because it has nice statistical properties. Yet GD from zero initialization imposes a strong statistical prior: preference for minimum-norm solutions. This connects optimization to statistics in a deep way — the algorithm does not just find a solution to the empirical risk minimization problem; it selects which solution among the many that minimize training error. Understanding implicit bias is central to understanding why deep learning works.
Feature learning vs. lazy training: real networks leave the NTK regime. The NTK analysis applies when networks are in the "lazy" regime — when parameters barely change from initialization. In this regime, the network is essentially a kernel machine, and all the theory of kernel methods applies. However, the power of deep learning comes from feature learning: the internal representations adapt to the data, solving problems that no fixed kernel could solve efficiently. Networks that learn useful features can generalize far better than NTK predictions, and also far better than classical statistical bounds. The theory of feature learning is active and incomplete.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.