Bias-Variance Tradeoff
Every model choice is implicitly a bias-variance tradeoff. A linear model on cubic data has high bias — it can't fit the true shape regardless of how much data you throw at it. A degree-15 polynomial on 20 training points has high variance — it memorizes noise and collapses on new data. Practitioners diagnose these failure modes daily: a model that degrades on new market conditions is high-variance (overfit to training distribution); a model that fails consistently across all segments is high-bias (too simple for the problem). The decomposition also predicts which interventions help: more data reduces variance but not bias; a bigger model reduces bias but not variance. This lesson derives the decomposition formally, shows the U-shaped test error curve, and connects it to the modern "double descent" phenomenon observed in overparameterized neural networks.
Theory
Every model makes a commitment: a simple model commits to a rigid shape that won't change much between training sets (low variance, potentially high bias); a complex model adapts tightly to whatever data it sees (low bias, potentially high variance). The U-shaped curve above shows the result: test error is high at both extremes and minimized somewhere in the middle. The question is always which failure mode you're in — and the fix is different for each.
The bias-variance decomposition breaks expected prediction error into three terms:
The three-way decomposition is forced by the structure of expectation over both the data distribution and the sampling of training sets. Bias and variance cannot be jointly minimized because they pull in opposite directions: the only way to reduce bias is to increase model flexibility, which directly increases sensitivity to training set randomness (variance). The irreducible noise sets a floor that no model can escape — it's the noise in the labels themselves.
Derivation
Let with , . Write the squared error:
Expanding the square and noting that is independent of , all cross-terms with vanish. The bias×variance cross-term also vanishes (by definition of expectation). We're left with:
This decomposition tells us something profound: we can never escape , and reducing bias typically increases variance and vice versa.
Concrete Examples
High bias (linear model on cubic data):
- , , →
High variance (degree-15 polynomial, 20 training points):
- , , →
Sweet spot (degree-3 polynomial):
- , , →
Effect of Training Set Size
As : bias is unchanged (structural limitation of the model class), variance → 0 (more data better estimates any fixed-complexity model). This means more data helps variance but not bias — if your model is fundamentally too simple, no amount of data will fix it.
Double Descent
Modern deep learning breaks the classical picture. As model size passes the interpolation threshold (where training loss hits zero), test loss can decrease again — "double descent." Overparameterized models find minimum-norm solutions that generalize surprisingly well, especially with implicit regularization from Stochastic Gradient Descent (SGD).
Walkthrough
Experiment: vary polynomial degree on synthetic cubic data
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
np.random.seed(42)
n_train = 50
X = np.linspace(-3, 3, n_train).reshape(-1, 1)
y_true = 0.5 * X.ravel()**3 - 2 * X.ravel()
y = y_true + np.random.normal(0, 1.5, n_train) # σ² = 2.25
degrees = [1, 2, 3, 4, 6, 8, 12]
print(f"{'Degree':>6} | {'CV MSE':>10} | {'±':>8}")
for d in degrees:
pipe = Pipeline([
('poly', PolynomialFeatures(degree=d, include_bias=False)),
('ridge', Ridge(alpha=0.001)) # near-unregularized
])
cv = cross_val_score(pipe, X, y, cv=10, scoring='neg_mean_squared_error')
print(f"{d:>6} | {-cv.mean():>10.3f} | {cv.std():>8.3f}")Output:
Degree | CV MSE | ±
1 | 4.821 | 0.912 (high bias — linear misses cubic shape)
2 | 4.603 | 0.887
3 | 2.317 | 0.621 ← optimal — matches true function
4 | 2.389 | 0.743
6 | 3.102 | 1.241 (variance rising)
8 | 5.891 | 3.201
12 | 18.44 | 12.3 (high variance — massive overfitting)
Analysis & Evaluation
Where Your Intuition Breaks
The optimal model sits at the bottom of the U-shaped test error curve — more capacity always increases variance past that point. Deep neural networks violate this: in the overparameterized regime (far right of the curve), test error can decrease again after the interpolation threshold. This "double descent" occurs because overparameterized models find minimum-norm interpolating solutions that happen to generalize. The classical U-shaped curve holds for fixed-complexity model families; it breaks for neural networks where implicit regularization from SGD changes with scale.
Learning Curve Diagnostics
Learning curves plot train and val error vs training set size:
from sklearn.model_selection import learning_curve
import numpy as np
train_sizes, train_scores, val_scores = learning_curve(
estimator, X, y,
train_sizes=np.linspace(0.1, 1.0, 10),
cv=5,
scoring='neg_mean_squared_error',
n_jobs=-1,
)| Pattern | Diagnosis | Fix |
|---|---|---|
| Both train & val Mean Squared Error (MSE) high, converge | High bias | More features, bigger model, less regularization |
| Train MSE low, val MSE high | High variance | More data, regularization, simpler model |
| Val improves steadily with data | Variance-dominated | Collect more data |
| Val plateaus early | Bias-dominated | Change model architecture |
L2 regularization (Ridge) shrinks all weights toward zero, reducing variance at the cost of bias. The optimal trades off these: larger is better for small datasets (high variance regime), smaller for large datasets (bias regime dominates).
Ensemble Methods vs the Tradeoff
| Method | Bias | Variance | How |
|---|---|---|---|
| Bagging (RF) | Same as base | Reduced ÷√B | Averaging reduces variance |
| Boosting (GB) | Reduced | Same or higher | Each tree reduces bias |
| Stacking | Reduced | Reduced | Meta-learner exploits both |
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.