Bias-Variance Tradeoff

Every model choice is implicitly a bias-variance tradeoff. A linear model on cubic data has high bias — it can't fit the true shape regardless of how much data you throw at it. A degree-15 polynomial on 20 training points has high variance — it memorizes noise and collapses on new data. Practitioners diagnose these failure modes daily: a model that degrades on new market conditions is high-variance (overfit to training distribution); a model that fails consistently across all segments is high-bias (too simple for the problem). The decomposition also predicts which interventions help: more data reduces variance but not bias; a bigger model reduces bias but not variance. This lesson derives the decomposition formally, shows the U-shaped test error curve, and connects it to the modern "double descent" phenomenon observed in overparameterized neural networks.

Theory

Bias–Variance Tradeoff vs. Model Complexity

Bias²

Variance

Total Error

Every model makes a commitment: a simple model commits to a rigid shape that won't change much between training sets (low variance, potentially high bias); a complex model adapts tightly to whatever data it sees (low bias, potentially high variance). The U-shaped curve above shows the result: test error is high at both extremes and minimized somewhere in the middle. The question is always which failure mode you're in — and the fix is different for each.

The bias-variance decomposition breaks expected prediction error into three terms:

$\mathbb{E}\left[(y - \hat{f}(\mathbf{x}))^2\right] = \underbrace{\left(f^*(\mathbf{x}) - \mathbb{E}[\hat{f}(\mathbf{x})]\right)^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}\left[\left(\hat{f}(\mathbf{x}) - \mathbb{E}[\hat{f}(\mathbf{x})]\right)^2\right]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible noise}}$

The three-way decomposition is forced by the structure of expectation over both the data distribution and the sampling of training sets. Bias and variance cannot be jointly minimized because they pull in opposite directions: the only way to reduce bias is to increase model flexibility, which directly increases sensitivity to training set randomness (variance). The irreducible noise $\sigma^2$ sets a floor that no model can escape — it's the noise in the labels themselves.

Derivation

Let $y = f^*(\mathbf{x}) + \epsilon$ with $\mathbb{E}[\epsilon]=0$ , $\text{Var}[\epsilon]=\sigma^2$ . Write the squared error:

$\mathbb{E}[(y - \hat{f})^2] = \mathbb{E}[(y - f^* + f^* - \mathbb{E}[\hat{f}] + \mathbb{E}[\hat{f}] - \hat{f})^2]$

Expanding the square and noting that $\epsilon$ is independent of $\hat{f}$ , all cross-terms with $\epsilon$ vanish. The bias×variance cross-term also vanishes (by definition of expectation). We're left with:

$= (f^* - \mathbb{E}[\hat{f}])^2 + \text{Var}[\hat{f}] + \sigma^2$

This decomposition tells us something profound: we can never escape $\sigma^2$ , and reducing bias typically increases variance and vice versa.

Concrete Examples

High bias (linear model on cubic data):

$\text{Bias}^2 = 2.3$ , $\text{Variance} = 0.1$ , $\sigma^2 = 0.5$ → $\text{MSE} = 2.9$

High variance (degree-15 polynomial, 20 training points):

$\text{Bias}^2 = 0.02$ , $\text{Variance} = 4.8$ , $\sigma^2 = 0.5$ → $\text{MSE} = 5.32$

Sweet spot (degree-3 polynomial):

$\text{Bias}^2 = 0.1$ , $\text{Variance} = 0.4$ , $\sigma^2 = 0.5$ → $\text{MSE} = 1.0$

Effect of Training Set Size

As $n \to \infty$ : bias is unchanged (structural limitation of the model class), variance → 0 (more data better estimates any fixed-complexity model). This means more data helps variance but not bias — if your model is fundamentally too simple, no amount of data will fix it.

Double Descent

Modern deep learning breaks the classical picture. As model size passes the interpolation threshold (where training loss hits zero), test loss can decrease again — "double descent." Overparameterized models find minimum-norm solutions that generalize surprisingly well, especially with implicit regularization from Stochastic Gradient Descent (SGD).

Walkthrough

Experiment: vary polynomial degree on synthetic cubic data

python

import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
 
np.random.seed(42)
n_train = 50
X = np.linspace(-3, 3, n_train).reshape(-1, 1)
y_true = 0.5 * X.ravel()**3 - 2 * X.ravel()
y = y_true + np.random.normal(0, 1.5, n_train)  # σ² = 2.25
 
degrees = [1, 2, 3, 4, 6, 8, 12]
print(f"{'Degree':>6} | {'CV MSE':>10} | {'±':>8}")
for d in degrees:
    pipe = Pipeline([
        ('poly', PolynomialFeatures(degree=d, include_bias=False)),
        ('ridge', Ridge(alpha=0.001))  # near-unregularized
    ])
    cv = cross_val_score(pipe, X, y, cv=10, scoring='neg_mean_squared_error')
    print(f"{d:>6} | {-cv.mean():>10.3f} | {cv.std():>8.3f}")

Output:

Degree |     CV MSE |        ±
     1 |      4.821 |    0.912  (high bias — linear misses cubic shape)
     2 |      4.603 |    0.887
     3 |      2.317 |    0.621  ← optimal — matches true function
     4 |      2.389 |    0.743
     6 |      3.102 |    1.241  (variance rising)
     8 |      5.891 |    3.201
    12 |     18.44  |   12.3    (high variance — massive overfitting)

Analysis & Evaluation

Where Your Intuition Breaks

The optimal model sits at the bottom of the U-shaped test error curve — more capacity always increases variance past that point. Deep neural networks violate this: in the overparameterized regime (far right of the curve), test error can decrease again after the interpolation threshold. This "double descent" occurs because overparameterized models find minimum-norm interpolating solutions that happen to generalize. The classical U-shaped curve holds for fixed-complexity model families; it breaks for neural networks where implicit regularization from SGD changes with scale.

Learning Curve Diagnostics

Learning curves plot train and val error vs training set size:

python

from sklearn.model_selection import learning_curve
import numpy as np
 
train_sizes, train_scores, val_scores = learning_curve(
    estimator, X, y,
    train_sizes=np.linspace(0.1, 1.0, 10),
    cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
)

Pattern	Diagnosis	Fix
Both train & val Mean Squared Error (MSE) high, converge	High bias	More features, bigger model, less regularization
Train MSE low, val MSE high	High variance	More data, regularization, simpler model
Val improves steadily with data	Variance-dominated	Collect more data
Val plateaus early	Bias-dominated	Change model architecture

💡Regularization shifts the tradeoff

L2 regularization (Ridge) shrinks all weights toward zero, reducing variance at the cost of bias. The optimal $\lambda$ trades off these: larger $\lambda$ is better for small datasets (high variance regime), smaller $\lambda$ for large datasets (bias regime dominates).

Ensemble Methods vs the Tradeoff

Method	Bias	Variance	How
Bagging (RF)	Same as base	Reduced ÷√B	Averaging reduces variance
Boosting (GB)	Reduced	Same or higher	Each tree reduces bias
Stacking	Reduced	Reduced	Meta-learner exploits both

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Train / Validation / Test Splits

Gradient Descent