Neural-Path/Notes
20 min

Bias-Variance Tradeoff

Every model choice is implicitly a bias-variance tradeoff. A linear model on cubic data has high bias — it can't fit the true shape regardless of how much data you throw at it. A degree-15 polynomial on 20 training points has high variance — it memorizes noise and collapses on new data. Practitioners diagnose these failure modes daily: a model that degrades on new market conditions is high-variance (overfit to training distribution); a model that fails consistently across all segments is high-bias (too simple for the problem). The decomposition also predicts which interventions help: more data reduces variance but not bias; a bigger model reduces bias but not variance. This lesson derives the decomposition formally, shows the U-shaped test error curve, and connects it to the modern "double descent" phenomenon observed in overparameterized neural networks.

Theory

Bias–Variance Tradeoff vs. Model Complexity
optimalHigh Bias(underfit)High Variance(overfit)Model ComplexityError
Bias²
Variance
Total Error

Every model makes a commitment: a simple model commits to a rigid shape that won't change much between training sets (low variance, potentially high bias); a complex model adapts tightly to whatever data it sees (low bias, potentially high variance). The U-shaped curve above shows the result: test error is high at both extremes and minimized somewhere in the middle. The question is always which failure mode you're in — and the fix is different for each.

The bias-variance decomposition breaks expected prediction error into three terms:

E[(yf^(x))2]=(f(x)E[f^(x)])2Bias2+E[(f^(x)E[f^(x)])2]Variance+σ2Irreducible noise\mathbb{E}\left[(y - \hat{f}(\mathbf{x}))^2\right] = \underbrace{\left(f^*(\mathbf{x}) - \mathbb{E}[\hat{f}(\mathbf{x})]\right)^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}\left[\left(\hat{f}(\mathbf{x}) - \mathbb{E}[\hat{f}(\mathbf{x})]\right)^2\right]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible noise}}

The three-way decomposition is forced by the structure of expectation over both the data distribution and the sampling of training sets. Bias and variance cannot be jointly minimized because they pull in opposite directions: the only way to reduce bias is to increase model flexibility, which directly increases sensitivity to training set randomness (variance). The irreducible noise σ2\sigma^2 sets a floor that no model can escape — it's the noise in the labels themselves.

Derivation

Let y=f(x)+ϵy = f^*(\mathbf{x}) + \epsilon with E[ϵ]=0\mathbb{E}[\epsilon]=0, Var[ϵ]=σ2\text{Var}[\epsilon]=\sigma^2. Write the squared error:

E[(yf^)2]=E[(yf+fE[f^]+E[f^]f^)2]\mathbb{E}[(y - \hat{f})^2] = \mathbb{E}[(y - f^* + f^* - \mathbb{E}[\hat{f}] + \mathbb{E}[\hat{f}] - \hat{f})^2]

Expanding the square and noting that ϵ\epsilon is independent of f^\hat{f}, all cross-terms with ϵ\epsilon vanish. The bias×variance cross-term also vanishes (by definition of expectation). We're left with:

=(fE[f^])2+Var[f^]+σ2= (f^* - \mathbb{E}[\hat{f}])^2 + \text{Var}[\hat{f}] + \sigma^2

This decomposition tells us something profound: we can never escape σ2\sigma^2, and reducing bias typically increases variance and vice versa.

Concrete Examples

High bias (linear model on cubic data):

  • Bias2=2.3\text{Bias}^2 = 2.3, Variance=0.1\text{Variance} = 0.1, σ2=0.5\sigma^2 = 0.5MSE=2.9\text{MSE} = 2.9

High variance (degree-15 polynomial, 20 training points):

  • Bias2=0.02\text{Bias}^2 = 0.02, Variance=4.8\text{Variance} = 4.8, σ2=0.5\sigma^2 = 0.5MSE=5.32\text{MSE} = 5.32

Sweet spot (degree-3 polynomial):

  • Bias2=0.1\text{Bias}^2 = 0.1, Variance=0.4\text{Variance} = 0.4, σ2=0.5\sigma^2 = 0.5MSE=1.0\text{MSE} = 1.0

Effect of Training Set Size

As nn \to \infty: bias is unchanged (structural limitation of the model class), variance → 0 (more data better estimates any fixed-complexity model). This means more data helps variance but not bias — if your model is fundamentally too simple, no amount of data will fix it.

Double Descent

Modern deep learning breaks the classical picture. As model size passes the interpolation threshold (where training loss hits zero), test loss can decrease again — "double descent." Overparameterized models find minimum-norm solutions that generalize surprisingly well, especially with implicit regularization from Stochastic Gradient Descent (SGD).

Walkthrough

Experiment: vary polynomial degree on synthetic cubic data

python
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
 
np.random.seed(42)
n_train = 50
X = np.linspace(-3, 3, n_train).reshape(-1, 1)
y_true = 0.5 * X.ravel()**3 - 2 * X.ravel()
y = y_true + np.random.normal(0, 1.5, n_train)  # σ² = 2.25
 
degrees = [1, 2, 3, 4, 6, 8, 12]
print(f"{'Degree':>6} | {'CV MSE':>10} | {'±':>8}")
for d in degrees:
    pipe = Pipeline([
        ('poly', PolynomialFeatures(degree=d, include_bias=False)),
        ('ridge', Ridge(alpha=0.001))  # near-unregularized
    ])
    cv = cross_val_score(pipe, X, y, cv=10, scoring='neg_mean_squared_error')
    print(f"{d:>6} | {-cv.mean():>10.3f} | {cv.std():>8.3f}")

Output:

Degree |     CV MSE |        ±
     1 |      4.821 |    0.912  (high bias — linear misses cubic shape)
     2 |      4.603 |    0.887
     3 |      2.317 |    0.621  ← optimal — matches true function
     4 |      2.389 |    0.743
     6 |      3.102 |    1.241  (variance rising)
     8 |      5.891 |    3.201
    12 |     18.44  |   12.3    (high variance — massive overfitting)

Analysis & Evaluation

Where Your Intuition Breaks

The optimal model sits at the bottom of the U-shaped test error curve — more capacity always increases variance past that point. Deep neural networks violate this: in the overparameterized regime (far right of the curve), test error can decrease again after the interpolation threshold. This "double descent" occurs because overparameterized models find minimum-norm interpolating solutions that happen to generalize. The classical U-shaped curve holds for fixed-complexity model families; it breaks for neural networks where implicit regularization from SGD changes with scale.

Learning Curve Diagnostics

Learning curves plot train and val error vs training set size:

python
from sklearn.model_selection import learning_curve
import numpy as np
 
train_sizes, train_scores, val_scores = learning_curve(
    estimator, X, y,
    train_sizes=np.linspace(0.1, 1.0, 10),
    cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
)
PatternDiagnosisFix
Both train & val Mean Squared Error (MSE) high, convergeHigh biasMore features, bigger model, less regularization
Train MSE low, val MSE highHigh varianceMore data, regularization, simpler model
Val improves steadily with dataVariance-dominatedCollect more data
Val plateaus earlyBias-dominatedChange model architecture
💡Regularization shifts the tradeoff

L2 regularization (Ridge) shrinks all weights toward zero, reducing variance at the cost of bias. The optimal λ\lambda trades off these: larger λ\lambda is better for small datasets (high variance regime), smaller λ\lambda for large datasets (bias regime dominates).

Ensemble Methods vs the Tradeoff

MethodBiasVarianceHow
Bagging (RF)Same as baseReduced ÷√BAveraging reduces variance
Boosting (GB)ReducedSame or higherEach tree reduces bias
StackingReducedReducedMeta-learner exploits both

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.