Train / Validation / Test Splits

Evaluation discipline separates models that work in notebooks from models that work in production. A recommendation model that looks great in offline validation can degrade in the first week post-launch when the validation split shuffles time — future watch history leaks into training. The fix requires rethinking the entire evaluation pipeline. Rigorous train/val/test splits, cross-validation for stable estimates, and Bayesian hyperparameter search with Optuna are the standard toolkit for anyone running model comparisons that need to hold up under scrutiny. This lesson covers the math of CV bias-variance, the Expected Improvement acquisition function behind Bayesian optimization, and the full practical workflow from raw data to a trusted final test score.

Theory

K-Fold Cross-Validation

fold 1/5

You can't grade your own homework. A model that selects its own hyperparameters using the same data it's evaluated on will always look better than it is — it has implicitly seen the answers. The diagram above shows K-fold cross-validation: the data is split into K folds, the model is trained on K-1 and validated on the held-out fold, and this rotates until every example has been the "unseen" test case exactly once. The final test set is the exam no one studies for.

Why Three Sets?

Train: Fit model parameters ( $\mathbf{w}$ , $b$ )
Validation: Tune hyperparameters, select model architecture
Test: Final unbiased performance estimate — touch exactly once

Using validation performance to select the best model introduces selection bias: the best-of- $k$ models by chance on validation will look better than it truly is on unseen data. The held-out test set corrects for this.

K-Fold Cross-Validation

Partition data into $K$ equal folds, rotate through each as the validation set:

$\text{CV}(K) = \frac{1}{K}\sum_{k=1}^{K} \mathcal{L}(f_{-k},\ D_k)$

where $f_{-k}$ is trained on all folds except $k$ , and $D_k$ is the $k$ -th fold. Standard choices: $K = 5$ (low variance) or $K = 10$ (lower bias). $K = n$ (Leave-One-Out Cross-Validation (LOOCV)) is unbiased but expensive.

Bias-Variance of CV Estimates

The variance of the CV estimate decreases with $K$ :

$\text{Var}[\widehat{\text{CV}}(K)] \approx \frac{1}{K} \text{Var}[\mathcal{L}] + \frac{K-1}{K} \cdot \text{Cov}[\mathcal{L}_i, \mathcal{L}_j]$

The covariance term is unavoidable: K-fold folds share training data, so their error estimates are correlated. This is why leave-one-out (K=n) has higher variance than 5-fold despite lower bias — each LOOCV fold trains on almost identical data, so the estimates are highly correlated. The optimal K balances bias (fewer folds → each validation set is smaller → noisier estimate) against variance (more folds → correlated estimates). K=5 or K=10 are empirically stable across problem sizes.

Stratified Splits

For classification, stratify splits to preserve class distribution in each fold:

$\frac{n_{\text{pos, fold }k}}{|D_k|} = \frac{n_{\text{pos}}}{n} \quad \forall k$

Without stratification, a small imbalanced dataset might produce folds with zero positives — making AUC undefined.

⚠️Time-series: never shuffle

For time-series data, random splits leak future information into training. Use sklearn.model_selection.TimeSeriesSplit which ensures train always precedes validation chronologically. Gap parameter prevents look-ahead bias from auto-correlated series.

Hyperparameter Optimization Algorithms

Once you have a validation signal, the question becomes: how do you search the hyperparameter space efficiently?

Grid search evaluates every point on a fixed Cartesian grid. For $p$ hyperparameters each with $N$ candidate values, cost is $O(N^p)$ — completely impractical beyond 3–4 parameters.

Random search (Bergstra & Bengio, 2012) samples configurations uniformly at random. The key insight: if only a few hyperparameters actually matter, random search dedicates far more evaluations to those dimensions than a grid does. For a budget of $T$ trials it achieves effective resolution $T^{1/p_{\text{eff}}}$ on the important dimensions versus $T^{1/p}$ for grid search, where $p_{\text{eff}} \ll p$ .

Bayesian optimization builds a cheap surrogate model $\hat{f}$ of the objective and uses it to pick the next configuration to evaluate. At each step it maximizes an acquisition function that trades off exploration (high uncertainty) and exploitation (high predicted value).

The Expected Improvement acquisition function:

$\text{EI}(x) = E\!\left[\max\!\left(f(x) - f^+,\ 0\right)\right]$

where $f^+ = \max_{x_i \in D} f(x_i)$ is the current best. For a Gaussian surrogate with predictive mean $\mu(x)$ and standard deviation $\sigma(x)$ :

$\text{EI}(x) = \bigl(\mu(x) - f^+\bigr)\,\Phi(Z) + \sigma(x)\,\phi(Z), \qquad Z = \frac{\mu(x) - f^+}{\sigma(x)}$

Optuna's TPE (Tree-structured Parzen Estimator, Bergstra et al. 2011) avoids the $O(n^3)$ cost of Gaussian Process (GP)-based surrogates by instead modeling the conditional distribution of configurations:

$p(x \mid y) = \begin{cases} l(x) & \text{if } y \geq y^* \quad \text{(good region)} \\ g(x) & \text{if } y < y^* \quad \text{(bad region)} \end{cases}$

where $y^*$ is the $\gamma$ -quantile of observed values (typically $\gamma = 0.25$ ). Both $l$ and $g$ are kernel density estimates. The acquisition is proportional to $l(x)/g(x)$ — maximize the ratio of "looks like a good trial" to "looks like a bad trial." This is equivalent to maximizing EI under the model.

💡Intuition

TPE is not just faster than GP-based Bayesian optimization — it also handles mixed search spaces (integers, categoricals, log-scale floats) more naturally. Gaussian processes require careful kernel design for categorical inputs; TPE handles them directly because KDE works on any probability space.

Walkthrough

Dataset: California Housing (20,640 samples, predict median house value)

python

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.linear_model import Ridge
from sklearn.ensemble import GradientBoostingRegressor
import numpy as np
 
data = fetch_california_housing()
X, y = data.data, data.target
 
# Step 1: Hold out test set FIRST — don't touch until final eval
X_trainval, X_test, y_trainval, y_test = train_test_split(
    X, y, test_size=0.15, random_state=42
)
print(f"Train+Val: {len(X_trainval)}, Test: {len(X_test)}")
# Train+Val: 17544, Test: 3096
 
# Step 2: CV on trainval for model selection
kf = KFold(n_splits=5, shuffle=True, random_state=42)
 
ridge_cv = cross_val_score(Ridge(alpha=1.0), X_trainval, y_trainval, cv=kf, scoring='r2')
gb_cv = cross_val_score(
    GradientBoostingRegressor(n_estimators=200, random_state=42),
    X_trainval, y_trainval, cv=kf, scoring='r2'
)
 
print(f"Ridge R²:  {ridge_cv.mean():.4f} ± {ridge_cv.std():.4f}")   # 0.602 ± 0.008
print(f"GB    R²:  {gb_cv.mean():.4f} ± {gb_cv.std():.4f}")          # 0.793 ± 0.009
 
# Step 3: Retrain best model on ALL trainval, evaluate ONCE on test
best = GradientBoostingRegressor(n_estimators=200, random_state=42)
best.fit(X_trainval, y_trainval)
final_r2 = best.score(X_test, y_test)
print(f"Final test R²: {final_r2:.4f}")  # 0.801

Nested CV for Unbiased Hyperparameter Tuning

python

from sklearn.model_selection import GridSearchCV, cross_val_score
 
inner_cv = KFold(n_splits=3, shuffle=True, random_state=1)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=2)
 
param_grid = {'alpha': [0.01, 0.1, 1.0, 10.0, 100.0]}
gs = GridSearchCV(Ridge(), param_grid=param_grid, cv=inner_cv)
 
# Outer CV gives unbiased estimate of model-with-tuning performance
nested_score = cross_val_score(gs, X, y, cv=outer_cv, scoring='r2')
print(f"Nested CV R²: {nested_score.mean():.4f}")  # Unbiased
 
# Without nesting: optimistic by ~0.5–3% depending on dataset
flat_score = cross_val_score(Ridge(alpha=1.0), X, y, cv=outer_cv, scoring='r2')
print(f"Flat CV R²: {flat_score.mean():.4f}")

The inner loop tunes hyperparameters; the outer loop measures generalization. Grid search works here because alpha is a single 1D parameter with 5 values. Once you have 4+ parameters or log-scale ranges, replace the inner GridSearchCV with Optuna.

Hyperparameter Tuning with Optuna

Optuna replaces GridSearchCV while fitting naturally inside the same nested-CV structure. Each trial is a function call that proposes a configuration, evaluates it with inner CV, and returns the score. Optuna's TPE sampler then uses that result to guide the next proposal.

python

import optuna
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import KFold, cross_val_score
 
optuna.logging.set_verbosity(optuna.logging.WARNING)
 
def objective(trial):
    # Define the search space — mixed types, log scale, conditional params
    params = {
        'n_estimators':    trial.suggest_int('n_estimators', 50, 500),
        'max_depth':       trial.suggest_int('max_depth', 2, 8),
        'learning_rate':   trial.suggest_float('learning_rate', 1e-3, 0.3, log=True),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 20),
        'subsample':       trial.suggest_float('subsample', 0.5, 1.0),
        'max_features':    trial.suggest_categorical('max_features', ['sqrt', 'log2', None]),
    }
    model = GradientBoostingRegressor(**params, random_state=42)
    inner_cv = KFold(n_splits=5, shuffle=True, random_state=42)
    scores = cross_val_score(
        model, X_trainval, y_trainval,
        cv=inner_cv, scoring='r2', n_jobs=-1,
    )
    return scores.mean()
 
study = optuna.create_study(
    direction='maximize',
    sampler=optuna.samplers.TPESampler(seed=42),
)
study.optimize(objective, n_trials=100)
 
print(f"Best CV R²: {study.best_value:.4f}")
print(f"Best params:")
for k, v in study.best_params.items():
    print(f"  {k}: {v}")

Best CV R²: 0.8317
Best params:
  n_estimators: 387
  max_depth: 5
  learning_rate: 0.0412
  min_samples_leaf: 3
  subsample: 0.82
  max_features: None

This beats the hand-picked n_estimators=200 baseline (0.793) by ~4% with zero manual iteration. The key improvement comes from the learning_rate/n_estimators trade-off — Optuna discovers that more trees at a lower rate generalizes better than fewer trees at the default rate.

Now fit the tuned model on all X_trainval and take the one test evaluation:

python

tuned_model = GradientBoostingRegressor(**study.best_params, random_state=42)
tuned_model.fit(X_trainval, y_trainval)
print(f"Final test R²: {tuned_model.score(X_test, y_test):.4f}")  # 0.836

Optuna also surfaces the importance of each hyperparameter via its built-in analysis:

python

importances = optuna.importance.get_param_importances(study)
for param, imp in importances.items():
    bar = '█' * int(imp * 40)
    print(f"  {param:<22} {bar} {imp:.3f}")

  learning_rate          ████████████████████████████ 0.512
  n_estimators           ████████████ 0.218
  max_depth              ██████ 0.113
  subsample              ████ 0.087
  min_samples_leaf       ██ 0.051
  max_features           █ 0.019

learning_rate accounts for 51% of variance in CV score — meaning the grid over n_estimators alone (as in the original code) was optimizing the wrong axis.

⚠️Warning

The split discipline still applies: Optuna sees X_trainval and y_trainval only — never X_test. The 100 Optuna trials each involve inner CV on X_trainval. The test set is touched exactly once at the very end, after study.best_params is already fixed.

Analysis & Evaluation

Where Your Intuition Breaks

Cross-validation gives an unbiased estimate of generalization error. It gives an estimate of performance for models trained on a dataset of size $n \cdot (K-1)/K$ — which is smaller than your full dataset. If you retrain on all the data after CV, the actual deployed model has seen more data and may perform better or worse than the CV estimate predicted. CV estimates performance of the process (train on 80% of this data), not the artifact (the model you'll actually ship). For small datasets, the gap between these can be meaningful.

HPO Method Comparison

Method	Trials to match Optuna	Handles mixed types	Parallel	Use when
Grid search	N/A (exhaustive)	✗ (manual encoding)	✓	≤2 params, narrow range
Random search	~3–5× more	✓	✓	Quick baselines, large budgets
Optuna (TPE)	1× (baseline)	✓	✓	Standard choice
Ax (BoTorch/GP)	0.5–0.8×	✓ (with care)	✓	Very expensive trials, <50 total
Ray Tune	1× (wraps Optuna)	✓	✓✓	Multi-GPU / distributed search

The practical choice for Hyperparameter Optimization (HPO): use Optuna unless your trial takes >30 minutes (then consider Ax) or you need to distribute across a cluster (then Ray Tune wrapping Optuna).

K-Fold Comparison

K	Bias	Variance	Compute	Use when
2	High	Low	Fast	Very large datasets
5	Medium	Medium	Moderate	Default choice
10	Low	Higher	Slow	Small datasets
n (LOOCV)	Lowest	High	Very slow	< 100 samples

💡The 5-fold sweet spot

5-fold CV gives ~80% variance reduction vs a single hold-out, at 5× compute. For most datasets this is the right tradeoff. Use 10-fold only if dataset < 1,000 samples and compute is cheap.

How Many Optuna Trials?

A practical rule: start with $n_{\text{trials}} = 10 \times p$ where $p$ is the number of hyperparameters. For 6 parameters, 60 trials. Plot the optimization history — if the best value plateaus before the budget is exhausted, you've converged.

The TPE sampler needs roughly $\gamma \cdot n_{\text{trials}}$ "good" observations to fit its $l(x)$ model. With $\gamma = 0.25$ and 20 initial random trials (Optuna's default), the surrogate becomes useful around trial 30. Fewer than 20–25 total trials means you're essentially doing random search.

Common Mistakes

Preprocessing before splitting — StandardScaler fit on full data leaks test stats
Optimizing on test set — invalidates the estimate; get more data instead
Shuffling time series — future leaks into past, inflating performance by 5–50%
Ignoring group structure — if samples have groups (patients, stores), use GroupKFold
Using test set inside Optuna — the objective function must only touch X_trainval
Reporting Optuna's best_value as your final score — that's still the CV estimate; report score(X_test, y_test) as the final number

python

from sklearn.model_selection import GroupKFold
 
# Each patient has multiple measurements — don't split within patients
groups = df['patient_id'].values
gkf = GroupKFold(n_splits=5)
cv_score = cross_val_score(model, X, y, cv=gkf.split(X, y, groups))

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Feature Engineering

Bias-Variance Tradeoff