Train / Validation / Test Splits
Evaluation discipline separates models that work in notebooks from models that work in production. A recommendation model that looks great in offline validation can degrade in the first week post-launch when the validation split shuffles time — future watch history leaks into training. The fix requires rethinking the entire evaluation pipeline. Rigorous train/val/test splits, cross-validation for stable estimates, and Bayesian hyperparameter search with Optuna are the standard toolkit for anyone running model comparisons that need to hold up under scrutiny. This lesson covers the math of CV bias-variance, the Expected Improvement acquisition function behind Bayesian optimization, and the full practical workflow from raw data to a trusted final test score.
Theory
You can't grade your own homework. A model that selects its own hyperparameters using the same data it's evaluated on will always look better than it is — it has implicitly seen the answers. The diagram above shows K-fold cross-validation: the data is split into K folds, the model is trained on K-1 and validated on the held-out fold, and this rotates until every example has been the "unseen" test case exactly once. The final test set is the exam no one studies for.
Why Three Sets?
- Train: Fit model parameters (, )
- Validation: Tune hyperparameters, select model architecture
- Test: Final unbiased performance estimate — touch exactly once
Using validation performance to select the best model introduces selection bias: the best-of- models by chance on validation will look better than it truly is on unseen data. The held-out test set corrects for this.
K-Fold Cross-Validation
Partition data into equal folds, rotate through each as the validation set:
where is trained on all folds except , and is the -th fold. Standard choices: (low variance) or (lower bias). (Leave-One-Out Cross-Validation (LOOCV)) is unbiased but expensive.
Bias-Variance of CV Estimates
The variance of the CV estimate decreases with :
The covariance term is unavoidable: K-fold folds share training data, so their error estimates are correlated. This is why leave-one-out (K=n) has higher variance than 5-fold despite lower bias — each LOOCV fold trains on almost identical data, so the estimates are highly correlated. The optimal K balances bias (fewer folds → each validation set is smaller → noisier estimate) against variance (more folds → correlated estimates). K=5 or K=10 are empirically stable across problem sizes.
Stratified Splits
For classification, stratify splits to preserve class distribution in each fold:
Without stratification, a small imbalanced dataset might produce folds with zero positives — making AUC undefined.
For time-series data, random splits leak future information into training. Use sklearn.model_selection.TimeSeriesSplit which ensures train always precedes validation chronologically. Gap parameter prevents look-ahead bias from auto-correlated series.
Hyperparameter Optimization Algorithms
Once you have a validation signal, the question becomes: how do you search the hyperparameter space efficiently?
Grid search evaluates every point on a fixed Cartesian grid. For hyperparameters each with candidate values, cost is — completely impractical beyond 3–4 parameters.
Random search (Bergstra & Bengio, 2012) samples configurations uniformly at random. The key insight: if only a few hyperparameters actually matter, random search dedicates far more evaluations to those dimensions than a grid does. For a budget of trials it achieves effective resolution on the important dimensions versus for grid search, where .
Bayesian optimization builds a cheap surrogate model of the objective and uses it to pick the next configuration to evaluate. At each step it maximizes an acquisition function that trades off exploration (high uncertainty) and exploitation (high predicted value).
The Expected Improvement acquisition function:
where is the current best. For a Gaussian surrogate with predictive mean and standard deviation :
Optuna's TPE (Tree-structured Parzen Estimator, Bergstra et al. 2011) avoids the cost of Gaussian Process (GP)-based surrogates by instead modeling the conditional distribution of configurations:
where is the -quantile of observed values (typically ). Both and are kernel density estimates. The acquisition is proportional to — maximize the ratio of "looks like a good trial" to "looks like a bad trial." This is equivalent to maximizing EI under the model.
TPE is not just faster than GP-based Bayesian optimization — it also handles mixed search spaces (integers, categoricals, log-scale floats) more naturally. Gaussian processes require careful kernel design for categorical inputs; TPE handles them directly because KDE works on any probability space.
Walkthrough
Dataset: California Housing (20,640 samples, predict median house value)
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.linear_model import Ridge
from sklearn.ensemble import GradientBoostingRegressor
import numpy as np
data = fetch_california_housing()
X, y = data.data, data.target
# Step 1: Hold out test set FIRST — don't touch until final eval
X_trainval, X_test, y_trainval, y_test = train_test_split(
X, y, test_size=0.15, random_state=42
)
print(f"Train+Val: {len(X_trainval)}, Test: {len(X_test)}")
# Train+Val: 17544, Test: 3096
# Step 2: CV on trainval for model selection
kf = KFold(n_splits=5, shuffle=True, random_state=42)
ridge_cv = cross_val_score(Ridge(alpha=1.0), X_trainval, y_trainval, cv=kf, scoring='r2')
gb_cv = cross_val_score(
GradientBoostingRegressor(n_estimators=200, random_state=42),
X_trainval, y_trainval, cv=kf, scoring='r2'
)
print(f"Ridge R²: {ridge_cv.mean():.4f} ± {ridge_cv.std():.4f}") # 0.602 ± 0.008
print(f"GB R²: {gb_cv.mean():.4f} ± {gb_cv.std():.4f}") # 0.793 ± 0.009
# Step 3: Retrain best model on ALL trainval, evaluate ONCE on test
best = GradientBoostingRegressor(n_estimators=200, random_state=42)
best.fit(X_trainval, y_trainval)
final_r2 = best.score(X_test, y_test)
print(f"Final test R²: {final_r2:.4f}") # 0.801Nested CV for Unbiased Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV, cross_val_score
inner_cv = KFold(n_splits=3, shuffle=True, random_state=1)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=2)
param_grid = {'alpha': [0.01, 0.1, 1.0, 10.0, 100.0]}
gs = GridSearchCV(Ridge(), param_grid=param_grid, cv=inner_cv)
# Outer CV gives unbiased estimate of model-with-tuning performance
nested_score = cross_val_score(gs, X, y, cv=outer_cv, scoring='r2')
print(f"Nested CV R²: {nested_score.mean():.4f}") # Unbiased
# Without nesting: optimistic by ~0.5–3% depending on dataset
flat_score = cross_val_score(Ridge(alpha=1.0), X, y, cv=outer_cv, scoring='r2')
print(f"Flat CV R²: {flat_score.mean():.4f}")The inner loop tunes hyperparameters; the outer loop measures generalization. Grid search works here because alpha is a single 1D parameter with 5 values. Once you have 4+ parameters or log-scale ranges, replace the inner GridSearchCV with Optuna.
Hyperparameter Tuning with Optuna
Optuna replaces GridSearchCV while fitting naturally inside the same nested-CV structure. Each trial is a function call that proposes a configuration, evaluates it with inner CV, and returns the score. Optuna's TPE sampler then uses that result to guide the next proposal.
import optuna
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import KFold, cross_val_score
optuna.logging.set_verbosity(optuna.logging.WARNING)
def objective(trial):
# Define the search space — mixed types, log scale, conditional params
params = {
'n_estimators': trial.suggest_int('n_estimators', 50, 500),
'max_depth': trial.suggest_int('max_depth', 2, 8),
'learning_rate': trial.suggest_float('learning_rate', 1e-3, 0.3, log=True),
'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 20),
'subsample': trial.suggest_float('subsample', 0.5, 1.0),
'max_features': trial.suggest_categorical('max_features', ['sqrt', 'log2', None]),
}
model = GradientBoostingRegressor(**params, random_state=42)
inner_cv = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(
model, X_trainval, y_trainval,
cv=inner_cv, scoring='r2', n_jobs=-1,
)
return scores.mean()
study = optuna.create_study(
direction='maximize',
sampler=optuna.samplers.TPESampler(seed=42),
)
study.optimize(objective, n_trials=100)
print(f"Best CV R²: {study.best_value:.4f}")
print(f"Best params:")
for k, v in study.best_params.items():
print(f" {k}: {v}")Best CV R²: 0.8317
Best params:
n_estimators: 387
max_depth: 5
learning_rate: 0.0412
min_samples_leaf: 3
subsample: 0.82
max_features: None
This beats the hand-picked n_estimators=200 baseline (0.793) by ~4% with zero manual iteration. The key improvement comes from the learning_rate/n_estimators trade-off — Optuna discovers that more trees at a lower rate generalizes better than fewer trees at the default rate.
Now fit the tuned model on all X_trainval and take the one test evaluation:
tuned_model = GradientBoostingRegressor(**study.best_params, random_state=42)
tuned_model.fit(X_trainval, y_trainval)
print(f"Final test R²: {tuned_model.score(X_test, y_test):.4f}") # 0.836Optuna also surfaces the importance of each hyperparameter via its built-in analysis:
importances = optuna.importance.get_param_importances(study)
for param, imp in importances.items():
bar = '█' * int(imp * 40)
print(f" {param:<22} {bar} {imp:.3f}") learning_rate ████████████████████████████ 0.512
n_estimators ████████████ 0.218
max_depth ██████ 0.113
subsample ████ 0.087
min_samples_leaf ██ 0.051
max_features █ 0.019
learning_rate accounts for 51% of variance in CV score — meaning the grid over n_estimators alone (as in the original code) was optimizing the wrong axis.
The split discipline still applies: Optuna sees X_trainval and y_trainval only — never X_test. The 100 Optuna trials each involve inner CV on X_trainval. The test set is touched exactly once at the very end, after study.best_params is already fixed.
Analysis & Evaluation
Where Your Intuition Breaks
Cross-validation gives an unbiased estimate of generalization error. It gives an estimate of performance for models trained on a dataset of size — which is smaller than your full dataset. If you retrain on all the data after CV, the actual deployed model has seen more data and may perform better or worse than the CV estimate predicted. CV estimates performance of the process (train on 80% of this data), not the artifact (the model you'll actually ship). For small datasets, the gap between these can be meaningful.
HPO Method Comparison
| Method | Trials to match Optuna | Handles mixed types | Parallel | Use when |
|---|---|---|---|---|
| Grid search | N/A (exhaustive) | ✗ (manual encoding) | ✓ | ≤2 params, narrow range |
| Random search | ~3–5× more | ✓ | ✓ | Quick baselines, large budgets |
| Optuna (TPE) | 1× (baseline) | ✓ | ✓ | Standard choice |
| Ax (BoTorch/GP) | 0.5–0.8× | ✓ (with care) | ✓ | Very expensive trials, <50 total |
| Ray Tune | 1× (wraps Optuna) | ✓ | ✓✓ | Multi-GPU / distributed search |
The practical choice for Hyperparameter Optimization (HPO): use Optuna unless your trial takes >30 minutes (then consider Ax) or you need to distribute across a cluster (then Ray Tune wrapping Optuna).
K-Fold Comparison
| K | Bias | Variance | Compute | Use when |
|---|---|---|---|---|
| 2 | High | Low | Fast | Very large datasets |
| 5 | Medium | Medium | Moderate | Default choice |
| 10 | Low | Higher | Slow | Small datasets |
| n (LOOCV) | Lowest | High | Very slow | < 100 samples |
5-fold CV gives ~80% variance reduction vs a single hold-out, at 5× compute. For most datasets this is the right tradeoff. Use 10-fold only if dataset < 1,000 samples and compute is cheap.
How Many Optuna Trials?
A practical rule: start with where is the number of hyperparameters. For 6 parameters, 60 trials. Plot the optimization history — if the best value plateaus before the budget is exhausted, you've converged.
The TPE sampler needs roughly "good" observations to fit its model. With and 20 initial random trials (Optuna's default), the surrogate becomes useful around trial 30. Fewer than 20–25 total trials means you're essentially doing random search.
Common Mistakes
- Preprocessing before splitting — StandardScaler fit on full data leaks test stats
- Optimizing on test set — invalidates the estimate; get more data instead
- Shuffling time series — future leaks into past, inflating performance by 5–50%
- Ignoring group structure — if samples have groups (patients, stores), use
GroupKFold - Using test set inside Optuna — the objective function must only touch
X_trainval - Reporting Optuna's
best_valueas your final score — that's still the CV estimate; reportscore(X_test, y_test)as the final number
from sklearn.model_selection import GroupKFold
# Each patient has multiple measurements — don't split within patients
groups = df['patient_id'].values
gkf = GroupKFold(n_splits=5)
cv_score = cross_val_score(model, X, y, cv=gkf.split(X, y, groups))Enjoying these notes?
Get new lessons delivered to your inbox. No spam.