Gradient Boosting & Tabular ML
Gradient boosted trees consistently outperform neural networks on structured tabular data — the format that underlies the majority of enterprise ML workloads. Unlike neural networks, tree ensembles require no normalization, handle mixed data types natively, and deliver competitive accuracy on datasets of under a million rows with minimal tuning. This lesson derives the gradient boosting algorithm from first principles, explains the engineering innovations in XGBoost and LightGBM, and walks through a complete five-sample worked example to make the residual-fitting intuition concrete.
Theory
step 1 / 5
Gradient boosting builds a committee where each new member corrects the mistakes of the previous ones. The first tree makes a rough prediction; the second tree learns from the first tree's errors; the third from the combined errors; and so on. The diagram above shows this residual-fitting process — each round's tree targets what the ensemble got wrong, not the original labels. The "gradient" in gradient boosting is this residual: it's the gradient of the loss with respect to the current prediction.
The Additive Model
Gradient Boosting Decision Trees (GBDT) build a prediction as a sum of weak learners, each a shallow decision tree:
where is an initial constant prediction (typically for regression), is the -th tree, and is the learning rate. The model is built stagewise: each new tree is added without modifying the previous ones.
Negative Gradient / Pseudo-Residuals
At each step , we want to find a tree that moves in the direction that most reduces loss . By analogy with gradient descent in function space, we compute the pseudo-residual for each training sample:
For mean squared error :
The pseudo-residuals are the ordinary residuals. For binary cross-entropy with :
Again, pseudo-residuals equal the residuals between true labels and current predicted probabilities — this is not coincidental but a direct consequence of the log-loss gradient.
Fitting trees to the negative gradient — not the raw residuals — is what makes boosting work for any differentiable loss, not just MSE. For MSE, the negative gradient happens to equal the residual, which is why the two look the same in the regression case. For cross-entropy or other losses, the pseudo-residual is a corrected signal that accounts for the curvature of the loss surface. Friedman's key insight was recognizing that "fit the residual" is actually "do gradient descent in function space" — the generalization is what unlocks classification, ranking, and survival analysis.
GBDT Algorithm
Initialize F_0(x) = argmin_γ Σ L(y_i, γ) # e.g., mean(y) for MSE
For m = 1 to M:
1. Compute pseudo-residuals: r_im = -∂L/∂F(x_i) at F = F_{m-1}
2. Fit tree h_m to {(x_i, r_im)} minimizing squared error on residuals
3. Find optimal leaf values for each leaf j:
γ_jm = argmin_γ Σ_{x_i in leaf_j} L(y_i, F_{m-1}(x_i) + γ)
4. Update: F_m(x) = F_{m-1}(x) + η · h_m(x)
Return F_M
XGBoost: Second-Order Optimization
XGBoost (Chen & Guestrin, 2016) improves on vanilla GBDT with a second-order Taylor expansion of the loss:
where is the gradient and is the Hessian. The regularization term is:
where is the number of leaves, are leaf weights, penalizes tree complexity, and is L2 regularization on leaf values.
For a fixed tree structure, the optimal leaf weight for leaf is:
The split gain determines whether a candidate split is accepted:
A split is only accepted if Gain > 0. This elegantly combines information gain with regularization.
XGBoost engineering innovations:
- Histogram binning: bucket continuous features into at most 256 bins for split-finding
- Column/row subsampling (
colsample_bytree,subsample): reduce tree correlation and variance - Sparse-aware split finding: native handling of missing values via learned default directions
LightGBM: Leaf-Wise Growth and Efficient Sampling
LightGBM (Ke et al., 2017) introduces two key innovations for large-scale training:
GOSS (Gradient-based One-Side Sampling): Large-gradient samples contribute most to the split gain. GOSS keeps all top-% high-gradient instances and randomly samples % of the remaining low-gradient ones, upweighting them by to maintain the distribution. This reduces sample complexity while focusing compute where it matters.
EFB (Exclusive Feature Bundling): Sparse features rarely take nonzero values simultaneously. EFB bundles mutually exclusive sparse features into single dense features, reducing the effective feature count the algorithm must scan.
Leaf-wise (best-first) growth: LightGBM always splits the leaf with the largest gain instead of growing level by level. This produces unbalanced trees that achieve lower loss with fewer leaves — but requires min_child_samples to prevent overfitting on small populations.
CatBoost: Ordered Target Encoding
CatBoost (Prokhorenkova et al., 2018) addresses target leakage in categorical feature encoding. Naive target encoding (replacing a category with its mean target) uses the current observation's own label, inflating apparent performance. CatBoost uses ordered target statistics: when encoding sample , it computes the mean target only from prior samples (in a random permutation) with the same category value. CatBoost also uses symmetric (oblivious) trees where every node at a given depth applies the same split, enabling efficient array-based inference.
Key Hyperparameters
| Parameter | Effect | Typical Range |
|---|---|---|
learning_rate | Step size per tree | 0.01–0.3 |
n_estimators | Number of trees (use early stopping) | 100–5000 |
max_depth | Maximum tree depth | 3–10 |
subsample | Fraction of rows per tree | 0.6–1.0 |
colsample_bytree | Fraction of features per tree | 0.5–1.0 |
reg_lambda | L2 regularization on leaf weights | 0.1–10 |
min_child_weight | Minimum Hessian sum in leaf | 1–20 |
Walkthrough
We fit a GBDT manually on five samples with and MSE loss.
Dataset:
| 1 | 1.0 | 2.5 |
| 2 | 2.0 | 3.8 |
| 3 | 3.0 | 6.2 |
| 4 | 4.0 | 7.1 |
| 5 | 5.0 | 9.0 |
Step 0 — Initialize .
Step 1 — Pseudo-residuals:
Fit a depth-1 stump to these residuals. Optimal split at :
- Left leaf (i=1,2): leaf value
- Right leaf (i=3,4,5): leaf value
Update :
| New residual | ||
|---|---|---|
| 1 | 4.44 | −1.94 |
| 2 | 4.44 | −0.64 |
| 3 | 6.58 | −0.38 |
| 4 | 6.58 | 0.52 |
| 5 | 6.58 | 2.42 |
MSE dropped from 5.43 → 1.70.
Step 2 — Fit tree 2 on residuals .
Optimal split at : left leaf = , right leaf = .
Update :
| Error | |||
|---|---|---|---|
| 1 | 2.5 | 4.13 | −1.63 |
| 2 | 3.8 | 4.13 | −0.33 |
| 3 | 6.2 | 6.27 | −0.07 |
| 4 | 7.1 | 6.27 | +0.83 |
| 5 | 9.0 | 7.79 | +1.21 |
MSE dropped to 0.88. Each additional tree corrects the ensemble's remaining errors by fitting residuals in function space.
Analysis & Evaluation
Where Your Intuition Breaks
More trees with a small learning rate always beats fewer trees with a large learning rate. With early stopping, this is often true. Without it, gradient boosting can overfit with enough trees regardless of learning rate — each new tree continues to reduce training loss even after test loss has started rising. The learning rate controls step size, not stopping point. The correct setup is: use a small learning rate, use early stopping on validation loss, and let the number of trees be determined by when overfitting begins rather than fixed in advance.
Algorithm Comparison
| Method | Training Speed | Accuracy (tabular) | Memory | Categorical Support | Strength |
|---|---|---|---|---|---|
| Decision Tree | Very fast | Low | Low | Native splits | Interpretable |
| Random Forest | Fast | Medium-high | Medium | Native splits | Low variance, robust |
| sklearn GBM | Slow | High | Medium | Ordinal encoding | Reference implementation |
| XGBoost | Fast | Very high | Medium | Ordinal encoding | Second-order, regularization |
| LightGBM | Very fast | Very high | Low | Native (cat_features) | GOSS/EFB, leaf-wise |
| CatBoost | Medium | Very high | Medium | Native ordered encoding | Best out-of-box on categoricals |
Trees vs Neural Networks for Tabular Data
Gradient boosted trees typically win on structured tabular data because:
- Scale invariance: trees are invariant to monotone feature transformations — no normalization needed
- Sparse and mixed data: handle missing values and mixed types without preprocessing
- Small data regime: trees generalize better when
- Interpretability: SHAP values give exact feature attribution per prediction
Neural networks win when the dataset is very large (millions of rows with complex feature interactions), inputs contain high-dimensional embeddings, or the task benefits from transfer learning.
(1) Too many trees without early stopping — training loss keeps dropping while validation loss rises. Always use a validation set with early_stopping_rounds=50. (2) Learning rate too high — the model memorizes training residuals too aggressively. Prefer learning_rate=0.05 with 1000+ trees over 0.3 with 100 trees.
Common Pitfalls
Feature leakage: Including features computed using the target variable inflates apparent performance. Use time-aware splits for temporal data and ordered encoding for categorical target statistics.
Feature importance bias: sklearn's built-in feature_importances_ (mean decrease in impurity) is biased toward high-cardinality features. Use permutation importance or SHAP values for reliable attribution.
Ignoring min_child_weight: On imbalanced datasets, small minority-class leaf populations cause overfitting on noise. Increasing min_child_weight forces each leaf to have a minimum sum of Hessian.
Production-Ready Code
Training with Early Stopping
import xgboost as xgb
import lightgbm as lgb
from sklearn.metrics import roc_auc_score
# XGBoost with early stopping
def train_xgboost(X_train, y_train, X_val, y_val):
dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_val, label=y_val)
params = {
"objective": "binary:logistic",
"eval_metric": "auc",
"learning_rate": 0.05,
"max_depth": 6,
"subsample": 0.8,
"colsample_bytree": 0.8,
"reg_lambda": 1.0,
"reg_alpha": 0.1,
"tree_method": "hist", # GPU: "gpu_hist"
"seed": 42,
}
model = xgb.train(
params, dtrain, num_boost_round=2000,
evals=[(dval, "val")],
early_stopping_rounds=50,
verbose_eval=200,
)
preds = model.predict(dval)
print(f"XGBoost val AUC: {roc_auc_score(y_val, preds):.4f}")
print(f"Best iteration: {model.best_iteration}")
return model
# LightGBM with early stopping
def train_lightgbm(X_train, y_train, X_val, y_val, cat_features=None):
train_data = lgb.Dataset(X_train, label=y_train,
categorical_feature=cat_features or 'auto')
val_data = lgb.Dataset(X_val, label=y_val, reference=train_data)
params = {
"objective": "binary",
"metric": "auc",
"learning_rate": 0.05,
"num_leaves": 63,
"min_child_samples": 20,
"subsample": 0.8,
"subsample_freq": 5,
"colsample_bytree": 0.8,
"reg_lambda": 1.0,
"verbose": -1,
"seed": 42,
}
callbacks = [lgb.early_stopping(50), lgb.log_evaluation(200)]
model = lgb.train(
params, train_data, num_boost_round=2000,
valid_sets=[val_data], callbacks=callbacks,
)
preds = model.predict(X_val)
print(f"LightGBM val AUC: {roc_auc_score(y_val, preds):.4f}")
return modelSHAP Feature Importance
import shap
import numpy as np
explainer = shap.TreeExplainer(xgb_model)
shap_values = explainer.shap_values(X_val)
# Global importance: mean absolute SHAP per feature
mean_shap = np.abs(shap_values).mean(axis=0)
top = sorted(zip(feature_names, mean_shap), key=lambda x: -x[1])[:10]
for name, score in top:
print(f"{name:30s} {score:.4f}")
# Summary plot (requires matplotlib)
shap.summary_plot(shap_values, X_val, feature_names=feature_names)
# Single-prediction explanation
shap.waterfall_plot(shap.Explanation(
values=shap_values[0],
base_values=explainer.expected_value,
data=X_val[0],
feature_names=feature_names,
))Hyperparameter Search with Optuna
import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)
def objective(trial):
params = {
"objective": "binary:logistic",
"eval_metric": "auc",
"tree_method": "hist",
"learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True),
"max_depth": trial.suggest_int("max_depth", 3, 10),
"subsample": trial.suggest_float("subsample", 0.5, 1.0),
"colsample_bytree": trial.suggest_float("colsample_bytree", 0.4, 1.0),
"reg_lambda": trial.suggest_float("reg_lambda", 0.1, 10.0, log=True),
"min_child_weight": trial.suggest_int("min_child_weight", 1, 20),
}
model = xgb.train(
params, dtrain, num_boost_round=1000,
evals=[(dval, "val")],
early_stopping_rounds=30,
verbose_eval=False,
)
return model.best_score
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100, timeout=600)
print("Best AUC:", study.best_value)
print("Best params:", study.best_params)Serving with FastAPI
import xgboost as xgb
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI(title="GBDT Inference")
_model = None
@app.on_event("startup")
def load():
global _model
_model = xgb.Booster()
_model.load_model("artifacts/model.ubj")
class PredictRequest(BaseModel):
features: list[float]
@app.post("/predict")
def predict(req: PredictRequest):
dmat = xgb.DMatrix([req.features])
prob = float(_model.predict(dmat)[0])
return {
"probability": round(prob, 4),
"label": int(prob > 0.5),
}Track PSI (Population Stability Index) on input feature distributions — PSI above 0.2 on any key feature signals distribution shift and triggers retraining. Log prediction score distributions daily; a shift in mean predicted probability often precedes drift in actual outcomes. XGBoost and LightGBM models are typically under 50 MB and load in under 100 ms, making them well-suited for synchronous low-latency serving.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.