Requires:Decision Trees & Ensembles

Gradient Boosting & Tabular ML

Gradient boosted trees consistently outperform neural networks on structured tabular data — the format that underlies the majority of enterprise ML workloads. Unlike neural networks, tree ensembles require no normalization, handle mixed data types natively, and deliver competitive accuracy on datasets of under a million rows with minimal tuning. This lesson derives the gradient boosting algorithm from first principles, explains the engineering innovations in XGBoost and LightGBM, and walks through a complete five-sample worked example to make the residual-fitting intuition concrete.

Theory

Gradient Boosting — Additive Tree Steps

● true y● F_m(x)╌ residual

F₀ (constant baseline):F₀ = mean(y) = 5.72. Residuals = y − F₀.

step 1 / 5

Gradient boosting builds a committee where each new member corrects the mistakes of the previous ones. The first tree makes a rough prediction; the second tree learns from the first tree's errors; the third from the combined errors; and so on. The diagram above shows this residual-fitting process — each round's tree targets what the ensemble got wrong, not the original labels. The "gradient" in gradient boosting is this residual: it's the gradient of the loss with respect to the current prediction.

The Additive Model

Gradient Boosting Decision Trees (GBDT) build a prediction as a sum of $M$ weak learners, each a shallow decision tree:

$F_M(\mathbf{x}) = F_0(\mathbf{x}) + \sum_{m=1}^{M} \eta \cdot h_m(\mathbf{x})$

where $F_0$ is an initial constant prediction (typically $\bar{y}$ for regression), $h_m$ is the $m$ -th tree, and $\eta \in (0,1]$ is the learning rate. The model is built stagewise: each new tree is added without modifying the previous ones.

Negative Gradient / Pseudo-Residuals

At each step $m$ , we want to find a tree that moves $F_{m-1}$ in the direction that most reduces loss $\mathcal{L}$ . By analogy with gradient descent in function space, we compute the pseudo-residual for each training sample:

$r_{im} = -\left[\frac{\partial \mathcal{L}(y_i,\, F(\mathbf{x}_i))}{\partial F(\mathbf{x}_i)}\right]_{F = F_{m-1}}$

For mean squared error $\mathcal{L} = \tfrac{1}{2}(y_i - F(\mathbf{x}_i))^2$ :

$r_{im} = y_i - F_{m-1}(\mathbf{x}_i)$

The pseudo-residuals are the ordinary residuals. For binary cross-entropy with $p_i = \sigma(F(\mathbf{x}_i))$ :

$r_{im} = y_i - p_{i,m-1}$

Again, pseudo-residuals equal the residuals between true labels and current predicted probabilities — this is not coincidental but a direct consequence of the log-loss gradient.

Fitting trees to the negative gradient — not the raw residuals — is what makes boosting work for any differentiable loss, not just MSE. For MSE, the negative gradient happens to equal the residual, which is why the two look the same in the regression case. For cross-entropy or other losses, the pseudo-residual is a corrected signal that accounts for the curvature of the loss surface. Friedman's key insight was recognizing that "fit the residual" is actually "do gradient descent in function space" — the generalization is what unlocks classification, ranking, and survival analysis.

GBDT Algorithm

Initialize F_0(x) = argmin_γ Σ L(y_i, γ)   # e.g., mean(y) for MSE

For m = 1 to M:
  1. Compute pseudo-residuals: r_im = -∂L/∂F(x_i) at F = F_{m-1}
  2. Fit tree h_m to {(x_i, r_im)} minimizing squared error on residuals
  3. Find optimal leaf values for each leaf j:
       γ_jm = argmin_γ Σ_{x_i in leaf_j} L(y_i, F_{m-1}(x_i) + γ)
  4. Update: F_m(x) = F_{m-1}(x) + η · h_m(x)

Return F_M

XGBoost: Second-Order Optimization

XGBoost (Chen & Guestrin, 2016) improves on vanilla GBDT with a second-order Taylor expansion of the loss:

$\mathcal{L}^{(m)} \approx \sum_{i=1}^{n} \left[g_i f_m(\mathbf{x}_i) + \tfrac{1}{2} h_i f_m(\mathbf{x}_i)^2\right] + \Omega(f_m)$

where $g_i = \partial_{\hat{y}} \ell(y_i, \hat{y}_i^{(m-1)})$ is the gradient and $h_i = \partial_{\hat{y}}^2 \ell(y_i, \hat{y}_i^{(m-1)})$ is the Hessian. The regularization term is:

$\Omega(f) = \gamma T + \tfrac{1}{2}\lambda \|w\|^2$

where $T$ is the number of leaves, $w$ are leaf weights, $\gamma$ penalizes tree complexity, and $\lambda$ is L2 regularization on leaf values.

For a fixed tree structure, the optimal leaf weight for leaf $j$ is:

$w_j^* = -\frac{G_j}{H_j + \lambda}, \quad G_j = \sum_{i \in \text{leaf}_j} g_i,\; H_j = \sum_{i \in \text{leaf}_j} h_i$

The split gain determines whether a candidate split is accepted:

$\text{Gain} = \frac{1}{2}\left[\frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L+G_R)^2}{H_L+H_R+\lambda}\right] - \gamma$

A split is only accepted if Gain > 0. This elegantly combines information gain with regularization.

XGBoost engineering innovations:

Histogram binning: bucket continuous features into at most 256 bins for $O(n)$ split-finding
Column/row subsampling (colsample_bytree, subsample): reduce tree correlation and variance
Sparse-aware split finding: native handling of missing values via learned default directions

LightGBM: Leaf-Wise Growth and Efficient Sampling

LightGBM (Ke et al., 2017) introduces two key innovations for large-scale training:

GOSS (Gradient-based One-Side Sampling): Large-gradient samples contribute most to the split gain. GOSS keeps all top- $a$ % high-gradient instances and randomly samples $b$ % of the remaining low-gradient ones, upweighting them by $\frac{1-a}{b}$ to maintain the distribution. This reduces sample complexity while focusing compute where it matters.

EFB (Exclusive Feature Bundling): Sparse features rarely take nonzero values simultaneously. EFB bundles mutually exclusive sparse features into single dense features, reducing the effective feature count the algorithm must scan.

Leaf-wise (best-first) growth: LightGBM always splits the leaf with the largest gain instead of growing level by level. This produces unbalanced trees that achieve lower loss with fewer leaves — but requires min_child_samples to prevent overfitting on small populations.

CatBoost: Ordered Target Encoding

CatBoost (Prokhorenkova et al., 2018) addresses target leakage in categorical feature encoding. Naive target encoding (replacing a category with its mean target) uses the current observation's own label, inflating apparent performance. CatBoost uses ordered target statistics: when encoding sample $i$ , it computes the mean target only from prior samples (in a random permutation) with the same category value. CatBoost also uses symmetric (oblivious) trees where every node at a given depth applies the same split, enabling efficient array-based inference.

Key Hyperparameters

Parameter	Effect	Typical Range
`learning_rate`	Step size per tree	0.01–0.3
`n_estimators`	Number of trees (use early stopping)	100–5000
`max_depth`	Maximum tree depth	3–10
`subsample`	Fraction of rows per tree	0.6–1.0
`colsample_bytree`	Fraction of features per tree	0.5–1.0
`reg_lambda`	L2 regularization on leaf weights	0.1–10
`min_child_weight`	Minimum Hessian sum in leaf	1–20

Walkthrough

We fit a GBDT manually on five samples with $\eta = 0.5$ and MSE loss.

Dataset:

$i$	$x$	$y$
1	1.0	2.5
2	2.0	3.8
3	3.0	6.2
4	4.0	7.1
5	5.0	9.0

Step 0 — Initialize $F_0 = \bar{y} = 5.72$ .

Step 1 — Pseudo-residuals:

$r_{i1} = y_i - 5.72 = [-3.22,\; -1.92,\; 0.48,\; 1.38,\; 3.28]$

Fit a depth-1 stump to these residuals. Optimal split at $x \leq 2.5$ :

Left leaf (i=1,2): leaf value $= \frac{-3.22 + (-1.92)}{2} = -2.57$
Right leaf (i=3,4,5): leaf value $= \frac{0.48 + 1.38 + 3.28}{3} = 1.71$

Update $F_1(\mathbf{x}_i) = 5.72 + 0.5 \cdot h_1(\mathbf{x}_i)$ :

$i$	$F_1$	New residual $y_i - F_1$
1	4.44	−1.94
2	4.44	−0.64
3	6.58	−0.38
4	6.58	0.52
5	6.58	2.42

MSE dropped from 5.43 → 1.70.

Step 2 — Fit tree 2 on residuals $[-1.94, -0.64, -0.38, 0.52, 2.42]$ .

Optimal split at $x \leq 4.5$ : left leaf = $-0.61$ , right leaf = $+2.42$ .

Update $F_2 = F_1 + 0.5 \cdot h_2$ :

$i$	$y_i$	$F_2$	Error
1	2.5	4.13	−1.63
2	3.8	4.13	−0.33
3	6.2	6.27	−0.07
4	7.1	6.27	+0.83
5	9.0	7.79	+1.21

MSE dropped to 0.88. Each additional tree corrects the ensemble's remaining errors by fitting residuals in function space.

Analysis & Evaluation

Where Your Intuition Breaks

More trees with a small learning rate always beats fewer trees with a large learning rate. With early stopping, this is often true. Without it, gradient boosting can overfit with enough trees regardless of learning rate — each new tree continues to reduce training loss even after test loss has started rising. The learning rate controls step size, not stopping point. The correct setup is: use a small learning rate, use early stopping on validation loss, and let the number of trees be determined by when overfitting begins rather than fixed in advance.

Algorithm Comparison

Method	Training Speed	Accuracy (tabular)	Memory	Categorical Support	Strength
Decision Tree	Very fast	Low	Low	Native splits	Interpretable
Random Forest	Fast	Medium-high	Medium	Native splits	Low variance, robust
sklearn GBM	Slow	High	Medium	Ordinal encoding	Reference implementation
XGBoost	Fast	Very high	Medium	Ordinal encoding	Second-order, regularization
LightGBM	Very fast	Very high	Low	Native (`cat_features`)	GOSS/EFB, leaf-wise
CatBoost	Medium	Very high	Medium	Native ordered encoding	Best out-of-box on categoricals

Trees vs Neural Networks for Tabular Data

Gradient boosted trees typically win on structured tabular data because:

Scale invariance: trees are invariant to monotone feature transformations — no normalization needed
Sparse and mixed data: handle missing values and mixed types without preprocessing
Small data regime: trees generalize better when $n < 100\text{k}$
Interpretability: SHAP values give exact feature attribution per prediction

Neural networks win when the dataset is very large (millions of rows with complex feature interactions), inputs contain high-dimensional embeddings, or the task benefits from transfer learning.

⚠️Two main overfitting failure modes

(1) Too many trees without early stopping — training loss keeps dropping while validation loss rises. Always use a validation set with early_stopping_rounds=50. (2) Learning rate too high — the model memorizes training residuals too aggressively. Prefer learning_rate=0.05 with 1000+ trees over 0.3 with 100 trees.

Common Pitfalls

Feature leakage: Including features computed using the target variable inflates apparent performance. Use time-aware splits for temporal data and ordered encoding for categorical target statistics.

Feature importance bias: sklearn's built-in feature_importances_ (mean decrease in impurity) is biased toward high-cardinality features. Use permutation importance or SHAP values for reliable attribution.

Ignoring min_child_weight: On imbalanced datasets, small minority-class leaf populations cause overfitting on noise. Increasing min_child_weight forces each leaf to have a minimum sum of Hessian.

Production-Ready Code

Training with Early Stopping

python

import xgboost as xgb
import lightgbm as lgb
from sklearn.metrics import roc_auc_score
 
# XGBoost with early stopping
def train_xgboost(X_train, y_train, X_val, y_val):
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dval   = xgb.DMatrix(X_val,   label=y_val)
 
    params = {
        "objective":        "binary:logistic",
        "eval_metric":      "auc",
        "learning_rate":    0.05,
        "max_depth":        6,
        "subsample":        0.8,
        "colsample_bytree": 0.8,
        "reg_lambda":       1.0,
        "reg_alpha":        0.1,
        "tree_method":      "hist",   # GPU: "gpu_hist"
        "seed":             42,
    }
 
    model = xgb.train(
        params, dtrain, num_boost_round=2000,
        evals=[(dval, "val")],
        early_stopping_rounds=50,
        verbose_eval=200,
    )
    preds = model.predict(dval)
    print(f"XGBoost val AUC: {roc_auc_score(y_val, preds):.4f}")
    print(f"Best iteration:  {model.best_iteration}")
    return model
 
# LightGBM with early stopping
def train_lightgbm(X_train, y_train, X_val, y_val, cat_features=None):
    train_data = lgb.Dataset(X_train, label=y_train,
                             categorical_feature=cat_features or 'auto')
    val_data   = lgb.Dataset(X_val,   label=y_val, reference=train_data)
 
    params = {
        "objective":        "binary",
        "metric":           "auc",
        "learning_rate":    0.05,
        "num_leaves":       63,
        "min_child_samples": 20,
        "subsample":        0.8,
        "subsample_freq":   5,
        "colsample_bytree": 0.8,
        "reg_lambda":       1.0,
        "verbose":          -1,
        "seed":             42,
    }
 
    callbacks = [lgb.early_stopping(50), lgb.log_evaluation(200)]
    model = lgb.train(
        params, train_data, num_boost_round=2000,
        valid_sets=[val_data], callbacks=callbacks,
    )
    preds = model.predict(X_val)
    print(f"LightGBM val AUC: {roc_auc_score(y_val, preds):.4f}")
    return model

SHAP Feature Importance

python

import shap
import numpy as np
 
explainer   = shap.TreeExplainer(xgb_model)
shap_values = explainer.shap_values(X_val)
 
# Global importance: mean absolute SHAP per feature
mean_shap = np.abs(shap_values).mean(axis=0)
top = sorted(zip(feature_names, mean_shap), key=lambda x: -x[1])[:10]
for name, score in top:
    print(f"{name:30s} {score:.4f}")
 
# Summary plot (requires matplotlib)
shap.summary_plot(shap_values, X_val, feature_names=feature_names)
 
# Single-prediction explanation
shap.waterfall_plot(shap.Explanation(
    values=shap_values[0],
    base_values=explainer.expected_value,
    data=X_val[0],
    feature_names=feature_names,
))

Hyperparameter Search with Optuna

python

import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)
 
def objective(trial):
    params = {
        "objective":        "binary:logistic",
        "eval_metric":      "auc",
        "tree_method":      "hist",
        "learning_rate":    trial.suggest_float("learning_rate", 0.01, 0.3, log=True),
        "max_depth":        trial.suggest_int("max_depth", 3, 10),
        "subsample":        trial.suggest_float("subsample", 0.5, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.4, 1.0),
        "reg_lambda":       trial.suggest_float("reg_lambda", 0.1, 10.0, log=True),
        "min_child_weight": trial.suggest_int("min_child_weight", 1, 20),
    }
 
    model = xgb.train(
        params, dtrain, num_boost_round=1000,
        evals=[(dval, "val")],
        early_stopping_rounds=30,
        verbose_eval=False,
    )
    return model.best_score
 
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100, timeout=600)
print("Best AUC:", study.best_value)
print("Best params:", study.best_params)

Serving with FastAPI

python

import xgboost as xgb
from fastapi import FastAPI
from pydantic import BaseModel
 
app    = FastAPI(title="GBDT Inference")
_model = None
 
@app.on_event("startup")
def load():
    global _model
    _model = xgb.Booster()
    _model.load_model("artifacts/model.ubj")
 
class PredictRequest(BaseModel):
    features: list[float]
 
@app.post("/predict")
def predict(req: PredictRequest):
    dmat = xgb.DMatrix([req.features])
    prob = float(_model.predict(dmat)[0])
    return {
        "probability": round(prob, 4),
        "label":       int(prob > 0.5),
    }

🚀Monitoring in production

Track PSI (Population Stability Index) on input feature distributions — PSI above 0.2 on any key feature signals distribution shift and triggers retraining. Log prediction score distributions daily; a shift in mean predicted probability often precedes drift in actual outcomes. XGBoost and LightGBM models are typically under 50 MB and load in under 100 ms, making them well-suited for synchronous low-latency serving.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Gradient Descent

NLP Essentials

Tokenization & BPE