Neural-Path/Notes
30 min

Gradient Boosting & Tabular ML

Gradient boosted trees consistently outperform neural networks on structured tabular data — the format that underlies the majority of enterprise ML workloads. Unlike neural networks, tree ensembles require no normalization, handle mixed data types natively, and deliver competitive accuracy on datasets of under a million rows with minimal tuning. This lesson derives the gradient boosting algorithm from first principles, explains the engineering innovations in XGBoost and LightGBM, and walks through a complete five-sample worked example to make the residual-fitting intuition concrete.

Theory

Gradient Boosting — Additive Tree Steps
● true y● F_m(x)╌ residual
036912x=1x=2x=3x=4x=5-3.22-1.92+0.48+1.38+3.28
F₀ (constant baseline):F₀ = mean(y) = 5.72. Residuals = y − F₀.

step 1 / 5

Gradient boosting builds a committee where each new member corrects the mistakes of the previous ones. The first tree makes a rough prediction; the second tree learns from the first tree's errors; the third from the combined errors; and so on. The diagram above shows this residual-fitting process — each round's tree targets what the ensemble got wrong, not the original labels. The "gradient" in gradient boosting is this residual: it's the gradient of the loss with respect to the current prediction.

The Additive Model

Gradient Boosting Decision Trees (GBDT) build a prediction as a sum of MM weak learners, each a shallow decision tree:

FM(x)=F0(x)+m=1Mηhm(x)F_M(\mathbf{x}) = F_0(\mathbf{x}) + \sum_{m=1}^{M} \eta \cdot h_m(\mathbf{x})

where F0F_0 is an initial constant prediction (typically yˉ\bar{y} for regression), hmh_m is the mm-th tree, and η(0,1]\eta \in (0,1] is the learning rate. The model is built stagewise: each new tree is added without modifying the previous ones.

Negative Gradient / Pseudo-Residuals

At each step mm, we want to find a tree that moves Fm1F_{m-1} in the direction that most reduces loss L\mathcal{L}. By analogy with gradient descent in function space, we compute the pseudo-residual for each training sample:

rim=[L(yi,F(xi))F(xi)]F=Fm1r_{im} = -\left[\frac{\partial \mathcal{L}(y_i,\, F(\mathbf{x}_i))}{\partial F(\mathbf{x}_i)}\right]_{F = F_{m-1}}

For mean squared error L=12(yiF(xi))2\mathcal{L} = \tfrac{1}{2}(y_i - F(\mathbf{x}_i))^2:

rim=yiFm1(xi)r_{im} = y_i - F_{m-1}(\mathbf{x}_i)

The pseudo-residuals are the ordinary residuals. For binary cross-entropy with pi=σ(F(xi))p_i = \sigma(F(\mathbf{x}_i)):

rim=yipi,m1r_{im} = y_i - p_{i,m-1}

Again, pseudo-residuals equal the residuals between true labels and current predicted probabilities — this is not coincidental but a direct consequence of the log-loss gradient.

Fitting trees to the negative gradient — not the raw residuals — is what makes boosting work for any differentiable loss, not just MSE. For MSE, the negative gradient happens to equal the residual, which is why the two look the same in the regression case. For cross-entropy or other losses, the pseudo-residual is a corrected signal that accounts for the curvature of the loss surface. Friedman's key insight was recognizing that "fit the residual" is actually "do gradient descent in function space" — the generalization is what unlocks classification, ranking, and survival analysis.

GBDT Algorithm

Initialize F_0(x) = argmin_γ Σ L(y_i, γ)   # e.g., mean(y) for MSE

For m = 1 to M:
  1. Compute pseudo-residuals: r_im = -∂L/∂F(x_i) at F = F_{m-1}
  2. Fit tree h_m to {(x_i, r_im)} minimizing squared error on residuals
  3. Find optimal leaf values for each leaf j:
       γ_jm = argmin_γ Σ_{x_i in leaf_j} L(y_i, F_{m-1}(x_i) + γ)
  4. Update: F_m(x) = F_{m-1}(x) + η · h_m(x)

Return F_M

XGBoost: Second-Order Optimization

XGBoost (Chen & Guestrin, 2016) improves on vanilla GBDT with a second-order Taylor expansion of the loss:

L(m)i=1n[gifm(xi)+12hifm(xi)2]+Ω(fm)\mathcal{L}^{(m)} \approx \sum_{i=1}^{n} \left[g_i f_m(\mathbf{x}_i) + \tfrac{1}{2} h_i f_m(\mathbf{x}_i)^2\right] + \Omega(f_m)

where gi=y^(yi,y^i(m1))g_i = \partial_{\hat{y}} \ell(y_i, \hat{y}_i^{(m-1)}) is the gradient and hi=y^2(yi,y^i(m1))h_i = \partial_{\hat{y}}^2 \ell(y_i, \hat{y}_i^{(m-1)}) is the Hessian. The regularization term is:

Ω(f)=γT+12λw2\Omega(f) = \gamma T + \tfrac{1}{2}\lambda \|w\|^2

where TT is the number of leaves, ww are leaf weights, γ\gamma penalizes tree complexity, and λ\lambda is L2 regularization on leaf values.

For a fixed tree structure, the optimal leaf weight for leaf jj is:

wj=GjHj+λ,Gj=ileafjgi,  Hj=ileafjhiw_j^* = -\frac{G_j}{H_j + \lambda}, \quad G_j = \sum_{i \in \text{leaf}_j} g_i,\; H_j = \sum_{i \in \text{leaf}_j} h_i

The split gain determines whether a candidate split is accepted:

Gain=12[GL2HL+λ+GR2HR+λ(GL+GR)2HL+HR+λ]γ\text{Gain} = \frac{1}{2}\left[\frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L+G_R)^2}{H_L+H_R+\lambda}\right] - \gamma

A split is only accepted if Gain > 0. This elegantly combines information gain with regularization.

XGBoost engineering innovations:

  • Histogram binning: bucket continuous features into at most 256 bins for O(n)O(n) split-finding
  • Column/row subsampling (colsample_bytree, subsample): reduce tree correlation and variance
  • Sparse-aware split finding: native handling of missing values via learned default directions

LightGBM: Leaf-Wise Growth and Efficient Sampling

LightGBM (Ke et al., 2017) introduces two key innovations for large-scale training:

GOSS (Gradient-based One-Side Sampling): Large-gradient samples contribute most to the split gain. GOSS keeps all top-aa% high-gradient instances and randomly samples bb% of the remaining low-gradient ones, upweighting them by 1ab\frac{1-a}{b} to maintain the distribution. This reduces sample complexity while focusing compute where it matters.

EFB (Exclusive Feature Bundling): Sparse features rarely take nonzero values simultaneously. EFB bundles mutually exclusive sparse features into single dense features, reducing the effective feature count the algorithm must scan.

Leaf-wise (best-first) growth: LightGBM always splits the leaf with the largest gain instead of growing level by level. This produces unbalanced trees that achieve lower loss with fewer leaves — but requires min_child_samples to prevent overfitting on small populations.

CatBoost: Ordered Target Encoding

CatBoost (Prokhorenkova et al., 2018) addresses target leakage in categorical feature encoding. Naive target encoding (replacing a category with its mean target) uses the current observation's own label, inflating apparent performance. CatBoost uses ordered target statistics: when encoding sample ii, it computes the mean target only from prior samples (in a random permutation) with the same category value. CatBoost also uses symmetric (oblivious) trees where every node at a given depth applies the same split, enabling efficient array-based inference.

Key Hyperparameters

ParameterEffectTypical Range
learning_rateStep size per tree0.01–0.3
n_estimatorsNumber of trees (use early stopping)100–5000
max_depthMaximum tree depth3–10
subsampleFraction of rows per tree0.6–1.0
colsample_bytreeFraction of features per tree0.5–1.0
reg_lambdaL2 regularization on leaf weights0.1–10
min_child_weightMinimum Hessian sum in leaf1–20

Walkthrough

We fit a GBDT manually on five samples with η=0.5\eta = 0.5 and MSE loss.

Dataset:

iixxyy
11.02.5
22.03.8
33.06.2
44.07.1
55.09.0

Step 0 — Initialize F0=yˉ=5.72F_0 = \bar{y} = 5.72.

Step 1 — Pseudo-residuals:

ri1=yi5.72=[3.22,  1.92,  0.48,  1.38,  3.28]r_{i1} = y_i - 5.72 = [-3.22,\; -1.92,\; 0.48,\; 1.38,\; 3.28]

Fit a depth-1 stump to these residuals. Optimal split at x2.5x \leq 2.5:

  • Left leaf (i=1,2): leaf value =3.22+(1.92)2=2.57= \frac{-3.22 + (-1.92)}{2} = -2.57
  • Right leaf (i=3,4,5): leaf value =0.48+1.38+3.283=1.71= \frac{0.48 + 1.38 + 3.28}{3} = 1.71

Update F1(xi)=5.72+0.5h1(xi)F_1(\mathbf{x}_i) = 5.72 + 0.5 \cdot h_1(\mathbf{x}_i):

iiF1F_1New residual yiF1y_i - F_1
14.44−1.94
24.44−0.64
36.58−0.38
46.580.52
56.582.42

MSE dropped from 5.43 → 1.70.

Step 2 — Fit tree 2 on residuals [1.94,0.64,0.38,0.52,2.42][-1.94, -0.64, -0.38, 0.52, 2.42].

Optimal split at x4.5x \leq 4.5: left leaf = 0.61-0.61, right leaf = +2.42+2.42.

Update F2=F1+0.5h2F_2 = F_1 + 0.5 \cdot h_2:

iiyiy_iF2F_2Error
12.54.13−1.63
23.84.13−0.33
36.26.27−0.07
47.16.27+0.83
59.07.79+1.21

MSE dropped to 0.88. Each additional tree corrects the ensemble's remaining errors by fitting residuals in function space.

Analysis & Evaluation

Where Your Intuition Breaks

More trees with a small learning rate always beats fewer trees with a large learning rate. With early stopping, this is often true. Without it, gradient boosting can overfit with enough trees regardless of learning rate — each new tree continues to reduce training loss even after test loss has started rising. The learning rate controls step size, not stopping point. The correct setup is: use a small learning rate, use early stopping on validation loss, and let the number of trees be determined by when overfitting begins rather than fixed in advance.

Algorithm Comparison

MethodTraining SpeedAccuracy (tabular)MemoryCategorical SupportStrength
Decision TreeVery fastLowLowNative splitsInterpretable
Random ForestFastMedium-highMediumNative splitsLow variance, robust
sklearn GBMSlowHighMediumOrdinal encodingReference implementation
XGBoostFastVery highMediumOrdinal encodingSecond-order, regularization
LightGBMVery fastVery highLowNative (cat_features)GOSS/EFB, leaf-wise
CatBoostMediumVery highMediumNative ordered encodingBest out-of-box on categoricals

Trees vs Neural Networks for Tabular Data

Gradient boosted trees typically win on structured tabular data because:

  1. Scale invariance: trees are invariant to monotone feature transformations — no normalization needed
  2. Sparse and mixed data: handle missing values and mixed types without preprocessing
  3. Small data regime: trees generalize better when n<100kn < 100\text{k}
  4. Interpretability: SHAP values give exact feature attribution per prediction

Neural networks win when the dataset is very large (millions of rows with complex feature interactions), inputs contain high-dimensional embeddings, or the task benefits from transfer learning.

⚠️Two main overfitting failure modes

(1) Too many trees without early stopping — training loss keeps dropping while validation loss rises. Always use a validation set with early_stopping_rounds=50. (2) Learning rate too high — the model memorizes training residuals too aggressively. Prefer learning_rate=0.05 with 1000+ trees over 0.3 with 100 trees.

Common Pitfalls

Feature leakage: Including features computed using the target variable inflates apparent performance. Use time-aware splits for temporal data and ordered encoding for categorical target statistics.

Feature importance bias: sklearn's built-in feature_importances_ (mean decrease in impurity) is biased toward high-cardinality features. Use permutation importance or SHAP values for reliable attribution.

Ignoring min_child_weight: On imbalanced datasets, small minority-class leaf populations cause overfitting on noise. Increasing min_child_weight forces each leaf to have a minimum sum of Hessian.

Production-Ready Code

Training with Early Stopping

python
import xgboost as xgb
import lightgbm as lgb
from sklearn.metrics import roc_auc_score
 
# XGBoost with early stopping
def train_xgboost(X_train, y_train, X_val, y_val):
    dtrain = xgb.DMatrix(X_train, label=y_train)
    dval   = xgb.DMatrix(X_val,   label=y_val)
 
    params = {
        "objective":        "binary:logistic",
        "eval_metric":      "auc",
        "learning_rate":    0.05,
        "max_depth":        6,
        "subsample":        0.8,
        "colsample_bytree": 0.8,
        "reg_lambda":       1.0,
        "reg_alpha":        0.1,
        "tree_method":      "hist",   # GPU: "gpu_hist"
        "seed":             42,
    }
 
    model = xgb.train(
        params, dtrain, num_boost_round=2000,
        evals=[(dval, "val")],
        early_stopping_rounds=50,
        verbose_eval=200,
    )
    preds = model.predict(dval)
    print(f"XGBoost val AUC: {roc_auc_score(y_val, preds):.4f}")
    print(f"Best iteration:  {model.best_iteration}")
    return model
 
# LightGBM with early stopping
def train_lightgbm(X_train, y_train, X_val, y_val, cat_features=None):
    train_data = lgb.Dataset(X_train, label=y_train,
                             categorical_feature=cat_features or 'auto')
    val_data   = lgb.Dataset(X_val,   label=y_val, reference=train_data)
 
    params = {
        "objective":        "binary",
        "metric":           "auc",
        "learning_rate":    0.05,
        "num_leaves":       63,
        "min_child_samples": 20,
        "subsample":        0.8,
        "subsample_freq":   5,
        "colsample_bytree": 0.8,
        "reg_lambda":       1.0,
        "verbose":          -1,
        "seed":             42,
    }
 
    callbacks = [lgb.early_stopping(50), lgb.log_evaluation(200)]
    model = lgb.train(
        params, train_data, num_boost_round=2000,
        valid_sets=[val_data], callbacks=callbacks,
    )
    preds = model.predict(X_val)
    print(f"LightGBM val AUC: {roc_auc_score(y_val, preds):.4f}")
    return model

SHAP Feature Importance

python
import shap
import numpy as np
 
explainer   = shap.TreeExplainer(xgb_model)
shap_values = explainer.shap_values(X_val)
 
# Global importance: mean absolute SHAP per feature
mean_shap = np.abs(shap_values).mean(axis=0)
top = sorted(zip(feature_names, mean_shap), key=lambda x: -x[1])[:10]
for name, score in top:
    print(f"{name:30s} {score:.4f}")
 
# Summary plot (requires matplotlib)
shap.summary_plot(shap_values, X_val, feature_names=feature_names)
 
# Single-prediction explanation
shap.waterfall_plot(shap.Explanation(
    values=shap_values[0],
    base_values=explainer.expected_value,
    data=X_val[0],
    feature_names=feature_names,
))

Hyperparameter Search with Optuna

python
import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)
 
def objective(trial):
    params = {
        "objective":        "binary:logistic",
        "eval_metric":      "auc",
        "tree_method":      "hist",
        "learning_rate":    trial.suggest_float("learning_rate", 0.01, 0.3, log=True),
        "max_depth":        trial.suggest_int("max_depth", 3, 10),
        "subsample":        trial.suggest_float("subsample", 0.5, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.4, 1.0),
        "reg_lambda":       trial.suggest_float("reg_lambda", 0.1, 10.0, log=True),
        "min_child_weight": trial.suggest_int("min_child_weight", 1, 20),
    }
 
    model = xgb.train(
        params, dtrain, num_boost_round=1000,
        evals=[(dval, "val")],
        early_stopping_rounds=30,
        verbose_eval=False,
    )
    return model.best_score
 
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100, timeout=600)
print("Best AUC:", study.best_value)
print("Best params:", study.best_params)

Serving with FastAPI

python
import xgboost as xgb
from fastapi import FastAPI
from pydantic import BaseModel
 
app    = FastAPI(title="GBDT Inference")
_model = None
 
@app.on_event("startup")
def load():
    global _model
    _model = xgb.Booster()
    _model.load_model("artifacts/model.ubj")
 
class PredictRequest(BaseModel):
    features: list[float]
 
@app.post("/predict")
def predict(req: PredictRequest):
    dmat = xgb.DMatrix([req.features])
    prob = float(_model.predict(dmat)[0])
    return {
        "probability": round(prob, 4),
        "label":       int(prob > 0.5),
    }
🚀Monitoring in production

Track PSI (Population Stability Index) on input feature distributions — PSI above 0.2 on any key feature signals distribution shift and triggers retraining. Log prediction score distributions daily; a shift in mean predicted probability often precedes drift in actual outcomes. XGBoost and LightGBM models are typically under 50 MB and load in under 100 ms, making them well-suited for synchronous low-latency serving.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.