Neural-Path/Notes
45 min

Double ML & Debiased Estimation

You have 500 confounders and you want to estimate the effect of price on demand. Naively, you throw everything into a Lasso regression and read off the price coefficient. But Lasso is designed to zero out small coefficients — and it will zero out some of the price variation along with the noise, biasing your estimate toward zero. This regularization bias problem afflicts any high-dimensional estimation, and it motivated one of the most elegant ideas in modern causal inference: Double Machine Learning.

Theory

Double ML cross-fitting — click a fold to see it held out

K=5 cross-fitting foldsFold 1Held outFold 2Train nuisanceFold 3Train nuisanceFold 4Train nuisanceFold 5Train nuisanceStep 1: Fit Y ← f(X) on training foldsPredict held-out: residual Y = Y - f̂(X)Step 2: Fit D ← g(X) on training foldsPredict held-out: residual D = D - ĝ(X)Step 3: OLS residual_Y = θ · residual_D + noiseθ̂ is debiased, Neyman-orthogonal estimate

The regularization bias problem. Suppose the true model is Y=Dθ0+g0(X)+εY = D\theta_0 + g_0(X) + \varepsilon, where DD is the treatment, XX are confounders, and g0g_0 is a complex function. If you estimate g0g_0 with Lasso, the regularization introduces bias in g^0\hat{g}_0. Because DD and XX are correlated (confounding), this bias propagates into θ^\hat{\theta}.

Neyman orthogonality. Chernozhukov et al. (2018) identified the key condition: a moment condition ψ(W;θ,η)\psi(W; \theta, \eta) is Neyman-orthogonal if:

ηE[ψ(W;θ0,η)]η=η0=0\frac{\partial}{\partial \eta} \mathbb{E}[\psi(W; \theta_0, \eta)] \bigg|_{\eta = \eta_0} = 0

That is, the moment condition is insensitive (at first order) to perturbations in the nuisance parameter η\eta around its true value η0\eta_0. This means errors in estimating η\eta don't bias θ^\hat{\theta} — the first-order bias cancels.

Why it had to be this way. The orthogonality condition is not arbitrary. It arises from the influence function of semiparametric estimation theory: the efficient influence function of θ\theta in a partially linear model is ψ=V1D~(YDθg(X))\psi = V^{-1} \cdot \tilde{D} \cdot (Y - D\theta - g(X)), where D~=DE[DX]\tilde{D} = D - \mathbb{E}[D \mid X] is the "residualized" treatment. This is the Frisch-Waugh-Lovell partialling-out formula made robust to machine learning nuisance estimation.

The Double ML algorithm (Partially Linear Regression):

  1. Fit m^(X)\hat{m}(X) predicting YY from XX (nuisance model 1)
  2. Fit e^(X)\hat{e}(X) predicting DD from XX (nuisance model 2)
  3. Compute residuals: Y~=Ym^(X)\tilde{Y} = Y - \hat{m}(X), D~=De^(X)\tilde{D} = D - \hat{e}(X)
  4. Regress Y~\tilde{Y} on D~\tilde{D}: θ^=(D~D~)1D~Y~\hat{\theta} = (\tilde{D}'\tilde{D})^{-1} \tilde{D}'\tilde{Y}

Cross-fitting. If you use the same data for both nuisance estimation and the final regression, overfitting re-introduces bias (the nuisance model has memorized the noise in YY and DD). K-fold cross-fitting solves this: fit nuisance models on K-1 folds, predict on the held-out fold, then use all held-out residuals for the final regression.

Inference. Under cross-fitting and Neyman orthogonality, θ^\hat{\theta} is asymptotically normal with variance V1E[ψ2]V1V^{-1} \mathbb{E}[\psi^2] V^{-1} (sandwich estimator). Standard errors come from the moment condition, not from the nuisance models.

Walkthrough

Scenario: Estimate price elasticity from 50,000 transaction records with 200 confounders (product features, customer attributes, market conditions). Naive OLS would be biased; Lasso direct estimation has regularization bias.

Step 1: Cross-fitting pipeline.

python
import numpy as np
from sklearn.model_selection import KFold
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
 
def double_ml_plr(
    Y: np.ndarray,   # (n,) outcome
    D: np.ndarray,   # (n,) treatment
    X: np.ndarray,   # (n, p) confounders
    n_folds: int = 5,
) -> dict:
    """Double ML partially linear regression with K-fold cross-fitting.
 
    Returns debiased ATE estimate with SE and 95% CI.
    """
    n = len(Y)
    Y_res = np.zeros(n)  # Y - m(X)
    D_res = np.zeros(n)  # D - e(X)
 
    kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
    for train_idx, test_idx in kf.split(X):
        X_train, X_test = X[train_idx], X[test_idx]
        Y_train, Y_test = Y[train_idx], Y[test_idx]
        D_train, D_test = D[train_idx], D[test_idx]
 
        # Nuisance model 1: Y ~ f(X)
        m_hat = GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=0)
        m_hat.fit(X_train, Y_train)
        Y_res[test_idx] = Y_test - m_hat.predict(X_test)
 
        # Nuisance model 2: D ~ g(X)
        e_hat = GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=0)
        e_hat.fit(X_train, D_train)
        D_res[test_idx] = D_test - e_hat.predict(X_test)
 
    # Final regression: Y_res ~ theta * D_res
    theta = float(np.dot(D_res, Y_res) / np.dot(D_res, D_res))
 
    # Sandwich SE
    psi = D_res * (Y_res - theta * D_res)
    V = float(np.mean(D_res ** 2))
    var_theta = float(np.mean(psi ** 2)) / (n * V ** 2)
    se = float(np.sqrt(var_theta))
 
    from scipy.stats import norm
    z = norm.ppf(0.975)
    return {
        'theta': round(theta, 6),
        'se': round(se, 6),
        'ci_lower': round(theta - z * se, 6),
        'ci_upper': round(theta + z * se, 6),
        'pvalue': round(float(2 * norm.sf(abs(theta / se))), 4),
        'n_folds': n_folds,
    }

Step 2: Compare to naive OLS.

python
import statsmodels.api as sm
 
def naive_ols(Y, D, X):
    features = np.column_stack([D, X])
    model = sm.OLS(Y, sm.add_constant(features)).fit()
    return {'theta': round(float(model.params[1]), 6), 'se': round(float(model.bse[1]), 6)}

Typical finding: OLS SE is understated (too many degrees of freedom), OLS coefficient is biased toward zero (regularization absorbs treatment variation), Double ML SE reflects actual uncertainty.

Analysis & Evaluation

Where your intuition breaks. It seems like using more flexible nuisance models (XGBoost instead of Lasso) should always improve Double ML. This is largely true — but there is a key exception: if the nuisance model for DD over-fits (perfectly predicts treatment from covariates), the residuals D~\tilde{D} become very small in the test fold, and the final regression becomes unstable. Cross-fitting mitigates but does not eliminate this: with K=5K=5 folds and 80% of data for training, very flexible models can still over-fit enough to create near-zero D~\tilde{D} residuals. The diagnostic: plot the distribution of D~\tilde{D} and check that it has substantial variance.

MethodRegularization biasHigh-dimensional confoundersSE validity
Naive OLSPresentNo (collinearity)Invalid
Lasso directPresentYesInvalid
Double ML (Lasso nuisance)EliminatedYesValid
Double ML (GBM nuisance)EliminatedYesValid
Double ML (RF nuisance)EliminatedYesValid

When Double ML fails. If overlap is poor (treatment is nearly deterministic given XX), D~\tilde{D} has near-zero variance and the estimator is unstable. If pnp \gg n (more confounders than observations), even flexible nuisance models may fail. In these cases, consider instrumental variable methods or partial identification.

⚠️Warning

Check the treatment residuals. Before trusting a Double ML estimate, plot the distribution of De^(X)D - \hat{e}(X) (the treatment residuals). If the variance is very small, overlap is poor and the estimator is unreliable. The ratio Var(D~)/Var(D)\text{Var}(\tilde{D}) / \text{Var}(D) should be at least 0.05.

Production-Ready Code

python
"""
Double ML production pipeline.
PLR cross-fitting, PLIV for instruments, nuisance
diagnostics, and automated model selection.
"""
 
from __future__ import annotations
from dataclasses import dataclass
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import Lasso, LassoCV
from scipy.stats import norm
 
 
@dataclass
class DoubleMLResult:
    theta: float
    se: float
    ci_lower: float
    ci_upper: float
    pvalue: float
    d_residual_r2: float   # R^2 of D nuisance model (coverage diagnostic)
    d_residual_var: float  # Var(D_tilde) — near zero = overlap problem
    n_folds: int
 
 
def double_ml_plr(
    Y: np.ndarray,
    D: np.ndarray,
    X: np.ndarray,
    n_folds: int = 5,
    nuisance_model: str = 'gbm',  # 'gbm', 'lasso', 'lasso_cv'
) -> DoubleMLResult:
    """Double ML partially linear regression with K-fold cross-fitting."""
    n = len(Y)
    Y_res = np.zeros(n)
    D_res = np.zeros(n)
    D_pred = np.zeros(n)
 
    def make_model():
        if nuisance_model == 'gbm':
            return GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=0)
        elif nuisance_model == 'lasso_cv':
            return LassoCV(cv=3, random_state=0)
        return Lasso(alpha=0.01)
 
    kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
    for train_idx, test_idx in kf.split(X):
        m = make_model()
        m.fit(X[train_idx], Y[train_idx])
        Y_res[test_idx] = Y[test_idx] - m.predict(X[test_idx])
 
        e = make_model()
        e.fit(X[train_idx], D[train_idx])
        D_pred[test_idx] = e.predict(X[test_idx])
        D_res[test_idx] = D[test_idx] - D_pred[test_idx]
 
    theta = float(np.dot(D_res, Y_res) / np.dot(D_res, D_res))
    psi = D_res * (Y_res - theta * D_res)
    V = float(np.mean(D_res ** 2))
    var_theta = float(np.mean(psi ** 2)) / (n * V ** 2)
    se = float(np.sqrt(var_theta))
 
    d_residual_r2 = float(1 - np.var(D_res) / np.var(D))
    d_residual_var = float(np.var(D_res))
    z = norm.ppf(0.975)
 
    if d_residual_var < 0.01 * np.var(D):
        import warnings
        warnings.warn(
            f"Treatment residual variance is very small ({d_residual_var:.4f}). "
            "Overlap may be poor — Double ML estimate is unreliable."
        )
 
    return DoubleMLResult(
        theta=round(theta, 6),
        se=round(se, 6),
        ci_lower=round(theta - z * se, 6),
        ci_upper=round(theta + z * se, 6),
        pvalue=round(float(2 * norm.sf(abs(theta / se))), 4),
        d_residual_r2=round(d_residual_r2, 4),
        d_residual_var=round(d_residual_var, 6),
        n_folds=n_folds,
    )
 
 
def double_ml_pliv(
    Y: np.ndarray,
    D: np.ndarray,
    Z: np.ndarray,   # instrument(s), shape (n,) or (n, n_instruments)
    X: np.ndarray,
    n_folds: int = 5,
) -> dict:
    """Double ML partially linear IV — for endogenous treatment with instrument.
 
    Uses cross-fitting on three nuisance models: E[Y|X], E[D|X], E[Z|X].
    Final stage: 2SLS on residuals.
    """
    n = len(Y)
    Y_res = np.zeros(n)
    D_res = np.zeros(n)
    Z_res = np.zeros(n) if Z.ndim == 1 else np.zeros((n, Z.shape[1]))
 
    kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
    for train_idx, test_idx in kf.split(X):
        for arr, res in [(Y, Y_res), (D, D_res)]:
            m = GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=0)
            m.fit(X[train_idx], arr[train_idx])
            res[test_idx] = arr[test_idx] - m.predict(X[test_idx])
        z_arr = Z if Z.ndim == 1 else Z
        mz = GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=0)
        mz.fit(X[train_idx], z_arr[train_idx] if Z.ndim == 1 else z_arr[train_idx, 0])
        Z_res[test_idx] = (z_arr[test_idx] if Z.ndim == 1 else z_arr[test_idx, 0]) - mz.predict(X[test_idx])
 
    # 2SLS on residuals
    theta = float(np.dot(Z_res, Y_res) / np.dot(Z_res, D_res))
    psi = Z_res * (Y_res - theta * D_res)
    V = float(np.mean(Z_res * D_res))
    var_theta = float(np.mean(psi ** 2)) / (n * V ** 2)
    se = float(np.sqrt(var_theta))
    z_crit = norm.ppf(0.975)
    return {
        'theta_iv': round(theta, 6),
        'se': round(se, 6),
        'ci_lower': round(theta - z_crit * se, 6),
        'ci_upper': round(theta + z_crit * se, 6),
        'pvalue': round(float(2 * norm.sf(abs(theta / se))), 4),
    }

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.