Double ML & Debiased Estimation

You have 500 confounders and you want to estimate the effect of price on demand. Naively, you throw everything into a Lasso regression and read off the price coefficient. But Lasso is designed to zero out small coefficients — and it will zero out some of the price variation along with the noise, biasing your estimate toward zero. This regularization bias problem afflicts any high-dimensional estimation, and it motivated one of the most elegant ideas in modern causal inference: Double Machine Learning.

Theory

Double ML cross-fitting — click a fold to see it held out

The regularization bias problem. Suppose the true model is $Y = D\theta_0 + g_0(X) + \varepsilon$ , where $D$ is the treatment, $X$ are confounders, and $g_0$ is a complex function. If you estimate $g_0$ with Lasso, the regularization introduces bias in $\hat{g}_0$ . Because $D$ and $X$ are correlated (confounding), this bias propagates into $\hat{\theta}$ .

Neyman orthogonality. Chernozhukov et al. (2018) identified the key condition: a moment condition $\psi(W; \theta, \eta)$ is Neyman-orthogonal if:

$\frac{\partial}{\partial \eta} \mathbb{E}[\psi(W; \theta_0, \eta)] \bigg|_{\eta = \eta_0} = 0$

That is, the moment condition is insensitive (at first order) to perturbations in the nuisance parameter $\eta$ around its true value $\eta_0$ . This means errors in estimating $\eta$ don't bias $\hat{\theta}$ — the first-order bias cancels.

Why it had to be this way. The orthogonality condition is not arbitrary. It arises from the influence function of semiparametric estimation theory: the efficient influence function of $\theta$ in a partially linear model is $\psi = V^{-1} \cdot \tilde{D} \cdot (Y - D\theta - g(X))$ , where $\tilde{D} = D - \mathbb{E}[D \mid X]$ is the "residualized" treatment. This is the Frisch-Waugh-Lovell partialling-out formula made robust to machine learning nuisance estimation.

The Double ML algorithm (Partially Linear Regression):

Fit $\hat{m}(X)$ predicting $Y$ from $X$ (nuisance model 1)
Fit $\hat{e}(X)$ predicting $D$ from $X$ (nuisance model 2)
Compute residuals: $\tilde{Y} = Y - \hat{m}(X)$ , $\tilde{D} = D - \hat{e}(X)$
Regress $\tilde{Y}$ on $\tilde{D}$ : $\hat{\theta} = (\tilde{D}'\tilde{D})^{-1} \tilde{D}'\tilde{Y}$

Cross-fitting. If you use the same data for both nuisance estimation and the final regression, overfitting re-introduces bias (the nuisance model has memorized the noise in $Y$ and $D$ ). K-fold cross-fitting solves this: fit nuisance models on K-1 folds, predict on the held-out fold, then use all held-out residuals for the final regression.

Inference. Under cross-fitting and Neyman orthogonality, $\hat{\theta}$ is asymptotically normal with variance $V^{-1} \mathbb{E}[\psi^2] V^{-1}$ (sandwich estimator). Standard errors come from the moment condition, not from the nuisance models.

Walkthrough

Scenario: Estimate price elasticity from 50,000 transaction records with 200 confounders (product features, customer attributes, market conditions). Naive OLS would be biased; Lasso direct estimation has regularization bias.

Step 1: Cross-fitting pipeline.

python

import numpy as np
from sklearn.model_selection import KFold
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
 
def double_ml_plr(
    Y: np.ndarray,   # (n,) outcome
    D: np.ndarray,   # (n,) treatment
    X: np.ndarray,   # (n, p) confounders
    n_folds: int = 5,
) -> dict:
    """Double ML partially linear regression with K-fold cross-fitting.
 
    Returns debiased ATE estimate with SE and 95% CI.
    """
    n = len(Y)
    Y_res = np.zeros(n)  # Y - m(X)
    D_res = np.zeros(n)  # D - e(X)
 
    kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
    for train_idx, test_idx in kf.split(X):
        X_train, X_test = X[train_idx], X[test_idx]
        Y_train, Y_test = Y[train_idx], Y[test_idx]
        D_train, D_test = D[train_idx], D[test_idx]
 
        # Nuisance model 1: Y ~ f(X)
        m_hat = GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=0)
        m_hat.fit(X_train, Y_train)
        Y_res[test_idx] = Y_test - m_hat.predict(X_test)
 
        # Nuisance model 2: D ~ g(X)
        e_hat = GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=0)
        e_hat.fit(X_train, D_train)
        D_res[test_idx] = D_test - e_hat.predict(X_test)
 
    # Final regression: Y_res ~ theta * D_res
    theta = float(np.dot(D_res, Y_res) / np.dot(D_res, D_res))
 
    # Sandwich SE
    psi = D_res * (Y_res - theta * D_res)
    V = float(np.mean(D_res ** 2))
    var_theta = float(np.mean(psi ** 2)) / (n * V ** 2)
    se = float(np.sqrt(var_theta))
 
    from scipy.stats import norm
    z = norm.ppf(0.975)
    return {
        'theta': round(theta, 6),
        'se': round(se, 6),
        'ci_lower': round(theta - z * se, 6),
        'ci_upper': round(theta + z * se, 6),
        'pvalue': round(float(2 * norm.sf(abs(theta / se))), 4),
        'n_folds': n_folds,
    }

Step 2: Compare to naive OLS.

python

import statsmodels.api as sm
 
def naive_ols(Y, D, X):
    features = np.column_stack([D, X])
    model = sm.OLS(Y, sm.add_constant(features)).fit()
    return {'theta': round(float(model.params[1]), 6), 'se': round(float(model.bse[1]), 6)}

Typical finding: OLS SE is understated (too many degrees of freedom), OLS coefficient is biased toward zero (regularization absorbs treatment variation), Double ML SE reflects actual uncertainty.

Analysis & Evaluation

Where your intuition breaks. It seems like using more flexible nuisance models (XGBoost instead of Lasso) should always improve Double ML. This is largely true — but there is a key exception: if the nuisance model for $D$ over-fits (perfectly predicts treatment from covariates), the residuals $\tilde{D}$ become very small in the test fold, and the final regression becomes unstable. Cross-fitting mitigates but does not eliminate this: with $K=5$ folds and 80% of data for training, very flexible models can still over-fit enough to create near-zero $\tilde{D}$ residuals. The diagnostic: plot the distribution of $\tilde{D}$ and check that it has substantial variance.

Method	Regularization bias	High-dimensional confounders	SE validity
Naive OLS	Present	No (collinearity)	Invalid
Lasso direct	Present	Yes	Invalid
Double ML (Lasso nuisance)	Eliminated	Yes	Valid
Double ML (GBM nuisance)	Eliminated	Yes	Valid
Double ML (RF nuisance)	Eliminated	Yes	Valid

When Double ML fails. If overlap is poor (treatment is nearly deterministic given $X$ ), $\tilde{D}$ has near-zero variance and the estimator is unstable. If $p \gg n$ (more confounders than observations), even flexible nuisance models may fail. In these cases, consider instrumental variable methods or partial identification.

⚠️Warning

Check the treatment residuals. Before trusting a Double ML estimate, plot the distribution of $D - \hat{e}(X)$ (the treatment residuals). If the variance is very small, overlap is poor and the estimator is unreliable. The ratio $\text{Var}(\tilde{D}) / \text{Var}(D)$ should be at least 0.05.

Production-Ready Code

python

"""
Double ML production pipeline.
PLR cross-fitting, PLIV for instruments, nuisance
diagnostics, and automated model selection.
"""
 
from __future__ import annotations
from dataclasses import dataclass
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import Lasso, LassoCV
from scipy.stats import norm
 
 
@dataclass
class DoubleMLResult:
    theta: float
    se: float
    ci_lower: float
    ci_upper: float
    pvalue: float
    d_residual_r2: float   # R^2 of D nuisance model (coverage diagnostic)
    d_residual_var: float  # Var(D_tilde) — near zero = overlap problem
    n_folds: int
 
 
def double_ml_plr(
    Y: np.ndarray,
    D: np.ndarray,
    X: np.ndarray,
    n_folds: int = 5,
    nuisance_model: str = 'gbm',  # 'gbm', 'lasso', 'lasso_cv'
) -> DoubleMLResult:
    """Double ML partially linear regression with K-fold cross-fitting."""
    n = len(Y)
    Y_res = np.zeros(n)
    D_res = np.zeros(n)
    D_pred = np.zeros(n)
 
    def make_model():
        if nuisance_model == 'gbm':
            return GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=0)
        elif nuisance_model == 'lasso_cv':
            return LassoCV(cv=3, random_state=0)
        return Lasso(alpha=0.01)
 
    kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
    for train_idx, test_idx in kf.split(X):
        m = make_model()
        m.fit(X[train_idx], Y[train_idx])
        Y_res[test_idx] = Y[test_idx] - m.predict(X[test_idx])
 
        e = make_model()
        e.fit(X[train_idx], D[train_idx])
        D_pred[test_idx] = e.predict(X[test_idx])
        D_res[test_idx] = D[test_idx] - D_pred[test_idx]
 
    theta = float(np.dot(D_res, Y_res) / np.dot(D_res, D_res))
    psi = D_res * (Y_res - theta * D_res)
    V = float(np.mean(D_res ** 2))
    var_theta = float(np.mean(psi ** 2)) / (n * V ** 2)
    se = float(np.sqrt(var_theta))
 
    d_residual_r2 = float(1 - np.var(D_res) / np.var(D))
    d_residual_var = float(np.var(D_res))
    z = norm.ppf(0.975)
 
    if d_residual_var < 0.01 * np.var(D):
        import warnings
        warnings.warn(
            f"Treatment residual variance is very small ({d_residual_var:.4f}). "
            "Overlap may be poor — Double ML estimate is unreliable."
        )
 
    return DoubleMLResult(
        theta=round(theta, 6),
        se=round(se, 6),
        ci_lower=round(theta - z * se, 6),
        ci_upper=round(theta + z * se, 6),
        pvalue=round(float(2 * norm.sf(abs(theta / se))), 4),
        d_residual_r2=round(d_residual_r2, 4),
        d_residual_var=round(d_residual_var, 6),
        n_folds=n_folds,
    )
 
 
def double_ml_pliv(
    Y: np.ndarray,
    D: np.ndarray,
    Z: np.ndarray,   # instrument(s), shape (n,) or (n, n_instruments)
    X: np.ndarray,
    n_folds: int = 5,
) -> dict:
    """Double ML partially linear IV — for endogenous treatment with instrument.
 
    Uses cross-fitting on three nuisance models: E[Y|X], E[D|X], E[Z|X].
    Final stage: 2SLS on residuals.
    """
    n = len(Y)
    Y_res = np.zeros(n)
    D_res = np.zeros(n)
    Z_res = np.zeros(n) if Z.ndim == 1 else np.zeros((n, Z.shape[1]))
 
    kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
    for train_idx, test_idx in kf.split(X):
        for arr, res in [(Y, Y_res), (D, D_res)]:
            m = GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=0)
            m.fit(X[train_idx], arr[train_idx])
            res[test_idx] = arr[test_idx] - m.predict(X[test_idx])
        z_arr = Z if Z.ndim == 1 else Z
        mz = GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=0)
        mz.fit(X[train_idx], z_arr[train_idx] if Z.ndim == 1 else z_arr[train_idx, 0])
        Z_res[test_idx] = (z_arr[test_idx] if Z.ndim == 1 else z_arr[test_idx, 0]) - mz.predict(X[test_idx])
 
    # 2SLS on residuals
    theta = float(np.dot(Z_res, Y_res) / np.dot(Z_res, D_res))
    psi = Z_res * (Y_res - theta * D_res)
    V = float(np.mean(Z_res * D_res))
    var_theta = float(np.mean(psi ** 2)) / (n * V ** 2)
    se = float(np.sqrt(var_theta))
    z_crit = norm.ppf(0.975)
    return {
        'theta_iv': round(theta, 6),
        'se': round(se, 6),
        'ci_lower': round(theta - z_crit * se, 6),
        'ci_upper': round(theta + z_crit * se, 6),
        'pvalue': round(float(2 * norm.sf(abs(theta / se))), 4),
    }

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Staggered DiD & Modern Estimators

Causal Forests & CATE Estimation