Double ML & Debiased Estimation
You have 500 confounders and you want to estimate the effect of price on demand. Naively, you throw everything into a Lasso regression and read off the price coefficient. But Lasso is designed to zero out small coefficients — and it will zero out some of the price variation along with the noise, biasing your estimate toward zero. This regularization bias problem afflicts any high-dimensional estimation, and it motivated one of the most elegant ideas in modern causal inference: Double Machine Learning.
Theory
Double ML cross-fitting — click a fold to see it held out
The regularization bias problem. Suppose the true model is , where is the treatment, are confounders, and is a complex function. If you estimate with Lasso, the regularization introduces bias in . Because and are correlated (confounding), this bias propagates into .
Neyman orthogonality. Chernozhukov et al. (2018) identified the key condition: a moment condition is Neyman-orthogonal if:
That is, the moment condition is insensitive (at first order) to perturbations in the nuisance parameter around its true value . This means errors in estimating don't bias — the first-order bias cancels.
Why it had to be this way. The orthogonality condition is not arbitrary. It arises from the influence function of semiparametric estimation theory: the efficient influence function of in a partially linear model is , where is the "residualized" treatment. This is the Frisch-Waugh-Lovell partialling-out formula made robust to machine learning nuisance estimation.
The Double ML algorithm (Partially Linear Regression):
- Fit predicting from (nuisance model 1)
- Fit predicting from (nuisance model 2)
- Compute residuals: ,
- Regress on :
Cross-fitting. If you use the same data for both nuisance estimation and the final regression, overfitting re-introduces bias (the nuisance model has memorized the noise in and ). K-fold cross-fitting solves this: fit nuisance models on K-1 folds, predict on the held-out fold, then use all held-out residuals for the final regression.
Inference. Under cross-fitting and Neyman orthogonality, is asymptotically normal with variance (sandwich estimator). Standard errors come from the moment condition, not from the nuisance models.
Walkthrough
Scenario: Estimate price elasticity from 50,000 transaction records with 200 confounders (product features, customer attributes, market conditions). Naive OLS would be biased; Lasso direct estimation has regularization bias.
Step 1: Cross-fitting pipeline.
import numpy as np
from sklearn.model_selection import KFold
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
def double_ml_plr(
Y: np.ndarray, # (n,) outcome
D: np.ndarray, # (n,) treatment
X: np.ndarray, # (n, p) confounders
n_folds: int = 5,
) -> dict:
"""Double ML partially linear regression with K-fold cross-fitting.
Returns debiased ATE estimate with SE and 95% CI.
"""
n = len(Y)
Y_res = np.zeros(n) # Y - m(X)
D_res = np.zeros(n) # D - e(X)
kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
for train_idx, test_idx in kf.split(X):
X_train, X_test = X[train_idx], X[test_idx]
Y_train, Y_test = Y[train_idx], Y[test_idx]
D_train, D_test = D[train_idx], D[test_idx]
# Nuisance model 1: Y ~ f(X)
m_hat = GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=0)
m_hat.fit(X_train, Y_train)
Y_res[test_idx] = Y_test - m_hat.predict(X_test)
# Nuisance model 2: D ~ g(X)
e_hat = GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=0)
e_hat.fit(X_train, D_train)
D_res[test_idx] = D_test - e_hat.predict(X_test)
# Final regression: Y_res ~ theta * D_res
theta = float(np.dot(D_res, Y_res) / np.dot(D_res, D_res))
# Sandwich SE
psi = D_res * (Y_res - theta * D_res)
V = float(np.mean(D_res ** 2))
var_theta = float(np.mean(psi ** 2)) / (n * V ** 2)
se = float(np.sqrt(var_theta))
from scipy.stats import norm
z = norm.ppf(0.975)
return {
'theta': round(theta, 6),
'se': round(se, 6),
'ci_lower': round(theta - z * se, 6),
'ci_upper': round(theta + z * se, 6),
'pvalue': round(float(2 * norm.sf(abs(theta / se))), 4),
'n_folds': n_folds,
}Step 2: Compare to naive OLS.
import statsmodels.api as sm
def naive_ols(Y, D, X):
features = np.column_stack([D, X])
model = sm.OLS(Y, sm.add_constant(features)).fit()
return {'theta': round(float(model.params[1]), 6), 'se': round(float(model.bse[1]), 6)}Typical finding: OLS SE is understated (too many degrees of freedom), OLS coefficient is biased toward zero (regularization absorbs treatment variation), Double ML SE reflects actual uncertainty.
Analysis & Evaluation
Where your intuition breaks. It seems like using more flexible nuisance models (XGBoost instead of Lasso) should always improve Double ML. This is largely true — but there is a key exception: if the nuisance model for over-fits (perfectly predicts treatment from covariates), the residuals become very small in the test fold, and the final regression becomes unstable. Cross-fitting mitigates but does not eliminate this: with folds and 80% of data for training, very flexible models can still over-fit enough to create near-zero residuals. The diagnostic: plot the distribution of and check that it has substantial variance.
| Method | Regularization bias | High-dimensional confounders | SE validity |
|---|---|---|---|
| Naive OLS | Present | No (collinearity) | Invalid |
| Lasso direct | Present | Yes | Invalid |
| Double ML (Lasso nuisance) | Eliminated | Yes | Valid |
| Double ML (GBM nuisance) | Eliminated | Yes | Valid |
| Double ML (RF nuisance) | Eliminated | Yes | Valid |
When Double ML fails. If overlap is poor (treatment is nearly deterministic given ), has near-zero variance and the estimator is unstable. If (more confounders than observations), even flexible nuisance models may fail. In these cases, consider instrumental variable methods or partial identification.
Check the treatment residuals. Before trusting a Double ML estimate, plot the distribution of (the treatment residuals). If the variance is very small, overlap is poor and the estimator is unreliable. The ratio should be at least 0.05.
Production-Ready Code
"""
Double ML production pipeline.
PLR cross-fitting, PLIV for instruments, nuisance
diagnostics, and automated model selection.
"""
from __future__ import annotations
from dataclasses import dataclass
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import Lasso, LassoCV
from scipy.stats import norm
@dataclass
class DoubleMLResult:
theta: float
se: float
ci_lower: float
ci_upper: float
pvalue: float
d_residual_r2: float # R^2 of D nuisance model (coverage diagnostic)
d_residual_var: float # Var(D_tilde) — near zero = overlap problem
n_folds: int
def double_ml_plr(
Y: np.ndarray,
D: np.ndarray,
X: np.ndarray,
n_folds: int = 5,
nuisance_model: str = 'gbm', # 'gbm', 'lasso', 'lasso_cv'
) -> DoubleMLResult:
"""Double ML partially linear regression with K-fold cross-fitting."""
n = len(Y)
Y_res = np.zeros(n)
D_res = np.zeros(n)
D_pred = np.zeros(n)
def make_model():
if nuisance_model == 'gbm':
return GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=0)
elif nuisance_model == 'lasso_cv':
return LassoCV(cv=3, random_state=0)
return Lasso(alpha=0.01)
kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
for train_idx, test_idx in kf.split(X):
m = make_model()
m.fit(X[train_idx], Y[train_idx])
Y_res[test_idx] = Y[test_idx] - m.predict(X[test_idx])
e = make_model()
e.fit(X[train_idx], D[train_idx])
D_pred[test_idx] = e.predict(X[test_idx])
D_res[test_idx] = D[test_idx] - D_pred[test_idx]
theta = float(np.dot(D_res, Y_res) / np.dot(D_res, D_res))
psi = D_res * (Y_res - theta * D_res)
V = float(np.mean(D_res ** 2))
var_theta = float(np.mean(psi ** 2)) / (n * V ** 2)
se = float(np.sqrt(var_theta))
d_residual_r2 = float(1 - np.var(D_res) / np.var(D))
d_residual_var = float(np.var(D_res))
z = norm.ppf(0.975)
if d_residual_var < 0.01 * np.var(D):
import warnings
warnings.warn(
f"Treatment residual variance is very small ({d_residual_var:.4f}). "
"Overlap may be poor — Double ML estimate is unreliable."
)
return DoubleMLResult(
theta=round(theta, 6),
se=round(se, 6),
ci_lower=round(theta - z * se, 6),
ci_upper=round(theta + z * se, 6),
pvalue=round(float(2 * norm.sf(abs(theta / se))), 4),
d_residual_r2=round(d_residual_r2, 4),
d_residual_var=round(d_residual_var, 6),
n_folds=n_folds,
)
def double_ml_pliv(
Y: np.ndarray,
D: np.ndarray,
Z: np.ndarray, # instrument(s), shape (n,) or (n, n_instruments)
X: np.ndarray,
n_folds: int = 5,
) -> dict:
"""Double ML partially linear IV — for endogenous treatment with instrument.
Uses cross-fitting on three nuisance models: E[Y|X], E[D|X], E[Z|X].
Final stage: 2SLS on residuals.
"""
n = len(Y)
Y_res = np.zeros(n)
D_res = np.zeros(n)
Z_res = np.zeros(n) if Z.ndim == 1 else np.zeros((n, Z.shape[1]))
kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
for train_idx, test_idx in kf.split(X):
for arr, res in [(Y, Y_res), (D, D_res)]:
m = GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=0)
m.fit(X[train_idx], arr[train_idx])
res[test_idx] = arr[test_idx] - m.predict(X[test_idx])
z_arr = Z if Z.ndim == 1 else Z
mz = GradientBoostingRegressor(n_estimators=100, max_depth=3, random_state=0)
mz.fit(X[train_idx], z_arr[train_idx] if Z.ndim == 1 else z_arr[train_idx, 0])
Z_res[test_idx] = (z_arr[test_idx] if Z.ndim == 1 else z_arr[test_idx, 0]) - mz.predict(X[test_idx])
# 2SLS on residuals
theta = float(np.dot(Z_res, Y_res) / np.dot(Z_res, D_res))
psi = Z_res * (Y_res - theta * D_res)
V = float(np.mean(Z_res * D_res))
var_theta = float(np.mean(psi ** 2)) / (n * V ** 2)
se = float(np.sqrt(var_theta))
z_crit = norm.ppf(0.975)
return {
'theta_iv': round(theta, 6),
'se': round(se, 6),
'ci_lower': round(theta - z_crit * se, 6),
'ci_upper': round(theta + z_crit * se, 6),
'pvalue': round(float(2 * norm.sf(abs(theta / se))), 4),
}Enjoying these notes?
Get new lessons delivered to your inbox. No spam.