Requires:Hypothesis Testing

Potential Outcomes & DAGs

You can't always run an experiment. A feature rolled out to all users simultaneously, a policy change that happened in the past, a treatment that can't ethically be withheld from a control group — these are the situations that require causal inference from observational data. The potential outcomes framework formalizes what "causal effect" means when you can never directly observe the counterfactual: what would have happened to the same user, at the same time, if we had treated them differently. Directed Acyclic Graphs (DAGs) give you a principled language for determining which variables to control for and — critically — which ones to never touch. Getting this wrong doesn't just add noise to your estimate; conditioning on the wrong variable can flip the sign entirely. The confusion between regression coefficients and causal effects is the most common source of incorrect claims in applied data science.

Theory

DAG Structures

Fork (Confounder)

M causes both T and Y

PATH OPEN (bias)

Without conditioning: backdoor path T ← M → Y is OPEN (bias)

Toggle "condition on M" to see how each structure responds to conditioning. Colliders are uniquely dangerous: conditioning opens a previously blocked path.

You can give a patient a drug or not give it, but you can never do both to the same patient at the same time. Causal inference is entirely about estimating what would have happened in the path not taken — the counterfactual. Every method in this module is a strategy for recovering population-level averages of that unobservable difference, under explicit assumptions about which variables have been measured and which paths in the causal graph are blocked.

The fundamental problem of causal inference

For each unit $i$ , define two potential outcomes: $Y_i(1)$ (outcome if treated) and $Y_i(0)$ (outcome if not treated). The individual treatment effect is $\tau_i = Y_i(1) - Y_i(0)$ .

We can never observe both $Y_i(1)$ and $Y_i(0)$ for the same unit at the same time — this is the fundamental problem of causal inference (Holland, 1986). One of them is always the counterfactual: what would have happened under the alternative assignment.

This is not a data problem. No amount of data lets you observe the counterfactual for a specific person. It is a logical impossibility. All causal inference methods are strategies for estimating population-level averages of effects without observing individual counterfactuals.

Treatment effect estimands

Different causal questions correspond to different estimands:

Average Treatment Effect (ATE): the effect of treating the entire population. $\tau_{\text{ATE}} = E[Y_i(1) - Y_i(0)]$

The expectation is the only estimable quantity because each individual contributes exactly one observed outcome — either $Y_i(1)$ or $Y_i(0)$ , never both. Population averages sidestep the fundamental problem by treating the two groups as exchangeable samples from the same distribution: if randomization holds, $E[Y|T=1]$ is an unbiased estimate of $E[Y_i(1)]$ and $E[Y|T=0]$ estimates $E[Y_i(0)]$ . Without that exchangeability assumption — which requires either randomization or measured confounders — the difference in observed means is not the ATE.

Average Treatment Effect on the Treated (ATT): the effect on those who actually received treatment. $\tau_{\text{ATT}} = E[Y_i(1) - Y_i(0) \mid T_i = 1]$

Conditional Average Treatment Effect (CATE): the effect for a specific subgroup. $\tau(x) = E[Y_i(1) - Y_i(0) \mid X_i = x]$

Local Average Treatment Effect (LATE): the effect for compliers (see IV section). Each estimand answers a different question — confusing them is a common source of incorrect causal claims.

SUTVA

SUTVA (Stable Unit Treatment Value Assumption) requires:

No interference: $Y_i(T_i, T_j) = Y_i(T_i)$ — unit $i$ 's outcome depends only on $i$ 's treatment, not others'
No hidden versions of treatment: "treated" has one well-defined meaning

SUTVA is violated when:

A social network recommendation changes what user $j$ sees when user $i$ is treated
A shared inference server slows down when treatment arm uses more compute
A pricing treatment on sellers changes buyers' behavior

Without SUTVA, the naive comparison $E[Y|T=1] - E[Y|T=0]$ estimates a mixture of direct and spillover effects, not the policy-relevant TATE (Total Average Treatment Effect of treating everyone).

Identification: when is ATE estimable?

A randomized experiment identifies ATE because random assignment ensures: $E[Y_i(1) \mid T_i = 1] = E[Y_i(1)], \quad E[Y_i(0) \mid T_i = 0] = E[Y_i(0)]$

so the observed difference in means equals ATE.

Without randomization, selection bias enters. Users who adopt a new feature are not representative of all users. The observed difference decomposes as:

$\underbrace{E[Y \mid T=1] - E[Y \mid T=0]}_{\text{observed}} = \underbrace{\tau_{\text{ATT}}}_{\text{causal}} + \underbrace{E[Y_i(0) \mid T_i=1] - E[Y_i(0) \mid T_i=0]}_{\text{selection bias}}$

The selection bias term is the difference in counterfactual control outcomes between the treated and control groups — unobservable without an experiment or a credible assumption.

Directed Acyclic Graphs (DAGs)

A DAG $G = (V, E)$ encodes causal structure: edge $X \to Y$ means $X$ causes $Y$ . DAGs let you reason about what variables to control for — a purely statistical approach can lead to bias by controlling for the wrong variables.

Backdoor criterion: a set $Z$ satisfies the backdoor criterion relative to $(T, Y)$ if:

$Z$ blocks every backdoor path from $T$ to $Y$ (paths with arrows into $T$ )
$Z$ contains no descendant of $T$

When $Z$ satisfies backdoor, the causal effect is identified by the adjustment formula: $E[Y \mid \text{do}(T=t)] = \sum_z E[Y \mid T=t, Z=z] \cdot P(Z=z)$

Here, tenure $U$ is a confounder — it causes both feature adoption $T$ and revenue $Y$ . Controlling for $U$ blocks the backdoor path $T \leftarrow U \rightarrow Y$ and removes bias.

Collider bias

A collider is a variable caused by both $T$ and $Y$ (or two variables on a path). Conditioning on a collider opens a path that was previously blocked, introducing bias.

Example: $T$ = new algorithm, $Y$ = user satisfaction, $M$ = user complaints. Both the algorithm and satisfaction affect complaints. If you control for $M$ in your regression — perhaps trying to understand "among users who complained equally" — you open a spurious path between $T$ and $Y$ .

The rule: never control for a descendant of $T$ unless you have a specific causal reason. This includes intermediate outcomes, post-treatment variables, and proxy outcomes. Always draw the DAG before selecting controls.

OLS under unconfoundedness

If all confounders $X$ are observed (unconfoundedness / ignorability): $Y_i(t) \perp\!\!\!\perp T_i \mid X_i$

Then Ordinary Least Squares (OLS) with controls identifies ATE: $Y_i = \alpha + \tau T_i + \beta^T X_i + \varepsilon_i$

Adding relevant covariates $X$ increases precision even when they don't affect treatment selection — this is the Analysis of Covariance (ANCOVA) principle. The key assumption is linearity: if the true relationship between $X$ and $Y$ is nonlinear, OLS absorbs the nonlinearity into bias on $\hat{\tau}$ .

Walkthrough

Reading a DAG: what to control for

python

# Simulated example: user tenure confounds feature adoption and revenue
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
 
np.random.seed(42)
n = 5_000
 
# DAG: Tenure → T (feature adoption), Tenure → Y (revenue), T → Y
tenure = np.random.exponential(scale=3, size=n)  # years
treatment = (tenure + np.random.normal(0, 2, n) > 4).astype(float)  # higher tenure → more likely to adopt
revenue = 0.5 * treatment + 1.2 * tenure + np.random.normal(0, 2, n)  # true τ = 0.5
 
df = pd.DataFrame({'treatment': treatment, 'tenure': tenure, 'revenue': revenue})
 
# Naive: omit confounder → biased
model_naive = smf.ols("revenue ~ treatment", data=df).fit()
print(f"Naive estimate: {model_naive.params['treatment']:.3f}")  # → ~1.8 (biased upward)
 
# Correct: control for confounder → unbiased
model_correct = smf.ols("revenue ~ treatment + tenure", data=df).fit()
print(f"Correct estimate: {model_correct.params['treatment']:.3f}")  # → ~0.5 (true effect)

CATE estimation with T-learner

When you want to know who benefits most from the treatment, not just the average effect:

python

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import cross_val_predict
import numpy as np
 
def t_learner_cate(
    X: np.ndarray,
    T: np.ndarray,
    Y: np.ndarray,
) -> np.ndarray:
    """
    T-learner: fit separate outcome models for treated and control,
    predict counterfactuals, return per-unit CATE estimates.
    """
    m1 = GradientBoostingRegressor(n_estimators=200, max_depth=4, random_state=42)
    m0 = GradientBoostingRegressor(n_estimators=200, max_depth=4, random_state=42)
 
    m1.fit(X[T == 1], Y[T == 1])
    m0.fit(X[T == 0], Y[T == 0])
 
    # CATE for each unit: predicted outcome under treatment minus under control
    tau_hat = m1.predict(X) - m0.predict(X)
    return tau_hat
 
# Usage
tau_hat = t_learner_cate(X, T, Y)
 
# Segment by user type to identify high-responders
df['cate'] = tau_hat
print(df.groupby('user_segment')['cate'].mean().sort_values(ascending=False))

The T-learner has a known weakness: if the treated and control groups differ in covariate distribution (which they do in observational data), each outcome model extrapolates into regions it wasn't trained on, adding bias. The DR-learner and R-learner correct for this.

Analysis & Evaluation

Where Your Intuition Breaks

Controlling for more variables always improves a causal estimate. Conditioning on a collider — a variable caused by both treatment and outcome — opens a spurious path between them and introduces bias that didn't exist before. If you add "hospital admission" as a control in a study of smoking and health, you induce collider bias: among hospitalized patients, smoking appears protective because you've conditioned on a variable that selects for both smokers and sick non-smokers. The rule is not "control for everything measured" but "control for confounders and block backdoor paths without conditioning on colliders or mediators."

When does OLS fail?

OLS identifies ATE under unconfoundedness, but two practical failures are common:

Nonlinear confounding: If tenure affects revenue nonlinearly (e.g., exponentially), a linear control for tenure leaves residual confounding. Adding tenure^2 or using a nonparametric model for confounders (Double ML) addresses this.

High-dimensional confounders: With hundreds of user features as controls, OLS overfits. The regularization (Lasso, Ridge) needed to handle high dimensions introduces shrinkage bias on $\hat{\tau}$ . Double ML (next note) solves this by separately modeling confounders and using residual regression.

Overlap violations: If treated and control groups have non-overlapping covariate distributions, OLS extrapolates and is unreliable. Check overlap visually before running regression.

DAG checklist before any observational analysis

1. List all variables in your dataset
2. Draw arrows for every causal relationship you believe exists
3. Identify the treatment T and outcome Y
4. Find all backdoor paths: paths from T to Y with at least one arrow into T
5. Find a sufficient adjustment set Z that:
   - Blocks all backdoor paths
   - Contains no descendant of T
6. Include exactly Z as controls — no more, no less
7. Check for colliders: if any Z is caused by both T and Y (directly or through paths),
   removing it from the adjustment set

⚠️Warning

The most common mistake in observational analysis is "controlling for everything." Including post-treatment variables or colliders introduces more bias, not less. The DAG is the tool for deciding what to control for — not statistical fit or correlation with the outcome.

Causal claims require causal assumptions

A regression coefficient is a statistical summary. It becomes a causal estimate only when you assert:

Unconfoundedness: you've controlled for all common causes of $T$ and $Y$
Correct functional form: the relationship between controls and outcome is well-specified
SUTVA: no interference between units

When you present an observational result, state these assumptions explicitly. "Controlling for user segment and tenure, treated users generated 3.2% more revenue" is a regression result. "The feature caused a 3.2% revenue increase" is a causal claim that requires all three assumptions above.

🚀Production

Lyft's data science team maintains explicit DAG documentation for their core metrics. Before any observational analysis is published internally, a reviewer checks that the adjustment set was chosen using the DAG, not by stepwise regression or arbitrary "throw everything in." This process caught several analyses that had inadvertently controlled for colliders, flipping the sign of estimated effects.

Production-Ready Code

Before running any causal analysis, build a machine-checkable causal model. The code below constructs a DAG from an adjacency dict, finds all backdoor paths, reports unmeasured confounders that block identification, and tests the DAG's d-separation implications against observed data. A failing implication test is evidence the DAG is misspecified — the most common cause of incorrect causal claims in practice.

python

# dag_validation.py
# DAG construction, backdoor path analysis, and d-separation implication testing.
 
from __future__ import annotations
import networkx as nx
import numpy as np
import pandas as pd
from scipy.stats import pearsonr
 
 
def build_dag(adjacency: dict[str, list[str]]) -> nx.DiGraph:
    """Build a DAG from {parent: [child1, child2]} adjacency dict."""
    G = nx.DiGraph()
    for parent, children in adjacency.items():
        for child in children:
            G.add_edge(parent, child)
    assert nx.is_directed_acyclic_graph(G), "Cycle detected — not a valid DAG"
    return G
 
 
def backdoor_analysis(
    G: nx.DiGraph,
    treatment: str,
    outcome: str,
    measured: set[str],
) -> dict:
    """
    Finds all backdoor paths (paths into treatment that reach outcome),
    reports unmeasured confounders on those paths, and suggests an adjustment set.
 
    A backdoor path from T to Y has its first edge going INTO treatment.
    Blocking all such paths via measured covariates achieves identification.
    """
    G_ud = G.to_undirected()
    all_paths = list(nx.all_simple_paths(G_ud, treatment, outcome))
 
    backdoor_paths = []
    for path in all_paths:
        if len(path) > 2 and G.has_edge(path[1], path[0]):
            backdoor_paths.append(path)
 
    unmeasured: set[str] = set()
    adjustment_candidates: set[str] = set()
    for path in backdoor_paths:
        for node in path:
            if node in {treatment, outcome}:
                continue
            if node not in measured:
                unmeasured.add(node)
            else:
                adjustment_candidates.add(node)
 
    return {
        "n_backdoor_paths": len(backdoor_paths),
        "backdoor_paths": [" ← ".join(p[:2]) + " … " + p[-1] for p in backdoor_paths],
        "unmeasured_confounders": sorted(unmeasured),
        "identified": len(unmeasured) == 0,
        "suggested_adjustment_set": sorted(adjustment_candidates),
        "verdict": (
            "Identified via backdoor adjustment"
            if not unmeasured
            else f"NOT identified — unmeasured: {sorted(unmeasured)}. "
                 "Consider IV or RDD if a valid instrument exists."
        ),
    }
 
 
def check_dseparation_implications(
    df: pd.DataFrame,
    G: nx.DiGraph,
    alpha: float = 0.05,
) -> list[dict]:
    """
    Tests every marginal d-separation implication against observed data.
    Each row where implication_holds=False is evidence against the DAG structure.
    Run this before trusting any causal estimate from this DAG.
    """
    nodes = [n for n in G.nodes() if n in df.columns]
    results = []
    for i, u in enumerate(nodes):
        for v in nodes[i + 1:]:
            if nx.d_separated(G, {u}, {v}, set()):
                r, p = pearsonr(df[u].values, df[v].values)
                results.append({
                    "u": u,
                    "v": v,
                    "conditioning_set": [],
                    "correlation": round(float(r), 4),
                    "p_value": round(float(p), 6),
                    "implication_holds": p >= alpha,
                    "verdict": "OK" if p >= alpha
                    else f"VIOLATION — {u} and {v} should be independent but are "
                         "correlated. DAG may be missing an edge.",
                })
    return results
 
 
# ── Example ───────────────────────────────────────────────────────────────────
dag = build_dag({
    "Age":       ["Treatment", "Outcome"],
    "Treatment": ["Outcome"],
})
print(backdoor_analysis(dag, "Treatment", "Outcome", measured={"Age"}))
# identified: True, suggested_adjustment_set: ['Age']
 
rng = np.random.default_rng(42)
n = 1_000
df = pd.DataFrame({
    "Age":       rng.normal(35, 10, n),
    "Treatment": rng.binomial(1, 0.5, n).astype(float),
    "Outcome":   rng.normal(0, 1, n),
})
for v in check_dseparation_implications(df, dag):
    print(v)

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Experiments

Adaptive Experiments & Bandits

Quasi-Experimental Methods