Neural-Path/Notes
40 min

Sensitivity Analysis & Robustness

Every causal estimate from observational data rests on an untestable assumption: that you have measured and controlled for all relevant confounders. You can never prove this. What you can do is ask how sensitive your conclusion is to violations of the assumption — and report that sensitivity alongside your estimate. Sensitivity analysis is not a limitation to apologize for; it is the honest accounting that makes causal claims credible.

Theory

Cinelli-Hazlett sensitivity — how strong must confounding be to explain away the effect?

AgeIncomeEducation0.00.00.10.10.20.20.30.3Partial R² of confounder with treatmentPartial R² with outcomet=1.96 (α=0.05)t=2.58 (α=0.01)t=3.0Effect survives

The fundamental problem. Unconfoundedness — Y(d)DXY(d) \perp D \mid X — is the identifying assumption for observational causal inference. It is untestable from data: any dataset consistent with unconfoundedness is also consistent with some unmeasured confounder UU that explains the association. What sensitivity analysis provides is a characterization of how large UU would need to be to overturn your conclusion.

E-values (VanderWeele and Ding, 2017). The E-value is the minimum relative risk (RR) that an unmeasured confounder UU would need to have with both DD (treatment) and YY (outcome) — after conditioning on measured confounders — to fully explain away the observed association. Formally:

E-value=RR+RR(RR1)E\text{-value} = RR + \sqrt{RR \cdot (RR - 1)}

where RRRR is the observed risk ratio. A large E-value means confounding would need to be very strong to explain the result. An E-value near 1 means even weak confounding could explain it away.

Why it had to be this way. The E-value formula comes from bounding the bias formula for unmeasured confounding. If UU has relative risk RRUD\text{RR}_{UD} with treatment and RRUY\text{RR}_{UY} with outcome, the maximum bias factor from UU is RRUDRRUY/(RRUD+RRUY1)\text{RR}_{UD} \cdot \text{RR}_{UY} / (\text{RR}_{UD} + \text{RR}_{UY} - 1). The E-value is the minimum value of RRUD=RRUY\text{RR}_{UD} = \text{RR}_{UY} that achieves bias factor =RR= RR — i.e., fully explains away the effect. It provides a single interpretable threshold.

Rosenbaum bounds (Γ\Gamma sensitivity). For matched observational studies, Rosenbaum (2002) asks: how large does Γ\Gamma (the odds ratio of treatment for two matched units that differ only in an unmeasured covariate) need to be before the conclusion changes? A study is Γ=2\Gamma = 2 robust if, even with hidden bias that doubles the odds of treatment, the conclusion holds.

Cinelli-Hazlett partial R2R^2 framework. For linear regression with continuous outcomes, the partial R2R^2 framework (Cinelli and Hazlett, 2020) expresses sensitivity in terms of:

  • RYZX,D2R^2_{Y \sim Z \mid X, D}: partial R2R^2 of unmeasured confounder ZZ with outcome (after controlling for XX and DD)
  • RDZX2R^2_{D \sim Z \mid X}: partial R2R^2 of ZZ with treatment (after controlling for XX)

The contour lines in the sensitivity plot show combinations of these two quantities that would exactly zero out the estimate or reduce it to a specific magnitude. You can benchmark against observed covariates: if no observed covariate has both high RY2R^2_{Y} and RD2R^2_{D}, a confounder would need to be stronger than anything you've measured.

Placebo tests. A complementary approach: use pre-treatment outcomes as pseudo-outcomes and test whether the "treatment effect" on them is near zero. If the estimator finds spurious effects on outcomes that precede treatment, the identifying assumptions are likely violated.

Walkthrough

Scenario: A propensity-score matched study finds a 20% relative risk reduction from a new drug. We want to characterize sensitivity to unmeasured confounding.

Step 1: Compute E-value.

python
import numpy as np
 
def evalue_rr(rr: float, rr_lower_ci: float | None = None) -> dict:
    """Compute E-value for an observed risk ratio.
 
    Also computes E-value for the confidence interval bound (more conservative).
    """
    def ev(r): return r + np.sqrt(r * (r - 1)) if r >= 1 else 1.0 / (r + np.sqrt(r * (1/r - 1)))
    e_point = float(ev(rr))
    result = {'rr': rr, 'evalue_point': round(e_point, 3)}
    if rr_lower_ci is not None:
        e_ci = float(ev(rr_lower_ci))
        result['evalue_ci'] = round(e_ci, 3)
        result['interpretation'] = (
            f"To explain away the point estimate, an unmeasured confounder would need "
            f"RR >= {e_point:.2f} with both treatment and outcome. "
            f"To move the CI to include 1.0, RR >= {e_ci:.2f}."
        )
    return result

Step 2: Partial R2R^2 robustness value (Cinelli-Hazlett).

python
def robustness_value(
    t_stat: float,
    n: int,
    k: int,    # number of covariates in model
    alpha: float = 0.05,
) -> dict:
    """Compute Cinelli-Hazlett robustness value RV_alpha.
 
    RV_alpha: minimum partial R^2 (equal for treatment and outcome) that
    reduces the t-stat to the critical value at level alpha.
    """
    from scipy.stats import t as t_dist
    t_crit = t_dist.ppf(1 - alpha / 2, df=n - k - 1)
    rv = max(0.0, (abs(t_stat) - t_crit) / (abs(t_stat) + t_crit + 1e-9))
    return {
        't_stat': round(t_stat, 3),
        't_crit': round(t_crit, 3),
        'rv_alpha': round(rv, 4),
        'interpretation': (
            f"An unmeasured confounder with partial R^2 >= {rv:.2%} with both "
            f"treatment and outcome would reduce significance below alpha={alpha}."
        ),
    }

Step 3: Placebo test.

python
import pandas as pd
 
def placebo_test(
    df: pd.DataFrame,
    outcome_col: str,
    treatment_col: str,
    pre_outcome_col: str,   # outcome measured before treatment
    covariate_cols: list[str],
) -> dict:
    """Test identifying assumptions by using pre-treatment outcome as pseudo-outcome.
 
    A large 'treatment effect' on the pre-treatment outcome indicates confounding.
    """
    import statsmodels.formula.api as smf
    covs = ' + '.join(covariate_cols)
    model_main = smf.ols(f'{outcome_col} ~ {treatment_col} + {covs}', data=df).fit()
    model_placebo = smf.ols(f'{pre_outcome_col} ~ {treatment_col} + {covs}', data=df).fit()
    main_coef = float(model_main.params[treatment_col])
    placebo_coef = float(model_placebo.params[treatment_col])
    placebo_pval = float(model_placebo.pvalues[treatment_col])
    return {
        'main_effect': round(main_coef, 6),
        'placebo_effect': round(placebo_coef, 6),
        'placebo_pvalue': round(placebo_pval, 4),
        'ratio': round(abs(placebo_coef / (main_coef + 1e-9)), 3),
        'warning': (
            'Placebo effect is large relative to main effect — confounding likely'
            if abs(placebo_coef) > 0.3 * abs(main_coef) else None
        ),
    }

Analysis & Evaluation

Where your intuition breaks. A large E-value feels reassuring — "confounding would need to be very strong." But large E-values are common when the effect size is large, not necessarily when the study design is good. A study with strong confounding controls, careful matching, and an E-value of 2 is far more trustworthy than a study with no confounding controls and an E-value of 5. The E-value characterizes robustness to unmeasured confounding, conditional on the assumption that measured confounders are controlled. If the measured confounders are poorly controlled, the E-value overstates robustness.

MethodWhat it measuresWhen to use
E-valueRR strength needed to explain away effectBinary outcomes, RR scale
Rosenbaum boundOdds ratio of hidden biasMatched studies
Partial R² (Cinelli-Hazlett)Variance explained by confounderLinear regression, ATE scale
Placebo testPre-treatment outcome falsificationPanel data with baseline
Manski boundsWorst-case bounds without assumptionsWhen assumptions are highly uncertain

How to report. Do not bury sensitivity analysis in an appendix. Include the E-value or robustness value in the main result table: "Effect = 0.25 (SE = 0.07); Robustness value = 0.12 — an unmeasured confounder explaining 12% of variance in both treatment and outcome would eliminate significance." This allows readers to calibrate against known confounders.

🚀Production

Benchmark sensitivity against observed covariates. After computing the Cinelli-Hazlett contour, plot where your measured covariates fall in the partial R² space. If the "danger zone" (where confounding would overturn the result) requires a confounder stronger than any measured covariate, the result is more robust. If common covariates fall in the danger zone, the unmeasured confounder concern is concrete and specific.

Production-Ready Code

python
"""
Sensitivity analysis production toolkit.
E-values, Rosenbaum bounds, Cinelli-Hazlett partial R^2,
placebo tests, and automated sensitivity reports.
"""
 
from __future__ import annotations
from dataclasses import dataclass
import numpy as np
import pandas as pd
from scipy.stats import norm, t as t_dist
 
 
@dataclass
class SensitivityReport:
    estimate: float
    se: float
    t_stat: float
    evalue: float | None
    robustness_value: float | None
    placebo_effect: float | None
    placebo_pvalue: float | None
    interpretation: str
 
 
def evalue_rr(rr: float, rr_ci_bound: float | None = None) -> dict:
    """E-value for observed risk ratio (VanderWeele-Ding 2017)."""
    def ev(r: float) -> float:
        if r >= 1:
            return r + np.sqrt(r * (r - 1))
        return 1.0 / (1/r + np.sqrt((1/r) * (1/r - 1)))
    result = {
        'rr': rr,
        'evalue_point': round(float(ev(rr)), 3),
    }
    if rr_ci_bound is not None:
        result['evalue_ci'] = round(float(ev(rr_ci_bound)), 3)
    return result
 
 
def cinelli_hazlett_bounds(
    t_stat: float,
    n: int,
    k: int,
    benchmark_r2: dict[str, tuple[float, float]] | None = None,
    alpha: float = 0.05,
) -> dict:
    """Cinelli-Hazlett partial R^2 sensitivity bounds.
 
    Returns robustness value and optional benchmark comparisons.
    benchmark_r2: {'covariate_name': (r2_treatment, r2_outcome), ...}
    """
    df = n - k - 1
    t_crit = float(t_dist.ppf(1 - alpha / 2, df=df))
    rv = max(0.0, float((abs(t_stat) - t_crit) / (abs(t_stat) + t_crit + 1e-9)))
 
    result = {
        't_stat': round(t_stat, 3),
        't_crit': round(t_crit, 3),
        'robustness_value': round(rv, 4),
        'benchmarks': {},
    }
 
    if benchmark_r2:
        for name, (r2d, r2y) in benchmark_r2.items():
            bias_t = abs(t_stat) * np.sqrt(r2d * r2y) / (np.sqrt((1 - r2d) * (1 - r2y)) + 1e-9)
            t_adjusted = abs(t_stat) - bias_t
            result['benchmarks'][name] = {
                'r2_treatment': r2d,
                'r2_outcome': r2y,
                't_adjusted': round(float(t_adjusted), 3),
                'still_significant': t_adjusted > t_crit,
            }
 
    return result
 
 
def rosenbaum_gamma_bound(
    matched_pairs: pd.DataFrame,   # columns: Y_treated, Y_control
    gamma: float = 2.0,
    alpha: float = 0.05,
) -> dict:
    """Rosenbaum sensitivity test for matched studies at given Gamma value.
 
    Returns whether conclusion holds at the specified Gamma level.
    """
    diffs = (matched_pairs['Y_treated'] - matched_pairs['Y_control']).values
    n = len(diffs)
    pos = np.sum(diffs > 0)
 
    p_max = gamma / (1 + gamma)
    from scipy.stats import binom
    pvalue_upper = float(binom.sf(pos - 1, n, p_max))
 
    return {
        'gamma': gamma,
        'n_pairs': n,
        'n_positive': int(pos),
        'pvalue_upper': round(pvalue_upper, 4),
        'conclusion_holds': pvalue_upper < alpha,
        'interpretation': (
            f"At Gamma={gamma} (hidden bias could {gamma}x the odds of treatment), "
            f"p-value upper bound = {pvalue_upper:.4f}. "
            f"{'Conclusion holds.' if pvalue_upper < alpha else 'Conclusion may not hold.'}"
        ),
    }
 
 
def run_sensitivity_report(
    estimate: float,
    se: float,
    n: int,
    k: int,
    placebo_result: dict | None = None,
    benchmark_r2: dict | None = None,
    alpha: float = 0.05,
) -> SensitivityReport:
    """Generate a unified sensitivity report."""
    t_stat = estimate / se
    rv_result = cinelli_hazlett_bounds(t_stat, n, k, benchmark_r2, alpha)
    rv = rv_result['robustness_value']
    t_crit = rv_result['t_crit']
    interp = (
        f"Effect = {estimate:.4f} (SE = {se:.4f}, t = {t_stat:.2f}). "
        f"Robustness value = {rv:.2%}: an unmeasured confounder explaining "
        f"{rv:.0%} of variance in both treatment and outcome would eliminate significance. "
        f"{'Conclusion is robust.' if rv > 0.10 else 'Conclusion is fragile — sensitivity analysis required.'}"
    )
    return SensitivityReport(
        estimate=round(estimate, 6),
        se=round(se, 6),
        t_stat=round(t_stat, 3),
        evalue=None,  # set separately for RR-scale estimates
        robustness_value=round(rv, 4),
        placebo_effect=placebo_result.get('placebo_effect') if placebo_result else None,
        placebo_pvalue=placebo_result.get('placebo_pvalue') if placebo_result else None,
        interpretation=interp,
    )

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.