Neural-Path/Notes
35 min

Experimental Design

Knowing the hypothesis testing framework is necessary but not sufficient. An experiment that's correctly analyzed but poorly designed will give you narrow confidence intervals around the wrong thing, or require three months of runtime to detect a lift you expected to see in two weeks. Experimental design is where most of the leverage lives: choosing the right randomization unit (user vs. session vs. city), applying variance reduction techniques that can halve your required sample size, constructing ratio metrics carefully enough that your test is valid, and selecting metrics that are sensitive to the changes you care about. A well-designed experiment is one where the randomization is clean, the metric responds quickly to the change under test, and the variance is as low as possible given the data already collected before the experiment started.

Theory

CUPED Variance Reduction
-3-2-10123adjusted outcome Ỹraw Y σ=1.00CUPED Ỹ σ=0.55 (−70% var)narrower

CUPED subtracts the pre-experiment covariate component θX̃ from Y, shrinking variance without changing the expected treatment effect.

Before you run a single experiment, you can cut the required sample size in half — without collecting any new data. Pre-experiment measurements of the same metric already contain most of the user-level variance; subtracting the predictable component leaves only the noise the treatment actually needs to overcome. CUPED does exactly this: last week's behavior controls for this week's baseline, shrinking distributions without touching the causal estimate.

Randomization as identification

The magic of a randomized experiment is that random assignment makes the treated and control groups statistically identical in expectation — not just on observed covariates, but on every characteristic including unobserved ones. This is what gives A/B tests their causal identification power.

Formally: under random assignment, Ti ⁣ ⁣ ⁣(Yi(0),Yi(1))T_i \perp\!\!\!\perp (Y_i(0), Y_i(1)), so:

E[YˉBYˉA]=E[Yi(1)Yi(0)]=τATEE[\bar{Y}_B - \bar{Y}_A] = E[Y_i(1) - Y_i(0)] = \tau_{\text{ATE}}

No other assumption required. This is why experiments are the gold standard and why observational methods need untestable assumptions to claim causal effects.

CUPED variance reduction

The key lever for running shorter experiments is reducing variance without changing sample size. Controlled-experiment Using Pre-Experiment Data (CUPED) (Deng et al., Microsoft, 2013) achieves this by subtracting the component of YY predictable from a pre-experiment covariate XX:

Yadj,i=Yiθ^(XiXˉ),θ^=Cov(Y,X)Var(X)Y_{\text{adj}, i} = Y_i - \hat{\theta}(X_i - \bar{X}), \quad \hat{\theta} = \frac{\text{Cov}(Y, X)}{\text{Var}(X)}

The formula θ^=Cov(Y,X)/Var(X)\hat{\theta} = \text{Cov}(Y, X)/\text{Var}(X) is the OLS slope from regressing YY on XX — the unique coefficient that minimizes Var(Yadj)\text{Var}(Y_\text{adj}) over all possible linear adjustments. Any other value of θ\theta gives an unbiased treatment effect estimate but higher variance; this θ^\hat{\theta} is the variance-minimizing choice. The unbiasedness holds regardless of θ\theta, which is why CUPED can be applied post-hoc at readout time rather than pre-specified before the experiment.

The variance of the adjusted outcome is:

Var(Yadj)=Var(Y)(1ρ2)\text{Var}(Y_{\text{adj}}) = \text{Var}(Y)(1 - \rho^2)

where ρ=Corr(Y,X)\rho = \text{Corr}(Y, X). For Daily Active Users (DAU) with week-over-week stability ρ=0.8\rho = 0.8:

Var(Yadj)=Var(Y)(10.64)=0.36Var(Y)\text{Var}(Y_{\text{adj}}) = \text{Var}(Y) \cdot (1 - 0.64) = 0.36 \cdot \text{Var}(Y)

Same required number of standard errors, 64% less variance → 64% fewer users needed → 64% shorter experiment. The left panel below shows exactly this: the adjusted distribution is narrower, concentrating more probability mass near the true mean. The right panel shows that the variance remaining grows rapidly as ρ\rho falls — if your pre-experiment covariate has no predictive power, CUPED doesn't help.

CUPED variance reduction: before vs after distributions, and variance fraction vs correlation

Key properties:

  • Estimate θ^\hat{\theta} on the pooled dataset (both arms together), never per-arm — this avoids post-treatment contamination
  • The adjustment is unbiased regardless of θ^\hat{\theta}: E[τ^adj]=τE[\hat{\tau}_{\text{adj}}] = \tau for any θ\theta. Optimizing θ\theta only minimizes variance.
  • Can be applied post-hoc at readout time if you forgot to pre-apply it

Ratio metrics and the delta method

Business metrics are often ratios: revenue per session = total revenue / total sessions. The naive variance formula Var(X/Y)Var(X)/Yˉ2\text{Var}(X/Y) \approx \text{Var}(X)/\bar{Y}^2 ignores the denominator's variability. The correct formula via the delta method:

Var ⁣(XˉYˉ)1Yˉ2(σX22XˉYˉσXY+(XˉYˉ)2σY2)\text{Var}\!\left(\frac{\bar{X}}{\bar{Y}}\right) \approx \frac{1}{\bar{Y}^2}\left(\sigma_X^2 - 2\frac{\bar{X}}{\bar{Y}}\sigma_{XY} + \left(\frac{\bar{X}}{\bar{Y}}\right)^2 \sigma_Y^2\right)

The covariance term σXY\sigma_{XY} matters. If sessions increase alongside revenue (treatment increases both numerator and denominator), the naive estimator will overstate variance. If sessions decrease while revenue increases, it will understate variance.

Design effect for cluster randomization

When SUTVA is violated — a treatment affecting one user spills over to others — you randomize at a higher level: cities, devices, households. The design effect (DEFF) quantifies the cost of cluster randomization versus individual randomization:

DEFF=1+(m1)ρICC\text{DEFF} = 1 + (m - 1) \cdot \rho_{\text{ICC}}

where mm is average cluster size and ρICC\rho_{\text{ICC}} is the intraclass correlation coefficient (ICC). For city-level experiments with m=50m = 50 users/city and ICC=0.02\text{ICC} = 0.02:

DEFF=1+49×0.02=1.98\text{DEFF} = 1 + 49 \times 0.02 = 1.98

You need approximately twice as many users compared to user-level randomization. The effective sample size is neff=n/DEFFn_{\text{eff}} = n / \text{DEFF}. The ICC is often underestimated — cluster experiments are chronically underpowered as a result.

Stratified randomization

Rather than pure random assignment, stratified randomization assigns units within pre-specified strata (e.g., new vs. returning users, mobile vs. desktop). The variance reduction:

Var(τ^stratified)=Var(τ^unstratified)σbetween2n\text{Var}(\hat{\tau}_{\text{stratified}}) = \text{Var}(\hat{\tau}_{\text{unstratified}}) - \frac{\sigma_{\text{between}}^2}{n}

The reduction equals the between-stratum variance in outcomes divided by nn. For a metric where new vs. returning users differ significantly (large σbetween2\sigma_{\text{between}}^2), stratification gives meaningful variance reduction with no statistical cost.

Walkthrough

CUPED implementation

python
import numpy as np
import pandas as pd
from scipy import stats
 
def apply_cuped(
    df: pd.DataFrame,
    outcome_col: str,
    pre_experiment_col: str,
    variant_col: str = 'variant',
) -> pd.Series:
    """
    CUPED-adjusted outcome.
    theta estimated on the full dataset (not per-arm) to avoid post-treatment bias.
    """
    X = df[pre_experiment_col].values
    Y = df[outcome_col].values
 
    # Estimate theta = Cov(Y, X) / Var(X) on pooled data
    theta = np.cov(Y, X)[0, 1] / np.var(X)
 
    # Adjust
    Y_adj = Y - theta * (X - X.mean())
    return pd.Series(Y_adj, index=df.index, name=f"{outcome_col}_cuped")
 
def variance_reduction_from_cuped(Y: np.ndarray, X: np.ndarray) -> dict:
    """How much variance does CUPED remove for this metric/covariate pair?"""
    rho = np.corrcoef(Y, X)[0, 1]
    return {
        "rho": round(rho, 3),
        "variance_reduction_pct": round((1 - rho**2) * 100, 1),
        "sample_size_reduction_pct": round((1 - rho**2) * 100, 1),  # same factor
    }
 
# Example: pre-experiment revenue as covariate for post-experiment revenue
df['revenue_cuped'] = apply_cuped(df, 'revenue_post', 'revenue_pre')
stats_pre = variance_reduction_from_cuped(df['revenue_post'].values, df['revenue_pre'].values)
# rho=0.72 → 48% variance reduction → 48% fewer users needed

Sample size with adjustments

python
import math
 
def sample_size_suite(
    baseline_rate: float,
    mde_relative: float,
    rho_cuped: float = 0.0,
    avg_cluster_size: float = 1.0,
    icc: float = 0.0,
    alpha: float = 0.05,
    power: float = 0.80,
) -> dict:
    """
    Sample size accounting for CUPED variance reduction and cluster design effect.
    """
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)
 
    delta = baseline_rate * mde_relative
    p_pool = baseline_rate + delta / 2
    sigma = math.sqrt(2 * p_pool * (1 - p_pool))
    n_base = math.ceil(((z_alpha + z_beta) * sigma / delta) ** 2)
 
    # Apply CUPED variance reduction
    n_cuped = math.ceil(n_base * (1 - rho_cuped**2))
 
    # Apply design effect for clustering
    deff = 1 + (avg_cluster_size - 1) * icc
    n_clustered = math.ceil(n_cuped * deff)
    n_clusters = math.ceil(n_clustered / avg_cluster_size) if avg_cluster_size > 1 else None
 
    return {
        "n_unadjusted": n_base,
        "n_with_cuped": n_cuped,
        "n_with_clustering": n_clustered,
        "n_clusters_needed": n_clusters,
        "design_effect": round(deff, 2),
        "days_needed": math.ceil(n_clustered / (200_000 * 0.5)),  # example traffic
    }
 
# Pure individual randomization: 45% baseline, 2% MDE
r = sample_size_suite(0.45, 0.02)
print(f"No adjustments: {r['n_unadjusted']:,} per arm")  # 19,652
 
# With CUPED (rho=0.7)
r = sample_size_suite(0.45, 0.02, rho_cuped=0.7)
print(f"With CUPED:     {r['n_with_cuped']:,} per arm")  # 10,718 (45% fewer)
 
# City-level randomization, 50 users/city, ICC=0.02
r = sample_size_suite(0.45, 0.02, avg_cluster_size=50, icc=0.02)
print(f"Cluster:        {r['n_clusters_needed']:,} clusters needed")  # 390

Ratio metric variance (delta method)

python
def ratio_metric_stats(
    numerator_A: np.ndarray,   # e.g., revenue per user in control
    denominator_A: np.ndarray, # e.g., sessions per user in control
    numerator_B: np.ndarray,
    denominator_B: np.ndarray,
) -> dict:
    """
    Compare ratio metrics (e.g., revenue per session) between arms.
    Uses delta method for variance of ratio.
    """
    def ratio_stats(num, den):
        r = num.mean() / den.mean()
        n = len(num)
        var_ratio = (1/den.mean()**2) * (
            np.var(num)
            - 2 * r * np.cov(num, den)[0,1]
            + r**2 * np.var(den)
        ) / n
        return r, var_ratio
 
    r_A, var_A = ratio_stats(numerator_A, denominator_A)
    r_B, var_B = ratio_stats(numerator_B, denominator_B)
 
    delta = r_B - r_A
    se = math.sqrt(var_A + var_B)
    z = delta / se
    p_value = 2 * (1 - stats.norm.cdf(abs(z)))
    ci = (delta - 1.96 * se, delta + 1.96 * se)
 
    return {
        "ratio_control": round(r_A, 4),
        "ratio_treatment": round(r_B, 4),
        "delta": round(delta, 4),
        "relative_lift_pct": round(delta / r_A * 100, 2),
        "ci_95": (round(ci[0], 4), round(ci[1], 4)),
        "p_value": round(p_value, 5),
    }

Randomization unit choice

The unit you randomize on determines what questions you can answer:

UnitWhen to useTradeoffs
User IDMost experiments — measures per-user effectsCannot control SUTVA violations between users
Session IDWhen user-level effects aren't relevant (e.g., anonymous search)Same user may see both arms — "dilution"
Device IDWhen the product is device-specificMulti-device users dilute effects
City / MarketWhen SUTVA is violated (marketplace, social)Cluster design effect reduces power significantly
Time periodSwitchback experiments for marketplace pricingCarry-over effects between periods

Always hash the randomization unit deterministically:

python
import hashlib
 
def assign_variant(unit_id: str, experiment_id: str, split: float = 0.5) -> str:
    """Deterministic, consistent assignment. Same user always gets same variant."""
    key = f"{experiment_id}:{unit_id}".encode()
    bucket = int(hashlib.sha256(key).hexdigest(), 16) % 10_000 / 10_000
    return "treatment" if bucket < split else "control"

Using a hash keyed on both experiment_id and unit_id ensures independence across experiments — the same user isn't always in the treatment arm.

Pre-experiment AA test

Simulate 1000 AA tests (no actual treatment) on historical data. Under a correctly specified variance model, exactly ~5% should yield p < 0.05. A higher false positive rate reveals a broken variance model.

python
def run_aa_simulation(
    user_metrics: np.ndarray,
    n_simulations: int = 1000,
    alpha: float = 0.05,
) -> dict:
    """
    Validate variance model by simulating AA tests on historical data.
    """
    false_positives = 0
    for _ in range(n_simulations):
        idx = np.random.permutation(len(user_metrics))
        mid = len(idx) // 2
        _, p = stats.ttest_ind(user_metrics[idx[:mid]], user_metrics[idx[mid:]])
        if p < alpha:
            false_positives += 1
 
    fpr = false_positives / n_simulations
    return {
        "false_positive_rate": round(fpr, 3),
        "expected": alpha,
        "verdict": "PASS" if abs(fpr - alpha) < 0.02 else "FAIL — check variance model",
        "likely_cause": (
            "clustering effects or autocorrelation" if fpr > alpha * 1.5 else "ok"
        ),
    }

Analysis & Evaluation

Where Your Intuition Breaks

Running a longer experiment always fixes an underpowered one. Sample size is determined by variance, not time. If your metric is dominated by high-variance users (e.g., a small fraction of users drive most revenue), doubling experiment duration barely helps — each new day adds the same noisy users. The fix is variance reduction (CUPED, winsorization, switching to a lower-CV metric), not more time. Duration matters only to the extent that it determines how many independent users you observe; for low-traffic products, even a month of runtime may fall short of the required n.

Metric selection: sensitivity vs. relevance

Not all metrics are equally useful as primary metrics. Two dimensions:

Sensitivity (statistical): how much signal per user. Measured by coefficient of variation (CV=σ/μ\text{CV} = \sigma / \mu). Lower CV → smaller sample size for the same MDE.

Relevance (business): how predictive is this metric of long-term business outcomes? A metric can be easy to move (sensitive) but irrelevant (click-through on an interstitial you're about to remove).

The right primary metric maximizes both. Estimate relevance empirically: compute the correlation between candidate short-term metrics and long-term outcomes on historical data. Compute sensitivity using pre-experiment variance.

MetricCV (lower is better)Corr with 6-month retention
Sessions/week0.80.71
Revenue/session2.10.54
Search success rate0.40.83 ← good primary
Feature adoption0.60.31

Variance estimation from historical data

Use 4 weeks of pre-experiment data to estimate σ2\sigma^2. Apply a variance inflation factor of 1.1–1.2 to be conservative — this accounts for:

  • Users not yet in the experiment behaving differently from those who are
  • Week-of-month effects not captured in 4 weeks
  • Cross-device users counted once but generating multiple events

For highly right-skewed metrics (revenue, session duration), winsorize at the 99th percentile before estimating variance. The same winsorization threshold must be applied at analysis time.

Production-Ready Code

In production, CUPED runs as a post-processing step on the raw event data after experiment close — it doesn't require changing instrumentation. The key engineering decision is covariate selection: compute correlations on pre-experiment data and pick the highest-rho columns automatically. Assignment must use a deterministic hash so the same user always gets the same variant across sessions, devices, and service calls.

python
# production_experiment_design.py
# CUPED pipeline with automated covariate selection, sample size service,
# and deterministic variant assignment.
 
from __future__ import annotations
import hashlib, math
from typing import Sequence
import numpy as np
import pandas as pd
import scipy.stats as stats
 
 
# ── Covariate selection ───────────────────────────────────────────────────────
 
def select_cuped_covariates(
    df: pd.DataFrame,
    outcome_col: str,
    candidate_cols: Sequence[str],
    min_rho: float = 0.3,
) -> list[str]:
    """
    Returns pre-experiment columns with |rho| >= min_rho, sorted by descending |rho|.
    Only pass columns measured BEFORE experiment start as candidates.
    Typical candidates: last-week metric, last-month metric, days-since-signup.
    """
    Y = df[outcome_col].values
    ranked = []
    for col in candidate_cols:
        rho = float(np.corrcoef(Y, df[col].values)[0, 1])
        if abs(rho) >= min_rho:
            ranked.append((col, rho))
    ranked.sort(key=lambda x: abs(x[1]), reverse=True)
    return [col for col, _ in ranked]
 
 
# ── Multi-covariate CUPED via OLS residualisation ────────────────────────────
 
def apply_cuped_pipeline(
    df: pd.DataFrame,
    outcome_col: str,
    covariate_cols: Sequence[str],
    variant_col: str = "variant",
) -> pd.Series:
    """
    Generalises single-covariate CUPED to multiple covariates via OLS.
    theta is estimated on the POOLED dataset — never per arm — to avoid
    post-treatment contamination.
    Returns the adjusted outcome series indexed like df.
    """
    Y = df[outcome_col].values
    X = df[covariate_cols].values
    X_dm = X - X.mean(axis=0)                           # demean each covariate
    theta, *_ = np.linalg.lstsq(X_dm, Y, rcond=None)   # OLS minimises Var(Y_adj)
    Y_adj = Y - X_dm @ theta
    return pd.Series(Y_adj, index=df.index, name=f"{outcome_col}_cuped")
 
 
def cuped_variance_reduction(
    df: pd.DataFrame,
    outcome_col: str,
    covariate_cols: Sequence[str],
) -> dict:
    """How much variance does the covariate set explain? Use to justify covariate choice."""
    Y = df[outcome_col].values
    X = df[covariate_cols].values
    X_dm = X - X.mean(axis=0)
    theta, *_ = np.linalg.lstsq(X_dm, Y, rcond=None)
    Y_adj = Y - X_dm @ theta
    reduction = 1 - Y_adj.var() / Y.var()
    return {
        "covariates": list(covariate_cols),
        "variance_reduction_pct": round(reduction * 100, 1),
        "sample_size_reduction_pct": round(reduction * 100, 1),
    }
 
 
# ── Sample size service ───────────────────────────────────────────────────────
 
def sample_size_service(
    baseline: float,
    mde_relative: float,
    rho_cuped: float = 0.0,
    avg_cluster_size: float = 1.0,
    icc: float = 0.0,
    alpha: float = 0.05,
    power: float = 0.80,
) -> dict:
    """
    Single entry point for all experiment sizing. Accounts for CUPED variance
    reduction and cluster design effect in sequence.
    """
    z_a = stats.norm.ppf(1 - alpha / 2)
    z_b = stats.norm.ppf(power)
    delta = baseline * mde_relative
    p_pool = baseline + delta / 2
    sigma = math.sqrt(2 * p_pool * (1 - p_pool))
    n_base = math.ceil(((z_a + z_b) * sigma / delta) ** 2)
    n_cuped = math.ceil(n_base * (1 - rho_cuped ** 2))
    deff = 1 + (avg_cluster_size - 1) * icc
    n_final = math.ceil(n_cuped * deff)
    return {
        "n_per_arm_unadjusted": n_base,
        "n_per_arm_with_cuped": n_cuped,
        "n_per_arm_final": n_final,
        "n_clusters_needed": math.ceil(n_final / avg_cluster_size) if avg_cluster_size > 1 else None,
        "design_effect": round(deff, 3),
        "variance_reduction_pct": round(rho_cuped ** 2 * 100, 1),
    }
 
 
# ── Deterministic assignment ──────────────────────────────────────────────────
 
def assign_variant(
    unit_id: str,
    experiment_id: str,
    weights: dict[str, float] | None = None,
) -> str:
    """
    Deterministic assignment via SHA-256 bucket keyed on experiment+unit.
    Keying on both ensures the same user is not always in treatment across experiments.
    Supports multi-arm splits via a weights dict that must sum to 1.0.
    """
    weights = weights or {"control": 0.5, "treatment": 0.5}
    assert abs(sum(weights.values()) - 1.0) < 1e-9, "Weights must sum to 1"
    key = f"{experiment_id}:{unit_id}".encode()
    bucket = int(hashlib.sha256(key).hexdigest(), 16) % 1_000_000 / 1_000_000
    cumulative = 0.0
    for arm, w in weights.items():
        cumulative += w
        if bucket < cumulative:
            return arm
    return list(weights)[-1]   # floating-point safety valve
 
 
# ── Example ───────────────────────────────────────────────────────────────────
print(sample_size_service(0.45, 0.02, rho_cuped=0.72))
# {'n_per_arm_unadjusted': 19652, 'n_per_arm_with_cuped': 10991, ...}
 
print(assign_variant("user_8675309", "checkout_v2_2026q2"))   # same every time
print(assign_variant("user_8675309", "onboarding_v3_2026q2")) # different experiment

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.