Long-run Measurement & Holdout Groups

The most dangerous number in experimentation is the two-week result. Users behave differently when a feature is new — they explore it out of curiosity (novelty effect) or resist it from habit (change aversion). After weeks or months, behavior stabilizes to something closer to the true long-run effect. A product that looks like a 5% win at two weeks might be a 1% win at six months — or vice versa. Long-run measurement is the discipline of connecting short-run experiment results to the outcomes that actually matter.

Theory

Short-run A/B tests measure intent-to-treat effects over a fixed window. The gap between this measurement and long-run impact has three components:

1. Novelty and learning effects. Users respond differently to new experiences than to familiar ones. Novelty effect: outcome is inflated early due to curiosity. Learning effect: outcome improves over time as users master the feature. These can partially or fully offset each other — and they go in opposite directions.

2. Ecosystem effects. A new feature may cannibalize other features (substitution) or complement them (amplification). Two-week experiments often miss cannibalization because users have not yet fully substituted.

3. Causal chain lag. Many business outcomes (retention, LTV) have long causal chains. An onboarding improvement might increase retention by month 3 but be invisible in a two-week engagement metric.

Permanent holdout groups. A permanent holdout is a small fraction of users (typically 1–5%) held out of all new features indefinitely. Comparing the treatment population (everyone else) against the holdout at any future time gives a retrospective estimate of cumulative long-run impact.

Design considerations:

Size: Large enough to detect meaningful effects ( $n \geq 10,000$ for 2% MDE on conversion)
Selection: Random assignment at the user level — not self-selection
Rotation: Holdout users eventually receive the features (typically quarterly) to avoid permanent harm
Scope: Separate holdout groups per feature team to avoid confounding

Surrogate indices. If you cannot wait for long-run outcomes, use a surrogate index — a weighted combination of short-run metrics that predicts the long-run outcome. Formally (Athey et al. 2019), $S$ is a valid surrogate for $Y$ if:

$Y(d) \perp D \mid S(d)$

That is, conditional on the surrogate, the potential outcome is independent of treatment assignment. If this surrogacy assumption holds, the long-run treatment effect equals the treatment effect on the surrogate after applying regression weights $\hat{\gamma}$ learned from historical experiments:

$\tau_Y = \mathbb{E}[Y(1) - Y(0)] = \mathbb{E}_{\hat{\gamma}}[S(1) - S(0)]$

Why it had to be this way. The surrogacy assumption is a mediation assumption — the long-run outcome is fully mediated through the surrogate. This is testable on old experiments where both $S$ and $Y$ were observed: run the regression on historical data and check $R^2$ . High $R^2$ supports surrogacy.

Carryover models for extrapolation. If you have decaying treatment effects (common for novelty), fit a carryover model to the observed short-run weekly effects:

$\tau_t = \tau_\infty + (\tau_0 - \tau_\infty) e^{-\lambda t}$

and extrapolate $\tau_\infty$ , the long-run asymptote. This requires at least 4 weeks of post-launch data to fit $\lambda$ reliably.

Walkthrough

Scenario: A streaming platform's recommendation improvement shows a 3% session-length increase at 2 weeks. We want to predict the 6-month retention impact.

Step 1: Check for novelty effect by plotting weekly treatment effects.

python

import numpy as np
from scipy.optimize import curve_fit
 
def fit_carryover_model(
    weeks: np.ndarray,    # post-launch week indices (1, 2, 3, ...)
    effects: np.ndarray,  # weekly treatment effect estimates
) -> dict:
    """Fit exponential decay: tau(t) = tau_inf + (tau_0 - tau_inf)*exp(-lam*t)."""
    def model(t, tau_inf, tau_0, lam):
        return tau_inf + (tau_0 - tau_inf) * np.exp(-lam * t)
    try:
        popt, pcov = curve_fit(
            model, weeks, effects,
            p0=[effects[-1], effects[0], 0.5],
            bounds=([-np.inf, -np.inf, 0.01], [np.inf, np.inf, 10]),
        )
        tau_inf, tau_0, lam = popt
        perr = np.sqrt(np.diag(pcov))
        return {
            'tau_inf': round(float(tau_inf), 4),
            'tau_0': round(float(tau_0), 4),
            'lambda': round(float(lam), 4),
            'half_life_weeks': round(float(np.log(2) / lam), 1),
            'tau_inf_se': round(float(perr[0]), 4),
        }
    except RuntimeError:
        return {'error': 'Carryover model did not converge — need more weeks of data'}

Step 2: Construct surrogate index from historical experiments.

python

import pandas as pd
from sklearn.linear_model import Ridge
 
def fit_surrogate_index(
    historical: pd.DataFrame,
    surrogate_cols: list[str],
    outcome_col: str = 'long_run_retention',
) -> dict:
    """Fit Ridge regression mapping short-run surrogates to long-run outcome."""
    X = historical[surrogate_cols].values
    y = historical[outcome_col].values
    model = Ridge(alpha=0.01).fit(X, y)
    r2 = model.score(X, y)
    return {
        'weights': dict(zip(surrogate_cols, model.coef_.round(4))),
        'r2': round(r2, 4),
        'surrogacy_ok': r2 > 0.70,
        'warning': None if r2 > 0.70
            else f"R²={r2:.2f} — surrogate explains only {r2:.0%} of long-run variance",
    }

Analysis & Evaluation

Where your intuition breaks. The novelty effect is usually assumed to inflate short-run results — you see a big positive and discount it. But learning effects go in the other direction: a new search interface looks bad at two weeks (users have not yet learned the new behavior) but outperforms at two months. Blindly discounting short-run results for novelty can lead to killing genuinely good features. Always plot the weekly treatment effect trajectory before making a call.

Effect	Short-run signal	Long-run signal	Action
Novelty	Inflated positive	Deflates to true effect	Wait or extrapolate
Learning	Deflated or negative	Grows to true effect	Wait or use surrogate
Habituation	Positive	Fades toward zero	Report both; question LTV
Cannibalization	Positive on feature metric	Negative on portfolio metric	Check portfolio-level holdout

When surrogate indices fail. The surrogacy assumption fails when treatment affects the long-run outcome through channels not captured by the surrogate. Example: a new feature increases engagement (captured by surrogate) but causes privacy concerns that raise churn months later (not in the surrogate). Always validate surrogates out-of-sample on held-out historical experiments.

💡Intuition

Permanent holdouts are the gold standard for long-run measurement — but they are expensive. A 1% holdout on a platform with 10M monthly active users means 100,000 users never get any new feature. The forgone value is a real cost. Most organizations compromise: rotate holdout membership quarterly and accept that long-run estimates conflate features shipped within the same cohort period.

Production-Ready Code

python

"""
Long-run measurement toolkit.
Holdout group management, surrogate index pipeline,
carryover model, and long-run lift extrapolation.
"""
 
from __future__ import annotations
from dataclasses import dataclass
import hashlib
import numpy as np
import pandas as pd
from scipy.optimize import curve_fit
from scipy.stats import norm
from sklearn.linear_model import Ridge
 
 
@dataclass
class HoldoutConfig:
    holdout_fraction: float = 0.02
    salt: str = 'holdout_v1'        # change to rotate holdout membership
    feature_team: str = 'growth'    # separate holdout per team
 
 
def is_in_holdout(user_id: int | str, config: HoldoutConfig) -> bool:
    """Deterministic, salt-based holdout assignment.
 
    Changing config.salt rotates the holdout population without
    requiring a database update.
    """
    key = f"{config.salt}:{config.feature_team}:{user_id}"
    digest = hashlib.sha256(key.encode()).hexdigest()
    bucket = int(digest[:8], 16) / 0xFFFFFFFF
    return bucket < config.holdout_fraction
 
 
def long_run_lift_estimate(
    holdout_outcome: float,
    treated_outcome: float,
    holdout_n: int,
    treated_n: int,
    holdout_std: float,
    treated_std: float,
    alpha: float = 0.05,
) -> dict:
    """Estimate long-run lift from permanent holdout comparison."""
    lift = treated_outcome - holdout_outcome
    se = np.sqrt(holdout_std**2 / holdout_n + treated_std**2 / treated_n)
    z = lift / se
    pvalue = float(2 * norm.sf(abs(z)))
    z_crit = norm.ppf(1 - alpha / 2)
    return {
        'lift': round(lift, 6),
        'se': round(se, 6),
        'ci_lower': round(lift - z_crit * se, 6),
        'ci_upper': round(lift + z_crit * se, 6),
        'pvalue': round(pvalue, 4),
        'significant': pvalue < alpha,
    }
 
 
def fit_surrogate_pipeline(
    historical_experiments: pd.DataFrame,
    surrogate_cols: list[str],
    outcome_col: str,
    alpha: float = 0.01,
) -> dict:
    """Fit surrogate index on historical experiments with R² validation."""
    X = historical_experiments[surrogate_cols].values
    y = historical_experiments[outcome_col].values
    model = Ridge(alpha=alpha).fit(X, y)
    y_pred = model.predict(X)
    ss_res = np.sum((y - y_pred) ** 2)
    ss_tot = np.sum((y - y.mean()) ** 2)
    r2 = float(1 - ss_res / ss_tot) if ss_tot > 0 else 0.0
 
    def predict_long_run(new_short_run: pd.DataFrame) -> np.ndarray:
        return model.predict(new_short_run[surrogate_cols].values)
 
    return {
        'weights': dict(zip(surrogate_cols, model.coef_.round(6))),
        'r2': round(r2, 4),
        'surrogacy_valid': r2 >= 0.70,
        'n_training_experiments': len(historical_experiments),
        'predict_fn': predict_long_run,
        'warning': None if r2 >= 0.70
            else f"R²={r2:.2f} — surrogate explains only {r2:.0%} of long-run variance",
    }
 
 
def extrapolate_from_carryover(
    weekly_effects: np.ndarray,
    weekly_se: np.ndarray,
    horizon_weeks: int = 26,
) -> dict:
    """Extrapolate long-run effect using exponential carryover model."""
    weeks = np.arange(1, len(weekly_effects) + 1, dtype=float)
    weights = 1.0 / (weekly_se ** 2 + 1e-9)
 
    def model(t, tau_inf, tau_0, lam):
        return tau_inf + (tau_0 - tau_inf) * np.exp(-lam * t)
 
    try:
        popt, pcov = curve_fit(
            model, weeks, weekly_effects, sigma=1/weights,
            p0=[weekly_effects[-1], weekly_effects[0], 0.3],
            bounds=([-np.inf, -np.inf, 0.01], [np.inf, np.inf, 5]),
            maxfev=5000,
        )
        tau_inf, tau_0, lam = popt
        tau_inf_se = float(np.sqrt(pcov[0, 0]))
        future = float(model(horizon_weeks, *popt))
        novelty_frac = float((tau_0 - tau_inf) / (tau_0 + 1e-9))
        return {
            'tau_inf': round(tau_inf, 4),
            'tau_inf_se': round(tau_inf_se, 4),
            'half_life_weeks': round(float(np.log(2) / lam), 1),
            f'effect_at_{horizon_weeks}w': round(future, 4),
            'novelty_fraction': round(novelty_frac, 3),
        }
    except RuntimeError:
        return {
            'error': f'Model did not converge with {len(weekly_effects)} weeks of data. Need at least 4 weeks.'
        }

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Always-Valid Sequential Testing

ModernCausal

Synthetic Control