Neural-Path/Notes
40 min

Switchback Experiments

In a two-sided marketplace, everything is connected. If you raise surge prices in Seattle during a treatment period, the extra driver earnings shift supply for hours afterward. If you test a new ETA algorithm in the morning, idle drivers from the experiment carry over into the afternoon control. Standard A/B testing would assign cities to treatment or control — but spatial randomization still shares the supply pool. The solution is randomization over time, not space: switchback experiments.

Theory

Alternating treatment / control time blocks — hover for details

T1ControlT2TreatmentT3Control↓ carry-overT4TreatmentT5Control↓ carry-overT6TreatmentT7Control↓ carry-overT8TreatmentTime →Metric (with carry-over)TreatmentControlCarry-over

A switchback experiment alternates the entire system between treatment and control in successive time blocks. Because every unit is exposed to both conditions at different times, interference across units is eliminated by design.

Why spatial randomization violates SUTVA in marketplaces. In a ride-share or delivery marketplace, treatment in one geographic zone affects supply and demand in neighboring zones through driver repositioning. Assigning half the drivers to treatment and half to control creates SUTVA violations because treated drivers compete with control drivers for the same trips. Time-based assignment eliminates this: during a control window, all drivers are in control; during a treatment window, all drivers are in treatment.

Why it had to be this way. The exposure mapping formalism makes this precise. Define Yi(d1,,dn)Y_i(d_1, \ldots, d_n) as unit ii's potential outcome under all nn units' assignments. In a two-sided market, this cannot be simplified to Yi(di)Y_i(d_i) under spatial assignment because driver ii's outcome depends on driver jj's assignment. Time-based randomization restores Yt(dt)Y_t(d_t) because the system state at time tt depends only on the block assignment at time tt.

Carry-over effects. The critical challenge: the system state at the start of a new block inherits from the previous block. After a treatment surge-pricing window, drivers are repositioned and riders' mental models of pricing are updated — these effects persist into the next control window. The carry-over duration λ\lambda determines the minimum viable block length.

Block length selection. The block length BB must satisfy BλB \gg \lambda. A common heuristic: B5λB \geq 5\lambda so that carry-over is less than 20% of the block. Shorter blocks mean more independent observations (higher power) but more bias from carry-over.

Variance estimation. Observations within a block are temporally correlated. The standard SE formula (σ^/n\hat{\sigma}/\sqrt{n}) is invalid — it understates variance and inflates false positive rates. Two valid approaches:

  • Block bootstrap: resample entire blocks to estimate variance
  • Newey-West HAC standard errors: heteroscedasticity- and autocorrelation-consistent

With KK blocks (half treatment, half control), the effective sample size is KK, not the number of observations NN.

Power analysis. The MDE with KK blocks is:

ΔMDE=zα/22σ^block2K/2\Delta_{MDE} = z_{\alpha/2} \cdot \sqrt{\frac{2\hat{\sigma}^2_{block}}{K/2}}

where σ^block2\hat{\sigma}^2_{block} is the variance of block-level means — estimated from pre-experiment data at your planned block granularity.

Walkthrough

Scenario: A ride-share company tests a new surge pricing algorithm. The switchback runs for 4 weeks in a single city, alternating 60-minute blocks between the current algorithm (control) and the new algorithm (treatment).

Block design: 4 weeks × 7 days × 24 hours = 672 one-hour blocks. With 60-minute blocks and a 10-minute carry-over estimate, the ratio is 6:1 — acceptable.

Carry-over detection:

python
import numpy as np
import pandas as pd
 
def detect_carryover(
    block_outcomes: pd.Series,     # one value per block
    block_assignments: pd.Series,  # 'T' or 'C' per block
) -> pd.DataFrame:
    """Estimate carry-over by comparing control blocks after treatment
    vs. control blocks after control."""
    prev_assign = block_assignments.shift(1)
    post_treat = block_assignments.index[
        (block_assignments == 'C') & (prev_assign == 'T')
    ]
    post_control = block_assignments.index[
        (block_assignments == 'C') & (prev_assign == 'C')
    ]
    gap = (block_outcomes.loc[post_treat].mean()
           - block_outcomes.loc[post_control].mean())
    return pd.DataFrame([{
        'carryover_gap': gap,
        'n_post_treat': len(post_treat),
        'n_post_control': len(post_control),
    }])

HAC standard errors via regression:

python
import statsmodels.api as sm
from statsmodels.stats.sandwich_covariance import cov_hac
 
def switchback_ate_hac(
    block_outcomes: np.ndarray,  # (K,)
    block_treat: np.ndarray,     # (K,) — 0/1
    n_lags: int = 4,
) -> dict:
    """Estimate ATE with Newey-West HAC standard errors."""
    X = sm.add_constant(block_treat)
    model = sm.OLS(block_outcomes, X).fit()
    hac_cov = cov_hac(model, nlags=n_lags)
    hac_se = float(np.sqrt(hac_cov[1, 1]))
    ate = float(model.params[1])
    from scipy.stats import norm
    pval = float(2 * norm.sf(abs(ate / hac_se)))
    return {
        'ate': ate,
        'se_hac': hac_se,
        'se_naive': float(model.bse[1]),
        'se_inflation': round(hac_se / model.bse[1], 2),
        'pvalue': pval,
    }

A standard OLS SE of 0.03 vs. HAC SE of 0.07 (2.3× larger) is common in practice — ignoring temporal autocorrelation leads to badly over-confident inferences.

Analysis & Evaluation

Where your intuition breaks. The natural instinct is to use shorter blocks for more data points and therefore more power. The opposite is often true: very short blocks mean most of each block is dominated by carry-over from the previous block, biasing your estimate toward zero (attenuation). The optimal block length balances power gain from more blocks against bias from more carry-over contamination — and this optimal point is usually much longer than practitioners expect.

DesignUnitSample sizeSUTVACarry-over
User-level A/BUsernn usersValid if no networkNone
City-level A/BCitynn citiesViolated in marketplacesNone
SwitchbackTime blockKK blocksRestored by designPresent at block boundaries

When pure holdout is better. If carry-over is longer than half the block length, switchback becomes unreliable. City-level or geo-level holdout with a longer test window is more appropriate. Switchback works best for fast-resetting systems where carry-over dissipates within minutes.

⚠️Warning

Never use OLS standard errors for switchback analysis. Temporal autocorrelation within and across blocks makes standard OLS SEs anti-conservative. Always use block bootstrap or HAC-robust SEs. Report the ratio of HAC SE to naive OLS SE as a diagnostic.

Production-Ready Code

python
"""
Switchback experiment production system.
Block assignment, carry-over detection, HAC estimation,
and block-length optimization.
"""
 
from __future__ import annotations
from dataclasses import dataclass
import numpy as np
import pandas as pd
from scipy.stats import norm
import statsmodels.api as sm
from statsmodels.stats.sandwich_covariance import cov_hac
 
 
@dataclass
class SwitchbackConfig:
    block_minutes: int
    n_blocks: int
    carryover_minutes: int
    metric_col: str = 'metric'
    time_col: str = 'block_start'
    treat_col: str = 'treatment'
 
 
def assign_blocks(
    block_starts: pd.DatetimeIndex,
    seed: int = 42,
) -> np.ndarray:
    """Balanced random assignment within each day to prevent
    time-of-day confounding."""
    rng = np.random.default_rng(seed)
    assignments = np.zeros(len(block_starts), dtype=int)
    days = np.array([t.date() for t in block_starts])
    for day in np.unique(days):
        mask = days == day
        n = mask.sum()
        perm = rng.permutation(n)
        half = n // 2
        day_assign = np.zeros(n, dtype=int)
        day_assign[perm[:half]] = 1
        assignments[mask] = day_assign
    return assignments
 
 
def block_optimizer(
    pre_data: pd.DataFrame,
    carryover_minutes: int,
    target_mde: float = 0.05,
    alpha: float = 0.05,
    power: float = 0.80,
) -> dict:
    """Find optimal block length for desired MDE."""
    sigma_per_minute = pre_data['metric'].std()
    z_alpha = norm.ppf(1 - alpha / 2)
    z_beta = norm.ppf(power)
    for block_min in [15, 30, 60, 120, 240]:
        carryover_frac = carryover_minutes / block_min
        if carryover_frac > 0.4:
            continue
        block_std = sigma_per_minute * np.sqrt(block_min) * (1 - carryover_frac)
        k_per_arm = ((z_alpha + z_beta) ** 2 * 2 * block_std ** 2) / target_mde ** 2
        total_hours = k_per_arm * 2 * block_min / 60
        return {
            'block_minutes': block_min,
            'k_per_arm': int(np.ceil(k_per_arm)),
            'total_hours': round(total_hours, 1),
            'carryover_fraction': round(carryover_frac, 3),
        }
    return {'error': 'No valid block length found — increase experiment duration or relax MDE target'}
 
 
def estimate_ate_hac(
    data: pd.DataFrame,
    config: SwitchbackConfig,
    burn_in_blocks: int = 1,
) -> dict:
    """ATE with HAC standard errors, excluding carry-over burn-in periods."""
    df = data.copy().sort_values(config.time_col)
    prev_treat = df[config.treat_col].shift(1)
    is_transition = (df[config.treat_col] != prev_treat) & prev_treat.notna()
    transition_positions = df.index[is_transition].tolist()
    burn_in_idx = set()
    for pos in transition_positions:
        loc = df.index.get_loc(pos)
        for k in range(burn_in_blocks):
            if loc + k < len(df):
                burn_in_idx.add(df.index[loc + k])
    clean = df[~df.index.isin(burn_in_idx)]
 
    X = sm.add_constant(clean[config.treat_col].values.astype(float))
    y = clean[config.metric_col].values
    model = sm.OLS(y, X).fit()
    n_lags = max(1, int(np.ceil(config.block_minutes / 15)))
    hac_cov = cov_hac(model, nlags=n_lags)
    hac_se = float(np.sqrt(hac_cov[1, 1]))
    ate = float(model.params[1])
    z = ate / hac_se
    pvalue = float(2 * norm.sf(abs(z)))
    naive_se = float(model.bse[1])
    return {
        'ate': round(ate, 6),
        'hac_se': round(hac_se, 6),
        'naive_se': round(naive_se, 6),
        'se_inflation': round(hac_se / naive_se, 2),
        'z': round(z, 3),
        'pvalue': round(pvalue, 4),
        'n_blocks_used': len(clean),
        'n_blocks_dropped': len(data) - len(clean),
    }

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.