Switchback Experiments
In a two-sided marketplace, everything is connected. If you raise surge prices in Seattle during a treatment period, the extra driver earnings shift supply for hours afterward. If you test a new ETA algorithm in the morning, idle drivers from the experiment carry over into the afternoon control. Standard A/B testing would assign cities to treatment or control — but spatial randomization still shares the supply pool. The solution is randomization over time, not space: switchback experiments.
Theory
Alternating treatment / control time blocks — hover for details
A switchback experiment alternates the entire system between treatment and control in successive time blocks. Because every unit is exposed to both conditions at different times, interference across units is eliminated by design.
Why spatial randomization violates SUTVA in marketplaces. In a ride-share or delivery marketplace, treatment in one geographic zone affects supply and demand in neighboring zones through driver repositioning. Assigning half the drivers to treatment and half to control creates SUTVA violations because treated drivers compete with control drivers for the same trips. Time-based assignment eliminates this: during a control window, all drivers are in control; during a treatment window, all drivers are in treatment.
Why it had to be this way. The exposure mapping formalism makes this precise. Define as unit 's potential outcome under all units' assignments. In a two-sided market, this cannot be simplified to under spatial assignment because driver 's outcome depends on driver 's assignment. Time-based randomization restores because the system state at time depends only on the block assignment at time .
Carry-over effects. The critical challenge: the system state at the start of a new block inherits from the previous block. After a treatment surge-pricing window, drivers are repositioned and riders' mental models of pricing are updated — these effects persist into the next control window. The carry-over duration determines the minimum viable block length.
Block length selection. The block length must satisfy . A common heuristic: so that carry-over is less than 20% of the block. Shorter blocks mean more independent observations (higher power) but more bias from carry-over.
Variance estimation. Observations within a block are temporally correlated. The standard SE formula () is invalid — it understates variance and inflates false positive rates. Two valid approaches:
- Block bootstrap: resample entire blocks to estimate variance
- Newey-West HAC standard errors: heteroscedasticity- and autocorrelation-consistent
With blocks (half treatment, half control), the effective sample size is , not the number of observations .
Power analysis. The MDE with blocks is:
where is the variance of block-level means — estimated from pre-experiment data at your planned block granularity.
Walkthrough
Scenario: A ride-share company tests a new surge pricing algorithm. The switchback runs for 4 weeks in a single city, alternating 60-minute blocks between the current algorithm (control) and the new algorithm (treatment).
Block design: 4 weeks × 7 days × 24 hours = 672 one-hour blocks. With 60-minute blocks and a 10-minute carry-over estimate, the ratio is 6:1 — acceptable.
Carry-over detection:
import numpy as np
import pandas as pd
def detect_carryover(
block_outcomes: pd.Series, # one value per block
block_assignments: pd.Series, # 'T' or 'C' per block
) -> pd.DataFrame:
"""Estimate carry-over by comparing control blocks after treatment
vs. control blocks after control."""
prev_assign = block_assignments.shift(1)
post_treat = block_assignments.index[
(block_assignments == 'C') & (prev_assign == 'T')
]
post_control = block_assignments.index[
(block_assignments == 'C') & (prev_assign == 'C')
]
gap = (block_outcomes.loc[post_treat].mean()
- block_outcomes.loc[post_control].mean())
return pd.DataFrame([{
'carryover_gap': gap,
'n_post_treat': len(post_treat),
'n_post_control': len(post_control),
}])HAC standard errors via regression:
import statsmodels.api as sm
from statsmodels.stats.sandwich_covariance import cov_hac
def switchback_ate_hac(
block_outcomes: np.ndarray, # (K,)
block_treat: np.ndarray, # (K,) — 0/1
n_lags: int = 4,
) -> dict:
"""Estimate ATE with Newey-West HAC standard errors."""
X = sm.add_constant(block_treat)
model = sm.OLS(block_outcomes, X).fit()
hac_cov = cov_hac(model, nlags=n_lags)
hac_se = float(np.sqrt(hac_cov[1, 1]))
ate = float(model.params[1])
from scipy.stats import norm
pval = float(2 * norm.sf(abs(ate / hac_se)))
return {
'ate': ate,
'se_hac': hac_se,
'se_naive': float(model.bse[1]),
'se_inflation': round(hac_se / model.bse[1], 2),
'pvalue': pval,
}A standard OLS SE of 0.03 vs. HAC SE of 0.07 (2.3× larger) is common in practice — ignoring temporal autocorrelation leads to badly over-confident inferences.
Analysis & Evaluation
Where your intuition breaks. The natural instinct is to use shorter blocks for more data points and therefore more power. The opposite is often true: very short blocks mean most of each block is dominated by carry-over from the previous block, biasing your estimate toward zero (attenuation). The optimal block length balances power gain from more blocks against bias from more carry-over contamination — and this optimal point is usually much longer than practitioners expect.
| Design | Unit | Sample size | SUTVA | Carry-over |
|---|---|---|---|---|
| User-level A/B | User | users | Valid if no network | None |
| City-level A/B | City | cities | Violated in marketplaces | None |
| Switchback | Time block | blocks | Restored by design | Present at block boundaries |
When pure holdout is better. If carry-over is longer than half the block length, switchback becomes unreliable. City-level or geo-level holdout with a longer test window is more appropriate. Switchback works best for fast-resetting systems where carry-over dissipates within minutes.
Never use OLS standard errors for switchback analysis. Temporal autocorrelation within and across blocks makes standard OLS SEs anti-conservative. Always use block bootstrap or HAC-robust SEs. Report the ratio of HAC SE to naive OLS SE as a diagnostic.
Production-Ready Code
"""
Switchback experiment production system.
Block assignment, carry-over detection, HAC estimation,
and block-length optimization.
"""
from __future__ import annotations
from dataclasses import dataclass
import numpy as np
import pandas as pd
from scipy.stats import norm
import statsmodels.api as sm
from statsmodels.stats.sandwich_covariance import cov_hac
@dataclass
class SwitchbackConfig:
block_minutes: int
n_blocks: int
carryover_minutes: int
metric_col: str = 'metric'
time_col: str = 'block_start'
treat_col: str = 'treatment'
def assign_blocks(
block_starts: pd.DatetimeIndex,
seed: int = 42,
) -> np.ndarray:
"""Balanced random assignment within each day to prevent
time-of-day confounding."""
rng = np.random.default_rng(seed)
assignments = np.zeros(len(block_starts), dtype=int)
days = np.array([t.date() for t in block_starts])
for day in np.unique(days):
mask = days == day
n = mask.sum()
perm = rng.permutation(n)
half = n // 2
day_assign = np.zeros(n, dtype=int)
day_assign[perm[:half]] = 1
assignments[mask] = day_assign
return assignments
def block_optimizer(
pre_data: pd.DataFrame,
carryover_minutes: int,
target_mde: float = 0.05,
alpha: float = 0.05,
power: float = 0.80,
) -> dict:
"""Find optimal block length for desired MDE."""
sigma_per_minute = pre_data['metric'].std()
z_alpha = norm.ppf(1 - alpha / 2)
z_beta = norm.ppf(power)
for block_min in [15, 30, 60, 120, 240]:
carryover_frac = carryover_minutes / block_min
if carryover_frac > 0.4:
continue
block_std = sigma_per_minute * np.sqrt(block_min) * (1 - carryover_frac)
k_per_arm = ((z_alpha + z_beta) ** 2 * 2 * block_std ** 2) / target_mde ** 2
total_hours = k_per_arm * 2 * block_min / 60
return {
'block_minutes': block_min,
'k_per_arm': int(np.ceil(k_per_arm)),
'total_hours': round(total_hours, 1),
'carryover_fraction': round(carryover_frac, 3),
}
return {'error': 'No valid block length found — increase experiment duration or relax MDE target'}
def estimate_ate_hac(
data: pd.DataFrame,
config: SwitchbackConfig,
burn_in_blocks: int = 1,
) -> dict:
"""ATE with HAC standard errors, excluding carry-over burn-in periods."""
df = data.copy().sort_values(config.time_col)
prev_treat = df[config.treat_col].shift(1)
is_transition = (df[config.treat_col] != prev_treat) & prev_treat.notna()
transition_positions = df.index[is_transition].tolist()
burn_in_idx = set()
for pos in transition_positions:
loc = df.index.get_loc(pos)
for k in range(burn_in_blocks):
if loc + k < len(df):
burn_in_idx.add(df.index[loc + k])
clean = df[~df.index.isin(burn_in_idx)]
X = sm.add_constant(clean[config.treat_col].values.astype(float))
y = clean[config.metric_col].values
model = sm.OLS(y, X).fit()
n_lags = max(1, int(np.ceil(config.block_minutes / 15)))
hac_cov = cov_hac(model, nlags=n_lags)
hac_se = float(np.sqrt(hac_cov[1, 1]))
ate = float(model.params[1])
z = ate / hac_se
pvalue = float(2 * norm.sf(abs(z)))
naive_se = float(model.bse[1])
return {
'ate': round(ate, 6),
'hac_se': round(hac_se, 6),
'naive_se': round(naive_se, 6),
'se_inflation': round(hac_se / naive_se, 2),
'z': round(z, 3),
'pvalue': round(pvalue, 4),
'n_blocks_used': len(clean),
'n_blocks_dropped': len(data) - len(clean),
}Enjoying these notes?
Get new lessons delivered to your inbox. No spam.