Switchback Experiments

In a two-sided marketplace, everything is connected. If you raise surge prices in Seattle during a treatment period, the extra driver earnings shift supply for hours afterward. If you test a new ETA algorithm in the morning, idle drivers from the experiment carry over into the afternoon control. Standard A/B testing would assign cities to treatment or control — but spatial randomization still shares the supply pool. The solution is randomization over time, not space: switchback experiments.

Theory

Alternating treatment / control time blocks — hover for details

A switchback experiment alternates the entire system between treatment and control in successive time blocks. Because every unit is exposed to both conditions at different times, interference across units is eliminated by design.

Why spatial randomization violates SUTVA in marketplaces. In a ride-share or delivery marketplace, treatment in one geographic zone affects supply and demand in neighboring zones through driver repositioning. Assigning half the drivers to treatment and half to control creates SUTVA violations because treated drivers compete with control drivers for the same trips. Time-based assignment eliminates this: during a control window, all drivers are in control; during a treatment window, all drivers are in treatment.

Why it had to be this way. The exposure mapping formalism makes this precise. Define $Y_i(d_1, \ldots, d_n)$ as unit $i$ 's potential outcome under all $n$ units' assignments. In a two-sided market, this cannot be simplified to $Y_i(d_i)$ under spatial assignment because driver $i$ 's outcome depends on driver $j$ 's assignment. Time-based randomization restores $Y_t(d_t)$ because the system state at time $t$ depends only on the block assignment at time $t$ .

Carry-over effects. The critical challenge: the system state at the start of a new block inherits from the previous block. After a treatment surge-pricing window, drivers are repositioned and riders' mental models of pricing are updated — these effects persist into the next control window. The carry-over duration $\lambda$ determines the minimum viable block length.

Block length selection. The block length $B$ must satisfy $B \gg \lambda$ . A common heuristic: $B \geq 5\lambda$ so that carry-over is less than 20% of the block. Shorter blocks mean more independent observations (higher power) but more bias from carry-over.

Variance estimation. Observations within a block are temporally correlated. The standard SE formula ( $\hat{\sigma}/\sqrt{n}$ ) is invalid — it understates variance and inflates false positive rates. Two valid approaches:

Block bootstrap: resample entire blocks to estimate variance
Newey-West HAC standard errors: heteroscedasticity- and autocorrelation-consistent

With $K$ blocks (half treatment, half control), the effective sample size is $K$ , not the number of observations $N$ .

Power analysis. The MDE with $K$ blocks is:

$\Delta_{MDE} = z_{\alpha/2} \cdot \sqrt{\frac{2\hat{\sigma}^2_{block}}{K/2}}$

where $\hat{\sigma}^2_{block}$ is the variance of block-level means — estimated from pre-experiment data at your planned block granularity.

Walkthrough

Scenario: A ride-share company tests a new surge pricing algorithm. The switchback runs for 4 weeks in a single city, alternating 60-minute blocks between the current algorithm (control) and the new algorithm (treatment).

Block design: 4 weeks × 7 days × 24 hours = 672 one-hour blocks. With 60-minute blocks and a 10-minute carry-over estimate, the ratio is 6:1 — acceptable.

Carry-over detection:

python

import numpy as np
import pandas as pd
 
def detect_carryover(
    block_outcomes: pd.Series,     # one value per block
    block_assignments: pd.Series,  # 'T' or 'C' per block
) -> pd.DataFrame:
    """Estimate carry-over by comparing control blocks after treatment
    vs. control blocks after control."""
    prev_assign = block_assignments.shift(1)
    post_treat = block_assignments.index[
        (block_assignments == 'C') & (prev_assign == 'T')
    ]
    post_control = block_assignments.index[
        (block_assignments == 'C') & (prev_assign == 'C')
    ]
    gap = (block_outcomes.loc[post_treat].mean()
           - block_outcomes.loc[post_control].mean())
    return pd.DataFrame([{
        'carryover_gap': gap,
        'n_post_treat': len(post_treat),
        'n_post_control': len(post_control),
    }])

HAC standard errors via regression:

python

import statsmodels.api as sm
from statsmodels.stats.sandwich_covariance import cov_hac
 
def switchback_ate_hac(
    block_outcomes: np.ndarray,  # (K,)
    block_treat: np.ndarray,     # (K,) — 0/1
    n_lags: int = 4,
) -> dict:
    """Estimate ATE with Newey-West HAC standard errors."""
    X = sm.add_constant(block_treat)
    model = sm.OLS(block_outcomes, X).fit()
    hac_cov = cov_hac(model, nlags=n_lags)
    hac_se = float(np.sqrt(hac_cov[1, 1]))
    ate = float(model.params[1])
    from scipy.stats import norm
    pval = float(2 * norm.sf(abs(ate / hac_se)))
    return {
        'ate': ate,
        'se_hac': hac_se,
        'se_naive': float(model.bse[1]),
        'se_inflation': round(hac_se / model.bse[1], 2),
        'pvalue': pval,
    }

A standard OLS SE of 0.03 vs. HAC SE of 0.07 (2.3× larger) is common in practice — ignoring temporal autocorrelation leads to badly over-confident inferences.

Analysis & Evaluation

Where your intuition breaks. The natural instinct is to use shorter blocks for more data points and therefore more power. The opposite is often true: very short blocks mean most of each block is dominated by carry-over from the previous block, biasing your estimate toward zero (attenuation). The optimal block length balances power gain from more blocks against bias from more carry-over contamination — and this optimal point is usually much longer than practitioners expect.

Design	Unit	Sample size	SUTVA	Carry-over
User-level A/B	User	$n$ users	Valid if no network	None
City-level A/B	City	$n$ cities	Violated in marketplaces	None
Switchback	Time block	$K$ blocks	Restored by design	Present at block boundaries

When pure holdout is better. If carry-over is longer than half the block length, switchback becomes unreliable. City-level or geo-level holdout with a longer test window is more appropriate. Switchback works best for fast-resetting systems where carry-over dissipates within minutes.

⚠️Warning

Never use OLS standard errors for switchback analysis. Temporal autocorrelation within and across blocks makes standard OLS SEs anti-conservative. Always use block bootstrap or HAC-robust SEs. Report the ratio of HAC SE to naive OLS SE as a diagnostic.

Production-Ready Code

python

"""
Switchback experiment production system.
Block assignment, carry-over detection, HAC estimation,
and block-length optimization.
"""
 
from __future__ import annotations
from dataclasses import dataclass
import numpy as np
import pandas as pd
from scipy.stats import norm
import statsmodels.api as sm
from statsmodels.stats.sandwich_covariance import cov_hac
 
 
@dataclass
class SwitchbackConfig:
    block_minutes: int
    n_blocks: int
    carryover_minutes: int
    metric_col: str = 'metric'
    time_col: str = 'block_start'
    treat_col: str = 'treatment'
 
 
def assign_blocks(
    block_starts: pd.DatetimeIndex,
    seed: int = 42,
) -> np.ndarray:
    """Balanced random assignment within each day to prevent
    time-of-day confounding."""
    rng = np.random.default_rng(seed)
    assignments = np.zeros(len(block_starts), dtype=int)
    days = np.array([t.date() for t in block_starts])
    for day in np.unique(days):
        mask = days == day
        n = mask.sum()
        perm = rng.permutation(n)
        half = n // 2
        day_assign = np.zeros(n, dtype=int)
        day_assign[perm[:half]] = 1
        assignments[mask] = day_assign
    return assignments
 
 
def block_optimizer(
    pre_data: pd.DataFrame,
    carryover_minutes: int,
    target_mde: float = 0.05,
    alpha: float = 0.05,
    power: float = 0.80,
) -> dict:
    """Find optimal block length for desired MDE."""
    sigma_per_minute = pre_data['metric'].std()
    z_alpha = norm.ppf(1 - alpha / 2)
    z_beta = norm.ppf(power)
    for block_min in [15, 30, 60, 120, 240]:
        carryover_frac = carryover_minutes / block_min
        if carryover_frac > 0.4:
            continue
        block_std = sigma_per_minute * np.sqrt(block_min) * (1 - carryover_frac)
        k_per_arm = ((z_alpha + z_beta) ** 2 * 2 * block_std ** 2) / target_mde ** 2
        total_hours = k_per_arm * 2 * block_min / 60
        return {
            'block_minutes': block_min,
            'k_per_arm': int(np.ceil(k_per_arm)),
            'total_hours': round(total_hours, 1),
            'carryover_fraction': round(carryover_frac, 3),
        }
    return {'error': 'No valid block length found — increase experiment duration or relax MDE target'}
 
 
def estimate_ate_hac(
    data: pd.DataFrame,
    config: SwitchbackConfig,
    burn_in_blocks: int = 1,
) -> dict:
    """ATE with HAC standard errors, excluding carry-over burn-in periods."""
    df = data.copy().sort_values(config.time_col)
    prev_treat = df[config.treat_col].shift(1)
    is_transition = (df[config.treat_col] != prev_treat) & prev_treat.notna()
    transition_positions = df.index[is_transition].tolist()
    burn_in_idx = set()
    for pos in transition_positions:
        loc = df.index.get_loc(pos)
        for k in range(burn_in_blocks):
            if loc + k < len(df):
                burn_in_idx.add(df.index[loc + k])
    clean = df[~df.index.isin(burn_in_idx)]
 
    X = sm.add_constant(clean[config.treat_col].values.astype(float))
    y = clean[config.metric_col].values
    model = sm.OLS(y, X).fit()
    n_lags = max(1, int(np.ceil(config.block_minutes / 15)))
    hac_cov = cov_hac(model, nlags=n_lags)
    hac_se = float(np.sqrt(hac_cov[1, 1]))
    ate = float(model.params[1])
    z = ate / hac_se
    pvalue = float(2 * norm.sf(abs(z)))
    naive_se = float(model.bse[1])
    return {
        'ate': round(ate, 6),
        'hac_se': round(hac_se, 6),
        'naive_se': round(naive_se, 6),
        'se_inflation': round(hac_se / naive_se, 2),
        'z': round(z, 3),
        'pvalue': round(pvalue, 4),
        'n_blocks_used': len(clean),
        'n_blocks_dropped': len(data) - len(clean),
    }

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Geo Testing & Market Holdouts

Network Experiments & Interference