Geo Testing & Market Holdouts

Standard A/B tests assume that assigning user A to treatment doesn't affect user B's outcome. In marketing and pricing, that assumption breaks: a national ad campaign raises awareness everywhere, a discount floods supply chains, and a surge-pricing change shifts driver behavior across cities. When the intervention is geo-level rather than user-level, you need a geo test — one of the most rigorous (and expensive) experimental tools in the practitioner's toolkit.

Theory

Click markets to toggle treatment / holdout

Geo testing assigns entire geographic markets — cities, DMAs, states — to treatment or holdout rather than individual users. This restores the independence assumption, but introduces new challenges: you have far fewer units (20–100 markets vs. millions of users), and markets differ enormously in size, demographics, and baseline trends.

Why user-level randomization fails for geo interventions. The Stable Unit Treatment Value Assumption (SUTVA) requires that unit $i$ 's potential outcome $Y_i$ depends only on its own assignment $d_i$ , not on others' assignments. A national TV ad campaign violates this at the user level — if your ad plays in every market, there is no untreated user to compare against. The only valid randomization unit is the market itself.

Why it had to be this way. SUTVA makes the violation precise: $Y_i(d_1, \ldots, d_n) \neq Y_i(d_i)$ when geo-level spillovers exist. This is not a modeling choice — it is a statement about the data-generating process. Geo randomization restores SUTVA by making the treatment assignment unit match the spillover unit.

Market selection. With 20–100 markets, balance matters enormously. Markets are matched on pre-period trend similarity, population, seasonality, and demographic composition. The holdout should be 15–30% of total market volume — large enough to measure with precision, small enough to be worth the foregone revenue.

Synthetic control as the analysis method. Synthetic control (Abadie, Diamond, Hainmueller 2010) constructs the counterfactual treated market as a weighted combination of donor (holdout) markets:

$\hat{Y}^{SC}_{1t} = \sum_{j \in \mathcal{J}_0} w_j Y_{jt}$

where weights $w_j \geq 0$ , $\sum_j w_j = 1$ , are chosen to minimize pre-period mean squared prediction error (MSPE):

$\min_{w} \sum_{t \leq T_0} \left(Y_{1t} - \sum_j w_j Y_{jt}\right)^2$

The treatment effect estimate at each post-period time $t$ is $\hat{\tau}_t = Y_{1t} - \hat{Y}^{SC}_{1t}$ .

Why synthetic control instead of simple DiD. DiD assumes parallel trends — treated and control markets would have evolved identically without treatment. Synthetic control tests this assumption directly: if the pre-period fit is good (low MSPE), the counterfactual is credible. DiD makes an untestable assumption; SC makes it testable and visible.

Permutation inference. With $J_0$ donor markets, you run the same SC optimization for each donor as if it were treated (placebo tests). The p-value is the fraction of donor markets with a larger post-period gap than the actual treated market:

$p = \frac{1}{J_0} \sum_{j \in \mathcal{J}_0} \mathbf{1}\!\left[\left|\hat{\tau}^{placebo}_j\right| \geq |\hat{\tau}|\right]$

This is valid even with 20 markets — no asymptotic approximations needed.

Walkthrough

Scenario: A streaming service runs an ad campaign in 6 treatment DMAs. 14 holdout DMAs form the donor pool. We want to estimate the causal effect on weekly sign-ups.

Step 1: Pre-period data. Collect 12 weeks of weekly sign-up data for all 20 markets before the campaign.

Step 2: SC weight optimization. Find convex weights minimizing pre-period MSPE between the treatment aggregate and the weighted donor sum.

python

import numpy as np
from scipy.optimize import minimize
 
# pre_treat: (T_pre,) — treatment DMA aggregate weekly sign-ups
# pre_donor: (T_pre, J0) — donor DMA weekly sign-ups
def fit_sc_weights(pre_treat, pre_donor):
    J0 = pre_donor.shape[1]
    def loss(w): return np.sum((pre_treat - pre_donor @ w) ** 2)
    constraints = [{'type': 'eq', 'fun': lambda w: w.sum() - 1}]
    bounds = [(0, 1)] * J0
    w0 = np.ones(J0) / J0
    result = minimize(loss, w0, method='SLSQP', bounds=bounds, constraints=constraints)
    return result.x  # shape (J0,)

Step 3: Post-period effect. The synthetic control counterfactual in the post-period is post_donor @ w. The gap at each week is the treatment effect.

Step 4: Permutation p-value. Fit SC weights with each of the 14 donor markets as the "treated" unit. The p-value is the fraction with a larger post-period gap than the actual treatment market.

Pre-period fit check. If MSPE is more than 2× the median donor-market MSPE, the SC weights are unreliable — the treatment market is too unusual to approximate from the donor pool. Consider Augmented SC (Ben-Michael et al.) or Synthetic DiD (Arkhangelsky et al.).

Analysis & Evaluation

Where your intuition breaks. You might assume that more treatment markets means more statistical power. With geo tests, this can backfire: each treatment market you add reduces your donor pool, making the synthetic control counterfactual worse. The optimal split is typically 20–35% treatment, 65–80% holdout, because you need many donors to construct a precise counterfactual for each treatment market.

Design choice	Recommendation	Reason
Holdout fraction	15–30% of total volume	Balance revenue cost vs. precision
Pre-period length	At least 2× the post-period	Enough history to validate SC fit
Minimum markets	20 for permutation $p < 0.05$	Permutation test needs resolution
Block size	Entire DMA or city	Match spillover unit to randomization unit
Analysis	Synthetic control + permutation	Valid with small market counts

Geo vs. time-based holdout. Geo holdouts work best when the treatment varies by location (ad spend, pricing). Time-based holdouts (switchback) work better when the treatment varies by time (surge pricing, algorithm changes).

When geo testing is overkill. If the intervention only affects individual user experience (UI changes, recommendation algorithms), user-level A/B testing is superior — far more statistical power with the same budget. Reserve geo tests for interventions with genuine geographic spillovers.

⚠️Warning

Pre-period MSPE check. Always report the ratio of treatment-market MSPE to median donor-market MSPE. A ratio above 2 means your synthetic control is unreliable. Report this metric transparently; do not just show the post-period gap.

Production-Ready Code

python

"""
Geo testing production pipeline.
Handles market selection scoring, SC weight optimization,
permutation inference, and MSPE diagnostics.
"""
 
from __future__ import annotations
from dataclasses import dataclass
import numpy as np
import pandas as pd
from scipy.optimize import minimize
from scipy.stats import pearsonr
from typing import Sequence
 
 
@dataclass
class SCResult:
    weights: np.ndarray          # shape (J0,) — donor weights
    pre_mspe: float              # treatment market pre-period MSPE
    donor_mspe_median: float     # median MSPE across placebo runs
    mspe_ratio: float            # pre_mspe / donor_mspe_median (< 2 = reliable)
    post_gap: np.ndarray         # shape (T_post,) — treatment effect per period
    pvalue: float                # permutation p-value (two-sided)
    ate: float                   # average treatment effect over post period
 
 
def score_markets(
    market_series: pd.DataFrame,  # (T_pre, n_markets) — weekly outcomes
    treatment_markets: Sequence[str],
    candidate_donors: Sequence[str],
) -> pd.DataFrame:
    """Score each candidate donor market on pre-period trend similarity."""
    treat_agg = market_series[list(treatment_markets)].mean(axis=1).values
    rows = []
    for m in candidate_donors:
        donor = market_series[m].values
        corr, _ = pearsonr(treat_agg, donor)
        rmse = np.sqrt(np.mean((treat_agg - donor) ** 2))
        score = corr - rmse / (treat_agg.mean() + 1e-9)
        rows.append({'market': m, 'correlation': corr, 'rmse': rmse, 'score': score})
    return pd.DataFrame(rows).sort_values('score', ascending=False)
 
 
def fit_sc_weights(
    pre_treat: np.ndarray,   # (T_pre,)
    pre_donor: np.ndarray,   # (T_pre, J0)
) -> np.ndarray:
    """Fit convex synthetic control weights via SLSQP."""
    J0 = pre_donor.shape[1]
    def loss(w): return float(np.sum((pre_treat - pre_donor @ w) ** 2))
    result = minimize(
        loss,
        x0=np.ones(J0) / J0,
        method='SLSQP',
        bounds=[(0, 1)] * J0,
        constraints=[{'type': 'eq', 'fun': lambda w: w.sum() - 1}],
        options={'ftol': 1e-12, 'maxiter': 2000},
    )
    if not result.success:
        raise RuntimeError(f"SC weight optimization failed: {result.message}")
    return result.x
 
 
def permutation_inference(
    pre_treat: np.ndarray,    # (T_pre,)
    post_treat: np.ndarray,   # (T_post,)
    pre_donor: np.ndarray,    # (T_pre, J0)
    post_donor: np.ndarray,   # (T_post, J0)
    donor_names: list[str],
) -> SCResult:
    """Run synthetic control with permutation p-value."""
    w = fit_sc_weights(pre_treat, pre_donor)
    synth_pre = pre_donor @ w
    synth_post = post_donor @ w
    post_gap = post_treat - synth_post
    pre_mspe = float(np.mean((pre_treat - synth_pre) ** 2))
    ate = float(post_gap.mean())
 
    placebo_ates: list[float] = []
    placebo_mspes: list[float] = []
    for j in range(len(donor_names)):
        other = [k for k in range(len(donor_names)) if k != j]
        if len(other) < 2:
            continue
        try:
            w_j = fit_sc_weights(pre_donor[:, j], pre_donor[:, other])
        except RuntimeError:
            continue
        synth_pre_j = pre_donor[:, other] @ w_j
        synth_post_j = post_donor[:, other] @ w_j
        placebo_ates.append(float((post_donor[:, j] - synth_post_j).mean()))
        placebo_mspes.append(float(np.mean((pre_donor[:, j] - synth_pre_j) ** 2)))
 
    donor_mspe_median = float(np.median(placebo_mspes)) if placebo_mspes else float('nan')
    mspe_ratio = pre_mspe / (donor_mspe_median + 1e-12)
    pvalue = float(np.mean(np.abs(placebo_ates) >= abs(ate))) if placebo_ates else float('nan')
 
    return SCResult(
        weights=w,
        pre_mspe=pre_mspe,
        donor_mspe_median=donor_mspe_median,
        mspe_ratio=mspe_ratio,
        post_gap=post_gap,
        pvalue=pvalue,
        ate=ate,
    )
 
 
def geo_experiment_report(result: SCResult, alpha: float = 0.05) -> dict:
    """Summarize geo test results with reliability diagnostics."""
    reliable = result.mspe_ratio < 2.0
    significant = result.pvalue < alpha
    return {
        'ate': round(result.ate, 4),
        'pvalue': round(result.pvalue, 4),
        'significant': significant,
        'reliable_fit': reliable,
        'mspe_ratio': round(result.mspe_ratio, 2),
        'warning': None if reliable else (
            f"MSPE ratio {result.mspe_ratio:.1f} > 2.0 — SC fit is poor; "
            "consider Augmented SC or Synthetic DiD."
        ),
    }

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Causal

Heterogeneous Treatment Effects

Switchback Experiments