Geo Testing & Market Holdouts
Standard A/B tests assume that assigning user A to treatment doesn't affect user B's outcome. In marketing and pricing, that assumption breaks: a national ad campaign raises awareness everywhere, a discount floods supply chains, and a surge-pricing change shifts driver behavior across cities. When the intervention is geo-level rather than user-level, you need a geo test — one of the most rigorous (and expensive) experimental tools in the practitioner's toolkit.
Theory
Click markets to toggle treatment / holdout
Geo testing assigns entire geographic markets — cities, DMAs, states — to treatment or holdout rather than individual users. This restores the independence assumption, but introduces new challenges: you have far fewer units (20–100 markets vs. millions of users), and markets differ enormously in size, demographics, and baseline trends.
Why user-level randomization fails for geo interventions. The Stable Unit Treatment Value Assumption (SUTVA) requires that unit 's potential outcome depends only on its own assignment , not on others' assignments. A national TV ad campaign violates this at the user level — if your ad plays in every market, there is no untreated user to compare against. The only valid randomization unit is the market itself.
Why it had to be this way. SUTVA makes the violation precise: when geo-level spillovers exist. This is not a modeling choice — it is a statement about the data-generating process. Geo randomization restores SUTVA by making the treatment assignment unit match the spillover unit.
Market selection. With 20–100 markets, balance matters enormously. Markets are matched on pre-period trend similarity, population, seasonality, and demographic composition. The holdout should be 15–30% of total market volume — large enough to measure with precision, small enough to be worth the foregone revenue.
Synthetic control as the analysis method. Synthetic control (Abadie, Diamond, Hainmueller 2010) constructs the counterfactual treated market as a weighted combination of donor (holdout) markets:
where weights , , are chosen to minimize pre-period mean squared prediction error (MSPE):
The treatment effect estimate at each post-period time is .
Why synthetic control instead of simple DiD. DiD assumes parallel trends — treated and control markets would have evolved identically without treatment. Synthetic control tests this assumption directly: if the pre-period fit is good (low MSPE), the counterfactual is credible. DiD makes an untestable assumption; SC makes it testable and visible.
Permutation inference. With donor markets, you run the same SC optimization for each donor as if it were treated (placebo tests). The p-value is the fraction of donor markets with a larger post-period gap than the actual treated market:
This is valid even with 20 markets — no asymptotic approximations needed.
Walkthrough
Scenario: A streaming service runs an ad campaign in 6 treatment DMAs. 14 holdout DMAs form the donor pool. We want to estimate the causal effect on weekly sign-ups.
Step 1: Pre-period data. Collect 12 weeks of weekly sign-up data for all 20 markets before the campaign.
Step 2: SC weight optimization. Find convex weights minimizing pre-period MSPE between the treatment aggregate and the weighted donor sum.
import numpy as np
from scipy.optimize import minimize
# pre_treat: (T_pre,) — treatment DMA aggregate weekly sign-ups
# pre_donor: (T_pre, J0) — donor DMA weekly sign-ups
def fit_sc_weights(pre_treat, pre_donor):
J0 = pre_donor.shape[1]
def loss(w): return np.sum((pre_treat - pre_donor @ w) ** 2)
constraints = [{'type': 'eq', 'fun': lambda w: w.sum() - 1}]
bounds = [(0, 1)] * J0
w0 = np.ones(J0) / J0
result = minimize(loss, w0, method='SLSQP', bounds=bounds, constraints=constraints)
return result.x # shape (J0,)Step 3: Post-period effect. The synthetic control counterfactual in the post-period is post_donor @ w. The gap at each week is the treatment effect.
Step 4: Permutation p-value. Fit SC weights with each of the 14 donor markets as the "treated" unit. The p-value is the fraction with a larger post-period gap than the actual treatment market.
Pre-period fit check. If MSPE is more than 2× the median donor-market MSPE, the SC weights are unreliable — the treatment market is too unusual to approximate from the donor pool. Consider Augmented SC (Ben-Michael et al.) or Synthetic DiD (Arkhangelsky et al.).
Analysis & Evaluation
Where your intuition breaks. You might assume that more treatment markets means more statistical power. With geo tests, this can backfire: each treatment market you add reduces your donor pool, making the synthetic control counterfactual worse. The optimal split is typically 20–35% treatment, 65–80% holdout, because you need many donors to construct a precise counterfactual for each treatment market.
| Design choice | Recommendation | Reason |
|---|---|---|
| Holdout fraction | 15–30% of total volume | Balance revenue cost vs. precision |
| Pre-period length | At least 2× the post-period | Enough history to validate SC fit |
| Minimum markets | 20 for permutation | Permutation test needs resolution |
| Block size | Entire DMA or city | Match spillover unit to randomization unit |
| Analysis | Synthetic control + permutation | Valid with small market counts |
Geo vs. time-based holdout. Geo holdouts work best when the treatment varies by location (ad spend, pricing). Time-based holdouts (switchback) work better when the treatment varies by time (surge pricing, algorithm changes).
When geo testing is overkill. If the intervention only affects individual user experience (UI changes, recommendation algorithms), user-level A/B testing is superior — far more statistical power with the same budget. Reserve geo tests for interventions with genuine geographic spillovers.
Pre-period MSPE check. Always report the ratio of treatment-market MSPE to median donor-market MSPE. A ratio above 2 means your synthetic control is unreliable. Report this metric transparently; do not just show the post-period gap.
Production-Ready Code
"""
Geo testing production pipeline.
Handles market selection scoring, SC weight optimization,
permutation inference, and MSPE diagnostics.
"""
from __future__ import annotations
from dataclasses import dataclass
import numpy as np
import pandas as pd
from scipy.optimize import minimize
from scipy.stats import pearsonr
from typing import Sequence
@dataclass
class SCResult:
weights: np.ndarray # shape (J0,) — donor weights
pre_mspe: float # treatment market pre-period MSPE
donor_mspe_median: float # median MSPE across placebo runs
mspe_ratio: float # pre_mspe / donor_mspe_median (< 2 = reliable)
post_gap: np.ndarray # shape (T_post,) — treatment effect per period
pvalue: float # permutation p-value (two-sided)
ate: float # average treatment effect over post period
def score_markets(
market_series: pd.DataFrame, # (T_pre, n_markets) — weekly outcomes
treatment_markets: Sequence[str],
candidate_donors: Sequence[str],
) -> pd.DataFrame:
"""Score each candidate donor market on pre-period trend similarity."""
treat_agg = market_series[list(treatment_markets)].mean(axis=1).values
rows = []
for m in candidate_donors:
donor = market_series[m].values
corr, _ = pearsonr(treat_agg, donor)
rmse = np.sqrt(np.mean((treat_agg - donor) ** 2))
score = corr - rmse / (treat_agg.mean() + 1e-9)
rows.append({'market': m, 'correlation': corr, 'rmse': rmse, 'score': score})
return pd.DataFrame(rows).sort_values('score', ascending=False)
def fit_sc_weights(
pre_treat: np.ndarray, # (T_pre,)
pre_donor: np.ndarray, # (T_pre, J0)
) -> np.ndarray:
"""Fit convex synthetic control weights via SLSQP."""
J0 = pre_donor.shape[1]
def loss(w): return float(np.sum((pre_treat - pre_donor @ w) ** 2))
result = minimize(
loss,
x0=np.ones(J0) / J0,
method='SLSQP',
bounds=[(0, 1)] * J0,
constraints=[{'type': 'eq', 'fun': lambda w: w.sum() - 1}],
options={'ftol': 1e-12, 'maxiter': 2000},
)
if not result.success:
raise RuntimeError(f"SC weight optimization failed: {result.message}")
return result.x
def permutation_inference(
pre_treat: np.ndarray, # (T_pre,)
post_treat: np.ndarray, # (T_post,)
pre_donor: np.ndarray, # (T_pre, J0)
post_donor: np.ndarray, # (T_post, J0)
donor_names: list[str],
) -> SCResult:
"""Run synthetic control with permutation p-value."""
w = fit_sc_weights(pre_treat, pre_donor)
synth_pre = pre_donor @ w
synth_post = post_donor @ w
post_gap = post_treat - synth_post
pre_mspe = float(np.mean((pre_treat - synth_pre) ** 2))
ate = float(post_gap.mean())
placebo_ates: list[float] = []
placebo_mspes: list[float] = []
for j in range(len(donor_names)):
other = [k for k in range(len(donor_names)) if k != j]
if len(other) < 2:
continue
try:
w_j = fit_sc_weights(pre_donor[:, j], pre_donor[:, other])
except RuntimeError:
continue
synth_pre_j = pre_donor[:, other] @ w_j
synth_post_j = post_donor[:, other] @ w_j
placebo_ates.append(float((post_donor[:, j] - synth_post_j).mean()))
placebo_mspes.append(float(np.mean((pre_donor[:, j] - synth_pre_j) ** 2)))
donor_mspe_median = float(np.median(placebo_mspes)) if placebo_mspes else float('nan')
mspe_ratio = pre_mspe / (donor_mspe_median + 1e-12)
pvalue = float(np.mean(np.abs(placebo_ates) >= abs(ate))) if placebo_ates else float('nan')
return SCResult(
weights=w,
pre_mspe=pre_mspe,
donor_mspe_median=donor_mspe_median,
mspe_ratio=mspe_ratio,
post_gap=post_gap,
pvalue=pvalue,
ate=ate,
)
def geo_experiment_report(result: SCResult, alpha: float = 0.05) -> dict:
"""Summarize geo test results with reliability diagnostics."""
reliable = result.mspe_ratio < 2.0
significant = result.pvalue < alpha
return {
'ate': round(result.ate, 4),
'pvalue': round(result.pvalue, 4),
'significant': significant,
'reliable_fit': reliable,
'mspe_ratio': round(result.mspe_ratio, 2),
'warning': None if reliable else (
f"MSPE ratio {result.mspe_ratio:.1f} > 2.0 — SC fit is poor; "
"consider Augmented SC or Synthetic DiD."
),
}Enjoying these notes?
Get new lessons delivered to your inbox. No spam.