Adaptive Experiments & Bandits
Fixed-horizon A/B tests are clean but rigid: you commit to a sample size, run to completion, and only then decide. Many real-world settings offer a better option — sequential interaction where you allocate traffic across options and learn their quality in real time. A recommendation system that can't explore new content dies; a pricing experiment that wastes 80% of traffic on a clearly losing variant isn't just statistically inefficient, it costs real revenue. Multi-armed bandits formalize this exploration-exploitation tradeoff and give principled guarantees on how much you lose relative to an oracle that always picks the best option. Understanding when to use bandits versus fixed A/B tests, and which bandit algorithm to deploy, is a core skill for any engineer running online experiments.
Theory
UCB1 and Thompson achieve O(√(KT log T)) regret vs O(T) for random · Thompson ≈ Lai-Robbins lower bound empirically
Every time a recommendation system shows you a video you've never seen, it's gambling: either you'll like it (valuable signal) or you won't (wasted impression). Multi-armed bandits formalize this as a sequential decision problem — show the variant that seems best, but not so exclusively that you stop learning about alternatives. The diagram above shows how three algorithms handle this tradeoff: Thompson Sampling and UCB1 quickly identify the best arm and concentrate pulls on it, while ε-greedy with fixed ε wastes traffic on random exploration indefinitely.
The bandit problem
The K-armed bandit has K arms (variants, actions) with unknown reward distributions. At each step t = 1, …, T:
- Choose arm
- Observe reward (e.g., Bernoulli with parameter )
Goal: maximize cumulative reward , or equivalently minimize cumulative regret:
The regret formula measures performance against an oracle that always plays the best arm — a standard no online algorithm can match, because the best arm is unknown. Regret is the right loss function because it penalizes both pulling a suboptimal arm too many times (exploiting a wrong belief) and pulling the optimal arm too few times (insufficient exploration). Any algorithm that achieves sublinear regret — growing slower than T — is eventually spending most of its pulls on the best arm; the question is only how quickly it gets there.
where is the best arm's mean reward. Regret measures the cumulative cost of not always playing the optimal arm.
Key property: where and is the number of pulls of arm k. Reducing regret means pulling suboptimal arms fewer times.
Lai-Robbins lower bound
No algorithm can achieve better than logarithmic regret asymptotically. For any consistent algorithm:
This means is unavoidable — you must explore to reduce uncertainty, and that exploration has a cost. The question is how close to this bound your algorithm gets.
Exploration strategies
ε-greedy: With probability ε, pull a random arm; with probability 1−ε, pull the empirically best arm .
Regret with fixed ε: — linear in T. Linear regret is bad. Fix: use a decaying schedule , which gives regret but requires knowing (unknown in practice).
UCB1 (Upper Confidence Bound): Choose the arm that maximizes an optimistic upper bound on its true mean:
The second term is a confidence bonus that shrinks as arm k is pulled more. UCB1 achieves:
This is — matching the Lai-Robbins lower bound up to constants.
Thompson Sampling: Maintain a posterior over each arm's reward parameter. At each step:
- Sample for each arm
- Pull arm
- Update posterior: if reward = 1, ; if reward = 0,
Thompson Sampling is Bayes-optimal under correct priors and empirically achieves regret very close to the Lai-Robbins lower bound. Unlike UCB, it requires no explicit exploration parameter — randomness in posterior sampling handles exploration naturally.
The cumulative regret comparison above shows the practical difference: with K=3 arms and 200 steps, Thompson and UCB1 achieve ~6x lower regret than random, while ε-greedy with fixed ε falls in between.
Bayesian A/B testing
Instead of a fixed-horizon test with a p-value threshold, a Bayesian A/B test maintains posteriors over conversion rates and reports expected loss:
You stop when expected loss drops below a threshold (e.g., 0.1% lift). This is equivalent to a bandit with two arms and a stopping rule — it allows peeking and early stopping without the multiple-comparisons inflation that breaks frequentist tests.
Contextual bandits
Standard bandits learn one mean per arm globally. Contextual bandits use a context vector (user features, time of day, etc.) to predict arm rewards and adapt per-user:
LinUCB: Assume linear reward model . Maintain ridge regression estimates and confidence ellipsoids :
The exploration bonus inflates in directions of feature-space uncertainty. LinUCB is widely used in news recommendation (Li et al., 2010) and achieves contextual regret.
Neural/nonparametric contextual bandits: Replace the linear model with a neural network; use dropout or Laplace approximation for uncertainty quantification. Less theoretically clean but more powerful in practice.
Switchback experiments and interference
Standard A/B tests assume SUTVA (Stable Unit Treatment Value Assumption): user i's outcome is unaffected by other users' assignments. This breaks in:
- Marketplaces: showing User A a price affects supply available to User B
- Social platforms: viral effects between friends
- Ride-sharing: changing dispatch for one driver affects wait times for all
Switchback experiments: alternate treatment/control assignment across time windows (not across users). Each time window is independently randomized. Assumes treatment effects decay within the window length (requires domain knowledge about carryover).
Cluster randomization: Randomize at the level of geographic or social clusters rather than individual users, to contain interference within clusters.
Walkthrough
Implementing ε-greedy, UCB1, and Thompson Sampling
import numpy as np
from scipy.stats import beta as beta_dist
class Bandit:
"""K-armed Bernoulli bandit with fixed true probabilities."""
def __init__(self, probs):
self.probs = probs
self.K = len(probs)
self.best_mean = max(probs)
def pull(self, arm):
return float(np.random.random() < self.probs[arm])
class EpsilonGreedy:
def __init__(self, K, epsilon=0.1):
self.K = K
self.epsilon = epsilon
self.counts = np.zeros(K)
self.means = np.zeros(K)
def select(self):
if np.random.random() < self.epsilon:
return np.random.randint(self.K)
return np.argmax(self.means)
def update(self, arm, reward):
self.counts[arm] += 1
# Incremental mean update
self.means[arm] += (reward - self.means[arm]) / self.counts[arm]
class UCB1:
def __init__(self, K):
self.K = K
self.counts = np.zeros(K)
self.means = np.zeros(K)
self.t = 0
def select(self):
# Pull each arm once first
if self.t < self.K:
return self.t
bonus = np.sqrt(2 * np.log(self.t) / self.counts)
return np.argmax(self.means + bonus)
def update(self, arm, reward):
self.t += 1
self.counts[arm] += 1
self.means[arm] += (reward - self.means[arm]) / self.counts[arm]
class ThompsonSampling:
def __init__(self, K):
self.K = K
self.alpha = np.ones(K) # successes + 1
self.beta = np.ones(K) # failures + 1
def select(self):
samples = beta_dist.rvs(self.alpha, self.beta)
return np.argmax(samples)
def update(self, arm, reward):
self.alpha[arm] += reward
self.beta[arm] += (1 - reward)
def simulate(bandit, agent, T=1000):
regret = 0.0
cumulative_regrets = []
for _ in range(T):
arm = agent.select()
reward = bandit.pull(arm)
agent.update(arm, reward)
regret += bandit.best_mean - bandit.probs[arm]
cumulative_regrets.append(regret)
return cumulative_regrets
# Compare strategies
probs = [0.3, 0.5, 0.7]
bandit = Bandit(probs)
results = {
'epsilon_greedy': simulate(bandit, EpsilonGreedy(3, epsilon=0.1)),
'ucb1': simulate(bandit, UCB1(3)),
'thompson': simulate(bandit, ThompsonSampling(3)),
}
for name, regrets in results.items():
print(f'{name:20s}: final regret = {regrets[-1]:.1f}')LinUCB for contextual bandits
class LinUCB:
"""Linear contextual bandit using ridge regression + UCB exploration."""
def __init__(self, K, d, alpha=1.0):
self.K = K
self.alpha = alpha
# Per-arm: A = X^T X + I, b = X^T y
self.A = [np.eye(d) for _ in range(K)]
self.b = [np.zeros(d) for _ in range(K)]
def select(self, context):
x = context
ucbs = []
for a in range(self.K):
theta = np.linalg.solve(self.A[a], self.b[a])
uncertainty = np.sqrt(x @ np.linalg.solve(self.A[a], x))
ucbs.append(float(theta @ x + self.alpha * uncertainty))
return int(np.argmax(ucbs))
def update(self, arm, context, reward):
x = context
self.A[arm] += np.outer(x, x)
self.b[arm] += reward * x
# Usage: create features from user context, call select/update at each requestAnalysis & Evaluation
Where Your Intuition Breaks
Bandits are always better than A/B tests because they waste less traffic on losing variants. Bandits make causal inference harder: as the bandit adapts, arm assignment probabilities become correlated with unobserved user characteristics, and standard Z-tests on bandit data are biased. Fixed A/B tests give you a clean, unconfounded causal estimate of the treatment effect — bandits give you lower regret but a noisier and potentially biased effect estimate. For feature launches that need a defensible causal estimate (regulatory, safety, detecting long-term harm), fixed A/B tests remain the correct choice.
Exploration strategy comparison
| Strategy | Regret | Tuning | Key assumption | Best when |
|---|---|---|---|---|
| Random | O(T) | None | — | Baseline only |
| ε-greedy (fixed ε) | O(εT) | ε | Stationary rewards | Simple baseline |
| ε-greedy (decaying) | O(ln T) | c, Δ | Needs gap estimate | When Δ is known |
| UCB1 | O(√(KT ln T)) | None | Sub-Gaussian rewards | Default choice |
| Thompson Sampling | ≈ Lai-Robbins | Prior | Prior specification | Best empirical perf |
| LinUCB | O(√(dT ln T)) | α | Linear reward model | Context available |
Bandits vs fixed A/B tests
| Criterion | Fixed A/B test | Bandit |
|---|---|---|
| Causal inference | Clean (SUTVA) | Clean for each pull |
| Carryover / novelty | Controlled (fixed horizon) | Harder to disentangle |
| Long-horizon effects | Easily measured | Harder (exploitation converges fast) |
| Traffic efficiency | May waste traffic on loser | Allocates more to winner |
| Analysis complexity | Simple z-test | Requires bandit-aware inference |
| Best fit | Feature launches, compliance | Recommendation, pricing, ads, personalization |
Practical pitfalls
Cold start: UCB's initial ln(t) bonus causes it to explore every arm once before exploiting — fine for K=3, problematic for K=10,000. Solution: contextual bandits or prior knowledge to warm-start.
Non-stationarity: Rewards drift over time (seasonality, feature interactions). Standard UCB/Thompson assume stationarity. Use discounted Thompson Sampling or sliding-window UCB.
Delayed feedback: Conversion events arrive hours or days later. Batch updates with a fixed lag, or use pessimistic initialization until delayed feedback arrives.
Multiple metrics: Bandits optimize a single scalar. If you need to balance CTR and revenue, combine into a composite reward — or use constrained bandits.
Peeking in A/B tests: Running a Bayesian bandit isn't the same as repeatedly peeking at a frequentist p-value. Bayesian expected-loss stopping is valid; p-value peeking without correction inflates Type I error.
Production-Ready Code
Production bandits require three things beyond the algorithm: durable state persistence (so the prior survives restarts), a reward logging pipeline, and a non-stationarity detector that resets priors when the reward distribution shifts. The framework below separates selection (read-path) from update (write-path) so both can scale independently.
# production_bandit.py
# Thompson Sampling bandit with durable state, reward logging, and
# non-stationarity detection via Kolmogorov-Smirnov test.
from __future__ import annotations
from dataclasses import dataclass, field
import numpy as np
from scipy.stats import ks_2samp
# ── Beta-Bernoulli Thompson Sampling ──────────────────────────────────────────
@dataclass
class ThompsonSamplingBandit:
"""
Beta-Bernoulli Thompson Sampling for binary rewards (click, conversion).
alpha_[k] = successes + 1, beta_[k] = failures + 1 (Beta(1,1) prior).
Guard .update() with a lock in concurrent environments.
Serialise state to DB after every update for crash recovery.
"""
arms: list[str]
alpha_: np.ndarray = field(init=False)
beta_: np.ndarray = field(init=False)
pulls: np.ndarray = field(init=False)
def __post_init__(self) -> None:
k = len(self.arms)
self.alpha_ = np.ones(k)
self.beta_ = np.ones(k)
self.pulls = np.zeros(k, dtype=int)
def select(self) -> str:
samples = np.random.beta(self.alpha_, self.beta_)
return self.arms[int(np.argmax(samples))]
def update(self, arm: str, reward: float) -> None:
"""reward in {0, 1} for Bernoulli; clip to [0, 1] for rate metrics."""
idx = self.arms.index(arm)
self.pulls[idx] += 1
self.alpha_[idx] += reward
self.beta_[idx] += 1.0 - reward
def to_dict(self) -> dict:
return {
"arms": self.arms,
"alpha": self.alpha_.tolist(),
"beta": self.beta_.tolist(),
"pulls": self.pulls.tolist(),
}
@classmethod
def from_dict(cls, state: dict) -> "ThompsonSamplingBandit":
bandit = cls(arms=state["arms"])
bandit.alpha_ = np.array(state["alpha"])
bandit.beta_ = np.array(state["beta"])
bandit.pulls = np.array(state["pulls"], dtype=int)
return bandit
def stats(self) -> list[dict]:
total = self.pulls.sum()
return [
{
"arm": arm,
"pulls": int(self.pulls[i]),
"traffic_share_pct": round(self.pulls[i] / max(total, 1) * 100, 1),
"estimated_rate": round(
self.alpha_[i] / (self.alpha_[i] + self.beta_[i]), 4
),
}
for i, arm in enumerate(self.arms)
]
# ── Non-stationarity detection ─────────────────────────────────────────────────
def detect_reward_shift(
reward_log: list[tuple[str, float]], # [(arm, reward), ...]
arm: str,
window: int = 500,
significance: float = 0.05,
) -> dict:
"""
KS test comparing recent vs. historical rewards for one arm.
Call periodically (e.g., every 1k events). Reset priors if shift detected.
"""
arm_rewards = [r for a, r in reward_log if a == arm]
if len(arm_rewards) < window * 2:
return {"arm": arm, "shift_detected": False, "reason": "insufficient data"}
historical = np.array(arm_rewards[:-window])
recent = np.array(arm_rewards[-window:])
stat, p = ks_2samp(historical, recent)
return {
"arm": arm,
"historical_mean": round(float(historical.mean()), 4),
"recent_mean": round(float(recent.mean()), 4),
"ks_stat": round(float(stat), 4),
"p_value": round(float(p), 6),
"shift_detected": p < significance,
"action": "RESET priors to Beta(1,1)" if p < significance else "OK",
}
# ── A/B vs. bandit decision framework ─────────────────────────────────────────
def recommend_approach(
n_variants: int,
total_traffic: int,
inference_required: bool,
regret_sensitive: bool,
) -> str:
if inference_required:
return "ab_test — fixed randomisation required for causal inference"
if n_variants > 10:
return "bandit — too many arms for A/B to be powered within reasonable time"
if total_traffic < 5_000:
return "ab_test — too little traffic for bandit to converge meaningfully"
if regret_sensitive and total_traffic > 50_000:
return "bandit — large traffic; regret savings justify adaptive allocation"
return "ab_test — default; use bandit only with explicit regret justification"
# ── Example ───────────────────────────────────────────────────────────────────
rng = np.random.default_rng(42)
bandit = ThompsonSamplingBandit(arms=["control", "v1", "v2"])
true_rates = {"control": 0.45, "v1": 0.47, "v2": 0.43}
reward_log: list[tuple[str, float]] = []
for _ in range(2_000):
arm = bandit.select()
reward = float(rng.binomial(1, true_rates[arm]))
bandit.update(arm, reward)
reward_log.append((arm, reward))
for s in bandit.stats():
print(s)
print(detect_reward_shift(reward_log, "v1", window=200))
print(recommend_approach(3, 50_000, inference_required=False, regret_sensitive=True))Enjoying these notes?
Get new lessons delivered to your inbox. No spam.