Neural-Path/Notes
35 min

Adaptive Experiments & Bandits

Fixed-horizon A/B tests are clean but rigid: you commit to a sample size, run to completion, and only then decide. Many real-world settings offer a better option — sequential interaction where you allocate traffic across options and learn their quality in real time. A recommendation system that can't explore new content dies; a pricing experiment that wastes 80% of traffic on a clearly losing variant isn't just statistically inefficient, it costs real revenue. Multi-armed bandits formalize this exploration-exploitation tradeoff and give principled guarantees on how much you lose relative to an oracle that always picks the best option. Understanding when to use bandits versus fixed A/B tests, and which bandit algorithm to deploy, is a core skill for any engineer running online experiments.

Theory

Cumulative regret — 3-arm bandit, arms p = 0.3 / 0.5 / 0.7
010203040050100150200steps (t)regret R(t)40.614.37.35.9
Random:40.6 @ T=200
ε-greedy (ε=0.1):14.3 @ T=200
UCB1:7.3 @ T=200
Thompson:5.9 @ T=200

UCB1 and Thompson achieve O(√(KT log T)) regret vs O(T) for random · Thompson ≈ Lai-Robbins lower bound empirically

Every time a recommendation system shows you a video you've never seen, it's gambling: either you'll like it (valuable signal) or you won't (wasted impression). Multi-armed bandits formalize this as a sequential decision problem — show the variant that seems best, but not so exclusively that you stop learning about alternatives. The diagram above shows how three algorithms handle this tradeoff: Thompson Sampling and UCB1 quickly identify the best arm and concentrate pulls on it, while ε-greedy with fixed ε wastes traffic on random exploration indefinitely.

The bandit problem

The K-armed bandit has K arms (variants, actions) with unknown reward distributions. At each step t = 1, …, T:

  1. Choose arm at{1,,K}a_t \in \{1, \ldots, K\}
  2. Observe reward rtνatr_t \sim \nu_{a_t} (e.g., Bernoulli with parameter μk\mu_k)

Goal: maximize cumulative reward trt\sum_t r_t, or equivalently minimize cumulative regret: R(T)=Tμt=1TμatR(T) = T \mu^* - \sum_{t=1}^T \mu_{a_t}

The regret formula measures performance against an oracle that always plays the best arm — a standard no online algorithm can match, because the best arm is unknown. Regret is the right loss function because it penalizes both pulling a suboptimal arm too many times (exploiting a wrong belief) and pulling the optimal arm too few times (insufficient exploration). Any algorithm that achieves sublinear regret — growing slower than T — is eventually spending most of its pulls on the best arm; the question is only how quickly it gets there.

where μ=maxkμk\mu^* = \max_k \mu_k is the best arm's mean reward. Regret measures the cumulative cost of not always playing the optimal arm.

Key property: R(T)=kkΔkE[Nk(T)]R(T) = \sum_{k \neq k^*} \Delta_k \mathbb{E}[N_k(T)] where Δk=μμk\Delta_k = \mu^* - \mu_k and Nk(T)N_k(T) is the number of pulls of arm k. Reducing regret means pulling suboptimal arms fewer times.

Lai-Robbins lower bound

No algorithm can achieve better than logarithmic regret asymptotically. For any consistent algorithm: lim infTR(T)lnTkkΔkKL(νk,νk)\liminf_{T \to \infty} \frac{R(T)}{\ln T} \geq \sum_{k \neq k^*} \frac{\Delta_k}{\text{KL}(\nu_k, \nu_{k^*})}

This means R(T)=Ω(lnT)R(T) = \Omega(\ln T) is unavoidable — you must explore to reduce uncertainty, and that exploration has a cost. The question is how close to this bound your algorithm gets.

Exploration strategies

ε-greedy: With probability ε, pull a random arm; with probability 1−ε, pull the empirically best arm k^=argmaxkμ^k\hat{k} = \arg\max_k \hat{\mu}_k.

Regret with fixed ε: R(T)εT(μμˉ)R(T) \approx \varepsilon T (\mu^* - \bar{\mu}) — linear in T. Linear regret is bad. Fix: use a decaying schedule εt=min(1,cK/(Δ2t))\varepsilon_t = \min(1, c\,K / (\Delta^2 t)), which gives O(lnT)O(\ln T) regret but requires knowing Δ\Delta (unknown in practice).

UCB1 (Upper Confidence Bound): Choose the arm that maximizes an optimistic upper bound on its true mean: at=argmaxk[μ^k+2lntNk(t)]a_t = \arg\max_k \left[\hat{\mu}_k + \sqrt{\frac{2 \ln t}{N_k(t)}}\right]

The second term is a confidence bonus that shrinks as arm k is pulled more. UCB1 achieves: E[R(T)]kk(8lnTΔk+1+π23)Δk\mathbb{E}[R(T)] \leq \sum_{k \neq k^*} \left(\frac{8 \ln T}{\Delta_k} + 1 + \frac{\pi^2}{3}\right) \Delta_k

This is O(lnT)O(\ln T) — matching the Lai-Robbins lower bound up to constants.

Thompson Sampling: Maintain a posterior over each arm's reward parameter. At each step:

  1. Sample θkBeta(αk,βk)\theta_k \sim \text{Beta}(\alpha_k, \beta_k) for each arm
  2. Pull arm at=argmaxkθka_t = \arg\max_k \theta_k
  3. Update posterior: if reward = 1, αat+=1\alpha_{a_t} \mathrel{+}= 1; if reward = 0, βat+=1\beta_{a_t} \mathrel{+}= 1

Thompson Sampling is Bayes-optimal under correct priors and empirically achieves regret very close to the Lai-Robbins lower bound. Unlike UCB, it requires no explicit exploration parameter — randomness in posterior sampling handles exploration naturally.

The cumulative regret comparison above shows the practical difference: with K=3 arms and 200 steps, Thompson and UCB1 achieve ~6x lower regret than random, while ε-greedy with fixed ε falls in between.

Bayesian A/B testing

Instead of a fixed-horizon test with a p-value threshold, a Bayesian A/B test maintains posteriors over conversion rates and reports expected loss: Expected Loss=EθA,θB[max(θA,θB)θchosen]\text{Expected Loss} = \mathbb{E}_{\theta_A, \theta_B}\left[\max(\theta_A, \theta_B) - \theta_{\text{chosen}}\right]

You stop when expected loss drops below a threshold (e.g., 0.1% lift). This is equivalent to a bandit with two arms and a stopping rule — it allows peeking and early stopping without the multiple-comparisons inflation that breaks frequentist tests.

Contextual bandits

Standard bandits learn one mean per arm globally. Contextual bandits use a context vector XtX_t (user features, time of day, etc.) to predict arm rewards and adapt per-user:

LinUCB: Assume linear reward model E[ra,t]=Xtθa\mathbb{E}[r_{a,t}] = X_t^\top \theta_a. Maintain ridge regression estimates θ^a\hat{\theta}_a and confidence ellipsoids AaA_a: at=argmaxa[Xtθ^a+αXtAa1Xt]a_t = \arg\max_a \left[X_t^\top \hat{\theta}_a + \alpha \sqrt{X_t^\top A_a^{-1} X_t}\right]

The exploration bonus αXtAa1Xt\alpha \sqrt{X_t^\top A_a^{-1} X_t} inflates in directions of feature-space uncertainty. LinUCB is widely used in news recommendation (Li et al., 2010) and achieves O(TlnT)O(\sqrt{T \ln T}) contextual regret.

Neural/nonparametric contextual bandits: Replace the linear model with a neural network; use dropout or Laplace approximation for uncertainty quantification. Less theoretically clean but more powerful in practice.

Switchback experiments and interference

Standard A/B tests assume SUTVA (Stable Unit Treatment Value Assumption): user i's outcome is unaffected by other users' assignments. This breaks in:

  • Marketplaces: showing User A a price affects supply available to User B
  • Social platforms: viral effects between friends
  • Ride-sharing: changing dispatch for one driver affects wait times for all

Switchback experiments: alternate treatment/control assignment across time windows (not across users). Each time window is independently randomized. Assumes treatment effects decay within the window length (requires domain knowledge about carryover).

Cluster randomization: Randomize at the level of geographic or social clusters rather than individual users, to contain interference within clusters.

Walkthrough

Implementing ε-greedy, UCB1, and Thompson Sampling

python
import numpy as np
from scipy.stats import beta as beta_dist
 
class Bandit:
    """K-armed Bernoulli bandit with fixed true probabilities."""
    def __init__(self, probs):
        self.probs = probs
        self.K = len(probs)
        self.best_mean = max(probs)
 
    def pull(self, arm):
        return float(np.random.random() < self.probs[arm])
 
 
class EpsilonGreedy:
    def __init__(self, K, epsilon=0.1):
        self.K = K
        self.epsilon = epsilon
        self.counts = np.zeros(K)
        self.means = np.zeros(K)
 
    def select(self):
        if np.random.random() < self.epsilon:
            return np.random.randint(self.K)
        return np.argmax(self.means)
 
    def update(self, arm, reward):
        self.counts[arm] += 1
        # Incremental mean update
        self.means[arm] += (reward - self.means[arm]) / self.counts[arm]
 
 
class UCB1:
    def __init__(self, K):
        self.K = K
        self.counts = np.zeros(K)
        self.means = np.zeros(K)
        self.t = 0
 
    def select(self):
        # Pull each arm once first
        if self.t < self.K:
            return self.t
        bonus = np.sqrt(2 * np.log(self.t) / self.counts)
        return np.argmax(self.means + bonus)
 
    def update(self, arm, reward):
        self.t += 1
        self.counts[arm] += 1
        self.means[arm] += (reward - self.means[arm]) / self.counts[arm]
 
 
class ThompsonSampling:
    def __init__(self, K):
        self.K = K
        self.alpha = np.ones(K)   # successes + 1
        self.beta = np.ones(K)    # failures + 1
 
    def select(self):
        samples = beta_dist.rvs(self.alpha, self.beta)
        return np.argmax(samples)
 
    def update(self, arm, reward):
        self.alpha[arm] += reward
        self.beta[arm] += (1 - reward)
 
 
def simulate(bandit, agent, T=1000):
    regret = 0.0
    cumulative_regrets = []
    for _ in range(T):
        arm = agent.select()
        reward = bandit.pull(arm)
        agent.update(arm, reward)
        regret += bandit.best_mean - bandit.probs[arm]
        cumulative_regrets.append(regret)
    return cumulative_regrets
 
 
# Compare strategies
probs = [0.3, 0.5, 0.7]
bandit = Bandit(probs)
 
results = {
    'epsilon_greedy': simulate(bandit, EpsilonGreedy(3, epsilon=0.1)),
    'ucb1':           simulate(bandit, UCB1(3)),
    'thompson':       simulate(bandit, ThompsonSampling(3)),
}
 
for name, regrets in results.items():
    print(f'{name:20s}: final regret = {regrets[-1]:.1f}')

LinUCB for contextual bandits

python
class LinUCB:
    """Linear contextual bandit using ridge regression + UCB exploration."""
    def __init__(self, K, d, alpha=1.0):
        self.K = K
        self.alpha = alpha
        # Per-arm: A = X^T X + I, b = X^T y
        self.A = [np.eye(d) for _ in range(K)]
        self.b = [np.zeros(d) for _ in range(K)]
 
    def select(self, context):
        x = context
        ucbs = []
        for a in range(self.K):
            theta = np.linalg.solve(self.A[a], self.b[a])
            uncertainty = np.sqrt(x @ np.linalg.solve(self.A[a], x))
            ucbs.append(float(theta @ x + self.alpha * uncertainty))
        return int(np.argmax(ucbs))
 
    def update(self, arm, context, reward):
        x = context
        self.A[arm] += np.outer(x, x)
        self.b[arm] += reward * x
 
 
# Usage: create features from user context, call select/update at each request

Analysis & Evaluation

Where Your Intuition Breaks

Bandits are always better than A/B tests because they waste less traffic on losing variants. Bandits make causal inference harder: as the bandit adapts, arm assignment probabilities become correlated with unobserved user characteristics, and standard Z-tests on bandit data are biased. Fixed A/B tests give you a clean, unconfounded causal estimate of the treatment effect — bandits give you lower regret but a noisier and potentially biased effect estimate. For feature launches that need a defensible causal estimate (regulatory, safety, detecting long-term harm), fixed A/B tests remain the correct choice.

Exploration strategy comparison

StrategyRegretTuningKey assumptionBest when
RandomO(T)NoneBaseline only
ε-greedy (fixed ε)O(εT)εStationary rewardsSimple baseline
ε-greedy (decaying)O(ln T)c, ΔNeeds gap estimateWhen Δ is known
UCB1O(√(KT ln T))NoneSub-Gaussian rewardsDefault choice
Thompson Sampling≈ Lai-RobbinsPriorPrior specificationBest empirical perf
LinUCBO(√(dT ln T))αLinear reward modelContext available

Bandits vs fixed A/B tests

CriterionFixed A/B testBandit
Causal inferenceClean (SUTVA)Clean for each pull
Carryover / noveltyControlled (fixed horizon)Harder to disentangle
Long-horizon effectsEasily measuredHarder (exploitation converges fast)
Traffic efficiencyMay waste traffic on loserAllocates more to winner
Analysis complexitySimple z-testRequires bandit-aware inference
Best fitFeature launches, complianceRecommendation, pricing, ads, personalization

Practical pitfalls

Cold start: UCB's initial ln(t) bonus causes it to explore every arm once before exploiting — fine for K=3, problematic for K=10,000. Solution: contextual bandits or prior knowledge to warm-start.

Non-stationarity: Rewards drift over time (seasonality, feature interactions). Standard UCB/Thompson assume stationarity. Use discounted Thompson Sampling or sliding-window UCB.

Delayed feedback: Conversion events arrive hours or days later. Batch updates with a fixed lag, or use pessimistic initialization until delayed feedback arrives.

Multiple metrics: Bandits optimize a single scalar. If you need to balance CTR and revenue, combine into a composite reward — or use constrained bandits.

Peeking in A/B tests: Running a Bayesian bandit isn't the same as repeatedly peeking at a frequentist p-value. Bayesian expected-loss stopping is valid; p-value peeking without correction inflates Type I error.

Production-Ready Code

Production bandits require three things beyond the algorithm: durable state persistence (so the prior survives restarts), a reward logging pipeline, and a non-stationarity detector that resets priors when the reward distribution shifts. The framework below separates selection (read-path) from update (write-path) so both can scale independently.

python
# production_bandit.py
# Thompson Sampling bandit with durable state, reward logging, and
# non-stationarity detection via Kolmogorov-Smirnov test.
 
from __future__ import annotations
from dataclasses import dataclass, field
import numpy as np
from scipy.stats import ks_2samp
 
 
# ── Beta-Bernoulli Thompson Sampling ──────────────────────────────────────────
 
@dataclass
class ThompsonSamplingBandit:
    """
    Beta-Bernoulli Thompson Sampling for binary rewards (click, conversion).
    alpha_[k] = successes + 1, beta_[k] = failures + 1 (Beta(1,1) prior).
    Guard .update() with a lock in concurrent environments.
    Serialise state to DB after every update for crash recovery.
    """
    arms: list[str]
    alpha_: np.ndarray = field(init=False)
    beta_: np.ndarray = field(init=False)
    pulls: np.ndarray = field(init=False)
 
    def __post_init__(self) -> None:
        k = len(self.arms)
        self.alpha_ = np.ones(k)
        self.beta_  = np.ones(k)
        self.pulls  = np.zeros(k, dtype=int)
 
    def select(self) -> str:
        samples = np.random.beta(self.alpha_, self.beta_)
        return self.arms[int(np.argmax(samples))]
 
    def update(self, arm: str, reward: float) -> None:
        """reward in {0, 1} for Bernoulli; clip to [0, 1] for rate metrics."""
        idx = self.arms.index(arm)
        self.pulls[idx] += 1
        self.alpha_[idx] += reward
        self.beta_[idx]  += 1.0 - reward
 
    def to_dict(self) -> dict:
        return {
            "arms": self.arms,
            "alpha": self.alpha_.tolist(),
            "beta": self.beta_.tolist(),
            "pulls": self.pulls.tolist(),
        }
 
    @classmethod
    def from_dict(cls, state: dict) -> "ThompsonSamplingBandit":
        bandit = cls(arms=state["arms"])
        bandit.alpha_ = np.array(state["alpha"])
        bandit.beta_  = np.array(state["beta"])
        bandit.pulls  = np.array(state["pulls"], dtype=int)
        return bandit
 
    def stats(self) -> list[dict]:
        total = self.pulls.sum()
        return [
            {
                "arm": arm,
                "pulls": int(self.pulls[i]),
                "traffic_share_pct": round(self.pulls[i] / max(total, 1) * 100, 1),
                "estimated_rate": round(
                    self.alpha_[i] / (self.alpha_[i] + self.beta_[i]), 4
                ),
            }
            for i, arm in enumerate(self.arms)
        ]
 
 
# ── Non-stationarity detection ─────────────────────────────────────────────────
 
def detect_reward_shift(
    reward_log: list[tuple[str, float]],  # [(arm, reward), ...]
    arm: str,
    window: int = 500,
    significance: float = 0.05,
) -> dict:
    """
    KS test comparing recent vs. historical rewards for one arm.
    Call periodically (e.g., every 1k events). Reset priors if shift detected.
    """
    arm_rewards = [r for a, r in reward_log if a == arm]
    if len(arm_rewards) < window * 2:
        return {"arm": arm, "shift_detected": False, "reason": "insufficient data"}
    historical = np.array(arm_rewards[:-window])
    recent     = np.array(arm_rewards[-window:])
    stat, p = ks_2samp(historical, recent)
    return {
        "arm": arm,
        "historical_mean": round(float(historical.mean()), 4),
        "recent_mean":     round(float(recent.mean()), 4),
        "ks_stat": round(float(stat), 4),
        "p_value": round(float(p), 6),
        "shift_detected": p < significance,
        "action": "RESET priors to Beta(1,1)" if p < significance else "OK",
    }
 
 
# ── A/B vs. bandit decision framework ─────────────────────────────────────────
 
def recommend_approach(
    n_variants: int,
    total_traffic: int,
    inference_required: bool,
    regret_sensitive: bool,
) -> str:
    if inference_required:
        return "ab_test — fixed randomisation required for causal inference"
    if n_variants > 10:
        return "bandit — too many arms for A/B to be powered within reasonable time"
    if total_traffic < 5_000:
        return "ab_test — too little traffic for bandit to converge meaningfully"
    if regret_sensitive and total_traffic > 50_000:
        return "bandit — large traffic; regret savings justify adaptive allocation"
    return "ab_test — default; use bandit only with explicit regret justification"
 
 
# ── Example ───────────────────────────────────────────────────────────────────
rng = np.random.default_rng(42)
bandit = ThompsonSamplingBandit(arms=["control", "v1", "v2"])
true_rates = {"control": 0.45, "v1": 0.47, "v2": 0.43}
reward_log: list[tuple[str, float]] = []
 
for _ in range(2_000):
    arm = bandit.select()
    reward = float(rng.binomial(1, true_rates[arm]))
    bandit.update(arm, reward)
    reward_log.append((arm, reward))
 
for s in bandit.stats():
    print(s)
 
print(detect_reward_shift(reward_log, "v1", window=200))
print(recommend_approach(3, 50_000, inference_required=False, regret_sensitive=True))

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.