Always-Valid Sequential Testing
Every practitioner who runs A/B tests eventually does it: they check the experiment dashboard before the planned end date, see , and ship. This is peeking — and it inflates the false positive rate from 5% to as high as 26% with five peeks. O'Brien-Fleming boundaries fix this, but only if you pre-commit to exact peek times. What if you want to monitor an experiment continuously and stop whenever the evidence is convincing? That requires a fundamentally different statistical object: always-valid inference.
Theory
Why standard p-values break under optional stopping. A p-value is valid at a single fixed sample size. If you commit to stopping when — regardless of when that happens — you have changed the stopping rule. Under the null hypothesis, the p-value process is a martingale whose values are uniform on at any fixed time, but the minimum of this martingale over time is stochastically smaller. The probability of ever seeing under the null over an infinite horizon approaches 1.
E-values: the currency of always-valid inference. An e-value is a non-negative random variable with under the null. By Markov's inequality, . The crucial property: e-values compose under optional stopping. If are e-values from independent data, then is still an e-value. You can accumulate evidence continuously and stop whenever — the false positive rate is still controlled at .
Why it had to be this way. The classical Neyman-Pearson framework is proved at a single fixed sample size. E-values escape this through the martingale structure: the product is a non-negative supermartingale under the null, and Ville's inequality — a time-uniform version of Markov's inequality — gives .
Confidence sequences. A confidence sequence satisfies:
This is a time-uniform guarantee: the true parameter is simultaneously covered at every sample size. A fixed-horizon CI only guarantees coverage at the planned end point. The confidence sequence is always wider than the fixed CI (it has more to cover), but it narrows as grows — and it is valid whenever you look.
mSPRT (mixture Sequential Probability Ratio Test). For testing a Gaussian mean, the e-value at step is:
where is the mixing parameter (prior SD on effect size). Choose as the minimum effect size you care about detecting.
Bayesian sequential testing. The posterior probability correctly represents uncertainty at any point. Stopping when the posterior exceeds 0.95 is always-valid in spirit — but it does not automatically control Type I error unless the prior is calibrated. Use e-values when calibrated priors are unavailable.
Walkthrough
Scenario: Monitoring a conversion rate experiment. Null: .
Step 1: Compute e-values for each new batch of observations.
import numpy as np
from scipy.special import betaln
def evalue_proportion(
x_treat: int, n_treat: int,
x_control: int, n_control: int,
rho: float = 0.5,
) -> float:
"""E-value for two-proportion test using beta-binomial mixture."""
log_e = (
betaln(x_treat + rho, n_treat - x_treat + rho) +
betaln(x_control + rho, n_control - x_control + rho) -
betaln(rho, rho) -
betaln(x_treat + x_control + 2*rho,
n_treat + n_control - x_treat - x_control + 2*rho) +
betaln(2*rho, 2*rho)
)
return max(float(np.exp(log_e)), 0.0)
def confidence_sequence_proportion(
n: int,
p_hat: float,
alpha: float = 0.05,
) -> tuple[float, float]:
"""Anytime-valid CI for a proportion using Robbins-Siegmund CS."""
if n == 0:
return (0.0, 1.0)
p = np.clip(p_hat, 1e-6, 1 - 1e-6)
width = np.sqrt(
2 * p * (1 - p) / n * np.log(np.log(max(2 * n, 3)) / alpha)
)
return (max(p_hat - width, 0.0), min(p_hat + width, 1.0))Step 2: Monitor the e-process and confidence sequence.
def run_sequential_monitor(
outcomes_treat: np.ndarray, # 0/1 per observation
outcomes_control: np.ndarray,
alpha: float = 0.05,
rho: float = 0.5,
) -> dict:
"""Run always-valid sequential monitor, return first stopping time."""
threshold = 1.0 / alpha
e_process = 1.0
x_t, n_t, x_c, n_c = 0, 0, 0, 0
stop_n = None
n_obs = min(len(outcomes_treat), len(outcomes_control))
for n in range(1, n_obs + 1):
x_t += int(outcomes_treat[n - 1])
n_t += 1
x_c += int(outcomes_control[n - 1])
n_c += 1
e_n = evalue_proportion(x_t, n_t, x_c, n_c, rho)
e_process *= e_n
if stop_n is None and e_process >= threshold:
stop_n = n
return {
'final_e_process': round(e_process, 4),
'threshold': threshold,
'stopped_at_n': stop_n,
'reject': e_process >= threshold,
}Comparison to fixed-horizon. Run the same data through a fixed-horizon -test. With 5 intermediate peeks, the naive test inflates to . The e-process stops when evidence is strong enough — at any sample size — with Type I error still at .
Analysis & Evaluation
Where your intuition breaks. Always-valid tests sound better in every way — so why use fixed-horizon tests at all? The answer is power. For the same sample size, a confidence sequence is always wider than a fixed-horizon CI: the always-valid guarantee costs statistical efficiency. At the planned end of the experiment, an always-valid test typically requires 1.3–2× more data than a fixed-horizon test to achieve the same power. Use always-valid methods when you need to peek early; use fixed-horizon when you can commit to an end date.
| Method | Valid at fixed | Valid under peeking | Power | When to use |
|---|---|---|---|---|
| Fixed-horizon p-value | Yes | No | High | Can commit to end date |
| O'Brien-Fleming | Yes | Yes (pre-planned) | High | Know peek schedule in advance |
| E-value / mSPRT | Yes | Yes (any time) | ~70–80% of FH | Continuous monitoring |
| Confidence sequence | Yes | Yes (any time) | ~70–80% of FH | Always-valid interval |
Common misuse. E-values are not p-values. An e-value of 20 does not mean . The relationship is — an e-value of 20 gives error control in the sense of Markov's inequality, not Neyman-Pearson.
"Always-valid" does not mean "always correct." An e-value process correctly controls false positives at any stopping time. But if you run 100 experiments and always reject when , your experiment-level FDR is 5%. This is correct frequentist control — but do not interpret a single e-value as a within-experiment running probability.
Production-Ready Code
"""
Always-valid sequential testing production system.
E-value process, confidence sequences, and monitoring
with immutable state updates for streaming contexts.
"""
from __future__ import annotations
from dataclasses import dataclass, field
import numpy as np
from scipy.special import betaln
@dataclass
class SequentialMonitor:
"""Production always-valid experiment monitor.
Immutable update pattern: call .update() to get a new monitor
with accumulated state. Safe for use in streaming/Kafka pipelines.
"""
alpha: float = 0.05
rho: float = 0.5
x_treat: int = 0
n_treat: int = 0
x_control: int = 0
n_control: int = 0
e_process: float = 1.0
stopped: bool = False
stop_n: int | None = None
_history: list[dict] = field(default_factory=list, repr=False)
@property
def threshold(self) -> float:
return 1.0 / self.alpha
def update(
self,
treat_conversions: int,
treat_n: int,
control_conversions: int,
control_n: int,
) -> 'SequentialMonitor':
"""Return a new monitor with updated state."""
new_x_t = self.x_treat + treat_conversions
new_n_t = self.n_treat + treat_n
new_x_c = self.x_control + control_conversions
new_n_c = self.n_control + control_n
total_n = new_n_t + new_n_c
log_e = (
betaln(new_x_t + self.rho, new_n_t - new_x_t + self.rho) +
betaln(new_x_c + self.rho, new_n_c - new_x_c + self.rho) -
betaln(self.rho, self.rho) -
betaln(new_x_t + new_x_c + 2*self.rho,
new_n_t + new_n_c - new_x_t - new_x_c + 2*self.rho) +
betaln(2*self.rho, 2*self.rho)
)
e_n = max(float(np.exp(log_e)), 0.0)
new_e_process = self.e_process * e_n
new_stopped = self.stopped or (new_e_process >= self.threshold)
new_stop_n = (
self.stop_n if self.stopped
else (total_n if new_e_process >= self.threshold else None)
)
return SequentialMonitor(
alpha=self.alpha, rho=self.rho,
x_treat=new_x_t, n_treat=new_n_t,
x_control=new_x_c, n_control=new_n_c,
e_process=new_e_process,
stopped=new_stopped, stop_n=new_stop_n,
_history=self._history + [{
'n': total_n,
'e_process': round(new_e_process, 4),
'p_treat': round(new_x_t / max(new_n_t, 1), 4),
'p_control': round(new_x_c / max(new_n_c, 1), 4),
'stopped': new_stopped,
}],
)
def confidence_sequence(self) -> dict:
"""Return current always-valid CI for rate difference."""
def cs_width(x: int, n: int) -> float:
if n == 0:
return 0.5
p = np.clip(x / n, 1e-6, 1 - 1e-6)
return float(np.sqrt(
2 * p * (1 - p) / n * np.log(np.log(max(2 * n, 3)) / self.alpha)
))
p_t = self.x_treat / max(self.n_treat, 1)
p_c = self.x_control / max(self.n_control, 1)
diff = p_t - p_c
width = np.sqrt(
cs_width(self.x_treat, self.n_treat)**2 +
cs_width(self.x_control, self.n_control)**2
)
return {
'estimate': round(diff, 6),
'lower': round(diff - width, 6),
'upper': round(diff + width, 6),
'contains_zero': (diff - width) < 0 < (diff + width),
}
def summary(self) -> dict:
cs = self.confidence_sequence()
return {
'n_treat': self.n_treat,
'n_control': self.n_control,
'e_process': round(self.e_process, 4),
'threshold': self.threshold,
'decision': 'REJECT H0' if self.stopped else 'CONTINUE',
'stop_n': self.stop_n,
'rate_diff': cs,
}Enjoying these notes?
Get new lessons delivered to your inbox. No spam.