Always-Valid Sequential Testing

Every practitioner who runs A/B tests eventually does it: they check the experiment dashboard before the planned end date, see $p < 0.05$ , and ship. This is peeking — and it inflates the false positive rate from 5% to as high as 26% with five peeks. O'Brien-Fleming boundaries fix this, but only if you pre-commit to exact peek times. What if you want to monitor an experiment continuously and stop whenever the evidence is convincing? That requires a fundamentally different statistical object: always-valid inference.

Theory

Why standard p-values break under optional stopping. A p-value is valid at a single fixed sample size. If you commit to stopping when $p < 0.05$ — regardless of when that happens — you have changed the stopping rule. Under the null hypothesis, the p-value process is a martingale whose values are uniform on $[0,1]$ at any fixed time, but the minimum of this martingale over time is stochastically smaller. The probability of ever seeing $p < \alpha$ under the null over an infinite horizon approaches 1.

E-values: the currency of always-valid inference. An e-value $E$ is a non-negative random variable with $\mathbb{E}_0[E] \leq 1$ under the null. By Markov's inequality, $P_0(E \geq 1/\alpha) \leq \alpha$ . The crucial property: e-values compose under optional stopping. If $E_1, E_2, \ldots$ are e-values from independent data, then $E_n = \prod_{i=1}^n E_i$ is still an e-value. You can accumulate evidence continuously and stop whenever $E_n \geq 1/\alpha$ — the false positive rate is still controlled at $\alpha$ .

Why it had to be this way. The classical Neyman-Pearson framework is proved at a single fixed sample size. E-values escape this through the martingale structure: the product $\prod E_i$ is a non-negative supermartingale under the null, and Ville's inequality — a time-uniform version of Markov's inequality — gives $P_0\!\left(\sup_n E_n \geq 1/\alpha\right) \leq \alpha$ .

Confidence sequences. A confidence sequence $\{C_n\}_{n \geq 1}$ satisfies:

$P\!\left(\forall n \geq 1: \mu \in C_n\right) \geq 1 - \alpha$

This is a time-uniform guarantee: the true parameter is simultaneously covered at every sample size. A fixed-horizon CI only guarantees coverage at the planned end point. The confidence sequence is always wider than the fixed CI (it has more to cover), but it narrows as $n$ grows — and it is valid whenever you look.

mSPRT (mixture Sequential Probability Ratio Test). For testing a Gaussian mean, the e-value at step $n$ is:

$E_n = \sqrt{\frac{n\sigma^{-2}}{n\sigma^{-2} + \tau^{-2}}} \exp\!\left(\frac{\bar{X}_n^2 \, n^2 \sigma^{-2}}{2\left(n\sigma^{-2} + \tau^{-2}\right)^{-1}}\right)$

where $\tau$ is the mixing parameter (prior SD on effect size). Choose $\tau$ as the minimum effect size you care about detecting.

Bayesian sequential testing. The posterior probability $P(H_1 \mid X_n)$ correctly represents uncertainty at any point. Stopping when the posterior exceeds 0.95 is always-valid in spirit — but it does not automatically control Type I error unless the prior is calibrated. Use e-values when calibrated priors are unavailable.

Walkthrough

Scenario: Monitoring a conversion rate experiment. Null: $p_{treatment} = p_{control}$ .

Step 1: Compute e-values for each new batch of observations.

python

import numpy as np
from scipy.special import betaln
 
def evalue_proportion(
    x_treat: int, n_treat: int,
    x_control: int, n_control: int,
    rho: float = 0.5,
) -> float:
    """E-value for two-proportion test using beta-binomial mixture."""
    log_e = (
        betaln(x_treat + rho, n_treat - x_treat + rho) +
        betaln(x_control + rho, n_control - x_control + rho) -
        betaln(rho, rho) -
        betaln(x_treat + x_control + 2*rho,
               n_treat + n_control - x_treat - x_control + 2*rho) +
        betaln(2*rho, 2*rho)
    )
    return max(float(np.exp(log_e)), 0.0)
 
 
def confidence_sequence_proportion(
    n: int,
    p_hat: float,
    alpha: float = 0.05,
) -> tuple[float, float]:
    """Anytime-valid CI for a proportion using Robbins-Siegmund CS."""
    if n == 0:
        return (0.0, 1.0)
    p = np.clip(p_hat, 1e-6, 1 - 1e-6)
    width = np.sqrt(
        2 * p * (1 - p) / n * np.log(np.log(max(2 * n, 3)) / alpha)
    )
    return (max(p_hat - width, 0.0), min(p_hat + width, 1.0))

Step 2: Monitor the e-process and confidence sequence.

python

def run_sequential_monitor(
    outcomes_treat: np.ndarray,    # 0/1 per observation
    outcomes_control: np.ndarray,
    alpha: float = 0.05,
    rho: float = 0.5,
) -> dict:
    """Run always-valid sequential monitor, return first stopping time."""
    threshold = 1.0 / alpha
    e_process = 1.0
    x_t, n_t, x_c, n_c = 0, 0, 0, 0
    stop_n = None
    n_obs = min(len(outcomes_treat), len(outcomes_control))
 
    for n in range(1, n_obs + 1):
        x_t += int(outcomes_treat[n - 1])
        n_t += 1
        x_c += int(outcomes_control[n - 1])
        n_c += 1
        e_n = evalue_proportion(x_t, n_t, x_c, n_c, rho)
        e_process *= e_n
        if stop_n is None and e_process >= threshold:
            stop_n = n
 
    return {
        'final_e_process': round(e_process, 4),
        'threshold': threshold,
        'stopped_at_n': stop_n,
        'reject': e_process >= threshold,
    }

Comparison to fixed-horizon. Run the same data through a fixed-horizon $z$ -test. With 5 intermediate peeks, the naive test inflates to $\alpha \approx 0.26$ . The e-process stops when evidence is strong enough — at any sample size — with Type I error still at $\alpha = 0.05$ .

Analysis & Evaluation

Where your intuition breaks. Always-valid tests sound better in every way — so why use fixed-horizon tests at all? The answer is power. For the same sample size, a confidence sequence is always wider than a fixed-horizon CI: the always-valid guarantee costs statistical efficiency. At the planned end of the experiment, an always-valid test typically requires 1.3–2× more data than a fixed-horizon test to achieve the same power. Use always-valid methods when you need to peek early; use fixed-horizon when you can commit to an end date.

Method	Valid at fixed $n$	Valid under peeking	Power	When to use
Fixed-horizon p-value	Yes	No	High	Can commit to end date
O'Brien-Fleming	Yes	Yes (pre-planned)	High	Know peek schedule in advance
E-value / mSPRT	Yes	Yes (any time)	~70–80% of FH	Continuous monitoring
Confidence sequence	Yes	Yes (any time)	~70–80% of FH	Always-valid interval

Common misuse. E-values are not p-values. An e-value of 20 does not mean $p = 0.05$ . The relationship is $P_0(E \geq 1/\alpha) \leq \alpha$ — an e-value of 20 gives $\alpha = 1/20 = 0.05$ error control in the sense of Markov's inequality, not Neyman-Pearson.

⚠️Warning

"Always-valid" does not mean "always correct." An e-value process correctly controls false positives at any stopping time. But if you run 100 experiments and always reject when $E \geq 20$ , your experiment-level FDR is 5%. This is correct frequentist control — but do not interpret a single e-value as a within-experiment running probability.

Production-Ready Code

python

"""
Always-valid sequential testing production system.
E-value process, confidence sequences, and monitoring
with immutable state updates for streaming contexts.
"""
 
from __future__ import annotations
from dataclasses import dataclass, field
import numpy as np
from scipy.special import betaln
 
 
@dataclass
class SequentialMonitor:
    """Production always-valid experiment monitor.
 
    Immutable update pattern: call .update() to get a new monitor
    with accumulated state. Safe for use in streaming/Kafka pipelines.
    """
    alpha: float = 0.05
    rho: float = 0.5
    x_treat: int = 0
    n_treat: int = 0
    x_control: int = 0
    n_control: int = 0
    e_process: float = 1.0
    stopped: bool = False
    stop_n: int | None = None
    _history: list[dict] = field(default_factory=list, repr=False)
 
    @property
    def threshold(self) -> float:
        return 1.0 / self.alpha
 
    def update(
        self,
        treat_conversions: int,
        treat_n: int,
        control_conversions: int,
        control_n: int,
    ) -> 'SequentialMonitor':
        """Return a new monitor with updated state."""
        new_x_t = self.x_treat + treat_conversions
        new_n_t = self.n_treat + treat_n
        new_x_c = self.x_control + control_conversions
        new_n_c = self.n_control + control_n
        total_n = new_n_t + new_n_c
 
        log_e = (
            betaln(new_x_t + self.rho, new_n_t - new_x_t + self.rho) +
            betaln(new_x_c + self.rho, new_n_c - new_x_c + self.rho) -
            betaln(self.rho, self.rho) -
            betaln(new_x_t + new_x_c + 2*self.rho,
                   new_n_t + new_n_c - new_x_t - new_x_c + 2*self.rho) +
            betaln(2*self.rho, 2*self.rho)
        )
        e_n = max(float(np.exp(log_e)), 0.0)
        new_e_process = self.e_process * e_n
        new_stopped = self.stopped or (new_e_process >= self.threshold)
        new_stop_n = (
            self.stop_n if self.stopped
            else (total_n if new_e_process >= self.threshold else None)
        )
 
        return SequentialMonitor(
            alpha=self.alpha, rho=self.rho,
            x_treat=new_x_t, n_treat=new_n_t,
            x_control=new_x_c, n_control=new_n_c,
            e_process=new_e_process,
            stopped=new_stopped, stop_n=new_stop_n,
            _history=self._history + [{
                'n': total_n,
                'e_process': round(new_e_process, 4),
                'p_treat': round(new_x_t / max(new_n_t, 1), 4),
                'p_control': round(new_x_c / max(new_n_c, 1), 4),
                'stopped': new_stopped,
            }],
        )
 
    def confidence_sequence(self) -> dict:
        """Return current always-valid CI for rate difference."""
        def cs_width(x: int, n: int) -> float:
            if n == 0:
                return 0.5
            p = np.clip(x / n, 1e-6, 1 - 1e-6)
            return float(np.sqrt(
                2 * p * (1 - p) / n * np.log(np.log(max(2 * n, 3)) / self.alpha)
            ))
 
        p_t = self.x_treat / max(self.n_treat, 1)
        p_c = self.x_control / max(self.n_control, 1)
        diff = p_t - p_c
        width = np.sqrt(
            cs_width(self.x_treat, self.n_treat)**2 +
            cs_width(self.x_control, self.n_control)**2
        )
        return {
            'estimate': round(diff, 6),
            'lower': round(diff - width, 6),
            'upper': round(diff + width, 6),
            'contains_zero': (diff - width) < 0 < (diff + width),
        }
 
    def summary(self) -> dict:
        cs = self.confidence_sequence()
        return {
            'n_treat': self.n_treat,
            'n_control': self.n_control,
            'e_process': round(self.e_process, 4),
            'threshold': self.threshold,
            'decision': 'REJECT H0' if self.stopped else 'CONTINUE',
            'stop_n': self.stop_n,
            'rate_diff': cs,
        }

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Network Experiments & Interference

Long-run Measurement & Holdout Groups