Neural-Path/Notes
40 min

Always-Valid Sequential Testing

Every practitioner who runs A/B tests eventually does it: they check the experiment dashboard before the planned end date, see p<0.05p < 0.05, and ship. This is peeking — and it inflates the false positive rate from 5% to as high as 26% with five peeks. O'Brien-Fleming boundaries fix this, but only if you pre-commit to exact peek times. What if you want to monitor an experiment continuously and stop whenever the evidence is convincing? That requires a fundamentally different statistical object: always-valid inference.

Theory

00.20.425%50%75%100%Sample collectedEffectCS (always-valid)Fixed CIRunning estimate

Why standard p-values break under optional stopping. A p-value is valid at a single fixed sample size. If you commit to stopping when p<0.05p < 0.05 — regardless of when that happens — you have changed the stopping rule. Under the null hypothesis, the p-value process is a martingale whose values are uniform on [0,1][0,1] at any fixed time, but the minimum of this martingale over time is stochastically smaller. The probability of ever seeing p<αp < \alpha under the null over an infinite horizon approaches 1.

E-values: the currency of always-valid inference. An e-value EE is a non-negative random variable with E0[E]1\mathbb{E}_0[E] \leq 1 under the null. By Markov's inequality, P0(E1/α)αP_0(E \geq 1/\alpha) \leq \alpha. The crucial property: e-values compose under optional stopping. If E1,E2,E_1, E_2, \ldots are e-values from independent data, then En=i=1nEiE_n = \prod_{i=1}^n E_i is still an e-value. You can accumulate evidence continuously and stop whenever En1/αE_n \geq 1/\alpha — the false positive rate is still controlled at α\alpha.

Why it had to be this way. The classical Neyman-Pearson framework is proved at a single fixed sample size. E-values escape this through the martingale structure: the product Ei\prod E_i is a non-negative supermartingale under the null, and Ville's inequality — a time-uniform version of Markov's inequality — gives P0 ⁣(supnEn1/α)αP_0\!\left(\sup_n E_n \geq 1/\alpha\right) \leq \alpha.

Confidence sequences. A confidence sequence {Cn}n1\{C_n\}_{n \geq 1} satisfies:

P ⁣(n1:μCn)1αP\!\left(\forall n \geq 1: \mu \in C_n\right) \geq 1 - \alpha

This is a time-uniform guarantee: the true parameter is simultaneously covered at every sample size. A fixed-horizon CI only guarantees coverage at the planned end point. The confidence sequence is always wider than the fixed CI (it has more to cover), but it narrows as nn grows — and it is valid whenever you look.

mSPRT (mixture Sequential Probability Ratio Test). For testing a Gaussian mean, the e-value at step nn is:

En=nσ2nσ2+τ2exp ⁣(Xˉn2n2σ22(nσ2+τ2)1)E_n = \sqrt{\frac{n\sigma^{-2}}{n\sigma^{-2} + \tau^{-2}}} \exp\!\left(\frac{\bar{X}_n^2 \, n^2 \sigma^{-2}}{2\left(n\sigma^{-2} + \tau^{-2}\right)^{-1}}\right)

where τ\tau is the mixing parameter (prior SD on effect size). Choose τ\tau as the minimum effect size you care about detecting.

Bayesian sequential testing. The posterior probability P(H1Xn)P(H_1 \mid X_n) correctly represents uncertainty at any point. Stopping when the posterior exceeds 0.95 is always-valid in spirit — but it does not automatically control Type I error unless the prior is calibrated. Use e-values when calibrated priors are unavailable.

Walkthrough

Scenario: Monitoring a conversion rate experiment. Null: ptreatment=pcontrolp_{treatment} = p_{control}.

Step 1: Compute e-values for each new batch of observations.

python
import numpy as np
from scipy.special import betaln
 
def evalue_proportion(
    x_treat: int, n_treat: int,
    x_control: int, n_control: int,
    rho: float = 0.5,
) -> float:
    """E-value for two-proportion test using beta-binomial mixture."""
    log_e = (
        betaln(x_treat + rho, n_treat - x_treat + rho) +
        betaln(x_control + rho, n_control - x_control + rho) -
        betaln(rho, rho) -
        betaln(x_treat + x_control + 2*rho,
               n_treat + n_control - x_treat - x_control + 2*rho) +
        betaln(2*rho, 2*rho)
    )
    return max(float(np.exp(log_e)), 0.0)
 
 
def confidence_sequence_proportion(
    n: int,
    p_hat: float,
    alpha: float = 0.05,
) -> tuple[float, float]:
    """Anytime-valid CI for a proportion using Robbins-Siegmund CS."""
    if n == 0:
        return (0.0, 1.0)
    p = np.clip(p_hat, 1e-6, 1 - 1e-6)
    width = np.sqrt(
        2 * p * (1 - p) / n * np.log(np.log(max(2 * n, 3)) / alpha)
    )
    return (max(p_hat - width, 0.0), min(p_hat + width, 1.0))

Step 2: Monitor the e-process and confidence sequence.

python
def run_sequential_monitor(
    outcomes_treat: np.ndarray,    # 0/1 per observation
    outcomes_control: np.ndarray,
    alpha: float = 0.05,
    rho: float = 0.5,
) -> dict:
    """Run always-valid sequential monitor, return first stopping time."""
    threshold = 1.0 / alpha
    e_process = 1.0
    x_t, n_t, x_c, n_c = 0, 0, 0, 0
    stop_n = None
    n_obs = min(len(outcomes_treat), len(outcomes_control))
 
    for n in range(1, n_obs + 1):
        x_t += int(outcomes_treat[n - 1])
        n_t += 1
        x_c += int(outcomes_control[n - 1])
        n_c += 1
        e_n = evalue_proportion(x_t, n_t, x_c, n_c, rho)
        e_process *= e_n
        if stop_n is None and e_process >= threshold:
            stop_n = n
 
    return {
        'final_e_process': round(e_process, 4),
        'threshold': threshold,
        'stopped_at_n': stop_n,
        'reject': e_process >= threshold,
    }

Comparison to fixed-horizon. Run the same data through a fixed-horizon zz-test. With 5 intermediate peeks, the naive test inflates to α0.26\alpha \approx 0.26. The e-process stops when evidence is strong enough — at any sample size — with Type I error still at α=0.05\alpha = 0.05.

Analysis & Evaluation

Where your intuition breaks. Always-valid tests sound better in every way — so why use fixed-horizon tests at all? The answer is power. For the same sample size, a confidence sequence is always wider than a fixed-horizon CI: the always-valid guarantee costs statistical efficiency. At the planned end of the experiment, an always-valid test typically requires 1.3–2× more data than a fixed-horizon test to achieve the same power. Use always-valid methods when you need to peek early; use fixed-horizon when you can commit to an end date.

MethodValid at fixed nnValid under peekingPowerWhen to use
Fixed-horizon p-valueYesNoHighCan commit to end date
O'Brien-FlemingYesYes (pre-planned)HighKnow peek schedule in advance
E-value / mSPRTYesYes (any time)~70–80% of FHContinuous monitoring
Confidence sequenceYesYes (any time)~70–80% of FHAlways-valid interval

Common misuse. E-values are not p-values. An e-value of 20 does not mean p=0.05p = 0.05. The relationship is P0(E1/α)αP_0(E \geq 1/\alpha) \leq \alpha — an e-value of 20 gives α=1/20=0.05\alpha = 1/20 = 0.05 error control in the sense of Markov's inequality, not Neyman-Pearson.

⚠️Warning

"Always-valid" does not mean "always correct." An e-value process correctly controls false positives at any stopping time. But if you run 100 experiments and always reject when E20E \geq 20, your experiment-level FDR is 5%. This is correct frequentist control — but do not interpret a single e-value as a within-experiment running probability.

Production-Ready Code

python
"""
Always-valid sequential testing production system.
E-value process, confidence sequences, and monitoring
with immutable state updates for streaming contexts.
"""
 
from __future__ import annotations
from dataclasses import dataclass, field
import numpy as np
from scipy.special import betaln
 
 
@dataclass
class SequentialMonitor:
    """Production always-valid experiment monitor.
 
    Immutable update pattern: call .update() to get a new monitor
    with accumulated state. Safe for use in streaming/Kafka pipelines.
    """
    alpha: float = 0.05
    rho: float = 0.5
    x_treat: int = 0
    n_treat: int = 0
    x_control: int = 0
    n_control: int = 0
    e_process: float = 1.0
    stopped: bool = False
    stop_n: int | None = None
    _history: list[dict] = field(default_factory=list, repr=False)
 
    @property
    def threshold(self) -> float:
        return 1.0 / self.alpha
 
    def update(
        self,
        treat_conversions: int,
        treat_n: int,
        control_conversions: int,
        control_n: int,
    ) -> 'SequentialMonitor':
        """Return a new monitor with updated state."""
        new_x_t = self.x_treat + treat_conversions
        new_n_t = self.n_treat + treat_n
        new_x_c = self.x_control + control_conversions
        new_n_c = self.n_control + control_n
        total_n = new_n_t + new_n_c
 
        log_e = (
            betaln(new_x_t + self.rho, new_n_t - new_x_t + self.rho) +
            betaln(new_x_c + self.rho, new_n_c - new_x_c + self.rho) -
            betaln(self.rho, self.rho) -
            betaln(new_x_t + new_x_c + 2*self.rho,
                   new_n_t + new_n_c - new_x_t - new_x_c + 2*self.rho) +
            betaln(2*self.rho, 2*self.rho)
        )
        e_n = max(float(np.exp(log_e)), 0.0)
        new_e_process = self.e_process * e_n
        new_stopped = self.stopped or (new_e_process >= self.threshold)
        new_stop_n = (
            self.stop_n if self.stopped
            else (total_n if new_e_process >= self.threshold else None)
        )
 
        return SequentialMonitor(
            alpha=self.alpha, rho=self.rho,
            x_treat=new_x_t, n_treat=new_n_t,
            x_control=new_x_c, n_control=new_n_c,
            e_process=new_e_process,
            stopped=new_stopped, stop_n=new_stop_n,
            _history=self._history + [{
                'n': total_n,
                'e_process': round(new_e_process, 4),
                'p_treat': round(new_x_t / max(new_n_t, 1), 4),
                'p_control': round(new_x_c / max(new_n_c, 1), 4),
                'stopped': new_stopped,
            }],
        )
 
    def confidence_sequence(self) -> dict:
        """Return current always-valid CI for rate difference."""
        def cs_width(x: int, n: int) -> float:
            if n == 0:
                return 0.5
            p = np.clip(x / n, 1e-6, 1 - 1e-6)
            return float(np.sqrt(
                2 * p * (1 - p) / n * np.log(np.log(max(2 * n, 3)) / self.alpha)
            ))
 
        p_t = self.x_treat / max(self.n_treat, 1)
        p_c = self.x_control / max(self.n_control, 1)
        diff = p_t - p_c
        width = np.sqrt(
            cs_width(self.x_treat, self.n_treat)**2 +
            cs_width(self.x_control, self.n_control)**2
        )
        return {
            'estimate': round(diff, 6),
            'lower': round(diff - width, 6),
            'upper': round(diff + width, 6),
            'contains_zero': (diff - width) < 0 < (diff + width),
        }
 
    def summary(self) -> dict:
        cs = self.confidence_sequence()
        return {
            'n_treat': self.n_treat,
            'n_control': self.n_control,
            'e_process': round(self.e_process, 4),
            'threshold': self.threshold,
            'decision': 'REJECT H0' if self.stopped else 'CONTINUE',
            'stop_n': self.stop_n,
            'rate_diff': cs,
        }

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.