Neural-Path/Notes
40 min

Hypothesis Testing: Neyman-Pearson, Likelihood Ratio Tests & Multiple Testing

Hypothesis testing formalizes the question "is this effect real or noise?" — the Neyman-Pearson lemma identifies the most powerful test for any given significance level, likelihood ratio tests extend this to composite hypotheses, and multiple testing corrections prevent false discovery from multiplying across thousands of simultaneous tests.

Concepts

A/B Test — Sampling Distributions of the Mean
z=1.96μAμB
α =
z = 5.00p = <0.001power = 100%SIGNIFICANT
Power vs sample size — effect size δ = 0.50σ, α = 0.05
00.50.811005001000200080%

Amber dot = current n. Green dashed = 80% power target (industry standard). Power ∝ √n × δ.

Control (A)
Treatment (B)
Critical threshold

Every time an A/B test is run to compare two product variants, the p-value answers one precise question: how often would we see a difference this large or larger by chance alone, if the two variants were actually identical? Hypothesis testing formalizes "probably real vs. probably noise" — and the Neyman-Pearson lemma proves which test draws this distinction as powerfully as possible for any given significance level.

Testing Setup: Error Types and Power

A hypothesis test specifies a null hypothesis H0H_0 and alternative hypothesis H1H_1. A test is a function ϕ(X)[0,1]\phi(X) \in [0, 1] giving the probability of rejecting H0H_0 (deterministic tests have ϕ{0,1}\phi \in \{0,1\}).

Error types:

H0H_0 trueH1H_1 true
Accept H0H_0Correct ✓Type II error (miss), prob β\beta
Reject H0H_0Type I error (false alarm), prob α\alphaCorrect ✓
  • Size (significance level): α=supθH0Eθ[ϕ(X)]\alpha = \sup_{\theta \in H_0} \mathbb{E}_\theta[\phi(X)]
  • Power: βϕ(θ)=Eθ[ϕ(X)]\beta_\phi(\theta) = \mathbb{E}_\theta[\phi(X)] for θH1\theta \in H_1; the power function shows power across all θ\theta
  • Type II error rate: β=1power\beta = 1 - \text{power}

The testing problem is: given a fixed significance level α\alpha, find the test ϕ\phi^* that maximizes power.

The asymmetry — fixing the Type I rate α\alpha first and then maximizing power — is not an arbitrary convention. It encodes the cost structure of discovery: a false alarm triggers costly interventions, retractions, or deployed changes; a missed detection merely delays discovery. Fixing α\alpha is the mathematical statement of "control the worst-case outcome first, then optimize." This is formally identical to constrained optimization: the Neyman-Pearson framework finds the most powerful test subject to a hard upper bound on false alarms.

Neyman-Pearson Lemma

For simple hypotheses H0:θ=θ0H_0: \theta = \theta_0 vs H1:θ=θ1H_1: \theta = \theta_1, the most powerful level-α\alpha test rejects when the likelihood ratio exceeds a threshold:

ϕ(x)={1if Λ(x)>κγif Λ(x)=κ0if Λ(x)<κ,Λ(x)=p(x;θ1)p(x;θ0),\phi^*(x) = \begin{cases} 1 & \text{if } \Lambda(x) > \kappa \\ \gamma & \text{if } \Lambda(x) = \kappa \\ 0 & \text{if } \Lambda(x) < \kappa \end{cases}, \quad \Lambda(x) = \frac{p(x;\theta_1)}{p(x;\theta_0)},

where κ\kappa and γ\gamma are chosen so that Eθ0[ϕ]=α\mathbb{E}_{\theta_0}[\phi^*] = \alpha exactly.

Proof (variational). Suppose ϕ\phi' is any other level-α\alpha test. Define A={x:Λ(x)>κ}A = \{x : \Lambda(x) > \kappa\}. Then:

(ϕϕ)(p1κp0)dx0,\int (\phi^* - \phi')(p_1 - \kappa p_0)\,dx \geq 0,

because on AA, ϕϕ\phi^* \geq \phi' and p1κp0>0p_1 - \kappa p_0 > 0; on AcA^c, ϕϕ\phi^* \leq \phi' and p1κp00p_1 - \kappa p_0 \leq 0. Rearranging:

ϕp1ϕp1κ(ϕp0ϕp0)0,\int \phi^* p_1 - \int \phi' p_1 \geq \kappa(\int \phi^* p_0 - \int \phi' p_0) \geq 0,

since ϕp0=α=ϕp0\int \phi^* p_0 = \alpha = \int \phi' p_0 (both have size α\alpha). Thus power(ϕ\phi^*) \geq power(ϕ\phi'). \square

The NP lemma says: any test that is not based on the likelihood ratio can be improved.

Uniformly Most Powerful Tests

A test ϕ\phi^* is uniformly most powerful (UMP) at level α\alpha if it is most powerful against every θ1H1\theta_1 \in H_1.

UMP tests exist for one-sided alternatives in exponential families. For example, testing H0:θθ0H_0: \theta \leq \theta_0 vs H1:θ>θ0H_1: \theta > \theta_0 in a one-parameter exponential family with sufficient statistic TT: reject when T>cαT > c_\alpha where cαc_\alpha is the 1α1-\alpha quantile of TT under θ0\theta_0.

UMP tests do not exist for two-sided alternatives H1:θθ0H_1: \theta \neq \theta_0 in general (they would need to simultaneously be most powerful against both θ>θ0\theta > \theta_0 and θ<θ0\theta < \theta_0, which is contradictory).

Likelihood Ratio, Wald, and Score Tests

For composite null H0:θΘ0RkH_0: \theta \in \Theta_0 \subset \mathbb{R}^k vs H1:θΘ0cH_1: \theta \in \Theta_0^c, three asymptotically equivalent tests have χ2\chi^2 null distributions:

Likelihood Ratio Test (LRT):

Λn=supθΘ0L(θ)supθΘL(θ),W=2logΛndχr2,\Lambda_n = \frac{\sup_{\theta \in \Theta_0} L(\theta)}{\sup_{\theta \in \Theta} L(\theta)}, \quad W = -2\log\Lambda_n \xrightarrow{d} \chi^2_r,

where r=dim(Θ)dim(Θ0)r = \dim(\Theta) - \dim(\Theta_0) is the number of constrained parameters. This is Wilks' theorem — the degrees of freedom equal the number of equality constraints imposed by H0H_0.

Wald test: Wn=(θ^θ0)T[nI(θ^)](θ^θ0)dχr2W_n = (\hat\theta - \theta_0)^T [n \cdot I(\hat\theta)](\hat\theta - \theta_0) \xrightarrow{d} \chi^2_r.

Score (Rao) test: Sn=sn(θ0)T[nI(θ0)]1sn(θ0)dχr2S_n = s_n(\theta_0)^T [n \cdot I(\theta_0)]^{-1} s_n(\theta_0) \xrightarrow{d} \chi^2_r, evaluated at θ0\theta_0 so the model need not be fit under the alternative.

All three are asymptotically equivalent under H0H_0 and under local alternatives. The LRT is most commonly used in practice for its robustness.

P-values

The p-value is p=Pθ0(Ttobs)p = P_{\theta_0}(T \geq t_{\text{obs}}) — the probability under H0H_0 of observing a test statistic at least as extreme as what was observed. Reject H0H_0 at level α\alpha iff pαp \leq \alpha.

Critical misinterpretations:

  • The p-value is not P(H0 is truedata)P(H_0 \text{ is true} \mid \text{data}) — that requires a prior.
  • A p-value of 0.04 does not mean there is a 4% chance the null is true.
  • A large p-value does not prove H0H_0; it only fails to reject it.

Multiple Testing Corrections

When mm independent tests are conducted simultaneously, the probability that at least one false positive occurs (the family-wise error rate, FWER) can be large even if each individual test uses α=0.05\alpha = 0.05: FWER=1(1α)mmα\text{FWER} = 1 - (1-\alpha)^m \approx m\alpha for small α\alpha.

Bonferroni correction: use threshold α/m\alpha/m per test. Controls FWER α\leq \alpha regardless of dependence structure. Conservative when tests are correlated.

False Discovery Rate (FDR) (Benjamini-Hochberg): if m1m_1 nulls are false, the FDR is the expected proportion of false discoveries among all rejections. BH procedure: sort p-values p(1)p(m)p_{(1)} \leq \ldots \leq p_{(m)}; reject all p(k)p_{(k)} with kk=max{k:p(k)αk/m}k \leq k^* = \max\{k : p_{(k)} \leq \alpha k/m\}. Controls FDR α\leq \alpha when tests are independent (or PRDS).

When to use which:

  • Medical trials, safety claims: FWER control (Bonferroni) — few false positives are acceptable
  • Genomics, large-scale screening: FDR (BH) — some false positives acceptable if overall discovery rate is controlled

Worked Example

Example 1: Gaussian One-Sample Test

X1,,XniidN(μ,σ2)X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^2), σ2\sigma^2 known. Test H0:μ=μ0H_0: \mu = \mu_0 vs H1:μμ0H_1: \mu \neq \mu_0.

LRT: Λn=L(μ0)/L(μ^)\Lambda_n = L(\mu_0)/L(\hat\mu). Under H0H_0: 2logΛn=n(Xˉμ0)2/σ2χ12-2\log\Lambda_n = n(\bar X - \mu_0)^2/\sigma^2 \sim \chi^2_1. Equivalently: reject when Z>zα/2|Z| > z_{\alpha/2} where Z=n(Xˉμ0)/σN(0,1)Z = \sqrt{n}(\bar X - \mu_0)/\sigma \sim \mathcal{N}(0,1) under H0H_0.

Power: β(μ)=Pμ(Z>zα/2)=P(n(Xˉμ)/σ+n(μμ0)/σ>zα/2)\beta(\mu) = P_\mu(|Z| > z_{\alpha/2}) = P\left(\left|\sqrt{n}(\bar X - \mu)/\sigma + \sqrt{n}(\mu - \mu_0)/\sigma\right| > z_{\alpha/2}\right).

The non-centrality parameter is δ=nμμ0/σ\delta = \sqrt{n}|\mu - \mu_0|/\sigma. Power increases with δ\delta — larger samples, larger effect sizes, or smaller σ\sigma all increase power.

Sample size for 80% power: need zα/2+z0.2=δ=nμ1μ0/σz_{\alpha/2} + z_{0.2} = \delta = \sqrt{n}|\mu_1 - \mu_0|/\sigma, so n=(zα/2+zβ)2σ2/(μ1μ0)2n = (z_{\alpha/2} + z_\beta)^2 \sigma^2 / (\mu_1 - \mu_0)^2. For α=0.05\alpha=0.05, β=0.2\beta=0.2: (1.96+0.84)2=7.85(1.96 + 0.84)^2 = 7.85.

Example 2: Wilks' Theorem for the LRT in Logistic Regression

Testing whether kk coefficients in a logistic regression are jointly zero: H0:β2==βk+1=0H_0: \beta_2 = \ldots = \beta_{k+1} = 0.

Fit the full model (MLE β^\hat\beta) and restricted model (MLE β^0\hat\beta_0 under H0H_0). Compute:

W=2[(β^)(β^0)]dχk2.W = 2[\ell(\hat\beta) - \ell(\hat\beta_0)] \xrightarrow{d} \chi^2_k.

This is the likelihood ratio chi-squared test reported by logistic regression software. It is preferred over Wald tests when sample sizes are moderate, since Wald statistics can be sensitive to parameterization.

Example 3: BH Procedure in Gene Expression

A gene expression study tests m=10,000m = 10{,}000 genes for differential expression. At α=0.05\alpha = 0.05 Bonferroni, each gene must achieve p<5×106p < 5 \times 10^{-6} — very stringent.

With BH at FDR = 0.05: sorted p-values p(1)p(2)p_{(1)} \leq p_{(2)} \leq \ldots. Reject the largest kk with p(k)0.05k/10000p_{(k)} \leq 0.05 \cdot k/10000. If the top 500 p-values are 0.0025\leq 0.0025, then p(500)0.05500/10000=0.0025p_{(500)} \leq 0.05 \cdot 500/10000 = 0.0025 — all 500 are rejected. This allows discovery of real effects that Bonferroni would miss, at the cost of permitting up to 5% expected false discoveries.

Connections

Where Your Intuition Breaks

The p-value is the most widely misinterpreted number in science. The correct statement — "probability of observing data this extreme if H0H_0 is true" — is subtly different from the natural reading "probability that H0H_0 is true given this data." These differ by Bayes' theorem: the posterior probability of H0H_0 requires a prior over whether H0H_0 is true, which the p-value does not provide. A p-value of 0.03 combined with a null hypothesis that is prior-likely true (say, 99% of tested genomic variants have no effect) can correspond to a false discovery rate above 50% — far from "only 3% chance of being wrong." Sequential testing compounds this: peeking at results mid-study and stopping when p<0.05p < 0.05 inflates the true Type I error rate far above α\alpha, even when the final test is reported as a single decision.

💡Intuition

The NP lemma says: use the likelihood ratio. Any test can be described by its rejection region in the sample space. The NP lemma proves that among all regions of fixed probability under H0H_0, the one that maximizes probability under H1H_1 is exactly the region where p1/p0p_1/p_0 is largest. Intuitively: order data points by how much more likely they are under H1H_1 than H0H_0, and put the most H1H_1-likely points in the rejection region. All common tests (t-test, chi-squared, F-test) are likelihood ratio tests for their respective parametric families.

💡Intuition

Wilks' theorem makes the LRT universally applicable. Instead of deriving the null distribution for each problem, Wilks' theorem says it is always χr2\chi^2_r asymptotically where rr = number of constraints. This is remarkable: regardless of the parametric family, the same χ2\chi^2 table applies. The reason is that near the null, the log-likelihood is locally quadratic (by Taylor expansion around the MLE), and a constrained quadratic minimization over rr directions gives a χr2\chi^2_r distribution.

⚠️Warning

P-hacking inflates Type I errors even with correct individual tests. If a researcher tries 20 different outcome measures and reports the one with p<0.05p < 0.05, the effective significance level is approximately 10.95200.641 - 0.95^{20} \approx 0.64. Pre-registration, multiple testing corrections, and replication requirements exist precisely because the p-value only controls error for the single pre-specified test, not for the exploration process. In ML: repeatedly tuning hyperparameters and reporting test accuracy inflates the apparent accuracy by the same mechanism — the test set has been implicitly used for selection.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.