Hypothesis Testing: Neyman-Pearson, Likelihood Ratio Tests & Multiple Testing

Hypothesis testing formalizes the question "is this effect real or noise?" — the Neyman-Pearson lemma identifies the most powerful test for any given significance level, likelihood ratio tests extend this to composite hypotheses, and multiple testing corrections prevent false discovery from multiplying across thousands of simultaneous tests.

Concepts

A/B Test — Sampling Distributions of the Mean

Effect size (δ)0.50σ

Sample size (n/arm)200

α =

z = 5.00p = <0.001power = 100%SIGNIFICANT

Power vs sample size — effect size δ = 0.50σ, α = 0.05

Amber dot = current n. Green dashed = 80% power target (industry standard). Power ∝ √n × δ.

Control (A)

Treatment (B)

Critical threshold

Every time an A/B test is run to compare two product variants, the p-value answers one precise question: how often would we see a difference this large or larger by chance alone, if the two variants were actually identical? Hypothesis testing formalizes "probably real vs. probably noise" — and the Neyman-Pearson lemma proves which test draws this distinction as powerfully as possible for any given significance level.

Testing Setup: Error Types and Power

A hypothesis test specifies a null hypothesis $H_0$ and alternative hypothesis $H_1$ . A test is a function $\phi(X) \in [0, 1]$ giving the probability of rejecting $H_0$ (deterministic tests have $\phi \in \{0,1\}$ ).

Error types:

	$H_0$ true	$H_1$ true
Accept $H_0$	Correct ✓	Type II error (miss), prob $\beta$
Reject $H_0$	Type I error (false alarm), prob $\alpha$	Correct ✓

Size (significance level): $\alpha = \sup_{\theta \in H_0} \mathbb{E}_\theta[\phi(X)]$
Power: $\beta_\phi(\theta) = \mathbb{E}_\theta[\phi(X)]$ for $\theta \in H_1$ ; the power function shows power across all $\theta$
Type II error rate: $\beta = 1 - \text{power}$

The testing problem is: given a fixed significance level $\alpha$ , find the test $\phi^*$ that maximizes power.

The asymmetry — fixing the Type I rate $\alpha$ first and then maximizing power — is not an arbitrary convention. It encodes the cost structure of discovery: a false alarm triggers costly interventions, retractions, or deployed changes; a missed detection merely delays discovery. Fixing $\alpha$ is the mathematical statement of "control the worst-case outcome first, then optimize." This is formally identical to constrained optimization: the Neyman-Pearson framework finds the most powerful test subject to a hard upper bound on false alarms.

Neyman-Pearson Lemma

For simple hypotheses $H_0: \theta = \theta_0$ vs $H_1: \theta = \theta_1$ , the most powerful level- $\alpha$ test rejects when the likelihood ratio exceeds a threshold:

$\phi^*(x) = \begin{cases} 1 & \text{if } \Lambda(x) > \kappa \\ \gamma & \text{if } \Lambda(x) = \kappa \\ 0 & \text{if } \Lambda(x) < \kappa \end{cases}, \quad \Lambda(x) = \frac{p(x;\theta_1)}{p(x;\theta_0)},$

where $\kappa$ and $\gamma$ are chosen so that $\mathbb{E}_{\theta_0}[\phi^*] = \alpha$ exactly.

Proof (variational). Suppose $\phi'$ is any other level- $\alpha$ test. Define $A = \{x : \Lambda(x) > \kappa\}$ . Then:

$\int (\phi^* - \phi')(p_1 - \kappa p_0)\,dx \geq 0,$

because on $A$ , $\phi^* \geq \phi'$ and $p_1 - \kappa p_0 > 0$ ; on $A^c$ , $\phi^* \leq \phi'$ and $p_1 - \kappa p_0 \leq 0$ . Rearranging:

$\int \phi^* p_1 - \int \phi' p_1 \geq \kappa(\int \phi^* p_0 - \int \phi' p_0) \geq 0,$

since $\int \phi^* p_0 = \alpha = \int \phi' p_0$ (both have size $\alpha$ ). Thus power( $\phi^*$ ) $\geq$ power( $\phi'$ ). $\square$

The NP lemma says: any test that is not based on the likelihood ratio can be improved.

Uniformly Most Powerful Tests

A test $\phi^*$ is uniformly most powerful (UMP) at level $\alpha$ if it is most powerful against every $\theta_1 \in H_1$ .

UMP tests exist for one-sided alternatives in exponential families. For example, testing $H_0: \theta \leq \theta_0$ vs $H_1: \theta > \theta_0$ in a one-parameter exponential family with sufficient statistic $T$ : reject when $T > c_\alpha$ where $c_\alpha$ is the $1-\alpha$ quantile of $T$ under $\theta_0$ .

UMP tests do not exist for two-sided alternatives $H_1: \theta \neq \theta_0$ in general (they would need to simultaneously be most powerful against both $\theta > \theta_0$ and $\theta < \theta_0$ , which is contradictory).

Likelihood Ratio, Wald, and Score Tests

For composite null $H_0: \theta \in \Theta_0 \subset \mathbb{R}^k$ vs $H_1: \theta \in \Theta_0^c$ , three asymptotically equivalent tests have $\chi^2$ null distributions:

Likelihood Ratio Test (LRT):

$\Lambda_n = \frac{\sup_{\theta \in \Theta_0} L(\theta)}{\sup_{\theta \in \Theta} L(\theta)}, \quad W = -2\log\Lambda_n \xrightarrow{d} \chi^2_r,$

where $r = \dim(\Theta) - \dim(\Theta_0)$ is the number of constrained parameters. This is Wilks' theorem — the degrees of freedom equal the number of equality constraints imposed by $H_0$ .

Wald test: $W_n = (\hat\theta - \theta_0)^T [n \cdot I(\hat\theta)](\hat\theta - \theta_0) \xrightarrow{d} \chi^2_r$ .

Score (Rao) test: $S_n = s_n(\theta_0)^T [n \cdot I(\theta_0)]^{-1} s_n(\theta_0) \xrightarrow{d} \chi^2_r$ , evaluated at $\theta_0$ so the model need not be fit under the alternative.

All three are asymptotically equivalent under $H_0$ and under local alternatives. The LRT is most commonly used in practice for its robustness.

P-values

The p-value is $p = P_{\theta_0}(T \geq t_{\text{obs}})$ — the probability under $H_0$ of observing a test statistic at least as extreme as what was observed. Reject $H_0$ at level $\alpha$ iff $p \leq \alpha$ .

Critical misinterpretations:

The p-value is not $P(H_0 \text{ is true} \mid \text{data})$ — that requires a prior.
A p-value of 0.04 does not mean there is a 4% chance the null is true.
A large p-value does not prove $H_0$ ; it only fails to reject it.

Multiple Testing Corrections

When $m$ independent tests are conducted simultaneously, the probability that at least one false positive occurs (the family-wise error rate, FWER) can be large even if each individual test uses $\alpha = 0.05$ : $\text{FWER} = 1 - (1-\alpha)^m \approx m\alpha$ for small $\alpha$ .

Bonferroni correction: use threshold $\alpha/m$ per test. Controls FWER $\leq \alpha$ regardless of dependence structure. Conservative when tests are correlated.

False Discovery Rate (FDR) (Benjamini-Hochberg): if $m_1$ nulls are false, the FDR is the expected proportion of false discoveries among all rejections. BH procedure: sort p-values $p_{(1)} \leq \ldots \leq p_{(m)}$ ; reject all $p_{(k)}$ with $k \leq k^* = \max\{k : p_{(k)} \leq \alpha k/m\}$ . Controls FDR $\leq \alpha$ when tests are independent (or PRDS).

When to use which:

Medical trials, safety claims: FWER control (Bonferroni) — few false positives are acceptable
Genomics, large-scale screening: FDR (BH) — some false positives acceptable if overall discovery rate is controlled

Worked Example

Example 1: Gaussian One-Sample Test

$X_1, \ldots, X_n \stackrel{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^2)$ , $\sigma^2$ known. Test $H_0: \mu = \mu_0$ vs $H_1: \mu \neq \mu_0$ .

LRT: $\Lambda_n = L(\mu_0)/L(\hat\mu)$ . Under $H_0$ : $-2\log\Lambda_n = n(\bar X - \mu_0)^2/\sigma^2 \sim \chi^2_1$ . Equivalently: reject when $|Z| > z_{\alpha/2}$ where $Z = \sqrt{n}(\bar X - \mu_0)/\sigma \sim \mathcal{N}(0,1)$ under $H_0$ .

Power: $\beta(\mu) = P_\mu(|Z| > z_{\alpha/2}) = P\left(\left|\sqrt{n}(\bar X - \mu)/\sigma + \sqrt{n}(\mu - \mu_0)/\sigma\right| > z_{\alpha/2}\right)$ .

The non-centrality parameter is $\delta = \sqrt{n}|\mu - \mu_0|/\sigma$ . Power increases with $\delta$ — larger samples, larger effect sizes, or smaller $\sigma$ all increase power.

Sample size for 80% power: need $z_{\alpha/2} + z_{0.2} = \delta = \sqrt{n}|\mu_1 - \mu_0|/\sigma$ , so $n = (z_{\alpha/2} + z_\beta)^2 \sigma^2 / (\mu_1 - \mu_0)^2$ . For $\alpha=0.05$ , $\beta=0.2$ : $(1.96 + 0.84)^2 = 7.85$ .

Example 2: Wilks' Theorem for the LRT in Logistic Regression

Testing whether $k$ coefficients in a logistic regression are jointly zero: $H_0: \beta_2 = \ldots = \beta_{k+1} = 0$ .

Fit the full model (MLE $\hat\beta$ ) and restricted model (MLE $\hat\beta_0$ under $H_0$ ). Compute:

$W = 2[\ell(\hat\beta) - \ell(\hat\beta_0)] \xrightarrow{d} \chi^2_k.$

This is the likelihood ratio chi-squared test reported by logistic regression software. It is preferred over Wald tests when sample sizes are moderate, since Wald statistics can be sensitive to parameterization.

Example 3: BH Procedure in Gene Expression

A gene expression study tests $m = 10{,}000$ genes for differential expression. At $\alpha = 0.05$ Bonferroni, each gene must achieve $p < 5 \times 10^{-6}$ — very stringent.

With BH at FDR = 0.05: sorted p-values $p_{(1)} \leq p_{(2)} \leq \ldots$ . Reject the largest $k$ with $p_{(k)} \leq 0.05 \cdot k/10000$ . If the top 500 p-values are $\leq 0.0025$ , then $p_{(500)} \leq 0.05 \cdot 500/10000 = 0.0025$ — all 500 are rejected. This allows discovery of real effects that Bonferroni would miss, at the cost of permitting up to 5% expected false discoveries.

Connections

Where Your Intuition Breaks

The p-value is the most widely misinterpreted number in science. The correct statement — "probability of observing data this extreme if $H_0$ is true" — is subtly different from the natural reading "probability that $H_0$ is true given this data." These differ by Bayes' theorem: the posterior probability of $H_0$ requires a prior over whether $H_0$ is true, which the p-value does not provide. A p-value of 0.03 combined with a null hypothesis that is prior-likely true (say, 99% of tested genomic variants have no effect) can correspond to a false discovery rate above 50% — far from "only 3% chance of being wrong." Sequential testing compounds this: peeking at results mid-study and stopping when $p < 0.05$ inflates the true Type I error rate far above $\alpha$ , even when the final test is reported as a single decision.

💡Intuition

The NP lemma says: use the likelihood ratio. Any test can be described by its rejection region in the sample space. The NP lemma proves that among all regions of fixed probability under $H_0$ , the one that maximizes probability under $H_1$ is exactly the region where $p_1/p_0$ is largest. Intuitively: order data points by how much more likely they are under $H_1$ than $H_0$ , and put the most $H_1$ -likely points in the rejection region. All common tests (t-test, chi-squared, F-test) are likelihood ratio tests for their respective parametric families.

💡Intuition

Wilks' theorem makes the LRT universally applicable. Instead of deriving the null distribution for each problem, Wilks' theorem says it is always $\chi^2_r$ asymptotically where $r$ = number of constraints. This is remarkable: regardless of the parametric family, the same $\chi^2$ table applies. The reason is that near the null, the log-likelihood is locally quadratic (by Taylor expansion around the MLE), and a constrained quadratic minimization over $r$ directions gives a $\chi^2_r$ distribution.

⚠️Warning

P-hacking inflates Type I errors even with correct individual tests. If a researcher tries 20 different outcome measures and reports the one with $p < 0.05$ , the effective significance level is approximately $1 - 0.95^{20} \approx 0.64$ . Pre-registration, multiple testing corrections, and replication requirements exist precisely because the p-value only controls error for the single pre-specified test, not for the exploration process. In ML: repeatedly tuning hyperparameters and reporting test accuracy inflates the apparent accuracy by the same mechanism — the test set has been implicitly used for selection.

Enjoying these notes?

Get new lessons delivered to your inbox. No spam.

Estimation Theory: MLE, Sufficiency, Fisher Information & Cramér-Rao

Bayesian Inference: Priors, Posteriors & Conjugacy