Hypothesis Testing: Neyman-Pearson, Likelihood Ratio Tests & Multiple Testing
Hypothesis testing formalizes the question "is this effect real or noise?" — the Neyman-Pearson lemma identifies the most powerful test for any given significance level, likelihood ratio tests extend this to composite hypotheses, and multiple testing corrections prevent false discovery from multiplying across thousands of simultaneous tests.
Concepts
Amber dot = current n. Green dashed = 80% power target (industry standard). Power ∝ √n × δ.
Every time an A/B test is run to compare two product variants, the p-value answers one precise question: how often would we see a difference this large or larger by chance alone, if the two variants were actually identical? Hypothesis testing formalizes "probably real vs. probably noise" — and the Neyman-Pearson lemma proves which test draws this distinction as powerfully as possible for any given significance level.
Testing Setup: Error Types and Power
A hypothesis test specifies a null hypothesis and alternative hypothesis . A test is a function giving the probability of rejecting (deterministic tests have ).
Error types:
| true | true | |
|---|---|---|
| Accept | Correct ✓ | Type II error (miss), prob |
| Reject | Type I error (false alarm), prob | Correct ✓ |
- Size (significance level):
- Power: for ; the power function shows power across all
- Type II error rate:
The testing problem is: given a fixed significance level , find the test that maximizes power.
The asymmetry — fixing the Type I rate first and then maximizing power — is not an arbitrary convention. It encodes the cost structure of discovery: a false alarm triggers costly interventions, retractions, or deployed changes; a missed detection merely delays discovery. Fixing is the mathematical statement of "control the worst-case outcome first, then optimize." This is formally identical to constrained optimization: the Neyman-Pearson framework finds the most powerful test subject to a hard upper bound on false alarms.
Neyman-Pearson Lemma
For simple hypotheses vs , the most powerful level- test rejects when the likelihood ratio exceeds a threshold:
where and are chosen so that exactly.
Proof (variational). Suppose is any other level- test. Define . Then:
because on , and ; on , and . Rearranging:
since (both have size ). Thus power() power().
The NP lemma says: any test that is not based on the likelihood ratio can be improved.
Uniformly Most Powerful Tests
A test is uniformly most powerful (UMP) at level if it is most powerful against every .
UMP tests exist for one-sided alternatives in exponential families. For example, testing vs in a one-parameter exponential family with sufficient statistic : reject when where is the quantile of under .
UMP tests do not exist for two-sided alternatives in general (they would need to simultaneously be most powerful against both and , which is contradictory).
Likelihood Ratio, Wald, and Score Tests
For composite null vs , three asymptotically equivalent tests have null distributions:
Likelihood Ratio Test (LRT):
where is the number of constrained parameters. This is Wilks' theorem — the degrees of freedom equal the number of equality constraints imposed by .
Wald test: .
Score (Rao) test: , evaluated at so the model need not be fit under the alternative.
All three are asymptotically equivalent under and under local alternatives. The LRT is most commonly used in practice for its robustness.
P-values
The p-value is — the probability under of observing a test statistic at least as extreme as what was observed. Reject at level iff .
Critical misinterpretations:
- The p-value is not — that requires a prior.
- A p-value of 0.04 does not mean there is a 4% chance the null is true.
- A large p-value does not prove ; it only fails to reject it.
Multiple Testing Corrections
When independent tests are conducted simultaneously, the probability that at least one false positive occurs (the family-wise error rate, FWER) can be large even if each individual test uses : for small .
Bonferroni correction: use threshold per test. Controls FWER regardless of dependence structure. Conservative when tests are correlated.
False Discovery Rate (FDR) (Benjamini-Hochberg): if nulls are false, the FDR is the expected proportion of false discoveries among all rejections. BH procedure: sort p-values ; reject all with . Controls FDR when tests are independent (or PRDS).
When to use which:
- Medical trials, safety claims: FWER control (Bonferroni) — few false positives are acceptable
- Genomics, large-scale screening: FDR (BH) — some false positives acceptable if overall discovery rate is controlled
Worked Example
Example 1: Gaussian One-Sample Test
, known. Test vs .
LRT: . Under : . Equivalently: reject when where under .
Power: .
The non-centrality parameter is . Power increases with — larger samples, larger effect sizes, or smaller all increase power.
Sample size for 80% power: need , so . For , : .
Example 2: Wilks' Theorem for the LRT in Logistic Regression
Testing whether coefficients in a logistic regression are jointly zero: .
Fit the full model (MLE ) and restricted model (MLE under ). Compute:
This is the likelihood ratio chi-squared test reported by logistic regression software. It is preferred over Wald tests when sample sizes are moderate, since Wald statistics can be sensitive to parameterization.
Example 3: BH Procedure in Gene Expression
A gene expression study tests genes for differential expression. At Bonferroni, each gene must achieve — very stringent.
With BH at FDR = 0.05: sorted p-values . Reject the largest with . If the top 500 p-values are , then — all 500 are rejected. This allows discovery of real effects that Bonferroni would miss, at the cost of permitting up to 5% expected false discoveries.
Connections
Where Your Intuition Breaks
The p-value is the most widely misinterpreted number in science. The correct statement — "probability of observing data this extreme if is true" — is subtly different from the natural reading "probability that is true given this data." These differ by Bayes' theorem: the posterior probability of requires a prior over whether is true, which the p-value does not provide. A p-value of 0.03 combined with a null hypothesis that is prior-likely true (say, 99% of tested genomic variants have no effect) can correspond to a false discovery rate above 50% — far from "only 3% chance of being wrong." Sequential testing compounds this: peeking at results mid-study and stopping when inflates the true Type I error rate far above , even when the final test is reported as a single decision.
The NP lemma says: use the likelihood ratio. Any test can be described by its rejection region in the sample space. The NP lemma proves that among all regions of fixed probability under , the one that maximizes probability under is exactly the region where is largest. Intuitively: order data points by how much more likely they are under than , and put the most -likely points in the rejection region. All common tests (t-test, chi-squared, F-test) are likelihood ratio tests for their respective parametric families.
Wilks' theorem makes the LRT universally applicable. Instead of deriving the null distribution for each problem, Wilks' theorem says it is always asymptotically where = number of constraints. This is remarkable: regardless of the parametric family, the same table applies. The reason is that near the null, the log-likelihood is locally quadratic (by Taylor expansion around the MLE), and a constrained quadratic minimization over directions gives a distribution.
P-hacking inflates Type I errors even with correct individual tests. If a researcher tries 20 different outcome measures and reports the one with , the effective significance level is approximately . Pre-registration, multiple testing corrections, and replication requirements exist precisely because the p-value only controls error for the single pre-specified test, not for the exploration process. In ML: repeatedly tuning hyperparameters and reporting test accuracy inflates the apparent accuracy by the same mechanism — the test set has been implicitly used for selection.
Enjoying these notes?
Get new lessons delivered to your inbox. No spam.