The Multiple Testing Problem
Hypothesis Testing
When Testing More Means Finding More Wrong
The multiple testing problem causes false discovery rates to explode when running many hypothesis tests simultaneously. Without correction, most "significant" findings in large-scale studies are false positives.
- Genomics — Testing thousands of genes for differential expression in disease studies
- Neuroimaging — Analyzing millions of brain voxels for activation differences
- Quality Control — Inspecting multiple product characteristics simultaneously
The more you test, the more you must correct — or drown in false discoveries.
If you run 20 hypothesis tests at α = 0.05 when all H₀s are true, you expect 1 false positive by chance. Run 1000 tests? Expect 50 false discoveries.
DfFamily-Wise Error Rate (FWER)
FWER = P(at least one false rejection) = 1 − (1−α)^m for independent tests
FWER Inflation
Visualizing FWER Inflation
import numpy as np
from scipy import stats
from statsmodels.stats.multitest import multipletests
import matplotlib.pyplot as plt
# Show FWER inflation
alpha = 0.05
m_tests = np.arange(1, 201)
fwer = 1 - (1 - alpha)**m_tests
plt.figure(figsize=(10, 5))
plt.plot(m_tests, fwer, 'b-', linewidth=2)
plt.axhline(0.05, color='red', linestyle='--', label='α=0.05 (desired level)')
plt.axhline(0.50, color='orange', linestyle=':', label='50% false positive rate')
plt.xlabel('Number of Tests (m)')
plt.ylabel('FWER (P at least one false positive)')
plt.title('Multiple Testing Problem: FWER Inflation')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('multiple_testing.png', dpi=150)
plt.show()
print(f"FWER with {5} tests: {1-(1-0.05)**5:.4f}")
print(f"FWER with {20} tests: {1-(1-0.05)**20:.4f}")
print(f"FWER with {100} tests: {1-(1-0.05)**100:.4f}")
Comparison of Correction Methods
Bonferroni, Holm, and Benjamini-Hochberg
# ==========================================
# CORRECTIONS
# ==========================================
np.random.seed(42)
n_tests = 20
n_truly_null = 15 # 15 of 20 are truly null
n_true_alt = 5 # 5 truly have effects
# Simulate p-values
null_p = np.random.uniform(0, 1, n_truly_null) # null: uniform
alt_p = np.random.beta(0.3, 5, n_true_alt) # alternative: small p-values
all_p = np.concatenate([null_p, alt_p])
np.random.shuffle(all_p)
print(f"\nUncorrected: {(all_p < 0.05).sum()} rejections (some are false)")
# Bonferroni correction
bonf_alpha = 0.05 / n_tests
print(f"\nBonferroni (α/m = {bonf_alpha:.4f}): {(all_p < bonf_alpha).sum()} rejections")
# Holm-Bonferroni (uniformly more powerful than Bonferroni)
reject_holm, pvals_corrected_holm, _, _ = multipletests(all_p, alpha=0.05, method='holm')
print(f"Holm-Bonferroni: {reject_holm.sum()} rejections")
# Benjamini-Hochberg (controls FDR, not FWER — more power for many tests)
reject_bh, pvals_corrected_bh, _, _ = multipletests(all_p, alpha=0.05, method='fdr_bh')
print(f"Benjamini-Hochberg (FDR): {reject_bh.sum()} rejections")
print("\nB-H procedure: controls the EXPECTED proportion of false discoveries, not probability of any")
print(" FDR ≤ α (expected false discovery rate)")
print(" More powerful than Bonferroni for many simultaneous tests")
FWER vs FDR
Choosing a Correction Method
| Method | Controls | Power | Use When |
|---|---|---|---|
| Bonferroni | FWER (strong) | Lowest | Few tests, any false positive is catastrophic |
| Holm | FWER (strong) | Higher than Bonferroni | Few-moderate tests |
| Benjamini-Hochberg | FDR | Highest | Many tests (genomics, fMRI), some false positives acceptable |
Key Takeaways
Summary: Multiple Testing Problem
- Run m tests at α -> expect m×α false positives when all nulls are true
- Bonferroni: divide α by m — simple, conservative, appropriate for few tests
- Benjamini-Hochberg controls FDR — used in genomics (thousands of tests)
- Pre-specify your comparisons before seeing data — reduces multiple testing inflation
- Report all tests conducted — selective reporting is a major source of false discoveries