The Multiple Testing Problem

Hypothesis Testing

When Testing More Means Finding More Wrong

The multiple testing problem causes false discovery rates to explode when running many hypothesis tests simultaneously. Without correction, most "significant" findings in large-scale studies are false positives.

Genomics — Testing thousands of genes for differential expression in disease studies
Neuroimaging — Analyzing millions of brain voxels for activation differences
Quality Control — Inspecting multiple product characteristics simultaneously

The more you test, the more you must correct — or drown in false discoveries.

If you run 20 hypothesis tests at α = 0.05 when all H₀s are true, you expect 1 false positive by chance. Run 1000 tests? Expect 50 false discoveries.

DfFamily-Wise Error Rate (FWER)

FWER = P(at least one false rejection) = 1 − (1−α)^m for independent tests

FWER Inflation

Visualizing FWER Inflation

import numpy as np
from scipy import stats
from statsmodels.stats.multitest import multipletests
import matplotlib.pyplot as plt

# Show FWER inflation
alpha = 0.05
m_tests = np.arange(1, 201)
fwer = 1 - (1 - alpha)**m_tests

plt.figure(figsize=(10, 5))
plt.plot(m_tests, fwer, 'b-', linewidth=2)
plt.axhline(0.05, color='red', linestyle='--', label='α=0.05 (desired level)')
plt.axhline(0.50, color='orange', linestyle=':', label='50% false positive rate')
plt.xlabel('Number of Tests (m)')
plt.ylabel('FWER (P at least one false positive)')
plt.title('Multiple Testing Problem: FWER Inflation')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('multiple_testing.png', dpi=150)
plt.show()

print(f"FWER with {5} tests:   {1-(1-0.05)**5:.4f}")
print(f"FWER with {20} tests:  {1-(1-0.05)**20:.4f}")
print(f"FWER with {100} tests: {1-(1-0.05)**100:.4f}")

Comparison of Correction Methods

Bonferroni, Holm, and Benjamini-Hochberg

# ==========================================
# CORRECTIONS
# ==========================================
np.random.seed(42)
n_tests = 20
n_truly_null = 15  # 15 of 20 are truly null
n_true_alt = 5    # 5 truly have effects

# Simulate p-values
null_p = np.random.uniform(0, 1, n_truly_null)       # null: uniform
alt_p = np.random.beta(0.3, 5, n_true_alt)           # alternative: small p-values
all_p = np.concatenate([null_p, alt_p])
np.random.shuffle(all_p)

print(f"\nUncorrected: {(all_p < 0.05).sum()} rejections (some are false)")

# Bonferroni correction
bonf_alpha = 0.05 / n_tests
print(f"\nBonferroni (α/m = {bonf_alpha:.4f}): {(all_p < bonf_alpha).sum()} rejections")

# Holm-Bonferroni (uniformly more powerful than Bonferroni)
reject_holm, pvals_corrected_holm, _, _ = multipletests(all_p, alpha=0.05, method='holm')
print(f"Holm-Bonferroni: {reject_holm.sum()} rejections")

# Benjamini-Hochberg (controls FDR, not FWER — more power for many tests)
reject_bh, pvals_corrected_bh, _, _ = multipletests(all_p, alpha=0.05, method='fdr_bh')
print(f"Benjamini-Hochberg (FDR): {reject_bh.sum()} rejections")

print("\nB-H procedure: controls the EXPECTED proportion of false discoveries, not probability of any")
print("  FDR ≤ α (expected false discovery rate)")
print("  More powerful than Bonferroni for many simultaneous tests")

FWER vs FDR

Choosing a Correction Method

Method	Controls	Power	Use When
Bonferroni	FWER (strong)	Lowest	Few tests, any false positive is catastrophic
Holm	FWER (strong)	Higher than Bonferroni	Few-moderate tests
Benjamini-Hochberg	FDR	Highest	Many tests (genomics, fMRI), some false positives acceptable

Key Takeaways

Summary: Multiple Testing Problem

Run m tests at α -> expect m×α false positives when all nulls are true
Bonferroni: divide α by m — simple, conservative, appropriate for few tests
Benjamini-Hochberg controls FDR — used in genomics (thousands of tests)
Pre-specify your comparisons before seeing data — reduces multiple testing inflation
Report all tests conducted — selective reporting is a major source of false discoveries

Multiple Testing Problem — FWER and False Discovery Rate