False Discovery Rate (FDR) and Benjamini-Hochberg

Hypothesis Testing

Balancing Discovery and Error Control

The Benjamini-Hochberg procedure controls the expected proportion of false discoveries, offering more power than Bonferroni for large-scale testing. It is the standard method for genomics, neuroimaging, and high-dimensional data.

Genomics — Identifying differentially expressed genes while controlling false discoveries
Neuroscience — Analyzing brain imaging data with millions of statistical tests
Digital Advertising — Optimizing keyword testing at scale with FDR control

FDR control enables discovery without drowning in false positives.

The FDR is the expected proportion of rejected H₀s that are actually true (false discoveries):

False Discovery Rate

\text{FDR} = E\left[\frac{\text{\# False Rejections}}{\text{\# Total Rejections}}\right]

Here,

$FDR$ =False Discovery Rate

The Benjamini-Hochberg (BH) procedure controls FDR at level q:

DfBH Algorithm

Sort p-values: p₍₁₎ ≤ p₍₂₎ ≤ ... ≤ p₍ₘ₎
Find largest k such that p₍ₖ₎ ≤ k·q/m
Reject all H₀₍ᵢ₎ for i = 1, ..., k

Python Implementation

Benjamini-Hochberg Procedure

import numpy as np
from statsmodels.stats.multitest import multipletests
import matplotlib.pyplot as plt

np.random.seed(42)
m = 1000  # 1000 tests (e.g., gene expression analysis)
m0 = 800  # 800 truly null genes
m1 = 200  # 200 truly differentially expressed

# Simulate p-values
null_pvals = np.random.uniform(0, 1, m0)          # null: uniform
alt_pvals = np.random.beta(0.5, 10, m1)            # alternative: skewed toward 0
all_pvals = np.concatenate([null_pvals, alt_pvals])
true_labels = np.array([0]*m0 + [1]*m1)            # 0=null, 1=alternative

# Sort for visualization
sort_idx = np.argsort(all_pvals)
sorted_pvals = all_pvals[sort_idx]
ranks = np.arange(1, m+1)
q = 0.05  # desired FDR level

# BH threshold line
bh_line = ranks * q / m

# Find BH cutoff
bh_mask = sorted_pvals <= bh_line
if bh_mask.any():
    bh_cutoff = sorted_pvals[bh_mask].max()
    n_rejected_bh = bh_mask.sum()
else:
    bh_cutoff = 0
    n_rejected_bh = 0

# Bonferroni for comparison
bonf_cutoff = 0.05 / m
n_rejected_bonf = (all_pvals <= bonf_cutoff).sum()

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# BH plot
axes[0].scatter(ranks, sorted_pvals, s=2, alpha=0.4, color='steelblue', label='p-values')
axes[0].plot(ranks, bh_line, 'r-', linewidth=2, label=f'BH line (slope=q/m={q/m:.5f})')
if bh_cutoff > 0:
    axes[0].axhline(bh_cutoff, color='green', linewidth=2, linestyle='--',
                    label=f'BH cutoff = {bh_cutoff:.5f}')
axes[0].set_xlabel('Rank')
axes[0].set_ylabel('Sorted p-value')
axes[0].set_title(f'Benjamini-Hochberg Procedure\n{n_rejected_bh} rejections at FDR ≤ {q}')
axes[0].legend()

# Histogram of p-values
axes[1].hist(all_pvals, bins=50, edgecolor='black', color='steelblue', alpha=0.7)
axes[1].axvline(bonf_cutoff, color='red', linewidth=2, label=f'Bonferroni ({n_rejected_bonf} rejected)')
axes[1].axvline(bh_cutoff, color='green', linewidth=2, linestyle='--', label=f'BH ({n_rejected_bh} rejected)')
axes[1].set_title('P-value Distribution\n(spike near 0 = true alternatives)')
axes[1].legend()

plt.tight_layout()
plt.savefig('bh_procedure.png', dpi=150)
plt.show()

# Using statsmodels
reject_bh, pvals_adj_bh, _, _ = multipletests(all_pvals, alpha=0.05, method='fdr_bh')
reject_bonf, _, _, _ = multipletests(all_pvals, alpha=0.05, method='bonferroni')

print(f"Total tests: {m}, True alternatives: {m1}")
print(f"Bonferroni rejections: {reject_bonf.sum()}")
print(f"BH rejections: {reject_bh.sum()}")

# Estimate false discovery rate achieved
bh_false = np.sum(reject_bh & (true_labels == 0))
print(f"BH actual FDR: {bh_false/max(reject_bh.sum(),1):.4f} (target: {q})")
print(f"BH Power: {np.sum(reject_bh & (true_labels==1))/m1:.4f}")

Key Takeaways

Summary: FDR — Benjamini-Hochberg Procedure

FDR controls the proportion of false positives among rejections — more lenient than FWER
BH algorithm is simple: rank p-values, compare to kq/m threshold
BH is more powerful than Bonferroni — especially for many tests
Genomics, fMRI, metabolomics — all use FDR because thousands of tests are conducted
q = 0.05 FDR means you expect 5% of your "discoveries" to be false
Adjusted p-values from BH are the smallest q for which that test would still be rejected

FDR — Benjamini-Hochberg Procedure for Multiple Testing