Significance Levels

Hypothesis Testing

Setting the Bar for Evidence

The significance level α is the probability threshold for rejecting the null hypothesis, set before data collection. It directly controls the Type I error rate and shapes the stringency of your statistical test.

Medical Research — Using α = 0.01 or stricter for high-stakes drug approvals
Social Science — Balancing discovery rigor against exploratory flexibility
Quality Control — Setting acceptance criteria for incoming materials

Alpha is a decision, not a discovery — choose it wisely before you see the data.

The significance level (α) is the probability threshold below which we consider evidence against H₀ sufficient to reject it. It is set before the test, not after.

What α = 0.05 Actually Means

DfSignificance Level (α)

If we use α = 0.05 and H₀ is true, we will incorrectly reject H₀ in 5% of repeated experiments. That means 1 in 20 studies will produce a false positive by chance alone. It is NOT the probability H₀ is true, nor that the result occurred by chance.

Demonstrating the False Positive Rate

False Positive Rate Simulation

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Demonstrate false positive rate
np.random.seed(42)
n_experiments = 10000
alpha = 0.05
false_positives = 0

# H₀ is TRUE in all experiments
for _ in range(n_experiments):
    sample = np.random.normal(0, 1, 30)  # true μ = 0 = μ₀
    _, p = stats.ttest_1samp(sample, popmean=0)
    if p < alpha:
        false_positives += 1

print(f"H₀ always true. α = {alpha}")
print(f"False positives: {false_positives}/{n_experiments} = {false_positives/n_experiments:.4f}")
print(f"This should ≈ α = {alpha}")

Conventional Significance Levels by Field

Field	Common α	Reason
Social science	0.05	Fisher's historical convention
Medicine (clinical)	0.05 or 0.01	Patient safety concerns
Psychology	0.05	Replication crisis driving to 0.005
Genetics (GWAS)	5×10⁻⁸	Millions of simultaneous tests
Particle physics	0.0000003 (5σ)	Extraordinary claims need extraordinary evidence
Quality control	Varies	Depends on cost of errors

Choosing α: A Decision Framework

Alpha Selection Based on Error Costs

def recommend_alpha(false_positive_cost, false_negative_cost, prior_prob_H0=0.5):
    """Recommend alpha based on relative costs of errors."""
    ratio = false_positive_cost / false_negative_cost
    
    if ratio > 10:
        alpha_rec = 0.01
        rationale = "False positives are very costly -> use strict α"
    elif ratio > 3:
        alpha_rec = 0.05
        rationale = "Moderate concern about false positives -> standard α"
    elif ratio > 1:
        alpha_rec = 0.10
        rationale = "False negatives relatively costly -> relaxed α"
    else:
        alpha_rec = 0.20
        rationale = "False negatives very costly -> very relaxed α (screening)"
    
    print(f"FP cost / FN cost = {ratio:.1f}")
    print(f"Recommended α = {alpha_rec}")
    print(f"Rationale: {rationale}")
    return alpha_rec

print("=== Drug with severe side effects ===")
recommend_alpha(false_positive_cost=100, false_negative_cost=5)

print("\n=== Cancer screening test ===")
recommend_alpha(false_positive_cost=1, false_negative_cost=50)

print("\n=== Website A/B test ===")
recommend_alpha(false_positive_cost=2, false_negative_cost=3)

The Multiple Testing Problem

Multiple Testing Inflation

If you run 20 tests all at α = 0.05, you expect about 1 false positive even if all H₀s are true.

Simulating Multiple Testing

# Simulate multiple testing inflation
n_tests = 20
alpha = 0.05
n_simulations = 10000
any_false_positive = 0

for _ in range(n_simulations):
    p_values = [stats.ttest_1samp(np.random.normal(0, 1, 30), 0)[1]
                for _ in range(n_tests)]
    if any(p < alpha for p in p_values):
        any_false_positive += 1

family_wise_error_rate = any_false_positive / n_simulations
expected_fewr = 1 - (1 - alpha)**n_tests

print(f"{n_tests} tests at α={alpha}")
print(f"Simulated FWER: {family_wise_error_rate:.4f}")
print(f"Theoretical FWER: {expected_fewr:.4f}")
print(f"1 − (1 − 0.05)²⁰ = {1-(0.95**20):.4f}")

# Bonferroni correction
bonferroni_alpha = alpha / n_tests
print(f"\nBonferroni-corrected α = {alpha}/{n_tests} = {bonferroni_alpha:.4f}")

Critical Values at Common Significance Levels

# For one-sample z-test (large n)
print("Critical values for z-test:")
print(f"{'α':<8} {'Two-tailed zc':<18} {'One-tailed zc'}")
for alpha in [0.10, 0.05, 0.025, 0.01, 0.005, 0.001]:
    zc_two = stats.norm.ppf(1 - alpha/2)
    zc_one = stats.norm.ppf(1 - alpha)
    print(f"{alpha:<8.3f} ±{zc_two:<17.4f} {zc_one:.4f}")

Key Takeaways

Summary: Significance Levels

α is set before the experiment — choosing after seeing results is cheating
α = 0.05 is a convention, not a law — choose based on costs of error types
α = the long-run false positive rate under repeated experiments with H₀ true
Multiple testing inflates false positives — always correct (Bonferroni, FDR)
Journals' overemphasis on p less than 0.05 has caused the replication crisis
Report your chosen α in your methods section before analysis

Significance Levels (α) — Choosing Alpha and What It Means