The Replication Crisis

Advanced Statistical Methods

Why Many Published Results Fail to Reproduce

The replication crisis has revealed that a significant proportion of published scientific findings cannot be reproduced, driven by p-hacking, HARKing, publication bias, and underpowered studies. Statistical remedies are reshaping research practice.

Psychology — Large-scale replication projects found that only 36% of landmark studies replicate
Medicine — Pre-clinical cancer research faces replication rates below 50%
Economics — Replication audits have corrected influential policy recommendations

The replication crisis is science's immune system — painful but ultimately strengthening the body of knowledge.

DfReplication Crisis

The replication crisis is a methodological crisis in science in which researchers have found that the results of many scientific studies are difficult or impossible to reproduce or replicate. It affects psychology, medicine, economics, biology, and other fields, raising fundamental questions about the reliability of published scientific findings.

"There are lies, damned lies, and statistics." — Often attributed to Benjamin Disraeli, but more accurately a commentary on the misuse of statistics

The Evidence

Reproducibility Project: Psychology

DfReproducibility Project

The Open Science Collaboration (2015) attempted to replicate 100 experimental and correlational studies published in three top psychology journals. Only 36% of replications achieved statistical significance ( $p < 0.05$ ), compared to 97% of the original studies. The average effect size in replications was half that of the originals.

Other Replication Efforts

Project	Field	Findings
Reproducibility Project: Cancer Biology	Biomedicine	Only 6 of 53 (11%) effects replicated
Many Labs 2	Psychology	14 of 28 (50%) replications significant
Social Sciences Replication Project	Social science	11 of 21 (52%) replications significant
REMAP	Medicine	Significant replication failures across multiple therapeutic areas

Defining Replication

A "successful" replication means the independent experiment reaches the same qualitative conclusion (e.g., significant effect in the same direction). Exact numerical replication is not required — what matters is the overall pattern of evidence.

Causes of Non-Replication

P-Hacking

DfP-Hacking

P-hacking (also called data dredging, significance chasing, or researcher degrees of freedom) is the practice of analyzing data in multiple ways until a statistically significant result is obtained, without adjusting for multiple comparisons.

Common forms of p-hacking:

Testing multiple outcomes but reporting only the significant ones
Adding or removing covariates until $p < 0.05$
Running multiple statistical tests (t-test, ANOVA, regression) and choosing the one that works
Collecting more data until significance is reached (optional stopping)
Excluding outliers based on different criteria until significance
Analyzing subgroups until a significant effect appears

Optional Stopping

If you collect data sequentially and test after each new observation, the probability of eventually finding $p < 0.05$ can be much higher than 5%, even when the null hypothesis is true. This is a form of p-hacking that exploits the sequential nature of data collection.

HARKing

DfHARKing

HARKing (Hypothesizing After Results are Known) is the practice of presenting post-hoc hypotheses as if they were formulated a priori. The researcher observes the data, identifies a pattern, and then writes the introduction as if the hypothesis was predicted before data collection.

HARKing is particularly insidious because:

It is undetectable from the published paper alone
It inflates the apparent confirmatory nature of research
It makes the literature appear more theoretically driven than it actually is

The Garden of Forking Paths

DfGarden of Forking Paths

The garden of forking paths (Gelman & Loken, 2013) describes the situation where researchers face many possible analytical choices, and even without conscious p-hacking, the cumulative effect of these choices inflates false positive rates. Each decision point (which covariates to include, which outliers to remove, which subgroups to analyze) branches into multiple analyses.

Even a well-intentioned researcher can arrive at $p < 0.05$ through legitimate analytical choices, because the number of possible analyses is enormous and each choice has a small but real chance of producing a false positive.

Statistical Remedies

Pre-Registration

DfPre-Registration

Pre-registration is the practice of publicly documenting a study's hypotheses, methods, and analysis plan before data collection begins. Platforms like OSF (Open Science Framework), ClinicalTrials.gov, and AsPredicted.org timestamp and make these plans publicly accessible.

Pre-registration addresses:

HARKing (hypotheses cannot be fabricated after seeing results)
P-hacking (analysis plan is fixed before data collection)
Publication bias (all registered studies are discoverable regardless of results)

Multi-Lab Replications

DfMulti-Lab Replication

Multi-lab replications coordinate many independent research teams to replicate the same study across diverse samples and settings. Projects like Many Labs, PSYCHE, and Reproducibility Project use standardized protocols to ensure comparability.

Advantages:

Large aggregate sample sizes provide high statistical power
Cross-cultural variation tests generalizability
Independent teams eliminate single-lab bias
Pre-registered protocols prevent analytical flexibility

Effect Sizes and Confidence Intervals

DfEffect Size Reporting

Reporting effect sizes (Cohen's d, $\eta^2$ , odds ratios, correlation coefficients) alongside p-values provides a more complete picture of results. An effect size of $d = 0.05$ may be statistically significant with $n = 10{,}000$ but is practically negligible.

Effect Size	Small	Medium	Large
Cohen's d	0.2	0.5	0.8
$\eta^2$ (ANOVA)	0.01	0.06	0.14
Pearson's r	0.10	0.30	0.50
Odds Ratio	1.5	2.5	4.3

Bayesian Analysis

DfBayesian Analysis

Bayesian methods provide an alternative framework that:

Produces posterior probabilities of hypotheses rather than p-values
Naturally incorporates prior information
Avoids the dichotomous significant/non-significant decision
Provides Bayes factors that quantify evidence for or against the null

The Bayes factor compares the evidence for two hypotheses:

Bayes Factor

BF_{10} = \frac{P(\text{data} \mid H_1)}{P(\text{data} \mid H_0)}

$BF_{10}$	Interpretation
1–3	Anecdotal evidence for $H_1$
3–10	Moderate evidence for $H_1$
10–30	Strong evidence for $H_1$
30–100	Very strong evidence for $H_1$
> 100	Extreme evidence for $H_1$
< 1/3	Anecdotal evidence for $H_0$
< 1/10	Moderate evidence for $H_0$

p-Values vs Bayes Factors

A p-value measures the probability of data given $H_0$ — it does not measure the probability that $H_0$ is true. Bayes factors directly compare the evidence for both hypotheses. Two studies with $p = 0.04$ can have dramatically different Bayes factors depending on sample size and effect magnitude.

Open Science Reforms

Registered Reports

DfRegistered Reports

Registered Reports are a publication format where journals commit to accepting or rejecting papers based on the methods and analysis plan before results are known. This eliminates publication bias at the source.

Open Data and Open Materials

DfOpen Data

Open data means making raw data publicly available. Open materials means sharing analysis code, survey instruments, and protocols. Together they enable independent verification and replication.

Incentive Reform

Current System	Proposed Reform
Publish novel, positive results	Value replication and null results
Reward p < 0.05	Reward rigorous methodology
Single studies count	Cumulative evidence counts
Career advancement via publication count	Career advancement via reproducibility

Quantifying the Impact

Inflation of Effect Sizes

Winner's Curse

The winner's curse describes how statistically significant effects are systematically overestimated. If the true effect is $\theta$ and we only publish when $\hat{\theta}$ exceeds the critical value, the expected published effect is:

E[\hat{\theta} \mid \hat{\theta} > z_{\alpha} \cdot \text{SE}] = \theta + \text{SE} \cdot \lambda(\alpha)

where $\lambda(\alpha)$ is the inverse Mills ratio, representing selection bias.

False Discovery Rate

DfFalse Discovery Rate

The false discovery rate (FDR) is the expected proportion of false positives among all significant results:

\text{FDR} = E\left[\frac{\text{False Positives}}{\text{Total Significant}}\right]

If 30% of tested hypotheses are true and the test has 80% power at $\alpha = 0.05$ :

\text{FDR} = \frac{0.70 \times 0.05}{0.70 \times 0.05 + 0.30 \times 0.80} = \frac{0.035}{0.035 + 0.24} \approx 12.7\%

Even with nominal $\alpha = 0.05$ , the FDR can be substantial when many null hypotheses are tested.

Python Implementation

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# --- Simulate p-hacking via optional stopping ---
def simulate_optional_stopping(true_effect=0, n_max=500, alpha=0.05, n_sims=10000):
    """
    Show how optional stopping inflates false positive rates.
    Test after each new observation; stop when p < alpha or n_max reached.
    """
    false_positives = 0
    sample_sizes = []
    
    for _ in range(n_sims):
        for n in range(10, n_max + 1, 5):
            data1 = np.random.normal(0, 1, n)
            data2 = np.random.normal(true_effect, 1, n)
            t_stat, p_val = stats.ttest_ind(data1, data2)
            
            if p_val < alpha:
                false_positives += 1
                sample_sizes.append(n)
                break
    
    fpr = false_positives / n_sims
    avg_n = np.mean(sample_sizes) if sample_sizes else n_max
    return fpr, avg_n

# Test at different true effects
print("=== Optional Stopping Simulation ===")
print(f"{'True Effect':<15} {'FPR (nominal α=0.05)':<25} {'Avg Sample Size':<15}")
print("-" * 55)

for true_eff in [0, 0.1, 0.2, 0.3]:
    fpr, avg_n = simulate_optional_stopping(true_effect=true_eff)
    print(f"{true_eff:<15.1f} {fpr:<25.3f} {avg_n:<15.1f}")

# --- Simulate p-hacking via multiple testing ---
def simulate_p_hacking(n_tests=20, n_sims=10000, alpha=0.05):
    """
    Show how testing multiple outcomes inflates false positive rate.
    """
    # No p-hacking: test only the primary outcome
    fp_no_hack = sum(
        1 for _ in range(n_sims) 
        if stats.ttest_ind(
            np.random.normal(0, 1, 100),
            np.random.normal(0, 1, 100)
        )[1] < alpha
    ) / n_sims
    
    # P-hacking: test n_tests outcomes, report smallest p-value
    fp_hack = 0
    for _ in range(n_sims):
        p_vals = [stats.ttest_ind(
            np.random.normal(0, 1, 100),
            np.random.normal(0, 1, 100)
        )[1] for _ in range(n_tests)]
        # Bonferroni-corrected threshold
        if min(p_vals) < alpha / n_tests:
            fp_hack += 1
    fp_hack_corrected = fp_hack / n_sims
    
    # P-hacking without correction
    fp_hack_uncorrected = sum(
        1 for _ in range(n_sims)
        if min([stats.ttest_ind(
            np.random.normal(0, 1, 100),
            np.random.normal(0, 1, 100)
        )[1] for _ in range(n_tests)]) < alpha
    ) / n_sims
    
    return fp_no_hack, fp_hack_uncorrected, fp_hack_corrected

print("\n=== Multiple Testing Simulation (20 outcomes) ===")
fp1, fp2, fp3 = simulate_p_hacking()
print(f"Single test (no hacking):           FPR = {fp1:.3f}")
print(f"20 tests, no correction (hacking):  FPR = {fp2:.3f}")
print(f"20 tests, Bonferroni corrected:      FPR = {fp3:.3f}")

# --- Funnel plot of published vs all studies ---
np.random.seed(42)
true_effect = 0.3
true_se = np.random.uniform(0.05, 0.3, 200)
true_effects = np.random.normal(true_effect, 0.1, 200)
observed_effects = np.random.normal(true_effects, true_se)
p_values = 2 * (1 - stats.norm.cdf(np.abs(observed_effects / true_se)))

# "Published" studies (significant only)
published = p_values < 0.05

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# All studies
axes[0].scatter(observed_effects, true_se, s=20, alpha=0.5, c='gray')
axes[0].axvline(x=true_effect, color='red', linestyle='--', label='True effect')
axes[0].set_xlabel('Observed Effect Size')
axes[0].set_ylabel('Standard Error')
axes[0].set_title(f'All Studies (n={len(observed_effects)})')
axes[0].invert_yaxis()
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Published only (publication bias)
axes[1].scatter(observed_effects[published], true_se[published], s=30, alpha=0.7, c='blue')
axes[1].scatter(observed_effects[~published], true_se[~published], s=15, alpha=0.3, c='gray', 
                label='Not published')
axes[1].axvline(x=true_effect, color='red', linestyle='--', label='True effect')
axes[1].set_xlabel('Observed Effect Size')
axes[1].set_ylabel('Standard Error')
axes[1].set_title(f'Published Only (n={published.sum()})')
axes[1].invert_yaxis()
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.suptitle('Publication Bias: Funnel Plot Distortion', fontsize=14)
plt.tight_layout()
plt.savefig('publication_bias.png', dpi=150)
plt.show()

# Effect size inflation
print(f"\n=== Effect Size Inflation ===")
print(f"True effect:                          {true_effect:.3f}")
print(f"Mean of all observed effects:         {np.mean(observed_effects):.3f}")
print(f"Mean of published (significant) effects: {np.mean(observed_effects[published]):.3f}")
print(f"Inflation factor:                     {np.mean(observed_effects[published])/true_effect:.2f}x")

# --- Bayes Factor computation (simplified using JZS prior) ---
def bayes_factor_ttest(x, y, r=np.sqrt(2)/2):
    """
    Compute Bayes Factor for independent samples t-test using JZS prior.
    Rouder et al. (2009) method.
    """
    n1, n2 = len(x), len(y)
    t_stat, _ = stats.ttest_ind(x, y)
    df = n1 + n2 - 2
    
    # JZS Bayes Factor
    BF01 = (1 + t_stat**2 / df)**(-(n1 + n2) / 2) * \
           (1 + t_stat**2 / (df * (1 + r**2)))**(df / 2 + 0.5) * \
           np.sqrt(1 / (1 + r**2))
    
    return 1 / BF01  # BF10 (evidence for alternative)

print("\n=== Bayes Factor vs p-Value ===")
print(f"{'Scenario':<30} {'p-value':<10} {'BF10':<10} {'Interpretation':<20}")
print("-" * 70)

scenarios = [
    ("Large effect, n=20", np.random.normal(0.8, 1, 20), np.random.normal(0, 1, 20)),
    ("Small effect, n=20", np.random.normal(0.2, 1, 20), np.random.normal(0, 1, 20)),
    ("Large effect, n=200", np.random.normal(0.3, 1, 200), np.random.normal(0, 1, 200)),
    ("No effect, n=200", np.random.normal(0, 1, 200), np.random.normal(0, 1, 200)),
]

for name, x, y in scenarios:
    t_stat, p_val = stats.ttest_ind(x, y)
    bf = bayes_factor_ttest(x, y)
    
    if bf > 100:
        interp = "Extreme for H1"
    elif bf > 10:
        interp = "Strong for H1"
    elif bf > 3:
        interp = "Moderate for H1"
    elif bf > 1/3:
        interp = "Anecdotal"
    elif bf > 1/10:
        interp = "Moderate for H0"
    else:
        interp = "Strong for H0"
    
    print(f"{name:<30} {p_val:<10.4f} {bf:<10.2f} {interp:<20}")

Key Takeaways

Summary: The Replication Crisis

The replication crisis affects multiple fields — only 36–50% of published effects replicate in large-scale projects.
P-hacking (multiple testing, optional stopping, outcome switching) inflates false positive rates far beyond the nominal 5%.
HARKing (hypothesizing after results are known) makes post-hoc explorations appear confirmatory.
The garden of forking paths means even well-intentioned analytical choices can produce false positives.
Pre-registration is the most effective remedy, fixing hypotheses and analysis plans before data collection.
Bayes factors provide evidence for or against the null, avoiding the binary significant/non-significant trap.
Open science reforms (registered reports, open data, open materials) are transforming incentive structures to value rigor over novelty.

The Replication Crisis in Statistics