The Replication Crisis
Advanced Statistical Methods
Why Many Published Results Fail to Reproduce
The replication crisis has revealed that a significant proportion of published scientific findings cannot be reproduced, driven by p-hacking, HARKing, publication bias, and underpowered studies. Statistical remedies are reshaping research practice.
- Psychology β Large-scale replication projects found that only 36% of landmark studies replicate
- Medicine β Pre-clinical cancer research faces replication rates below 50%
- Economics β Replication audits have corrected influential policy recommendations
The replication crisis is science's immune system β painful but ultimately strengthening the body of knowledge.
DfReplication Crisis
The replication crisis is a methodological crisis in science in which researchers have found that the results of many scientific studies are difficult or impossible to reproduce or replicate. It affects psychology, medicine, economics, biology, and other fields, raising fundamental questions about the reliability of published scientific findings.
"There are lies, damned lies, and statistics." β Often attributed to Benjamin Disraeli, but more accurately a commentary on the misuse of statistics
The Evidence
Reproducibility Project: Psychology
DfReproducibility Project
The Open Science Collaboration (2015) attempted to replicate 100 experimental and correlational studies published in three top psychology journals. Only 36% of replications achieved statistical significance (), compared to 97% of the original studies. The average effect size in replications was half that of the originals.
Other Replication Efforts
| Project | Field | Findings |
|---|---|---|
| Reproducibility Project: Cancer Biology | Biomedicine | Only 6 of 53 (11%) effects replicated |
| Many Labs 2 | Psychology | 14 of 28 (50%) replications significant |
| Social Sciences Replication Project | Social science | 11 of 21 (52%) replications significant |
| REMAP | Medicine | Significant replication failures across multiple therapeutic areas |
Defining Replication
A "successful" replication means the independent experiment reaches the same qualitative conclusion (e.g., significant effect in the same direction). Exact numerical replication is not required β what matters is the overall pattern of evidence.
Causes of Non-Replication
P-Hacking
DfP-Hacking
P-hacking (also called data dredging, significance chasing, or researcher degrees of freedom) is the practice of analyzing data in multiple ways until a statistically significant result is obtained, without adjusting for multiple comparisons.
Common forms of p-hacking:
- Testing multiple outcomes but reporting only the significant ones
- Adding or removing covariates until
- Running multiple statistical tests (t-test, ANOVA, regression) and choosing the one that works
- Collecting more data until significance is reached (optional stopping)
- Excluding outliers based on different criteria until significance
- Analyzing subgroups until a significant effect appears
Optional Stopping
If you collect data sequentially and test after each new observation, the probability of eventually finding can be much higher than 5%, even when the null hypothesis is true. This is a form of p-hacking that exploits the sequential nature of data collection.
HARKing
DfHARKing
HARKing (Hypothesizing After Results are Known) is the practice of presenting post-hoc hypotheses as if they were formulated a priori. The researcher observes the data, identifies a pattern, and then writes the introduction as if the hypothesis was predicted before data collection.
HARKing is particularly insidious because:
- It is undetectable from the published paper alone
- It inflates the apparent confirmatory nature of research
- It makes the literature appear more theoretically driven than it actually is
The Garden of Forking Paths
DfGarden of Forking Paths
The garden of forking paths (Gelman & Loken, 2013) describes the situation where researchers face many possible analytical choices, and even without conscious p-hacking, the cumulative effect of these choices inflates false positive rates. Each decision point (which covariates to include, which outliers to remove, which subgroups to analyze) branches into multiple analyses.
Even a well-intentioned researcher can arrive at through legitimate analytical choices, because the number of possible analyses is enormous and each choice has a small but real chance of producing a false positive.
Statistical Remedies
Pre-Registration
DfPre-Registration
Pre-registration is the practice of publicly documenting a study's hypotheses, methods, and analysis plan before data collection begins. Platforms like OSF (Open Science Framework), ClinicalTrials.gov, and AsPredicted.org timestamp and make these plans publicly accessible.
Pre-registration addresses:
- HARKing (hypotheses cannot be fabricated after seeing results)
- P-hacking (analysis plan is fixed before data collection)
- Publication bias (all registered studies are discoverable regardless of results)
Multi-Lab Replications
DfMulti-Lab Replication
Multi-lab replications coordinate many independent research teams to replicate the same study across diverse samples and settings. Projects like Many Labs, PSYCHE, and Reproducibility Project use standardized protocols to ensure comparability.
Advantages:
- Large aggregate sample sizes provide high statistical power
- Cross-cultural variation tests generalizability
- Independent teams eliminate single-lab bias
- Pre-registered protocols prevent analytical flexibility
Effect Sizes and Confidence Intervals
DfEffect Size Reporting
Reporting effect sizes (Cohen's d, , odds ratios, correlation coefficients) alongside p-values provides a more complete picture of results. An effect size of may be statistically significant with but is practically negligible.
| Effect Size | Small | Medium | Large |
|---|---|---|---|
| Cohen's d | 0.2 | 0.5 | 0.8 |
| (ANOVA) | 0.01 | 0.06 | 0.14 |
| Pearson's r | 0.10 | 0.30 | 0.50 |
| Odds Ratio | 1.5 | 2.5 | 4.3 |
Bayesian Analysis
DfBayesian Analysis
Bayesian methods provide an alternative framework that:
- Produces posterior probabilities of hypotheses rather than p-values
- Naturally incorporates prior information
- Avoids the dichotomous significant/non-significant decision
- Provides Bayes factors that quantify evidence for or against the null
The Bayes factor compares the evidence for two hypotheses:
Bayes Factor
| Interpretation | |
|---|---|
| 1β3 | Anecdotal evidence for |
| 3β10 | Moderate evidence for |
| 10β30 | Strong evidence for |
| 30β100 | Very strong evidence for |
| > 100 | Extreme evidence for |
| < 1/3 | Anecdotal evidence for |
| < 1/10 | Moderate evidence for |
p-Values vs Bayes Factors
A p-value measures the probability of data given β it does not measure the probability that is true. Bayes factors directly compare the evidence for both hypotheses. Two studies with can have dramatically different Bayes factors depending on sample size and effect magnitude.
Open Science Reforms
Registered Reports
DfRegistered Reports
Registered Reports are a publication format where journals commit to accepting or rejecting papers based on the methods and analysis plan before results are known. This eliminates publication bias at the source.
Open Data and Open Materials
DfOpen Data
Open data means making raw data publicly available. Open materials means sharing analysis code, survey instruments, and protocols. Together they enable independent verification and replication.
Incentive Reform
| Current System | Proposed Reform |
|---|---|
| Publish novel, positive results | Value replication and null results |
| Reward p < 0.05 | Reward rigorous methodology |
| Single studies count | Cumulative evidence counts |
| Career advancement via publication count | Career advancement via reproducibility |
Quantifying the Impact
Inflation of Effect Sizes
Winner's Curse
The winner's curse describes how statistically significant effects are systematically overestimated. If the true effect is and we only publish when exceeds the critical value, the expected published effect is:
where is the inverse Mills ratio, representing selection bias.
False Discovery Rate
DfFalse Discovery Rate
The false discovery rate (FDR) is the expected proportion of false positives among all significant results:
If 30% of tested hypotheses are true and the test has 80% power at :
Even with nominal , the FDR can be substantial when many null hypotheses are tested.
Python Implementation
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# --- Simulate p-hacking via optional stopping ---
def simulate_optional_stopping(true_effect=0, n_max=500, alpha=0.05, n_sims=10000):
"""
Show how optional stopping inflates false positive rates.
Test after each new observation; stop when p < alpha or n_max reached.
"""
false_positives = 0
sample_sizes = []
for _ in range(n_sims):
for n in range(10, n_max + 1, 5):
data1 = np.random.normal(0, 1, n)
data2 = np.random.normal(true_effect, 1, n)
t_stat, p_val = stats.ttest_ind(data1, data2)
if p_val < alpha:
false_positives += 1
sample_sizes.append(n)
break
fpr = false_positives / n_sims
avg_n = np.mean(sample_sizes) if sample_sizes else n_max
return fpr, avg_n
# Test at different true effects
print("=== Optional Stopping Simulation ===")
print(f"{'True Effect':<15} {'FPR (nominal Ξ±=0.05)':<25} {'Avg Sample Size':<15}")
print("-" * 55)
for true_eff in [0, 0.1, 0.2, 0.3]:
fpr, avg_n = simulate_optional_stopping(true_effect=true_eff)
print(f"{true_eff:<15.1f} {fpr:<25.3f} {avg_n:<15.1f}")
# --- Simulate p-hacking via multiple testing ---
def simulate_p_hacking(n_tests=20, n_sims=10000, alpha=0.05):
"""
Show how testing multiple outcomes inflates false positive rate.
"""
# No p-hacking: test only the primary outcome
fp_no_hack = sum(
1 for _ in range(n_sims)
if stats.ttest_ind(
np.random.normal(0, 1, 100),
np.random.normal(0, 1, 100)
)[1] < alpha
) / n_sims
# P-hacking: test n_tests outcomes, report smallest p-value
fp_hack = 0
for _ in range(n_sims):
p_vals = [stats.ttest_ind(
np.random.normal(0, 1, 100),
np.random.normal(0, 1, 100)
)[1] for _ in range(n_tests)]
# Bonferroni-corrected threshold
if min(p_vals) < alpha / n_tests:
fp_hack += 1
fp_hack_corrected = fp_hack / n_sims
# P-hacking without correction
fp_hack_uncorrected = sum(
1 for _ in range(n_sims)
if min([stats.ttest_ind(
np.random.normal(0, 1, 100),
np.random.normal(0, 1, 100)
)[1] for _ in range(n_tests)]) < alpha
) / n_sims
return fp_no_hack, fp_hack_uncorrected, fp_hack_corrected
print("\n=== Multiple Testing Simulation (20 outcomes) ===")
fp1, fp2, fp3 = simulate_p_hacking()
print(f"Single test (no hacking): FPR = {fp1:.3f}")
print(f"20 tests, no correction (hacking): FPR = {fp2:.3f}")
print(f"20 tests, Bonferroni corrected: FPR = {fp3:.3f}")
# --- Funnel plot of published vs all studies ---
np.random.seed(42)
true_effect = 0.3
true_se = np.random.uniform(0.05, 0.3, 200)
true_effects = np.random.normal(true_effect, 0.1, 200)
observed_effects = np.random.normal(true_effects, true_se)
p_values = 2 * (1 - stats.norm.cdf(np.abs(observed_effects / true_se)))
# "Published" studies (significant only)
published = p_values < 0.05
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# All studies
axes[0].scatter(observed_effects, true_se, s=20, alpha=0.5, c='gray')
axes[0].axvline(x=true_effect, color='red', linestyle='--', label='True effect')
axes[0].set_xlabel('Observed Effect Size')
axes[0].set_ylabel('Standard Error')
axes[0].set_title(f'All Studies (n={len(observed_effects)})')
axes[0].invert_yaxis()
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Published only (publication bias)
axes[1].scatter(observed_effects[published], true_se[published], s=30, alpha=0.7, c='blue')
axes[1].scatter(observed_effects[~published], true_se[~published], s=15, alpha=0.3, c='gray',
label='Not published')
axes[1].axvline(x=true_effect, color='red', linestyle='--', label='True effect')
axes[1].set_xlabel('Observed Effect Size')
axes[1].set_ylabel('Standard Error')
axes[1].set_title(f'Published Only (n={published.sum()})')
axes[1].invert_yaxis()
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.suptitle('Publication Bias: Funnel Plot Distortion', fontsize=14)
plt.tight_layout()
plt.savefig('publication_bias.png', dpi=150)
plt.show()
# Effect size inflation
print(f"\n=== Effect Size Inflation ===")
print(f"True effect: {true_effect:.3f}")
print(f"Mean of all observed effects: {np.mean(observed_effects):.3f}")
print(f"Mean of published (significant) effects: {np.mean(observed_effects[published]):.3f}")
print(f"Inflation factor: {np.mean(observed_effects[published])/true_effect:.2f}x")
# --- Bayes Factor computation (simplified using JZS prior) ---
def bayes_factor_ttest(x, y, r=np.sqrt(2)/2):
"""
Compute Bayes Factor for independent samples t-test using JZS prior.
Rouder et al. (2009) method.
"""
n1, n2 = len(x), len(y)
t_stat, _ = stats.ttest_ind(x, y)
df = n1 + n2 - 2
# JZS Bayes Factor
BF01 = (1 + t_stat**2 / df)**(-(n1 + n2) / 2) * \
(1 + t_stat**2 / (df * (1 + r**2)))**(df / 2 + 0.5) * \
np.sqrt(1 / (1 + r**2))
return 1 / BF01 # BF10 (evidence for alternative)
print("\n=== Bayes Factor vs p-Value ===")
print(f"{'Scenario':<30} {'p-value':<10} {'BF10':<10} {'Interpretation':<20}")
print("-" * 70)
scenarios = [
("Large effect, n=20", np.random.normal(0.8, 1, 20), np.random.normal(0, 1, 20)),
("Small effect, n=20", np.random.normal(0.2, 1, 20), np.random.normal(0, 1, 20)),
("Large effect, n=200", np.random.normal(0.3, 1, 200), np.random.normal(0, 1, 200)),
("No effect, n=200", np.random.normal(0, 1, 200), np.random.normal(0, 1, 200)),
]
for name, x, y in scenarios:
t_stat, p_val = stats.ttest_ind(x, y)
bf = bayes_factor_ttest(x, y)
if bf > 100:
interp = "Extreme for H1"
elif bf > 10:
interp = "Strong for H1"
elif bf > 3:
interp = "Moderate for H1"
elif bf > 1/3:
interp = "Anecdotal"
elif bf > 1/10:
interp = "Moderate for H0"
else:
interp = "Strong for H0"
print(f"{name:<30} {p_val:<10.4f} {bf:<10.2f} {interp:<20}")
Key Takeaways
Summary: The Replication Crisis
- The replication crisis affects multiple fields β only 36β50% of published effects replicate in large-scale projects.
- P-hacking (multiple testing, optional stopping, outcome switching) inflates false positive rates far beyond the nominal 5%.
- HARKing (hypothesizing after results are known) makes post-hoc explorations appear confirmatory.
- The garden of forking paths means even well-intentioned analytical choices can produce false positives.
- Pre-registration is the most effective remedy, fixing hypotheses and analysis plans before data collection.
- Bayes factors provide evidence for or against the null, avoiding the binary significant/non-significant trap.
- Open science reforms (registered reports, open data, open materials) are transforming incentive structures to value rigor over novelty.