Open Science Practices
Advanced Statistical Methods
Making Research Transparent and Reproducible
Open science practices β open data, open code, preregistration, and FAIR principles β increase transparency, enable verification, and accelerate scientific progress through collaboration and reuse.
- Research credibility β Open data allows independent verification of analytical conclusions
- Collaboration β Shared code and datasets enable other researchers to build on published work
- Funding compliance β Increasingly, grant agencies mandate data sharing and open access publication
Open science is not just good ethics β it produces better, more trustworthy science.
The replication crisis in psychology, medicine, and social sciences has catalyzed a paradigm shift toward open science β a set of principles and practices that increase the transparency, accessibility, and reproducibility of research.
DfOpen Science
Open science is the movement to make scientific research (including publications, data, samples, and software) freely accessible and transparently documented, enabling verification, replication, and extension by others.
The Four Pillars of Open Science
Core Components
- Open Data β raw and processed datasets are publicly available
- Open Code β analysis scripts and software are shared
- Open Materials β experimental stimuli, protocols, and procedures are documented
- Open Access β publications are freely available without paywalls
The Reproducibility Project (Open Science Collaboration, 2015) attempted to replicate 100 published psychology studies and found that only 36% yielded statistically significant results in replications, compared to 97% in the originals. This alarming discrepancy underscored the need for structural reforms.
FAIR Data Principles
The FAIR principles (Wilkinson et al., 2016) provide a framework for data management:
DfFAIR Principles
- Findable β data has a unique identifier (DOI) and is indexed in searchable repositories
- Accessible β data is retrievable via standard protocols (HTTP, FTP) with metadata always accessible
- Interoperable β data uses formal, shared vocabularies and formats (JSON-LD, RDF)
- Reusable β data has clear usage licenses and detailed provenance
FAIR Compliance Score
Here,
- =Overall FAIR compliance score (0β1)
- =Weight for criterion i
- =Rating for criterion i (0 or 1)
Data Sharing and Repositories
Data Sharing Considerations
- De-identification β remove all personally identifiable information (PII)
- Informed consent β ensure participants consented to data sharing
- Privacy regulations β comply with GDPR, HIPAA, and institutional policies
- Embargo periods β allow data exclusivity for a limited time if needed
Major data repositories include:
| Repository | Domain | DOI Support | Access |
|---|---|---|---|
| Dryad | General | Yes | Free (with publication) |
| Zenodo | General | Yes | Free |
| ICPSR | Social sciences | Yes | Restricted access |
| OpenNeuro | Neuroimaging | Yes | Free |
| Figshare | General | Yes | Free (limited) |
| OSF | Multi-disciplinary | Yes | Free |
Code Sharing and Reproducibility
DfComputational Reproducibility
A study is computationally reproducible if an independent researcher can obtain the same results (numbers, figures, tables) from the same data using the same code and environment.
Reproducibility Checklist
- Use version control (Git) for all analysis code
- Specify software versions (e.g., Python 3.11, R 4.3.1)
- Use containerization (Docker, Singularity) for environment reproducibility
- Set random seeds for stochastic algorithms
- Include a
requirements.txtorrenv.lockfor dependency management
Python Implementation: Reproducibility Workflow
Open Science Reproducibility Toolkit
import numpy as np
import pandas as pd
import hashlib
import json
import datetime
from pathlib import Path
# 1. Set random seeds for reproducibility
np.random.seed(42)
# 2. Create a reproducible analysis pipeline
class ReproducibleAnalysis:
def __init__(self, seed=42):
self.seed = seed
self.artifacts = {}
self.log = []
def log_step(self, step_name, description):
entry = {
"step": step_name,
"description": description,
"timestamp": datetime.datetime.now().isoformat()
}
self.log.append(entry)
return entry
def compute_hash(self, data, algorithm='sha256'):
"""Compute hash of data for integrity verification."""
if isinstance(data, np.ndarray):
data_bytes = data.tobytes()
elif isinstance(data, pd.DataFrame):
data_bytes = pd.util.hash_pandas_object(data).values.tobytes()
else:
data_bytes = str(data).encode()
return hashlib.new(algorithm, data_bytes).hexdigest()
def save_artifact(self, name, data, directory='artifacts'):
"""Save and hash an analysis artifact."""
Path(directory).mkdir(exist_ok=True)
filepath = Path(directory) / f"{name}.csv"
if isinstance(data, pd.DataFrame):
data.to_csv(filepath, index=False)
else:
pd.DataFrame(data).to_csv(filepath, index=False)
file_hash = self.compute_hash(data)
self.artifacts[name] = {
"path": str(filepath),
"hash": file_hash,
"rows": len(data) if hasattr(data, '__len__') else None
}
self.log_step(f"save_{name}", f"Saved {filepath} (hash: {file_hash[:16]}...)")
return filepath
def generate_report(self):
"""Generate a reproducibility report."""
report = {
"seed": self.seed,
"numpy_version": np.__version__,
"pandas_version": pd.__version__,
"artifacts": self.artifacts,
"execution_log": self.log,
"generated_at": datetime.datetime.now().isoformat()
}
return report
# 3. Example analysis with full traceability
analysis = ReproducibleAnalysis(seed=42)
analysis.log_step("init", "Initialize reproducible analysis with seed=42")
# Simulate data
n = 500
data = pd.DataFrame({
'x': np.random.normal(0, 1, n),
'group': np.random.choice(['A', 'B', 'C'], n)
})
data['y'] = 2.5 * data['x'] + np.random.normal(0, 1, n)
# Save raw data
analysis.save_artifact("raw_data", data)
# Analysis step
analysis.log_step("regression", "Fit OLS: y ~ x")
from numpy.linalg import lstsq
X = np.column_stack([np.ones(n), data['x'].values])
beta, residuals, rank, sv = lstsq(X, data['y'].values, rcond=None)
# Save results
results = pd.DataFrame({'coefficient': ['intercept', 'slope'], 'estimate': beta})
analysis.save_artifact("regression_results", results)
# 4. Generate reproducibility report
report = analysis.generate_report()
print(json.dumps(report, indent=2))
Preregistration and Transparency
DfPreregistration
Preregistration is the act of specifying a study's design, hypotheses, and analysis plan in a time-stamped, publicly accessible document before data collection begins.
The key distinction is between:
Exploratory vs. Confirmatory Research
- Confirmatory β hypothesis-driven, preregistered, strict Ξ±-control; results confirm or refute a priori predictions
- Exploratory β data-driven, hypothesis-generating, flexible; results suggest new theories but require independent confirmation
Registered Reports
Registered Reports (RR) are a publishing format where journals peer-review and accept studies before results are known, based on the importance of the research question and the quality of the methodology.
Benefits of Registered Reports
- Eliminates publication bias (file drawer problem)
- Reduces p-hacking and HARKing
- Rewards rigorous methodology over novel results
- Provides in-principle acceptance (IPA) guarantee
Measuring Reproducibility
Reproducibility Ratio
Here,
- =Reproducibility ratio (0β1)
The Open Science Collaboration (2015) found R β 0.36 for psychology. Subsequent large-scale replication projects have found:
| Field | Replication Rate | Source |
|---|---|---|
| Psychology | 36% | OSC (2015) |
| Economics | 61% | Camerer et al. (2016) |
| Social Science | 62% | Camerer et al. (2018) |
| Cancer Biology | 46% | Begley & Ellis (2012) |
Python Implementation: Detecting p-hacking
p-curve Analysis
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
def p_curve(p_values):
"""
Analyze the distribution of significant p-values.
A right-skewed p-curve suggests evidential value.
A flat or left-skewed curve suggests p-hacking.
"""
sig_pvals = p_values[p_values < 0.05]
# Binomial test: under H1, p-values should be right-skewed
bins = [0, 0.01, 0.02, 0.03, 0.04, 0.05]
observed_counts, _ = np.histogram(sig_pvals, bins=bins)
# Expected under uniform (no effect)
expected = np.full_like(observed_counts, len(sig_pvals) / 5, dtype=float)
# Chi-square test for uniformity
chi2, p_uniform = stats.chisquare(observed_counts, f_exp=expected)
# Skewness of significant p-values
if len(sig_pvals) > 2:
skewness = stats.skew(sig_pvals)
else:
skewness = np.nan
return {
'n_significant': len(sig_pvals),
'n_total': len(p_values),
'chi2_uniformity': chi2,
'p_uniformity': p_uniform,
'skewness': skewness,
'counts': observed_counts,
'bins': bins
}
# Simulate p-values with and without p-hacking
np.random.seed(42)
n_studies = 200
# Scenario 1: Genuine effects (right-skewed p-curve)
genuine_pvals = np.concatenate([
np.random.beta(0.5, 20, 80), # Strong effects
np.random.uniform(0, 1, 120) # Null effects
])
# Scenario 2: p-hacked (uniform or left-skewed)
p_hacked_pvals = np.random.uniform(0.01, 0.05, 100) # Only p < .05 reported
# Analyze both
result_genuine = p_curve(genuine_pvals)
result_hacked = p_curve(p_hacked_pvals)
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].bar(range(5), result_genuine['counts'], color='steelblue', edgecolor='black')
axes[0].set_xticks(range(5))
axes[0].set_xticklabels(['0-0.01', '0.01-0.02', '0.02-0.03', '0.03-0.04', '0.04-0.05'])
axes[0].set_title(f'Genuine Effects\nSkewness: {result_genuine["skewness"]:.2f}')
axes[0].set_ylabel('Frequency')
axes[1].bar(range(5), result_hacked['counts'], color='coral', edgecolor='black')
axes[1].set_xticks(range(5))
axes[1].set_xticklabels(['0-0.01', '0.01-0.02', '0.02-0.03', '0.03-0.04', '0.04-0.05'])
axes[1].set_title(f'p-Hacked Data\nSkewness: {result_hacked["skewness"]:.2f}')
plt.tight_layout()
plt.savefig('p_curve_analysis.png', dpi=150)
plt.show()
Incentives and Barriers
Barriers to Open Science
- Career incentives β traditional metrics reward novelty over reproducibility
- Data privacy β sensitive data (medical, educational) cannot be shared openly
- Time and resources β curating open data requires additional effort
- Scoop fears β researchers worry others will use their data prematurely
- Lack of training β many researchers lack computational skills for reproducible workflows
Key Takeaways
Summary: Open Science Practices
- Open data, code, and materials increase transparency and enable verification of results
- FAIR principles provide a structured framework for data management
- Preregistration separates confirmatory from exploratory research, reducing p-hacking
- Registered Reports eliminate publication bias by accepting studies before results are known
- p-curve analysis can detect evidential value and identify potential p-hacking
- Reproducibility ratios vary widely across fields (36β62%), highlighting the need for reform
- Computational reproducibility requires version control, containerization, and random seed management