Open Science Practices

Advanced Statistical Methods

Making Research Transparent and Reproducible

Open science practices — open data, open code, preregistration, and FAIR principles — increase transparency, enable verification, and accelerate scientific progress through collaboration and reuse.

Research credibility — Open data allows independent verification of analytical conclusions
Collaboration — Shared code and datasets enable other researchers to build on published work
Funding compliance — Increasingly, grant agencies mandate data sharing and open access publication

Open science is not just good ethics — it produces better, more trustworthy science.

The replication crisis in psychology, medicine, and social sciences has catalyzed a paradigm shift toward open science — a set of principles and practices that increase the transparency, accessibility, and reproducibility of research.

DfOpen Science

Open science is the movement to make scientific research (including publications, data, samples, and software) freely accessible and transparently documented, enabling verification, replication, and extension by others.

The Four Pillars of Open Science

Core Components

Open Data — raw and processed datasets are publicly available
Open Code — analysis scripts and software are shared
Open Materials — experimental stimuli, protocols, and procedures are documented
Open Access — publications are freely available without paywalls

The Reproducibility Project (Open Science Collaboration, 2015) attempted to replicate 100 published psychology studies and found that only 36% yielded statistically significant results in replications, compared to 97% in the originals. This alarming discrepancy underscored the need for structural reforms.

FAIR Data Principles

The FAIR principles (Wilkinson et al., 2016) provide a framework for data management:

DfFAIR Principles

Findable — data has a unique identifier (DOI) and is indexed in searchable repositories
Accessible — data is retrievable via standard protocols (HTTP, FTP) with metadata always accessible
Interoperable — data uses formal, shared vocabularies and formats (JSON-LD, RDF)
Reusable — data has clear usage licenses and detailed provenance

FAIR Compliance Score

S_{\text{FAIR}} = \frac{1}{n}\sum_{i=1}^{n} w_i \cdot r_i

Here,

$S_FAIR$ =Overall FAIR compliance score (0–1)
$w_i$ =Weight for criterion i
$r_i$ =Rating for criterion i (0 or 1)

Data Sharing and Repositories

Data Sharing Considerations

De-identification — remove all personally identifiable information (PII)
Informed consent — ensure participants consented to data sharing
Privacy regulations — comply with GDPR, HIPAA, and institutional policies
Embargo periods — allow data exclusivity for a limited time if needed

Major data repositories include:

Repository	Domain	DOI Support	Access
Dryad	General	Yes	Free (with publication)
Zenodo	General	Yes	Free
ICPSR	Social sciences	Yes	Restricted access
OpenNeuro	Neuroimaging	Yes	Free
Figshare	General	Yes	Free (limited)
OSF	Multi-disciplinary	Yes	Free

Code Sharing and Reproducibility

DfComputational Reproducibility

A study is computationally reproducible if an independent researcher can obtain the same results (numbers, figures, tables) from the same data using the same code and environment.

Reproducibility Checklist

Use version control (Git) for all analysis code
Specify software versions (e.g., Python 3.11, R 4.3.1)
Use containerization (Docker, Singularity) for environment reproducibility
Set random seeds for stochastic algorithms
Include a requirements.txt or renv.lock for dependency management

Python Implementation: Reproducibility Workflow

Open Science Reproducibility Toolkit

import numpy as np
import pandas as pd
import hashlib
import json
import datetime
from pathlib import Path

# 1. Set random seeds for reproducibility
np.random.seed(42)

# 2. Create a reproducible analysis pipeline
class ReproducibleAnalysis:
    def __init__(self, seed=42):
        self.seed = seed
        self.artifacts = {}
        self.log = []

    def log_step(self, step_name, description):
        entry = {
            "step": step_name,
            "description": description,
            "timestamp": datetime.datetime.now().isoformat()
        }
        self.log.append(entry)
        return entry

    def compute_hash(self, data, algorithm='sha256'):
        """Compute hash of data for integrity verification."""
        if isinstance(data, np.ndarray):
            data_bytes = data.tobytes()
        elif isinstance(data, pd.DataFrame):
            data_bytes = pd.util.hash_pandas_object(data).values.tobytes()
        else:
            data_bytes = str(data).encode()
        return hashlib.new(algorithm, data_bytes).hexdigest()

    def save_artifact(self, name, data, directory='artifacts'):
        """Save and hash an analysis artifact."""
        Path(directory).mkdir(exist_ok=True)
        filepath = Path(directory) / f"{name}.csv"

        if isinstance(data, pd.DataFrame):
            data.to_csv(filepath, index=False)
        else:
            pd.DataFrame(data).to_csv(filepath, index=False)

        file_hash = self.compute_hash(data)
        self.artifacts[name] = {
            "path": str(filepath),
            "hash": file_hash,
            "rows": len(data) if hasattr(data, '__len__') else None
        }
        self.log_step(f"save_{name}", f"Saved {filepath} (hash: {file_hash[:16]}...)")
        return filepath

    def generate_report(self):
        """Generate a reproducibility report."""
        report = {
            "seed": self.seed,
            "numpy_version": np.__version__,
            "pandas_version": pd.__version__,
            "artifacts": self.artifacts,
            "execution_log": self.log,
            "generated_at": datetime.datetime.now().isoformat()
        }
        return report

# 3. Example analysis with full traceability
analysis = ReproducibleAnalysis(seed=42)
analysis.log_step("init", "Initialize reproducible analysis with seed=42")

# Simulate data
n = 500
data = pd.DataFrame({
    'x': np.random.normal(0, 1, n),
    'group': np.random.choice(['A', 'B', 'C'], n)
})
data['y'] = 2.5 * data['x'] + np.random.normal(0, 1, n)

# Save raw data
analysis.save_artifact("raw_data", data)

# Analysis step
analysis.log_step("regression", "Fit OLS: y ~ x")
from numpy.linalg import lstsq
X = np.column_stack([np.ones(n), data['x'].values])
beta, residuals, rank, sv = lstsq(X, data['y'].values, rcond=None)

# Save results
results = pd.DataFrame({'coefficient': ['intercept', 'slope'], 'estimate': beta})
analysis.save_artifact("regression_results", results)

# 4. Generate reproducibility report
report = analysis.generate_report()
print(json.dumps(report, indent=2))

Preregistration and Transparency

DfPreregistration

Preregistration is the act of specifying a study's design, hypotheses, and analysis plan in a time-stamped, publicly accessible document before data collection begins.

The key distinction is between:

Exploratory vs. Confirmatory Research

Confirmatory — hypothesis-driven, preregistered, strict α-control; results confirm or refute a priori predictions
Exploratory — data-driven, hypothesis-generating, flexible; results suggest new theories but require independent confirmation

Registered Reports

Registered Reports (RR) are a publishing format where journals peer-review and accept studies before results are known, based on the importance of the research question and the quality of the methodology.

Benefits of Registered Reports

Eliminates publication bias (file drawer problem)
Reduces p-hacking and HARKing
Rewards rigorous methodology over novel results
Provides in-principle acceptance (IPA) guarantee

Measuring Reproducibility

Reproducibility Ratio

R = \frac{\text{Number of successful replications}}{\text{Total replication attempts}}

Here,

$R$ =Reproducibility ratio (0–1)

The Open Science Collaboration (2015) found R ≈ 0.36 for psychology. Subsequent large-scale replication projects have found:

Field	Replication Rate	Source
Psychology	36%	OSC (2015)
Economics	61%	Camerer et al. (2016)
Social Science	62%	Camerer et al. (2018)
Cancer Biology	46%	Begley & Ellis (2012)

Python Implementation: Detecting p-hacking

p-curve Analysis

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

def p_curve(p_values):
    """
    Analyze the distribution of significant p-values.
    A right-skewed p-curve suggests evidential value.
    A flat or left-skewed curve suggests p-hacking.
    """
    sig_pvals = p_values[p_values < 0.05]

    # Binomial test: under H1, p-values should be right-skewed
    bins = [0, 0.01, 0.02, 0.03, 0.04, 0.05]
    observed_counts, _ = np.histogram(sig_pvals, bins=bins)

    # Expected under uniform (no effect)
    expected = np.full_like(observed_counts, len(sig_pvals) / 5, dtype=float)

    # Chi-square test for uniformity
    chi2, p_uniform = stats.chisquare(observed_counts, f_exp=expected)

    # Skewness of significant p-values
    if len(sig_pvals) > 2:
        skewness = stats.skew(sig_pvals)
    else:
        skewness = np.nan

    return {
        'n_significant': len(sig_pvals),
        'n_total': len(p_values),
        'chi2_uniformity': chi2,
        'p_uniformity': p_uniform,
        'skewness': skewness,
        'counts': observed_counts,
        'bins': bins
    }

# Simulate p-values with and without p-hacking
np.random.seed(42)
n_studies = 200

# Scenario 1: Genuine effects (right-skewed p-curve)
genuine_pvals = np.concatenate([
    np.random.beta(0.5, 20, 80),   # Strong effects
    np.random.uniform(0, 1, 120)    # Null effects
])

# Scenario 2: p-hacked (uniform or left-skewed)
p_hacked_pvals = np.random.uniform(0.01, 0.05, 100)  # Only p < .05 reported

# Analyze both
result_genuine = p_curve(genuine_pvals)
result_hacked = p_curve(p_hacked_pvals)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

axes[0].bar(range(5), result_genuine['counts'], color='steelblue', edgecolor='black')
axes[0].set_xticks(range(5))
axes[0].set_xticklabels(['0-0.01', '0.01-0.02', '0.02-0.03', '0.03-0.04', '0.04-0.05'])
axes[0].set_title(f'Genuine Effects\nSkewness: {result_genuine["skewness"]:.2f}')
axes[0].set_ylabel('Frequency')

axes[1].bar(range(5), result_hacked['counts'], color='coral', edgecolor='black')
axes[1].set_xticks(range(5))
axes[1].set_xticklabels(['0-0.01', '0.01-0.02', '0.02-0.03', '0.03-0.04', '0.04-0.05'])
axes[1].set_title(f'p-Hacked Data\nSkewness: {result_hacked["skewness"]:.2f}')

plt.tight_layout()
plt.savefig('p_curve_analysis.png', dpi=150)
plt.show()

Incentives and Barriers

Barriers to Open Science

Career incentives — traditional metrics reward novelty over reproducibility
Data privacy — sensitive data (medical, educational) cannot be shared openly
Time and resources — curating open data requires additional effort
Scoop fears — researchers worry others will use their data prematurely
Lack of training — many researchers lack computational skills for reproducible workflows

Key Takeaways

Summary: Open Science Practices

Open data, code, and materials increase transparency and enable verification of results
FAIR principles provide a structured framework for data management
Preregistration separates confirmatory from exploratory research, reducing p-hacking
Registered Reports eliminate publication bias by accepting studies before results are known
p-curve analysis can detect evidential value and identify potential p-hacking
Reproducibility ratios vary widely across fields (36–62%), highlighting the need for reform
Computational reproducibility requires version control, containerization, and random seed management

Open Science Practices

Open Science Practices

Making Research Transparent and Reproducible

DfOpen Science

The Four Pillars of Open Science

FAIR Data Principles

DfFAIR Principles

FAIR Compliance Score

Data Sharing and Repositories

Code Sharing and Reproducibility

DfComputational Reproducibility

Python Implementation: Reproducibility Workflow

Open Science Reproducibility Toolkit

Preregistration and Transparency

DfPreregistration

Registered Reports

Measuring Reproducibility

Reproducibility Ratio

Python Implementation: Detecting p-hacking

p-curve Analysis

Incentives and Barriers

Key Takeaways

Summary: Open Science Practices

Premium Content

Need Expert Statistics Help?