πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Open Science Practices

Advanced Statistical MethodsResearch Methodology🟒 Free Lesson

Advertisement

Open Science Practices

Advanced Statistical Methods

Making Research Transparent and Reproducible

Open science practices β€” open data, open code, preregistration, and FAIR principles β€” increase transparency, enable verification, and accelerate scientific progress through collaboration and reuse.

  • Research credibility β€” Open data allows independent verification of analytical conclusions
  • Collaboration β€” Shared code and datasets enable other researchers to build on published work
  • Funding compliance β€” Increasingly, grant agencies mandate data sharing and open access publication

Open science is not just good ethics β€” it produces better, more trustworthy science.


The replication crisis in psychology, medicine, and social sciences has catalyzed a paradigm shift toward open science β€” a set of principles and practices that increase the transparency, accessibility, and reproducibility of research.

DfOpen Science

Open science is the movement to make scientific research (including publications, data, samples, and software) freely accessible and transparently documented, enabling verification, replication, and extension by others.


The Four Pillars of Open Science

Core Components

  • Open Data β€” raw and processed datasets are publicly available
  • Open Code β€” analysis scripts and software are shared
  • Open Materials β€” experimental stimuli, protocols, and procedures are documented
  • Open Access β€” publications are freely available without paywalls

The Reproducibility Project (Open Science Collaboration, 2015) attempted to replicate 100 published psychology studies and found that only 36% yielded statistically significant results in replications, compared to 97% in the originals. This alarming discrepancy underscored the need for structural reforms.


FAIR Data Principles

The FAIR principles (Wilkinson et al., 2016) provide a framework for data management:

DfFAIR Principles

  1. Findable β€” data has a unique identifier (DOI) and is indexed in searchable repositories
  2. Accessible β€” data is retrievable via standard protocols (HTTP, FTP) with metadata always accessible
  3. Interoperable β€” data uses formal, shared vocabularies and formats (JSON-LD, RDF)
  4. Reusable β€” data has clear usage licenses and detailed provenance

FAIR Compliance Score

SFAIR=1nβˆ‘i=1nwiβ‹…riS_{\text{FAIR}} = \frac{1}{n}\sum_{i=1}^{n} w_i \cdot r_i

Here,

  • SFAIRS_FAIR=Overall FAIR compliance score (0–1)
  • wiw_i=Weight for criterion i
  • rir_i=Rating for criterion i (0 or 1)

Data Sharing and Repositories

Data Sharing Considerations

  • De-identification β€” remove all personally identifiable information (PII)
  • Informed consent β€” ensure participants consented to data sharing
  • Privacy regulations β€” comply with GDPR, HIPAA, and institutional policies
  • Embargo periods β€” allow data exclusivity for a limited time if needed

Major data repositories include:

RepositoryDomainDOI SupportAccess
DryadGeneralYesFree (with publication)
ZenodoGeneralYesFree
ICPSRSocial sciencesYesRestricted access
OpenNeuroNeuroimagingYesFree
FigshareGeneralYesFree (limited)
OSFMulti-disciplinaryYesFree

Code Sharing and Reproducibility

DfComputational Reproducibility

A study is computationally reproducible if an independent researcher can obtain the same results (numbers, figures, tables) from the same data using the same code and environment.

Reproducibility Checklist

  • Use version control (Git) for all analysis code
  • Specify software versions (e.g., Python 3.11, R 4.3.1)
  • Use containerization (Docker, Singularity) for environment reproducibility
  • Set random seeds for stochastic algorithms
  • Include a requirements.txt or renv.lock for dependency management

Python Implementation: Reproducibility Workflow

Open Science Reproducibility Toolkit

import numpy as np
import pandas as pd
import hashlib
import json
import datetime
from pathlib import Path

# 1. Set random seeds for reproducibility
np.random.seed(42)

# 2. Create a reproducible analysis pipeline
class ReproducibleAnalysis:
    def __init__(self, seed=42):
        self.seed = seed
        self.artifacts = {}
        self.log = []

    def log_step(self, step_name, description):
        entry = {
            "step": step_name,
            "description": description,
            "timestamp": datetime.datetime.now().isoformat()
        }
        self.log.append(entry)
        return entry

    def compute_hash(self, data, algorithm='sha256'):
        """Compute hash of data for integrity verification."""
        if isinstance(data, np.ndarray):
            data_bytes = data.tobytes()
        elif isinstance(data, pd.DataFrame):
            data_bytes = pd.util.hash_pandas_object(data).values.tobytes()
        else:
            data_bytes = str(data).encode()
        return hashlib.new(algorithm, data_bytes).hexdigest()

    def save_artifact(self, name, data, directory='artifacts'):
        """Save and hash an analysis artifact."""
        Path(directory).mkdir(exist_ok=True)
        filepath = Path(directory) / f"{name}.csv"

        if isinstance(data, pd.DataFrame):
            data.to_csv(filepath, index=False)
        else:
            pd.DataFrame(data).to_csv(filepath, index=False)

        file_hash = self.compute_hash(data)
        self.artifacts[name] = {
            "path": str(filepath),
            "hash": file_hash,
            "rows": len(data) if hasattr(data, '__len__') else None
        }
        self.log_step(f"save_{name}", f"Saved {filepath} (hash: {file_hash[:16]}...)")
        return filepath

    def generate_report(self):
        """Generate a reproducibility report."""
        report = {
            "seed": self.seed,
            "numpy_version": np.__version__,
            "pandas_version": pd.__version__,
            "artifacts": self.artifacts,
            "execution_log": self.log,
            "generated_at": datetime.datetime.now().isoformat()
        }
        return report

# 3. Example analysis with full traceability
analysis = ReproducibleAnalysis(seed=42)
analysis.log_step("init", "Initialize reproducible analysis with seed=42")

# Simulate data
n = 500
data = pd.DataFrame({
    'x': np.random.normal(0, 1, n),
    'group': np.random.choice(['A', 'B', 'C'], n)
})
data['y'] = 2.5 * data['x'] + np.random.normal(0, 1, n)

# Save raw data
analysis.save_artifact("raw_data", data)

# Analysis step
analysis.log_step("regression", "Fit OLS: y ~ x")
from numpy.linalg import lstsq
X = np.column_stack([np.ones(n), data['x'].values])
beta, residuals, rank, sv = lstsq(X, data['y'].values, rcond=None)

# Save results
results = pd.DataFrame({'coefficient': ['intercept', 'slope'], 'estimate': beta})
analysis.save_artifact("regression_results", results)

# 4. Generate reproducibility report
report = analysis.generate_report()
print(json.dumps(report, indent=2))

Preregistration and Transparency

DfPreregistration

Preregistration is the act of specifying a study's design, hypotheses, and analysis plan in a time-stamped, publicly accessible document before data collection begins.

The key distinction is between:

Exploratory vs. Confirmatory Research

  • Confirmatory β€” hypothesis-driven, preregistered, strict Ξ±-control; results confirm or refute a priori predictions
  • Exploratory β€” data-driven, hypothesis-generating, flexible; results suggest new theories but require independent confirmation

Registered Reports

Registered Reports (RR) are a publishing format where journals peer-review and accept studies before results are known, based on the importance of the research question and the quality of the methodology.

Benefits of Registered Reports

  • Eliminates publication bias (file drawer problem)
  • Reduces p-hacking and HARKing
  • Rewards rigorous methodology over novel results
  • Provides in-principle acceptance (IPA) guarantee

Measuring Reproducibility

Reproducibility Ratio

R=NumberΒ ofΒ successfulΒ replicationsTotalΒ replicationΒ attemptsR = \frac{\text{Number of successful replications}}{\text{Total replication attempts}}

Here,

  • RR=Reproducibility ratio (0–1)

The Open Science Collaboration (2015) found R β‰ˆ 0.36 for psychology. Subsequent large-scale replication projects have found:

FieldReplication RateSource
Psychology36%OSC (2015)
Economics61%Camerer et al. (2016)
Social Science62%Camerer et al. (2018)
Cancer Biology46%Begley & Ellis (2012)

Python Implementation: Detecting p-hacking

p-curve Analysis

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

def p_curve(p_values):
    """
    Analyze the distribution of significant p-values.
    A right-skewed p-curve suggests evidential value.
    A flat or left-skewed curve suggests p-hacking.
    """
    sig_pvals = p_values[p_values < 0.05]

    # Binomial test: under H1, p-values should be right-skewed
    bins = [0, 0.01, 0.02, 0.03, 0.04, 0.05]
    observed_counts, _ = np.histogram(sig_pvals, bins=bins)

    # Expected under uniform (no effect)
    expected = np.full_like(observed_counts, len(sig_pvals) / 5, dtype=float)

    # Chi-square test for uniformity
    chi2, p_uniform = stats.chisquare(observed_counts, f_exp=expected)

    # Skewness of significant p-values
    if len(sig_pvals) > 2:
        skewness = stats.skew(sig_pvals)
    else:
        skewness = np.nan

    return {
        'n_significant': len(sig_pvals),
        'n_total': len(p_values),
        'chi2_uniformity': chi2,
        'p_uniformity': p_uniform,
        'skewness': skewness,
        'counts': observed_counts,
        'bins': bins
    }

# Simulate p-values with and without p-hacking
np.random.seed(42)
n_studies = 200

# Scenario 1: Genuine effects (right-skewed p-curve)
genuine_pvals = np.concatenate([
    np.random.beta(0.5, 20, 80),   # Strong effects
    np.random.uniform(0, 1, 120)    # Null effects
])

# Scenario 2: p-hacked (uniform or left-skewed)
p_hacked_pvals = np.random.uniform(0.01, 0.05, 100)  # Only p < .05 reported

# Analyze both
result_genuine = p_curve(genuine_pvals)
result_hacked = p_curve(p_hacked_pvals)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

axes[0].bar(range(5), result_genuine['counts'], color='steelblue', edgecolor='black')
axes[0].set_xticks(range(5))
axes[0].set_xticklabels(['0-0.01', '0.01-0.02', '0.02-0.03', '0.03-0.04', '0.04-0.05'])
axes[0].set_title(f'Genuine Effects\nSkewness: {result_genuine["skewness"]:.2f}')
axes[0].set_ylabel('Frequency')

axes[1].bar(range(5), result_hacked['counts'], color='coral', edgecolor='black')
axes[1].set_xticks(range(5))
axes[1].set_xticklabels(['0-0.01', '0.01-0.02', '0.02-0.03', '0.03-0.04', '0.04-0.05'])
axes[1].set_title(f'p-Hacked Data\nSkewness: {result_hacked["skewness"]:.2f}')

plt.tight_layout()
plt.savefig('p_curve_analysis.png', dpi=150)
plt.show()

Incentives and Barriers

Barriers to Open Science

  • Career incentives β€” traditional metrics reward novelty over reproducibility
  • Data privacy β€” sensitive data (medical, educational) cannot be shared openly
  • Time and resources β€” curating open data requires additional effort
  • Scoop fears β€” researchers worry others will use their data prematurely
  • Lack of training β€” many researchers lack computational skills for reproducible workflows

Key Takeaways

Summary: Open Science Practices

  1. Open data, code, and materials increase transparency and enable verification of results
  2. FAIR principles provide a structured framework for data management
  3. Preregistration separates confirmatory from exploratory research, reducing p-hacking
  4. Registered Reports eliminate publication bias by accepting studies before results are known
  5. p-curve analysis can detect evidential value and identify potential p-hacking
  6. Reproducibility ratios vary widely across fields (36–62%), highlighting the need for reform
  7. Computational reproducibility requires version control, containerization, and random seed management
⭐

Premium Content

Open Science Practices

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement