Pre-registration of Studies
Advanced Statistical Methods
Locking In Hypotheses Before Seeing Results
Pre-registration documents research plans, hypotheses, and analysis strategies before data collection, preventing post-hoc hypothesizing and HARKing. It separates confirmatory from exploratory research.
- Clinical trials β Prevent outcome switching by registering primary endpoints in advance
- Social sciences β Distinguish planned analyses from exploratory fishing expeditions
- Drug development β Provide regulatory assurance that trial results are not selectively reported
Pre-registration makes the boundary between prediction and postdiction crystal clear.
Pre-registration is the practice of documenting a study's hypotheses, design, analysis plan, and any deviations from standard protocols in a time-stamped, publicly accessible registry before data collection begins.
DfPre-registration
A pre-registration is a time-stamped document that specifies, prior to data collection: (1) research hypotheses, (2) study design, (3) sample size justification, (4) primary and secondary outcome measures, (5) exclusion criteria, and (6) planned statistical analyses.
What to Pre-register
Essential Pre-registration Components
- Hypotheses β clearly stated primary and secondary hypotheses
- Study design β between-subjects, within-subjects, longitudinal, etc.
- Sample size justification β a priori power analysis or information-based sizing
- Outcome measures β primary, secondary, and exploratory endpoints
- Exclusion criteria β rules for excluding data (pre-specified, not post hoc)
- Analysis plan β specific statistical tests, models, and software
- Inference criteria β significance threshold, one- vs two-tailed, correction methods
Exploratory vs. Confirmatory Analysis
DfConfirmatory Data Analysis (CDA)
CDA tests a priori hypotheses using pre-specified analyses. The Type I error rate is controlled at Ξ±, and the analysis is deductive: theory β hypothesis β data β conclusion.
DfExploratory Data Analysis (EDA)
EDA generates new hypotheses from data using flexible, post hoc analyses. It is inductive: data β patterns β hypothesis. Results require independent confirmation.
The CDAβEDA Boundary
- CDA is valid only if analyses were truly specified before seeing the data
- EDA becomes problematic when presented as confirmatory (HARKing)
- Pre-registration creates a clear, auditable boundary between the two
- Deviations from the pre-registered plan must be transparently reported
Mathematical Framework: Decision Theory for Pre-registration
Expected Value of Pre-registration
Here,
- =Probability that the study yields true results
- =Value of credible, reproducible findings
- =Time cost of writing the pre-registration
- =Cost of reduced analytical flexibility
The key insight is that pre-registration reduces flexibility but increases credibility. The net value depends on the field's replication norms and the study's stakes.
OSF Pre-registration
The Open Science Framework (OSF) is the most widely used pre-registration platform.
DfOSF Pre-registration Structure
- Summary β brief description of the study
- Hypotheses β numbered, specific predictions
- Design β factorial, between/within, blocking variables
- Sampling Plan β data collection stopping rule, sample size justification
- Variables β IVs, DVs, covariates, manipulation checks
- Analysis Plan β specific tests, models, software, Ξ±-level
- Other β deviations, unanticipated events, exploratory analyses
Power Analysis for Pre-registration
A Priori Power Analysis
Here,
- =Required sample size per group
- =Significance level (typically 0.05)
- =Type II error rate (power = 1 β Ξ²)
- =Minimum detectable effect size
- =Population standard deviation
Power Analysis Best Practices
- Conduct power analysis before pre-registration, not after
- Specify minimum effect size of practical significance
- Use simulation-based power for complex designs (Bayesian, multilevel)
- Report sensitivity power: "What effect size can we detect with n = X at 80% power?"
Python Implementation: Pre-registration Template Generator
OSF Pre-registration Template
import json
from datetime import datetime
class PreregistrationTemplate:
def __init__(self, title, authors, research_question):
self.title = title
self.authors = authors
self.research_question = research_question
self.hypotheses = []
self.design = {}
self.sampling_plan = {}
self.variables = {"IV": [], "DV": [], "covariates": []}
self.analysis_plan = []
self.exclusions = []
self.timestamp = datetime.now().isoformat()
def add_hypothesis(self, number, statement, direction="two-sided"):
self.hypotheses.append({
"H_number": number,
"statement": statement,
"direction": direction
})
def set_design(self, design_type, factors=None, within_subjects=False):
self.design = {
"type": design_type,
"factors": factors or [],
"within_subjects": within_subjects
}
def set_sampling_plan(self, n_per_group, power, effect_size, alpha=0.05,
stopping_rule="fixed"):
self.sampling_plan = {
"n_per_group": n_per_group,
"power": power,
"effect_size": effect_size,
"alpha": alpha,
"stopping_rule": stopping_rule,
"total_n": n_per_group * (2 if "between" in self.design.get("type", "") else 1)
}
def add_variable(self, var_type, name, measure, coding=None):
entry = {"name": name, "measure": measure}
if coding:
entry["coding"] = coding
self.variables[var_type].append(entry)
def add_analysis(self, test_name, variables, model=None, software="R"):
self.analysis_plan.append({
"test": test_name,
"variables": variables,
"model": model,
"software": software
})
def add_exclusion_criterion(self, criterion):
self.exclusions.append(criterion)
def to_osf_format(self):
return {
"title": self.title,
"authors": self.authors,
"registration_type": "OSF Standard",
"timestamp": self.timestamp,
"sections": {
"1_summary": self.research_question,
"2_hypotheses": self.hypotheses,
"3_design": self.design,
"4_sampling_plan": self.sampling_plan,
"5_variables": self.variables,
"6_analysis_plan": self.analysis_plan,
"7_exclusions": self.exclusions,
"8_other": "No additional information at this time."
}
}
def export_json(self, filename="preregistration.json"):
data = self.to_osf_format()
with open(filename, 'w') as f:
json.dump(data, f, indent=2)
print(f"Pre-registration exported to {filename}")
return data
# Example: Create a pre-registration for a two-sample t-test
prereg = PreregistrationTemplate(
title="Effect of Sleep Deprivation on Cognitive Performance",
authors=["Smith, J.", "Doe, A."],
research_question="Does 24-hour sleep deprivation impair working memory performance?"
)
prereg.add_hypothesis(
number=1,
statement="Sleep-degraded participants will show lower accuracy on the n-back task "
"than well-rested controls.",
direction="one-sided"
)
prereg.set_design(
design_type="between-subjects",
factors=["sleep_condition"],
within_subjects=False
)
prereg.set_sampling_plan(
n_per_group=50,
power=0.90,
effect_size=0.5, # Cohen's d
alpha=0.05
)
prereg.add_variable("IV", "sleep_condition", "Manipulated (deprived vs. control)")
prereg.add_variable("DV", "nback_accuracy", "Proportion correct on 2-back task")
prereg.add_variable("covariates", "baseline_cognition", "Pre-study n-back score")
prereg.add_analysis(
test_name="Welch's t-test",
variables=["nback_accuracy ~ sleep_condition"],
software="R (version 4.3.1)"
)
prereg.add_exclusion_criterion("Participants who fail manipulation check (subjective sleepiness < 6/10 in control group)")
prereg.add_exclusion_criterion("Participants with >20% missing trial data")
# Export
data = prereg.export_json("sleep_deprivation_prereg.json")
print(json.dumps(data, indent=2))
Threats to Pre-registration
Common Threats to Pre-registration Integrity
- Vague hypotheses β pre-registering overly broad predictions that can match any outcome
- Outcome switching β changing primary outcomes after seeing results
- Analytical flexibility β pre-registering multiple analyses and reporting only significant ones
- Leakage β sharing pre-registration privately to bias reviewers
- File drawer of pre-registrations β never publishing pre-registered studies that fail
- Post hoc justification β claiming deviations were "necessary" without pre-specification
Pre-registration Platforms
Major Pre-registration Platforms
- OSF (osf.io) β free, open, supports many formats (OSF Standard, AsPredicted, Registered Report)
- AsPredicted.org β quick, 9-question template for simple studies
- ClinicalTrials.gov β mandatory for FDA-regulated clinical trials
- ISRCTN β international clinical trial registry
- EGAP β Experiments in Governance and Politics (pre-registration with mandatory replication)
- AEA RCT Registry β American Economic Association randomized controlled trials
Evaluating Pre-registration Quality
Pre-registration Completeness Index
Here,
- =Completeness index (0β1)
- =Total number of required items
- =Indicator function (1 if specified, 0 otherwise)
Higher completeness is associated with more rigorous research practices, though quality of specification matters more than mere presence of items.
Key Takeaways
Summary: Pre-registration of Studies
- Pre-registration separates confirmatory from exploratory research with a time-stamped record
- What to preregister: hypotheses, design, sample size, outcomes, exclusions, and analysis plan
- OSF and AsPredicted are the most widely used pre-registration platforms
- Power analysis must be conducted before pre-registration, not after
- Threats include vague hypotheses, outcome switching, and analytical flexibility
- Pre-registration does not prevent EDA β it clarifies which analyses are confirmatory vs. exploratory
- Registered Reports build on pre-registration by providing in-principle acceptance before results