Systematic Review Methodology
Advanced Statistical Methods
Comprehensive, Reproducible Evidence Synthesis
Systematic reviews follow PRISMA guidelines to identify, evaluate, and synthesize all relevant evidence on a question using transparent, reproducible methodology. Risk of bias assessment ensures study quality.
- Healthcare guideline development β Form the evidence base for clinical practice recommendations
- Policy evaluation β Assess the total evidence for government program effectiveness
- Technology assessment β Compare interventions systematically for informed purchasing decisions
Systematic reviews replace narrative cherry-picking with comprehensive, reproducible evidence synthesis.
DfSystematic Review
A systematic review is a rigorous, transparent, and reproducible method for identifying, evaluating, and synthesizing all relevant evidence to answer a specific research question. Unlike narrative reviews, systematic reviews use explicit, pre-specified methods to minimize bias.
"A systematic review attempts to identify, appraise, and synthesize all the empirical evidence that meets pre-specified eligibility criteria to answer a given research question." β Cochrane Handbook
Systematic Review vs Meta-Analysis
| Aspect | Systematic Review | Meta-Analysis |
|---|---|---|
| Scope | Qualitative synthesis of evidence | Quantitative statistical pooling |
| Output | Narrative summary with assessment | Pooled effect estimate with CI |
| When used | Always for systematic reviews | Only when studies are comparable enough |
| Heterogeneity | Addressed narratively | Quantified (IΒ², ΟΒ²) |
| PRISMA | Required | Required (as part of SR) |
Key Point
Every meta-analysis should be embedded within a systematic review, but not every systematic review includes a meta-analysis. When studies are too heterogeneous, methodologically diverse, or use incompatible outcome measures, synthesis without meta-analysis (SWiM) is appropriate.
PRISMA Guidelines
DfPRISMA
PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) is a 27-item checklist and flow diagram that standardizes the reporting of systematic reviews. PRISMA 2020 (Page et al., 2021) updated the original 2009 guidelines.
PRISMA 2020 Flow Diagram
Identification
βββ Records from databases (n = ___)
βββ Records from other sources (n = ___)
βββ Duplicates removed (n = ___)
Screening
βββ Records screened (n = ___)
βββ Records excluded (n = ___)
βββ Reports sought for retrieval (n = ___)
Eligibility
βββ Reports assessed for eligibility (n = ___)
βββ Reports excluded (n = ___)
βββ Reason 1 (n = ___)
βββ Reason 2 (n = ___)
βββ Reason 3 (n = ___)
Included
βββ Studies included in review (n = ___)
βββ Studies included in meta-analysis (n = ___)
Search Strategy
DfSearch Strategy
A search strategy is a systematic, reproducible method for identifying all relevant studies. It combines subject terms (MeSH, Emtree) with free-text keywords using Boolean operators.
PICO Framework
DfPICO
The PICO framework structures the research question:
- P (Population): Who is being studied?
- I (Intervention): What is the treatment or exposure?
- C (Comparison): What is the comparator?
- O (Outcome): What outcomes are measured?
Example Search Strategy
# Database: MEDLINE via PubMed
# Population: Adults with type 2 diabetes
# Intervention: SGLT2 inhibitors
# Comparison: Placebo or other antidiabetics
("diabetes mellitus, type 2"[MeSH] OR "type 2 diabetes"[tiab])
AND
("sodium-glucose transporter 2 inhibitors"[MeSH] OR "SGLT2 inhibitor*"[tiab]
OR "empagliflozin"[tiab] OR "dapagliflozin"[tiab] OR "canagliflozin"[tiab])
AND
("cardiovascular diseases"[MeSH] OR "heart failure"[MeSH] OR "MACE"[tiab]
OR "cardiovascular outcome*"[tiab])
NOT
("type 1 diabetes"[tiab] OR "gestational"[tiab])
Search Sensitivity vs Specificity
A highly sensitive search (broad terms, no filters) ensures no relevant studies are missed but produces many irrelevant records. A highly specific search is efficient but risks missing studies. For systematic reviews, maximize sensitivity β screening time is preferred over missed studies.
Inclusion and Exclusion Criteria
DfEligibility Criteria
Eligibility criteria define which studies are included in the review. They must be pre-specified in a protocol and applied consistently.
Common Eligibility Domains
| Domain | Inclusion | Exclusion |
|---|---|---|
| Study design | RCTs, quasi-experimental | Case reports, editorials |
| Population | Adults β₯18 years | Pediatric, pregnant |
| Intervention | SGLT2 inhibitors (any dose) | Combination therapy only |
| Comparator | Placebo, active control | No comparator |
| Outcome | MACE, all-cause mortality | Surrogate endpoints only |
| Timeframe | Published 2010β2025 | Pre-2010 |
| Language | English | Non-English (if justified) |
Risk of Bias Assessment
DfRisk of Bias
Risk of bias (RoB) refers to the likelihood that a study's design, conduct, or analysis introduced systematic error, leading to over- or under-estimation of the true effect.
Cochrane Risk of Bias Tool (RoB 2)
The Cochrane RoB 2 tool assesses five domains:
| Domain | Key Question |
|---|---|
| D1: Randomization process | Was allocation truly random? Was it concealed? |
| D2: Deviations from intended interventions | Were participants aware of allocation? |
| D3: Missing outcome data | Was attrition balanced and explained? |
| D4: Measurement of the outcome | Was the outcome measure valid and assessed blindly? |
| D5: Selection of reported result | Was the analysis pre-specified? |
Each domain rated as: Low risk, Some concerns, or High risk.
Overall Judgment
If any domain is rated "High risk," the overall study is "High risk." If any domain has "Some concerns" and no domain is "High risk," the overall rating is "Some concerns." Only studies with "Low risk" across all domains receive an overall low risk of bias rating.
Data Extraction
DfData Extraction
Data extraction is the systematic process of recording study characteristics and results from included studies into a standardized form. Two independent reviewers typically extract data, with discrepancies resolved by consensus or a third reviewer.
Standard Data Fields
| Category | Fields |
|---|---|
| Study | Author, year, country, design, sample size |
| Population | Age, sex, BMI, diabetes duration, HbA1c |
| Intervention | Drug, dose, duration |
| Comparator | Drug, dose, duration |
| Outcomes | Effect estimate (OR, HR, MD), 95% CI, n per group |
| Quality | RoB rating, GRADE certainty |
GRADE Quality Assessment
DfGRADE
GRADE (Grading of Recommendations Assessment, Development and Evaluation) is a systematic approach for rating the certainty of evidence and strength of recommendations. It rates evidence as high, moderate, low, or very low certainty.
GRADE Domains
| Domain | Effect on Certainty |
|---|---|
| Risk of bias | β Downgrade if serious limitations |
| Inconsistency | β Downgrade if IΒ² > 50% or unexplained heterogeneity |
| Indirectness | β Downgrade if population, intervention, or outcome differs |
| Imprecision | β Downgrade if wide CI crosses clinical decision threshold |
| Publication bias | β Downgrade if funnel plot asymmetry or Egger's p < 0.10 |
Upgrade factors:
- β Large effect (RR > 2 or < 0.5)
- β Dose-response gradient
- β All confounders would reduce the effect
Starting Certainty
RCTs start at high certainty. Observational studies start at low certainty. Each domain can move the rating down (or up for observational studies with large effects).
Synthesis Without Meta-Analysis (SWiM)
DfSWiM
Synthesis without meta-analysis (SWiM) provides guidance for systematically reviewing evidence when quantitative pooling is inappropriate. It uses structured, transparent narrative synthesis with tabular and graphical summaries.
When to Use SWiM
- Studies use different outcome measures or scales
- Studies are too heterogeneous to pool meaningfully
- Few studies (K < 3) are available
- Methodological diversity prevents valid pooling
SWiM Methods
| Method | Description |
|---|---|
| Vote counting | Count studies directionally favorable/unfavorable |
| Harvest plots | Bar charts weighted by study quality |
| Blobbograms | Modified forest plots without pooling |
| Narrative synthesis | Structured textual summary by subgroups |
| Tabular summaries | Effect estimates, CIs, and quality ratings in tables |
Python Implementation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# --- PRISMA flow diagram data ---
prisma_data = {
'Stage': ['Identification', 'Screening', 'Eligibility', 'Included'],
'Records': [4523, 2891, 312, 45],
'Excluded': [1632, 2579, 267, 0]
}
print("=== PRISMA Flow ===")
for i, stage in enumerate(prisma_data['Stage']):
print(f"{stage}: {prisma_data['Records'][i]} records")
if prisma_data['Excluded'][i] > 0:
print(f" Excluded: {prisma_data['Excluded'][i]}")
# --- Risk of Bias Assessment ---
studies = [
{'Study': 'Smith 2020', 'D1': 'Low', 'D2': 'Low', 'D3': 'Some concerns',
'D4': 'Low', 'D5': 'Low', 'Overall': 'Some concerns'},
{'Study': 'Jones 2021', 'D1': 'Low', 'D2': 'Low', 'D3': 'Low',
'D4': 'Low', 'D5': 'Low', 'Overall': 'Low'},
{'Study': 'Lee 2022', 'D1': 'High', 'D2': 'Some concerns', 'D3': 'Low',
'D4': 'Some concerns', 'D5': 'Low', 'Overall': 'High risk'},
{'Study': 'Chen 2023', 'D1': 'Low', 'D2': 'Low', 'D3': 'Low',
'D4': 'Some concerns', 'D5': 'Low', 'Overall': 'Some concerns'},
{'Study': 'Wang 2024', 'D1': 'Low', 'D2': 'Low', 'D3': 'Low',
'D4': 'Low', 'D5': 'Low', 'Overall': 'Low'},
]
df_rob = pd.DataFrame(studies)
print("\n=== Risk of Bias Summary ===")
print(df_rob.to_string(index=False))
# Traffic light plot
rob_matrix = df_rob[['D1', 'D2', 'D3', 'D4', 'D5']].values
color_map = {'Low': '#2ecc71', 'Some concerns': '#f39c12', 'High': '#e74c3c',
'High risk': '#e74c3c'}
fig, ax = plt.subplots(figsize=(10, 5))
for i in range(len(rob_matrix)):
for j in range(5):
color = color_map.get(rob_matrix[i, j], '#95a5a6')
ax.add_patch(plt.Rectangle((j, len(rob_matrix) - i - 1), 1, 1,
facecolor=color, edgecolor='white', linewidth=2))
ax.set_xlim(0, 5)
ax.set_ylim(0, len(rob_matrix))
ax.set_xticks([0.5, 1.5, 2.5, 3.5, 4.5])
ax.set_xticklabels(['Randomization', 'Deviations', 'Missing Data',
'Measurement', 'Reporting'])
ax.set_yticks([0.5 + i for i in range(len(rob_matrix))])
ax.set_yticklabels(df_rob['Study'].tolist()[::-1])
ax.set_title('Risk of Bias Traffic Light Plot')
plt.tight_layout()
plt.savefig('rob_traffic_light.png', dpi=150)
plt.show()
# --- GRADE Evidence Profile ---
grade_data = {
'Outcome': ['MACE (3-point)', 'All-cause mortality', 'Heart failure hospitalization'],
'Studies': [5, 4, 6],
'Participants': [45000, 38000, 52000],
'Risk of bias': ['Serious (-1)', 'Not serious', 'Serious (-1)'],
'Inconsistency': ['Not serious', 'Serious (-1)', 'Not serious'],
'Indirectness': ['Not serious', 'Not serious', 'Not serious'],
'Imprecision': ['Not serious', 'Serious (-1)', 'Not serious'],
'Publication bias': ['Undetected', 'Undetected', 'Undetected'],
'Starting level': ['High', 'High', 'High'],
'Final certainty': ['Moderate', 'Low', 'Moderate'],
'Effect estimate': ['HR 0.86 (0.80-0.93)', 'HR 0.92 (0.84-1.01)', 'HR 0.72 (0.64-0.82)']
}
df_grade = pd.DataFrame(grade_data)
print("\n=== GRADE Evidence Profile ===")
print(df_grade.to_string(index=False))
# --- Study selection funnel ---
selection_data = {
'Phase': ['Database search', 'Duplicate removal', 'Title/abstract screen',
'Full-text review', 'Data extraction', 'Quality assessment', 'Final synthesis'],
'Records': [4523, 2891, 892, 312, 45, 45, 45]
}
fig, ax = plt.subplots(figsize=(10, 6))
phases = selection_data['Phase']
counts = selection_data['Records']
colors = plt.cm.Blues(np.linspace(0.3, 0.9, len(phases)))
ax.barh(phases[::-1], counts[::-1], color=colors[::-1], edgecolor='black')
for i, (count, phase) in enumerate(zip(counts[::-1], phases[::-1])):
ax.text(count + 50, i, str(count), va='center', fontsize=10)
ax.set_xlabel('Number of Records')
ax.set_title('Study Selection Funnel')
ax.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.savefig('study_selection.png', dpi=150)
plt.show()
# --- Inclusion/exclusion summary ---
print("\n=== Exclusion Reasons (Full-Text) ===")
exclusion_reasons = {
'Wrong population': 87,
'Wrong intervention': 62,
'Wrong outcome': 45,
'Wrong study design': 38,
'Duplicate data': 21,
'Insufficient data': 14
}
for reason, count in sorted(exclusion_reasons.items(), key=lambda x: -x[1]):
print(f" {reason}: {count} studies")
Key Takeaways
Summary: Systematic Review Methodology
- Systematic reviews use explicit, pre-specified methods to minimize bias β they are not narrative summaries.
- PRISMA 2020 provides the standard reporting framework with a 27-item checklist and flow diagram.
- Search strategies should maximize sensitivity using PICO-framed Boolean queries across multiple databases.
- Risk of bias assessment (Cochrane RoB 2) evaluates five domains: randomization, deviations, missing data, measurement, and reporting.
- GRADE rates evidence certainty from high to very low, starting from study design and adjusting for bias, inconsistency, indirectness, imprecision, and publication bias.
- SWiM provides structured methods for narrative synthesis when meta-analysis is inappropriate.
- Data extraction should be performed independently by two reviewers with pre-specified data fields.