Statistics Review and Roadmap
Advanced Statistical Methods
Your Complete Guide to Mastering Statistics
This comprehensive review connects all major statistics topics from descriptive methods to Bayesian inference, providing structured learning paths for every level. It maps the full landscape of the discipline and charts your course through it.
- Beginner path β Build foundations in probability, estimation, and hypothesis testing
- Intermediate path β Master regression, ANOVA, multivariate methods, and experimental design
- Advanced path β Explore Bayesian methods, high-dimensional statistics, and specialized applications
Statistics is not a destination but a journey β this roadmap ensures you never lose your way.
DfThe Statistics Curriculum
The discipline of statistics can be organized into a coherent curriculum spanning four major pillars: descriptive statistics, probability theory, statistical inference, and applied methods. This roadmap provides a structured overview of the entire field, connecting concepts and identifying learning paths.
"Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write." β H.G. Wells
Foundation: Descriptive Statistics
DfDescriptive Statistics
Descriptive statistics summarizes and visualizes data through measures of central tendency, dispersion, and distribution shape. It forms the foundation for all statistical reasoning.
Core Topics
| Topic | Key Concepts | Difficulty |
|---|---|---|
| Levels of Measurement | Nominal, ordinal, interval, ratio | Beginner |
| Central Tendency | Mean, median, mode, trimmed mean | Beginner |
| Dispersion | Variance, SD, IQR, range, CV | Beginner |
| Shape | Skewness, kurtosis | Beginner |
| Data Visualization | Histograms, box plots, scatter plots | Beginner |
| Tabulation | Frequency distributions, contingency tables | Beginner |
Mathematical Foundations
Key Formulas to Master
Pillar 1: Probability Theory
DfProbability Theory
Probability theory provides the mathematical framework for quantifying uncertainty. It underpins all of statistical inference: we reason about data by computing the probability of observing such data under various hypotheses.
Core Topics
| Topic | Key Concepts | Difficulty |
|---|---|---|
| Probability Axioms | Kolmogorov axioms, sample spaces, events | Beginner |
| Conditional Probability | Bayes' theorem, independence | Beginner-Intermediate |
| Random Variables | PMF, PDF, CDF, expectation, variance | Intermediate |
| Discrete Distributions | Binomial, Poisson, geometric, negative binomial | Intermediate |
| Continuous Distributions | Normal, exponential, gamma, beta, chi-square | Intermediate |
| Joint Distributions | Marginal, conditional, covariance, correlation | Intermediate |
| Limit Theorems | CLT, LLN, convergence concepts | Intermediate-Advanced |
The Probability Distributions to Know
Essential Distributions
Every statistician must have deep familiarity with these distributions:
- Normal -- the backbone of parametric statistics (CLT)
- Binomial -- counting successes in Bernoulli trials
- Poisson -- modeling rare event counts
- Exponential/Gamma -- waiting times, survival analysis
- Beta -- modeling proportions, Bayesian conjugacy
- Chi-square -- goodness-of-fit, contingency tables
- t-distribution -- small-sample inference
- F-distribution -- ANOVA, variance ratios
Pillar 2: Statistical Inference
DfStatistical Inference
Statistical inference is the process of drawing conclusions about populations from sample data, quantifying uncertainty in those conclusions. It encompasses estimation, hypothesis testing, and confidence/credible intervals.
Estimation
| Topic | Key Concepts | Difficulty |
|---|---|---|
| Point Estimation | MLE, method of moments, sufficient statistics | Intermediate |
| Properties of Estimators | Unbiasedness, consistency, efficiency, MSE | Intermediate |
| Confidence Intervals | Wald, score, bootstrap CIs | Intermediate |
| Sample Size Determination | Power analysis, effect sizes | Intermediate |
Hypothesis Testing
| Topic | Key Concepts | Difficulty |
|---|---|---|
| Null/Alternative Hypotheses | One-sided vs. two-sided | Beginner-Intermediate |
| Type I/II Errors | Alpha, beta, power | Intermediate |
| p-values | Definition, interpretation, misuse | Intermediate |
| z-tests and t-tests | One-sample, two-sample, paired | Intermediate |
| Chi-square Tests | Goodness-of-fit, independence | Intermediate |
| F-test | Equality of variances, ANOVA | Intermediate |
| Nonparametric Tests | Wilcoxon, Mann-Whitney, Kruskal-Wallis | Intermediate |
Mathematical Framework
Neyman-Pearson Lemma
For testing vs , the most powerful test of size rejects when:
where is chosen so .
Pillar 3: Regression and Linear Models
DfLinear Models
The general linear model with is the workhorse of applied statistics. Extensions include generalized linear models, mixed effects models, and regularized variants.
Topic Map
| Topic | Key Concepts | Difficulty |
|---|---|---|
| Simple Linear Regression | OLS, slope/intercept, | Intermediate |
| Multiple Regression | Multicollinearity, adjusted | Intermediate |
| Regression Diagnostics | Residuals, leverage, Cook's distance | Intermediate |
| Heteroscedasticity | Breusch-Pagan, White's test, WLS | Intermediate |
| Logistic Regression | Odds ratios, logit, Wald test | Intermediate |
| Regularized Regression | Ridge, Lasso, Elastic Net | Advanced |
| Quantile Regression | Conditional quantiles | Advanced |
| ANOVA/Factorial Designs | One-way, two-way, interactions | Intermediate |
| MANOVA/ANCOVA | Multivariate and adjusted comparisons | Advanced |
| Generalized Linear Models | Link functions, exponential family | Advanced |
OLS Estimator
The OLS estimator is BLUE (Best Linear Unbiased Estimator) by the Gauss-Markov theorem.
Pillar 4: Applied Methods
Experimental Design
| Topic | Key Concepts | Difficulty |
|---|---|---|
| Design of Experiments | Randomization, blocking, factorial | Intermediate |
| Response Surface Methods | Optimization, central composite designs | Advanced |
| Adaptive Trial Designs | Group sequential, Bayesian adaptive | Advanced |
| Optimal Design | D-optimal, A-optimal, information criteria | Advanced |
Multivariate Methods
| Topic | Key Concepts | Difficulty |
|---|---|---|
| PCA | Eigenvectors, variance explained, scree plots | Intermediate |
| Factor Analysis | Latent variables, rotation, communalities | Advanced |
| Cluster Analysis | K-means, hierarchical, DBSCAN | Intermediate |
| Discriminant Analysis | LDA, QDA, Fisher's criterion | Intermediate |
| MANOVA | Multivariate hypothesis testing | Advanced |
| Canonical Correlation | Relationships between variable sets | Advanced |
| MDS | Multidimensional scaling | Advanced |
Time Series Analysis
| Topic | Key Concepts | Difficulty |
|---|---|---|
| Stationarity | Weak/strong stationarity, unit root tests | Intermediate |
| ACF/PACF | Autocorrelation, partial autocorrelation | Intermediate |
| ARIMA Models | AR, MA, ARMA, ARIMA, seasonal | Advanced |
| Exponential Smoothing | Simple, Holt, Holt-Winters | Intermediate |
| Granger Causality | Lag-based predictive causation | Advanced |
Survival Analysis
| Topic | Key Concepts | Difficulty |
|---|---|---|
| Kaplan-Meier | Survival curves, censoring | Intermediate |
| Cox Proportional Hazards | Hazard ratios, proportional hazards | Advanced |
| Event History Analysis | Competing risks, recurrent events | Advanced |
Pillar 5: Advanced and Bayesian Methods
Bayesian Statistics
| Topic | Key Concepts | Difficulty |
|---|---|---|
| Bayesian Inference | Prior, posterior, conjugacy | Advanced |
| Bayesian Regression | Posterior predictive, credible intervals | Advanced |
| Hierarchical Bayesian Models | Random effects, partial pooling | Advanced |
| MCMC Diagnostics | Convergence, trace plots, R-hat, ESS | Advanced |
| Model Comparison | Bayes factors, DIC, WAIC | Advanced |
Bayes' Theorem
Causal Inference
| Topic | Key Concepts | Difficulty |
|---|---|---|
| Causal Inference Intro | Potential outcomes, SUTVA | Advanced |
| Randomized Controlled Trials | Randomization, intention-to-treat | Intermediate |
| Instrumental Variables | Exogeneity, exclusion restriction | Advanced |
| Regression Discontinuity | Sharp/fuzzy, bandwidth selection | Advanced |
| Difference-in-Differences | Parallel trends, staggered adoption | Advanced |
| Propensity Score Matching | Balance, overlap, ATT estimation | Advanced |
Specialized Methods
| Topic | Key Concepts | Difficulty |
|---|---|---|
| Missing Data | MCAR, MAR, MNAR | Advanced |
| Multiple Imputation | Rubin's rules, chained equations | Advanced |
| Meta-Analysis | Fixed/random effects, heterogeneity | Advanced |
| Robust Statistics | M-estimators, breakdown point | Advanced |
| High-Dimensional Statistics | Sparsity, LASSO, compressed sensing | Advanced |
| Spatial Statistics | Kriging, geostatistics, spatial autocorrelation | Advanced |
| Extreme Value Theory | GEV, GP distribution, return levels | Advanced |
| Copulas | Dependence structures, marginal distributions | Advanced |
Learning Paths
Beginner Path (0-6 months)
Beginner Curriculum
Goal: Build intuition for data and basic statistical reasoning.
Prerequisites: Basic algebra
Topics to cover (in order):
- What is Statistics
- Types of Data / Levels of Measurement
- Data Collection Methods
- Sampling Techniques and Bias
- Frequency Distributions and Histograms
- Measures of Central Tendency
- Variance and Standard Deviation
- Correlation (Pearson, Spearman)
- Introduction to Probability
- Normal Distribution and Z-scores
- Confidence Intervals
- Hypothesis Testing Basics (z-test, t-test)
Time commitment: 5-8 hours/week for 6 months
Intermediate Path (6-18 months)
Intermediate Curriculum
Goal: Master core statistical methods and regression.
Prerequisites: Beginner path or equivalent
Topics to cover:
- Simple and Multiple Linear Regression
- Regression Diagnostics
- Logistic Regression
- ANOVA (One-way, Two-way)
- Chi-square Tests
- Nonparametric Tests
- Experimental Design
- Time Series Introduction (ACF/PACF, basic ARIMA)
- Survival Analysis (Kaplan-Meier)
- Principal Component Analysis
- Bootstrap Methods
- Cross-Validation
Time commitment: 8-12 hours/week for 12 months
Advanced Path (18-36 months)
Advanced Curriculum
Goal: Master modern and specialized methods.
Prerequisites: Intermediate path, linear algebra, calculus
Topics to cover:
- Bayesian Statistics (hierarchical models, MCMC)
- Causal Inference (IV, RDD, DiD, PSM)
- Meta-Analysis and Systematic Review
- High-Dimensional Statistics
- Regularized Regression (Ridge, Lasso, Elastic Net)
- Spatial Statistics
- Extreme Value Theory
- Copulas
- Mixture Models
- Hidden Markov Models
- Streaming Statistics and Online Learning
- Statistics Meets Machine Learning
Time commitment: 10-15 hours/week for 18 months
Recommended Textbooks
Beginner
| Textbook | Author(s) | Strength |
|---|---|---|
| The Elements of Statistical Learning | Hastie, Tibshirani, Friedman | Clear, applied, free PDF |
| OpenIntro Statistics | Diez, Barr, Cetinkaya-Rundel | Free, modern, excellent examples |
| Introductory Statistics | OpenStax | Free, comprehensive |
| Statistics | Freedman, Pisani, Purves | Unique intuitive approach |
Intermediate
| Textbook | Author(s) | Strength |
|---|---|---|
| Applied Linear Statistical Models | Kutner et al. | Regression reference, problem sets |
| An Introduction to Statistical Learning | James, Witten, Hastie, Tibshirani | Accessible ML/stats bridge, free PDF |
| Statistical Methods | Snedecor & Cochran | Classic, thorough |
| Time Series Analysis | Hamilton | Comprehensive, rigorous |
| Causal Inference: The Mixtape | Cunningham | Modern, free, excellent examples |
Advanced
| Textbook | Author(s) | Strength |
|---|---|---|
| All of Statistics | Wasserman | Concise, covers breadth |
| Bayesian Data Analysis | Gelman et al. | Bayesian bible (BDA3) |
| The Elements of Statistical Learning | Hastie, Tibshirani, Friedman | Rigorous ML theory, free PDF |
| Asymptotic Statistics | van der Vaart | Mathematical statistics reference |
| High-Dimensional Statistics | Wainwright | Modern theory, sparse recovery |
| Causal Inference | Imbens & Rubin | Potential outcomes framework |
Online Resources
Free Courses
| Resource | Platform | Level | Focus |
|---|---|---|---|
| Statistical Learning | Stanford (edX) | Intermediate | ML from stats perspective |
| Introduction to Probability | Harvard (edX) | Intermediate | Probability theory |
| Bayesian Statistics | UCSC (Coursera) | Advanced | Bayesian methods |
| Data Science Specialization | Johns Hopkins (Coursera) | Beginner-Intermediate | R-based, applied |
| Mathematics for Machine Learning | Imperial (Coursera) | Intermediate | Linear algebra, calculus |
Interactive Learning
| Resource | Description |
|---|---|
| Seeing Theory | Visual probability/statistics (Brown) |
| Stat Trek | Online calculators and tutorials |
| Cross Validated | Stack Exchange for statistics |
| R-bloggers | R community blog aggregator |
| Towards Data Science | ML/data science articles (Medium) |
Certification Paths
DfProfessional Certifications
Certifications validate skills and can accelerate career advancement:
-
PStat (Professional Statistician) -- ASA's gold standard; requires education, experience, and peer review. Demonstrates competence and ethical commitment.
-
SAS Certified -- Multiple levels (Base, Advanced, Specialist). Required for many pharmaceutical and regulatory roles.
-
Google Data Analytics Certificate -- Entry-level, good for career changers into data.
-
AWS Machine Learning Specialty -- Validates cloud ML deployment skills.
-
Six Sigma (Green/Black Belt) -- Process improvement; valued in manufacturing and consulting.
-
Certified Analytics Professional (CAP) -- Broad analytics certification.
Certification Strategy
For career advancement: PStat for statistics-specific roles, SAS for pharma/regulatory, AWS/GCP for cloud-focused roles. Certifications are most valuable early in career or when transitioning between sectors. At senior levels, publications and demonstrated impact matter more than certifications.
Python Implementation: Topic Difficulty Analysis
import numpy as np
import pandas as pd
# Map the statistics curriculum with difficulty and prerequisites
topics = pd.DataFrame({
'Topic': [
'Descriptive Statistics', 'Probability Basics', 'Distributions',
'Confidence Intervals', 'Hypothesis Testing', 'Correlation',
'Simple Linear Regression', 'Multiple Regression', 'ANOVA',
'Logistic Regression', 'Time Series (ARIMA)', 'PCA',
'Bayesian Inference', 'Causal Inference', 'Meta-Analysis',
'Survival Analysis', 'High-Dim Statistics', 'Streaming Methods'
],
'Difficulty': [1, 1, 2, 2, 2, 1, 3, 3, 3, 3, 4, 3, 4, 5, 4, 4, 5, 5],
'Hours_To_Master': [20, 40, 60, 30, 30, 15, 40, 60, 50, 50, 80, 50, 80, 100, 60, 70, 100, 80],
'Prerequisites': [
'None', 'Descriptive Stats', 'Probability',
'Distributions', 'Distributions', 'Descriptive Stats',
'Correlation', 'Simple Reg', 'Multiple Reg',
'Multiple Reg', 'Regression', 'Regression',
'Probability', 'Regression + Causal', 'Hypothesis Testing',
'Survival Analysis', 'Regression + Bayesian', 'Bayesian + ML'
]
})
print("=== Statistics Curriculum Map ===")
print(f"{'Topic':<25s} {'Level':>8s} {'Hours':>8s} {'Prerequisites'}")
print("-" * 80)
for _, row in topics.iterrows():
stars = '*' * row['Difficulty']
print(f"{row['Topic']:<25s} {stars:>8s} {row['Hours_To_Master']:>6d}h {row['Prerequisites']}")
total_hours = topics['Hours_To_Master'].sum()
print(f"\nTotal hours to master all topics: ~{total_hours} hours")
print(f" At 10 hrs/week: {total_hours/10/52:.1f} years")
print(f" At 20 hrs/week: {total_hours/20/52:.1f} years")
# Learning path analysis
beginner = topics[topics['Difficulty'] <= 2]
intermediate = topics[(topics['Difficulty'] >= 2) & (topics['Difficulty'] <= 3)]
advanced = topics[topics['Difficulty'] >= 4]
print(f"\n=== Learning Path Summary ===")
print(f"Beginner (1-2): {len(beginner)} topics, ~{beginner['Hours_To_Master'].sum()} hours")
print(f"Intermediate (2-3): {len(intermediate)} topics, ~{intermediate['Hours_To_Master'].sum()} hours")
print(f"Advanced (4-5): {len(advanced)} topics, ~{advanced['Hours_To_Master'].sum()} hours")
Key Takeaways
Summary: Statistics Review and Roadmap
- Statistics has four pillars: descriptive statistics, probability theory, statistical inference, and applied methods. Each builds on the previous.
- Beginner path (0-6 months): Focus on descriptive statistics, probability, normal distribution, confidence intervals, and basic hypothesis testing.
- Intermediate path (6-18 months): Master regression, logistic regression, ANOVA, nonparametric tests, and introductory time series.
- Advanced path (18-36 months): Bayesian methods, causal inference, high-dimensional statistics, meta-analysis, and specialized topics.
- Textbook recommendations: Introductory Statistics (beginner), Applied Linear Statistical Models (intermediate), BDA3 and All of Statistics (advanced).
- Certifications: PStat for statistics professionals, SAS for pharma, cloud certifications for ML engineering.
- The field is evolving: Streaming statistics, AI fairness, causal inference, and privacy-preserving methods are emerging frontiers that extend the classical curriculum.
- Continuous learning: Statistics is a lifelong learning journey. Even experienced practitioners must continually update their skills as new methods and applications emerge.