Capstone: End-to-End Data Science Project

Your capstone demonstrates mastery across the entire data science lifecycle. This is your portfolio centerpiece.

Capstone Workflow

1. Problem Selection Criteria

Criterion	Good	Bad
Impact	Solves real business problem	Academic exercise
Data	Available and sufficient	Too small or inaccessible
Scope	Completable in 2-4 weeks	Too ambitious
Novelty	Fresh angle or approach	Tutorial copy-paste
Deployable	Can be served as API/app	Notebook-only

2. Project Structure

Architecture Diagram

capstone/
├── README.md
├── proposal.md
├── data/
│   ├── raw/
│   ├── processed/
│   └── data_dictionary.md
├── notebooks/
│   ├── 01_EDA.ipynb
│   ├── 02_Feature_Engineering.ipynb
│   ├── 03_Modeling.ipynb
│   └── 04_Evaluation.ipynb
├── src/
│   ├── data/
│   │   ├── collect.py
│   │   └── clean.py
│   ├── features/
│   │   └── build_features.py
│   ├── models/
│   │   ├── train.py
│   │   └── predict.py
│   ├── api/
│   │   └── app.py
│   └── utils.py
├── configs/
│   └── params.yaml
├── tests/
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
└── .github/
    └── workflows/
        └── ci.yml

3. Timeline

4. Evaluation Template

Model comparison uses statistical tests to validate improvements:

\text{Lift} = \frac{\text{Model Performance}}{\text{Baseline Performance}}

Wilcoxon signed-rank test for significance:

p < 0.05 \implies \text{statistically significant improvement}

# Model comparison framework
results = {
    "logistic_regression": {"accuracy": 0.82, "f1": 0.79, "auc": 0.87},
    "random_forest": {"accuracy": 0.87, "f1": 0.85, "auc": 0.92},
    "xgboost": {"accuracy": 0.91, "f1": 0.89, "auc": 0.95},
    "lightgbm": {"accuracy": 0.90, "f1": 0.88, "auc": 0.94}
}

# Statistical significance
from scipy.stats import wilcoxon
stat, p_value = wilcoxon(xgb_scores, rf_scores)
print(f"p-value: {p_value:.4f}")  # Significant if p < 0.05

5. Writeup Structure

Executive Summary (1 paragraph)
Problem Definition (business context, objectives)
Data Description (source, size, features, quality)
Methodology (approach, models, evaluation)
Results (metrics, visualizations, insights)
Limitations (what the model cannot do)
Future Work (improvements, extensions)
Appendix (code references, hyperparameters)

6. Presentation Structure (15 minutes)

Time	Section	Content
0-2 min	Hook	Problem statement with compelling statistic
2-5 min	Background	Data, EDA highlights, key insights
5-8 min	Approach	Feature engineering, model selection
8-11 min	Results	Metrics, visualizations, business impact
11-13 min	Demo	Live demo or video walkthrough
13-15 min	Reflection	Limitations, what you learned

Key Takeaways

Problem first: Choose a problem that matters, not just interesting data
Document everything: Trade-offs, decisions, and failures are valuable
Deploy it: A running application demonstrates end-to-end capability
Tell the story: Your presentation skills matter as much as your code

Capstone: End-to-End Data Science Project

Capstone: End-to-End Data Science Project

Capstone Workflow

1. Problem Selection Criteria

2. Project Structure

3. Timeline

4. Evaluation Template

5. Writeup Structure

6. Presentation Structure (15 minutes)

Key Takeaways

Premium Content

Need Expert Data Science Help?