Capstone: End-to-End Data Science Project
Your capstone demonstrates mastery across the entire data science lifecycle. This is your portfolio centerpiece.
Capstone Workflow
1. Problem Selection Criteria
| Criterion | Good | Bad |
|---|---|---|
| Impact | Solves real business problem | Academic exercise |
| Data | Available and sufficient | Too small or inaccessible |
| Scope | Completable in 2-4 weeks | Too ambitious |
| Novelty | Fresh angle or approach | Tutorial copy-paste |
| Deployable | Can be served as API/app | Notebook-only |
2. Project Structure
Architecture Diagram
capstone/
βββ README.md
βββ proposal.md
βββ data/
β βββ raw/
β βββ processed/
β βββ data_dictionary.md
βββ notebooks/
β βββ 01_EDA.ipynb
β βββ 02_Feature_Engineering.ipynb
β βββ 03_Modeling.ipynb
β βββ 04_Evaluation.ipynb
βββ src/
β βββ data/
β β βββ collect.py
β β βββ clean.py
β βββ features/
β β βββ build_features.py
β βββ models/
β β βββ train.py
β β βββ predict.py
β βββ api/
β β βββ app.py
β βββ utils.py
βββ configs/
β βββ params.yaml
βββ tests/
βββ Dockerfile
βββ docker-compose.yml
βββ requirements.txt
βββ .github/
βββ workflows/
βββ ci.yml
3. Timeline
4. Evaluation Template
Model comparison uses statistical tests to validate improvements:
Wilcoxon signed-rank test for significance:
# Model comparison framework
results = {
"logistic_regression": {"accuracy": 0.82, "f1": 0.79, "auc": 0.87},
"random_forest": {"accuracy": 0.87, "f1": 0.85, "auc": 0.92},
"xgboost": {"accuracy": 0.91, "f1": 0.89, "auc": 0.95},
"lightgbm": {"accuracy": 0.90, "f1": 0.88, "auc": 0.94}
}
# Statistical significance
from scipy.stats import wilcoxon
stat, p_value = wilcoxon(xgb_scores, rf_scores)
print(f"p-value: {p_value:.4f}") # Significant if p < 0.05
5. Writeup Structure
- Executive Summary (1 paragraph)
- Problem Definition (business context, objectives)
- Data Description (source, size, features, quality)
- Methodology (approach, models, evaluation)
- Results (metrics, visualizations, insights)
- Limitations (what the model cannot do)
- Future Work (improvements, extensions)
- Appendix (code references, hyperparameters)
6. Presentation Structure (15 minutes)
| Time | Section | Content |
|---|---|---|
| 0-2 min | Hook | Problem statement with compelling statistic |
| 2-5 min | Background | Data, EDA highlights, key insights |
| 5-8 min | Approach | Feature engineering, model selection |
| 8-11 min | Results | Metrics, visualizations, business impact |
| 11-13 min | Demo | Live demo or video walkthrough |
| 13-15 min | Reflection | Limitations, what you learned |
Key Takeaways
- Problem first: Choose a problem that matters, not just interesting data
- Document everything: Trade-offs, decisions, and failures are valuable
- Deploy it: A running application demonstrates end-to-end capability
- Tell the story: Your presentation skills matter as much as your code