How to Know If Your Model Actually Works — Beyond Accuracy
Choosing the right metric and evaluation strategy is critical. A model with 99% accuracy might be useless if the data is imbalanced.
- Precision and Recall — When false positives and false negatives matter differently
- Cross-Validation — Getting reliable performance estimates
- Bias-Variance Tradeoff — The central challenge in machine learning
"Not everything that counts can be counted, and not everything that can be counted counts."
Model Evaluation — Complete Guide
Choosing the right metric and evaluation strategy is critical. A model with 99% accuracy might be useless if the data is imbalanced.
Classification Metrics
DfConfusion Matrix
A matrix where entry counts the number of samples from true class predicted as class . For binary classification:
where = count of samples truly in class predicted as class .
F1 Score
Here,
- =Harmonic mean of precision and recall ∈ [0,1]
When to Use What?
- Accuracy: Balanced classes, all errors equally costly
- Precision: When false positives are costly (spam filter — don't lose real emails)
- Recall: When false negatives are costly (cancer detection — don't miss cases)
- F1: Imbalanced data, need balance between precision and recall
ROC Curve and AUC
DfROC Curve
The Receiver Operating Characteristic curve plots True Positive Rate (Recall) vs False Positive Rate at various classification thresholds. The Area Under the Curve (AUC) summarizes performance: AUC = 0.5 is random, AUC = 1.0 is perfect.
Cross-Validation
DfK-Fold Cross-Validation
Split data into folds. Train on folds, test on the remaining fold, rotate times. Final score is the mean across all folds:
Variance estimate:
Python Implementation
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
model = RandomForestClassifier(n_estimators=100, random_state=42)
# Standard 5-fold CV
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"CV Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")
# Stratified K-Fold (preserves class proportions)
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
stratified_scores = cross_val_score(model, X, y, cv=skf, scoring='f1_macro')
print(f"Stratified CV F1: {stratified_scores.mean():.3f}")
Bias-Variance Tradeoff
ThBias-Variance Decomposition
For a model trained on dataset , the expected prediction error at point decomposes as:
where is the true function and is noise.
Diagnosing Bias vs Variance
- High bias: Both train and test error are high → model is too simple → add features, use more complex model
- High variance: Train error low, test error high → model is too complex → more data, regularization, simpler model
- Learning curves: Plot train/test error vs training size — gap indicates variance, both high indicates bias
Regression Metrics Comparison
Regression Metrics Summary
| Metric | Formula | Robust to Outliers? | In Units of y? |
|---|---|---|---|
| MSE | No | No (y²) | |
| RMSE | No | Yes | |
| MAE | Yes | Yes | |
| R² | No | No (dimensionless) |
Key Takeaways
Summary: Model Evaluation
- Accuracy is misleading for imbalanced datasets — use F1 or AUC-ROC
- Precision when FP costly, Recall when FN costly, F1 for balance
- AUC-ROC is threshold-independent: 0.5 = random, 1.0 = perfect
- Always use cross-validation ( or ) for reliable performance estimates
- Bias-variance tradeoff: Error = Bias² + Variance + σ²
- Underfitting: high bias, both errors high → increase model complexity
- Overfitting: high variance, gap between train/test → regularization, more data
- Choose metrics that match your business objective — no single metric fits all
- Stratified K-Fold preserves class proportions in each fold
- No Free Lunch — no single model works best for all problems
What to Learn Next
-> Regularization Prevent overfitting with Ridge, Lasso, and Elastic Net.
-> Model Selection Hyperparameter tuning, grid search, and choosing the best model.
-> Ensemble Methods Bagging, boosting, and stacking for stronger models.