🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Model Evaluation — Metrics, Cross-Validation and Selection

ML FoundationsEvaluation🟢 Free Lesson

Advertisement

ML Foundations

How to Know If Your Model Actually Works — Beyond Accuracy

Choosing the right metric and evaluation strategy is critical. A model with 99% accuracy might be useless if the data is imbalanced.

  • Precision and Recall — When false positives and false negatives matter differently
  • Cross-Validation — Getting reliable performance estimates
  • Bias-Variance Tradeoff — The central challenge in machine learning

"Not everything that counts can be counted, and not everything that can be counted counts."

Model Evaluation — Complete Guide

Choosing the right metric and evaluation strategy is critical. A model with 99% accuracy might be useless if the data is imbalanced.


Classification Metrics

DfConfusion Matrix

A K×KK \times K matrix where entry CijC_{ij} counts the number of samples from true class ii predicted as class jj. For binary classification:

C=[TPFNFPTN]\mathbf{C} = \begin{bmatrix} TP & FN \\ FP & TN \end{bmatrix}

where CijC_{ij} = count of samples truly in class ii predicted as class jj.

Confusion Matrix and Classification MetricsConfusion MatrixPredicted →Actual →NegPosNegPos950TN50FP30FN970TPAccuracy = (950+970)/2000 = 96%Key MetricsAccuracy = (TP + TN) / TotalCorrect predictions / all predictions. Misleading for imbalanced data.Precision = TP / (TP + FP)Of predicted positives, how many are correct? When FP is costly (spam filter).Recall = TP / (TP + FN)Of actual positives, how many did we find? When FN is costly (disease detection).F1 = 2 × Precision × Recall / (Precision + Recall)

F1 Score

F1=2PrecisionRecallPrecision+RecallF_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

Here,

  • F1F_1=Harmonic mean of precision and recall ∈ [0,1]

When to Use What?

  • Accuracy: Balanced classes, all errors equally costly
  • Precision: When false positives are costly (spam filter — don't lose real emails)
  • Recall: When false negatives are costly (cancer detection — don't miss cases)
  • F1: Imbalanced data, need balance between precision and recall

ROC Curve and AUC

DfROC Curve

The Receiver Operating Characteristic curve plots True Positive Rate (Recall) vs False Positive Rate at various classification thresholds. The Area Under the Curve (AUC) summarizes performance: AUC = 0.5 is random, AUC = 1.0 is perfect.

ROC Curve and AUCFalse Positive RateTrue Positive Rate (Recall)0101Random (AUC=0.5)Perfect (AUC=1.0)AUC = 0.95AUC = 0.75AUC InterpretationAUC = 1.0: PerfectSeparates classes perfectlyAUC ≈¥ 0.9: ExcellentStrong discriminationAUC ≈¥ 0.7: AcceptableSome discrimination abilityAUC ≈¤ 0.5: RandomNo discrimination (useless model)Threshold Trade-offLow threshold → high recall, low precisionHigh threshold → high precision, low recall

Cross-Validation

DfK-Fold Cross-Validation

Split data into KK folds. Train on K1K-1 folds, test on the remaining fold, rotate KK times. Final score is the mean across all folds:

Score^=1Kk=1KScorek\hat{\text{Score}} = \frac{1}{K}\sum_{k=1}^{K} \text{Score}_k

Variance estimate: Var=1Kk=1K(ScorekScore^)2\text{Var} = \frac{1}{K}\sum_{k=1}^{K}(\text{Score}_k - \hat{\text{Score}})^2

5-Fold Cross-ValidationFold1TestTrainTrainTrainTrain→ Score₁2TrainTestTrainTrainTrain→ Score₂3-5Each fold takes a turn as test set...→ Score₃,â‚„,₄CV Score = (Score₁ + Score₂ + Score₃ + Scoreâ‚„ + Score₄) / 5Why Cross-Validation?1. Every sample used for both train and test2. Reduces variance of performance estimate3. Detects overfitting (large train/test gap)4. Better model selection and comparison5. K=5 or K=10 is standard (bias-variance tradeoff)6. Leave-One-Out: K=N, for very small datasetsMore folds = less bias, more variance (and more compute)

Python Implementation

from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

model = RandomForestClassifier(n_estimators=100, random_state=42)

# Standard 5-fold CV
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"CV Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")

# Stratified K-Fold (preserves class proportions)
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
stratified_scores = cross_val_score(model, X, y, cv=skf, scoring='f1_macro')
print(f"Stratified CV F1: {stratified_scores.mean():.3f}")

Bias-Variance Tradeoff

ThBias-Variance Decomposition

For a model f^\hat{f} trained on dataset D\mathcal{D}, the expected prediction error at point xx decomposes as:

ED[(yf^(x))2]=(f(x)E[f^(x)])2Bias2+E[(f^(x)E[f^(x)])2]Variance+σ2Irreducible Error\mathbb{E}_{\mathcal{D}}[(y - \hat{f}(x))^2] = \underbrace{(f(x) - \mathbb{E}[\hat{f}(x)])^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible Error}}

where f(x)f(x) is the true function and σ2\sigma^2 is noise.

Bias-Variance Tradeoff: The Central ChallengeModel Complexity →ErrorTrain errorTest errorBias²VarianceSweet SpotUnderfittingHigh Bias, Low VarianceOverfittingLow Bias, High VarianceTraining errorTest error (generalization)Optimal complexity

Diagnosing Bias vs Variance

  • High bias: Both train and test error are high → model is too simple → add features, use more complex model
  • High variance: Train error low, test error high → model is too complex → more data, regularization, simpler model
  • Learning curves: Plot train/test error vs training size — gap indicates variance, both high indicates bias

Regression Metrics Comparison

Regression Metrics Summary

MetricFormulaRobust to Outliers?In Units of y?
MSE1N(yiy^i)2\frac{1}{N}\sum(y_i - \hat{y}_i)^2NoNo (y²)
RMSEMSE\sqrt{\text{MSE}}NoYes
MAE1Nyiy^i\frac{1}{N}\sum\|y_i - \hat{y}_i\|YesYes
1SSres/SStot1 - SS_{res}/SS_{tot}NoNo (dimensionless)

Key Takeaways

Summary: Model Evaluation

  1. Accuracy is misleading for imbalanced datasets — use F1 or AUC-ROC
  2. Precision when FP costly, Recall when FN costly, F1 for balance
  3. AUC-ROC is threshold-independent: 0.5 = random, 1.0 = perfect
  4. Always use cross-validation (K=5K=5 or 1010) for reliable performance estimates
  5. Bias-variance tradeoff: Error = Bias² + Variance + σ²
  6. Underfitting: high bias, both errors high → increase model complexity
  7. Overfitting: high variance, gap between train/test → regularization, more data
  8. Choose metrics that match your business objective — no single metric fits all
  9. Stratified K-Fold preserves class proportions in each fold
  10. No Free Lunch — no single model works best for all problems

What to Learn Next

-> Regularization Prevent overfitting with Ridge, Lasso, and Elastic Net.

-> Model Selection Hyperparameter tuning, grid search, and choosing the best model.

-> Ensemble Methods Bagging, boosting, and stacking for stronger models.

Premium Content

Model Evaluation — Metrics, Cross-Validation and Selection

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Machine Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement