ML Foundations

How to Know If Your Model Actually Works — Beyond Accuracy

Choosing the right metric and evaluation strategy is critical. A model with 99% accuracy might be useless if the data is imbalanced.

Precision and Recall — When false positives and false negatives matter differently
Cross-Validation — Getting reliable performance estimates
Bias-Variance Tradeoff — The central challenge in machine learning

"Not everything that counts can be counted, and not everything that can be counted counts."

Model Evaluation — Complete Guide

Choosing the right metric and evaluation strategy is critical. A model with 99% accuracy might be useless if the data is imbalanced.

Classification Metrics

DfConfusion Matrix

A $K \times K$ matrix where entry $C_{ij}$ counts the number of samples from true class $i$ predicted as class $j$ . For binary classification:

\mathbf{C} = \begin{bmatrix} TP & FN \\ FP & TN \end{bmatrix}

where $C_{ij}$ = count of samples truly in class $i$ predicted as class $j$ .

F1 Score

F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

Here,

$F_1$ =Harmonic mean of precision and recall ∈ [0,1]

When to Use What?

Accuracy: Balanced classes, all errors equally costly
Precision: When false positives are costly (spam filter — don't lose real emails)
Recall: When false negatives are costly (cancer detection — don't miss cases)
F1: Imbalanced data, need balance between precision and recall

ROC Curve and AUC

DfROC Curve

The Receiver Operating Characteristic curve plots True Positive Rate (Recall) vs False Positive Rate at various classification thresholds. The Area Under the Curve (AUC) summarizes performance: AUC = 0.5 is random, AUC = 1.0 is perfect.

Cross-Validation

DfK-Fold Cross-Validation

Split data into $K$ folds. Train on $K-1$ folds, test on the remaining fold, rotate $K$ times. Final score is the mean across all folds:

\hat{\text{Score}} = \frac{1}{K}\sum_{k=1}^{K} \text{Score}_k

Variance estimate: $\text{Var} = \frac{1}{K}\sum_{k=1}^{K}(\text{Score}_k - \hat{\text{Score}})^2$

Python Implementation

from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

model = RandomForestClassifier(n_estimators=100, random_state=42)

# Standard 5-fold CV
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print(f"CV Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")

# Stratified K-Fold (preserves class proportions)
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
stratified_scores = cross_val_score(model, X, y, cv=skf, scoring='f1_macro')
print(f"Stratified CV F1: {stratified_scores.mean():.3f}")

Bias-Variance Tradeoff

ThBias-Variance Decomposition

For a model $\hat{f}$ trained on dataset $\mathcal{D}$ , the expected prediction error at point $x$ decomposes as:

\mathbb{E}_{\mathcal{D}}[(y - \hat{f}(x))^2] = \underbrace{(f(x) - \mathbb{E}[\hat{f}(x)])^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible Error}}

where $f(x)$ is the true function and $\sigma^2$ is noise.

Diagnosing Bias vs Variance

High bias: Both train and test error are high → model is too simple → add features, use more complex model
High variance: Train error low, test error high → model is too complex → more data, regularization, simpler model
Learning curves: Plot train/test error vs training size — gap indicates variance, both high indicates bias

Regression Metrics Comparison

Regression Metrics Summary

Metric	Formula	Robust to Outliers?	In Units of y?
MSE	$\frac{1}{N}\sum(y_i - \hat{y}_i)^2$	No	No (y²)
RMSE	$\sqrt{\text{MSE}}$	No	Yes
MAE	$\frac{1}{N}\sum\\|y_i - \hat{y}_i\\|$	Yes	Yes
R²	$1 - SS_{res}/SS_{tot}$	No	No (dimensionless)

Key Takeaways

Summary: Model Evaluation

Accuracy is misleading for imbalanced datasets — use F1 or AUC-ROC
Precision when FP costly, Recall when FN costly, F1 for balance
AUC-ROC is threshold-independent: 0.5 = random, 1.0 = perfect
Always use cross-validation ( $K=5$ or $10$ ) for reliable performance estimates
Bias-variance tradeoff: Error = Bias² + Variance + σ²
Underfitting: high bias, both errors high → increase model complexity
Overfitting: high variance, gap between train/test → regularization, more data
Choose metrics that match your business objective — no single metric fits all
Stratified K-Fold preserves class proportions in each fold
No Free Lunch — no single model works best for all problems

What to Learn Next

-> Regularization Prevent overfitting with Ridge, Lasso, and Elastic Net.

-> Model Selection Hyperparameter tuning, grid search, and choosing the best model.

-> Ensemble Methods Bagging, boosting, and stacking for stronger models.

Model Evaluation — Metrics, Cross-Validation and Selection