🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Evaluation Metrics: Precision, Recall, F1, AUC-ROC & Confusion Matrix

Machine LearningEvaluation Metrics⭐ Premium

Advertisement

Meta & Netflix Interview

Evaluation Metrics: Precision, Recall, F1, AUC-ROC & Confusion Matrix

Choosing the right metric for your business problem

Interview Question

"When would you optimize for precision vs recall? Explain the ROC curve and AUC. How do you evaluate models on imbalanced datasets?"

Difficulty: Medium | Frequently asked at Meta, Netflix, Amazon


Theoretical Foundation

Confusion Matrix

For binary classification:

Confusion Matrix=[TPFPFNTN]\text{Confusion Matrix} = \begin{bmatrix} TP & FP \\ FN & TN \end{bmatrix}
  • True Positive (TP): Correctly predicted positive
  • False Positive (FP): Incorrectly predicted positive (Type I error)
  • False Negative (FN): Incorrectly predicted negative (Type II error)
  • True Negative (TN): Correctly predicted negative

Classification Metrics

Accuracy

Accuracy=TP+TNTP+FP+FN+TN\text{Accuracy} = \frac{TP + TN}{TP + FP + FN + TN}
  • Intuitive but misleading for imbalanced datasets
  • Example: 99% accuracy is trivial if 99% of samples are negative

Precision (Positive Predictive Value)

Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}
  • "Of all predicted positives, how many are actually positive?"
  • Important when FP is costly (spam detection, ad targeting)

Recall (Sensitivity, True Positive Rate)

Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}
  • "Of all actual positives, how many did we catch?"
  • Important when FN is costly (disease detection, fraud detection)

F1 Score

F1=2PrecisionRecallPrecision+RecallF1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
  • Harmonic mean of precision and recall
  • Balances precision and recall
  • Range: [0,1][0, 1]

Specificity (True Negative Rate)

Specificity=TNTN+FP\text{Specificity} = \frac{TN}{TN + FP}
  • "Of all actual negatives, how many did we correctly identify?"

When to Optimize Which Metric

MetricOptimize WhenExample
PrecisionFP is costlySpam detection (don't mark legitimate emails as spam)
RecallFN is costlyDisease detection (don't miss positive cases)
F1Both FP and FN are importantGeneral classification
AccuracyClasses are balancedBalanced datasets

ℹ️

Key Insight: Precision and recall are inversely related. Increasing the decision threshold increases precision but decreases recall. The optimal threshold depends on the business cost of FP vs FN.

ROC Curve and AUC

ROC Curve

Plots True Positive Rate (Recall) vs False Positive Rate at different thresholds:

TPR=TPTP+FN,FPR=FPFP+TN\text{TPR} = \frac{TP}{TP + FN}, \quad \text{FPR} = \frac{FP}{FP + TN}

Interpretation:

  • Upper-left corner: Perfect classifier (TPR=1, FPR=0)
  • Diagonal: Random classifier (TPR=FPR)
  • Area under curve (AUC): Probability that classifier ranks a random positive higher than a random negative

AUC (Area Under ROC Curve)

  • AUC = 1.0: Perfect classifier
  • AUC = 0.5: Random classifier
  • AUC < 0.5: Worse than random (flip predictions)

AUC Interpretation: AUC is the probability that a randomly chosen positive instance is ranked higher than a randomly chosen negative instance.

Precision-Recall Curve

For imbalanced datasets, precision-recall curve is more informative than ROC:

Average Precision (AP)=k(RkRk1)Pk\text{Average Precision (AP)} = \sum_{k} (R_k - R_{k-1}) P_k

where RkR_k and PkP_k are recall and precision at threshold kk.

⚠️

Common Misconception: ROC curves can be misleading for imbalanced datasets. A classifier can have high AUC but poor precision. Always check the precision-recall curve for imbalanced problems.

Confusion Matrix Derivatives

Matthews Correlation Coefficient (MCC)

MCC=TPTNFPFN(TP+FP)(TP+FN)(TN+FP)(TN+FN)MCC = \frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}
  • Range: [1,1][-1, 1]
  • 1 = perfect, 0 = random, -1 = inverse
  • Balanced even with imbalanced classes

Cohen's Kappa

κ=pope1pe\kappa = \frac{p_o - p_e}{1 - p_e}

where pop_o is observed agreement and pep_e is expected agreement.

Regression Metrics

Mean Squared Error (MSE)

MSE=1ni=1n(yiy^i)2MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

Root Mean Squared Error (RMSE)

RMSE=MSERMSE = \sqrt{MSE}

Mean Absolute Error (MAE)

MAE=1ni=1nyiy^iMAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|

R-Squared (Coefficient of Determination)

R2=1SSresSStot=1(yiy^i)2(yiyˉ)2R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}

Code Implementation

Explanation of Code

  1. Confusion Matrix: Shows TP, FP, FN, TN and derived metrics.

  2. Precision vs Recall: Demonstrates the tradeoff at different thresholds.

  3. ROC Curve: Plots TPR vs FPR and finds optimal threshold.

  4. Precision-Recall Curve: Shows performance for imbalanced datasets.

  5. Regression Metrics: Demonstrates MSE, RMSE, MAE, and R².

  6. Metric Selection: Provides guidance on when to use each metric.


Real-World Applications

Meta: Content Moderation

Meta optimizes for:

  • Recall: Catch all harmful content (minimize FN)
  • Precision: Don't remove legitimate content (minimize FP)
  • F1: Balance between the two

Netflix: Recommendation Ranking

Netflix uses:

  • NDCG: Ranking quality (top recommendations matter)
  • AUC-ROC: Classification of relevant vs irrelevant items
  • Coverage: Ensure recommendations span all content

💡

Meta Interview Tip: Be prepared to discuss how business costs affect metric selection. For example, in content moderation, missing harmful content (FN) has higher cost than removing legitimate content (FP).


Common Follow-Up Questions

Q1: Why is accuracy misleading for imbalanced datasets?

If 99% of samples are negative, a classifier that predicts "negative" for everything achieves 99% accuracy but catches zero positives. Precision, recall, and F1 are more informative.

Q2: What is the difference between ROC-AUC and PR-AUC?

ROC-AUC uses TPR and FPR, which can be misleading for imbalanced datasets. PR-AUC uses precision and recall, which are more informative when the positive class is rare.

Q3: How do you choose the optimal classification threshold?

Consider business costs:

  • Youden's J: Maximizes TPR - FPR (balanced)
  • Cost-based: Minimize total cost of FP and FN
  • F1-based: Maximize F1 score

Q4: Can you use accuracy for multi-class problems?

Yes, but consider:

  • Macro-averaged F1: Treats all classes equally
  • Weighted F1: Accounts for class imbalance
  • Confusion matrix: Shows per-class performance

Company-Specific Tips

Meta Interview Tips

  • Discuss multi-objective optimization (precision, recall, latency)
  • Be ready to explain calibration of probabilities
  • Mention fairness metrics across different groups
  • Talk about online evaluation (A/B testing)

Netflix Interview Tips

  • Focus on ranking metrics (NDCG, MRR)
  • Discuss coverage and diversity metrics
  • Be prepared to explain business-aligned metrics
  • Mention user study design

Related Topics

Advertisement