ROC Curves and AUC � Model Discrimination
Statistics
Evaluating How Well Classifiers Separate Classes
ROC curves plot true positive rates against false positive rates across all thresholds, while AUC summarizes overall discrimination ability. These threshold-independent metrics reveal a models inherent ability to distinguish between classes.
- Medical Screening � Compare diagnostic tests for disease detection accuracy
- Fraud Detection � Evaluate model performance across operating thresholds
- Credit Risk � Assess borrower classification before setting cutoff policies
An AUC of 0.5 means random guessing; an AUC of 1 means perfect separation.
ROC curves and AUC measure how well a classifier distinguishes between classes. They are threshold-independent metrics that evaluate the discrimination ability of a model.
DfROC Curve
A plot of the True Positive Rate (sensitivity) against the False Positive Rate (1 - specificity) at various classification thresholds.
Key Metrics
True Positive Rate (Sensitivity)
Here,
- =True positives (correctly predicted positive)
- =False negatives (missed positive cases)
False Positive Rate
Here,
- =False positives (incorrectly predicted positive)
- =True negatives (correctly predicted negative)
Precision
Here,
- =Proportion of positive predictions that are correct
Area Under the Curve (AIC)
AUC
Here,
- =Area under the ROC curve (0 to 1)
| AUC | Interpretation |
|---|---|
| 0.5 | No discrimination (random guessing) |
| 0.5 - 0.7 | Poor discrimination |
| 0.7 - 0.8 | Acceptable discrimination |
| 0.8 - 0.9 | Excellent discrimination |
| > 0.9 | Outstanding discrimination |
Probabilistic Interpretation
AUC = the probability that a randomly chosen positive case receives a higher predicted probability than a randomly chosen negative case.
Threshold Selection
The optimal threshold depends on the cost ratio of false positives vs false negatives.
Optimal Threshold
Here,
- =Cost of a false positive
- =Cost of a false negative
- =Prior probabilities
Balanced Threshold
The default threshold of 0.5 is optimal only when classes are equally important and equally prevalent. Adjust the threshold based on the specific application costs.
Confusion Matrix
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | TP | FN |
| Actual Negative | FP | TN |
F1 Score
Here,
- =Same as TPR
- =Harmonic mean of precision and recall
Precision-Recall Curve
When classes are imbalanced, the PR curve may be more informative than ROC.
ROC with Imbalanced Classes
With severe class imbalance (e.g., 99% negatives), ROC can be optimistically misleading because FPR uses a large denominator (all negatives). The PR curve focuses on the minority class.
Multi-Class Extensions
One-vs-Rest (OvR)
Compute ROC for each class against all others, then average.
One-vs-One (OvO)
Compute ROC for each pair of classes.
Multi-class AUC (macro)
Here,
- =Number of classes
- =AUC for class k (one-vs-rest)
DeLong Test
Tests whether two AUCs are significantly different.
DeLong Test
Here,
- =Test statistic (standard normal under $H_0$)
Python Implementation
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (roc_curve, roc_auc_score, precision_recall_curve,
confusion_matrix, classification_report)
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
np.random.seed(42)
# Generate data
X, y = make_classification(n_samples=500, n_features=10, n_informative=5,
weights=[0.7], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Fit model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_prob = model.predict_proba(X_test)[:, 1]
# ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
auc = roc_auc_score(y_test, y_prob)
# Find optimal threshold (Youden's J)
J = tpr - fpr
optimal_idx = np.argmax(J)
optimal_threshold = thresholds[optimal_idx]
# Plot
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# ROC
axes[0].plot(fpr, tpr, 'b-', label=f'AUC = {auc:.3f}')
axes[0].plot([0, 1], [0, 1], 'r--', label='Random')
axes[0].scatter(fpr[optimal_idx], tpr[optimal_idx], c='red', s=100, label=f'Threshold = {optimal_threshold:.2f}')
axes[0].set_xlabel('FPR')
axes[0].set_ylabel('TPR')
axes[0].set_title('ROC Curve')
axes[0].legend()
# Precision-Recall
precision, recall, _ = precision_recall_curve(y_test, y_prob)
axes[1].plot(recall, precision, 'b-')
axes[1].set_xlabel('Recall')
axes[1].set_ylabel('Precision')
axes[1].set_title('Precision-Recall Curve')
plt.tight_layout()
plt.show()
# Confusion matrix at optimal threshold
y_pred_optimal = (y_prob >= optimal_threshold).astype(int)
print(f"Optimal threshold: {optimal_threshold:.3f}")
print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred_optimal)}")
print(f"\n{classification_report(y_test, y_pred_optimal)}")
Worked Example
Example: Medical Screening
Evaluating a disease screening test with 5% prevalence:
| Threshold | Sensitivity | Specificity | PPV | F1 |
|---|---|---|---|---|
| 0.3 | 0.95 | 0.72 | 0.14 | 0.24 |
| 0.5 | 0.82 | 0.88 | 0.26 | 0.40 |
| 0.7 | 0.65 | 0.95 | 0.42 | 0.51 |
AUC = 0.89 (excellent discrimination)
Recommendation: Use threshold = 0.3 for screening (high sensitivity, accept more false positives). Use threshold = 0.7 for diagnosis (high specificity, fewer false positives).
Key Takeaways
Summary: ROC and AUC
- ROC curve plots TPR vs FPR across all thresholds
- AUC summarizes discrimination ability: 0.5 = random, 1.0 = perfect
- AUC = probability that a random positive scores higher than a random negative
- Threshold selection depends on the relative costs of FP vs FN
- For imbalanced classes, use the PR curve instead of ROC
- Use the DeLong test to compare AUCs between models
- F1 score balances precision and recall at a single threshold
Related Topics
- See Cross-Validation for model evaluation methodology
- See AIC and BIC for model selection criteria
- See Missing Data for data quality considerations