ROC Curves and AUC � Model Discrimination

Statistics

Evaluating How Well Classifiers Separate Classes

ROC curves plot true positive rates against false positive rates across all thresholds, while AUC summarizes overall discrimination ability. These threshold-independent metrics reveal a models inherent ability to distinguish between classes.

Medical Screening � Compare diagnostic tests for disease detection accuracy
Fraud Detection � Evaluate model performance across operating thresholds
Credit Risk � Assess borrower classification before setting cutoff policies

An AUC of 0.5 means random guessing; an AUC of 1 means perfect separation.

ROC curves and AUC measure how well a classifier distinguishes between classes. They are threshold-independent metrics that evaluate the discrimination ability of a model.

DfROC Curve

A plot of the True Positive Rate (sensitivity) against the False Positive Rate (1 - specificity) at various classification thresholds.

Key Metrics

True Positive Rate (Sensitivity)

TPR = \frac{TP}{TP + FN}

Here,

$TP$ =True positives (correctly predicted positive)
$FN$ =False negatives (missed positive cases)

False Positive Rate

FPR = \frac{FP}{FP + TN}

Here,

$FP$ =False positives (incorrectly predicted positive)
$TN$ =True negatives (correctly predicted negative)

Precision

Precision = \frac{TP}{TP + FP}

Here,

$Precision$ =Proportion of positive predictions that are correct

Area Under the Curve (AIC)

AUC

AUC = \int_0^1 TPR(FPR^{-1}(x)) \, dx

Here,

$AUC$ =Area under the ROC curve (0 to 1)

AUC	Interpretation
0.5	No discrimination (random guessing)
0.5 - 0.7	Poor discrimination
0.7 - 0.8	Acceptable discrimination
0.8 - 0.9	Excellent discrimination
> 0.9	Outstanding discrimination

Probabilistic Interpretation

AUC = the probability that a randomly chosen positive case receives a higher predicted probability than a randomly chosen negative case.

Threshold Selection

The optimal threshold depends on the cost ratio of false positives vs false negatives.

Optimal Threshold

\text{Threshold}^* = \frac{C_{FP} \cdot P(N)}{C_{FN} \cdot P(P)}

Here,

$C_{FP}$ =Cost of a false positive
$C_{FN}$ =Cost of a false negative
$P(P), P(N)$ =Prior probabilities

Balanced Threshold

The default threshold of 0.5 is optimal only when classes are equally important and equally prevalent. Adjust the threshold based on the specific application costs.

Confusion Matrix

	Predicted Positive	Predicted Negative
Actual Positive	TP	FN
Actual Negative	FP	TN

F1 Score

F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}

Here,

$Recall$ =Same as TPR
$F1$ =Harmonic mean of precision and recall

Precision-Recall Curve

When classes are imbalanced, the PR curve may be more informative than ROC.

ROC with Imbalanced Classes

With severe class imbalance (e.g., 99% negatives), ROC can be optimistically misleading because FPR uses a large denominator (all negatives). The PR curve focuses on the minority class.

Multi-Class Extensions

One-vs-Rest (OvR)

Compute ROC for each class against all others, then average.

One-vs-One (OvO)

Compute ROC for each pair of classes.

Multi-class AUC (macro)

AUC_{macro} = \frac{1}{K}\sum_{k=1}^{K} AUC_k

Here,

$K$ =Number of classes
$AUC_k$ =AUC for class k (one-vs-rest)

DeLong Test

Tests whether two AUCs are significantly different.

DeLong Test

z = \frac{AUC_1 - AUC_2}{\sqrt{Var(AUC_1) + Var(AUC_2) - 2Cov(AUC_1, AUC_2)}}

Here,

$z$ =Test statistic (standard normal under $H_0$)

Python Implementation

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (roc_curve, roc_auc_score, precision_recall_curve,
                              confusion_matrix, classification_report)
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

np.random.seed(42)

# Generate data
X, y = make_classification(n_samples=500, n_features=10, n_informative=5,
                           weights=[0.7], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Fit model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_prob = model.predict_proba(X_test)[:, 1]

# ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
auc = roc_auc_score(y_test, y_prob)

# Find optimal threshold (Youden's J)
J = tpr - fpr
optimal_idx = np.argmax(J)
optimal_threshold = thresholds[optimal_idx]

# Plot
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# ROC
axes[0].plot(fpr, tpr, 'b-', label=f'AUC = {auc:.3f}')
axes[0].plot([0, 1], [0, 1], 'r--', label='Random')
axes[0].scatter(fpr[optimal_idx], tpr[optimal_idx], c='red', s=100, label=f'Threshold = {optimal_threshold:.2f}')
axes[0].set_xlabel('FPR')
axes[0].set_ylabel('TPR')
axes[0].set_title('ROC Curve')
axes[0].legend()

# Precision-Recall
precision, recall, _ = precision_recall_curve(y_test, y_prob)
axes[1].plot(recall, precision, 'b-')
axes[1].set_xlabel('Recall')
axes[1].set_ylabel('Precision')
axes[1].set_title('Precision-Recall Curve')
plt.tight_layout()
plt.show()

# Confusion matrix at optimal threshold
y_pred_optimal = (y_prob >= optimal_threshold).astype(int)
print(f"Optimal threshold: {optimal_threshold:.3f}")
print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred_optimal)}")
print(f"\n{classification_report(y_test, y_pred_optimal)}")

Worked Example

Example: Medical Screening

Evaluating a disease screening test with 5% prevalence:

Threshold	Sensitivity	Specificity	PPV	F1
0.3	0.95	0.72	0.14	0.24
0.5	0.82	0.88	0.26	0.40
0.7	0.65	0.95	0.42	0.51

AUC = 0.89 (excellent discrimination)

Recommendation: Use threshold = 0.3 for screening (high sensitivity, accept more false positives). Use threshold = 0.7 for diagnosis (high specificity, fewer false positives).

Key Takeaways

Summary: ROC and AUC

ROC curve plots TPR vs FPR across all thresholds
AUC summarizes discrimination ability: 0.5 = random, 1.0 = perfect
AUC = probability that a random positive scores higher than a random negative
Threshold selection depends on the relative costs of FP vs FN
For imbalanced classes, use the PR curve instead of ROC
Use the DeLong test to compare AUCs between models
F1 score balances precision and recall at a single threshold

ROC Curves and AUC � Model Discrimination

ROC Curves and AUC � Model Discrimination

Evaluating How Well Classifiers Separate Classes

DfROC Curve

Key Metrics

True Positive Rate (Sensitivity)

False Positive Rate

Precision

Area Under the Curve (AIC)

AUC

Threshold Selection

Optimal Threshold

Confusion Matrix

F1 Score

Precision-Recall Curve

Multi-Class Extensions

One-vs-Rest (OvR)

One-vs-One (OvO)

Multi-class AUC (macro)

DeLong Test

DeLong Test

Python Implementation

Worked Example

Example: Medical Screening

Key Takeaways

Summary: ROC and AUC

Related Topics

Premium Content

Need Expert Statistics Help?