Calibration and Model Checking
Advanced Statistical Methods
Ensuring Your Predictions Match Reality
Calibration is the bridge between raw model outputs and trustworthy predictions β ensuring that when a model says "80% probability," events actually occur about 80% of the time. In medicine, finance, and weather forecasting, miscalibrated probabilities lead to disastrous decisions.
- Medical diagnosis β Calibrated risk scores enable clinicians to trust and act on predicted probabilities
- Weather forecasting β Calibration ensures probabilistic forecasts are reliable and actionable
- Machine learning deployment β Well-calibrated models produce outputs that can be interpreted as true probabilities
Calibration transforms opaque model scores into probabilities you can stake decisions on.
What Is Calibration?
DfCalibration
A probabilistic classifier is calibrated if, for any predicted probability , the true probability of the positive class among all instances predicted with probability is approximately . Formally, a model is calibrated if:
Calibration is distinct from discrimination. A model can have excellent discrimination (high AUC) yet be poorly calibrated, producing predicted probabilities that are systematically too high or too low.
Brier Score β The Decomposition
The Brier score measures the mean squared error between predicted probabilities and observed outcomes:
Brier Score
Here,
- =Predicted probability for observation i
- =Observed binary outcome (0 or 1)
- =Total number of observations
Decomposition of the Brier Score
The Brier score can be decomposed into three components:
where:
- Reliability = average squared difference between mean predicted probability and observed frequency within each probability bin
- Resolution = variance of observed frequencies across bins (how much predictions vary from overall base rate)
- Uncertainty = variance of the overall outcome (irreducible noise)
A perfectly calibrated model has reliability = 0.
Reliability Diagrams
DfReliability Diagram
A reliability diagram plots the mean predicted probability (x-axis) against the observed frequency (y-axis) across bins of predictions. A perfectly calibrated model traces the 45-degree diagonal . Systematic deviations above the diagonal indicate underestimation; deviations below indicate overestimation.
The construction procedure:
- Bin predictions into intervals (e.g., )
- For each bin , compute
- Compute
- Plot for each bin and connect with lines
Hosmer-Lemeshow Test
ThHosmer-Lemeshow Test Statistic
Divide the sample into groups (typically deciles) ordered by predicted probability. The test statistic is:
where is the observed number of events in group , is the group size, and is the mean predicted probability in group .
Under (calibration), asymptotically. A significant p-value indicates lack of fit.
Limitations of Hosmer-Lemeshow
- The test is sensitive to the number of bins: too few bins lose power; too many bins create sparse groups
- It is an overall test and does not identify where miscalibration occurs
- Power depends on sample size: very large samples can reject trivial miscalibration
- Use the test alongside visual reliability diagrams, not as a sole diagnostic
Calibration Methods
Platt Scaling
DfPlatt Scaling
Platt scaling fits a logistic regression to the outputs of a classifier. Given raw scores , it estimates:
where is the sigmoid function and parameters , are estimated via maximum likelihood on a held-out calibration set. Platt scaling assumes the calibration function is logistic in form.
Isotonic Regression
DfIsotonic Regression for Calibration
Isotonic regression fits a non-decreasing step function to the mapping from predicted probabilities to observed outcomes by minimizing:
This is a non-parametric method that makes no assumptions about the functional form of miscalibration, making it more flexible than Platt scaling.
Platt vs. Isotonic
- Platt scaling: parametric, stable with small calibration sets, assumes logistic form
- Isotonic regression: non-parametric, requires more data, can fit arbitrary calibration shapes
- For large datasets (), isotonic regression typically outperforms Platt scaling
Python Implementation
Calibration Analysis with sklearn
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from sklearn.metrics import brier_score_loss
np.random.seed(42)
# Generate imbalanced dataset
X, y = make_classification(n_samples=5000, n_features=20,
n_informative=10, n_redundant=5,
weights=[0.7], flip_y=0.05, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)
# Uncalibrated models
lr = LogisticRegression(max_iter=1000).fit(X_train, y_train)
rf = RandomForestClassifier(n_estimators=200, random_state=42).fit(X_train, y_train)
lr_probs = lr.predict_proba(X_test)[:, 1]
rf_probs = rf.predict_proba(X_test)[:, 1]
# Calibrated models
lr_cal = CalibratedClassifierCV(lr, method='sigmoid', cv='prefit').fit(X_test, y_test)
rf_cal = CalibratedClassifierCV(rf, method='isotonic', cv='prefit').fit(X_test, y_test)
lr_cal_probs = lr_cal.predict_proba(X_test)[:, 1]
rf_cal_probs = rf_cal.predict_proba(X_test)[:, 1]
# Brier scores
print("Brier Scores:")
print(f" LR (uncalibrated): {brier_score_loss(y_test, lr_probs):.4f}")
print(f" LR (Platt): {brier_score_loss(y_test, lr_cal_probs):.4f}")
print(f" RF (uncalibrated): {brier_score_loss(y_test, rf_probs):.4f}")
print(f" RF (isotonic): {brier_score_loss(y_test, rf_cal_probs):.4f}")
# Reliability diagram
fig, ax = plt.subplots(figsize=(8, 6))
for name, probs in [("LR (raw)", lr_probs), ("LR (Platt)", lr_cal_probs),
("RF (raw)", rf_probs), ("RF (isotonic)", rf_cal_probs)]:
fraction_pos, mean_pred = calibration_curve(y_test, probs, n_bins=10)
ax.plot(mean_pred, fraction_pos, 'o-', label=name)
ax.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
ax.set_xlabel('Mean predicted probability')
ax.set_ylabel('Fraction of positives')
ax.set_title('Calibration Curves')
ax.legend(loc='lower right')
plt.tight_layout()
plt.savefig('calibration_curves.png', dpi=150)
plt.show()
Hosmer-Lemeshow Test Implementation
import numpy as np
from scipy import stats
def hosmer_lemeshow_test(y_true, y_pred_prob, n_groups=10):
"""Hosmer-Lemeshow goodness-of-fit test."""
order = np.argsort(y_pred_prob)
y_true_sorted = np.asarray(y_true)[order]
y_pred_sorted = np.asarray(y_pred_prob)[order]
# Create equal-frequency bins
bins = np.array_split(np.arange(len(y_true_sorted)), n_groups)
chi2_stat = 0.0
for idx in bins:
n_g = len(idx)
o_g = y_true_sorted[idx].sum()
p_bar = y_pred_sorted[idx].mean()
if p_bar * (1 - p_bar) > 0:
chi2_stat += (o_g - n_g * p_bar)**2 / (n_g * p_bar * (1 - p_bar))
df = n_groups - 2
p_value = 1 - stats.chi2.cdf(chi2_stat, df)
return chi2_stat, df, p_value
# Example usage
np.random.seed(42)
n = 500
X_dummy = np.random.randn(n, 5)
from sklearn.linear_model import LogisticRegression
y_dummy = (X_dummy[:, 0] + 0.5 * X_dummy[:, 1] + np.random.randn(n) * 0.8 > 0).astype(int)
model = LogisticRegression().fit(X_dummy, y_dummy)
probs = model.predict_proba(X_dummy)[:, 1]
chi2, df, p = hosmer_lemeshow_test(y_dummy, probs, n_groups=10)
print(f"Hosmer-Lemeshow statistic: {chi2:.4f}")
print(f"Degrees of freedom: {df}")
print(f"P-value: {p:.4f}")
print("Calibrated" if p > 0.05 else "Poor calibration detected")
Calibration in Practice
Best Practices for Calibration
- Always use a held-out calibration set β never calibrate on training data
- Use cross-validation for small datasets β
CalibratedClassifierCVwithcv=5 - Inspect reliability diagrams visually β statistical tests alone are insufficient
- Re-calibrate periodically β model drift degrades calibration over time
- Calibrate before combining models β ensemble methods benefit from calibrated base learners
- Consider the application β medical decision-making requires well-calibrated probabilities; classification thresholds do not
Key Takeaways
Summary: Calibration and Model Checking
- Calibration means predicted probabilities match observed frequencies
- Brier score measures overall prediction quality and decomposes into reliability, resolution, and uncertainty
- Reliability diagrams are the gold-standard visual diagnostic for calibration
- Hosmer-Lemeshow test provides a formal test but should complement, not replace, visual inspection
- Platt scaling (sigmoid) and isotonic regression are the two main post-hoc calibration methods
- Use Platt scaling for small calibration sets; isotonic regression for large ones
- Calibration is critical in medical, financial, and policy applications where probabilities drive decisions