Calibration and Model Checking

Advanced Statistical Methods

Ensuring Your Predictions Match Reality

Calibration is the bridge between raw model outputs and trustworthy predictions — ensuring that when a model says "80% probability," events actually occur about 80% of the time. In medicine, finance, and weather forecasting, miscalibrated probabilities lead to disastrous decisions.

Medical diagnosis — Calibrated risk scores enable clinicians to trust and act on predicted probabilities
Weather forecasting — Calibration ensures probabilistic forecasts are reliable and actionable
Machine learning deployment — Well-calibrated models produce outputs that can be interpreted as true probabilities

Calibration transforms opaque model scores into probabilities you can stake decisions on.

What Is Calibration?

DfCalibration

A probabilistic classifier is calibrated if, for any predicted probability $p$ , the true probability of the positive class among all instances predicted with probability $p$ is approximately $p$ . Formally, a model $\hat{f}(x) = P(\hat{Y}=1 \mid X=x)$ is calibrated if:

P(Y=1 \mid \hat{f}(X)=p) = p \quad \text{for all } p \in [0, 1]

Calibration is distinct from discrimination. A model can have excellent discrimination (high AUC) yet be poorly calibrated, producing predicted probabilities that are systematically too high or too low.

Brier Score — The Decomposition

The Brier score measures the mean squared error between predicted probabilities and observed outcomes:

Brier Score

\text{BS} = \frac{1}{N} \sum_{i=1}^{N} (\hat{p}_i - y_i)^2

Here,

$\hat{p}_i$ =Predicted probability for observation i
$y_i$ =Observed binary outcome (0 or 1)
$N$ =Total number of observations

Decomposition of the Brier Score

The Brier score can be decomposed into three components:

\text{BS} = \text{Reliability} - \text{Resolution} + \text{Uncertainty}

where:

Reliability = average squared difference between mean predicted probability and observed frequency within each probability bin
Resolution = variance of observed frequencies across bins (how much predictions vary from overall base rate)
Uncertainty = variance of the overall outcome (irreducible noise)

A perfectly calibrated model has reliability = 0.

Reliability Diagrams

DfReliability Diagram

A reliability diagram plots the mean predicted probability (x-axis) against the observed frequency (y-axis) across $K$ bins of predictions. A perfectly calibrated model traces the 45-degree diagonal $y = x$ . Systematic deviations above the diagonal indicate underestimation; deviations below indicate overestimation.

The construction procedure:

Bin predictions into $K$ intervals (e.g., $[0, 0.1), [0.1, 0.2), \ldots, [0.9, 1.0]$ )
For each bin $k$ , compute $\bar{p}_k = \frac{1}{n_k}\sum_{i \in \text{bin } k} \hat{p}_i$
Compute $\bar{y}_k = \frac{1}{n_k}\sum_{i \in \text{bin } k} y_i$
Plot $(\bar{p}_k, \bar{y}_k)$ for each bin and connect with lines

Hosmer-Lemeshow Test

ThHosmer-Lemeshow Test Statistic

Divide the sample into $G$ groups (typically deciles) ordered by predicted probability. The test statistic is:

C = \sum_{g=1}^{G} \frac{(O_g - n_g \bar{p}_g)^2}{n_g \bar{p}_g (1 - \bar{p}_g)}

where $O_g$ is the observed number of events in group $g$ , $n_g$ is the group size, and $\bar{p}_g$ is the mean predicted probability in group $g$ .

Under $H_0$ (calibration), $C \sim \chi^2_{G-2}$ asymptotically. A significant p-value indicates lack of fit.

Limitations of Hosmer-Lemeshow

The test is sensitive to the number of bins: too few bins lose power; too many bins create sparse groups
It is an overall test and does not identify where miscalibration occurs
Power depends on sample size: very large samples can reject trivial miscalibration
Use the test alongside visual reliability diagrams, not as a sole diagnostic

Calibration Methods

Platt Scaling

DfPlatt Scaling

Platt scaling fits a logistic regression to the outputs of a classifier. Given raw scores $f(x)$ , it estimates:

P(Y=1 \mid f) = \sigma(Af + B)

where $\sigma$ is the sigmoid function and parameters $A < 0$ , $B$ are estimated via maximum likelihood on a held-out calibration set. Platt scaling assumes the calibration function is logistic in form.

Isotonic Regression

DfIsotonic Regression for Calibration

Isotonic regression fits a non-decreasing step function $\hat{m}(\hat{p})$ to the mapping from predicted probabilities to observed outcomes by minimizing:

\min_{m} \sum_{i=1}^{N} w_i (y_i - m(\hat{p}_i))^2 \quad \text{subject to } m(\hat{p}_1) \leq m(\hat{p}_2) \leq \cdots

This is a non-parametric method that makes no assumptions about the functional form of miscalibration, making it more flexible than Platt scaling.

Platt vs. Isotonic

Platt scaling: parametric, stable with small calibration sets, assumes logistic form
Isotonic regression: non-parametric, requires more data, can fit arbitrary calibration shapes
For large datasets ( $n > 1000$ ), isotonic regression typically outperforms Platt scaling

Python Implementation

Calibration Analysis with sklearn

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from sklearn.metrics import brier_score_loss

np.random.seed(42)

# Generate imbalanced dataset
X, y = make_classification(n_samples=5000, n_features=20,
                           n_informative=10, n_redundant=5,
                           weights=[0.7], flip_y=0.05, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)

# Uncalibrated models
lr = LogisticRegression(max_iter=1000).fit(X_train, y_train)
rf = RandomForestClassifier(n_estimators=200, random_state=42).fit(X_train, y_train)

lr_probs = lr.predict_proba(X_test)[:, 1]
rf_probs = rf.predict_proba(X_test)[:, 1]

# Calibrated models
lr_cal = CalibratedClassifierCV(lr, method='sigmoid', cv='prefit').fit(X_test, y_test)
rf_cal = CalibratedClassifierCV(rf, method='isotonic', cv='prefit').fit(X_test, y_test)

lr_cal_probs = lr_cal.predict_proba(X_test)[:, 1]
rf_cal_probs = rf_cal.predict_proba(X_test)[:, 1]

# Brier scores
print("Brier Scores:")
print(f"  LR (uncalibrated): {brier_score_loss(y_test, lr_probs):.4f}")
print(f"  LR (Platt):        {brier_score_loss(y_test, lr_cal_probs):.4f}")
print(f"  RF (uncalibrated): {brier_score_loss(y_test, rf_probs):.4f}")
print(f"  RF (isotonic):     {brier_score_loss(y_test, rf_cal_probs):.4f}")

# Reliability diagram
fig, ax = plt.subplots(figsize=(8, 6))
for name, probs in [("LR (raw)", lr_probs), ("LR (Platt)", lr_cal_probs),
                     ("RF (raw)", rf_probs), ("RF (isotonic)", rf_cal_probs)]:
    fraction_pos, mean_pred = calibration_curve(y_test, probs, n_bins=10)
    ax.plot(mean_pred, fraction_pos, 'o-', label=name)

ax.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
ax.set_xlabel('Mean predicted probability')
ax.set_ylabel('Fraction of positives')
ax.set_title('Calibration Curves')
ax.legend(loc='lower right')
plt.tight_layout()
plt.savefig('calibration_curves.png', dpi=150)
plt.show()

Hosmer-Lemeshow Test Implementation

import numpy as np
from scipy import stats

def hosmer_lemeshow_test(y_true, y_pred_prob, n_groups=10):
    """Hosmer-Lemeshow goodness-of-fit test."""
    order = np.argsort(y_pred_prob)
    y_true_sorted = np.asarray(y_true)[order]
    y_pred_sorted = np.asarray(y_pred_prob)[order]

    # Create equal-frequency bins
    bins = np.array_split(np.arange(len(y_true_sorted)), n_groups)

    chi2_stat = 0.0
    for idx in bins:
        n_g = len(idx)
        o_g = y_true_sorted[idx].sum()
        p_bar = y_pred_sorted[idx].mean()
        if p_bar * (1 - p_bar) > 0:
            chi2_stat += (o_g - n_g * p_bar)**2 / (n_g * p_bar * (1 - p_bar))

    df = n_groups - 2
    p_value = 1 - stats.chi2.cdf(chi2_stat, df)
    return chi2_stat, df, p_value

# Example usage
np.random.seed(42)
n = 500
X_dummy = np.random.randn(n, 5)
from sklearn.linear_model import LogisticRegression
y_dummy = (X_dummy[:, 0] + 0.5 * X_dummy[:, 1] + np.random.randn(n) * 0.8 > 0).astype(int)
model = LogisticRegression().fit(X_dummy, y_dummy)
probs = model.predict_proba(X_dummy)[:, 1]

chi2, df, p = hosmer_lemeshow_test(y_dummy, probs, n_groups=10)
print(f"Hosmer-Lemeshow statistic: {chi2:.4f}")
print(f"Degrees of freedom: {df}")
print(f"P-value: {p:.4f}")
print("Calibrated" if p > 0.05 else "Poor calibration detected")

Calibration in Practice

Best Practices for Calibration

Always use a held-out calibration set — never calibrate on training data
Use cross-validation for small datasets — CalibratedClassifierCV with cv=5
Inspect reliability diagrams visually — statistical tests alone are insufficient
Re-calibrate periodically — model drift degrades calibration over time
Calibrate before combining models — ensemble methods benefit from calibrated base learners
Consider the application — medical decision-making requires well-calibrated probabilities; classification thresholds do not

Key Takeaways

Summary: Calibration and Model Checking

Calibration means predicted probabilities match observed frequencies
Brier score measures overall prediction quality and decomposes into reliability, resolution, and uncertainty
Reliability diagrams are the gold-standard visual diagnostic for calibration
Hosmer-Lemeshow test provides a formal test but should complement, not replace, visual inspection
Platt scaling (sigmoid) and isotonic regression are the two main post-hoc calibration methods
Use Platt scaling for small calibration sets; isotonic regression for large ones
Calibration is critical in medical, financial, and policy applications where probabilities drive decisions

Calibration and Model Checking

Calibration and Model Checking

Ensuring Your Predictions Match Reality

What Is Calibration?

DfCalibration

Brier Score — The Decomposition

Brier Score

Reliability Diagrams

DfReliability Diagram

Hosmer-Lemeshow Test

ThHosmer-Lemeshow Test Statistic

Calibration Methods

Platt Scaling

DfPlatt Scaling

Isotonic Regression

DfIsotonic Regression for Calibration

Python Implementation

Calibration Analysis with sklearn

Hosmer-Lemeshow Test Implementation

Calibration in Practice

Key Takeaways

Summary: Calibration and Model Checking

Premium Content

Need Expert Statistics Help?