πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Calibration and Model Checking

Advanced Statistical MethodsModel Diagnostics🟒 Free Lesson

Advertisement

Calibration and Model Checking

Advanced Statistical Methods

Ensuring Your Predictions Match Reality

Calibration is the bridge between raw model outputs and trustworthy predictions β€” ensuring that when a model says "80% probability," events actually occur about 80% of the time. In medicine, finance, and weather forecasting, miscalibrated probabilities lead to disastrous decisions.

  • Medical diagnosis β€” Calibrated risk scores enable clinicians to trust and act on predicted probabilities
  • Weather forecasting β€” Calibration ensures probabilistic forecasts are reliable and actionable
  • Machine learning deployment β€” Well-calibrated models produce outputs that can be interpreted as true probabilities

Calibration transforms opaque model scores into probabilities you can stake decisions on.


What Is Calibration?

DfCalibration

A probabilistic classifier is calibrated if, for any predicted probability pp, the true probability of the positive class among all instances predicted with probability pp is approximately pp. Formally, a model f^(x)=P(Y^=1∣X=x)\hat{f}(x) = P(\hat{Y}=1 \mid X=x) is calibrated if:

P(Y=1∣f^(X)=p)=pfor all p∈[0,1]P(Y=1 \mid \hat{f}(X)=p) = p \quad \text{for all } p \in [0, 1]

Calibration is distinct from discrimination. A model can have excellent discrimination (high AUC) yet be poorly calibrated, producing predicted probabilities that are systematically too high or too low.


Brier Score β€” The Decomposition

The Brier score measures the mean squared error between predicted probabilities and observed outcomes:

Brier Score

BS=1Nβˆ‘i=1N(p^iβˆ’yi)2\text{BS} = \frac{1}{N} \sum_{i=1}^{N} (\hat{p}_i - y_i)^2

Here,

  • p^i\hat{p}_i=Predicted probability for observation i
  • yiy_i=Observed binary outcome (0 or 1)
  • NN=Total number of observations

Decomposition of the Brier Score

The Brier score can be decomposed into three components:

BS=Reliabilityβˆ’Resolution+Uncertainty\text{BS} = \text{Reliability} - \text{Resolution} + \text{Uncertainty}

where:

  • Reliability = average squared difference between mean predicted probability and observed frequency within each probability bin
  • Resolution = variance of observed frequencies across bins (how much predictions vary from overall base rate)
  • Uncertainty = variance of the overall outcome (irreducible noise)

A perfectly calibrated model has reliability = 0.


Reliability Diagrams

DfReliability Diagram

A reliability diagram plots the mean predicted probability (x-axis) against the observed frequency (y-axis) across KK bins of predictions. A perfectly calibrated model traces the 45-degree diagonal y=xy = x. Systematic deviations above the diagonal indicate underestimation; deviations below indicate overestimation.

The construction procedure:

  1. Bin predictions into KK intervals (e.g., [0,0.1),[0.1,0.2),…,[0.9,1.0][0, 0.1), [0.1, 0.2), \ldots, [0.9, 1.0])
  2. For each bin kk, compute pΛ‰k=1nkβˆ‘i∈binΒ kp^i\bar{p}_k = \frac{1}{n_k}\sum_{i \in \text{bin } k} \hat{p}_i
  3. Compute yΛ‰k=1nkβˆ‘i∈binΒ kyi\bar{y}_k = \frac{1}{n_k}\sum_{i \in \text{bin } k} y_i
  4. Plot (pˉk,yˉk)(\bar{p}_k, \bar{y}_k) for each bin and connect with lines

Hosmer-Lemeshow Test

ThHosmer-Lemeshow Test Statistic

Divide the sample into GG groups (typically deciles) ordered by predicted probability. The test statistic is:

C=βˆ‘g=1G(Ogβˆ’ngpΛ‰g)2ngpΛ‰g(1βˆ’pΛ‰g)C = \sum_{g=1}^{G} \frac{(O_g - n_g \bar{p}_g)^2}{n_g \bar{p}_g (1 - \bar{p}_g)}

where OgO_g is the observed number of events in group gg, ngn_g is the group size, and pˉg\bar{p}_g is the mean predicted probability in group gg.

Under H0H_0 (calibration), CβˆΌΟ‡Gβˆ’22C \sim \chi^2_{G-2} asymptotically. A significant p-value indicates lack of fit.

Limitations of Hosmer-Lemeshow

  • The test is sensitive to the number of bins: too few bins lose power; too many bins create sparse groups
  • It is an overall test and does not identify where miscalibration occurs
  • Power depends on sample size: very large samples can reject trivial miscalibration
  • Use the test alongside visual reliability diagrams, not as a sole diagnostic

Calibration Methods

Platt Scaling

DfPlatt Scaling

Platt scaling fits a logistic regression to the outputs of a classifier. Given raw scores f(x)f(x), it estimates:

P(Y=1∣f)=Οƒ(Af+B)P(Y=1 \mid f) = \sigma(Af + B)

where Οƒ\sigma is the sigmoid function and parameters A<0A < 0, BB are estimated via maximum likelihood on a held-out calibration set. Platt scaling assumes the calibration function is logistic in form.

Isotonic Regression

DfIsotonic Regression for Calibration

Isotonic regression fits a non-decreasing step function m^(p^)\hat{m}(\hat{p}) to the mapping from predicted probabilities to observed outcomes by minimizing:

min⁑mβˆ‘i=1Nwi(yiβˆ’m(p^i))2subjectΒ toΒ m(p^1)≀m(p^2)≀⋯\min_{m} \sum_{i=1}^{N} w_i (y_i - m(\hat{p}_i))^2 \quad \text{subject to } m(\hat{p}_1) \leq m(\hat{p}_2) \leq \cdots

This is a non-parametric method that makes no assumptions about the functional form of miscalibration, making it more flexible than Platt scaling.

Platt vs. Isotonic

  • Platt scaling: parametric, stable with small calibration sets, assumes logistic form
  • Isotonic regression: non-parametric, requires more data, can fit arbitrary calibration shapes
  • For large datasets (n>1000n > 1000), isotonic regression typically outperforms Platt scaling

Python Implementation

Calibration Analysis with sklearn

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from sklearn.metrics import brier_score_loss

np.random.seed(42)

# Generate imbalanced dataset
X, y = make_classification(n_samples=5000, n_features=20,
                           n_informative=10, n_redundant=5,
                           weights=[0.7], flip_y=0.05, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y)

# Uncalibrated models
lr = LogisticRegression(max_iter=1000).fit(X_train, y_train)
rf = RandomForestClassifier(n_estimators=200, random_state=42).fit(X_train, y_train)

lr_probs = lr.predict_proba(X_test)[:, 1]
rf_probs = rf.predict_proba(X_test)[:, 1]

# Calibrated models
lr_cal = CalibratedClassifierCV(lr, method='sigmoid', cv='prefit').fit(X_test, y_test)
rf_cal = CalibratedClassifierCV(rf, method='isotonic', cv='prefit').fit(X_test, y_test)

lr_cal_probs = lr_cal.predict_proba(X_test)[:, 1]
rf_cal_probs = rf_cal.predict_proba(X_test)[:, 1]

# Brier scores
print("Brier Scores:")
print(f"  LR (uncalibrated): {brier_score_loss(y_test, lr_probs):.4f}")
print(f"  LR (Platt):        {brier_score_loss(y_test, lr_cal_probs):.4f}")
print(f"  RF (uncalibrated): {brier_score_loss(y_test, rf_probs):.4f}")
print(f"  RF (isotonic):     {brier_score_loss(y_test, rf_cal_probs):.4f}")

# Reliability diagram
fig, ax = plt.subplots(figsize=(8, 6))
for name, probs in [("LR (raw)", lr_probs), ("LR (Platt)", lr_cal_probs),
                     ("RF (raw)", rf_probs), ("RF (isotonic)", rf_cal_probs)]:
    fraction_pos, mean_pred = calibration_curve(y_test, probs, n_bins=10)
    ax.plot(mean_pred, fraction_pos, 'o-', label=name)

ax.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
ax.set_xlabel('Mean predicted probability')
ax.set_ylabel('Fraction of positives')
ax.set_title('Calibration Curves')
ax.legend(loc='lower right')
plt.tight_layout()
plt.savefig('calibration_curves.png', dpi=150)
plt.show()

Hosmer-Lemeshow Test Implementation

import numpy as np
from scipy import stats

def hosmer_lemeshow_test(y_true, y_pred_prob, n_groups=10):
    """Hosmer-Lemeshow goodness-of-fit test."""
    order = np.argsort(y_pred_prob)
    y_true_sorted = np.asarray(y_true)[order]
    y_pred_sorted = np.asarray(y_pred_prob)[order]

    # Create equal-frequency bins
    bins = np.array_split(np.arange(len(y_true_sorted)), n_groups)

    chi2_stat = 0.0
    for idx in bins:
        n_g = len(idx)
        o_g = y_true_sorted[idx].sum()
        p_bar = y_pred_sorted[idx].mean()
        if p_bar * (1 - p_bar) > 0:
            chi2_stat += (o_g - n_g * p_bar)**2 / (n_g * p_bar * (1 - p_bar))

    df = n_groups - 2
    p_value = 1 - stats.chi2.cdf(chi2_stat, df)
    return chi2_stat, df, p_value

# Example usage
np.random.seed(42)
n = 500
X_dummy = np.random.randn(n, 5)
from sklearn.linear_model import LogisticRegression
y_dummy = (X_dummy[:, 0] + 0.5 * X_dummy[:, 1] + np.random.randn(n) * 0.8 > 0).astype(int)
model = LogisticRegression().fit(X_dummy, y_dummy)
probs = model.predict_proba(X_dummy)[:, 1]

chi2, df, p = hosmer_lemeshow_test(y_dummy, probs, n_groups=10)
print(f"Hosmer-Lemeshow statistic: {chi2:.4f}")
print(f"Degrees of freedom: {df}")
print(f"P-value: {p:.4f}")
print("Calibrated" if p > 0.05 else "Poor calibration detected")

Calibration in Practice

Best Practices for Calibration

  1. Always use a held-out calibration set β€” never calibrate on training data
  2. Use cross-validation for small datasets β€” CalibratedClassifierCV with cv=5
  3. Inspect reliability diagrams visually β€” statistical tests alone are insufficient
  4. Re-calibrate periodically β€” model drift degrades calibration over time
  5. Calibrate before combining models β€” ensemble methods benefit from calibrated base learners
  6. Consider the application β€” medical decision-making requires well-calibrated probabilities; classification thresholds do not

Key Takeaways

Summary: Calibration and Model Checking

  • Calibration means predicted probabilities match observed frequencies
  • Brier score measures overall prediction quality and decomposes into reliability, resolution, and uncertainty
  • Reliability diagrams are the gold-standard visual diagnostic for calibration
  • Hosmer-Lemeshow test provides a formal test but should complement, not replace, visual inspection
  • Platt scaling (sigmoid) and isotonic regression are the two main post-hoc calibration methods
  • Use Platt scaling for small calibration sets; isotonic regression for large ones
  • Calibration is critical in medical, financial, and policy applications where probabilities drive decisions
⭐

Premium Content

Calibration and Model Checking

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement