MCMC Diagnostics and Convergence

Advanced Statistical Methods

Knowing When Your MCMC Has Converged

MCMC diagnostics verify that Markov chain samples have converged to the target posterior distribution, preventing invalid Bayesian inference. Tools like R-hat, effective sample size, and trace plots are essential safeguards.

Bayesian modeling — Validate that posterior samples accurately represent the target distribution
Pharmacokinetics — Ensure reliable parameter estimates from complex hierarchical models
Computational biology — Confirm MCMC chains explore the full posterior in high-dimensional problems

Diagnostics are the quality control of Bayesian computation — never trust a chain you haven't checked.

Markov Chain Monte Carlo (MCMC) methods generate samples from posterior distributions that cannot be evaluated analytically. Proper diagnostics are essential to ensure the samples are valid representations of the target distribution.

MCMC Fundamentals

DfMarkov Chain Monte Carlo

MCMC constructs a Markov chain $\{X^{(t)}\}_{t=0}^{\infty}$ whose stationary distribution is the target posterior $p(\boldsymbol{\theta} \mid \mathcal{D})$ . After sufficient iterations (burn-in), samples $\{X^{(t)}\}_{t=B}^{T}$ approximate draws from the posterior.

The chain satisfies the detailed balance condition:

\pi(\mathbf{x}) \, T(\mathbf{x} \to \mathbf{x}') = \pi(\mathbf{x}') \, T(\mathbf{x}' \to \mathbf{x})

where $\pi(\mathbf{x})$ is the target density and $T$ is the transition kernel.

Why MCMC?

Most posterior distributions cannot be sampled directly (they are not standard distributions) and cannot be normalized (the evidence integral is intractable). MCMC circumvents both problems by constructing a Markov chain that only requires evaluating the posterior up to a normalizing constant.

Metropolis-Hastings Algorithm

ThMetropolis-Hastings Algorithm

At each iteration $t$ :

Propose $\mathbf{x}' \sim q(\mathbf{x}' \mid \mathbf{x}^{(t)})$ from a proposal distribution
Compute acceptance ratio:

\alpha = \min\left(1, \frac{\pi(\mathbf{x}') \, q(\mathbf{x}^{(t)} \mid \mathbf{x}')}{\pi(\mathbf{x}^{(t)}) \, q(\mathbf{x}' \mid \mathbf{x}^{(t)})}\right)

Accept with probability $\alpha$ : $\mathbf{x}^{(t+1)} = \mathbf{x}'$ ; otherwise $\mathbf{x}^{(t+1)} = \mathbf{x}^{(t)}$

The chain converges to $\pi$ regardless of the choice of $q$ (subject to regularity conditions).

Proposal Tuning

The proposal distribution $q$ must be tuned for efficiency. Too narrow: high acceptance but slow exploration (high autocorrelation). Too wide: low acceptance, many rejected proposals. The optimal acceptance rate is approximately 23.4% for high-dimensional Gaussian targets (Roberts, Gelman, & Gilks, 1997).

Gibbs Sampling

DfGibbs Sampling

Gibbs sampling is a special case of Metropolis-Hastings where each component is updated conditionally:

x_1^{(t+1)} \sim p(x_1 \mid x_2^{(t)}, x_3^{(t)}, \ldots, x_k^{(t)})

x_2^{(t+1)} \sim p(x_2 \mid x_1^{(t+1)}, x_3^{(t)}, \ldots, x_k^{(t)})

\vdots

x_k^{(t+1)} \sim p(x_k \mid x_1^{(t+1)}, x_2^{(t+1)}, \ldots, x_{k-1}^{(t+1)})

Each update is accepted with probability 1 (no rejection step), but requires sampling from the full conditional distributions.

Gibbs vs. Metropolis-Hastings

Gibbs is efficient when full conditionals are easy to sample from. When they are not available or expensive, Metropolis-Hastings with a flexible proposal is used. Modern samplers (NUTS, HMC) combine both approaches.

Convergence Diagnostics

R-hat (Gelman-Rubin Statistic)

R-hat (Potential Scale Reduction Factor)

\hat{R} = \sqrt{\frac{\hat{\text{Var}}(\theta)}{W}}

Here,

$\hat{\text{Var}}(\theta)$ =Pooled posterior variance estimate (between + within chain variance)
$W$ =Within-chain variance averaged across M chains

More precisely:

R-hat Decomposition

\hat{R} = \sqrt{\frac{\frac{n-1}{n}W + \frac{1}{n}B}{W}}

Here,

$B = \frac{M}{M-1} \sum_{m=1}^M (\bar{\theta}_m - \bar{\theta})^2$ =Between-chain variance
$W = \frac{1}{M(n-1)} \sum_{m=1}^M \sum_{t=1}^n (\theta_m^{(t)} - \bar{\theta}_m)^2$ =Within-chain variance
$M$ =Number of chains
$n$ =Iterations per chain (after burn-in)

R-hat Threshold

A common threshold is $\hat{R} < 1.1$ (Gelman & Rubin, 1992). Modern recommendations (Vehtari et al., 2021) use $\hat{R} < 1.01$ . If $\hat{R} > 1.05$ , the chain has likely not converged — run longer or reparameterize.

Effective Sample Size (ESS)

Effective Sample Size

\text{ESS} = \frac{MN}{1 + 2 \sum_{k=1}^{\infty} \rho_k}

Here,

$M$ =Number of chains
$N$ =Total iterations per chain
$\rho_k$ =Autocorrelation at lag k

ESS measures the number of independent samples equivalent to the correlated MCMC output. High autocorrelation reduces ESS. A minimum of ESS $> 400$ is recommended for reliable posterior summaries.

Autocorrelation

Sample Autocorrelation

\hat{\rho}_k = \frac{\sum_{t=1}^{N-k}(\theta^{(t)} - \bar{\theta})(\theta^{(t+k)} - \bar{\theta})}{\sum_{t=1}^{N}(\theta^{(t)} - \bar{\theta})^2}

Here,

$\hat{\rho}_k$ =Autocorrelation at lag k
$N$ =Chain length
$\theta^{(t)}$ =Parameter value at iteration t

Autocorrelation Interpretation

Low autocorrelation (rapid decay): chain mixes well, explores efficiently
High autocorrelation (slow decay): chain is stuck, needs thinning or reparameterization
Autocorrelation should drop below $0.1$ within a few dozen lags for well-mixed chains

Thinning and Burn-in

DfBurn-in Period

The first $B$ iterations are discarded as burn-in (or warm-up). These iterations reflect the initial starting point, not the stationary distribution. Common practice: discard 50% of iterations, or use diagnostics to determine when stationarity begins.

DfThinning

Instead of keeping every iteration, retain every $k$ -th sample to reduce storage and autocorrelation. Thinning produces a chain of length $N/k$ with reduced autocorrelation but also reduces information — total ESS is unchanged.

Recommendation: Store all samples and thin only for storage constraints. Compute ESS to assess effective information content.

Modern Practice

Modern MCMC (especially HMC/NUTS) produces low-autocorrelation samples, making thinning unnecessary. Store all post-burn-in samples and compute ESS directly.

Geweke Diagnostic

ThGeweke Diagnostic

Compare the mean of the first 10% of the chain to the mean of the last 50%:

z = \frac{\bar{\theta}_{\text{first}} - \bar{\theta}_{\text{last}}}{\sqrt{\hat{S}_{\text{first}}(f_0) / n_1 + \hat{S}_{\text{last}}(f_0) / n_2}}

where $\hat{S}(f_0)$ is the spectral density at frequency zero. Under convergence, $z \sim \mathcal{N}(0, 1)$ . $|z| > 2$ suggests non-convergence.

Brooks-Gelman-Rubin Multivariate Diagnostic

Multivariate R-hat

\hat{R}_{\text{mult}} = \max_j \hat{R}_j

Here,

$\hat{R}_j$ =Univariate R-hat for parameter j

The multivariate $\hat{R}$ (Vehtari et al., 2021) monitors all parameters simultaneously and is more sensitive to divergence in high-dimensional posteriors. $\hat{R}_{\text{mult}} < 1.01$ is the recommended threshold.

Divergent Transitions (HMC/NUTS)

Divergent Transitions

When using Hamiltonian Monte Carlo (HMC) or NUTS (No-U-Turn Sampler), divergent transitions indicate that the numerical integrator failed to accurately follow the Hamiltonian trajectory. Divergences signal:

Regions of high curvature that the integrator cannot handle
Potential bias in posterior estimates
Need for reparameterization (e.g., non-centered parameterization)

Zero divergences is the goal. If divergences occur, try: (a) increase target_accept, (b) use non-centered parameterization, (c) reparameterize the model.

Python Implementation

import numpy as np
import pymc as pm
import arviz as az
import matplotlib.pyplot as plt

np.random.seed(42)

# --- Simple model for demonstration ---
n_obs = 200
X = np.random.randn(n_obs)
true_beta = np.array([1.5, -2.0])
y = true_beta[0] + true_beta[1] * X + np.random.randn(n_obs) * 0.5

with pm.Model() as demo_model:
    beta = pm.Normal('beta', mu=0, sigma=10, shape=2)
    sigma = pm.HalfNormal('sigma', sigma=2)
    mu = beta[0] + beta[1] * X
    y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y)
    
    # Run MCMC with diagnostics
    trace = pm.sample(2000, tune=1000, chains=4, return_inferencedata=True,
                      random_seed=42)

# --- R-hat ---
print("=== R-hat ===")
print(az.rhat(trace, var_names=['beta', 'sigma']))

# --- Effective Sample Size ---
print("\n=== Effective Sample Size ===")
print(az.ess(trace, var_names=['beta', 'sigma']))

# --- Autocorrelation ---
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
for i, ax in enumerate(axes.flatten()):
    if i < 2:
        az.plot_autocorr(trace, var_names=['beta'], combined=True, ax=ax)
    elif i == 2:
        az.plot_autocorr(trace, var_names=['sigma'], combined=True, ax=ax)
    else:
        az.plot_posterior(trace, var_names=['beta'], ax=ax)
plt.tight_layout()
plt.savefig('mcmc_autocorr.png', dpi=150)
plt.show()

# --- Trace plots ---
az.plot_trace(trace, var_names=['beta', 'sigma'], compact=True)
plt.tight_layout()
plt.savefig('mcmc_trace.png', dpi=150)
plt.show()

# --- Rank plots (modern diagnostic) ---
az.plot_rank(trace, var_names=['beta'])
plt.tight_layout()
plt.savefig('mcmc_rank.png', dpi=150)
plt.show()

# --- Geweke diagnostic ---
# Manual implementation for demonstration
from scipy import stats as sp_stats

def geweke_diagnostic(chain, first=0.1, last=0.5):
    n = len(chain)
    first_end = int(n * first)
    last_start = int(n * (1 - last))
    mean_first = np.mean(chain[:first_end])
    mean_last = np.mean(chain[last_start:])
    var_first = np.var(chain[:first_end]) / first_end
    var_last = np.var(chain[last_start:]) / (n - last_start)
    z = (mean_first - mean_last) / np.sqrt(var_first + var_last)
    p_value = 2 * (1 - sp_stats.norm.cdf(abs(z)))
    return z, p_value

print("\n=== Geweke Diagnostic ===")
for name in ['beta', 'sigma']:
    samples = trace.posterior[name].values.flatten()
    z, p = geweke_diagnostic(samples)
    print(f"{name}: z={z:.3f}, p={p:.3f} {'✓' if abs(z) < 2 else '✗'}")

Diagnostic Checklist

Before using MCMC samples for inference:

$\hat{R} < 1.01$ for all parameters
ESS $> 400$ for all parameters
Trace plots show no trends or drift
Autocorrelation decays rapidly
No divergent transitions (HMC/NUTS)
Posterior predictive checks pass
Rank plots are approximately uniform

Key Takeaways

Summary: MCMC Diagnostics and Convergence

MCMC generates correlated samples from the posterior — diagnostics verify the chain has converged
R-hat compares between-chain vs. within-chain variance — $\hat{R} < 1.01$ is the modern threshold
ESS counts independent-equivalent samples — ESS $> 400$ is recommended
Autocorrelation measures chain mixing — rapid decay indicates good mixing
Burn-in discards early non-stationary iterations; thinning reduces storage but not information
Geweke diagnostic compares early vs. late chain segments — $|z| < 2$ indicates stationarity
Divergent transitions in HMC/NUTS signal numerical problems — always target zero divergences
Modern practice: store all post-burn-in samples, use rank plots, and compute ESS and $\hat{R}$ together

MCMC Diagnostics and Convergence

MCMC Diagnostics and Convergence

Knowing When Your MCMC Has Converged

MCMC Fundamentals

DfMarkov Chain Monte Carlo

Metropolis-Hastings Algorithm

ThMetropolis-Hastings Algorithm

Gibbs Sampling

DfGibbs Sampling

Convergence Diagnostics

R-hat (Gelman-Rubin Statistic)

R-hat (Potential Scale Reduction Factor)

R-hat Decomposition

Effective Sample Size (ESS)

Effective Sample Size

Autocorrelation

Sample Autocorrelation

Thinning and Burn-in

DfBurn-in Period

DfThinning

Geweke Diagnostic

ThGeweke Diagnostic

Brooks-Gelman-Rubin Multivariate Diagnostic

Multivariate R-hat

Divergent Transitions (HMC/NUTS)

Python Implementation

Related Topics

Key Takeaways

Summary: MCMC Diagnostics and Convergence

Premium Content

Need Expert Statistics Help?