MCMC Diagnostics and Convergence
Advanced Statistical Methods
Knowing When Your MCMC Has Converged
MCMC diagnostics verify that Markov chain samples have converged to the target posterior distribution, preventing invalid Bayesian inference. Tools like R-hat, effective sample size, and trace plots are essential safeguards.
- Bayesian modeling — Validate that posterior samples accurately represent the target distribution
- Pharmacokinetics — Ensure reliable parameter estimates from complex hierarchical models
- Computational biology — Confirm MCMC chains explore the full posterior in high-dimensional problems
Diagnostics are the quality control of Bayesian computation — never trust a chain you haven't checked.
Markov Chain Monte Carlo (MCMC) methods generate samples from posterior distributions that cannot be evaluated analytically. Proper diagnostics are essential to ensure the samples are valid representations of the target distribution.
MCMC Fundamentals
DfMarkov Chain Monte Carlo
MCMC constructs a Markov chain whose stationary distribution is the target posterior . After sufficient iterations (burn-in), samples approximate draws from the posterior.
The chain satisfies the detailed balance condition:
where is the target density and is the transition kernel.
Why MCMC?
Most posterior distributions cannot be sampled directly (they are not standard distributions) and cannot be normalized (the evidence integral is intractable). MCMC circumvents both problems by constructing a Markov chain that only requires evaluating the posterior up to a normalizing constant.
Metropolis-Hastings Algorithm
ThMetropolis-Hastings Algorithm
At each iteration :
- Propose from a proposal distribution
- Compute acceptance ratio:
- Accept with probability : ; otherwise
The chain converges to regardless of the choice of (subject to regularity conditions).
Proposal Tuning
The proposal distribution must be tuned for efficiency. Too narrow: high acceptance but slow exploration (high autocorrelation). Too wide: low acceptance, many rejected proposals. The optimal acceptance rate is approximately 23.4% for high-dimensional Gaussian targets (Roberts, Gelman, & Gilks, 1997).
Gibbs Sampling
DfGibbs Sampling
Gibbs sampling is a special case of Metropolis-Hastings where each component is updated conditionally:
Each update is accepted with probability 1 (no rejection step), but requires sampling from the full conditional distributions.
Gibbs vs. Metropolis-Hastings
Gibbs is efficient when full conditionals are easy to sample from. When they are not available or expensive, Metropolis-Hastings with a flexible proposal is used. Modern samplers (NUTS, HMC) combine both approaches.
Convergence Diagnostics
R-hat (Gelman-Rubin Statistic)
R-hat (Potential Scale Reduction Factor)
Here,
- =Pooled posterior variance estimate (between + within chain variance)
- =Within-chain variance averaged across M chains
More precisely:
R-hat Decomposition
Here,
- =Between-chain variance
- =Within-chain variance
- =Number of chains
- =Iterations per chain (after burn-in)
R-hat Threshold
A common threshold is (Gelman & Rubin, 1992). Modern recommendations (Vehtari et al., 2021) use . If , the chain has likely not converged — run longer or reparameterize.
Effective Sample Size (ESS)
Effective Sample Size
Here,
- =Number of chains
- =Total iterations per chain
- =Autocorrelation at lag k
ESS measures the number of independent samples equivalent to the correlated MCMC output. High autocorrelation reduces ESS. A minimum of ESS is recommended for reliable posterior summaries.
Autocorrelation
Sample Autocorrelation
Here,
- =Autocorrelation at lag k
- =Chain length
- =Parameter value at iteration t
Autocorrelation Interpretation
- Low autocorrelation (rapid decay): chain mixes well, explores efficiently
- High autocorrelation (slow decay): chain is stuck, needs thinning or reparameterization
- Autocorrelation should drop below within a few dozen lags for well-mixed chains
Thinning and Burn-in
DfBurn-in Period
The first iterations are discarded as burn-in (or warm-up). These iterations reflect the initial starting point, not the stationary distribution. Common practice: discard 50% of iterations, or use diagnostics to determine when stationarity begins.
DfThinning
Instead of keeping every iteration, retain every -th sample to reduce storage and autocorrelation. Thinning produces a chain of length with reduced autocorrelation but also reduces information — total ESS is unchanged.
Recommendation: Store all samples and thin only for storage constraints. Compute ESS to assess effective information content.
Modern Practice
Modern MCMC (especially HMC/NUTS) produces low-autocorrelation samples, making thinning unnecessary. Store all post-burn-in samples and compute ESS directly.
Geweke Diagnostic
ThGeweke Diagnostic
Compare the mean of the first 10% of the chain to the mean of the last 50%:
where is the spectral density at frequency zero. Under convergence, . suggests non-convergence.
Brooks-Gelman-Rubin Multivariate Diagnostic
Multivariate R-hat
Here,
- =Univariate R-hat for parameter j
The multivariate (Vehtari et al., 2021) monitors all parameters simultaneously and is more sensitive to divergence in high-dimensional posteriors. is the recommended threshold.
Divergent Transitions (HMC/NUTS)
Divergent Transitions
When using Hamiltonian Monte Carlo (HMC) or NUTS (No-U-Turn Sampler), divergent transitions indicate that the numerical integrator failed to accurately follow the Hamiltonian trajectory. Divergences signal:
- Regions of high curvature that the integrator cannot handle
- Potential bias in posterior estimates
- Need for reparameterization (e.g., non-centered parameterization)
Zero divergences is the goal. If divergences occur, try: (a) increase target_accept, (b) use non-centered parameterization, (c) reparameterize the model.
Python Implementation
import numpy as np
import pymc as pm
import arviz as az
import matplotlib.pyplot as plt
np.random.seed(42)
# --- Simple model for demonstration ---
n_obs = 200
X = np.random.randn(n_obs)
true_beta = np.array([1.5, -2.0])
y = true_beta[0] + true_beta[1] * X + np.random.randn(n_obs) * 0.5
with pm.Model() as demo_model:
beta = pm.Normal('beta', mu=0, sigma=10, shape=2)
sigma = pm.HalfNormal('sigma', sigma=2)
mu = beta[0] + beta[1] * X
y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y)
# Run MCMC with diagnostics
trace = pm.sample(2000, tune=1000, chains=4, return_inferencedata=True,
random_seed=42)
# --- R-hat ---
print("=== R-hat ===")
print(az.rhat(trace, var_names=['beta', 'sigma']))
# --- Effective Sample Size ---
print("\n=== Effective Sample Size ===")
print(az.ess(trace, var_names=['beta', 'sigma']))
# --- Autocorrelation ---
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
for i, ax in enumerate(axes.flatten()):
if i < 2:
az.plot_autocorr(trace, var_names=['beta'], combined=True, ax=ax)
elif i == 2:
az.plot_autocorr(trace, var_names=['sigma'], combined=True, ax=ax)
else:
az.plot_posterior(trace, var_names=['beta'], ax=ax)
plt.tight_layout()
plt.savefig('mcmc_autocorr.png', dpi=150)
plt.show()
# --- Trace plots ---
az.plot_trace(trace, var_names=['beta', 'sigma'], compact=True)
plt.tight_layout()
plt.savefig('mcmc_trace.png', dpi=150)
plt.show()
# --- Rank plots (modern diagnostic) ---
az.plot_rank(trace, var_names=['beta'])
plt.tight_layout()
plt.savefig('mcmc_rank.png', dpi=150)
plt.show()
# --- Geweke diagnostic ---
# Manual implementation for demonstration
from scipy import stats as sp_stats
def geweke_diagnostic(chain, first=0.1, last=0.5):
n = len(chain)
first_end = int(n * first)
last_start = int(n * (1 - last))
mean_first = np.mean(chain[:first_end])
mean_last = np.mean(chain[last_start:])
var_first = np.var(chain[:first_end]) / first_end
var_last = np.var(chain[last_start:]) / (n - last_start)
z = (mean_first - mean_last) / np.sqrt(var_first + var_last)
p_value = 2 * (1 - sp_stats.norm.cdf(abs(z)))
return z, p_value
print("\n=== Geweke Diagnostic ===")
for name in ['beta', 'sigma']:
samples = trace.posterior[name].values.flatten()
z, p = geweke_diagnostic(samples)
print(f"{name}: z={z:.3f}, p={p:.3f} {'✓' if abs(z) < 2 else '✗'}")
Diagnostic Checklist
Before using MCMC samples for inference:
- for all parameters
- ESS for all parameters
- Trace plots show no trends or drift
- Autocorrelation decays rapidly
- No divergent transitions (HMC/NUTS)
- Posterior predictive checks pass
- Rank plots are approximately uniform
Related Topics
- See Bayesian Linear Regression for when MCMC is needed
- See Hierarchical Bayesian Models for complex models requiring MCMC
- See Bayesian Statistics for the Bayesian framework
Key Takeaways
Summary: MCMC Diagnostics and Convergence
- MCMC generates correlated samples from the posterior — diagnostics verify the chain has converged
- R-hat compares between-chain vs. within-chain variance — is the modern threshold
- ESS counts independent-equivalent samples — ESS is recommended
- Autocorrelation measures chain mixing — rapid decay indicates good mixing
- Burn-in discards early non-stationary iterations; thinning reduces storage but not information
- Geweke diagnostic compares early vs. late chain segments — indicates stationarity
- Divergent transitions in HMC/NUTS signal numerical problems — always target zero divergences
- Modern practice: store all post-burn-in samples, use rank plots, and compute ESS and together