🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

MCMC Diagnostics and Convergence

Advanced Statistical MethodsBayesian Methods🟢 Free Lesson

Advertisement

MCMC Diagnostics and Convergence

Advanced Statistical Methods

Knowing When Your MCMC Has Converged

MCMC diagnostics verify that Markov chain samples have converged to the target posterior distribution, preventing invalid Bayesian inference. Tools like R-hat, effective sample size, and trace plots are essential safeguards.

  • Bayesian modeling — Validate that posterior samples accurately represent the target distribution
  • Pharmacokinetics — Ensure reliable parameter estimates from complex hierarchical models
  • Computational biology — Confirm MCMC chains explore the full posterior in high-dimensional problems

Diagnostics are the quality control of Bayesian computation — never trust a chain you haven't checked.


Markov Chain Monte Carlo (MCMC) methods generate samples from posterior distributions that cannot be evaluated analytically. Proper diagnostics are essential to ensure the samples are valid representations of the target distribution.


MCMC Fundamentals

DfMarkov Chain Monte Carlo

MCMC constructs a Markov chain {X(t)}t=0\{X^{(t)}\}_{t=0}^{\infty} whose stationary distribution is the target posterior p(θD)p(\boldsymbol{\theta} \mid \mathcal{D}). After sufficient iterations (burn-in), samples {X(t)}t=BT\{X^{(t)}\}_{t=B}^{T} approximate draws from the posterior.

The chain satisfies the detailed balance condition:

π(x)T(xx)=π(x)T(xx)\pi(\mathbf{x}) \, T(\mathbf{x} \to \mathbf{x}') = \pi(\mathbf{x}') \, T(\mathbf{x}' \to \mathbf{x})

where π(x)\pi(\mathbf{x}) is the target density and TT is the transition kernel.

Why MCMC?

Most posterior distributions cannot be sampled directly (they are not standard distributions) and cannot be normalized (the evidence integral is intractable). MCMC circumvents both problems by constructing a Markov chain that only requires evaluating the posterior up to a normalizing constant.


Metropolis-Hastings Algorithm

ThMetropolis-Hastings Algorithm

At each iteration tt:

  1. Propose xq(xx(t))\mathbf{x}' \sim q(\mathbf{x}' \mid \mathbf{x}^{(t)}) from a proposal distribution
  2. Compute acceptance ratio:
α=min(1,π(x)q(x(t)x)π(x(t))q(xx(t)))\alpha = \min\left(1, \frac{\pi(\mathbf{x}') \, q(\mathbf{x}^{(t)} \mid \mathbf{x}')}{\pi(\mathbf{x}^{(t)}) \, q(\mathbf{x}' \mid \mathbf{x}^{(t)})}\right)
  1. Accept with probability α\alpha: x(t+1)=x\mathbf{x}^{(t+1)} = \mathbf{x}'; otherwise x(t+1)=x(t)\mathbf{x}^{(t+1)} = \mathbf{x}^{(t)}

The chain converges to π\pi regardless of the choice of qq (subject to regularity conditions).

Proposal Tuning

The proposal distribution qq must be tuned for efficiency. Too narrow: high acceptance but slow exploration (high autocorrelation). Too wide: low acceptance, many rejected proposals. The optimal acceptance rate is approximately 23.4% for high-dimensional Gaussian targets (Roberts, Gelman, & Gilks, 1997).


Gibbs Sampling

DfGibbs Sampling

Gibbs sampling is a special case of Metropolis-Hastings where each component is updated conditionally:

x1(t+1)p(x1x2(t),x3(t),,xk(t))x_1^{(t+1)} \sim p(x_1 \mid x_2^{(t)}, x_3^{(t)}, \ldots, x_k^{(t)})
x2(t+1)p(x2x1(t+1),x3(t),,xk(t))x_2^{(t+1)} \sim p(x_2 \mid x_1^{(t+1)}, x_3^{(t)}, \ldots, x_k^{(t)})
\vdots
xk(t+1)p(xkx1(t+1),x2(t+1),,xk1(t+1))x_k^{(t+1)} \sim p(x_k \mid x_1^{(t+1)}, x_2^{(t+1)}, \ldots, x_{k-1}^{(t+1)})

Each update is accepted with probability 1 (no rejection step), but requires sampling from the full conditional distributions.

Gibbs vs. Metropolis-Hastings

Gibbs is efficient when full conditionals are easy to sample from. When they are not available or expensive, Metropolis-Hastings with a flexible proposal is used. Modern samplers (NUTS, HMC) combine both approaches.


Convergence Diagnostics

R-hat (Gelman-Rubin Statistic)

R-hat (Potential Scale Reduction Factor)

R^=Var^(θ)W\hat{R} = \sqrt{\frac{\hat{\text{Var}}(\theta)}{W}}

Here,

  • Var^(θ)\hat{\text{Var}}(\theta)=Pooled posterior variance estimate (between + within chain variance)
  • WW=Within-chain variance averaged across M chains

More precisely:

R-hat Decomposition

R^=n1nW+1nBW\hat{R} = \sqrt{\frac{\frac{n-1}{n}W + \frac{1}{n}B}{W}}

Here,

  • B=MM1m=1M(θˉmθˉ)2B = \frac{M}{M-1} \sum_{m=1}^M (\bar{\theta}_m - \bar{\theta})^2=Between-chain variance
  • W=1M(n1)m=1Mt=1n(θm(t)θˉm)2W = \frac{1}{M(n-1)} \sum_{m=1}^M \sum_{t=1}^n (\theta_m^{(t)} - \bar{\theta}_m)^2=Within-chain variance
  • MM=Number of chains
  • nn=Iterations per chain (after burn-in)

R-hat Threshold

A common threshold is R^<1.1\hat{R} < 1.1 (Gelman & Rubin, 1992). Modern recommendations (Vehtari et al., 2021) use R^<1.01\hat{R} < 1.01. If R^>1.05\hat{R} > 1.05, the chain has likely not converged — run longer or reparameterize.

Effective Sample Size (ESS)

Effective Sample Size

ESS=MN1+2k=1ρk\text{ESS} = \frac{MN}{1 + 2 \sum_{k=1}^{\infty} \rho_k}

Here,

  • MM=Number of chains
  • NN=Total iterations per chain
  • ρk\rho_k=Autocorrelation at lag k

ESS measures the number of independent samples equivalent to the correlated MCMC output. High autocorrelation reduces ESS. A minimum of ESS >400> 400 is recommended for reliable posterior summaries.

Autocorrelation

Sample Autocorrelation

ρ^k=t=1Nk(θ(t)θˉ)(θ(t+k)θˉ)t=1N(θ(t)θˉ)2\hat{\rho}_k = \frac{\sum_{t=1}^{N-k}(\theta^{(t)} - \bar{\theta})(\theta^{(t+k)} - \bar{\theta})}{\sum_{t=1}^{N}(\theta^{(t)} - \bar{\theta})^2}

Here,

  • ρ^k\hat{\rho}_k=Autocorrelation at lag k
  • NN=Chain length
  • θ(t)\theta^{(t)}=Parameter value at iteration t

Autocorrelation Interpretation

  • Low autocorrelation (rapid decay): chain mixes well, explores efficiently
  • High autocorrelation (slow decay): chain is stuck, needs thinning or reparameterization
  • Autocorrelation should drop below 0.10.1 within a few dozen lags for well-mixed chains

Thinning and Burn-in

DfBurn-in Period

The first BB iterations are discarded as burn-in (or warm-up). These iterations reflect the initial starting point, not the stationary distribution. Common practice: discard 50% of iterations, or use diagnostics to determine when stationarity begins.

DfThinning

Instead of keeping every iteration, retain every kk-th sample to reduce storage and autocorrelation. Thinning produces a chain of length N/kN/k with reduced autocorrelation but also reduces information — total ESS is unchanged.

Recommendation: Store all samples and thin only for storage constraints. Compute ESS to assess effective information content.

Modern Practice

Modern MCMC (especially HMC/NUTS) produces low-autocorrelation samples, making thinning unnecessary. Store all post-burn-in samples and compute ESS directly.


Geweke Diagnostic

ThGeweke Diagnostic

Compare the mean of the first 10% of the chain to the mean of the last 50%:

z=θˉfirstθˉlastS^first(f0)/n1+S^last(f0)/n2z = \frac{\bar{\theta}_{\text{first}} - \bar{\theta}_{\text{last}}}{\sqrt{\hat{S}_{\text{first}}(f_0) / n_1 + \hat{S}_{\text{last}}(f_0) / n_2}}

where S^(f0)\hat{S}(f_0) is the spectral density at frequency zero. Under convergence, zN(0,1)z \sim \mathcal{N}(0, 1). z>2|z| > 2 suggests non-convergence.


Brooks-Gelman-Rubin Multivariate Diagnostic

Multivariate R-hat

R^mult=maxjR^j\hat{R}_{\text{mult}} = \max_j \hat{R}_j

Here,

  • R^j\hat{R}_j=Univariate R-hat for parameter j

The multivariate R^\hat{R} (Vehtari et al., 2021) monitors all parameters simultaneously and is more sensitive to divergence in high-dimensional posteriors. R^mult<1.01\hat{R}_{\text{mult}} < 1.01 is the recommended threshold.


Divergent Transitions (HMC/NUTS)

Divergent Transitions

When using Hamiltonian Monte Carlo (HMC) or NUTS (No-U-Turn Sampler), divergent transitions indicate that the numerical integrator failed to accurately follow the Hamiltonian trajectory. Divergences signal:

  1. Regions of high curvature that the integrator cannot handle
  2. Potential bias in posterior estimates
  3. Need for reparameterization (e.g., non-centered parameterization)

Zero divergences is the goal. If divergences occur, try: (a) increase target_accept, (b) use non-centered parameterization, (c) reparameterize the model.


Python Implementation

import numpy as np
import pymc as pm
import arviz as az
import matplotlib.pyplot as plt

np.random.seed(42)

# --- Simple model for demonstration ---
n_obs = 200
X = np.random.randn(n_obs)
true_beta = np.array([1.5, -2.0])
y = true_beta[0] + true_beta[1] * X + np.random.randn(n_obs) * 0.5

with pm.Model() as demo_model:
    beta = pm.Normal('beta', mu=0, sigma=10, shape=2)
    sigma = pm.HalfNormal('sigma', sigma=2)
    mu = beta[0] + beta[1] * X
    y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y)
    
    # Run MCMC with diagnostics
    trace = pm.sample(2000, tune=1000, chains=4, return_inferencedata=True,
                      random_seed=42)

# --- R-hat ---
print("=== R-hat ===")
print(az.rhat(trace, var_names=['beta', 'sigma']))

# --- Effective Sample Size ---
print("\n=== Effective Sample Size ===")
print(az.ess(trace, var_names=['beta', 'sigma']))

# --- Autocorrelation ---
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
for i, ax in enumerate(axes.flatten()):
    if i < 2:
        az.plot_autocorr(trace, var_names=['beta'], combined=True, ax=ax)
    elif i == 2:
        az.plot_autocorr(trace, var_names=['sigma'], combined=True, ax=ax)
    else:
        az.plot_posterior(trace, var_names=['beta'], ax=ax)
plt.tight_layout()
plt.savefig('mcmc_autocorr.png', dpi=150)
plt.show()

# --- Trace plots ---
az.plot_trace(trace, var_names=['beta', 'sigma'], compact=True)
plt.tight_layout()
plt.savefig('mcmc_trace.png', dpi=150)
plt.show()

# --- Rank plots (modern diagnostic) ---
az.plot_rank(trace, var_names=['beta'])
plt.tight_layout()
plt.savefig('mcmc_rank.png', dpi=150)
plt.show()

# --- Geweke diagnostic ---
# Manual implementation for demonstration
from scipy import stats as sp_stats

def geweke_diagnostic(chain, first=0.1, last=0.5):
    n = len(chain)
    first_end = int(n * first)
    last_start = int(n * (1 - last))
    mean_first = np.mean(chain[:first_end])
    mean_last = np.mean(chain[last_start:])
    var_first = np.var(chain[:first_end]) / first_end
    var_last = np.var(chain[last_start:]) / (n - last_start)
    z = (mean_first - mean_last) / np.sqrt(var_first + var_last)
    p_value = 2 * (1 - sp_stats.norm.cdf(abs(z)))
    return z, p_value

print("\n=== Geweke Diagnostic ===")
for name in ['beta', 'sigma']:
    samples = trace.posterior[name].values.flatten()
    z, p = geweke_diagnostic(samples)
    print(f"{name}: z={z:.3f}, p={p:.3f} {'✓' if abs(z) < 2 else '✗'}")

Diagnostic Checklist

Before using MCMC samples for inference:

  1. R^<1.01\hat{R} < 1.01 for all parameters
  2. ESS >400> 400 for all parameters
  3. Trace plots show no trends or drift
  4. Autocorrelation decays rapidly
  5. No divergent transitions (HMC/NUTS)
  6. Posterior predictive checks pass
  7. Rank plots are approximately uniform

Related Topics


Key Takeaways

Summary: MCMC Diagnostics and Convergence

  • MCMC generates correlated samples from the posterior — diagnostics verify the chain has converged
  • R-hat compares between-chain vs. within-chain variance — R^<1.01\hat{R} < 1.01 is the modern threshold
  • ESS counts independent-equivalent samples — ESS >400> 400 is recommended
  • Autocorrelation measures chain mixing — rapid decay indicates good mixing
  • Burn-in discards early non-stationary iterations; thinning reduces storage but not information
  • Geweke diagnostic compares early vs. late chain segments — z<2|z| < 2 indicates stationarity
  • Divergent transitions in HMC/NUTS signal numerical problems — always target zero divergences
  • Modern practice: store all post-burn-in samples, use rank plots, and compute ESS and R^\hat{R} together

Premium Content

MCMC Diagnostics and Convergence

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement