Monte Carlo Simulation
Advanced Statistical Methods
Solving Impossible Problems With Random Numbers
Monte Carlo methods use random sampling to approximate solutions to problems that are intractable analytically. The convergence rate of O(1/√n) is dimension-independent, making these methods powerful in high dimensions.
- Physics — Simulate particle interactions in nuclear reactor design and shielding calculations
- Finance — Price complex derivatives using risk-neutral Monte Carlo pricing models
- Engineering — Perform structural reliability analysis with probability of failure estimation
Monte Carlo methods harness randomness to tame the curse of dimensionality.
DfMonte Carlo Integration
Monte Carlo integration estimates an integral I = ∫_Ω g(x) dx by drawing random samples from a proposal distribution. Given n i.i.d. samples X₁, ..., X_n drawn from a distribution p(x):
The strong law of large numbers guarantees I_n → I almost surely as n → ∞. The standard error is:
where σ_g² is the variance of the integrand under the proposal distribution. The convergence rate O(1/√n) is independent of dimension, making Monte Carlo superior to quadrature in high dimensions.
Variance Decomposition
The total variance of a Monte Carlo estimator can be decomposed:
where h(x) = g(x)/p(x) is the importance weight times the integrand. The choice of proposal distribution p determines efficiency:
- Optimal proposal: p*(x) = |g(x)| / ∫|g(x)| dx minimizes variance to zero
- Prior in Bayesian inference: p(x) = π(x) leads to self-normalized importance weights
- Prior predictive: p(x) = p(x | θ)π(θ) for marginal likelihood estimation
The effective sample size (ESS) measures the quality of the importance weights:
where w_i = g(X_i)/p(X_i) are unnormalized importance weights.
Importance Sampling
ThImportance Sampling Estimator
Given a target distribution π(x) ∝ f(x) and a proposal distribution q(x) with q(x) > 0 whenever π(x) > 0, the importance sampling estimator for E_π[h(X)] is:
where w(x) = π(x)/q(x) ∝ f(x)/q(x) are unnormalized importance weights. This is a self-normalized estimator.
Variance bound: For the unnormalized estimator μ̂_n = (1/n)∑w(X_i)h(X_i):
The proposal q should be heavy-tailed relative to π to ensure finite variance.
Adaptive Importance Sampling
Adaptive methods update the proposal distribution iteratively to reduce variance:
- Cross-Entropy method: Minimize KL(π || q_θ) by iteratively updating θ toward high-probability regions
- Population Monte Carlo: Use mixture proposals with component-specific bandwidths updated via EM
- Adaptive tempering: Anneal between easy and hard target distributions
The adaptive independence sampler maintains a mixture proposal q_t(x) = ∑_k α_k q_k(x; θ_k^(t)) and updates weights and parameters based on previous samples, achieving convergence to the optimal proposal in the limit.
Rejection Sampling
ThRejection Sampling Algorithm
To sample from a target π(x) with density proportional to f(x), given a proposal q(x) and constant M such that f(x) ≤ Mq(x) for all x:
- Draw X ~ q(x)
- Draw U ~ Uniform(0, 1)
- If U ≤ f(X)/(Mq(X)), accept X; otherwise reject
Acceptance probability: α = 1/M = ∫f(x)dx / sup_x[f(x)/q(x)]
Efficiency: The expected number of samples per acceptance is M. Optimal proposals satisfy q(x) ∝ f(x), giving acceptance rate 1.
Rejection constant: For common distributions:
- Normal: M = √(2πe) ≈ 2.718 (with uniform proposal)
- Gamma(α): M = e^(1-α)α^α / Γ(α)
import numpy as np
from scipy import stats
class MonteCarloSampler:
def __init__(self, seed=42):
self.rng = np.random.RandomState(seed)
def rejection_sampling(self, target_pdf, proposal_pdf, proposal_sampler, M, n_samples):
samples = []
attempts = 0
while len(samples) < n_samples:
x = proposal_sampler()
u = self.rng.uniform()
if u <= target_pdf(x) / (M * proposal_pdf(x)):
samples.append(x)
attempts += 1
return np.array(samples), len(samples) / attempts
def importance_sampling(self, target_pdf, proposal_pdf, proposal_sampler, n_samples, h=lambda x: x):
x = proposal_sampler(n_samples)
w = target_pdf(x) / proposal_pdf(x)
w_norm = w / np.sum(w)
mean = np.sum(w_norm * h(x))
var = np.sum(w_norm * (h(x) - mean)**2)
ess = np.sum(w)**2 / np.sum(w**2)
return mean, var, ess
def mcmc_gibbs(self, log_target, initial, n_samples, proposal_std=1.0):
n_vars = len(initial)
samples = np.zeros((n_samples, n_vars))
current = initial.copy()
accepted = np.zeros(n_vars)
for t in range(n_samples):
for i in range(n_vars):
candidate = current.copy()
candidate[i] += self.rng.normal(0, proposal_std)
log_ratio = log_target(candidate) - log_target(current)
if np.log(self.rng.uniform()) < log_ratio:
current = candidate
accepted[i] += 1
samples[t] = current
acceptance_rates = accepted / n_samples
return samples, acceptance_rates
Markov Chain Monte Carlo
ThMetropolis-Hastings Algorithm
The Metropolis-Hastings (MH) algorithm constructs a Markov chain with stationary distribution π. Given current state x_t:
- Draw proposal x* ~ q(· | x_t)
- Compute acceptance probability:
- Set x_{t+1} = x* with probability α(x_t, x*), else x_{t+1} = x_t
Key variants:
- Random walk MH: q(x* | x) = N(x, σ²I), symmetric → α = min(1, π(x*)/π(x_t))
- Independence sampler: q(x* | x) = q(x*), non-informative → α = min(1, [π(x*)q(x_t)] / [π(x_t)q(x*)])
- Langevin MH: q(x* | x) = N(x + (σ²/2)∇log π(x), σ²I), gradient-informed
Optimal acceptance rate: For random walk MH in high dimensions, the optimal acceptance rate is 23.4% (Roberts, Gelman, and Gilks, 1997).
Hamiltonian Monte Carlo
HMC introduces auxiliary momentum variables p to improve mixing. The target distribution is embedded in a joint distribution:
The Hamiltonian H(q, p) = -log π(q, p) = -log π(q) + (1/2)pᵀM⁻¹p is (approximately) conserved under Hamiltonian dynamics:
Leapfrog integration with step size ε and L steps:
Acceptance probability: α = min(1, exp(-H(q*, p*) + H(q, p)))
HMC achieves O(d^{1/4}) effective sample size per gradient evaluation in d dimensions, compared to O(d^{-1/2}) for random walk MH.
No-U-Turn Sampler (NUTS)
NUTS adaptively selects the number of leapfrog steps L to avoid U-turns in trajectory. Starting from (q, p), it builds a binary tree of proposals by doubling the trajectory length until:
- The trajectory begins to fold back on itself (U-turn condition)
- A maximum depth D is reached
- The acceptance probability drops below threshold
The slice sampling scheme ensures detailed balance without the need to manually tune L. NUTS is the default sampler in Stan and provides automatic tuning of step size ε and mass matrix M.
Variance Reduction
ThAntithetic Variates
Given a Monte Carlo estimator μ̂ = (1/n)∑h(X_i) with variance σ²/n, antithetic variates use negatively correlated pairs:
where X_i' is obtained by flipping quantiles: X_i' = F⁻¹(1 - F(X_i)).
Variance reduction: If Cov(h(X), h(X')) < 0, then Var(μ̂_AV) < Var(μ̂). For the uniform distribution:
where ρ = Corr(h(X), h(X')). Antithetic variates are most effective when h is monotone.
Control Variates and Stratification
Control variates: If Y is a random variable with known mean μ_Y = E[Y], use:
The optimal control coefficient is c* = Cov(h(X), Y) / Var(Y), reducing variance by ρ² where ρ = Corr(h(X), Y).
Stratified sampling: Partition the sample space into K strata {S₁,...,S_K} and sample n_k ∝ π(S_k) samples from each stratum:
Variance: Var(μ̂_strat) = (1/n)∑_k π(S_k) σ_k² ≤ σ²/n, with equality only when all strata have equal variance.
Convergence and Diagnostics
ThCentral Limit Theorem for MCMC
Under ergodicity conditions (aperiodicity, irreducibility, positive recurrence), the MCMC central limit theorem states:
where τ² = σ² · IAT is the asymptotic variance, σ² = Var_π(h(X)) is the target variance, and IAT is the integrated autocorrelation time:
where ρ_k = Corr(h(X_t), h(X_{t+k})) is the autocorrelation at lag k.
Effective sample size: n_eff = n / IAT. For well-mixed chains, IAT ≈ 1; for poor mixing, IAT can be >> n.
Diagnostics and Convergence Assessment
Key diagnostic tools for assessing MCMC convergence:
-
Gelman-Rubin statistic (R̂): Compares within-chain and between-chain variance. R̂ < 1.1 indicates convergence.
-
Effective sample size (ESS): Computes the number of independent samples. ESS > 100 per chain is recommended.
-
Trace plots: Visual inspection of sample paths for stationarity and mixing.
-
Geweke diagnostic: Tests mean stationarity by comparing early and late portions of the chain.
-
Raftery-Lewis: Estimates the number of iterations needed to estimate a quantile with specified precision.
-
Heidelberger-Welch: Combines a stationarity test (Cramér-von Mises) with half-width confidence interval assessment.
Warm-up (burn-in): The first n_b samples are discarded. Adaptive MCMC algorithms (NUTS, adaptive MH) use warm-up to tune proposal distributions.
Monte Carlo Convergence Rates
The convergence rates of different Monte Carlo methods are summarized:
| Method | Convergence Rate | Dimension Dependence |
|---|---|---|
| Simple MC | O(n⁻¹/²) | Dimension-independent |
| Importance Sampling | O(n⁻¹/²) | Degrades with dimension |
| Rejection Sampling | O(acceptance rate) | Degrades exponentially |
| Random Walk MH | O(n⁻¹/²) | O(d⁻¹/²) per gradient |
| HMC | O(n⁻¹/²) | O(d⁻¹/⁴) per gradient |
| NUTS | O(n⁻¹/²) | O(d⁻¹/⁴) per gradient |
The key insight is that MCMC methods achieve the same n⁻¹/² Monte Carlo convergence rate as independent sampling, but with an effective sample size that depends on the mixing properties of the chain.
Bayesian Posterior Computation
Consider a Bayesian model with likelihood p(D|θ) and prior π(θ). The posterior is:
Gibbs sampling for a conjugate normal model:
- Prior: θ ~ N(μ₀, τ²₀)
- Likelihood: x̄ | θ ~ N(θ, σ²/n)
- Posterior: θ | D ~ N(μ_n, τ²_n) where μ_n = (μ₀/τ²₀ + nx̄/σ²)/(1/τ²₀ + n/σ²)
HMC for non-conjugate models:
- Requires gradient ∇_θ log π(θ|D) = ∇_θ log p(D|θ) + ∇_θ log π(θ)
- Leapfrog integration preserves volume in phase space
- Adaptive tuning during warm-up phase
For hierarchical models with many parameters, NUTS provides efficient sampling with automatic tuning, achieving 10-100× better effective sample size than random walk MH.