Monte Carlo Simulation

Advanced Statistical Methods

Solving Impossible Problems With Random Numbers

Monte Carlo methods use random sampling to approximate solutions to problems that are intractable analytically. The convergence rate of O(1/√n) is dimension-independent, making these methods powerful in high dimensions.

Physics — Simulate particle interactions in nuclear reactor design and shielding calculations
Finance — Price complex derivatives using risk-neutral Monte Carlo pricing models
Engineering — Perform structural reliability analysis with probability of failure estimation

Monte Carlo methods harness randomness to tame the curse of dimensionality.

DfMonte Carlo Integration

Monte Carlo integration estimates an integral I = ∫_Ω g(x) dx by drawing random samples from a proposal distribution. Given n i.i.d. samples X₁, ..., X_n drawn from a distribution p(x):

\hat{I}_n = \frac{1}{n} \sum_{i=1}^{n} \frac{g(X_i)}{p(X_i)} \cdot \mathbb{1}_{\Omega}(X_i)

The strong law of large numbers guarantees I_n → I almost surely as n → ∞. The standard error is:

\text{SE}(\hat{I}_n) = \frac{\sigma_g}{\sqrt{n}}, \quad \sigma_g^2 = \text{Var}_p\left[\frac{g(X)}{p(X)}\right]

where σ_g² is the variance of the integrand under the proposal distribution. The convergence rate O(1/√n) is independent of dimension, making Monte Carlo superior to quadrature in high dimensions.

Variance Decomposition

The total variance of a Monte Carlo estimator can be decomposed:

\text{Var}(\hat{I}_n) = \text{Var}\left(\frac{1}{n}\sum_{i=1}^n h(X_i)\right) = \frac{1}{n}\text{Var}_p(h(X))

where h(x) = g(x)/p(x) is the importance weight times the integrand. The choice of proposal distribution p determines efficiency:

Optimal proposal: p*(x) = |g(x)| / ∫|g(x)| dx minimizes variance to zero
Prior in Bayesian inference: p(x) = π(x) leads to self-normalized importance weights
Prior predictive: p(x) = p(x | θ)π(θ) for marginal likelihood estimation

The effective sample size (ESS) measures the quality of the importance weights:

\text{ESS} = \frac{(\sum_{i=1}^n w_i)^2}{\sum_{i=1}^n w_i^2}

where w_i = g(X_i)/p(X_i) are unnormalized importance weights.

Importance Sampling

ThImportance Sampling Estimator

Given a target distribution π(x) ∝ f(x) and a proposal distribution q(x) with q(x) > 0 whenever π(x) > 0, the importance sampling estimator for E_π[h(X)] is:

\hat{\mu}_{IS} = \frac{\sum_{i=1}^{n} w(X_i) h(X_i)}{\sum_{i=1}^{n} w(X_i)}

where w(x) = π(x)/q(x) ∝ f(x)/q(x) are unnormalized importance weights. This is a self-normalized estimator.

Variance bound: For the unnormalized estimator μ̂_n = (1/n)∑w(X_i)h(X_i):

\text{Var}_q(\hat{\mu}_n) = \frac{1}{n}\left[\int \frac{f(x)^2 h(x)^2}{q(x)} dx - \left(\int f(x)h(x) dx\right)^2\right]

The proposal q should be heavy-tailed relative to π to ensure finite variance.

Adaptive Importance Sampling

Adaptive methods update the proposal distribution iteratively to reduce variance:

Cross-Entropy method: Minimize KL(π || q_θ) by iteratively updating θ toward high-probability regions
Population Monte Carlo: Use mixture proposals with component-specific bandwidths updated via EM
Adaptive tempering: Anneal between easy and hard target distributions

The adaptive independence sampler maintains a mixture proposal q_t(x) = ∑_k α_k q_k(x; θ_k^(t)) and updates weights and parameters based on previous samples, achieving convergence to the optimal proposal in the limit.

Rejection Sampling

ThRejection Sampling Algorithm

To sample from a target π(x) with density proportional to f(x), given a proposal q(x) and constant M such that f(x) ≤ Mq(x) for all x:

Draw X ~ q(x)
Draw U ~ Uniform(0, 1)
If U ≤ f(X)/(Mq(X)), accept X; otherwise reject

Acceptance probability: α = 1/M = ∫f(x)dx / sup_x[f(x)/q(x)]

Efficiency: The expected number of samples per acceptance is M. Optimal proposals satisfy q(x) ∝ f(x), giving acceptance rate 1.

Rejection constant: For common distributions:

Normal: M = √(2πe) ≈ 2.718 (with uniform proposal)
Gamma(α): M = e^(1-α)α^α / Γ(α)

import numpy as np
from scipy import stats

class MonteCarloSampler:
    def __init__(self, seed=42):
        self.rng = np.random.RandomState(seed)

    def rejection_sampling(self, target_pdf, proposal_pdf, proposal_sampler, M, n_samples):
        samples = []
        attempts = 0
        while len(samples) < n_samples:
            x = proposal_sampler()
            u = self.rng.uniform()
            if u <= target_pdf(x) / (M * proposal_pdf(x)):
                samples.append(x)
            attempts += 1
        return np.array(samples), len(samples) / attempts

    def importance_sampling(self, target_pdf, proposal_pdf, proposal_sampler, n_samples, h=lambda x: x):
        x = proposal_sampler(n_samples)
        w = target_pdf(x) / proposal_pdf(x)
        w_norm = w / np.sum(w)
        mean = np.sum(w_norm * h(x))
        var = np.sum(w_norm * (h(x) - mean)**2)
        ess = np.sum(w)**2 / np.sum(w**2)
        return mean, var, ess

    def mcmc_gibbs(self, log_target, initial, n_samples, proposal_std=1.0):
        n_vars = len(initial)
        samples = np.zeros((n_samples, n_vars))
        current = initial.copy()
        accepted = np.zeros(n_vars)

        for t in range(n_samples):
            for i in range(n_vars):
                candidate = current.copy()
                candidate[i] += self.rng.normal(0, proposal_std)
                log_ratio = log_target(candidate) - log_target(current)
                if np.log(self.rng.uniform()) < log_ratio:
                    current = candidate
                    accepted[i] += 1
            samples[t] = current

        acceptance_rates = accepted / n_samples
        return samples, acceptance_rates

Markov Chain Monte Carlo

ThMetropolis-Hastings Algorithm

The Metropolis-Hastings (MH) algorithm constructs a Markov chain with stationary distribution π. Given current state x_t:

Draw proposal x* ~ q(· | x_t)
Compute acceptance probability:

\alpha(x_t, x^*) = \min\left(1, \frac{\pi(x^*) q(x_t | x^*)}{\pi(x_t) q(x^* | x_t)}\right)

Set x_{t+1} = x* with probability α(x_t, x*), else x_{t+1} = x_t

Key variants:

Random walk MH: q(x* | x) = N(x, σ²I), symmetric → α = min(1, π(x*)/π(x_t))
Independence sampler: q(x* | x) = q(x*), non-informative → α = min(1, [π(x*)q(x_t)] / [π(x_t)q(x*)])
Langevin MH: q(x* | x) = N(x + (σ²/2)∇log π(x), σ²I), gradient-informed

Optimal acceptance rate: For random walk MH in high dimensions, the optimal acceptance rate is 23.4% (Roberts, Gelman, and Gilks, 1997).

Hamiltonian Monte Carlo

HMC introduces auxiliary momentum variables p to improve mixing. The target distribution is embedded in a joint distribution:

\pi(q, p) = \pi(q) \cdot \mathcal{N}(p | 0, M)

The Hamiltonian H(q, p) = -log π(q, p) = -log π(q) + (1/2)pᵀM⁻¹p is (approximately) conserved under Hamiltonian dynamics:

\dot{q} = M^{-1}p, \quad \dot{p} = -\nabla_q \log \pi(q)

Leapfrog integration with step size ε and L steps:

p_{t+\varepsilon/2} = p_t + \frac{\varepsilon}{2}\nabla_q \log \pi(q_t)

q_{t+\varepsilon} = q_t + \varepsilon M^{-1} p_{t+\varepsilon/2}

p_{t+\varepsilon} = p_{t+\varepsilon/2} + \frac{\varepsilon}{2}\nabla_q \log \pi(q_{t+\varepsilon})

Acceptance probability: α = min(1, exp(-H(q*, p*) + H(q, p)))

HMC achieves O(d^{1/4}) effective sample size per gradient evaluation in d dimensions, compared to O(d^{-1/2}) for random walk MH.

No-U-Turn Sampler (NUTS)

NUTS adaptively selects the number of leapfrog steps L to avoid U-turns in trajectory. Starting from (q, p), it builds a binary tree of proposals by doubling the trajectory length until:

The trajectory begins to fold back on itself (U-turn condition)
A maximum depth D is reached
The acceptance probability drops below threshold

The slice sampling scheme ensures detailed balance without the need to manually tune L. NUTS is the default sampler in Stan and provides automatic tuning of step size ε and mass matrix M.

Variance Reduction

ThAntithetic Variates

Given a Monte Carlo estimator μ̂ = (1/n)∑h(X_i) with variance σ²/n, antithetic variates use negatively correlated pairs:

\hat{\mu}_{AV} = \frac{1}{n/2}\sum_{i=1}^{n/2} \frac{h(X_i) + h(X_i')}{2}

where X_i' is obtained by flipping quantiles: X_i' = F⁻¹(1 - F(X_i)).

Variance reduction: If Cov(h(X), h(X')) < 0, then Var(μ̂_AV) < Var(μ̂). For the uniform distribution:

\text{Var}(\hat{\mu}_{AV}) \approx \frac{\sigma^2}{n}\left(1 - \rho\right)

where ρ = Corr(h(X), h(X')). Antithetic variates are most effective when h is monotone.

Control Variates and Stratification

Control variates: If Y is a random variable with known mean μ_Y = E[Y], use:

\hat{\mu}_{CV} = \hat{\mu} - c(Y - \mu_Y)

The optimal control coefficient is c* = Cov(h(X), Y) / Var(Y), reducing variance by ρ² where ρ = Corr(h(X), Y).

Stratified sampling: Partition the sample space into K strata {S₁,...,S_K} and sample n_k ∝ π(S_k) samples from each stratum:

\hat{\mu}_{strat} = \sum_{k=1}^{K} \pi(S_k) \frac{1}{n_k}\sum_{i \in S_k} h(X_i)

Variance: Var(μ̂_strat) = (1/n)∑_k π(S_k) σ_k² ≤ σ²/n, with equality only when all strata have equal variance.

Convergence and Diagnostics

ThCentral Limit Theorem for MCMC

Under ergodicity conditions (aperiodicity, irreducibility, positive recurrence), the MCMC central limit theorem states:

\sqrt{n}(\hat{\mu}_n - \mu) \xrightarrow{d} \mathcal{N}(0, \tau^2)

where τ² = σ² · IAT is the asymptotic variance, σ² = Var_π(h(X)) is the target variance, and IAT is the integrated autocorrelation time:

\text{IAT} = 1 + 2\sum_{k=1}^{\infty} \rho_k

where ρ_k = Corr(h(X_t), h(X_{t+k})) is the autocorrelation at lag k.

Effective sample size: n_eff = n / IAT. For well-mixed chains, IAT ≈ 1; for poor mixing, IAT can be >> n.

Diagnostics and Convergence Assessment

Key diagnostic tools for assessing MCMC convergence:

Gelman-Rubin statistic (R̂): Compares within-chain and between-chain variance. R̂ < 1.1 indicates convergence.
Effective sample size (ESS): Computes the number of independent samples. ESS > 100 per chain is recommended.
Trace plots: Visual inspection of sample paths for stationarity and mixing.
Geweke diagnostic: Tests mean stationarity by comparing early and late portions of the chain.
Raftery-Lewis: Estimates the number of iterations needed to estimate a quantile with specified precision.
Heidelberger-Welch: Combines a stationarity test (Cramér-von Mises) with half-width confidence interval assessment.

Warm-up (burn-in): The first n_b samples are discarded. Adaptive MCMC algorithms (NUTS, adaptive MH) use warm-up to tune proposal distributions.

Monte Carlo Convergence Rates

The convergence rates of different Monte Carlo methods are summarized:

Method	Convergence Rate	Dimension Dependence
Simple MC	O(n⁻¹/²)	Dimension-independent
Importance Sampling	O(n⁻¹/²)	Degrades with dimension
Rejection Sampling	O(acceptance rate)	Degrades exponentially
Random Walk MH	O(n⁻¹/²)	O(d⁻¹/²) per gradient
HMC	O(n⁻¹/²)	O(d⁻¹/⁴) per gradient
NUTS	O(n⁻¹/²)	O(d⁻¹/⁴) per gradient

The key insight is that MCMC methods achieve the same n⁻¹/² Monte Carlo convergence rate as independent sampling, but with an effective sample size that depends on the mixing properties of the chain.

Bayesian Posterior Computation

Consider a Bayesian model with likelihood p(D|θ) and prior π(θ). The posterior is:

\pi(\theta | D) = \frac{p(D|\theta)\pi(\theta)}{\int p(D|\theta)\pi(\theta)d\theta}

Gibbs sampling for a conjugate normal model:

Prior: θ ~ N(μ₀, τ²₀)
Likelihood: x̄ | θ ~ N(θ, σ²/n)
Posterior: θ | D ~ N(μ_n, τ²_n) where μ_n = (μ₀/τ²₀ + nx̄/σ²)/(1/τ²₀ + n/σ²)

HMC for non-conjugate models:

Requires gradient ∇_θ log π(θ|D) = ∇_θ log p(D|θ) + ∇_θ log π(θ)
Leapfrog integration preserves volume in phase space
Adaptive tuning during warm-up phase

For hierarchical models with many parameters, NUTS provides efficient sampling with automatic tuning, achieving 10-100× better effective sample size than random walk MH.

Monte Carlo Simulation

Monte Carlo Simulation

Solving Impossible Problems With Random Numbers

DfMonte Carlo Integration

Variance Decomposition

Importance Sampling

ThImportance Sampling Estimator

Rejection Sampling

ThRejection Sampling Algorithm

Markov Chain Monte Carlo

ThMetropolis-Hastings Algorithm

Hamiltonian Monte Carlo

Variance Reduction

ThAntithetic Variates

Control Variates and Stratification

Convergence and Diagnostics

ThCentral Limit Theorem for MCMC

Monte Carlo Convergence Rates

Bayesian Posterior Computation

Premium Content

Need Expert Statistics Help?