🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Monte Carlo Simulation

Advanced Statistical MethodsComputational Methods🟢 Free Lesson

Advertisement

Monte Carlo Simulation

Advanced Statistical Methods

Solving Impossible Problems With Random Numbers

Monte Carlo methods use random sampling to approximate solutions to problems that are intractable analytically. The convergence rate of O(1/√n) is dimension-independent, making these methods powerful in high dimensions.

  • Physics — Simulate particle interactions in nuclear reactor design and shielding calculations
  • Finance — Price complex derivatives using risk-neutral Monte Carlo pricing models
  • Engineering — Perform structural reliability analysis with probability of failure estimation

Monte Carlo methods harness randomness to tame the curse of dimensionality.


DfMonte Carlo Integration

Monte Carlo integration estimates an integral I = ∫_Ω g(x) dx by drawing random samples from a proposal distribution. Given n i.i.d. samples X₁, ..., X_n drawn from a distribution p(x):

I^n=1ni=1ng(Xi)p(Xi)1Ω(Xi)\hat{I}_n = \frac{1}{n} \sum_{i=1}^{n} \frac{g(X_i)}{p(X_i)} \cdot \mathbb{1}_{\Omega}(X_i)

The strong law of large numbers guarantees I_n → I almost surely as n → ∞. The standard error is:

SE(I^n)=σgn,σg2=Varp[g(X)p(X)]\text{SE}(\hat{I}_n) = \frac{\sigma_g}{\sqrt{n}}, \quad \sigma_g^2 = \text{Var}_p\left[\frac{g(X)}{p(X)}\right]

where σ_g² is the variance of the integrand under the proposal distribution. The convergence rate O(1/√n) is independent of dimension, making Monte Carlo superior to quadrature in high dimensions.

Variance Decomposition

The total variance of a Monte Carlo estimator can be decomposed:

Var(I^n)=Var(1ni=1nh(Xi))=1nVarp(h(X))\text{Var}(\hat{I}_n) = \text{Var}\left(\frac{1}{n}\sum_{i=1}^n h(X_i)\right) = \frac{1}{n}\text{Var}_p(h(X))

where h(x) = g(x)/p(x) is the importance weight times the integrand. The choice of proposal distribution p determines efficiency:

  • Optimal proposal: p*(x) = |g(x)| / ∫|g(x)| dx minimizes variance to zero
  • Prior in Bayesian inference: p(x) = π(x) leads to self-normalized importance weights
  • Prior predictive: p(x) = p(x | θ)π(θ) for marginal likelihood estimation

The effective sample size (ESS) measures the quality of the importance weights:

ESS=(i=1nwi)2i=1nwi2\text{ESS} = \frac{(\sum_{i=1}^n w_i)^2}{\sum_{i=1}^n w_i^2}

where w_i = g(X_i)/p(X_i) are unnormalized importance weights.

Importance Sampling

ThImportance Sampling Estimator

Given a target distribution π(x) ∝ f(x) and a proposal distribution q(x) with q(x) > 0 whenever π(x) > 0, the importance sampling estimator for E_π[h(X)] is:

μ^IS=i=1nw(Xi)h(Xi)i=1nw(Xi)\hat{\mu}_{IS} = \frac{\sum_{i=1}^{n} w(X_i) h(X_i)}{\sum_{i=1}^{n} w(X_i)}

where w(x) = π(x)/q(x) ∝ f(x)/q(x) are unnormalized importance weights. This is a self-normalized estimator.

Variance bound: For the unnormalized estimator μ̂_n = (1/n)∑w(X_i)h(X_i):

Varq(μ^n)=1n[f(x)2h(x)2q(x)dx(f(x)h(x)dx)2]\text{Var}_q(\hat{\mu}_n) = \frac{1}{n}\left[\int \frac{f(x)^2 h(x)^2}{q(x)} dx - \left(\int f(x)h(x) dx\right)^2\right]

The proposal q should be heavy-tailed relative to π to ensure finite variance.

Adaptive Importance Sampling

Adaptive methods update the proposal distribution iteratively to reduce variance:

  1. Cross-Entropy method: Minimize KL(π || q_θ) by iteratively updating θ toward high-probability regions
  2. Population Monte Carlo: Use mixture proposals with component-specific bandwidths updated via EM
  3. Adaptive tempering: Anneal between easy and hard target distributions

The adaptive independence sampler maintains a mixture proposal q_t(x) = ∑_k α_k q_k(x; θ_k^(t)) and updates weights and parameters based on previous samples, achieving convergence to the optimal proposal in the limit.

Rejection Sampling

ThRejection Sampling Algorithm

To sample from a target π(x) with density proportional to f(x), given a proposal q(x) and constant M such that f(x) ≤ Mq(x) for all x:

  1. Draw X ~ q(x)
  2. Draw U ~ Uniform(0, 1)
  3. If U ≤ f(X)/(Mq(X)), accept X; otherwise reject

Acceptance probability: α = 1/M = ∫f(x)dx / sup_x[f(x)/q(x)]

Efficiency: The expected number of samples per acceptance is M. Optimal proposals satisfy q(x) ∝ f(x), giving acceptance rate 1.

Rejection constant: For common distributions:

  • Normal: M = √(2πe) ≈ 2.718 (with uniform proposal)
  • Gamma(α): M = e^(1-α)α^α / Γ(α)
import numpy as np
from scipy import stats

class MonteCarloSampler:
    def __init__(self, seed=42):
        self.rng = np.random.RandomState(seed)

    def rejection_sampling(self, target_pdf, proposal_pdf, proposal_sampler, M, n_samples):
        samples = []
        attempts = 0
        while len(samples) < n_samples:
            x = proposal_sampler()
            u = self.rng.uniform()
            if u <= target_pdf(x) / (M * proposal_pdf(x)):
                samples.append(x)
            attempts += 1
        return np.array(samples), len(samples) / attempts

    def importance_sampling(self, target_pdf, proposal_pdf, proposal_sampler, n_samples, h=lambda x: x):
        x = proposal_sampler(n_samples)
        w = target_pdf(x) / proposal_pdf(x)
        w_norm = w / np.sum(w)
        mean = np.sum(w_norm * h(x))
        var = np.sum(w_norm * (h(x) - mean)**2)
        ess = np.sum(w)**2 / np.sum(w**2)
        return mean, var, ess

    def mcmc_gibbs(self, log_target, initial, n_samples, proposal_std=1.0):
        n_vars = len(initial)
        samples = np.zeros((n_samples, n_vars))
        current = initial.copy()
        accepted = np.zeros(n_vars)

        for t in range(n_samples):
            for i in range(n_vars):
                candidate = current.copy()
                candidate[i] += self.rng.normal(0, proposal_std)
                log_ratio = log_target(candidate) - log_target(current)
                if np.log(self.rng.uniform()) < log_ratio:
                    current = candidate
                    accepted[i] += 1
            samples[t] = current

        acceptance_rates = accepted / n_samples
        return samples, acceptance_rates

Markov Chain Monte Carlo

ThMetropolis-Hastings Algorithm

The Metropolis-Hastings (MH) algorithm constructs a Markov chain with stationary distribution π. Given current state x_t:

  1. Draw proposal x* ~ q(· | x_t)
  2. Compute acceptance probability:
α(xt,x)=min(1,π(x)q(xtx)π(xt)q(xxt))\alpha(x_t, x^*) = \min\left(1, \frac{\pi(x^*) q(x_t | x^*)}{\pi(x_t) q(x^* | x_t)}\right)
  1. Set x_{t+1} = x* with probability α(x_t, x*), else x_{t+1} = x_t

Key variants:

  • Random walk MH: q(x* | x) = N(x, σ²I), symmetric → α = min(1, π(x*)/π(x_t))
  • Independence sampler: q(x* | x) = q(x*), non-informative → α = min(1, [π(x*)q(x_t)] / [π(x_t)q(x*)])
  • Langevin MH: q(x* | x) = N(x + (σ²/2)∇log π(x), σ²I), gradient-informed

Optimal acceptance rate: For random walk MH in high dimensions, the optimal acceptance rate is 23.4% (Roberts, Gelman, and Gilks, 1997).

Hamiltonian Monte Carlo

HMC introduces auxiliary momentum variables p to improve mixing. The target distribution is embedded in a joint distribution:

π(q,p)=π(q)N(p0,M)\pi(q, p) = \pi(q) \cdot \mathcal{N}(p | 0, M)

The Hamiltonian H(q, p) = -log π(q, p) = -log π(q) + (1/2)pᵀM⁻¹p is (approximately) conserved under Hamiltonian dynamics:

q˙=M1p,p˙=qlogπ(q)\dot{q} = M^{-1}p, \quad \dot{p} = -\nabla_q \log \pi(q)

Leapfrog integration with step size ε and L steps:

pt+ε/2=pt+ε2qlogπ(qt)p_{t+\varepsilon/2} = p_t + \frac{\varepsilon}{2}\nabla_q \log \pi(q_t)
qt+ε=qt+εM1pt+ε/2q_{t+\varepsilon} = q_t + \varepsilon M^{-1} p_{t+\varepsilon/2}
pt+ε=pt+ε/2+ε2qlogπ(qt+ε)p_{t+\varepsilon} = p_{t+\varepsilon/2} + \frac{\varepsilon}{2}\nabla_q \log \pi(q_{t+\varepsilon})

Acceptance probability: α = min(1, exp(-H(q*, p*) + H(q, p)))

HMC achieves O(d^{1/4}) effective sample size per gradient evaluation in d dimensions, compared to O(d^{-1/2}) for random walk MH.

No-U-Turn Sampler (NUTS)

NUTS adaptively selects the number of leapfrog steps L to avoid U-turns in trajectory. Starting from (q, p), it builds a binary tree of proposals by doubling the trajectory length until:

  1. The trajectory begins to fold back on itself (U-turn condition)
  2. A maximum depth D is reached
  3. The acceptance probability drops below threshold

The slice sampling scheme ensures detailed balance without the need to manually tune L. NUTS is the default sampler in Stan and provides automatic tuning of step size ε and mass matrix M.

Variance Reduction

ThAntithetic Variates

Given a Monte Carlo estimator μ̂ = (1/n)∑h(X_i) with variance σ²/n, antithetic variates use negatively correlated pairs:

μ^AV=1n/2i=1n/2h(Xi)+h(Xi)2\hat{\mu}_{AV} = \frac{1}{n/2}\sum_{i=1}^{n/2} \frac{h(X_i) + h(X_i')}{2}

where X_i' is obtained by flipping quantiles: X_i' = F⁻¹(1 - F(X_i)).

Variance reduction: If Cov(h(X), h(X')) < 0, then Var(μ̂_AV) < Var(μ̂). For the uniform distribution:

Var(μ^AV)σ2n(1ρ)\text{Var}(\hat{\mu}_{AV}) \approx \frac{\sigma^2}{n}\left(1 - \rho\right)

where ρ = Corr(h(X), h(X')). Antithetic variates are most effective when h is monotone.

Control Variates and Stratification

Control variates: If Y is a random variable with known mean μ_Y = E[Y], use:

μ^CV=μ^c(YμY)\hat{\mu}_{CV} = \hat{\mu} - c(Y - \mu_Y)

The optimal control coefficient is c* = Cov(h(X), Y) / Var(Y), reducing variance by ρ² where ρ = Corr(h(X), Y).

Stratified sampling: Partition the sample space into K strata {S₁,...,S_K} and sample n_k ∝ π(S_k) samples from each stratum:

μ^strat=k=1Kπ(Sk)1nkiSkh(Xi)\hat{\mu}_{strat} = \sum_{k=1}^{K} \pi(S_k) \frac{1}{n_k}\sum_{i \in S_k} h(X_i)

Variance: Var(μ̂_strat) = (1/n)∑_k π(S_k) σ_k² ≤ σ²/n, with equality only when all strata have equal variance.

Convergence and Diagnostics

ThCentral Limit Theorem for MCMC

Under ergodicity conditions (aperiodicity, irreducibility, positive recurrence), the MCMC central limit theorem states:

n(μ^nμ)dN(0,τ2)\sqrt{n}(\hat{\mu}_n - \mu) \xrightarrow{d} \mathcal{N}(0, \tau^2)

where τ² = σ² · IAT is the asymptotic variance, σ² = Var_π(h(X)) is the target variance, and IAT is the integrated autocorrelation time:

IAT=1+2k=1ρk\text{IAT} = 1 + 2\sum_{k=1}^{\infty} \rho_k

where ρ_k = Corr(h(X_t), h(X_{t+k})) is the autocorrelation at lag k.

Effective sample size: n_eff = n / IAT. For well-mixed chains, IAT ≈ 1; for poor mixing, IAT can be >> n.

Diagnostics and Convergence Assessment

Key diagnostic tools for assessing MCMC convergence:

  1. Gelman-Rubin statistic (R̂): Compares within-chain and between-chain variance. R̂ < 1.1 indicates convergence.

  2. Effective sample size (ESS): Computes the number of independent samples. ESS > 100 per chain is recommended.

  3. Trace plots: Visual inspection of sample paths for stationarity and mixing.

  4. Geweke diagnostic: Tests mean stationarity by comparing early and late portions of the chain.

  5. Raftery-Lewis: Estimates the number of iterations needed to estimate a quantile with specified precision.

  6. Heidelberger-Welch: Combines a stationarity test (Cramér-von Mises) with half-width confidence interval assessment.

Warm-up (burn-in): The first n_b samples are discarded. Adaptive MCMC algorithms (NUTS, adaptive MH) use warm-up to tune proposal distributions.

Monte Carlo Convergence Rates

The convergence rates of different Monte Carlo methods are summarized:

MethodConvergence RateDimension Dependence
Simple MCO(n⁻¹/²)Dimension-independent
Importance SamplingO(n⁻¹/²)Degrades with dimension
Rejection SamplingO(acceptance rate)Degrades exponentially
Random Walk MHO(n⁻¹/²)O(d⁻¹/²) per gradient
HMCO(n⁻¹/²)O(d⁻¹/⁴) per gradient
NUTSO(n⁻¹/²)O(d⁻¹/⁴) per gradient

The key insight is that MCMC methods achieve the same n⁻¹/² Monte Carlo convergence rate as independent sampling, but with an effective sample size that depends on the mixing properties of the chain.

Bayesian Posterior Computation

Consider a Bayesian model with likelihood p(D|θ) and prior π(θ). The posterior is:

π(θD)=p(Dθ)π(θ)p(Dθ)π(θ)dθ\pi(\theta | D) = \frac{p(D|\theta)\pi(\theta)}{\int p(D|\theta)\pi(\theta)d\theta}

Gibbs sampling for a conjugate normal model:

  • Prior: θ ~ N(μ₀, τ²₀)
  • Likelihood: x̄ | θ ~ N(θ, σ²/n)
  • Posterior: θ | D ~ N(μ_n, τ²_n) where μ_n = (μ₀/τ²₀ + nx̄/σ²)/(1/τ²₀ + n/σ²)

HMC for non-conjugate models:

  • Requires gradient ∇_θ log π(θ|D) = ∇_θ log p(D|θ) + ∇_θ log π(θ)
  • Leapfrog integration preserves volume in phase space
  • Adaptive tuning during warm-up phase

For hierarchical models with many parameters, NUTS provides efficient sampling with automatic tuning, achieving 10-100× better effective sample size than random walk MH.

Premium Content

Monte Carlo Simulation

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement