Bootstrap Methods — Resampling for Inference
Statistics
Computer-Intensive Inference Without Distributional Assumptions
Bootstrapping estimates the sampling distribution of any statistic by resampling with replacement from the data. It provides standard errors, confidence intervals, and hypothesis tests when theoretical formulas are unavailable or unreliable.
-
Finance — Estimate VaR confidence intervals for complex portfolio distributions
-
Ecology — Build confidence intervals for species diversity indices
-
Machine Learning — Assess variability of feature importance measures
Let the data generate its own reference distribution through the power of resampling.
Bootstrapping is a resampling method that estimates the sampling distribution of a statistic by sampling with replacement from the observed data. It provides standard errors and confidence intervals without distributional assumptions.
DfBootstrap
A computer-intensive method that approximates the sampling distribution of a statistic by repeatedly resampling (with replacement) from the observed data and recomputing the statistic for each resample.
Bootstrap Principle
Key Insight
The empirical distribution of the sample approximates the true population distribution. Therefore, the distribution of statistics computed from bootstrap samples approximates the true sampling distribution.
Algorithm
| Step | Action |
|------|--------|
| 1 | Draw a bootstrap sample by sampling with replacement from the original data |
| 2 | Compute the statistic from the bootstrap sample |
| 3 | Repeat steps 1-2 B times (typically B = 1,000 - 10,000) |
| 4 | Use the distribution of for inference |
Bootstrap Standard Error
Bootstrap Standard Error
Here,
- =Statistic from bootstrap sample b
- =Mean of bootstrap statistics
- =Number of bootstrap resamples
Bootstrap Confidence Intervals
Percentile Method
Percentile CI
Here,
- =p-th percentile of bootstrap distribution
BCa (Bias-Corrected and Accelerated)
BCa CI
Here,
- =Adjusted lower percentile
- =Bias correction factor
- =Acceleration factor
BCa vs Percentile
The BCa interval adjusts for bias and skewness in the bootstrap distribution. It is generally preferred over the simple percentile method.
Types of Bootstrap
| Type | Resampling Unit | When to Use |
|------|----------------|-------------|
| Nonparametric | Individual observations | Default; no distributional assumptions |
| Parametric | From fitted distribution | When distribution is known |
| Block | Blocks of observations | Time series data |
| Wild | Residuals with sign changes | Heteroscedastic data |
Bootstrap Hypothesis Testing
Bootstrap p-value
Here,
- =Observed statistic
- =Null hypothesis value
- =Number of bootstrap samples
Subsampling
A related method that samples without replacement with a smaller sample size .
Subsampling vs Bootstrap
Subsampling does not require the data to be exchangeable and works for some problems where the bootstrap fails (e.g., unit roots). However, it requires choosing .
Python Implementation
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
# Original data
n = 200
true_mean = 5.0
true_std = 2.0
data = np.random.normal(true_mean, true_std, n)
# Observed statistic
obs_mean = np.mean(data)
print(f"Observed mean: {obs_mean:.3f}")
# Bootstrap
B = 5000
boot_means = np.zeros(B)
for b in range(B):
sample = np.random.choice(data, size=n, replace=True)
boot_means[b] = np.mean(sample)
# Bootstrap SE
boot_se = np.std(boot_means, ddof=1)
print(f"Bootstrap SE: {boot_se:.3f}")
print(f"Analytical SE: {true_std/np.sqrt(n):.3f}")
# Percentile CI
alpha = 0.05
ci_perc = np.percentile(boot_means, [100*alpha/2, 100*(1-alpha/2)])
print(f"Percentile 95% CI: [{ci_perc[0]:.3f}, {ci_perc[1]:.3f}]")
# BCa CI (simplified)
z0 = np.mean(boot_means < obs_mean)
z_alpha = 1.96
ci_bca_lower = np.percentile(boot_means, 100 * np.mean(boot_means < obs_mean - z_alpha * boot_se))
ci_bca_upper = np.percentile(boot_means, 100 * np.mean(boot_means < obs_mean + z_alpha * boot_se))
print(f"BCa CI (approx): [{ci_bca_lower:.3f}, {ci_bca_upper:.3f}]")
# Bootstrap distribution
plt.figure(figsize=(8, 5))
plt.hist(boot_means, bins=50, edgecolor='black', alpha=0.7)
plt.axvline(x=obs_mean, color='red', linestyle='--', label='Observed')
plt.axvline(x=true_mean, color='green', linestyle='--', label='True')
plt.xlabel('Bootstrap Mean')
plt.ylabel('Frequency')
plt.title('Bootstrap Distribution of the Mean')
plt.legend()
plt.show()
# Bootstrap hypothesis test: H0: mean = 4.5
theta0 = 4.5
p_value = np.mean(np.abs(boot_means - obs_mean) >= np.abs(obs_mean - theta0))
print(f"\nBootstrap test (H0: mean=4.5): p={p_value:.4f}")
Worked Example
Example: Median with Bootstrap CI
Computing a 95% confidence interval for the median of a skewed distribution:
| Method | Estimate | 95% CI |
|--------|----------|--------|
| Normal theory | 4.85 | [4.21, 5.49] |
| Bootstrap percentile | 4.82 | [4.18, 5.62] |
| Bootstrap BCa | 4.82 | [4.25, 5.71] |
The distribution is right-skewed, so the normal-theory CI is asymmetric. The bootstrap methods provide more accurate coverage for the skewed distribution.
Key Takeaways
Summary: Bootstrap Methods
-
Bootstrap resamples with replacement to approximate the sampling distribution
-
Works for any statistic — mean, median, regression coefficients, etc.
-
Use B = 1,000-10,000 bootstrap resamples
-
Percentile CI is simple but may be biased; BCa adjusts for bias and skewness
-
Bootstrap provides standard errors and confidence intervals without distributional assumptions
-
For time series, use block bootstrap to preserve dependence
-
Subsampling (without replacement) is an alternative for some problems
Related Topics
-
See Cross-Validation for resampling in model evaluation
-
See AIC and BIC for model selection criteria
-
See Multiple Imputation for another resampling-based approach