Multiple Imputation for Missing Data
Statistics
Principled Missing Data Handling With Rubins Rules
Multiple imputation creates several plausible completed datasets, analyzes each separately, and combines results using Rubins rules. It properly accounts for the uncertainty introduced by the imputation process.
-
Epidemiology β Handle missing biomarker data in cohort studies
-
Economics β Complete income data with observed correlates in survey analysis
-
Healthcare β Impute missing lab values while preserving statistical validity
Multiply imputing once is better than single-imputing with false confidence.
Multiple imputation (MI) creates several plausible completed datasets by filling in missing values, analyzes each separately, and combines results using Rubin's rules.
DfMultiple Imputation
A principled method for handling missing data that:
-
Creates M imputed datasets
-
Analyzes each dataset with standard methods
-
Combines estimates and standard errors using Rubin's rules
Why Multiple Imputation?
Advantages Over Single Imputation
-
Uncertainty: MI accounts for the uncertainty due to missing values
-
Unbiased: Produces unbiased estimates under MAR
-
Flexible: Works with any statistical model
-
Valid: Produces correct standard errors and confidence intervals
The MICE Algorithm
Multiple Imputation by Chained Equations (MICE) is the most popular MI method.
DfMICE Algorithm
Iteratively imputes each variable with missing values using its own regression model, conditional on other variables. Each iteration cycles through all variables with missing data.
Steps
| Step | Action |
|------|--------|
| 1 | Initialize missing values with simple imputation (e.g., mean) |
| 2 | For each variable with missing values: fit regression on other variables |
| 3 | Draw imputed values from the predictive distribution |
| 4 | Repeat steps 2-3 for many cycles (typically 10-20) |
| 5 | After convergence, save the imputed dataset |
| 6 | Repeat steps 1-5 M times to create M datasets |
Predictive Mean Matching (PMM)
PMM Method
Here,
- =Observed value from donor with closest predicted mean
Steps for each missing value:
-
Fit a regression predicting from other variables using observed data
-
Predict for both observed and missing cases
-
For each missing case, find the observed case with the closest predicted value
-
Use the observed value as the imputation
PMM Advantage
PMM always imputes plausible values within the range of observed data. It is robust to non-normality and model misspecification.
Rubin's Rules
Combined Point Estimate
Pooled Estimate
Here,
- =Point estimate from imputation m
- =Number of imputations
Combined Variance
Total Variance
Here,
- =Within-imputation variance (average of M variances)
- =Between-imputation variance (variance of M estimates)
Variance Components
-
Within-imputation (): Variability if the imputed values were known
-
Between-imputation (): Additional variability due to not knowing the true values
-
Total (): Correctly reflects the full uncertainty
Number of Imputations
| Missing Fraction | Recommended M |
|-----------------|---------------|
| < 10% | 5-10 |
| 10-30% | 20-40 |
| 30-50% | 40-100 |
| > 50% | > 100 |
Rule of Thumb
Use at least as many imputations as the percentage of missing data. For example, if 25% of values are missing, use M = 25.
Diagnostics
Convergence
Plot the mean and variance of imputed values across iterations. They should stabilize after 10-20 iterations.
Imputation Quality
| Diagnostic | What to Check |
|-----------|---------------|
| Density plots | Imputed and observed distributions should overlap |
| Scatter plots | Relationships between variables should be similar |
| Trace plots | MICE chains should mix and converge |
Assumptions
MAR Assumption
MI produces unbiased estimates only under MAR (Missing at Random). If missingness is MNAR, results may be biased. Conduct sensitivity analysis for MNAR.
Python Implementation
import numpy as np
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import matplotlib.pyplot as plt
np.random.seed(42)
# Simulate data
n = 500
X1 = np.random.randn(n)
X2 = 0.6 * X1 + np.random.randn(n) * 0.5
X3 = 0.3 * X1 + 0.5 * X2 + np.random.randn(n) * 0.7
# Create MAR missingness in X1
missing_prob = 1 / (1 + np.exp(-(-1 + 0.8*X2)))
R = np.random.binomial(1, missing_prob)
X1_obs = X1.copy()
X1_obs[R == 1] = np.nan
df = pd.DataFrame({'X1': X1_obs, 'X2': X2, 'X3': X3})
true_mean = X1.mean()
obs_mean = np.nanmean(X1_obs)
missing_pct = df['X1'].isna().mean()
print(f"True X1 mean: {true_mean:.3f}")
print(f"Observed X1 mean: {obs_mean:.3f}")
print(f"Missing: {missing_pct:.1%}")
# Multiple Imputation (M = 30)
M = 30
estimates = []
for m in range(M):
imputer = IterativeImputer(random_state=m, max_iter=20)
imputed = imputer.fit_transform(df)
estimates.append(imputed[:, 0].mean())
# Rubin's rules
Q_bar = np.mean(estimates)
B = np.var(estimates, ddof=1)
# Approximate within-imputation variance
U_bar = np.var(X1) / (n * (1 - missing_pct))
T = U_bar + (1 + 1/M) * B
SE = np.sqrt(T)
print(f"\nMI estimate: {Q_bar:.3f} (SE: {SE:.3f})")
print(f"95% CI: [{Q_bar - 1.96*SE:.3f}, {Q_bar + 1.96*SE:.3f}]")
# Plot convergence
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].plot(estimates, 'o-')
axes[0].axhline(y=true_mean, color='red', linestyle='--', label='True')
axes[0].set_xlabel('Imputation')
axes[0].set_ylabel('X1 Mean')
axes[0].set_title('Imputation Convergence')
axes[0].legend()
# Density comparison
axes[1].hist(X1, bins=30, alpha=0.5, density=True, label='Observed')
axes[1].hist([e for e in estimates], bins=30, alpha=0.3, density=True, label='Imputed means')
axes[1].legend()
axes[1].set_title('Density Comparison')
plt.tight_layout()
plt.show()
Worked Example
Example: Health Survey with Missing Income
A health survey with 30% missing income data (MAR: higher-income people less likely to report):
| Method | Mean Income | SE | 95% CI |
|--------|------------|-----|---------|
| Complete cases | <MathBlock tex=42,500 | \ />1,200 | [<MathBlock tex=40,148, \ />44,852] |
| Single imputation | <MathBlock tex=41,800 | \ />980 | [<MathBlock tex=39,880, \ />43,720] |
| Multiple imputation | <MathBlock tex=45,200 | \ />1,350 | [<MathBlock tex=42,554, \ />47,846] |
| True value | $45,100 | β | β |
MI produces the least biased estimate. Complete cases underestimate due to MAR. Single imputation underestimates the standard error.
Key Takeaways
Summary: Multiple Imputation
-
MI creates M plausible completed datasets and combines results via Rubin's rules
-
MICE (chained equations) is the most popular algorithm
-
PMM (predictive mean matching) is robust to non-normality
-
Use M = 20 imputations; more for large missing fractions
-
Rubin's rules combine estimates (average) and variances (within + between)
-
Diagnostics: Check convergence, density overlap, and imputation quality
-
MI assumes MAR β conduct sensitivity analysis for MNAR
-
Always report the number of imputations and missing data patterns
Related Topics
-
See Missing Data for mechanisms (MCAR, MAR, MNAR)
-
See Bootstrap Methods for resampling methods
-
See Regression Diagnostics for model checking