Multiple Imputation for Missing Data

Statistics

Principled Missing Data Handling With Rubins Rules

Multiple imputation creates several plausible completed datasets, analyzes each separately, and combines results using Rubins rules. It properly accounts for the uncertainty introduced by the imputation process.

Epidemiology — Handle missing biomarker data in cohort studies
Economics — Complete income data with observed correlates in survey analysis
Healthcare — Impute missing lab values while preserving statistical validity

Multiply imputing once is better than single-imputing with false confidence.

Multiple imputation (MI) creates several plausible completed datasets by filling in missing values, analyzes each separately, and combines results using Rubin's rules.

DfMultiple Imputation

A principled method for handling missing data that:

Creates M imputed datasets
Analyzes each dataset with standard methods
Combines estimates and standard errors using Rubin's rules

Why Multiple Imputation?

Advantages Over Single Imputation

Uncertainty: MI accounts for the uncertainty due to missing values
Unbiased: Produces unbiased estimates under MAR
Flexible: Works with any statistical model
Valid: Produces correct standard errors and confidence intervals

The MICE Algorithm

Multiple Imputation by Chained Equations (MICE) is the most popular MI method.

DfMICE Algorithm

Iteratively imputes each variable with missing values using its own regression model, conditional on other variables. Each iteration cycles through all variables with missing data.

Steps

| Step | Action |

|------|--------|

| 1 | Initialize missing values with simple imputation (e.g., mean) |

| 2 | For each variable $X_j$ with missing values: fit regression on other variables |

| 3 | Draw imputed values from the predictive distribution |

| 4 | Repeat steps 2-3 for many cycles (typically 10-20) |

| 5 | After convergence, save the imputed dataset |

| 6 | Repeat steps 1-5 M times to create M datasets |

Predictive Mean Matching (PMM)

PMM Method

\hat{X}_j^{miss} = X_j^{obs}[k]

Here,

$X_j^{obs}[k]$ =Observed value from donor with closest predicted mean

Steps for each missing value:

Fit a regression predicting $X_j$ from other variables using observed data
Predict $\hat{X}_j$ for both observed and missing cases
For each missing case, find the observed case with the closest predicted value
Use the observed value as the imputation

PMM Advantage

PMM always imputes plausible values within the range of observed data. It is robust to non-normality and model misspecification.

Rubin's Rules

Combined Point Estimate

Pooled Estimate

\bar{Q} = \frac{1}{M}\sum_{m=1}^{M}\hat{Q}_m

Here,

$\hat{Q}_m$ =Point estimate from imputation m
$M$ =Number of imputations

Combined Variance

Total Variance

T = \bar{U} + \left(1 + \frac{1}{M}\right)B

Here,

$\bar{U}$ =Within-imputation variance (average of M variances)
$B$ =Between-imputation variance (variance of M estimates)

Variance Components

Within-imputation ( $\bar{U}$ ): Variability if the imputed values were known
Between-imputation ( $B$ ): Additional variability due to not knowing the true values
Total ( $T$ ): Correctly reflects the full uncertainty

Number of Imputations

| Missing Fraction | Recommended M |

|-----------------|---------------|

| < 10% | 5-10 |

| 10-30% | 20-40 |

| 30-50% | 40-100 |

| > 50% | > 100 |

Rule of Thumb

Use at least as many imputations as the percentage of missing data. For example, if 25% of values are missing, use M = 25.

Diagnostics

Convergence

Plot the mean and variance of imputed values across iterations. They should stabilize after 10-20 iterations.

Imputation Quality

| Diagnostic | What to Check |

|-----------|---------------|

| Density plots | Imputed and observed distributions should overlap |

| Scatter plots | Relationships between variables should be similar |

| Trace plots | MICE chains should mix and converge |

Assumptions

MAR Assumption

MI produces unbiased estimates only under MAR (Missing at Random). If missingness is MNAR, results may be biased. Conduct sensitivity analysis for MNAR.

Python Implementation


import numpy as np

import pandas as pd

from sklearn.experimental import enable_iterative_imputer

from sklearn.impute import IterativeImputer

import matplotlib.pyplot as plt



np.random.seed(42)



# Simulate data

n = 500

X1 = np.random.randn(n)

X2 = 0.6 * X1 + np.random.randn(n) * 0.5

X3 = 0.3 * X1 + 0.5 * X2 + np.random.randn(n) * 0.7



# Create MAR missingness in X1

missing_prob = 1 / (1 + np.exp(-(-1 + 0.8*X2)))

R = np.random.binomial(1, missing_prob)

X1_obs = X1.copy()

X1_obs[R == 1] = np.nan



df = pd.DataFrame({'X1': X1_obs, 'X2': X2, 'X3': X3})

true_mean = X1.mean()

obs_mean = np.nanmean(X1_obs)

missing_pct = df['X1'].isna().mean()



print(f"True X1 mean: {true_mean:.3f}")

print(f"Observed X1 mean: {obs_mean:.3f}")

print(f"Missing: {missing_pct:.1%}")



# Multiple Imputation (M = 30)

M = 30

estimates = []

for m in range(M):

    imputer = IterativeImputer(random_state=m, max_iter=20)

    imputed = imputer.fit_transform(df)

    estimates.append(imputed[:, 0].mean())



# Rubin's rules

Q_bar = np.mean(estimates)

B = np.var(estimates, ddof=1)

# Approximate within-imputation variance

U_bar = np.var(X1) / (n * (1 - missing_pct))



T = U_bar + (1 + 1/M) * B

SE = np.sqrt(T)



print(f"\nMI estimate: {Q_bar:.3f} (SE: {SE:.3f})")

print(f"95% CI: [{Q_bar - 1.96*SE:.3f}, {Q_bar + 1.96*SE:.3f}]")



# Plot convergence

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

axes[0].plot(estimates, 'o-')

axes[0].axhline(y=true_mean, color='red', linestyle='--', label='True')

axes[0].set_xlabel('Imputation')

axes[0].set_ylabel('X1 Mean')

axes[0].set_title('Imputation Convergence')

axes[0].legend()



# Density comparison

axes[1].hist(X1, bins=30, alpha=0.5, density=True, label='Observed')

axes[1].hist([e for e in estimates], bins=30, alpha=0.3, density=True, label='Imputed means')

axes[1].legend()

axes[1].set_title('Density Comparison')

plt.tight_layout()

plt.show()

Worked Example

Example: Health Survey with Missing Income

A health survey with 30% missing income data (MAR: higher-income people less likely to report):

| Method | Mean Income | SE | 95% CI |

|--------|------------|-----|---------|

| True value | $45,100 | — | — |

MI produces the least biased estimate. Complete cases underestimate due to MAR. Single imputation underestimates the standard error.

Key Takeaways

Summary: Multiple Imputation

MI creates M plausible completed datasets and combines results via Rubin's rules
MICE (chained equations) is the most popular algorithm
PMM (predictive mean matching) is robust to non-normality
Use M = 20 imputations; more for large missing fractions
Rubin's rules combine estimates (average) and variances (within + between)
Diagnostics: Check convergence, density overlap, and imputation quality
MI assumes MAR — conduct sensitivity analysis for MNAR
Always report the number of imputations and missing data patterns

Multiple Imputation for Missing Data

Multiple Imputation for Missing Data

Principled Missing Data Handling With Rubins Rules

DfMultiple Imputation

Why Multiple Imputation?

The MICE Algorithm

DfMICE Algorithm

Steps

Predictive Mean Matching (PMM)

PMM Method

Rubin's Rules

Combined Point Estimate

Pooled Estimate

Combined Variance

Total Variance

Number of Imputations

Diagnostics

Convergence

Imputation Quality

Assumptions

Python Implementation

Worked Example

Example: Health Survey with Missing Income

Key Takeaways

Summary: Multiple Imputation

Related Topics

Premium Content

Need Expert Statistics Help?