πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Multiple Imputation for Missing Data

StatisticsData Quality🟒 Free Lesson

Advertisement

Multiple Imputation for Missing Data

Statistics

Principled Missing Data Handling With Rubins Rules

Multiple imputation creates several plausible completed datasets, analyzes each separately, and combines results using Rubins rules. It properly accounts for the uncertainty introduced by the imputation process.

  • Epidemiology β€” Handle missing biomarker data in cohort studies

  • Economics β€” Complete income data with observed correlates in survey analysis

  • Healthcare β€” Impute missing lab values while preserving statistical validity

Multiply imputing once is better than single-imputing with false confidence.


Multiple imputation (MI) creates several plausible completed datasets by filling in missing values, analyzes each separately, and combines results using Rubin's rules.

DfMultiple Imputation

A principled method for handling missing data that:

  1. Creates M imputed datasets

  2. Analyzes each dataset with standard methods

  3. Combines estimates and standard errors using Rubin's rules


Why Multiple Imputation?

Advantages Over Single Imputation

  • Uncertainty: MI accounts for the uncertainty due to missing values

  • Unbiased: Produces unbiased estimates under MAR

  • Flexible: Works with any statistical model

  • Valid: Produces correct standard errors and confidence intervals


The MICE Algorithm

Multiple Imputation by Chained Equations (MICE) is the most popular MI method.

DfMICE Algorithm

Iteratively imputes each variable with missing values using its own regression model, conditional on other variables. Each iteration cycles through all variables with missing data.

Steps

| Step | Action |

|------|--------|

| 1 | Initialize missing values with simple imputation (e.g., mean) |

| 2 | For each variable XjX_j with missing values: fit regression on other variables |

| 3 | Draw imputed values from the predictive distribution |

| 4 | Repeat steps 2-3 for many cycles (typically 10-20) |

| 5 | After convergence, save the imputed dataset |

| 6 | Repeat steps 1-5 M times to create M datasets |


Predictive Mean Matching (PMM)

PMM Method

X^jmiss=Xjobs[k]\hat{X}_j^{miss} = X_j^{obs}[k]

Here,

  • Xjobs[k]X_j^{obs}[k]=Observed value from donor with closest predicted mean

Steps for each missing value:

  1. Fit a regression predicting XjX_j from other variables using observed data

  2. Predict X^j\hat{X}_j for both observed and missing cases

  3. For each missing case, find the observed case with the closest predicted value

  4. Use the observed value as the imputation

PMM Advantage

PMM always imputes plausible values within the range of observed data. It is robust to non-normality and model misspecification.


Rubin's Rules

Combined Point Estimate

Pooled Estimate

QΛ‰=1Mβˆ‘m=1MQ^m\bar{Q} = \frac{1}{M}\sum_{m=1}^{M}\hat{Q}_m

Here,

  • Q^m\hat{Q}_m=Point estimate from imputation m
  • MM=Number of imputations

Combined Variance

Total Variance

T=Uˉ+(1+1M)BT = \bar{U} + \left(1 + \frac{1}{M}\right)B

Here,

  • UΛ‰\bar{U}=Within-imputation variance (average of M variances)
  • BB=Between-imputation variance (variance of M estimates)

Variance Components

  • Within-imputation (UΛ‰\bar{U}): Variability if the imputed values were known

  • Between-imputation (BB): Additional variability due to not knowing the true values

  • Total (TT): Correctly reflects the full uncertainty


Number of Imputations

| Missing Fraction | Recommended M |

|-----------------|---------------|

| < 10% | 5-10 |

| 10-30% | 20-40 |

| 30-50% | 40-100 |

| > 50% | > 100 |

Rule of Thumb

Use at least as many imputations as the percentage of missing data. For example, if 25% of values are missing, use M = 25.


Diagnostics

Convergence

Plot the mean and variance of imputed values across iterations. They should stabilize after 10-20 iterations.

Imputation Quality

| Diagnostic | What to Check |

|-----------|---------------|

| Density plots | Imputed and observed distributions should overlap |

| Scatter plots | Relationships between variables should be similar |

| Trace plots | MICE chains should mix and converge |


Assumptions

MAR Assumption

MI produces unbiased estimates only under MAR (Missing at Random). If missingness is MNAR, results may be biased. Conduct sensitivity analysis for MNAR.


Python Implementation


import numpy as np

import pandas as pd

from sklearn.experimental import enable_iterative_imputer

from sklearn.impute import IterativeImputer

import matplotlib.pyplot as plt



np.random.seed(42)



# Simulate data

n = 500

X1 = np.random.randn(n)

X2 = 0.6 * X1 + np.random.randn(n) * 0.5

X3 = 0.3 * X1 + 0.5 * X2 + np.random.randn(n) * 0.7



# Create MAR missingness in X1

missing_prob = 1 / (1 + np.exp(-(-1 + 0.8*X2)))

R = np.random.binomial(1, missing_prob)

X1_obs = X1.copy()

X1_obs[R == 1] = np.nan



df = pd.DataFrame({'X1': X1_obs, 'X2': X2, 'X3': X3})

true_mean = X1.mean()

obs_mean = np.nanmean(X1_obs)

missing_pct = df['X1'].isna().mean()



print(f"True X1 mean: {true_mean:.3f}")

print(f"Observed X1 mean: {obs_mean:.3f}")

print(f"Missing: {missing_pct:.1%}")



# Multiple Imputation (M = 30)

M = 30

estimates = []

for m in range(M):

    imputer = IterativeImputer(random_state=m, max_iter=20)

    imputed = imputer.fit_transform(df)

    estimates.append(imputed[:, 0].mean())



# Rubin's rules

Q_bar = np.mean(estimates)

B = np.var(estimates, ddof=1)

# Approximate within-imputation variance

U_bar = np.var(X1) / (n * (1 - missing_pct))



T = U_bar + (1 + 1/M) * B

SE = np.sqrt(T)



print(f"\nMI estimate: {Q_bar:.3f} (SE: {SE:.3f})")

print(f"95% CI: [{Q_bar - 1.96*SE:.3f}, {Q_bar + 1.96*SE:.3f}]")



# Plot convergence

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

axes[0].plot(estimates, 'o-')

axes[0].axhline(y=true_mean, color='red', linestyle='--', label='True')

axes[0].set_xlabel('Imputation')

axes[0].set_ylabel('X1 Mean')

axes[0].set_title('Imputation Convergence')

axes[0].legend()



# Density comparison

axes[1].hist(X1, bins=30, alpha=0.5, density=True, label='Observed')

axes[1].hist([e for e in estimates], bins=30, alpha=0.3, density=True, label='Imputed means')

axes[1].legend()

axes[1].set_title('Density Comparison')

plt.tight_layout()

plt.show()

Worked Example

Example: Health Survey with Missing Income

A health survey with 30% missing income data (MAR: higher-income people less likely to report):

| Method | Mean Income | SE | 95% CI |

|--------|------------|-----|---------|

| Complete cases | <MathBlock tex=42,500 &#124; \ />1,200 | [<MathBlock tex=40,148, \ />44,852] |

| Single imputation | <MathBlock tex=41,800 &#124; \ />980 | [<MathBlock tex=39,880, \ />43,720] |

| Multiple imputation | <MathBlock tex=45,200 &#124; \ />1,350 | [<MathBlock tex=42,554, \ />47,846] |

| True value | $45,100 | β€” | β€” |

MI produces the least biased estimate. Complete cases underestimate due to MAR. Single imputation underestimates the standard error.


Key Takeaways

Summary: Multiple Imputation

  • MI creates M plausible completed datasets and combines results via Rubin's rules

  • MICE (chained equations) is the most popular algorithm

  • PMM (predictive mean matching) is robust to non-normality

  • Use M = 20 imputations; more for large missing fractions

  • Rubin's rules combine estimates (average) and variances (within + between)

  • Diagnostics: Check convergence, density overlap, and imputation quality

  • MI assumes MAR β€” conduct sensitivity analysis for MNAR

  • Always report the number of imputations and missing data patterns


Related Topics

⭐

Premium Content

Multiple Imputation for Missing Data

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement