Missing Data — MCAR, MAR, MNAR, Imputation
Statistics
Understanding Why Data Is Missing and How to Handle It
The mechanism generating missing values — MCAR, MAR, or MNAR — determines which methods produce valid inferences. Naive deletion can bias results, while principled approaches preserve information and validity.
-
Clinical Research — Handle patient dropout that may be related to outcomes
-
Survey Analysis — Address item nonresponse that varies across demographic groups
-
Social Science — Deal with attrition in longitudinal panel studies
How data goes missing matters as much as how much is missing.
Missing data is ubiquitous in real-world research. Understanding the mechanism that generates missing values is critical for choosing appropriate handling methods.
DfMissing Data
Values in a dataset that are not observed. The analysis must account for missingness to produce valid statistical inferences.
Types of Missingness
MCAR — Missing Completely at Random
MCAR
Here,
- =Missingness indicator (1=missing, 0=observed)
- =Outcome value
- =Covariates
Missingness is completely unrelated to any data (observed or missing). Like data?? being lost in the mail.
MCAR Implication
Under MCAR, the observed data is a random subsample of the full data. Listwise deletion is unbiased but reduces power.
MAR — Missing at Random
MAR
Here,
- =Observed portion of the outcome
Missingness depends on observed data but not on the missing values themselves.
MAR Example
In a depression study, younger people are less likely to report income. If age is observed, missingness in income is MAR.
MNAR — Missing Not at Random
MNAR
Here,
- =Missing portion of the outcome
Missingness depends on the unobserved values themselves. The hardest mechanism to handle.
MNAR Challenge
MNAR requires modeling the missingness mechanism directly, which is difficult without external information. Results are sensitive to the assumed model.
Comparison
| Mechanism | Missingness depends on | Example |
|-----------|----------------------|---------|
| MCAR | Nothing | Data entry errors; random equipment failure |
| MAR | Observed variables only | Young people skip income questions |
| MNAR | Missing values themselves | Depressed people don't report depression |
Handling Missing Data
Listwise Deletion
Delete any row with missing values.
| Pros | Cons |
|------|------|
| Simple; unbiased under MCAR | Loses data; reduces power |
| | Biased under MAR and MNAR |
Mean Imputation
Replace missing values with the observed mean.
Mean Imputation Problems
-
Biases standard errors downward
-
Distorts correlations and distributions
-
Never recommended for statistical analysis
Multiple Imputation
DfMultiple Imputation
Create M complete datasets by imputing missing values with plausible values drawn from their predictive distribution. Analyze each dataset and combine results using Rubin's rules.
Multiple Imputation: Rubin's Rules
Combined Estimate
Here,
- =Estimate from imputed dataset m
- =Number of imputations
Combined Variance
Here,
- =Within-imputation variance: $\frac{1}{M}\sum U_m$
- =Between-imputation variance: $\frac{1}{M-1}\sum(\hat{Q}_m - \bar{Q})^2$
Number of Imputations
Use at least M = 20 imputations. For multiply imputed estimates with many missing values, use M = 50-100 for stable results.
Predictive Mean Matching (PMM)
The most popular imputation method. For each missing value:
-
Fit a regression predicting the variable from other variables
-
Find observed values with similar predicted values
-
Use the observed value as the imputation
PMM Advantage
PMM produces plausible values within the range of observed data. It does not extrapolate beyond the data, making it robust to model misspecification.
Python Implementation
import numpy as np
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, SimpleImputer
import matplotlib.pyplot as plt
np.random.seed(42)
# Simulate data with missing values
n = 500
X1 = np.random.randn(n)
X2 = 0.7 * X1 + np.random.randn(n) * 0.5
X3 = 0.3 * X1 + 0.4 * X2 + np.random.randn(n) * 0.8
# MAR: X1 missing depends on X2
missing_prob = 1 / (1 + np.exp(-(-1 + 0.5*X2)))
R = np.random.binomial(1, missing_prob)
X1_obs = X1.copy()
X1_obs[R == 1] = np.nan
df = pd.DataFrame({'X1': X1_obs, 'X2': X2, 'X3': X3})
print(f"Missing in X1: {df['X1'].isna().sum()} ({df['X1'].isna().mean():.1%})")
# Listwise deletion
complete = df.dropna()
print(f"\nListwise deletion: n={len(complete)}")
print(f"X1 mean (complete): {complete['X1'].mean():.3f} (true: {X1.mean():.3f})")
# Multiple Imputation
mice = IterativeImputer(random_state=42, max_iter=10)
imputed = pd.DataFrame(mice.fit_transform(df), columns=df.columns)
print(f"\nMICE imputation:")
print(f"X1 mean (imputed): {imputed['X1'].mean():.3f} (true: {X1.mean():.3f})")
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].hist(X1, bins=30, alpha=0.5, label='True')
axes[0].hist(complete['X1'], bins=30, alpha=0.5, label='Listwise')
axes[0].legend()
axes[0].set_title('Listwise Deletion')
axes[1].hist(X1, bins=30, alpha=0.5, label='True')
axes[1].hist(imputed['X1'], bins=30, alpha=0.5, label='MICE')
axes[1].legend()
axes[1].set_title('Multiple Imputation')
plt.tight_layout()
plt.show()
Worked Example
Example: Clinical Trial with Dropout
A drug trial has 20% dropout due to side effects (MNAR for those who dropped out):
| Method | Mean Effect | Bias |
|--------|------------|------|
| Complete cases | 3.8 | -0.7 (underestimate) |
| Mean imputation | 3.2 | -1.3 (severe underestimate) |
| Multiple imputation | 4.4 | -0.1 (minimal bias) |
| Pattern-mixture model | 4.6 | +0.1 (minimal bias) |
Multiple imputation produces the least biased estimate when missingness is MAR. For MNAR, specialized models (pattern-mixture, selection models) are needed.
Key Takeaways
Summary: Missing Data
-
MCAR: Missingness is completely random — observed data is representative
-
MAR: Missingness depends on observed data — MI handles this well
-
MNAR: Missingness depends on missing values — requires specialized models
-
Multiple imputation (MI) is the gold standard for MAR data
-
Use Rubin's rules to combine estimates across M imputed datasets
-
Mean imputation is biased — never use it for analysis
-
Use at least M = 20 imputations; more for large amounts of missingness
-
Conduct sensitivity analysis for potential MNAR
Related Topics
-
See Multiple Imputation for detailed MI methods
-
See Propensity Score Matching for handling selection bias
-
See Causal Inference for related topics