Missing Data — MCAR, MAR, MNAR, Imputation

Statistics

Understanding Why Data Is Missing and How to Handle It

The mechanism generating missing values — MCAR, MAR, or MNAR — determines which methods produce valid inferences. Naive deletion can bias results, while principled approaches preserve information and validity.

Clinical Research — Handle patient dropout that may be related to outcomes
Survey Analysis — Address item nonresponse that varies across demographic groups
Social Science — Deal with attrition in longitudinal panel studies

How data goes missing matters as much as how much is missing.

Missing data is ubiquitous in real-world research. Understanding the mechanism that generates missing values is critical for choosing appropriate handling methods.

DfMissing Data

Values in a dataset that are not observed. The analysis must account for missingness to produce valid statistical inferences.

Types of Missingness

MCAR — Missing Completely at Random

MCAR

P(R_i = 1 | Y_i, X_i) = P(R_i = 1)

Here,

$R_i$ =Missingness indicator (1=missing, 0=observed)
$Y_i$ =Outcome value
$X_i$ =Covariates

Missingness is completely unrelated to any data (observed or missing). Like data?? being lost in the mail.

MCAR Implication

Under MCAR, the observed data is a random subsample of the full data. Listwise deletion is unbiased but reduces power.

MAR — Missing at Random

MAR

P(R_i = 1 | Y_i, X_i) = P(R_i = 1 | Y_i^{obs}, X_i)

Here,

$Y_i^{obs}$ =Observed portion of the outcome

Missingness depends on observed data but not on the missing values themselves.

MAR Example

In a depression study, younger people are less likely to report income. If age is observed, missingness in income is MAR.

MNAR — Missing Not at Random

MNAR

P(R_i = 1 | Y_i, X_i) \text{ depends on } Y_i^{miss}

Here,

$Y_i^{miss}$ =Missing portion of the outcome

Missingness depends on the unobserved values themselves. The hardest mechanism to handle.

MNAR Challenge

MNAR requires modeling the missingness mechanism directly, which is difficult without external information. Results are sensitive to the assumed model.

Comparison

| Mechanism | Missingness depends on | Example |

|-----------|----------------------|---------|

| MCAR | Nothing | Data entry errors; random equipment failure |

| MAR | Observed variables only | Young people skip income questions |

| MNAR | Missing values themselves | Depressed people don't report depression |

Handling Missing Data

Listwise Deletion

Delete any row with missing values.

| Pros | Cons |

|------|------|

| Simple; unbiased under MCAR | Loses data; reduces power |

| | Biased under MAR and MNAR |

Mean Imputation

Replace missing values with the observed mean.

Mean Imputation Problems

Biases standard errors downward
Distorts correlations and distributions
Never recommended for statistical analysis

Multiple Imputation

DfMultiple Imputation

Create M complete datasets by imputing missing values with plausible values drawn from their predictive distribution. Analyze each dataset and combine results using Rubin's rules.

Multiple Imputation: Rubin's Rules

Combined Estimate

\bar{Q} = \frac{1}{M}\sum_{m=1}^{M}\hat{Q}_m

Here,

$\hat{Q}_m$ =Estimate from imputed dataset m
$M$ =Number of imputations

Combined Variance

T = \bar{U} + \left(1 + \frac{1}{M}\right)B

Here,

$\bar{U}$ =Within-imputation variance: $\frac{1}{M}\sum U_m$
$B$ =Between-imputation variance: $\frac{1}{M-1}\sum(\hat{Q}_m - \bar{Q})^2$

Number of Imputations

Use at least M = 20 imputations. For multiply imputed estimates with many missing values, use M = 50-100 for stable results.

Predictive Mean Matching (PMM)

The most popular imputation method. For each missing value:

Fit a regression predicting the variable from other variables
Find observed values with similar predicted values
Use the observed value as the imputation

PMM Advantage

PMM produces plausible values within the range of observed data. It does not extrapolate beyond the data, making it robust to model misspecification.

Python Implementation


import numpy as np

import pandas as pd

from sklearn.experimental import enable_iterative_imputer

from sklearn.impute import IterativeImputer, SimpleImputer

import matplotlib.pyplot as plt



np.random.seed(42)



# Simulate data with missing values

n = 500

X1 = np.random.randn(n)

X2 = 0.7 * X1 + np.random.randn(n) * 0.5

X3 = 0.3 * X1 + 0.4 * X2 + np.random.randn(n) * 0.8



# MAR: X1 missing depends on X2

missing_prob = 1 / (1 + np.exp(-(-1 + 0.5*X2)))

R = np.random.binomial(1, missing_prob)

X1_obs = X1.copy()

X1_obs[R == 1] = np.nan



df = pd.DataFrame({'X1': X1_obs, 'X2': X2, 'X3': X3})

print(f"Missing in X1: {df['X1'].isna().sum()} ({df['X1'].isna().mean():.1%})")



# Listwise deletion

complete = df.dropna()

print(f"\nListwise deletion: n={len(complete)}")

print(f"X1 mean (complete): {complete['X1'].mean():.3f} (true: {X1.mean():.3f})")



# Multiple Imputation

mice = IterativeImputer(random_state=42, max_iter=10)

imputed = pd.DataFrame(mice.fit_transform(df), columns=df.columns)

print(f"\nMICE imputation:")

print(f"X1 mean (imputed): {imputed['X1'].mean():.3f} (true: {X1.mean():.3f})")



# Visualize

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

axes[0].hist(X1, bins=30, alpha=0.5, label='True')

axes[0].hist(complete['X1'], bins=30, alpha=0.5, label='Listwise')

axes[0].legend()

axes[0].set_title('Listwise Deletion')



axes[1].hist(X1, bins=30, alpha=0.5, label='True')

axes[1].hist(imputed['X1'], bins=30, alpha=0.5, label='MICE')

axes[1].legend()

axes[1].set_title('Multiple Imputation')

plt.tight_layout()

plt.show()

Worked Example

Example: Clinical Trial with Dropout

A drug trial has 20% dropout due to side effects (MNAR for those who dropped out):

| Method | Mean Effect | Bias |

|--------|------------|------|

| Complete cases | 3.8 | -0.7 (underestimate) |

| Mean imputation | 3.2 | -1.3 (severe underestimate) |

| Multiple imputation | 4.4 | -0.1 (minimal bias) |

| Pattern-mixture model | 4.6 | +0.1 (minimal bias) |

Multiple imputation produces the least biased estimate when missingness is MAR. For MNAR, specialized models (pattern-mixture, selection models) are needed.

Key Takeaways

Summary: Missing Data

MCAR: Missingness is completely random — observed data is representative
MAR: Missingness depends on observed data — MI handles this well
MNAR: Missingness depends on missing values — requires specialized models
Multiple imputation (MI) is the gold standard for MAR data
Use Rubin's rules to combine estimates across M imputed datasets
Mean imputation is biased — never use it for analysis
Use at least M = 20 imputations; more for large amounts of missingness
Conduct sensitivity analysis for potential MNAR

Missing Data — MCAR, MAR, MNAR, Imputation

Missing Data — MCAR, MAR, MNAR, Imputation

Understanding Why Data Is Missing and How to Handle It

DfMissing Data

Types of Missingness

MCAR — Missing Completely at Random

MCAR

MAR — Missing at Random

MAR

MNAR — Missing Not at Random

MNAR

Comparison

Handling Missing Data

Listwise Deletion

Mean Imputation

Multiple Imputation

DfMultiple Imputation

Multiple Imputation: Rubin's Rules

Combined Estimate

Combined Variance

Predictive Mean Matching (PMM)

Python Implementation

Worked Example

Example: Clinical Trial with Dropout

Key Takeaways

Summary: Missing Data

Related Topics

Premium Content

Need Expert Statistics Help?