🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Missing Data — MCAR, MAR, MNAR, Imputation

StatisticsData Quality🟢 Free Lesson

Advertisement

Missing Data — MCAR, MAR, MNAR, Imputation

Statistics

Understanding Why Data Is Missing and How to Handle It

The mechanism generating missing values — MCAR, MAR, or MNAR — determines which methods produce valid inferences. Naive deletion can bias results, while principled approaches preserve information and validity.

  • Clinical Research — Handle patient dropout that may be related to outcomes

  • Survey Analysis — Address item nonresponse that varies across demographic groups

  • Social Science — Deal with attrition in longitudinal panel studies

How data goes missing matters as much as how much is missing.


Missing data is ubiquitous in real-world research. Understanding the mechanism that generates missing values is critical for choosing appropriate handling methods.

DfMissing Data

Values in a dataset that are not observed. The analysis must account for missingness to produce valid statistical inferences.


Types of Missingness

MCAR — Missing Completely at Random

MCAR

P(Ri=1Yi,Xi)=P(Ri=1)P(R_i = 1 | Y_i, X_i) = P(R_i = 1)

Here,

  • RiR_i=Missingness indicator (1=missing, 0=observed)
  • YiY_i=Outcome value
  • XiX_i=Covariates

Missingness is completely unrelated to any data (observed or missing). Like data?? being lost in the mail.

MCAR Implication

Under MCAR, the observed data is a random subsample of the full data. Listwise deletion is unbiased but reduces power.

MAR — Missing at Random

MAR

P(Ri=1Yi,Xi)=P(Ri=1Yiobs,Xi)P(R_i = 1 | Y_i, X_i) = P(R_i = 1 | Y_i^{obs}, X_i)

Here,

  • YiobsY_i^{obs}=Observed portion of the outcome

Missingness depends on observed data but not on the missing values themselves.

MAR Example

In a depression study, younger people are less likely to report income. If age is observed, missingness in income is MAR.

MNAR — Missing Not at Random

MNAR

P(Ri=1Yi,Xi) depends on YimissP(R_i = 1 | Y_i, X_i) \text{ depends on } Y_i^{miss}

Here,

  • YimissY_i^{miss}=Missing portion of the outcome

Missingness depends on the unobserved values themselves. The hardest mechanism to handle.

MNAR Challenge

MNAR requires modeling the missingness mechanism directly, which is difficult without external information. Results are sensitive to the assumed model.


Comparison

| Mechanism | Missingness depends on | Example |

|-----------|----------------------|---------|

| MCAR | Nothing | Data entry errors; random equipment failure |

| MAR | Observed variables only | Young people skip income questions |

| MNAR | Missing values themselves | Depressed people don't report depression |


Handling Missing Data

Listwise Deletion

Delete any row with missing values.

| Pros | Cons |

|------|------|

| Simple; unbiased under MCAR | Loses data; reduces power |

| | Biased under MAR and MNAR |

Mean Imputation

Replace missing values with the observed mean.

Mean Imputation Problems

  • Biases standard errors downward

  • Distorts correlations and distributions

  • Never recommended for statistical analysis

Multiple Imputation

DfMultiple Imputation

Create M complete datasets by imputing missing values with plausible values drawn from their predictive distribution. Analyze each dataset and combine results using Rubin's rules.


Multiple Imputation: Rubin's Rules

Combined Estimate

Qˉ=1Mm=1MQ^m\bar{Q} = \frac{1}{M}\sum_{m=1}^{M}\hat{Q}_m

Here,

  • Q^m\hat{Q}_m=Estimate from imputed dataset m
  • MM=Number of imputations

Combined Variance

T=Uˉ+(1+1M)BT = \bar{U} + \left(1 + \frac{1}{M}\right)B

Here,

  • Uˉ\bar{U}=Within-imputation variance: $\frac{1}{M}\sum U_m$
  • BB=Between-imputation variance: $\frac{1}{M-1}\sum(\hat{Q}_m - \bar{Q})^2$

Number of Imputations

Use at least M = 20 imputations. For multiply imputed estimates with many missing values, use M = 50-100 for stable results.


Predictive Mean Matching (PMM)

The most popular imputation method. For each missing value:

  1. Fit a regression predicting the variable from other variables

  2. Find observed values with similar predicted values

  3. Use the observed value as the imputation

PMM Advantage

PMM produces plausible values within the range of observed data. It does not extrapolate beyond the data, making it robust to model misspecification.


Python Implementation


import numpy as np

import pandas as pd

from sklearn.experimental import enable_iterative_imputer

from sklearn.impute import IterativeImputer, SimpleImputer

import matplotlib.pyplot as plt



np.random.seed(42)



# Simulate data with missing values

n = 500

X1 = np.random.randn(n)

X2 = 0.7 * X1 + np.random.randn(n) * 0.5

X3 = 0.3 * X1 + 0.4 * X2 + np.random.randn(n) * 0.8



# MAR: X1 missing depends on X2

missing_prob = 1 / (1 + np.exp(-(-1 + 0.5*X2)))

R = np.random.binomial(1, missing_prob)

X1_obs = X1.copy()

X1_obs[R == 1] = np.nan



df = pd.DataFrame({'X1': X1_obs, 'X2': X2, 'X3': X3})

print(f"Missing in X1: {df['X1'].isna().sum()} ({df['X1'].isna().mean():.1%})")



# Listwise deletion

complete = df.dropna()

print(f"\nListwise deletion: n={len(complete)}")

print(f"X1 mean (complete): {complete['X1'].mean():.3f} (true: {X1.mean():.3f})")



# Multiple Imputation

mice = IterativeImputer(random_state=42, max_iter=10)

imputed = pd.DataFrame(mice.fit_transform(df), columns=df.columns)

print(f"\nMICE imputation:")

print(f"X1 mean (imputed): {imputed['X1'].mean():.3f} (true: {X1.mean():.3f})")



# Visualize

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

axes[0].hist(X1, bins=30, alpha=0.5, label='True')

axes[0].hist(complete['X1'], bins=30, alpha=0.5, label='Listwise')

axes[0].legend()

axes[0].set_title('Listwise Deletion')



axes[1].hist(X1, bins=30, alpha=0.5, label='True')

axes[1].hist(imputed['X1'], bins=30, alpha=0.5, label='MICE')

axes[1].legend()

axes[1].set_title('Multiple Imputation')

plt.tight_layout()

plt.show()

Worked Example

Example: Clinical Trial with Dropout

A drug trial has 20% dropout due to side effects (MNAR for those who dropped out):

| Method | Mean Effect | Bias |

|--------|------------|------|

| Complete cases | 3.8 | -0.7 (underestimate) |

| Mean imputation | 3.2 | -1.3 (severe underestimate) |

| Multiple imputation | 4.4 | -0.1 (minimal bias) |

| Pattern-mixture model | 4.6 | +0.1 (minimal bias) |

Multiple imputation produces the least biased estimate when missingness is MAR. For MNAR, specialized models (pattern-mixture, selection models) are needed.


Key Takeaways

Summary: Missing Data

  • MCAR: Missingness is completely random — observed data is representative

  • MAR: Missingness depends on observed data — MI handles this well

  • MNAR: Missingness depends on missing values — requires specialized models

  • Multiple imputation (MI) is the gold standard for MAR data

  • Use Rubin's rules to combine estimates across M imputed datasets

  • Mean imputation is biased — never use it for analysis

  • Use at least M = 20 imputations; more for large amounts of missingness

  • Conduct sensitivity analysis for potential MNAR


Related Topics

Premium Content

Missing Data — MCAR, MAR, MNAR, Imputation

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
💼Interview Prep
📜Certificates
🤝Community Access

Already a member? Log in

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement