Kaplan-Meier Estimator — Survival Function
Statistics
Non-Parametric Estimation of Survival Probabilities
The Kaplan-Meier estimator constructs the survival function step-by-step at each event time, handling censored observations correctly. It produces the iconic survival curve used throughout medical and reliability research.
-
Clinical Trials — Estimate patient survival probabilities with varying follow-up times
-
Manufacturing — Predict component reliability with incomplete failure data
-
Customer Analytics — Model subscription duration with right-censored observations
Each step down in the survival curve represents real events, properly weighted for those still at risk.
The Kaplan-Meier estimator is a non-parametric method for estimating the survival function from time-to-event data, even when observations are censored.
DfSurvival Function
The probability that an event has not yet occurred by time :
Survival Function
Here,
- =Time until event occurs
- =Probability of surviving past time t
- =Cumulative distribution function
Censoring
DfRight Censoring
An observation is right-censored if the event has not occurred by the end of the study period. We know the survival time is at least as long as the observed time.
| Type | Description |
|------|------------|
| Right-censored | Event not observed before study ends |
| Left-censored | Event occurred before study began |
| Interval-censored | Event known to occur in an interval |
Why Kaplan-Meier Matters
Standard methods (mean, median) cannot handle censored data. Kaplan-Meier correctly uses all available information, including the partial information from censored observations.
Kaplan-Meier Formula
Kaplan-Meier Estimator
Here,
- =Time of the i-th event
- =Number of events at time $t_i$
- =Number at risk just before time $t_i$
The estimator is a step function that drops at each event time.
Standard Error
Greenwood's Formula
Here,
- =Events at time $t_i$
- =Number at risk at time $t_i$
The 95% confidence interval is:
Confidence Interval
Here,
- =Estimated standard error from Greenwood's formula
Log-Rank Test
The log-rank test compares survival curves between two or more groups.
Log-Rank Test Statistic
Here,
- =Observed events in group 1 at time $t_i$
- =Expected events in group 1 under $H_0$
- =Variance of $(O_{1i} - E_{1i})$
| Hypothesis | Meaning |
|-----------|---------|
| : | No difference in survival between groups |
| : | Survival curves differ |
Median Survival Time
The median survival is the smallest time at which .
Median Survival
Here,
- =Kaplan-Meier estimate of survival
When Median is Undefined
If the survival curve never drops below 0.5 (e.g., more than half survive), the median survival time is undefined. Report as "not reached" (NR).
Python Implementation
import numpy as np
import pandas as pd
from lifelines import KaplanMeierFitter
from lifelines.statistics import logrank_test
import matplotlib.pyplot as plt
np.random.seed(42)
# Simulate survival data
n = 200
treatment = np.random.binomial(1, 0.5, n)
time = np.where(treatment,
np.random.exponential(12, n), # Treatment: longer survival
np.random.exponential(8, n)) # Control: shorter survival
censored = np.random.binomial(1, 0.2, n) # 20% censoring
event = 1 - censored
# Kaplan-Meier curves
kmf_treat = KaplanMeierFitter()
kmf_control = KaplanMeierFitter()
mask_treat = treatment == 1
kmf_treat.fit(time[mask_treat], event[mask_treat], label='Treatment')
kmf_control.fit(time[~mask_treat], event[~mask_treat], label='Control')
# Plot
fig, ax = plt.subplots(figsize=(8, 5))
kmf_treat.plot_survival_function(ax=ax)
kmf_control.plot_survival_function(ax=ax)
ax.set_title('Kaplan-Meier Survival Curves')
ax.set_xlabel('Time')
ax.set_ylabel('Survival Probability')
plt.show()
# Median survival
print(f"Treatment median: {kmf_treat.median_survival_time_:.1f}")
print(f"Control median: {kmf_control.median_survival_time_:.1f}")
# Log-rank test
result = logrank_test(time[mask_treat], time[~mask_treat],
event_observed_A=event[mask_treat],
event_observed_B=event[~mask_treat])
print(f"\nLog-rank test: ?²={result.test_statistic:.2f}, p={result.p_value:.4f}")
Worked Example
Example: Drug Trial
Comparing survival times between treatment and control groups:
| Time | At Risk (Control) | Events | At Risk (Treatment) | Events |
|------|-------------------|--------|---------------------|--------|
| 3 | 100 | 5 | 100 | 2 |
| 6 | 94 | 8 | 97 | 3 |
| 9 | 85 | 6 | 93 | 4 |
| 12 | 78 | 4 | 88 | 3 |
Control
Treatment
Log-rank test: , p = 0.009 -> Treatment has significantly better survival.
Key Takeaways
Summary: Kaplan-Meier Estimator
-
Kaplan-Meier estimates the survival function from censored data
-
The estimator is a step function that drops at each observed event time
-
Greenwood's formula provides the standard error for confidence intervals
-
The log-rank test compares survival curves between groups
-
Median survival is the time when ; may be undefined
-
The method makes the independent censoring assumption
-
Always report confidence intervals alongside point estimates
Related Topics
-
See Cox Proportional Hazards for regression with covariates
-
See Hypothesis Testing for the log-rank test framework
-
See Missing Data for handling different censoring mechanisms