Kaplan-Meier Estimator — Survival Function

Statistics

Non-Parametric Estimation of Survival Probabilities

The Kaplan-Meier estimator constructs the survival function step-by-step at each event time, handling censored observations correctly. It produces the iconic survival curve used throughout medical and reliability research.

Clinical Trials — Estimate patient survival probabilities with varying follow-up times
Manufacturing — Predict component reliability with incomplete failure data
Customer Analytics — Model subscription duration with right-censored observations

Each step down in the survival curve represents real events, properly weighted for those still at risk.

The Kaplan-Meier estimator is a non-parametric method for estimating the survival function from time-to-event data, even when observations are censored.

DfSurvival Function

The probability that an event has not yet occurred by time $t$ :

Survival Function

S(t) = P(T > t) = 1 - F(t)

Here,

$T$ =Time until event occurs
$S(t)$ =Probability of surviving past time t
$F(t)$ =Cumulative distribution function

Censoring

DfRight Censoring

An observation is right-censored if the event has not occurred by the end of the study period. We know the survival time is at least as long as the observed time.

| Type | Description |

|------|------------|

| Right-censored | Event not observed before study ends |

| Left-censored | Event occurred before study began |

| Interval-censored | Event known to occur in an interval |

Why Kaplan-Meier Matters

Standard methods (mean, median) cannot handle censored data. Kaplan-Meier correctly uses all available information, including the partial information from censored observations.

Kaplan-Meier Formula

Kaplan-Meier Estimator

\hat{S}(t) = \prod_{t_i \leq t}\left(1 - \frac{d_i}{n_i}\right)

Here,

$t_i$ =Time of the i-th event
$d_i$ =Number of events at time $t_i$
$n_i$ =Number at risk just before time $t_i$

The estimator is a step function that drops at each event time.

Standard Error

Greenwood's Formula

\widehat{\text{Var}}[\hat{S}(t)] = \hat{S}(t)^2 \sum_{t_i \leq t}\frac{d_i}{n_i(n_i - d_i)}

Here,

$d_i$ =Events at time $t_i$
$n_i$ =Number at risk at time $t_i$

The 95% confidence interval is:

Confidence Interval

\hat{S}(t) \pm 1.96 \times \widehat{\text{SE}}[\hat{S}(t)]

Here,

$\widehat{\text{SE}}$ =Estimated standard error from Greenwood's formula

Log-Rank Test

The log-rank test compares survival curves between two or more groups.

Log-Rank Test Statistic

\chi^2 = \frac{\left(\sum_{i}(O_{1i} - E_{1i})\right)^2}{\sum_{i}V_i}

Here,

$O_{1i}$ =Observed events in group 1 at time $t_i$
$E_{1i}$ =Expected events in group 1 under $H_0$
$V_i$ =Variance of $(O_{1i} - E_{1i})$

| Hypothesis | Meaning |

|-----------|---------|

| $H_0$ : $S_1(t) = S_2(t)$ | No difference in survival between groups |

| $H_1$ : $S_1(t) \neq S_2(t)$ | Survival curves differ |

Median Survival Time

The median survival is the smallest time $t$ at which $S(t) \leq 0.5$ .

Median Survival

\hat{t}_{med} = \inf\{t : \hat{S}(t) \leq 0.5\}

Here,

$\hat{S}(t)$ =Kaplan-Meier estimate of survival

When Median is Undefined

If the survival curve never drops below 0.5 (e.g., more than half survive), the median survival time is undefined. Report as "not reached" (NR).

Python Implementation


import numpy as np

import pandas as pd

from lifelines import KaplanMeierFitter

from lifelines.statistics import logrank_test

import matplotlib.pyplot as plt



np.random.seed(42)



# Simulate survival data

n = 200

treatment = np.random.binomial(1, 0.5, n)

time = np.where(treatment,

                np.random.exponential(12, n),  # Treatment: longer survival

                np.random.exponential(8, n))    # Control: shorter survival

censored = np.random.binomial(1, 0.2, n)       # 20% censoring

event = 1 - censored



# Kaplan-Meier curves

kmf_treat = KaplanMeierFitter()

kmf_control = KaplanMeierFitter()



mask_treat = treatment == 1

kmf_treat.fit(time[mask_treat], event[mask_treat], label='Treatment')

kmf_control.fit(time[~mask_treat], event[~mask_treat], label='Control')



# Plot

fig, ax = plt.subplots(figsize=(8, 5))

kmf_treat.plot_survival_function(ax=ax)

kmf_control.plot_survival_function(ax=ax)

ax.set_title('Kaplan-Meier Survival Curves')

ax.set_xlabel('Time')

ax.set_ylabel('Survival Probability')

plt.show()



# Median survival

print(f"Treatment median: {kmf_treat.median_survival_time_:.1f}")

print(f"Control median: {kmf_control.median_survival_time_:.1f}")



# Log-rank test

result = logrank_test(time[mask_treat], time[~mask_treat],

                      event_observed_A=event[mask_treat],

                      event_observed_B=event[~mask_treat])

print(f"\nLog-rank test: ?²={result.test_statistic:.2f}, p={result.p_value:.4f}")

Worked Example

Example: Drug Trial

Comparing survival times between treatment and control groups:

|------|-------------------|--------|---------------------|--------|

| 3 | 100 | 5 | 100 | 2 |

| 6 | 94 | 8 | 97 | 3 |

| 9 | 85 | 6 | 93 | 4 |

| 12 | 78 | 4 | 88 | 3 |

Control $\hat{S}(6) = (1 - 5/100)(1 - 8/94) = 0.95 \times 0.915 = 0.869$

Treatment $\hat{S}(6) = (1 - 2/100)(1 - 3/97) = 0.98 \times 0.969 = 0.950$

Log-rank test: $\chi^2 = 6.82$ , p = 0.009 -> Treatment has significantly better survival.

Kaplan-Meier Survival Curves — Drug Trial

Key Takeaways

Summary: Kaplan-Meier Estimator

Kaplan-Meier estimates the survival function from censored data
The estimator is a step function that drops at each observed event time
Greenwood's formula provides the standard error for confidence intervals
The log-rank test compares survival curves between groups
Median survival is the time when $S(t) = 0.5$ ; may be undefined
The method makes the independent censoring assumption
Always report confidence intervals alongside point estimates

Kaplan-Meier Estimator — Survival Function

Kaplan-Meier Estimator — Survival Function

Non-Parametric Estimation of Survival Probabilities

DfSurvival Function

Survival Function

Censoring

DfRight Censoring

Kaplan-Meier Formula

Kaplan-Meier Estimator

Standard Error

Greenwood's Formula

Confidence Interval

Log-Rank Test

Log-Rank Test Statistic

Median Survival Time

Median Survival

Python Implementation

Worked Example

Example: Drug Trial

Key Takeaways

Summary: Kaplan-Meier Estimator

Related Topics

Premium Content

Need Expert Statistics Help?