Survival Analysis
Foundations of Statistics
Analyzing Time-to-Event Data With Censoring
Survival analysis handles the unique challenge of censored observations — subjects who haven't experienced the event by study end. From Kaplan-Meier curves to Cox models, these methods extract maximum information from incomplete time-to-event data.
- Clinical Trials — Estimate patient survival probabilities with censored follow-up
- Reliability Engineering — Predict time-to-failure for mechanical components
- Customer Analytics — Model time until churn with subscription cancellations
The event will happen — survival analysis tells us when, even when we don't wait long enough to see it.
Survival analysis analyzes time until an event occurs (death, failure, relapse). It handles censored data — subjects who haven't experienced the event by the study end.
DfSurvival Function
The probability of surviving past time t: S(t) = P(T > t), where T is the time to event.
Key functions:
- — survival function (probability of surviving past time t)
- — hazard rate (instantaneous risk at time t)
- — cumulative hazard
Survival Function
Here,
- =Survival probability at time t
- =Time to event
import numpy as np
import pandas as pd
from lifelines import KaplanMeierFitter, CoxPHFitter
from lifelines.statistics import logrank_test
import matplotlib.pyplot as plt
np.random.seed(42)
n = 200
# Simulate clinical trial: two treatment groups
# Group A (control): exponential survival, median = 12 months
# Group B (treatment): longer survival, median = 20 months
group = np.random.choice([0, 1], n)
true_median = np.where(group == 0, 12, 20)
duration = np.random.exponential(true_median/np.log(2))
censored_at = 24 # study ends at 24 months
observed = duration <= censored_at
duration_obs = np.minimum(duration, censored_at)
df = pd.DataFrame({
'duration': duration_obs,
'event': observed.astype(int),
'group': np.where(group==0, 'Control', 'Treatment'),
'age': np.random.uniform(40, 70, n),
'stage': np.random.choice([1,2,3], n, p=[0.3,0.5,0.2])
})
# Kaplan-Meier estimator
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
kmf = KaplanMeierFitter()
for grp, color in [('Control','red'), ('Treatment','blue')]:
mask = df['group'] == grp
kmf.fit(df[mask]['duration'], df[mask]['event'], label=grp)
kmf.plot_survival_function(ax=axes[0], color=color)
axes[0].set_title('Kaplan-Meier Survival Curves')
axes[0].set_xlabel('Time (months)')
axes[0].set_ylabel('Survival Probability')
# Log-rank test
ctrl = df[df['group']=='Control']
trt = df[df['group']=='Treatment']
result = logrank_test(ctrl['duration'], trt['duration'],
ctrl['event'], trt['event'])
axes[0].text(0.05, 0.1, f'Log-rank p = {result.p_value:.4f}',
transform=axes[0].transAxes)
# Cox proportional hazards model
cph = CoxPHFitter()
cph.fit(df[['duration','event','group','age','stage']],
duration_col='duration', event_col='event')
cph.print_summary()
cph.plot(ax=axes[1])
axes[1].set_title('Cox Model: Hazard Ratios')
plt.tight_layout()
plt.savefig('survival_analysis.png', dpi=150)
plt.show()
# Median survival times
for grp in ['Control','Treatment']:
mask = df['group'] == grp
kmf.fit(df[mask]['duration'], df[mask]['event'])
median = kmf.median_survival_time_
print(f"{grp}: median survival = {median:.1f} months")
Censored Data
Censored observations are not missing — they carry information. A censored subject survived at least until the censoring time.
Key Takeaways
Summary: Survival Analysis
- Censored observations are not missing — they carry information (survived at least until censoring)
- Kaplan-Meier is a nonparametric estimator of the survival function
- Log-rank test compares survival curves between groups
- Cox proportional hazards model estimates hazard ratios adjusting for covariates
- Hazard Ratio less than 1: treatment reduces hazard; greater than 1: treatment increases hazard