Survival Analysis

Foundations of Statistics

Analyzing Time-to-Event Data With Censoring

Survival analysis handles the unique challenge of censored observations — subjects who haven't experienced the event by study end. From Kaplan-Meier curves to Cox models, these methods extract maximum information from incomplete time-to-event data.

Clinical Trials — Estimate patient survival probabilities with censored follow-up
Reliability Engineering — Predict time-to-failure for mechanical components
Customer Analytics — Model time until churn with subscription cancellations

The event will happen — survival analysis tells us when, even when we don't wait long enough to see it.

Survival analysis analyzes time until an event occurs (death, failure, relapse). It handles censored data — subjects who haven't experienced the event by the study end.

DfSurvival Function

The probability of surviving past time t: S(t) = P(T > t), where T is the time to event.

Key functions:

$S(t) = P(T > t)$ — survival function (probability of surviving past time t)
$h(t)$ — hazard rate (instantaneous risk at time t)
$H(t) = -\log S(t)$ — cumulative hazard

Survival Function

S(t) = P(T > t)

Here,

$S(t)$ =Survival probability at time t
$T$ =Time to event

import numpy as np
import pandas as pd
from lifelines import KaplanMeierFitter, CoxPHFitter
from lifelines.statistics import logrank_test
import matplotlib.pyplot as plt

np.random.seed(42)
n = 200

# Simulate clinical trial: two treatment groups
# Group A (control): exponential survival, median = 12 months
# Group B (treatment): longer survival, median = 20 months
group = np.random.choice([0, 1], n)
true_median = np.where(group == 0, 12, 20)
duration = np.random.exponential(true_median/np.log(2))
censored_at = 24  # study ends at 24 months
observed = duration <= censored_at
duration_obs = np.minimum(duration, censored_at)

df = pd.DataFrame({
    'duration': duration_obs,
    'event': observed.astype(int),
    'group': np.where(group==0, 'Control', 'Treatment'),
    'age': np.random.uniform(40, 70, n),
    'stage': np.random.choice([1,2,3], n, p=[0.3,0.5,0.2])
})

# Kaplan-Meier estimator
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

kmf = KaplanMeierFitter()
for grp, color in [('Control','red'), ('Treatment','blue')]:
    mask = df['group'] == grp
    kmf.fit(df[mask]['duration'], df[mask]['event'], label=grp)
    kmf.plot_survival_function(ax=axes[0], color=color)

axes[0].set_title('Kaplan-Meier Survival Curves')
axes[0].set_xlabel('Time (months)')
axes[0].set_ylabel('Survival Probability')

# Log-rank test
ctrl = df[df['group']=='Control']
trt  = df[df['group']=='Treatment']
result = logrank_test(ctrl['duration'], trt['duration'],
                       ctrl['event'], trt['event'])
axes[0].text(0.05, 0.1, f'Log-rank p = {result.p_value:.4f}',
             transform=axes[0].transAxes)

# Cox proportional hazards model
cph = CoxPHFitter()
cph.fit(df[['duration','event','group','age','stage']],
        duration_col='duration', event_col='event')
cph.print_summary()
cph.plot(ax=axes[1])
axes[1].set_title('Cox Model: Hazard Ratios')

plt.tight_layout()
plt.savefig('survival_analysis.png', dpi=150)
plt.show()

# Median survival times
for grp in ['Control','Treatment']:
    mask = df['group'] == grp
    kmf.fit(df[mask]['duration'], df[mask]['event'])
    median = kmf.median_survival_time_
    print(f"{grp}: median survival = {median:.1f} months")

Censored Data

Censored observations are not missing — they carry information. A censored subject survived at least until the censoring time.

Key Takeaways

Summary: Survival Analysis

Censored observations are not missing — they carry information (survived at least until censoring)
Kaplan-Meier is a nonparametric estimator of the survival function
Log-rank test compares survival curves between groups
Cox proportional hazards model estimates hazard ratios adjusting for covariates
Hazard Ratio less than 1: treatment reduces hazard; greater than 1: treatment increases hazard

Survival Analysis — Time-to-Event Data