4. Kaplan-Meier estimator
• The Kaplan–Meier approach, also called the product-limit approach, is a very
popular method that re-estimates the survival probability at each time an event occurs.
• This estimator is the product over the failure times of the conditional probabilities of
surviving to the next failure time.
• There are several assumptions for appropriate use of the Kaplan–Meier approach.
Specifically, we assume that (1) censoring is independent of the probabilities of
developing the event of interest and that, (2) survival probabilities are comparable in
participants who are recruited earlier as well as later into the study. (3) when
comparing several groups, it is also important that these assumptions are satisfied in
each comparison group so that, for example, censoring is not more likely in one group
than in another.
5. First, We define the following notation to be used in deriving the Kaplan-Meier estimator.
• Let t0 < t1 < t2 < …< tk, represents the observed failure times as well as the censored times in a
sample of size n = n0 (where n0 is the number of participants at the baseline).
• dj : the failure cases at tj.
• cj : the censored cases during the interval [tj ,tj+1).
• 𝑛𝑗: the number of individuals at risk just prior to tj,. Where, this number is computed as:
𝑛𝑗 = 𝑛𝑗−1 − (𝑑𝑗−1+𝑐𝑗−1).
• The probability of surviving at the jth interval is estimated as Ƹ
𝑝𝑗 =
𝑛𝑗−𝑑𝑗
𝑛𝑗
= 1 −
𝑑𝑗
𝑛𝑗
.
• The probability of surviving up to tj is the product of the probabilities of surviving all the
intervals up to the jth interval.
• The survivor function is then estimated by: መ
𝑆𝑡 = ෑ
𝑗 | 𝑡𝑗<𝑡
𝑗
𝑛𝑗 − 𝑑𝑗
𝑛𝑗
= መ
𝑆𝑡−1 ×
𝑛𝑡 − 𝑑𝑡
𝑛𝑡
Kaplan-Meier estimator
6. Kaplan-Meier estimator - Example-1
Consider a small prospective cohort study designed to study time to death. The study involves participants
who are 65+ years of age who are followed for up to 24 years. Twenty participants are followed until they
die, until the study ends, or until they drop out of the study. Data obtained from the study are presented in
the following table (year at which the subject lost to follow-up indicated by (+)).
1. Derive the Kaplan-Meier estimate of survivor function at times of failure, also compute the standard
errors of those estimates.
2. Draw the Kaplan-Meier survivor function curve. Based on the curve, find: (a) the probability that a
participant survives past 10 years, (b) the minimum no. of years at which 75% of participants will
survive, and (c) the estimate of median survival time.
Participant ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Year of death/ last
contact
24+ 3 11+ 19+ 24 13 14 2+ 18 17+ 24 21+ 12 1 10+ 23 6+ 5 9+ 17
8. Kaplan-Meier estimator – Example-1
Years to death: 1, 2+, 3, 5, 6+, 9+, 10+, 11+, 12, 13, 14, 17+, 17, 18, 19+, 21+, 23, 24+,
24, 24
(+ indicates censored cases)
This can be easily obtained using R as follows:
Note that, data is re-organized ascendingly,
with indicating to the censored cases to ease
constructing the Kaplan-Meier table to get the
survival estimates.
9. Kaplan-Meier estimator – Example-1
Note that, in Kaplan-Meier table, computation will be displayed only at failure times
(not the censored) where survival estimates are constant at censored times.
Time
Number alive
(people at risk)
at prior to tj
Number of
deaths
(failure) at tj
Number of
censored
during the
interval
[tj ,tj+1)
Proportion
surviving at tj
Survival
propability
estimates up
to tj
tj nt dt ct pt= (nt-dt)/nt St = pj * St-1
0 20 0 0 1.000 1
1 20 1 1 0.950 0.950 0.003 0.003 0.0487
3 18 1 0 0.944 0.897 0.003 0.006 0.0689
5 17 1 4 0.941 0.844 0.004 0.010 0.0826
12 12 1 0 0.917 0.774 0.008 0.017 0.1014
13 11 1 0 0.909 0.704 0.009 0.026 0.1140
14 10 1 0 0.900 0.633 0.011 0.037 0.1224
17 9 1 1 0.889 0.563 0.014 0.051 0.1274
18 7 1 2 0.857 0.483 0.024 0.075 0.1322
23 4 1 0 0.750 0.362 0.083 0.158 0.1440
24 3 2 1 0.333 0.121 0.667 0.825 0.1096
11 9
(1-pj)/ (nj * pj) SE(St)
S(1-pj)/ (nj * pj)
𝑛𝑗 = 𝑛𝑗−1 − (𝑑𝑗−1+𝑐𝑗−1).
Ƹ
𝑝𝑗 = 1 −
𝑑𝑗
𝑛𝑗
.
10. Kaplan-Meier estimator – Example-1 (Cont.)
• K-M plot drawn as a step function:
+: indicate where censoring occurred.
11. Kaplan-Meier estimator Standard Error (SE)
Estimates:
▪ A popular formula to estimate the standard error of the survival estimates is
Greenwood’s formula.
▪ It is obtained the illustrated formula:
𝑉𝑎𝑟( መ
𝑆𝑡) = መ
𝑆𝑡
2
𝑗≤𝑡
1 − Ƹ
𝑝𝑗
𝑛𝑗 Ƹ
𝑝𝑗
where the sum in the above formula computed cumulatively across all time points
before the time point of interest.
▪ Also, (1-)% CI of 𝑆𝑡 is obtained as መ
𝑆𝑡 ± 𝑍𝛼/2
𝑆𝐸( መ
𝑆𝑡).
▪ Unfortunately, confidence intervals computed based on the above variance may
extend above one or below zero. A more satisfying approach is to find confidence
intervals for the log-log transformation (only in R).
Greenwood’s formula
12. Kaplan-Meier estimator - Example-1 (Cont.)
Consider a small prospective cohort study designed to study time to death. The study
involves participants who are 65+ years of age who are followed for up to 24 years.
Twenty participants are followed until they die, until the study ends, or until they drop
out of the study. Data obtained from the study are presented in the following table (year
at which the subject lost to follow-up indicated by (+)).
1. Derive the Kaplan-Meier estimate of survivor function at times of failure, also compute the
standard errors of those estimates.
2. Draw the Kaplan-Meier survivor function curve. Based on the curve, find: (a) the probability
that a participant survives past 10 years, (b) the minimum no. of years at which 75% of
participants will survive, and (c) the estimate of median survival time.
Participant ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Year of death/ last
contact
24+ 3 11+ 19+ 24 13 14 2+ 18 17+ 24 21+ 12 1 10+ 23 6+ 5 9+ 17
13. Kaplan-Meier estimator – Example-1(Cont.)
Note that, in Kaplan-Meier table, computation will be displayed only at failure times
(not the censored) where survival estimates are constant at censored times.
Time
Number alive
(people at risk)
at prior to tj
Number of
deaths
(failure) at tj
Number of
censored
during the
interval
[tj ,tj+1)
Proportion
surviving at tj
Survival
propability
estimates up
to tj
tj nt dt ct pt= (nt-dt)/nt St = pj * St-1
0 20 0 0 1.000 1
1 20 1 1 0.950 0.950 0.003 0.003 0.0487
3 18 1 0 0.944 0.897 0.003 0.006 0.0689
5 17 1 4 0.941 0.844 0.004 0.010 0.0826
12 12 1 0 0.917 0.774 0.008 0.017 0.1014
13 11 1 0 0.909 0.704 0.009 0.026 0.1140
14 10 1 0 0.900 0.633 0.011 0.037 0.1224
17 9 1 1 0.889 0.563 0.014 0.051 0.1274
18 7 1 2 0.857 0.483 0.024 0.075 0.1322
23 4 1 0 0.750 0.362 0.083 0.158 0.1440
24 3 2 1 0.333 0.121 0.667 0.825 0.1096
11 9
(1-pj)/ (nj * pj) SE(St)
S(1-pj)/ (nj * pj)
𝑉𝑎𝑟( መ
𝑆𝑡)
= መ
𝑆𝑡
2
𝑗≤𝑡
1 − Ƹ
𝑝𝑗
𝑛𝑗 Ƹ
𝑝𝑗
(1) (2) (3) (4) (5) (6) (7) (8)
17. Kaplan-Meier estimator - Example-2
Table shows the results of a clinical trial of a
treatment (drug 6-mercaptopurine or 6-MP)
versus a placebo in 42 children with acute
leukemia. Patients were followed until their
leukemia returned (relapse or go out of
remission) or until the end of the study.
1. Use the Kaplan-Meier method to estimate
the survival function for each group.
2. Draw the survival curve for each group.
Comment on the results.
18. Kaplan-Meier estimator - Example-2 (Cont.)
Group = Treatment
Time
Number alive
(people at risk)
at prior to tj
Number of
deaths
(failure) at
tj
Number of
censored
during the
interval
[tj ,tj+1)
Proportion
surviving at tj
Survival
propability
estimates up to
tj
tj nt dt ct pt= (nt-dt)/nt St = pj * St-1
0 21 0 0 1 1
6 21 3 1 0.8571 0.8571 0.008 0.008 0.0764
7 17 1 1 0.9412 0.8067 0.004 0.012 0.0869
10 15 1 2 0.9333 0.7529 0.005 0.016 0.0963
13 12 1 0 0.9167 0.6902 0.008 0.024 0.1068
16 11 1 3 0.9091 0.6275 0.009 0.033 0.1141
22 7 1 0 0.8571 0.5378 0.024 0.057 0.1282
23 6 1 5 0.8333 0.4482 0.033 0.090 0.1346
9 12
(1-pj)/ (nj * pj) S(1-pj)/ (nj * pj) SE(St)
19. Kaplan-Meier estimator - Example-2 (Cont.)
Group = Placebo ` `
Time
Number alive
(people at risk)
at prior to tj
Number of
deaths
(failure) at
tj
Number of
censored
during the
interval
[tj ,tj+1)
Proportion
surviving at tj
Survival
propability
estimates up to
tj
tj nt dt ct pt= (nt-dt)/nt St = pj * St-1
0 21 0 0 1.000 1
1 21 2 0 0.9048 0.9048 0.005 0.005 0.0641
2 19 2 1 0.8947 0.8095 0.006 0.011 0.0857
4 16 1 1 0.9375 0.7589 0.004 0.015 0.0941
5 14 1 1 0.9286 0.7047 0.005 0.021 0.1018
8 12 3 1 0.7500 0.5285 0.028 0.049 0.1166
11 8 1 1 0.8750 0.4625 0.018 0.067 0.1193
12 6 1 1 0.8333 0.3854 0.033 0.100 0.1218
15 4 1 0 0.7500 0.2890 0.083 0.183 0.1237
17 3 1 0 0.6667 0.1927 0.167 0.350 0.1140
22 2 1 1 0.5000 0.0963 0.500 0.850 0.0888
14 7
(1-pj)/ (nj * pj) S(1-pj)/ (nj * pj) SE(St)
21. Kaplan-Meier estimator - Example-2 (Cont.)
Interpretation: Based on the figure, the
survival probabilities for the treatment
group are higher than the survival
probabilities for the placebo. That is the
placebo group shows faster rates of going
out of remission than the treatment group.
The KM curves established the efficacy of
6-MP (treatment) for maintaining longer
remissions in acute leukemia than the
placebo.
23. Comparing survival functions: Log-rank test
▪ One of the important goals of survival analysis is to assess whether there are
differences in survival among different groups (two or more) of participants.
For example:
(1) In a clinical trial with a survival outcome, we are interested in comparing
survival between participants receiving a new drug as compared to a placebo.
(2) In an observational study, we might be interested in comparing survival
between men and women or between participants with and without a
particular risk factor (e.g., hypertension or diabetes).
▪ There are many tests, Log-rank test is the most popular one to test the null
hypothesis of no difference in survival between two or more independent
groups.
24. Log-rank test
Test hypothesis:
H0: Survival functions of the two (or more say r) independent groups are identical
(S1t = S2t , at all times t)
H1: Survival functions of the two (or more say r) independent groups are not identical
(S1t ≠ S2t , at any time t)
Test statistics
▪ There are many versions of this test, the one is presented here is related to 𝜒2 test
statistics that is known as Cox-Mantel log-rank test . The test statistic is derived based
on comparing the observed numbers to the expected numbers of events at each time
point over the follow-up period.
▪ First, the Kaplan-Meier (K-M) table is constructed for each group, considering the data
from example 2 above, K-M estimate tables for treatment and placebo groups are
shown below.
25. Log-rank test (teststatisticcomputation)
1. For each group, at each event time compute the number at risk and the observed number of
failure events. This can be extracted from the Kaplan-Meier table for each group.
2. Rank the survival times (event times) for the combined data (over all groups).
3. Under H0 is true, The log-rank statistic follows a 2 distribution (with df= r-1) is computed as
follow:
𝜒2 =
𝑖=1
𝑟
(σ𝑡 𝑂𝑖𝑡 − σ𝑡 𝐸𝑖𝑡)2
σ𝑡 𝐸𝑖𝑡
σ𝑡 𝑂𝑖𝑡 : the sum of the observed number of events in the ith group over time
σ𝑡 𝐸𝑖𝑡 : the sum of the expected number of events in the ith group over time, where
𝐸𝑖𝑡 = 𝑁𝑖𝑡 ×
𝑂𝑡
𝑁𝑡
, 𝑖 = 1,2, … , 𝑟
𝑁𝑖𝑡 : number of subjects at risk in group i at time point t.
𝑁𝑡 : total number of subjects at risk (in both groups) at time point t (𝑁𝑡 = 𝑁1𝑡 + 𝑁2𝑡+… + 𝑁𝑟𝑡).
𝑂𝑡 : total number of observed events (in both groups) at time point t (𝑂𝑡 = 𝑂1𝑡 + 𝑂2𝑡+… + 𝑂𝑟𝑡).
(1)
26. Log-rank test (example-2)
1- For each group, at each event time compute the number at risk and the observed number of events.
Then Rank the survival times (event times) for the combined data (over all groups).
Time to
event
Number at
risk in
treatment
group
Number of
observed
event in
treatment
group
time to
event
Number at
risk in
placebo
group
Number of
observed
event in
placebo
group
t N1 O1 t N2 O2
6 21 3 1 21 2
7 17 1 2 19 2
10 15 1 4 16 1
13 12 1 5 14 1
16 11 1 8 12 3
22 7 1 11 8 1
23 6 1 12 6 1
9 15 4 1
17 3 1
22 2 1
14
Time to
event
Number at
risk in
treatment
group
Number of
observed
event in
treatment
group
Number at
risk in
placebo
group
Number of
observed
event in
placebo
group
t N1 O1 N2 O2
1 21 0 21 2
2 21 0 19 2
4 21 0 16 1
5 21 0 14 1
6 21 3 14 0
7 17 1 14 0
8 17 0 12 3
10 15 1 12 0
11 15 0 8 1
12 15 0 6 1
13 12 1 6 0
15 12 0 4 1
16 11 1 4 0
17 11 0 3 1
22 7 1 2 1
23 6 1 2 0
9 14
Should be extracted from
Kaplan-Meier table of each
group
27. Log-rank test (example-2)(cont.)
𝜒2 =
(9−14.488)2
14.488
+
(14−8.512)2
8.512
𝜒2 = 5.617
0.018 < 0.05 (or 5.617 > 3.841),
then we reject the null hypothesis
that the two survival functions
are identical.
2- In the combined table, add the number of observed events and the number at risk for each group (from the
previous table), then compute 𝐸1𝑡 and 𝐸2𝑡 over time, and finally apply in the above formula.
Time to
event
Number at
risk in
treatment
group
Number of
observed
event in
treatment
group
Number at
risk in
placebo
group
Number of
observed
event in
placebo
group
Total number
at risk
Total number
of observed
event
Expected number of
event in treatment
group
Expected number of
event in placebo group
t N1 O1 N2 O2 N=N1+N2 O=O1+O2 E1 = N1 * (O/N) E2 = N2 * (O/N)
1 21 0 21 2 42 2 1.000 1.000
2 21 0 19 2 40 2 1.050 0.950
4 21 0 16 1 37 1 0.568 0.432
5 21 0 14 1 35 1 0.600 0.400
6 21 3 14 0 35 3 1.800 1.200
7 17 1 14 0 31 1 0.548 0.452
8 17 0 12 3 29 3 1.759 1.241
10 15 1 12 0 27 1 0.556 0.444
11 15 0 8 1 23 1 0.652 0.348
12 15 0 6 1 21 1 0.714 0.286
13 12 1 6 0 18 1 0.667 0.333
15 12 0 4 1 16 1 0.750 0.250
16 11 1 4 0 15 1 0.733 0.267
17 11 0 3 1 14 1 0.786 0.214
22 7 1 2 1 9 2 1.556 0.444
23 6 1 2 0 8 1 0.750 0.250
9 14 14.488 8.512
P-value of obtained in R as follows:
1 - pchisq (5.617,1)
[1] 0.01778707 or
qchisq(0.95,1)
[1] 3.841459
𝜒2 =
𝑖=1
𝑟
(σ𝑡 𝑂𝑖𝑡 − σ𝑡 𝐸𝑖𝑡)2
σ𝑡 𝐸𝑖𝑡
28. Log-rank test (example-2)(R-output)
𝜒2 =
(𝑂1 − 𝐸1)2
𝑉𝑎𝑟(𝑂1 − 𝐸1)
=
[σ𝑡(𝑂1𝑡 − 𝐸1𝑡)]2
𝑉
𝑉 = σ𝑡
𝑁1𝑡𝑁2𝑡𝑂𝑡(𝑁𝑡−𝑂𝑡)
𝑁𝑡
2(𝑁𝑡−1)
.
Formula of Log-rank test statistic (in case of two groups) used by R and many
other software such as STATA is obtained as:
This statistic is distributed as 𝜒2 with 1 df under H0 is true, where:
The formula introduced by (2) is the original one, while (1) is an approximated
version. Also, formula (1) is slightly smaller than the log–rank statistic for formula (2).
(2)
30. Comparing survival functions: Hazard Ratio (HR)
▪ To compare between two independent groups (exposed vs. unexposed or treatment vs. control),
hazard ratio (HR) is used. Hazard ratio is estimated from data organized to conduct log-rank test.
▪ Specifically, the hazard ratio is the ratio of the total number of observed to expected
events in two independent comparison groups:
𝐻𝑅 =
σ𝑡 𝑂𝐸𝑥𝑝𝑜𝑠𝑒𝑑, 𝑡 / σ𝑡 𝐸𝐸𝑥𝑝𝑜𝑠𝑒𝑑,𝑡
σ𝑡 𝑂𝑈𝑛𝑒𝑝𝑜𝑠𝑒𝑑, 𝑡 / σ𝑡 𝐸𝑈𝑛𝑒𝑥𝑝𝑜𝑠𝑒𝑑, 𝑡
=
σ𝑡 𝑂𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡, 𝑡 / σ𝑡 𝐸𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡,𝑡
σ𝑡 𝑂𝐶𝑜𝑛𝑡𝑟𝑜𝑙, 𝑡 / σ𝑡 𝐸𝐶𝑜𝑛𝑡𝑟𝑜𝑙, 𝑡
▪ In example 2, to compare survival functions between the treatment and placebo, on can
use the hazard ratio as follow
𝐻𝑅 =
σ𝑡 𝑂𝑃𝑙𝑎𝑐𝑒𝑏𝑜, 𝑡 / σ𝑡 𝐸𝑃𝑙𝑎𝑐𝑒𝑏𝑜,𝑡
σ𝑡 𝑂𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡, 𝑡 / σ𝑡 𝐸𝑇𝑟𝑒𝑎𝑡𝑚𝑒𝑛𝑡,𝑡
=
14/8.512
9/14.488
= 2.65
Thus, participants in the placebo group have 2.65 times the risk of going out
of remission (relapse) as compared to participants in the treatment group.
32. • Kaplan-Meier curves and log-rank tests - are examples of univariate analysis. They
describe the survival according to one factor under investigation, but ignore the
impact of any others.
• Additionally, Kaplan-Meier curves and log-rank tests are useful only when the
predictor variable is categorical (e.g.: treatment A vs treatment B; males vs females).
They don’t work easily for quantitative predictors such as gene expression, weight,
or age.
• The Cox PH model is the most commonly used survival data analysis technique that
simultaneously allows one to include and to assess the effect of multiple covariates.
Procedures for analyzing time-to-event
33. Procedures for analyzing time-to-event
▪ When research interest in modeling the relationship between the time to event and a
set of explanatory variables (or risk factors), regression models are needed.
▪ Two types of regression models can be used
1. Semiparametric models: Do not require us to specify a parametric form for the
baseline hazard (defined as the hazard at time t for observations with all
predictors equal to zero)
2. Parametric models: Parametric distributional assumption are made about the
hazard function.
▪ One of the most popular models used under the first type is the Proportional
hazard model or Cox regression.
ℎ 𝑡, 𝑿 = ℎ0(𝑡) exp(𝛽1𝑋1 + 𝛽2𝑋2 + ⋯ + 𝛽𝑝𝑋𝑝)
34. Cox Regression Model
Example:
These data involve two groups of
leukemia patients, with 21
patients in each group. Group 1 is
the treatment group, and group 2
is the placebo group.
The data set also contains the
variable log WBC, which is a
well-known prognostic indicator
of survival for leukemia patients