# Survival Analysis more then just functions in R.

Hi All

I would like to move this blog in a new-ish direction of survival analysis. The main reasons for this are:

1. it is useful,
2. it is an area that ML has not got its dirty paws on,
3. It is a field that I did not study in my Ph.D.,
4. Wikipedia does not seem to do a good job of describing it.

Starting at the beginning, because that always seems like a good place to start, survival analysis is the study of the duration from an origin to an event. In concrete terms (and avoiding depressing experiments in involving cancer patients) think about phone call lengths. The origin time is the time the call is started, the event time is the time the call ends and the duration is the difference. In addition to calls that end there are calls that are censored out; a call is censored out if it is still ongoing at the time the data is collected. In such cases the duration is the longest time that it is know the call is still ongoing.

To summarize survival data it is common to use either a survival function or hazard function. If $f(t)$ is the pdf of the survival times and $F(t)$ is the cdf then $1- F(t) =\int_t ^{\infty} du f(u) = S(t)$ is the survival function. That is it the probability that a “phone call” will last at-least until time $t$. The hazard function is the instantaneous probability of failure (i.e., “hanging up” ) given that an observation has lasted until that time (i.e., given that I have been on hold at Air Canada for 45 mins what is the probability that they answer me now). This is calculated as the ratio of the pdf to the survival function; $h(t) = f(t)/S(t)$.

The survival function is commonly empirically estimated in two ways these are Kaplan–Meier and Nelson-Aalen. In a later post I plan to gives some details about another method that I have been developing. The remainder of this post discusses the Kaplan–Meier method.

Consider an ordered sample of $n$ observed durations ( $t_1$ to $t_n$ with $t_j \geq t_i$ IIF $j > i$), each duration is either to an event (“hanging up“) or to censoring (the phone call is ongoing at the time of the study). The estimate is $\hat{S}(t) = \prod_{j=1}^{k} (r_j-d_j)/r_j$ where $t \in [t_{k}, t_{k+1})$, $r_j$ is the number of observations at risk in time interval $[t_{j-1}, t_{j})$, and $d_j$ is the number of “hanging ups” at time $t_j$. That is, $\hat{S}(t)$ is a step function with a different value for each interval $[t_{i}, t_{i+1})$ in the observed data. This should be sort of intuitive as it is very similar to the empirical cdf estimate used in boot strapping.

The confidence bounds for the Kaplan–Meier estimate can be derived as follows: Starting with Greenwood’s formula ( $Var[\hat{S}(t)] = \hat{S}(t)^2 \sum_{j=1}^k d_j/[r_j (r_j - d_j)]$) the variance of $\log(- \log[\hat{S}(t)])$ is calculated. $\log(- \log[\hat{S}(t)])$ is the assumed to be Gaussian  (If you know how this is justified please let me know I have not found a formal proof yet) and the confidence bounds are found by transforming the  bounds for $\log(- \log[\hat{S}(t)])$ with $\exp(-\exp(x))$. The reason that Greenwood’s formula is not used directly is that the survival function must be bounded by 0 and 1 and adding a delta to $\hat{S}(t)$ may not be in in this interval.

Greenwood’s formula is essentially an application of the approximation $var[g(x)] \approx [(d/dx) g(x)]^2 var(x)$. Recall (really this is just up the page a few lines so recall is not necessary but I hate starting a sentence with an equation) $\hat{S}(t) = \prod_{j=1}^{k} (r_j-d_j)/r_j = \prod_{j=1}^{k} p_j$. Thus $var(log[\hat{S}(t)]) = \sum_{j=1}^{k} var[ \log(p_j)]$. The variance of $p_j$ is $p_j (1- p_j) /(r_j)$; this bold assertion is justified with the hand wavy response that $p_j$ has a binomial distribution. So moving full circle $var[ \log(p_j)] = [p_j^{-2}] p_j (1- p_j) /(r_j) = (1- p_j) /(p_j r_j)$, which can be simplified $(1- p_j) /(p_j r_j) = (1-[(r_j-d_j)/r_j]) /([(r_j-d_j)/r_j] r_j)= d_j/[r_j(r_j-d_j)]$. Using $var[g(x)] \approx [(d/dx) g(x)]^2 var(x)$ again (I really hope this approximation is good!) $var(log[\hat{S}(t)]) =\hat{S}(t)^{-2} var[\hat{S}(t)]$. Putting all these pieces together $\sum_{j=1}^{k} var[ \log(p_j)] = \sum_{j=1}^{k} d_j/[r_j(r_j-d_j)]=var(log[\hat{S}(t)]) = \hat{S}(t)^{-2} var[\hat{S}(t)]$ which can be solved for $var[\hat{S}(t)]$ giving Greenwood’s formula $Var[\hat{S}(t)] = \hat{S}(t)^2 \sum_{j=1}^k d_j/[r_j (r_j - d_j)]$.

To get the variance of $\log(- \log[\hat{S}(t)])$ once again $var[g(x)] \approx [(d/dx) g(x)]^2 var(x)$ is used.  This time $x = -\log( \hat{S}(t))$; $var[\log(- \log[\hat{S}(t)])] = \log[\hat{S}(t)]^2 (\sum_{j=1}^{k} d_j/[r_j(r_j-d_j)])$.

Now the completely unjustified assumption that $\log(- \log[\hat{S}(t)]) \sim$ Gaussian, a 95% confidence bound (yes I am not using the Bayesian terminology of central credibility interval because this whole process is frequentist) is then $\log(- \log[\hat{S}(t)]) \pm 1.96 se[\log(- \log[\hat{S}(t)])]$.

And finally the complementary log log space confidence bound is transformed back; $\exp( - \exp[ (\log(- \log[\hat{S}(t)]) \pm 1.96 se[\log(- \log[\hat{S}(t)])]) ]) = \hat{S}(t)^{\exp(\pm 1.96 se[\log(- \log[\hat{S}(t)])])}$

So before going on to some code that implements this I just want to point out that It does feel strange to me that the uncertainty of $\hat{S}(t)$ is not a beta distribution at each step. Again if anyone has any thoughts on this please let me know.

The code to plot a Kaplan–Meier estimate of a survival function is given below.  Note that in the r package survival has a faster implementation of this the process; it looks something like “plot(survfit(Surv(time=t, event=event)~1 ))”.  The plot generated by my code for the simulated data is shown in figure 1. Figure 1: shows a Kaplan–Meier estimated (blue) survival function with uncertainty (grey) and the true survival curve (red)

Cheers Tune in next week for more survival analysis (assuming you last that long)

t = sort(rexp(100))
r = runif(100)
event = rep(1,100)
event[r < 0.1] = 0
cen  = runif(100)
t[event==0] = t[event==0]*cen[event==0]
index = sort(t, index.return = T)\$ix

t=t[index]
event= event[index]

g.KM<-function(t, event) { k = length(t) ndn = rep(1,k+1) dnnd = rep(0,k+1) for ( i in  1:k)     {     n.i = length(t[t > t[i]]) # the number of observations at risk
d.i = event[i]            # the number that die at the ith obersved time
c.i = abs(1-event[i])

ndn[i+1]      = (n.i - d.i -c.i)/(n.i - c.i) # prob of failure at that time
dnnd[i+1]     = dnnd[i] + d.i/(n.i*(n.i-d.i)+0.00000001) # greenwood estimate

}

S = ndn
for (i in  2:(k+1))
{
S[i] = prod(ndn[1:i])
}
se = sqrt(dnnd)*S
UB  = S ^exp( 1.96*sqrt(dnnd)/((log(S))))
LB  = S ^exp( -1.96*sqrt(dnnd)/((log(S))))
df = data.frame(time =  c(0,t), Event = c(1,event), ndn, S, LB, UB)
print(df)

plot(c(0,t), seq(1, 0, length.out = k+1), type= "n", xlab = "time", ylab = "proportion alive" )
for (i in 2:(k+1))
{
segments(x0=t[i-1], x1=t[i], y0 = UB[i-1], y1 = UB[i-1], lwd=1, col="grey")
segments(x0=t[i],   x1=t[i], y0 = UB[i-1], y1 = UB[i], lwd=1, col="grey")

segments(x0=t[i-1], x1=t[i], y0 = LB[i-1], y1 = LB[i-1], lwd=1, col="grey")
segments(x0=t[i],   x1=t[i], y0 = LB[i-1], y1 = LB[i], lwd=1, col="grey")

segments(x0=t[i-1], x1=t[i], y0 = S[i-1], y1 = S[i-1], lwd=3, col="blue")
segments(x0=t[i],   x1=t[i], y0 = S[i-1], y1 = S[i], lwd=3, col="blue")
}
points(c(1, t)[c(F, event==0)], S[c(F, event==0)], pch=3)

t.frame = seq(0, 2*max(t), length.out =1001)
S.true  = 1-pexp(t.frame )
lines(t.frame, S.true, col="red", lwd=2)

}

g.KM(t, event)