Proportional Hazards (unpacking equation 12)

Hi All

This week I am looking at proportional hazards regression. In particular, trying to clarify the conditional likelihood formula.

Before diving into that please note that proportional hazards, also called Cox proportional hazards (CoxPH), was first presented in 1972 in “Regression models and life-tables” which has been cited ~ 46,000 times! So Dr Cox’s contribution with this one paper (he has a lot of other great papers as well) really is extraordinary.

That said I did not find it the clearest of papers in particular equation 12 took me some time to understand. Figure 1 shows an excerpt of the paper.

Screen Shot 2017-05-29 at 12.03.51 PM

Figure 1: an excerpt of Cox’s 1972 “regression models and life tables”

The point that Cox is getting at is, if the hazard function (\lambda(t) in his notation and h(t) in ours ) is completely unknown but the order of failures are know can inference still be conducted? The answer is of course yes. Cox proposes the conditional likelihood function; that is, the probability of the order of failures or to put it another way the likelihood conditioned on the observed times of failures. Thus Cox’s equation 12 is apparently the probability that the i’th phone call ended given that one phone call end at that time. This post explores his assertion.

So to start, in CoxPH the hazard function is not assumed to have any particular form however it is assumed that all phone calls have the same form of hazard function only differing in a parametrically defined proportion; i.e., h_{(i)}(t) = \exp({\bf z}_{(i)} {\bf m}) h_0(t) is the hazard function for the i’th largest observation.

The \exp({\bf z}_{(i)}{\bf m}) gives the relative hazard of the i’th observation relative to h_0(t) the base hazard. This is the “proportional” in proportional hazards.

Returning to equation 12 and translating it into the this blogs notation we get

Screen Shot 2017-05-29 at 3.09.11 PM.

This is the proportional term divided by the sum of the probational terms for the risk set [R(t)]. The risk set is the set of observation with event times greater that of the i’th observation. By multiplying the top and bottom of the fraction with the baseline hazard h_0(t_{(i)}) it becomes clear that this is the hazard of the ending phone call divided by the hazard of all phone call still active imitatively before time t_{(i)}.

Screen Shot 2017-05-29 at 4.43.42 PM

Intuitively, this should make sense the probability of phone call i ending given that exactly one of the phone calls ended is the hazard of the i’th phone call divided by the total hazard of all active phone calls. But sadly Intuitive is not really good enough. So continuing to please back some of these layers and using the definition of a hazard function:
Screen Shot 2017-05-29 at 4.41.38 PM.  This can be rewritten as
Screen Shot 2017-05-29 at 7.31.59 PM using Bayes rule [P(A|B) = P(A \cap B )/P(B) ].  Given the assumption that there is one event at time t_{(i)} (this assumption accounts deals for the conditional term) the probabilities in the fraction can be interpreted as
Screen Shot 2017-05-29 at 7.11.35 PM,  making the additional assumption that the event times are independent of each other the denominator can be interpreted as the probability that there is one event at time t_{(i)} this results in further simplification:
Screen Shot 2017-05-29 at 7.13.43 PM.  Finally using Bayes rule again the fraction can be written as single conditional probability
Screen Shot 2017-05-29 at 7.15.34 PM. So CoxPH is correct and 46,000 papers don’t need an errata! But as I said I would have like a little more expansion in the original paper.

Basically that is everything I wanted to get through, however, …, I like like to talk too much / over think things. In particular,  the assumption that there is only one event at each time. In the proof this leads to the probability of the an event happening at the observed being the sum of the probabilities; however, in situation where there could be more than one event at a given time (likely due to sloppy data collection) then that is clearly not true (Cox and the reviewers discuss this in the paper a bit). The probability if at-least one observations has event time t_{(i)} is then 1- \prod(1-P[\text{Obs. k has event time } t_{(i)}]). While it is difficult to make general observations about this probability it must be larger then the assumed probability. In addition, the numerator would then be not just the probability that observation i has event time t_{(i)}  but instead the probability that all the observations that end at t_{(i)} have event time t_{(i)}. This probability must smaller. So it does look like this would change the conditional likelihood a lot.

Tune in next time for exciting Bayesian Business intelligence (I promise next week there will actually be a Bayesian component to this)!



A short note on Nelson-Aalen

Hi All

Last time I made a post deriving the variance of the Kaplan Meier survival function. I had intended to also give the Nelson Aalen estimate as well buts its variance is much harder to get. This post just highlights the difficulties that I had.

The Nelson Aalen estimate is \hat{S}(t) = \prod_{j=1}^{k} \exp(-d_j/r_j). It feels a bit random to me, however David Collett points out that \exp(-x) = 1 - x + x^2/2 - x^3/6 ... so for x << 1 \exp(-x) \approx  1-x \hat{S}(t) = \prod_{j=1}^{k} \exp(-d_j/r_j) \approx \prod_{j=1}^{k} )(1-d_j/r_j)  \approx \prod_{j=1}^{k} )(r_j-d_j)/r_j which is the Kaplan Meier estimate.

Collett give the variance of the Nelson Aalen estimate as Var[\hat{S}(t)] = \hat{S}(t)^2 \sum_{j=1}^k (d_j/r_j^2). I have been unable to verify the formula. If the same logic as my perfidious post is applied the variance of var(\log[\hat{S}(t)]) can be found to be var(\log[\hat{S}(t)]) = \sum_{j=1}^{k} var[-d_j/r_j] and using everyone’s favorite identity (var[g(x)] \approx [(d/dx) g(x)]^2 var(x)) var(\log[\hat{S}(t)]) can also be found to be var(\log[\hat{S}(t)]) =\hat{S}(t)^{-2} var[\hat{S}(t)]. Putting to two sides together \hat{S}(t)^{-2} var[\hat{S}(t)]  = \sum_{j=1}^{k} var[-d_j/r_j] which is solved for var[\hat{S}(t)] to give var[\hat{S}(t)]  = \hat{S}(t)^{2} \sum_{j=1}^{k} var[-d_j/r_j]. So for Var[\hat{S}(t)] = \hat{S}(t)^2 \sum_{j=1}^k (d_j/r_j^2) to be true \sum_{j=1}^{k} var[-d_j/r_j] = \sum_{j=1}^k (d_j/r_j^2). I don’t see any clear reason why this is the case.

That is it for now. Tune in next time for hopefully some answers instead of just expositions on my inadequacies.

Survival Analysis more then just functions in R.

Hi All

I would like to move this blog in a new-ish direction of survival analysis. The main reasons for this are:

  1. it is useful,
  2. it is an area that ML has not got its dirty paws on,
  3. It is a field that I did not study in my Ph.D.,
  4. Wikipedia does not seem to do a good job of describing it.

Starting at the beginning, because that always seems like a good place to start, survival analysis is the study of the duration from an origin to an event. In concrete terms (and avoiding depressing experiments in involving cancer patients) think about phone call lengths. The origin time is the time the call is started, the event time is the time the call ends and the duration is the difference. In addition to calls that end there are calls that are censored out; a call is censored out if it is still ongoing at the time the data is collected. In such cases the duration is the longest time that it is know the call is still ongoing.

To summarize survival data it is common to use either a survival function or hazard function. If f(t) is the pdf of the survival times and F(t) is the cdf then 1- F(t) =\int_t ^{\infty} du f(u) = S(t) is the survival function. That is it the probability that a “phone call” will last at-least until time t. The hazard function is the instantaneous probability of failure (i.e., “hanging up” ) given that an observation has lasted until that time (i.e., given that I have been on hold at Air Canada for 45 mins what is the probability that they answer me now). This is calculated as the ratio of the pdf to the survival function; h(t) = f(t)/S(t).

The survival function is commonly empirically estimated in two ways these are Kaplan–Meier and Nelson-Aalen. In a later post I plan to gives some details about another method that I have been developing. The remainder of this post discusses the Kaplan–Meier method.

Consider an ordered sample of n observed durations (t_1 to t_n with t_j \geq t_i IIF j > i), each duration is either to an event (“hanging up“) or to censoring (the phone call is ongoing at the time of the study). The estimate is \hat{S}(t) = \prod_{j=1}^{k} (r_j-d_j)/r_j where t \in [t_{k}, t_{k+1}), r_j is the number of observations at risk in time interval [t_{j-1}, t_{j}), and d_j is the number of “hanging ups” at time t_j. That is, \hat{S}(t) is a step function with a different value for each interval [t_{i}, t_{i+1}) in the observed data. This should be sort of intuitive as it is very similar to the empirical cdf estimate used in boot strapping.

The confidence bounds for the Kaplan–Meier estimate can be derived as follows: Starting with Greenwood’s formula (Var[\hat{S}(t)] = \hat{S}(t)^2 \sum_{j=1}^k d_j/[r_j (r_j - d_j)]) the variance of \log(- \log[\hat{S}(t)]) is calculated. \log(- \log[\hat{S}(t)]) is the assumed to be Gaussian  (If you know how this is justified please let me know I have not found a formal proof yet) and the confidence bounds are found by transforming the  bounds for \log(- \log[\hat{S}(t)]) with \exp(-\exp(x)). The reason that Greenwood’s formula is not used directly is that the survival function must be bounded by 0 and 1 and adding a delta to \hat{S}(t) may not be in in this interval.

Greenwood’s formula is essentially an application of the approximation var[g(x)] \approx [(d/dx) g(x)]^2 var(x). Recall (really this is just up the page a few lines so recall is not necessary but I hate starting a sentence with an equation) \hat{S}(t) = \prod_{j=1}^{k} (r_j-d_j)/r_j = \prod_{j=1}^{k} p_j. Thus var(log[\hat{S}(t)]) = \sum_{j=1}^{k} var[ \log(p_j)]. The variance of p_j is p_j (1- p_j) /(r_j); this bold assertion is justified with the hand wavy response that p_j has a binomial distribution. So moving full circle var[ \log(p_j)] = [p_j^{-2}] p_j (1- p_j) /(r_j) = (1- p_j) /(p_j r_j), which can be simplified (1- p_j) /(p_j r_j) = (1-[(r_j-d_j)/r_j]) /([(r_j-d_j)/r_j] r_j)= d_j/[r_j(r_j-d_j)]. Using var[g(x)] \approx [(d/dx) g(x)]^2 var(x) again (I really hope this approximation is good!) var(log[\hat{S}(t)]) =\hat{S}(t)^{-2} var[\hat{S}(t)]. Putting all these pieces together \sum_{j=1}^{k} var[ \log(p_j)] = \sum_{j=1}^{k} d_j/[r_j(r_j-d_j)]=var(log[\hat{S}(t)]) = \hat{S}(t)^{-2} var[\hat{S}(t)] which can be solved for  var[\hat{S}(t)] giving Greenwood’s formula Var[\hat{S}(t)] = \hat{S}(t)^2 \sum_{j=1}^k d_j/[r_j (r_j - d_j)].

To get the variance of \log(- \log[\hat{S}(t)]) once again var[g(x)] \approx [(d/dx) g(x)]^2 var(x) is used.  This time x = -\log( \hat{S}(t)); var[\log(- \log[\hat{S}(t)])] = \log[\hat{S}(t)]^2 (\sum_{j=1}^{k} d_j/[r_j(r_j-d_j)]).

Now the completely unjustified assumption that \log(- \log[\hat{S}(t)]) \sim Gaussian, a 95% confidence bound (yes I am not using the Bayesian terminology of central credibility interval because this whole process is frequentist) is then \log(- \log[\hat{S}(t)]) \pm 1.96 se[\log(- \log[\hat{S}(t)])].

And finally the complementary log log space confidence bound is transformed back; \exp( - \exp[ (\log(- \log[\hat{S}(t)]) \pm 1.96 se[\log(- \log[\hat{S}(t)])]) ]) = \hat{S}(t)^{\exp(\pm 1.96 se[\log(- \log[\hat{S}(t)])])}

So before going on to some code that implements this I just want to point out that It does feel strange to me that the uncertainty of \hat{S}(t) is not a beta distribution at each step. Again if anyone has any thoughts on this please let me know.

The code to plot a Kaplan–Meier estimate of a survival function is given below.  Note that in the r package survival has a faster implementation of this the process; it looks something like “plot(survfit(Surv(time=t, event=event)~1 ))”.  The plot generated by my code for the simulated data is shown in figure 1.


Screen Shot 2017-05-22 at 3.33.26 PM

Figure 1: shows a Kaplan–Meier estimated (blue) survival function with uncertainty (grey) and the true survival curve (red)

Cheers Tune in next week for more survival analysis (assuming you last that long)

t = sort(rexp(100))
r = runif(100)
event = rep(1,100)
event[r < 0.1] = 0
cen  = runif(100)
t[event==0] = t[event==0]*cen[event==0]
index = sort(t, index.return = T)$ix

event= event[index]

g.KM<-function(t, event) { k = length(t) ndn = rep(1,k+1) dnnd = rep(0,k+1) for ( i in  1:k)     {     n.i = length(t[t > t[i]]) # the number of observations at risk
    d.i = event[i]            # the number that die at the ith obersved time 
    c.i = abs(1-event[i])

    ndn[i+1]      = (n.i - d.i -c.i)/(n.i - c.i) # prob of failure at that time
    dnnd[i+1]     = dnnd[i] + d.i/(n.i*(n.i-d.i)+0.00000001) # greenwood estimate


S = ndn
for (i in  2:(k+1))
    S[i] = prod(ndn[1:i])
se = sqrt(dnnd)*S
UB  = S ^exp( 1.96*sqrt(dnnd)/((log(S))))
LB  = S ^exp( -1.96*sqrt(dnnd)/((log(S))))
df = data.frame(time =  c(0,t), Event = c(1,event), ndn, S, LB, UB)

plot(c(0,t), seq(1, 0, length.out = k+1), type= "n", xlab = "time", ylab = "proportion alive" )
for (i in 2:(k+1))
    segments(x0=t[i-1], x1=t[i], y0 = UB[i-1], y1 = UB[i-1], lwd=1, col="grey")
    segments(x0=t[i],   x1=t[i], y0 = UB[i-1], y1 = UB[i], lwd=1, col="grey")

    segments(x0=t[i-1], x1=t[i], y0 = LB[i-1], y1 = LB[i-1], lwd=1, col="grey")
    segments(x0=t[i],   x1=t[i], y0 = LB[i-1], y1 = LB[i], lwd=1, col="grey")

    segments(x0=t[i-1], x1=t[i], y0 = S[i-1], y1 = S[i-1], lwd=3, col="blue")
    segments(x0=t[i],   x1=t[i], y0 = S[i-1], y1 = S[i], lwd=3, col="blue")
points(c(1, t)[c(F, event==0)], S[c(F, event==0)], pch=3)

t.frame = seq(0, 2*max(t), length.out =1001)
S.true  = 1-pexp(t.frame )
lines(t.frame, S.true, col="red", lwd=2)


g.KM(t, event)