Hi all

First no this post does not mean that I have finished writing my textbook. In fact truth be told I have made very little progress since my last post. I choose to highlight that I have given four talks on data science in that time so 1-1 tie?

Secondly I want to bring up that my former colleague Alexandre Macmillan has also started a blog on business analytics. Alex is less mathy than me, however he is extremely insightful when considering user behavior; if he applies to your company just hire him.

Now onto risk adjusted survival functions…

As noted before survival analysis is great for modeling both churn and conversion. Also Cox proportional hazards (Cox PH) is a great way to do survival analysis. However, Cox PH models the relative hazards of members of a population and the common way to plot (and think about?) survival data is the empirical survival function. This post addresses how to integrate these two methods.

Before continuing I should not that r does this auto-magically with the “survfit” function:

# max some simulated data d.g1 = rexp(1000, rate = 1/720) C.g1 = rev(seq(1,365, length.out = 1000 )) E.g1 = rep(1,1000) E.g1[d.g1 >= C.g1] = 0 d.g1[d.g1 >= C.g1] = C.g1[d.g1 >= C.g1] d.g2 = rexp(1000, rate = 1/180) C.g2 = rev(seq(1,365, length.out = 1000 )) E.g2 = rep(1,1000) E.g2[d.g2 >= C.g2] = 0 d.g2[d.g2 >= C.g2] = C.g2[d.g2 >= C.g2] d.g3 = rweibull(1000, shape =1/2, scale = 180) C.g3 = rev(seq(1,365, length.out = 1000 )) E.g3 = rep(1,1000) E.g3[d.g3 >= C.g3] = 0 d.g3[d.g3 >= C.g3] = C.g3[d.g3 >= C.g3] event.time = c(d.g1, d.g2, d.g3) event.result = c(E.g1, E.g2, E.g3) Group = c(rep(1, 1000), rep(2, 1000), rep(3, 1000)) # fit a cox model and plot the result with a "predicted" survival curv mod = coxph(Surv(time=event.time , event=event.result )~factor(Group)) km = survfit(Surv(time=event.time , event=event.result )~factor(Group)) plot(km, col=c("red", "blue", "green"), lwd=3, mark.time=FALSE, mark=3, xlim=c(0, 365), xlab = "time (days)", ylab = "S(d)") lines(survfit(mod, newdata = data.frame(Group = 1:3)), col=c("red", "blue", "green"), lty=2, mark.time=FALSE )

Figure 1 shows the result of the r code. Note that Cox model would not be appropriate in the is case as the proportional hazards assumption is not met; i.e., the KM lines cross!

So how does r calculate the risk adjusted survival function from the Cox model? To answer this question recall the definition of a hazard function; . Also recall that the Cox model describes the hazard function as . Thus what is needed is some function such that

is true.

All we have to do is find and use it to calculate the risk adjusted survival function from the empirical survival function. But how to find ? well like a lot of math it is much easier to know the answer and work back wards!

In this case I tried a few functions and could not get any of them to work. Then I had the idea to assume , i.e., it is exponential. Using this it became clear that . It then turns out that this definition of works for all !

Thus risk adjusted survival function .

That’s it for now.