Hi all, My friend and former colleague Alex has asked me to write a short “vulgurization” on my methods for modelling churn. This will likely be a review for all my loyal readers, however summarizing and clarifying is rarely a wast of time.
Before continuing I should just add that Alex’s blog has had some very insightful posts about churn. In general, I think he is very good (definitely better than my self) at describing the qualitative, non mathematical ideas of business intelligence.
On to vulgarization:
In short, I model login velocity (time between logins) using survival analysis. This approach solves the two primary difficulties of modelling churn on mobile products. The first is that churn is passive not an active; i.e., users do not cancel their subscription they simply stop returning. The second is that all users may have different lengths of time to return; e.g., if user A stopped playing 6 months ago they have had more time to return than user B that stopped 1 week ago.
How does a survival survival analysis approach address these problems? First, with regards to the passive nature of churn, by modelling the time until return the passive non-action of not returning is replaced by an active action of returning. The second problem, how to account for each user having a different length of time to return in, is is done through the use of data censoring.
Data censoring is probability the only tricky part to using this method; to describe it consider the problem of measuring phone call durations (trust me alternative examples are far more depressing). If right now a large phone company was to pull the durations of all phone calls using its network it could get an estimate however many phone call would be ongoing at the time it pulled the information. And importantly some of the ongoing phone calls might be the very longest. Thus for phone calls that have already ended there is a measurable duration however for ongoing phone calls all that is know is that they have not ended yet (in other words that they lasted at least this long). The way survival analysis uses this information censored information can get a bit mathy. However, in a more common sense sort of way of thinking about it; if you are calculating the percentage of users that have not returned by some length of time from their previous session then censored users would count in the both the numerator and denominator before their censor time and neither after it.
In mobile churn the censor time just as in the phone example is the current time. For users that have already returned there is a defined time between logins for users that have not already returned all that is know is that they have not returned by now (the current time). The main difference in this application of survival analysis is that unlike most conventional problems most of our observations will be censored instead of most being observed.
The SQL code to pull the inter arrival times should look something like this:
SELECT player_id, LOGIN_DATE, LOGIN_MONTH, datetime_diff(COALESCE(NEXT_DATE, EOT ), LOGIN_DATE, DAY) as EVENT_TIME, -- the time between events or login and censor date CASE WHEN NEXT_DATE iS NULL THEN 0 ELSE 1 END as EVENT -- if the event is censored FROM ( SELECT LOGIN_DATE, player_id, DATETIME_TRUNC(LOGIN_DATE, MONTH) as LOGIN_MONTH, max(LOGIN_DATE) over (PARTITION BY player_id rows between 1 following and 1 following) as NEXT_DATE, -- get next login max(LOGIN_DATE) over () as EOT, -- get time at end of data set ROW_NUMBER() OVER (PARTITION BY DATETIME_TRUNC(LOGIN_DATE, MONTH) ORDER BY RAND()) as index -- this query can't be aggregated so we need to sample the observations FROM LOGIN_DATE ) WHERE index <= 5000 -- 5k observations per month
Note that this query was made so that assuming that we are interested in cohorting hte users by last contact month. If a different attribute is of interest that should be used instead.
In order to actually conduct the survival analysis I use R. Though it is possible in some parametric cased to do so in SQL.
The two most important survival analysis methods Kaplan Meier Plots and Cox Proportional Hazard model. Describing the nuts and bolts o these will probably fall out side the scope of a “vulgurization” like this so I will just leave you with the R code for them (note you will need to load the surv package for this code to work) and short description of how to interpret the results.
Before starting lets simulate some data.
########## simluate data d.g1 = rexp(1000, rate = 1/720) C.g1 = rev(seq(1,365, length.out = 1000 )) E.g1 = rep(1,1000) E.g1[d.g1 >= C.g1] = 0 d.g1[d.g1 >= C.g1] = C.g1[d.g1 >= C.g1] d.g2 = rexp(1000, rate = 1/180) C.g2 = rev(seq(1,365, length.out = 1000 )) E.g2 = rep(1,1000) E.g2[d.g2 >= C.g2] = 0 d.g2[d.g2 >= C.g2] = C.g2[d.g2 >= C.g2] d.g3 = rweibull(1000, shape =1/2, scale = 180) C.g3 = rev(seq(1,365, length.out = 1000 )) E.g3 = rep(1,1000) E.g3[d.g3 >= C.g3] = 0 d.g3[d.g3 >= C.g3] = C.g3[d.g3 >= C.g3] event.time = c(d.g1, d.g2, d.g3) event.result = c(E.g1, E.g2, E.g3) Group = c(rep(1, 1000), rep(2, 1000), rep(3, 1000))
Here the “Group” field of the data frame would be equivalent to the active_on_month from the SQL query.
########################## kaplan Meier plot with groups km = survfit(Surv(time=event.time , event=event.result )~factor(Group)) plot(km, col=c("red", "blue", "green"), lwd=3, mark.time=TRUE, mark=3, xlim=c(0, 365), xlab = "time (days)", ylab = "S(d)")
This produces Fig. 1 which is a Kaplan Meier plot that shows the empirical survival function (how many people have not returned buy the indicated number of days) for the simulated data. The crosses are times that a observation was censored (i.e., hit the ever moving now).
The next thing is to fit a Cox proportional hazards model to the data. And plot the predicted (modelled) survival functions.
###################################### cox model mod = coxph(Surv(time=event.time , event=event.result )~factor(Group)) km = survfit(Surv(time=event.time , event=event.result )~factor(Group)) plot(km, col=c("red", "blue", "green"), lwd=3, mark.time=FALSE, mark=3, xlim=c(0, 365), xlab = "time (days)", ylab = "S(d)") lines(survfit(mod, newdata = data.frame(Group = 1:3)), col=c("red", "blue", "green"), lty=2, mark.time=FALSE )
Really that is it for now. But a few final points. In this example the cox model is not a good choice because the groups do not have proportional hazards (the Kaplan Meier plots cross).
P.S. if you are new to my blog (or have been here for a long time) please ask questions in the comments.