How to define and model Churn:

For those of us in BI churn is the enemy.  For those of you you have made better life choices and are not in BI, a user is said to have churned from a game when they leave the game forever. As is probably obvious forever is a very long time. Thus to have data before the heat death of the universe it is common [citation needed] to describe churn by picking a semi-arbitrary time period and defining a user to have churned if they do not return in that period; i.e.,  7 days could be chosen and all users that have a 7 day laps are defined as churned even if they return latter. This paradigm is unsatisfactory to me. This post will explore ways of making it more satisfying.

Two alternative methods are explored; both are parametric methods. The first is to model a users probability of returning on a given day as their total number of distinct active days divided by their calendar age. If the probability of them not returning for n days becomes too small they are defined as churned. The second is to use classic survival analysis techniques; a user is defined as dead when they login again or censored when they reach the current date.

The first (lets call it binomial churn [BC]) I have not been able to make work however I believe it has potential. The second (survival churn [SC]) is my current go-to method. Sadly neither of these are being done in a particularity Bayesian way; I might have to change the title of this blog to BI the pragmatic way.

To get data from the BC method a query like:

"SELECT id, Prob, N_day_absent, CASE WHEN  (1-prob)^N_day_absent < 0.05 THEN 1 ELSE 0 END as churned 
    (SELECT id, N_active_days/cal_age::float as Prob, N_day_absent
        (select id, count(distinct day) as N_active_days, max(datediff('day', first_contact_day, getdate()) + 1) as cal_age,  datediff('day', max(day), getdate()) as N_day_absent
        From login
        GROUP BY 1)
    where cal_age >= 7

The basic problem that i have not been able to get around is that as uses stop logging in their probability of logging in goes down and thus paradoxically their probability of churning is decreased. I have a feeling that some sort of “runs test” type idea could account for this but like I said, I have not figured out how yet. It may also be possible to limit the probability of users to their last login however that will result in them always having 2 or more logins (a first and last login). Anyway if you have any ideas about this let me know.

The SC method is simpler. The only mental hurdle is that we have to conceptualize logging in as dying. So increased hazard is good and increased survival is bad. A query to get the data for the SC method should look similar to:

"SELECT id, EVENT, datediff('day', day, EVENT_DAY) as Event_age, DAY, event_day 
    SELECT id, day 
               CASE WHEN next_day is NULL THEN 0 ELSE 1 END as EVENT,
               CASE WHEN next_day is NULL THEN getdate() ELSE next_day END as EVENT_DAY
        (SELECT id, day, max(day) over (partition by id order by day rows between 1 following and 1 following) as next_day
        FROM login_day) -- this is table with at most one entry per id per day.) 

Unlike BC SC still needs some analysis at this point as users have not been defined as churned within the query. The two basic r code commands that I would use to model churn are

"surv.plot.test    = survfit(Surv(time=event_age, event=event)~x, data=data.junk1)
cph.test          = coxph(Surv(time=event_age, event=event)~x, data=data.junk1)"

where “x” could be any predictor and is entered in the standard r model syntax. The “survfit” function is good for plotting survival rates (remember lower is better!). Figure 1 is an example of a Kaplan Meier plot of a possible  “surv.plot.text” model.  The second is the  Cox Proportional Hazards model; it is good for looking at the effect of predictors on churn (Again higher hazard is better).


Figure 1: Kaplan Meier plot of user probability of return. At zero days 100% of users have not returned and by 30 days about 80% of users have returned. The different colors indicate different groups of users; here this is a synthetic example and these are meaningless.  

So thats it for churn, as always please leave comments suggestions poems etc.

P.S. sorry this entry was left unfinished for so long.

Tune in next time, unless you churn, for more BI the Bayesian Way.