How to define and model Churn:

For those of us in BI churn is the enemy.  For those of you you have made better life choices and are not in BI, a user is said to have churned from a game when they leave the game forever. As is probably obvious forever is a very long time. Thus to have data before the heat death of the universe it is common [citation needed] to describe churn by picking a semi-arbitrary time period and defining a user to have churned if they do not return in that period; i.e.,  7 days could be chosen and all users that have a 7 day laps are defined as churned even if they return latter. This paradigm is unsatisfactory to me. This post will explore ways of making it more satisfying.

Two alternative methods are explored; both are parametric methods. The first is to model a users probability of returning on a given day as their total number of distinct active days divided by their calendar age. If the probability of them not returning for n days becomes too small they are defined as churned. The second is to use classic survival analysis techniques; a user is defined as dead when they login again or censored when they reach the current date.

The first (lets call it binomial churn [BC]) I have not been able to make work however I believe it has potential. The second (survival churn [SC]) is my current go-to method. Sadly neither of these are being done in a particularity Bayesian way; I might have to change the title of this blog to BI the pragmatic way.

To get data from the BC method a query like:

"SELECT id, Prob, N_day_absent, CASE WHEN  (1-prob)^N_day_absent < 0.05 THEN 1 ELSE 0 END as churned 
    (SELECT id, N_active_days/cal_age::float as Prob, N_day_absent
        (select id, count(distinct day) as N_active_days, max(datediff('day', first_contact_day, getdate()) + 1) as cal_age,  datediff('day', max(day), getdate()) as N_day_absent
        From login
        GROUP BY 1)
    where cal_age >= 7

The basic problem that i have not been able to get around is that as uses stop logging in their probability of logging in goes down and thus paradoxically their probability of churning is decreased. I have a feeling that some sort of “runs test” type idea could account for this but like I said, I have not figured out how yet. It may also be possible to limit the probability of users to their last login however that will result in them always having 2 or more logins (a first and last login). Anyway if you have any ideas about this let me know.

The SC method is simpler. The only mental hurdle is that we have to conceptualize logging in as dying. So increased hazard is good and increased survival is bad. A query to get the data for the SC method should look similar to:

"SELECT id, EVENT, datediff('day', day, EVENT_DAY) as Event_age, DAY, event_day 
    SELECT id, day 
               CASE WHEN next_day is NULL THEN 0 ELSE 1 END as EVENT,
               CASE WHEN next_day is NULL THEN getdate() ELSE next_day END as EVENT_DAY
        (SELECT id, day, max(day) over (partition by id order by day rows between 1 following and 1 following) as next_day
        FROM login_day) -- this is table with at most one entry per id per day.) 

Unlike BC SC still needs some analysis at this point as users have not been defined as churned within the query. The two basic r code commands that I would use to model churn are

"surv.plot.test    = survfit(Surv(time=event_age, event=event)~x, data=data.junk1)
cph.test          = coxph(Surv(time=event_age, event=event)~x, data=data.junk1)"

where “x” could be any predictor and is entered in the standard r model syntax. The “survfit” function is good for plotting survival rates (remember lower is better!). Figure 1 is an example of a Kaplan Meier plot of a possible  “surv.plot.text” model.  The second is the  Cox Proportional Hazards model; it is good for looking at the effect of predictors on churn (Again higher hazard is better).


Figure 1: Kaplan Meier plot of user probability of return. At zero days 100% of users have not returned and by 30 days about 80% of users have returned. The different colors indicate different groups of users; here this is a synthetic example and these are meaningless.  

So thats it for churn, as always please leave comments suggestions poems etc.

P.S. sorry this entry was left unfinished for so long.

Tune in next time, unless you churn, for more BI the Bayesian Way.



6 thoughts on “How to define and model Churn:

  1. Wouldn’t it be more precise to define churn as the condition of both: a) last experience in the game was characterized by and has not logged in for . Feature X could be extracted from logs of users that have long left the game. Did they all loose money on their lost log in? Attempt the same level 3 times or more?


  2. How very Bayesian of you to define something in terms of probability and not certainty. Just be sure you round any aggregation or sum to a whole number. Business people don’t understand what 0.2 of a user means.


  3. Pingback: Survival Analysis more then just the coxph function in R. | Bayesian Business Intelligence

  4. Pingback: A short note on risk adjusted Survival functions | Bayesian Business Intelligence

  5. Pingback: Estimating Passive Churn or Life time Conversion. | Bayesian Business Intelligence

  6. Pingback: An introduction or “vulgurization” on Churn | Bayesian Business Intelligence

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s