Hi all

This post shows how to use SQL to get estimates of the expected event times given either the Exponential or Weibull distributions and right censored data. That might seem like a strange restriction however it is very powerful have estimates in a query instead of requiring R as they can be used directly in dashboard (in my experience with Tableau). Also if like me you are lazy being able to just have a query that answers your question is much nicer then having to analyze data from a query.

Before showing the query it might be helpful to describe what is being queried. Starting with the exponential distribution. Let be a set of observed data; has observations with event times and censored times. The likelihood function is where if the ith observation is an observed event and 0 if it is censored. Thus if the event times are assumed to be exponentially distributed . If the likelihood is considered as function of not then it is recognizable as a gamma distribution; thus the expectation of is . Note that is can also be found using the standard ML approach of solving the derivative of the like-hood set equal to zero.

For the Weibull distribution this a bit more complicated! Recall that the pdf ofr a Weibull is and the survival function is . Thus using the same process as shown with the exponential distribution (i.e., I am too lazy to write it again) the likelihood function of a similar data set is . This does not correspond to a distribution with know properties when considered as a function of and . However the derivative of the log likelihood can be solved to find an estimate right? And that estimate is going to be used the the query? We shall see!

. The two partial derivatives are and . Using the first equation it is is to find that however can not be solved for algebraically.

The value of that minimizes is the maximum likelihood estimate.

So this whole post comes down to how can a SQL query find the optimal value of . The answer is actually quite simple, brute force. I create a large number of values and full join them to my data set. Then I rank the rows by their misfit to the objective function and take the best one! BAM!

This query assume there is a table with user ids, event times and user ages. The cent follows the same structure as the churn posts query.

SELECT age, gamma,lambda, n, r, o_f ,lambda_ex FROM ( SELECT age, gamma, n, r,lambda, o_f, lambda_ex, row_number() over (partition by age order by o_f) as index FROM ( SELECT age, gamma, n, r, r/(SUM_U_to_gamma + SUM_t_to_gamma) as lambda, r/(SUM_U + SUM_t) as lambda_ex, ABS(r/gamma + SUM_log_t - (r/ (SUM_U_to_gamma + SUM_t_to_gamma) )*(SUM_U_to_gamma_log_u + SUM_t_to_gamma_log_t)) as o_f FROM ( SELECT age, gamma, sum(1) as n, SUM(event) as r, SUM(CASE WHEN event = 0 THEN (event_time::float) END) as SUM_U, SUM(CASE WHEN event = 1 THEN (event_time::float) END) as SUM_t, SUM(CASE WHEN event = 0 THEN (event_time::float)^gamma END) as SUM_U_to_gamma, SUM(CASE WHEN event = 1 THEN (event_time::float)^gamma END) as SUM_t_to_gamma, SUM(CASE WHEN event = 0 THEN ln(event_time::Float) END) as SUM_log_U, SUM(CASE WHEN event = 1 THEN ln(event_time::Float) END) as SUM_log_t, SUM(CASE WHEN event = 0 THEN ln(event_time::float)*event_time^gamma END) as SUM_U_to_gamma_log_u, SUM(CASE WHEN event = 1 THEN ln(event_time::float)*event_time^gamma END) as SUM_t_to_gamma_log_t FROM ( (SELECT id, CASE WHEN next_time is null then 0 else 1 end as event, datediff('sec', clean_time, coalesce(next_time,'2017-08-28 00:00:00') ) as event_time, age, 1 as key FROM (SELECT id, clean_time, age, max(clean_time) over (partition by id order by clean_time rows between 1 following and 1 following) as next_time, row_number() over (partition by id,age order by random() ) as index FROM tab_event and clean_time <= '2017-08-28 00:00:00' ) where index <= 1 and age <= 7) Q1 LEFT JOIN (SELECT random() as gamma, 1 as key from tab_event LIMIT 10000 ) Q2 ON Q1.key = Q2.key ) where event_time > 0 GROUP BY 1,2 ) ) ) where index = 1 ORDER BY 1,2

That is it for now tune in next time for more survival analysis!

Pingback: An introduction or “vulgurization” on Churn | Bayesian Business Intelligence

SELECT *

FROM

(

SELECT active_on_date, power(ln(2),1/gamma)*power(lambda,(-1/gamma)) as MEDIAN_RET_TIME, log(2)/lambda_ex as MEDIAN_RET_TIME2, max(ACTIVE_ON_DATE) OVER () as MAX_DAY

FROM

(

SELECT active_on_date, gamma,lambda, n, r, o_f ,lambda_ex

FROM

(

SELECT active_on_date, gamma, n, r,lambda, o_f, lambda_ex,

row_number() over (partition by active_on_date order by o_f) as index

FROM

(

SELECT active_on_date, gamma, n, r, r/(SUM_U_to_gamma + SUM_t_to_gamma) as lambda, r/(SUM_U + SUM_t) as lambda_ex,

ABS(r/gamma + SUM_log_t – (r/ (SUM_U_to_gamma + SUM_t_to_gamma) )*(SUM_U_to_gamma_log_u + SUM_t_to_gamma_log_t)) as o_f

FROM

(

SELECT active_on_date, gamma,

sum(1) as n,

SUM(event) as r,

SUM(CASE WHEN event = 0 THEN (event_time) END) as SUM_U,

SUM(CASE WHEN event = 1 THEN (event_time) END) as SUM_t,

SUM(CASE WHEN event = 0 THEN power(event_time,gamma) END) as SUM_U_to_gamma,

SUM(CASE WHEN event = 1 THEN power(event_time,gamma) END) as SUM_t_to_gamma,

SUM(CASE WHEN event = 0 THEN ln(event_time) END) as SUM_log_U,

SUM(CASE WHEN event = 1 THEN ln(event_time) END) as SUM_log_t,

SUM(CASE WHEN event = 0 THEN ln(event_time)*power(event_time, gamma) END) as SUM_U_to_gamma_log_u,

SUM(CASE WHEN event = 1 THEN ln(event_time)*power(event_time, gamma) END) as SUM_t_to_gamma_log_t

FROM

(

(SELECT player_id, CASE WHEN next_age is null then 0 else 1 end as event,

COALESCE(next_age, age_possible) – age as event_time, active_on_date, 1 as key

FROM

(SELECT Q1.*, age_possible

FROM

(

(SELECT player_id, active_on_date, age,

max(age) over (partition by player_id order by age rows between 1 following and 1 following) as next_age,

row_number() over (partition by player_id, active_on_date order by rand() ) as index

FROM datapipe.cohort

where active_on_date >= ‘2018-12-06’) Q1

LEFT JOIN

(SELECT player_id, age_possible

FROM datapipe.user_summary) Q2

ON Q1.player_id = Q2.player_id

)

)

where index 0

GROUP BY 1,2

)

)

)

where index = 1

)

)

WHERE abs(DATETIME_DIFF(MAX_DAY, ACTIVE_ON_DATE, DAY)) >= 7

ORDER BY 1,2

LikeLike