Hi all
This post shows how to use SQL to get estimates of the expected event times given either the Exponential or Weibull distributions and right censored data. That might seem like a strange restriction however it is very powerful have estimates in a query instead of requiring R as they can be used directly in dashboard (in my experience with Tableau). Also if like me you are lazy being able to just have a query that answers your question is much nicer then having to analyze data from a query.
Before showing the query it might be helpful to describe what is being queried. Starting with the exponential distribution. Let be a set of observed data;
has
observations with
event times and
censored times. The likelihood function is
where
if the ith observation is an observed event and 0 if it is censored. Thus if the event times are assumed to be exponentially distributed
. If the likelihood is considered as function of
not
then it is recognizable as a gamma distribution; thus the expectation of
is
. Note that is can also be found using the standard ML approach of solving the derivative of the like-hood set equal to zero.
For the Weibull distribution this a bit more complicated! Recall that the pdf ofr a Weibull is and the survival function is
. Thus using the same process as shown with the exponential distribution (i.e., I am too lazy to write it again) the likelihood function of a similar data set is
. This does not correspond to a distribution with know properties when considered as a function of
and
. However the derivative of the log likelihood can be solved to find an estimate right? And that estimate is going to be used the the query? We shall see!
. The two partial derivatives are
and
. Using the first equation it is is to find that
however
can not be solved for algebraically.
The value of that minimizes
is the maximum likelihood estimate.
So this whole post comes down to how can a SQL query find the optimal value of . The answer is actually quite simple, brute force. I create a large number of
values and full join them to my data set. Then I rank the rows by their misfit to the objective function and take the best one! BAM!
This query assume there is a table with user ids, event times and user ages. The cent follows the same structure as the churn posts query.
SELECT age, gamma,lambda, n, r, o_f ,lambda_ex FROM ( SELECT age, gamma, n, r,lambda, o_f, lambda_ex, row_number() over (partition by age order by o_f) as index FROM ( SELECT age, gamma, n, r, r/(SUM_U_to_gamma + SUM_t_to_gamma) as lambda, r/(SUM_U + SUM_t) as lambda_ex, ABS(r/gamma + SUM_log_t - (r/ (SUM_U_to_gamma + SUM_t_to_gamma) )*(SUM_U_to_gamma_log_u + SUM_t_to_gamma_log_t)) as o_f FROM ( SELECT age, gamma, sum(1) as n, SUM(event) as r, SUM(CASE WHEN event = 0 THEN (event_time::float) END) as SUM_U, SUM(CASE WHEN event = 1 THEN (event_time::float) END) as SUM_t, SUM(CASE WHEN event = 0 THEN (event_time::float)^gamma END) as SUM_U_to_gamma, SUM(CASE WHEN event = 1 THEN (event_time::float)^gamma END) as SUM_t_to_gamma, SUM(CASE WHEN event = 0 THEN ln(event_time::Float) END) as SUM_log_U, SUM(CASE WHEN event = 1 THEN ln(event_time::Float) END) as SUM_log_t, SUM(CASE WHEN event = 0 THEN ln(event_time::float)*event_time^gamma END) as SUM_U_to_gamma_log_u, SUM(CASE WHEN event = 1 THEN ln(event_time::float)*event_time^gamma END) as SUM_t_to_gamma_log_t FROM ( (SELECT id, CASE WHEN next_time is null then 0 else 1 end as event, datediff('sec', clean_time, coalesce(next_time,'2017-08-28 00:00:00') ) as event_time, age, 1 as key FROM (SELECT id, clean_time, age, max(clean_time) over (partition by id order by clean_time rows between 1 following and 1 following) as next_time, row_number() over (partition by id,age order by random() ) as index FROM tab_event and clean_time <= '2017-08-28 00:00:00' ) where index <= 1 and age <= 7) Q1 LEFT JOIN (SELECT random() as gamma, 1 as key from tab_event LIMIT 10000 ) Q2 ON Q1.key = Q2.key ) where event_time > 0 GROUP BY 1,2 ) ) ) where index = 1 ORDER BY 1,2
That is it for now tune in next time for more survival analysis!
Pingback: An introduction or “vulgurization” on Churn | Bayesian Business Intelligence
SELECT *
FROM
(
SELECT active_on_date, power(ln(2),1/gamma)*power(lambda,(-1/gamma)) as MEDIAN_RET_TIME, log(2)/lambda_ex as MEDIAN_RET_TIME2, max(ACTIVE_ON_DATE) OVER () as MAX_DAY
FROM
(
SELECT active_on_date, gamma,lambda, n, r, o_f ,lambda_ex
FROM
(
SELECT active_on_date, gamma, n, r,lambda, o_f, lambda_ex,
row_number() over (partition by active_on_date order by o_f) as index
FROM
(
SELECT active_on_date, gamma, n, r, r/(SUM_U_to_gamma + SUM_t_to_gamma) as lambda, r/(SUM_U + SUM_t) as lambda_ex,
ABS(r/gamma + SUM_log_t – (r/ (SUM_U_to_gamma + SUM_t_to_gamma) )*(SUM_U_to_gamma_log_u + SUM_t_to_gamma_log_t)) as o_f
FROM
(
SELECT active_on_date, gamma,
sum(1) as n,
SUM(event) as r,
SUM(CASE WHEN event = 0 THEN (event_time) END) as SUM_U,
SUM(CASE WHEN event = 1 THEN (event_time) END) as SUM_t,
SUM(CASE WHEN event = 0 THEN power(event_time,gamma) END) as SUM_U_to_gamma,
SUM(CASE WHEN event = 1 THEN power(event_time,gamma) END) as SUM_t_to_gamma,
SUM(CASE WHEN event = 0 THEN ln(event_time) END) as SUM_log_U,
SUM(CASE WHEN event = 1 THEN ln(event_time) END) as SUM_log_t,
SUM(CASE WHEN event = 0 THEN ln(event_time)*power(event_time, gamma) END) as SUM_U_to_gamma_log_u,
SUM(CASE WHEN event = 1 THEN ln(event_time)*power(event_time, gamma) END) as SUM_t_to_gamma_log_t
FROM
(
(SELECT player_id, CASE WHEN next_age is null then 0 else 1 end as event,
COALESCE(next_age, age_possible) – age as event_time, active_on_date, 1 as key
FROM
(SELECT Q1.*, age_possible
FROM
(
(SELECT player_id, active_on_date, age,
max(age) over (partition by player_id order by age rows between 1 following and 1 following) as next_age,
row_number() over (partition by player_id, active_on_date order by rand() ) as index
FROM datapipe.cohort
where active_on_date >= ‘2018-12-06’) Q1
LEFT JOIN
(SELECT player_id, age_possible
FROM datapipe.user_summary) Q2
ON Q1.player_id = Q2.player_id
)
)
where index 0
GROUP BY 1,2
)
)
)
where index = 1
)
)
WHERE abs(DATETIME_DIFF(MAX_DAY, ACTIVE_ON_DATE, DAY)) >= 7
ORDER BY 1,2
LikeLike