This post shows how to use SQL to get estimates of the expected event times given either the Exponential or Weibull distributions and right censored data. That might seem like a strange restriction however it is very powerful have estimates in a query instead of requiring R as they can be used directly in dashboard (in my experience with Tableau). Also if like me you are lazy being able to just have a query that answers your question is much nicer then having to analyze data from a query.
Before showing the query it might be helpful to describe what is being queried. Starting with the exponential distribution. Let be a set of observed data; has observations with event times and censored times. The likelihood function is where if the ith observation is an observed event and 0 if it is censored. Thus if the event times are assumed to be exponentially distributed . If the likelihood is considered as function of not then it is recognizable as a gamma distribution; thus the expectation of is . Note that is can also be found using the standard ML approach of solving the derivative of the like-hood set equal to zero.
For the Weibull distribution this a bit more complicated! Recall that the pdf ofr a Weibull is and the survival function is . Thus using the same process as shown with the exponential distribution (i.e., I am too lazy to write it again) the likelihood function of a similar data set is . This does not correspond to a distribution with know properties when considered as a function of and . However the derivative of the log likelihood can be solved to find an estimate right? And that estimate is going to be used the the query? We shall see!
. The two partial derivatives are and . Using the first equation it is is to find that however can not be solved for algebraically.
The value of that minimizes is the maximum likelihood estimate.
So this whole post comes down to how can a SQL query find the optimal value of . The answer is actually quite simple, brute force. I create a large number of values and full join them to my data set. Then I rank the rows by their misfit to the objective function and take the best one! BAM!
This query assume there is a table with user ids, event times and user ages. The cent follows the same structure as the churn posts query.
SELECT age, gamma,lambda, n, r, o_f ,lambda_ex
SELECT age, gamma, n, r,lambda, o_f, lambda_ex,
row_number() over (partition by age order by o_f) as index
SELECT age, gamma, n, r, r/(SUM_U_to_gamma + SUM_t_to_gamma) as lambda, r/(SUM_U + SUM_t) as lambda_ex,
ABS(r/gamma + SUM_log_t - (r/ (SUM_U_to_gamma + SUM_t_to_gamma) )*(SUM_U_to_gamma_log_u + SUM_t_to_gamma_log_t)) as o_f
SELECT age, gamma,
sum(1) as n,
SUM(event) as r,
SUM(CASE WHEN event = 0 THEN (event_time::float) END) as SUM_U,
SUM(CASE WHEN event = 1 THEN (event_time::float) END) as SUM_t,
SUM(CASE WHEN event = 0 THEN (event_time::float)^gamma END) as SUM_U_to_gamma,
SUM(CASE WHEN event = 1 THEN (event_time::float)^gamma END) as SUM_t_to_gamma,
SUM(CASE WHEN event = 0 THEN ln(event_time::Float) END) as SUM_log_U,
SUM(CASE WHEN event = 1 THEN ln(event_time::Float) END) as SUM_log_t,
SUM(CASE WHEN event = 0 THEN ln(event_time::float)*event_time^gamma END) as SUM_U_to_gamma_log_u,
SUM(CASE WHEN event = 1 THEN ln(event_time::float)*event_time^gamma END) as SUM_t_to_gamma_log_t
(SELECT id, CASE WHEN next_time is null then 0 else 1 end as event,
datediff('sec', clean_time, coalesce(next_time,'2017-08-28 00:00:00') ) as event_time, age, 1 as key
(SELECT id, clean_time, age,
max(clean_time) over (partition by id order by clean_time rows between 1 following and 1 following) as next_time,
row_number() over (partition by id,age order by random() ) as index
and clean_time <= '2017-08-28 00:00:00'
where index <= 1
and age <= 7) Q1
(SELECT random() as gamma, 1 as key
LIMIT 10000 ) Q2
ON Q1.key = Q2.key
where event_time > 0
GROUP BY 1,2
where index = 1
ORDER BY 1,2
That is it for now tune in next time for more survival analysis!