Weibull parameters for censored data using only SQL

Hi all

This post shows how to use SQL to get estimates of the expected event times given either the Exponential or Weibull distributions and right censored data. That might seem like a strange restriction however it is very powerful have estimates in a query instead of requiring R as they can be used directly in dashboard (in my experience with Tableau). Also if like me you are lazy being able to just have a query that answers your question is much nicer then having to analyze data from a query.

Before showing the query it might be helpful to describe what is being queried. Starting with the exponential distribution. Let {\bf d} be a set of observed data; {\bf d} has n observations with k event times and n-k censored times. The likelihood function is \mathcal{L}(\lambda) = \prod f_x (x= d_i)^{\delta_i} [1- F_x (x = d_j)]^{1- \delta_i} = \prod f_x (x= d_i)^{\delta_i} [S_x (x = d_j)]^{1- \delta_i} where \delta_i =1 if the ith observation is an observed event and 0 if it is censored. Thus if the event times are assumed to be exponentially distributed \mathcal{L}(\lambda) = \prod \lambda^{\delta_i} \exp( - \lambda  d_i \delta_i)  \exp( - \lambda d_i [1- \delta_i] ) = \lambda^{\sum_{i = 1}^{n} \delta_i} exp( - \lambda \sum_{i = 1}^{n} d_i). If the likelihood is considered as function of \lambda not {\bf d} then it is recognizable as a gamma distribution; thus the expectation of \lambda is (\sum_{i = 1}^{n} \delta_i) / ( \sum_{i = 1}^{n} d_i ) = k / ( \sum_{i = 1}^{n} d_i). Note that is can also be found using the standard ML approach of solving the derivative of the like-hood set equal to zero.

For the Weibull distribution this a bit more complicated! Recall that the pdf ofr a Weibull is f_x(x) = \lambda \gamma x^{ \gamma -1} \exp(- \lambda x^{\gamma}) and the survival function is S_x(x) = \exp(-\lambda x^{\gamma}). Thus using the same process as shown with the exponential distribution (i.e., I am too lazy to write it again) the likelihood function of a similar data set is  \mathcal{L}(\lambda, \gamma) = \prod [\lambda \gamma {d_i}^{\gamma-1}]^{\delta_i} \exp(- \lambda {d_i}^{\gamma-1}). This does not correspond to a distribution with know properties when considered as a function of \gamma and \lambda. However the derivative of the log likelihood can be solved to find an estimate right? And that estimate is going to be used the the query? We shall see!

\log \mathcal{L}(\lambda, \gamma) = \sum \delta_i \log [\lambda \gamma {d_i}] + (\gamma -1) \sum \delta_i \log(d_i) - \lambda \sum {d_i}^{\gamma}.  The two partial derivatives are d/d \lambda \log \mathcal{L}(\lambda, \gamma) = k/\lambda - \sum {d_i}^{\gamma} and d/d \gamma \log \mathcal{L}(\lambda, \gamma) = k/ \gamma + \sum \delta_i \log(d_i) - \lambda  \sum {d_i}^{\gamma} \log(d_i) . Using the first equation it is is to find that \lambda = k / \sum {d_i}^{\gamma} however \gamma can not be solved for algebraically.

The value of \gamma that minimizes | k/ \gamma + \sum \delta_i \log(d_i) - (k / \sum {d_i}^{\gamma})  \sum {d_i}^{\gamma} \log(d_i) | is the maximum likelihood estimate.

So this whole post comes down to how can a SQL query find the optimal value of \gamma. The answer is actually quite simple, brute force. I create a large number of \gamma values and full join them to my data set. Then I rank the rows by their misfit to the objective function and take the best one! BAM!

This query assume there is a table with user ids, event times and user ages. The cent follows the same structure as the churn posts query.

SELECT age, gamma,lambda, n, r, o_f ,lambda_ex
FROM
 (
 SELECT age, gamma, n, r,lambda, o_f, lambda_ex,
 row_number() over (partition by age order by o_f) as index
 FROM

(
 SELECT age, gamma, n, r, r/(SUM_U_to_gamma + SUM_t_to_gamma) as lambda, r/(SUM_U + SUM_t) as lambda_ex,
 ABS(r/gamma + SUM_log_t - (r/ (SUM_U_to_gamma + SUM_t_to_gamma) )*(SUM_U_to_gamma_log_u + SUM_t_to_gamma_log_t)) as o_f
 FROM
 (
 SELECT age, gamma,
 sum(1) as n,
 SUM(event) as r,

SUM(CASE WHEN event = 0 THEN (event_time::float) END) as SUM_U,
 SUM(CASE WHEN event = 1 THEN (event_time::float) END) as SUM_t,

SUM(CASE WHEN event = 0 THEN (event_time::float)^gamma END) as SUM_U_to_gamma,
 SUM(CASE WHEN event = 1 THEN (event_time::float)^gamma END) as SUM_t_to_gamma,

SUM(CASE WHEN event = 0 THEN ln(event_time::Float) END) as SUM_log_U,
 SUM(CASE WHEN event = 1 THEN ln(event_time::Float) END) as SUM_log_t,

SUM(CASE WHEN event = 0 THEN ln(event_time::float)*event_time^gamma END) as SUM_U_to_gamma_log_u,
 SUM(CASE WHEN event = 1 THEN ln(event_time::float)*event_time^gamma END) as SUM_t_to_gamma_log_t
 FROM
 (
 (SELECT id, CASE WHEN next_time is null then 0 else 1 end as event,
 datediff('sec', clean_time, coalesce(next_time,'2017-08-28 00:00:00') ) as event_time, age, 1 as key
 FROM
 (SELECT id, clean_time, age,
 max(clean_time) over (partition by id order by clean_time rows between 1 following and 1 following) as next_time,
 row_number() over (partition by id,age order by random() ) as index
 FROM tab_event
 and clean_time <= '2017-08-28 00:00:00'
 )
 where index <= 1
 and age <= 7) Q1
 LEFT JOIN
 (SELECT random() as gamma, 1 as key
 from tab_event
 LIMIT 10000 ) Q2
 ON Q1.key = Q2.key
 )
 where event_time > 0
 GROUP BY 1,2
 )
 )
 )
where index = 1
ORDER BY 1,2

That is it for now tune in next time for more survival analysis!

Advertisement

2 thoughts on “Weibull parameters for censored data using only SQL

  1. Pingback: An introduction or “vulgurization” on Churn | Bayesian Business Intelligence

  2. SELECT *
    FROM
    (
    SELECT active_on_date, power(ln(2),1/gamma)*power(lambda,(-1/gamma)) as MEDIAN_RET_TIME, log(2)/lambda_ex as MEDIAN_RET_TIME2, max(ACTIVE_ON_DATE) OVER () as MAX_DAY
    FROM
    (
    SELECT active_on_date, gamma,lambda, n, r, o_f ,lambda_ex
    FROM
    (
    SELECT active_on_date, gamma, n, r,lambda, o_f, lambda_ex,
    row_number() over (partition by active_on_date order by o_f) as index
    FROM

    (
    SELECT active_on_date, gamma, n, r, r/(SUM_U_to_gamma + SUM_t_to_gamma) as lambda, r/(SUM_U + SUM_t) as lambda_ex,
    ABS(r/gamma + SUM_log_t – (r/ (SUM_U_to_gamma + SUM_t_to_gamma) )*(SUM_U_to_gamma_log_u + SUM_t_to_gamma_log_t)) as o_f
    FROM
    (
    SELECT active_on_date, gamma,
    sum(1) as n,
    SUM(event) as r,

    SUM(CASE WHEN event = 0 THEN (event_time) END) as SUM_U,
    SUM(CASE WHEN event = 1 THEN (event_time) END) as SUM_t,

    SUM(CASE WHEN event = 0 THEN power(event_time,gamma) END) as SUM_U_to_gamma,
    SUM(CASE WHEN event = 1 THEN power(event_time,gamma) END) as SUM_t_to_gamma,

    SUM(CASE WHEN event = 0 THEN ln(event_time) END) as SUM_log_U,
    SUM(CASE WHEN event = 1 THEN ln(event_time) END) as SUM_log_t,

    SUM(CASE WHEN event = 0 THEN ln(event_time)*power(event_time, gamma) END) as SUM_U_to_gamma_log_u,
    SUM(CASE WHEN event = 1 THEN ln(event_time)*power(event_time, gamma) END) as SUM_t_to_gamma_log_t
    FROM
    (

    (SELECT player_id, CASE WHEN next_age is null then 0 else 1 end as event,
    COALESCE(next_age, age_possible) – age as event_time, active_on_date, 1 as key
    FROM
    (SELECT Q1.*, age_possible
    FROM
    (

    (SELECT player_id, active_on_date, age,
    max(age) over (partition by player_id order by age rows between 1 following and 1 following) as next_age,
    row_number() over (partition by player_id, active_on_date order by rand() ) as index
    FROM datapipe.cohort
    where active_on_date >= ‘2018-12-06’) Q1
    LEFT JOIN
    (SELECT player_id, age_possible
    FROM datapipe.user_summary) Q2
    ON Q1.player_id = Q2.player_id
    )
    )
    where index 0
    GROUP BY 1,2
    )
    )
    )
    where index = 1
    )
    )
    WHERE abs(DATETIME_DIFF(MAX_DAY, ACTIVE_ON_DATE, DAY)) >= 7
    ORDER BY 1,2

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s