# Weibull parameters for censored data using only SQL

Hi all

This post shows how to use SQL to get estimates of the expected event times given either the Exponential or Weibull distributions and right censored data. That might seem like a strange restriction however it is very powerful have estimates in a query instead of requiring R as they can be used directly in dashboard (in my experience with Tableau). Also if like me you are lazy being able to just have a query that answers your question is much nicer then having to analyze data from a query.

Before showing the query it might be helpful to describe what is being queried. Starting with the exponential distribution. Let ${\bf d}$ be a set of observed data; ${\bf d}$ has $n$ observations with $k$ event times and $n-k$ censored times. The likelihood function is $\mathcal{L}(\lambda) = \prod f_x (x= d_i)^{\delta_i} [1- F_x (x = d_j)]^{1- \delta_i} = \prod f_x (x= d_i)^{\delta_i} [S_x (x = d_j)]^{1- \delta_i}$ where $\delta_i =1$ if the ith observation is an observed event and 0 if it is censored. Thus if the event times are assumed to be exponentially distributed $\mathcal{L}(\lambda) = \prod \lambda^{\delta_i} \exp( - \lambda d_i \delta_i) \exp( - \lambda d_i [1- \delta_i] ) = \lambda^{\sum_{i = 1}^{n} \delta_i} exp( - \lambda \sum_{i = 1}^{n} d_i)$. If the likelihood is considered as function of $\lambda$ not ${\bf d}$ then it is recognizable as a gamma distribution; thus the expectation of $\lambda$ is $(\sum_{i = 1}^{n} \delta_i) / ( \sum_{i = 1}^{n} d_i ) = k / ( \sum_{i = 1}^{n} d_i)$. Note that is can also be found using the standard ML approach of solving the derivative of the like-hood set equal to zero.

For the Weibull distribution this a bit more complicated! Recall that the pdf ofr a Weibull is $f_x(x) = \lambda \gamma x^{ \gamma -1} \exp(- \lambda x^{\gamma})$ and the survival function is $S_x(x) = \exp(-\lambda x^{\gamma})$. Thus using the same process as shown with the exponential distribution (i.e., I am too lazy to write it again) the likelihood function of a similar data set is  $\mathcal{L}(\lambda, \gamma) = \prod [\lambda \gamma {d_i}^{\gamma-1}]^{\delta_i} \exp(- \lambda {d_i}^{\gamma-1})$. This does not correspond to a distribution with know properties when considered as a function of $\gamma$ and $\lambda$. However the derivative of the log likelihood can be solved to find an estimate right? And that estimate is going to be used the the query? We shall see!

$\log \mathcal{L}(\lambda, \gamma) = \sum \delta_i \log [\lambda \gamma {d_i}] + (\gamma -1) \sum \delta_i \log(d_i) - \lambda \sum {d_i}^{\gamma}$.  The two partial derivatives are $d/d \lambda \log \mathcal{L}(\lambda, \gamma) = k/\lambda - \sum {d_i}^{\gamma}$ and $d/d \gamma \log \mathcal{L}(\lambda, \gamma) = k/ \gamma + \sum \delta_i \log(d_i) - \lambda \sum {d_i}^{\gamma} \log(d_i)$. Using the first equation it is is to find that $\lambda = k / \sum {d_i}^{\gamma}$ however $\gamma$ can not be solved for algebraically.

The value of $\gamma$ that minimizes $| k/ \gamma + \sum \delta_i \log(d_i) - (k / \sum {d_i}^{\gamma}) \sum {d_i}^{\gamma} \log(d_i) |$ is the maximum likelihood estimate.

So this whole post comes down to how can a SQL query find the optimal value of $\gamma$. The answer is actually quite simple, brute force. I create a large number of $\gamma$ values and full join them to my data set. Then I rank the rows by their misfit to the objective function and take the best one! BAM!

This query assume there is a table with user ids, event times and user ages. The cent follows the same structure as the churn posts query.

SELECT age, gamma,lambda, n, r, o_f ,lambda_ex
FROM
(
SELECT age, gamma, n, r,lambda, o_f, lambda_ex,
row_number() over (partition by age order by o_f) as index
FROM

(
SELECT age, gamma, n, r, r/(SUM_U_to_gamma + SUM_t_to_gamma) as lambda, r/(SUM_U + SUM_t) as lambda_ex,
ABS(r/gamma + SUM_log_t - (r/ (SUM_U_to_gamma + SUM_t_to_gamma) )*(SUM_U_to_gamma_log_u + SUM_t_to_gamma_log_t)) as o_f
FROM
(
SELECT age, gamma,
sum(1) as n,
SUM(event) as r,

SUM(CASE WHEN event = 0 THEN (event_time::float) END) as SUM_U,
SUM(CASE WHEN event = 1 THEN (event_time::float) END) as SUM_t,

SUM(CASE WHEN event = 0 THEN (event_time::float)^gamma END) as SUM_U_to_gamma,
SUM(CASE WHEN event = 1 THEN (event_time::float)^gamma END) as SUM_t_to_gamma,

SUM(CASE WHEN event = 0 THEN ln(event_time::Float) END) as SUM_log_U,
SUM(CASE WHEN event = 1 THEN ln(event_time::Float) END) as SUM_log_t,

SUM(CASE WHEN event = 0 THEN ln(event_time::float)*event_time^gamma END) as SUM_U_to_gamma_log_u,
SUM(CASE WHEN event = 1 THEN ln(event_time::float)*event_time^gamma END) as SUM_t_to_gamma_log_t
FROM
(
(SELECT id, CASE WHEN next_time is null then 0 else 1 end as event,
datediff('sec', clean_time, coalesce(next_time,'2017-08-28 00:00:00') ) as event_time, age, 1 as key
FROM
(SELECT id, clean_time, age,
max(clean_time) over (partition by id order by clean_time rows between 1 following and 1 following) as next_time,
row_number() over (partition by id,age order by random() ) as index
FROM tab_event
and clean_time <= '2017-08-28 00:00:00'
)
where index <= 1
and age <= 7) Q1
LEFT JOIN
(SELECT random() as gamma, 1 as key
from tab_event
LIMIT 10000 ) Q2
ON Q1.key = Q2.key
)
where event_time > 0
GROUP BY 1,2
)
)
)
where index = 1
ORDER BY 1,2

That is it for now tune in next time for more survival analysis!

## 2 thoughts on “Weibull parameters for censored data using only SQL”

1. SELECT *
FROM
(
SELECT active_on_date, power(ln(2),1/gamma)*power(lambda,(-1/gamma)) as MEDIAN_RET_TIME, log(2)/lambda_ex as MEDIAN_RET_TIME2, max(ACTIVE_ON_DATE) OVER () as MAX_DAY
FROM
(
SELECT active_on_date, gamma,lambda, n, r, o_f ,lambda_ex
FROM
(
SELECT active_on_date, gamma, n, r,lambda, o_f, lambda_ex,
row_number() over (partition by active_on_date order by o_f) as index
FROM

(
SELECT active_on_date, gamma, n, r, r/(SUM_U_to_gamma + SUM_t_to_gamma) as lambda, r/(SUM_U + SUM_t) as lambda_ex,
ABS(r/gamma + SUM_log_t – (r/ (SUM_U_to_gamma + SUM_t_to_gamma) )*(SUM_U_to_gamma_log_u + SUM_t_to_gamma_log_t)) as o_f
FROM
(
SELECT active_on_date, gamma,
sum(1) as n,
SUM(event) as r,

SUM(CASE WHEN event = 0 THEN (event_time) END) as SUM_U,
SUM(CASE WHEN event = 1 THEN (event_time) END) as SUM_t,

SUM(CASE WHEN event = 0 THEN power(event_time,gamma) END) as SUM_U_to_gamma,
SUM(CASE WHEN event = 1 THEN power(event_time,gamma) END) as SUM_t_to_gamma,

SUM(CASE WHEN event = 0 THEN ln(event_time) END) as SUM_log_U,
SUM(CASE WHEN event = 1 THEN ln(event_time) END) as SUM_log_t,

SUM(CASE WHEN event = 0 THEN ln(event_time)*power(event_time, gamma) END) as SUM_U_to_gamma_log_u,
SUM(CASE WHEN event = 1 THEN ln(event_time)*power(event_time, gamma) END) as SUM_t_to_gamma_log_t
FROM
(

(SELECT player_id, CASE WHEN next_age is null then 0 else 1 end as event,
COALESCE(next_age, age_possible) – age as event_time, active_on_date, 1 as key
FROM
(SELECT Q1.*, age_possible
FROM
(

(SELECT player_id, active_on_date, age,
max(age) over (partition by player_id order by age rows between 1 following and 1 following) as next_age,
row_number() over (partition by player_id, active_on_date order by rand() ) as index
FROM datapipe.cohort
where active_on_date >= ‘2018-12-06’) Q1
LEFT JOIN
(SELECT player_id, age_possible
FROM datapipe.user_summary) Q2
ON Q1.player_id = Q2.player_id
)
)
where index 0
GROUP BY 1,2
)
)
)
where index = 1
)
)
WHERE abs(DATETIME_DIFF(MAX_DAY, ACTIVE_ON_DATE, DAY)) >= 7
ORDER BY 1,2

Like