# Creating a distribution with a given hazard function.

Hi All

This week I write about how to create a distribution with a specified hazard function.

“Crayon math” is what I call it when I have shape in my mind that I want to express as a function; i.e., draw the shape with math. This can happen a lot when there is empirical data but there is value to describing it parametrically. In particular, recall Kaplan-Meier curves, in many cases it is useful to describe them parametrically. While it is possible to use Cox Proportional Hazards as noted before proportional hazards considers the order of events not the time between them. Thus in cases where the goal is to estimate not only the relative hazard but the absolute hazard it can be necessary to use parametric survival regression (think a GLM). The problem is to do that a distribution must be specified. That is a parameterization has to be chosen.

The normal way to pick a distribution is to chose semi-arbitrarily from usual suspects (Exponential, Weibull, Gamma, truncated normal and log normal). Sadly this choice is commonly made either for mathematical convenience or laziness (i.e, that was one of the options in R).

To do a better job it is worth looking at the shape of the empirical hazard function (I will make a post about this in the future ) and then build a distribution that matches it.

An illustrative example:

Consider the hazard function shown in figure 2; it starts high and asymptotes down to a constant value exponentially (I know this because I cheated, i.e., it is simulated data) .

Figure 1: a hazard function that asymptotes down to a constant value.

Starting with the hazard function $h(t)$ the goal is to get a distribution. As usual working backwards is far easier. Thus let $F(t)$ be the desired cumulative distribution. If it is assumed that $F(t) = 1- \exp(-H(t))$ where $d/dt H(t) = h(t)$ then the pdf is $f(t) = h(t) \exp(-H(t))$. The result is then that the hazard function is $h(t)$ as $f(t)/[1-F(t)] = h(t)$.

Thus in the example of Fig. 2 $h(t) = A - B C \exp(-C t)$ therefore $f(t) = [A - B C \exp(-C t)] \exp(-[A t + B exp( -C t )])$ and $F(t) = 1- \exp(- [A t + B exp( -C t )] )$.

Clearly not all $h(t)$ functions will work. The big restrictions are $H(t =0) = 0$ (to ensure that $F(t=0) = 1- \exp(-H(t=0)) = 1- \exp(-0)= 1-1 = 0$) and  $H(t = \infty) = \infty$ (to ensure that $F(t= \infty) = 1- \exp(-H(t= \infty)) = 1- \exp(- \infty)= 1-0 = 1$). In addition, the usual restriction that $h(t)$ is positive is required.

That is it for now.

# Weibull parameters for censored data using only SQL

Hi all

This post shows how to use SQL to get estimates of the expected event times given either the Exponential or Weibull distributions and right censored data. That might seem like a strange restriction however it is very powerful have estimates in a query instead of requiring R as they can be used directly in dashboard (in my experience with Tableau). Also if like me you are lazy being able to just have a query that answers your question is much nicer then having to analyze data from a query.

Before showing the query it might be helpful to describe what is being queried. Starting with the exponential distribution. Let ${\bf d}$ be a set of observed data; ${\bf d}$ has $n$ observations with $k$ event times and $n-k$ censored times. The likelihood function is $\mathcal{L}(\lambda) = \prod f_x (x= d_i)^{\delta_i} [1- F_x (x = d_j)]^{1- \delta_i} = \prod f_x (x= d_i)^{\delta_i} [S_x (x = d_j)]^{1- \delta_i}$ where $\delta_i =1$ if the ith observation is an observed event and 0 if it is censored. Thus if the event times are assumed to be exponentially distributed $\mathcal{L}(\lambda) = \prod \lambda^{\delta_i} \exp( - \lambda d_i \delta_i) \exp( - \lambda d_i [1- \delta_i] ) = \lambda^{\sum_{i = 1}^{n} \delta_i} exp( - \lambda \sum_{i = 1}^{n} d_i)$. If the likelihood is considered as function of $\lambda$ not ${\bf d}$ then it is recognizable as a gamma distribution; thus the expectation of $\lambda$ is $(\sum_{i = 1}^{n} \delta_i) / ( \sum_{i = 1}^{n} d_i ) = k / ( \sum_{i = 1}^{n} d_i)$. Note that is can also be found using the standard ML approach of solving the derivative of the like-hood set equal to zero.

For the Weibull distribution this a bit more complicated! Recall that the pdf ofr a Weibull is $f_x(x) = \lambda \gamma x^{ \gamma -1} \exp(- \lambda x^{\gamma})$ and the survival function is $S_x(x) = \exp(-\lambda x^{\gamma})$. Thus using the same process as shown with the exponential distribution (i.e., I am too lazy to write it again) the likelihood function of a similar data set is  $\mathcal{L}(\lambda, \gamma) = \prod [\lambda \gamma {d_i}^{\gamma-1}]^{\delta_i} \exp(- \lambda {d_i}^{\gamma-1})$. This does not correspond to a distribution with know properties when considered as a function of $\gamma$ and $\lambda$. However the derivative of the log likelihood can be solved to find an estimate right? And that estimate is going to be used the the query? We shall see!

$\log \mathcal{L}(\lambda, \gamma) = \sum \delta_i \log [\lambda \gamma {d_i}] + (\gamma -1) \sum \delta_i \log(d_i) - \lambda \sum {d_i}^{\gamma}$.  The two partial derivatives are $d/d \lambda \log \mathcal{L}(\lambda, \gamma) = k/\lambda - \sum {d_i}^{\gamma}$ and $d/d \gamma \log \mathcal{L}(\lambda, \gamma) = k/ \gamma + \sum \delta_i \log(d_i) - \lambda \sum {d_i}^{\gamma} \log(d_i)$. Using the first equation it is is to find that $\lambda = k / \sum {d_i}^{\gamma}$ however $\gamma$ can not be solved for algebraically.

The value of $\gamma$ that minimizes $| k/ \gamma + \sum \delta_i \log(d_i) - (k / \sum {d_i}^{\gamma}) \sum {d_i}^{\gamma} \log(d_i) |$ is the maximum likelihood estimate.

So this whole post comes down to how can a SQL query find the optimal value of $\gamma$. The answer is actually quite simple, brute force. I create a large number of $\gamma$ values and full join them to my data set. Then I rank the rows by their misfit to the objective function and take the best one! BAM!

This query assume there is a table with user ids, event times and user ages. The cent follows the same structure as the churn posts query.

SELECT age, gamma,lambda, n, r, o_f ,lambda_ex
FROM
(
SELECT age, gamma, n, r,lambda, o_f, lambda_ex,
row_number() over (partition by age order by o_f) as index
FROM

(
SELECT age, gamma, n, r, r/(SUM_U_to_gamma + SUM_t_to_gamma) as lambda, r/(SUM_U + SUM_t) as lambda_ex,
ABS(r/gamma + SUM_log_t - (r/ (SUM_U_to_gamma + SUM_t_to_gamma) )*(SUM_U_to_gamma_log_u + SUM_t_to_gamma_log_t)) as o_f
FROM
(
SELECT age, gamma,
sum(1) as n,
SUM(event) as r,

SUM(CASE WHEN event = 0 THEN (event_time::float) END) as SUM_U,
SUM(CASE WHEN event = 1 THEN (event_time::float) END) as SUM_t,

SUM(CASE WHEN event = 0 THEN (event_time::float)^gamma END) as SUM_U_to_gamma,
SUM(CASE WHEN event = 1 THEN (event_time::float)^gamma END) as SUM_t_to_gamma,

SUM(CASE WHEN event = 0 THEN ln(event_time::Float) END) as SUM_log_U,
SUM(CASE WHEN event = 1 THEN ln(event_time::Float) END) as SUM_log_t,

SUM(CASE WHEN event = 0 THEN ln(event_time::float)*event_time^gamma END) as SUM_U_to_gamma_log_u,
SUM(CASE WHEN event = 1 THEN ln(event_time::float)*event_time^gamma END) as SUM_t_to_gamma_log_t
FROM
(
(SELECT id, CASE WHEN next_time is null then 0 else 1 end as event,
datediff('sec', clean_time, coalesce(next_time,'2017-08-28 00:00:00') ) as event_time, age, 1 as key
FROM
(SELECT id, clean_time, age,
max(clean_time) over (partition by id order by clean_time rows between 1 following and 1 following) as next_time,
row_number() over (partition by id,age order by random() ) as index
FROM tab_event
and clean_time <= '2017-08-28 00:00:00'
)
where index <= 1
and age <= 7) Q1
LEFT JOIN
(SELECT random() as gamma, 1 as key
from tab_event
LIMIT 10000 ) Q2
ON Q1.key = Q2.key
)
where event_time > 0
GROUP BY 1,2
)
)
)
where index = 1
ORDER BY 1,2

That is it for now tune in next time for more survival analysis!

# An brief answer to the unsettling questions raised by interval censored data.

Hi All

So first of all sorry for the delay in the posts. Those that know me (i.e., have connected to me on facebook) know that I have recently changed my employer from Hothead games to Phoenix Labs. the process of this transition left me with little time for this blog. Now that I have had my last day at Hothead I can focus here a bit more.

FYI if you are reading this some time near 2017-09-09 and are data scientist you might want to apply to Hothead; they are a great company.

Moving on to the meats of this post.

While I was reading about interval censored data I kept having problems reconciling my understanding of likelihood function with the presented methods.  In short, what has been bugging me is how do cdfs and pdfs work together because they seem to have different assumptions. Let $F_x(x)$ be a cdf and $f_x(x)$ be a pdf; in a classic likelihood function with iid observations the probability that observation i is $p(d_i) = f_x(x = d_i)$. For an interval censored observation (lets say observation i must be in the interval $[x_i, x_i + \delta]$ ) the probability of is $p(d_i) = F_x(x = x_i + \delta) - F_x(x = x_i)$. Now as $\delta$ approaches zero in a common sense sort of way I want the two probabilities to be the same. However they are clearly different! $p(d_i) = F_x(x = x_i ) - F_x(x = x_i) = 0$ which is not in general $f_x(x = d_i)$.

So how is this problem resolved? It might be worth stopping here for a moment and thinking about this because it is not easy.

Figure 1: Mr Darcy and a chess board. This just here to give some space so you don’t read ahead while you have your moment.

Returning to the likelihood for a interval censored observation $p(d_i) = F_x(x = x_i + \delta) - F_x(x = x_i)$; this can be multiplied by 1 without issue. In this case $1 = \delta / \delta$. Thus $p(d_i) = [F_x(x = x_i + \delta) - F_x(x = x_i)] (\delta / \delta) = ([F_x(x = x_i + \delta) - F_x(x = x_i)] / \delta) \delta$. Now using L’Hôpital’s rule (by far the best spelt name ever!) the limit as $\delta$ approaches zero for the first term is $\lim_{\delta \rightarrow 0} ([F_x(x = x_i + \delta) - F_x(x = x_i)] / \delta) = f_x(x= d_i)$. Thus for an arbitrarily small $\delta$  $p(d_i) = F_x(x = x_i + \delta) - F_x(x = x_i) = f_x(x= d_i) \delta$.
For fequentists it is not exactly clear to me what happens to the $\delta$ term; to be sure for ML estimates $\delta$ cancel when the derivatives are solved to equal zero but who knows in what happens in general?
For Bayesians the $\delta$ term cancels with the evidence; .
To justify this recall that $\mathcal{Z}_j = \int d {\bf m}_j \pi({\bf m}_j) \mathcal{L}({\bf m}_j)$ so the $\delta$ will be in both the numerator and the denominator.