# Proportional Hazards (unpacking equation 12)

Hi All

This week I am looking at proportional hazards regression. In particular, trying to clarify the conditional likelihood formula.

Before diving into that please note that proportional hazards, also called Cox proportional hazards (CoxPH), was first presented in 1972 in “Regression models and life-tables” which has been cited ~ 46,000 times! So Dr Cox’s contribution with this one paper (he has a lot of other great papers as well) really is extraordinary.

That said I did not find it the clearest of papers in particular equation 12 took me some time to understand. Figure 1 shows an excerpt of the paper.

Figure 1: an excerpt of Cox’s 1972 “regression models and life tables”

The point that Cox is getting at is, if the hazard function ($\lambda(t)$ in his notation and $h(t)$ in ours ) is completely unknown but the order of failures are know can inference still be conducted? The answer is of course yes. Cox proposes the conditional likelihood function; that is, the probability of the order of failures or to put it another way the likelihood conditioned on the observed times of failures. Thus Cox’s equation 12 is apparently the probability that the i’th phone call ended given that one phone call end at that time. This post explores his assertion.

So to start, in CoxPH the hazard function is not assumed to have any particular form however it is assumed that all phone calls have the same form of hazard function only differing in a parametrically defined proportion; i.e., $h_{(i)}(t) = \exp({\bf z}_{(i)} {\bf m}) h_0(t)$ is the hazard function for the i’th largest observation.

The $\exp({\bf z}_{(i)}{\bf m})$ gives the relative hazard of the i’th observation relative to $h_0(t)$ the base hazard. This is the “proportional” in proportional hazards.

Returning to equation 12 and translating it into the this blogs notation we get

.

This is the proportional term divided by the sum of the probational terms for the risk set [$R(t)$]. The risk set is the set of observation with event times greater that of the i’th observation. By multiplying the top and bottom of the fraction with the baseline hazard $h_0(t_{(i)})$ it becomes clear that this is the hazard of the ending phone call divided by the hazard of all phone call still active imitatively before time $t_{(i)}$.

Intuitively, this should make sense the probability of phone call i ending given that exactly one of the phone calls ended is the hazard of the i’th phone call divided by the total hazard of all active phone calls. But sadly Intuitive is not really good enough. So continuing to please back some of these layers and using the definition of a hazard function:
.  This can be rewritten as
using Bayes rule [$P(A|B) = P(A \cap B )/P(B)$].  Given the assumption that there is one event at time $t_{(i)}$ (this assumption accounts deals for the conditional term) the probabilities in the fraction can be interpreted as
,  making the additional assumption that the event times are independent of each other the denominator can be interpreted as the probability that there is one event at time $t_{(i)}$ this results in further simplification:
.  Finally using Bayes rule again the fraction can be written as single conditional probability
. So CoxPH is correct and 46,000 papers don’t need an errata! But as I said I would have like a little more expansion in the original paper.

Basically that is everything I wanted to get through, however, …, I like like to talk too much / over think things. In particular,  the assumption that there is only one event at each time. In the proof this leads to the probability of the an event happening at the observed being the sum of the probabilities; however, in situation where there could be more than one event at a given time (likely due to sloppy data collection) then that is clearly not true (Cox and the reviewers discuss this in the paper a bit). The probability if at-least one observations has event time $t_{(i)}$ is then $1- \prod(1-P[\text{Obs. k has event time } t_{(i)}])$. While it is difficult to make general observations about this probability it must be larger then the assumed probability. In addition, the numerator would then be not just the probability that observation i has event time $t_{(i)}$  but instead the probability that all the observations that end at $t_{(i)}$ have event time $t_{(i)}$. This probability must smaller. So it does look like this would change the conditional likelihood a lot.

Tune in next time for exciting Bayesian Business intelligence (I promise next week there will actually be a Bayesian component to this)!