# The Central Limit Theorem; How to measure information loss.

Hi All

This might not be the last post for a while as I have begun to start thinking about commencing to consider writing a proposal for a text book. And such a committed undertaking will occupy  (and has occupied) most of my writing time. This post only exists because I having been trying to clarify my understanding of some basic statistical concepts.

The Central Limit Theorem (CLT) states that the average (or sum) of a collection of random variables from any distribution (with a defined first and second moment; i.e., this does not Cauchy distribution) will be normally distributed. Formally if $x_1, x_2, ..., x_n$ be an independent, identically distributed set of $n$ random variables with PDF $f_x(x; \mu, \sigma^2)$, where $\mu$ and $\sigma^2$ and the mean and variance of $f_x$, respectively  then $\bar{x}$ ~ Norm($\mu, \sigma^2 / n$) as $n \rightarrow \inf$.

In practice the CLT is the justification for the use of the z- and t-tests. Basically I don’t have to care what the underlying distribution of some process as the average of its observably random variables will be Gaussian. While I use t-tests (or more commonly evidence tests ) for split (or AB) testings I have never been truly comfortable with the idea of the CLT. In particular I don’t doubt the various proofs that it works, however I have never seen any non-experimental process to show how fast it becomes true. For example the sum of 12 random uniform random variables makes a near perfect Gaussian and the sum of 12 log-normal random variables does not.  This post shows a process to evaluating the convergence (information loss) of the CLT. The process is demonstrated with a set of exponentially distributed variables.

The process has two steps. The first step is to derive the distribution of the average of the observed random variables $f_t(t)$. The second step is to calculate the Kullback–Leibler divergence of the $f_t(t)$ and a Gaussian as a function on $n$. Note that here the math is tractable and so an analytic result is derived however both these integrals could be conducted numerically.

Step 1) Finding $f_t(t)$: Let $x_1, x_2, ..., x_n$ be an independent, identically distributed set of exponentially distributed random variables ($x_i$ ~ $\lambda \exp[- \lambda x_i ]$). Let $z_k = x_1+x_2+...x_k$. The distribution of $z_k$ is $f_{z_k}(z_k) = \int_0^{z_k} dx_k f_{z_{k-1}}(z_k - x_k)f_x(x_k)$; i.e., each random variable is integrated out in turn into the sum.

To find $f_{z_k}(z_k)$ it is, as usual, easiest to start by knowing the answer and work back wards; I got to answer in a ponderous and indirect path so I will spare you the details but it involved failing to sleep on an uncomfortable couch. Let $f_{z_k}(z_k) = \lambda^{k+1} (z_k)^{k} \exp(- \lambda z_k) / \Gamma(k+1)$; thus $f_{z_{k+1}}(z_{k+1}) = \int_0^{z_{k+1}} dx_{k+1} [ \lambda^{k+1} (z_{k+1} - x_{k+1})^{k} \exp(- \lambda [z_{k+1} - x_{k+1}]) / \Gamma(k+1) ]$ $\times \lambda \exp(- \lambda x_{k+1} )$. A small amount of algebra leads to $f_{z_{k+1}}(z_{k+1}) =[\lambda^{k+2} \exp(- \lambda z_{k+1}) / \Gamma(k+1)] \int_0^{z_{k+1}} dx_{k+1} (z_{k+1} - x_{k+1})^{k}$.

The remaining integral is trivial (i.e., wolfram alpha can solve it for me… ok in this case it is just a straight substitution of variables but normally that is what trivial integral means). Thus $f_{z_{k+1}}(z_{k+1}) =[\lambda^{k+2} \exp(- \lambda z_{k+1}) / \Gamma(n+1)] [(z_{k+1}-x_{k+1})^{k+1} /(k+1) ]_0^{z_{k+1}}$ $= [\lambda^{k+2} \exp(- \lambda z_{k+1}) / \Gamma(k+1)][z_{k+1}^{k+1} / (k+1)]$. And one more “thus” leads to $f_{z_{k+1}}(z_{k+1}) = \lambda^{k+2} (z_{k+1})^{k+1} \exp(- \lambda z_{k+1}) / \Gamma(k+2)$. Since $f_x(x) = \lambda \exp(- \lambda x_i )$ is also of the form of $f_{z_k}(z_k)$ the principle of induction shows this is always valid and all that I have to do is turn the “k“s to “n“s. I debated leaving that as an exercise to the reader but here it is $f_{z_n}(z_n) = \lambda^{n+1} (z_n)^{n} \exp(- \lambda z_n) / \Gamma(n+1)$.

Next since $z_n$ is a sum not an average it must be transformed to $t = z_n / n = g(z_n)$. Recall that for monotonic functions changes of variables are conducted using $f_y(y) = |(d/dy) g^{-1}(y) |f_x(g^{-1}(y))$. So $f_t(t) = \lambda^{n+1} n^{n+1} (z_n)^{n} \exp(- [\lambda n] z_n) / \Gamma(n+1)$. For the sake of notation let $\beta = \lambda n$;  $f_t(t) = \beta^{n+1} (z_n)^{n} \exp(- \beta z_n) / \Gamma(n+1)$.

That is the distribution of the average of n exponentially distributed random variables. I was slightly surprised to note that it is a Gamma distribution as I had not previously realized that a Gamma was sum of  exponential distributions!

Step 2) Calculate the KL divergence between $f_t$ and a Gaussian $f_g(t)$ with the same mean and variance:

The mean and variance of $f_t$ are $(n+1)/\beta$ and $(n+1)/\beta^2$. The KL divergence between two distributions $f_1(x)$ and $f2_(x)$ is $\int_{\text{range}} dx f_1(x) [\log(f_1[x]) - \log(f_2[x])]$. So the KL divergence is the expected log-ratio of the two distributions with respect to the first. It can be thought of a a measure of the difference between the two distributions weighted by the relative importance of the first. It has few nice properties. First it is always positive unless the two distributions are the same at which point is is zero. It works for both continuous and discrete distributions. It commonly is used to measure information gained from one distribution to another (normally the gain going from the prior to the posterior). Thus here it can be thought of as the information lost by using the CLT Gaussian approximation.  Thus here the KL divergence is $\int_0^{\inf} dt f_t(t; \beta) [log(f_t[t; \beta]) - \log(f_g[t;(n+1)/\beta, (n+1)/\beta^2 ])]$.

This integral large but straight forward. The first step is to group all of the constant terms into one place; let $\alpha = n \log(\beta) +0.5 \log(n+1) - \log( \Gamma[n+1] ) + 0.5 \log(2 \pi)$. Thus the integral can be rewritten as the more manageable $\int_0^{\inf} dt [\beta^{n+1} (z_n)^{n} \exp(- \beta z_n) / \Gamma(n+1)]$ $\times [\alpha + n \log(t) - \beta t +0.5 (t-[n+1]/ \beta)^2/([n+1]/ \beta^2) ]$. Which can be parsed as $\alpha + E[\log(t)] - \beta E[t] + \beta^2/(2[n+1])Var[t]$, where all the expectations and variances are with respect to $f_t(t)$. Using the many known properties of the Gamma distribution all terms can be evaluated. Thus the KL divergence is $\Delta(n) = 0.5 \log(2 \pi) + 0.5 + 0.5 \log (n+1) - \log( \Gamma[n+1]) + n \Psi (n+1) -(n+1)$, where $\Psi(x)$ is the digamma function (very similar to the harmonic series, i.e. $\sum 1/k$ ).

The final function $\Delta(n)$ is messy but it does give an objective measure of the difference between the true distribution of the average of observed exponential random variables and the CLT assumed Gaussian distribution. It is there for possible to use this method to check, directly!, if the n for a given problem is large enough to make the CLT valid.

Also as noted in the introduction while I chose to do this algebraically because it is raining you could do both sets of integrals numerical.

Cheers That is it for now.

P.S. Stay tuned for more text book developments!

# Calculating Bayesian Evidence for Cox Proportional Hazards.

Our perception of time is arbitrary [citation needed].

It objectively has not been as long as it seems since I wrote a post.

This post shows how to use importance sampling to calculate Bayesian evidence and then applies that to Cox PH models.

Starting with importance sampling or “IS” (pronounced [iz]) as it likes to be called; this leads to sentences that are difficult to parse.  IS is (see how fun “IS is” is?) a sampling strategy that involves drawing samples from a similar distribution and re weighting the samples such that their importance corespondents to the target distribution.  This is useful in the ubiquitous Bayesian situation where it is possible to evaluate a posterior distribution but not directly sample from it.

Formally, if $f(x)$ is a function proportional to the target distribution, $g(x)$ is a distribution that is similar to $f(x)$ [$g(x) > 0 \, \forall x$ such that $f(x) > 0$ is the only necessary condition but it works faster if the two distributions share modes and variances], and ${\bf x}$ is a random sample drawn from $g(x)$ with i’th element $x_{i}$ then the importance weight for $x_i$, $w_i = f(x_i)/g(x_i)$.

The importance weight are used to evaluate the expectation of functions with respect to $f(x)$. For example, the mean of $f(x)$ can be estimated as $\sum x_i w_i / \sum w_i$. The expectation of an arbitrary function $Q(x)$ can estimated as  $\sum Q(x_i) w_i / \sum w_i$. However in this post I am not interested in evaluating any the expectation of functions; as noted in the title this is about calculating the Bayesian Evidence.

It is obvious (i.e., someone pointed it out to me before I noticed) that the $n^{-1} \sum w_i = Z_f/Z_g$; in other words the average of the importance weights is the ratio of the integrals over $f$ and $g$ with respect to $x$.

Not a proof but a good thought experiment to make this feel intuitive is to let $f(x) = (2 \pi)^{-0.5} \exp(-0.5 x^2)$ for $x \in [0, 6]$, $f(x) = 0$ other wise and let $g(x)$ be uniform on the same range. Also because only a finite summation is used we can re order the sample ${\bf x}$ so that $x_1 < x_2 ... < x_n$ and noting that $Z_g =1$. The $w_i = 6*(2 \pi)^{-0.5} \exp(-0.5 x_i^2)$ which implies that $Z_f = 6/(n \sqrt{2 \pi}) \sum \exp(-0.5 x_i^2) =1/ \sqrt{2 \pi} \sum \exp(-0.5 x_i^2)*(6/n)$. Now as $6/n$ is the expected distance between $x_i$ and $x_{i+1}$ for all $i$, the summation turns into the Riemann sum  for $\int_0^6 dx \exp(-0.5 x^2)$.   Thus $\sum \exp(-0.5 x_i^2)*(6/n) \approx \sqrt{2 \pi}/2$, thus $Z_f = 1/ \sqrt{2 \pi} \sum \exp(-0.5 x_i^2)*(6/n) \approx 0.5$. Which is correct as just over half of the Gaussian distribution has been cut off.

Figure 1: Shows the 1 dimensional log posterior / log likelihood for a one dimensional Cox model. The black line is the true normalized posterior and the blue line is the assumed Gaussian.

Returning to Cox PH models, as noted before they do not use a proper likelihood. And their likelihood function is definitely not Gaussian (Figure 1 shows log posterior distribution for a Cox model with uniform prior as well as the assumed Gaussian distribution. While similar the two are clearly different). So it is difficult to analytically the Bayesian evidence. Note I am choosing to interpret the event times as a choice of model; in future work it might be worth considering what happens if the event times are marginalized out?

The following R code creates a simulated data set of right censored survival times with 2 predictive and 4 random predictors. It also defines several functions to be used in calculating the evidence.

#simulate data

S1 = rexp(1000)
S2 = rexp( 500, rate = 0.5)
S3 = rexp( 500, rate = 2)
ET = c(S1, S2, S3)
X1 = c(rep(0, 1000), rep(1, 500), rep(1, 500))
X2 = c(rep(0, 1500), rep(0, 500))
#X2 = sample(1:2, 2000, replace=T)
X3 = sample(1:2, 2000, replace=T)
X4 = sample(1:2, 2000, replace=T)
X5 = sample(1:2, 2000, replace=T)
X6 = sample(1:2, 2000, replace=T)
ET.raw = ET

# censor events
r = runif(2000)
EVENT = rep(0, 2000)
EVENT[r >= 0.25] = 1
r = runif(2000)
ET[EVENT == 0] = r[EVENT == 0]*ET[EVENT == 0]

df = data.frame(ET,ET.raw, EVENT, X1, X2, X3, X4, X5, X6)

# sort event times
index <- sort(ET, index.return=T)
df = df[index$ix, ] # these things shoud be consistant with the df table for easy use ET = df$ET
EVENT = df$EVENT ET = df$ET
EVENT = df$EVENT ET.raw = df$ET.raw
X1 = df$X1 X2 = df$X2
X3 = df$X3 X4 = df$X4
X5 = df$X5 X6 = df$X6

# log likelihood function
g.cph.LL<-function(m, ET, EVENT, X)
{
n = ncol(X)
t = ET
k = length(t)
A = X
eta = rep(0,k)

if (n >= 2)
{
for(i in 1:n)
{
eta = eta + A[,i]*m[i]
}
}
if(n == 1)
{
eta = A[,1]*m[1]
}

h = exp(eta)
H = sum(h) - cumsum(h) + h
P = as.numeric(h/H)
out = sum(log(P[EVENT==1]))
return(out)
}

#multi variate dnorm funny this is not in base or MASS?
mvdnorm <-function(X, mu, Sigma)
{
A.1 = det(Sigma)
A = -0.5*log(2*pi*A.1)
B.1 = solve(Sigma)
B.2 = X-mu
B.0 = t(B.2)%*%B.1%*%B.2
B = -0.5*B.0
LP = A + B
return(LP)
}

# esitmate evidence given IS
g.z <- function(ET, EVENT, A, n.s = 1000)
{
mod.freq = coxph(Surv(time = ET, event = EVENT)~., data= A)

SIGMA = mod.freq$var MU = as.vector(mod.freq$coefficients)
SAMP = mvrnorm(n = n.s, mu = MU, Sigma = SIGMA, tol = 1e-6, empirical = FALSE, EISPACK = FALSE)
W = 1:n.s
for ( i in 1:n.s)
{
W.num = g.cph.LL(m= SAMP[i,], ET, EVENT, X= A )
W.dom = mvdnorm(X = SAMP[i,], mu=MU, Sigma = SIGMA)
W[i] = W.num - W.dom +9500.27
}
return(log(mean(exp(W))))
}

This next code junk can be run to conduct calculate the evidence for all sub sets of predictors in the simulated data set. Note that I have updated how I do the all sub sets by using the “expand.grid” function.  The code needs both survival and MASS to be loaded.

The strategy is to use “summary.coxph” function to define a multivariate Gaussian distribution which is taken as $g(x)$. This is is used to crease samples that are evaluated by  “g.cph.LL” which is the $f(x)$. The ratios are used to calculate the evidence. Note that because of machine precision issues I am adding “9500” to the log ratio. This is equivalent to multiplying by a very large number; since only the relative evidence value are needed this will not change the order and stops all the weights from being rounded to zero!

# make list of all subsets
IF.list = data.frame( TRUE, TRUE, TRUE, expand.grid(rep(list(c(T,F)), 6)))

# internalize some lists
colnames(IF.list) = colnames(df)
z = rep(-1, (nrow(IF.list)-1) )

# loop though all models
for (i in 1:(nrow(IF.list)-1) )
{
print(i)
print(IF.list[i,])
data.temp = df[,as.numeric(IF.list[i,1:9])==T]
A.temp = as.data.frame(data.temp[4:ncol(data.temp)])
z[i] = g.z(ET = ET, EVENT = EVENT, A = A.temp)
}
IF.list = IF.list[1:(nrow(IF.list)-1), ]

# report results
plot(z) # plot the evidence
which.max(z)
IF.list[which.max(z),]
data.frame(log.z = z[(max(z)- z) <= 10], IF.list[(max(z)- z) <= 10,])
data.temp = df[,as.numeric(IF.list[which.max(z),1:9])==T]
A.temp = as.data.frame(data.temp[4:ncol(data.temp)])
summary(coxph(Surv(time = ET, event = EVENT)~., data= A.temp))

This code does find the correct parameterization almost all the time and in cases where it does not it does not have signifigantly lower evidence than the “best model”. But don’t take my word for it run the code your self a few times.

That is it for now. Tune in next time for more Bayesian Business Intelligence.

# Frailty part 1

One of the larger problems in modern statistics is that that telemetry (i.e., the data source) is chosen before analysis conducted.

[Yes it can be updated, yes a lot of good work goes into choices where events are recorded and how that data is recorded, and yes I really don’t care about the point you are trying to make right now… just read my article!]

The result of incomplete telemetry is that even if there are adequate model selection procedures (e.g., rjMCMC) there is still no guarantee that the data is available to accurately answer the parameter estimate question. This problem is has been discussed on this blog before; but that was in the context of GLM modelling.

This post explores the same problem within the context of survival analysis where for some bizarre accident of history dispersion is called frailty.

To be more precise, frailty is when two phone calls form the same population (i.e., have the same predictive statistics) have different true hang up time distributions. There could ether be unobservable sub-groups or a continuous spectrum.

The frailty model is $S_i^{\star}(t_i) = \int_{\text{range}} du_i S_i(t_i | u_i) f(u_i)$ where $u_i$ is a parameter of $S_i()$ the survival function for the ith observation. There is an obvious parallel to Bayes theorem here. If $f()$ is thought of as the prior distribution and $S_i()$ the likelihood then $S_i^{\star}()$ is the evidence.

Note that I don’t really know why the custom is to express frailty models in terms of survival function. For example $S_i^{\star}(t_i) = \int_{\text{range}} du_i S_i(t_i | u_i) f(u_i) = \int_{\text{range}} du_i [1 - G_i(t_i | u_i)] f(u_i)$ note that $G_i()$ is the cdf of the distribution of the ith observation. Thus $S_i^{\star}(t_i) = \int_{\text{range}} du_i [1 - \int_{\text{range}} du_i g_i(t_i | u_i)] f(u_i)$. Which implies that the process could have been started with the specification of the frail pdf of the ith observation $g_i^{\star} = \int_{\text{range}} du_i g_i(t_i | u_i) f(u_i)$.

An informative (simple-ish) example is the exponential-gamma frailty model. Consider a a population of phone calls with exponentially distributed hang-up times; i.e., $g_i(t_i|\lambda_i) = \lambda_i \exp(-\lambda_i t_i)$. The complication is that the $\lambda_i$ are distributed gamma($\alpha, \beta)$; i.e, $f(\lambda) = [ \beta^\alpha / \Gamma(\alpha)] \lambda^{\alpha -1} e^{- \beta \lambda }$

Thus to get the frailty model $g^{\star} = \int_{\text{range}} du \ g_i(t_i | u) f(u) = \int_0^{\inf} d\lambda \lambda \exp(-\lambda t_i) [ \beta^\alpha / \Gamma(\alpha)] \lambda^{\alpha -1} e^{- \beta \lambda }$. this integral can be evaluated to be $g^{\star}(t_i) = [ \beta^\alpha / \Gamma(\alpha)] \int_0^{\inf} d\lambda \lambda^{\alpha } e^{- [\beta+t_i] \lambda } = [ \beta^\alpha / \Gamma(\alpha)] [ \Gamma(\alpha + 1) / (\beta + t_i)^{\alpha +1} ] = \beta^\alpha \alpha/( \beta + t_i)^{\alpha +1})$

That is it for now. Just a word of warning you still want a prior for $\alpha$ and $\beta$ as they will not be constrained by the data. So don’t just MCMC sample from this thing yet!

# Plotting your Hazard

As humans we know that Misfortune stalk us at every turn. But how much hazard are we  truly in?

This post answers that important question.

Recall that the hazard function [$h(t)$] is the ration of the event time pdf [$f(t)$] to the survival function [$S(t)$]; i.e., it the instantaneous probability of failure for an observation given that survived at least that long. Also recall that a Kaplan-Meier curve (KMC) is reasonable ways to estimate the survival function….

So what is the point of this post? Can’t the hazard function be gotten directly from the observed KMC? surprisingly no. The basic problem is that a KMC is not a smooth estimate of the $S(t)$ thus $f(t) = d/dt [1- S(t)]$ is not really defined.

The simplest way to get around this problem is to use the piece wise estimate $\hat{h}(t) = d_j / (r_j \Delta_j)$ where $t \in [t_{j-1}, t_{j})$, $r_j$ is the number of observations at risk in time interval $[t_{j-1}, t_{j})$,, $d_j$ is the number of “hanging ups” at time $t_j$, and $\Delta_j = t_{j} - t_{j-1}$.

In practice this method is not satisfying as its uncertainty [$\text{Var}(\hat{h}[t]) = \hat{h}(t)^2 (r_j - d_j)/(r_j d_j)$] tends to be large and estimates for adjacent segments tend to vary largely.

So here a Bayesian approach is proposed! Yes that is right I am finally moving in to a Bayesian survival analysis problem!

As noted before Bayesian methods tend to start with a likelihood function (yes ABC does not but it sort of does because it is based on defining an approximate likelihood function… do you feel smug for pointing that out? do you wake up in the night and hug yourself? No one cares keener!) . Here Order statistics are exploited to produce the likelihood function. The percentile of the $k$‘th of $n$ observation has a beta distribution because there are $k -1$ observations smaller and $n - k$ observations bigger. The percentiles can them be fit using a collections of weighted cdfs. The number of cdf and their parameters are found using rjMCMC. I know this is a fairly vague description (frankly it is a vague reminder) however I would like to get this post done and I beleive that if you have been reading the other posts this is not that far out of an idea.

To explore the idea I have made a simulation study. In the study a collection of observed exponential random variables  are used to create a estimated a hazard function. Truncated Gaussian distributions are used for the kernel of the rjMCMC inversion. In practice it would probably be better to use truncated generalized Gaussian distributions. However then the kernel could exactly fit the simulated data which is unlikely to be possible in practice. The choice of simulated exponential distributed data set is made for ease of interpretation as the exponential has a constant hazard function.  Figure 1 left shows simulated exponential random variables (circles) plotted vs their percentiles; Fig. 1 right gives a color histogram of the posterior hazard function. The posterior distribution is distinct from the true hazard function (the constant like $y = 1$) for much of the range zero to two however it is close. And it should be noted that the data set is constantly lower than the theoretical percentile for that range.

Figure 1: Left the simulated exponential random variables (circles) plotted vs their percentiles. The blue line is the theoretical values. Right is a color histogram of the posterior hazard function. The horizontal line is the hazard function of the simulation distribution; i.e., the “true value”.

That is it for this week. I hope this post is interesting As I had this idea in my head for a while before I could find time to write it.  The code for the simulated inversion is append.

Tune in next time for more Bayesian Business Intelligence! Same Bayesian Channel, Same Bayesian Time!

g.sim.data <-function(N = 1000, lambda = 1)
{
t  = rexp(N, rate = lambda )
t. = sort(t)

index = 1:N
cdf   = (1:N)/N

out = data.frame(t., index, cdf)
return(out)

}
g.sim.data(N= 100)

#  LL  j, P1, P2,  Mu1  Mu2, S1   S2
g.predict<-function(MOD=c(225.870633018 ,  3.000000000 ,  1.000000000 ,  0.408442587 ,  0.173129261 ,  0.006416128 ,  0.992600670 ,  1.333872850 ,  0.488384568 ,  1.279163793,   0.261276892 ), X = g.sim.data(N=100), plot.line=0 )
{
# upack the pars

j      = MOD[2]
z      = MOD[3:(j+2)]
mu     = MOD[(j+3):(2*j + 2)]
sigma  = MOD[(2*j + 3):(3*j + 2)]

#print(z)

# get the probs
z.star = c(z, 0)
prob = 1:j
for (i in 1:j)
{
prob[i] = z.star[i]-z.star[i+1]
}

if (min(prob) < 0 ) {return(-Inf) } #print(prob) t.     = X$t. n = nrow(X) k = 1:n alpha = k beta = n - k + 1 #print("-----------") #print(prob[1]) CDF.hat = prob[1]*(pnorm(t., mean = mu[1], sd=sigma[1]) - pnorm(0,mean = mu[1], sd=sigma[1]))/( 1- pnorm(0,mean = mu[1], sd=sigma[1]) ) if(j >= 2) { for ( i in 2:(j)) { #print("-----------") #print(prob[i]) temp = prob[i]*(pnorm(t., mean = mu[i], sd=sigma[i]) - pnorm(0,mean = mu[i], sd=sigma[i]))/( 1- pnorm(0,mean = mu[i], sd=sigma[i]) ) CDF.hat = CDF.hat + temp } } LL = dbeta(CDF.hat, shape1 = alpha, shape2 = beta, ncp = 0, log = TRUE) LL = sum(LL, na.rm=T) #print(CDF.hat) if (plot.line == 1 ) { plot(t., X$cdf)
lines(t., CDF.hat)
}

return(LL)
}
g.predict(plot.line=1)

g.perterb<-function(M=c(-Inf, 3, 0,  0.6, 0.2,   0, 0.5,1,   1, 0.5, 0.1 ), LB=c(0, -10, 0), UB = c(1, 10, 10), Qsd = c(0.01, 0.01, 0.01)  , data=g.sim.data(N=100)) { # unpacking hte parameters LL     = M[1] j      = M[2] z      = M[3:(j+2)] mu     = M[(j+3):(2*j + 2)] sigma  = M[(2*j + 3):(3*j + 2)]    #print(M)  # make the proposal model   z.prime      = z mu.prime     = mu sigma.prime  = sigma   # make the proposal model index  = sample(1:j, 1 ) if(index >1) {z.prime[index]     = z.prime[index]     + rnorm(1, mean = 0, sd= Qsd[1])} # add random noise to old model
mu.prime[index]                  = mu.prime[index]    + rnorm(1, mean = 0, sd= Qsd[2])
sigma.prime[index]               = sigma.prime[index] + rnorm(1, mean = 0, sd= Qsd[3])

SI = sort(z, index.return=T, decreasing = T)$ix z.prime = z.prime[SI] mu.prime = mu.prime[SI] sigma.prime = sigma.prime[SI] M.prime =c(LL, j, z.prime, mu.prime, sigma.prime) #print(M.prime) #print(z.prime) #print(mu.prime) #print(sigma.prime) #print(z.prime[index]) #print(mu.prime[index]) #print(sigma.prime[index]) if ((z.prime[index] >= LB[1]) & (z.prime[index] <= UB[1]) & # check the bounds (mu.prime[index] >= LB[2]) & (mu.prime[index] <= UB[2]) & (sigma.prime[index] >= LB[3]) & (sigma.prime[index] <= UB[3]) #&(sum(prob.prime ) <= 1) ) { LL.prime = g.predict(M.prime, X = data ) # compute loglikihood M.prime[1] = LL.prime # save LL r = runif(1) # random uniform MH = exp(LL.prime - LL) # Metropolis-hasting acceptance probability value if (r <= MH) { M = M.prime } } return(M) } g.birth<-function(M=c(-Inf, 3, 0, 0.6, 0.2, 0, 0.5,1, 1, 0.5, 0.1 ), LB=c(0, -10, 0), UB = c(1, 10, 10), Qsd = c(0.01, 0.01, 0.01) , data=g.sim.data(N=100)) { # unpacking hte parameters LL = M[1] j = M[2] z = M[3:(j+2)] mu = M[(j+3):(2*j + 2)] sigma = M[(2*j + 3):(3*j + 2)] #print(M) #print(z) if (j <= 14 ) { # make the proposal model LL.prime = LL j.prime = j + 1 z.prime = c(z , runif(1, min = LB[1], max = UB[1]) ) mu.prime = c(mu, runif(1, min = LB[2], max = UB[2]) ) sigma.prime = c(sigma , runif(1, min = LB[3], max = UB[3]) ) SI = sort(z.prime, index.return=T, decreasing = T)$ix       # resort the mods in prob
z.prime     = z.prime[SI]
mu.prime    = mu.prime[SI]
sigma.prime = sigma.prime[SI]

M.prime      = c(LL.prime, j.prime, z.prime, mu.prime, sigma.prime)
#print("------ gavin this line ------------")
#print(M.prime)
LL.prime     = g.predict(M.prime, X = data)                  # get predicted values
M.prime[1]   = LL.prime

r = runif(1)                                          # random uniform
MH = exp(LL.prime - LL)                               # Metropolis-hasting acceptance probability value note that the prior and proposal cancel
if (r <= MH) {M = M.prime}                            # if accepted
}
return(M)

}

g.death<-function(M=c(-Inf, 3, 0,  0.6, 0.2,   0, 0.5,1,   1, 0.5, 0.1 ), LB=c(0, -10, 0), UB = c(1, 10, 10), Qsd = c(0.01, 0.01, 0.01)  , data=g.sim.data(N=100)) { # unpacking hte parameters LL     = M[1] j      = M[2] z      = M[3:(j+2)] mu     = M[(j+3):(2*j + 2)] sigma  = M[(2*j + 3):(3*j + 2)] #print(M) if (j >= 3 )
{
# make the proposal model
index        = sample(2:j, 1)
LL.prime     = LL
j.prime      = j - 1
z.prime      = z[-index]
mu.prime     = mu[-index]
sigma.prime  = sigma[-index]
M.prime      = c(LL.prime, j.prime, z.prime, mu.prime, sigma.prime)
LL.prime     = g.predict(M.prime, X = data)                  # get predicted values
M.prime[1]   = LL.prime
r = runif(1)                                          # random uniform
MH = exp(LL.prime - LL)                               # Metropolis-hasting acceptance probability value note that the prior and proposal cancel
if (r <= MH) {M = M.prime}                            # if accepted
}
return(M)

}

g.explore<-function(old, d)
{
Qsd. = c(0.005, 0.005, 0.005)
LB.  = c(0, -10, 0)
UB.  = c(1, 10, 10)

move.type = sample(1:3, 1) # the type of move i.e., perterb, birth, death
#print("------ this line")
#print(move.type)

if (move.type == 1) {old = g.perterb(M=old, Qsd =Qsd., LB = LB., UB=UB., data= d)}
if (move.type == 2) {old = g.birth(M=old,  Qsd =Qsd., LB = LB., UB=UB., data = d)}
if (move.type == 3) {old = g.death(M=old, Qsd =Qsd., LB = LB., UB=UB., data = d )}

return(old)
}

g.rjMCMC<-function( Nsamp = 20000, BURN = 2000)
{

data = g.sim.data(N= 100, lambda = 1)

#x11(height=4, width = 5)
par(mfrow= c(1,2))
plot(data$t., data$cdf, xlab = "time", main = "Simulated Data", ylab= "Cumlative Probability", xlim=c(0, 5))
t.f = seq(0, 7, length.out = 1001)
lines(t.f,1-exp(-t.f)  , col="blue", lwd=3 )
points(data$t., data$cdf)

Mod.old  =c(-Inf, 3, 1,  0.6, 0.2,   0, 0.5,1,   1, 0.5, 0.1 )

for(i in 1:BURN) # the burn in
{
Mod.old = g.explore(old = Mod.old, d = data)
}
print(Mod.old)

REC = list()
for(i in 1:(Nsamp-1)) # the burn in
{
Mod.old = g.explore(old = Mod.old, d = data)
REC[[i]] = Mod.old
rownames(REC) = NULL
}

#print(table(REC[,2]))
#best = which.max(REC[,1])
#g.predict(MOD=REC[best,], X = data , plot.line=1)
#lines(data$t.,1-exp(-data$t.)  , col="blue", lwd=3 )
return(REC)

}

g.Haz<-function(N = 19999, REC= samps) { hist.mat = matrix(0, nrow= N, ncol = 101) t. = seq(0, 7, length.out = 101)      for (bi in 1:N)     {         #print(bi)     MOD = REC[[bi]]     # upack the pars          j      = MOD[2]     z      = MOD[3:(j+2)]     mu     = MOD[(j+3):(2*j + 2)]     sigma  = MOD[(2*j + 3):(3*j + 2)]               #print(z)          # get the probs      z.star = c(z, 0)     prob = 1:j     for (i in 1:j)         {         prob[i] = z.star[i]-z.star[i+1]         }          #print(prob)          PDF.hat  =  prob[1]*(dnorm(t., mean = mu[1], sd=sigma[1]))/( 1- pnorm(0,mean = mu[1], sd=sigma[1])  )          if(j >= 2)
{
for ( i in 2:(j))
{
temp =  prob[i]*(dnorm(t., mean = mu[i], sd=sigma[i]))/( 1- pnorm(0,mean = mu[i], sd=sigma[i])  )
PDF.hat  = PDF.hat + temp
}
}
if (min(PDF.hat) < 0 ) {print("problem pdf")}     #print("-----------")     #print(prob[1])     CDF.hat  =  prob[1]*(pnorm(t., mean = mu[1], sd=sigma[1]) -   pnorm(0,mean = mu[1], sd=sigma[1]))/( 1- pnorm(0,mean = mu[1], sd=sigma[1])  )          if(j >= 2)
{
for ( i in 2:(j))
{
temp =  prob[i]*(pnorm(t., mean = mu[i], sd=sigma[i]) - pnorm(0,mean = mu[i], sd=sigma[i]))/( 1- pnorm(0,mean = mu[i], sd=sigma[i])  )
CDF.hat  = CDF.hat + temp
}
}
if (max(CDF.hat) > 1 ) {print("problem cdf too big")}
if (max(CDF.hat) > 1 ) {print(prob)}

if (max(CDF.hat) > 1 ) {print(CDF.hat)}
if (min(CDF.hat) < 0 ) {print("problem cdf")}

S.hat = 1 - CDF.hat
if (min(S.hat) < 0 ) {print("problem s")}

h.hat = PDF.hat/S.hat

hist.mat[bi,] =  h.hat

}

plot(t., t., type= "n", xlab = "time", ylab = "hazard function", ylim=c(0,4 ), xlim= c(0, 5))

ncolor = 21
layers = matrix(0, nrow=ncolor, ncol=101)
for (i in 1:101)
{
q = quantile(hist.mat[,i], probs=seq(0, 1, length.out=ncolor))
layers[,i] = q
}
gcol = c("white", rev(heat.colors((ncolor-1)/2) ))
#       1            2           34        4  3 21

for (i in 1:((ncolor-1)/2))
{
PY = c(layers[i,], rev(layers[ncolor+1-i,]))
Px   = c(t., rev(t.))
polygon(Px, PY, col=gcol[i], border=F)
}
lines(t., layers[((ncolor-1)/2)+1,], col=gcol[((ncolor-1)/2)+1], lwd=2)

abline(h = 1)

#print(layers)

}

samps = g.rjMCMC()
g.Haz()



# Creating a distribution with a given hazard function.

Hi All

This week I write about how to create a distribution with a specified hazard function.

“Crayon math” is what I call it when I have shape in my mind that I want to express as a function; i.e., draw the shape with math. This can happen a lot when there is empirical data but there is value to describing it parametrically. In particular, recall Kaplan-Meier curves, in many cases it is useful to describe them parametrically. While it is possible to use Cox Proportional Hazards as noted before proportional hazards considers the order of events not the time between them. Thus in cases where the goal is to estimate not only the relative hazard but the absolute hazard it can be necessary to use parametric survival regression (think a GLM). The problem is to do that a distribution must be specified. That is a parameterization has to be chosen.

The normal way to pick a distribution is to chose semi-arbitrarily from usual suspects (Exponential, Weibull, Gamma, truncated normal and log normal). Sadly this choice is commonly made either for mathematical convenience or laziness (i.e, that was one of the options in R).

To do a better job it is worth looking at the shape of the empirical hazard function (I will make a post about this in the future ) and then build a distribution that matches it.

An illustrative example:

Consider the hazard function shown in figure 2; it starts high and asymptotes down to a constant value exponentially (I know this because I cheated, i.e., it is simulated data) .

Figure 1: a hazard function that asymptotes down to a constant value.

Starting with the hazard function $h(t)$ the goal is to get a distribution. As usual working backwards is far easier. Thus let $F(t)$ be the desired cumulative distribution. If it is assumed that $F(t) = 1- \exp(-H(t))$ where $d/dt H(t) = h(t)$ then the pdf is $f(t) = h(t) \exp(-H(t))$. The result is then that the hazard function is $h(t)$ as $f(t)/[1-F(t)] = h(t)$.

Thus in the example of Fig. 2 $h(t) = A - B C \exp(-C t)$ therefore $f(t) = [A - B C \exp(-C t)] \exp(-[A t + B exp( -C t )])$ and $F(t) = 1- \exp(- [A t + B exp( -C t )] )$.

Clearly not all $h(t)$ functions will work. The big restrictions are $H(t =0) = 0$ (to ensure that $F(t=0) = 1- \exp(-H(t=0)) = 1- \exp(-0)= 1-1 = 0$) and  $H(t = \infty) = \infty$ (to ensure that $F(t= \infty) = 1- \exp(-H(t= \infty)) = 1- \exp(- \infty)= 1-0 = 1$). In addition, the usual restriction that $h(t)$ is positive is required.

That is it for now.

# Weibull parameters for censored data using only SQL

Hi all

This post shows how to use SQL to get estimates of the expected event times given either the Exponential or Weibull distributions and right censored data. That might seem like a strange restriction however it is very powerful have estimates in a query instead of requiring R as they can be used directly in dashboard (in my experience with Tableau). Also if like me you are lazy being able to just have a query that answers your question is much nicer then having to analyze data from a query.

Before showing the query it might be helpful to describe what is being queried. Starting with the exponential distribution. Let ${\bf d}$ be a set of observed data; ${\bf d}$ has $n$ observations with $k$ event times and $n-k$ censored times. The likelihood function is $\mathcal{L}(\lambda) = \prod f_x (x= d_i)^{\delta_i} [1- F_x (x = d_j)]^{1- \delta_i} = \prod f_x (x= d_i)^{\delta_i} [S_x (x = d_j)]^{1- \delta_i}$ where $\delta_i =1$ if the ith observation is an observed event and 0 if it is censored. Thus if the event times are assumed to be exponentially distributed $\mathcal{L}(\lambda) = \prod \lambda^{\delta_i} \exp( - \lambda d_i \delta_i) \exp( - \lambda d_i [1- \delta_i] ) = \lambda^{\sum_{i = 1}^{n} \delta_i} exp( - \lambda \sum_{i = 1}^{n} d_i)$. If the likelihood is considered as function of $\lambda$ not ${\bf d}$ then it is recognizable as a gamma distribution; thus the expectation of $\lambda$ is $(\sum_{i = 1}^{n} \delta_i) / ( \sum_{i = 1}^{n} d_i ) = k / ( \sum_{i = 1}^{n} d_i)$. Note that is can also be found using the standard ML approach of solving the derivative of the like-hood set equal to zero.

For the Weibull distribution this a bit more complicated! Recall that the pdf ofr a Weibull is $f_x(x) = \lambda \gamma x^{ \gamma -1} \exp(- \lambda x^{\gamma})$ and the survival function is $S_x(x) = \exp(-\lambda x^{\gamma})$. Thus using the same process as shown with the exponential distribution (i.e., I am too lazy to write it again) the likelihood function of a similar data set is  $\mathcal{L}(\lambda, \gamma) = \prod [\lambda \gamma {d_i}^{\gamma-1}]^{\delta_i} \exp(- \lambda {d_i}^{\gamma-1})$. This does not correspond to a distribution with know properties when considered as a function of $\gamma$ and $\lambda$. However the derivative of the log likelihood can be solved to find an estimate right? And that estimate is going to be used the the query? We shall see!

$\log \mathcal{L}(\lambda, \gamma) = \sum \delta_i \log [\lambda \gamma {d_i}] + (\gamma -1) \sum \delta_i \log(d_i) - \lambda \sum {d_i}^{\gamma}$.  The two partial derivatives are $d/d \lambda \log \mathcal{L}(\lambda, \gamma) = k/\lambda - \sum {d_i}^{\gamma}$ and $d/d \gamma \log \mathcal{L}(\lambda, \gamma) = k/ \gamma + \sum \delta_i \log(d_i) - \lambda \sum {d_i}^{\gamma} \log(d_i)$. Using the first equation it is is to find that $\lambda = k / \sum {d_i}^{\gamma}$ however $\gamma$ can not be solved for algebraically.

The value of $\gamma$ that minimizes $| k/ \gamma + \sum \delta_i \log(d_i) - (k / \sum {d_i}^{\gamma}) \sum {d_i}^{\gamma} \log(d_i) |$ is the maximum likelihood estimate.

So this whole post comes down to how can a SQL query find the optimal value of $\gamma$. The answer is actually quite simple, brute force. I create a large number of $\gamma$ values and full join them to my data set. Then I rank the rows by their misfit to the objective function and take the best one! BAM!

This query assume there is a table with user ids, event times and user ages. The cent follows the same structure as the churn posts query.

SELECT age, gamma,lambda, n, r, o_f ,lambda_ex
FROM
(
SELECT age, gamma, n, r,lambda, o_f, lambda_ex,
row_number() over (partition by age order by o_f) as index
FROM

(
SELECT age, gamma, n, r, r/(SUM_U_to_gamma + SUM_t_to_gamma) as lambda, r/(SUM_U + SUM_t) as lambda_ex,
ABS(r/gamma + SUM_log_t - (r/ (SUM_U_to_gamma + SUM_t_to_gamma) )*(SUM_U_to_gamma_log_u + SUM_t_to_gamma_log_t)) as o_f
FROM
(
SELECT age, gamma,
sum(1) as n,
SUM(event) as r,

SUM(CASE WHEN event = 0 THEN (event_time::float) END) as SUM_U,
SUM(CASE WHEN event = 1 THEN (event_time::float) END) as SUM_t,

SUM(CASE WHEN event = 0 THEN (event_time::float)^gamma END) as SUM_U_to_gamma,
SUM(CASE WHEN event = 1 THEN (event_time::float)^gamma END) as SUM_t_to_gamma,

SUM(CASE WHEN event = 0 THEN ln(event_time::Float) END) as SUM_log_U,
SUM(CASE WHEN event = 1 THEN ln(event_time::Float) END) as SUM_log_t,

SUM(CASE WHEN event = 0 THEN ln(event_time::float)*event_time^gamma END) as SUM_U_to_gamma_log_u,
SUM(CASE WHEN event = 1 THEN ln(event_time::float)*event_time^gamma END) as SUM_t_to_gamma_log_t
FROM
(
(SELECT id, CASE WHEN next_time is null then 0 else 1 end as event,
datediff('sec', clean_time, coalesce(next_time,'2017-08-28 00:00:00') ) as event_time, age, 1 as key
FROM
(SELECT id, clean_time, age,
max(clean_time) over (partition by id order by clean_time rows between 1 following and 1 following) as next_time,
row_number() over (partition by id,age order by random() ) as index
FROM tab_event
and clean_time <= '2017-08-28 00:00:00'
)
where index <= 1
and age <= 7) Q1
LEFT JOIN
(SELECT random() as gamma, 1 as key
from tab_event
LIMIT 10000 ) Q2
ON Q1.key = Q2.key
)
where event_time > 0
GROUP BY 1,2
)
)
)
where index = 1
ORDER BY 1,2

That is it for now tune in next time for more survival analysis!

# An brief answer to the unsettling questions raised by interval censored data.

Hi All

So first of all sorry for the delay in the posts. Those that know me (i.e., have connected to me on facebook) know that I have recently changed my employer from Hothead games to Phoenix Labs. the process of this transition left me with little time for this blog. Now that I have had my last day at Hothead I can focus here a bit more.

FYI if you are reading this some time near 2017-09-09 and are data scientist you might want to apply to Hothead; they are a great company.

Moving on to the meats of this post.

While I was reading about interval censored data I kept having problems reconciling my understanding of likelihood function with the presented methods.  In short, what has been bugging me is how do cdfs and pdfs work together because they seem to have different assumptions. Let $F_x(x)$ be a cdf and $f_x(x)$ be a pdf; in a classic likelihood function with iid observations the probability that observation i is $p(d_i) = f_x(x = d_i)$. For an interval censored observation (lets say observation i must be in the interval $[x_i, x_i + \delta]$ ) the probability of is $p(d_i) = F_x(x = x_i + \delta) - F_x(x = x_i)$. Now as $\delta$ approaches zero in a common sense sort of way I want the two probabilities to be the same. However they are clearly different! $p(d_i) = F_x(x = x_i ) - F_x(x = x_i) = 0$ which is not in general $f_x(x = d_i)$.

So how is this problem resolved? It might be worth stopping here for a moment and thinking about this because it is not easy.

Figure 1: Mr Darcy and a chess board. This just here to give some space so you don’t read ahead while you have your moment.

The, my, a answer is:
Returning to the likelihood for a interval censored observation $p(d_i) = F_x(x = x_i + \delta) - F_x(x = x_i)$; this can be multiplied by 1 without issue. In this case $1 = \delta / \delta$. Thus $p(d_i) = [F_x(x = x_i + \delta) - F_x(x = x_i)] (\delta / \delta) = ([F_x(x = x_i + \delta) - F_x(x = x_i)] / \delta) \delta$. Now using L’Hôpital’s rule (by far the best spelt name ever!) the limit as $\delta$ approaches zero for the first term is $\lim_{\delta \rightarrow 0} ([F_x(x = x_i + \delta) - F_x(x = x_i)] / \delta) = f_x(x= d_i)$. Thus for an arbitrarily small $\delta$  $p(d_i) = F_x(x = x_i + \delta) - F_x(x = x_i) = f_x(x= d_i) \delta$.

For fequentists it is not exactly clear to me what happens to the $\delta$ term; to be sure for ML estimates $\delta$ cancel when the derivatives are solved to equal zero but who knows in what happens in general?

For Bayesians the $\delta$ term cancels with the evidence; .

To justify this recall that $\mathcal{Z}_j = \int d {\bf m}_j \pi({\bf m}_j) \mathcal{L}({\bf m}_j)$ so the $\delta$ will be in both the numerator and the denominator.

The good part is that directly observed data and interval censored data can be used in a mixed likelihood function without any requirement to re-weight one or the other data types.

That is it for this post I will be writing one again soon on how to get Weibull distribution estimates in a SQL query.

# Road map of Survival Analysis posts

Hi All

I have been a bit busy recently and so this post has been both delayed and diminished. Now it presents only a list of the survival analysis topics that I intent to cover in the next few months.

1. Kaplan Meier plots (estimating survival functions) with interval censored data.
1. Cox proportional hazards
2. Bayesian Parametric Approach
3. Uniform Hazards
2. Cox proportional hazards and MCMC
1. Specifying prior distributions and verifying the common assumption of Gaussian uncertainty
2. RJMCMC for model selection
3.  Parametric Survival Analysis
4. Frailty Models
5. Event Dependent Censoring
6. Multiple Events
7. Competing risks

Points one and two I have actively been thinking about and working on. As is probably apparent from the lack of detail, points 3 through 7 are topics that I think sound interesting (i.e., know nothing about). They may (read are very likely to) change as I start writing.

A quick note on writing style of this blog. A friend of mine recently challenged me on the clarity and approach-ability of my writing. In writing there are two broad categories descriptions and reminders. The former is clearly better then the latter. This blog has been guilty of only giving reminders. However, it is my mission with the proposed posts on survival analysis to be squarely in the descriptions not reminders category of writing.

Finally I really am going to try to stick to a scheduled of 1 post ever two weeks that will mean that some posts will be shorter.

Tune in next time for hopefully a clear description of using the Cox Proportional hazards model to estimate survival curves.

Cheers

# Proportional Hazards (unpacking equation 12)

Hi All

This week I am looking at proportional hazards regression. In particular, trying to clarify the conditional likelihood formula.

Before diving into that please note that proportional hazards, also called Cox proportional hazards (CoxPH), was first presented in 1972 in “Regression models and life-tables” which has been cited ~ 46,000 times! So Dr Cox’s contribution with this one paper (he has a lot of other great papers as well) really is extraordinary.

That said I did not find it the clearest of papers in particular equation 12 took me some time to understand. Figure 1 shows an excerpt of the paper.

Figure 1: an excerpt of Cox’s 1972 “regression models and life tables”

The point that Cox is getting at is, if the hazard function ($\lambda(t)$ in his notation and $h(t)$ in ours ) is completely unknown but the order of failures are know can inference still be conducted? The answer is of course yes. Cox proposes the conditional likelihood function; that is, the probability of the order of failures or to put it another way the likelihood conditioned on the observed times of failures. Thus Cox’s equation 12 is apparently the probability that the i’th phone call ended given that one phone call end at that time. This post explores his assertion.

So to start, in CoxPH the hazard function is not assumed to have any particular form however it is assumed that all phone calls have the same form of hazard function only differing in a parametrically defined proportion; i.e., $h_{(i)}(t) = \exp({\bf z}_{(i)} {\bf m}) h_0(t)$ is the hazard function for the i’th largest observation.

The $\exp({\bf z}_{(i)}{\bf m})$ gives the relative hazard of the i’th observation relative to $h_0(t)$ the base hazard. This is the “proportional” in proportional hazards.

Returning to equation 12 and translating it into the this blogs notation we get

.

This is the proportional term divided by the sum of the probational terms for the risk set [$R(t)$]. The risk set is the set of observation with event times greater that of the i’th observation. By multiplying the top and bottom of the fraction with the baseline hazard $h_0(t_{(i)})$ it becomes clear that this is the hazard of the ending phone call divided by the hazard of all phone call still active imitatively before time $t_{(i)}$.

Intuitively, this should make sense the probability of phone call i ending given that exactly one of the phone calls ended is the hazard of the i’th phone call divided by the total hazard of all active phone calls. But sadly Intuitive is not really good enough. So continuing to please back some of these layers and using the definition of a hazard function:
.  This can be rewritten as
using Bayes rule [$P(A|B) = P(A \cap B )/P(B)$].  Given the assumption that there is one event at time $t_{(i)}$ (this assumption accounts deals for the conditional term) the probabilities in the fraction can be interpreted as
,  making the additional assumption that the event times are independent of each other the denominator can be interpreted as the probability that there is one event at time $t_{(i)}$ this results in further simplification:
.  Finally using Bayes rule again the fraction can be written as single conditional probability
. So CoxPH is correct and 46,000 papers don’t need an errata! But as I said I would have like a little more expansion in the original paper.

Basically that is everything I wanted to get through, however, …, I like like to talk too much / over think things. In particular,  the assumption that there is only one event at each time. In the proof this leads to the probability of the an event happening at the observed being the sum of the probabilities; however, in situation where there could be more than one event at a given time (likely due to sloppy data collection) then that is clearly not true (Cox and the reviewers discuss this in the paper a bit). The probability if at-least one observations has event time $t_{(i)}$ is then $1- \prod(1-P[\text{Obs. k has event time } t_{(i)}])$. While it is difficult to make general observations about this probability it must be larger then the assumed probability. In addition, the numerator would then be not just the probability that observation i has event time $t_{(i)}$  but instead the probability that all the observations that end at $t_{(i)}$ have event time $t_{(i)}$. This probability must smaller. So it does look like this would change the conditional likelihood a lot.

Tune in next time for exciting Bayesian Business intelligence (I promise next week there will actually be a Bayesian component to this)!

# A short note on Nelson-Aalen

Hi All

Last time I made a post deriving the variance of the Kaplan Meier survival function. I had intended to also give the Nelson Aalen estimate as well buts its variance is much harder to get. This post just highlights the difficulties that I had.

The Nelson Aalen estimate is $\hat{S}(t) = \prod_{j=1}^{k} \exp(-d_j/r_j)$. It feels a bit random to me, however David Collett points out that $\exp(-x) = 1 - x + x^2/2 - x^3/6 ...$ so for $x << 1 \exp(-x) \approx 1-x$ $\hat{S}(t) = \prod_{j=1}^{k} \exp(-d_j/r_j) \approx \prod_{j=1}^{k} )(1-d_j/r_j) \approx \prod_{j=1}^{k} )(r_j-d_j)/r_j$ which is the Kaplan Meier estimate.

Collett give the variance of the Nelson Aalen estimate as $Var[\hat{S}(t)] = \hat{S}(t)^2 \sum_{j=1}^k (d_j/r_j^2)$. I have been unable to verify the formula. If the same logic as my perfidious post is applied the variance of $var(\log[\hat{S}(t)])$ can be found to be $var(\log[\hat{S}(t)]) = \sum_{j=1}^{k} var[-d_j/r_j]$ and using everyone’s favorite identity ($var[g(x)] \approx [(d/dx) g(x)]^2 var(x)$) $var(\log[\hat{S}(t)])$ can also be found to be $var(\log[\hat{S}(t)]) =\hat{S}(t)^{-2} var[\hat{S}(t)]$. Putting to two sides together $\hat{S}(t)^{-2} var[\hat{S}(t)] = \sum_{j=1}^{k} var[-d_j/r_j]$ which is solved for $var[\hat{S}(t)]$ to give $var[\hat{S}(t)] = \hat{S}(t)^{2} \sum_{j=1}^{k} var[-d_j/r_j]$. So for $Var[\hat{S}(t)] = \hat{S}(t)^2 \sum_{j=1}^k (d_j/r_j^2)$ to be true $\sum_{j=1}^{k} var[-d_j/r_j] = \sum_{j=1}^k (d_j/r_j^2)$. I don’t see any clear reason why this is the case.

That is it for now. Tune in next time for hopefully some answers instead of just expositions on my inadequacies.