# Calculating Bayesian Evidence for Cox Proportional Hazards.

Our perception of time is arbitrary [citation needed].

It objectively has not been as long as it seems since I wrote a post.

This post shows how to use importance sampling to calculate Bayesian evidence and then applies that to Cox PH models.

Starting with importance sampling or “IS” (pronounced [iz]) as it likes to be called; this leads to sentences that are difficult to parse.  IS is (see how fun “IS is” is?) a sampling strategy that involves drawing samples from a similar distribution and re weighting the samples such that their importance corespondents to the target distribution.  This is useful in the ubiquitous Bayesian situation where it is possible to evaluate a posterior distribution but not directly sample from it.

Formally, if $f(x)$ is a function proportional to the target distribution, $g(x)$ is a distribution that is similar to $f(x)$ [ $g(x) > 0 \, \forall x$ such that $f(x) > 0$ is the only necessary condition but it works faster if the two distributions share modes and variances], and ${\bf x}$ is a random sample drawn from $g(x)$ with i’th element $x_{i}$ then the importance weight for $x_i$, $w_i = f(x_i)/g(x_i)$.

The importance weight are used to evaluate the expectation of functions with respect to $f(x)$. For example, the mean of $f(x)$ can be estimated as $\sum x_i w_i / \sum w_i$. The expectation of an arbitrary function $Q(x)$ can estimated as $\sum Q(x_i) w_i / \sum w_i$. However in this post I am not interested in evaluating any the expectation of functions; as noted in the title this is about calculating the Bayesian Evidence.

It is obvious (i.e., someone pointed it out to me before I noticed) that the $n^{-1} \sum w_i = Z_f/Z_g$; in other words the average of the importance weights is the ratio of the integrals over $f$ and $g$ with respect to $x$.

Not a proof but a good thought experiment to make this feel intuitive is to let $f(x) = (2 \pi)^{-0.5} \exp(-0.5 x^2)$ for $x \in [0, 6]$, $f(x) = 0$ other wise and let $g(x)$ be uniform on the same range. Also because only a finite summation is used we can re order the sample ${\bf x}$ so that $x_1 < x_2 ... < x_n$ and noting that $Z_g =1$. The $w_i = 6*(2 \pi)^{-0.5} \exp(-0.5 x_i^2)$ which implies that $Z_f = 6/(n \sqrt{2 \pi}) \sum \exp(-0.5 x_i^2) =1/ \sqrt{2 \pi} \sum \exp(-0.5 x_i^2)*(6/n)$. Now as $6/n$ is the expected distance between $x_i$ and $x_{i+1}$ for all $i$, the summation turns into the Riemann sum  for $\int_0^6 dx \exp(-0.5 x^2)$.   Thus $\sum \exp(-0.5 x_i^2)*(6/n) \approx \sqrt{2 \pi}/2$, thus $Z_f = 1/ \sqrt{2 \pi} \sum \exp(-0.5 x_i^2)*(6/n) \approx 0.5$. Which is correct as just over half of the Gaussian distribution has been cut off. Figure 1: Shows the 1 dimensional log posterior / log likelihood for a one dimensional Cox model. The black line is the true normalized posterior and the blue line is the assumed Gaussian.

Returning to Cox PH models, as noted before they do not use a proper likelihood. And their likelihood function is definitely not Gaussian (Figure 1 shows log posterior distribution for a Cox model with uniform prior as well as the assumed Gaussian distribution. While similar the two are clearly different). So it is difficult to analytically the Bayesian evidence. Note I am choosing to interpret the event times as a choice of model; in future work it might be worth considering what happens if the event times are marginalized out?

The following R code creates a simulated data set of right censored survival times with 2 predictive and 4 random predictors. It also defines several functions to be used in calculating the evidence.

#simulate data

S1 = rexp(1000)
S2 = rexp( 500, rate = 0.5)
S3 = rexp( 500, rate = 2)
ET = c(S1, S2, S3)
X1 = c(rep(0, 1000), rep(1, 500), rep(1, 500))
X2 = c(rep(0, 1500), rep(0, 500))
#X2 = sample(1:2, 2000, replace=T)
X3 = sample(1:2, 2000, replace=T)
X4 = sample(1:2, 2000, replace=T)
X5 = sample(1:2, 2000, replace=T)
X6 = sample(1:2, 2000, replace=T)
ET.raw = ET

# censor events
r = runif(2000)
EVENT = rep(0, 2000)
EVENT[r >= 0.25] = 1
r = runif(2000)
ET[EVENT == 0] = r[EVENT == 0]*ET[EVENT == 0]

df = data.frame(ET,ET.raw, EVENT, X1, X2, X3, X4, X5, X6)

# sort event times
index <- sort(ET, index.return=T)
df = df[index$ix, ] # these things shoud be consistant with the df table for easy use ET = df$ET
EVENT = df$EVENT ET = df$ET
EVENT = df$EVENT ET.raw = df$ET.raw
X1 = df$X1 X2 = df$X2
X3 = df$X3 X4 = df$X4
X5 = df$X5 X6 = df$X6

# log likelihood function
g.cph.LL<-function(m, ET, EVENT, X)
{
n = ncol(X)
t = ET
k = length(t)
A = X
eta = rep(0,k)

if (n >= 2)
{
for(i in 1:n)
{
eta = eta + A[,i]*m[i]
}
}
if(n == 1)
{
eta = A[,1]*m
}

h = exp(eta)
H = sum(h) - cumsum(h) + h
P = as.numeric(h/H)
out = sum(log(P[EVENT==1]))
return(out)
}

#multi variate dnorm funny this is not in base or MASS?
mvdnorm <-function(X, mu, Sigma)
{
A.1 = det(Sigma)
A = -0.5*log(2*pi*A.1)
B.1 = solve(Sigma)
B.2 = X-mu
B.0 = t(B.2)%*%B.1%*%B.2
B = -0.5*B.0
LP = A + B
return(LP)
}

# esitmate evidence given IS
g.z <- function(ET, EVENT, A, n.s = 1000)
{
mod.freq = coxph(Surv(time = ET, event = EVENT)~., data= A)

SIGMA = mod.freq$var MU = as.vector(mod.freq$coefficients)
SAMP = mvrnorm(n = n.s, mu = MU, Sigma = SIGMA, tol = 1e-6, empirical = FALSE, EISPACK = FALSE)
W = 1:n.s
for ( i in 1:n.s)
{
W.num = g.cph.LL(m= SAMP[i,], ET, EVENT, X= A )
W.dom = mvdnorm(X = SAMP[i,], mu=MU, Sigma = SIGMA)
W[i] = W.num - W.dom +9500.27
}
return(log(mean(exp(W))))
}

This next code junk can be run to conduct calculate the evidence for all sub sets of predictors in the simulated data set. Note that I have updated how I do the all sub sets by using the “expand.grid” function.  The code needs both survival and MASS to be loaded.

The strategy is to use “summary.coxph” function to define a multivariate Gaussian distribution which is taken as $g(x)$. This is is used to crease samples that are evaluated by  “g.cph.LL” which is the $f(x)$. The ratios are used to calculate the evidence. Note that because of machine precision issues I am adding “9500” to the log ratio. This is equivalent to multiplying by a very large number; since only the relative evidence value are needed this will not change the order and stops all the weights from being rounded to zero!

# make list of all subsets
IF.list = data.frame( TRUE, TRUE, TRUE, expand.grid(rep(list(c(T,F)), 6)))

# internalize some lists
colnames(IF.list) = colnames(df)
z = rep(-1, (nrow(IF.list)-1) )

# loop though all models
for (i in 1:(nrow(IF.list)-1) )
{
print(i)
print(IF.list[i,])
data.temp = df[,as.numeric(IF.list[i,1:9])==T]
A.temp = as.data.frame(data.temp[4:ncol(data.temp)])
z[i] = g.z(ET = ET, EVENT = EVENT, A = A.temp)
}
IF.list = IF.list[1:(nrow(IF.list)-1), ]

# report results
plot(z) # plot the evidence
which.max(z)
IF.list[which.max(z),]
data.frame(log.z = z[(max(z)- z) <= 10], IF.list[(max(z)- z) <= 10,])
data.temp = df[,as.numeric(IF.list[which.max(z),1:9])==T]
A.temp = as.data.frame(data.temp[4:ncol(data.temp)])
summary(coxph(Surv(time = ET, event = EVENT)~., data= A.temp))

This code does find the correct parameterization almost all the time and in cases where it does not it does not have signifigantly lower evidence than the “best model”. But don’t take my word for it run the code your self a few times.

That is it for now. Tune in next time for more Bayesian Business Intelligence.