Our perception of time is arbitrary [citation needed].

It objectively has not been as long as it seems since I wrote a post.

This post shows how to use importance sampling to calculate Bayesian evidence and then applies that to Cox PH models.

Starting with importance sampling or “IS” (pronounced [iz]) as it likes to be called; this leads to sentences that are difficult to parse. IS is (see how fun “IS is” is?) a sampling strategy that involves drawing samples from a similar distribution and re weighting the samples such that their importance corespondents to the target distribution. This is useful in the ubiquitous Bayesian situation where it is possible to evaluate a posterior distribution but not directly sample from it.

Formally, if is a function proportional to the target distribution, is a distribution that is similar to [ such that is the only necessary condition but it works faster if the two distributions share modes and variances], and is a random sample drawn from with i’th element then the importance weight for , .

The importance weight are used to evaluate the expectation of functions with respect to . For example, the mean of can be estimated as . The expectation of an arbitrary function can estimated as . However in this post I am not interested in evaluating any the expectation of functions; as noted in the title this is about calculating the Bayesian Evidence.

It is obvious (i.e., someone pointed it out to me before I noticed) that the ; in other words the average of the importance weights is the ratio of the integrals over and with respect to .

Not a proof but a good thought experiment to make this feel intuitive is to let for , other wise and let be uniform on the same range. Also because only a finite summation is used we can re order the sample so that and noting that . The which implies that . Now as is the expected distance between and for all , the summation turns into the Riemann sum for . Thus , thus . Which is correct as just over half of the Gaussian distribution has been cut off.

Returning to Cox PH models, as noted before they do not use a proper likelihood. And their likelihood function is definitely not Gaussian (Figure 1 shows log posterior distribution for a Cox model with uniform prior as well as the assumed Gaussian distribution. While similar the two are clearly different). So it is difficult to analytically the Bayesian evidence. Note I am choosing to interpret the event times as a choice of model; in future work it might be worth considering what happens if the event times are marginalized out?

The following R code creates a simulated data set of right censored survival times with 2 predictive and 4 random predictors. It also defines several functions to be used in calculating the evidence.

#simulate data S1 = rexp(1000) S2 = rexp( 500, rate = 0.5) S3 = rexp( 500, rate = 2) ET = c(S1, S2, S3) X1 = c(rep(0, 1000), rep(1, 500), rep(1, 500)) X2 = c(rep(0, 1500), rep(0, 500)) #X2 = sample(1:2, 2000, replace=T) X3 = sample(1:2, 2000, replace=T) X4 = sample(1:2, 2000, replace=T) X5 = sample(1:2, 2000, replace=T) X6 = sample(1:2, 2000, replace=T) ET.raw = ET # censor events r = runif(2000) EVENT = rep(0, 2000) EVENT[r >= 0.25] = 1 r = runif(2000) ET[EVENT == 0] = r[EVENT == 0]*ET[EVENT == 0] df = data.frame(ET,ET.raw, EVENT, X1, X2, X3, X4, X5, X6) # sort event times index <- sort(ET, index.return=T) df = df[index$ix, ] # these things shoud be consistant with the df table for easy use ET = df$ET EVENT = df$EVENT ET = df$ET EVENT = df$EVENT ET.raw = df$ET.raw X1 = df$X1 X2 = df$X2 X3 = df$X3 X4 = df$X4 X5 = df$X5 X6 = df$X6 # log likelihood function g.cph.LL<-function(m, ET, EVENT, X) { n = ncol(X) t = ET k = length(t) A = X eta = rep(0,k) if (n >= 2) { for(i in 1:n) { eta = eta + A[,i]*m[i] } } if(n == 1) { eta = A[,1]*m[1] } h = exp(eta) H = sum(h) - cumsum(h) + h P = as.numeric(h/H) out = sum(log(P[EVENT==1])) return(out) } #multi variate dnorm funny this is not in base or MASS? mvdnorm <-function(X, mu, Sigma) { A.1 = det(Sigma) A = -0.5*log(2*pi*A.1) B.1 = solve(Sigma) B.2 = X-mu B.0 = t(B.2)%*%B.1%*%B.2 B = -0.5*B.0 LP = A + B return(LP) } # esitmate evidence given IS g.z <- function(ET, EVENT, A, n.s = 1000) { mod.freq = coxph(Surv(time = ET, event = EVENT)~., data= A) SIGMA = mod.freq$var MU = as.vector(mod.freq$coefficients) SAMP = mvrnorm(n = n.s, mu = MU, Sigma = SIGMA, tol = 1e-6, empirical = FALSE, EISPACK = FALSE) W = 1:n.s for ( i in 1:n.s) { W.num = g.cph.LL(m= SAMP[i,], ET, EVENT, X= A ) W.dom = mvdnorm(X = SAMP[i,], mu=MU, Sigma = SIGMA) W[i] = W.num - W.dom +9500.27 } return(log(mean(exp(W)))) }

This next code junk can be run to conduct calculate the evidence for all sub sets of predictors in the simulated data set. Note that I have updated how I do the all sub sets by using the “expand.grid” function. The code needs both survival and MASS to be loaded.

The strategy is to use “summary.coxph” function to define a multivariate Gaussian distribution which is taken as . This is is used to crease samples that are evaluated by “g.cph.LL” which is the . The ratios are used to calculate the evidence. Note that because of machine precision issues I am adding “9500” to the log ratio. This is equivalent to multiplying by a very large number; since only the relative evidence value are needed this will not change the order and stops all the weights from being rounded to zero!

# make list of all subsets IF.list = data.frame( TRUE, TRUE, TRUE, expand.grid(rep(list(c(T,F)), 6))) # internalize some lists colnames(IF.list) = colnames(df) z = rep(-1, (nrow(IF.list)-1) ) # loop though all models for (i in 1:(nrow(IF.list)-1) ) { print(i) print(IF.list[i,]) data.temp = df[,as.numeric(IF.list[i,1:9])==T] A.temp = as.data.frame(data.temp[4:ncol(data.temp)]) z[i] = g.z(ET = ET, EVENT = EVENT, A = A.temp) } IF.list = IF.list[1:(nrow(IF.list)-1), ] # report results plot(z) # plot the evidence which.max(z) IF.list[which.max(z),] data.frame(log.z = z[(max(z)- z) <= 10], IF.list[(max(z)- z) <= 10,]) data.temp = df[,as.numeric(IF.list[which.max(z),1:9])==T] A.temp = as.data.frame(data.temp[4:ncol(data.temp)]) summary(coxph(Surv(time = ET, event = EVENT)~., data= A.temp))

This code does find the correct parameterization almost all the time and in cases where it does not it does not have signifigantly lower evidence than the “best model”. But don’t take my word for it run the code your self a few times.

That is it for now. Tune in next time for more Bayesian Business Intelligence.