# Parallel Tempering / Replica Exchange

Hi All
Last time I promised some Bayesian neural network post; but as happens I lied. Basically the post I wanted would be too long so I have broken it into several parts that will drop over the next little while.This part is on a Bayesian MCMC system called Parallel Tempering or Replica Exchange Method (REM).

REM works by running many Markov chains in parallel and allowing them to exchange information. Each chain is run at a different temperature (sometimes you might run many at the same temperature but lets not get ahead of ourselves). Temperature (denoted $t$) in this context is a modifier to the acceptance probability ($A$) of the MCMC steps; i.e.,  $A =$ min ( exp $\{(E_1 - E_2)/t\}$,1). If $t$ is large the MCMC process will tend to accept bad (low probability) models. If $t$ is small only high liklyhood models will be accepted.

At random points in the MCMC walk the chains have the option to swap temperatures with each other. The idea is that high temp. chains are good at exploring the space but bad at finding local maxima, low temp. chains are good at fined maxima but bad at exploring. So we use both. The probability of two randomly selected chains swapping their temperatures is: $A =$ min ( exp $\{(E_1 - E_2) (1/t_1 - 1/t_2) \}$ ,1). IN general i use a strategy of doing  about 100-1000 random swap attempts every 4 or so MCMC steps. These swaps are basically free as there is no need to evaluate any likelihood functions.

So why is this useful?   Consider running conventional MCMC with the target distribution shown in Fig. 1. Because there are four disjointed probability masses it would be extremely difficult to pick perturbation sizes a prioi. Also many adaptive methods will just optimize for one of the modes making sampling in the other difficult as they have different correlation structures.

Figure 1: Multi-modal target distribution. Darker colors indicate higher probabilities

Figure 2 shows the posterior samples from the distribution without the use of REM (Note that no attempt to optimize the proposal distribution was made). 51,000 samples are taken; these where down sampled to 10,000. As can be seen only two of the four modes are sampled by the process. Figure 3 shows samples using REM. Thirteen chains are used with temperatures 1.00 1.19, 1.41, 1.68, 2.00, 2.38, 2.83, 3.36, 4.00, 4.76, 5.66, 6.73, and 8.00. 10,000 MCMC steps are taken; this means that the process has made about 2.5 as many likelihood calculations. All four modes are sampled from!

Figure 2: The multi-modal target distribution shown in Fig. 1. The dots (.) show the posterior samples.

Figure 3: Target distribution with posteriors samples (.) taken using REM.

```Code is appended

g.LL <- function(m)
{
m = as.numeric(m)
mu = abs(m[2]) - abs(m[1])
nu = sqrt(m[1]^2 + m[2]^2 )

#print(mu)
#print(nu)

#print(is.numeric(mu))

P1 = dnorm(mu[1], mean = 0, sd = 0.25, log=T)
P2 = dnorm(nu[1], mean = 3.5, sd = 1, log=T)

#print(P1)
#print(P2)

P = P1 + P2

return(P)
}

g.REM.mcmc <- function(N= 1000, DREM = 1 )
{
UB = c( 5, 5)
LB = c(-5, -5)

N.temps = 13
EX = seq(0, 3, length.out = N.temps )
t.star = 2^EX

M = matrix( nrow = N.temps, ncol= 4)
for ( i in 1:N.temps)
{
m.old = c(runif(1, min = -5, max = 5), runif(1, min = -5, max = 5))
ll.old = g.LL(m = m.old)
M[i,1] = ll.old
M[i,2:3] = m.old
M[i,4] = t.star[i]
}

M.list = c(ll.old, m.old,0)
alpha = 1

for ( i in 1:N)
{
# the mcmc steps for each chain
for (j in 1:N.temps)
{
m.old = M[j,2:3]
ll.old = M[j,1]
t.star = M[j,4]

jump = rnorm(2, mean=0, sd = 0.07) # this is not best but the point is to highlight REM
m.new = m.old + jump
if ( (m.new[1] < UB[1]) & (m.new[2] < UB[2]) & (m.new[1] > LB[1]) & (m.new[2] > LB[2]) )
{
ll.new = g.LL(m = m.new)

# standard mcmc acceptance rule
A = exp((ll.new - ll.old)/t.star)
r = runif(1)

if (r <= A)
{
M[j,1:3] = c(ll.new, m.new)
}
}
}

# Replica exchange method steps

r = runif(1)
if ( (r<= 0.25) & (DREM == 1)) # clearly this (the DREM on switch) should also be in the liklihood step but i am lazy and this fast anyway
{

for (j in 1:100)
{
k = sample(x = 1:N.temps, size= 2, replace =T)

A = exp((M[k[1],1] - M[k[2],1])*( (1/ M[k[1],4] ) - (1/M[k[2],4] ) ))
r = runif(1)
if ( r <= A)
{
T1 = M[k[1],4]
T2 = M[k[2],4]

M[k[1],4] = T2
M[k[2],4] = T1
}
}
}

if ( i <= 9999)
{
M.list = rbind(M.list, c(M[M[,4]==1,1:3 ],i))
rownames(M.list) =NULL
}

# fixed length thinning
if ( i > 9999)
{
alpha = alpha*10000/(alpha+10000)
r = runif(1)
if ( r <= alpha)
{
row = sample(1:10000,1)
M.list[row,] = c(M[M[,4]==1,1:3 ],i)
}
}

}

N.s = nrow(M.list)

return(M.list[floor(N.s/2):N.s,] )
}
samples = g.REM.mcmc(N=10999, DREM = 1 )
png(file = "target3.png")
image(y= seq(-5, 5, length.out =101), x= seq(-5, 5, length.out =101), LL.mat, col = c("white", rev(heat.colors(51))), xlab="m1", ylab ="m2")
points(samples[,2], samples[,3], pch=".")
dev.off()

```