# In which the author gets the Reversible Jump Markov Chain Monte Carlo algorithm and trans-dimension inversion off of his chest.

Last time I promised to write an entry about Data Augmentation and Dispersion however I lied.  The basic problem is that in order to provide an example of either process I need to first talk about Markov chain Monte Carlo (MCMC) algorithms such as  Gibbs sampling or Metropolis-Hastings Sampling.  And if I am going to write about MCMC then there really is no reason not to write about reversible jump Markov chain Monte Carlo (rjMCMC).

And because I talk to much and as well as desire all of these post to more or less stand on their own we shall start by reviewing statistical inversion (parameter estimation). Forward and inverse problems may be broadly defined as data prediction and parameter estimation, respectively. That is, if the function $f$ represents the process of interest (e.g., the systematic part of the KPI) and ${\bf m}$ is a vector of model parameters representing the system (predictors), then calculating predicted data ${\bf d} = f({\bf m})$ represents the forward problem, while solving the same system of equation with an observed data set ${\bf d}_0$ for an estimate of the parameters $\hat{{\bf m}}$ is the inverse problem, i.e., $\hat{{\bf m}} = f^{-1}({\bf d}_0)$. Note, however, an analytic form for $f^{-1}({\bf d}_0)$ rarely exists and consequently the distribution of $\hat{{\bf m}}$ can only be expressed numerically.

The most common sampling paradigm is Metropolis-Hasting (MH) sampling, which produces a random walk through the parameter space. The walk is conducted such that the long-run distribution of the steps converge to the target distribution, e.g., the PPD. Figure 1 shows an example of MH sampling for a one-dimensional target distribution; the distribution of the steps converge to the target as the sampling progresses.

Figure 1: Samples (dots) and distribution for Metropolis-Hasting sampling after 1,000 (left) and 10,000 (right) iterations. The blue line indicates the target distribution and the shaded area is the sampled distribution.

The r code for this example is shown below; in summary the algorithm has 3 steps. These are draw the proposed model, accept or reject the proposed model, and save the model. Starting with the first step the proposed model ${\bf m'}_t$ is drawn randomly from the distribution $Q({\bf m'}_t|{\bf m}_t)$ which as the notation suggests can only depend on constants  and the current model (this is not precisely true but is a simplifying lie). The second step has two parts (this is apparently distinct from there being four steps with one part each…); the first part is to calculate the acceptance probability, the other is to accept the proposed model with that probability. The acceptance probability is

where $\mathcal{L}$ is the likelihood and $\pi$ is the prior. If the proppsed model is accepted then ${\bf m}_{t+1} = {\bf m'}_t$ else ${\bf m}_{t+1} = {\bf m}_t$. In the final step ${\bf m}_{t+1}$ is saved.

The rjMCMC is described after this code segment.

g.MCMC<-function(N= 10000)
{
# this stuff is to start the process
data = rnorm(100, mean=0, sd=1) # the simulated data mean forced to be
M = c(-Inf, 0)                    # starting model (log likelihood, m)
Q.mu1 = 0.05                      # perterbation SD (chosen here to be bad for plotting !)
SAMPLES = M

# plot stuff
par(mar = c(0.1, 0.1, 0.1, 0.1)  )
layout(mat = matrix(c(7,7,7,7,7,5,1,2,3,4,6,6,6,6,6), nrow=3, ncol=5, byrow=T),
widths = c(0.3, 1, 0.5, 1, 0.5), heights = c(0.05, 1, 0.2)   )
M.frame = seq(-0.4, 0.4, length.out = 1001)
Target = dnorm(M.frame, mean= mean(data), sd=1/sqrt(length(data))) # the target distribution

# the prior
LB = -3
UB = 3

# the sampling
for (i in 1:N)
{
M.prime    = M                                           # Step 1 the proposal of a new model
M.prime[2] = M.prime[2] + rnorm(1, mean = 0, sd = Q.mu1) # the perterbation

# calculate the likelihood of the data
e          = data – M.prime[2]                  # the residuals
LL         = dnorm(e, mean=0, sd = 1, log = T ) # the log likelihood
M.prime[1] = sum(LL)                            # add LLs to get total log likelihood

# Step 2 MCMC acceptance
r = runif(1)
A = exp(M.prime[1]-M[1])
if (is.na(A)) {A= 0}
if ((A>=r)&(M.prime[2]>= LB)&(M.prime[2]<= UB))
{
M = M.prime
}
# Step 3 record sample
SAMPLES = rbind(SAMPLES, M)
rownames(SAMPLES) = NULL

if (i == 1000)
{
SAMPLES=SAMPLES[1:1000,]
# plot PPD
plot(1:1000,SAMPLES[,2], pch=”.”, yaxs=”i”, ylim=c(-0.4, 0.4), xlab=”Sample Index”, ylab=”m”)
axis(1, at = 500, tick = F, label = “Sample Index”, cex.axis=1.5, mgp=c(3, 3, 0))
axis(2, at = 0, tick = F, label = “m”, cex.axis=1.5, mgp=c(3, 3, 0))
H=hist(SAMPLES[,2], breaks=seq(min(SAMPLES[,2]),max(SAMPLES[,2]), length.out=51), plot=F)
plot( seq(0, 1.04*max(H$den), length.out=length(H$mids)),H$mids, type=”n”, xaxs=”i”, ylab=””, xlab=””, yaxs=”i”, yaxt=”n”, xaxt=”n”, ylim=c(-0.4, 0.4)) axis(1, at = mean( seq(0, 1.04*max(H$den), length.out=length(H$mids))), tick = F, label = “P(m)”, cex.axis=1.5, mgp=c(3, 3, 0)) polygon( c(0, H$den, 0), c(-0.4, H$mids, 0.4), col=”grey”) lines( Target, M.frame, col=”blue”, lwd=3) } if (i == N) { # plot PPD plot(1:N,SAMPLES[,2], pch=”.”, yaxs=”i”, ylim=c(-0.4, 0.4), xlab=”Sample Index”, ylab=”m”, yaxt=”n”) axis(1, at = N/2, tick = F, label = “Sample Index”, cex.axis=1.5, mgp=c(3, 3, 0)) H=hist(SAMPLES[,2], breaks=seq(min(SAMPLES[,2]),max(SAMPLES[,2]), length.out=51), plot=F) plot( seq(0, 1.04*max(H$den), length.out=length(H$mids)),H$mids, type=”n”, xaxs=”i”, ylab=””,
xlab=””, yaxs=”i”, yaxt=”n”, xaxt=”n”, ylim=c(-0.4, 0.4))
axis(1, at = mean( seq(0, 1.04*max(H$den), length.out=length(H$mids))), tick = F, label = “P(m)”, cex.axis=1.5, mgp=c(3, 3, 0))
polygon( c(0, H$den, 0), c(-0.4, H$mids, 0.4), col=”grey”)
lines( Target, M.frame, col=”blue”, lwd=3)
}

}
}
g.MCMC()

So far we have only considered cases where the number of parameters is known(i.e., the dimension of the PPD fixed), however in many setting the number of parameters is not know a priori. In such cases it may be possible to calculate the evidence for each model and select the best one, but in common applications situation that can be computationally impractical. In addition even if it is possible the point estimate for the dimension of the PPD will have uncertainty that will not be accounted for in the parameter estimate uncertainties! The rjMCMC algorithm solves both these problems. The PPD is expanded to span multiple dimensions (numbers of parameters). Like the MCMC algorithm rjMCMC takes a random walk though the parameter space but now the walk can not only perturb the parameter values but “Birth” and “Death” sets if parameters. The rjMCMC acceptance probability is
.
The time subscript (t) has been suppressed for clarity. The subscript $j$ indicates the dimension of the model vector ${\bf m}$. The term |J| is the Jacobin of the diffeomorphism of the transform between dimension j and j’. In general if only the parameters that are birthed or deathed are altered if the dimension changes this will be 1 and be ignored. Also the proposal distribution Q can really be thought of as three parts the birth proposal, the death proposal and the perturbation proposal; i.e.,

The proposal distributions will operate in pares as the reverse move for a perturbation must be a perturbation, a birth must pare with a death, and a death with a birth. Also, in general the prior ratio will change only by the prior of the birthed / deathed parameters. Thus if the birthed parameters are proposed from the prior The acceptance probability becomes that of the standard MCMC.

To illustrate the rjMCMC algorithm consider the following regression example. It conists of regressing 20 y’s onto 20 x’s. The order of the model is unknown, i.e., it could be a constant, linear, or quadratic. The data and true model (linear) is shown in the next figure.

The simulated noisy (circles) and true (blue line) data used inthe rjMCMC example.

The posterior uncertainty of the parameters is shown in the histograms below. Note the PPD might be under-sampled but i wanted the code to run relatively quickly and I do not want my out put to look different from your s when you run it! Out of the 25000 samples models 7595 had only a constant, 17376 had a mean and linear trend and only 28 had a quadratic term! the code for this is below.

The posterior marginal uncertainties of beta0, beta1, and beta2, the regression parameters. The dashed vertical blue lines indicate the “true” values.

To recap Bayesians commonly approximate the PPD with samples. These samples can be gathered by MCMC type algorithms. It is possible to collect the samples even if the number of model parameters are unknown. And the rjMCMC algorithm selects parameters from models proportionally to their evidence. Finally parameter uncertainties account for model selection uncertainties!

Tune in next time for Bayesian BI! Also I recently moved to Hothead games so check out our products if you get a chance (Boom Boom Soccer and Boom Boom Foot Ball are in soft launch so please play them and leave any comments you have!)

P.S. I have noticed that this blog format changes some of the symbols in the code if you have trouble running it leave a comment and I will email you directly.
The rjMCMC code:
g.predict<-function(Mod, X)
{
j     = Mod[2]
beta0 = Mod[3]
beta1 = Mod[4]
beta2 = Mod[5]

P = 0*X + beta0
if (j >= 2) {P = P + X*beta1}
if (j >= 3) {P = P + (X^2)*beta2}

return(P)
}

g.perterb<-function(M=c(-Inf, 3, 0, 0, 0), Qsd=c(0, 0, 0.1, 0.01, 0.001), LB = c(0, 1, -10, -2, -1 ), UB = c(0, 1, 10, 2, 1 )  , data=data.frame(1:100, rnorm(100)))
{
# unpacking hte parameters
LL    = M[1]
j     = M[2]
#beta0 = M[3]
#beta1 = M[4]
#beta2 = M[5]
x     = data[,1]
y     = data[,2]

ORDER = sample(3:(3+j-1), j)

for (i in ORDER)
{
M.prime = M                                           # make the proposal model
M.prime[i] = M.prime[i] + rnorm(1, mean = 0, sd= Qsd[i]) # add random noise to old model
P = g.predict(M.prime, x)                             # get predicted values

LL.prime = sum(dnorm(y-P, mean= 0, sd=1, log=T))      # compute loglikihood
M.prime[1] = LL.prime                                 # save LL

r = runif(1)                                          # random uniform
MH = exp(LL.prime – LL)                               # Metropolis-hasting acceptance probability value
if ((r <= MH) & (M.prime[i] >= LB[i]) & (M.prime[i] <= UB[i])  ) {M = M.prime}                            # if accepted and in bounds

}
return(M)
}

g.birth<-function(M=c(-Inf, 1, 0, 0, 0), Qsd=c(0, 0, 0.1, 0.01, 0.001), LB = c(0, 1, -10, -2, -1 ), UB = c(0, 1, 10, 2, 1 )  , data=data.frame(1:100, rnorm(100)))
{
# unpacking hte parameters
LL    = M[1]
j     = M[2]
x     = data[,1]
y     = data[,2]

if (j == 1)
{
M.prime = M                                           # make the proposal model
M.prime[2] = 2
M.prime[4] = runif(1, min = LB[4], max = UB[4])       # propose from prior
P = g.predict(M.prime, x)                             # get predicted values

LL.prime = sum(dnorm(y-P, mean= 0, sd=1, log=T))      # compute loglikihood
M.prime[1] = LL.prime                                 # save LL

r = runif(1)                                          # random uniform
MH = exp(LL.prime – LL)                               # Metropolis-hasting acceptance probability value
if (r <= MH) {M = M.prime}                            # if accepted

}

if (j == 2)
{
M.prime = M                                           # make the proposal model
M.prime[2] = 3
M.prime[5] = runif(1, min = LB[5], max = UB[5])       # propose from prior
P = g.predict(M.prime, x)                             # get predicted values

LL.prime = sum(dnorm(y-P, mean= 0, sd=1, log=T))      # compute loglikihood
M.prime[1] = LL.prime                                 # save LL

r = runif(1)                                          # random uniform
MH = exp(LL.prime – LL)                               # Metropolis-hasting acceptance probability value
if (r <= MH) {M = M.prime}                            # if accepted

}

return(M)
}

g.death<-function(M=c(-Inf, 2, 1, 0.5, 0.0), Qsd=c(0, 0, 0.1, 0.01, 0.001), LB = c(0, 1, -10, -2, -1 ), UB = c(0, 1, 10, 2, 1 )  , data=data.frame(1:100, rnorm(100)))
{
# unpacking hte parameters
LL    = M[1]
j     = M[2]
x     = data[,1]
y     = data[,2]

if (j == 3)
{
M.prime = M                                           # make the proposal model
M.prime[2] = 2
M.prime[5] = 0                                        # propose from prior
P = g.predict(M.prime, x)                             # kill the parameter

LL.prime = sum(dnorm(y-P, mean= 0, sd=1, log=T))      # compute loglikihood
M.prime[1] = LL.prime                                 # save LL

r = runif(1)                                          # random uniform
MH = exp(LL.prime – LL)                               # Metropolis-hasting acceptance probability value
if (r <= MH) {M = M.prime}                            # if accepted

}

if (j == 2)
{
M.prime = M                                           # make the proposal model
M.prime[2] = 1
M.prime[4] = 0                                        # kill the parameter
P = g.predict(M.prime, x)                             # get p

LL.prime = sum(dnorm(y-P, mean= 0, sd=1, log=T))      # compute loglikihood
M.prime[1] = LL.prime                                 # save LL

r = runif(1)                                          # random uniform
MH = exp(LL.prime – LL)                               # Metropolis-hasting acceptance probability value
if (r <= MH) {M = M.prime}                            # if accepted

}

return(M)
}

}

g.explore<-function(old, d)
{
Qsd. = c(0, 0, 0.1, 0.01, 0.001)
LB.  = c(0, 1, -10, -2, -1 )
UB.  = c(0, 1, 10, 2, 1 )

move.type = sample(1:3, 1) # the type of move i.e., perterb, birth, death

if (move.type == 1) {old = g.perterb(M=old, Qsd =Qsd., LB = LB., UB=UB., data= d)}
if (move.type == 2) {old = g.birth(M=old,  Qsd =Qsd., LB = LB., UB=UB., data = d)}
if (move.type == 3) {old = g.death(M=old, Qsd =Qsd., LB = LB., UB=UB., data = d )}

return(old)
}

g.rjMCMC<-function(Ndat = 100, Nsamp = 25000, BURN = 1000)
{
#+ (1:Ndat)*0.75
beta0 = 3
beta1 = 0.1
beta2 = 0

data = data.frame(x = 1:Ndat, y = beta0  +rnorm(Ndat)+ (1:Ndat)*beta1  ) # the simulated data

x11(height=4, width = 5)
plot(data[,1], data[,2], xlab=”x”, ylab=”y”, main = “Simulated Data”)
lines(1:Ndat,beta0  +  (1:Ndat)*beta1, col=”blue”, lwd=3 )
points(data[,1], data[,2])

Mod.old  = c(-Inf, 1, 4, 0, 0)

for(i in 1:BURN) # the burn in
{
Mod.old = g.explore(old = Mod.old, d = data)
}
print(Mod.old)

REC = Mod.old
for(i in 1:(Nsamp-1)) # the burn in
{
Mod.old = g.explore(old = Mod.old, d = data)
REC = rbind(REC, Mod.old)
rownames(REC) = NULL
}

print(table(REC[,2]))

x11(height=4, width = 8)
x = 16
par(mar = c(4,4,1,1), oma = c(1,1,1,1))
layout(mat = matrix(c(1, 2, 3), nrow=1, ncol=3, byrow=T) )

REC = rbind(REC, c(0, 3, 0,0,0)) # just to make the ploting easier

H1 = hist(REC[,3],breaks = seq(-10, 10, length.out = 1001),  plot=F)
H2 = hist(REC[REC[,2] >= 2 ,4], breaks = seq(-2, 2, length.out = 1001), plot=F)
H3 = hist(REC[REC[,2] >= 3 ,5], breaks = seq(-1, 1, length.out = 1001), plot=F)

plot(H1$mids, H1$den, type=”n”, xlab=”Beta 0″, ylab= “P(Beta 0)”,xaxs=”i”, yaxs=”i”)
polygon(x=c(H1$mids[1], H1$mids, H1$mids[length(H1$mids)] ), y=c(0, H1$den, 0), col=”grey”, border=F ) abline( v = beta0, col=”blue”, lwd=2, lty=2 ) plot(H2$mids, H2$den, type=”n”, xlab=”Beta 1″, ylab= “P(Beta 1)”,xaxs=”i”, yaxs=”i”) polygon(x=c(H2$mids[1], H2$mids, H2$mids[length(H2$mids)] ), y=c(0, H2$den, 0), col=”grey”, border=F  )
abline( v = beta1, col=”blue”, lwd=2, lty=2  )

plot(H3$mids, H3$den, type=”n”, xlab=”Beta 2″, ylab= “P(Beta 2)”,xaxs=”i”, yaxs=”i”)
polygon(x=c(H3$mids[1], H3$mids, H3$mids[length(H3$mids)] ), y=c(0, H3\$den, 0), col=”grey”, border=F  )
abline( v = beta2, col=”blue”, lwd=2, lty=2  )

}

g.rjMCMC(Ndat = 20)

# In which we learn that about Bayesian Evidence, Change Point Detection, and latex.

Note although WordPress claims to work with Latex this seems to be lies (or more to the point their plan to make me give them money) so sorry about the terrible formatting of equations.

An applied statistics problem that commonly needs to be solved in BI is change point detection; i.e., determining when a key performance indicator (KPI) changes its underlining model. Commonly the purpose of change point detection is to analyze the effects of a game feature that cannot be AB tested (e.g., there may not be the capability to make the change for only some of the users or there may be concerns about push back from “disadvantaged” users), it can also be an important part of error detection. In either case it may not be known apriori when a change in the KPI occurs or specifically for feature evaluation does the change in KPI performance actually correspond to the feature release.

An approach to change point location is to create a linear model of the KPI, regressing its values (${\bf d}$) onto the time/ index of the observation and other relevant factors. The change points are represented by the design matrix $\text{A}$. So for example a linear model with no change point that might have design matrix

or alternatively a linear model with a change point between the $n$ th and $n+1$ st KPI measures could have design matrix
.
In both cases it is assumed that ${\bf d} = \text{A} {\bf m} + {\bf e}$ where ${\bf m}$ represents an arbitrary model (vector of parameters). The linear models are then compared using Bayesian evidence. In general this is better than P value tests (e.g., anova) as comparisons between models with the same number of parameters will have zero degrees of freedom with which to conduct the test.

As this is the first post it is convenient to review Bayesian inference in general before describe Bayesian evidence; a complete description of Bayesian inference can be found in Refs.[1,2]. In Bayesian inference, model parameters are considered random variables. The distribution of ${\bf m}$ is evolved for its prior distribution, $\pi({\bf m})$, to its posterior probability density (PPD), $P({\bf m}|{\bf d})$, by the introduction of data information via the likelihood function, $\mathcal{L}({\bf m})$.  Using Bayes’ rule, the PPD can be expressed as $P({\bf m}|{\bf d})=\frac{\pi({\bf m})\mathcal{L}({\bf m})} {\mathcal{Z}}$,where $\mathcal{Z}$ is the Bayesian evidence. As the PPD is a distribution and therefor normalized $\mathcal{Z} =\int d {\bf m} \pi({\bf m})\mathcal{L}({\bf m})$ (Now that I am out of grad school I can write internals the way that makes sense, none of this $\int f(x)dx$ non-sense. You don’t agree?, then riddle me why derivatives don’t act like brackets? ). Thus $\mathcal{Z}$ can be thought as the zeroth-dimensional marginal density of the model, which given a “moments contemplation” can thus be interpreted as the probability of the data.

As the name suggests, comparisons of Bayesian evidence are a strong indicator of which model has the greatest support by the data and prior information. The model with the highest evidence evaluated at the observed data, i.e., the model for which the observations are most probable, should be taken as the preferred model.Bayesian evidence accounts for both data t and model parsimony (i.e., a preference for simple models). Parsimony is addressed naturally by penalizing needlessly flexible models, i.e., penalizing models that assign evidence over an unnecessarily large region of the data space. This penalty results from the fact that evidence is a normalized distribution over the data space; models that assign probability too widely over the data space will tend to have lower probabilities at the observed data than more focused models. Intuitively, overly-flexible, non-parsimonious models are undesirable because they are not easily disproved.

Figure 1: Bayesian Evidence of two models (J1, J0) as a function of the data space. The vertical line indicates a possible set of data observations.

The FIg. 1 illustrates the evidence of two models, $J0$ and $J1$, as a function of the data space (schematically compressed into one dimension). Model $J0$ is less parsimonious as $\mathcal{Z}(J0)$ is more widely distributed than $\mathcal{Z}(J1)$. Evidence compares the two models at the observed data (represented by the vertical line). In this example, $J1$ is found to have higher evidence (support from the data and prior) than $J0$.

Evaluating Bayesian evidence can be computationally difficult for many nonlinear problems; I will give strategies to get around this in latter posts. For change point detection fortunately linear models are sufficient, so the evidence can be calculated algebraically giving you more time for other things. The evidence for a linear model with known covariance matrix $\text{C}$ [i.e., ${\bf d}$ ~ Norm( $\text{A}{\bf m}$, $\text{C}$)] is found by evaluating the integral $\int d{\bf m} (2\pi)^{-k/2}|\text{C}|^{-1/2} \exp(-\frac{1}{2}[{\bf d}-\text{A}{\bf m}]^{t}\text{C}^{-1}[{\bf d}-\text{A}{\bf m}])$  which by completing the square (for this step it helps to know the answer and work backwards) can be found to be $( 2 \pi )^{(j-k)/2}| \text{C}|^{-1/2}|\text{A}^t\text{C}^{-1}\text{A}|^{-1/2} \exp (-\frac{1}{2}[{\bf d}^t \text{C}^{-1}{\bf d} - {\bf d}^t \text{C}^{-1} \text{A} (\text{A}^t \text{C}^{-1} \text{A} )^{-1}\text{A}^t \text{C}^{-1} {\bf d} ] )$, where $j$ is the dimension of ${\bf m}$ and  $k$ is the dimension of ${\bf d}$.

This ends the “theoretical” part of this blog; in summary the denominator of posterior distribution is called the Bayesian evidence. For linear models with known covariance matrices this can be calculated analytical by me, and now also, by you! Comparing the Bayesian evidence is good way to find the model with most support from the data and prior information.

Returning to the original problem, how to use Bayesian evidence used to locate changed points. The recipe is to iteratively create models differing by their design matrices $\text{A}$ which represent possible change points. Evaluate the evidence of the models at the data, and compare the evidences. The R code given at the end of the post implements this recipe for a simulated KPI.

Top: KPI as a function of time (circles), the solid blue lines indicate the true/noise free KPI values. The vertical dashed line shows the evidence preferred change point. Bottom The Log Bayesian Evidence for the change point at the indicated time index. The vertical solid blue line gives the true change point. The vertical dashed line shows the evidence preferred change point.

To conclude I just want to make a few final observations. Firstly the assumption of known data covariance matrix might seem impractical, however, with many revenue based KPI such as average revenue per daily active user (ARPDAU) the variance is “known” as the sampled variance for the users can be used to estimate the variance of the mean and in general the there will be high enough degrees of freedom for the t-distribution to be well approximated by a Normal distribution. The $\mathcal{Z}$ can be used to weight the linear models if you have to report a fit to the data. Have fun with my code, try modifying it to check for multiple change points.

Cheers Gavin

P.S. tune in next time for “Data Augmentation and Dispersion: Fitting Binomial Data”

g.BE {

# the true parameters
m0 = 0.15
m1 = -0.001
m2 = 0.125
m3 = 0.0005

# the “true data”
time = 1:n
KPI.true = time
KPI.true[time <= cp] = m1*time[time <= cp] +m0 KPI.true[time > cp] = m3*(time[time > cp]-cp) + m2

sigma.list = runif(n, min=0.001, max=0.01 )
noise = rnorm(n, sd=sigma.list, mean = 0)
KPI = KPI.true + noise

# calculate the evidence

j = 4 # four parameters
k = n # n data points
C = matrix(0, nrow=n , ncol =n) # the data covariance matrix
diag(C) = sigma.list^2 #
DET.C = prod(sigma.list^2) #
C.inv = solve(C) #
d = matrix(KPI, ncol=1, nrow = n ) # the data in matrix form

B.E.list = 0*time # list for the BE to go in to
for ( i in 2:(n-2)) # loop for all possible Ai
{
A = matrix(0, nrow=k , ncol=j) # make the Ai
A[1:i,1] = 1 + A[1:i,1]
A[1:i,2] = 1:i
A[(i+1):n,3] = 1 + A[(i+1):n,1]
A[(i+1):n,4] = (i+1):n -i

#values used to calc the evidence
sigma.inv = t(A)%*%C.inv%*%A
sigma = solve(sigma.inv)
LT = t(A)%*%C.inv%*%d
m.hat = sigma%*%t(A)%*%C.inv%*%d

# the log Bayesian Evedence
log.BE = log((2*pi)^((j-k)/2))
log.BE = log.BE + log( DET.C^(-0.5))
log.BE = log.BE + log( det(sigma.inv)^(-0.5) ) # inverse of inverse
B.E.list[i] = log.BE

# print stuff
if ( (1==2))
{
print(i)
print(log.BE)
print(data.frame(m.HAT= m.hat[,1], M.true=c(m0, m1, m2, m3)))
print(sigma)
print(sigma.inv)
print(“—————————-“)
}
}

# plot the data
par(mar=c(0,0,0,0), cex.axis=1.3)
mat = matrix(c(3, 3, 3, 4, 1, 5, 6, 6, 6, 7,2,8, 9,9,9 ), nrow =5, ncol=3, byrow = T)
layout(mat=mat, widths= c(0.08, 0.9, 0.02), heights =c(0.02, 0.40, 0.05, 0.40, 0.13) )

# plot KPI
plot(time, KPI, xaxt=”n”, xlim=c(1, n))
axis(1, at=seq(0, n, by = 10), label=F)
axis(2, at=mean(KPI), label=”KPI”, tick=F, mgp=c(0, 3,1))
lines(time, KPI.true, lwd=3, col=”blue”)
abline(v=which.max(B.E.list), lwd=2, lty=2, col=”red”)
points(time, KPI)

# show best model
plot(time,B.E.list, ylim=c(min(B.E.list[2:(n-2)]),max(B.E.list[2:(n-2)]) ), xlim=c(1, n) )
axis(1, at=seq(0, n, by = 10), label=F)
axis(2, at=mean(c(min(B.E.list[2:(n-2)]),max(B.E.list[2:(n-2)]) )), label=”log(Z)”, tick=F, mgp=c(0, 3,1))
axis(1, at=mean(1:n), label=”Time index”, tick=F, mgp=c(0, 3,1))
abline(v=cp, col=”blue”, lwd=3)
abline(v=which.max(B.E.list), lwd=2, lty=2, col=”red”)
points(time,B.E.list)
print(which.max(B.E.list))

}
g.BE()