# In which we learn that about Bayesian Evidence, Change Point Detection, and latex.

Note although WordPress claims to work with Latex this seems to be lies (or more to the point their plan to make me give them money) so sorry about the terrible formatting of equations.

An applied statistics problem that commonly needs to be solved in BI is change point detection; i.e., determining when a key performance indicator (KPI) changes its underlining model. Commonly the purpose of change point detection is to analyze the effects of a game feature that cannot be AB tested (e.g., there may not be the capability to make the change for only some of the users or there may be concerns about push back from “disadvantaged” users), it can also be an important part of error detection. In either case it may not be known apriori when a change in the KPI occurs or specifically for feature evaluation does the change in KPI performance actually correspond to the feature release.

An approach to change point location is to create a linear model of the KPI, regressing its values (${\bf d}$) onto the time/ index of the observation and other relevant factors. The change points are represented by the design matrix $\text{A}$. So for example a linear model with no change point that might have design matrix

or alternatively a linear model with a change point between the $n$ th and $n+1$ st KPI measures could have design matrix
.
In both cases it is assumed that ${\bf d} = \text{A} {\bf m} + {\bf e}$ where ${\bf m}$ represents an arbitrary model (vector of parameters). The linear models are then compared using Bayesian evidence. In general this is better than P value tests (e.g., anova) as comparisons between models with the same number of parameters will have zero degrees of freedom with which to conduct the test.

As this is the first post it is convenient to review Bayesian inference in general before describe Bayesian evidence; a complete description of Bayesian inference can be found in Refs.[1,2]. In Bayesian inference, model parameters are considered random variables. The distribution of ${\bf m}$ is evolved for its prior distribution, $\pi({\bf m})$, to its posterior probability density (PPD), $P({\bf m}|{\bf d})$, by the introduction of data information via the likelihood function, $\mathcal{L}({\bf m})$.  Using Bayes’ rule, the PPD can be expressed as $P({\bf m}|{\bf d})=\frac{\pi({\bf m})\mathcal{L}({\bf m})} {\mathcal{Z}}$,where $\mathcal{Z}$ is the Bayesian evidence. As the PPD is a distribution and therefor normalized $\mathcal{Z} =\int d {\bf m} \pi({\bf m})\mathcal{L}({\bf m})$ (Now that I am out of grad school I can write internals the way that makes sense, none of this $\int f(x)dx$ non-sense. You don’t agree?, then riddle me why derivatives don’t act like brackets? ). Thus $\mathcal{Z}$ can be thought as the zeroth-dimensional marginal density of the model, which given a “moments contemplation” can thus be interpreted as the probability of the data.

As the name suggests, comparisons of Bayesian evidence are a strong indicator of which model has the greatest support by the data and prior information. The model with the highest evidence evaluated at the observed data, i.e., the model for which the observations are most probable, should be taken as the preferred model.Bayesian evidence accounts for both data t and model parsimony (i.e., a preference for simple models). Parsimony is addressed naturally by penalizing needlessly flexible models, i.e., penalizing models that assign evidence over an unnecessarily large region of the data space. This penalty results from the fact that evidence is a normalized distribution over the data space; models that assign probability too widely over the data space will tend to have lower probabilities at the observed data than more focused models. Intuitively, overly-flexible, non-parsimonious models are undesirable because they are not easily disproved.

Figure 1: Bayesian Evidence of two models (J1, J0) as a function of the data space. The vertical line indicates a possible set of data observations.

The FIg. 1 illustrates the evidence of two models, $J0$ and $J1$, as a function of the data space (schematically compressed into one dimension). Model $J0$ is less parsimonious as $\mathcal{Z}(J0)$ is more widely distributed than $\mathcal{Z}(J1)$. Evidence compares the two models at the observed data (represented by the vertical line). In this example, $J1$ is found to have higher evidence (support from the data and prior) than $J0$.

Evaluating Bayesian evidence can be computationally difficult for many nonlinear problems; I will give strategies to get around this in latter posts. For change point detection fortunately linear models are sufficient, so the evidence can be calculated algebraically giving you more time for other things. The evidence for a linear model with known covariance matrix $\text{C}$ [i.e., ${\bf d}$ ~ Norm( $\text{A}{\bf m}$, $\text{C}$)] is found by evaluating the integral $\int d{\bf m} (2\pi)^{-k/2}|\text{C}|^{-1/2} \exp(-\frac{1}{2}[{\bf d}-\text{A}{\bf m}]^{t}\text{C}^{-1}[{\bf d}-\text{A}{\bf m}])$  which by completing the square (for this step it helps to know the answer and work backwards) can be found to be $( 2 \pi )^{(j-k)/2}| \text{C}|^{-1/2}|\text{A}^t\text{C}^{-1}\text{A}|^{-1/2} \exp (-\frac{1}{2}[{\bf d}^t \text{C}^{-1}{\bf d} - {\bf d}^t \text{C}^{-1} \text{A} (\text{A}^t \text{C}^{-1} \text{A} )^{-1}\text{A}^t \text{C}^{-1} {\bf d} ] )$, where $j$ is the dimension of ${\bf m}$ and  $k$ is the dimension of ${\bf d}$.

This ends the “theoretical” part of this blog; in summary the denominator of posterior distribution is called the Bayesian evidence. For linear models with known covariance matrices this can be calculated analytical by me, and now also, by you! Comparing the Bayesian evidence is good way to find the model with most support from the data and prior information.

Returning to the original problem, how to use Bayesian evidence used to locate changed points. The recipe is to iteratively create models differing by their design matrices $\text{A}$ which represent possible change points. Evaluate the evidence of the models at the data, and compare the evidences. The R code given at the end of the post implements this recipe for a simulated KPI.

Top: KPI as a function of time (circles), the solid blue lines indicate the true/noise free KPI values. The vertical dashed line shows the evidence preferred change point. Bottom The Log Bayesian Evidence for the change point at the indicated time index. The vertical solid blue line gives the true change point. The vertical dashed line shows the evidence preferred change point.

To conclude I just want to make a few final observations. Firstly the assumption of known data covariance matrix might seem impractical, however, with many revenue based KPI such as average revenue per daily active user (ARPDAU) the variance is “known” as the sampled variance for the users can be used to estimate the variance of the mean and in general the there will be high enough degrees of freedom for the t-distribution to be well approximated by a Normal distribution. The $\mathcal{Z}$ can be used to weight the linear models if you have to report a fit to the data. Have fun with my code, try modifying it to check for multiple change points.

If you have any questions or suggestions for future topics please leave a comment.

Cheers Gavin

P.S. tune in next time for “Data Augmentation and Dispersion: Fitting Binomial Data”

g.BE {

# the true parameters
m0 = 0.15
m1 = -0.001
m2 = 0.125
m3 = 0.0005

# the “true data”
time = 1:n
KPI.true = time
KPI.true[time <= cp] = m1*time[time <= cp] +m0 KPI.true[time > cp] = m3*(time[time > cp]-cp) + m2

# add noise to data
sigma.list = runif(n, min=0.001, max=0.01 )
noise = rnorm(n, sd=sigma.list, mean = 0)
KPI = KPI.true + noise

# calculate the evidence

j = 4 # four parameters
k = n # n data points
C = matrix(0, nrow=n , ncol =n) # the data covariance matrix
diag(C) = sigma.list^2 #
DET.C = prod(sigma.list^2) #
C.inv = solve(C) #
d = matrix(KPI, ncol=1, nrow = n ) # the data in matrix form

B.E.list = 0*time # list for the BE to go in to
for ( i in 2:(n-2)) # loop for all possible Ai
{
A = matrix(0, nrow=k , ncol=j) # make the Ai
A[1:i,1] = 1 + A[1:i,1]
A[1:i,2] = 1:i
A[(i+1):n,3] = 1 + A[(i+1):n,1]
A[(i+1):n,4] = (i+1):n -i

#values used to calc the evidence
sigma.inv = t(A)%*%C.inv%*%A
sigma = solve(sigma.inv)
LT = t(A)%*%C.inv%*%d
m.hat = sigma%*%t(A)%*%C.inv%*%d

# the log Bayesian Evedence
log.BE = log((2*pi)^((j-k)/2))
log.BE = log.BE + log( DET.C^(-0.5))
log.BE = log.BE + log( det(sigma.inv)^(-0.5) ) # inverse of inverse
log.BE = log.BE -0.5*QUAD
B.E.list[i] = log.BE

# print stuff
if ( (1==2))
{
print(i)
print(log.BE)
print(data.frame(m.HAT= m.hat[,1], M.true=c(m0, m1, m2, m3)))
print(sigma)
print(sigma.inv)
print(“—————————-“)
}
}

# plot the data
par(mar=c(0,0,0,0), cex.axis=1.3)
mat = matrix(c(3, 3, 3, 4, 1, 5, 6, 6, 6, 7,2,8, 9,9,9 ), nrow =5, ncol=3, byrow = T)
layout(mat=mat, widths= c(0.08, 0.9, 0.02), heights =c(0.02, 0.40, 0.05, 0.40, 0.13) )

# plot KPI
plot(time, KPI, xaxt=”n”, xlim=c(1, n))
axis(1, at=seq(0, n, by = 10), label=F)
axis(2, at=mean(KPI), label=”KPI”, tick=F, mgp=c(0, 3,1))
lines(time, KPI.true, lwd=3, col=”blue”)
abline(v=which.max(B.E.list), lwd=2, lty=2, col=”red”)
points(time, KPI)

# show best model
plot(time,B.E.list, ylim=c(min(B.E.list[2:(n-2)]),max(B.E.list[2:(n-2)]) ), xlim=c(1, n) )
axis(1, at=seq(0, n, by = 10), label=F)
axis(2, at=mean(c(min(B.E.list[2:(n-2)]),max(B.E.list[2:(n-2)]) )), label=”log(Z)”, tick=F, mgp=c(0, 3,1))
axis(1, at=mean(1:n), label=”Time index”, tick=F, mgp=c(0, 3,1))
abline(v=cp, col=”blue”, lwd=3)
abline(v=which.max(B.E.list), lwd=2, lty=2, col=”red”)
points(time,B.E.list)
print(which.max(B.E.list))

}
g.BE()