Note although WordPress claims to work with Latex this seems to be lies (or more to the point their plan to make me give them money) so sorry about the terrible formatting of equations.

An applied statistics problem that commonly needs to be solved in BI is change point detection; i.e., determining when a key performance indicator (KPI) changes its underlining model. Commonly the purpose of change point detection is to analyze the effects of a game feature that cannot be AB tested (e.g., there may not be the capability to make the change for only some of the users or there may be concerns about push back from “disadvantaged” users), it can also be an important part of error detection. In either case it may not be known apriori when a change in the KPI occurs or specifically for feature evaluation does the change in KPI performance actually correspond to the feature release.

An approach to change point location is to create a linear model of the KPI, regressing its values () onto the time/ index of the observation and other relevant factors. The change points are represented by the design matrix . So for example a linear model with no change point that might have design matrix

or alternatively a linear model with a change point between the th and st KPI measures could have design matrix

.

In both cases it is assumed that where represents an arbitrary model (vector of parameters). The linear models are then compared using **Bayesian evidence**. In general this is better than P value tests (e.g., anova) as comparisons between models with the same number of parameters will have zero degrees of freedom with which to conduct the test.

As this is the first post it is convenient to review Bayesian inference in general before describe Bayesian evidence; a complete description of Bayesian inference can be found in Refs.[1,2]. In Bayesian inference, model parameters are considered random variables. The distribution of is evolved for its prior distribution, , to its posterior probability density (PPD), , by the introduction of data information via the likelihood function, . Using Bayes’ rule, the PPD can be expressed as ,where is the Bayesian evidence. As the PPD is a distribution and therefor normalized (Now that I am out of grad school I can write internals the way that makes sense, none of this non-sense. You don’t agree?, then riddle me why derivatives don’t act like brackets? ). Thus can be thought as the zeroth-dimensional marginal density of the model, which given a “moments contemplation” can thus be interpreted as the probability of the data.

As the name suggests, comparisons of Bayesian evidence are a strong indicator of which model has the greatest support by the data and prior information. The model with the highest evidence evaluated at the observed data, i.e., the model for which the observations are most probable, should be taken as the preferred model.Bayesian evidence accounts for both data t and model parsimony (i.e., a preference for simple models). Parsimony is addressed naturally by penalizing needlessly flexible models, i.e., penalizing models that assign evidence over an unnecessarily large region of the data space. This penalty results from the fact that evidence is a normalized distribution over the data space; models that assign probability too widely over the data space will tend to have lower probabilities at the observed data than more focused models. Intuitively, overly-flexible, non-parsimonious models are undesirable because they are not easily disproved.

The FIg. 1 illustrates the evidence of two models, and , as a function of the data space (schematically compressed into one dimension). Model is less parsimonious as is more widely distributed than . Evidence compares the two models at the observed data (represented by the vertical line). In this example, is found to have higher evidence (support from the data and prior) than .

Evaluating Bayesian evidence can be computationally difficult for many nonlinear problems; I will give strategies to get around this in latter posts. For change point detection fortunately linear models are sufficient, so the evidence can be calculated algebraically giving you more time for other things. The evidence for a linear model with known covariance matrix [i.e., ~ Norm( , )] is found by evaluating the integral which by completing the square (for this step it helps to know the answer and work backwards) can be found to be , where is the dimension of and is the dimension of .

This ends the “theoretical” part of this blog; in summary the denominator of posterior distribution is called the **Bayesian evidence**. For linear models with known covariance matrices this can be calculated analytical by me, and now also, by you! Comparing the Bayesian evidence is good way to find the model with most support from the data and prior information.

Returning to the original problem, how to use Bayesian evidence used to locate changed points. The recipe is to iteratively create models differing by their design matrices which represent possible change points. Evaluate the evidence of the models at the data, and compare the evidences. The R code given at the end of the post implements this recipe for a simulated KPI.

To conclude I just want to make a few final observations. Firstly the assumption of known data covariance matrix might seem impractical, however, with many revenue based KPI such as average revenue per daily active user (ARPDAU) the variance is “known” as the sampled variance for the users can be used to estimate the variance of the mean and in general the there will be high enough degrees of freedom for the t-distribution to be well approximated by a Normal distribution. The can be used to weight the linear models if you have to report a fit to the data. Have fun with my code, try modifying it to check for multiple change points.

If you have any questions or suggestions for future topics please leave a comment.

Cheers Gavin

P.S. tune in next time for “Data Augmentation and Dispersion: Fitting Binomial Data”

g.BE {

# the true parameters

m0 = 0.15

m1 = -0.001

m2 = 0.125

m3 = 0.0005

# the “true data”

time = 1:n

KPI.true = time

KPI.true[time <= cp] = m1*time[time <= cp] +m0 KPI.true[time > cp] = m3*(time[time > cp]-cp) + m2

# add noise to data

sigma.list = runif(n, min=0.001, max=0.01 )

noise = rnorm(n, sd=sigma.list, mean = 0)

KPI = KPI.true + noise

# calculate the evidence

j = 4 # four parameters

k = n # n data points

C = matrix(0, nrow=n , ncol =n) # the data covariance matrix

diag(C) = sigma.list^2 #

DET.C = prod(sigma.list^2) #

C.inv = solve(C) #

d = matrix(KPI, ncol=1, nrow = n ) # the data in matrix form

B.E.list = 0*time # list for the BE to go in to

for ( i in 2:(n-2)) # loop for all possible Ai

{

A = matrix(0, nrow=k , ncol=j) # make the Ai

A[1:i,1] = 1 + A[1:i,1]

A[1:i,2] = 1:i

A[(i+1):n,3] = 1 + A[(i+1):n,1]

A[(i+1):n,4] = (i+1):n -i

#values used to calc the evidence

sigma.inv = t(A)%*%C.inv%*%A

sigma = solve(sigma.inv)

LT = t(A)%*%C.inv%*%d

QUAD1 = t(d)%*%C.inv%*%d

QUAD2 = t(LT)%*%sigma%*%LT

QUAD = QUAD1 – QUAD2

m.hat = sigma%*%t(A)%*%C.inv%*%d

# the log Bayesian Evedence

log.BE = log((2*pi)^((j-k)/2))

log.BE = log.BE + log( DET.C^(-0.5))

log.BE = log.BE + log( det(sigma.inv)^(-0.5) ) # inverse of inverse

log.BE = log.BE -0.5*QUAD

B.E.list[i] = log.BE

# print stuff

if ( (1==2))

{

print(i)

print(log.BE)

print(data.frame(m.HAT= m.hat[,1], M.true=c(m0, m1, m2, m3)))

print(sigma)

print(sigma.inv)

print(“—————————-“)

}

}

# plot the data

par(mar=c(0,0,0,0), cex.axis=1.3)

mat = matrix(c(3, 3, 3, 4, 1, 5, 6, 6, 6, 7,2,8, 9,9,9 ), nrow =5, ncol=3, byrow = T)

layout(mat=mat, widths= c(0.08, 0.9, 0.02), heights =c(0.02, 0.40, 0.05, 0.40, 0.13) )

# plot KPI

plot(time, KPI, xaxt=”n”, xlim=c(1, n))

axis(1, at=seq(0, n, by = 10), label=F)

axis(2, at=mean(KPI), label=”KPI”, tick=F, mgp=c(0, 3,1))

lines(time, KPI.true, lwd=3, col=”blue”)

abline(v=which.max(B.E.list), lwd=2, lty=2, col=”red”)

points(time, KPI)

# show best model

plot(time,B.E.list, ylim=c(min(B.E.list[2:(n-2)]),max(B.E.list[2:(n-2)]) ), xlim=c(1, n) )

axis(1, at=seq(0, n, by = 10), label=F)

axis(2, at=mean(c(min(B.E.list[2:(n-2)]),max(B.E.list[2:(n-2)]) )), label=”log(Z)”, tick=F, mgp=c(0, 3,1))

axis(1, at=mean(1:n), label=”Time index”, tick=F, mgp=c(0, 3,1))

abline(v=cp, col=”blue”, lwd=3)

abline(v=which.max(B.E.list), lwd=2, lty=2, col=”red”)

points(time,B.E.list)

print(which.max(B.E.list))

}

g.BE()