Hi All

This week I thought we could address AB/spit testing. That is testing if sub sets of the data are different from each other. Typically in mobile gaming the test segments would each have a different variant of some game parameters (e.g., group 1 might have a 5 USD sale, group 2 a 10 USD sale, and group 3 no sale). Also typically users would be assigned randomly into each group such that the demographics of each group would be similar.

Before continuing, I would like to say that both of those names are deeply stupid. However like “data scientist” (because there are scientists that don’t use data?) the terms have become standard. Basically I just think that student t himself would not approve. So what is the difference between the two terms (AB testing and Split testing)? Absolutely nothing; they are both just applications of parameter estimation. If I have to pick a name I tend to stick with split testing as it does not seem to be limited to two groups.

This article describes a Bayesian method of conducting slit tests on “normal” data; so mostly this for KPI such as ARPI. A later article will consider data with extreme outliers (i.e., those will not be on this test).

The common way of conducting a split test is with a student t test. If you are reading this post because you are preparing for a job interview with the analytics department at HotHead Games then you should know that I think of this as sort of c/ c+ answer. The basic problem with this approach is that it fails to account for natural parsimony; i.e., that the more parameters you have the better you will fit the data even if those parameters are unnecessary.

The Bayesian way (i.e., correct way) to conduct a split test is use Bayesian evidence. To calculate the Bayesian evidence consider Bayes formula as commonly used in statistical inference:

Here is the prior distribution of , is the total Bayesian evidence of the ensemble of models (i.e., the probability of the data integrated / summed over all possible parameter values for the given parameterizations), and is the likelihood

of the parameter vector (i.e., the probability of the data given the model). The sub script denotes the choice of parameterization.

Models are compared using the ratio of their respective evidence. This ratio is called the Bayesian factor and is defined as BF where and are the Bayesian evidences of the two models being compared. The BF can be shown to satisfy natural parsimony, that is it will tend to favor simpler models with low deviance.

For a Gaussian model the Bayesian evidence can be calculated to be:

Intuitively we can see that evidence increases with and decreases with (The sample standard deviation).

[Note I have updated this equation since the original post made to be more verbose; it also uses an explicit prior where both and are uniform with volume .]

Kass and Raftery published a table to categorize the significance BF. Their table is:

Moving on an examples because lets be honest if the derivation interested you you would have done it your self, right? The example considers data without a significant difference between the groups; we could do more but the code is append (i.e., do them your self!).

Example, 10 groups each with 5000 observation from a Gaussian with mean 0.05 and standard deviation 1. Figure 1 shows the simulated data and posterior histograms of the means of the different groups. Table 2 gives the two sided P-values for the pair wise tests of significance between each group. The red squares indicate that the pair is found to have a significant difference.

Clearly the t-test is not satisfactory; there are many false positives. In real situations the pressure to do something would probably result in many less significant results also being found to be significant.

The BF based test does much better. First the test of all groups vs having only 1 group has a BF of 84.58817 (i.e., very strong evidence that there is only one group). The pair wise tests are shown in Tab. 3. No pair is found to have more support for two groups vs one (e.g., has a negative BF). In addition, many of the pairs have significant evidence of having only one group; take that “fail to reject the null” a definitive statement in support of it!

Cheers

Tune in Next Week for how to deal with outliers! And post comments!

Code: g.data <-function(N = 1000, G = 10, mu1 = 0, sigma1 = 1, mu2 = 0, sigma2= 1) { # this function makes G groups of guassian data # only 1 is different from the others Group = rep(1, N) Value = rnorm(N, mean= mu1, sd = sigma1 ) for (i in 2:G) { Group = c(Group, rep(i, N)) Value = c(Value, rnorm(N, mean= mu2, sd = sigma2 )) } data = data.frame(Group, Value) plot(factor(data$Group), data$Value) return(data) } g.t.test <-function(data) { g.col=c(1,2,3,4,6:11) Value = data$Value G = data$Group Groups = unique(data$Group) NG = length(Groups) print(Groups) print(NG) X.bar = 1:NG S.bar = 1:NG N.bar = 1:NG LB = 1:NG UB = 1:NG for (i in 1:NG) { N.bar[i] = length(Value[G==Groups[i]]) X.bar[i] = mean(Value[G==Groups[i]]) S.bar[i] = sd(Value[G==Groups[i]]) LB[i] = X.bar[i] - 3*S.bar[i]/sqrt(N.bar[i]) UB[i] = X.bar[i] + 3*S.bar[i]/sqrt(N.bar[i]) } print(N.bar) print(X.bar) print(S.bar) print(LB) print(UB) Mu.dist.x = seq(min(LB), max(UB), length.out = 1001) Mu.dist.y = matrix(0, nrow = 1001, ncol=NG) for (i in 1:NG) { Mu.dist.y[,i] = dnorm(x =Mu.dist.x , mean = X.bar[i], sd = S.bar[i]/sqrt(N.bar[i]), log = FALSE) # This should be dt however r does not have a very flaexible t function and it is just not worth it to write one for thos sort of plotting. } Mu.dist.y.f = seq(min(Mu.dist.y), max(Mu.dist.y), length.out = 1001) plot(Mu.dist.x, Mu.dist.y.f, xlab = "Mean(Value)", ylab="P[Mean(Value)]", type="n") for ( i in 1:NG) { lines(Mu.dist.x, Mu.dist.y[,i], col=g.col[i], lwd=3, lty=2) } legend("topright", col=g.col[1:NG], lwd=3, legend = Groups) axis(3, at = X.bar, label=Groups) delta.mat = matrix(0, nrow=NG, ncol=NG) df.mat = matrix(Inf, nrow=NG, ncol=NG) S12.mat = matrix(0, nrow=NG, ncol=NG) t.mat = matrix(0, nrow=NG, ncol=NG) P.mat = matrix(1, nrow=NG, ncol=NG) for ( i in 1:NG) { for (j in 1:NG) if(i!=j) { delta.mat[i,j] = X.bar[i] - X.bar[j] df.mat[i,j] = ((S.bar[i]^2)/N.bar[i] + (S.bar[j]^2)/N.bar[j])/ ( (((S.bar[i]^2)/N.bar[i])^2)/(N.bar[i] -1) + (((S.bar[j]^2)/N.bar[j])^2)/(N.bar[j] -1) ) S12.mat[i,j] = sqrt((S.bar[i]^2)/N.bar[i] + (S.bar[j]^2)/N.bar[j]) t.mat[i,j] = delta.mat[i,j]/S12.mat[i,j] P.mat[i,j] = 2*(1-pt(q =abs(t.mat[i,j]), df= df.mat[i,j], lower.tail = TRUE, log.p = FALSE)) } } print(round(delta.mat,4)) print(round(P.mat,4)) } g.b.test <-function(data) { x = data$Value G = data$Group Groups = unique(data$Group) NG = length(Groups) # for Z total grouped x. = sum(x) x.. = sum(x^2) N = length(x) J = (1-N)/2 Z.a = J*log(2*pi)- 0.5*log(N) #Z.b = -0.5*(x.. - (x.^2)/N) # if sd know to by 1 Z.b = lgamma( -(J +1)) Z.c = (-(J+1)) * log(0.5*(x..-(x.^2)/N)) Z = Z.a+Z.b-Z.c # for individuals Z.split = 1:NG for ( i in 1:NG) { y = x[G==Groups[i]] x. = sum(y) x.. = sum(y^2) N = length(y) J = (1-N)/2 Z.a = J*log(2*pi)- 0.5*log(N) #Z.b = -0.5*(x.. - (x.^2)/N) Z.b = lgamma( -(J +1)) Z.c = (-(J+1)) * log(0.5*(x..-(x.^2)/N)) Z.split[i] = Z.a+Z.b -Z.c } Z.B = sum(Z.split) print("BayesFactor None to all") print(2*(Z-Z.B)) BF.mat = matrix(0, nrow=NG, ncol=NG) for (i in 1:NG) { for (j in 1:NG) { if(i!=j) { x.temp = x[(G==Groups[i])|(G==Groups[j])] x. = sum(x.temp) x.. = sum(x.temp^2) N = length(x.temp) J = (1-N)/2 Z.a = J*log(2*pi)- 0.5*log(N) #Z.b = -0.5*(x.. - (x.^2)/N) # if sd know to by 1 Z.b = lgamma( -(J +1)) Z.c = (-(J+1)) * log(0.5*(x..-(x.^2)/N)) Z.ab = Z.a+Z.b-Z.c BF = 2*(Z.ab - Z.split[i] -Z.split[j]) BF.mat[i,j] = BF } } } print(round(BF.mat,2)) } par(mfrow=c(1,2)) test.data.0 = g.data(N = 5000, G = 10, mu1 = 0.05, sigma1 = 1, mu2 = 0.05, sigma2= 1) g.t.test(test.data.0) g.b.test(test.data.0)