Hi all, this week I am trying to avoid being interesting but useless. To that end I am looking at user engagement. That is measuring or estimating how engaged a user is. This is a strangely difficult thing to do. It is easy to measure the signs of engagement. For example engaged uses tent to spend more, user more product features, login more, have longer product sessions etc.; however, the engagement, as a variable itself is never observed. My strategy is to use hierarchical Bayes, the engagement has a prior distribution with variable hyper parameters. Inference is conducted by comparing the the hyper parameters.

In more detail, the approach used here is based on structural error in variables regression. It should also be noted that this used the integrated likelihood presented in Sambridge 2016. That is I define an unobserved and unit-less parameter vector with elements to be the engagement of the i’th user. They have a prior distributions which is Gaussian with mean and variance and , respectively. I also define a curve with parameters through the space of observed engagement attributed. The value of for a user determines their location on the curve in the attribute space. The probability of the an observed set of attributes given and is then assigned a multi-variate Gaussian distribution. The value of is marginalized out giving the probability giving an integrated likelihood in-terms of . Standard MCMC sampling is then used. That was super clear right? Trust me it will be once you understand all the steps (or you will just has to make a comment and ask a quest).

Figure 1 shows a diagram of a single data point and a line defined by the two points and , i.e. . These points are parameterized with the model parameters . The vectors and are difference between the data point and and the difference between and these will be used later to simplify some equations (not they should have subscripts but these have been ignored for clarity). Finally the point denoted and the vector are an arbitrary point on defined by a particular and the difference between that point and the observed data point, . The color represents the data uncertainty, i.e., the multivariate Gaussian, which represents the significance of . To get the probability of given () the integral is computed. Conceptually this should make sense as the probability of is the sum of the probability give each .

Because of the dark sorcery that protects the Gaussian distribution all algebra using it always works; in this case it means that can be found analytically. In detail, i.e., more detail then anyone but me cares to see:

, where is the dimension of .

Now using , can be rewritten as .

Using completing the square technique, similar to process in the change point detection post the can be written as a scaled Gaussian distribution in terms of . . Note what may look like a “wealth of linear algebra sins” are in fact legitimate as the arrays are 1×1, i.e., scalars.

By defining , , and , the probability can be simplified to .

The likelihood is now multiplied by the prior . The result is , where , , and .

Finally the probability can be computed; .

The likelihood of a data set is thus . This expression unfortunately can not be optimized analytically however it is possible to use MCMC to sample the parameters.

Now that the maths is done it may be helpful to return to the original purpose. That is estimating user engagement. The parameters and are shared by all and are estimated using the MCMC process. So to be explicit the prior distribution of all of the is defined fictionally but its parameters are estimated using observed data! Also note that the model parameters are buried in , , and which must be calculated for each data point. In addition to avoid having extra correlated parameters it is possible to force and . In addition, consideration for the parameterization structure of the data co-variance matrix should be given. In particular what level of complexity is required to model your data.

This post is long enough so I will add some code for the MCMC sampling at a later date. Thanks for tuning in! Please comeback next week for Bayesian Business Intelligence.