Hi all, this week I am trying to avoid being interesting but useless. To that end I am looking at user engagement. That is measuring or estimating how engaged a user is. This is a strangely difficult thing to do. It is easy to measure the signs of engagement. For example engaged uses tent to spend more, user more product features, login more, have longer product sessions etc.; however, the engagement, as a variable itself is never observed. My strategy is to use hierarchical Bayes, the engagement has a prior distribution with variable hyper parameters. Inference is conducted by comparing the the hyper parameters.
In more detail, the approach used here is based on structural error in variables regression. It should also be noted that this used the integrated likelihood presented in Sambridge 2016. That is I define an unobserved and unit-less parameter vector with elements
to be the engagement of the i’th user. They have a prior distributions
which is Gaussian with mean and variance
and
, respectively. I also define a curve with parameters
through the space of observed engagement attributed. The value of
for a user determines their location on the curve in the attribute space. The probability of the an observed set of attributes
given
and
is then assigned a multi-variate Gaussian distribution. The value of
is marginalized out giving the probability giving an integrated likelihood in-terms of
. Standard MCMC sampling is then used. That was super clear right? Trust me it will be once you understand all the steps (or you will just has to make a comment and ask a quest).

Figure 1: An example of one data point with uncertainty (the colors) near a line (grey dashed line) defined by the two points. Finally a residual conditional on a point on the line defined by the a scalar is also shown. (Sorry this vague I cant put latex equations in here for some unclear reason )
Figure 1 shows a diagram of a single data point and a line defined by the two points
and
, i.e.
. These points are parameterized with the model parameters
. The vectors
and
are difference between the data point and
and the difference between
and
these will be used later to simplify some equations (not they should have subscripts but these have been ignored for clarity). Finally the point denoted
and the vector
are an arbitrary point on
defined by a particular
and the difference between that point and the observed data point,
. The color represents the data uncertainty, i.e., the multivariate Gaussian, which represents the significance of
. To get the probability of
given
(
) the integral
is computed. Conceptually this should make sense as the probability of
is the sum of the probability give each
.
Because of the dark sorcery that protects the Gaussian distribution all algebra using it always works; in this case it means that can be found analytically. In detail, i.e., more detail then anyone but me cares to see:
, where
is the dimension of
.
Now using ,
can be rewritten as
.
Using completing the square technique, similar to process in the change point detection post the can be written as a scaled Gaussian distribution in terms of
.
. Note what may look like a “wealth of linear algebra sins” are in fact legitimate as the arrays are 1×1, i.e., scalars.
By defining ,
, and
, the probability
can be simplified to
.
The likelihood is now multiplied by the prior
. The result is
, where
,
, and
.
Finally the probability can be computed;
.
The likelihood of a data set is thus . This expression unfortunately can not be optimized analytically however it is possible to use MCMC to sample the parameters.
Now that the maths is done it may be helpful to return to the original purpose. That is estimating user engagement. The parameters and
are shared by all
and are estimated using the MCMC process. So to be explicit the prior distribution of all of the
is defined fictionally but its parameters are estimated using observed data! Also note that the model parameters are buried in
,
, and
which must be calculated for each data point. In addition to avoid having extra correlated parameters it is possible to force
and
. In addition, consideration for the parameterization structure of the data co-variance matrix
should be given. In particular what level of complexity is required to model your data.
This post is long enough so I will add some code for the MCMC sampling at a later date. Thanks for tuning in! Please comeback next week for Bayesian Business Intelligence.