Estimating User Engagement (Errors-in-variables Regression)

Hi all, this week I am trying to avoid being interesting but useless. To that end I am looking at user engagement. That is measuring or estimating how engaged a user is. This is a strangely difficult thing to do. It is easy to measure the signs of engagement. For example engaged uses tent to spend more, user more product features, login more, have longer product sessions etc.; however, the engagement, as a variable itself is never observed. My strategy is to use hierarchical Bayes, the engagement has a prior distribution with variable hyper parameters. Inference is conducted by comparing the the hyper parameters.

In more  detail, the approach used here is based on structural error in variables regression.  It should also be noted that this used the integrated likelihood presented in Sambridge 2016. That is I define an unobserved and unit-less parameter vector \boldsymbol{\theta} with elements \theta_i to be the engagement of the i’th user. They have a prior distributions \pi (\theta_i) which is Gaussian with mean and variance \mu and \sigma^2, respectively.  I also define a curve with parameters {\bf m} through the space of observed engagement attributed. The value of \theta_i for a user determines their location on the curve in the attribute space. The probability of the an observed set of attributes {\bf d}_i given \theta_i and {\bf m} is then assigned a multi-variate Gaussian distribution. The value of \theta_i is marginalized out giving the probability giving an integrated likelihood in-terms of {\bf m}. Standard MCMC sampling is then used.  That was super clear right? Trust me it will be once you understand all the steps (or you will just has to make a comment and ask a quest).

Screen Shot 2016-09-03 at 8.13.48 PM

Figure 1: An example of one data point with uncertainty (the colors) near a line (grey dashed line) defined by the two points. Finally a residual conditional on a point on the line defined by the a scalar is also shown. (Sorry this vague I cant put latex equations in here for some unclear reason )

Figure 1 shows a diagram of a single data point {\bf d}_i and a line defined by the two points {\bf Q}_0 and {\bf Q}_1, i.e. L=\theta_i  {\bf Q}_0 + (1-\theta_i ) {\bf Q}_1. These points are parameterized with the model parameters {\bf m}. The vectors {\bf u} and {\bf v} are difference between the data point and {\bf Q}_0 and the difference between {\bf Q}_1 and {\bf Q}_0 these will be used later to simplify some equations (not they should have subscripts but these have been ignored for clarity).  Finally the point denoted \theta_i and the vector {\bf r}_i | \theta_i are an arbitrary point on L defined by a particular \theta_i and the difference between that point and the observed data point, {\bf d}_i.  The color represents the data uncertainty, i.e., the multivariate Gaussian, which represents the significance of {\bf r}_i | \theta_i. To get the probability of {\bf d}_i given {\bf m} (P({\bf d}_i | {\bf m})) the integral \int d \theta P({\bf d}_i | {\bf m},  \theta_i) \pi ( \theta_i | \mu, \sigma^2 )  is computed. Conceptually this should make sense as the probability of {\bf d}_i is the sum of the probability give each \theta_i.

Because of the dark sorcery that protects the Gaussian distribution all algebra using it always works; in this case it means that P({\bf d}_i | {\bf m}) can be found analytically. In detail, i.e., more detail then anyone but me cares to see:

P({\bf r}_i | \theta ) = (2 \pi)^{-k/2} |C_{d}|^{-1/2}  \exp(-0.5[{\bf d}_i - L(\theta_i)]^T C_{d}^{-1}[{\bf d}_i - L(\theta_i)]) , where k is the dimension of {\bf d}_i.

Now using {\bf d}_i - L(\theta_i ) = {\bf d}_i - [\theta_i {\bf Q}_0 + (1-\theta_i ){\bf Q}_1 ]  = {\bf u} + \theta_i  {\bf v}, P({\bf r}_i | \theta_i ) can be rewritten as P({\bf r}_i | \theta_i ) = (2 \pi)^{-k/2} |C_{d}|^{-1/2}  \exp(-0.5[{\bf u} + \theta_i {\bf v}]^T C_{d}^{-1}[{\bf u} + \theta_i {\bf v}]) .

Using completing the square technique, similar to process in the change point detection post the P({\bf r}_i | \theta_i ) can be written as a scaled Gaussian distribution in terms of \theta_i . P({\bf r}_i | \theta_i ) = (2 \pi)^{-k/2} |C_{d}|^{-1/2}  \exp(-0.5 [{\bf u}^T C_{d}^{-1} {\bf u} - {\bf v}^T C_{d}^{-1} {\bf u} /{\bf v}^T C_{d}^{-1} {\bf v}] ) \times \exp(-0.5 [{\bf v}^T C_{d}^{-1} {\bf v}] [\theta  - ({\bf v}^T C_{d}^{-1} {\bf u} /{\bf v}^T C_{d}^{-1} {\bf v}) ]^2) . Note what may look like a “wealth of linear algebra sins” are in fact legitimate as the arrays are 1×1, i.e., scalars.

By defining \alpha = [{\bf u}^T C_{d}^{-1} {\bf u} - {\bf v}^T C_{d}^{-1} {\bf u} /{\bf v}^T C_{d}^{-1} {\bf v}] , S^2 = 1/[{\bf v}^T C_{d}^{-1} {\bf v}] , and \bar{\theta}= {\bf v}^T C_{d}^{-1} {\bf u} /{\bf v}^T C_{d}^{-1} {\bf v}, the probability P({\bf r}_i | \theta_i ) can be simplified to P({\bf r}_i | \theta_i ) = (2 \pi)^{-k/2} |C_{d}|^{-1/2}  \exp(-0.5 \alpha ) \exp[-0.5  ([\theta_i  - \bar{\theta} ]/S)^2] .

The likelihood P({\bf r}_i | \theta_i ) is now multiplied by the prior \pi ( \theta_i ) = (2 \pi \sigma^2 )^{-1/2}  \exp[-0.5  ([\theta_i  - \mu ]/ \sigma)^2]. The result is P({\bf r}_i | \theta_i ) \pi( \theta_i) = (2 \pi)^{-(k+1)/2 } |C_{d}|^{-1/2}  \exp(-0.5 \alpha ) \exp(-0.5 \beta ) \exp[-0.5  ([\theta_i  - \hat{\theta} ]/ \hat{\sigma} )^2 , where \hat{\theta} = (\mu/ \sigma^2 + \bar{\theta}/S^2)/(1/\sigma^2  + 1/S^2)\hat{\sigma}^2 = (1/\sigma^2  + 1/S^2)^{-1}, and \beta = (\sigma^2 \bar{\theta}^2+S^2 \mu^2)/(\sigma^2 +S^2) - \hat{\theta} ^2.

Finally the probability P({\bf r}_i ) = \int d \theta_i P({\bf r}_i | \theta_i ) \pi( \theta_i)  can be computed; P({\bf r}_i ) = (2 \pi)^{-k/2 } |C_{d}|^{-1/2}  \exp(-0.5 [\alpha+ \beta]  )  \hat{\sigma}.

The likelihood of a data set is thus \prod_{i=1}^{n} P({\bf r}_i ) . This expression unfortunately can not be optimized analytically however it is possible to use MCMC to sample the parameters.

Now that the maths is done it may be helpful to return to the original purpose. That is estimating user engagement. The parameters \mu and \sigma^2 are shared by all \theta_i and are estimated using the MCMC process. So to be explicit the prior distribution of all of the \theta_i is defined fictionally but its parameters are estimated using observed data!  Also note that the model parameters are buried in \alpha, \beta, and \hat{\sigma} which must be calculated for each data point. In addition to avoid having extra correlated parameters it is possible to force {\bf Q}_0 = [0, m_1] and {\bf Q}_1 =  [1, m_2]. In addition, consideration for the parameterization structure of the data co-variance matrix C_{d} should be given. In particular what level of complexity is required to model your data.

This post is long enough so I will add some code for the MCMC sampling at a later date. Thanks for tuning in! Please comeback next week for Bayesian Business Intelligence.