Posterior Predictive Uncertainty for Linear Models.

Hi All I just am going to expand on something I noticed in the last post; that is posterior predictive uncertainty.  I will continue with the scheduled summarizing of  linear modeling and generalized linear models as used in Business intelligence afterwards.

What caught my attention was that the predictive uncertainty of a new observation given the observations that I have so far. That is I observer k data points from a Gaussian linear system with sensitivity matrix  \text{A}, model parameters {\bf m}, and residual variance \sigma^2, what is the distribution for J new data points with sensitivity matrix  \text{B}?

In a Bayesian framework, conceptually this can be solved by computing the integral of  \int d \sigma^2 \int d {\bf m} [ P({\bf d}_n | {\bf m}, \sigma^2) P({\bf m}, \sigma^2 | {\bf d}) ], where {\bf d} is the original data set and {\bf d}_n is the new data set. The reason this gives the answer is P({\bf d}_n | {\bf m}, \sigma^2) P({\bf m}, \sigma^2 | {\bf d}) = P({\bf d}_n ,{\bf m}, \sigma^2 | {\bf d}) and the integral marginalizes the distribution to P({\bf d}_n | {\bf d}).

Figure 1 shows my actual derivation (although not the first attempt). Just as an aside buying old calendars and using their back is a very inexpensive way to have access to larger pieces of paper that are perfect for cumbersome derivations. I also just want to high light that I do derive everything I present here myself, mostly by hand. One of the main reasons that I went through this exercise is that the actual answer seems to be strangely missing from Bayesian Data analysis.

big_paper

Figure 1: Derivation of Posterior predictive distribution for a linear model with uncorrelated Gaussian errors.

In real life I would just simulate from P({\bf d}_n | {\bf m}, \sigma^2) P({\bf m}, \sigma^2 | {\bf d}) = P({\bf d}_n ,{\bf m}, \sigma^2 | {\bf d}) as it is trivial to draw random variables from a Gaussian distribution. However, because I like to pretend that I still know how to do some math for this post I have derived it analytically. The rest of this post is just going to go transcribe Fig. 1 into a more legible format.

Starting with P({\bf m}, \sigma^2 | {\bf d}), the posterior distribution of the data. This is proportional to (2\pi)^{-k/2} (\sigma^2)^{-k/2} \exp(-\frac{1}{2} \frac{1}{\sigma^2}[{\bf d}-\text{A}{\bf m}]^{t} [{\bf d}-\text{A}{\bf m}]) which can be rewritten as (2\pi)^{-k/2} (\sigma^2)^{-k/2}  \times \exp(-\frac{1}{2} \frac{1}{\sigma^2} [{\bf d}^t {\bf d} - {\bf d}^t  \text{A} (\text{A}^t  \text{A} )^{-1}\text{A}^t {\bf d} ] )  \times \exp(-\frac{1}{2} \frac{1}{\sigma^2} [{\bf m} - {\bf \mu}_1]^{t} [\text{A}^t  \text{A} ]^{-1} [{\bf m} - {\bf \mu}_1] ) using the completing the squares trick, where \mu_1 = (\text{A}^t  \text{A} )^{-1}\text{A}^t {\bf d} .

Similarly (a word that hides a wealth of mathematical sins) P({\bf d}_n |{\bf m}, \sigma^2,{\bf d}) can also be written in the same format, i.e., P({\bf d}_n |{\bf m}, \sigma^2,{\bf d}) = (2\pi)^{-j/2} (\sigma^2)^{-j/2}  \times \exp(-\frac{1}{2} \frac{1}{\sigma^2} [{\bf d}_n^t {\bf d}_n - {\bf d}_n^t  \text{B} (\text{B}^t  \text{B} )^{-1}\text{B}^t {\bf d} _n] ) \times \exp(-\frac{1}{2} \frac{1}{\sigma^2} [{\bf m} - {\bf \mu}_n]^{t} [\text{B}^t  \text{B} ]^{-1} [{\bf m} - {\bf \mu}_n] ) where \mu_n = (\text{B}^t  \text{B} )^{-1}\text{B}^t {\bf d}_n.

The product P({\bf m}, \sigma^2 | {\bf d}) * P({\bf d}_n |{\bf m}, \sigma^2,{\bf d}) = P({\bf d}_n | {\bf m}, \sigma^2) P({\bf m}, \sigma^2, {\bf d}) can be expressed as proportional to a Gaussian be defining a few terms. These are \tau = ( [\text{A}^t  \text{A} ]^{-1} + [\text{B}^t  \text{B} ]^{-1} )^{-1} and \mu = \tau ([\text{A}^t  \text{A} ]^{-1} \mu_1 + [\text{B}^t  \text{B} ]^{-1}  \mu_n). Thus P({\bf d}_n | {\bf m}, \sigma^2) P({\bf m}, \sigma^2 | {\bf d}) = (2\pi)^{-(k+j)/2} (\sigma^2)^{-(k+j)/2} \times \exp(-\frac{1}{2} \frac{1}{\sigma^2} [Q_1({\bf d},\text{A}) + Q_1({\bf d}_n,\text{B})]) \times \exp(-\frac{1}{2} \frac{1}{\sigma^2} Q_2) \times \exp(-\frac{1}{2} \frac{1}{\sigma^2} [{\bf m} -  \mu]^{t} \tau^{-1} [{\bf m} -  \mu]), where Q_1({\bf x}, \text{C}) = {\bf x}^t{\bf x} - {\bf x}^t  \text{C} (\text{C}^t  \text{C} )^{-1}\text{C}^t {\bf x} and  Q_2  = \mu_1^t (\text{A}^t  \text{A} )^{-1} \mu_1 + \mu_n^t (\text{B}^t  \text{B} )^{-1} \mu_n + ([\text{A}^t  \text{A} ]^{-1} \mu_1 + [\text{B}^t  \text{B} ]^{-1}  \mu_n)^t \tau ([\text{A}^t  \text{A} ]^{-1} \mu_1 + [\text{B}^t  \text{B} ]^{-1}  \mu_n).

The Gaussian integral is known so \int d{\bf m}  P({\bf d}_n, {\bf m}, \sigma^2 | {\bf d})  = (2\pi)^{-(k+j-p)/2} (\sigma^2)^{-(k+j-p)/2} |\tau|^{1/2} \times \exp(-\frac{1}{2} \frac{1}{\sigma^2} [Q_1({\bf d},\text{A}) + Q_1({\bf d}_n,\text{B}) +Q_2]) which has the form of an inverse-gamma distribution. Thus the posterior predictive distribution P({\bf d}_n | {\bf d}) = \int d \sigma^2 (2\pi)^{-(k+j-p)/2} (\sigma^2)^{-(k+j-p)/2} |\tau|^{1/2} \times \exp(-\frac{1}{2} \frac{1}{\sigma^2} [Q_1({\bf d},\text{A}) + Q_1({\bf d}_n,\text{B}) +Q_2]) = (2\pi)^{-(k+j-p)/2} |\tau|^{1/2} \Gamma([k+j-p]/2 -1) (Q_1({\bf d},\text{A}) + Q_1({\bf d}_n,\text{B}) +Q_2)^{1-[k+j-p]/2} .

The final form of the distribution is not clear to me. Largely this is because Q_2 term which does not seem to be easily simplified. Also ironically, because of the gamma function the analytic answer is probably less efficient then just simulating the result. So this whole exercise may be a waste of time; that could explain why this is not presented any of the text books I used in my under grad. Regardless I hope it was at-least fun to follow along.

Tune in next time for more on linear models.

Another summary of Linear Models for BI

Hi All as with the last post I am summarizing linear modeling and generalized linear models as used in Business intelligence. The areas that I are covered are:

  1. Querying and cleaning data for modeling
  2. Plotting models with uncertainty
  3. Quick and dirty model selection
  4. Regularized regression
  5. L1 and L2 Norms

This post explores plotting models Jumping right, code to simulate and plot hetero-stochastic data with a linear model is:

function (n = 365) 
{

# simulating data 
index_day = 1:n
dow       = index_day%%7
week      = floor(index_day/7)     
week.d    = 0:max(week)  
dac       = sample(x = 25000:30000, size = n, replace =T)
NW        = length(unique(week))
arpdac.t  = rep(n,0)
week.P.T  = runif(NW) + 1
dow.P.T   = 0.5*(runif(7)-0.5) 
error     = rnorm(n, mean = 0, sd = 50/sqrt(dac))

# add week and dow effects
for (i in 1:NW)
    {
    arpdac.t[week == week.d[i]] = week.P.T[i]
    }
for (i in 0:6)
    {
    arpdac.t[dow == i] = arpdac.t[dow == i] + dow.P.T[i+1]
    }

# final data    
arpdac   = arpdac.t+error    
rev      = arpdac*dac    
data.sym = data.frame(index_day, dow, week, dac, arpdac.t, arpdac, rev)       

#############################################################################

# plotting code
par(mfrow = c(2,1), mar= c(4,4,1,1))
plot(index_day, arpdac, xlab = "date", ylab = "ARPDAC ($)", ylim=c(0, max(arpdac)))    

mod  = lm(arpdac~factor(week)+factor(dow)-1, data =data.sym, weights = dac)
P    = predict(mod, se.fit = T)
EX   = P$fit
LB   = P$fit-1.96*P$se.fit
UB   = P$fit+1.96*P$se.fit

plot(index_day, arpdac, xlab = "date", ylab = "ARPDAC ($)", ylim=c(0, max(arpdac)))    
polygon(c(index_day, rev(index_day)), c(UB, rev(LB)), col="grey", border=F  )
lines(index_day, EX, lwd=3, col="blue")
points(index_day, arpdac,)   
}

The output of this r script is shown in Fig. 1. There are two parts to the script. The first part simulates ARPDAC and Revenue data. The second part fits a model to that data and plots the model (with uncertainty) over the ARPDAC data. The simulation process is straight forward. I would only like to highlight one point that total revenue is not simulated directly; first ARPDAC and DAC are simulated and total revenue is calculated from that.

Moving to the second part ARPDAC is model with a standard LM. The predictors are the day of week (DOW) and week. As the data has bee pre-aggregated the DAC values are used as weights of the data points. The prediction and its uncertainty are found using the “predict” function. First the expectation is found (this is denoted EX in the code), then the upper and lower confidence bounds are calculated (UB and LB respectively). The confidence bounds are found using the fact that the prediction is a Gaussian thus the standard “1.96*SD” gives a 95% confidence interval. Note this is for the expectation (the blue line in Fig. 1) not the data points that have their own error on top of the uncertainty of the “true” ARPDAC. To plot the uncertainty the “polygon” function is used. I use the strategy of plotting the UB first and then the LB in reverse (i.e., “polygon(x = c(index_day, rev(index_day)), y= c(UB, rev(LB)) …”). Also note that you probably want to re-plot your data points again after plotting the uncertainty region as it is opaque.

screen-shot-2017-01-15-at-10-46-18-am

Figure 1: Top the simulated ARPDAC data (circles). Bottom the simulated ARPDAC data, noise free ARPDAC values (red lines), the predicted APRDAC value (blue time), and the uncertainty of the prediction (grey area).

The process is virtually identical for GLMs, the difference is that the link function has to be accounted for.  For example for a Probit regression, this is done is  by wrapping the EX, LB, and UB terms in a “pnorm” (i.e., “pnorm(EX)”, “pnorm(UB)”, and “pnorm(LB)”).

That is it for now. turn in Next time for more on linear models.

A summary of Linear Models for BI

Hi All with this and the next few posts I would like to summarize linear modeling and generalized linear models as used in Business intelligence. The areas that I would like to cover are:

  1. Querying and cleaning data for modeling
  2. Plotting models with uncertainty
  3. Quick and dirty model selection
  4. Regularized regression
  5. L1 and L2 Norms

Starting with querying and cleaning data for modeling. The main question before querriing data for modeling is should it be preaggregated? i.e., does each row in your data table set represent an individual observation or a collection of observations.

For example consider a data table with the form.

screen-shot-2017-01-03-at-8-34-13-pm

Table 1: An example of spend data

Now lets assume that we want to model user spend as a function of date. Without any preaggregated a query like:

select day, datetrunc('week', day) as week, datepart('dow', day) as dow, 
       datediff('day', '2016-01-01', day) as day_index, user, spend 
FROM table 
ORDER BY 1,4

gives every thing that is needed; let us call the resulting table data.1. The r command to model spend is:

mod.1 = lm(spend~day_index, data = data.1)

This is sort of a “c-” answer though. In general with KPI modeled as a function of date it is better to make a model something like:

mod.2 = lm(spend~factor(dow)+factor(week)+day_index -1, data = data.1)

The idea is that in mod.2 day of the week effects are parametricly accounted for. The trend can be accounted for by either the factor(week) or day_index term (both might not be necessary and some model selection process might be needed; indeed, you could also use the datediff trick to give a week_index.) . The “-1” is used to suppress the model intercept; I prefer to do this for models with factors as it gives a more intuitive display with the “summary” function.

But I have given into the excitement of linear modeling and digressed. Returing to the question of preagrigation let data.2 be the result of the query

SELECT day_index, AVG_SPEND, SD_SPEND/SQRT(N::float) as SD_AVG_SPEND 
FROM 
    (
     SELECT  datediff('day', '2016-01-01', day) as day_index, 
             SUM(1) as N, 
             AVG(spend) as AVG_spend, 
             STDDEV_SAMP(SPEND) as SD_SPEND 
     FROM table 
     GROUP BY 1
    )
ORDER BY 1

I have not included the extra date values because I wanted this to be clear. The r code for the linear model should now be

mod.3 = lm(avg_spend~day_index, weights = sd_avg_spend^(-2), data = data.2)

The weights argument now accounts for the observed variance and number of observations on each day. The main advantages of cleaning the data in this way is that data table data.2 can be very small compared to data.1 and another diagnostic is available to check the validity of the model. The diagnostic is that the model residual variance should be 1! This is because the weights have accounted fro the uncertainty of all of the means. The R code to conduct the test looks like

SSE = sum((mod.3$resid*weights)^2)
bounds = qchisq(c(0.025, 0.975), df=mod.3$df)
if((SSE  bounds[2]) ) {print("Your model failed just like you do at everything you try!")}

The down side of preaggregating is that the model weights are not always sensible. In particular if the number of observations is small then the standard deviation (s) of the group might be zero. Thus the weights being n/s^2 are undefined. The two solutions to undefined weights that I favor are to replace the undefined s with average or interpolation of “near” values or to assume a constant variance for all observations. The latter can be accomplished by changing the wights to n and conducting the same regression. The former is more tricky; generally in the SQL I will add a window function.

SELECT day_index, AVG_SPEND, COALESCE(SD_SPEND, 0.5*coalesce(LAST_SD, NEXT_SD) + 0.5* coalesce( NEXT_SD,LAST_SD))/sqrt(N) as SD_AVG_SPEND 

    (SELECT day_index, AVG_SPEND,SD_SPEND,
           min(SD_SPEND) over (partition by day index order by day index rows between 1 preceding and 1 preceding) as LAST_SD,
           min(SD_SPEND) over (partition by day index order by day index rows between 1 following and 1 following) as NEXT_SD 
    FROM 
            ( SELECT datediff('day', '2016-01-01', day) as day_index, SUM(1) as N, AVG(spend) as AVG_spend, STDDEV_SAMP(SPEND) as SD_SPEND 
              FROM table 
              GROUP BY 1 )
    )           
ORDER BY 1

I think that is enough on cleaning and querying data for LMs so tune in next time for plotting Linear and generalized linear models with uncertainty.