L1 and L2 Norms

Hi all as with the last few posts I am summarizing linear modeling and generalized linear models as used in Business intelligence. The areas that I am covering are:

  1. Querying and cleaning data for modeling
  2. Plotting models with uncertainty
  3. Quick and dirty model selection
  4. Regularized regression
  5. L1 and L2 Norms

This is the last post in the series and honestly like aging I did not save the best to last.  That said given that you have opened this link if you stop reading now I will find you!

So now that the “extortion” portion of this post is completed we can return to Norms. In regression with a meaningful definition of misfit (i.e., cases where f[{\bf m}] - {\bf d} = {\bf r} ) the norm is the exponent of the sum absolute residual that minimized. In typical linear regression an L2 Norm is used, that is \sum|r_i|^2 is minimized. Note this is equivalent assuming a Gaussian distribution. Alternatively if your data has an outlier you might consider using an L1 norm (i.e., \sum|r_i| is minimized); this is equivalent to assuming a Laplace distribution.

A fun aside is to note that for the case where a central parameter is being estimated (d_i - \theta = r_i) the L2 norm will minimize error when \theta is the mean and the L1 norm will minimize error when of \theta is the median. If you have too much time on your hands it is a fun exercise to find an estimate that gives an arbitrary percentile (hint: use a piece wise function to adjust cost of being to high or too low f(x) = ax | x>0 and -bx |x<0).

The rest of this post will introduce the QQ plot as tool for diagnosing if the correct Norm was used in a model. And propose the strategy of using the generalized normal distribution with MCMC to account for data with outliers.

A QQ plot is a plot of the quantiles of two data sets vs each other.  Some technical mambo-jumbo could be given here but really this just plotting sorted data set 1 vs sorted data set 2. Visualizing data goes into a lot of the “jumbo” about how to deal with different sized data sets but you will probably figure it out on your own so I don’t need to repeat it.

Theoretical quantile values can be used instead a particular data set. If two data sets / theoretical values come from the same distribution their QQ plot should look like a straight line with slope 1 and zero intercept. Consider the following example:

q = qnorm(seq(0, 1, length = 1001))
d = sort(rnorm(1001))
plot(q,d, xlab = "Theoretical", ylab = "Observed" )
abline(a=0, b =1, col="grey", lwd=1)
Screen Shot 2017-04-22 at 10.31.50 PM

Figure 1: A QQ plot showing observed Gaussian quantiles for a sample of size 1001 vs theoretical quanatiles. The grey line give y = x.

In linear regression the residuals are assumed be Gaussian thus a QQ plot of the observed residuals vs theoretical Gaussian quantiles should be a straight line (if you don’t studentize them the slope will be the SD). If the QQ plot for your data residuals does not look like a straight line then the model assumption of Gaussian errors is not correct and likely the wrong norm is being used.

The pdf for the generalized Gaussian distribution is f(x) = \beta (2 \alpha \Gamma[1/ \beta] ) ^1 \exp(- [|x-\mu|/ \alpha]^ \beta ) . Thus for a standard 2d linear regression problem (d_i = m_1 + m_2 x_i + r_i and there are n independent identically distributed observations), the log-likelihood is \log(L) = n*\log( \beta) - n(log(2) + log( \alpha) + \Gamma[1/ \beta]) - \sum( |d_i - m_1 + m_2 x_i |/ \alpha)^{ \beta}). This can then be sampled using MCMC. The only wrinkle is that because \alpha and \beta or correlated it may be difficult to sample them directly. In sample code I show below i use the strategy of sampling the residual standard error ( \alpha (\Gamma[3/ \beta] / \Gamma[1/\beta])^{1/2} ) and \beta to infer \alpha.

The results of MCMC sampling is shown in figure 2. Just a quick note because the point of this post was to talk about norms I have cheated in the sampling and started the MCMC chain at the true value.

Screen Shot 2017-04-23 at 6.18.00 PM.png

Figure 2: Sample trajectories (first column) and posterior 1 dimensional marginal distributions for the sampled parameters. The blue lines show the “true” value.

The code for this plot is appended to the post.  The structure should be familiar to any one that has been following this blog. That is it for this week. Tune in next time for a new series on survival analysis.

x       = 20*sort(runif(N)-0.5)
r.prime = rexp(N, rate = 1)
S       = sample(c(-1,1), N, replace = T)
r       = r.prime*S 

d.true  = 2*x -3
d       = r + d.true 

#hist(r, breaks=seq(-50, 50, length.out = 101))

out = data.frame(x,d.true,d)
g.data.maker(N = 10001)

g.LL <-function(sd,b,m1, m2, x, d)
a = sd / sqrt(gamma(3/b)/gamma(1/b))    
n = length(d)
mu = m1+ m2*x
r  = d - mu 
LL = n*log(b) - n*(log(2) + log(a) + lgamma(1/b)) - sum((abs(r)/a)^b)

g.MCMC <-function()
data = g.data.maker(N = 201)

          # sd   b   m1   m2
mod.old = c(1,  0.1,   -3,  2)
UB      = c(30, 2,  30, 30)
LB      = c(0,  0, -30,-30) 

LL.old  = -Inf 

Q.sd    = c(0.1, 0.05, 0.1, 0.01)

BURN = 5000

for (i in 1:20000)
    order = sample(1:4, 4)
    for(j in order)
        mod.prime    = mod.old
        mod.prime[j] = mod.prime[j] +rnorm(1, sd = Q.sd[j])
        if( (mod.prime[j] > LB[j]) & (mod.prime[j]< UB[j]))
            LL.prime     = g.LL(sd = mod.prime[1],b= mod.prime[2],m1= mod.prime[3], m2= mod.prime[4], x = data[,1], d = data[,3])

            r = runif(1)
            if(r <= exp(LL.prime - LL.old))
                LL.old = LL.prime
                mod.old = mod.prime


    if ( i == BURN) {MOD.LIST = c(LL.old, mod.old) }
    if ( i > BURN)
        MOD.LIST = rbind(MOD.LIST, c(LL.old, mod.old)) 
        row.names(MOD.LIST) = NULL

lmat = matrix(c(11,13,12,14,
        11,1,12, 6,
        11,2,12, 7,
        11,3,12, 8,
        11,4,12, 9,
        11,15,12, 16), nrow = 7, ncol=4, byrow=T)
layout(mat=lmat, widths = c(0.1, 1, 0.05, 0.25), heights =c(0.1, 1,1,1,1,1, 0.4) )
par( mar= c(1,1,1,1))


plot(MOD.LIST[,1], xlab = "sample index", ylab = "Log LHood", pch=".", ylim=c(min(MOD.LIST[,1]), max(MOD.LIST[,1]))  )
axis(2, at = mean(c(min(MOD.LIST[,1]), max(MOD.LIST[,1]))), label =  "Log LHood", tick= F, mgp = c(3,3,1) )
plot(MOD.LIST[,2], xlab = "sample index", ylab = "SE[Residual]", pch=".", ylim=c(min(MOD.LIST[,2]), max(MOD.LIST[,2])))
axis(2, at = mean(c(min(MOD.LIST[,2]), max(MOD.LIST[,2]))), label =  "SE[Residual]", tick= F, mgp = c(3,3,1) )
abline(h = 1.414214, col="blue", lwd=3)
plot(MOD.LIST[,3], xlab = "sample index", ylab = "Beta", pch=".", ylim=c(min(MOD.LIST[,3]), max(MOD.LIST[,3])))
axis(2, at = mean(c(min(MOD.LIST[,3]), max(MOD.LIST[,3]))), label =  "Beta", tick= F, mgp = c(3,3,1) )
abline(h = 1, col="blue", lwd=3)
plot(MOD.LIST[,4], xlab = "sample index", ylab = "M1", pch=".", ylim=c(min(MOD.LIST[,4]), max(MOD.LIST[,4])))
axis(2, at = mean(c(min(MOD.LIST[,4]), max(MOD.LIST[,4]))), label =  "M1", tick= F, mgp = c(3,3,1) )
abline(h = -3, col="blue", lwd=3)
plot(MOD.LIST[,5], xlab = "sample index", ylab = "M2", pch=".", ylim=c(min(MOD.LIST[,5]), max(MOD.LIST[,5])))
axis(2, at = mean(c(min(MOD.LIST[,5]), max(MOD.LIST[,5]))), label =  "M2", tick= F, mgp = c(3,3,1) )
axis(1, at = nrow(MOD.LIST)/2, label =  "Sample Index", tick= F, mgp = c(3,3,1) )
abline(h = 2, col="blue", lwd=3)

h = hist(MOD.LIST[,1], breaks =seq(min(MOD.LIST[,1]), max(MOD.LIST[,1]), length.out =101), plot =F)
plot(h$den, h$mids, type ="n", ylim=c(min(MOD.LIST[,1]), max(MOD.LIST[,1])) ,  )
polygon(c(0,h$den,0), c(min(MOD.LIST[,1]), h$mids, max(MOD.LIST[,1])), col="grey", border=F )

h = hist(MOD.LIST[,2], breaks =seq(min(MOD.LIST[,2]), max(MOD.LIST[,2]), length.out =101), plot =F)
plot(h$den, h$mids, type ="n", ylim=c(min(MOD.LIST[,2]), max(MOD.LIST[,2])))
polygon(c(0,h$den,0), c(min(MOD.LIST[,2]), h$mids, max(MOD.LIST[,2])), col="grey", border=F )
abline(h = 1.414214, col="blue", lwd=3)

h = hist(MOD.LIST[,3], breaks =seq(min(MOD.LIST[,3]), max(MOD.LIST[,3]), length.out =101), plot =F)
plot(h$den, h$mids, type ="n", ylim=c(min(MOD.LIST[,3]), max(MOD.LIST[,3])))
polygon(c(0,h$den,0), c(min(MOD.LIST[,3]), h$mids, max(MOD.LIST[,3])), col="grey", border=F )
abline(h = 1, col="blue", lwd=3)

h = hist(MOD.LIST[,4], breaks =seq(min(MOD.LIST[,4]), max(MOD.LIST[,4]), length.out =101), plot =F)
plot(h$den, h$mids, type ="n", ylim=c(min(MOD.LIST[,4]), max(MOD.LIST[,4])))
polygon(c(0,h$den,0), c(min(MOD.LIST[,4]), h$mids, max(MOD.LIST[,4])), col="grey", border=F )
abline(h = -3, col="blue", lwd=3)

h = hist(MOD.LIST[,5], breaks =seq(min(MOD.LIST[,5]), max(MOD.LIST[,5]), length.out =101), plot =F)
plot(h$den, h$mids, type ="n", ylim=c(min(MOD.LIST[,5]), max(MOD.LIST[,5])))
polygon(c(0,h$den,0), c(min(MOD.LIST[,5]), h$mids, max(MOD.LIST[,5])), col="grey", border=F )
abline(h = 2, col="blue", lwd=3)


samples = g.MCMC()


C) Bayesian Linear Regression

Hi all as with the last few posts I am summarizing linear modeling and generalized linear models as used in Business intelligence. The areas that I am covering are:

  1. Querying and cleaning data for modeling
  2. Plotting models with uncertainty
  3. Quick and dirty model selection
  4. Regularized regression
  5. L1 and L2 Norms

This post continues the discussion on  regularized regression (i.e., singular value decomposition regression, ridge regression, general regularized regression, and Bayesian methods). In particular, it covers Bayesian methods. My original intention had been to cover 1 Bayesian approach (pseudo-data) and then produce some sort of “taste test” (yes given Pepsi’s trouble this week I am giving them the enormous marketing boost that comes from highlighting their less controversial commercials to my loyal blog readers [i.e., people that I guilt into looking at this at work] ), however I got distracted with some other projects. I don’t like going too long without making a new post; so the new plan is to present the pseudo-data approach mathematically.

The basic idea of the pseudo-data approach is to exploit the above solution by treating the prior information as  “data”.  Specifically in the case where the data residuals are uncorrelated and there is Gaussian prior for each of the parameters the data is augmented such that  {\bf d}^{\star} = ({\bf d}, {\bf m}_0) and  A^{\star} = (A, I_p)

Recall that P({\bf m}, \sigma^2 | {\bf d}), the posterior distribution of the model ({\bf m}) is proportional to (2\pi)^{-k/2} (\sigma^2)^{-k/2} \exp(-\frac{1}{2} \frac{1}{\sigma^2}[{\bf d}-\text{A}{\bf m}]^{t} [{\bf d}-\text{A}{\bf m}]) which can be rewritten as (2\pi)^{-k/2} (\sigma^2)^{-k/2}  \times \exp(-\frac{1}{2} \frac{1}{\sigma^2} [{\bf d}^t {\bf d} - {\bf d}^t  \text{A} (\text{A}^t  \text{A} )^{-1}\text{A}^t {\bf d} ] )  \times \exp(-\frac{1}{2} \frac{1}{\sigma^2} [{\bf m} - {\bf \mu}_1]^{t} [\text{A}^t  \text{A} ]^{-1} [{\bf m} - {\bf \mu}_1] ) using the completing the squares trick, where \mu_1 = (\text{A}^t  \text{A} )^{-1}\text{A}^t {\bf d} . Thus the solution with prior information on {\bf m} is (2\pi)^{-(k+p)/2} (\sigma^2)^{-(k+p)/2}  \times \exp(-\frac{1}{2} \frac{1}{\sigma^2} [{{\bf d}^{\star}}^t {\bf d}^{\star} - {{\bf d}^{\star}}^t  \text{A}^{\star} ({\text{A}^{\star}}^t  {\text{A}^{\star}} )^{-1}{\text{A}^{\star}}^t {{\bf d}^{\star}} ] )  \times \exp(-\frac{1}{2} \frac{1}{\sigma^2} [{\bf m} - {\bf \mu}_1^{\star}]^{t} [{\text{A}^{\star}}^t  {\text{A}^{\star}} ]^{-1} [{\bf m} - {\bf \mu}_1^{\star}] ) where {\bf \mu}_1^{\star} = ({\text{A}^{\star}}^t  {\text{A}^{\star}} )^{-1}{\text{A}^{\star}}^t {\bf d}^{\star}.

More complicated prior information can be accounted for similarly to the regularized regression already discussed; that is by also augmenting the co-variance matrix  Screen Shot 2017-04-08 at 2.51.16 PM.png
where 0_{k,p} is a k by p zero matrix. In addition more complicated structures can be used in A^{\star}. For example A^{\star} = (A, D_p) where D_p is

Screen Shot 2017-04-08 at 3.11.20 PM.png

is the “flattest solution”  (note this could also be done with a auto regressive prior co-variance matrix). Once a {\bf d}^{\star}, A^{\star}, and C_{\bf d}^{\star} are chosen standard Gibbs sampling can be done.

A few final thoughts:

First if you are using a formulation were the data co-variance is a summed known until a consent the prior can be a function of your estimated data variance (e.g., proportional to 1/\sigma^2) such that data variance does not change prior information.

Second don’t be be frightened to used MCMC or other sampling methods for more complicated prior co-variances.

Third I prefer this formalization to other regularization ideas  presented here as the parameter uncertainty estimates are easier to understand in the context of data and prior information.

That is it for now.