Hi all as with the last few posts I am summarizing linear modeling and generalized linear models as used in Business intelligence. The areas that I am covering are:

- Querying and cleaning data for modeling
- Plotting models with uncertainty
- Quick and dirty model selection
- Regularized regression
- L1 and L2 Norms

This is the last post in the series and honestly like aging I did not save the best to last. That said given that you have opened this link if you stop reading now I will find you!

So now that the “extortion” portion of this post is completed we can return to Norms. In regression with a meaningful definition of misfit (i.e., cases where ) the norm is the exponent of the sum absolute residual that minimized. In typical linear regression an L2 Norm is used, that is is minimized. Note this is equivalent assuming a Gaussian distribution. Alternatively if your data has an outlier you might consider using an L1 norm (i.e., is minimized); this is equivalent to assuming a Laplace distribution.

A fun aside is to note that for the case where a central parameter is being estimated () the L2 norm will minimize error when is the mean and the L1 norm will minimize error when of is the median. If you have too much time on your hands it is a fun exercise to find an estimate that gives an arbitrary percentile (hint: use a piece wise function to adjust cost of being to high or too low and ).

The rest of this post will introduce the QQ plot as tool for diagnosing if the correct Norm was used in a model. And propose the strategy of using the generalized normal distribution with MCMC to account for data with outliers.

A QQ plot is a plot of the quantiles of two data sets vs each other. Some technical mambo-jumbo could be given here but really this just plotting sorted data set 1 vs sorted data set 2. Visualizing data goes into a lot of the “jumbo” about how to deal with different sized data sets but you will probably figure it out on your own so I don’t need to repeat it.

Theoretical quantile values can be used instead a particular data set. If two data sets / theoretical values come from the same distribution their QQ plot should look like a straight line with slope 1 and zero intercept. Consider the following example:

q = qnorm(seq(0, 1, length = 1001)) d = sort(rnorm(1001)) plot(q,d, xlab = "Theoretical", ylab = "Observed" ) abline(a=0, b =1, col="grey", lwd=1) points(q,d)

In linear regression the residuals are assumed be Gaussian thus a QQ plot of the observed residuals vs theoretical Gaussian quantiles should be a straight line (if you don’t studentize them the slope will be the SD). If the QQ plot for your data residuals does not look like a straight line then the model assumption of Gaussian errors is not correct and likely the wrong norm is being used.

The pdf for the generalized Gaussian distribution is . Thus for a standard 2d linear regression problem ( and there are independent identically distributed observations), the log-likelihood is . This can then be sampled using MCMC. The only wrinkle is that because and or correlated it may be difficult to sample them directly. In sample code I show below i use the strategy of sampling the residual standard error ( ) and to infer .

The results of MCMC sampling is shown in figure 2. Just a quick note because the point of this post was to talk about norms I have cheated in the sampling and started the MCMC chain at the true value.

The code for this plot is appended to the post. The structure should be familiar to any one that has been following this blog. That is it for this week. Tune in next time for a new series on survival analysis.

g.data.maker<-function(N) { x = 20*sort(runif(N)-0.5) r.prime = rexp(N, rate = 1) S = sample(c(-1,1), N, replace = T) r = r.prime*S d.true = 2*x -3 d = r + d.true #hist(r, breaks=seq(-50, 50, length.out = 101)) out = data.frame(x,d.true,d) return(out) } g.data.maker(N = 10001) g.LL <-function(sd,b,m1, m2, x, d) { a = sd / sqrt(gamma(3/b)/gamma(1/b)) n = length(d) mu = m1+ m2*x r = d - mu LL = n*log(b) - n*(log(2) + log(a) + lgamma(1/b)) - sum((abs(r)/a)^b) return(LL) } g.MCMC <-function() { data = g.data.maker(N = 201) # sd b m1 m2 mod.old = c(1, 0.1, -3, 2) UB = c(30, 2, 30, 30) LB = c(0, 0, -30,-30) LL.old = -Inf Q.sd = c(0.1, 0.05, 0.1, 0.01) BURN = 5000 for (i in 1:20000) { order = sample(1:4, 4) for(j in order) { mod.prime = mod.old mod.prime[j] = mod.prime[j] +rnorm(1, sd = Q.sd[j]) if( (mod.prime[j] > LB[j]) & (mod.prime[j]< UB[j])) { LL.prime = g.LL(sd = mod.prime[1],b= mod.prime[2],m1= mod.prime[3], m2= mod.prime[4], x = data[,1], d = data[,3]) r = runif(1) if(r <= exp(LL.prime - LL.old)) { LL.old = LL.prime mod.old = mod.prime } } } if ( i == BURN) {MOD.LIST = c(LL.old, mod.old) } if ( i > BURN) { MOD.LIST = rbind(MOD.LIST, c(LL.old, mod.old)) row.names(MOD.LIST) = NULL } } lmat = matrix(c(11,13,12,14, 11,1,12, 6, 11,2,12, 7, 11,3,12, 8, 11,4,12, 9, 11,5,12,10, 11,15,12, 16), nrow = 7, ncol=4, byrow=T) layout(mat=lmat, widths = c(0.1, 1, 0.05, 0.25), heights =c(0.1, 1,1,1,1,1, 0.4) ) par( mar= c(1,1,1,1)) print(MOD.LIST) plot(MOD.LIST[,1], xlab = "sample index", ylab = "Log LHood", pch=".", ylim=c(min(MOD.LIST[,1]), max(MOD.LIST[,1])) ) axis(2, at = mean(c(min(MOD.LIST[,1]), max(MOD.LIST[,1]))), label = "Log LHood", tick= F, mgp = c(3,3,1) ) plot(MOD.LIST[,2], xlab = "sample index", ylab = "SE[Residual]", pch=".", ylim=c(min(MOD.LIST[,2]), max(MOD.LIST[,2]))) axis(2, at = mean(c(min(MOD.LIST[,2]), max(MOD.LIST[,2]))), label = "SE[Residual]", tick= F, mgp = c(3,3,1) ) abline(h = 1.414214, col="blue", lwd=3) plot(MOD.LIST[,3], xlab = "sample index", ylab = "Beta", pch=".", ylim=c(min(MOD.LIST[,3]), max(MOD.LIST[,3]))) axis(2, at = mean(c(min(MOD.LIST[,3]), max(MOD.LIST[,3]))), label = "Beta", tick= F, mgp = c(3,3,1) ) abline(h = 1, col="blue", lwd=3) plot(MOD.LIST[,4], xlab = "sample index", ylab = "M1", pch=".", ylim=c(min(MOD.LIST[,4]), max(MOD.LIST[,4]))) axis(2, at = mean(c(min(MOD.LIST[,4]), max(MOD.LIST[,4]))), label = "M1", tick= F, mgp = c(3,3,1) ) abline(h = -3, col="blue", lwd=3) plot(MOD.LIST[,5], xlab = "sample index", ylab = "M2", pch=".", ylim=c(min(MOD.LIST[,5]), max(MOD.LIST[,5]))) axis(2, at = mean(c(min(MOD.LIST[,5]), max(MOD.LIST[,5]))), label = "M2", tick= F, mgp = c(3,3,1) ) axis(1, at = nrow(MOD.LIST)/2, label = "Sample Index", tick= F, mgp = c(3,3,1) ) abline(h = 2, col="blue", lwd=3) h = hist(MOD.LIST[,1], breaks =seq(min(MOD.LIST[,1]), max(MOD.LIST[,1]), length.out =101), plot =F) plot(h$den, h$mids, type ="n", ylim=c(min(MOD.LIST[,1]), max(MOD.LIST[,1])) , ) polygon(c(0,h$den,0), c(min(MOD.LIST[,1]), h$mids, max(MOD.LIST[,1])), col="grey", border=F ) h = hist(MOD.LIST[,2], breaks =seq(min(MOD.LIST[,2]), max(MOD.LIST[,2]), length.out =101), plot =F) plot(h$den, h$mids, type ="n", ylim=c(min(MOD.LIST[,2]), max(MOD.LIST[,2]))) polygon(c(0,h$den,0), c(min(MOD.LIST[,2]), h$mids, max(MOD.LIST[,2])), col="grey", border=F ) abline(h = 1.414214, col="blue", lwd=3) h = hist(MOD.LIST[,3], breaks =seq(min(MOD.LIST[,3]), max(MOD.LIST[,3]), length.out =101), plot =F) plot(h$den, h$mids, type ="n", ylim=c(min(MOD.LIST[,3]), max(MOD.LIST[,3]))) polygon(c(0,h$den,0), c(min(MOD.LIST[,3]), h$mids, max(MOD.LIST[,3])), col="grey", border=F ) abline(h = 1, col="blue", lwd=3) h = hist(MOD.LIST[,4], breaks =seq(min(MOD.LIST[,4]), max(MOD.LIST[,4]), length.out =101), plot =F) plot(h$den, h$mids, type ="n", ylim=c(min(MOD.LIST[,4]), max(MOD.LIST[,4]))) polygon(c(0,h$den,0), c(min(MOD.LIST[,4]), h$mids, max(MOD.LIST[,4])), col="grey", border=F ) abline(h = -3, col="blue", lwd=3) h = hist(MOD.LIST[,5], breaks =seq(min(MOD.LIST[,5]), max(MOD.LIST[,5]), length.out =101), plot =F) plot(h$den, h$mids, type ="n", ylim=c(min(MOD.LIST[,5]), max(MOD.LIST[,5]))) polygon(c(0,h$den,0), c(min(MOD.LIST[,5]), h$mids, max(MOD.LIST[,5])), col="grey", border=F ) abline(h = 2, col="blue", lwd=3) return(MOD.LIST) } samples = g.MCMC()