# Ridge Regression Appendix

Hi All

To avoid a post being to mathy I am making this short appendix that is super mathy but you don’t have to read it.

In classic regression we are trying to minimize the misfit function $\phi=[{\bf d}-A{\bf m}]^t C_d^{-1}[{\bf d}-A{\bf m}]$ in ridge regression the $\phi$ is augmented to be $\phi= [{\bf d}-A{\bf m}]^t C_d^{-1}[{\bf d}-A{\bf m}] +\mu [{\bf m}- {\bf m}_0]^t C_{{\bf m}_0}^{-1}[{\bf m}- {\bf m}_0]$. To get the minimum the derivative of $\phi$ is taken with respect to ${\bf m}$, set equal; to zero and solved for ${\bf m}$; i.e., solve $(d/d{\bf m}) \phi = 0$.

To express the derivative recall $(d/d{\bf x}) [A{\bf x}+b]^t C [D{\bf x} +e]$ $= [D{\bf x} +e]^t C^t A + [A{\bf x}+b]^t C D$ (This is analogous to the derivative of a quadratic in 1 dimension). Thus, $(d/d{\bf m}) \phi$ $=(d/d{\bf m}) [{\bf d}-A{\bf m}]^t C_d^{-1}[{\bf d}-A{\bf m}] + (d/d{\bf m}) \mu [{\bf m}- {\bf m}_0]^t C_{{\bf m}_0}^{-1}[{\bf m}- {\bf m}_0]$ $= 2[A{\bf m}-{\bf d}]^t C_d^{-1} A+2\mu [{\bf m}- {\bf m}_0]^t C_{{\bf m}_0}^{-1}$. This results from both $C_d^{-1}$ and $C_{{\bf m}_0}^{-1}$ being symmetric.

Now solving $(d/d{\bf m}) \phi = 0$ $\Rightarrow 0 = [A{\bf m}-{\bf d}]^t C_d^{-1} A+\mu [{\bf m}- {\bf m}_0]^t C_{{\bf m}_0}^{-1}$ $\Rightarrow 0 = [A{\bf m}]^t C_d^{-1} A- {\bf d}^t C_d^{-1} A+ \mu{\bf m}^t C_{{\bf m}_0}^{-1} - \mu{\bf m}_0^t C_{{\bf m}_0}^{-1}$ $\Rightarrow {\bf d}^t C_d^{-1} A + \mu{\bf m}_0^t C_{{\bf m}_0}^{-1}= [A{\bf m}]^t C_d^{-1} A + \mu{\bf m}^t C_{{\bf m}_0}^{-1}$ $\Rightarrow {\bf d}^t C_d^{-1} A + \mu{\bf m}_0^t C_{{\bf m}_0}^{-1}= {\bf m}^t A^t C_d^{-1} A + {\bf m}^t \mu C_{{\bf m}_0}^{-1}$ $\Rightarrow {\bf d}^t C_d^{-1} A + \mu{\bf m}_0^t C_{{\bf m}_0}^{-1}= {\bf m}^t [A^t C_d^{-1} A + \mu C_{{\bf m}_0}^{-1} ]$ $\Rightarrow [ {\bf d}^t C_d^{-1} A + \mu{\bf m}_0^t C_{{\bf m}_0}^{-1}] [A^t C_d^{-1} A + \mu C_{{\bf m}_0}^{-1} ]^{-1}= {\bf m}^t$.

In theory I could stop here as ${\bf m}$ is shown explicitly but this is not the most user friendly way to write it so a few more steps are needed. First removing the transpose and then pulling out the ${\bf m}_0$ term. Thus, $[ {\bf d}^t C_d^{-1} A + \mu{\bf m}_0^t C_{{\bf m}_0}^{-1}] [A^t C_d^{-1} A + \mu C_{{\bf m}_0}^{-1} ]^{-1}= {\bf m}^t$ $\Rightarrow {\bf m} = ([ {\bf d}^t C_d^{-1} A + \mu{\bf m}_0^t C_{{\bf m}_0}^{-1}] [A^t C_d^{-1} A + \mu C_{{\bf m}_0}^{-1} ]^{-1})^t$ $\Rightarrow {\bf m} = ([A^t C_d^{-1} A + \mu C_{{\bf m}_0}^{-1} ]^{-1})^t ([ {\bf d}^t C_d^{-1} A + \mu{\bf m}_0^t C_{{\bf m}_0}^{-1}])^t$ $\Rightarrow {\bf m} = [A^t C_d^{-1} A + \mu C_{{\bf m}_0}^{-1} ]^{-1} [ A^t C_d^{-1} {\bf d} + \mu C_{{\bf m}_0}^{-1} {\bf m}_0]$.

To pull out the ${\bf m}_0$ term it is necessary to add zero to the right hand side in the form of $A^t C_d^{-1} A {\bf m}_0 - A^t C_d^{-1} A {\bf m}_0$. Thus, ${\bf m} = [A^t C_d^{-1} A + \mu C_{{\bf m}_0}^{-1} ]^{-1} [ A^t C_d^{-1} {\bf d} - A^t C_d^{-1} A {\bf m}_0 + A^t C_d^{-1} A {\bf m}_0 + \mu C_{{\bf m}_0}^{-1} {\bf m}_0]$ $\Rightarrow {\bf m} = [A^t C_d^{-1} A + \mu C_{{\bf m}_0}^{-1} ]^{-1} [ A^t C_d^{-1}({\bf d} - A {\bf m}_0) +(A^t C_d^{-1} A + \mu C_{{\bf m}_0}^{-1}) {\bf m}_0]$ $\Rightarrow {\bf m} = [A^t C_d^{-1} A + \mu C_{{\bf m}_0}^{-1} ]^{-1} [ A^t C_d^{-1}({\bf d} - A {\bf m}_0)]$ $+[A^t C_d^{-1} A + \mu C_{{\bf m}_0}^{-1} ]^{-1} [A^t C_d^{-1} A + \mu C_{{\bf m}_0}^{-1}] {\bf m}_0]$ $\Rightarrow \hat{\bf m} = {\bf m}_0 + [A^t C_d^{-1} A + \mu C_{{\bf m}_0}^{-1} ]^{-1} [ A^t C_d^{-1}({\bf d} - A {\bf m}_0)]$.

Note the final form has $\hat{\bf m}$ not ${\bf m}$ as it now represents an a most likely estimator. Changing the symbol is arbitrary but it helps with notation latter. I like this form as it highlights the components of the estimate. The prior term ${\bf m}_0$ is updated (a very fancy word for added to) by $[A^t C_d^{-1} A + \mu C_{{\bf m}_0}^{-1} ]^{-1} [ A^t C_d^{-1}({\bf d} - A {\bf m}_0)]$, a function of the prior data misfit ( ${\bf d} - A {\bf m}_0$).

# Regularized Regression: Part A) SVD.

Hi all as with the last few posts  I am summarizing linear modeling and generalized linear models as used in Business intelligence. The areas that I am covering are:

1. Querying and cleaning data for modeling
2. Plotting models with uncertainty
3. Quick and dirty model selection
4. Regularized regression
5. L1 and L2 Norms

This post introduces regularized regression (i.e., singular value decomposition regression, ridge regression, general regularized regression, and Bayesian methods). In particular, it covers singular value decomposition (SVD) regression. So with that in mind are you regular?

Before answering that question it might be helpful to review  parameter estimates for linear models. Assume that we have a linear model $\text{N}(A{\bf m}, C_{\bf d})$, i.e., we have a linear model with sensitivity matrix $A$, data covariance matrix $C_{\bf d}$, and parameters ${\bf m}$. Thus the estimates of the parameter vector ${\bf m}$ is $\hat{{\bf m}} = [A^{t} C_{\bf d} A]^{-1}A^{t} C_{\bf d} {\bf d}$.

Although the $A$ and $C_{\bf d}$ are in general assumed to be known ( $C_{\bf d}$ is might be known only up to a constant) they are not in general something that the statistician can control. Consequently there is no guarantee that $[A^{t} C_{\bf d} A]$ is an invertible matrix. In the most basic sense regularization is a set of techniques to ensure that $[A^{t} C_{\bf d} A]$ invertible. Later this post shows how to incorporate prior information through regularization but as a start let us just assume that $[A^{t} C_{\bf d} A]$ is not invertible and we can’t change $A$ and $C_{\bf d}$ . So how do we get the parameter estimates?

The first technique to invert $[A^{t} C_{\bf d} A]$ is SVD. In SVD $[A^{t} C_{\bf d} A]$ is decomposed into $[V^{t} D V]$ where $V$ is column matrix of eigen vectors of $[A^{t} C_{\bf d} A]$ and $D$ is a diagonal matrix where the ith diagonal element is the eigen value ( $\lambda_i$) that corresponds to the eigen vector in the ith column of $V$. By convention normally the eigen values are given in descending order.

The inverse of a matrix of the form $[V ^{t}D V]$ is $[V^{t} D^{-1} V]$ as eigen vectors are orthogonal, i.e., $V^{t} V = I_n$. Recall that if a matrix is not invertible, then at-least one of its eigen values must be zero. Thus in our problem, where $[A^{t} C_{\bf d} A]$ is not invertible, only $k < n$ of the eigen values can be non zero, i.e., $D$ looks like: Also recall that the inverse of a diagonal matrix is the inverse of its diagonal elements. The combination of these two facts means that $D^{-1}$ will not be defined for some of its diagonal element. To make a meaningful estimate of $[A^{t} C_{\bf d} A]^{-1}$ we define $\hat{D}^{-1}$ as: That is the ith diagonal elements of $\hat{D}^{-1}$ where the $\lambda_i > 0$  are $1/ \lambda_i$ otherwise (where $\lambda_i = 0$) the ith diagonal element is 0.  So putting this all together the SVD estimate of the parameter vector ${\bf m}$ is $\hat{{\bf m}} =[V^{t} \hat{D}^{-1} V]A^{t} C_{\bf d} {\bf d}$.

As a final point please note that you can replace $1/ \lambda_i$ with $0$ even if $\lambda_i \neq 0$. This might be desirable if $1/ \lambda_i$ is very small and  causing $[A^{t} C_{\bf d} A]^{-1}$ to be numerically unstable.

That is it for now, the next post will continue with ridge regression.

# Quick and Dirty Model Selection

Hi All as with the last three posts I am summarizing linear modeling and generalized linear models as used in Business intelligence. The areas that I are covered are:

1. Querying and cleaning data for modeling
2. Plotting models with uncertainty
3. Quick and dirty model selection
4. Regularized regression
5. L1 and L2 Norms

This post explores number 3 model selection. Model selection is a topic that has been covered a few times already (1, 2, 3, 4) on this blog. What makes this post different (i.e., better) is that it lacks mathematical rigor and  uses base functions (e.g., lm).

Now before going forward nothing here is worthwhile if you are happy with stepAIC. Personally I have found it so buggy as to never be useful; if you have power to you and I suggest you use the time you would have spent here on something more productive.

Before model selecting can happen data must be simulated:

# data for modeling
x.index = 1:336
x.month = floor((x.index-1)/28)
x.dom   = rep(1:28, 12)
x.week  = floor((x.index-1)/7)
x.dow   = rep(0:6, 12*4)
x.rand1 = runif(336)
x.rand2 = runif(336)
x.rand3 = runif(336)
x.dom0    = rep(1:28, 12)
x.dom1    = rep(1:28, 12)
x.dom2    = rep(1:28, 12)
x.dom3    = rep(1:28, 12)
x.dom4    = rep(1:28, 12)
x.dom5    = rep(1:28, 12)
x.dom6    = rep(1:28, 12)
x.dom7    = rep(1:28, 12)
x.dom8    = rep(1:28, 12)
x.dom9    = rep(1:28, 12)
x.dom10   = rep(1:28, 12)
x.dom0[x.month != 0] = 0
x.dom1[x.month != 1] = 0
x.dom2[x.month != 2] = 0
x.dom3[x.month != 3] = 0
x.dom4[x.month != 4] = 0
x.dom5[x.month != 5] = 0
x.dom6[x.month != 6] = 0
x.dom7[x.month != 7] = 0
x.dom8[x.month != 8] = 0
x.dom9[x.month != 9] = 0
x.dom10[x.month != 9] = 0

X  = data.frame(x.index,x.month,x.dom,x.week ,x.dow , x.rand1, x.rand2, x.rand3,x.dom0 ,x.dom1 ,x.dom2 ,x.dom3 ,x.dom4 ,x.dom5 ,x.dom6 ,x.dom7 ,x.dom8 ,x.dom9 ,x.dom10)

# data for making the simulated observations
x.sun   = rep(c(1,0,0,0,0,0,0), 12*4)
x.mon   = rep(c(0,1,0,0,0,0,0), 12*4)
x.tue   = rep(c(0,0,1,0,0,0,0), 12*4)
x.wed   = rep(c(0,0,0,1,0,0,0), 12*4)
x.thu   = rep(c(0,0,0,0,1,0,0), 12*4)
x.fri   = rep(c(0,0,0,0,0,1,0), 12*4)
x.sat   = rep(c(0,0,0,0,0,0,1), 12*4)
x.m0    = c(rep(1, 28), rep(0, 11*28))
x.m1    = c(rep(0, 1*28), rep(1, 28), rep(0, 10*28))
x.m2    = c(rep(0, 2*28), rep(1, 28), rep(0, 9*28))
x.m3    = c(rep(0, 3*28), rep(1, 28), rep(0, 8*28))
x.m4    = c(rep(0, 4*28), rep(1, 28), rep(0, 7*28))
x.m5    = c(rep(0, 5*28), rep(1, 28), rep(0, 6*28))
x.m6    = c(rep(0, 6*28), rep(1, 28), rep(0, 5*28))
x.m7    = c(rep(0, 7*28), rep(1, 28), rep(0, 4*28))
x.m8    = c(rep(0, 8*28), rep(1, 28), rep(0, 3*28))
x.m9    = c(rep(0, 9*28), rep(1, 28), rep(0, 2*28))
x.m10   = c(rep(0, 10*28), rep(1, 28), rep(0, 1*28))
x.m11   = c(rep(0, 11*28), rep(1, 28))

# the simulated observations
n      = floor(runif(336, min = 10000, max =20000))
sd     = rchisq(n = 336, df = 25, ncp = 0)/sqrt(n)
error  = rnorm(336, mean= 0, sd =sd)
y.true = 0.5*x.fri + 0.5*x.sat - 0.25*x.mon - 0.25*x.tue - 0.25*x.wed -0.25*x.thu +
0.15*x.m0 +0.2*x.m1  -0.05*x.m2  -0.65*x.m3  +0.05*x.m4  +0.35*x.m5  +0.25*x.m6  +0.15*x.m7  -0.15*x.m8  +0.15*x.m9  -0.45*x.m10  +
0.02*x.dom5 - 0.03*x.dom9 +0.015*x.dom10 +
1.5
y      = y.true +error

# plot data
plot(x.index, y)
lines(x.index, y.true, col="blue", lwd=3)
points(x.index, y)

The simulated data and true model are shown in Fig. 1. The true model has day of week (dow) and monthly parameters. In addition 6n and 10th months have slopes. In real-life this code would just be a sql query. Figure 1: The simulated data with noise (circles) and without noise (blue lines).

The model selection procedures described here algorithmically have two parts. The first part is to build a data frame will all features that will be considered in selection process. These features can be either individual predictors, collections of categorical predictors, interaction terms, or any combination there of. The only restriction is that they must be prespecified. The second step of the algorithm is to compute the BIC of models using sub sets of the features compiled in the first step. Here both an exhaustive and annealing method is given.

Returning to the first step:

#make data.frame for model selection
data.MS    = data.frame(
x.index           = X$x.index , #1 x.month.cat = factor(X$x.month),                    # 2
x.dom             = X$x.dom , # 3 x.week = X$x.week    ,                         # 4
x.week.cat        = factor(X$x.week) , # 5 x.dow = X$x.dow  ,                            # 6
x.dow.cat         = factor(X$x.dow), # 7 x.month.dow.cat = factor(X$x.dow):factor(X$x.month), # 8 x.rand1 = X$x.rand1 ,                           # 9
x.rand2           = X$x.rand2 , # 10 x.dom0 = X$x.dom0  ,                           # 11
x.dom1            = X$x.dom1 , # 12 x.dom2 = X$x.dom2  ,                           # 13
x.dom3            = X$x.dom3 , # 14 x.dom4 = X$x.dom4  ,                           # 15
x.dom5            = X$x.dom5 , # 16 x.dom6 = X$x.dom6  ,                           # 17
x.dom7            = X$x.dom7 , # 18 x.dom8 = X$x.dom8  ,                           # 19
x.dom9            = X$x.dom9 , # 20 x.dom10 = X$x.dom10 ,                           # 21
y                 = y

)

In this example the individual predictors are “x.index”, “x.dom”, “x.week”, “x.dow”, “x.rand1”, and “x.rand2”. The collections of predictors are “x.month.cat”, “x.week.cat”, and “x.dow.cat”. Finally, the interaction terms are “x.month.dow.cat”, and “x.dom” 0 to 10. The exhaustive approach will calculate the BIC for each subset of these predictors.  The code for the exhaustive method is:


# model selection
BIC.list = 1:2047
N        = sum(n)

for ( i in 1:2047)
{
print("--------------------------------------------------")
print(i)
cols = intToBits(i)[1:11]
cols.tf = c(cols == 1, cols == 1, cols == 1, cols == 1, cols == 1, cols == 1,cols == 1,cols == 1,cols == 1, cols == 1,  rep(cols == 1,11), T )

data.temp = data.MS[,cols.tf]
print(colnames(data.temp))

mod.temp = lm(y~., data = data.temp, weights =  sd^(-2)  ) # note that in this case I am asuming that hte true sds are known
LogLhood = logLik(mod.temp)
K        = mod.temp$rank K.col = sum(cols.tf) BIC = -2*LogLhood + log(N)*K + 0.01*K.col print(K) print(K.col) print(log(N)) print(BIC) BIC.list[i] = BIC } i = which.min(BIC.list) cols = intToBits(i)[1:11] cols.tf = c(cols == 1, cols == 1, cols == 1, cols == 1, cols == 1, cols == 1,cols == 1,cols == 1,cols == 1, cols == 1, rep(cols == 1,11), T ) data.temp = data.MS[,cols.tf] mod.best = lm(y~., data = data.temp, weights = sd^(-2) ) # note that in this case I am asuming that hte true sds are known P = predict(mod.best, se.fit =T) EX = P$fit
LB = P$fit - 1.96*P$se.fit
UB = P$fit + 1.96*P$se.fit

plot(x.index, y)
polygon(c(x.index, rev(x.index)), c(UB, rev(LB)), col= "grey", border=F)
lines(x.index, EX, col="red", lwd=3)
lines(x.index, y.true, col="blue", lwd=3, lty =2)
points(x.index, y)

I would like to highlight a few choices. First the BIC is calculated manually. This is does because the BIC function in r does not always calculate the number of observations correctly when they are used in the weights of the linear model. Second the strategy of using  intToBits and then making rules on the cols.tf slows for groups of parameters to be linked. Here all of the “x.dom” parameters are either on or off together. Thirdly, I have modified the BIC to avoid models with many unused parameters (the “0.01*K.col” term).

Here there are only 11 features that could be on or off. If it is describable to look through many features to the extent that an exhaustive search is not possible then an annealing approach seems to be the best plan.  The code for the annealing model selection is

# model selection
N.steps  = 10000
t        = 10
t.step   = 0.9995
0        = last.accept
N        = sum(n)
cols.tf  = c(runif(1) >= 0.5, runif(1) >= 0.5, runif(1) >= 0.5, runif(1) >= 0.5, runif(1) >= 0.5,
runif(1) >= 0.5, runif(1) >= 0.5, runif(1) >= 0.5, runif(1) >= 0.5, runif(1) >= 0.5,
runif(1) >= 0.5, runif(1) >= 0.5, runif(1) >= 0.5, runif(1) >= 0.5, runif(1) >= 0.5,
runif(1) >= 0.5, runif(1) >= 0.5, runif(1) >= 0.5, runif(1) >= 0.5, runif(1) >= 0.5,
runif(1) >= 0.5, T )
cols.tf.old  = cols.tf
cols.tf.best = cols.tf
data.temp    = data.MS[,cols.tf]
print(colnames(data.temp))

mod.temp = lm(y~., data = data.temp, weights =  sd^(-2)  ) # note that in this case I am asuming that hte true sds are known
LogLhood = logLik(mod.temp)
K        = mod.temp$rank K.col = sum(cols.tf) BIC.OLD = -2*LogLhood + log(N)*K + 0.01*K.col BIC.BEST = BIC.OLD for ( i in 1:N.steps) { print("--------------------------------------------------") print(i) cols.tf = cols.tf.old j = sample(1:21,1) # S. = 1 if ( (cols.tf[j] ==T) & (S. == 1) ) { cols.tf[j] = F S. = 0 } if ( (cols.tf[j] ==F) & (S. == 1) ) { cols.tf[j] = T S. = 0 } data.temp = data.MS[,cols.tf] print(colnames(data.temp)) mod.temp = lm(y~., data = data.temp, weights = sd^(-2) ) # note that in this case I am asuming that hte true sds are known LogLhood = logLik(mod.temp) K = mod.temp$rank
K.col    = sum(cols.tf)

BIC.NEW      = -2*LogLhood + log(N)*K  + 0.01*K.col

r = runif(1)
A = exp((BIC.OLD-BIC.NEW)/t )

print(paste("A", A))
print(paste("NEW", BIC.NEW))
print(paste("OLD", BIC.OLD))
if(A >= r)
{
print("accept new model")
BIC.OLD = BIC.NEW
cols.tf.old = cols.tf
last.accept = i
}

if(BIC.NEW <= BIC.BEST)
{
BIC.BEST = BIC.NEW
cols.tf.best = cols.tf
}

print(K)
print(K.col)
print(paste("t",t))
print(paste("j", j))
print(log(N))
print(BIC.OLD)

t= t*t.step

}

cols.tf      = cols.tf.best
cols.tf.old  = cols.tf
cols.tf.best = cols.tf
data.temp    = data.MS[,cols.tf]
print(colnames(data.temp))

# secound run though for good measure
0        = last.accept
mod.temp = lm(y~., data = data.temp, weights =  sd^(-2)  ) # note that in this case I am asuming that hte true sds are known
LogLhood = logLik(mod.temp)
K        = mod.temp$rank K.col = sum(cols.tf) BIC.OLD = -2*LogLhood + log(N)*K + 0.01*K.col BIC.BEST = BIC.OLD for ( i in 1:N.steps) { print("--------------------------------------------------") print(i) cols.tf = cols.tf.old j = sample(1:21,1) # S. = 1 if ( (cols.tf[j] ==T) & (S. == 1) ) { cols.tf[j] = F S. = 0 } if ( (cols.tf[j] ==F) & (S. == 1) ) { cols.tf[j] = T S. = 0 } data.temp = data.MS[,cols.tf] print(colnames(data.temp)) mod.temp = lm(y~., data = data.temp, weights = sd^(-2) ) # note that in this case I am asuming that hte true sds are known LogLhood = logLik(mod.temp) K = mod.temp$rank
K.col    = sum(cols.tf)
BIC.NEW      = -2*LogLhood + log(N)*K  + 0.01*K.col
r = runif(1)
A = exp((BIC.OLD-BIC.NEW)/t )
print(paste("A", A))
print(paste("NEW", BIC.NEW))
print(paste("OLD", BIC.OLD))
if(A >= r)
{
print("accept new model")
BIC.OLD = BIC.NEW
cols.tf.old = cols.tf
last.accept = i
}
if(BIC.NEW <= BIC.BEST)
{
BIC.BEST = BIC.NEW
cols.tf.best = cols.tf
}
print(K)
print(K.col)
print(paste("t",t))
print(paste("j", j))
print(log(N))
print(BIC.OLD)
t= t*t.step
}
data.temp = data.MS[,cols.tf]
mod.best  = lm(y~., data = data.temp, weights =  sd^(-2)  ) # note that in this case I am asuming that hte true sds are known
P  = predict(mod.best, se.fit =T)
EX = P$fit LB = P$fit - 1.96*P$se.fit UB = P$fit + 1.96*P\$se.fit

plot(x.index, y)
polygon(c(x.index, rev(x.index)), c(UB, rev(LB)), col= "grey", border=F)
lines(x.index, EX, col="red", lwd=3)
lines(x.index, y.true, col="blue", lwd=3, lty =2)
points(x.index, y)

This code is almost exactly the same as the exhaustive search but now a random walk is taken instead of a systematic iteration. In addition, models are not automatically accepted but only accepted using stochastic annealing.  Two cooling runs are made because … why not? It is good to check stability.

As a final point I would say once you have a best model cross validation is always a good extra check.

Tune in next time for more on linear models.