Regularized Regression: Part B) Ridge Regression.

Hi all as with the last few posts I am summarizing linear modeling and generalized linear models as used in Business intelligence. The areas that I am covering are:

  1. Querying and cleaning data for modeling
  2. Plotting models with uncertainty
  3. Quick and dirty model selection
  4. Regularized regression
  5. L1 and L2 Norms

This post continues the discussion on  regularized regression (i.e., singular value decomposition regression, ridge regression, general regularized regression, and Bayesian methods). In particular, it covers ridge regression.

Ridge regression (Apparently called Tikhonov by Wikipedia) is the use of a prior information to stabilize the parameter estimates. The standard misfit \phi=[{\bf d}-A{\bf m}]^t C_d^{-1}[{\bf d}-A{\bf m}] function is augmented with the “prior misfit” term \mu [{\bf m}- {\bf m}_0]^t C_{{\bf m}_0}^{-1}[{\bf m}- {\bf m}_0]. The scalar \mu determines the relative weight of the prior information relative to the data information (there will be more on choosing it later).  {\bf m}_0 and C_{{\bf m}_0} define the prior information; e.g., {\bf m}_0 could be {\bf 0} and C_{{\bf m}_0} = I_{n} causing the prior to prefer the “smallest” model. It is worth spending some time thinking about why this is true and how you could get the flattest model or smoothest model.

Using  the regularized misfit function \phi= [{\bf d}-A{\bf m}]^t C_d^{-1}[{\bf d}-A{\bf m}] +\mu [{\bf m}- {\bf m}_0]^t C_{{\bf m}_0}^{-1}[{\bf m}- {\bf m}_0]. The MLE estimate of {\bf m}, can be found to be \hat{\bf m} = {\bf m}_0 + [A^t C_d^{-1} A + \mu C_{{\bf m}_0}^{-1} ]^{-1} [ A^t C_d^{-1}({\bf d} - A {\bf m}_0)]. The details of this derivation are shown in the appendix to this post.

So to continue the discussion of matrix inversion that stated with the last post. Here the matrix that must be inverted is [A^t C_d^{-1} A + \mu C_{{\bf m}_0}^{-1} ]. Starting with the special case of the smallest model (C_{{\bf m}_0} = I_{n} and {\bf m}_0 = {\bf 0}) if {\bf u} is an eigen vector of A^t C_d^{-1} A with eigen value \lambda then it is also a eigen value of [A^t C_d^{-1} A + \mu I_n ] as [A^t C_d^{-1} A + \mu I_n ] {\bf u} = (\lambda + \mu) {\bf u} . The new eigen value (\lambda + \mu) highlights how this sort of regularization stabilizes the inversion; i.e., zero and small valued \lambdas are replaced with \mu thus allowing 1/(\lambda + \mu) to always be stable! The large valued \lambdas are not significantly changed by the \mu.

The key problem with ridge regression is choosing the value of \mu.  If you are working in a setting where the data uncertainty is known (or at-least approximately known, i.e., you are guessing) then the easiest way to find the value of \mu such that the summed standardized misfit squared \sum ([d_i-d_p]/ \sigma_i)^2 = \chi^2 = n, where d_p is the model predicted value and n is the number of observations. That is, the prior is used to reduce the fit until it has the expected misfit.


Figure 1: Blue line standardized squared misfit vs prior weight. Black line the theoretical misfit where \chi^2 = n.

If you are in the real world (i.e., you are not willing to guess) and don’t know the data residual variance Then the best way to chose the value of \mu is to find the value such that d \chi^2 / d \mu is maximized.That is the place where \mu is changing the misfit the most. The logic is that before this point the model is likely to be over-fitting and after there is diminishing value in the regulation.

Normally both these values of \muare found numerically.

That is it for now! Tune in next week for the Bayesian methods of regularization (and some working code comparing the three approaches).