Hi all as with the last few posts I am summarizing linear modeling and generalized linear models as used in Business intelligence. The areas that I am covering are:
- Querying and cleaning data for modeling
- Plotting models with uncertainty
- Quick and dirty model selection
- Regularized regression
- Singular value decomposition regression
- Ridge Regression
- Bayesian Methods
- L1 and L2 Norms
This post continues the discussion on regularized regression (i.e., singular value decomposition regression, ridge regression, general regularized regression, and Bayesian methods). In particular, it covers ridge regression.
Ridge regression (Apparently called Tikhonov by Wikipedia) is the use of a prior information to stabilize the parameter estimates. The standard misfit function is augmented with the “prior misfit” term
. The scalar
determines the relative weight of the prior information relative to the data information (there will be more on choosing it later).
and
define the prior information; e.g.,
could be
and
causing the prior to prefer the “smallest” model. It is worth spending some time thinking about why this is true and how you could get the flattest model or smoothest model.
Using the regularized misfit function . The MLE estimate of
, can be found to be
. The details of this derivation are shown in the appendix to this post.
So to continue the discussion of matrix inversion that stated with the last post. Here the matrix that must be inverted is . Starting with the special case of the smallest model (
and
) if
is an eigen vector of
with eigen value
then it is also a eigen value of
as
. The new eigen value
highlights how this sort of regularization stabilizes the inversion; i.e., zero and small valued
s are replaced with
thus allowing
to always be stable! The large valued
s are not significantly changed by the
.
The key problem with ridge regression is choosing the value of . If you are working in a setting where the data uncertainty is known (or at-least approximately known, i.e., you are guessing) then the easiest way to find the value of
such that the summed standardized misfit squared
, where
is the model predicted value and
is the number of observations. That is, the prior is used to reduce the fit until it has the expected misfit.

Figure 1: Blue line standardized squared misfit vs prior weight. Black line the theoretical misfit where .
If you are in the real world (i.e., you are not willing to guess) and don’t know the data residual variance Then the best way to chose the value of is to find the value such that
is maximized.That is the place where
is changing the misfit the most. The logic is that before this point the model is likely to be over-fitting and after there is diminishing value in the regulation.
Normally both these values of are found numerically.
That is it for now! Tune in next week for the Bayesian methods of regularization (and some working code comparing the three approaches).