RIDGE REGRESSION:


The objective (or loss) function that is minimized in OLS is:

\(\bf f(\beta) = (y - X\beta)^\prime(y - X\beta)\)

In ridge regression we minimize:

\(\bf (y - X\beta)^\prime(y - X\beta) + \lambda \beta^\prime \beta\)

What are the normal equations in this case?

\(\bf (X^\prime X + \lambda I)\beta = X^\prime y\)

Proof:

From this entry, let’s consider the square root of \(\lambda\) to be \(\nu\), and construct the row augmented model \([p \times p]\) matrix, with \(p\) being the number of columns:

\[\bf X_{*} = \pmatrix{X \\ \nu I}\]

And the \(\bf y\) vector augmented by a corresponding \(p\) number of zeros - \(y_{*}\).

Now the cost function will be:

\(\bf (y_{*} - X_{*}\beta)^\prime(y_{*} - X_{*}\beta) = (y - X\beta)^\prime(y - X\beta) + \lambda \beta^\prime \beta \tag{1}\)

because there will be \(p\) additional terms of the form \((0 - \nu \beta_i)^2 = \lambda \beta_i^2\)

Inspecting the LHS of Eq.1, the normal equations are:

\(\bf (X_{*}^\prime X_{*})\beta = X_{*}^\prime y_{*}\tag{2}\)

Since \(y_{*}\) just has zeros tagged to the end the RHS of Eq.2 is the same as \(\bf X^\prime y\), and on the LHS, \(\nu^2 I=\lambda I\) results in:

\(\bf (X^\prime X + \lambda I)\beta = X^\prime y.\)


By adjoining \(\bf \nu I\) to \(\bf X\), thereby lengthening them, we are placing the vectors in avlarger space \(\mathbb R^{n+p}\) by including \(p\) “imaginary”, mutually orthogonal directions. The first column of \(\bf X\) is given a small imaginary component of size \(\nu\), thereby lengthening it and moving it out of the space generated by the original \(p\) columns. The second, third, …, \(p^{\text{th}}\) columns are similarly lengthened and moved out of the original space by the same amount \(\nu\)–but all in different new directions. Consequently, any collinearity present in the original columns will immediately be resolved. Moreover, the larger \(\nu\) becomes, the more these new vectors approach the individual \(p\) imaginary directions: they become more and more orthonormal. Consequently, the solution of the normal equations will immediately become possible and it will rapidly become numerically stable as \(\nu\) increases from zero.


Home Page