### RIDGE REGRESSION:

The objective (or loss) function that is minimized in OLS is:

$$\bf f(\beta) = (y - X\beta)^\prime(y - X\beta)$$

In ridge regression we minimize:

$$\bf (y - X\beta)^\prime(y - X\beta) + \lambda \beta^\prime \beta$$

What are the normal equations in this case?

$$\bf (X^\prime X + \lambda I)\beta = X^\prime y$$

Proof:

From this entry, let’s consider the square root of $$\lambda$$ to be $$\nu$$, and construct the row augmented model $$[p \times p]$$ matrix, with $$p$$ being the number of columns:

$\bf X_{*} = \pmatrix{X \\ \nu I}$

And the $$\bf y$$ vector augmented by a corresponding $$p$$ number of zeros - $$y_{*}$$.

Now the cost function will be:

$$\bf (y_{*} - X_{*}\beta)^\prime(y_{*} - X_{*}\beta) = (y - X\beta)^\prime(y - X\beta) + \lambda \beta^\prime \beta \tag{1}$$

because there will be $$p$$ additional terms of the form $$(0 - \nu \beta_i)^2 = \lambda \beta_i^2$$

Inspecting the LHS of Eq.1, the normal equations are:

$$\bf (X_{*}^\prime X_{*})\beta = X_{*}^\prime y_{*}\tag{2}$$

Since $$y_{*}$$ just has zeros tagged to the end the RHS of Eq.2 is the same as $$\bf X^\prime y$$, and on the LHS, $$\nu^2 I=\lambda I$$ results in:

$$\bf (X^\prime X + \lambda I)\beta = X^\prime y.$$

By adjoining $$\bf \nu I$$ to $$\bf X$$, thereby lengthening them, we are placing the vectors in avlarger space $$\mathbb R^{n+p}$$ by including $$p$$ “imaginary”, mutually orthogonal directions. The first column of $$\bf X$$ is given a small imaginary component of size $$\nu$$, thereby lengthening it and moving it out of the space generated by the original $$p$$ columns. The second, third, …, $$p^{\text{th}}$$ columns are similarly lengthened and moved out of the original space by the same amount $$\nu$$–but all in different new directions. Consequently, any collinearity present in the original columns will immediately be resolved. Moreover, the larger $$\nu$$ becomes, the more these new vectors approach the individual $$p$$ imaginary directions: they become more and more orthonormal. Consequently, the solution of the normal equations will immediately become possible and it will rapidly become numerically stable as $$\nu$$ increases from zero.