NOTES ON STATISTICS, PROBABILITY and MATHEMATICS


Regression Optimization:


The cost function is not necessary in OLS, but it comes into play when using regularization.

The cost function would be generally expressed as:

\[J(\hat \beta)= (y - \mathbf{X}\hat \beta)^\top(y- \mathbf{X} \hat \beta)= \displaystyle \sum_{i=1}^n (y_i - x_i^\top\hat \beta)^2= \sum_{i=1}^n(y_i - \hat y_i)^2\]

Expanding the quadratic in matrix notation:

\[J(\hat \beta)= (y - \bf X\hat \beta)^\top (y- \bf X \hat \beta)= y^\top y + \color{red}{\hat \beta^\top\,X^\top X\,\hat \beta} - 2y^\top X\hat \beta\]

The term in red is a positive semidefinite matrix. A positive definite matrix fulfills the requirement, \(x^\top Ax>0\). The other two terms are scalars.

To differentiate the cost function to obtain a minimum we need two pieces of information:

\(\frac{\partial {\bf A}\hat \beta}{\partial \hat \beta}={\bf A}^\top\) (the derivative of a matrix with respect to a vector); and \(\frac{\partial \hat \beta^\top{\bf A}\hat \beta}{\partial \hat \beta}= 2{\bf A}^\top \hat \beta\) (derivative of a quadratic form with respect to a vector).

\[\frac{\partial J(\hat \beta)}{\partial \hat \beta}=\frac{\partial}{\partial\hat \beta}\left[y^\top y + \color{red}{\hat \beta^\top \,X^\top X\,\hat \beta} - 2y^\top X\hat \beta \right]=0 +2 \color{red}{X^\top X\,\hat \beta}-2X^\top y\]

which gives:

\[2X^\top X\hat \beta = 2X^\top y\]

\[\hat \beta = (X^\top X)^{-1}X^\top y\]

Home Page

NOTE: These are tentative notes on different topics for personal use - expect mistakes and misunderstandings.