### REGRESSION OPTIMIZATION:

The cost function is not necessary in OLS, but it comes into play when using regularization.

The cost function would be generally expressed as:

$J(\hat \beta)= (y - \mathbf{X}\hat \beta)^\top(y- \mathbf{X} \hat \beta)= \displaystyle \sum_{i=1}^n (y_i - x_i^\top\hat \beta)^2= \sum_{i=1}^n(y_i - \hat y_i)^2$

Expanding the quadratic in matrix notation:

$J(\hat \beta)= (y - \bf X\hat \beta)^\top (y- \bf X \hat \beta)= y^\top y + \color{red}{\hat \beta^\top\,X^\top X\,\hat \beta} - 2y^\top X\hat \beta$

The term in red is a positive semidefinite matrix. A positive definite matrix fulfills the requirement, $$x^\top Ax>0$$. The other two terms are scalars.

To differentiate the cost function to obtain a minimum we need two pieces of information:

$$\frac{\partial {\bf A}\hat \beta}{\partial \hat \beta}={\bf A}^\top$$ (the derivative of a matrix with respect to a vector); and $$\frac{\partial \hat \beta^\top{\bf A}\hat \beta}{\partial \hat \beta}= 2{\bf A}^\top \hat \beta$$ (derivative of a quadratic form with respect to a vector).

$\frac{\partial J(\hat \beta)}{\partial \hat \beta}=\frac{\partial}{\partial\hat \beta}\left[y^\top y + \color{red}{\hat \beta^\top \,X^\top X\,\hat \beta} - 2y^\top X\hat \beta \right]=0 +2 \color{red}{X^\top X\,\hat \beta}-2X^\top y$

which gives:

$2X^\top X\hat \beta = 2X^\top y$