Proof that the variance of the estimated coefficients \(\hat \beta = \sigma^2 \left(X^\top X \right)^{-1}\) where the estimation of \(\sigma\) depends on the number of estimated parameters, \(p,\) and the number of observations, \(n\): \(\text{SSE} = e^\top e\) has \(\text{df}_E = n − p.\) The estimate for \(\sigma^2\) is

\[s^2 = \frac{\mathrm e^\top \mathrm e}{n-p}= \frac{(\mathrm Y − \mathrm{Xb})^\top(\mathrm Y − \mathrm{Xb})}{n-p}= \frac{\text{SSE}} {\text{df}_E}= \text{MSE}\]

\[s = \sqrt{s^2}= \text{Root MSE}\]

From this video and this post:

We know that \(\hat \beta= \left(X^\top X \right)^{-1} X^\top Y.\)

And we want to calcualte the variance of the parameter estimate:

\(\begin{align}\text{var}\left(\hat \beta\right) &= \text{var}\left[\left(X^\top X \right)^{-1} X^\top y\right]\\\quad &\underset{\text{var}(Ay)=A\,\text{var}(y)\,A^\top}{=}\quad\left[\left(X^\top X \right)^{-1} X^\top \right] \color{blue}{\text{var}(y)} \left[\left(X^\top X \right)^{-1} X^\top \right]^\top\\ &\underset{*}=\quad\left[\left(X^\top X \right)^{-1} X^\top \right] {\bf I}\sigma^2 \left[\left(X^\top X \right)^{-1} X^\top \right]^\top\\ &=\sigma^2\left[\left(X^\top X \right)^{-1} X^\top \right]\left[\left(X^\top X \right)^{-1} X^\top \right]^\top\\ &\quad\underset{(AB)^\top = B^\top A^\top}{=}\quad\sigma^2\left[\left(X^\top X \right)^{-1} X^\top \right]\left[\left(X^\top\right)^\top \left(\left(X^\top X \right)^{-1}\right)^\top \right]\\ &\quad\underset{(A^{-1})^\top = (A^\top)^{-1}}{=}\quad\sigma^2\left[\left(X^\top X \right)^{-1} X^\top \right]\left[X \left(\left(X^\top X \right)^\top\right)^{-1} \right]\\ &\quad\underset{A^\top A \text{ is symmetrical}}{=}\quad\sigma^2\left[\left(X^\top X \right)^{-1} X^\top \right]\left[X \left(X^\top X \right)^{-1} \right]\\ &\quad\underset{ \text{regrouping}}{=}\quad\sigma^2\left[\left(X^\top X \right)^{-1}\right] X^\top X \left[\left(X^\top X \right)^{-1} \right]\\ &=\bbox[yellow, 2px]{\sigma^2 \left(X^\top X\right)^{-1}} \end{align}\)

\(*\) because:

\[\text{var}(y)= \text{var}\left(X\beta+\varepsilon\right)\]

and since \(X\beta\) is fixed, \(\text{var}(y)= \text{var}\left(X\beta+\varepsilon\right)=0+\text{var}(\epsilon\vert X)=\sigma^2 \bf I\), since \(\varepsilon \sim N(0, \sigma^2 \bf I).\)

This latter part comes from the Gauss-Markov sphericity of errors assumption:

In the case of a single regressor (and intercept) \(Y = \beta_0 + \beta_1 x\), again, with the assumption that \(Y \sim N(\mu = \beta_0 + \beta_1 x, \sigma^2):\)

\[\operatorname{var}(\hat \beta):=\sigma^2(\hat \beta)=\begin{bmatrix}\operatorname{var}(\hat \beta_0) &\operatorname{cov}(\hat \beta_0, \hat \beta_1)\\\operatorname{cov}(\hat \beta_0, \hat \beta_1)& \operatorname{var}(\hat \beta_1)\end{bmatrix}\]

and having derived \(\operatorname{var}(\hat \beta) =\sigma^2 \left(X^\top X\right)^{-1},\) and knowing that

\[\left(X^\top X\right)^{-1}=\begin{bmatrix}\frac{\sum x_i^2}{n\sum(x_i - \bar x)^2} & \frac{-\sum x_i}{n\sum(x_i - \bar x)^2}\\\frac{-\sum x_i}{n\sum(x_i - \bar x)^2}& \frac{1}{\sum(x_i - \bar x)^2}\end{bmatrix}\]

\[\operatorname{var}(\hat \beta) =\sigma^2 \left(X^\top X\right)^{-1}=\begin{bmatrix}\frac{\sigma^2\sum x_i^2}{n\sum(x_i - \bar x)^2} & \frac{-\sigma^2\sum x_i}{n\sum(x_i - \bar x)^2}\\\frac{-\sigma^2\sum x_i}{n\sum(x_i - \bar x)^2}& \frac{\sigma^2}{\sum(x_i - \bar x)^2}\end{bmatrix} \]

This is my attempt at answer this question.

The question is what happens to the **variance** of a
parameter estimates (\(\hat \beta_i\))
with the introduction of a new *control
variable* in a regression model? And secondarily, how would the
statistical testing of \(\hat \beta_i\)
be affected?

The answer is that **the variance of the estimated parameter
will increase**; and **the \(p\)-value will increase**.

Your question revolves around the problem with overfitting and the bias-variance trade-off.

The more regressors you add, the lower the \(\text{RSS}\) (residual sum of squares) and
the higher the \(R^2\). In fact, if you
just add columns of pure noise (`rnorm()`

) to your data, and
you run OLS regressions after every addition, you will see a marked,
monotonic decrease in the \(\text{RSS}\). I tested this with an
example:

The idea is that in OLS the hat matrix \(X (X^TX)^{-1}X^T\) is a projection matrix of the vector of observed \(y\) values onto the column space of the model matrix. The higher the dimensions of the column space (the more vectors to form a basis), the closer \(\hat y\) will be to \(y\). But at a price!

The estimation of the variance of \(\hat \beta_i\) is given by:

\(Var[\hat\beta_i]=
\sigma^2(X^TX)^{-1}\) with \(\sigma^2\) corresponding to the variation
of the observations around the predicted values, (\(Var[\epsilon|X] = \sigma^2I_n\)). The
estimation of \(\sigma^2\) from the
sample is \(s^2=\frac{e^Te}{n-p}\) or
\(\text{MSE}\) (mean squared error).
The denominator is the number of observations, \(n\) minus the number of parameters, \(p\), counting the intercept. It is also
referred to as the *error or residual degrees of freedom* (\(\small\text{no. observations−no. indepen't
variables−1}\)). Alternatively, the formula for the \(\sigma^2\) estimation, can be expressed as
\(\text{MSE = RSS/df}\) with \(\text{RSS}\) being the same as \(\text{SSE}\) (sum of squared errors), \(\sum_1^n(y_i - \hat y)^2\).

Therefore the estimation of the variance of the parameter \(\hat\beta_i\) is \(s^2(X^TX)^{-1}\) or \(\text{MSE}\times(X^TX)^{-1}\).

And I think this may be a source of confusion - just because adding another regressor decreases the \(\text{MSE}\) by decreasing the \(\text{RSS}\) - if it actually does at all, because of the change in the degrees of freedom in the denominator, it can’t be said that the variance for the estimated parameter decreases.

In a parallel OLS simulation (here), I found, quite anecdotally, a \(\small 3.5\%\) increase in \(\text{MSE}\) when adding an extra regressor, with a minimal decrease in \(\text{RSS}\).

What controlled the increased variance for the estimate of \(\hat\beta_i\) turned out to be the entry for \(\hat\beta_i\) in the matrix of cofactors involved in the calculation of the inverse of \((X^TX)^{-1}\), explaining a variance for the estimate of the parameter \(\hat\beta_i\) \(2.8\) times higher in the presence of the control variable.

An alternative formula for the variance not applicable to the intercept is:

\(\large \text{var}[\hat\beta_i]= \frac{\sigma^2}{n \times var[X_i]\times(1\,-\,R_i^2)}\). The key here is that \(R_i^2\) is the \(R\)-square of the regression of the corresponding variable for the parameter \(\hat\beta_i\), or \(X_i\) against all the other regressors. Therefore, the better the regression model of \(X_i \sim \text{control variable}\), the higher the estimated \(\text{var}[\hat\beta_i].\)

Naturally, the higher the variance (or squared error), the broader the confidence intervals, and the lower the \(t\) statistic, resulting in a higher \(p\)-value.

**NOTE: These are tentative notes on different topics for
personal use - expect mistakes and misunderstandings.**