NOTES ON STATISTICS, PROBABILITY and MATHEMATICS


ANOVA is equivalent to OLS:


[First written for a post on CrossValidated.]

Let me put some color into the idea that OLS with categorical (dummy-coded) regressors is equivalent to the factors in ANOVA. In both cases there are levels (or groups in the case of ANOVA).

In OLS regression it is most usual to have also continuous variables in the regressors. These logically modify the relationship in the fit model between the categorical variables and the dependent variable (DV). But not to the point of making the parallel unrecognizable.

Based on th mtcars data set we can first visualize the model lm(mpg ~ wt + as.factor(cyl), data = mtcars) as the slope determined by the continuous variable wt (weight), and the different intercepts projecting the effect of the categorical variable cylinder (four, six or eight cylinders). It is this last part that forms a parallel with a one-way ANOVA.

Let’s see it graphically on the plot to the right (the three plots to the left are included for side-to-side comparison with the ANOVA model discussed immediately afterwards):

enter image description here

Each cylinder engine is color coded, and the distance between the fitted lines with different intercepts and the data cloud is the equivalent of within group variation in an ANOVA (not mathematically identical, though, due to the effect of weight). The slope of the lines is the coefficient for the continuous variable weight.

If you try to suppress the effect of weight by mentally straightening these lines and returning them to the horizontal line, you’ll end up with the ANOVA plot of the model aov(mtcars$mpg ~ as.factor(mtcars$cyl)) on the three subplots to the left. The weight regressor is now out, but the relationship from the points to the different intercepts is roughly preserved - we are simply rotating counter-clockwise and spreading out the previously overlapping plots for each different level (again, only as a visual device to “see” the connection; not as a mathematical equality, since we are comparing two different models!).

Each level in the factor cylinder is separated, and the vertical lines represent the residuals or within-group error: the distance from each point in the cloud and the mean for each level (appropriately color-coded horizontal line). The color gradient gives us an indication of how significant the levels are in validating the model: the more clustered the data points are around their group means, the more likely the ANOVA model will be statistically significant. The horizontal black line around \(\small 20\) in all the plots is the mean for all the factors. The numbers in the \(x\)-axis are simply the placeholder number/identifier for each point within each level, and don’t have any further purpose than to separate points along the horizontal line to allow a plotting display different to boxplots.

And it is through the sum of these vertical segments that we can manually calculate the residuals:

mu_mpg <- mean(mtcars$mpg)                      # Mean mpg in dataset
TSS <- sum((mtcars$mpg - mu_mpg)^2)             # Total sum of squares
SumSq=sum((mtcars[mtcars$cyl==4,"mpg"] - mean(mtcars[mtcars$cyl=="4","mpg"]))^2)+
sum((mtcars[mtcars$cyl==6,"mpg"] - mean(mtcars[mtcars$cyl=="6","mpg"]))^2)+
sum((mtcars[mtcars$cyl==8,"mpg"] - mean(mtcars[mtcars$cyl=="8","mpg"]))^2)

The result: SumSq = 301.2626 and TSS - SumSq = 824.7846. Compare to:

Call:
   aov(formula = mtcars$mpg ~ as.factor(mtcars$cyl))

Terms:
                as.factor(mtcars$cyl) Residuals
Sum of Squares               824.7846  301.2626
Deg. of Freedom                     2        29

Exactly the same result as testing with an ANOVA the linear model with only the categorical cylinder as regressor:

fit <- lm(mpg ~ as.factor(cyl), data = mtcars)
summary(fit)
anova(fit)

Analysis of Variance Table

Response: mpg
               Df Sum Sq Mean Sq F value    Pr(>F)    
as.factor(cyl)  2 824.78  412.39  39.697 4.979e-09 ***
Residuals      29 301.26   10.39 

What we see, then, is that the residuals - the part of the total variance not explained by the model - as well as the variance are the same whether you call an OLS of the type lm(DV ~ factors), or an ANOVA (aov(DV ~ factors)): when we strip the model of continuous variables we end up with an identical system. Similarly, when we evaluate the models globally or as an omnibus ANOVA (not level by level), we naturally get the same \(p\)-value F-statistic: 39.7 on 2 and 29 DF, p-value: 4.979e-09.

This is not to imply that the testing of individual levels is going to yield identical \(p\)-values. In the case of OLS, we can invoke summary(fit) and get:

lm(formula = mpg ~ as.factor(cyl), data = mtcars)

                Estimate Std. Error t value                           Pr(>|t|)    
(Intercept)      26.6636     0.9718  27.437                           < 2e-16 ***
as.factor(cyl)6  -6.9208     1.5583  -4.441                           0.000119 ***
as.factor(cyl)8 -11.5636     1.2986  -8.905                           8.57e-10 ***

This is not possible in ANOVA, which is more of an omnibus test. To get these types of \(p\)-value assessments we need to run a Tukey Honest Significant Difference test, which will try to reduce the possibility of a type I error as a result of performing multiple pairwise comparisons (hence, “p adjusted”), resulting in a completely different output:

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = mtcars$mpg ~ as.factor(mtcars$cyl))

$`as.factor(mtcars$cyl)`
          diff        lwr        upr                                      p adj
6-4  -6.920779 -10.769350 -3.0722086                                    0.0003424
8-4 -11.563636 -14.770779 -8.3564942                                    0.0000000
8-6  -4.642857  -8.327583 -0.9581313                                    0.0112287

Ultimately, nothing is more reassuring that checking getting a glimpse of the engine under the hood, which non other than the model matrices, and the projections in the column space. These are actually quite simple in the case of an ANOVA:

\[\small\begin{bmatrix} y_1 \\ y_2 \\ y_3 \\ \vdots \\\vdots\\\vdots\\.\\y_n \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 \\ 1 & 0 & 0 \\ \vdots & \vdots & \vdots \\ 0 & 1 & 0 \\ 0 & 1 & 0 \\ \vdots & \vdots & \vdots \\ .&.&.\\ 0 & 0 & 1 \\ 0 & 0 & 1 \\ \end{bmatrix} \begin{bmatrix} \mu_1\\ \mu_2\\ \mu_3 \end{bmatrix} +\begin{bmatrix} \varepsilon_1 \\ \varepsilon_2\\ \varepsilon_3\\ \vdots\\ \vdots\\ \vdots\\ .\\ \varepsilon_n \end{bmatrix}\]

This would be the one-way ANOVA model matrix with three levels (e.g. cyl 4, cyl 6, cyl 8), summarized as \(\small y_{ij} = \mu_i + \epsilon_{ij}\), where \(\mu_i\) is the mean at each level or group: when the error or residual for the observation \(j\) of the group or level \(i\) is added, we obtain the actual DV \(y_{ij}\) observation.

On the other hand, the model matrix for an OLS regression is:

\[\small\begin{bmatrix}y_1 \\ y_2 \\ y_3 \\ y_4 \\ \vdots \\ y_n \end{bmatrix} = \begin{bmatrix} 1 & x_{12} & x_{13}\\ 1 & x_{22} & x_{23} \\ 1 & x_{32} & x_{33} \\ 1 & x_{42} & x_{43} \\ \vdots & \vdots & \vdots \\1 & x_{n2} & x_{n3} \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \end{bmatrix} + \begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \varepsilon_3 \\ \varepsilon_4 \\ \vdots \\ \varepsilon_n \end{bmatrix}\]

This is of the form $ y_i = _0 + 1, x{i1} + 2, x{i2} + _i $ with a single intercept \(\beta_0\) and two slopes (\(\beta_1\) and \(\beta_2\)) each for a different continuous variables, say weight and displacement.

The trick now is to see how we can create different intercepts, as in the initial example - so let’s get rid of the second slope and stick to the original single continuous variable weight (in other words, one single column besides the column of ones in the model matrix; the intercept \(\beta_0\) and the slope for weight, \(\beta_1\)). The column of ones will by default correspond to the cyl 4 intercept, which will be shifted via dummy coding as follows:

\[\small\begin{bmatrix}y_1 \\ y_2 \\ y_3 \\ y_4\\ y_5 \\ \vdots \\ y_n\end{bmatrix} = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ 1 & x_3 \\ 1 & x_4 \\ 1 & x_5 \\ \vdots & \vdots \\1 & x_n \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix}+ \begin{bmatrix}1&0\\1&0\\1&0\\0&1\\0&1\\ \vdots & \vdots\\0&1\end{bmatrix} \begin{bmatrix} \mu_2 \\ \mu_3 \end{bmatrix} + \begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \varepsilon_3 \\ \varepsilon_4 \\ \varepsilon_5\\ \vdots \\ \varepsilon_n \end{bmatrix}\]

Now when the third column is \(1\) we’ll be systematically shifting the intercept by \(\mu_2\), and when the fourth column is \(1\), it’ll be instead \(\mu_3\) added to the intercept. The matrix equation, hence, will be $y_i = _0 + _1, x_i + _i + _i $. Therefore, going with this model to the ANOVA model is just a matter of getting rid of the continuous variables, and understanding that the default intercept in OLS corresponds to the first level in ANOVA.


Home Page

NOTE: These are tentative notes on different topics for personal use - expect mistakes and misunderstandings.