This question appeared here.
Newbie question using R’s mtcars dataset with anova() function. My question is how to use anova() to select the best (nested) model. Here’s some example data:
anova(lm(mpg~disp,mtcars),lm(mpg~disp+wt,mtcars),lm(mpg~disp+wt+am,mtcars))
Analysis of Variance Table
Model 1: mpg ~ disp
Model 2: mpg ~ disp + wt
Model 3: mpg ~ disp + wt + am
Res.Df RSS Df Sum of Sq F Pr(>F)
1 30 317.16
2 29 246.68 1 70.476 8.0036 0.008535 **
3 28 246.56 1 0.126 0.0143 0.905548
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> anova(lm(mpg~disp,mtcars),lm(mpg~disp+wt,mtcars),lm(mpg~disp+wt+hp,mtcars))
Analysis of Variance Table
Model 1: mpg ~ disp
Model 2: mpg ~ disp + wt
Model 3: mpg ~ disp + wt + hp
Res.Df RSS Df Sum of Sq F Pr(>F)
1 30 317.16
2 29 246.68 1 70.476 10.1201 0.003571 **
3 28 194.99 1 51.692 7.4228 0.010971 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
My understanding is anova() compares the reduction in the residual sum of squares to report a corresponding p-value for each nested model, where lower p-values means that nested model is more significantly different from the first model.
Question 1: Why is it that changing the 3rd regressor variable
effects results from the 2nd nest model? That is, the p-value for
disp+wt
model changes from 0.008535 to 0.003571 going from
the first to the second example. (does anova’s model 2 analysis use data
from model 3???)
Question 2: Since the 3rd model’s Sum of Sq
value is
much lower in the first example (e.g. 0.126 versus 51.692), I’d expect
the p-value to be lower as well, but it in fact increases (e.g. 0.905548
versus 0.010971). Why?
Question 3: Ultimately I’m trying to understand, given a dataset with a lot of regressors, how to use anova() to find the best model. Any general rules of thumb are appreciated.
ANSWER:
Question 1:
Regarding the difference in p-values (or F-values)
for Model 2
in the first ANOVA compared to the second,
there is no explanation apparent to me. They should indeed be the same,
since we are comparing at that point Model 1
to
Model 2
in both instances. It would be very useful to get
further input about the reason behind, but in the absence of an
explanation, it is probably advisable to only include two models in each
ANOVA call. The values calculated “manually” (I presume the
correct values) are consistent with the first ANOVA call in
your post with an F = 8.2852
.
Question 2:
You are comparing 51.692
and 0.126
BUT this
bulkier difference in the second case is a good thing: What you should
be paying attention to is the difference in the second ANOVA between
246.68
for Model 2
and 194.99
for
Model 3
, which is much more of a drop in RSS than
what you observe in in the first ANOVA, a change from
246.68
to 246.56
, tantamount to nothing. So by
including the third regressor in the second ANOVA, you shave off a ton
of residual (misfitted) information that is now captured, whereas you
hardly made a dent in the first set of models.
Question 3:
The question is how to use ANOVA to find the best model, but I think the answer is clear by now - low p-value makes it a good bet that you should include the regressor under consideration. Yet, what is not clear at all is the process by which you move along a series of ANOVA tests. In other words, I know you know because I’ve taken the Coursera course, that at first you are acquainted with \(R^2\), which you can calculate manually, and understand the idea here (among many other places); after the idea of overfitting is introduced, Adjusted-\(R^2\) values are explained, shortly followed by, yes, ANOVA (F-test). All these and others, such as the AIC criterion are tests or criteria to select models or include / exclude regressors. For instance, AIC and BIC used for non-nested models, and ANOVA and Likelihood Ratio Tests for nested models (I’m quoting a comment by @f coppens in this post.
The challenge, though, is to come up with models that make sense to
test (check this
post). Say you have 5 independent variables, and you are doing
linear regression, or OLS, you can include or exclude each variable in
the any conceivable model for a total of \(2^{10}=1,024\) models before considering
interactions (remember that mtcars
has 10 variables besides
mpg
). This has its own problems with collinearity for
instance. You can resort to your knowledge of cars, or just run with raw
computer power through all combinations either with forward or
backward stepwise selection as very nicely explained in this
document, using R code along the lines of:
## Method Adj.r.square
## 1 for.aic 0.8263446
## 2 for.bic 0.8185189
## 3 back.aic 0.8335561
## 4 back.bic 0.8335561
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## am 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
to select the model with the highest adjusted \(R^2\) (or you can also run ANOVA) with the idea of later looking into possible interactions, and running residuals diagnostics.
NOTE: These are tentative notes on different topics for personal use - expect mistakes and misunderstandings.