NOTES ON STATISTICS, PROBABILITY and MATHEMATICS


ANOVA From Summary Statistics:


Extracted from this document.

As an illustration of the one-way ANOVA calculated from summary statistics in R, we can look at the differences in sepal length between three species of Iris, comparing the output of the “manual” code with the built-in output for the ANOVA table.

We are looking at these differences:

plot(iris$Sepal.Length ~ iris$Species, xlab="species", ylab="sepal length",
     main = "Differences in Sepal length across species", cex.main=.8)

The SUMMARY STATISTICS (including the number of specimens) are as follows:

    # Number of specimens per species:
    (n           = tapply(iris$Sepal.Length, iris$Species, length))
    #    setosa versicolor  virginica 
    # E        50         50         50 
    # Mean sepal length for each species:
    (group_means = tapply(iris$Sepal.Length, iris$Species, mean))
    #    setosa versicolor  virginica 
    #     5.006      5.936      6.588 
    # Sample variance of the sepal length for each species:
    (group_var   = tapply(iris$Sepal.Length, iris$Species, var))
    #    setosa versicolor  virginica 
    #    0.1242490  0.2664327  0.4043429 

Now, working with these summary statistics we can calculate…

The variability between groups:

    # First we need to calculate the overall mean:
    grand_mean  = sum(n * group_means) / sum(n) 
    #... to calculate the sum of square differences between groups:
    (SS_between  = sum(n * (group_means - grand_mean)^2))
    # [1] 63.21213
    # with the corresponding degrees of freedom:
    (df_between  = length(unique(iris$Species)) - 1)
    # [1] 2
    # and the mean square error:
    (MS_between  = SS_between / df_between)
    # [1] 31.60607

Now we can calculate the variability within groups:

It’s worth at this point remembering that the sample variance is calculated as \(s^2=\frac{\sum (x_i - \bar x )^2}{n-1}\) so that the \(\displaystyle \text{SS}_{\text{within groups}}= \sum_{\text{groups}}(n-1) s^2.\)

    # We obtained the sum of squares within adding the numerators of the var's):
    (SS_within   = sum((n - 1) * group_var))
    # [1] 38.9562
    # And calculate the degrees of freedom:
    (df_within   = sum(n - 1))
    # [1] 147
    # And the mean squared error:
    (MS_within   = SS_within / df_within)
    # [1] 0.2650082

We calculate the F value and the p value as:

    # The ratio of mean squared errors:
    (F_value = MS_between / MS_within)
    # [1] 119.2645
    # and the corresponding probability:
    (p_value = pf(F_value, df_between, df_within, lower.tail = F))
    # [1] 1.669669e-31

Now we can compare these values to the output in R using the original data set:

    > anova(lm(iris$Sepal.Length ~ iris$Species))
    Analysis of Variance Table
    
    Response: iris$Sepal.Length
                  Df Sum Sq Mean Sq F value    Pr(>F)    
    iris$Species   2 63.212  31.606  119.26 < 2.2e-16 ***
    Residuals    147 38.956   0.265   

Finally,

    > (SS_total = SS_between + SS_within)
    [1] 102.1683
    > (df       = df_between + df_within)
    [1] 149

Home Page

NOTE: These are tentative notes on different topics for personal use - expect mistakes and misunderstandings.