NOTES ON STATISTICS, PROBABILITY and MATHEMATICS


Applications of the Chi Square:


General test statistic (regardless of specific application):

The chi square statistic is denoted as \(\chi^2=\displaystyle \sum_i \frac{(O_i-E_i)^2}{E_i}\) where \(O_i\) is the observed number of cases in category \(i\), and \(E_i\) is the expected number of cases in category \(i\).

The null and alternative hypotheses for each chi square test are:

\(H_o: O_i=E_i\) and \(H_1:O_i\neq E_i\).

The theoretical distribution is a continuous distribution, but the \(\chi^2\) statistic is obtained in a discrete manner, on the basis of discrete differences betweenthe observed and expected values. Hence, there is a need for asymptotics: The condition in which the chi square statistic is approximated by the theoretical chi square distribution, is that the sample size is reasonably large: there should be at least \(\color{red}{5}\, \text{expected cases}\) per category. There may be no observed values in one or more categories, and this causes no difficulty with the test. But there must be around 5 or more expected cases per category. If there are considerably less than this, adjacent categories may be grouped together to boost the number of expected cases.

The test statistic follows a multinomial distribution.



GoF vs Homogeneity vs Independence Testing:


• The “goodness-of-fit test” is a way of determining whether a set of categorical data came from a claimed discrete distribution or not. The null hypothesis is that they did and the alternate hypothesis is that they didn’t. It answers the question: are the frequencies I observe for my categorical variable consistent with my theory? The goodness-of-fit test expands the one-proportion z-test. The one-proportion z-test is used if the outcome has only two categories. The goodness-of-fit test is used if you have two or more categories.

• The “test of homogeneity” is a way of determining whether two or more DIFFERENT POPULATIONS (or GROUPS) share the same distribution of a SINGLE CATEGORICAL VARIABLE. For example, do people of different races have the same proportion of smokers to non-smokers, or do different education levels have different proportions of Democrats, Republicans, and Independent. The test of homogeneity expands on the two-proportion z-test. The two proportion z-test is used when the response variable has only two categories as outcomes and we are comparing two groups. The homogeneity test is used if the response variable has several outcome categories, and we wish to compare two or more groups.

\(H_0\): \(p_1 = p_2 = \cdots = p_n\) the proportion of \(X\) is the same in all the populations studied.

\(H_a\): At least one proportion of \(X\) is not the same.

• The “test of independence” is a way of determining whether TWO CATEGORICAL VARIABLES are associated with one another in ONE SINGLE POPULATION. For example, we draw a single group of 200 subjects and record the number of children they have, and the number of colds they each got last year, trying to see if there is a relationship between the having children and getting colds.

\(H_0\): \(X\) and \(Y\) are independent.

\(H_a\): \(X\) and \(Y\) are dependent.

• Homogeneity and independence sound the same. The difference is a matter of design. In the test of independence, observational units are collected at random from ONE POPULATION and TWO CATEGORICAL VARIABLES are observed for each observational unit. In the test of homogeneity, the data are collected by randomly sampling from each sub-group (SEVERAL POPULATIONS) separately. (Say, 100 blacks, 100 whites, 100 American Indians, and so on.) The null hypothesis is that each sub-group shares the same distribution of A SINGLE CATEGORICAL VARIABLE. (Say, “chain smoker”, “occasional smoker”, “non-smoker”).


Goodness-of-Fit Test:


This test concerns itself with checking whether the sample is consistent with a theorized population distribution. Examples:

  1. Is there the same number of customers each day visiting a store (the answer will determine staffing needs)? In other words, is the distribution uniform?

  2. Have the grades of the students been allocated using a bell distribution?

  3. Does a sample follow the distribution of the population in a science project?

  4. Check whether the random generation of numbers by a software program is as promised.

When dealing with continuous variable there will be a need to discretize by grouping the observations into \(K\) bins or categories through, for instance, quantiles. Importantly, the degrees of freedom are \(k-1\). Large values of the chi square statistic will lead to rejecting the null hypothesis.

The degrees of freedom in this case are calculated with the formula: \(\small \text{df}\,=\, \text{bins}\,(\text{or}\,\text{categories}) - 1 - \text{number}\,\text{of}\,\text{parameters}\,\text{of}\,\text{the}\,\text{distribution}\). For example, two parameters define the normal distribution, whereas only one defines the binomial.



Example 1:


Consider a standard package of milk chocolate M&Ms. There are six different colors: red, orange, yellow, green, blue and brown. We get a sample of \(600\) M&Ms and check whether the distribution of colors is homogeneous. We have ONE CATEGORICAL VARIABLE with SIX LEVELS. The null hypothesis is that all six colors occur with a frequency or probability of \(1/6\). The degrees of freedom are the \(\text{no. of levels - 1 = 5}\).

(MM = c(red=212, orange=147, yellow=103, green=50, blue=46, brown=42))
##    red orange yellow  green   blue  brown 
##    212    147    103     50     46     42
expected_freq = rep(1/length(MM), length(MM))
chisq.test(MM, p = expected_freq, correct = F)
## 
##  Chi-squared test for given probabilities
## 
## data:  MM
## X-squared = 235.42, df = 5, p-value < 2.2e-16


Example 2:


A very generic example to see how the expected frequencies are generated would consist of testing whether the values in a vector representing a sample are consistent with a second vector representative of the population:

samp       <- c(12455, 11554, 11908, 13416, 12647, 7828, 10172) # the sample
population <- c(181395, 165347, 175331, 193760, 190370, 124226, 157822)
expected_freq <- population/sum(population) # we need to get the expected frequencies
ch_sq <- chisq.test(samp, p= expected_freq)
ch_sq$expected
## [1] 12209.518 11129.343 11801.356 13041.794 12813.616  8361.529 10622.843
ch_sq
## 
##  Chi-squared test for given probabilities
## 
## data:  samp
## X-squared = 88.184, df = 6, p-value < 2.2e-16

The sample definitely does not belong to the population distribution.



Example 3:


Does the distribution of ages in a sample matches that expected in the population:

age         <- c("20-24", "25-34", "35-44")
observed    <- c(103, 216, 171)
Expec_Perc. <- c(18, 50, 32) # Percentages expected.
(dat <- data.frame(age, observed, Expec_Perc.))
##     age observed Expec_Perc.
## 1 20-24      103          18
## 2 25-34      216          50
## 3 35-44      171          32

In other words, \(H_o\): The proportion of cases in category \(i\) is \(p_i\); \(i= 1, 2, 3, \dots k\). The alternative hypothesis is that at least one of the proportions is not as hypothesized.

We can calculate the expected counts manually:

\(\small E_1= 490*\frac{18}{100}= 490*0.18 = 88.2\)

\(\small E_2= 490*\frac{50}{100}= 490*0.50 = 245.0\)

\(\small E_3= 490*\frac{32}{100}= 490*0.32 = 156.8\)

The TS (test statistic) is:

\(\small TS = (103 - 88)^2 / 88 + (216 - 245)^2 / 245 + (171 - 157)^2 / 157 = 7.2378\).

The degrees of freedom are \(k - 1 = 2\).

This rounding doesn’t take place when the values are calculated in [R]:

\(\small TS =(103 - 88.2)^2 / 88.2 + (216 - 245)^2 / 245 + (171 - 156.8)^2 / 156.8 = 7.202069\)

Check it out:

chisq.test(dat[,2], p = dat[,3]/100, correct = F)
## 
##  Chi-squared test for given probabilities
## 
## data:  dat[, 2]
## X-squared = 7.2021, df = 2, p-value = 0.0273

Clearly the value is too asymptotic (\(\small p<0.05\)) to not reject the equality of distributions, i.e. No, the sample is not distributed as in the theoretical population.



Example 4:


Has the number of suicides in a region stayed the same over time?

This is the data:

years    <- c(1978:1989)
suicides <- c(164, 142, 153, 171, 171, 148, 136, 133, 138, 132, 145, 124)
dat <- data.frame(years, suicides); as.data.frame(t(dat))
##            V1   V2   V3   V4   V5   V6   V7   V8   V9  V10  V11  V12
## years    1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989
## suicides  164  142  153  171  171  148  136  133  138  132  145  124

The assumption is that the frequency is the same across years, and the test is:

chisq.test(dat$suicides, p = replicate(length(dat$suicides), 
                        1/length(dat$suicides))) # The length is 1/12 which would be the probability if every year was the same.
## 
##  Chi-squared test for given probabilities
## 
## data:  dat$suicides
## X-squared = 18.133, df = 11, p-value = 0.07855

Therefore, we don’t have enough evidence to reject the idea that there may not be too much difference between years.

Example 5:


There is a self-perpetuating calendar effect among youth soccer players whereby little players trying out for a team have more of a chance to prove themselves and make the team if they are born in January than in November or December, a difference that is significant at an early age.

The birth months of a league of professional players can be analysed with a GOF test:

month_bins <- c("Jan to Mar", "Apr to Jun", "Jul to Sep", "Oct to Dec")
players    <- c(84,77,35,34)
dat <- data.frame(month_bins, players)
dat
##   month_bins players
## 1 Jan to Mar      84
## 2 Apr to Jun      77
## 3 Jul to Sep      35
## 4 Oct to Dec      34
chisq.test(dat$players, p = replicate(length(dat$players),
                                      1/length(dat$players))) #length is 1/4
## 
##  Chi-squared test for given probabilities
## 
## data:  dat$players
## X-squared = 37.235, df = 3, p-value = 4.104e-08

… a powerful argument in favor of the calendar effect.

However, this effect may be due to seasonal differences in the number of births. Hence, if the proportion of births in different months is:

births  <- c(0.242,0.260,0.258,0.240) # Proportions of births each period (adds to 1)
(dat    <- cbind(dat,births))
##   month_bins players births
## 1 Jan to Mar      84  0.242
## 2 Apr to Jun      77  0.260
## 3 Jul to Sep      35  0.258
## 4 Oct to Dec      34  0.240
chisq.test(dat$players, p = births)
## 
##  Chi-squared test for given probabilities
## 
## data:  dat$players
## X-squared = 37.503, df = 3, p-value = 3.602e-08

… a different distribution than that predicted, further supporting the calendar effect.



Example 6: COMPARING MANUSCRIPTS (modified Jane Austen example):


In another example of GoF testing suppose that we know the relative frequency of occurrence of certain letters (let’s take vowels) in the English literature at large. We now examine some documents, and we want to see if the counts of these vowels is consistent with the expected frequencies.

Let’s simulate some data:

               a      e     i      o      u
English     0.34   0.12  0.24   0.11   0.19   # These are again proportions.
documents 256.00 344.00 63.00 223.00 337.00

The idea is to run a GoF (goodness-of-fit) test to see if the data conforms to the expected frequencies. And we would do this with a \(\chi^2\) test. The degrees of freedom are the number of cells minus 1.

chisq.test(p = English, documents)

    Chi-squared test for given probabilities

data:  documents
X-squared = 612.844, df = 4, p-value < 2.2e-16

The squared difference is far too large for the proportion of vowels in your documents to be consistent with the expected proportion in the English literature at large. Perhaps it is because of a very specific linguistic register your documents come from, or because they are very old documents, etc.



Example 7: COMPARING DISTRIBUTIONS: RANDOM NUMBER GENERATION:


We can test how good is [R] at generating random uniform numbers by generating \(\small 1 \,\text{million}\) draws:

which looks very good, but let’s look at the actual counts compared to the expected frequency (\(0.25\)) for every quartile:

uniform_sample <- runif(1e6)

first    <- length(uniform_sample[uniform_sample<0.25])
secnd    <- length(uniform_sample[uniform_sample>=0.25&
                              uniform_sample<0.5])
third    <- length(uniform_sample[uniform_sample>=0.5&
                              uniform_sample<0.75])
fourth   <- length(uniform_sample[uniform_sample>=0.75])

counts   <- c(first,secnd,third,fourth)
TargFreq <- c(0.25, 0.25, 0.25, 0.25)

tab <- t(as.matrix(data.frame(counts, TargFreq)))
colnames(tab) <-c("[0,0.25)", "[0.25,0.5)", "[0.5,0.75)", "[0.75,1)")

exp_counts <- c(rep(1e6 * 0.25, times = 4))
tableau <- rbind(tab, exp_counts)
tableau # tableau is just a name for this object
##             [0,0.25) [0.25,0.5) [0.5,0.75)  [0.75,1)
## counts     250084.00  249610.00  249803.00 250503.00
## TargFreq        0.25       0.25       0.25      0.25
## exp_counts 250000.00  250000.00  250000.00 250000.00

The name I gave to the table presenting the counts and expected frequencies is tableau. Now we can run the test with degress of freedom corresponding to the cells minus 1 (in this case \(\small \text{df}=3\)):

TS <- sum((tableau[1,] - tableau[3,])^2/tableau[3,])
TS
## [1] 1.803896
pchisq(TS, df=3, lower.tail = F)
## [1] 0.6140875

Which is clearly not a value extreme enough to consider that the sample is not distributed according to the 0.25 fractions in each quartile.

In [R]’s built-in functions we can easily calculate this GOF test:

chisq.test(tableau[1,], p = tableau[2,])
## 
##  Chi-squared test for given probabilities
## 
## data:  tableau[1, ]
## X-squared = 1.8039, df = 3, p-value = 0.6141



Example 8: COMPARING DISTRIBUTIONS: ARE INVESTMENT RETURNS “NORMAL”?


As an example of how the chi-square goodness-of-fit test for a normal distribution can be used, the 5-year annualized return rates achieved by 158 growth funds can be assessed for normality:

library(gsheet)
data <- read.csv(text = gsheet2text('https://docs.google.com/spreadsheets/d/1VQP4jwknEQPnbRFOCeFoNdvieHg9BSVrPzMDwXJbsL4/edit?usp=sharing', format ='csv'))
## No encoding supplied: defaulting to UTF-8.
funds <- data[data$Objective=="Growth",c("Fund","FiveYrRtn")]
head(funds)
##                                  Fund FiveYrRtn
## 1 ABN Amro Montag & Calswell Growth I      11.7
## 2             AIM Aggressive Growth A       5.4
## 4              Aim Small Cap Growth A      18.5
## 5                        AIM Summit I       7.3
## 6                         AIM Value B       8.8
## 7                    AIM Weingarten A       3.5
# Since we are going to test for normality the two parameters to estimate are:

mean(funds$FiveYrRtn); sd(funds$FiveYrRtn)
## [1] 10.14873
## [1] 4.772583

The null and alternative hypotheses are:

\(H_o\): The 5-year annualized return rates follow a normal distribution.

\(H_1\): The 5-year annualized return rates do not follow a normal distribution.

Segmenting in intervals the five-year returns:

library(ggplot2)
intervals <- cut_interval(funds$FiveYrRtn, length = 5)
tab <- as.matrix(table(intervals))
colnames(tab) <- c("OBS.FR.")
Upp.Bound <- c(-5,0,5,10,15,20,25,30)
Z.vals <- round((Upp.Bound - replicate(8, mean(funds$FiveYrRtn)))/replicate(8, sd(funds$FiveYrRtn)),4)
area.below <- round(pnorm(Z.vals,lower.tail=T),5)
area.bin <- round(c(area.below[1],diff(area.below)),5)
EXP.FR. <- round(area.bin * length(funds$FiveYrRtn))
DELTAsq <- (as.data.frame(tab)[,1] - EXP.FR.)^2/EXP.FR.
(dat<-data.frame(tab, Upp.Bound, Z.vals, area.below, area.bin, EXP.FR., DELTAsq))
##          OBS.FR. Upp.Bound  Z.vals area.below area.bin EXP.FR.   DELTAsq
## [-10,-5]       1        -5 -3.1741    0.00075  0.00075       0       Inf
## (-5,0]         3         0 -2.1265    0.01673  0.01598       3 0.0000000
## (0,5]         15         5 -1.0788    0.14034  0.12361      20 1.2500000
## (5,10]        58        10 -0.0312    0.48756  0.34722      55 0.1636364
## (10,15]       61        15  1.0165    0.84530  0.35774      57 0.2807018
## (15,20]       16        20  2.0641    0.98050  0.13520      21 1.1904762
## (20,25]        3        25  3.1118    0.99907  0.01857       3 0.0000000
## (25,30]        1        30  4.1594    0.99998  0.00091       0       Inf
# The TEST STATISTIC IS:

sum(DELTAsq)
## [1] Inf

The TEST STATISTIC is infinite because we have expected frequencies below 1. So we have to combine bins:

OBS.FR. <- c(4,15,58,61,16,4)
Upp.Bound <-c(0,5,10,15,20,25)
area.bin <- round(c(pnorm(Z.vals[2]),diff(area.below[2:7])),5)
EXP.FR. <- round(area.bin * length(funds$FiveYrRtn),3)
dat <- dat[2:7,]
dat$OBS.FR. <- OBS.FR.
dat$Upp.Bound <- Upp.Bound
dat$area.bin <- area.bin
dat$EXP.FR. <- EXP.FR.
dat$DELTAsq <- (OBS.FR. - EXP.FR.)^2/EXP.FR.
row.names(dat) <-c("(-10,0]","(0,5]","(5,10]","(10,15]",
"(15,20]","(20,30)")
dat
##         OBS.FR. Upp.Bound  Z.vals area.below area.bin EXP.FR.   DELTAsq
## (-10,0]       4         0 -2.1265    0.01673  0.01673   2.643 0.6967268
## (0,5]        15         5 -1.0788    0.14034  0.12361  19.530 1.0507373
## (5,10]       58        10 -0.0312    0.48756  0.34722  54.861 0.1796052
## (10,15]      61        15  1.0165    0.84530  0.35774  56.523 0.3546084
## (15,20]      16        20  2.0641    0.98050  0.13520  21.362 1.3458966
## (20,30)       4        25  3.1118    0.99907  0.01857   2.934 0.3873061
# The TEST STATISTIC:

sum(dat$DELTAsq)
## [1] 4.01488

Now we know that we have finally \(k=6\) bins, and the normal distribution has two parameters, we end up with \(df=6-1-2=3\), and…

qchisq(0.95, df = 3)
## [1] 7.814728

we can’t reject the hypothesis that the returns follow a normal distribution.


CLOSING REMARKS for GOF test:


This test is sensitive to the choice of bins. There is no optimal choice for the bin width (since the optimal bin width depends on the distribution). Most reasonable choices should produce similar, but not identical, results. For the chi-square approximation to be valid, the expected frequency should be at least 5. This test is not valid for small samples, and if some of the counts are less than five, you may need to combine some bins in the tails.

The Shapiro-Wilk test, proposed in 1965, calculates a \(W\) statistic that tests whether a random sample, \(x_1,x_2,\dots,x_n\) comes from (specifically) a normal distribution . Small values of \(W\) are evidence of departure from normality and percentage points for the \(W\) statistic, obtained via Monte Carlo simulations, were reproduced by Pearson and Hartley (1972). This test has done very well in comparison studies with other goodness of fit tests.

The chi-square test is an alternative to the Anderson-Darling and Kolmogorov-Smirnov goodness-of-fit tests. The chi-square goodness-of-fit test can be applied to discrete distributions such as the binomial and the Poisson. The Kolmogorov-Smirnov and Anderson-Darling tests are restricted to continuous distributions.

The distribution of the K-S test statistic itself does not depend on the underlying cumulative distribution function being tested. Another advantage is that it is an exact test (the chi-square goodness-of-fit test depends on an adequate sample size for the approximations to be valid). Despite these advantages, the K-S test has several important limitations:

  1. It only applies to continuous distributions.
  2. It tends to be more sensitive near the center of the distribution than at the tails.
  3. Perhaps the most serious limitation is that the distribution must be fully specified. That is, if location, scale, and shape parameters are estimated from the data, the critical region of the K-S test is no longer valid. It typically must be determined by simulation.

Home Page

NOTE: These are tentative notes on different topics for personal use - expect mistakes and misunderstandings.