Chi-square: Independence, Homgeneity

APPLICATION OF THE CHI SQUARE TEST:

GENERAL TEST STATISTIC (regardless of specific application):

The chi square statistic is denoted as \(\chi^2=\displaystyle \sum_i \frac{(O_i-E_i)^2}{E_i}\) where \(O_i\) is the observed number of cases in category \(i\), and \(E_i\) is the expected number of cases in category \(i\).

The null and alternative hypotheses for each chi square test are:

\(H_o: O_i=E_i\) and \(H_1:O_i\neq E_i\).

The theoretical distribution is a continuous distribution, but the \(\chi^2\) statistic is obtained in a discrete manner, on the basis of discrete differences betweenthe observed and expected values. Hence, there is a need for asymptotics: The condition in which the chi square statistic is approximated by the theoretical chi square distribution, is that the sample size is reasonably large: there should be at least \(5 \,\text{expected cases}\) in at least \(80\%\) of the cells (with all the cells having expected values \(> 1\)). If there are considerably less than this, adjacent categories may be grouped together to boost the number of expected cases.

Concepts:

• The “goodness-of-fit test” is a way of determining whether a set of categorical data came from a claimed discrete distribution or not. The null hypothesis is that they did and the alternate hypothesis is that they didn’t. It answers the question: are the frequencies I observe for my categorical variable consistent with my theory? The goodness-of-fit test expands the one-proportion z-test. The one-proportion z-test is used if the outcome has only two categories. The goodness-of-fit test is used if you have two or more categories.

• The “test of homogeneity” is a way of determining whether two or more DIFFERENT POPULATIONS (or GROUPS) share the same distribution of a SINGLE CATEGORICAL VARIABLE. For example, do people of different races have the same proportion of smokers to non-smokers, or do different education levels have different proportions of Democrats, Republicans, and Independent. The test of homogeneity expands on the two-proportion z-test. The two proportion z-test is used when the response variable has only two categories as outcomes and we are comparing two groups. The homogeneity test is used if the response variable has several outcome categories, and we wish to compare two or more groups.

\(H_0\): \(p_1 = p_2 = \cdots = p_n\) the proportion of \(X\) is the same in all the populations studied.

\(H_a\): At least one proportion of \(X\) is not the same.

• The “test of independence” is a way of determining whether TWO CATEGORICAL VARIABLES are associated with one another in ONE SINGLE POPULATION. For example, we draw a single group of 200 subjects and record the number of children they have, and the number of colds they each got last year, trying to see if there is a relationship between the having children and getting colds.

\(H_0\): \(X\) and \(Y\) are independent.

\(H_a\): \(X\) and \(Y\) are dependent.

• Homogeneity and independence sound the same. The difference is a matter of design. In the test of independence, observational units are collected at random from ONE POPULATION and TWO CATEGORICAL VARIABLES are observed for each observational unit. In the test of homogeneity, the data are collected by randomly sampling from each sub-group (SEVERAL POPULATIONS) separately. (Say, 100 blacks, 100 whites, 100 American Indians, and so on.) The null hypothesis is that each sub-group shares the same distribution of A SINGLE CATEGORICAL VARIABLE. (Say, “chain smoker”, “occasional smoker”, “non-smoker”).

• Homogeneity and independence sound the same. The difference is conceptual:

Test of independence:

Observational units are collected at random from a population and two categorical variables are observed for each unit.
Two variables are observed for each observational unit.
The margins are considered fixed. The test conditions on both margins - No cross-multiplication of margins.

Example:

  Couples are sampled and "husband satisfaction" and "wife satisfaction are measured."

Test of homogeneity:

The data are collected by randomly sampling from each sub-group separately. (Say, 100 blacks, 100 whites, 100 American Indians, and so on.) The null hypothesis is that each sub-group shares the same distribution of another categorical variable. (Say, “chain smoker”, “occasional smoker”, “non-smoker”.)
The margins of the variables are considered random variables distributed normally, and we cross-multiply to get expected frequencies.

Examples:

1. Patients on drug A and patients in drug B would be sub-groups of the variable ("treatment"), and each assessed for the variable "heartburn".
  
2. Relationship between "sex" and "class". Take samples of both genders and check their socioeconomic class.

It is based on the analysis of a cross classification on a contingency table to test the possible dependency or relationship between variables.

There is a conceptual distinction between the test of independence and the chi-square test of homogeneity, see here and here, although there are no practical mathematical consequences. A test of independence would condition on both margins, whereas in a test of homogeneity the marginals of one of the categories are assumed not be random variables, but rather fixed by design. These theoretical distinctions aside, both can be lumped under the terms chi-square test of association (for two-way contingency tables) or chi-square k-independent samples comparison test (for more two categories). This is in contradistinction to the one-sample chi-square test of agreement, essentially a GOF test.

For larger samples (> 5 expected frequency count in each cell) the \(\chi^2\) provides an approximation of the significance value. The test is based on calculating the expected frequency counts obtained by cross-multiplying the marginals (assuming normal distribution of the marginals, it makes sense that we end up with a \(\chi^2\) distributed test statistic, since if \(X\sim N(\mu,\sigma^2)\), then \(X^2\sim \chi^2(1))\):

The goodness of fit test examines only one variable, while the test of independence is concerned with the relationship betweentwo variables. Like the goodness of fit test, the chi square test of independence is verygeneral, and can be used with variables measured on any type of scale, nominal, ordinal, interval or ratio. As with the GOF test the expected number of cases in each cell should be more than \(5\).

The chi square test for independence is conducted by assuming that there is no relationship between the two variables being examined. The alternative hypothesis is that there is some relationship between the variables.

The chi square statistic used to conduct this test is the same as in the goodness of fit test:

\(\chi^2=\displaystyle \sum_i \frac{(O_i-E_i)^2}{E_i}\) where \(O_i\) is the numbers of cases in each cell ofthe cross classiØcation table, and \(E_i\) is the expected number of cases.

The chi square statistic computed from the observed and expected values is calculated, and if this statistic is inthe region of rejection of the null hypothesis, then the assumption of no relationship between \(X\) and \(Y\) is rejected.

Example 1:

The example given in the test of proportions entry.

Example 2:

Is there a relationship between sex and class?

tab <- matrix(c(33,153,103,16,29,181,81,14), nrow =4)
dimnames(tab)<-list(Social.Class = c("Upper","Middle","Working","Lower"), Sex = c("Male", "Female"))
addmargins(tab) # Margins are considered random variables!

##             Sex
## Social.Class Male Female Sum
##      Upper     33     29  62
##      Middle   153    181 334
##      Working  103     81 184
##      Lower     16     14  30
##      Sum      305    305 610

Under the NULL hypothesis the two variables are unrelated and therefore:

\(\small P(\text{Upper} \,\text{and}\, \text{Male}) = P(\text{Upper})P(\text{Male})\)

\(\small P(\text{Upper}\,\text{and}\,\text{Male}) =\frac{62}{610}\frac{305}{610}\)

and the number of expected cases for the cell:

\(610 \, \frac{62}{610}\frac{305}{610}\)

The general formula for the degrees of freedom is the number of rows minus 1, times the number of columns minus 1.

chisq.test(tab)

## 
##  Pearson's Chi-squared test
## 
## data:  tab
## X-squared = 5.3691, df = 3, p-value = 0.1467

The independence of sex and class cannot be rejected.

Test of Independence of Two Variables:

The best illustration is Agresti’s example with husband and wife marital satisfaction:

sex_satis <- matrix(c(7,2,1,2,7,8,5,8,2,3,4,9,3,7,9,14), nrow = 4)
dimnames(sex_satis) = list(wife = c("Never","Sometimes","Often","Always"), husband = c("Never","Sometimes","Often","Always"))
addmargins(sex_satis) # Margins are considered fixed!

##            husband
## wife        Never Sometimes Often Always Sum
##   Never         7         7     2      3  19
##   Sometimes     2         8     3      7  20
##   Often         1         5     4      9  19
##   Always        2         8     9     14  33
##   Sum          12        28    18     33  91

It is to be expected that, in general, the satisfaction of a spouse moves together with the other.

If we run a \(\chi^2\) test we get,

chisq.test(sex_satis, correct = T)

## 
##  Pearson's Chi-squared test
## 
## data:  sex_satis
## X-squared = 16.955, df = 9, p-value = 0.04942

that with a probability of \(<\,5\%\) of making a type I error, the counts in the table cannot be considered independent - we reject the \(H_o\) hypothesis of independence, and assume that there is an association between the satisfaction of the husband and the wife.

In this test, the margins are considered fixed, and the expected counts calculated as follows:

What this is saying is that the expected values in the cells are conditioned on both margins (row and column), and it should be read as along the lines as, for example: the expected value of “never/never” is the probability that the wife “never” experiences satisfaction (\(19/91\)) times (independent of) the probability that the husband “never” experiences pleasure (\(12/91\)). Then we calculate based on the resultant probability the expected frequency in the cell by multiplying by the total number of cases (\(91\)).

In contradistinction in the test of homogeneity (as in the example above with antacids - table reproduced below this paragraph) we are saying that, for instance, getting the result “Drug A/heartburn” is the probability of “heartburn” (\(156/368\)) (one single variable). Then we calculate the expected frequency by multiplying this times the number of patients taking Drug A (\(178\)).

##            Medication
## Symptoms    Drug A Drug B Sum
##   Heartburn     64     92 156
##   Normal       114     98 212
##   Sum          178    190 368

The test statistic will be:

\(\sum \frac{(O-E)^2}{E}\) with \(\small df = (rows-1)*(cols - 1)\)

The exact p-value can be obtained with a Monte Carlo simulation:

library(gsheet)
data <- read.csv(text =  gsheet2text('https://docs.google.com/spreadsheets/d/1w575Rfj2QY30wWc6piunwcyDOfYQfL-5KSulJIQ3ego/edit?usp=sharing',format ='csv'))

## No encoding supplied: defaulting to UTF-8.

data <- data[,-1]

We need first the dataset in long format:

data[1:3,]

##   husband  wife
## 1   Never Never
## 2   Never Never
## 3   Never Never

tail(data)

##    husband   wife
## 86  Always Always
## 87  Always Always
## 88  Always Always
## 89  Always Always
## 90  Always Always
## 91  Always Always

If we perform first the chi square test on the dataset:

chisq.test(table(data),correct = F)$statistic

## X-squared 
##  16.95524

… and then we permute either the husband or the wife (swingers?), calculating the chi square value every time, and record the percentage of time that it will be greater than the original chi statistic:

baseline <- chisq.test(table(data),correct = F)$statistic
dat <- data
statistic <- 0
for(i in 1: 1000){dat[,2] <- sample(dat[,2])
statistic[i] <- chisq.test(table(dat), correct = F)$statistic}
(p_value <- length(statistic[statistic>=baseline])/length(statistic))

## [1] 0.055

So, under the assumption of independence (swingers) only in this small fraction of case would we obtain such a high chi square value to begin with. Hence, we can reject the hypothesis of independence, and assume there is dependence between the husbands’ and wives’ satisfaction.

We can calculate this “exact” test (really a generalization of Fisher’s exact test) with the command:

chisq.test(table(data), simulate.p.value = T)

## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  table(data)
## X-squared = 16.955, df = NA, p-value = 0.04848

This how simulate.p.value works:

I realize what follows is not a formal answer to the question, and that the majority of users will find it self-evident. However, after being reminded by @bdeonovic of the simulate.p.value option within the chisq.test() function in R, I went through the code to understand how the function works, and posting it here could possibly be of help (starting with myself… next time I forget).

Even though what follows is code-based, it exposes in very simple terms the actual idea of the simulation.

So we have a \(3 \times 2\) matrix with margins as follows:

m = matrix(c(4,5,23,20,104,496), nrow=3)
dimnames(m) = list(c("A", "B", "C"), c("Yes","No"))
addmargins(m)

    Yes  No Sum
A     4  20  24
B     5 104 109
C    23 496 519
Sum  32 620 652

The key lines in the chisq.test() function dealing with the Monte Carlo option are:

if (simulate.p.value) {
            nx <- length(x)
            sm <- matrix(sample.int(nx, B * n, TRUE, prob = p), 
                nrow = n)
            ss <- apply(sm, 2L, function(x, E, k) {
                sum((table(factor(x, levels = 1L:k)) - E)^2/E)
            }, E = E, k = nx)
            PARAMETER <- NA
            PVAL <- (1 + sum(ss >= almost.1 * STATISTIC))/(B + 
                1)
        }

The function takes in as input at the beginning of the call x in

       chisq.test(x, y = NULL, correct = TRUE,
       p = rep(1/length(x), length(x)), rescale.p = FALSE,
       simulate.p.value = FALSE, B = 2000)

Therefore x = m. Now, nx is the length(x), which is the number of cells (\(6\) cells in this example). n is the total number of counts, which can be observed as the grand total in the printed out table above \((652)\), and is generally calculated as n = sum(x).

Under the null hypothesis of independence, the probability of each cell in the matrix is uniformly p = rep(1/length(x), length(x)) # 0.1666667, which implies that the expected value is E = n * p # 108.6667.

With this the actual Monte Carlo simulation is carried out with the line:

sm = matrix(sample.int(nx, B * n, TRUE, prob = p), nrow = n) which spells the following command: Sample from the integers nx (i.e. \(1\) to \(6\)) so as to randomly choose one of the six cells in the matrix, B * n times, where B is the number of simulations, and n is the total number of counts in the original matrix \((652)\). TRUE stands for replace = TRUE because we want to be able to select each cell more than once. And p gives directions to use uniform probability across cells in this case. So we get an insane amount of picks from \(1\) to \(6\), which the matrix( , nrow = n) organizes in columns of in this case n entries, i.e. \(652\) (the total counts above). Each column is therefore a random table with the same totals as in the original matrix. Now it is just a matter of counting:

ss = apply(sm, 2L, function(x, E, k) {sum((table(factor(x, levels = 1L:k)) - E)^2/E)}, E = E, k = nx)

is commanding to do the following operations on sm, columnwise (2L), i.e. for each simulation:

Tabulate the number of times each cell \(1: 6\) was picked (table(x, levels = 1L:K). Here ’k = nx = c`.
Calculate the chi square statistic as \(\chi^2=\displaystyle \sum_i \frac{(O_i-E_i)^2}{E_i}\) for each column (simulation).

The last step is simply to calculate in what proportion of cases this chi square statistic is bigger than the chi square statistic calculated on the original matrix. This is the PVAL.

Home Page

NOTE: These are tentative notes on different topics for personal use - expect mistakes and misunderstandings.