NOTES ON STATISTICS, PROBABILITY and MATHEMATICS


Monte Carlo Instead of Chi Square:


An alternative when the conditions for a chi squared test (i.e. No cells with expected values \(<1\)), and no more than \(20\%\) of cells with values \(<5\)) are not met, is a Monte Carlo simulation.

The function is the option simulate.p.value = TRUE in the function chisq.test(). Of note the default number of replicates in the simulation is \(\text{B}=2000\), which can naturally be changed.

To make it manageable, I can use a minimalistic table:

sales <- matrix(c(2, 7, 1, 2), nrow = 2, byrow = T)
dimnames(sales) = list(Sales = c("Purchases", "No_Purchases"),
                       Clients = c("One_off", "Good_Customer"))
sales
##               Clients
## Sales          One_off Good_Customer
##   Purchases          2             7
##   No_Purchases       1             2

For the probabilities used for each cell in the simulation, we can use the default value, which assigns the same probability to every cell rep(1/length(x), length(x)) = 0.25. But this wouldn’t make sense in our case, since we see that we have more good customers than occasional, one-off clients. So we can make the probability vector to reflect the break-down of our clients: \(3/9 = 1/3\). So, under the null hypothesis we assume that there is the same probability of a sale for both customer category, p = c(1/8,3/8,1/8,3/8).

First some definitions:

probabilities = c(1/8,3/8,1/8,3/8) # Our probability vector
B = 2 # Only two replicates to keep things under control
no.cells = length(sales) # No. cells (4) in table. In R, this is "nx".
total_counts = sum(sales) # 12 in this case. In R, it's "n".
expected = total_counts * probabilities # 1.5  4.5  1.5  4.5. In R, it's "E".

Now we are going to randomly pick up cells among the 4 cells in our table, but with one condition: We’ll do that as many times as the total number of counts in the original table, which in this case is 12. The cells will be assigned values from 1 to 4 through the call length(x) (in this case no.cells). The randomness, though, will be governed by the probabilities assigned. The result will look something like: 1, 2, 2, 4, 2, 4, 3, 1, 2, 1, 4, 4.

It may seem now for a second that we are generating some sort of counts, but we are not, because these numbers from 1 to 4 will be tabulated - counted for their frequencies. And it will be these counts that will constitute the simulated counts.

We will do this for B replicates or simulations. All of it in one single sweep.

Remember that we are doing it twice (B = 2), and there are 12 counts, so we expect 24 values returned. These will be in 12 rows, so that each count of all the original counts is now going to point to a random cell:

set.seed(23)
simulations = matrix(sample.int(no.cells, B * total_counts, replace = TRUE, prob = probabilities), nrow = total_counts)
t(simulations)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## [1,]    4    2    2    4    3    4    1    1    3     1     3     4
## [2,]    4    2    3    2    4    4    1    4    4     2     2     3

Now it is time to tabulate, and we have to do it separately for each simulation (i.e. column-wise in the original, non-transposed matrix).

The next call will produce the chi square for each simulation, which is the sum of the squared value of Observed - Expected counts for each cell for all cells, divided by the Expected values.

Some details about the function: First, the 2L means column-wise, and the function within apply takes the expected counts calculated above based on overall counts and probabilities, and the number of cells. It is the number of cells (4 in this case) that will be the levels = 1:k of 1,2,3,4 used to turn the results of simulations into factors: so instead of 2 the computer will see "2". Whatever… The important thing is that now, finally, these cell reference numbers from 1 to 4, and again, there are as many chosen as total number of cells, will be tabulated to create the actual (fictitious) counts across columns. Once this is done, we’ll subtract the expected values, square, divide by the expected value. And all done!

chi = apply(simulations, 2L, function(x, E, k) {
    sum((table(factor(x, levels = 1L:k)) - E)^2 / E)}, E = expected, k = no.cells)
chi # In this case we are going to have just two chi squared values (2 simulations). R calls this function `ss`.
## [1] 4.4444444 0.4444444

Now we calculate what percentage of chi square values are large than what R calls STATISTIC, which is calculated based on the original data (sales), and which would be the actual calculation of the chi square value. So we are really looking at how many, randomly generated tables with the same overall counts end up with a more extreme chi square:

STATISTIC <- sum((sales - expected)^2/expected)
(p.value <- (1 + sum(chi >= STATISTIC))/(B + 1)) # We divide by the number of simulations.
## [1] 0.3333333

This would not be a statistically significant value, relying on this minimalistic simulation So we wouldn’t be able to say that there are differences in purchases between regular and occasional clients.

With just two simulations we can’t really compare to the R formula, which would be:

chisq.test(sales, p = probabilities, simulate.p.value = T, B = 2)
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2
##  replicates)
## 
## data:  sales
## X-squared = 0.14815, df = NA, p-value = 1


References:

R Studio pubs


Home Page

NOTE: These are tentative notes on different topics for personal use - expect mistakes and misunderstandings.