NOTES ON STATISTICS, PROBABILITY and MATHEMATICS


One Sample Mean to Population Comparison with Z and T Tests:


The one-sample z-test (or t-test) is used when we want to know whether our sample comes from a particular population but we may or may not have full population information (population standard deviation) available to us.

Motivating example:

Suppose that we wanted to detect whether a sample mean in a hospital is higher than the national mean of the RDI (respiratory disturbance index) in the context of sleep apnea. Assume normality and that the sample in question has a standard deviation of \(4\).

We test whether the population mean is bigger than a reference accepted value (\(\mu_0=30\)) based on the mean obtained from sample \((\bar X)\).

We compare the sample mean, \(\bar X\) to a constant \(C\) so that the probability of a type I error is \(\alpha\) (typically \(0.05\) or \(5\%\)):

\(0.05 = \Pr\left (\bar X \geq C \mid \mu_o =30 \right)\)

Normalizing:

\(0.05\,=\Pr \left (\frac{\bar X - \mu_o}{s/\sqrt{n}}\,\geq\frac{C - \mu_o}{s/\sqrt{n}} \mid \mu_o=30 \right)\)

For this to be true the second part of the inequality must be equal to:

\(\frac{C - \mu_o}{s/\sqrt{n}}\) = qnorm(0.95)= 1.6448

And the first part of the inequality is really the Z-statistic:

\(\text{Z test statistic} = \frac{\bar X - \mu_o}{s/\sqrt{n}}\)

Before we can do a Z-test, in general we need to make check if we can reasonably treat the mean of this sample as normally distributed (also for a t-test) (see discussion for two samples here), and we have enough data for a Z test:

  1. The data comes from a normal distribution.
  2. How much data do we need? Many textbooks use \(30\) data points as a rule of thumb.
set.seed(2021)
n <- 100
sample <- rnorm(n, 35, 5) # The actual sample values.
par(mar=c(3,3,3,3))
par(mgp=c(1,0.2,0))
qqnorm(sample, pch=19, cex=0.5, cex.main = .9, cex.lab = .5, cex.axis =.4, tck = -.01, col='red4')
qqline(sample)

Here is the Z-test using the package BSDA. First (even though we are deviating from the example above) the two-sided alternative):

library(BSDA)
sigma <- 4   # Population standard deviation:
mu    <- 30  # Population mean
z.test(sample, mu=mu, sigma.x=sigma)
## 
##  One-sample z-Test
## 
## data:  sample
## z = 10.319, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 30
## 95 percent confidence interval:
##  33.34359 34.91156
## sample estimates:
## mean of x 
##  34.12758

Manually, and following this post:

# calculate the z-statistic
paste('The z statistic is', z_stat <- (mean(sample) - mu) / (sigma / sqrt(n)))
## [1] "The z statistic is 10.3189449413837"
paste('The p-value is', 2 * (1 - pnorm(z_stat)))
## [1] "The p-value is 0"
paste('The CI is between', mean(sample) - qnorm(0.975) * sigma / sqrt(n),'and', mean(sample) + qnorm(0.975) * sigma / sqrt(n) )
## [1] "The CI is between 33.3435923827374 and 34.9115635703695"

And finally, the one-sided test assesses whether the mean is larger than the population mean:

z.test(sample, mu=mu, sigma.x=sigma, alternative='greater')
## 
##  One-sample z-Test
## 
## data:  sample
## z = 10.319, p-value < 2.2e-16
## alternative hypothesis: true mean is greater than 30
## 95 percent confidence interval:
##  33.46964       NA
## sample estimates:
## mean of x 
##  34.12758

which manually would work like:

paste('The z statistic is', z_stat <- (mean(sample) - mu) / (sigma / sqrt(n)))
## [1] "The z statistic is 10.3189449413837"
paste('The p-value is', 1 - pnorm(z_stat))
## [1] "The p-value is 0"
paste('The CI is between', mean(sample) - qnorm(0.95) * sigma / sqrt(n),'and infinity.')
## [1] "The CI is between 33.4696365257729 and infinity."

If \(n\) is small we will do exactly the same thing but with a \(t\) test:

\(\frac{C - \mu_o}{s/\sqrt{n}}\) = qt(0.95, n - 1)

To test whether the sample mean is different to the population mean:

For instance, if the sample mean is greater:

n <- 15
sample <- rnorm(n, 35, 5)
t.test(sample, mu=30)
## 
##  One Sample t-test
## 
## data:  sample
## t = 2.2079, df = 14, p-value = 0.04443
## alternative hypothesis: true mean is not equal to 30
## 95 percent confidence interval:
##  30.07617 35.25081
## sample estimates:
## mean of x 
##  32.66349

For a one-sided test to assess if the sample mean is greater:

set.seed(2021)
n <- 15
sample <- rnorm(n, 35, 5)
t.test(sample, mu=30, alternative = 'greater')
## 
##  One Sample t-test
## 
## data:  sample
## t = 5.2614, df = 14, p-value = 6.016e-05
## alternative hypothesis: true mean is greater than 30
## 95 percent confidence interval:
##  34.42933      Inf
## sample estimates:
## mean of x 
##  36.65826

what if we don’t have the sample vector, but rather just the sample standard deviation and the mean?

Let’s say the mean is 36.65826, which is mean(sample) as above, but instead of plugging in the sample into the function, we just have the \(\bar x\) value or sample mean, and the sample standard deviation (or the population’s, perhaps).

Incredibly we need an ad hoc function! Which for a one-sided (‘greater’) test involves the following calculation:

mu <- 30                 # Population mean
x.bar <- mean(sample)    # Sample mean
n <- 15                  # Sample size
se <- sd(sample)         # Sample SD
t <- (x.bar - mu) / (se/sqrt(n)) # t statistic

# Which is greater than:
qt(0.95, n - 1) # We test for a 'greater' alternative hypothesis.
## [1] 1.76131
# This corresponds to a p-value of
1 - pt(abs(t), n - 1)
## [1] 6.016346e-05
# And for lower bound of confidence intervals (see below for formula derivation):

x.bar - qt(0.95, n - 1) * se/sqrt(n)
## [1] 34.42933

[What follows was first posted on CrossValidated]

Example:

We know that mean glucose concentration in blood is 100 mg/dL, and we are studying an ethnic group with very low prevalence of metabolic syndrome. We sample 100 of these individuals, and we get a mean [glucose] = 98 mg/dL with a sd = 10 mg/dL. Is it reasonable to conclude with a risk alpha of 5% that this group has lower glucose levels than the general population?

When sample size is larger than 30, we can consider using a z-test. Typically the standard deviation of the sample will have to be used to substitute for the unknown parameter. In the example above we want a left-tailed test: \(\small H_1: \bar X < \mu\). We use the central limit theorem, which tells us that the sample distribution of the sample means has a standard error, \(\small SE=\frac{\sigma}{\sqrt n}\), with \(\sigma\) corresponding to the population standard deviation, and we get the following statistic:

\[z = \frac{\bar X - \mu}{\frac{\sigma}{\sqrt n}}\] If \(\sigma\) is unknown we have to use the unbiased SD of the sample \(z = \frac{\bar X - \mu}{\frac{s}{\sqrt n}}.\)

In our example: z= (98-100)/(10/sqrt(100)) = -2, which needs to be compared to qnorm(0.05) = -1.644854. Therefore we reject the \(\small H_0: \bar X=\mu\).

x.bar = 98
mu = 100
s = 10
n = 100
paste('The z statistic is', z_stat <- (x.bar - mu) / (s / sqrt(n)))
## [1] "The z statistic is -2"
paste('The p-value is', pnorm(z_stat))
## [1] "The p-value is 0.0227501319481792"

Is the Mean of a Small Sample Consistent with Known Population Mean?

Consider the example above, but with a sample size of n = 16 individuals. The distribution of the same statistic would follow a \(t\)-distribution, and the test applied is the “one-sample \(t\) test”:

\(t = \frac{\bar X - \mu}{\frac{s}{\sqrt n}}\).

So, the statistic is calculated identically if we have been using the sample SD all along, but now we go to the \(t\) tables: n = 16; qt(0.05, n - 1) = -1.75305 (n - 1 corresponds to the degrees of freedom). Now we can reject the null.

x.bar = 98
mu = 100
s = 10
n = 16
paste('The t statistic is', t_stat <- (x.bar - mu) / (s / sqrt(n)))
## [1] "The t statistic is -0.8"
paste('The p-value is', pt(t_stat, n - 1))
## [1] "The p-value is 0.218098969581344"

Confidence Interval Calculation:

Based on the prior example and with n = 16 we want to calculate the 95% percent confidence interval around the sample mean. In this case we don’t know the population mean, but we will assume that if we were to repeat the sampling from the population 100 times, in 95% of the times, it would be comprised within the interval. The formula is derived from the two-tailed \(t\) test:

Under \(H_0\), \(\left| \frac{\bar X - \mu}{\frac{\sigma}{\sqrt n}}\right|\leq t_{(1-\alpha/2, \,n - 1)}\). Hence, \(\left| \bar X - \mu\right|\leq t_{(1-\alpha/2, \,n - 1)}\, \frac{\sigma}{\sqrt n}\), or:

\[\bar X- t_{(1-\alpha/2, \,n - 1)}\, \frac{\sigma}{\sqrt n}\, <\mu<\bar X + t_{(1-\alpha/2, \,n - 1)}\, \frac{\sigma}{\sqrt n}\]

In R,

x.bar + c(-1,1) * qt(1 - 0.5/2, n - 1) * s / sqrt(n)
## [1] 96.27201 99.72799

Alternatively, the confidence interval can be calculated using the normal distribution in larger samples (let’s imagine n=100), using the formula: n = 100; 98 + c(-1,1) * qnorm(1 - 0.5/2) * 10/sqrt(n).


Altogether, this would be the calculation in R:

mu <- 100                 # Population mean
x.bar <- 98               # Sample mean
n <- 16                   # Sample size
sd <- 10                  # Sample SD
(t <- (x.bar - mu) / (sd/sqrt(n))) # t statistic
## [1] -0.8
# Which is not lower than:
qt(0.05, n - 1)
## [1] -1.75305
# And therefore we don't reject the null.

# This corresponds to a p-value of
pt(t, n - 1)
## [1] 0.218099
# And for upper bound of confidence intervals:

x.bar + qt(0.05, n - 1) * sd/sqrt(n)
## [1] 93.61737

When we have summary statistics only but two means to compare we can use the function:

# m1, m2: the sample means
# s1, s2: the sample standard deviations
# n1, n2: the same sizes
# m0: the null value for the difference in means to be tested for. Default is 0. 
# equal.variance: whether or not to assume equal variance. Default is FALSE. 
t.test2 <- function(m1,m2,s1,s2,n1,n2,m0=0,equal.variance=FALSE)
{
    if( equal.variance==FALSE ) 
    {
        se <- sqrt( (s1^2/n1) + (s2^2/n2) )
        # welch-satterthwaite df
        df <- ( (s1^2/n1 + s2^2/n2)^2 )/( (s1^2/n1)^2/(n1-1) + (s2^2/n2)^2/(n2-1) )
    } else
    {
        # pooled standard deviation, scaled by the sample sizes
        se <- sqrt( (1/n1 + 1/n2) * ((n1-1)*s1^2 + (n2-1)*s2^2)/(n1+n2-2) ) 
        df <- n1+n2-2
    }      
    t <- (m1-m2-m0)/se 
    dat <- c(m1-m2, se, t, 2*pt(-abs(t),df))    
    names(dat) <- c("Difference of means", "Std Error", "t", "p-value")
    return(dat) 
}

from this post.


References:

Basic Inferential Calculations with only Summary Statistics


Home Page

NOTE: These are tentative notes on different topics for personal use - expect mistakes and misunderstandings.