The one-sample z-test (or t-test) is used when we want to know whether our sample comes from a particular population but we do not have full population information available to us.

We test whether the population mean is bigger than a reference accepted value (\(\mu_o\)) based on the mean obtained from sample (\(\bar X\)).

We compare the sample mean, \(\bar X\) to a constant \(C\) so that the probability of a type I error is \(\alpha\), which is typically \(0.05\) or \(5\%\):

\(0.05 = p\,(\bar X \geq C) | \mu_o=30\)

Normalizing:

\(0.05\,=\,p(\frac{\bar X - \mu_o}{s/\sqrt{n}}\,\geq\frac{C - \mu_o}{s/\sqrt{n}}|\mu_o=30)\)

And the first part of the inequality is really the *Z-statistic*:

Z test statistic = \(\Large \frac{\bar X - \mu_o}{s/\sqrt{n}}\)

based on the central limit theorem.

For this to be true the second part of the inequality must be equal to:

\(\Large \frac{C - \mu_o}{s/\sqrt{n}}\) `= qnorm(0.95)`

If \(n\) is small we will do exactly the same thing but with a \(t\) test:

\(\Large \frac{C - \mu_o}{s/\sqrt{n}}\) `= qt(0.95, n - 1)`

**Is the Mean of a Large Sample Consistent with Known Population Mean?**

[What follows was first posted on CrossValidated]

Example: We know that mean glucose concentration in blood is `100 mg/dL`

, and we are studying an ethnic group with very low prevalence of metabolic syndrome. We sample 100 of these individuals, and we get a `mean [glucose] = 98 mg/dL`

with a `sd = 10 mg/dL`

. Is it reasonable to conclude with a risk alpha of 5% that this group has lower glucose levels than the general population?

When sample size is larger than 30, we can consider using a z-test. Typically the standard deviation of the sample will have to be used to substitute for the unknown parameter. In the example above we want a left-tailed test: \(\small H_1: \bar X < \mu\). We use the central limit theorem, which tells us that the sample distribution of the sample means has a standard error, \(\small SE=\frac{\sigma}{\sqrt n}\), with \(\sigma\) corresponding to the population standard deviation, and we get the following statistic:

\(\large z = \frac{\bar X - \mu}{\frac{\sigma}{\sqrt n}}\). Again, since \(\sigma\) is unknown we have to use the unbiased SD of the sample.

What would happen if we sampled multiple times from the presumed population (Monte Carlo): `means <- replicate(1e6, mean(rnorm(100, 100, 10)))`

? We would end up with the following density with the area of type 1 error shaded in red:

In our example: `z= (98-100)/(10/sqrt(100)) = -2`

, which needs to be compared to `qnorm(0.05) = -1.644854`

. Therefore we reject the \(\small H_0: \bar X=\mu\).

**Is the Mean of a Small Sample Consistent with Known Population Mean?**

Consider the example above, but with a sample size of `n = 16`

individuals. The distribution of the same statistic would follow a \(t\)-distribution, and the test applied is the “one-sample \(t\) test”:

\(t = \frac{\bar X - \mu}{\frac{\sigma}{\sqrt n}}\). So, identical statistic, but now we go to the \(t\) tables: `n = 16; qt(0.05, n - 1) = -1.75305`

(`n - 1`

corresponds to the degrees of freedom). We reject the null just as well, but notice that this time it was a bit harder to reject (broader tails of the \(t\) distribution).

**Confidence Interval Calculation:**

Based on the prior example and with `n = 16`

we want to calculate the 95% percent confidence interval around the sample mean. In this case we don’t know the population mean, but we will assume that if we were to repeat the sampling from the population 100 times, in 95% of the times, it would be comprised within the interval. The formula is derived from the two-tailed \(t\) test:

Under \(H_0\),

\(\left| \frac{\bar X - \mu}{\frac{\sigma}{\sqrt n}}\right|\leq t_{(1-\alpha/2, \,n - 1)}\). Hence, \(\left| \bar X - \mu\right|\leq t_{(1-\alpha/2, \,n - 1)}\, \frac{\sigma}{\sqrt n}\), or:

\[\bar X- t_{(1-\alpha/2, \,n - 1)}\, \frac{\sigma}{\sqrt n}\, <\mu<\bar X + t_{(1-\alpha/2, \,n - 1)}\, \frac{\sigma}{\sqrt n}\]

In R, `n = 16; 98 + c(-1,1) * qt(1 - 0.5/2, n -1) * 10/sqrt(n) = 96.27201 to 99.72799`

Alternatively, the confidence interval can be calculated using the normal distribution in larger samples (let’s imagine `n=100`

), using the formula: `n = 100; 98 + c(-1,1) * qnorm(1 - 0.5/2) * 10/sqrt(n)`

.

References:

Basic Inferential Calculations with only Summary Statistics