It defines probabilities as favorable events over all possible events, and it works well when all outcomes are equally likely, i.e. probability of getting a sum of \(4\) when rolling of two dice.
It measures the relative frequency of an event in a hypothetical infinite sequence of experiments, relying on the law of large numbers.
Bayesian probability comes in response to the limitations of the frequentist paradigm in definining an infinite sequence of experiments in cases such as \(\mathbb P(\text{tomorrow will rain})\), or \(\mathbb P(\text{the die is fair})\). In this latter case, rolling the die over and over, will not give an answer to the question - the die is fair or biased with a probability of \(1\) or \(0\): \(\mathbb P(\text{fair})=\{0,1\}\).
The Bayesian perspective is a subjective measure of uncertainty.
Probabilities can be quantified by thinking of a “fair bet”. For example, if we assess the \(\text{odds}(\text{rain})=1:4\), and we stand to win \(n=\$10\) if it rains, we should pay up to \(m=n\times \text{odds}(\text{rain})= 10 \times \frac{1}{4}= \frac{5}{2} =\$2.5\). In other words, if we pay up to \(m=\$2.5\) to play a game that pays \(n=\$10\), the subjective odds are \(m/n=\frac{\$2.5}{\$10}=\frac{1}{4}=0.25\). The probability of winning is \(\mathbb P(\text{win})=\frac{1/4}{1 + 1/4}=1/5\).
A fair bet could be taken in either direction: in the prior bet, if we stand to pay \(\$2.5\) betting that it will rain with the idea of winning \(\$10\) if it does, we should also be ready to pay \(\$10\) if it doesn’t rain in exchange for winning \(\$2.5\) if it does. Hunger for risk has no bearing on the concept, which is predicated on expectations:
\(\mathbb E(\text{return})=\mathbb P(\text{rain})\times \text{gain} - \mathbb P(\text{no rain})\times \text{loss}= \frac{1}{5} \times 10 - \frac{4}{5} \times \frac{5}{2}=\color{red}{0}\).
The idea is that if we were to place this bet over and over again, the return would be \(0\), being a fair bet.
From Herbert Lee’s Coursera course on Bayesian statistics: “Frequentist confidence intervals have the interpretation that”If you were to repeat many times the process of collecting data and computing a 95% confidence interval, then on average about 95% of those intervals would contain the true parameter value; however, once you observe data and compute an interval the true value is either in the interval or it is not, but you can’t tell which.” Bayesian credible intervals have the interpretation that “Your posterior probability that the parameter is in a 95% credible interval is 95%.””
1. Frequentist:
We observe \(5\) tosses and we want to know if it was a fair die or a loaded die with a \(\mathbb P(\text{H})=0.7\).
Parameter to estimate: \(\theta=\{\text{fair die},\text{loaded die}\}\)
Method MLE:
The PDF for the entire set of data conditioned on the value of \(\theta\):
\[f(X|\theta)=\binom{5}{x}\,0.5^5\,I_{\theta=\text{fair}} + \binom{5}{x}\,0.7^x\,0.3^{5-x}\,I_{\theta=\text{loaded}}\]
Hence, in the case of observing \(2\) heads, the density function as a function of \(\theta\) will be:
\[L(\theta|X=2)=0.3125\,I_{\theta=\text{fair}} + 0.1323\,I_{\theta=\text{loaded}}\]
The MLE is identical to the PDF, but thought of as a function of \(\theta\) given the data, and it will no longer be a PDF. We will choose the \(\theta\) that maximizes the likelihood:
Therefore, the \(\text{MLE}(\hat \theta)=\text{fair}\). How sure are we? It is difficult to answer.
In the frequentist paradigm, \(\mathbb P(\theta=\text{fair}|X=2)=\mathbb P(\text{fair})\) because the coin is either fair or not.
2. Bayesian:
We incorporate a prior belief based on the knowledge of the person conducting the experiment, for example, that \(\mathbb P(\text{loaded})=0.6\). This is the prior.
Method: Bayes theorem:
\[f(\theta|X)=\frac{f(X|\theta)\,f(\theta)}{\displaystyle \sum_\theta f(X|\theta)f(\theta)}\]
\[f(\theta|X)=\frac{\binom{5}{x}\left[\,0.5^5\,\color{blue}{0.4}\,I_{\theta=\text{fair}} + 0.7^x\,0.3^{5-x}\,\color{blue}{0.6}\,I_{\theta=\text{loaded}}\,\right]}{\binom{5}{x}\left[\,0.5^5\,\color{blue}{0.4} + 0.7^x\,0.3^{5-x}\,\color{blue}{0.6}\,\right]}\]
Having observed \(2\) heads:
\[f(\theta|X)=\frac{0.0125\,I_{\theta=\text{fair}} + 0.079\,I_{\theta=\text{loaded}}}{0.0125+0.0079}=0.612\,I_{\theta=\text{fair}} + 0.388\,I_{\theta=\text{loaded}}\].
The denominator is a normalizing constant, so that we get values that add up to \(1\), i.e. we get probabilities: posterior probabilities:
\[\mathbb P(\theta=\text{loaded}|X=2)=0.388\].
\[f(\theta|X)=\frac{f(X|\theta)\,f(\theta)}{f(x)} =\frac{f(X|\theta)\,f(\theta)}{\int f(X|\theta)f(\theta)d\theta}=\frac{\text{likelihood}\times\text{prior}}{\text{normalizing constant}}\sim \text{likelihood}\times\text{prior}\]
In the case of a coin with an uninformative prior:
\(\theta\sim U[0,1]\) and \(f(\theta)=I_{0\leq\theta\leq1}\)
If we get \(1\) head in a single toss:
\[f(\theta|X=1)=\frac{\theta^1(1-\theta)^0I_{(0\leq\theta\leq1)}}{\int_{-\infty}^\infty \theta^1(1-\theta)^0I_{(0\leq\theta\leq1)}d\theta}=\frac{\theta^1I_{(0\leq\theta\leq1)}}{\int_0^1 \theta^1I_{(0\leq\theta\leq1)}d\theta}=2\theta I_{(0\leq\theta\leq1)}\tag 1\]
Since it is a uniform distribution, it is immediate:
\(\mathbb P(0.025<\theta<0.975)=0.95\)
and
\(\mathbb P(theta>0.05)=0.95\)
\(\mathbb P(0.025<\theta<0.975)=\int_.025^.975 2\theta d\theta= .975^2-.025^2=.95\)
\(\mathbb P(\theta>0.05)=1 - 0.05^2=0.9975\)
It’s the equivalent of the CI of the frequentist paradigm:
1. Equal tailed interval:
From equation 1:
\(\mathbb P(\theta<\text{quantile}|X=1)=\int_0^q 2\theta d\theta=q^2\). Hence,
\(\mathbb P\left(\sqrt{.025}<\theta<\sqrt{.975}\right)=P\left(.158<\theta<.987\right)=.95\)
We can say, “under the posterior, there is a \(95\%\) probability that \(\theta\) is in the interval \([.158,.967]\).”
2. Highest posterior density (HPD):
Shortest possible interval that contains the \(95\%\) probability - not necessarily split equally between tails:
\(\mathbb P(\theta>\sqrt{.05}|X=1)=\mathbb P(\theta>.224|X=1)=.95\)
So a \(\theta\) parameter between \([.224,1]\) carries a probability of \(95\%\).
Any beta prior will give a beta posterior when the likelihood is Bernoulli or Binomial.
We can use a uniform prior (which is the same as a \(Beta(1,1)\)) and get a beta posterior:
The likelihood is:
\(f(X|\theta)= \theta^{\sum x_i}(1-\theta)^{n-\sum x_i}\)
The prior is just a uniform: \(f(\theta) = I_{0\leq \theta \leq 1}\).
So the posterior is:
\[\begin{align} \displaystyle f(\theta|X) &= \frac{\theta^{\sum x_i}\,(1-\theta)^{n-\sum x_i}\, I_{0\leq \theta \leq 1}}{\int_0^1 \theta^{\sum x_i}\,(1-\theta)^{n-\sum x_i}\, d\theta} \\\\ &=\frac{\theta^{\sum x_i}\,(1-\theta)^{n-\sum x_i}\, I_{0\leq \theta \leq 1}}{\frac{\Gamma (\sum x_i + 1)\,\Gamma(n-\sum x_i +1 )}{\Gamma(n+2)} \color{red}{\int_0^1 {\frac{\Gamma(n+2)}{\Gamma (\sum x_i + 1)\,\Gamma(n-\sum x_i +1 )} \theta^{\sum x_i}\,(1-\theta)^{n-\sum x_i}\, d\theta}}} \\\\ &=\frac{\Gamma(n+2)}{\Gamma (\sum x_i + 1)\,\Gamma(n-\sum x_i +1 )} \, \theta^{\sum x_i}\,(1-\theta)^{n-\sum x_i}\, I_{0\leq \theta \leq 1} \end{align}\].
Everything in red is the pdf of a beta, and hence it integrates to \(1\).
The end expression indicates that:
\(f(\theta|X) \sim \text{Beta} (\color{green}{\sum x_i} + \color{red}{1}, \,\,\, \color{blue}{n} - \color{green}{\sum x_i} + \color{red}{1})\tag {*}\)
Generalizing to any beta (not just the uniform prior) the prior will be:
\(\color{blue}{f(\theta)={\large \frac{\Gamma(\alpha+\beta)}{\Gamma (\alpha)\,\Gamma(\beta)} \, \theta^{\alpha-1}\,(1-\theta)^{\beta - 1} I_{0\leq \theta \leq 1}}}\)
Hence, the posterior:
\[\begin{align} f(\theta|X) &\propto f(X|\theta)\, f(\theta)\\\\ &= \theta^{\sum x_i}\,(1-\theta)^{n-\sum x_i} \color{blue}{\large \frac{\Gamma(\alpha+\beta)}{\Gamma (\alpha)\,\Gamma(\beta)} \, \theta^{\alpha-1}\,(1-\theta)^{\beta - 1} I_{0\leq \theta \leq 1}}\\\\ &\propto \theta^{\,\alpha + \sum x_i - 1}\,(1-\theta)^{\,\beta + n - \sum x_i - 1}\,I_{0\leq \theta \leq 1}\\\\ &\theta|X \sim Beta(\color{green}{\sum x_i} + \color{red}{\alpha},\,\,\, \color{blue}{n} - \color{green}{\sum x_i} + \color{red}{\beta}) \end{align}\]
Notice how when the \(\alpha\) and \(\beta\) are \(1\) we get the expression (*).
These hyperparameters \(\alpha\) and \(\beta\) determine the prior effective sample size: \(\alpha + \beta\), and can be interpreted as prior “successes” and “failures”, respectively.
The mean of a beta distribution is \(\frac{\alpha}{\alpha+\beta}\).
The posterior mean is:
\[\begin{align} \text{Posterior mean} &= \frac{\alpha + \sum x_i}{\alpha + \sum x_i + \beta + n - \sum x_i}\\\\ &=\frac{\alpha + \sum x_i}{\alpha + \beta + n}\\\\ &= \color{red}{\frac{\alpha + \beta}{\alpha + \beta + n}}\color{blue} {\frac{\alpha}{\alpha + \beta}} + \color{red}{\frac{n}{\alpha + \beta + n}}\color{blue}{\frac{\sum x_i}{n}}\\\\ &=\color{red}{\text{prior weight}}\times\color{blue}{\text{prior mean}} \,+\,\color{red}{\text{data weight}}\times\color{blue}{\text{data mean}} \end{align}\]
Notice how the weight fittingly add up to \(1\).
The likelihood is:
\(\large f(X \arrowvert \lambda) = \frac{\lambda^{\sum x_i} e^{-n \lambda}}{\displaystyle \Pi_{i=1}^n x_i!}\)
The distribution that simulates this expression is the gamma. A gamma prior ((, )) is going to look like:
\(\large f(\lambda) = \frac{\beta^\alpha}{\Gamma(\alpha)} \lambda^{\alpha - 1\, e^{-\beta \lambda}}\)
The posterior will be:
\(\large f(\lambda\arrowvert X) \propto f(X\arrowvert \lambda)\,f(\lambda)\propto \lambda^{\sum X_i}\,e^{-n\lambda}\, \lambda^{\alpha - 1} e^{-\beta \lambda} = \lambda^{(\alpha + \sum x_i)-1}\,e^{-(\beta + n)\lambda}\)
or
\(\large \Gamma(\alpha+\sum x_i, \beta +n)\)
Since the mean of a Gamma distribution is \(\alpha/\beta\) the posterior mean is:
\(\large \frac{\alpha + \sum x_i}{\beta +n} = \frac{\beta}{\beta + n}\frac{\alpha}{\beta}+\frac{n}{\beta + n}\frac{\sum x_i}{n}\), or a weighted average of the prior mean and the data mean. The sample size for the data is \(n\) and the effective sample size for the prior is \(\beta\).
Strategies to choose the prior parameters:
Prior mean: E.g. number of chips per cookie, corresponding to \(\alpha/\beta\). To further determine each parameter, we can think of the prior standard deviation as \(\sqrt \alpha/\beta\).
Effective of sample size (\(\beta\)): We can think of it as how much certainty we place in the prior as countered by the information in the data, \(n\).
Vague prior - flat prior. We can have a \(\Gamma(\epsilon,\epsilon)\) with a small \(\epsilon\), greater than \(0\). The posterior mean would be \(\frac{\epsilon + \sum x_i}{\epsilon + n}\), which is approximately the data mean.
The conjugate prior is a gamma distribution.
\(X \sim Exp(\lambda)\). The prior mean will be \(1/\lambda\). Hence we have to adjust the parameters of the gamma so that \(\alpha/\beta = 1/\lambda\). The prior standard deviation will be \(\frac{\alpha}{\beta}\).
If we say that a bus arrives on average every \(10\) minutes, we have a mean of the rate parameter of \(1/10\). We can think of a gamma prior of \(\Gamma(100,1000)\), which will have a mean of \(1/10\) and a standard deviation of \(1/100\). Hence the prior mean will be \(0.1 \pm 0.02\).
\(\large f(\lambda \arrowvert x) \propto f(X \arrowvert \lambda)f(\lambda) \propto \lambda e^{-\lambda y}\lambda^{\alpha-1} e^{-\beta \lambda} = \lambda^{(\alpha+1)-1}\,e^{-(\beta x)\lambda}\)
Say that the bus comes after a waiting time of \(12\) minutes once, the posterior will follow a distribution: \(\large f(\lambda \arrowvert x) \sim \Gamma (\alpha+1, \beta + x)\). Pluggin in the values, we end up with a posterior \(\lambda \arrowvert x ~\propto \Gamma(101, 1012)\) with a posterior mean of \(\frac{101}{1012}\).
Normal is conjugate for itself. So for data following \(N(\mu, \sigma^2)\), there will be a prior distribution \(\mu \sim N(m_o,\S_o^2)\).
The posterior will be \(f(\mu \vert X) \propto f(X \vert \mu) f(\mu).\)
The posterior distribution will end up being normally distributed as:
\[\mu \vert X \sim N \left ( \frac{\frac{n\bar x}{\sigma_o^2} +\frac{m_0}{S_o^2}}{\frac{n}{\sigma_o^2} +\frac{1}{S_0^2}} , \frac{1}{\frac{n}{S_0^2}+\frac{1}{S_0^2}} \right )\]
The posterior mean can be written as:
\[\frac{\frac{n}{\sigma_o^2} }{\frac{n}{\sigma_o^2} +\frac{1}{S_0^2}} \bar x + \frac{ \frac{1}{S_o^2}}{\frac{n}{\sigma_o^2} +\frac{1}{S_0^2}}m_0\]
Rearranging we get a weighted mean:
\[\frac{n}{n+\frac{\sigma_0^2}{S_0^2}}\bar x + \frac{\frac{\sigma_0^2}{S_0^2}}{n + \frac{\sigma_0^2}{S_0^2}}m\]
We specify a prior in a hierarchial fashion. The data \(X_i \vert \mu, \sigma^2 \overset{iid}{\sim} N(\mu,\sigma^2)\), and we specify a prior for \(\mu\) conditional on \(\sigma^2\): \(\mu \vert \sigma^2 \sim N(m, \frac{\sigma^2}{w})\). \(w = \frac{\sigma^2}{\sigma_\mu^2}\) is the effective sample size. The prior for \(\sigma^2 \sim \Gamma^{-1}(\alpha, \beta)\).
The posterior for \(\sigma^2\) is:
\(\sigma^2 \vert x \sim \Gamma^{-1}\left(\alpha + \frac{n}{2}, \beta + \frac{1}{2} \sum_{i=1}^n (x_i -\bar x)^2 + \frac{nw}{2(n+w)}(\bar x - m)^2 \right)\)
while the posterior distribution for \(\mu\) is:
\(\mu \vert \sigma^2 \sim N\left( \frac{n\bar x + wm}{n+w}, \frac{\sigma^2}{n+w}\right)\).
The mean can be written as a weighted average of the prior mean and the data mean:
\(\frac{n\bar x + wm}{n+w}= \frac{w}{n+w}m+\frac{n}{n+w}\bar x\).
We can marginalized \(\sigma^2\), and get a \(\mu \vert X \sim t\).
We can have a prior for a coin toss (\(Y_i \sim B(\theta)\)) that follows \(\theta \sim U[0,1]=\text{Beta}(1,1)\). The effective sample size of a beta prior is the sum of its parameters, which in this case is \(2\) - not a completely non-informative prior. We can use \(\text{Beta}(1/2,1/2)\), or \((.001,.001)\), or \((0,0)\). This latter case would have a density \(\text{\Beta}(0,0) \sim f(\theta) \propto \theta^{-1}(1-\theta)^{-1}\), which does not integrate to \(1\) (not a true density). It is an improper prior, but we can still use it, and get a posterior \(f(\theta \vert y) \propto \theta^{y-1}(1-\theta)^{n-y-1} \sim \text{Beta}(y, n-y)\). This posterior is proper.
The posterior mean will be \(\frac{y}{n}\), which is the MLE of \(\hat\theta\): \(\hat \theta = \frac{y}{n}\), as in the frequentist approach.
We can take a vague prior with a huge variance: \(\mu \sim N(0,1000000^2)\) At the limit of the variance, we spread across the real line, getting an improper prior \(f(\mu) \propto 1\).
But the posterior will be:
\(f(\mu \vert y) \propto f(y \vert \mu) f(\mu) \propto \text{exp} \left\{ \frac{1}{-2\frac{\sigma^2}{n}} (\mu - \bar y)^2\right\}\).
This is to say \(\mu \vert y \sim N(\bar y, \sigma^2/n)\).
If the variance is unknown the non-informative prior is \(f(\sigma^2) \propto \frac{1}{\sigma^2}=\Gamma^{-1}(0,0)\). It is improper and uniform in log scale of \(\sigma^2\). The posterior is:
\(\sigma^2 \vert y \sim \Gamma^{-1}\left(\frac{n-1}{2}, 1/2 \sum(y_i - \bar y)^2\right)\)
To address parametrization differences.
\(f(\theta) \propto \sqrt{I(\theta)}\) or square root of the Fisher information. In most cases it will be an improper prior.
For normal data, the Jeffrey’s prior is \(f(\mu) \propto 1\) (uniform for mu), and \(f(\sigma^2) \propto \frac{1}{\sigma^2}\) (uniform on the log scale).
For Bernouilli or binomial, the Jeffrey’s prior is \(f(\theta) \propto \theta^{-1/2}(1- \theta)^{-1/2} \sim \text{Beta}(1/2,1/2)\).
NOTE: These are tentative notes on different topics for personal use - expect mistakes and misunderstandings.