Specific derivations can be found here.

From this post:

We want to estimate the mean and variance of the stem diameters (in mm) of Pinus radiata trees based on twelve observations, and using a normal model:

The point of departure is the pdf of the normal distribution:

\[f_X(x \vert \mu,\sigma) = \frac{1}{\sqrt{2\pi}\,\sigma}\exp{\frac{-(x-\mu)^2}{2\sigma^2}}\]

Wikipedia Now we look at this function from a different perspective by considering the observed values \(x_1, x_2,\dots, x_n\) to be fixed “parameters” of this function, whereas \(\theta\) will be the function’s variable and allowed to vary freely; this same function will be called the likelihood:

\[\mathcal L( \mu,\sigma ; x_i)=\prod_{i=1}^n f_X(x_i \vert \mu, \sigma)=\prod_{i=1}^n \frac{1}{\sqrt{2\pi}\,\sigma}\exp{\frac{-(x_i-\mu)^2}{2\sigma^2}}\]

Here the observed values \(x_i\) are fixed, and the function depends on the parameters \(\mu\) and \(\sigma\). The idea is to maximize the value of the formula. However, multiplying small fractions becomes numerically unstable. In comes the log-likelihood function:

the log-likelihood:

\[\ln \mathcal {L}(\theta \,;\,x_{1},\ldots ,x_{n})=\sum _{i=1}^{n}\ln f_X(x_{i}\vert \theta )\]

The average log-likelihood function is denoted as \(\hat l\):

\[\hat l=\frac {\ln \mathcal L}{n}\]

The method of maximum likelihood estimates by finding a value of \(\theta\) that maximizes \(\hat l (\theta ;x)\). This method of estimation defines a maximum likelihood estimator (MLE) of \(\theta\):

\[\{\hat \theta_{\text{mle}}\} \subseteq \{\underset{\theta \in \Theta }{\operatorname{arg}\text{max}}\ \hat {l}(\theta \,;\,x_{1},\ldots ,x_{n})\}\]

In many instances, there is no closed form, and an computational or iterative procedures will be used. Here is a link to Andrew Ng’s gradient descent will be used.

\[\begin{align}\log \mathcal L( \mu,\sigma ; x_i) &= \sum_{i=1}^n \log\left(\frac{1}{\sqrt{2\pi}\,\sigma}\exp{\frac{-(x_i-\mu)^2}{2\sigma^2}}\right)\\ &= \sum_{i=1}^n\log\left( \frac{1}{\sqrt{2\pi}\,\sigma}\right)+\sum_{i=1}^n\log\left(\exp{\frac{-(x_i-\mu)^2}{2\sigma^2}}\right)\\ &= \sum_{i=1}^n\log\left( (2\pi)^{-1/2}\right) + \sum_{i=1}^n \log\left( (\sigma)^{-1}\right)+\sum_{i=1}^n\log\left(\exp{\frac{-(x_i-\mu)^2}{2\sigma^2}}\right)\\ &=-\frac{n}{2}\log(2\pi)-n\log(\sigma)-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i - \mu)^2 \end{align}\]

Now taking the partial derivative with respect to \(\mu\):

\[\frac{\partial \log L}{\partial \mu}= \frac{1}{2\sigma^2} 2 \sum_{i=1}^n(x_i - \mu)\]

and setting it to zero:

\[0 = \sum_{i=1}^n(x_i - \mu)\]

\[\mu = \frac{\sum_{i=1}^nx_i}{n}\]

Actually, this is the mean, and it is the **sample** mean providing and estimation of \(\mu\) (i.e. \(\hat \mu\)).

We can also take the partial with respect to \(\sigma\):

\[\begin{align}\frac{\partial \log L}{\partial \sigma}&= -\frac{n}{\sigma}+\frac{2}{2}\sum_{i=1}^n(x_i - \mu)^2 \sigma^{-3}\\ &=-\frac{n}{\sigma}+\frac{1}{\sigma^3}\sum_{i=1}^n(x_i - \mu)^2\\ \end{align}\]And setting it to zero:

\[0 = -\frac{n}{\sigma}+\frac{1}{\sigma^3}\sum_{i=1}^n(x_i - \mu)^2\]

Hence,

\[\sigma^2 =\frac{\sum_{i=1}^n (x_i - \mu)^2}{n}\]

which is really a biased estimate of the variance in the population, \(\sigma\) (i.e. \(s^2\)). This bias arises because maximum likelihood estimates do not take into account the loss of degrees of freedom when estimating fixed effects.

```
diams = c(150, 124, 100, 107, 170, 144,
113, 108, 92, 129, 123, 118)
# A simple function to calculate log-likelihood.
# Values will be easier to manage
loglike = function(data, mu, sigma) {
loglike = 0
for(obs in data){
loglike = loglike +
log(1/(sqrt(2*pi)*sigma) *
exp(-1/2 * (obs - mu)^2/(sigma^2)))
}
return(loglike)
}
# Let's try some combinations of parameters and
# plot the results
params = expand.grid(mu = seq(50, 200, 1),
sigma = seq(10, 40, 1))
params$logL = with(params, loglike(diams, mu, sigma))
summary(params)
```

```
## mu sigma logL
## Min. : 50 Min. :10 Min. :-420.42
## 1st Qu.: 87 1st Qu.:17 1st Qu.: -89.43
## Median :125 Median :25 Median : -69.39
## Mean :125 Mean :25 Mean : -85.58
## 3rd Qu.:163 3rd Qu.:33 3rd Qu.: -59.13
## Max. :200 Max. :40 Max. : -53.81
```

```
library(lattice)
contourplot(logL ~ mu*sigma, data = params, cuts = 30, col=2, labels = list(cex = 0.7))
```

```
# Zooming in
params = expand.grid(mu = seq(120, 126, 0.01),
sigma = seq(18, 25, 0.1))
params$logL = with(params, loglike(diams, mu, sigma))
summary(params)
```

```
## mu sigma logL
## Min. :120.0 Min. :18.0 Min. :-54.40
## 1st Qu.:121.5 1st Qu.:19.7 1st Qu.:-54.03
## Median :123.0 Median :21.5 Median :-53.93
## Mean :123.0 Mean :21.5 Mean :-53.96
## 3rd Qu.:124.5 3rd Qu.:23.3 3rd Qu.:-53.87
## Max. :126.0 Max. :25.0 Max. :-53.81
```

```
contourplot(logL ~ mu*sigma, data = params, cuts = 10, col=2,
labels = list(cex = 0.7))
```

```
# Compare to:
mean(diams); sd(diams)
```

`## [1] 123.1667`

`## [1] 22.38438`

From here.

If our experiment is a single Bernoulli trial and we observe \(X = 1\) (success) then the likelihood function is \(\mathcal L(p ; x) = p\) . This function reaches its maximum at \(\hat p =1\). If we observe \(X = 0\) (failure) then the likelihood is \(\mathcal L(p ; x) = 1 − p\), which reaches its maximum at \(\hat p=0\). Of course, it is somewhat silly for us to try to make formal inferences about \(\theta\) on the basis of a single Bernoulli trial; usually multiple trials are available.

Suppose that \(X = (X_1, X_2, \dots, X_n)\) represents the outcomes of \(n\) independent Bernoulli trials, each with success probability \(p\). The likelihood for \(p\) based on \(X\) is defined as the joint probability distribution of \(X = (X_1, X_2, \dots, X_n)\). Since \(X = (X_1, X_2, \dots, X_n)\) are iid random variables, the joint distribution is

\[\begin{align}\mathcal L(\theta;x) &=\prod_{i=1}^{n}f_X(x_i;\theta)=\prod_{i=1}^{n} \theta^x_i(1−\theta)^{1−x}\\[2ex] &=\theta^{x_1}(1-\theta)^{1-x_1}\dots\theta^{x_n}(1-\theta)^{1-x_n}\tag{1}\\ \end{align}\]\((1)\) just reflects the probability mass function of the Bernouilli distribution.

For the log-likelihood,

\[\begin{align}\ln \mathcal L(\theta ; x)&= \ln \theta \left(\sum_{i=1}^n x_i\right) + \ln(1- \theta)\left(n -\sum_{i=1}^n x_i\right)\\[2ex] &=n\bar x \ln \theta + n(1- \bar x)\ln(1- \theta) \end{align}\]Differentiating and setting it up to zero,

\[\begin{align} \frac{\partial}{\partial \theta} \ln \mathcal L(\theta; x) &= n\left(\frac{\bar x}{\theta}-\frac{1-\bar x}{1-\theta}\right) \\[2ex] &=\frac{\bar x (1-\theta) - \theta(1 - \bar x)}{\theta (1-\theta)}=0 \end{align}\]Therefore \(\hat \theta = \bar x\)

Suppose that $ $is an observation from a binomial distribution, \(X \sim Bin(n, p )\), where \(n\) is known and \(p\) is to be estimated. The likelihood function is

\[\mathcal L(p;x)=\frac{n!}{x!(n−x)!}p^x(1−p)^{n−x}\]

which, except for the factor \(\frac{n!}{x!(n−x)!}\), is identical to the likelihood from \(n\) independent Bernoulli trials with \(x=\sum_{i=1}^n x_i\). But since the likelihood function is regarded as a function only of the parameter p, the factor \(\frac{n!}{x!(n−x)!}\) is a fixed constant and does not affect the MLE. Thus the MLE is again \(\hat p = x/n\), the sample proportion of successes.

You get the same value by maximizing the binomial loglikelihood function:

\[l(p;x)=k+x \log p+(n−x) \log(1−p)\]

where \(k\) is a constant that does not involve the parameter \(p.\)

Let’s see what the likelihood function looks like with a binomial experiment yielding \(3 \text{ heads}\) and \(2 \text{ tails}:\)

```
p = seq(0, 1, 0.001)
L = p^3 * (1 - p)^2
plot(p, L, type = "l", xlab="p", ylab="L(p(x))")
max(L)
```

`## [1] 0.03456`

```
d = cbind(p,L)
(MLE = d[which.max(d[,2]), 1])
```

```
## p
## 0.6
```

```
arrows(x0 = MLE, y0 = 0, x1 = MLE, y1 = max(L))
abline(h = 0); abline(h = max(L))
```

```
# As for the log-likelihood:
loglik = 3 * log(p) + 2 * log(1 - p)
plot(p, loglik, type = "l")
d = cbind(p,loglik)
MLE = d[which.max(d[,2]), 1]
arrows(x0 = MLE, y0 = -20, x1 = MLE, y1 = max(loglik))
abline(h = -20); abline(h = max(loglik))
```

The model is:

\[y_i = \alpha + \beta x_i + \epsilon_i\]

Because the errors are independent with mean of \(0\) and unknown variance \(\sigma^2:\)

\[\begin{align}\mathcal L(\alpha, \beta, \sigma^2; x, y)&=f_X(y_1\vert \alpha + \beta x_1, \sigma^2)\quad f_X(y_2\vert \alpha + \beta x_2, \sigma^2)\quad\dots\quad f_X(y_n\vert \alpha + \beta x_n, \sigma^2)\\[2ex] &= \small \frac{1}{\sigma \sqrt{2\pi}}\exp{-\frac{(y_1 - \alpha -\beta x_1)^2}{2\sigma^2}}\frac{1}{\sigma \sqrt{2\pi}}\exp{-\frac{(y_2 - \alpha -\beta x_2)^2}{2\sigma^2}}\dots \frac{1}{\sigma \sqrt{2\pi}}\exp{-\frac{(y_n - \alpha -\beta x_n)^2}{2\sigma^2}}\\[2ex] &=\frac{1}{\sqrt{(2\pi\sigma^2)^n}}\exp -\frac{1}{2\sigma^2}\sum_{i=1}^n \left(y_i - (\alpha + \beta x_i\right))^2 \end{align}\]The log-likelihood:

\[\ln \mathcal L(\alpha,\beta,\sigma^2; x,y)=-\frac{n}{2}\ln 2\pi\sigma^2 \color{red}{-} \frac{1}{2\sigma^2}\sum_{i=1}^n(y_i -(\alpha + \beta x_i))^2\]

So when the residuals are minimum (i.e. \(\sum_{i=1}^n(y_i -(\alpha + \beta x_i))^2\)), the log-likelihood will be maximized (due to the negative sign).

Since OLS minimizes residuals, it will correspond to the MLE, provided the errors are following a normal distribution around the mean (the predicted values).

We can estimate the slope \(\beta\) for instance:

\[\begin{align}\frac{\partial\ln \mathcal L(\alpha,\beta,\sigma^2; x,y)}{\partial \beta}&= \frac{1}{\sigma^2}\sum_{i=1}^nx_i(y_i -(\alpha + \beta x_i)) \end{align}\]and setting to zero:

\[\begin{align}0&= \frac{1}{\sigma^2}\sum_{i=1}^nx_i(y_i -(\alpha + \beta x_i))\\ 0&= \color{blue}{\sum_{i=1}^nx_i(y_i -(\alpha + \beta x_i))} \end{align}\] \[\begin{align}0&= \sum_{i=1}^n x_i y_i - x_i \alpha - \beta x_i^2))\\ \sum_{i=1}^n\beta x_i^2 &= \sum_{i=1}^n x_i y_i - x_i \alpha \end{align}\]Therefore,

\[\hat \beta = \frac{\sum_{i=1}^n x_i y_i - x_i \alpha }{\sum_{i=1}^n x_i^2}\]

Now the \(\sum_{i=1}^n(y_i -(\alpha + \beta x_i))\) is the estimated error, and

\[\color{blue}{0=\sum_{i=1}^nx_i(y_i -(\alpha + \beta x_i))}\]

one of the conditions of classical OLS.