Maximum Likelihood Estimation MLE

MLE - Maximum Likelihood Estimation:

Specific derivations can be found here.

We want to estimate the mean and variance of the stem diameters (in mm) of Pinus radiata trees based on twelve observations, and using a normal model:

The point of departure is the pdf of the normal distribution:

\[f_X(x \vert \mu,\sigma) = \frac{1}{\sqrt{2\pi}\,\sigma}\exp{\frac{-(x-\mu)^2}{2\sigma^2}}\]

Wikipedia Now we look at this function from a different perspective by considering the observed values $x_1, x_2,\dots, x_n$ to be fixed “parameters” of this function, whereas $\theta$ will be the function’s variable and allowed to vary freely; this same function will be called the likelihood:

\[\mathcal L( \mu,\sigma ; x_i)=\prod_{i=1}^n f_X(x_i \vert \mu, \sigma)=\prod_{i=1}^n \frac{1}{\sqrt{2\pi}\,\sigma}\exp{\frac{-(x_i-\mu)^2}{2\sigma^2}}\]

Here the observed values $x_i$ are fixed, and the function depends on the parameters $\mu$ and $\sigma$. The idea is to maximize the value of the formula. However, multiplying small fractions becomes numerically unstable. In comes the log-likelihood function:

the log-likelihood:

\[\ln \mathcal {L}(\theta \,;\,x_{1},\ldots ,x_{n})=\sum _{i=1}^{n}\ln f_X(x_{i}\vert \theta )\]

The average log-likelihood function is denoted as $\hat l$:

\[\hat l=\frac {\ln \mathcal L}{n}\]

The method of maximum likelihood estimates by finding a value of $\theta$ that maximizes $\hat l (\theta ;x)$. This method of estimation defines a maximum likelihood estimator (MLE) of $\theta$:

\[\{\hat \theta_{\text{mle}}\} \subseteq \{\underset{\theta \in \Theta }{\operatorname{arg}\text{max}}\ \hat {l}(\theta \,;\,x_{1},\ldots ,x_{n})\}\]

In many instances, there is no closed form, and an computational or iterative procedures will be used. Here is a link to Andrew Ng’s gradient descent will be used.

\[\begin{align}\log \mathcal L( \mu,\sigma ; x_i) &= \sum_{i=1}^n \log\left(\frac{1}{\sqrt{2\pi}\,\sigma}\exp{\frac{-(x_i-\mu)^2}{2\sigma^2}}\right)\\ &= \sum_{i=1}^n\log\left( \frac{1}{\sqrt{2\pi}\,\sigma}\right)+\sum_{i=1}^n\log\left(\exp{\frac{-(x_i-\mu)^2}{2\sigma^2}}\right)\\ &= \sum_{i=1}^n\log\left( (2\pi)^{-1/2}\right) + \sum_{i=1}^n \log\left( (\sigma)^{-1}\right)+\sum_{i=1}^n\log\left(\exp{\frac{-(x_i-\mu)^2}{2\sigma^2}}\right)\\ &=-\frac{n}{2}\log(2\pi)-n\log(\sigma)-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i - \mu)^2 \end{align}\]

Now taking the partial derivative with respect to $\mu$:

\[\frac{\partial \log L}{\partial \mu}= \frac{1}{2\sigma^2} 2 \sum_{i=1}^n(x_i - \mu)\]

and setting it to zero:

\[0 = \sum_{i=1}^n(x_i - \mu)\]

\[\mu = \frac{\sum_{i=1}^nx_i}{n}\]

Actually, this is the mean, and it is the sample mean providing and estimation of $\mu$ (i.e. $\hat \mu$).

We can also take the partial with respect to $\sigma$:

\[\begin{align}\frac{\partial \log L}{\partial \sigma}&= -\frac{n}{\sigma}+\frac{2}{2}\sum_{i=1}^n(x_i - \mu)^2 \sigma^{-3}\\ &=-\frac{n}{\sigma}+\frac{1}{\sigma^3}\sum_{i=1}^n(x_i - \mu)^2\\ \end{align}\]

And setting it to zero:

\[0 = -\frac{n}{\sigma}+\frac{1}{\sigma^3}\sum_{i=1}^n(x_i - \mu)^2\]

Hence,

\[\sigma^2 =\frac{\sum_{i=1}^n (x_i - \mu)^2}{n}\]

which is really a biased estimate of the variance in the population, $\sigma$ (i.e. $s^2$). This bias arises because maximum likelihood estimates do not take into account the loss of degrees of freedom when estimating fixed effects.

diams = c(150, 124, 100, 107, 170, 144,
          113, 108, 92, 129, 123, 118)

# A simple function to calculate log-likelihood.
# Values will be easier to manage
loglike = function(data, mu, sigma) {
  loglike = 0
    for(obs in data){
      loglike = loglike +
                log(1/(sqrt(2*pi)*sigma) *
                       exp(-1/2 * (obs - mu)^2/(sigma^2)))
    }
    return(loglike)
}
# Let's try some combinations of parameters and
# plot the results
params = expand.grid(mu = seq(50, 200, 1),
                     sigma = seq(10, 40, 1))
params$logL = with(params, loglike(diams, mu, sigma))
summary(params)

##        mu          sigma         logL        
##  Min.   : 50   Min.   :10   Min.   :-420.42  
##  1st Qu.: 87   1st Qu.:17   1st Qu.: -89.43  
##  Median :125   Median :25   Median : -69.39  
##  Mean   :125   Mean   :25   Mean   : -85.58  
##  3rd Qu.:163   3rd Qu.:33   3rd Qu.: -59.13  
##  Max.   :200   Max.   :40   Max.   : -53.81

library(lattice)
contourplot(logL ~ mu*sigma, data = params, cuts = 30, col=2, labels = list(cex = 0.7))

# Zooming in
params = expand.grid(mu = seq(120, 126, 0.01),
                     sigma = seq(18, 25, 0.1))
params$logL = with(params, loglike(diams, mu, sigma))
summary(params)

##        mu            sigma           logL       
##  Min.   :120.0   Min.   :18.0   Min.   :-54.40  
##  1st Qu.:121.5   1st Qu.:19.7   1st Qu.:-54.03  
##  Median :123.0   Median :21.5   Median :-53.93  
##  Mean   :123.0   Mean   :21.5   Mean   :-53.96  
##  3rd Qu.:124.5   3rd Qu.:23.3   3rd Qu.:-53.87  
##  Max.   :126.0   Max.   :25.0   Max.   :-53.81

contourplot(logL ~ mu*sigma, data = params, cuts = 10, col=2,
            labels = list(cex = 0.7))

# Compare to:

mean(diams); sd(diams)

## [1] 123.1667

## [1] 22.38438

BERNOULLI AND BINOMIAL TRIALS:

From here.

ML for Bernoulli trials

If our experiment is a single Bernoulli trial and we observe $X = 1$ (success) then the likelihood function is $\mathcal L(p ; x) = p$ . This function reaches its maximum at $\hat p =1$. If we observe $X = 0$ (failure) then the likelihood is $\mathcal L(p ; x) = 1 − p$, which reaches its maximum at $\hat p=0$. Of course, it is somewhat silly for us to try to make formal inferences about $\theta$ on the basis of a single Bernoulli trial; usually multiple trials are available.

Suppose that $X = (X_1, X_2, \dots, X_n)$ represents the outcomes of $n$ independent Bernoulli trials, each with success probability $p$. The likelihood for $p$ based on $X$ is defined as the joint probability distribution of $X = (X_1, X_2, \dots, X_n)$. Since $X = (X_1, X_2, \dots, X_n)$ are iid random variables, the joint distribution is

\[\begin{align}\mathcal L(\theta;x) &=\prod_{i=1}^{n}f_X(x_i;\theta)=\prod_{i=1}^{n} \theta^x_i(1−\theta)^{1−x}\\[2ex] &=\theta^{x_1}(1-\theta)^{1-x_1}\dots\theta^{x_n}(1-\theta)^{1-x_n}\tag{1}\\ \end{align}\]

$(1)$ just reflects the probability mass function of the Bernouilli distribution.

For the log-likelihood,

\[\begin{align}\ln \mathcal L(\theta ; x)&= \ln \theta \left(\sum_{i=1}^n x_i\right) + \ln(1- \theta)\left(n -\sum_{i=1}^n x_i\right)\\[2ex] &=n\bar x \ln \theta + n(1- \bar x)\ln(1- \theta) \end{align}\]

Differentiating and setting it up to zero,

\[\begin{align} \frac{\partial}{\partial \theta} \ln \mathcal L(\theta; x) &= n\left(\frac{\bar x}{\theta}-\frac{1-\bar x}{1-\theta}\right) \\[2ex] &=\frac{\bar x (1-\theta) - \theta(1 - \bar x)}{\theta (1-\theta)}=0 \end{align}\]

Therefore $\hat \theta = \bar x$

ML for Binomial trials:

Suppose that $ $is an observation from a binomial distribution, $X \sim Bin(n, p )$, where $n$ is known and $p$ is to be estimated. The likelihood function is

\[\mathcal L(p;x)=\frac{n!}{x!(n−x)!}p^x(1−p)^{n−x}\]

which, except for the factor $\frac{n!}{x!(n−x)!}$, is identical to the likelihood from $n$ independent Bernoulli trials with $x=\sum_{i=1}^n x_i$. But since the likelihood function is regarded as a function only of the parameter p, the factor $\frac{n!}{x!(n−x)!}$ is a fixed constant and does not affect the MLE. Thus the MLE is again $\hat p = x/n$, the sample proportion of successes.

You get the same value by maximizing the binomial loglikelihood function:

\[l(p;x)=k+x \log p+(n−x) \log(1−p)\]

where $k$ is a constant that does not involve the parameter $p.$

Let’s see what the likelihood function looks like with a binomial experiment yielding $3 \text{ heads}$ and $2 \text{ tails}:$

p = seq(0, 1, 0.001)
L = p^3 * (1 - p)^2
plot(p, L, type = "l", xlab="p", ylab="L(p(x))")
max(L)

## [1] 0.03456

d = cbind(p,L)
(MLE = d[which.max(d[,2]), 1])

##   p 
## 0.6

arrows(x0 = MLE, y0 = 0, x1 = MLE, y1 = max(L))
abline(h = 0); abline(h = max(L))

# As for the log-likelihood:

loglik = 3 * log(p) + 2 * log(1 - p)
plot(p, loglik, type = "l")
d = cbind(p,loglik)
MLE = d[which.max(d[,2]), 1]
arrows(x0 = MLE, y0 = -20, x1 = MLE, y1 = max(loglik))
abline(h = -20); abline(h = max(loglik))

Linear Regression:

The model is:

\[y_i = \alpha + \beta x_i + \epsilon_i\]

Because the errors are independent with mean of $0$ and unknown variance $\sigma^2:$

\[\begin{align}\mathcal L(\alpha, \beta, \sigma^2; x, y)&=f_X(y_1\vert \alpha + \beta x_1, \sigma^2)\quad f_X(y_2\vert \alpha + \beta x_2, \sigma^2)\quad\dots\quad f_X(y_n\vert \alpha + \beta x_n, \sigma^2)\\[2ex] &= \small \frac{1}{\sigma \sqrt{2\pi}}\exp{-\frac{(y_1 - \alpha -\beta x_1)^2}{2\sigma^2}}\frac{1}{\sigma \sqrt{2\pi}}\exp{-\frac{(y_2 - \alpha -\beta x_2)^2}{2\sigma^2}}\dots \frac{1}{\sigma \sqrt{2\pi}}\exp{-\frac{(y_n - \alpha -\beta x_n)^2}{2\sigma^2}}\\[2ex] &=\frac{1}{\sqrt{(2\pi\sigma^2)^n}}\exp -\frac{1}{2\sigma^2}\sum_{i=1}^n \left(y_i - (\alpha + \beta x_i\right))^2 \end{align}\]

The log-likelihood:

\[\ln \mathcal L(\alpha,\beta,\sigma^2; x,y)=-\frac{n}{2}\ln 2\pi\sigma^2 \color{red}{-} \frac{1}{2\sigma^2}\sum_{i=1}^n(y_i -(\alpha + \beta x_i))^2\]

So when the residuals are minimum (i.e. $\sum_{i=1}^n(y_i -(\alpha + \beta x_i))^2$), the log-likelihood will be maximized (due to the negative sign).

Since OLS minimizes residuals, it will correspond to the MLE, provided the errors are following a normal distribution around the mean (the predicted values).

We can estimate the slope $\beta$ for instance:

\[\begin{align}\frac{\partial\ln \mathcal L(\alpha,\beta,\sigma^2; x,y)}{\partial \beta}&= \frac{1}{\sigma^2}\sum_{i=1}^nx_i(y_i -(\alpha + \beta x_i)) \end{align}\]

and setting to zero:

\[\begin{align}0&= \frac{1}{\sigma^2}\sum_{i=1}^nx_i(y_i -(\alpha + \beta x_i))\\ 0&= \color{blue}{\sum_{i=1}^nx_i(y_i -(\alpha + \beta x_i))} \end{align}\]

\[\begin{align}0&= \sum_{i=1}^n x_i y_i - x_i \alpha - \beta x_i^2))\\ \sum_{i=1}^n\beta x_i^2 &= \sum_{i=1}^n x_i y_i - x_i \alpha \end{align}\]

Therefore,

\[\hat \beta = \frac{\sum_{i=1}^n x_i y_i - x_i \alpha }{\sum_{i=1}^n x_i^2}\]

Now the $\sum_{i=1}^n(y_i -(\alpha + \beta x_i))$ is the estimated error, and

\[\color{blue}{0=\sum_{i=1}^nx_i(y_i -(\alpha + \beta x_i))}\]

one of the conditions of classical OLS.

Home Page

NOTE: These are tentative notes on different topics for personal use - expect mistakes and misunderstandings.