Maximum Liklihood Estimation

An intuitive guide to MLE

Maximum Likelihood is a method for estimating parameters given data and some assumed parametric distribution. It is a widely used estimation technique which is effective across a huge class of models. The standard statistical procude is attributed to Ronald Fisher, but has an “epic” history engaging Lagrange, Gauss, Euler, and more (see Stigler 2008).

The idea is simple: given a finite set of data, which parameters make this specific distribution of data more likely.

We begin with some data generating process that outputs data with some kind of probability distribution. For example, heights of adults within a sex tend to be normally distributed. By parametric distribution we mean a distribution which is completely determined by “parameters” – in the case of a normal distribution this would be mean and variance. We have our data, we know the distribution type, but we don’t know the parameters and would like to estimate them. In the case of heights, different countries may have different mean heights but the same underlying distribution of heights. Often, MLE is used where determining the parameters analytically is very difficult or intractable and so require some numerical technique.

Estimation Mechanics

Let $X_1, X_2, \ldots, X_n$ be an independent and identically distributed (i.i.d.) random sample drawn from a distribution with probability density (or mass) function $f(x \mid \theta)$, where $\theta$ is an unknown parameter.

The Likelihood Function is the joint probability of observing the data, viewed as a function of $\theta$.

Lobs(θX)=i=1nf(xiθ) \mathcal{L_{obs}}(\theta \mid X) = \prod_{i=1}^{n} f(x_i \mid \theta)

We estimate the unknown parameter $\theta_0$ by maximizing the liklihood of the observed data over a range of possible parameter values $\theta \in \Omega$.

That is,

θ^ML=argmaxθΩL(θX)=argmaxθΩi=1nf(xiθ) \begin{aligned} \hat{\theta}_{\text{ML}} &= \arg\max_{\theta \in \Omega} \mathcal{L}(\theta \mid X) \\ &= \arg\max_{\theta \in \Omega} \prod_{i=1}^{n} f(x_i \mid \theta) \end{aligned}

I find the normal distribution most intuitive here, however, others seem to find Bernoulli more intuitive. Take your pick, same idea. If our true mean is 5’9” then if we test the mean at 5’3” then the PDFs for most of our data would be quite small, and the closer we test the mean towards 5’9” the PDFs for more points would be greater an the produce would be greater. Hence, 5’9” would be the more “likely” mean.

Usually, we apply a monotonic transformation by taking the logarithm of the likelihood thus converting our products into sums.

n(θ)=lnLn(θ)=lni=1nf(Xi;θ)=i=1nlnf(Xi;θ). \begin{aligned} \ell_n(\theta) &= \ln \mathcal{L}_n(\theta) \\ &= \ln \prod_{i=1}^{n} f(X_i; \theta) \\ &= \sum_{i=1}^{n} \ln f(X_i; \theta). \end{aligned}

Historically, the reason for applying the logarithim is that it converts multiplication (hard to do by hand) into addition(easy to do by hand). Interestingly enough we face a similar computation problem with multiplication. Because our probabilities are between 0 and 1 ($f(X_i; \theta) \in (0, 1)$) the resulting product for large amounts of data can hit underflow(resulting value is too tiny for the computer to handle).

Because the logarithm produces a monotonic transformation our maximum remains the same, which highlights an importance in the terminology. Likelihood is not a measure of how “probable” our estimated parameter is, but simply that it fits the distribution better than other potential values and thus is most “likely”.

Add derivative, score function, asymptotics, fisher information, and proprerties (consistency, normality, invariance, efficiency)