maximum likelihood estimation

Maximum likelihood

Maximum likelihood estimation (MLE) is a popular statistical method used for fitting a mathematical model to some data. The modeling of real world data using estimation by maximum likelihood offers a way of tuning the free parameters of the model to provide a good fit.

The method was pioneered by geneticist and statistician Sir R. A. Fisher between 1912 and 1922. It has widespread applications in various fields, including:

The method of maximum likelihood corresponds to many well-known estimation methods in statistics. For example, suppose you are interested in the heights of Americans. You have a sample of some number of Americans, but not the entire population, and record their heights. Further, you are willing to assume that heights are normally distributed with some unknown mean and variance. The sample mean is then the maximum likelihood estimator of the population mean, and the sample variance is a close approximation to the maximum likelihood estimator of the population variance (see examples below).

For a fixed set of data and underlying probability model, maximum likelihood picks the values of the model parameters that make the data "more likely" than any other values of the parameters would make them. Maximum likelihood estimation gives a unique and easy way to determine solution in the case of the normal distribution and many other problems, although in very complex problems this may not be the case. If a uniform prior distribution is assumed over the parameters, the maximum likelihood estimate coincides with the most probable values thereof.

Prerequisites

The following discussion assumes that readers are familiar with basic notions in probability theory such as probability distributions, probability density functions, random variables and expectation. It also assumes they are familiar with standard basic techniques of maximizing continuous real-valued functions, such as using differentiation to find a function's maxima.

Principles

Consider a family D_theta of probability distributions parameterized by an unknown parameter theta (which could be vector-valued), associated with either a known probability density function (continuous distribution) or a known probability mass function (discrete distribution), denoted as f_theta. We draw a sample x_1,x_2,dots,x_n of n values from this distribution, and then using f_theta we compute the (multivariate) probability density associated with our observed data, f_theta(x_1,dots,x_n).,!

As a function of θ with x1, ..., xn fixed, this is the likelihood function

mathcal{L}(theta) = f_{theta}(x_1,dots,x_n).,!

The method of maximum likelihood estimates θ by finding the value of θ that maximizes mathcal{L}(theta). This is the maximum likelihood estimator (MLE) of θ:

widehat{theta} = underset{theta}{operatorname{arg max}} mathcal{L}(theta).

From a simple point of view, the outcome of a maximum likelihood analysis is the maximum likelihood estimate. This can be supplemented by an approximation for the covariance matrix of the MLE, where this approximation is derived from the likelihood function. A more complete outcome from a maximum likelihood analysis would be the likelihood function itself, which can be used to construct improved versions of confidence intervals compared to those obtained from the approximate variance matrix. See also Likelihood Ratio Test

Commonly, one assumes that the data drawn from a particular distribution are independent, identically distributed (iid) with unknown parameters. This considerably simplifies the problem because the likelihood can then be written as a product of n univariate probability densities:

mathcal{L}(theta) = prod_{i=1}^n f_{theta}(x_i)

and since maxima are unaffected by monotone transformations, one can take the logarithm of this expression to turn it into a sum:

mathcal{L}^*(theta) = sum_{i=1}^n log f_{theta}(x_i).

The maximum of this expression can then be found numerically using various optimization algorithms.

This contrasts with seeking an unbiased estimator of θ, which may not necessarily yield the MLE but which will yield a value that (on average) will neither tend to over-estimate nor under-estimate the true value of θ.

Note that the maximum likelihood estimator may not be unique, or indeed may not even exist.

Properties

Functional invariance

The maximum likelihood estimator selects the parameter value which gives the observed data the largest possible probability (or probability density, in the continuous case). If the parameter consists of a number of components, then we define their separate maximum likelihood estimators, as the corresponding component of the MLE of the complete parameter. Consistent with this, if widehat{theta} is the MLE for θ, and if g is any function of θ, then the MLE for α = g(θ) is by definition

widehat{alpha} = g(widehat{theta}).,!

It maximizes the so-called profile likelihood:

bar{L}(alpha) = sup_{theta: alpha = g(theta)} L(theta).

Bias

For small numbers of samples, the bias of maximum-likelihood estimators can be substantial. Consider a case where n tickets numbered from 1 to n are placed in a box and one is selected at random (see uniform distribution). If n is unknown, then the maximum-likelihood estimator of n is the number on the drawn ticket. The expected value of the number on the drawn ticket, and therefore the expected value of n, is (n+1)/2. As a result, the maximum likelihood estimator for n will systematically underestimate n by (n-1)/2. In estimating the highest number n, we can only be certain that it is greater than or equal to the drawn ticket number.

Asymptotics

In many cases, estimation is performed using a set of independent identically distributed measurements. These may correspond to distinct elements from a random sample, repeated observations, etc. In such cases, it is of interest to determine the behavior of a given estimator as the number of measurements increases to infinity, referred to as asymptotic behaviour.

Under certain (fairly weak) regularity conditions, which are listed below, the MLE exhibits several characteristics which can be interpreted to mean that it is "asymptotically optimal". These characteristics include:

  • The MLE is asymptotically unbiased, i.e., its bias tends to zero as the number of samples increases to infinity.
  • The MLE is asymptotically efficient, i.e., it achieves the Cramér-Rao lower bound when the number of samples tends to infinity. This means that, asymptotically, no unbiased estimator has lower mean squared error than the MLE.
  • The MLE is asymptotically normal. As the number of samples increases, the distribution of the MLE tends to the Gaussian distribution with mean theta and covariance matrix equal to the inverse of the Fisher information matrix.

Since the Cramér-Rao bound only speaks of unbiased estimators while the maximum likelihood estimator is usually biased, asymptotic efficiency as defined here does not mean anything: perhaps there are other nearly unbiased estimators with much smaller variance. However, it can be shown that among all regular estimators, which are estimators which have an asymptotic distribution which is not dramatically disturbed by small changes in the parameters, the asymptotic distribution of the maximum likelihood estimator is the best possible, i.e., most concentrated.

Some regularity conditions which ensure this behavior are:

  1. The first and second derivatives of the log-likelihood function must be defined.
  2. The Fisher information matrix must not be zero, and must be continuous as a function of the parameter.
  3. The maximum likelihood estimator is consistent.

By the mathematical meaning of the word asymptotic, asymptotic properties are properties which only approached in the limit of larger and larger samples: they are approximately true when the sample size is large enough. The theory does not tell us how large the sample needs to be in order to obtain a good enough degree of approximation. Fortunately, in practice they often appear to be approximately true, when the sample size is moderately large. So in practice, inference about the estimated parameters is often based on the asymptotic Gaussian distribution of the MLE. When we do this, the Fisher information matrix is usefully estimated by the observed information matrix.

Some cases where the asymptotic behaviour described above does not hold are outlined next.

Estimate on boundary. Sometimes the maximum likelihood estimate lies on the boundary of the set of possible parameters, or (if the boundary is not, strictly speaking, allowed) the likelihood gets larger and larger as the parameter approaches the boundary. Standard asymptotic theory needs the assumption that the true parameter value lies away from the boundary. If we have enough data, the maximum likelihood estimate will keep away from the boundary too. But with smaller samples, the estimate can lie on the boundary. In such cases, the asymptotic theory clearly does not give a practically useful approximation. Examples here would be variance-component models, where each component of variance, σ2, must satisfy the constraint σ2 ≥0.

Data boundary parameter-dependent. For the theory to apply in a simple way, the set of data values which has positive probability (or positive probability density) should not depend on the unknown parameter. A simple example where such parameter-dependence does hold is the case of estimating θ from a set of independent identically distributed when the common distribution is uniform on the range (0,θ). For estimation purposes the relevant range of θ is such that θ cannot be less than the largest observation. In this instance the maximum likelihood estimate exists and has some good behaviour, but the asymptotics are not as outlined above.

Nuisance parameters. For maximum likelihood estimations, a model may have a number of nuisance parameters. For the asymptotic behaviour outlined to hold, the number of nuisance parameters should not increase with the number of observations (the sample size). A well-known example of this case is where observations occur as pairs, where the observations in each pair have a different (unknown) mean but otherwise the observations are independent and Normally distributed with a common variance. Here for 2N observations, there are N+1 parameters. It is well-known that the maximum likelihood estimate for the variance does not converge to the true value of the variance.

Increasing information. For the asymptotics to hold in cases where the assumption of independent identically distributed observations does not hold, a basic requirement is that the amount of information in the data increases indefinitely as the sample size increases. Such a requirement may not be met if either there is too much dependence in the data (for example, if new observations are essentially identical to existing observations), or if new independent observations are subject to an increasing observation error.

Examples

Discrete distribution, finite parameter space

Consider tossing an unfair coin 80 times (i.e., we sample something like x1=H, x2=T, ..., x80=T, and count the number of HEADS "H" observed). Call the probability of tossing a HEAD p, and the probability of tossing TAILS 1-p (so here p is θ above). Suppose we toss 49 HEADS and 31 TAILS, and suppose the coin was taken from a box containing three coins: one which gives HEADS with probability p=1/3, one which gives HEADS with probability p=1/2 and another which gives HEADS with probability p=2/3. The coins have lost their labels, so we don't know which one it was. Using maximum likelihood estimation we can calculate which coin has the largest likelihood, given the data that we observed. The likelihood function (defined below) takes one of three values:

begin{matrix} Pr(mathrm{H} = 49 mid p=1/3) & = & binom{80}{49}(1/3)^{49}(1-1/3)^{31} approx 0.000 && Pr(mathrm{H} = 49 mid p=1/2) & = & binom{80}{49}(1/2)^{49}(1-1/2)^{31} approx 0.012 && Pr(mathrm{H} = 49 mid p=2/3) & = & binom{80}{49}(2/3)^{49}(1-2/3)^{31} approx 0.054 end{matrix}

We see that the likelihood is maximized when p=2/3, and so this is our maximum likelihood estimate for p.

Discrete distribution, continuous parameter space

Now suppose we had only one coin but its p could have been any value 0 ≤ p ≤ 1. We must maximize the likelihood function:

L(theta) = f_D(mathrm{H} = 49 mid p) = binom{80}{49} p^{49}(1-p)^{31}

over all possible values 0 ≤ p ≤ 1.

One way to maximize this function is by differentiating with respect to p and setting to zero:

begin{align} {0}&{} = frac{partial}{partial p} left(binom{80}{49} p^{49}(1-p)^{31} right) & {}propto 49p^{48}(1-p)^{31} - 31p^{49}(1-p)^{30} & {}= p^{48}(1-p)^{30}left[49(1-p) - 31p right] & {}= p^{48}(1-p)^{30}left[49 - 80p right] end{align}

which has solutions p=0, p=1, and p=49/80. The solution which maximizes the likelihood is clearly p=49/80 (since p=0 and p=1 result in a likelihood of zero). Thus we say the maximum likelihood estimator for p is 49/80.

This result is easily generalized by substituting a letter such as t in the place of 49 to represent the observed number of 'successes' of our Bernoulli trials, and a letter such as n in the place of 80 to represent the number of Bernoulli trials. Exactly the same calculation yields the maximum likelihood estimator t / n for any sequence of n Bernoulli trials resulting in t 'successes'.

Continuous distribution, continuous parameter space

For the normal distribution mathcal{N}(mu, sigma^2) which has probability density function

f(xmid mu,sigma^2) = frac{1}{sqrt{2pi} sigma }
exp{left(-frac {(x-mu)^2}{2sigma^2} right)},

the corresponding probability density function for a sample of n independent identically distributed normal random variables (the likelihood) is

f(x_1,ldots,x_n mid mu,sigma^2) = prod_{i=1}^{n} f(x_{i}mid mu, sigma^2) = left(frac{1}{2pisigma^2} right)^{n/2} expleft(-frac{ sum_{i=1}^{n}(x_i-mu)^2}{2sigma^2}right),

or more conveniently:

f(x_1,ldots,x_n mid mu,sigma^2) = left(frac{1}{2pisigma^2} right)^{n/2} expleft(-frac{ sum_{i=1}^{n}(x_i-bar{x})^2+n(bar{x}-mu)^2}{2sigma^2}right),
where bar{x} is the sample mean.

This family of distributions has two parameters: θ=(μ,σ), so we maximize the likelihood, mathcal{L} (mu,sigma) = f(x_1,ldots,x_n mid mu, sigma), over both parameters simultaneously, or if possible, individually.

Since the logarithm is a continuous strictly increasing function over the range of the likelihood, the values which maximize the likelihood will also maximize its logarithm. Since maximizing the logarithm often requires simpler algebra, it is the logarithm which is maximized below. (Note: the log-likelihood is closely related to information entropy and Fisher information.)

0 = frac{partial}{partial mu} log left(left(frac{1}{2pisigma^2} right)^{n/2} expleft(-frac{ sum_{i=1}^{n}(x_i-bar{x})^2+n(bar{x}-mu)^2}{2sigma^2}right) right)

= frac{partial}{partial mu} left(logleft(frac{1}{2pisigma^2} right)^{n/2} - frac{ sum_{i=1}^{n}(x_i-bar{x})^2+n(bar{x}-mu)^2}{2sigma^2} right)

= 0 - frac{-2n(bar{x}-mu)}{2sigma^2}

which is solved by

hatmu = bar{x} = sum^{n}_{i=1}x_i/n .

This is indeed the maximum of the function since it is the only turning point in μ and the second derivative is strictly less than zero. Its expectation value is equal to the parameter μ of the given distribution,

E left[widehatmu right] = mu,

which means that the maximum-likelihood estimator widehatmu is unbiased.

Similarly we differentiate the log likelihood with respect to σ and equate to zero:

0 = frac{partial}{partial sigma} log left(left(frac{1}{2pisigma^2} right)^{n/2} expleft(-frac{ sum_{i=1}^{n}(x_i-bar{x})^2+n(bar{x}-mu)^2}{2sigma^2}right) right)

= frac{partial}{partial sigma} left(frac{n}{2}logleft(frac{1}{2pisigma^2} right) - frac{ sum_{i=1}^{n}(x_i-bar{x})^2+n(bar{x}-mu)^2}{2sigma^2} right)

= -frac{n}{sigma} + frac{ sum_{i=1}^{n}(x_i-bar{x})^2+n(bar{x}-mu)^2}{sigma^3}

which is solved by

widehatsigma^2 = sum_{i=1}^n(x_i-widehat{mu})^2/n.

Inserting widehatmu we obtain

widehatsigma^2 = frac{1}{n} sum_{i=1}^{n} (x_{i} - bar{x})^2 = frac{1}{n}sum_{i=1}^n x_i^2
-frac{1}{n^2}sum_{i=1}^nsum_{j=1}^n x_i x_j.

To calculate its expected value, it is convenient to rewrite the expression in terms of zero-mean random variables (statistical error) delta_i equiv mu - x_i. Expressing the estimate in these variables yields

widehatsigma^2 = frac{1}{n} sum_{i=1}^{n} (mu - delta_i)^2 -frac{1}{n^2}sum_{i=1}^nsum_{j=1}^n (mu - delta_i)(mu - delta_j).

Simplifying the expression above, utilizing the facts that Eleft[delta_iright] = 0 and E[delta_i^2] = sigma^2 , allows us to obtain

E left[widehat{sigma^2} right]= frac{n-1}{n}sigma^2.

This means that the estimator widehatsigma is biased (However, widehatsigma is consistent).

Formally we say that the maximum likelihood estimator for theta=(mu,sigma^2) is:

widehat{theta} = left(widehat{mu},widehat{sigma}^2right).

In this case the MLEs could be obtained individually. In general this may not be the case, and the MLEs would have to be obtained simultaneously.

Non-independent variables

It may be the case that variables are correlated, in which case they are not independent. Two random variables X and Y are only independent if their joint probability density function is the product of the individual probability density functions, i.e.

f(x,y)=f(x)f(y),

Suppose one constructs an order n, Gaussian vector out of random variables (x_1,ldots,x_n),, where each variable has means given by (mu_1, ldots, mu_n),. Furthermore, let the covariance matrix be denoted by Sigma,

The joint probability density function of these n random variables is then given by:

f(x_1,ldots,x_n)=frac{1}{2pisqrt{text{det}(Sigma)}} expleft(-frac{1}{2} left[x_1-mu_1,ldots,x_n-mu_nright]Sigma^{-1} left[x_1-mu_1,ldots,x_n-mu_nright]^T right)

In the two variable case, the joint probability density function is given by:

f(x,y) = frac{1}{2pi sigma_x sigma_y sqrt{1-rho^2}} expleft[-frac{1}{2(1-rho^2)} left(frac{(x-mu_x)^2}{sigma_x^2} - frac{2rho(x-mu_x)(y-mu_y)}{sigma_xsigma_y} + frac{(y-mu_y)^2}{sigma_y^2}right) right]

In this and other cases where a joint density function exists, the likelihood function is defined as above, under Principles, using this density.

See also

References

* *

External links

Search another word or see maximum likelihood estimationon Dictionary | Thesaurus |Spanish
Copyright © 2014 Dictionary.com, LLC. All rights reserved.
  • Please Login or Sign Up to use the Recent Searches feature
FAVORITES
RECENT

;