Definitions

# Expectation-maximization algorithm

An expectation-maximization (EM) algorithm is used in statistics for finding maximum likelihood estimates of parameters in probabilistic models, where the model depends on unobserved latent variables. EM alternates between performing an expectation (E) step, which computes an expectation of the likelihood by including the latent variables as if they were observed, and a maximization (M) step, which computes the maximum likelihood estimates of the parameters by maximizing the expected likelihood found on the E step. The parameters found on the M step are then used to begin another E step, and the process is repeated.

## History

The EM algorithm was explained and given its name in a classic 1977 paper by Arthur Dempster, Nan Laird, and Donald Rubin in the Journal of the Royal Statistical Society . They pointed out that the method had been "proposed many times in special circumstances" by other authors, but the 1977 paper generalized the method and developed the theory behind it.

## Applications

EM is frequently used for data clustering in machine learning and computer vision. In natural language processing, two prominent instances of the algorithm are the Baum-Welch algorithm (also known as forward-backward) and the inside-outside algorithm for unsupervised induction of probabilistic context-free grammars.

In psychometrics, EM is almost indispensable for estimating item parameters and latent abilities of item response theory models.

With the ability to deal with missing data and observe unidentified variables, EM is becoming a useful tool to price and manage risk of a portfolio.

The EM algorithm (and its faster variant OS-EM) is also widely used in medical image reconstruction, especially in Positron Emission Tomography and Single Photon Emission Computed Tomography. See below for other faster variants of EM.

## Demonstrations and activities

Various 1D, 2D and 3D demonstrations of EM together with Mixture Modeling are provided as part of the paired SOCR activities and applets. These applets and activities show empirically the properties of the EM algorithm for parameter estimation in diverse settings.

A number of methods have been proposed to accelerate the sometimes slow convergence of the EM algorithm, such as those utilising conjugate gradient and modified Newton-Raphson techniques. Additionally EM can be utilised with constrained estimation techniques.

## EM in a nutshell

EM is an iterative technique for estimating the value of some unknown quantity, given the values of some correlated, known quantity.

The approach is to first assume that the quantity is represented as a value in some parameterized probability distribution (the popular application is a mixture of Gaussians, hence the example below). The EM procedure, then, is:

1. Initialize the distribution parameters
2. Repeat until convergence:
1. E-Step: estimate the [E]xpected value of the unknown variables, given the current parameter estimate
2. M-Step: re-estimate the distribution parameters to [M]aximize the likelihood of the data, given the expected estimates of the unknown variables

Steps 1 and 2 are loaded, and depend on what distribution you choose, how many parameters there are, and how complicated the missing value is. Two other caveats: First, the choice of initial parameters technically does not matter, but in practice a poor choice can lead to a bad estimate. Second, convergence, though guaranteed, may take a long time to get to. In practice, if the values of the missing variables and parameters do not change significantly between two iterations, then the algorithm terminates.

A simple application is filling missing values in a column of a database. Assume you know about 50% of the values in a column, but the remaining values are corrupt or missing. For simplicity, assume that the data is distributed normally with a unit variance. Then, the only parameter that must be computed is the mean value. What's more convenient, is that the E-step is simple: the expected value of each missing value is the mean. But the E-step changes the overall mean of the data, and so the estimate can be improved. This will continue for a few iterations. An example:

Initialization Step: Data: [4, 10, ?, ?] Initial mean value: 0
New Data: [4, 10, 0, 0]

Step 1: New Mean: 3.5
New Data:[4, 10, 3.5, 3.5]

Step 2: New Mean: 5.25
New Data: [4, 10, 5.25, 5.25]

Step 3: New Mean: 6.125
New Data: [4, 10, 6.125, 6.125]

Step 4: New Mean: 6.5625
New Data: [4, 10, 6.5625, 6.5625]

Step 5: New Mean: 6.7825
New Data: [4, 10, 6.7825, 6.7825]

Result: New Mean: 6.890625

From here you can see the value is slowly converging toward 7. For this simple model (a univariate normal distribution with unit variance), it can be seen that substituting the average of the known values is the best answer. For more complex models, there is no easy way to find the best answer, and the EM algorithm is a very popular approach for estimating the answer.

In the mixture of Gaussians demonstration below, we have a collection of multi-dimensional objects. We'll assume that each individual data object was generated from one of K Gaussians. If we knew which Gaussian each object came from, then estimating the parameters is easy (using Maximum likelihood Estimation techniques). Instead, we must first compute the Expected value that the object came from each Gaussian (E-step) and then estimate the parameters, given these expected assignments (M-step).

## Specification of the EM procedure

Let $textbf\left\{y\right\}$ denote incomplete data consisting of values of observable variables, and let $textbf\left\{z\right\}$ denote the missing data. Together, $textbf\left\{y\right\}$ and $textbf\left\{z\right\}$ form the complete data. $textbf\left\{z\right\}$ can either be actual missing measurements or a hidden variable that would make the problem easier if its value were known. For instance, in mixture models, the likelihood formula would be much more convenient if mixture components that "generated" the samples were known (see example below).

Let $p,$ be the joint probability density function (continuous case) or probability mass function (discrete case) of the complete data with parameters given by the vector $theta$: $p\left(mathbf y, mathbf z | theta\right)$. This function can also be considered as the complete data likelihood, that is, it can be thought of as a function of $theta$. Further, note that the conditional distribution of the missing data given the observed can be expressed as:

$p\left(mathbf z |mathbf y, theta\right) = frac\left\{p\left(mathbf y, mathbf z | theta\right)\right\}\left\{p\left(mathbf y | theta\right)\right\} = frac\left\{p\left(mathbf y|mathbf z, theta\right) p\left(mathbf z |theta\right) \right\}\left\{int p\left(mathbf y|mathbf hat\left\{z\right\}, theta\right) p\left(mathbf hat\left\{z\right\} |theta\right) dmathbf hat\left\{z\right\}\right\}$

by using the Bayes rule and the law of total probability. (This formulation only requires knowledge of the observation likelihood given the unobservable data $p\left(mathbf y|mathbf z, theta\right)$ and the probability of the unobservable data $p\left(mathbf z |theta\right)$.)

An EM algorithm iteratively improves an initial estimate $theta_0$ by constructing new estimates $theta_1, theta_2,$ and so on. An individual re-estimation step that derives $theta_\left\{n+1\right\},$ from $theta_n,$ takes the following form:


theta_{n+1} = argmax_{theta}Q(theta)

where $Q\left(theta\right)$ is the expected value of the log-likelihood. In other words, we do not know the complete data, so we cannot say what is the exact value of the likelihood, but given the data that we do know (the $y$'s), we can find a posteriori estimates of the probabilities for the various values of the unknown $z$'s. For each set of $z$'s there is a likelihood value for $theta$, and we can thus calculate an expected value of the likelihood with the given values of $y$'s (and which depends on the previously assumed value of $theta$ because this influenced the probabilities of the z's).

Q is given by


Q(theta) =
`sum_z`
` p left(z ,|, y, theta_n right)`
` log p left(y, z ,|, theta right)`

or more generally


Q(theta) = E_{mathbf z} ! ! left[log p left(mathbf y, mathbf z ,|, theta right) Big| mathbf y right] where it is understood that this denotes the conditional expectation of $log p left\left(mathbf y, mathbf z ,|, theta right\right)$ being taken with the $theta$ used in the conditional distribution of $textbf\left\{z\right\}$ fixed at $theta_n$. (The log of the likelihood is often used instead of true likelihood because it leads to easier formulas, but still attains its maximum at the same point as the likelihood.)

In other words, $theta_\left\{n+1\right\}$ is the value that maximizes (M) the conditional expectation (E) of the complete data log-likelihood given the observed variables under the previous parameter value. The expectation $Q\left(theta\right)$ in the continuous case would be given by


Q(theta) = E_{mathbf z} ! ! left[log p left(mathbf y, mathbf z ,|, theta right) Big| mathbf y right] = int^infty _{- infty}
`p left(mathbf z ,|, mathbf y, theta_n right)`
`log p left(mathbf y, mathbf z ,|, theta right) dmathbf z`

Speaking of an expectation (E) step is a bit of a misnomer. What is calculated in the first step are the fixed, data-dependent parameters of the function Q. Once the parameters of Q are known, it is fully determined and is maximized in the second (M) step of an EM algorithm.

The origin of the name comes from the fact that in the paper by Dempster, Laird and Rubin, they first discuss a less general problem in which the probability distribution is of the exponential family, and in that case the so-called E step consists of finding the expected values of certain sufficient statistics of the complete data.

### Properties

Part of the reason for the popularity of EM algorithms is that, as can be shown, an EM iteration does not decrease the observed data likelihood function. However, there is no guarantee that the sequence converges to a maximum likelihood estimator. For multimodal distributions, this means that an EM algorithm will converge to a local maximum (or saddle point) of the observed data likelihood function, depending on starting values. There are a variety of heuristic approaches for escaping a local maximum such as using several different random initial estimates, $theta_0$, or applying simulated annealing.

EM is particularly useful when maximum likelihood estimation of a complete data model is easy. If closed-form estimators exist, the M step is often trivial. A classic example is maximum likelihood estimation of a finite mixture of Gaussians, where each component of the mixture can be estimated trivially if the mixing distribution is known.

"Expectation-maximization" is a description of a class of related algorithms, not a specific algorithm; EM is a recipe or meta-algorithm which is used to devise particular algorithms. The Baum-Welch algorithm is an example of an EM algorithm applied to hidden Markov models. Another example is the EM algorithm for fitting a mixture density model.

An EM algorithm can also find maximum a posteriori (MAP) estimates, by performing MAP estimation in the M step, rather than maximum likelihood.

There are other methods for finding maximum likelihood estimates, such as gradient descent, conjugate gradient or variations of the Gauss-Newton method. Unlike EM, such methods typically require the evaluation of first and/or second derivatives of the likelihood function.

## Incremental versions

The classic EM procedure is to replace both Q and θ with their optimal possible (argmax) values at each iteration. However it can be shown (see Neal & Hinton, 1999) that simply finding Q and θ to give some improvement over their current value will also ensure successful convergence.

For example, to improve Q, we could restrict the space of possible functions to a computationally simple distribution such as a factorial distribution,

$Q=prod_i Q_i. !$

Thus at each E step we compute the variational approximation of Q.

To improve θ, we could use any hill-climbing method, and not worry about finding the optimal θ, just some improvement. This method is also known as Generalized EM (GEM).

## Relation to variational Bayes methods

EM is a partially non-Bayesian, maximum likelihood method. Its final result gives a probability distribution over the latent variables (in the Bayesian style) together with a point estimate for θ (either a maximum likelihood estimate or a posterior mode). We may want a fully Bayesian version of this, giving a probability distribution over θ as well as the latent variables. In fact the Bayesian approach to inference is simply to treat θ as another latent variable. In this paradigm, the distinction between the E and M steps disappears. If we use the factorized Q approximation as described above (variational Bayes), we may iterate over each latent variable (now including θ) and optimize them one at a time. There are now k steps per iteration, where k is the number of latent variables. For graphical models this is easy to do as each variable's new Q depends only on its Markov blanket, so local message passing can be used for efficient inference.

## Example: Gaussian Mixture

Assume that a sample of m vectors (or scalars) $mathbf y_1, dots, textbf\left\{y\right\}_m$, where $mathbf y_j in mathbb\left\{R\right\}^l$, is drawn from one of n Gaussian distributions. Let $z_j in \left\{1,2,ldots,n\right\}$ denote which Gaussian $mathbf y_j$ is from. The probability that a particular $mathbf y$ comes from the $i^\left\{mathrm\left\{th\right\}\right\}$ $D$-dimensional Gaussian is

$P\left(mathbf y | z=i,theta\right) = mathcal\left\{N\right\}\left(mu_i,sigma_i\right) = \left(2pi\right)^\left\{-D/2\right\} \left\{left| sigma_i right|\right\}^\left\{-1/2\right\} expleft\left(-frac\left\{1\right\}\left\{2\right\}\left(mathbf y - mathbf mu_i\right)^T sigma_i^\left\{-1\right\} \left(mathbf y - mathbf mu_i\right)right\right)$

Our task is to estimate the unknown parameters $theta = left\left\{ mu_1, dots, mu_n, sigma_1, dots, sigma_n, P\left(z=1\right), dots, P\left(z=n\right) right\right\}$, that is, the mean and standard deviation of each Gaussian and the probability for each Gaussian being drawn for any given point. (Actually it is not clear that we should allow the standard deviations to take any value because then the maximum likelihood may be unbounded as one centers a particular Gaussian on a particular data point and decreases the standard deviation toward zero!)

### E-step

Estimation of the unobserved z's (which Gaussian is used), conditioned on the observation, using the values from the last maximization step:


p(z_j=i|mathbf y_j,theta_t)

# frac{p(z_j

i, mathbf y_j | theta_t)}{p(mathbf y_j|theta_t)}

# frac{p(mathbf y_j|z_j=i,theta_t) p(z=i|theta_t)}{sum_{k=1}^n p(mathbf y_j | z_j=k, theta_t) p(z

k|theta_t)}

### M-step

We now want to maximize the expected log-likelihood of the joint event:


begin{align} Q(theta) & = E_{z} left[ln prod_{j=1}^m p left(mathbf y_j, mathbf z | theta right) Big| mathbf y_j right] & = E_{z} left[sum_{j=1}^m ln p left(mathbf y_j, mathbf z | theta right) Big| mathbf y_j right] & = sum_{j=1}^m E_{z} left[ln p left(mathbf y_j, mathbf z | theta right) Big| mathbf y_j right] & = sum_{j=1}^m sum_{i=1}^n p left(z_j=i | mathbf y_j, theta_t right) ln pleft(z_j=i, mathbf y_j | theta right) end{align}

If we expand the probability of the joint event, we get


Q(theta)

# sum_{j=1}^m sum_{i=1}^n p(z_j=i | mathbf y_j, theta_t) ln left(p(mathbf y_j | z_j=i, theta) p(z_j

i | theta) right)

We have the constraint


sum_{i=1}^{n} p(z_j=i|theta) = 1

If we add a Lagrange multiplier, and expand the pdf, we get


begin{align} mathcal{L}(theta)

# & left(sum_{j=1}^m sum_{i=1}^n p(z_j=i | mathbf y_j, theta_t) left(- frac{D}{2} ln (2pi) - frac{1}{2} ln left| sigma_i right| - frac{1}{2}(mathbf y_j - mathbf mu_i)^T sigma_i^{-1} (mathbf y_j - mathbf mu_i) + ln p(z

i | theta) right) right) & - lambda left(sum_{i=1}^{n} p(z=i | theta) - 1 right) end{align}

To find the new estimate $theta_\left\{t+1\right\}$, we find a maximum where $frac\left\{partial mathcal\left\{L\right\}\left(theta\right)\right\}\left\{partial theta\right\} = 0$.

New estimate for mean (using some differentiation rules from matrix calculus):


begin{align} frac{partial mathcal{L}(theta)}{partial mu_i} & = sum_{j=1}^m p(z_j=i | mathbf y_j, theta_t) left(- frac{partial}{partial mu_i} frac{1}{2}(mathbf y_j - mathbf mu_i)^T sigma_i^{-1} (mathbf y_j - mathbf mu_i) right) & = sum_{j=1}^m p(z_j=i | mathbf y_j, theta_t) left(- frac{1}{2}(sigma_i^{-1} +sigma_i^{-T})(mathbf y_j - mathbf mu_i)(-1) right) & = sum_{j=1}^m p(z_j=i | mathbf y_j, theta_t) left(sigma_i^{-1}(mathbf y_j - mathbf mu_i) right)
`& = 0 `
`& Downarrow `
sum_{j=1}^m p(z_j=i | mathbf y_j, theta_t) sigma_i^{-1} mathbf mu_i & = sum_{j=1}^m p(z_j=i | mathbf y_j, theta_t) sigma_i^{-1} mathbf y_j
`& Downarrow `
mu_i sum_{j=1}^m p(z_j=i | mathbf y_j, theta_t) & = sum_{j=1}^m p(z_j=i | mathbf y_j, theta_t) mathbf y_j
`& Downarrow `
mu_i & = frac{sum_{j=1}^m p(z_j=i | mathbf y_j, theta_t) mathbf y_j}{sum_{j=1}^m p(z_j=i | mathbf y_j, theta_t)} end{align}

New estimate for covariance:


begin{align} frac{partial mathcal{L}(theta)}{partial sigma_i} & = sum_{j=1}^m p(z_j=i | mathbf y_j, theta_t) left(- frac{partial}{partial sigma_i} frac{1}{2} ln left| sigma_i right| - frac{partial}{partial sigma_i} frac{1}{2}(mathbf y_j - mathbf mu_i)^T sigma_i^{-1} (mathbf y_j - mathbf mu_i) right) & = sum_{j=1}^m p(z_j=i | mathbf y_j, theta_t) left(- frac{1}{2} sigma_i^{-T} + frac{1}{2} sigma_i^{-T}(mathbf y_j - mathbf mu_i) (mathbf y_j - mathbf mu_i)^T sigma_i^{-T} right)
`& = 0 `
`& Downarrow `
sum_{j=1}^m p(z_j=i | mathbf y_j, theta_t) sigma_i^{-1} & = sum_{j=1}^m p(z_j=i | mathbf y_j, theta_t) sigma_i^{-1} (mathbf y_j - mathbf mu_i) (mathbf y_j - mathbf mu_i)^T sigma_i^{-1}
`& Downarrow `
sum_{j=1}^m p(z_j=i | mathbf y_j, theta_t) & = sum_{j=1}^m p(z_j=i | mathbf y_j, theta_t) sigma_i^{-1} (mathbf y_j - mathbf mu_i) (mathbf y_j - mathbf mu_i)^T
`& Downarrow `
sigma_i & = frac{sum_{j=1}^m p(z_j=i | mathbf y_j, theta_t) (mathbf y_j - mathbf mu_i) (mathbf y_j - mathbf mu_i)^T}{sum_{j=1}^m p(z_j=i | mathbf y_j, theta_t)} end{align}

New estimate for class probability:


begin{align} frac{partial mathcal{L}(theta)}{partial p(z=i|theta)} & = left(sum_{j=1}^m p(z_j=i | mathbf y_j, theta_t) frac{partial ln p(z=i|theta)}{partial p(z=i|theta)} right) - lambda left(frac{partial p(z=i|theta)}{partial p(z=i|theta)} right) & = left(sum_{j=1}^m p(z_j=i | mathbf y_j, theta_t) frac{1}{p(z=i|theta)} right) - lambda
`& = 0 `
`& Downarrow `
sum_{j=1}^m p(z_j=i | mathbf y_j, theta_t) frac{1}{p(z=i|theta)}
`& = lambda `
`& Downarrow `
p(z=i|theta) & = frac{1}{lambda} sum_{j=1}^m p(z_j=i | mathbf y_j, theta_t) end{align}

Inserting into the constraint:


begin{align} sum_{i=1}^{n} p(z=i|theta) & = sum_{i=1}^{n} frac{1}{lambda} sum_{j=1}^m p(z_j=i | mathbf y_j, theta_t)
`& = 1 `
`& Downarrow `
lambda & = sum_{i=1}^{n} sum_{j=1}^m p(z_j=i | mathbf y_j, theta_t) end{align}

Inserting $lambda$ into our estimate:


begin{align} p(z=i|theta) & = frac{1}{lambda} sum_{j=1}^m p(z_j=i | mathbf y_j, theta_t) & = {displaystylefrac{sum_{j=1}^m p(z_j=i | mathbf y_j, theta_t)}{sum_{k=1}^{n} sum_{j=1}^m p(z_j=k | mathbf y_j, theta_t)}} & = frac{1}{m}sum_{j=1}^m p(z_j=i | mathbf y_j, theta_t) end{align}

These estimates now become our $theta_\left\{t+1\right\}$, to be used in the next estimation step.