Expectation-Maximization Algorithm

    From Marovi AI
    This page contains changes which are not marked for translation.
    Other languages:
    Article
    Topic area Machine Learning
    Prerequisites Maximum Likelihood Estimation, Latent Variable Models, Kullback-Leibler Divergence


    Expectation-Maximization (EM) is an iterative algorithm for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models that involve unobserved latent variables. Each iteration alternates between computing the expected log-likelihood of the data under the current posterior over latents (the E-step) and maximizing this expectation with respect to the parameters (the M-step). EM is the standard workhorse for fitting mixture models, hidden Markov models, factor analyzers, and many other latent variable models, and it underlies the modern theory of variational inference.

    Overview

    Many models of practical interest assume that each observed data point $ x $ is generated jointly with a latent variable $ z $ that is never observed directly. Examples include the cluster assignment in a Gaussian mixture, the hidden state sequence in a hidden Markov model, the topic of a word in Latent Dirichlet Allocation, and the shared factor in factor analysis. The marginal likelihood

    $ p(x \mid \theta) = \sum_{z} p(x, z \mid \theta) \quad \text{or} \quad \int p(x, z \mid \theta)\, dz $

    typically has no tractable closed form, and direct gradient ascent on $ \log p(x \mid \theta) $ is awkward because the sum or integral sits inside the logarithm.

    EM sidesteps this difficulty by introducing the posterior $ p(z \mid x, \theta) $ as a coupling distribution. At each iteration the algorithm computes (or bounds) this posterior using the current parameters, then optimizes a surrogate objective in which the latent variables are effectively "filled in." The procedure was given its name and general formulation by Dempster, Laird, and Rubin in 1977, although special cases had appeared decades earlier in the genetics, signal processing, and clustering literatures.

    EM is attractive because each iteration is guaranteed to never decrease the marginal log-likelihood, the M-step often reduces to a familiar fully-observed maximum likelihood problem, and the algorithm requires no learning rate or step size. It is the canonical example of a Majorization-Minimization procedure and a direct ancestor of modern Variational Inference.

    Intuition

    EM can be understood through a simple chicken-and-egg observation. If the latent variables $ z $ were known, fitting $ \theta $ would reduce to standard Maximum Likelihood Estimation on the complete data $ (x, z) $. Conversely, if $ \theta $ were known, the posterior $ p(z \mid x, \theta) $ would tell us what the latents likely are. Neither is available at the start, so EM bootstraps: it makes a soft guess about the latents using the current parameters, then refits the parameters as if that guess were true, and iterates.

    The "soft" qualifier matters. Unlike hard-assignment heuristics such as the canonical k-means update, EM does not commit to a single value of $ z $. Instead it computes a distribution over latents and weights the M-step by that distribution. This soft assignment is what makes the marginal log-likelihood non-decreasing and what distinguishes EM from coordinate descent on a hard-assignment objective.

    Formulation

    Evidence Lower Bound

    Let $ q(z) $ be any distribution over the latent variables. Jensen's inequality gives the evidence lower bound (ELBO):

    $ {\displaystyle \log p(x \mid \theta) \;\geq\; \mathbb{E}_{q(z)}\!\left[\log p(x, z \mid \theta)\right] - \mathbb{E}_{q(z)}\!\left[\log q(z)\right] \;\equiv\; \mathcal{L}(q, \theta).} $

    The gap between the marginal log-likelihood and the ELBO is exactly the Kullback-Leibler Divergence from $ q(z) $ to the true posterior:

    $ {\displaystyle \log p(x \mid \theta) - \mathcal{L}(q, \theta) = \mathrm{KL}\!\left(q(z) \,\Vert\, p(z \mid x, \theta)\right) \geq 0.} $

    EM is then coordinate ascent on $ \mathcal{L} $: it alternately maximizes with respect to $ q $ (with $ \theta $ fixed) and with respect to $ \theta $ (with $ q $ fixed).

    E-step

    With $ \theta $ fixed at the current estimate $ \theta^{(t)} $, the ELBO is maximized by setting $ q $ equal to the posterior:

    $ q^{(t+1)}(z) = p(z \mid x, \theta^{(t)}). $

    This drives the KL term to zero and tightens the bound. Equivalently, the E-step computes the expected complete-data log-likelihood, often denoted

    $ Q(\theta \mid \theta^{(t)}) = \mathbb{E}_{p(z \mid x, \theta^{(t)})}\!\left[\log p(x, z \mid \theta)\right]. $

    In practice, the E-step amounts to evaluating sufficient statistics of $ z $ under the posterior — for instance, the responsibilities $ r_{ik} = p(z_i = k \mid x_i, \theta^{(t)}) $ in a mixture model, or the smoothed state probabilities computed by the forward-backward recursion in a hidden Markov model.

    M-step

    Holding $ q^{(t+1)} $ fixed, the ELBO becomes (up to an additive constant) the expected complete-data log-likelihood, and the M-step solves

    $ {\displaystyle \theta^{(t+1)} = \arg\max_{\theta} \; Q(\theta \mid \theta^{(t)}).} $

    Because $ q^{(t+1)} $ does not depend on $ \theta $, the latents act as fixed weights and the optimization typically reduces to a standard fully-observed maximum likelihood problem. For exponential families, the M-step often has a closed form: weighted means, weighted covariances, and weighted multinomial counts replace their unweighted counterparts.

    Monotone Improvement

    The combined update satisfies

    $ \log p(x \mid \theta^{(t+1)}) \;\geq\; \mathcal{L}(q^{(t+1)}, \theta^{(t+1)}) \;\geq\; \mathcal{L}(q^{(t+1)}, \theta^{(t)}) \;=\; \log p(x \mid \theta^{(t)}), $

    so the marginal log-likelihood is non-decreasing at every iteration. Under mild regularity conditions, the iterates converge to a stationary point of the marginal likelihood, though not necessarily a global optimum.

    Canonical Example: Gaussian Mixture Models

    For a Gaussian mixture model with $ K $ components, parameters $ \theta = \{\pi_k, \mu_k, \Sigma_k\} $, and latent assignment $ z_i \in \{1, \ldots, K\} $, the EM updates take a particularly transparent form.

    The E-step computes the responsibility of each component for each point:

    $ r_{ik} = \frac{\pi_k \, \mathcal{N}(x_i \mid \mu_k, \Sigma_k)}{\sum_{j=1}^{K} \pi_j \, \mathcal{N}(x_i \mid \mu_j, \Sigma_j)}. $

    The M-step uses these responsibilities as soft counts:

    $ {\displaystyle N_k = \sum_{i=1}^{N} r_{ik}, \quad \pi_k^{\text{new}} = \frac{N_k}{N}, \quad \mu_k^{\text{new}} = \frac{1}{N_k}\sum_{i=1}^{N} r_{ik}\, x_i, \quad \Sigma_k^{\text{new}} = \frac{1}{N_k}\sum_{i=1}^{N} r_{ik}\,(x_i - \mu_k^{\text{new}})(x_i - \mu_k^{\text{new}})^{\!\top}.} $

    In the limit where each responsibility collapses to a one-hot vector (for instance, as covariances shrink toward an isotropic $ \sigma^2 I $ with $ \sigma \to 0 $), the algorithm reduces exactly to k-means, which is therefore the hard-assignment limit of EM for a spherical Gaussian mixture.

    Variants and Generalizations

    Several relaxations of the basic recipe extend its applicability when either the E-step or the M-step is intractable.

    • Generalized EM (GEM): The M-step is replaced by any update that increases (rather than maximizes) $ Q(\theta \mid \theta^{(t)}) $. Monotone improvement of the marginal log-likelihood is preserved.
    • Expectation-Conditional-Maximization (ECM): The M-step is decomposed into a sequence of conditional maximizations over disjoint blocks of parameters, useful when joint maximization is hard but conditional maximization is easy.
    • Stochastic EM and Monte Carlo EM: When the posterior $ p(z \mid x, \theta) $ is intractable, the E-step is replaced by sampling. In stochastic EM a single sample is drawn; in Monte Carlo EM a sample mean approximates the expectation.
    • Variational EM: The E-step restricts $ q $ to a tractable family $ \mathcal{Q} $ and minimizes the KL divergence to the true posterior within that family. The marginal log-likelihood is then bounded but not necessarily reached, trading exactness for tractability. This is the bridge between classical EM and modern variational methods.
    • Online and incremental EM: Sufficient statistics are updated using exponentially weighted moving averages over mini-batches, scaling EM to streaming or massive datasets.
    • Hard EM: Each E-step replaces the posterior by a point mass at its mode. This loses the monotone improvement guarantee but is sometimes preferred for speed or when posteriors are sharply peaked.

    Convergence Properties

    EM is a local-search algorithm. Its monotone-improvement property guarantees convergence of the marginal log-likelihood sequence to a limit, and under standard regularity conditions the parameter iterates converge to a stationary point — typically a local maximum, but possibly a saddle point or a boundary point of the parameter space. The convergence rate is generically linear, governed by the missing information principle of Dempster, Laird, and Rubin: the closer the posterior over latents is to a point mass, the faster EM converges. When the missing information fraction is large, convergence can be very slow, and quasi-Newton acceleration schemes (Aitken, SQUAREM, parameter-expanded EM) are sometimes used.

    Sensitivity to initialization is a well-known practical issue. For mixture models, common strategies include multiple random restarts, k-means initialization, and deterministic annealing, which begins with a heavily smoothed objective and gradually sharpens it.

    Connections

    EM is a special case of the Majorization-Minimization family: each iteration constructs a tangent lower bound (the ELBO at the current parameters) and maximizes that bound. This perspective unifies EM with proximal methods, MM algorithms in optimization, and the recent literature on lower-bound maximization for nonconvex problems.

    EM is also the conceptual ancestor of Variational Inference and the variational autoencoder. Where EM evaluates the exact posterior in the E-step, variational methods replace it with a tractable approximation; where EM optimizes parameters in the M-step, amortized variational inference learns an inference network that predicts the variational posterior from data. The ELBO, which arose as a tool for analyzing EM, is now the central training objective of probabilistic deep generative models.

    In a probabilistic principal component analysis (PPCA), in factor analysis, and in independent component analysis, EM provides a scalable alternative to direct eigendecomposition or fixed-point iteration. In hidden Markov models, the EM algorithm specializes to the Baum-Welch algorithm, where the E-step is the forward-backward recursion and the M-step is a closed-form re-estimation of transition and emission parameters.

    Limitations

    EM is powerful but not universal. Its main practical limitations are:

    • Local optima: The marginal likelihood is generally non-convex, and EM converges only to a stationary point. Mixtures with poorly separated components, in particular, exhibit many local maxima and saddle points.
    • Slow convergence under high missing information: When the posterior is diffuse, the M-step makes only small parameter moves, and the algorithm can require very many iterations.
    • Singularities: In Gaussian mixtures with unconstrained covariance, the likelihood is unbounded — a single component can collapse onto a single point and drive the likelihood to infinity. Practical implementations regularize covariances or impose priors to avoid this pathology.
    • Intractable E-step: For complex models with discrete latents in high dimension or strongly coupled latent structure, computing the exact posterior is infeasible, requiring variational or Monte Carlo approximations.
    • No uncertainty quantification: EM returns point estimates of $ \theta $ and does not, by itself, provide a posterior over parameters. Adding a prior and doing MAP-EM partially addresses this; a fully Bayesian treatment requires variational Bayes or MCMC.

    These limitations motivate the broader landscape of approximate inference techniques in modern probabilistic machine learning.

    See also

    References

    • Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). "Maximum Likelihood from Incomplete Data via the EM Algorithm". Journal of the Royal Statistical Society, Series B, 39(1), 1-38.
    • Wu, C. F. J. (1983). "On the Convergence Properties of the EM Algorithm". The Annals of Statistics, 11(1), 95-103.
    • McLachlan, G. J. and Krishnan, T. (2008). The EM Algorithm and Extensions (2nd ed.). Wiley.
    • Neal, R. M. and Hinton, G. E. (1998). "A View of the EM Algorithm that Justifies Incremental, Sparse, and Other Variants". In Learning in Graphical Models, M. I. Jordan (ed.), Kluwer, 355-368.
    • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer, Chapter 9.
    • Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press, Chapter 11.
    • Baum, L. E., Petrie, T., Soules, G. and Weiss, N. (1970). "A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains". The Annals of Mathematical Statistics, 41(1), 164-171.