Markov Chain Monte Carlo
| Article | |
|---|---|
| Topic area | Bayesian Methods |
| Prerequisites | Bayesian Inference, Markov Chain |
Overview
Markov chain Monte Carlo (MCMC) is a family of algorithms for drawing samples from a probability distribution that is known only up to a normalising constant. Rather than sampling directly, MCMC constructs a Markov chain whose stationary distribution is the target, runs the chain long enough to forget its starting state, and treats subsequent states as approximate samples. Expectations under the target are then estimated by averaging functions of those states.[1] The technique underpins modern Bayesian inference: posteriors of the form $ p(\theta \mid \mathbf{y}) \propto p(\mathbf{y} \mid \theta) p(\theta) $ are typically intractable, but their unnormalised density is cheap to evaluate, which is exactly what MCMC needs.
MCMC sits between exact sampling, which requires a tractable normaliser, and variational inference, which optimises a tractable surrogate at the cost of a fixed approximation gap. Given enough compute, an MCMC estimator is asymptotically unbiased; its main practical concerns are mixing time, autocorrelation, and tuning. The framework was popularised in statistics by the Metropolis algorithm in 1953 and extended by Hastings in 1970, then transformed Bayesian computation in the 1990s once general-purpose samplers like Gibbs and BUGS made posterior inference routine.
Intuition
Imagine wanting to compute the average value of a function over a high-dimensional posterior whose density you can score but not invert. Direct sampling fails because you do not know the normaliser; importance sampling fails because no good proposal exists in high dimensions. MCMC sidesteps both issues with a random walk biased toward high-density regions: at each step, propose a small change to the current state, then accept or reject the proposal so that, in the long run, the chain spends time in each region in proportion to the target density.
The key fact is that one only needs to evaluate ratios of the target density. Since $ p(\theta) = \tilde{p}(\theta) / Z $ where $ Z $ is the unknown normaliser, the ratio $ \tilde{p}(\theta')/\tilde{p}(\theta) $ equals $ p(\theta')/p(\theta) $. The unknown constant cancels, so the chain can be simulated using only the unnormalised density.
Foundations
A Markov chain is a sequence of random variables $ \theta_0, \theta_1, \ldots $ where $ \theta_{t+1} $ depends only on $ \theta_t $, governed by a transition kernel $ T(\theta' \mid \theta) $. A distribution $ \pi $ is stationary for $ T $ if
$ {\displaystyle \pi(\theta') = \int T(\theta' \mid \theta) \pi(\theta)\, d\theta.} $
A sufficient (but not necessary) condition for stationarity is detailed balance,
$ {\displaystyle \pi(\theta) T(\theta' \mid \theta) = \pi(\theta') T(\theta \mid \theta'),} $
which states that probability flux between any pair of states is symmetric. If the chain is also irreducible (every state reachable from every other) and aperiodic, the ergodic theorem guarantees that time averages along the chain converge to expectations under $ \pi $.[2] MCMC algorithms are recipes for designing transition kernels whose stationary distribution is a chosen target.
Metropolis-Hastings
The Metropolis-Hastings (MH) algorithm is the canonical MCMC method. Given the current state $ \theta $, draw a candidate $ \theta' $ from a proposal distribution $ q(\theta' \mid \theta) $, then accept with probability
$ {\displaystyle \alpha(\theta, \theta') = \min\!\left(1,\; \frac{\tilde{p}(\theta')\, q(\theta \mid \theta')}{\tilde{p}(\theta)\, q(\theta' \mid \theta)}\right).} $
If the proposal is symmetric (the original Metropolis case), the $ q $ ratio drops out and acceptance reduces to $ \min(1, \tilde{p}(\theta')/\tilde{p}(\theta)) $. By construction, the resulting transition kernel satisfies detailed balance with respect to $ p $, so $ p $ is stationary. The proposal is the only design lever: a proposal that is too narrow yields high acceptance but slow exploration; one that is too wide is rejected often. Step-size adaptation targeting an acceptance rate near 0.234 is asymptotically optimal for random-walk proposals in high dimension.[3]
Gibbs sampling
Gibbs sampling exploits the structure of multivariate targets by updating one coordinate (or block of coordinates) at a time, conditional on the others. For $ \theta = (\theta_1, \ldots, \theta_d) $, a sweep of the Gibbs sampler draws
$ {\displaystyle \theta_i^{(t+1)} \sim p\!\left(\theta_i \,\big|\, \theta_{-i}^{(t+1, t)}\right),} $
where $ \theta_{-i} $ denotes the other coordinates with their most recent values. Each conditional update is itself an MH step with acceptance probability one, so no tuning is required when the conditionals can be sampled in closed form. Conjugate hierarchical models, mixture models, and graphical models with tractable Markov blankets are natural fits. When a conditional is not directly sampleable, a within-Gibbs MH or slice-sampling move can be substituted, yielding a Metropolis-within-Gibbs scheme.
Hamiltonian Monte Carlo
Random-walk MH mixes poorly in high dimensions and on correlated targets. Hamiltonian Monte Carlo (HMC) augments the state with an auxiliary momentum variable $ r $ and proposes long, gradient-informed trajectories by simulating Hamiltonian dynamics with potential $ U(\theta) = -\log \tilde{p}(\theta) $ and kinetic energy $ K(r) = \tfrac{1}{2} r^\top M^{-1} r $.[4] The leapfrog integrator preserves volume and is reversible; remaining discretisation error is corrected by a single MH accept-reject step on the joint $ (\theta, r) $ energy. Because trajectories follow the gradient, HMC suppresses random-walk behaviour and scales much better with dimension. The No-U-Turn Sampler (NUTS) automates the choice of trajectory length by extending the path until it begins to double back, removing one of HMC's tuning headaches and making it the default sampler in Stan and PyMC.[5]
Diagnostics and convergence
MCMC estimates are biased for finite chains; the bias decays as the chain forgets its initialisation. A burn-in period is discarded, and the remaining samples are summarised by their autocorrelation, the effective sample size (ESS), and the potential scale reduction factor $ \hat{R} $ computed from multiple chains started at overdispersed points. A common rule of thumb is to run at least four chains and treat $ \hat{R} > 1.01 $ as evidence of non-convergence.[6] Trace plots, rank plots, and divergence diagnostics for HMC catch pathologies that scalar diagnostics miss, such as funnels and multimodality. None of these tools can prove convergence; they can only flag failure.
Variants and connections
Beyond the canonical samplers, several variants address specific challenges. Reversible-jump MCMC handles models of varying dimension. Parallel tempering and replica exchange run multiple chains at different temperatures and swap states to escape modes. Sequential Monte Carlo blends importance sampling with MCMC moves and is well suited to streaming and tempered targets. Stochastic-gradient MCMC methods such as SGLD and SGHMC scale to large datasets by replacing the full-data gradient with mini-batch estimates, trading exact stationarity for tractability. MCMC also interacts with other inference tools: it provides asymptotically unbiased reference posteriors for evaluating Variational Inference, supplies fully Bayesian hyperparameter estimates for Gaussian Processes, and offers a stochastic alternative to the Expectation-Maximization Algorithm when the E-step is intractable.
Limitations
MCMC's headline weakness is sample efficiency: each draw requires a full evaluation of the unnormalised density (and, for HMC, its gradient), and consecutive draws are correlated, so the effective sample size can be orders of magnitude smaller than the chain length. Mixing degrades on multimodal posteriors, where chains can become trapped in a single mode for arbitrarily long, and on geometries with strong curvature variation, where step sizes that work in one region fail in another. Convergence diagnostics are necessary but not sufficient. Stochastic-gradient variants introduce bias that is rarely quantified rigorously. For very high-dimensional or latent-variable models, variational and amortised methods are often preferred when calibrated uncertainty is not the priority, with MCMC reserved for gold-standard reference inference.
References
- ↑ Robert, C. P. and Casella, G., Monte Carlo Statistical Methods, 2nd ed., Springer, 2004.
- ↑ Meyn, S. and Tweedie, R. L., Markov Chains and Stochastic Stability, 2nd ed., Cambridge University Press, 2009.
- ↑ Roberts, G. O., Gelman, A. and Gilks, W. R., "Weak convergence and optimal scaling of random walk Metropolis algorithms", Annals of Applied Probability, 1997.
- ↑ Neal, R. M., "MCMC using Hamiltonian dynamics", in Handbook of Markov Chain Monte Carlo, Chapman and Hall/CRC, 2011.
- ↑ Hoffman, M. D. and Gelman, A., "The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo", Journal of Machine Learning Research, 2014.
- ↑ Vehtari, A., Gelman, A., Simpson, D., Carpenter, B. and Burkner, P.-C., "Rank-normalization, folding, and localization: An improved $ \hat{R} $ for assessing convergence of MCMC", Bayesian Analysis, 2021.