Gaussian Processes/en
| Article | |
|---|---|
| Topic area | Bayesian Methods |
| Prerequisites | Linear Regression |
Overview
A Gaussian process (GP) is a distribution over functions such that any finite collection of function values has a joint Gaussian distribution. GPs provide a principled, non-parametric Bayesian approach to regression, classification, and other supervised learning tasks: instead of fitting a fixed-dimensional parameter vector, the model places a prior directly on the function space and updates it to a posterior conditioned on observed data.[1] Predictions come with calibrated uncertainty estimates derived from the posterior, which is what makes GPs attractive for applications such as Bayesian optimization, geostatistics, and surrogate modelling of expensive simulators.
Compared with parametric models like Linear Regression or feed-forward neural networks, a GP does not commit to a fixed functional form; the inductive bias is encoded in a kernel (covariance) function rather than in a basis. This flexibility comes at a computational cost that is cubic in the number of training points, which has motivated a large body of work on sparse and approximate inference.
Intuition
A standard Gaussian distribution over a vector specifies, for each coordinate, a mean and how that coordinate covaries with every other coordinate. A Gaussian process generalises this idea from a finite vector to an infinite collection indexed by inputs $ x \in \mathcal{X} $: it places a Gaussian distribution over function values at any finite set of points. If you sample a function from a GP and evaluate it at $ n $ inputs, you get an $ n $-dimensional Gaussian random vector whose mean and covariance are determined by the GP's mean and kernel functions.
The kernel encodes assumptions about smoothness, periodicity, length scales, and amplitude. Two inputs that are close under the kernel produce highly correlated outputs; two inputs that are far apart produce nearly independent outputs. Conditioning on observed data pins the posterior down at the training points and lets uncertainty grow smoothly as you move away from them, recovering the prior in regions with no nearby observations.
Formal definition
A Gaussian process on an input domain $ \mathcal{X} $ is a stochastic process $ \{f(x) : x \in \mathcal{X}\} $ such that, for any finite set of inputs $ x_1, \ldots, x_n \in \mathcal{X} $, the random vector $ (f(x_1), \ldots, f(x_n)) $ is multivariate Gaussian. A GP is fully specified by a mean function $ m(x) = \mathbb{E}[f(x)] $ and a covariance (or kernel) function $ k(x, x') = \mathbb{E}[(f(x) - m(x))(f(x') - m(x'))] $, written
$ {\displaystyle f(\cdot) \sim \mathcal{GP}(m(\cdot), k(\cdot, \cdot)).} $
For the kernel to define a valid GP, it must be symmetric and positive semi-definite: for any inputs and weights, the Gram matrix $ K_{ij} = k(x_i, x_j) $ must be positive semi-definite. The mean function is often set to zero after centering the targets, so the model's expressiveness lives entirely in the kernel.
Kernels
The kernel determines what kinds of functions are a priori plausible. Common choices include:
- Squared exponential (RBF): $ k(x, x') = \sigma_f^2 \exp\!\left(-\tfrac{1}{2 \ell^2} \lVert x - x' \rVert^2 \right) $, which produces infinitely differentiable, smooth functions controlled by a length-scale $ \ell $ and amplitude $ \sigma_f $.
- Matern: a family parameterised by smoothness $ \nu $, recovering the RBF as $ \nu \to \infty $ and giving rougher samples for small $ \nu $ (e.g. $ \nu = 3/2 $ or $ 5/2 $). Often preferred over the RBF when modelling physical processes.
- Periodic: $ k(x, x') = \sigma_f^2 \exp\!\left(-\tfrac{2}{\ell^2} \sin^2(\pi (x - x') / p)\right) $, used when the underlying function has period $ p $.
- Linear: $ k(x, x') = \sigma_f^2\, x^\top x' $, which makes the GP equivalent to Bayesian linear regression in the original feature space.
- Composite: sums and products of base kernels combine assumptions, e.g. a sum of a periodic and an RBF kernel models seasonality plus a slow trend.
Choosing a kernel and its hyperparameters is the central modelling decision in GP regression and is analogous to architecture and feature design for parametric models.
Inference and prediction
Consider noisy regression with observations $ y_i = f(x_i) + \varepsilon_i $ where $ \varepsilon_i \sim \mathcal{N}(0, \sigma_n^2) $ are i.i.d. Gaussian. Stack the training inputs into $ X $, targets into $ \mathbf{y} $, and write the kernel matrix $ K = K(X, X) $. Place a zero-mean GP prior on $ f $ and integrate out the latent function values. The joint distribution of training targets and a test prediction $ f_* $ at $ x_* $ is Gaussian, so the posterior is also Gaussian with closed-form mean and variance:
$ {\displaystyle \bar{f}_* = k_*^\top (K + \sigma_n^2 I)^{-1} \mathbf{y},} $
$ {\displaystyle \mathbb{V}[f_*] = k(x_*, x_*) - k_*^\top (K + \sigma_n^2 I)^{-1} k_*,} $
where $ k_* = K(X, x_*) $. The mean is a linear smoother of the targets, and the variance shrinks near training points and grows away from them. In practice the inverse is replaced by a Cholesky factorisation of $ K + \sigma_n^2 I $ for numerical stability.
For non-Gaussian likelihoods (e.g. classification with a Bernoulli likelihood), the posterior is no longer Gaussian. Approximations such as the Laplace approximation, expectation propagation, or variational inference replace the exact closed form.
Hyperparameter learning
Kernel hyperparameters $ \theta $ (length scales, amplitudes, noise variance) are typically learned by maximising the log marginal likelihood,
$ {\displaystyle \log p(\mathbf{y} \mid X, \theta) = -\tfrac{1}{2} \mathbf{y}^\top (K_\theta + \sigma_n^2 I)^{-1} \mathbf{y} - \tfrac{1}{2} \log |K_\theta + \sigma_n^2 I| - \tfrac{n}{2} \log 2\pi.} $
The first term rewards data fit, the second penalises model complexity, and the third is constant. This automatic Occam's razor balances flexibility against parsimony without a separate validation set, though the objective is non-convex and can have multiple local optima. A fully Bayesian alternative places a hyperprior on $ \theta $ and integrates with Markov chain Monte Carlo, which is slower but quantifies hyperparameter uncertainty.
Computational cost and scalable variants
Exact GP inference requires storing and factorising an $ n \times n $ kernel matrix, costing $ \mathcal{O}(n^3) $ time and $ \mathcal{O}(n^2) $ memory. This makes vanilla GPs impractical beyond a few thousand training points. Scalable variants reduce cost by exploiting structure or approximation:
- Sparse / inducing-point methods summarise the training data with $ m \ll n $ inducing inputs, reducing cost to $ \mathcal{O}(n m^2) $. The variational formulation of Titsias and the stochastic variant SVGP are widely used.[2][3]
- Structured kernel interpolation (KISS-GP) and Toeplitz / Kronecker tricks exploit gridded inputs to drop cost further.
- Local and product-of-experts approximations partition the input space and combine local GPs.
- Random Fourier features approximate stationary kernels with finite-dimensional feature maps, turning the GP into Bayesian linear regression in the feature space.
- Deep kernels compose a parametric feature extractor (often a neural network) with a base kernel, blending GP uncertainty quantification with learned representations.
The right approximation depends on dataset size, input geometry, and whether posterior covariance is needed in addition to the mean.
A GP with a linear kernel recovers Bayesian linear regression; a GP with a fixed-feature kernel $ k(x, x') = \phi(x)^\top \Sigma_p \phi(x') $ recovers Bayesian linear regression in the feature space $ \phi $. Conversely, a Bayesian neural network with one hidden layer of infinite width and Gaussian weight priors converges to a GP with a kernel determined by the activation function, a connection extended by the neural tangent kernel for deep networks. Compared to support-vector regression, GPs share the kernel machinery but produce a full posterior rather than a single regression function; compared to random forests and gradient boosting, GPs typically yield better-calibrated uncertainty at the cost of poorer scaling.
Limitations
The dominant practical limitation is cubic scaling, which restricts exact GPs to small datasets. Approximations relax this but introduce their own bias-variance trade-offs and tuning burden. GPs are sensitive to kernel choice: a misspecified kernel can lead to overconfident or miscalibrated predictions, and standard stationary kernels struggle with high-dimensional inputs because length scales become hard to learn and the curse of dimensionality erodes the smoothness prior. Non-Gaussian likelihoods require approximate inference, and constraints such as monotonicity or non-negativity are not handled natively. Finally, while the marginal likelihood gives an elegant model-selection criterion, it can favour overly flexible kernels when the noise model is misspecified.