Stochastic Gradient Descent
| Article | |
|---|---|
| Topic area | Machine Learning |
| Difficulty | Intermediate |
| Prerequisites | Gradient Descent, Linear Regression |
Stochastic gradient descent (often abbreviated Lua error: Internal error: The interpreter exited with status 1.) is an iterative optimisation algorithm used to minimise an Lua error: Internal error: The interpreter exited with status 1. written as a sum of differentiable sub-functions. It is the workhorse behind modern machine-learning training, powering everything from Lua error: Internal error: The interpreter exited with status 1. to deep neural networks.
Motivation
In classical Lua error: Internal error: The interpreter exited with status 1., the full gradient of the Lua error: Internal error: The interpreter exited with status 1. is computed over the entire training set before each parameter update. When the dataset is large this becomes prohibitively expensive. SGD addresses the problem by estimating the gradient from a single randomly chosen sample (or a small Lua error: Internal error: The interpreter exited with status 1.) at each step, trading a noisier estimate for dramatically lower per-iteration cost.
Algorithm
Given a parameterised Lua error: Internal error: The interpreter exited with status 1.
- $ L(\theta) = \frac{1}{N}\sum_{i=1}^{N} \ell(\theta;\, x_i,\, y_i) $
the SGD update rule at step $ t $ is:
- $ \theta_{t+1} = \theta_t - \eta_t \,\nabla_\theta \ell(\theta_t;\, x_{i_t},\, y_{i_t}) $
where $ \eta_t $ is the Lua error: Internal error: The interpreter exited with status 1. (Lua error: Internal error: The interpreter exited with status 1.) and $ i_t $ is a randomly selected index.
Mini-batch variant
In practice a Lua error: Internal error: The interpreter exited with status 1. of $ B $ samples is used:
- $ \theta_{t+1} = \theta_t - \frac{\eta_t}{B}\sum_{j=1}^{B} \nabla_\theta \ell(\theta_t;\, x_{i_j},\, y_{i_j}) $
Common batch sizes range from 32 to 512. Larger batches reduce gradient variance but increase memory usage.
Pseudocode
initialise parameters θ
for epoch = 1, 2, … do
shuffle training set
for each mini-batch B ⊂ training set do
g ← (1/|B|) Σ ∇ℓ(θ; xᵢ, yᵢ) # estimate gradient
θ ← θ − η · g # update parameters
end for
end for
Learning rate schedules
The Lua error: Internal error: The interpreter exited with status 1. $ \eta_t $ strongly influences Lua error: Internal error: The interpreter exited with status 1.. Common strategies include:
- Constant — simple but may overshoot or stall.
- Step decay — multiply $ \eta $ by a factor (e.g. 0.1) every $ k $ Lua error: Internal error: The interpreter exited with status 1..
- Exponential decay — $ \eta_t = \eta_0 \, e^{-\lambda t} $.
- Cosine annealing — smoothly reduces the rate following a cosine curve, often with warm restarts.
- Linear warm-up — ramp up from a small $ \eta $ during the first few iterations to stabilise early training.
Convergence properties
For Lua error: Internal error: The interpreter exited with status 1. objectives with Lipschitz-continuous gradients, SGD with a decaying Lua error: Internal error: The interpreter exited with status 1. satisfying
- $ \sum_{t=1}^{\infty} \eta_t = \infty, \qquad \sum_{t=1}^{\infty} \eta_t^2 < \infty $
converges almost surely to the global minimum (Robbins–Monro conditions). For non-convex problems — the typical regime for Lua error: Internal error: The interpreter exited with status 1. — SGD converges to a stationary point, and empirical evidence shows it often finds good local minima.
Popular variants
Several extensions reduce the variance of the gradient estimate or adapt the Lua error: Internal error: The interpreter exited with status 1. per parameter:
| Method | Key idea | Reference |
|---|---|---|
| Lua error: Internal error: The interpreter exited with status 1. | Accumulates an exponentially decaying moving average of past gradients | Polyak, 1964 |
| Nesterov accelerated gradient | Evaluates the gradient at a "look-ahead" position | Nesterov, 1983 |
| Lua error: Internal error: The interpreter exited with status 1. | Per-parameter rates that shrink for frequently updated features | Duchi et al., 2011 |
| RMSProp | Fixes Lua error: Internal error: The interpreter exited with status 1.'s diminishing rates using a moving average of squared gradients | Hinton (lecture notes), 2012 |
| Lua error: Internal error: The interpreter exited with status 1. | Combines Lua error: Internal error: The interpreter exited with status 1. with RMSProp-style adaptive rates | Kingma & Ba, 2015 |
| AdamW | Decouples Lua error: Internal error: The interpreter exited with status 1. from the adaptive gradient step | Loshchilov & Hutter, 2019 |
Practical considerations
- Data shuffling — Re-shuffle the dataset each Lua error: Internal error: The interpreter exited with status 1. to avoid cyclic patterns.
- Lua error: Internal error: The interpreter exited with status 1. — Cap the gradient norm to prevent exploding updates, especially in recurrent networks.
- Lua error: Internal error: The interpreter exited with status 1. — Normalising layer inputs reduces sensitivity to the Lua error: Internal error: The interpreter exited with status 1..
- Mixed-precision training — Using half-precision floats accelerates SGD on modern GPUs with minimal accuracy loss.
Applications
SGD and its variants are used across virtually all areas of machine learning:
- Training deep neural networks (computer vision, NLP, speech recognition)
- Large-scale linear models (Lua error: Internal error: The interpreter exited with status 1., SVMs via SGD)
- Reinforcement learning policy optimisation
- Lua error: Internal error: The interpreter exited with status 1. and collaborative filtering
- Online learning settings where data arrives in a stream
See also
References
- Robbins, H. and Monro, S. (1951). "A Stochastic Approximation Method". Annals of Mathematical Statistics.
- Bottou, L. (2010). "Large-Scale Machine Learning with Stochastic Gradient Descent". COMPSTAT.
- Kingma, D. P. and Ba, J. (2015). "Lua error: Internal error: The interpreter exited with status 1.: A Method for Stochastic Optimization". ICLR.
- Ruder, S. (2016). "An overview of Lua error: Internal error: The interpreter exited with status 1. optimization algorithms". arXiv:1609.04747.