AdaGrad
| Article | |
|---|---|
| Topic area | optimization |
| Prerequisites | Stochastic gradient descent, Gradient, Loss function |
Overview
AdaGrad, short for Adaptive Gradient Algorithm, is a first-order optimization method that adapts the learning rate of each parameter individually, scaling updates inversely to the square root of the sum of all historical squared gradients for that parameter. Introduced by Duchi, Hazan, and Singer in 2011, it was one of the first widely used adaptive learning rate methods and a direct ancestor of modern optimizers such as RMSProp, Adadelta, and Adam. The motivating insight is that frequently occurring features should receive smaller updates while rare features should receive larger ones, which is especially valuable for sparse data such as natural language text or click streams.
AdaGrad sits between vanilla Stochastic gradient descent and the family of adaptive optimizers used to train modern neural networks. While its automatic per-parameter scaling reduces the burden of learning rate tuning, the unbounded accumulation of squared gradients causes the effective step size to shrink monotonically toward zero, a property that is sometimes desirable for convex problems but often hampers training of deep networks. Today AdaGrad is rarely used as a default optimizer for deep learning, but it remains pedagogically important and continues to appear in convex optimization, online learning, and high-dimensional sparse regression settings.
Motivation and Intuition
In classical Stochastic gradient descent (SGD), every parameter is updated with the same scalar learning rate. This is awkward when different parameters require different update magnitudes. Consider a language model trained on word embeddings: the embedding for a common word like "the" receives gradient signal on nearly every minibatch, while the embedding for a rare word may be touched only a handful of times across an entire epoch. Using the same learning rate for both forces a compromise: small enough to keep the common word stable, but then too small to make meaningful progress on rare words.
AdaGrad addresses this by giving each parameter its own learning rate that decays based on how much gradient signal that parameter has already received. Parameters touched often by large gradients have their step sizes shrunk; parameters touched rarely retain larger steps. The result is implicit learning rate annealing tailored per-coordinate, with no need for the practitioner to schedule decays manually.
The effect can be understood geometrically. AdaGrad approximates a diagonal preconditioning of the gradient by the inverse square root of an outer-product accumulator of past gradients. This is a cheap surrogate for second-order methods such as Newton's method, which precondition by the inverse Hessian matrix. Where Newton's method is too expensive for high-dimensional problems, AdaGrad provides a coordinate-wise approximation that captures some of the same scale-invariance benefits at the cost of one extra accumulator per parameter.
Formulation
Let $ \theta \in \mathbb{R}^d $ denote the parameter vector and $ g_t = \nabla_\theta L_t(\theta_{t-1}) $ the gradient of the loss $ L_t $ at step $ t $. AdaGrad maintains a running sum of element-wise squared gradients
$ {\displaystyle G_t = \sum_{\tau=1}^{t} g_\tau \odot g_\tau} $
where $ \odot $ denotes the element-wise (Hadamard) product. The parameter update is
$ {\displaystyle \theta_t = \theta_{t-1} - \frac{\eta}{\sqrt{G_t + \epsilon}} \odot g_t} $
where $ \eta $ is a global step size (often called the base learning rate) and $ \epsilon $ is a small constant, typically $ 10^{-8} $, that prevents division by zero on the first step and dampens updates for parameters whose accumulators are still small. The square root and division are taken element-wise.
Equivalently, if we write $ G_t = \mathrm{diag}(s_t) $ where $ s_t \in \mathbb{R}^d $ is the running sum, the update for the $ i $-th coordinate is
$ {\displaystyle \theta_{t,i} = \theta_{t-1,i} - \frac{\eta}{\sqrt{s_{t,i} + \epsilon}} \, g_{t,i}.} $
The original paper presents a more general full-matrix variant in which $ G_t $ is the full outer-product accumulator $ \sum_\tau g_\tau g_\tau^\top $ and the update uses $ G_t^{-1/2} $. The full-matrix variant has stronger theoretical guarantees but is rarely used in practice because storing and inverting a $ d \times d $ matrix is infeasible when $ d $ reaches millions or billions of parameters. The diagonal variant described above is what is meant when practitioners say "AdaGrad" without qualification.
Algorithm
A typical implementation for the diagonal variant is straightforward:
Initialize: theta := initial parameters s := 0 (vector of same shape as theta) eta := base learning rate (e.g. 0.01) eps := 1e-8 For t = 1, 2, ... until convergence: Sample a minibatch and compute gradient g_t s := s + g_t * g_t # element-wise square accumulation theta := theta - eta * g_t / (sqrt(s) + eps)
The accumulator $ s $ requires the same memory as the parameter vector, so AdaGrad doubles the optimizer state compared to plain SGD. In contrast, Adam requires roughly three times the parameter memory because it tracks both first and second moment estimates.
In modern frameworks such as PyTorch and TensorFlow, AdaGrad is exposed as a drop-in optimizer (torch.optim.Adagrad, tf.keras.optimizers.Adagrad) with optional initial accumulator value, learning rate decay, and weight decay knobs.
Convergence and Regret
AdaGrad was originally analyzed in the online convex optimization framework. For convex losses with bounded gradients, the diagonal variant achieves a regret bound of
$ {\displaystyle R(T) = \sum_{t=1}^{T} L_t(\theta_t) - \min_{\theta^*} \sum_{t=1}^{T} L_t(\theta^*) = O\left(\sqrt{T} \cdot \sum_{i=1}^{d} \sqrt{\sum_{t=1}^{T} g_{t,i}^2}\right).} $
Crucially, the per-coordinate sum inside makes this bound much tighter than the corresponding $ O(\sqrt{dT}) $ bound for SGD when gradients are sparse, because coordinates that are rarely active contribute little to the total. This is the formal statement of the intuition that AdaGrad benefits problems with sparse features.
For non-convex objectives, including most deep networks, no comparable global guarantees exist, but coordinate-wise scaling still offers practical preconditioning benefits, particularly early in training before the accumulator grows large.
Variants and Successors
AdaGrad's monotonic accumulation of squared gradients leads to ever-shrinking step sizes, which can stall training before the model reaches a good optimum. Several successors address this directly:
- RMSProp (Tieleman and Hinton, 2012) replaces the running sum with an exponential moving average of squared gradients, $ s_t = \rho s_{t-1} + (1-\rho) g_t \odot g_t $. This bounds the effective learning rate from below and recovers the ability to make large updates if gradient magnitudes change.
- Adadelta (Zeiler, 2012) takes RMSProp's exponential averaging further by also normalizing by a running average of past parameter updates, eliminating the global learning rate $ \eta $ as a hyperparameter.
- Adam (Kingma and Ba, 2014) combines RMSProp's running average of squared gradients with a separate running average of gradients themselves, providing momentum-like behavior in addition to per-parameter scaling. Adam is now the dominant optimizer for deep learning.
- AdaGrad-Norm uses a single scalar accumulator $ s_t = \sum_\tau \|g_\tau\|^2 $ shared across all parameters. This loses the per-parameter adaptation but is simpler to analyze and has clean convergence guarantees for non-convex optimization.
A separate research thread on sparse AdaGrad updates has also produced lazy variants that only touch the accumulator entries corresponding to non-zero gradient coordinates, an important optimization for very high-dimensional sparse models.
Comparison with Other Optimizers
Compared to plain SGD, AdaGrad eliminates the need for manual learning rate schedules in many convex settings and handles sparse gradients gracefully. Its main disadvantage is the eventual stalling caused by accumulator growth.
Compared to Adam, AdaGrad is simpler, has fewer hyperparameters, and uses less memory, but is generally slower and worse for training deep networks. Adam's exponential averaging avoids the stalling problem and its momentum term helps escape narrow ravines in the loss landscape.
Compared to second-order methods such as Newton's method or L-BFGS, AdaGrad is dramatically cheaper, requiring only diagonal storage and element-wise operations, but offers a much weaker form of preconditioning.
In practice, AdaGrad still appears in production for recommender systems, click-through rate prediction, and large-scale logistic regression, where features are highly sparse and the convex formulation aligns with AdaGrad's theoretical strengths. Google's original FTRL (Follow The Regularized Leader) algorithm for ad CTR prediction is closely related and shares the per-coordinate accumulation idea.
Limitations
The defining limitation of AdaGrad is the monotonic decay of the effective learning rate. Because $ G_t $ only grows, the step size $ \eta / \sqrt{G_t + \epsilon} $ can only shrink. For long training runs, especially on non-convex problems where the model needs to make large adjustments late in training, this aggressive annealing causes premature stalling. RMSProp and Adam were designed to address exactly this issue.
A secondary issue is sensitivity to the base learning rate $ \eta $. While AdaGrad reduces sensitivity to per-parameter learning rate scaling, it is still sensitive to the global scale, and a poorly chosen $ \eta $ can cause either explosive updates early on or too-slow progress.
Finally, AdaGrad's preconditioning is purely diagonal: it cannot account for correlations between parameters. Problems where parameter dimensions are strongly coupled, such as those arising in ill-conditioned linear systems, may not benefit substantially from AdaGrad relative to a tuned SGD-with-momentum.
Applications
AdaGrad's strongest practical foothold is in convex problems with sparse high-dimensional features:
- Click-through rate prediction and other large-scale logistic regression models in advertising and recommendation systems.
- Sparse natural language processing models, particularly bag-of-words and n-gram classifiers, where most features are zero on any given example.
- Word embedding training in early implementations of word2vec and similar models, where rare-word embeddings benefit from large updates.
- Online learning settings where data arrive sequentially and per-coordinate adaptation is more important than careful schedule design.
For modern deep learning workloads, AdaGrad has been largely supplanted by Adam, AdamW, and related methods, though it remains a useful pedagogical entry point and a sensible baseline for new convex problems.
References
- ↑ Template:Cite arxiv
- ↑ Duchi, J., Hazan, E., and Singer, Y. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12:2121-2159, 2011.
- ↑ Tieleman, T. and Hinton, G. Lecture 6.5 - rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
- ↑ Template:Cite arxiv
- ↑ Template:Cite arxiv
- ↑ McMahan, H. B. et al. Ad Click Prediction: a View from the Trenches. KDD, 2013.