RMSProp/en

    From Marovi AI
    Other languages:
    Article
    Topic area optimization
    Prerequisites Stochastic Gradient Descent, Backpropagation, Gradient Descent


    Overview

    RMSProp (Root Mean Square Propagation) is an adaptive learning rate optimization algorithm for training neural networks via Stochastic Gradient Descent. Proposed by Geoffrey Hinton in his 2012 Coursera lecture on neural networks, RMSProp scales each parameter's update by an exponentially decaying running average of the squared gradients seen so far. Parameters whose gradients have been consistently large receive a smaller effective step size, while parameters with small or sparse gradients receive larger updates. This per-parameter rescaling helps the optimizer make rapid progress on shallow directions of the loss landscape without diverging on steep ones, and it has historically been the default optimizer for recurrent neural networks before the rise of Adam.[1]

    The method occupies a middle ground between vanilla Gradient Descent and fully adaptive methods such as Adam. It addresses a well-known failure mode of AdaGrad, in which the cumulative sum of squared gradients grows monotonically and eventually shrinks the effective learning rate to zero. By replacing the cumulative sum with an exponential moving average, RMSProp keeps step sizes responsive to recent curvature and remains usable for long training runs and on non-stationary objectives such as those encountered in deep learning and reinforcement learning.

    Intuition

    The core idea is to estimate, for each parameter, the typical magnitude of recent gradients and to divide each update by that estimate. If a parameter's gradient has been consistently large in absolute value, the running average of squared gradients is also large, and the update is dampened. If the gradient has been small, the divisor is small, so the same nominal learning rate produces a larger effective step. The result is approximate per-parameter step-size normalization without requiring second-order information such as the Hessian.

    A useful mental picture is that of a long, narrow valley in the loss surface. Plain SGD with a single global learning rate either bounces back and forth across the steep walls (if the rate is too large) or crawls along the floor (if the rate is too small). RMSProp shrinks the step size in the steep direction, where squared gradients are large, while leaving steps in the gentle direction relatively untouched. The trajectory becomes smoother and progress along the floor accelerates. Unlike AdaGrad, the squared-gradient estimate decays over time, so the optimizer does not lose the ability to take large steps later in training when needed.

    Formulation

    Let $ \theta_t \in \mathbb{R}^n $ denote the parameter vector at step $ t $, and let $ g_t = \nabla_\theta L(\theta_t) $ be the gradient of the loss $ L $ with respect to the parameters, evaluated on a Mini-Batch. RMSProp maintains an exponential moving average $ v_t $ of the elementwise squared gradients and uses its square root to rescale the update:

    $ {\displaystyle v_t = \rho\, v_{t-1} + (1 - \rho)\, g_t \odot g_t} $

    $ {\displaystyle \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{v_t} + \epsilon}\, g_t} $

    Here $ \eta $ is the global learning rate, $ \rho \in [0, 1) $ is the decay rate of the moving average (often denoted $ \beta $ or $ \gamma $), and $ \epsilon $ is a small constant that prevents division by zero. The operation $ \odot $ denotes elementwise multiplication, and the square root and division are also elementwise. The state $ v_0 $ is initialized to zero.

    A common alternative writes the denominator as $ \sqrt{v_t + \epsilon} $, with $ \epsilon $ inside the square root rather than added afterward; this is the form used in TensorFlow and gives slightly different numerical behavior near the start of training but is otherwise equivalent in spirit. Typical hyperparameter defaults are $ \eta = 10^{-3} $, $ \rho = 0.9 $, and $ \epsilon = 10^{-8} $, although the learning rate often needs to be tuned per problem.

    Training and Inference

    RMSProp is used only at training time; at inference, the trained weights are applied directly with no optimizer state involved. During training, the optimizer maintains one extra tensor of the same shape as the parameters (the running average $ v_t $), so the memory overhead is roughly equal to that of the model itself. This is cheaper than Adam, which stores both first and second moments, but more expensive than vanilla SGD, which stores none.

    Each step requires one elementwise square, one elementwise multiply-add to update $ v_t $, one elementwise square root, and one elementwise division. These operations are trivially parallel on GPUs and contribute negligible overhead compared to the forward and backward passes through the network. Because the moment estimates accumulate over batches, the meaningful unit of progress is the optimizer step rather than the wall-clock second; learning-rate schedules and warmup periods are typically expressed in steps.

    When resuming training from a checkpoint, both the parameters and the optimizer state must be restored. Restoring only the parameters effectively resets the squared-gradient estimate to zero, which causes the first few updates to be excessively large and can destabilize training on a cold start. Most deep-learning frameworks save and restore the optimizer state automatically when checkpointing.

    Variants

    Several variants of the basic algorithm appear in the literature and in practice. The original Hinton lecture describes RMSProp without momentum, but a common extension adds Momentum by maintaining a separate velocity buffer:

    $ {\displaystyle m_t = \mu\, m_{t-1} + \frac{\eta}{\sqrt{v_t} + \epsilon}\, g_t} $

    $ {\displaystyle \theta_{t+1} = \theta_t - m_t} $

    This RMSProp-with-momentum form is what TensorFlow exposes when its momentum hyperparameter is non-zero, and it tends to perform better on problems with poorly conditioned curvature. A further extension proposed by Graves applies Nesterov-style lookahead to the momentum step.[2]

    The closely related AdaDelta algorithm, introduced by Zeiler in 2012, was developed independently and arrives at a similar update rule but additionally rescales the numerator by a running average of squared parameter updates, eliminating the need to specify a global learning rate.[3] Adam can be viewed as RMSProp augmented with a first-moment estimate (a smoothed gradient) and bias-correction terms; in practice the two often produce comparable results on supervised tasks, with Adam more commonly chosen as the default in modern training pipelines.

    Comparison with Other Optimizers

    Compared to AdaGrad, RMSProp avoids the indefinite shrinkage of step sizes by exponentially decaying old squared-gradient information; this makes it preferable for long training runs and for non-stationary problems. Compared to plain SGD with a fixed global learning rate, RMSProp is more forgiving of poor learning-rate choices and converges faster on problems with heterogeneous parameter scales, such as when training across different layer types or modalities. The cost is one extra state tensor per parameter and slightly more arithmetic per step.

    Compared to Adam, RMSProp omits the first-moment (velocity) estimate and the bias-correction step. In settings where the first moment provides a meaningful smoothing of noisy gradients, such as small batch sizes or noisy reinforcement-learning rollouts, Adam tends to outperform RMSProp. In settings where the gradients are already relatively well-behaved, the two are often interchangeable, and RMSProp's simplicity and slightly lower memory footprint can be an advantage. Empirically, RMSProp with momentum was the default for many deep recurrent network and policy-gradient training pipelines through the mid-2010s and continues to appear as a baseline in reinforcement-learning libraries.

    Limitations

    RMSProp inherits the general limitations of adaptive optimizers. The per-parameter scaling can interfere with explicit weight decay; coupling the regularizer to the rescaled update gives a different effective penalty than coupling it to the raw gradient, which motivated the decoupled-weight-decay variants that have since become standard for Adam-like methods. Care is also required when combining RMSProp with learning-rate warmup or sudden schedule changes, because the running average $ v_t $ takes several thousand steps to stabilize and can interact unexpectedly with abrupt step-size jumps.

    The algorithm has no convergence guarantees on non-convex objectives stronger than those for SGD, and on some convex problems it has been shown to fail to converge to a stationary point even in the limit of infinite samples, a deficiency it shares with Adam.[4] In practice this is rarely a problem for over-parameterized neural networks, but it does mean that RMSProp should not be relied on as a black-box solver in settings where convergence guarantees matter. Finally, like all elementwise adaptive methods, RMSProp ignores correlations between parameters and so cannot exploit curvature structure that a true second-order method would capture.

    References

    1. Hinton, G. Lecture 6.5 - rmsprop: Divide the gradient by a running average of its recent magnitude. Coursera: Neural Networks for Machine Learning, 2012.
    2. Graves, A. Generating Sequences With Recurrent Neural Networks. arXiv:1308.0850, 2013.
    3. Zeiler, M. D. ADADELTA: An Adaptive Learning Rate Method. arXiv:1212.5701, 2012.
    4. Reddi, S. J., Kale, S., Kumar, S. On the Convergence of Adam and Beyond. ICLR, 2018.