Cyclic Learning Rates

    From Marovi AI
    This page contains changes which are not marked for translation.
    Other languages:
    Article
    Topic area Optimization
    Prerequisites Stochastic Gradient Descent, Learning Rate Schedule


    Overview

    Cyclic learning rates (CLR) are a family of learning rate schedules that vary the learning rate periodically between a lower and an upper bound rather than decreasing it monotonically. The technique was introduced by Leslie N. Smith in 2015 as a practical alternative to hand-tuned step decays and to the more elaborate adaptive optimizers, with the empirical observation that letting the learning rate oscillate often reaches a target accuracy in fewer iterations than a carefully tuned fixed or decaying schedule.[1] The same paper introduced the learning rate range test, a short sweep that locates the bounds of the cycle automatically.

    Cyclic schedules have since become a default choice in libraries such as PyTorch (`torch.optim.lr_scheduler.CyclicLR`) and the fast.ai training stack, and they underpin the popular one-cycle policy used to train deep networks to high accuracy in tens rather than hundreds of epochs.

    Background and motivation

    Most pre-2015 training recipes used a piecewise-constant learning rate with one or two manual drops, sometimes augmented by a learning-rate decay tied to validation loss. Two practical problems motivate cyclic schedules. First, picking the initial learning rate and the timing of decays requires costly trial-and-error sweeps. Second, even a well-tuned monotonic schedule can leave the optimizer trapped near saddle points or in narrow, sharp minima from which a small step cannot escape.

    Smith's hypothesis is that periodically increasing the learning rate has a useful regularizing effect: large steps push the iterate away from sharp minima and across saddle ridges, while the subsequent low-rate phases allow it to settle into nearby flatter basins. The total computation spent at very small learning rates is also reduced, which contributes to the observed wall-clock speedup.

    The triangular schedule

    The basic CLR policy, called triangular, linearly interpolates the learning rate between a base value $ \eta_{\min} $ and a maximum $ \eta_{\max} $ over a half-period of length $ s $ (the step size), then linearly returns to $ \eta_{\min} $ over the next half-period. Define the cycle index

    $ {\displaystyle c = \left\lfloor 1 + \frac{t}{2s} \right\rfloor,} $

    and the position within the cycle

    $ {\displaystyle x = \left| \frac{t}{s} - 2c + 1 \right|.} $

    The learning rate at iteration $ t $ is then

    $ {\displaystyle \eta_t = \eta_{\min} + (\eta_{\max} - \eta_{\min})\,\max(0, 1 - x).} $

    A typical step size is between two and ten epochs measured in iterations. Smith reports good results with cycles short enough that several complete cycles fit within the training budget.

    Learning rate range test

    To choose $ \eta_{\min} $ and $ \eta_{\max} $, Smith proposes the range test (also known as the LR finder). One trains for a few hundred iterations while increasing the learning rate exponentially from a tiny value (e.g. $ 10^{-7} $) to a large one (e.g. $ 10 $), recording the training loss at each step. The resulting loss-versus-rate curve typically shows three regimes: a flat region where the rate is too small to make progress, a steeply descending region of effective learning, and a divergent region where the loss explodes.

    A practical heuristic is to set $ \eta_{\max} $ at or just below the rate at which loss begins to climb, and $ \eta_{\min} $ roughly one order of magnitude smaller, near the start of the descending region. The range test is cheap (typically less than one full epoch) and replaces several full training runs needed for a traditional grid search.

    Variants

    Smith's original paper defines three policies that share the triangular base shape but modulate the amplitude over time.

    • triangular: pure triangle, constant amplitude across all cycles.
    • triangular2: the difference $ \eta_{\max} - \eta_{\min} $ is halved at the end of each cycle, gradually shrinking the oscillation while preserving its shape. This blends the exploration benefit of cycling with the convergence behaviour of a decaying schedule.
    • exp_range: the amplitude decays exponentially, $ (\eta_{\max} - \eta_{\min})\gamma^t $, for some $ \gamma \in (0,1) $.

    A closely related cosine variant replaces the triangle with a half-cosine wave; this smoother transition is the basis of SGDR (stochastic gradient descent with warm restarts) introduced by Loshchilov and Hutter.[2]

    The one-cycle policy

    In a 2018 follow-up, Smith proposed the one-cycle policy, in which the learning rate executes a single triangle over the entire training run, optionally followed by a short annihilation phase that drops it well below $ \eta_{\min} $ for the final few percent of iterations.[3] The policy is often combined with a mirrored cyclic momentum schedule that decreases the momentum coefficient while the learning rate increases, then reverses, on the heuristic that high learning rates are better tolerated when momentum is reduced.

    One-cycle training was popularized as a recipe for super-convergence, in which CIFAR-10 and ImageNet models are trained to competitive accuracy in roughly an order of magnitude fewer iterations than with conventional schedules. The PyTorch implementation is `torch.optim.lr_scheduler.OneCycleLR`.

    Comparison with other schedules

    Cyclic schedules occupy an intermediate position between fully monotonic decay and the warm-restart family. Compared with step decay or cosine annealing, cyclic schedules spend significantly more time at high learning rates, which acts as an implicit regularizer and can be especially useful when training data is limited or noisy. Compared with cosine annealing with warm restarts (SGDR), the triangular shape is simpler and arguably easier to reason about, while SGDR tends to produce smoother loss curves and lends itself naturally to snapshot ensembling.[4]

    Adaptive optimizers such as Adam interact non-trivially with cyclic schedules: the adaptive per-parameter scaling already attenuates large effective steps for high-curvature directions, so the empirical benefit of cycling is generally smaller than with plain SGD with momentum. The one-cycle recipe is most often paired with SGD plus momentum.

    Practical considerations and limitations

    Cyclic schedules introduce two new hyperparameters ($ \eta_{\min} $, $ \eta_{\max} $) and a step size, but the range test largely automates the first two and good defaults exist for the third. The principal cost is that intermediate snapshots are not directly comparable across cycles: validation accuracy oscillates with the learning rate, so model selection should occur at the end of a cycle rather than at arbitrary iterations.

    Several caveats are documented in the literature and in practitioner experience.

    • The interaction with batch normalization statistics can be non-trivial; very high learning rates briefly destabilize running statistics, which sometimes requires extending the warm-up portion of the first cycle.
    • The optimal $ \eta_{\max} $ depends on batch size and model width; doubling the batch size typically allows a proportionally higher peak rate.
    • For very deep transformer models, the one-cycle policy is less commonly used than linear warm-up followed by cosine decay, partly because layer-norm and pre-norm architectures tolerate higher constant rates.

    Cyclic learning rates remain a strong default for medium-scale convolutional and recurrent models and a useful diagnostic tool even in workflows that ultimately settle on a different schedule, because the range test provides a fast, principled way to bound any learning rate sweep.

    References

    1. Smith, L. N. Cyclical Learning Rates for Training Neural Networks. WACV, 2017. arXiv:1506.01186.
    2. Loshchilov, I. and Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. ICLR, 2017. arXiv:1608.03983.
    3. Smith, L. N. A Disciplined Approach to Neural Network Hyper-parameters: Part 1. Technical Report, US Naval Research Laboratory, 2018. arXiv:1803.09820.
    4. Huang, G. et al. Snapshot Ensembles: Train 1, get M for free. ICLR, 2017. arXiv:1704.00109.