Lion Optimizer/en

    From Marovi AI
    Other languages:
    Article
    Topic area optimization
    Prerequisites Stochastic gradient descent, Momentum (optimization), Adam optimizer


    Overview

    The Lion optimizer (short for EvoLved Sign Momentum) is a first-order stochastic optimization algorithm for training neural networks, introduced by Chen et al. in 2023.[1] Unlike most widely used optimizers, which were designed by hand from theoretical principles, Lion was discovered through symbolic program search over a space of candidate optimizer programs. The result is an update rule that is conceptually simpler than Adam yet often matches or exceeds its accuracy on large-scale deep learning workloads, while using roughly half the optimizer state memory.

    Lion's defining characteristic is that the parameter update direction is the elementwise sign of an interpolated momentum term. This makes every coordinate of the update have the same magnitude (a fixed step size scaled by the learning rate), in contrast to Adam, whose adaptive per-coordinate scaling is governed by an exponential moving average of squared gradients. Lion has been adopted in various large language model and vision training pipelines and serves as a touchstone for the research direction of automatically discovered optimizers.

    Background and motivation

    By the early 2020s, AdamW had become the de facto optimizer for transformer training, despite well-known limitations: it stores two moment buffers per parameter (first and second moments), which doubles optimizer memory beyond the parameters themselves, and its hyperparameters are sensitive to model scale. A long line of work attempted to replace Adam by hand, with mixed empirical success: LAMB, Adafactor, Sophia, Shampoo, and others each addressed specific issues but did not displace Adam as the default.

    Lion arose from a different premise: instead of designing an optimizer, search for one. The authors framed an optimizer as a short program operating on gradients and state, defined a search space of primitive operations (additions, multiplications, sign, exponential moving averages, clipping, and so on), and used evolutionary program search with meta-learning-style proxy training tasks to evaluate candidates. Lion is the simplest high-performing program that emerged after substantial pruning of larger discovered candidates.

    Algorithm

    Let $ g_t $ be the gradient of the loss with respect to parameters $ \theta_t $ at step $ t $, let $ m_t $ be a single momentum buffer initialized to zero, and let $ \eta $ be the learning rate. Lion has two interpolation coefficients $ \beta_1, \beta_2 \in [0,1) $ and a weight decay coefficient $ \lambda \ge 0 $. The update at step $ t $ is:

    $ {\displaystyle c_t = \beta_1 \, m_{t-1} + (1 - \beta_1) \, g_t} $

    $ {\displaystyle \theta_t = \theta_{t-1} - \eta \, \big( \operatorname{sign}(c_t) + \lambda \, \theta_{t-1} \big)} $

    $ {\displaystyle m_t = \beta_2 \, m_{t-1} + (1 - \beta_2) \, g_t} $

    The first line forms an update direction $ c_t $ by interpolating between the previous momentum and the current gradient. The second line applies the parameter update using only the elementwise sign of $ c_t $, plus decoupled weight decay in the style of AdamW. The third line then updates the persistent momentum buffer $ m_t $ using a different interpolation coefficient. Crucially, the buffer used for next step's direction ($ m_t $) and the buffer used for the current direction (a mix involving $ m_{t-1} $) are governed by separate $ \beta $ values.

    The default hyperparameters reported by Chen et al. are $ \beta_1 = 0.9 $, $ \beta_2 = 0.99 $, and a learning rate roughly an order of magnitude smaller than AdamW's, with weight decay roughly an order of magnitude larger.

    Intuition

    Three observations help explain why Lion works.

    First, every update has unit magnitude per coordinate. Because the update is $ \eta \cdot \operatorname{sign}(c_t) $, no coordinate can take a step larger than $ \eta $. This implicit clipping resembles gradient clipping but is applied at the per-coordinate level and at every step, providing robustness to outlier gradients common in language model training.

    Second, the two-$ \beta $ structure decouples direction smoothing from buffer accumulation. With $ \beta_1 < \beta_2 $, the direction $ c_t $ is more reactive to the current gradient than the persistent buffer $ m_t $ is. This gives Lion a built-in Nesterov-like lookahead: the step taken at time $ t $ reflects more of $ g_t $ than the momentum buffer alone would.

    Third, taking the sign discards gradient magnitude information, which Adam exploits via its second moment. The empirical fact that Lion still trains well suggests that for many large neural networks, direction matters more than precise magnitude scaling, provided learning rate and weight decay are tuned compatibly.

    Memory and compute

    Lion stores a single momentum buffer $ m_t $ per parameter. Adam and AdamW store two (the first and second moments). For a model with $ P $ parameters trained in bfloat16 or 32-bit precision, this halves optimizer state memory: from $ 2P $ to $ P $ values. For models in the tens to hundreds of billions of parameters, this savings is significant for distributed training under fixed accelerator memory budgets.

    Lion's per-step compute is also slightly cheaper: a sign operation and two interpolations, with no division by a square root of a second-moment estimate. In practice the time per step is dominated by the forward and backward passes, so the practical speedup from cheaper optimizer arithmetic is modest, but nonzero.

    Hyperparameter tuning

    Lion's hyperparameters do not transfer directly from Adam. The recommended adjustments, observed empirically by the authors and corroborated in independent reproductions, are:

    • Learning rate: roughly $ 3\times $ to $ 10\times $ smaller than the AdamW value that works for the same model. Because Lion's update has unit per-coordinate magnitude, the effective step size is more predictable, but smaller in absolute terms than Adam's adaptive scaling at well-conditioned coordinates.
    • Weight decay: roughly $ 3\times $ to $ 10\times $ larger than the AdamW value, to compensate for the smaller learning rate while preserving the effective regularization strength $ \eta \lambda $.
    • Batch size: Lion benefits from larger batch sizes more than Adam in some regimes, plausibly because sign-based updates have higher variance per step and average more cleanly across larger batches.
    • $ \beta_1, \beta_2 $: the defaults $ (0.9, 0.99) $ are robust; the original paper found minor gains from tuning but recommends starting from defaults.

    Empirical performance

    In the original paper, Lion was evaluated on ImageNet image classification, image-text contrastive learning (CLIP-style), diffusion model training, and language model pretraining up to several billion parameters. Across these settings, Lion matched or modestly exceeded AdamW on validation accuracy or perplexity, typically with comparable wall-clock training time.

    Independent reproductions on smaller models have generally confirmed that Lion is competitive but not uniformly superior. The result depends on model architecture, batch size, and tuning effort. Some fine-tuning settings, particularly those with very small datasets, have shown Lion to be more sensitive to learning rate schedule choices than AdamW.

    Variants and follow-ups

    Several variants and analyses have appeared since 2023:

    • Tiger applies a similar sign-of-momentum idea with a single $ \beta $ and is reported to be even more memory-efficient at modest accuracy cost.
    • Lion-K replaces the elementwise sign with a more general nonlinearity, with theoretical analysis connecting Lion to a regularized mirror descent interpretation.
    • Continuous-time and convergence analyses of Lion have been developed, viewing the algorithm as a discrete-time approximation to a particular ordinary differential equation with sign nonlinearity, providing partial theoretical justification for the observed robustness.

    These follow-ups generally agree that Lion's behavior is governed by an interplay between the sign nonlinearity and the exponential moving average of momentum, rather than by either component alone.

    Comparison with related optimizers

    • vs. signSGD: signSGD takes the sign of the raw gradient with no momentum buffer. Lion is signSGD with a particular two-$ \beta $ momentum scheme; the momentum is essential to Lion's empirical performance.
    • vs. Adam / AdamW: Lion uses one buffer instead of two and discards magnitude information. Adam adapts per-coordinate scale via the second moment; Lion does not.
    • vs. SGD with momentum: SGD with momentum uses the raw momentum vector as the direction. Lion uses its sign, capping per-coordinate update magnitude.
    • vs. Adafactor: Adafactor reduces optimizer memory by factorizing the second moment; Lion eliminates the second moment entirely.

    Limitations

    Lion is not a universal replacement for Adam. Reported limitations include:

    • Hyperparameter sensitivity: because Lion's update is sign-based, subtle changes in learning rate, weight decay, and warmup can change behavior more abruptly than they would for Adam.
    • Small-batch regimes: sign-based updates have higher per-step variance; with very small batches, Adam's per-coordinate adaptive scaling can be a stabilizer that Lion lacks.
    • Sparse gradients: in models with intrinsically sparse gradient structure (some recommender systems, embedding tables), the sign operation can amplify noise on rarely updated coordinates.
    • Limited theoretical guarantees: although follow-up analyses exist, Lion lacks the same depth of convex convergence theory as SGD or Adam variants.

    History

    Lion was reported in the 2023 paper "Symbolic Discovery of Optimization Algorithms" by Chen et al. at Google. The optimizer was the headline product of the broader symbolic search methodology, which the authors argued could be applied to other components of machine learning training pipelines. Open-source implementations appeared in PyTorch, JAX, and Optax within months of release, and the algorithm was incorporated into several large-scale training stacks, including for language model and image generation pretraining.

    References