LoRA Adapters

    From Marovi AI
    This page contains changes which are not marked for translation.
    Other languages:
    Article
    Topic area deep learning
    Prerequisites Transformer, AdamW, Backpropagation


    Overview

    LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that injects small trainable rank-decomposition matrices into the layers of a frozen pre-trained model. Introduced by Hu et al. in 2021, LoRA reparameterizes selected weight updates as the product of two low-rank factors, training only those factors while keeping the original weights fixed. The resulting "adapter" typically contains 0.01 to 1 percent of the parameters of the base model, yet recovers most of the quality of full fine-tuning on downstream tasks. Because the adapter weights can be merged back into the base weights at inference time, LoRA introduces no additional latency in deployment, distinguishing it from earlier adapter methods that inserted extra layers in the forward pass.[1] LoRA has become the dominant fine-tuning recipe for large language models, diffusion models, and Vision Transformers, and is the building block for hubs of community-shared adapters used to specialize models for tasks, styles, or characters.

    Motivation

    Full fine-tuning of a modern foundation model means updating every parameter, then storing a complete copy of the model per task. For a 70B-parameter language model in 16-bit precision, each fine-tuned checkpoint is roughly 140 GB; the optimizer state required during training is several times larger. This is impractical when one base model must be specialized for hundreds of tasks, users, or domains.

    The empirical observation motivating LoRA is that the change a fine-tuning run applies to a pre-trained weight matrix tends to lie in a low-rank subspace. If $ W_0 \in \mathbb{R}^{d \times k} $ is a pre-trained weight matrix and $ W_0 + \Delta W $ is the fine-tuned version, the singular value decomposition of $ \Delta W $ typically shows that a small number of singular directions carry most of the energy. Hu et al. argued that the "intrinsic rank" of task-specific adaptation is small enough that $ \Delta W $ can be parameterized directly as a low-rank product without losing expressiveness.

    Formulation

    LoRA replaces the update $ \Delta W $ for a chosen weight matrix with a factored form $ {\displaystyle \Delta W = \frac{\alpha}{r}\, B A,} $ where $ A \in \mathbb{R}^{r \times k} $, $ B \in \mathbb{R}^{d \times r} $, the rank $ r $ is much smaller than $ \min(d, k) $ (commonly 4, 8, 16, or 64), and $ \alpha $ is a constant scaling factor. The forward pass of the modified layer becomes $ {\displaystyle h = W_0 x + \frac{\alpha}{r}\, B A x.} $ The original matrix $ W_0 $ is frozen and receives no gradient updates. Only $ A $ and $ B $ are trained.

    At initialization, $ A $ is drawn from a Gaussian (or Kaiming) distribution and $ B $ is set to zero, so that $ \Delta W = 0 $ at step zero and the network's predictions match the pre-trained model exactly. Gradient flow through the product $ BA $ then breaks the symmetry, and the adapter learns a task-specific update.

    The scaling factor $ \alpha/r $ decouples the effective learning rate of the adapter from the rank. Doubling $ r $ would otherwise double the magnitude of $ \Delta W $ at fixed parameter values; the $ 1/r $ factor cancels this so that hyperparameters tuned at one rank transfer reasonably to another. A common convention is to fix $ \alpha = r $ or $ \alpha = 2r $.

    Where to apply LoRA

    In a Transformer, LoRA is most often applied to the projection matrices of the attention mechanism -- the query, key, value, and output projections $ W_Q $, $ W_K $, $ W_V $, $ W_O $. The original paper found that adapting $ W_Q $ and $ W_V $ alone was sufficient on GLUE-style tasks; more recent practice for large language models applies LoRA to all four attention projections plus the feed-forward up- and down-projections, since instruction-following and reasoning tasks benefit from broader coverage.

    LoRA can be applied to any linear layer, including embedding tables and the output softmax head, though embedding-table LoRA requires care because of its non-standard backward pass through index-select operations. Convolutional kernels can be reshaped to a 2D matrix and adapted similarly, which is the basis of LoRA for vision and diffusion models.

    Training and inference

    During training, the optimizer state -- gradients, Adam/AdamW moments, and master weights for mixed precision -- is allocated only for $ A $ and $ B $. With rank 8 and adaptation of $ W_Q, W_V $, the trainable parameter count for a 7B-parameter language model drops from 7B to roughly 4M, which is a 1700x reduction. The corresponding memory savings, combined with gradient checkpointing and 4-bit weight quantization, are what make techniques like QLoRA able to fine-tune 65B-parameter models on a single consumer GPU.

    At inference time, the adapter can be merged into the base weights: $ {\displaystyle W_{\text{merged}} = W_0 + \frac{\alpha}{r}\, B A.} $ Once merged, the modified layer is indistinguishable from a fully fine-tuned layer and adds no latency or memory cost. Alternatively, the adapter can be kept unmerged and applied as an additive branch at runtime; this allows hot-swapping different adapters on the same base model without reloading weights, which is the basis for multi-tenant adapter serving and "adapter libraries" in image-generation tools.

    Variants and extensions

    A large family of methods builds on the LoRA recipe:

    • QLoRA (Dettmers et al. 2023) -- combines LoRA with 4-bit quantization of the frozen base weights using the NF4 datatype and double quantization. The adapter trains in 16-bit while gradients flow through dequantized base weights. This made fine-tuning 30B+ models feasible on a single 24 GB GPU.[2]
    • DoRA (Weight-Decomposed Low-Rank Adaptation) -- decomposes $ W_0 $ into a magnitude vector and a direction matrix, then applies LoRA only to the direction. Empirically narrows the gap to full fine-tuning on tasks where vanilla LoRA underperforms.
    • AdaLoRA -- learns rank allocation across layers using a parameterization analogous to singular value decomposition (with explicit singular values), pruning unimportant directions during training.
    • LoRA+ -- uses different learning rates for $ A $ and $ B $, motivated by the asymmetry that $ B $ starts at zero. The paper recommends $ \eta_B / \eta_A \approx 16 $.
    • VeRA -- shares random projection matrices across all layers and trains only small per-layer scaling vectors, achieving 10x parameter reduction over LoRA at similar quality.
    • LoRA for diffusion models -- the dominant fine-tuning method for Stable Diffusion and related models. Community hubs distribute hundreds of thousands of style, character, and concept LoRAs of typically 10-200 MB each.

    Comparisons

    LoRA sits in the broader landscape of parameter-efficient fine-tuning (PEFT) methods:

    • Full fine-tuning -- highest quality ceiling, highest cost. Used when the task domain is far from pre-training data and a single dedicated model per task is acceptable.
    • Adapter modules (Houlsby et al. 2019) -- insert small bottleneck networks between Transformer sublayers. Extra parameters and forward-pass latency at inference; no merge-back option.
    • Prefix tuning and prompt tuning -- prepend learned vectors to the input or to each attention layer. Very few parameters; sensitive to initialization and harder to tune for tasks requiring substantial behavioral change.
    • BitFit -- trains only the bias terms. Extremely cheap but limited expressiveness.

    LoRA's appeal is that it occupies a sweet spot: comparable quality to full fine-tuning on most fine-tuning tasks, parameter counts comparable to prefix tuning, no inference-time overhead after merging, and clean compositionality (multiple LoRAs for different concepts can be linearly combined, with caveats).

    Limitations

    LoRA assumes the task-specific update is approximately low-rank. For tasks that require substantial knowledge injection -- training a model on a new language or a large new corpus of facts -- the rank required to match full fine-tuning may be high enough that the parameter savings disappear. Pre-training a model from scratch with LoRA-only updates is generally not effective.

    Choosing where and at what rank to apply LoRA remains an empirical exercise. Common defaults (rank 8 or 16, attention projections only) work well for instruction tuning of language models but may underperform for code, math, or multi-modal adaptation, where uniform high-rank LoRA on more layers is typically required.

    Adapter composition is not always linear: stacking or summing multiple LoRAs trained independently can produce interference, particularly in diffusion models, where dedicated merging algorithms (TIES, DARE) and orthogonalization techniques have been proposed to mitigate cross-talk.

    Finally, LoRA narrows but does not eliminate the gap to full fine-tuning. On benchmarks measuring narrow technical skill (mathematical reasoning, code synthesis), well-tuned full fine-tuning still outperforms LoRA by a small but measurable margin, and the gap widens at very low ranks.

    References