Translations:Attention Mechanisms/17/en

    From Marovi AI
    Revision as of 23:33, 27 April 2026 by FuzzyBot (talk | contribs) (Importing a new version from external source)
    (diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

    The scaling factor $ \sqrt{d_k} $ prevents the dot products from growing large in magnitude as the key dimension $ d_k $ increases, which would push the softmax into regions of extremely small gradients.