Translations:Attention Mechanisms/17/en: Difference between revisions
(Importing a new version from external source) |
(Importing a new version from external source) Tag: Manual revert |
||
| (2 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
The scaling factor <math>\sqrt{d_k}</math> prevents the dot products from growing large in magnitude as the key dimension <math>d_k</math> increases, which would push the softmax into regions of extremely small gradients. | The scaling factor <math>\sqrt{d_k}</math> prevents the dot products from growing large in magnitude as the key dimension <math>d_k</math> increases, which would push the {{Term|softmax}} into regions of extremely small gradients. | ||
Latest revision as of 23:33, 27 April 2026
The scaling factor $ \sqrt{d_k} $ prevents the dot products from growing large in magnitude as the key dimension $ d_k $ increases, which would push the softmax into regions of extremely small gradients.