Translations:Attention Mechanisms/33/en: Difference between revisions

Revision as of 19:41, 27 April 2026

Information about message (contribute)

This message has no documentation. If you know where or how this message is used, you can help other translators by adding documentation to this message.

Message definition (Attention Mechanisms)

* '''Masking''': In autoregressive decoding, future positions are masked (set to <math>-\infty</math> before {{Term|softmax}}) to preserve the causal structure.
* '''Attention {{Term|dropout}}''': Dropping attention weights randomly during training acts as a regulariser and reduces {{Term|overfitting}} to specific alignment patterns.
* '''Key-value caching''': During inference, previously computed key and value vectors are cached to avoid redundant computation, significantly speeding up autoregressive generation.

Masking: In autoregressive decoding, future positions are masked (set to $-\infty$ before softmax) to preserve the causal structure.
Attention dropout: Dropping attention weights randomly during training acts as a regulariser and reduces overfitting to specific alignment patterns.
Key-value caching: During inference, previously computed key and value vectors are cached to avoid redundant computation, significantly speeding up autoregressive generation.