Translations:Attention Mechanisms/33/en: Difference between revisions

    From Marovi AI
    (Importing a new version from external source)
    Tag: Manual revert
    (Importing a new version from external source)
    Tag: Manual revert
     
    Line 1: Line 1:
    * '''Masking''': In autoregressive decoding, future positions are masked (set to <math>-\infty</math> before softmax) to preserve the causal structure.
    * '''Masking''': In autoregressive decoding, future positions are masked (set to <math>-\infty</math> before {{Term|softmax}}) to preserve the causal structure.
    * '''Attention dropout''': Dropping attention weights randomly during training acts as a regulariser and reduces overfitting to specific alignment patterns.
    * '''Attention {{Term|dropout}}''': Dropping attention weights randomly during training acts as a regulariser and reduces {{Term|overfitting}} to specific alignment patterns.
    * '''Key-value caching''': During inference, previously computed key and value vectors are cached to avoid redundant computation, significantly speeding up autoregressive generation.
    * '''Key-value caching''': During inference, previously computed key and value vectors are cached to avoid redundant computation, significantly speeding up autoregressive generation.

    Latest revision as of 23:33, 27 April 2026

    Information about message (contribute)
    This message has no documentation. If you know where or how this message is used, you can help other translators by adding documentation to this message.
    Message definition (Attention Mechanisms)
    * '''Masking''': In autoregressive decoding, future positions are masked (set to <math>-\infty</math> before {{Term|softmax}}) to preserve the causal structure.
    * '''Attention {{Term|dropout}}''': Dropping attention weights randomly during training acts as a regulariser and reduces {{Term|overfitting}} to specific alignment patterns.
    * '''Key-value caching''': During inference, previously computed key and value vectors are cached to avoid redundant computation, significantly speeding up autoregressive generation.
    • Masking: In autoregressive decoding, future positions are masked (set to $ -\infty $ before softmax) to preserve the causal structure.
    • Attention dropout: Dropping attention weights randomly during training acts as a regulariser and reduces overfitting to specific alignment patterns.
    • Key-value caching: During inference, previously computed key and value vectors are cached to avoid redundant computation, significantly speeding up autoregressive generation.