Translations:Attention Is All You Need/4/en

Prior to the transformer, dominant sequence transduction models relied on recurrent neural networks (RNNs), particularly LSTMs and GRUs, often enhanced with attention mechanisms. These architectures processed tokens sequentially, creating a fundamental bottleneck that prevented parallelization during training. The transformer eliminated this constraint by relying solely on attention to draw global dependencies between input and output sequences, enabling far greater parallelism and reducing training times from days to hours on contemporary hardware.