Translations:Attention Is All You Need/12/en
where $ Q $, $ K $, and $ V $ are matrices of queries, keys, and values respectively, and $ d_k $ is the dimensionality of the keys. The scaling factor $ \sqrt{d_k} $ prevents the dot products from growing large in magnitude, which would push the softmax into regions with extremely small gradients.