Translations:Attention Is All You Need/12/en

where $$ Q $$ , $$ K $$ , and $$ V $$ are matrices of queries, keys, and values respectively, and $$ d_k $$ is the dimensionality of the keys. The scaling factor $\sqrt{d_k}$ prevents the dot products from growing large in magnitude, which would push the softmax into regions with extremely small gradients.