Multi-head attention applies several attention functions in parallel, each with different learned linear projections: