layer normalization (Ba et al., 2016) normalizes across all features within a single sample, making it independent of batch size. It is the standard choice in transformer architectures.