Translations:Stochastic Gradient Descent/27/en

Data shuffling — Re-shuffle the dataset each epoch to avoid cyclic patterns.
Gradient clipping — Cap the gradient norm to prevent exploding updates, especially in recurrent networks.
Batch normalisation — Normalising layer inputs reduces sensitivity to the learning rate.
Mixed-precision training — Using half-precision floats accelerates SGD on modern GPUs with minimal accuracy loss.