Translations:Stochastic Gradient Descent/27/en
- Data shuffling — Re-shuffle the dataset each epoch to avoid cyclic patterns.
- Gradient clipping — Cap the gradient norm to prevent exploding updates, especially in recurrent networks.
- Batch normalisation — Normalising layer inputs reduces sensitivity to the learning rate.
- Mixed-precision training — Using half-precision floats accelerates SGD on modern GPUs with minimal accuracy loss.