Translations:Stochastic Gradient Descent/27/en: Difference between revisions
(Importing a new version from external source) |
(Importing a new version from external source) |
||
| Line 1: | Line 1: | ||
* '''Data shuffling''' — Re-shuffle the dataset each epoch to avoid cyclic patterns. | * '''Data shuffling''' — Re-shuffle the dataset each {{Term|epoch}} to avoid cyclic patterns. | ||
* '''{{Term|gradient clipping|Gradient clipping}}''' — Cap the gradient norm to prevent exploding updates, especially in recurrent networks. | * '''{{Term|gradient clipping|Gradient clipping}}''' — Cap the gradient norm to prevent exploding updates, especially in recurrent networks. | ||
* '''{{Term|batch normalization|Batch normalisation}}''' — Normalising layer inputs reduces sensitivity to the {{Term|learning rate}}. | * '''{{Term|batch normalization|Batch normalisation}}''' — Normalising layer inputs reduces sensitivity to the {{Term|learning rate}}. | ||
* '''Mixed-precision training''' — Using half-precision floats accelerates SGD on modern GPUs with minimal accuracy loss. | * '''Mixed-precision training''' — Using half-precision floats accelerates SGD on modern GPUs with minimal accuracy loss. | ||
Latest revision as of 19:42, 27 April 2026
- Data shuffling — Re-shuffle the dataset each epoch to avoid cyclic patterns.
- Gradient clipping — Cap the gradient norm to prevent exploding updates, especially in recurrent networks.
- Batch normalisation — Normalising layer inputs reduces sensitivity to the learning rate.
- Mixed-precision training — Using half-precision floats accelerates SGD on modern GPUs with minimal accuracy loss.