Translations:Gradient Descent/33/en

Feature scaling — normalising input features so they have similar ranges dramatically improves convergence, because the loss surface becomes more isotropic.
gradient clipping — capping the norm of the gradient prevents excessively large updates.
Random initialisation — starting from a reasonable random initialisation (e.g. Xavier or He initialisation for neural networks) avoids symmetry-breaking issues.
Monitoring the loss curve — plotting the training loss over iterations is the simplest diagnostic: a smoothly decreasing curve indicates healthy training; oscillations suggest the learning rate is too high.