Translations:Gradient Descent/31/en

momentum — accumulates a velocity vector from past gradients, helping to accelerate convergence in ravine-like landscapes.
Nesterov accelerated gradient — a momentum variant that evaluates the gradient at a look-ahead position, yielding better theoretical convergence rates.
Adaptive methods (adagrad, RMSProp, adam) — maintain per-parameter learning rates that adapt based on the history of gradients.
Second-order methods — algorithms like Newton's method and L-BFGS use curvature information (the Hessian or its approximation) for faster convergence, but are often too expensive for large-scale problems.