Translations:Stochastic Gradient Descent/3/en
In classical gradient descent, the full gradient of the loss function is computed over the entire training set before each parameter update. When the dataset is large this becomes prohibitively expensive. SGD addresses the problem by estimating the gradient from a single randomly chosen sample (or a small mini-batch) at each step, trading a noisier estimate for dramatically lower per-iteration cost.