Translations:Backpropagation/33/en

Memory — the forward pass must store all intermediate activations for the backward pass. For very deep networks this can be prohibitive; gradient checkpointing trades compute for memory by recomputing activations during the backward pass instead of storing them.
Numerical stability — using log-sum-exp tricks and fused softmax-cross-entropy implementations avoids overflow and underflow.
Higher-order gradients — differentiating through the backward pass itself yields second-order information (Hessian-vector products), useful for methods like natural gradient descent and meta-learning.
Mixed precision — computing the forward pass in half precision while keeping a master copy of the weights in full precision speeds up training on modern GPUs.