Translations:Backpropagation/33/en
- Memory — the forward pass must store all intermediate activations for the backward pass. For very deep networks this can be prohibitive; gradient checkpointing trades compute for memory by recomputing activations during the backward pass instead of storing them.
- Numerical stability — using log-sum-exp tricks and fused softmax-cross-entropy implementations avoids overflow and underflow.
- Higher-order gradients — differentiating through the backward pass itself yields second-order information (Hessian-vector products), useful for methods like natural gradient descent and meta-learning.
- Mixed precision — computing the forward pass in half precision while keeping a master copy of the weights in full precision speeds up training on modern GPUs.