Backpropagation: Difference between revisions

    From Marovi AI
    ([deploy-bot] Deploy from CI (775ba6e))
    Tag: ci-deploy
     
    (Force re-parse after Math source-mode rollout (v1.2.0))
    Tags: ci-deploy Reverted
    Line 103: Line 103:
    [[Category:Deep Learning]]
    [[Category:Deep Learning]]
    [[Category:Intermediate]]
    [[Category:Intermediate]]
    <!--v1.2.0 cache-bust-->

    Revision as of 06:57, 24 April 2026

    Languages: English | Español | 中文
    Article
    Topic area Deep Learning
    Difficulty Intermediate
    Prerequisites Gradient Descent, Neural Networks

    Backpropagation (short for backward propagation of errors) is an algorithm for efficiently computing the gradient of a loss function with respect to every weight in a neural network. Combined with an optimisation method such as gradient descent, it forms the standard training procedure for modern deep learning models.

    The chain rule

    Backpropagation is fundamentally an application of the chain rule of calculus. If a variable $ z $ depends on $ y $, which in turn depends on $ x $, then:

    $ \frac{\partial z}{\partial x} = \frac{\partial z}{\partial y} \cdot \frac{\partial y}{\partial x} $

    In a neural network the loss $ L $ depends on the output, which depends on the activations of the last hidden layer, which depend on the activations of the previous layer, and so on back to the input. The chain rule allows us to decompose the gradient into a product of local derivatives, one for each layer.

    Forward pass

    During the forward pass, input data propagates through the network layer by layer. For a fully connected layer $ l $:

    $ \mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)} $
    $ \mathbf{a}^{(l)} = g^{(l)}(\mathbf{z}^{(l)}) $

    where $ \mathbf{a}^{(l-1)} $ is the activation from the previous layer (with $ \mathbf{a}^{(0)} = \mathbf{x} $), $ \mathbf{W}^{(l)} $ and $ \mathbf{b}^{(l)} $ are the weights and biases, and $ g^{(l)} $ is the activation function. The forward pass stores all intermediate values $ \mathbf{z}^{(l)} $ and $ \mathbf{a}^{(l)} $ because they are needed during the backward pass.

    Backward pass

    The backward pass computes gradients starting from the loss and moving toward the input. Define the error signal at layer $ l $ as:

    $ \boldsymbol{\delta}^{(l)} = \frac{\partial L}{\partial \mathbf{z}^{(l)}} $

    For the output layer (layer $ L_{\text{out}} $):

    $ \boldsymbol{\delta}^{(L_{\text{out}})} = \frac{\partial L}{\partial \mathbf{a}^{(L_{\text{out}})}} \odot g'^{(L_{\text{out}})}(\mathbf{z}^{(L_{\text{out}})}) $

    For each earlier layer, the error propagates backward:

    $ \boldsymbol{\delta}^{(l)} = \bigl(\mathbf{W}^{(l+1)}\bigr)^\top \boldsymbol{\delta}^{(l+1)} \odot g'^{(l)}(\mathbf{z}^{(l)}) $

    where $ \odot $ denotes element-wise multiplication. Once the error signal is known, the parameter gradients are:

    $ \frac{\partial L}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)} \bigl(\mathbf{a}^{(l-1)}\bigr)^\top, \qquad \frac{\partial L}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)} $

    Computational graphs

    Modern deep learning frameworks (PyTorch, TensorFlow, JAX) implement backpropagation by constructing a computational graph — a directed acyclic graph where each node represents an operation and each edge carries a tensor. The forward pass builds the graph; the backward pass traverses it in reverse topological order, applying the chain rule at every node.

    This abstraction makes it possible to differentiate arbitrary compositions of operations, not just standard layer types. Two implementation strategies exist:

    • Static graphs — the graph is defined once before execution (early TensorFlow). Enables aggressive compiler optimisations but is less flexible.
    • Dynamic graphs — the graph is rebuilt on every forward pass (PyTorch, TensorFlow Eager mode). More intuitive for debugging and models with data-dependent control flow.

    Automatic differentiation

    Backpropagation is a special case of reverse-mode automatic differentiation (AD). Unlike numerical differentiation (which is approximate) or symbolic differentiation (which can produce unwieldy expressions), AD computes exact derivatives by systematically applying the chain rule to elementary operations.

    Reverse-mode AD computes the gradient of a scalar output with respect to all inputs in a single backward pass, making it ideally suited to neural networks where the loss is scalar but the parameters number in the millions.

    The cost of the backward pass is typically 2–3 times that of the forward pass, because it must evaluate the local Jacobians and multiply them with the incoming error signal.

    Vanishing and exploding gradients

    When a network has many layers, the gradient is a product of many local derivatives. If these factors are consistently less than 1, the gradient shrinks exponentially toward zero — the vanishing gradient problem. If they are consistently greater than 1, the gradient grows exponentially — the exploding gradient problem.

    Problem Symptom Common mitigations
    Vanishing gradients Early layers learn extremely slowly ReLU activations, residual connections, batch normalisation, careful initialisation
    Exploding gradients Loss diverges or produces NaN values Gradient clipping, weight regularisation, lower learning rate

    These issues were major obstacles to training deep networks before the introduction of ReLU activations, residual connections (ResNets), and normalisation techniques.

    Practical considerations

    • Memory — the forward pass must store all intermediate activations for the backward pass. For very deep networks this can be prohibitive; gradient checkpointing trades compute for memory by recomputing activations during the backward pass instead of storing them.
    • Numerical stability — using log-sum-exp tricks and fused softmax-cross-entropy implementations avoids overflow and underflow.
    • Higher-order gradients — differentiating through the backward pass itself yields second-order information (Hessian-vector products), useful for methods like natural gradient descent and meta-learning.
    • Mixed precision — computing the forward pass in half precision while keeping a master copy of the weights in full precision speeds up training on modern GPUs.

    Historical development

    The key ideas behind backpropagation were developed independently by several researchers. Seppo Linnainmaa described reverse-mode automatic differentiation in 1970. Paul Werbos applied it to neural networks in his 1974 PhD thesis. The algorithm achieved widespread adoption after the influential 1986 paper by Rumelhart, Hinton, and Williams, which demonstrated its effectiveness on multi-layer networks.

    See also

    References

    • Rumelhart, D. E., Hinton, G. E. and Williams, R. J. (1986). "Learning representations by back-propagating errors". Nature, 323, 533–536.
    • Linnainmaa, S. (1970). "The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors". Master's thesis, University of Helsinki.
    • Werbos, P. J. (1974). "Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences". PhD thesis, Harvard University.
    • Baydin, A. G. et al. (2018). "Automatic Differentiation in Machine Learning: a Survey". JMLR, 18(153), 1–43.
    • Goodfellow, I., Bengio, Y. and Courville, A. (2016). Deep Learning, Chapter 6. MIT Press.