Deep learning
Deep learning is a subfield of Machine Learning that uses artificial neural networks with many layers — and millions to billions of parameters — to learn hierarchical representations directly from raw data. It underpins most of the recent breakthroughs in computer vision, natural language processing, speech recognition, and scientific discovery.
Overview
Classical machine learning relied on hand-engineered features: a practitioner would design pixel statistics, n-gram counts, or acoustic descriptors, and a relatively shallow model would map these features to outputs. Deep learning removes this bottleneck. A deep neural network learns its own features layer by layer, with each successive layer composing simpler patterns from the layer below into more abstract concepts.
The qualifier "deep" refers to the depth of the computation graph rather than any particular biological fidelity. Modern systems routinely stack tens to hundreds of layers and rely on three coupled ingredients that became simultaneously available in the early 2010s: large labelled datasets, massively parallel hardware (GPUs and later TPUs), and stable optimisation techniques. Together they made it practical to train networks whose representational capacity dwarfs anything previously feasible.
Deep learning is often credited with shifting AI from rule-based and feature-engineered systems toward a paradigm of end-to-end learning, in which a single differentiable model is trained jointly to map raw inputs to task outputs.
Key Concepts
- Hierarchical representation learning — successive layers transform the input into representations of increasing abstraction; the network discovers features rather than receiving them.
- Distributed representations — concepts are encoded as patterns of activation across many units, allowing combinatorial generalisation that one-hot or symbolic schemes cannot match.
- Differentiable computation — every operation is (almost everywhere) differentiable, so gradients flow through the entire model and parameters are tuned by gradient-based optimisation.
- End-to-end training — the entire pipeline, from raw input to final prediction, is optimised against a single loss, which removes the need for hand-tuned intermediate stages.
- Inductive biases via architecture — convolution encodes translation equivariance, recurrence encodes temporal locality, attention encodes pairwise interaction; the choice of architecture injects assumptions appropriate for the data.
- Scale — empirical scaling laws show that loss decreases predictably as a power of model size, dataset size, and compute, motivating ever-larger models.
History
Deep learning has roots that long predate its modern dominance. The perceptron (Rosenblatt 1958) and the early multilayer models of the 1960s established the basic neuron abstraction, but were limited by the lack of an effective training procedure for hidden layers. The reinvention and popularisation of backpropagation by Rumelhart, Hinton, and Williams in 1986 made multi-layer training feasible, and Yann LeCun's LeNet (1989, refined through the 1990s) demonstrated end-to-end learning of handwritten digits with a convolutional network.
Through the 1990s and early 2000s neural networks were largely overshadowed by support vector machines, kernel methods, and probabilistic graphical models. Renewed interest came from work on deep belief networks and unsupervised pre-training (Hinton, Salakhutdinov, Bengio, around 2006), which showed that depth was tractable if initialisation was handled carefully.
The decisive turning point was AlexNet (Krizhevsky, Sutskever, Hinton, 2012), which won the ImageNet challenge by a wide margin and demonstrated the practical force of GPU-trained convolutional networks with Dropout and cross-entropy objectives. The years that followed saw rapid architectural progress: VGG and GoogLeNet (2014), ResNet (He et al. 2015) and its residual connections, sequence-to-sequence models with attention, and the transformer (Vaswani et al. 2017). The transformer in turn enabled large language models (BERT 2018, GPT-2 2019, GPT-3 2020) and modern multimodal systems.
Key Approaches
A typical deep model is a parameterised function $ f_\theta : \mathcal{X} \to \mathcal{Y} $ trained by minimising an empirical risk:
- $ \mathcal{L}(\theta) = \frac{1}{N}\sum_{i=1}^{N} \ell\bigl(f_\theta(x_i),\, y_i\bigr) + \lambda\, R(\theta) $
where $ \ell $ is a per-example loss (e.g. cross-entropy for classification, squared error for regression) and $ R $ is an optional regulariser. Gradients $ \nabla_\theta \mathcal{L} $ are computed by backpropagation and parameters are updated with stochastic gradient descent or adaptive methods such as adam:
- $ \theta_{t+1} = \theta_t - \eta\, \widehat{\nabla}_\theta \mathcal{L}(\theta_t) $
The dominant architectural families are:
- Convolutional networks — translation-equivariant feature extractors for grid-structured data; foundational in vision.
- Recurrent networks (LSTM, GRU) — state-carrying models for sequences, central to early speech and language work.
- Transformers — built around the attention mechanism, where outputs are computed as $ \operatorname{Attention}(Q,K,V)=\operatorname{softmax}(QK^\top/\sqrt{d_k})V $; now the default for language and increasingly for vision and audio.
- Graph neural networks — generalise convolution to nodes and edges, used for molecules, citation networks, and social graphs.
- Autoencoders and variational autoencoders — encoder–decoder pairs trained to compress and reconstruct, useful for representation learning and generation.
- Generative adversarial networks — a generator and discriminator trained in a minimax game to produce realistic samples.
- Diffusion models — generative models that learn to invert a gradual noising process, dominant in modern image and video synthesis.
Effective training depends on auxiliary techniques: careful initialisation (Xavier, He), normalisation (batch, layer, group), regularisation (Dropout, weight decay, data augmentation), and learning-rate schedules (warm-up, cosine decay). Increasingly, self-supervised and pre-training objectives are used to learn general-purpose representations from unlabelled data, which are then adapted to downstream tasks via fine-tuning or transfer learning.
A loose taxonomy of training regimes:
| Regime | Signal | Typical use |
|---|---|---|
| Supervised | labelled $ (x, y) $ pairs | image classification, machine translation |
| Self-supervised | pretext task derived from $ x $ alone | pre-training language and vision models |
| Unsupervised / generative | likelihood of $ x $ | autoencoders, diffusion, GANs |
| Reinforcement | scalar reward from an environment | game playing, robotics, RLHF for alignment |
Connections
Deep learning sits at the intersection of several long-standing fields. As a form of Machine Learning, it inherits the bias–variance trade-off, generalisation theory, and concerns about overfitting. It is built on top of neural networks and depends critically on Backpropagation for credit assignment and on Gradient Descent (in particular Stochastic Gradient Descent) for optimisation. Classification heads typically combine a softmax output with a cross-entropy loss, while other losses are chosen to match the task structure.
Architecturally, CNNs specialise the general framework to spatial data, RNNs to sequential data, and Transformers to general set- and sequence-structured data via attention. In language and search, word embeddings were an early demonstration that deep models could learn meaningful continuous representations of discrete symbols. Modern reinforcement learning, recommendation systems, and many areas of computational science now rely on deep models as drop-in function approximators.
See also
- Neural Networks
- Backpropagation
- Gradient Descent
- Stochastic Gradient Descent
- Convolutional Neural Networks
- Recurrent Neural Networks
- Attention Mechanisms
- Dropout
- Batch Normalization
- Transfer Learning
- Cross-Entropy Loss
- Overfitting and Regularization
References
- LeCun, Y., Bengio, Y. and Hinton, G. (2015). "Deep learning". Nature, 521, 436–444.
- Goodfellow, I., Bengio, Y. and Courville, A. (2016). Deep Learning. MIT Press.
- Rumelhart, D. E., Hinton, G. E. and Williams, R. J. (1986). "Learning representations by back-propagating errors". Nature, 323, 533–536.
- Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks". NeurIPS.
- He, K., Zhang, X., Ren, S. and Sun, J. (2016). "Deep Residual Learning for Image Recognition". CVPR.
- Vaswani, A. et al. (2017). "attention Is All You Need". NeurIPS.
- Schmidhuber, J. (2015). "Deep Learning in Neural Networks: An Overview". Neural Networks, 61, 85–117.