Convolutional Neural Networks
| Article | |
|---|---|
| Topic area | Deep Learning |
| Difficulty | Intermediate |
| Prerequisites | Neural Networks, Backpropagation |
Convolutional neural networks (CNNs or ConvNets) are a class of deep neural networks specifically designed to process data with a grid-like topology, such as images (2D grids of pixels), audio spectrograms, and video. They exploit the spatial structure of the input through local connectivity, weight sharing, and pooling, making them far more efficient than fully connected networks for visual and spatial tasks.
The convolution operation
The core building block is the discrete convolution. For a 2D input $ \mathbf{X} $ and a filter (kernel) $ \mathbf{K} $ of size $ k \times k $, the output feature map $ \mathbf{Y} $ is:
- $ Y_{i,j} = \sum_{m=0}^{k-1}\sum_{n=0}^{k-1} K_{m,n} \cdot X_{i+m,\, j+n} + b $
where $ b $ is a bias term. The filter slides (convolves) across the input, computing a dot product at each position. Technically, most implementations compute cross-correlation rather than true convolution (which would flip the kernel), but the distinction is immaterial since the kernel weights are learned.
Key hyperparameters controlling the convolution:
- Kernel size — the spatial extent of the filter (e.g. $ 3 \times 3 $, $ 5 \times 5 $).
- Stride — the step size between successive positions of the kernel. A stride of 2 halves the spatial dimensions.
- Padding — adding zeros around the border of the input to control the output size. "Same" padding preserves spatial dimensions; "valid" padding uses no padding.
Filters and feature detection
Each filter learns to detect a specific local pattern. In early layers, filters typically respond to edges, corners, and colour gradients. Deeper layers compose these into higher-level features — textures, parts, and eventually entire objects.
A convolutional layer applies multiple filters in parallel, producing a stack of feature maps. If the input has $ C_{\text{in}} $ channels and the layer has $ C_{\text{out}} $ filters, the total number of learnable parameters is:
- $ C_{\text{out}} \times (C_{\text{in}} \times k^2 + 1) $
This is dramatically fewer than a fully connected layer with the same input and output dimensions, because weights are shared across all spatial positions.
Pooling
Pooling layers downsample the feature maps, reducing their spatial dimensions and providing a degree of translation invariance. Common pooling operations:
- Max pooling — takes the maximum value in each local window (e.g. $ 2 \times 2 $).
- Average pooling — takes the mean value in each window.
- Global average pooling — averages each entire feature map to a single value, often used before the final classification layer.
Pooling reduces computational cost and helps prevent overfitting by progressively abstracting the representation.
Architecture of a CNN
A typical CNN alternates convolutional layers and pooling layers, followed by one or more fully connected layers for the final prediction:
Input → [Conv → ReLU → Pool] × N → Flatten → FC → FC → Output
Each conv-pool block extracts increasingly abstract features, while the fully connected layers combine them for classification or regression.
Landmark architectures
| Architecture | Year | Key contribution | Depth |
|---|---|---|---|
| LeNet-5 | 1998 | Pioneered CNNs for handwritten digit recognition (MNIST) | 5 layers |
| AlexNet | 2012 | Won ImageNet; popularised ReLU, dropout, GPU training | 8 layers |
| VGGNet | 2014 | Showed depth matters; used only $ 3 \times 3 $ filters throughout | 16–19 layers |
| GoogLeNet (Inception) | 2014 | Introduced inception modules with parallel filter sizes | 22 layers |
| ResNet | 2015 | Introduced residual connections enabling very deep networks | 50–152+ layers |
| DenseNet | 2017 | Connected each layer to every subsequent layer via dense blocks | 121–264 layers |
| EfficientNet | 2019 | Compound scaling of depth, width, and resolution | Variable |
Residual connections
The residual connection (or skip connection) introduced by ResNet adds the input of a block directly to its output:
- $ \mathbf{y} = \mathcal{F}(\mathbf{x}) + \mathbf{x} $
This allows gradients to flow directly through the identity path, mitigating the vanishing gradient problem and enabling the training of networks with hundreds of layers. Residual connections have become a standard component in nearly all modern architectures.
Applications in computer vision
CNNs have achieved state-of-the-art performance across a wide range of vision tasks:
- Image classification — assigning a label to an entire image (ImageNet, CIFAR).
- Object detection — localising and classifying objects within an image (YOLO, Faster R-CNN, SSD).
- Semantic segmentation — assigning a class label to every pixel (U-Net, DeepLab).
- Instance segmentation — distinguishing individual instances of objects (Mask R-CNN).
- Image generation — generating realistic images using CNN-based generators (GANs, diffusion models).
- Medical imaging — tumour detection, retinal analysis, and radiology screening.
Practical tips
- Use pretrained models (transfer learning) when labelled data is limited.
- Prefer small kernels ($ 3 \times 3 $) stacked in depth — two $ 3 \times 3 $ layers have the same receptive field as one $ 5 \times 5 $ layer but with fewer parameters.
- Apply batch normalisation after convolution and before activation.
- Use data augmentation generously to reduce overfitting.
- Replace fully connected layers with global average pooling to reduce parameters.
See also
- Neural Networks
- Backpropagation
- Overfitting and Regularization
- Recurrent Neural Networks
- Gradient Descent
References
- LeCun, Y. et al. (1998). "Gradient-Based Learning Applied to Document Recognition". Proceedings of the IEEE.
- Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks". NeurIPS.
- Simonyan, K. and Zisserman, A. (2015). "Very Deep Convolutional Networks for Large-Scale Image Recognition". ICLR.
- He, K. et al. (2016). "Deep Residual Learning for Image Recognition". CVPR.
- Tan, M. and Le, Q. V. (2019). "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks". ICML.