Convolutional Neural Networks

Article
Topic area	Deep Learning
Difficulty	Intermediate
Prerequisites	Neural Networks, Backpropagation

Languages: English | Español | 中文

Convolutional neural networks (CNNs or ConvNets) are a class of deep neural networks specifically designed to process data with a grid-like topology, such as images (2D grids of pixels), audio spectrograms, and video. They exploit the spatial structure of the input through local connectivity, weight sharing, and pooling, making them far more efficient than fully connected networks for visual and spatial tasks.

The convolution operation

The core building block is the discrete convolution. For a 2D input $\mathbf{X}$ and a filter (kernel) $\mathbf{K}$ of size $k \times k$ , the output feature map $\mathbf{Y}$ is:

Y_{i,j} = \sum_{m=0}^{k-1}\sum_{n=0}^{k-1} K_{m,n} \cdot X_{i+m,\, j+n} + b

where $$ b $$ is a bias term. The filter slides (convolves) across the input, computing a dot product at each position. Technically, most implementations compute cross-correlation rather than true convolution (which would flip the kernel), but the distinction is immaterial since the kernel weights are learned.

Key hyperparameters controlling the convolution:

Kernel size — the spatial extent of the filter (e.g. $3 \times 3$ , $5 \times 5$ ).
Stride — the step size between successive positions of the kernel. A stride of 2 halves the spatial dimensions.
Padding — adding zeros around the border of the input to control the output size. "Same" padding preserves spatial dimensions; "valid" padding uses no padding.

Filters and feature detection

Each filter learns to detect a specific local pattern. In early layers, filters typically respond to edges, corners, and colour gradients. Deeper layers compose these into higher-level features — textures, parts, and eventually entire objects.

A convolutional layer applies multiple filters in parallel, producing a stack of feature maps. If the input has $C_{\text{in}}$ channels and the layer has $C_{\text{out}}$ filters, the total number of learnable parameters is:

C_{\text{out}} \times (C_{\text{in}} \times k^2 + 1)

This is dramatically fewer than a fully connected layer with the same input and output dimensions, because weights are shared across all spatial positions.

Pooling

Pooling layers downsample the feature maps, reducing their spatial dimensions and providing a degree of translation invariance. Common pooling operations:

Max pooling — takes the maximum value in each local window (e.g. $2 \times 2$ ).
Average pooling — takes the mean value in each window.
Global average pooling — averages each entire feature map to a single value, often used before the final classification layer.

Pooling reduces computational cost and helps prevent overfitting by progressively abstracting the representation.

Architecture of a CNN

A typical CNN alternates convolutional layers and pooling layers, followed by one or more fully connected layers for the final prediction:

Input → [Conv → ReLU → Pool] × N → Flatten → FC → FC → Output

Each conv-pool block extracts increasingly abstract features, while the fully connected layers combine them for classification or regression.

Landmark architectures

Architecture	Year	Key contribution	Depth
LeNet-5	1998	Pioneered CNNs for handwritten digit recognition (MNIST)	5 layers
AlexNet	2012	Won ImageNet; popularised ReLU, dropout, GPU training	8 layers
VGGNet	2014	Showed depth matters; used only $3 \times 3$ filters throughout	16–19 layers
GoogLeNet (Inception)	2014	Introduced inception modules with parallel filter sizes	22 layers
ResNet	2015	Introduced residual connections enabling very deep networks	50–152+ layers
DenseNet	2017	Connected each layer to every subsequent layer via dense blocks	121–264 layers
EfficientNet	2019	Compound scaling of depth, width, and resolution	Variable

Residual connections

The residual connection (or skip connection) introduced by ResNet adds the input of a block directly to its output:

\mathbf{y} = \mathcal{F}(\mathbf{x}) + \mathbf{x}

This allows gradients to flow directly through the identity path, mitigating the vanishing gradient problem and enabling the training of networks with hundreds of layers. Residual connections have become a standard component in nearly all modern architectures.

Applications in computer vision

CNNs have achieved state-of-the-art performance across a wide range of vision tasks:

Image classification — assigning a label to an entire image (ImageNet, CIFAR).
Object detection — localising and classifying objects within an image (YOLO, Faster R-CNN, SSD).
Semantic segmentation — assigning a class label to every pixel (U-Net, DeepLab).
Instance segmentation — distinguishing individual instances of objects (Mask R-CNN).
Image generation — generating realistic images using CNN-based generators (GANs, diffusion models).
Medical imaging — tumour detection, retinal analysis, and radiology screening.

Practical tips

Use pretrained models (transfer learning) when labelled data is limited.
Prefer small kernels ( $3 \times 3$ ) stacked in depth — two $3 \times 3$ layers have the same receptive field as one $5 \times 5$ layer but with fewer parameters.
Apply batch normalisation after convolution and before activation.
Use data augmentation generously to reduce overfitting.
Replace fully connected layers with global average pooling to reduce parameters.

References

LeCun, Y. et al. (1998). "Gradient-Based Learning Applied to Document Recognition". Proceedings of the IEEE.
Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks". NeurIPS.
Simonyan, K. and Zisserman, A. (2015). "Very Deep Convolutional Networks for Large-Scale Image Recognition". ICLR.
He, K. et al. (2016). "Deep Residual Learning for Image Recognition". CVPR.
Tan, M. and Le, Q. V. (2019). "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks". ICML.