Convolutional Neural Networks: Difference between revisions

    From Marovi AI
    (Force re-parse after Math source-mode rollout (v1.2.0))
    Tags: ci-deploy Reverted
    (Pass 2 force re-parse)
    Tags: ci-deploy Reverted
    Line 117: Line 117:
    [[Category:Neural Networks]]
    [[Category:Neural Networks]]
    <!--v1.2.0 cache-bust-->
    <!--v1.2.0 cache-bust-->
    <!-- pass 2 -->

    Revision as of 07:00, 24 April 2026

    Languages: English | Español | 中文
    Article
    Topic area Deep Learning
    Difficulty Intermediate
    Prerequisites Neural Networks, Backpropagation

    Convolutional neural networks (CNNs or ConvNets) are a class of deep neural networks specifically designed to process data with a grid-like topology, such as images (2D grids of pixels), audio spectrograms, and video. They exploit the spatial structure of the input through local connectivity, weight sharing, and pooling, making them far more efficient than fully connected networks for visual and spatial tasks.

    The convolution operation

    The core building block is the discrete convolution. For a 2D input $ \mathbf{X} $ and a filter (kernel) $ \mathbf{K} $ of size $ k \times k $, the output feature map $ \mathbf{Y} $ is:

    $ Y_{i,j} = \sum_{m=0}^{k-1}\sum_{n=0}^{k-1} K_{m,n} \cdot X_{i+m,\, j+n} + b $

    where $ b $ is a bias term. The filter slides (convolves) across the input, computing a dot product at each position. Technically, most implementations compute cross-correlation rather than true convolution (which would flip the kernel), but the distinction is immaterial since the kernel weights are learned.

    Key hyperparameters controlling the convolution:

    • Kernel size — the spatial extent of the filter (e.g. $ 3 \times 3 $, $ 5 \times 5 $).
    • Stride — the step size between successive positions of the kernel. A stride of 2 halves the spatial dimensions.
    • Padding — adding zeros around the border of the input to control the output size. "Same" padding preserves spatial dimensions; "valid" padding uses no padding.

    Filters and feature detection

    Each filter learns to detect a specific local pattern. In early layers, filters typically respond to edges, corners, and colour gradients. Deeper layers compose these into higher-level features — textures, parts, and eventually entire objects.

    A convolutional layer applies multiple filters in parallel, producing a stack of feature maps. If the input has $ C_{\text{in}} $ channels and the layer has $ C_{\text{out}} $ filters, the total number of learnable parameters is:

    $ C_{\text{out}} \times (C_{\text{in}} \times k^2 + 1) $

    This is dramatically fewer than a fully connected layer with the same input and output dimensions, because weights are shared across all spatial positions.

    Pooling

    Pooling layers downsample the feature maps, reducing their spatial dimensions and providing a degree of translation invariance. Common pooling operations:

    • Max pooling — takes the maximum value in each local window (e.g. $ 2 \times 2 $).
    • Average pooling — takes the mean value in each window.
    • Global average pooling — averages each entire feature map to a single value, often used before the final classification layer.

    Pooling reduces computational cost and helps prevent overfitting by progressively abstracting the representation.

    Architecture of a CNN

    A typical CNN alternates convolutional layers and pooling layers, followed by one or more fully connected layers for the final prediction:

    Input → [Conv → ReLU → Pool] × N → Flatten → FC → FC → Output
    

    Each conv-pool block extracts increasingly abstract features, while the fully connected layers combine them for classification or regression.

    Landmark architectures

    Architecture Year Key contribution Depth
    LeNet-5 1998 Pioneered CNNs for handwritten digit recognition (MNIST) 5 layers
    AlexNet 2012 Won ImageNet; popularised ReLU, dropout, GPU training 8 layers
    VGGNet 2014 Showed depth matters; used only $ 3 \times 3 $ filters throughout 16–19 layers
    GoogLeNet (Inception) 2014 Introduced inception modules with parallel filter sizes 22 layers
    ResNet 2015 Introduced residual connections enabling very deep networks 50–152+ layers
    DenseNet 2017 Connected each layer to every subsequent layer via dense blocks 121–264 layers
    EfficientNet 2019 Compound scaling of depth, width, and resolution Variable

    Residual connections

    The residual connection (or skip connection) introduced by ResNet adds the input of a block directly to its output:

    $ \mathbf{y} = \mathcal{F}(\mathbf{x}) + \mathbf{x} $

    This allows gradients to flow directly through the identity path, mitigating the vanishing gradient problem and enabling the training of networks with hundreds of layers. Residual connections have become a standard component in nearly all modern architectures.

    Applications in computer vision

    CNNs have achieved state-of-the-art performance across a wide range of vision tasks:

    • Image classification — assigning a label to an entire image (ImageNet, CIFAR).
    • Object detection — localising and classifying objects within an image (YOLO, Faster R-CNN, SSD).
    • Semantic segmentation — assigning a class label to every pixel (U-Net, DeepLab).
    • Instance segmentation — distinguishing individual instances of objects (Mask R-CNN).
    • Image generation — generating realistic images using CNN-based generators (GANs, diffusion models).
    • Medical imaging — tumour detection, retinal analysis, and radiology screening.

    Practical tips

    • Use pretrained models (transfer learning) when labelled data is limited.
    • Prefer small kernels ($ 3 \times 3 $) stacked in depth — two $ 3 \times 3 $ layers have the same receptive field as one $ 5 \times 5 $ layer but with fewer parameters.
    • Apply batch normalisation after convolution and before activation.
    • Use data augmentation generously to reduce overfitting.
    • Replace fully connected layers with global average pooling to reduce parameters.

    See also

    References

    • LeCun, Y. et al. (1998). "Gradient-Based Learning Applied to Document Recognition". Proceedings of the IEEE.
    • Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks". NeurIPS.
    • Simonyan, K. and Zisserman, A. (2015). "Very Deep Convolutional Networks for Large-Scale Image Recognition". ICLR.
    • He, K. et al. (2016). "Deep Residual Learning for Image Recognition". CVPR.
    • Tan, M. and Le, Q. V. (2019). "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks". ICML.