Convolutional Neural Networks/en - Revision history

FuzzyBot: Updating to match new version of source page

2026-04-27T23:39:05Z

Updating to match new version of source page

FuzzyBot: Updating to match new version of source page

2026-04-27T22:22:31Z

Updating to match new version of source page

FuzzyBot: Updating to match new version of source page

2026-04-27T19:42:05Z

Updating to match new version of source page

FuzzyBot: Updating to match new version of source page

2026-04-27T02:39:39Z

Updating to match new version of source page

← Older revision		Revision as of 02:39, 27 April 2026
Line 1:		Line 1:
	<languages />		<languages />
	~~{{LanguageBar \| page = Convolutional Neural Networks}}~~
	{{ArticleInfobox \| topic_area = Deep Learning \| difficulty = Intermediate \| prerequisites = [[Neural Networks]], [[Backpropagation]]}}		{{ArticleInfobox \| topic_area = Deep Learning \| difficulty = Intermediate \| prerequisites = [[Neural Networks]], [[Backpropagation]]}}
	{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}		{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}

FuzzyBot: Updating to match new version of source page

2026-04-27T00:30:33Z

Updating to match new version of source page

New page

<languages />
{{LanguageBar | page = Convolutional Neural Networks}}
{{ArticleInfobox | topic_area = Deep Learning | difficulty = Intermediate | prerequisites = [[Neural Networks]], [[Backpropagation]]}}
{{ContentMeta | generated_by = claude-opus | model_used = claude-opus-4-6 | generated_date = 2026-03-13}}

'''Convolutional neural networks''' ('''CNNs''' or '''ConvNets''') are a class of deep [[Neural Networks|neural networks]] specifically designed to process data with a grid-like topology, such as images (2D grids of pixels), audio spectrograms, and video. They exploit the spatial structure of the input through local connectivity, weight sharing, and pooling, making them far more efficient than fully connected networks for visual and spatial tasks.

== The convolution operation ==

The core building block is the '''discrete convolution'''. For a 2D input <math>\mathbf{X}</math> and a filter (kernel) <math>\mathbf{K}</math> of size <math>k \times k</math>, the output feature map <math>\mathbf{Y}</math> is:

:<math>Y_{i,j} = \sum_{m=0}^{k-1}\sum_{n=0}^{k-1} K_{m,n} \cdot X_{i+m,\, j+n} + b</math>

where <math>b</math> is a bias term. The filter slides (convolves) across the input, computing a dot product at each position. Technically, most implementations compute '''cross-correlation''' rather than true convolution (which would flip the kernel), but the distinction is immaterial since the kernel weights are learned.

Key hyperparameters controlling the convolution:

* '''Kernel size''' — the spatial extent of the filter (e.g. <math>3 \times 3</math>, <math>5 \times 5</math>).
* '''Stride''' — the step size between successive positions of the kernel. A stride of 2 halves the spatial dimensions.
* '''Padding''' — adding zeros around the border of the input to control the output size. "Same" padding preserves spatial dimensions; "valid" padding uses no padding.

== Filters and feature detection ==

Each filter learns to detect a specific local pattern. In early layers, filters typically respond to edges, corners, and colour gradients. Deeper layers compose these into higher-level features — textures, parts, and eventually entire objects.

A convolutional layer applies multiple filters in parallel, producing a stack of feature maps. If the input has <math>C_{\text{in}}</math> channels and the layer has <math>C_{\text{out}}</math> filters, the total number of learnable parameters is:

:<math>C_{\text{out}} \times (C_{\text{in}} \times k^2 + 1)</math>

This is dramatically fewer than a fully connected layer with the same input and output dimensions, because weights are shared across all spatial positions.

== Pooling ==

'''Pooling''' layers downsample the feature maps, reducing their spatial dimensions and providing a degree of translation invariance. Common pooling operations:

* '''Max pooling''' — takes the maximum value in each local window (e.g. <math>2 \times 2</math>).
* '''Average pooling''' — takes the mean value in each window.
* '''Global average pooling''' — averages each entire feature map to a single value, often used before the final classification layer.

Pooling reduces computational cost and helps prevent overfitting by progressively abstracting the representation.

== Architecture of a CNN ==

A typical CNN alternates convolutional layers and pooling layers, followed by one or more fully connected layers for the final prediction:

<pre>
Input → [Conv → ReLU → Pool] × N → Flatten → FC → FC → Output
</pre>

Each conv-pool block extracts increasingly abstract features, while the fully connected layers combine them for classification or regression.

== Landmark architectures ==

{| class="wikitable"
|-
! Architecture !! Year !! Key contribution !! Depth
|-
| '''LeNet-5''' || 1998 || Pioneered CNNs for handwritten digit recognition (MNIST) || 5 layers
|-
| '''AlexNet''' || 2012 || Won ImageNet; popularised ReLU, dropout, GPU training || 8 layers
|-
| '''VGGNet''' || 2014 || Showed depth matters; used only <math>3 \times 3</math> filters throughout || 16–19 layers
|-
| '''GoogLeNet (Inception)''' || 2014 || Introduced inception modules with parallel filter sizes || 22 layers
|-
| '''ResNet''' || 2015 || Introduced residual connections enabling very deep networks || 50–152+ layers
|-
| '''DenseNet''' || 2017 || Connected each layer to every subsequent layer via dense blocks || 121–264 layers
|-
| '''EfficientNet''' || 2019 || Compound scaling of depth, width, and resolution || Variable
|}

=== Residual connections ===

The '''residual connection''' (or skip connection) introduced by ResNet adds the input of a block directly to its output:

:<math>\mathbf{y} = \mathcal{F}(\mathbf{x}) + \mathbf{x}</math>

This allows gradients to flow directly through the identity path, mitigating the vanishing gradient problem and enabling the training of networks with hundreds of layers. Residual connections have become a standard component in nearly all modern architectures.

== Applications in computer vision ==

CNNs have achieved state-of-the-art performance across a wide range of vision tasks:

* '''Image classification''' — assigning a label to an entire image (ImageNet, CIFAR).
* '''Object detection''' — localising and classifying objects within an image (YOLO, Faster R-CNN, SSD).
* '''Semantic segmentation''' — assigning a class label to every pixel (U-Net, DeepLab).
* '''Instance segmentation''' — distinguishing individual instances of objects (Mask R-CNN).
* '''Image generation''' — generating realistic images using CNN-based generators (GANs, diffusion models).
* '''Medical imaging''' — tumour detection, retinal analysis, and radiology screening.

== Practical tips ==

* Use pretrained models (transfer learning) when labelled data is limited.
* Prefer small kernels (<math>3 \times 3</math>) stacked in depth — two <math>3 \times 3</math> layers have the same receptive field as one <math>5 \times 5</math> layer but with fewer parameters.
* Apply batch normalisation after convolution and before activation.
* Use data augmentation generously to reduce [[Overfitting and Regularization|overfitting]].
* Replace fully connected layers with global average pooling to reduce parameters.

== See also ==

* [[Neural Networks]]
* [[Backpropagation]]
* [[Overfitting and Regularization]]
* [[Recurrent Neural Networks]]
* [[Gradient Descent]]

== References ==

* LeCun, Y. et al. (1998). "Gradient-Based Learning Applied to Document Recognition". ''Proceedings of the IEEE''.
* Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks". ''NeurIPS''.
* Simonyan, K. and Zisserman, A. (2015). "Very Deep Convolutional Networks for Large-Scale Image Recognition". ''ICLR''.
* He, K. et al. (2016). "Deep Residual Learning for Image Recognition". ''CVPR''.
* Tan, M. and Le, Q. V. (2019). "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks". ''ICML''.

[[Category:Deep Learning]]
[[Category:Intermediate]]
[[Category:Neural Networks]]

← Older revision		Revision as of 23:39, 27 April 2026
Line 3:		Line 3:
	{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}		{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}

	'''Convolutional neural networks''' ('''CNNs''' or '''ConvNets''') are a class of deep [[Neural Networks\|neural networks]] specifically designed to process data with a grid-like topology, such as images (2D grids of pixels), audio spectrograms, and video. They exploit the spatial structure of the input through local connectivity, weight sharing, and pooling, making them far more efficient than fully connected networks for visual and spatial tasks.		'''Convolutional neural networks''' ('''CNNs''' or '''ConvNets''') are a class of deep [[Neural Networks\|neural networks]] specifically designed to process data with a grid-like topology, such as images (2D grids of pixels), audio spectrograms, and video. They exploit the spatial structure of the input through local connectivity, weight sharing, and {{Term\|pooling}}, making them far more efficient than fully connected networks for visual and spatial tasks.

	== The convolution operation ==		== The convolution operation ==
Line 13:		Line 13:
	where <math>b</math> is a bias term. The filter slides (convolves) across the input, computing a dot product at each position. Technically, most implementations compute '''cross-correlation''' rather than true convolution (which would flip the kernel), but the distinction is immaterial since the kernel weights are learned.		where <math>b</math> is a bias term. The filter slides (convolves) across the input, computing a dot product at each position. Technically, most implementations compute '''cross-correlation''' rather than true convolution (which would flip the kernel), but the distinction is immaterial since the kernel weights are learned.

	Key hyperparameters controlling the convolution:		Key {{Term\|hyperparameter\|hyperparameters}} controlling the convolution:

	* '''Kernel size''' — the spatial extent of the filter (e.g. <math>3 \times 3</math>, <math>5 \times 5</math>).		* '''Kernel size''' — the spatial extent of the filter (e.g. <math>3 \times 3</math>, <math>5 \times 5</math>).
	* '''Stride''' — the step size between successive positions of the kernel. A stride of 2 halves the spatial dimensions.		* '''Stride''' — the {{Term\|learning rate\|step size}} between successive positions of the kernel. A stride of 2 halves the spatial dimensions.
	* '''Padding''' — adding zeros around the border of the input to control the output size. "Same" padding preserves spatial dimensions; "valid" padding uses no padding.		* '''Padding''' — adding zeros around the border of the input to control the output size. "Same" padding preserves spatial dimensions; "valid" padding uses no padding.

Line 31:		Line 31:
	== Pooling ==		== Pooling ==

	'''~~Pooling~~''' layers downsample the feature maps, reducing their spatial dimensions and providing a degree of translation invariance. Common pooling operations:		'''{{Term\|pooling}}''' layers downsample the feature maps, reducing their spatial dimensions and providing a degree of translation invariance. Common {{Term\|pooling}} operations:

	* '''Max pooling''' — takes the maximum value in each local window (e.g. <math>2 \times 2</math>).		* '''Max {{Term\|pooling}}''' — takes the maximum value in each local window (e.g. <math>2 \times 2</math>).
	* '''Average pooling''' — takes the mean value in each window.		* '''Average {{Term\|pooling}}''' — takes the mean value in each window.
	* '''Global average pooling''' — averages each entire feature map to a single value, often used before the final classification layer.		* '''Global average {{Term\|pooling}}''' — averages each entire feature map to a single value, often used before the final classification layer.

	~~Pooling~~ reduces computational cost and helps prevent overfitting by progressively abstracting the representation.		{{Term\|pooling}} reduces computational cost and helps prevent {{Term\|overfitting}} by progressively abstracting the representation.

	== Architecture of a CNN ==		== Architecture of a CNN ==

	A typical CNN alternates convolutional layers and pooling layers, followed by one or more fully connected layers for the final prediction:		A typical CNN alternates convolutional layers and {{Term\|pooling}} layers, followed by one or more fully connected layers for the final prediction:

	<pre>		<pre>
Line 57:		Line 57:
	\| '''LeNet-5''' \|\| 1998 \|\| Pioneered CNNs for handwritten digit recognition (MNIST) \|\| 5 layers		\| '''LeNet-5''' \|\| 1998 \|\| Pioneered CNNs for handwritten digit recognition (MNIST) \|\| 5 layers
	\|-		\|-
	\| '''AlexNet''' \|\| 2012 \|\| Won ImageNet; popularised ReLU, dropout, GPU training \|\| 8 layers		\| '''AlexNet''' \|\| 2012 \|\| Won ImageNet; popularised ReLU, {{Term\|dropout}}, GPU training \|\| 8 layers
	\|-		\|-
	\| '''VGGNet''' \|\| 2014 \|\| Showed depth matters; used only <math>3 \times 3</math> filters throughout \|\| 16–19 layers		\| '''VGGNet''' \|\| 2014 \|\| Showed depth matters; used only <math>3 \times 3</math> filters throughout \|\| 16–19 layers
Line 93:		Line 93:
	* Use pretrained models (transfer learning) when labelled data is limited.		* Use pretrained models (transfer learning) when labelled data is limited.
	* Prefer small kernels (<math>3 \times 3</math>) stacked in depth — two <math>3 \times 3</math> layers have the same receptive field as one <math>5 \times 5</math> layer but with fewer parameters.		* Prefer small kernels (<math>3 \times 3</math>) stacked in depth — two <math>3 \times 3</math> layers have the same receptive field as one <math>5 \times 5</math> layer but with fewer parameters.
	* Apply batch normalisation after convolution and before activation.		* Apply {{Term\|batch normalization\|batch normalisation}} after convolution and before {{Term\|activation function\|activation}}.
	* Use data augmentation generously to reduce [[Overfitting and Regularization\|overfitting]].		* Use data augmentation generously to reduce [[Overfitting and Regularization\|overfitting]].
	* Replace fully connected layers with global average pooling to reduce parameters.		* Replace fully connected layers with global average {{Term\|pooling}} to reduce parameters.

	== See also ==		== See also ==

← Older revision		Revision as of 22:22, 27 April 2026
Line 3:		Line 3:
	{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}		{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}

	'''Convolutional neural networks''' ('''CNNs''' or '''ConvNets''') are a class of deep [[Neural Networks\|neural networks]] specifically designed to process data with a grid-like topology, such as images (2D grids of pixels), audio spectrograms, and video. They exploit the spatial structure of the input through local connectivity, weight sharing, and ~~{{Term\|~~pooling}}, making them far more efficient than fully connected networks for visual and spatial tasks.		'''Convolutional neural networks''' ('''CNNs''' or '''ConvNets''') are a class of deep [[Neural Networks\|neural networks]] specifically designed to process data with a grid-like topology, such as images (2D grids of pixels), audio spectrograms, and video. They exploit the spatial structure of the input through local connectivity, weight sharing, and pooling, making them far more efficient than fully connected networks for visual and spatial tasks.

	== The convolution operation ==		== The convolution operation ==
Line 13:		Line 13:
	where <math>b</math> is a bias term. The filter slides (convolves) across the input, computing a dot product at each position. Technically, most implementations compute '''cross-correlation''' rather than true convolution (which would flip the kernel), but the distinction is immaterial since the kernel weights are learned.		where <math>b</math> is a bias term. The filter slides (convolves) across the input, computing a dot product at each position. Technically, most implementations compute '''cross-correlation''' rather than true convolution (which would flip the kernel), but the distinction is immaterial since the kernel weights are learned.

	Key ~~{{Term\|hyperparameter\|~~hyperparameters}} controlling the convolution:		Key hyperparameters controlling the convolution:

	* '''Kernel size''' — the spatial extent of the filter (e.g. <math>3 \times 3</math>, <math>5 \times 5</math>).		* '''Kernel size''' — the spatial extent of the filter (e.g. <math>3 \times 3</math>, <math>5 \times 5</math>).
	* '''Stride''' — the ~~{{Term\|learning rate\|~~step size}} between successive positions of the kernel. A stride of 2 halves the spatial dimensions.		* '''Stride''' — the step size between successive positions of the kernel. A stride of 2 halves the spatial dimensions.
	* '''Padding''' — adding zeros around the border of the input to control the output size. "Same" padding preserves spatial dimensions; "valid" padding uses no padding.		* '''Padding''' — adding zeros around the border of the input to control the output size. "Same" padding preserves spatial dimensions; "valid" padding uses no padding.

Line 31:		Line 31:
	== Pooling ==		== Pooling ==

	'''~~{{Term\|pooling}}~~''' layers downsample the feature maps, reducing their spatial dimensions and providing a degree of translation invariance. Common ~~{{Term\|~~pooling}} operations:		'''Pooling''' layers downsample the feature maps, reducing their spatial dimensions and providing a degree of translation invariance. Common pooling operations:

	* '''Max ~~{{Term\|~~pooling}}''' — takes the maximum value in each local window (e.g. <math>2 \times 2</math>).		* '''Max pooling''' — takes the maximum value in each local window (e.g. <math>2 \times 2</math>).
	* '''Average ~~{{Term\|~~pooling}}''' — takes the mean value in each window.		* '''Average pooling''' — takes the mean value in each window.
	* '''Global average ~~{{Term\|~~pooling}}''' — averages each entire feature map to a single value, often used before the final classification layer.		* '''Global average pooling''' — averages each entire feature map to a single value, often used before the final classification layer.

	~~{{Term\|pooling}}~~ reduces computational cost and helps prevent ~~{{Term\|~~overfitting}} by progressively abstracting the representation.		Pooling reduces computational cost and helps prevent overfitting by progressively abstracting the representation.

	== Architecture of a CNN ==		== Architecture of a CNN ==

	A typical CNN alternates convolutional layers and ~~{{Term\|~~pooling}} layers, followed by one or more fully connected layers for the final prediction:		A typical CNN alternates convolutional layers and pooling layers, followed by one or more fully connected layers for the final prediction:

	<pre>		<pre>
Line 57:		Line 57:
	\| '''LeNet-5''' \|\| 1998 \|\| Pioneered CNNs for handwritten digit recognition (MNIST) \|\| 5 layers		\| '''LeNet-5''' \|\| 1998 \|\| Pioneered CNNs for handwritten digit recognition (MNIST) \|\| 5 layers
	\|-		\|-
	\| '''AlexNet''' \|\| 2012 \|\| Won ImageNet; popularised ReLU, ~~{{Term\|~~dropout}}, GPU training \|\| 8 layers		\| '''AlexNet''' \|\| 2012 \|\| Won ImageNet; popularised ReLU, dropout, GPU training \|\| 8 layers
	\|-		\|-
	\| '''VGGNet''' \|\| 2014 \|\| Showed depth matters; used only <math>3 \times 3</math> filters throughout \|\| 16–19 layers		\| '''VGGNet''' \|\| 2014 \|\| Showed depth matters; used only <math>3 \times 3</math> filters throughout \|\| 16–19 layers
Line 93:		Line 93:
	* Use pretrained models (transfer learning) when labelled data is limited.		* Use pretrained models (transfer learning) when labelled data is limited.
	* Prefer small kernels (<math>3 \times 3</math>) stacked in depth — two <math>3 \times 3</math> layers have the same receptive field as one <math>5 \times 5</math> layer but with fewer parameters.		* Prefer small kernels (<math>3 \times 3</math>) stacked in depth — two <math>3 \times 3</math> layers have the same receptive field as one <math>5 \times 5</math> layer but with fewer parameters.
	* Apply ~~{{Term\|batch normalization\|~~batch normalisation}} after convolution and before ~~{{Term\|activation function\|~~activation}}.		* Apply batch normalisation after convolution and before activation.
	* Use data augmentation generously to reduce [[Overfitting and Regularization\|overfitting]].		* Use data augmentation generously to reduce [[Overfitting and Regularization\|overfitting]].
	* Replace fully connected layers with global average ~~{{Term\|~~pooling}} to reduce parameters.		* Replace fully connected layers with global average pooling to reduce parameters.

	== See also ==		== See also ==

← Older revision		Revision as of 19:42, 27 April 2026
Line 3:		Line 3:
	{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}		{{ContentMeta \| generated_by = claude-opus \| model_used = claude-opus-4-6 \| generated_date = 2026-03-13}}

	'''Convolutional neural networks''' ('''CNNs''' or '''ConvNets''') are a class of deep [[Neural Networks\|neural networks]] specifically designed to process data with a grid-like topology, such as images (2D grids of pixels), audio spectrograms, and video. They exploit the spatial structure of the input through local connectivity, weight sharing, and pooling, making them far more efficient than fully connected networks for visual and spatial tasks.		'''Convolutional neural networks''' ('''CNNs''' or '''ConvNets''') are a class of deep [[Neural Networks\|neural networks]] specifically designed to process data with a grid-like topology, such as images (2D grids of pixels), audio spectrograms, and video. They exploit the spatial structure of the input through local connectivity, weight sharing, and {{Term\|pooling}}, making them far more efficient than fully connected networks for visual and spatial tasks.

	== The convolution operation ==		== The convolution operation ==
Line 13:		Line 13:
	where <math>b</math> is a bias term. The filter slides (convolves) across the input, computing a dot product at each position. Technically, most implementations compute '''cross-correlation''' rather than true convolution (which would flip the kernel), but the distinction is immaterial since the kernel weights are learned.		where <math>b</math> is a bias term. The filter slides (convolves) across the input, computing a dot product at each position. Technically, most implementations compute '''cross-correlation''' rather than true convolution (which would flip the kernel), but the distinction is immaterial since the kernel weights are learned.

	Key hyperparameters controlling the convolution:		Key {{Term\|hyperparameter\|hyperparameters}} controlling the convolution:

	* '''Kernel size''' — the spatial extent of the filter (e.g. <math>3 \times 3</math>, <math>5 \times 5</math>).		* '''Kernel size''' — the spatial extent of the filter (e.g. <math>3 \times 3</math>, <math>5 \times 5</math>).
	* '''Stride''' — the step size between successive positions of the kernel. A stride of 2 halves the spatial dimensions.		* '''Stride''' — the {{Term\|learning rate\|step size}} between successive positions of the kernel. A stride of 2 halves the spatial dimensions.
	* '''Padding''' — adding zeros around the border of the input to control the output size. "Same" padding preserves spatial dimensions; "valid" padding uses no padding.		* '''Padding''' — adding zeros around the border of the input to control the output size. "Same" padding preserves spatial dimensions; "valid" padding uses no padding.

Line 31:		Line 31:
	== Pooling ==		== Pooling ==

	'''~~Pooling~~''' layers downsample the feature maps, reducing their spatial dimensions and providing a degree of translation invariance. Common pooling operations:		'''{{Term\|pooling}}''' layers downsample the feature maps, reducing their spatial dimensions and providing a degree of translation invariance. Common {{Term\|pooling}} operations:

	* '''Max pooling''' — takes the maximum value in each local window (e.g. <math>2 \times 2</math>).		* '''Max {{Term\|pooling}}''' — takes the maximum value in each local window (e.g. <math>2 \times 2</math>).
	* '''Average pooling''' — takes the mean value in each window.		* '''Average {{Term\|pooling}}''' — takes the mean value in each window.
	* '''Global average pooling''' — averages each entire feature map to a single value, often used before the final classification layer.		* '''Global average {{Term\|pooling}}''' — averages each entire feature map to a single value, often used before the final classification layer.

	~~Pooling~~ reduces computational cost and helps prevent overfitting by progressively abstracting the representation.		{{Term\|pooling}} reduces computational cost and helps prevent {{Term\|overfitting}} by progressively abstracting the representation.

	== Architecture of a CNN ==		== Architecture of a CNN ==

	A typical CNN alternates convolutional layers and pooling layers, followed by one or more fully connected layers for the final prediction:		A typical CNN alternates convolutional layers and {{Term\|pooling}} layers, followed by one or more fully connected layers for the final prediction:

	<pre>		<pre>
Line 57:		Line 57:
	\| '''LeNet-5''' \|\| 1998 \|\| Pioneered CNNs for handwritten digit recognition (MNIST) \|\| 5 layers		\| '''LeNet-5''' \|\| 1998 \|\| Pioneered CNNs for handwritten digit recognition (MNIST) \|\| 5 layers
	\|-		\|-
	\| '''AlexNet''' \|\| 2012 \|\| Won ImageNet; popularised ReLU, dropout, GPU training \|\| 8 layers		\| '''AlexNet''' \|\| 2012 \|\| Won ImageNet; popularised ReLU, {{Term\|dropout}}, GPU training \|\| 8 layers
	\|-		\|-
	\| '''VGGNet''' \|\| 2014 \|\| Showed depth matters; used only <math>3 \times 3</math> filters throughout \|\| 16–19 layers		\| '''VGGNet''' \|\| 2014 \|\| Showed depth matters; used only <math>3 \times 3</math> filters throughout \|\| 16–19 layers
Line 93:		Line 93:
	* Use pretrained models (transfer learning) when labelled data is limited.		* Use pretrained models (transfer learning) when labelled data is limited.
	* Prefer small kernels (<math>3 \times 3</math>) stacked in depth — two <math>3 \times 3</math> layers have the same receptive field as one <math>5 \times 5</math> layer but with fewer parameters.		* Prefer small kernels (<math>3 \times 3</math>) stacked in depth — two <math>3 \times 3</math> layers have the same receptive field as one <math>5 \times 5</math> layer but with fewer parameters.
	* Apply batch normalisation after convolution and before activation.		* Apply {{Term\|batch normalization\|batch normalisation}} after convolution and before {{Term\|activation function\|activation}}.
	* Use data augmentation generously to reduce [[Overfitting and Regularization\|overfitting]].		* Use data augmentation generously to reduce [[Overfitting and Regularization\|overfitting]].
	* Replace fully connected layers with global average pooling to reduce parameters.		* Replace fully connected layers with global average {{Term\|pooling}} to reduce parameters.

	== See also ==		== See also ==