ImageNet Classification with Deep CNNs/en

Research Paper
Authors	Alex Krizhevsky; Ilya Sutskever; Geoffrey E. Hinton
Year	2012
Venue	NeurIPS
Topic area	Deep Learning
Difficulty	Research
Source	View paper

Other languages:

Languages: English | Español | 中文

ImageNet Classification with Deep Convolutional Neural Networks is a 2012 paper by Krizhevsky, Sutskever, and Hinton that introduced AlexNet, a deep convolutional neural network that won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012 by a dramatic margin. The paper is widely considered the catalyst for the modern deep learning revolution, demonstrating that deep neural networks trained on GPUs could vastly outperform traditional computer vision methods on large-scale image recognition.

Overview

Before AlexNet, the dominant approaches to image classification relied on hand-engineered features (such as SIFT, HOG, or Fisher vectors) fed into shallow classifiers like SVMs. While neural networks had shown promise on smaller datasets like MNIST, they had not been successfully scaled to complex, large-scale recognition tasks. Many researchers questioned whether deep networks could compete with carefully designed feature pipelines.

Krizhevsky et al. shattered this assumption by training a deep convolutional neural network with 60 million parameters on the ImageNet LSVRC-2010 dataset (1.2 million images, 1,000 classes), achieving top-5 error rates that were nearly half those of the best competing methods. This result demonstrated that the combination of large datasets, GPU computing, and architectural innovations could unlock the representational power of deep networks.

Key Contributions

Large-scale CNN training on GPUs: One of the first successful demonstrations of training deep convolutional networks on GPUs, using a model split across two NVIDIA GTX 580 GPUs with 3 GB of memory each.
ReLU activation function: Adoption of rectified linear units ( $f(x) = \max(0, x)$ ) instead of the traditional sigmoid or tanh activations, enabling much faster training of deep networks.
Data augmentation: Use of random image translations, horizontal reflections, and PCA-based color augmentation to artificially enlarge the training set and reduce overfitting.
Dropout regularization: Application of dropout (with probability 0.5) in the fully connected layers, one of the earliest uses of this technique in a large convolutional network.
Local response normalization: A normalization scheme inspired by lateral inhibition in biological neurons, applied after ReLU activations.
Overlapping pooling: Use of max-pooling with stride smaller than the kernel size, which slightly reduced overfitting compared to non-overlapping pooling.

Methods

AlexNet consists of eight learned layers: five convolutional layers followed by three fully connected layers. The final fully connected layer feeds into a 1,000-way softmax to produce the class probability distribution.

The network processes 224x224 RGB images. The first convolutional layer applies 96 kernels of size 11x11 with stride 4, dramatically reducing the spatial dimensions. Subsequent layers use smaller kernels (5x5 and 3x3). The architecture was split across two GPUs, with each GPU processing half of the feature maps, and cross-GPU communication occurring only at certain layers.

The ReLU activation function was a critical innovation. Compared to the saturating nonlinearities (sigmoid, tanh) standard at the time, ReLU enabled training to converge approximately six times faster on the same architecture:

$f(x) = \max(0, x)$

Data augmentation was applied in two forms. The first extracted random 224x224 patches (and their horizontal reflections) from the 256x256 images, increasing the training set by a factor of 2,048. The second performed PCA-based color perturbation, adding multiples of the principal components of the RGB pixel values to each image, reducing the top-1 error rate by over 1%.

Dropout was applied to the outputs of the first two fully connected layers during training, randomly setting each neuron's output to zero with probability 0.5. This approximately doubled the number of iterations required to converge, but substantially reduced overfitting.

The network was trained using stochastic gradient descent with a batch size of 128, momentum of 0.9, and weight decay of 0.0005. The learning rate was initialized at 0.01 and manually reduced by a factor of 10 when the validation error stopped improving. Training took approximately five to six days on two NVIDIA GTX 580 GPUs.

Results

On the ILSVRC-2012 competition, AlexNet achieved:

Top-5 error rate of 15.3% on the test set, compared to 26.2% for the second-place entry (which used traditional features with an SVM). This 10.9 percentage point improvement was unprecedented in the competition's history.
Top-1 error rate of 37.5%, also substantially ahead of competing methods.

On the ILSVRC-2010 test set (where labels were publicly available), the network achieved top-1 and top-5 error rates of 37.5% and 17.0% respectively, outperforming the previous best results of 47.1% and 28.2%.

Qualitative analysis of the learned features showed that the first convolutional layer learned a collection of frequency- and orientation-selective filters as well as color-specific filters — reminiscent of the simple cells found in the primary visual cortex. The two GPU pathways specialized differently, with one GPU learning largely color-agnostic features and the other learning color-specific features.

The authors also demonstrated that the features learned by AlexNet transferred well to other tasks, achieving competitive results when the last-layer features were used with simple classifiers on other datasets.

Impact

AlexNet is widely credited with igniting the deep learning revolution. Its decisive victory in the 2012 ImageNet competition convinced the computer vision community — and the broader AI field — that deep neural networks were a viable and powerful approach to perception tasks. Within two years, virtually all competitive entries in ImageNet used deep convolutional networks, and the top-5 error rate dropped below human-level performance by 2015.

The paper introduced or popularized several techniques (ReLU, dropout, GPU training, data augmentation) that became standard practice. It directly influenced subsequent architectures including VGGNet, GoogLeNet, and ResNet. The use of GPUs for training, pioneered in this work, transformed the hardware landscape for machine learning and drove the development of specialized AI accelerators.

AlexNet is consistently ranked among the most influential machine learning papers ever published and is a landmark in the history of artificial intelligence.

The paper's success also validated the importance of large-scale labeled datasets for training deep networks. The ImageNet dataset itself, curated by Fei-Fei Li and collaborators, proved essential — without 1.2 million labeled images, the deep network's capacity could not have been fully leveraged. This insight spurred the creation of large-scale datasets across many domains.

The collaboration between Krizhevsky, Sutskever, and Hinton at the University of Toronto exemplified the academic origins of the deep learning renaissance, and all three went on to play central roles in the subsequent development of the field at major technology companies.

References

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems 25 (NeurIPS 2012).
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A Large-Scale Hierarchical Image Database. CVPR 2009.
Simonyan, K. & Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR 2015.