Deep Residual Learning for Image Recognition/en

Research Paper
Authors	Kaiming He; Xiangyu Zhang; Shaoqing Ren; Jian Sun
Year	2016
Venue	CVPR
Topic area	Deep Learning
Difficulty	Research
arXiv	1512.03385
PDF	Download PDF

Other languages:

Deep Residual Learning for Image Recognition is a 2016 paper by He et al. from Microsoft Research that introduced residual networks (ResNets), a framework for training extremely deep neural networks using skip connections (also called shortcut connections). The paper demonstrated that networks with over 100 layers could be trained effectively, winning first place in the ILSVRC 2015 image classification competition with a 3.57% top-5 error rate.

Overview

As neural networks grew deeper in the mid-2010s, researchers observed a counterintuitive degradation problem: adding more layers to a network eventually caused training accuracy to degrade, not from overfitting but from optimization difficulty. A 56-layer plain network performed worse than a 20-layer network on both training and test sets, indicating that deeper networks were harder to optimize rather than simply more prone to overfitting.

He et al. proposed that instead of learning desired underlying mappings directly, layers should learn residual functions with reference to the layer inputs. This reformulation, implemented through shortcut connections that skip one or more layers, made it substantially easier to optimize very deep networks and enabled training of architectures with up to 152 layers (and experimentally over 1,000 layers) without degradation.

Key Contributions

Residual learning framework: A reformulation where network layers learn residual functions $$ F(x) = H(x) - x $$ rather than unreferenced mappings $$ H(x) $$ , with identity shortcut connections passing the input directly to deeper layers.
Extremely deep networks: Successful training of networks with 152 layers for ImageNet and over 1,000 layers on CIFAR-10, far exceeding the depth of prior architectures.
State-of-the-art results: First place in the ILSVRC 2015 classification, detection, and localization tracks, as well as first place in the COCO 2015 detection and segmentation tracks.
Generalizable insight: The residual learning principle proved applicable far beyond image classification, influencing architectures across all areas of deep learning.

Methods

The core idea is deceptively simple. For a stack of layers intended to fit a desired mapping $$ H(x) $$ , instead of fitting $$ H(x) $$ directly, the layers are tasked with fitting the residual:

$$ F(x) := H(x) - x $$

The original mapping is then recast as $$ H(x) = F(x) + x $$ . This is implemented by adding an identity shortcut connection that skips one or more layers:

$y = F(x, \{W_i\}) + x$

where $F(x, \{W_i\})$ represents the residual mapping to be learned (typically two or three convolutional layers with batch normalization and ReLU activations). The addition is element-wise and requires that $$ F $$ and $$ x $$ have the same dimensions. When dimensions differ (e.g., at downsampling stages), a linear projection $$ W_s $$ is applied to the shortcut:

$y = F(x, \{W_i\}) + W_s x$

The hypothesis is that it is easier for a network to learn a small residual perturbation $F(x) \approx 0$ than to learn an identity mapping from scratch. If the optimal function is close to identity, the residual formulation makes it easy for the solver to push weights toward zero rather than fitting an identity through nonlinear layers.

The paper presented several ResNet variants: ResNet-18, ResNet-34, ResNet-50, ResNet-101, and ResNet-152. Deeper variants (50+) use a bottleneck design with 1x1, 3x3, and 1x1 convolutions to reduce computational cost while maintaining representational capacity.

Results

On the ImageNet validation set, ResNet-152 achieved a top-5 error rate of 3.57% using an ensemble of models, surpassing all previous approaches and winning the ILSVRC 2015 competition. As a single model, ResNet-152 achieved 4.49% top-5 error, substantially below the 2014 winner GoogLeNet (6.67%).

Critical evidence for the residual learning framework came from controlled comparisons: a 34-layer ResNet outperformed an 18-layer ResNet, whereas a 34-layer plain network performed worse than an 18-layer plain network. This directly demonstrated that skip connections solved the degradation problem.

On CIFAR-10, the authors trained networks with over 1,000 layers, showing that extremely deep residual networks could still be optimized, though a 1202-layer network showed mild overfitting compared to a 110-layer variant due to the small dataset size.

The representations learned by ResNets also transferred well to other tasks, achieving state-of-the-art results on PASCAL VOC and MS COCO object detection and segmentation benchmarks. The generality of these improvements confirmed that the benefits of residual learning extended well beyond classification to dense prediction tasks. ResNet-based feature extractors became the standard backbone for Faster R-CNN, Mask R-CNN, and Feature Pyramid Networks.

Impact

ResNet is one of the most cited and influential papers in deep learning. The residual connection became a fundamental building block adopted in virtually every subsequent deep architecture, including Transformers (which use residual connections around each attention and feed-forward sublayer), DenseNets, U-Nets, and modern convolutional architectures. The insight that identity mappings ease optimization in deep networks profoundly shaped both theoretical understanding and practical architecture design.

ResNet won the 2016 Best Paper Award at CVPR. As of 2026, ResNet variants remain competitive baselines in computer vision and are among the most commonly used backbone architectures for transfer learning.

The mathematical simplicity of the residual connection — adding the input to the output of a block — belies its profound impact. This single idea enabled training networks that were an order of magnitude deeper than previously feasible, and the principle has proven essential in architectures far removed from the original image classification context, including speech synthesis, natural language processing, and scientific computing.

Subsequent theoretical work showed that skip connections help gradients flow through very deep networks by providing shorter paths during backpropagation, effectively mitigating the vanishing gradient problem that had long plagued deep network training. The paper has accumulated over 200,000 citations, making it one of the most referenced works in all of science.

Pre-trained ResNet models are available in every major deep learning framework, making them among the most accessible starting points for transfer learning in computer vision.

References

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. Proceedings of CVPR 2016. arXiv:1512.03385
Simonyan, K. & Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR 2015.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Identity Mappings in Deep Residual Networks. ECCV 2016. arXiv:1603.05027.