Deep Residual Learning for Image Recognition Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun ILSVRC 2015 Winner, arXiv:1512.03385
Problem & Motivation Problem: Deeper networks become harder to train (vanishing gradients, degradation). Observation: Plain networks beyond 20–30 layers often perform worse. Motivation: Enable very deep networks (50–152 layers) to train effectively. Significance: Depth is critical for hierarchical feature learning in images.
Method – Residual Learning Reformulated mapping: learn residual F(x) = H(x) – x. Final output: H(x) = F(x) + x. Shortcut connections directly pass x to deeper layers. Key idea: fitting residuals is easier than direct mapping.
Method – Residual Block & Architecture Basic Residual Block: Conv–BN–ReLU–Conv–BN + identity skip. Enables extremely deep networks (e.g., 152 layers). More efficient than VGG while achieving better performance. Variants: ResNet-18, 34, 50, 101, 152.
Results – ImageNet ResNet achieved 3.57% top-5 error on ImageNet test set. Outperformed VGG-16 and GoogLeNet. 1st place at ILSVRC 2015 classification task.
Results – Other Benchmarks CIFAR-10: Plain nets degrade with depth; ResNets train up to 1000 layers. COCO Detection: ~28% improvement using ResNet backbones. Winning entries in ImageNet & COCO detection/localization/segmentation.
Strengths Enables training of very deep networks (>100 layers). Significant accuracy improvements on ImageNet, CIFAR, COCO. Simple yet powerful architecture. Influence: Foundation for modern architectures including Transformers.
Limitations Resource-intensive: requires more compute and memory. Diminishing returns beyond ~1000 layers. Initially designed for computer vision tasks only.
Summary Problem: Degradation in deep nets. Solution: Residual learning with skip connections. Impact: Enabled training of very deep networks (152 layers). Results: Record-breaking accuracy on ImageNet and COCO. Strengths: Scalable, effective, foundational. Limitations: Compute-heavy, diminishing returns.
Discussion Inspired later models: DenseNet, ResNeXt, EfficientNet. Residual connections used in Transformers (NLP, Vision). Residual learning is now a universal building block.