Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Paper overview: "Deep Residual Learning for Image Recognition"

2,267 views

Published on

A talk given at Computational Neuroscience Seminar @ University of Tartu. We discuss the idea behind deep neural network that has won ILSVRC (ImageNet) 2015 and COCO 2015 Image competitions.

Published in: Science
  • Be the first to comment

Paper overview: "Deep Residual Learning for Image Recognition"

  1. 1. Article overview by Ilya Kuzovkin K. He, X. Zhang, S. Ren and J. Sun Microsoft Research Computational Neuroscience Seminar University of Tartu 2016 Deep Residual Learning for Image Recognition ILSVRC 2015 MS COCO 2015 WINNER
  2. 2. THE IDEA
  3. 3. 1000 classes
  4. 4. 2012 8 layers 15.31% error
  5. 5. 2012 8 layers 15.31% error 2013 9 layers, 2x params 11.74% error
  6. 6. 2012 8 layers 15.31% error 2013 2014 9 layers, 2x params 11.74% error 19 layers 7.41% error
  7. 7. 2012 8 layers 15.31% error 2013 2014 2015 9 layers, 2x params 11.74% error 19 layers 7.41% error ?
  8. 8. 2012 8 layers 15.31% error 2013 2014 2015 9 layers, 2x params 11.74% error 19 layers 7.41% error Is learning better networks as easy as stacking more layers? ?
  9. 9. ? 2012 8 layers 15.31% error 2013 2014 2015 9 layers, 2x params 11.74% error 19 layers 7.41% error Is learning better networks as easy as stacking more layers? Vanishing / exploding gradients
  10. 10. ? 2012 8 layers 15.31% error 2013 2014 2015 9 layers, 2x params 11.74% error 19 layers 7.41% error Is learning better networks as easy as stacking more layers? Vanishing / exploding gradients Normalized initialization & intermediate normalization
  11. 11. ? 2012 8 layers 15.31% error 2013 2014 2015 9 layers, 2x params 11.74% error 19 layers 7.41% error Is learning better networks as easy as stacking more layers? Vanishing / exploding gradients Normalized initialization & intermediate normalization Degradation problem
  12. 12. Degradation problem “with the network depth increasing, accuracy gets saturated”
  13. 13. Not caused by overfitting: Degradation problem “with the network depth increasing, accuracy gets saturated”
  14. 14. Conv Conv Conv Conv Trained Accuracy X% Tested
  15. 15. Conv Conv Conv Conv Trained Accuracy X% Conv Conv Conv Conv Identity Identity Identity Identity Tested Tested
  16. 16. Conv Conv Conv Conv Trained Accuracy X% Conv Conv Conv Conv Identity Identity Identity Identity Same performance Tested Tested
  17. 17. Conv Conv Conv Conv Trained Accuracy X% Conv Conv Conv Conv Identity Identity Identity Identity Same performance Conv Conv Conv Conv Conv Conv Conv Conv Trained Tested Tested Tested
  18. 18. Conv Conv Conv Conv Trained Accuracy X% Conv Conv Conv Conv Identity Identity Identity Identity Same performance Conv Conv Conv Conv Conv Conv Conv Conv Trained Worse! Tested Tested Tested
  19. 19. Conv Conv Conv Conv Trained Accuracy X% Conv Conv Conv Conv Identity Identity Identity Identity Same performance Conv Conv Conv Conv Conv Conv Conv Conv Trained Worse! Tested Tested Tested “Our current solvers on hand are unable to find solutions that are comparably good or better than the constructed solution (or unable to do so in feasible time)”
  20. 20. Conv Conv Conv Conv Trained Accuracy X% Conv Conv Conv Conv Identity Identity Identity Identity Same performance Conv Conv Conv Conv Conv Conv Conv Conv Trained Worse! Tested Tested Tested “Our current solvers on hand are unable to find solutions that are comparably good or better than the constructed solution (or unable to do so in feasible time)” “Solvers might have difficulties in approximating identity mappings by multiple nonlinear layers”
  21. 21. Conv Conv Conv Conv Trained Accuracy X% Conv Conv Conv Conv Identity Identity Identity Identity Same performance Conv Conv Conv Conv Conv Conv Conv Conv Trained Worse! Tested Tested Tested “Our current solvers on hand are unable to find solutions that are comparably good or better than the constructed solution (or unable to do so in feasible time)” “Solvers might have difficulties in approximating identity mappings by multiple nonlinear layers” Add explicit identity connections and “solvers may simply drive the weights of the multiple nonlinear layers toward zero”
  22. 22. Add explicit identity connections and “solvers may simply drive the weights of the multiple nonlinear layers toward zero” is the true function we want to learn
  23. 23. Add explicit identity connections and “solvers may simply drive the weights of the multiple nonlinear layers toward zero” is the true function we want to learn Let’s pretend we want to learn instead.
  24. 24. Add explicit identity connections and “solvers may simply drive the weights of the multiple nonlinear layers toward zero” is the true function we want to learn Let’s pretend we want to learn instead. The original function is then
  25. 25. Network can decide how deep it needs to be…
  26. 26. Network can decide how deep it needs to be… “The identity connections introduce neither extra parameter nor computation complexity”
  27. 27. 2012 8 layers 15.31% error 2013 2014 2015 9 layers, 2x params 11.74% error 19 layers 7.41% error ?
  28. 28. 2012 8 layers 15.31% error 2013 2014 2015 9 layers, 2x params 11.74% error 19 layers 7.41% error 152 layers 3.57% error
  29. 29. EXPERIMENTS AND DETAILS
  30. 30. • Lots of convolutional 3x3 layers • VGG complexity is 19.6 billion FLOPs 34-layer-ResNet is 3.6 bln. FLOPs
  31. 31. • Lots of convolutional 3x3 layers • VGG complexity is 19.6 billion FLOPs 34-layer-ResNet is 3.6 bln. FLOPs ! • Batch normalization • SGD with batch size 256 • (up to) 600,000 iterations • LR 0.1 (divided by 10 when error plateaus) • Momentum 0.9 • No dropout • Weight decay 0.0001
  32. 32. • Lots of convolutional 3x3 layers • VGG complexity is 19.6 billion FLOPs 34-layer-ResNet is 3.6 bln. FLOPs ! • Batch normalization • SGD with batch size 256 • (up to) 600,000 iterations • LR 0.1 (divided by 10 when error plateaus) • Momentum 0.9 • No dropout • Weight decay 0.0001 ! • 1.28 million training images • 50,000 validation • 100,000 test
  33. 33. • 34-layer ResNet has lower training error. This indicates that the degradation problem is well addressed and we manage to obtain accuracy gains from increased depth.
  34. 34. • 34-layer ResNet has lower training error. This indicates that the degradation problem is well addressed and we manage to obtain accuracy gains from increased depth. ! • 34-layer-ResNet reduces the top-1 error by 3.5%
  35. 35. • 34-layer ResNet has lower training error. This indicates that the degradation problem is well addressed and we manage to obtain accuracy gains from increased depth. ! • 34-layer-ResNet reduces the top-1 error by 3.5% ! • 18-layer ResNet converges faster and thus ResNet eases the optimization by providing faster convergence at the early stage.
  36. 36. GOING DEEPER
  37. 37. Due to time complexity the usual building block is replaced by Bottleneck Block 50 / 101 / 152 - layer ResNets are build from those blocks
  38. 38. ANALYSIS ON CIFAR-10
  39. 39. ImageNet Classification 2015 1st 3.57% error ImageNet Object Detection 2015 1st 194 / 200 categories ImageNet Object Localization 2015 1st 9.02% error COCO Detection 2015 1st 37.3% COCO Segmentation 2015 1st 28.2% http://research.microsoft.com/en-us/um/people/kahe/ilsvrc15/ilsvrc2015_deep_residual_learning_kaiminghe.pdf

×