Min-Seo Kim
Network Science Lab
Dept. of Artificial Intelligence
The Catholic University of Korea
E-mail: kms39273@naver.com
1
ResNet
• Deep convolutional neural networks have led to a series of breakthroughs for image classification.
• The depth of representations is of central importance for many visual recognition tasks.
• Is learning better networks as easy as stacking more layers?
• An obstacle to answering this question was the notorious problem of vanishing/exploding gradients.
• Degradation problem has been exposed, not caused by overfitting, led to higher training error
• In this paper, we address the degradation problem by introducing a deep residual learning framework.
Introduction
2
ResNet
• Formally, denoting the desired underlying mapping as H(x), we let the stacked nonlinear layers fit another
mapping of F(x) := H(x)−x. The original mapping is recast into F(x)+x.
• Identity shortcut connections(in this case, residual learning) add neither extra parameter nor computational
complexity.
• Deep residual nets are easy to optimize, but the counterpart “plain” nets (that simply stack layers) exhibit
higher training error when the depth increases.
• Deep residual nets can easily enjoy accuracy gains from greatly increased depth, producing results
substantially better than previous networks.
Residual learning framework
3
ResNet
• We learn not H(x), but H(x) − x(=F(x)).
• Although both forms should be able to asymptotically approximate the desired functions (as hypothesized),
the ease of learning might be different.
• This reformulation is motivated by the counterintuitive phenomena about the degradation problem.
• We show by experiments that the learned residual functions in general have small responses, suggesting that
identity mappings provide reasonable preconditioning.
Residual Learning
4
ResNet
• The function F(x, {Wi}) represents the residual mapping to be learned.
• If this is not the case (e.g., when changing the input/output channels), we can perform a linear projection Ws
by the shortcut connections to match the dimensions: (2)
Identity Mapping by Shortcuts
5
ResNet
• Plain Network.
• Inspired by the philosophy of VGG
nets.
• Convolutional layers mostly have 3×3
filters.
• Downsampling is performed directly by
convolutional layers with a stride of
2.(no pooling layer for each layer)
• The network ends with a global
average pooling layer.
• A 1000-way fully-connected layer with
softmax is used at the end.
• Compare with Residual Network.
Network Architectures (Plain Network , Residual Network)
6
ResNet
• Evaluate our method on the ImageNet 2012 classification dataset [36] that consists of 1000 classes.
• Evaluate both top-1 and top-5 error rates.
Implementation
7
ResNet
• We argue that this optimization difficulty is unlikely to be caused by vanishing gradients
• We conjecture that the deep plain nets may have exponentially low convergence rates, which impact the
reducing of the training error.
Implementation
8
ResNet
• ResNet introduced residual connections to address the vanishing gradient problem in deep networks.
• Initially, it was believed that the vanishing gradient was the primary issue in training deep networks.
• However, further observations showed that deep plain networks (without residual connections) experienced
exponentially slow convergence, which was not solely due to the vanishing gradient.
• This slow convergence is related to the reduction in learning rates.
• Residual connections improve gradient propagation by directly adding the input to the output.
• As a result, residual connections enhance the convergence speed of deep networks and mitigate the
reduction in learning rates.
• In this case, the ResNet eases the optimization by providing faster convergence at the early stage.
Result

ResNet.pptx

  • 1.
    Min-Seo Kim Network ScienceLab Dept. of Artificial Intelligence The Catholic University of Korea E-mail: kms39273@naver.com
  • 2.
    1 ResNet • Deep convolutionalneural networks have led to a series of breakthroughs for image classification. • The depth of representations is of central importance for many visual recognition tasks. • Is learning better networks as easy as stacking more layers? • An obstacle to answering this question was the notorious problem of vanishing/exploding gradients. • Degradation problem has been exposed, not caused by overfitting, led to higher training error • In this paper, we address the degradation problem by introducing a deep residual learning framework. Introduction
  • 3.
    2 ResNet • Formally, denotingthe desired underlying mapping as H(x), we let the stacked nonlinear layers fit another mapping of F(x) := H(x)−x. The original mapping is recast into F(x)+x. • Identity shortcut connections(in this case, residual learning) add neither extra parameter nor computational complexity. • Deep residual nets are easy to optimize, but the counterpart “plain” nets (that simply stack layers) exhibit higher training error when the depth increases. • Deep residual nets can easily enjoy accuracy gains from greatly increased depth, producing results substantially better than previous networks. Residual learning framework
  • 4.
    3 ResNet • We learnnot H(x), but H(x) − x(=F(x)). • Although both forms should be able to asymptotically approximate the desired functions (as hypothesized), the ease of learning might be different. • This reformulation is motivated by the counterintuitive phenomena about the degradation problem. • We show by experiments that the learned residual functions in general have small responses, suggesting that identity mappings provide reasonable preconditioning. Residual Learning
  • 5.
    4 ResNet • The functionF(x, {Wi}) represents the residual mapping to be learned. • If this is not the case (e.g., when changing the input/output channels), we can perform a linear projection Ws by the shortcut connections to match the dimensions: (2) Identity Mapping by Shortcuts
  • 6.
    5 ResNet • Plain Network. •Inspired by the philosophy of VGG nets. • Convolutional layers mostly have 3×3 filters. • Downsampling is performed directly by convolutional layers with a stride of 2.(no pooling layer for each layer) • The network ends with a global average pooling layer. • A 1000-way fully-connected layer with softmax is used at the end. • Compare with Residual Network. Network Architectures (Plain Network , Residual Network)
  • 7.
    6 ResNet • Evaluate ourmethod on the ImageNet 2012 classification dataset [36] that consists of 1000 classes. • Evaluate both top-1 and top-5 error rates. Implementation
  • 8.
    7 ResNet • We arguethat this optimization difficulty is unlikely to be caused by vanishing gradients • We conjecture that the deep plain nets may have exponentially low convergence rates, which impact the reducing of the training error. Implementation
  • 9.
    8 ResNet • ResNet introducedresidual connections to address the vanishing gradient problem in deep networks. • Initially, it was believed that the vanishing gradient was the primary issue in training deep networks. • However, further observations showed that deep plain networks (without residual connections) experienced exponentially slow convergence, which was not solely due to the vanishing gradient. • This slow convergence is related to the reduction in learning rates. • Residual connections improve gradient propagation by directly adding the input to the output. • As a result, residual connections enhance the convergence speed of deep networks and mitigate the reduction in learning rates. • In this case, the ResNet eases the optimization by providing faster convergence at the early stage. Result