Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

[MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network

6,550 views

Published on

MIRU2018のロングオーラルで発表した内容です.Attention Branch Networkという,Attention mapによるCNNの注視領域の可視化と物体認識の高精度化を行うネットワークを提案しています.

Published in: Science
  • Login to see the comments

[MIRU2018] Global Average Poolingの特性を用いたAttention Branch Network

  1. 1. † ‡ † †‡ †
  2. 2. nofDepth ,96,/4,pool/2 256,pool/2 nv,384 nv,384 256,pool/2 4096 4096 1000 3x3conv,64 3x3conv,64,pool/2 3x3conv,128 3x3conv,128,pool/2 3x3conv,256 3x3conv,256 3x3conv,256 3x3conv,256,pool/2 3x3conv,512 3x3conv,512 3x3conv,512 3x3conv,512,pool/2 3x3conv,512 3x3conv,512 3x3conv,512 3x3conv,512,pool/2 fc,4096 fc,4096 fc,1000 VGG,19layers (ILSVRC2014) input Conv 7x7+2(S) MaxPool 3x3+2(S) LocalRespNorm Conv 1x1+1(V) Conv 3x3+1(S) LocalRespNorm MaxPool 3x3+2(S) ConvConvConvConv 1x1+1(S)3x3+1(S)5x5+1(S)1x1+1(S) ConvConvMaxPool 1x1+1(S)1x1+1(S)3x3+1(S) DepthConcat ConvConvConvConv 1x1+1(S)3x3+1(S)5x5+1(S)1x1+1(S) ConvConvMaxPool 1x1+1(S)1x1+1(S)3x3+1(S) DepthConcat MaxPool 3x3+2(S) ConvConvConvConv 1x1+1(S)3x3+1(S)5x5+1(S)1x1+1(S) ConvConvMaxPool 1x1+1(S)1x1+1(S)3x3+1(S) DepthConcat ConvConvConvConv 1x1+1(S)3x3+1(S)5x5+1(S)1x1+1(S) ConvConvMaxPool 1x1+1(S)1x1+1(S)3x3+1(S) AveragePool 5x5+3(V) DepthConcat ConvConvConvConv 1x1+1(S)3x3+1(S)5x5+1(S)1x1+1(S) ConvConvMaxPool 1x1+1(S)1x1+1(S)3x3+1(S) DepthConcat ConvConvConvConv 1x1+1(S)3x3+1(S)5x5+1(S)1x1+1(S) ConvConvMaxPool 1x1+1(S)1x1+1(S)3x3+1(S) DepthConcat ConvConvConvConv 1x1+1(S)3x3+1(S)5x5+1(S)1x1+1(S) ConvConvMaxPool 1x1+1(S)1x1+1(S)3x3+1(S) AveragePool 5x5+3(V) DepthConcat MaxPool 3x3+2(S) ConvConvConvConv 1x1+1(S)3x3+1(S)5x5+1(S)1x1+1(S) ConvConvMaxPool 1x1+1(S)1x1+1(S)3x3+1(S) DepthConcat ConvConvConvConv 1x1+1(S)3x3+1(S)5x5+1(S)1x1+1(S) ConvConvMaxPool 1x1+1(S)1x1+1(S)3x3+1(S) DepthConcat AveragePool 7x7+1(V) FC Conv 1x1+1(S) FC FC SoftmaxActivation softmax0 Conv 1x1+1(S) FC FC SoftmaxActivation softmax1 SoftmaxActivation softmax2 GoogleNet,22layers (ILSVRC2014) KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.arXiv2015. 1x1conv,64 3x3conv,64 1x1conv,256 1x1conv,64 3x3conv,64 1x1conv,256 1x1conv,64 3x3conv,64 1x1conv,256 1x2conv,128,/2 3x3conv,128 1x1conv,512 1x1conv,128 3x3conv,128 1x1conv,512 1x1conv,128 3x3conv,128 1x1conv,512 1x1conv,128 3x3conv,128 1x1conv,512 1x1conv,128 3x3conv,128 1x1conv,512 1x1conv,128 3x3conv,128 1x1conv,512 1x1conv,128 3x3conv,128 1x1conv,512 1x1conv,128 3x3conv,128 1x1conv,512 1x1conv,256,/2 3x3conv,256 1x1conv,1024 1x1conv,256 3x3conv,256 1x1conv,1024 1x1conv,256 3x3conv,256 1x1conv,1024 1x1conv,256 3x3conv,256 1x1conv,1024 1x1conv,256 3x3conv,256 1x1conv,1024 1x1conv,256 3x3conv,256 1x1conv,1024 1x1conv,256 3x3conv,256 1x1conv,1024 1x1conv,256 3x3conv,256 1x1conv,1024 1x1conv,256 3x3conv,256 1x1conv,1024 1x1conv,256 3x3conv,256 1x1conv,1024 7x7conv,64,/2,pool/2 3x3conv,256 1x1conv,1024 1x1conv,256 3x3conv,256 1x1conv,1024 1x1conv,256 3x3conv,256 1x1conv,1024 1x1conv,256 3x3conv,256 1x1conv,1024 1x1conv,256 3x3conv,256 1x1conv,1024 1x1conv,256 3x3conv,256 1x1conv,1024 1x1conv,256 3x3conv,256 1x1conv,1024 1x1conv,256 3x3conv,256 1x1conv,1024 1x1conv,512,/2 3x3conv,512 1x1conv,2048 1x1conv,512 3x3conv,512 1x1conv,2048 1x1conv,512 3x3conv,512 1x1conv,2048 avepool,fc1000 ageRecognition”.arXi
  3. 3. et.al
  4. 4. w1 w2 w3 w1 w2 w3
  5. 5. 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 AUTHOR(S): LEARNING OF OCCLUSION-AWARE ATTENTION FOR PEDESTRIAN DE tion, outputting the classification scores using global average pooling or global max p from the feature map f (·). However, global average pooling increases in the respons of entire feature map at specific class due to using an average of all pixel at a featur On the other hand, global max pooling does not increase the entire feature map at s class because of using a maximum pixel value in a feature map. Response score fo class of global average pooling and global max pooling is calculated as follow Eq. (1 vc i = 1 M×N ∑M m=1 ∑N n=1 fc m,n (xi) (global average pooling), max fc m,n (xi) (global max pooling), After outputting the score for each class, the attention of pedestrian and occlusion r are generated. First, we fuse the multiple channel feature map to one channel. In this we validate the three type fusion as follows in fig. 1(b)∼(d): 1) standard fusion, 2) so weighting fusion, and 3) squeeze-and-excitation (SE) block fusion. Standard fusion is summation of feature map. In softmax weighting, it is weighted the feature maps fo channel using softmax score by Eq. (2). The softmax weighting can mask the unnec channel feature map. In SE block fusion, it is weighted the feature maps for each c using the attention of SE block like Squeeze-and-Excitation Network. After fusing channel, pedestrian classification and occlusion state attentions are fused. In this wo calculate the attention by subtracting the occlusion attention from pedestrian classifi attention. Here, we call the attention the attention map because of containing positi negative values. Attentioni = C ∑ c=1 fc (xi)∗ exp(vc i ) ∑J j=1 exp vj i 3.4 Perception branch In the perception branch, it outputs the final result score using attention map and featu of RoI pooling. Attention map can refine the feature map of RoI pooling, such as m unnecessary background feature and enhancing the important locations. Converted map is made of the inner product of attention map and feature map from RoI poolin perception branch is composed two fully connected layers like Fast R-CNN. The struc the perception branch is the same as conventional Fast R-CNN, however, our model e 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 Anonymous CVPR submission Paper ID **** Abstract 1. Introduction t log y + (1 − t) log (1 − y) (1) vc i = 1 M × N M m=1 N n=1 fc m,n(xi) (2) v1 i , v2 i , v3 i , vc i (3) f(xi) (4) f (xi, yi) (5) 2. Concolusion 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 Abstract 1. Introduction t log y + (1 − t) log (1 − y) (1) vc i = 1 M × N M m=1 N n=1 fc m,n(xi) (2) v1 i , v2 i , v3 i , vc i (3) C (4) f (xi, yi) (5) 2. Concolusion References 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 Paper ID **** Abstract 1. Introduction t log y + (1 − t) log (1 − y) (1) vc i = 1 M × N M m=1 N n=1 fc m,n(xi) (2) v1 i , v2 i , v3 i , vc i (3) C (4) f (xi, yi) (5) 2. Concolusion References 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 Paper ID **** Abstract 1. Introduction t log y + (1 − t) log (1 − y) (1) vc i = 1 M × N M m=1 N n=1 fc m,n(xi) (2) v1 i , v2 i , v3 i , vc i (3) C (4) f (xi, yi) (5) 2. Concolusion 90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12 vc i = 1 M×N ∑M m=1 ∑N n=1 fc m,n (xi) (global average pooling), max fc m,n (xi) (global max pooling), (1) After outputting the score for each class, the attention of pedestrian and occlusion regions are generated. First, we fuse the multiple channel feature map to one channel. In this work, we validate the three type fusion as follows in fig. 1(b)∼(d): 1) standard fusion, 2) softmax- weighting fusion, and 3) squeeze-and-excitation (SE) block fusion. Standard fusion is simply summation of feature map. In softmax weighting, it is weighted the feature maps for each channel using softmax score by Eq. (2). The softmax weighting can mask the unnecessary channel feature map. In SE block fusion, it is weighted the feature maps for each channel using the attention of SE block like Squeeze-and-Excitation Network. After fusing to one channel, pedestrian classification and occlusion state attentions are fused. In this work, we calculate the attention by subtracting the occlusion attention from pedestrian classification attention. Here, we call the attention the attention map because of containing positive and negative values. Attentioni = C ∑ c=1 fc (xi)∗ exp(vc i ) ∑J j=1 exp vj i (2) 3.4 Perception branch In the perception branch, it outputs the final result score using attention map and feature map 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 AUTHOR(S): LEARNING OF OCCLUSION-AWARE ATTENTION FOR PEDESTRIAN DETECTION5 tion, outputting the classification scores using global average pooling or global max pooling from the feature map f (·). However, global average pooling increases in the response value of entire feature map at specific class due to using an average of all pixel at a feature map. On the other hand, global max pooling does not increase the entire feature map at specific class because of using a maximum pixel value in a feature map. Response score for each class of global average pooling and global max pooling is calculated as follow Eq. (1). vc i = 1 M×N ∑M m=1 ∑N n=1 fc m,n (xi) (global average pooling), max fc m,n (xi) (global max pooling), (1) After outputting the score for each class, the attention of pedestrian and occlusion regions are generated. First, we fuse the multiple channel feature map to one channel. In this work, we validate the three type fusion as follows in fig. 1(b)∼(d): 1) standard fusion, 2) softmax- weighting fusion, and 3) squeeze-and-excitation (SE) block fusion. Standard fusion is simply 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 AUTHOR(S): LEARNING OF OCCLUSION-AWARE ATTENTION FOR PEDESTRIAN DETECTION5 tion, outputting the classification scores using global average pooling or global max pooling from the feature map f (·). However, global average pooling increases in the response value of entire feature map at specific class due to using an average of all pixel at a feature map. On the other hand, global max pooling does not increase the entire feature map at specific class because of using a maximum pixel value in a feature map. Response score for each class of global average pooling and global max pooling is calculated as follow Eq. (1). vc i = 1 M×N ∑M m=1 ∑N n=1 fc m,n (xi) (global average pooling), max fc m,n (xi) (global max pooling), (1) After outputting the score for each class, the attention of pedestrian and occlusion regions are generated. First, we fuse the multiple channel feature map to one channel. In this work, we validate the three type fusion as follows in fig. 1(b)∼(d): 1) standard fusion, 2) softmax- weighting fusion, and 3) squeeze-and-excitation (SE) block fusion. Standard fusion is simply summation of feature map. In softmax weighting, it is weighted the feature maps for each 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 Abstract 1. Introduction t log y + (1 − t) log (1 − y) (1) vc i = 1 M × N M m=1 N n=1 fc m,n(xi) (2) v1 i , v2 i , v3 i , vc i (3) f(xi) (4) f (xi, yi) (5) 2. Concolusion References 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 How Small Network Can Detect Ped Anonymous CVPR submission Paper ID **** Abstract 1. Introduction t log y + (1 − t) log (1 − y) (1) vc i = 1 M × N M m=1 N n=1 fc m,n(xi) (2) M, N (3) C (4)
  6. 6. Table 1. Classification error on the ILSVRC validation set. Networks top-1 val. error top-5 val. error VGGnet-GAP 33.4 12.2 GoogLeNet-GAP 35.0 13.2 AlexNet∗-GAP 44.9 20.9 AlexNet-GAP 51.1 26.3 GoogLeNet 31.9 11.3 VGGnet 31.2 11.4 AlexNet 42.6 19.5 NIN 41.9 19.6 GoogLeNet-GMP 35.6 13.9 Table 2. Localization error on the ILSVRC validation set. Bac prop refers to using [23] for localization instead of CAM. Method top-1 val.error top-5 val. error GoogLeNet-GAP 56.40 43.00 VGGnet-GAP 57.20 45.14 GoogLeNet 60.09 49.34 AlexNet∗-GAP 63.75 49.53 AlexNet-GAP 67.19 52.16 NIN 65.47 54.19 Backprop on GoogLeNet 61.31 50.55
  7. 7. Lall(x) = Eatt(x) + Eper(x) Eper(x) Eatt(x)
  8. 8. g(x) M(x) g′(x) g′(x) = (1 + M(x)) ⋅ g(x)
  9. 9. irshick2 Piotr Doll´ar2 Zhuowen Tu1 Kaiming He2 C San Diego 2 Facebook AI Research @ucsd.edu {rbg,pdollar,kaiminghe}@fb.com rized network archi- etwork is constructed egates a set of trans- ur simple design re- architecture that has is strategy exposes a ality” (the size of the factor in addition to On the ImageNet-1K under the restricted ncreasing cardinality racy. Moreover, in- han going deeper or Our models, named entry to the ILSVRC secured 2nd place. 256, 1x1, 4 4, 3x3, 4 4, 1x1, 256 + 256, 1x1, 4 4, 3x3, 4 4, 1x1, 256 256, 1x1, 4 4, 3x3, 4 4, 1x1, 256 .... total 32 paths 256-d in + 256, 1x1, 64 64, 3x3, 64 64, 1x1, 256 + 256-d in 256-d out 256-d out Figure 1. Left: A block of ResNet [14]. Right: A block of ResNeXt with cardinality = 32, with roughly the same complex- ity. A layer is shown as (# in channels, filter size, # out channels). ing blocks of the same shape. This strategy is inherited by ResNets [14] which stack modules of the same topol- ogy. This simple rule reduces the free choices of hyper- parameters, and depth is exposed as an essential dimension Aggregated Residual Transformations for Deep Neural Networks Saining Xie1 Ross Girshick2 Piotr Doll´ar2 Zhuowen Tu1 Kaiming He2 1 UC San Diego 2 Facebook AI Research {s9xie,ztu}@ucsd.edu {rbg,pdollar,kaiminghe}@fb.com Abstract We present a simple, highly modularized network archi- tecture for image classification. Our network is constructed by repeating a building block that aggregates a set of trans- formations with the same topology. Our simple design re- sults in a homogeneous, multi-branch architecture that has only a few hyper-parameters to set. This strategy exposes a new dimension, which we call “cardinality” (the size of the set of transformations), as an essential factor in addition to the dimensions of depth and width. On the ImageNet-1K dataset, we empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy. Moreover, in- creasing cardinality is more effective than going deeper or wider when we increase the capacity. Our models, named ResNeXt, are the foundations of our entry to the ILSVRC 2016 classification task in which we secured 2nd place. We further investigate ResNeXt on an ImageNet-5K set and the COCO detection set, also showing better results than its ResNet counterpart. The code and models are publicly available online1 . 1. Introduction 256, 1x1, 4 4, 3x3, 4 4, 1x1, 256 + 256, 1x1, 4 4, 3x3, 4 4, 1x1, 256 256, 1x1, 4 4, 3x3, 4 4, 1x1, 256 .... total 32 paths 256-d in + 256, 1x1, 64 64, 3x3, 64 64, 1x1, 256 + 256-d in 256-d out 256-d out Figure 1. Left: A block of ResNet [14]. Right: A block of ResNeXt with cardinality = 32, with roughly the same complex- ity. A layer is shown as (# in channels, filter size, # out channels). ing blocks of the same shape. This strategy is inherited by ResNets [14] which stack modules of the same topol- ogy. This simple rule reduces the free choices of hyper- parameters, and depth is exposed as an essential dimension in neural networks. Moreover, we argue that the simplicity of this rule may reduce the risk of over-adapting the hyper- parameters to a specific dataset. The robustness of VGG- nets and ResNets has been proven by various visual recog- nition tasks [7, 10, 9, 28, 31, 14] and by non-visual tasks involving speech [42, 30] and language [4, 41, 20]. Unlike VGG-nets, the family of Inception models [38, 17, 39, 37] have demonstrated that carefully designed v:1611.05431v2[cs.CV]11Apr2017 Densely Connected Convolutional Networks Gao Huang⇤ Cornell University gh349@cornell.edu Zhuang Liu⇤ Tsinghua University liuzhuang13@mails.tsinghua.edu.cn Laurens van der Maaten Facebook AI Research lvdmaaten@fb.com Kilian Q. Weinberger Cornell University kqw4@cornell.edu Abstract Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convo- lutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections—one between each layer and its subsequent layer—our network has L(L+1) 2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several com- pelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage fea- ture reuse, and substantially reduce the number of parame- ters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain sig- nificant improvements over the state-of-the-art on most of them, whilst requiring less computation to achieve high per- formance. Code and pre-trained models are available at https://github.com/liuzhuang13/DenseNet. 1. Introduction Convolutional neural networks (CNNs) have become the dominant machine learning approach for visual object recognition. Although they were originally introduced over 20 years ago [18], improvements in computer hardware and network structure have enabled the training of truly deep CNNs only recently. The original LeNet5 [19] consisted of 5 layers, VGG featured 19 [29], and only last year Highway ⇤Authors contributed equally x0 x1 H1 x2 H2 H3 H4 x3 x4 Figure 1: A 5-layer dense block with a growth rate of k = 4. Each layer takes all preceding feature-maps as input. Networks [34] and Residual Networks (ResNets) [11] have surpassed the 100-layer barrier. As CNNs become increasingly deep, a new research problem emerges: as information about the input or gra- dient passes through many layers, it can vanish and “wash out” by the time it reaches the end (or beginning) of the network. Many recent publications address this or related problems. ResNets [11] and Highway Networks [34] by- pass signal from one layer to the next via identity connec- tions. Stochastic depth [13] shortens ResNets by randomly dropping layers during training to allow better information and gradient flow. FractalNets [17] repeatedly combine sev- eral parallel layer sequences with different number of con- volutional blocks to obtain a large nominal depth, while maintaining many short paths in the network. Although these different approaches vary in network topology and training procedure, they all share a key characteristic: they create short paths from early layers to later layers. 1 arXiv:1608.06993v5[cs.CV]28Jan2018
  10. 10. tanh × Σ
  11. 11. f(st) g(st) g′(st)

×