SlideShare a Scribd company logo
1 of 21
Download to read offline
Journal	
  Club	
  論文紹介	
  
Visualizing	
  and	
  Understanding	
  
Convolu5onal	
  Networks	
Zeiler,	
  M.	
  et	
  al.	
  
In	
  Proc.	
  European	
  Conference	
  on	
  
Computer	
  Vision	
  2014	
  
	
  
2015/07/03	
  shouno@uec.ac.jp
前回(2015.03.26)のおまけスライド	
•  階層ネットワークへの導入	
  
– [Zeiler	
  &	
  Fergus	
  13]	
invert this, the deconvnet uses transposed versions of
the same filters, but applied to the rectified maps, not
the output of the layer beneath. In practice this means
flipping each filter vertically and horizontally.
Projecting down from higher layers uses the switch
settings generated by the max pooling in the convnet
on the way up. As these switch settings are peculiar
to a given input image, the reconstruction obtained
from a single activation thus resembles a small piece
of the original input image, with structures weighted
according to their contribution toward to the feature
activation. Since the model is trained discriminatively,
they implicitly show which parts of the input image
are discriminative. Note that these projections are not
samples from the model, since there is no generative
process involved.
Layer Below Pooled Maps
Feature Maps
Rectified Feature Maps
Convolu'onal)
Filtering){F})
Rec'fied)Linear)
Func'on)
Pooled Maps
Max)Pooling)
Reconstruction
Rectified Unpooled Maps
Unpooled Maps
Convolu'onal)
Filtering){FT})
Rec'fied)Linear)
Func'on)
Layer Above
Reconstruction
Max)Unpooling)
Switches)
Unpooling
Max Locations
“Switches”
Pooling
Pooled Maps
Feature Map
Layer Above
Reconstruction
Unpooled
Maps
Rectified
Feature Maps
Figure 1. Top: A deconvnet layer (left) attached to a con-
vnet layer (right). The deconvnet will reconstruct an ap-
proximate version of the convnet features from the layer
beneath. Bottom: An illustration of the unpooling oper-
ation in the deconvnet, using switches which record the
location of the local max in each pooling region (colored
zones) during pooling in the convnet.
Other important di↵erences relating to layers 1 and
2 were made following inspection of the visualizations
in Fig. 6, as described in Section 4.1.
The model was trained on the ImageNet 2012 train-
ing set (1.3 million images, spread over 1000 di↵erent
classes). Each RGB image was preprocessed by resiz-
ing the smallest dimension to 256, cropping the center
256x256 region, subtracting the per-pixel mean (across
all images) and then using 10 di↵erent sub-crops of size
224x224 (corners + center with(out) horizontal flips).
Stochastic gradient descent with a mini-batch size of
128 was used to update the parameters, starting with a
learning rate of 10 2
, in conjunction with a momentum
term of 0.9. We anneal the learning rate throughout
training manually when the validation error plateaus.
Dropout (Hinton et al., 2012) is used in the fully con-
nected layers (6 and 7) with a rate of 0.5. All weights
are initialized to 10 2
and biases are set to 0.
Visualization of the first layer filters during training
reveals that a few of them dominate, as shown in
Fig. 6(a). To combat this, we renormalize each filter
in the convolutional layers whose RMS value exceeds
a fixed radius of 10 1
to this fixed radius. This is cru-
cial, especially in the first layer of the model, where the
input images are roughly in the [-128,128] range. As in
(Krizhevsky et al., 2012), we produce multiple di↵er-
ent crops and flips of each training example to boost
training set size. We stopped training after 70 epochs,
which took around 12 days on a single GTX580 GPU,
using an implementation based on (Krizhevsky et al.,
2012).
4. Convnet Visualization
Using the model described in Section 3, we now use
the deconvnet to visualize the feature activations on
the ImageNet validation set.
Feature Visualization: Fig. 2 shows feature visu-
alizations from our model once training is complete.
However, instead of showing the single strongest ac-
tivation for a given feature map, we show the top 9
activations. Projecting each separately down to pixel
Overview	
•  Deep	
  Convolu5onal	
  Network	
  
– 特徴抽出+識別器として好成績を収める	
  
– なんで,そんなに成績良いの?	
  
	
  
•  “画像のどの成分をみて判断しているか”	
  を
推測するのは意外と難しい	
  
	
  
•  Deep	
  Convolu5onal	
  Network	
  の中間層の	
  	
  
“観ているもの”	
  を可視化する
ネットワークアーキテクチャ	
•  DCNN:	
  局所特徴抽出+変形に対する不変性	
Convolutions
Subsampling
Convolutions
Subsampling
ネットワークアーキテクチャ	
•  DCNN	
  [Krizhevsky+12]	
  を利用して特徴解析を
してみる	
Visualizing and Understanding Convolutional Networks
Input Image
stride 2!
image size 224!
3!
96!
5!
2!
110!
55
3x3 max pool
stride 2
96!
3!
1!
26
256!
filter size 7!
3x3 max
pool
stride 2
13
256!
3!
1!
13
384!
3!
1!
13
384!
Layer 1 Layer 2
13
256!
3x3 max
pool
stride 2
6
Layer 3 Layer 4 Layer 5
256!
4096
units!
4096
units!
Layer 6 Layer 7
C
class
softmax!
Output
contrast
norm.
contrast
norm.
Figure 3. Architecture of our 8 layer convnet model. A 224 by 224 crop of an image (with 3 color planes) is presented as
the input. This is convolved with 96 di↵erent 1st layer filters (red), each of size 7 by 7, using a stride of 2 in both x and y.
The resulting feature maps are then: (i) passed through a rectified linear function (not shown), (ii) pooled (max within
3x3 regions, using stride 2) and (iii) contrast normalized across feature maps to give 96 di↵erent 55 by 55 element feature
maps. Similar operations are repeated in layers 2,3,4,5. The last two layers are fully connected, taking features from
the top convolutional layer as input in vector form (6 · 6 · 256 = 9216 dimensions). The final layer is a C-way softmax
function, C being the number of classes. All filters and feature maps are square in shape.
高次表現の変換:	
  DeConv	
  Net	
•  Max	
  Loca5ons	
  Sw.	
  
による位置推定	
  
	
  
•  ReLU	
  による正値化	
  
•  FT	
  による逆フィルタ	
according to their contribution toward to the feature
activation. Since the model is trained discriminatively,
they implicitly show which parts of the input image
are discriminative. Note that these projections are not
samples from the model, since there is no generative
process involved.
Layer Below Pooled Maps
Feature Maps
Rectified Feature Maps
Convolu'onal)
Filtering){F})
Rec'fied)Linear)
Func'on)
Pooled Maps
Max)Pooling)
Reconstruction
Rectified Unpooled Maps
Unpooled Maps
Convolu'onal)
Filtering){FT})
Rec'fied)Linear)
Func'on)
Layer Above
Reconstruction
Max)Unpooling)
Switches)
Unpooling
Max Locations
“Switches”
Pooling
Pooled Maps
Feature Map
Layer Above
Reconstruction
Unpooled
Maps
Rectified
Feature Maps
Stochastic gradient descent with a mini-b
128 was used to update the parameters, star
learning rate of 10 2
, in conjunction with a
term of 0.9. We anneal the learning rate
training manually when the validation erro
Dropout (Hinton et al., 2012) is used in th
nected layers (6 and 7) with a rate of 0.5.
are initialized to 10 2
and biases are set to
Visualization of the first layer filters duri
reveals that a few of them dominate, a
Fig. 6(a). To combat this, we renormalize
in the convolutional layers whose RMS va
a fixed radius of 10 1
to this fixed radius.
cial, especially in the first layer of the mode
input images are roughly in the [-128,128] r
(Krizhevsky et al., 2012), we produce mul
ent crops and flips of each training examp
training set size. We stopped training after
which took around 12 days on a single GTX
using an implementation based on (Krizhe
2012).
4. Convnet Visualization
Using the model described in Section 3, w
the deconvnet to visualize the feature act
the ImageNet validation set.Fig.1:	
  DeConv	
  Net	
  の模式図
可視化結果をお楽しみください(1)	
Visualizing and Understanding Convolutional Networks
Layer 1
Visualizing and Understanding Convolutional Networks
Layer 2
Visualizing and Understanding Convolutional Networks
Input Image
stride 2!
image size 224!
3!
96!
5!
2!
110!
55
3x3 max pool
stride 2
96!
3!
1!
26
256!
filter size 7!
3x3 max
pool
stride 2
13
256!
3!
1!
13
384!
3!
1!
13
384!
Layer 1 Layer 2
13
256!
3x3 max
pool
stride 2
6
Layer 3 Layer 4 Layer 5
256!
4096
units!
4096
units!
Layer 6 Layer 7
C
class
softmax!
Output
contrast
norm.
contrast
norm.
Figure 3. Architecture of our 8 layer convnet model. A 224 by 224 crop of an image (with 3 color planes) is presented as
the input. This is convolved with 96 di↵erent 1st layer filters (red), each of size 7 by 7, using a stride of 2 in both x and y.
The resulting feature maps are then: (i) passed through a rectified linear function (not shown), (ii) pooled (max within
3x3 regions, using stride 2) and (iii) contrast normalized across feature maps to give 96 di↵erent 55 by 55 element feature
maps. Similar operations are repeated in layers 2,3,4,5. The last two layers are fully connected, taking features from
the top convolutional layer as input in vector form (6 · 6 · 256 = 9216 dimensions). The final layer is a C-way softmax
function, C being the number of classes. All filters and feature maps are square in shape.
Layer 1 Layer 2 Layer 3 Layer 4 Layer 5
Figure 4. Evolution of a randomly chosen subset of model features through training. Each layer’s features are displayed
in a di↵erent block. Within each block, we show a randomly chosen subset of features at epochs [1,2,5,10,20,30,40,64].
The visualization shows the strongest activation (across all training examples) for a given feature map, projected down to
pixel space using our deconvnet approach. Color contrast is artificially enhanced and the figure is best viewed in electronic
form.
0 50 100 150 200 250 300 350
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Rotation Degrees
P(trueclass)
Lawn Mower
Shih−Tzu
African Crocodile
African Grey
Entertrainment Center
1
3
5
7
8
9
Vertical Translation (Pixels)
CanonicalDistance
Lawn Mower
African Crocodile
African Grey
Entertrainment Center
−60 −40 −20 0 20 40 60
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Vertical Translation (Pixels)
CanonicalDistance
Lawn Mower
Shih−Tzu
African Crocodile
African Grey
Entertrainment Center
1 1.2 1.4 1.6 1.8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Scale (Ratio)
CanonicalDistance
Lawn Mower
Shih−Tzu
African Crocodile
African Grey
Entertrainment Center
0 50 100 150 200 250 300 350
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Rotation Degrees
CanonicalDistance
Lawn Mower
Shih−Tzu
African Crocodile
African Grey
Entertrainment Center
−60 −40 −20 0 20 40 60
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Vertical Translation (Pixels)
P(trueclass)
Lawn Mower
Shih−Tzu
African Crocodile
African Grey
Entertrainment Center
1 1.2 1.4 1.6 1.8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Scale (Ratio)
P(trueclass)
Lawn Mower
Shih−Tzu
African Crocodile
African Grey
Entertrainment Center
a1#
c1#
a3#
c3# c4#
a4#
1 1.2 1.4 1.6 1.8
0
2
4
6
8
10
12
Scale (Ratio)
CanonicalDistance
Lawn Mower
Shih−Tzu
African Crocodile
African Grey
Entertrainment Center
0 50 100 150 200 250 300 350
0
5
10
15
Rotation Degrees
CanonicalDistance
Lawn Mower
Shih−Tzu
African Crocodile
African Grey
Entertrainment Center
a2#
b3# b4#b2#b1#
c2#
Figure 5. Analysis of vertical translation, scale, and rotation invariance within the model (rows a-c respectively). Col 1: 5
example images undergoing the transformations. Col 2 & 3: Euclidean distance between feature vectors from the original
Fig.2:	
  Layer	
  1,	
  2	
  の可視化
可視化結果をお楽しみください(2)	
Visualizing and Understanding Convolutional Networks
Input Image
stride 2!
image size 224!
3!
96!
5!
2!
110!
55
3x3 max pool
stride 2
96!
3!
1!
26
256!
filter size 7!
3x3 max
pool
stride 2
13
256!
3!
1!
13
384!
3!
1!
13
384!
Layer 1 Layer 2
13
256!
3x3 max
pool
stride 2
6
Layer 3 Layer 4 Layer 5
256!
4096
units!
4096
units!
Layer 6 Layer 7
C
class
softmax!
Output
contrast
norm.
contrast
norm.
Figure 3. Architecture of our 8 layer convnet model. A 224 by 224 crop of an image (with 3 color planes) is presented as
the input. This is convolved with 96 di↵erent 1st layer filters (red), each of size 7 by 7, using a stride of 2 in both x and y.
The resulting feature maps are then: (i) passed through a rectified linear function (not shown), (ii) pooled (max within
3x3 regions, using stride 2) and (iii) contrast normalized across feature maps to give 96 di↵erent 55 by 55 element feature
maps. Similar operations are repeated in layers 2,3,4,5. The last two layers are fully connected, taking features from
the top convolutional layer as input in vector form (6 · 6 · 256 = 9216 dimensions). The final layer is a C-way softmax
function, C being the number of classes. All filters and feature maps are square in shape.
Layer 1 Layer 2 Layer 3 Layer 4 Layer 5
Figure 4. Evolution of a randomly chosen subset of model features through training. Each layer’s features are displayed
in a di↵erent block. Within each block, we show a randomly chosen subset of features at epochs [1,2,5,10,20,30,40,64].
The visualization shows the strongest activation (across all training examples) for a given feature map, projected down to
pixel space using our deconvnet approach. Color contrast is artificially enhanced and the figure is best viewed in electronic
form.
0 50 100 150 200 250 300 350
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Rotation Degrees
P(trueclass)
Lawn Mower
Shih−Tzu
African Crocodile
African Grey
Entertrainment Center
1
3
5
7
8
9
Vertical Translation (Pixels)
CanonicalDistance
Lawn Mower
African Crocodile
African Grey
Entertrainment Center
−60 −40 −20 0 20 40 60
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Vertical Translation (Pixels)
CanonicalDistance
Lawn Mower
Shih−Tzu
African Crocodile
African Grey
Entertrainment Center
1 1.2 1.4 1.6 1.8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Scale (Ratio)
CanonicalDistance
Lawn Mower
Shih−Tzu
African Crocodile
African Grey
Entertrainment Center
0 50 100 150 200 250 300 350
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Rotation Degrees
CanonicalDistance
Lawn Mower
Shih−Tzu
African Crocodile
African Grey
Entertrainment Center
−60 −40 −20 0 20 40 60
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Vertical Translation (Pixels)
P(trueclass)
Lawn Mower
Shih−Tzu
African Crocodile
African Grey
Entertrainment Center
1 1.2 1.4 1.6 1.8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Scale (Ratio)
P(trueclass)
Lawn Mower
Shih−Tzu
African Crocodile
African Grey
Entertrainment Center
a1#
c1#
a3#
c3# c4#
a4#
1 1.2 1.4 1.6 1.8
0
2
4
6
8
10
12
Scale (Ratio)
CanonicalDistance
Lawn Mower
Shih−Tzu
African Crocodile
African Grey
Entertrainment Center
0 50 100 150 200 250 300 350
0
5
10
15
Rotation Degrees
CanonicalDistance
Lawn Mower
Shih−Tzu
African Crocodile
African Grey
Entertrainment Center
a2#
b3# b4#b2#b1#
c2#
Figure 5. Analysis of vertical translation, scale, and rotation invariance within the model (rows a-c respectively). Col 1: 5
example images undergoing the transformations. Col 2 & 3: Euclidean distance between feature vectors from the original
Layer 3
Fig.2:	
  Layer	
  3	
  の可視化
可視化結果をお楽しみください(3)	
Visualizing and Understanding Convolutional Networks
Input Image
stride 2!
image size 224!
3!
96!
5!
2!
110!
55
3x3 max pool
stride 2
96!
3!
1!
26
256!
filter size 7!
3x3 max
pool
stride 2
13
256!
3!
1!
13
384!
3!
1!
13
384!
Layer 1 Layer 2
13
256!
3x3 max
pool
stride 2
6
Layer 3 Layer 4 Layer 5
256!
4096
units!
4096
units!
Layer 6 Layer 7
C
class
softmax!
Output
contrast
norm.
contrast
norm.
Figure 3. Architecture of our 8 layer convnet model. A 224 by 224 crop of an image (with 3 color planes) is presented as
the input. This is convolved with 96 di↵erent 1st layer filters (red), each of size 7 by 7, using a stride of 2 in both x and y.
The resulting feature maps are then: (i) passed through a rectified linear function (not shown), (ii) pooled (max within
3x3 regions, using stride 2) and (iii) contrast normalized across feature maps to give 96 di↵erent 55 by 55 element feature
maps. Similar operations are repeated in layers 2,3,4,5. The last two layers are fully connected, taking features from
the top convolutional layer as input in vector form (6 · 6 · 256 = 9216 dimensions). The final layer is a C-way softmax
function, C being the number of classes. All filters and feature maps are square in shape.
Layer 1 Layer 2 Layer 3 Layer 4 Layer 5
Figure 4. Evolution of a randomly chosen subset of model features through training. Each layer’s features are displayed
in a di↵erent block. Within each block, we show a randomly chosen subset of features at epochs [1,2,5,10,20,30,40,64].
The visualization shows the strongest activation (across all training examples) for a given feature map, projected down to
pixel space using our deconvnet approach. Color contrast is artificially enhanced and the figure is best viewed in electronic
form.
0 50 100 150 200 250 300 350
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Rotation Degrees
P(trueclass)
Lawn Mower
Shih−Tzu
African Crocodile
African Grey
Entertrainment Center
1
3
5
7
8
9
Vertical Translation (Pixels)
CanonicalDistance
Lawn Mower
African Crocodile
African Grey
Entertrainment Center
−60 −40 −20 0 20 40 60
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Vertical Translation (Pixels)
CanonicalDistance
Lawn Mower
Shih−Tzu
African Crocodile
African Grey
Entertrainment Center
1 1.2 1.4 1.6 1.8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Scale (Ratio)
CanonicalDistance
Lawn Mower
Shih−Tzu
African Crocodile
African Grey
Entertrainment Center
0 50 100 150 200 250 300 350
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Rotation Degrees
CanonicalDistance
Lawn Mower
Shih−Tzu
African Crocodile
African Grey
Entertrainment Center
−60 −40 −20 0 20 40 60
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Vertical Translation (Pixels)
P(trueclass)
Lawn Mower
Shih−Tzu
African Crocodile
African Grey
Entertrainment Center
1 1.2 1.4 1.6 1.8
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Scale (Ratio)
P(trueclass)
Lawn Mower
Shih−Tzu
African Crocodile
African Grey
Entertrainment Center
a1#
c1#
a3#
c3# c4#
a4#
1 1.2 1.4 1.6 1.8
0
2
4
6
8
10
12
Scale (Ratio)
CanonicalDistance
Lawn Mower
Shih−Tzu
African Crocodile
African Grey
Entertrainment Center
0 50 100 150 200 250 300 350
0
5
10
15
Rotation Degrees
CanonicalDistance
Lawn Mower
Shih−Tzu
African Crocodile
African Grey
Entertrainment Center
a2#
b3# b4#b2#b1#
c2#
Figure 5. Analysis of vertical translation, scale, and rotation invariance within the model (rows a-c respectively). Col 1: 5
example images undergoing the transformations. Col 2 & 3: Euclidean distance between feature vectors from the original
Layer 4
Figure 2. Visualization of features in a fully trained model. For layers 2-5 we show the top 9 activations in a random subset
学習発展に於ける抽出特徴の変化	
Fig.4:	
  1,	
  2,	
  5,	
  10,	
  20,	
  30,	
  40,	
  64	
  epochs	
  時の特徴マップ(random	
  selected)から再現
ネットワークパラメータの選択(1)	
•  カーネルサイズとかの設定が適切か?	
[Krizhevsky	
  +12]:	
  使えなさそうなフィルタがちらほら	
  
Stride:	
  	
  4	
  
Filter	
  size:	
  11x11	
適正化させたもの	
  
Stride:	
  2	
  
Filter	
  size:	
  7x7
ネットワークパラメータの選択(2)	
•  カーネルサイズとかの設定が適切か@Layer2?	
[Krizhevsky	
  +12]	
 適正化させたもの:	
  An5Alias	
  とか消えてる?	
  
Occlusion	
  に関する考察	
(c) (d) (e)
Figure 6. (a): 1st layer features without feature scale clipping. Note that one feature dominates. (b): 1st layer features
from (Krizhevsky et al., 2012). (c): Our 1st layer features. The smaller stride (2 vs 4) and filter size (7x7 vs 11x11)
results in more distinctive features and fewer “dead” features. (d): Visualizations of 2nd layer features from (Krizhevsky
et al., 2012). (e): Visualizations of our 2nd layer features. These are cleaner, with no aliasing artifacts that are visible in
(d).
Car wheel
Racer
Cab
Police van
Pomeranian
Tennis ball
Keeshond
Pekinese
Afghan hound
Gordon setter
Irish setter
Mortarboard
Fur coat
Academic gown
Australian terrier
Ice lolly
Vizsla
Neck brace
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.05
0.1
0.15
0.2
0.25
True Label: Pomeranian
(a) Input Image (b) Layer 5, strongest feature map
(c) Layer 5, strongest
feature map projections
(d) Classifier, probability
of correct class
(e) Classifier, most
probable class
True Label: Car Wheel
True Label: Afghan Hound
Figure 7. Three test examples where we systematically cover up di↵erent portions of the scene with a gray square (1st
隠蔽箇所による活性度	
 最大活性部を可視化	
 隠蔽箇所による正解度
識別器は,わりといい加減である
識別率を測ってみる	
  
(可視化それ程関係ない)	
Shid	
  
Mag.	
  
Rot.	
  
Layer	
  1	
   Layer	
  7	
   Output	
  
Correspondence	
  Analysis	
  
ここも可視化はあまり関係ない	
•  隠蔽画像と非隠蔽画像のズレ(らしきもの)の計測	
Visualizing and Understanding Convolutional Networks
Figure 8. Images used for correspondence experiments.
Col 1: Original image. Col 2,3,4: Occlusion of the right
eye, left eye, and nose respectively. Other columns show
examples of random occlusions.
Mean Feature Mean Feature
Sign Change Sign Change
Occlusion Location Layer 5 Layer 7
Right Eye 0.067 ± 0.007 0.069 ± 0.015
Left Eye 0.069 ± 0.007 0.068 ± 0.013
Nose 0.079 ± 0.017 0.069 ± 0.011
Random 0.107 ± 0.017 0.073 ± 0.014
ing set). We note that this error is almost half that of
the top non-convnet entry in the ImageNet 2012 classi-
fication challenge, which obtained 26.2% error (Gunji
et al., 2012).
Val Val Test
Error % Top-1 Top-5 Top-5
(Gunji et al., 2012) - - 26.2
(Krizhevsky et al., 2012), 1 convnet 40.7 18.2
(Krizhevsky et al., 2012), 5 convnets 38.1 16.4 16.4
(Krizhevsky et al., 2012)⇤
, 1 convnets 39.0 16.6
(Krizhevsky et al., 2012)⇤
, 7 convnets 36.7 15.4 15.3
Our replication of
(Krizhevsky et al., 2012), 1 convnet 40.5 18.1
1 convnet as per Fig. 3 38.4 16.5
5 convnets as per Fig. 3 – (a) 36.7 15.3 15.3
1 convnet as per Fig. 3 but with
layers 3,4,5: 512,1024,512 maps – (b) 37.5 16.0 16.1
6 convnets, (a) & (b) combined 36.0 14.7 14.8
Table 2. ImageNet 2012 classification error rates. The ⇤
indicates models that were trained on both ImageNet 2011
and 2012 training sets.
Varying ImageNet Model Sizes: In Table 3, we
first explore the architecture of (Krizhevsky et al.,
2012) by adjusting the size of layers, or removing
them entirely. In each case, the model is trained from
Figure 8. Images used for correspondence experiments.
Col 1: Original image. Col 2,3,4: Occlusion of the right
eye, left eye, and nose respectively. Other columns show
examples of random occlusions.
Mean Feature Mean Feature
Sign Change Sign Change
Occlusion Location Layer 5 Layer 7
Right Eye 0.067 ± 0.007 0.069 ± 0.015
Left Eye 0.069 ± 0.007 0.068 ± 0.013
Nose 0.079 ± 0.017 0.069 ± 0.011
Random 0.107 ± 0.017 0.073 ± 0.014
Table 1. Measure of correspondence for di↵erent object
parts in 5 di↵erent dog images. The lower scores for the
eyes and nose (compared to random object parts) show the
model implicitly establishing some form of correspondence
of parts at layer 5 in the model. At layer 7, the scores
are more similar, perhaps due to upper layers trying to
discriminate between the di↵erent breeds of dog.
5. Experiments
5.1. ImageNet 2012
This dataset consists of 1.3M/50k/100k train-
ing/validation/test examples, spread over 1000 cate-
gories. Table 2 shows our results on this dataset.
et al.,
Error
(Gunj
(Krizh
(Krizh
(Krizh
(Krizh
Our re
(Krizh
1 conv
5 conv
1 conv
layers
6 conv
Table 2
indicat
and 20
Varyi
first e
2012)
them e
scratch
fully c
in erro
the m
the mi
small
both t
nected
perform
gest th
for obt
our m
fully c
mance
数値的には微妙な結果だ…
特定の場所を隠されたほうが,表現のずれが少ない(ということらしい)
State-­‐of-­‐Art	
  を目指して	
•  ちゃんと設計するとか合体させるとかすれば	
  
パフォーマンスは上がる	
  
	
nd Understanding Convolutional Networks
experiments.
on of the right
columns show
ean Feature
ign Change
Layer 7
ing set). We note that this error is almost half that of
the top non-convnet entry in the ImageNet 2012 classi-
fication challenge, which obtained 26.2% error (Gunji
et al., 2012).
Val Val Test
Error % Top-1 Top-5 Top-5
(Gunji et al., 2012) - - 26.2
(Krizhevsky et al., 2012), 1 convnet 40.7 18.2
(Krizhevsky et al., 2012), 5 convnets 38.1 16.4 16.4
(Krizhevsky et al., 2012)⇤
, 1 convnets 39.0 16.6
(Krizhevsky et al., 2012)⇤
, 7 convnets 36.7 15.4 15.3
Our replication of
(Krizhevsky et al., 2012), 1 convnet 40.5 18.1
1 convnet as per Fig. 3 38.4 16.5
5 convnets as per Fig. 3 – (a) 36.7 15.3 15.3
1 convnet as per Fig. 3 but with
layers 3,4,5: 512,1024,512 maps – (b) 37.5 16.0 16.1
6 convnets, (a) & (b) combined 36.0 14.7 14.8
Table 2. ImageNet 2012 classification error rates. The ⇤
indicates models that were trained on both ImageNet 2011
and 2012 training sets.
いろいろやってみる(1)	
•  もはや可視化とはあまり関係なさそうである	
Visualizing and Understanding Convolutional N
Train Val Val
Error % Top-1 Top-1 Top-5
Our replication of
(Krizhevsky et al., 2012), 1 convnet 35.1 40.5 18.1
Removed layers 3,4 41.8 45.4 22.1
Removed layer 7 27.4 40.0 18.4
Removed layers 6,7 27.4 44.8 22.4
Removed layer 3,4,6,7 71.1 71.3 50.1
Adjust layers 6,7: 2048 units 40.3 41.7 18.8
Adjust layers 6,7: 8192 units 26.8 40.0 18.1
Our Model (as per Fig. 3) 33.1 38.4 16.5
Adjust layers 6,7: 2048 units 38.2 40.2 17.6
Adjust layers 6,7: 8192 units 22.0 38.8 17.0
Adjust layers 3,4,5: 512,1024,512 maps 18.8 37.5 16.0
Adjust layers 6,7: 8192 units and
Layers 3,4,5: 512,1024,512 maps 10.0 38.3 16.9
Table 3. ImageNet 2012 classification error rates with var-
ious architectural changes to the model of (Krizhevsky
et al., 2012) and our model (see Fig. 3).
softmax classifier on top (for the appropriate number
0 10
25
30
35
40
45
50
55
60
65
70
75
Accuracy%
Figure 9. Caltech-25
number of training i
6 training examples
extractor, we surpas
いろいろやってみる(2)	
•  Caltech-­‐101/256,	
  PASCAL	
  2012	
  でも大丈夫?	
  
→多分大丈夫	
ng Convolutional Networks
0 10 20 30 40 50 60
25
30
35
40
45
50
55
60
65
70
75
Training Images per−class
Accuracy%
Our Model
Bo etal
Sohn etal
Figure 9. Caltech-256 classification performance as the
number of training images per class is varied. Using only
6 training examples per class with our pre-trained feature
extractor, we surpass best reported result by (Bo et al.,
2013).
Adjust layers 6,7: 2048 units 38.2 40.2 17.6
Adjust layers 6,7: 8192 units 22.0 38.8 17.0
Adjust layers 3,4,5: 512,1024,512 maps 18.8 37.5 16.0
Adjust layers 6,7: 8192 units and
Layers 3,4,5: 512,1024,512 maps 10.0 38.3 16.9
Table 3. ImageNet 2012 classification error rates with var-
ious architectural changes to the model of (Krizhevsky
et al., 2012) and our model (see Fig. 3).
softmax classifier on top (for the appropriate number
of classes) using the training images of the new dataset.
Since the softmax contains relatively few parameters,
it can be trained quickly from a relatively small num-
ber of examples, as is the case for certain datasets.
The classifiers used by our model (a softmax) and
other approaches (typically a linear SVM) are of simi-
lar complexity, thus the experiments compare our fea-
ture representation, learned from ImageNet, with the
hand-crafted features used by other methods. It is im-
portant to note that both our feature representation
and the hand-crafted features are designed using im-
ages beyond the Caltech and PASCAL training sets.
For example, the hyper-parameters in HOG descrip-
tors were determined through systematic experiments
on a pedestrian dataset (Dalal & Triggs, 2005). We
also try a second strategy of training a model from
scratch, i.e. resetting layers 1-7 to random values and
train them, as well as the softmax, on the training
images of the dataset.
One complication is that some of the Caltech datasets
have some images that are also in the ImageNet train-
ing data. Using normalized correlation, we identified
these few “overlap” images2
and removed them from
our Imagenet training set and then retrained our Ima-
genet models, so avoiding the possibility of train/test
contamination.
Caltech-101: We follow the procedure of (Fei-fei
et al., 2006) and randomly select 15 or 30 images per
class for training and test on up to 50 images per class
reporting the average of the per-class accuracies in Ta-
2
For Caltech-101, we found 44 images in common (out
of 9,144 total images), with a maximum overlap of 10 for
any given class. For Caltech-256, we found 243 images in
common (out of 30,607 total images), with a maximum
0 10 20 30 40 50 60
25
30
35
Training Images per−class
Our Model
Bo etal
Sohn etal
Figure 9. Caltech-256 classification performance as the
number of training images per class is varied. Using only
6 training examples per class with our pre-trained feature
extractor, we surpass best reported result by (Bo et al.,
2013).
ble 4, using 5 train/test folds. Training took 17 min-
utes for 30 images/class. The pre-trained model beats
the best reported result for 30 images/class from (Bo
et al., 2013) by 2.2%. The convnet model trained from
scratch however does terribly, only achieving 46.5%.
Acc % Acc %
# Train 15/class 30/class
(Bo et al., 2013) 81.4 ± 0.33
(Jianchao et al., 2009) 73.2 84.3
Non-pretrained convnet 22.8 ± 1.5 46.5 ± 1.7
ImageNet-pretrained convnet 83.8 ± 0.5 86.5 ± 0.5
Table 4. Caltech-101 classification accuracy for our con-
vnet models, against two leading alternate approaches.
Caltech-256: We follow the procedure of (Gri n
et al., 2006), selecting 15, 30, 45, or 60 training im-
ages per class, reporting the average of the per-class
accuracies in Table 5. Our ImageNet-pretrained model
beats the current state-of-the-art results obtained by
Bo et al. (Bo et al., 2013) by a significant margin:
74.2% vs 55.2% for 60 training images/class. However,
as with Caltech-101, the model trained from scratch
does poorly. In Fig. 9, we explore the “one-shot learn-
ing” (Fei-fei et al., 2006) regime. With our pre-trained
model, just 6 Caltech-256 training images are needed
to beat the leading method using 10 times as many im-
ages. This shows the power of the ImageNet feature
extractor.
Acc % Acc % Acc % Acc %
# Train 15/class 30/class 45/class 60/class
(Sohn et al., 2011) 35.1 42.1 45.7 47.9
(Bo et al., 2013) 40.5 ± 0.4 48.0 ± 0.2 51.9 ± 0.2 55.2 ± 0.3
Non-pretr. 9.0 ± 1.4 22.5 ± 0.7 31.2 ± 0.5 38.8 ± 1.4
ImageNet-pretr. 65.7 ± 0.2 70.6 ± 0.2 72.7 ± 0.4 74.2 ± 0.3
other approaches (typically a linear SVM) are of simi-
lar complexity, thus the experiments compare our fea-
ture representation, learned from ImageNet, with the
hand-crafted features used by other methods. It is im-
portant to note that both our feature representation
and the hand-crafted features are designed using im-
ages beyond the Caltech and PASCAL training sets.
For example, the hyper-parameters in HOG descrip-
tors were determined through systematic experiments
on a pedestrian dataset (Dalal & Triggs, 2005). We
also try a second strategy of training a model from
scratch, i.e. resetting layers 1-7 to random values and
train them, as well as the softmax, on the training
images of the dataset.
One complication is that some of the Caltech datasets
have some images that are also in the ImageNet train-
ing data. Using normalized correlation, we identified
these few “overlap” images2
and removed them from
our Imagenet training set and then retrained our Ima-
genet models, so avoiding the possibility of train/test
contamination.
Caltech-101: We follow the procedure of (Fei-fei
et al., 2006) and randomly select 15 or 30 images per
class for training and test on up to 50 images per class
reporting the average of the per-class accuracies in Ta-
2
For Caltech-101, we found 44 images in common (out
of 9,144 total images), with a maximum overlap of 10 for
any given class. For Caltech-256, we found 243 images in
common (out of 30,607 total images), with a maximum
overlap of 18 for any given class.
Acc % Acc %
# Train 15/class 30/class
(Bo et al., 2013) 81.4 ± 0.33
(Jianchao et al., 2009) 73.2 84.3
Non-pretrained convnet 22.8 ± 1.5 46.5 ± 1.7
ImageNet-pretrained convnet 83.8 ± 0.5 86.5 ± 0.5
Table 4. Caltech-101 classification accuracy for our con-
vnet models, against two leading alternate approaches.
Caltech-256: We follow the procedure of (Gri n
et al., 2006), selecting 15, 30, 45, or 60 training im-
ages per class, reporting the average of the per-class
accuracies in Table 5. Our ImageNet-pretrained model
beats the current state-of-the-art results obtained by
Bo et al. (Bo et al., 2013) by a significant margin:
74.2% vs 55.2% for 60 training images/class. However,
as with Caltech-101, the model trained from scratch
does poorly. In Fig. 9, we explore the “one-shot learn-
ing” (Fei-fei et al., 2006) regime. With our pre-trained
model, just 6 Caltech-256 training images are needed
to beat the leading method using 10 times as many im-
ages. This shows the power of the ImageNet feature
extractor.
Acc % Acc % Acc % Acc %
# Train 15/class 30/class 45/class 60/class
(Sohn et al., 2011) 35.1 42.1 45.7 47.9
(Bo et al., 2013) 40.5 ± 0.4 48.0 ± 0.2 51.9 ± 0.2 55.2 ± 0.3
Non-pretr. 9.0 ± 1.4 22.5 ± 0.7 31.2 ± 0.5 38.8 ± 1.4
ImageNet-pretr. 65.7 ± 0.2 70.6 ± 0.2 72.7 ± 0.4 74.2 ± 0.3
Table 5. Caltech 256 classification accuracies.
Visualizing and Understanding Conv
PASCAL 2012: We used the standard training and
validation images to train a 20-way softmax on top of
the ImageNet-pretrained convnet. This is not ideal, as
PASCAL images can contain multiple objects and our
model just provides a single exclusive prediction for
each image. Table 6 shows the results on the test set.
The PASCAL and ImageNet images are quite di↵er-
ent in nature, the former being full scenes unlike the
latter. This may explain our mean performance being
3.2% lower than the leading (Yan et al., 2012) result,
however we do beat them on 5 classes, sometimes by
large margins.
Acc % [A] [B] Ours Acc % [A] [B] Ours
Airplane 92.0 97.3 96.0 Dining tab 63.2 77.8 67.7
Bicycle 74.2 84.2 77.1 Dog 68.9 83.0 87.8
Bird 73.0 80.8 88.4 Horse 78.2 87.5 86.0
Boat 77.5 85.3 85.5 Motorbike 81.0 90.1 85.1
Bottle 54.3 60.8 55.8 Person 91.6 95.0 90.9
Bus 85.2 89.9 85.8 Potted pl 55.9 57.8 52.2
Car 81.9 86.8 78.6 Sheep 69.4 79.2 83.6
Cat 76.4 89.3 91.2 Sofa 65.4 73.4 61.1
Chair 65.2 75.4 65.0 Train 86.7 94.5 91.8
Cow 63.2 77.8 74.4 Tv 77.4 80.7 76.1
Mean 74.3 82.2 79.0 # won 0 15 5
Table 6. PASCAL 2012 classification results, comparing
our Imagenet-pretrained convnet against the leading two
6. Di
We exp
els, tra
First,
tivity w
be far f
they sh
compo
crimina
how th
lems w
ample
et al.,
then de
iments
is highl
not jus
on the
to the
vital to
Finally
can gen
Caltech-­‐101	
Caltech-­‐256	
PASCAL2012	
Trained	
  with	
  6	
  images/class,	
  Caltech-­‐256
いろいろやってみる(3)	
•  どうやら SVM	
  での評価はここが最初らしい…	
5.3. Feature Analysis
We explore how discriminative the features in each
layer of our Imagenet-pretrained model are. We do this
by varying the number of layers retained from the Ima-
geNet model and place either a linear SVM or softmax
classifier on top. Table 7 shows results on Caltech-
101 and Caltech-256. For both datasets, a steady im-
provement can be seen as we ascend the model, with
best results being obtained by using all layers. This
supports the premise that as the feature hierarchies
become deeper, they learn increasingly powerful fea-
tures.
Cal-101 Cal-256
(30/class) (60/class)
SVM (1) 44.8 ± 0.7 24.6 ± 0.4
SVM (2) 66.2 ± 0.5 39.6 ± 0.3
SVM (3) 72.3 ± 0.4 46.0 ± 0.3
SVM (4) 76.6 ± 0.4 51.3 ± 0.1
SVM (5) 86.2 ± 0.8 65.6 ± 0.3
SVM (7) 85.5 ± 0.4 71.7 ± 0.2
Softmax (5) 82.9 ± 0.4 65.7 ± 0.5
Softmax (7) 85.4 ± 0.4 72.6 ± 0.1
Table 7. Analysis of the discriminative information con-
tained in each layer of feature maps within our ImageNet-
pretrained convnet. We train either a linear SVM or soft-
max on features from di↵erent layers (as indicated in brack-
ets) from the convnet. Higher layers generally produce
more discriminative features.
training sets. Our con
to the PASCAL data,
bias (Torralba & Efro
within 3.2% of the best
ing for the task. For e
improve if a di↵erent
mitted multiple objec
rally enable the networ
as well.
Acknowledgmen
The authors are very g
IIS-1116923, Microsof
ship.
References
Bengio, Y., Lamblin, P
H. Greedy layer-wis
NIPS, pp. 153–160,
Berkes, P. and Wisko
terpretation of inho
receptive fields. Neu
Bo, L., Ren, X., and F
using hierarchical m
まとめ	
•  可視化は,DCNN	
  などを理解・設計するのに
重要	
  
– 例として	
  Krizhevsky+12	
  を改良してみた	
  
– 改良した	
  DCNN	
  を他のデータベースに適用したけ
ど,充分上手く動くことが示せた

More Related Content

What's hot

WBOIT Final Version
WBOIT Final VersionWBOIT Final Version
WBOIT Final Version
Brock Stoops
 
Double transform contoor extraction
Double transform contoor extractionDouble transform contoor extraction
Double transform contoor extraction
arteimi
 
改进的固定点图像复原算法_英文_阎雪飞
改进的固定点图像复原算法_英文_阎雪飞改进的固定点图像复原算法_英文_阎雪飞
改进的固定点图像复原算法_英文_阎雪飞
alen yan
 

What's hot (20)

3 intensity transformations and spatial filtering slides
3 intensity transformations and spatial filtering slides3 intensity transformations and spatial filtering slides
3 intensity transformations and spatial filtering slides
 
2021 05-04-u2-net
2021 05-04-u2-net2021 05-04-u2-net
2021 05-04-u2-net
 
WBOIT Final Version
WBOIT Final VersionWBOIT Final Version
WBOIT Final Version
 
Kernel Estimation of Videodeblurringalgorithm and Motion Compensation of Resi...
Kernel Estimation of Videodeblurringalgorithm and Motion Compensation of Resi...Kernel Estimation of Videodeblurringalgorithm and Motion Compensation of Resi...
Kernel Estimation of Videodeblurringalgorithm and Motion Compensation of Resi...
 
Interferogram Filtering Using Gaussians Scale Mixtures in Steerable Wavelet D...
Interferogram Filtering Using Gaussians Scale Mixtures in Steerable Wavelet D...Interferogram Filtering Using Gaussians Scale Mixtures in Steerable Wavelet D...
Interferogram Filtering Using Gaussians Scale Mixtures in Steerable Wavelet D...
 
Anatomy of a Texture Fetch
Anatomy of a Texture FetchAnatomy of a Texture Fetch
Anatomy of a Texture Fetch
 
Performance Analysis of Image Enhancement Using Dual-Tree Complex Wavelet Tra...
Performance Analysis of Image Enhancement Using Dual-Tree Complex Wavelet Tra...Performance Analysis of Image Enhancement Using Dual-Tree Complex Wavelet Tra...
Performance Analysis of Image Enhancement Using Dual-Tree Complex Wavelet Tra...
 
LAPLACE TRANSFORM SUITABILITY FOR IMAGE PROCESSING
LAPLACE TRANSFORM SUITABILITY FOR IMAGE PROCESSINGLAPLACE TRANSFORM SUITABILITY FOR IMAGE PROCESSING
LAPLACE TRANSFORM SUITABILITY FOR IMAGE PROCESSING
 
Lect 03 - first portion
Lect 03 - first portionLect 03 - first portion
Lect 03 - first portion
 
Matlab Implementation of Baseline JPEG Image Compression Using Hardware Optim...
Matlab Implementation of Baseline JPEG Image Compression Using Hardware Optim...Matlab Implementation of Baseline JPEG Image Compression Using Hardware Optim...
Matlab Implementation of Baseline JPEG Image Compression Using Hardware Optim...
 
Network Deconvolution review [cdm]
Network Deconvolution review [cdm]Network Deconvolution review [cdm]
Network Deconvolution review [cdm]
 
Double transform contoor extraction
Double transform contoor extractionDouble transform contoor extraction
Double transform contoor extraction
 
Multimedia lossy compression algorithms
Multimedia lossy compression algorithmsMultimedia lossy compression algorithms
Multimedia lossy compression algorithms
 
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...
DeepLab V3+: Encoder-Decoder with Atrous Separable Convolution for Semantic I...
 
Structure from motion
Structure from motionStructure from motion
Structure from motion
 
PERFORMANCE EVALUATION OF DIFFERENT TECHNIQUES FOR TEXTURE CLASSIFICATION
PERFORMANCE EVALUATION OF DIFFERENT TECHNIQUES FOR TEXTURE CLASSIFICATION PERFORMANCE EVALUATION OF DIFFERENT TECHNIQUES FOR TEXTURE CLASSIFICATION
PERFORMANCE EVALUATION OF DIFFERENT TECHNIQUES FOR TEXTURE CLASSIFICATION
 
改进的固定点图像复原算法_英文_阎雪飞
改进的固定点图像复原算法_英文_阎雪飞改进的固定点图像复原算法_英文_阎雪飞
改进的固定点图像复原算法_英文_阎雪飞
 
2020 11 4_bag_of_tricks
2020 11 4_bag_of_tricks2020 11 4_bag_of_tricks
2020 11 4_bag_of_tricks
 
A DIGITAL COLOR IMAGE WATERMARKING SYSTEM USING BLIND SOURCE SEPARATION
A DIGITAL COLOR IMAGE WATERMARKING SYSTEM USING BLIND SOURCE SEPARATIONA DIGITAL COLOR IMAGE WATERMARKING SYSTEM USING BLIND SOURCE SEPARATION
A DIGITAL COLOR IMAGE WATERMARKING SYSTEM USING BLIND SOURCE SEPARATION
 
Image enhancement
Image enhancementImage enhancement
Image enhancement
 

Viewers also liked

20141003.journal club
20141003.journal club20141003.journal club
20141003.journal club
Hayaru SHOUNO
 
20130925.deeplearning
20130925.deeplearning20130925.deeplearning
20130925.deeplearning
Hayaru SHOUNO
 

Viewers also liked (15)

20150326.journal club
20150326.journal club20150326.journal club
20150326.journal club
 
20141208.名大セミナー
20141208.名大セミナー20141208.名大セミナー
20141208.名大セミナー
 
20140726.西野研セミナー
20140726.西野研セミナー20140726.西野研セミナー
20140726.西野研セミナー
 
20150803.山口大学講演
20150803.山口大学講演20150803.山口大学講演
20150803.山口大学講演
 
20140705.西野研セミナー
20140705.西野研セミナー20140705.西野研セミナー
20140705.西野研セミナー
 
20150803.山口大学集中講義
20150803.山口大学集中講義20150803.山口大学集中講義
20150803.山口大学集中講義
 
20160825 IEICE SIP研究会 講演
20160825 IEICE SIP研究会 講演20160825 IEICE SIP研究会 講演
20160825 IEICE SIP研究会 講演
 
20160329.dnn講演
20160329.dnn講演20160329.dnn講演
20160329.dnn講演
 
20130722
2013072220130722
20130722
 
20141003.journal club
20141003.journal club20141003.journal club
20141003.journal club
 
ベイズ Chow-Liu アルゴリズム
ベイズ Chow-Liu アルゴリズムベイズ Chow-Liu アルゴリズム
ベイズ Chow-Liu アルゴリズム
 
20141204.journal club
20141204.journal club20141204.journal club
20141204.journal club
 
20130925.deeplearning
20130925.deeplearning20130925.deeplearning
20130925.deeplearning
 
量子アニーリングを用いたクラスタ分析
量子アニーリングを用いたクラスタ分析量子アニーリングを用いたクラスタ分析
量子アニーリングを用いたクラスタ分析
 
ものまね鳥を愛でる 結合子論理と計算
ものまね鳥を愛でる 結合子論理と計算ものまね鳥を愛でる 結合子論理と計算
ものまね鳥を愛でる 結合子論理と計算
 

Similar to 20150703.journal club

Visualizing and understanding convolutional networks(2014)
Visualizing and understanding convolutional networks(2014)Visualizing and understanding convolutional networks(2014)
Visualizing and understanding convolutional networks(2014)
WoochulShin10
 
Automatic Detection of Window Regions in Indoor Point Clouds Using R-CNN
Automatic Detection of Window Regions in Indoor Point Clouds Using R-CNNAutomatic Detection of Window Regions in Indoor Point Clouds Using R-CNN
Automatic Detection of Window Regions in Indoor Point Clouds Using R-CNN
Zihao(Gerald) Zhang
 
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITIONVERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Willy Marroquin (WillyDevNET)
 
nips report
nips reportnips report
nips report
?? ?
 

Similar to 20150703.journal club (20)

Visualizing and Understanding Convolutional Networks
Visualizing and Understanding Convolutional NetworksVisualizing and Understanding Convolutional Networks
Visualizing and Understanding Convolutional Networks
 
Overview of Convolutional Neural Networks
Overview of Convolutional Neural NetworksOverview of Convolutional Neural Networks
Overview of Convolutional Neural Networks
 
Road Segmentation from satellites images
Road Segmentation from satellites imagesRoad Segmentation from satellites images
Road Segmentation from satellites images
 
stylegan.pdf
stylegan.pdfstylegan.pdf
stylegan.pdf
 
Visualizing and understanding convolutional networks(2014)
Visualizing and understanding convolutional networks(2014)Visualizing and understanding convolutional networks(2014)
Visualizing and understanding convolutional networks(2014)
 
Human Head Counting and Detection using Convnets
Human Head Counting and Detection using ConvnetsHuman Head Counting and Detection using Convnets
Human Head Counting and Detection using Convnets
 
Deep learning for image video processing
Deep learning for image video processingDeep learning for image video processing
Deep learning for image video processing
 
2022-01-17-Rethinking_Bisenet.pptx
2022-01-17-Rethinking_Bisenet.pptx2022-01-17-Rethinking_Bisenet.pptx
2022-01-17-Rethinking_Bisenet.pptx
 
Conditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN DecodersConditional Image Generation with PixelCNN Decoders
Conditional Image Generation with PixelCNN Decoders
 
Image Classification using deep learning
Image Classification using deep learning Image Classification using deep learning
Image Classification using deep learning
 
Automatic Detection of Window Regions in Indoor Point Clouds Using R-CNN
Automatic Detection of Window Regions in Indoor Point Clouds Using R-CNNAutomatic Detection of Window Regions in Indoor Point Clouds Using R-CNN
Automatic Detection of Window Regions in Indoor Point Clouds Using R-CNN
 
1409.1556.pdf
1409.1556.pdf1409.1556.pdf
1409.1556.pdf
 
Decomposing image generation into layout priction and conditional synthesis
Decomposing image generation into layout priction and conditional synthesisDecomposing image generation into layout priction and conditional synthesis
Decomposing image generation into layout priction and conditional synthesis
 
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITIONVERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
 
Geographic Information System unit 4
Geographic Information System   unit 4Geographic Information System   unit 4
Geographic Information System unit 4
 
nips report
nips reportnips report
nips report
 
Presentation vision transformersppt.pptx
Presentation vision transformersppt.pptxPresentation vision transformersppt.pptx
Presentation vision transformersppt.pptx
 
Vector-Based Back Propagation Algorithm of.pdf
Vector-Based Back Propagation Algorithm of.pdfVector-Based Back Propagation Algorithm of.pdf
Vector-Based Back Propagation Algorithm of.pdf
 
Deep learning in Computer Vision
Deep learning in Computer VisionDeep learning in Computer Vision
Deep learning in Computer Vision
 
Review-image-segmentation-by-deep-learning
Review-image-segmentation-by-deep-learningReview-image-segmentation-by-deep-learning
Review-image-segmentation-by-deep-learning
 

Recently uploaded

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
Health
 

Recently uploaded (20)

Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
Bridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxBridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptx
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic Marks
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
 
Rums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdfRums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdf
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 

20150703.journal club

  • 1. Journal  Club  論文紹介   Visualizing  and  Understanding   Convolu5onal  Networks Zeiler,  M.  et  al.   In  Proc.  European  Conference  on   Computer  Vision  2014     2015/07/03  shouno@uec.ac.jp
  • 2. 前回(2015.03.26)のおまけスライド •  階層ネットワークへの導入   – [Zeiler  &  Fergus  13] invert this, the deconvnet uses transposed versions of the same filters, but applied to the rectified maps, not the output of the layer beneath. In practice this means flipping each filter vertically and horizontally. Projecting down from higher layers uses the switch settings generated by the max pooling in the convnet on the way up. As these switch settings are peculiar to a given input image, the reconstruction obtained from a single activation thus resembles a small piece of the original input image, with structures weighted according to their contribution toward to the feature activation. Since the model is trained discriminatively, they implicitly show which parts of the input image are discriminative. Note that these projections are not samples from the model, since there is no generative process involved. Layer Below Pooled Maps Feature Maps Rectified Feature Maps Convolu'onal) Filtering){F}) Rec'fied)Linear) Func'on) Pooled Maps Max)Pooling) Reconstruction Rectified Unpooled Maps Unpooled Maps Convolu'onal) Filtering){FT}) Rec'fied)Linear) Func'on) Layer Above Reconstruction Max)Unpooling) Switches) Unpooling Max Locations “Switches” Pooling Pooled Maps Feature Map Layer Above Reconstruction Unpooled Maps Rectified Feature Maps Figure 1. Top: A deconvnet layer (left) attached to a con- vnet layer (right). The deconvnet will reconstruct an ap- proximate version of the convnet features from the layer beneath. Bottom: An illustration of the unpooling oper- ation in the deconvnet, using switches which record the location of the local max in each pooling region (colored zones) during pooling in the convnet. Other important di↵erences relating to layers 1 and 2 were made following inspection of the visualizations in Fig. 6, as described in Section 4.1. The model was trained on the ImageNet 2012 train- ing set (1.3 million images, spread over 1000 di↵erent classes). Each RGB image was preprocessed by resiz- ing the smallest dimension to 256, cropping the center 256x256 region, subtracting the per-pixel mean (across all images) and then using 10 di↵erent sub-crops of size 224x224 (corners + center with(out) horizontal flips). Stochastic gradient descent with a mini-batch size of 128 was used to update the parameters, starting with a learning rate of 10 2 , in conjunction with a momentum term of 0.9. We anneal the learning rate throughout training manually when the validation error plateaus. Dropout (Hinton et al., 2012) is used in the fully con- nected layers (6 and 7) with a rate of 0.5. All weights are initialized to 10 2 and biases are set to 0. Visualization of the first layer filters during training reveals that a few of them dominate, as shown in Fig. 6(a). To combat this, we renormalize each filter in the convolutional layers whose RMS value exceeds a fixed radius of 10 1 to this fixed radius. This is cru- cial, especially in the first layer of the model, where the input images are roughly in the [-128,128] range. As in (Krizhevsky et al., 2012), we produce multiple di↵er- ent crops and flips of each training example to boost training set size. We stopped training after 70 epochs, which took around 12 days on a single GTX580 GPU, using an implementation based on (Krizhevsky et al., 2012). 4. Convnet Visualization Using the model described in Section 3, we now use the deconvnet to visualize the feature activations on the ImageNet validation set. Feature Visualization: Fig. 2 shows feature visu- alizations from our model once training is complete. However, instead of showing the single strongest ac- tivation for a given feature map, we show the top 9 activations. Projecting each separately down to pixel
  • 3. Overview •  Deep  Convolu5onal  Network   – 特徴抽出+識別器として好成績を収める   – なんで,そんなに成績良いの?     •  “画像のどの成分をみて判断しているか”  を 推測するのは意外と難しい     •  Deep  Convolu5onal  Network  の中間層の     “観ているもの”  を可視化する
  • 5. ネットワークアーキテクチャ •  DCNN  [Krizhevsky+12]  を利用して特徴解析を してみる Visualizing and Understanding Convolutional Networks Input Image stride 2! image size 224! 3! 96! 5! 2! 110! 55 3x3 max pool stride 2 96! 3! 1! 26 256! filter size 7! 3x3 max pool stride 2 13 256! 3! 1! 13 384! 3! 1! 13 384! Layer 1 Layer 2 13 256! 3x3 max pool stride 2 6 Layer 3 Layer 4 Layer 5 256! 4096 units! 4096 units! Layer 6 Layer 7 C class softmax! Output contrast norm. contrast norm. Figure 3. Architecture of our 8 layer convnet model. A 224 by 224 crop of an image (with 3 color planes) is presented as the input. This is convolved with 96 di↵erent 1st layer filters (red), each of size 7 by 7, using a stride of 2 in both x and y. The resulting feature maps are then: (i) passed through a rectified linear function (not shown), (ii) pooled (max within 3x3 regions, using stride 2) and (iii) contrast normalized across feature maps to give 96 di↵erent 55 by 55 element feature maps. Similar operations are repeated in layers 2,3,4,5. The last two layers are fully connected, taking features from the top convolutional layer as input in vector form (6 · 6 · 256 = 9216 dimensions). The final layer is a C-way softmax function, C being the number of classes. All filters and feature maps are square in shape.
  • 6. 高次表現の変換:  DeConv  Net •  Max  Loca5ons  Sw.   による位置推定     •  ReLU  による正値化   •  FT  による逆フィルタ according to their contribution toward to the feature activation. Since the model is trained discriminatively, they implicitly show which parts of the input image are discriminative. Note that these projections are not samples from the model, since there is no generative process involved. Layer Below Pooled Maps Feature Maps Rectified Feature Maps Convolu'onal) Filtering){F}) Rec'fied)Linear) Func'on) Pooled Maps Max)Pooling) Reconstruction Rectified Unpooled Maps Unpooled Maps Convolu'onal) Filtering){FT}) Rec'fied)Linear) Func'on) Layer Above Reconstruction Max)Unpooling) Switches) Unpooling Max Locations “Switches” Pooling Pooled Maps Feature Map Layer Above Reconstruction Unpooled Maps Rectified Feature Maps Stochastic gradient descent with a mini-b 128 was used to update the parameters, star learning rate of 10 2 , in conjunction with a term of 0.9. We anneal the learning rate training manually when the validation erro Dropout (Hinton et al., 2012) is used in th nected layers (6 and 7) with a rate of 0.5. are initialized to 10 2 and biases are set to Visualization of the first layer filters duri reveals that a few of them dominate, a Fig. 6(a). To combat this, we renormalize in the convolutional layers whose RMS va a fixed radius of 10 1 to this fixed radius. cial, especially in the first layer of the mode input images are roughly in the [-128,128] r (Krizhevsky et al., 2012), we produce mul ent crops and flips of each training examp training set size. We stopped training after which took around 12 days on a single GTX using an implementation based on (Krizhe 2012). 4. Convnet Visualization Using the model described in Section 3, w the deconvnet to visualize the feature act the ImageNet validation set.Fig.1:  DeConv  Net  の模式図
  • 7. 可視化結果をお楽しみください(1) Visualizing and Understanding Convolutional Networks Layer 1 Visualizing and Understanding Convolutional Networks Layer 2 Visualizing and Understanding Convolutional Networks Input Image stride 2! image size 224! 3! 96! 5! 2! 110! 55 3x3 max pool stride 2 96! 3! 1! 26 256! filter size 7! 3x3 max pool stride 2 13 256! 3! 1! 13 384! 3! 1! 13 384! Layer 1 Layer 2 13 256! 3x3 max pool stride 2 6 Layer 3 Layer 4 Layer 5 256! 4096 units! 4096 units! Layer 6 Layer 7 C class softmax! Output contrast norm. contrast norm. Figure 3. Architecture of our 8 layer convnet model. A 224 by 224 crop of an image (with 3 color planes) is presented as the input. This is convolved with 96 di↵erent 1st layer filters (red), each of size 7 by 7, using a stride of 2 in both x and y. The resulting feature maps are then: (i) passed through a rectified linear function (not shown), (ii) pooled (max within 3x3 regions, using stride 2) and (iii) contrast normalized across feature maps to give 96 di↵erent 55 by 55 element feature maps. Similar operations are repeated in layers 2,3,4,5. The last two layers are fully connected, taking features from the top convolutional layer as input in vector form (6 · 6 · 256 = 9216 dimensions). The final layer is a C-way softmax function, C being the number of classes. All filters and feature maps are square in shape. Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Figure 4. Evolution of a randomly chosen subset of model features through training. Each layer’s features are displayed in a di↵erent block. Within each block, we show a randomly chosen subset of features at epochs [1,2,5,10,20,30,40,64]. The visualization shows the strongest activation (across all training examples) for a given feature map, projected down to pixel space using our deconvnet approach. Color contrast is artificially enhanced and the figure is best viewed in electronic form. 0 50 100 150 200 250 300 350 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Rotation Degrees P(trueclass) Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center 1 3 5 7 8 9 Vertical Translation (Pixels) CanonicalDistance Lawn Mower African Crocodile African Grey Entertrainment Center −60 −40 −20 0 20 40 60 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Vertical Translation (Pixels) CanonicalDistance Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center 1 1.2 1.4 1.6 1.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Scale (Ratio) CanonicalDistance Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center 0 50 100 150 200 250 300 350 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Rotation Degrees CanonicalDistance Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center −60 −40 −20 0 20 40 60 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Vertical Translation (Pixels) P(trueclass) Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center 1 1.2 1.4 1.6 1.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Scale (Ratio) P(trueclass) Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center a1# c1# a3# c3# c4# a4# 1 1.2 1.4 1.6 1.8 0 2 4 6 8 10 12 Scale (Ratio) CanonicalDistance Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center 0 50 100 150 200 250 300 350 0 5 10 15 Rotation Degrees CanonicalDistance Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center a2# b3# b4#b2#b1# c2# Figure 5. Analysis of vertical translation, scale, and rotation invariance within the model (rows a-c respectively). Col 1: 5 example images undergoing the transformations. Col 2 & 3: Euclidean distance between feature vectors from the original Fig.2:  Layer  1,  2  の可視化
  • 8. 可視化結果をお楽しみください(2) Visualizing and Understanding Convolutional Networks Input Image stride 2! image size 224! 3! 96! 5! 2! 110! 55 3x3 max pool stride 2 96! 3! 1! 26 256! filter size 7! 3x3 max pool stride 2 13 256! 3! 1! 13 384! 3! 1! 13 384! Layer 1 Layer 2 13 256! 3x3 max pool stride 2 6 Layer 3 Layer 4 Layer 5 256! 4096 units! 4096 units! Layer 6 Layer 7 C class softmax! Output contrast norm. contrast norm. Figure 3. Architecture of our 8 layer convnet model. A 224 by 224 crop of an image (with 3 color planes) is presented as the input. This is convolved with 96 di↵erent 1st layer filters (red), each of size 7 by 7, using a stride of 2 in both x and y. The resulting feature maps are then: (i) passed through a rectified linear function (not shown), (ii) pooled (max within 3x3 regions, using stride 2) and (iii) contrast normalized across feature maps to give 96 di↵erent 55 by 55 element feature maps. Similar operations are repeated in layers 2,3,4,5. The last two layers are fully connected, taking features from the top convolutional layer as input in vector form (6 · 6 · 256 = 9216 dimensions). The final layer is a C-way softmax function, C being the number of classes. All filters and feature maps are square in shape. Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Figure 4. Evolution of a randomly chosen subset of model features through training. Each layer’s features are displayed in a di↵erent block. Within each block, we show a randomly chosen subset of features at epochs [1,2,5,10,20,30,40,64]. The visualization shows the strongest activation (across all training examples) for a given feature map, projected down to pixel space using our deconvnet approach. Color contrast is artificially enhanced and the figure is best viewed in electronic form. 0 50 100 150 200 250 300 350 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Rotation Degrees P(trueclass) Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center 1 3 5 7 8 9 Vertical Translation (Pixels) CanonicalDistance Lawn Mower African Crocodile African Grey Entertrainment Center −60 −40 −20 0 20 40 60 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Vertical Translation (Pixels) CanonicalDistance Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center 1 1.2 1.4 1.6 1.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Scale (Ratio) CanonicalDistance Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center 0 50 100 150 200 250 300 350 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Rotation Degrees CanonicalDistance Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center −60 −40 −20 0 20 40 60 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Vertical Translation (Pixels) P(trueclass) Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center 1 1.2 1.4 1.6 1.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Scale (Ratio) P(trueclass) Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center a1# c1# a3# c3# c4# a4# 1 1.2 1.4 1.6 1.8 0 2 4 6 8 10 12 Scale (Ratio) CanonicalDistance Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center 0 50 100 150 200 250 300 350 0 5 10 15 Rotation Degrees CanonicalDistance Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center a2# b3# b4#b2#b1# c2# Figure 5. Analysis of vertical translation, scale, and rotation invariance within the model (rows a-c respectively). Col 1: 5 example images undergoing the transformations. Col 2 & 3: Euclidean distance between feature vectors from the original Layer 3 Fig.2:  Layer  3  の可視化
  • 9. 可視化結果をお楽しみください(3) Visualizing and Understanding Convolutional Networks Input Image stride 2! image size 224! 3! 96! 5! 2! 110! 55 3x3 max pool stride 2 96! 3! 1! 26 256! filter size 7! 3x3 max pool stride 2 13 256! 3! 1! 13 384! 3! 1! 13 384! Layer 1 Layer 2 13 256! 3x3 max pool stride 2 6 Layer 3 Layer 4 Layer 5 256! 4096 units! 4096 units! Layer 6 Layer 7 C class softmax! Output contrast norm. contrast norm. Figure 3. Architecture of our 8 layer convnet model. A 224 by 224 crop of an image (with 3 color planes) is presented as the input. This is convolved with 96 di↵erent 1st layer filters (red), each of size 7 by 7, using a stride of 2 in both x and y. The resulting feature maps are then: (i) passed through a rectified linear function (not shown), (ii) pooled (max within 3x3 regions, using stride 2) and (iii) contrast normalized across feature maps to give 96 di↵erent 55 by 55 element feature maps. Similar operations are repeated in layers 2,3,4,5. The last two layers are fully connected, taking features from the top convolutional layer as input in vector form (6 · 6 · 256 = 9216 dimensions). The final layer is a C-way softmax function, C being the number of classes. All filters and feature maps are square in shape. Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Figure 4. Evolution of a randomly chosen subset of model features through training. Each layer’s features are displayed in a di↵erent block. Within each block, we show a randomly chosen subset of features at epochs [1,2,5,10,20,30,40,64]. The visualization shows the strongest activation (across all training examples) for a given feature map, projected down to pixel space using our deconvnet approach. Color contrast is artificially enhanced and the figure is best viewed in electronic form. 0 50 100 150 200 250 300 350 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Rotation Degrees P(trueclass) Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center 1 3 5 7 8 9 Vertical Translation (Pixels) CanonicalDistance Lawn Mower African Crocodile African Grey Entertrainment Center −60 −40 −20 0 20 40 60 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Vertical Translation (Pixels) CanonicalDistance Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center 1 1.2 1.4 1.6 1.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Scale (Ratio) CanonicalDistance Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center 0 50 100 150 200 250 300 350 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Rotation Degrees CanonicalDistance Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center −60 −40 −20 0 20 40 60 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Vertical Translation (Pixels) P(trueclass) Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center 1 1.2 1.4 1.6 1.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Scale (Ratio) P(trueclass) Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center a1# c1# a3# c3# c4# a4# 1 1.2 1.4 1.6 1.8 0 2 4 6 8 10 12 Scale (Ratio) CanonicalDistance Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center 0 50 100 150 200 250 300 350 0 5 10 15 Rotation Degrees CanonicalDistance Lawn Mower Shih−Tzu African Crocodile African Grey Entertrainment Center a2# b3# b4#b2#b1# c2# Figure 5. Analysis of vertical translation, scale, and rotation invariance within the model (rows a-c respectively). Col 1: 5 example images undergoing the transformations. Col 2 & 3: Euclidean distance between feature vectors from the original Layer 4 Figure 2. Visualization of features in a fully trained model. For layers 2-5 we show the top 9 activations in a random subset
  • 10. 学習発展に於ける抽出特徴の変化 Fig.4:  1,  2,  5,  10,  20,  30,  40,  64  epochs  時の特徴マップ(random  selected)から再現
  • 11. ネットワークパラメータの選択(1) •  カーネルサイズとかの設定が適切か? [Krizhevsky  +12]:  使えなさそうなフィルタがちらほら   Stride:    4   Filter  size:  11x11 適正化させたもの   Stride:  2   Filter  size:  7x7
  • 13. Occlusion  に関する考察 (c) (d) (e) Figure 6. (a): 1st layer features without feature scale clipping. Note that one feature dominates. (b): 1st layer features from (Krizhevsky et al., 2012). (c): Our 1st layer features. The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer “dead” features. (d): Visualizations of 2nd layer features from (Krizhevsky et al., 2012). (e): Visualizations of our 2nd layer features. These are cleaner, with no aliasing artifacts that are visible in (d). Car wheel Racer Cab Police van Pomeranian Tennis ball Keeshond Pekinese Afghan hound Gordon setter Irish setter Mortarboard Fur coat Academic gown Australian terrier Ice lolly Vizsla Neck brace 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.05 0.1 0.15 0.2 0.25 True Label: Pomeranian (a) Input Image (b) Layer 5, strongest feature map (c) Layer 5, strongest feature map projections (d) Classifier, probability of correct class (e) Classifier, most probable class True Label: Car Wheel True Label: Afghan Hound Figure 7. Three test examples where we systematically cover up di↵erent portions of the scene with a gray square (1st 隠蔽箇所による活性度 最大活性部を可視化 隠蔽箇所による正解度
  • 16. Correspondence  Analysis   ここも可視化はあまり関係ない •  隠蔽画像と非隠蔽画像のズレ(らしきもの)の計測 Visualizing and Understanding Convolutional Networks Figure 8. Images used for correspondence experiments. Col 1: Original image. Col 2,3,4: Occlusion of the right eye, left eye, and nose respectively. Other columns show examples of random occlusions. Mean Feature Mean Feature Sign Change Sign Change Occlusion Location Layer 5 Layer 7 Right Eye 0.067 ± 0.007 0.069 ± 0.015 Left Eye 0.069 ± 0.007 0.068 ± 0.013 Nose 0.079 ± 0.017 0.069 ± 0.011 Random 0.107 ± 0.017 0.073 ± 0.014 ing set). We note that this error is almost half that of the top non-convnet entry in the ImageNet 2012 classi- fication challenge, which obtained 26.2% error (Gunji et al., 2012). Val Val Test Error % Top-1 Top-5 Top-5 (Gunji et al., 2012) - - 26.2 (Krizhevsky et al., 2012), 1 convnet 40.7 18.2 (Krizhevsky et al., 2012), 5 convnets 38.1 16.4 16.4 (Krizhevsky et al., 2012)⇤ , 1 convnets 39.0 16.6 (Krizhevsky et al., 2012)⇤ , 7 convnets 36.7 15.4 15.3 Our replication of (Krizhevsky et al., 2012), 1 convnet 40.5 18.1 1 convnet as per Fig. 3 38.4 16.5 5 convnets as per Fig. 3 – (a) 36.7 15.3 15.3 1 convnet as per Fig. 3 but with layers 3,4,5: 512,1024,512 maps – (b) 37.5 16.0 16.1 6 convnets, (a) & (b) combined 36.0 14.7 14.8 Table 2. ImageNet 2012 classification error rates. The ⇤ indicates models that were trained on both ImageNet 2011 and 2012 training sets. Varying ImageNet Model Sizes: In Table 3, we first explore the architecture of (Krizhevsky et al., 2012) by adjusting the size of layers, or removing them entirely. In each case, the model is trained from Figure 8. Images used for correspondence experiments. Col 1: Original image. Col 2,3,4: Occlusion of the right eye, left eye, and nose respectively. Other columns show examples of random occlusions. Mean Feature Mean Feature Sign Change Sign Change Occlusion Location Layer 5 Layer 7 Right Eye 0.067 ± 0.007 0.069 ± 0.015 Left Eye 0.069 ± 0.007 0.068 ± 0.013 Nose 0.079 ± 0.017 0.069 ± 0.011 Random 0.107 ± 0.017 0.073 ± 0.014 Table 1. Measure of correspondence for di↵erent object parts in 5 di↵erent dog images. The lower scores for the eyes and nose (compared to random object parts) show the model implicitly establishing some form of correspondence of parts at layer 5 in the model. At layer 7, the scores are more similar, perhaps due to upper layers trying to discriminate between the di↵erent breeds of dog. 5. Experiments 5.1. ImageNet 2012 This dataset consists of 1.3M/50k/100k train- ing/validation/test examples, spread over 1000 cate- gories. Table 2 shows our results on this dataset. et al., Error (Gunj (Krizh (Krizh (Krizh (Krizh Our re (Krizh 1 conv 5 conv 1 conv layers 6 conv Table 2 indicat and 20 Varyi first e 2012) them e scratch fully c in erro the m the mi small both t nected perform gest th for obt our m fully c mance 数値的には微妙な結果だ… 特定の場所を隠されたほうが,表現のずれが少ない(ということらしい)
  • 17. State-­‐of-­‐Art  を目指して •  ちゃんと設計するとか合体させるとかすれば   パフォーマンスは上がる   nd Understanding Convolutional Networks experiments. on of the right columns show ean Feature ign Change Layer 7 ing set). We note that this error is almost half that of the top non-convnet entry in the ImageNet 2012 classi- fication challenge, which obtained 26.2% error (Gunji et al., 2012). Val Val Test Error % Top-1 Top-5 Top-5 (Gunji et al., 2012) - - 26.2 (Krizhevsky et al., 2012), 1 convnet 40.7 18.2 (Krizhevsky et al., 2012), 5 convnets 38.1 16.4 16.4 (Krizhevsky et al., 2012)⇤ , 1 convnets 39.0 16.6 (Krizhevsky et al., 2012)⇤ , 7 convnets 36.7 15.4 15.3 Our replication of (Krizhevsky et al., 2012), 1 convnet 40.5 18.1 1 convnet as per Fig. 3 38.4 16.5 5 convnets as per Fig. 3 – (a) 36.7 15.3 15.3 1 convnet as per Fig. 3 but with layers 3,4,5: 512,1024,512 maps – (b) 37.5 16.0 16.1 6 convnets, (a) & (b) combined 36.0 14.7 14.8 Table 2. ImageNet 2012 classification error rates. The ⇤ indicates models that were trained on both ImageNet 2011 and 2012 training sets.
  • 18. いろいろやってみる(1) •  もはや可視化とはあまり関係なさそうである Visualizing and Understanding Convolutional N Train Val Val Error % Top-1 Top-1 Top-5 Our replication of (Krizhevsky et al., 2012), 1 convnet 35.1 40.5 18.1 Removed layers 3,4 41.8 45.4 22.1 Removed layer 7 27.4 40.0 18.4 Removed layers 6,7 27.4 44.8 22.4 Removed layer 3,4,6,7 71.1 71.3 50.1 Adjust layers 6,7: 2048 units 40.3 41.7 18.8 Adjust layers 6,7: 8192 units 26.8 40.0 18.1 Our Model (as per Fig. 3) 33.1 38.4 16.5 Adjust layers 6,7: 2048 units 38.2 40.2 17.6 Adjust layers 6,7: 8192 units 22.0 38.8 17.0 Adjust layers 3,4,5: 512,1024,512 maps 18.8 37.5 16.0 Adjust layers 6,7: 8192 units and Layers 3,4,5: 512,1024,512 maps 10.0 38.3 16.9 Table 3. ImageNet 2012 classification error rates with var- ious architectural changes to the model of (Krizhevsky et al., 2012) and our model (see Fig. 3). softmax classifier on top (for the appropriate number 0 10 25 30 35 40 45 50 55 60 65 70 75 Accuracy% Figure 9. Caltech-25 number of training i 6 training examples extractor, we surpas
  • 19. いろいろやってみる(2) •  Caltech-­‐101/256,  PASCAL  2012  でも大丈夫?   →多分大丈夫 ng Convolutional Networks 0 10 20 30 40 50 60 25 30 35 40 45 50 55 60 65 70 75 Training Images per−class Accuracy% Our Model Bo etal Sohn etal Figure 9. Caltech-256 classification performance as the number of training images per class is varied. Using only 6 training examples per class with our pre-trained feature extractor, we surpass best reported result by (Bo et al., 2013). Adjust layers 6,7: 2048 units 38.2 40.2 17.6 Adjust layers 6,7: 8192 units 22.0 38.8 17.0 Adjust layers 3,4,5: 512,1024,512 maps 18.8 37.5 16.0 Adjust layers 6,7: 8192 units and Layers 3,4,5: 512,1024,512 maps 10.0 38.3 16.9 Table 3. ImageNet 2012 classification error rates with var- ious architectural changes to the model of (Krizhevsky et al., 2012) and our model (see Fig. 3). softmax classifier on top (for the appropriate number of classes) using the training images of the new dataset. Since the softmax contains relatively few parameters, it can be trained quickly from a relatively small num- ber of examples, as is the case for certain datasets. The classifiers used by our model (a softmax) and other approaches (typically a linear SVM) are of simi- lar complexity, thus the experiments compare our fea- ture representation, learned from ImageNet, with the hand-crafted features used by other methods. It is im- portant to note that both our feature representation and the hand-crafted features are designed using im- ages beyond the Caltech and PASCAL training sets. For example, the hyper-parameters in HOG descrip- tors were determined through systematic experiments on a pedestrian dataset (Dalal & Triggs, 2005). We also try a second strategy of training a model from scratch, i.e. resetting layers 1-7 to random values and train them, as well as the softmax, on the training images of the dataset. One complication is that some of the Caltech datasets have some images that are also in the ImageNet train- ing data. Using normalized correlation, we identified these few “overlap” images2 and removed them from our Imagenet training set and then retrained our Ima- genet models, so avoiding the possibility of train/test contamination. Caltech-101: We follow the procedure of (Fei-fei et al., 2006) and randomly select 15 or 30 images per class for training and test on up to 50 images per class reporting the average of the per-class accuracies in Ta- 2 For Caltech-101, we found 44 images in common (out of 9,144 total images), with a maximum overlap of 10 for any given class. For Caltech-256, we found 243 images in common (out of 30,607 total images), with a maximum 0 10 20 30 40 50 60 25 30 35 Training Images per−class Our Model Bo etal Sohn etal Figure 9. Caltech-256 classification performance as the number of training images per class is varied. Using only 6 training examples per class with our pre-trained feature extractor, we surpass best reported result by (Bo et al., 2013). ble 4, using 5 train/test folds. Training took 17 min- utes for 30 images/class. The pre-trained model beats the best reported result for 30 images/class from (Bo et al., 2013) by 2.2%. The convnet model trained from scratch however does terribly, only achieving 46.5%. Acc % Acc % # Train 15/class 30/class (Bo et al., 2013) 81.4 ± 0.33 (Jianchao et al., 2009) 73.2 84.3 Non-pretrained convnet 22.8 ± 1.5 46.5 ± 1.7 ImageNet-pretrained convnet 83.8 ± 0.5 86.5 ± 0.5 Table 4. Caltech-101 classification accuracy for our con- vnet models, against two leading alternate approaches. Caltech-256: We follow the procedure of (Gri n et al., 2006), selecting 15, 30, 45, or 60 training im- ages per class, reporting the average of the per-class accuracies in Table 5. Our ImageNet-pretrained model beats the current state-of-the-art results obtained by Bo et al. (Bo et al., 2013) by a significant margin: 74.2% vs 55.2% for 60 training images/class. However, as with Caltech-101, the model trained from scratch does poorly. In Fig. 9, we explore the “one-shot learn- ing” (Fei-fei et al., 2006) regime. With our pre-trained model, just 6 Caltech-256 training images are needed to beat the leading method using 10 times as many im- ages. This shows the power of the ImageNet feature extractor. Acc % Acc % Acc % Acc % # Train 15/class 30/class 45/class 60/class (Sohn et al., 2011) 35.1 42.1 45.7 47.9 (Bo et al., 2013) 40.5 ± 0.4 48.0 ± 0.2 51.9 ± 0.2 55.2 ± 0.3 Non-pretr. 9.0 ± 1.4 22.5 ± 0.7 31.2 ± 0.5 38.8 ± 1.4 ImageNet-pretr. 65.7 ± 0.2 70.6 ± 0.2 72.7 ± 0.4 74.2 ± 0.3 other approaches (typically a linear SVM) are of simi- lar complexity, thus the experiments compare our fea- ture representation, learned from ImageNet, with the hand-crafted features used by other methods. It is im- portant to note that both our feature representation and the hand-crafted features are designed using im- ages beyond the Caltech and PASCAL training sets. For example, the hyper-parameters in HOG descrip- tors were determined through systematic experiments on a pedestrian dataset (Dalal & Triggs, 2005). We also try a second strategy of training a model from scratch, i.e. resetting layers 1-7 to random values and train them, as well as the softmax, on the training images of the dataset. One complication is that some of the Caltech datasets have some images that are also in the ImageNet train- ing data. Using normalized correlation, we identified these few “overlap” images2 and removed them from our Imagenet training set and then retrained our Ima- genet models, so avoiding the possibility of train/test contamination. Caltech-101: We follow the procedure of (Fei-fei et al., 2006) and randomly select 15 or 30 images per class for training and test on up to 50 images per class reporting the average of the per-class accuracies in Ta- 2 For Caltech-101, we found 44 images in common (out of 9,144 total images), with a maximum overlap of 10 for any given class. For Caltech-256, we found 243 images in common (out of 30,607 total images), with a maximum overlap of 18 for any given class. Acc % Acc % # Train 15/class 30/class (Bo et al., 2013) 81.4 ± 0.33 (Jianchao et al., 2009) 73.2 84.3 Non-pretrained convnet 22.8 ± 1.5 46.5 ± 1.7 ImageNet-pretrained convnet 83.8 ± 0.5 86.5 ± 0.5 Table 4. Caltech-101 classification accuracy for our con- vnet models, against two leading alternate approaches. Caltech-256: We follow the procedure of (Gri n et al., 2006), selecting 15, 30, 45, or 60 training im- ages per class, reporting the average of the per-class accuracies in Table 5. Our ImageNet-pretrained model beats the current state-of-the-art results obtained by Bo et al. (Bo et al., 2013) by a significant margin: 74.2% vs 55.2% for 60 training images/class. However, as with Caltech-101, the model trained from scratch does poorly. In Fig. 9, we explore the “one-shot learn- ing” (Fei-fei et al., 2006) regime. With our pre-trained model, just 6 Caltech-256 training images are needed to beat the leading method using 10 times as many im- ages. This shows the power of the ImageNet feature extractor. Acc % Acc % Acc % Acc % # Train 15/class 30/class 45/class 60/class (Sohn et al., 2011) 35.1 42.1 45.7 47.9 (Bo et al., 2013) 40.5 ± 0.4 48.0 ± 0.2 51.9 ± 0.2 55.2 ± 0.3 Non-pretr. 9.0 ± 1.4 22.5 ± 0.7 31.2 ± 0.5 38.8 ± 1.4 ImageNet-pretr. 65.7 ± 0.2 70.6 ± 0.2 72.7 ± 0.4 74.2 ± 0.3 Table 5. Caltech 256 classification accuracies. Visualizing and Understanding Conv PASCAL 2012: We used the standard training and validation images to train a 20-way softmax on top of the ImageNet-pretrained convnet. This is not ideal, as PASCAL images can contain multiple objects and our model just provides a single exclusive prediction for each image. Table 6 shows the results on the test set. The PASCAL and ImageNet images are quite di↵er- ent in nature, the former being full scenes unlike the latter. This may explain our mean performance being 3.2% lower than the leading (Yan et al., 2012) result, however we do beat them on 5 classes, sometimes by large margins. Acc % [A] [B] Ours Acc % [A] [B] Ours Airplane 92.0 97.3 96.0 Dining tab 63.2 77.8 67.7 Bicycle 74.2 84.2 77.1 Dog 68.9 83.0 87.8 Bird 73.0 80.8 88.4 Horse 78.2 87.5 86.0 Boat 77.5 85.3 85.5 Motorbike 81.0 90.1 85.1 Bottle 54.3 60.8 55.8 Person 91.6 95.0 90.9 Bus 85.2 89.9 85.8 Potted pl 55.9 57.8 52.2 Car 81.9 86.8 78.6 Sheep 69.4 79.2 83.6 Cat 76.4 89.3 91.2 Sofa 65.4 73.4 61.1 Chair 65.2 75.4 65.0 Train 86.7 94.5 91.8 Cow 63.2 77.8 74.4 Tv 77.4 80.7 76.1 Mean 74.3 82.2 79.0 # won 0 15 5 Table 6. PASCAL 2012 classification results, comparing our Imagenet-pretrained convnet against the leading two 6. Di We exp els, tra First, tivity w be far f they sh compo crimina how th lems w ample et al., then de iments is highl not jus on the to the vital to Finally can gen Caltech-­‐101 Caltech-­‐256 PASCAL2012 Trained  with  6  images/class,  Caltech-­‐256
  • 20. いろいろやってみる(3) •  どうやら SVM  での評価はここが最初らしい… 5.3. Feature Analysis We explore how discriminative the features in each layer of our Imagenet-pretrained model are. We do this by varying the number of layers retained from the Ima- geNet model and place either a linear SVM or softmax classifier on top. Table 7 shows results on Caltech- 101 and Caltech-256. For both datasets, a steady im- provement can be seen as we ascend the model, with best results being obtained by using all layers. This supports the premise that as the feature hierarchies become deeper, they learn increasingly powerful fea- tures. Cal-101 Cal-256 (30/class) (60/class) SVM (1) 44.8 ± 0.7 24.6 ± 0.4 SVM (2) 66.2 ± 0.5 39.6 ± 0.3 SVM (3) 72.3 ± 0.4 46.0 ± 0.3 SVM (4) 76.6 ± 0.4 51.3 ± 0.1 SVM (5) 86.2 ± 0.8 65.6 ± 0.3 SVM (7) 85.5 ± 0.4 71.7 ± 0.2 Softmax (5) 82.9 ± 0.4 65.7 ± 0.5 Softmax (7) 85.4 ± 0.4 72.6 ± 0.1 Table 7. Analysis of the discriminative information con- tained in each layer of feature maps within our ImageNet- pretrained convnet. We train either a linear SVM or soft- max on features from di↵erent layers (as indicated in brack- ets) from the convnet. Higher layers generally produce more discriminative features. training sets. Our con to the PASCAL data, bias (Torralba & Efro within 3.2% of the best ing for the task. For e improve if a di↵erent mitted multiple objec rally enable the networ as well. Acknowledgmen The authors are very g IIS-1116923, Microsof ship. References Bengio, Y., Lamblin, P H. Greedy layer-wis NIPS, pp. 153–160, Berkes, P. and Wisko terpretation of inho receptive fields. Neu Bo, L., Ren, X., and F using hierarchical m
  • 21. まとめ •  可視化は,DCNN  などを理解・設計するのに 重要   – 例として  Krizhevsky+12  を改良してみた   – 改良した  DCNN  を他のデータベースに適用したけ ど,充分上手く動くことが示せた