(DL Hacks輪読) How transferable are features in deep neural networks?

How transferable are features in
deep neural networks?
Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson
DL Hacks

!  NIPS2014
!  15
!  Deep Learning
! 
! 
! 
! 
!

DNN general specific
! 
1
! 
! 
general
!  specific
http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/CVPR2012-Tutorial_lee.pdf
general
specific

general specific
DNN general
1.
general specific
2.
general specific
3.

[Pan et al. 2010] inductive
transfer learning
base task
target task
&
general

fine-tune frozen
2
1. fine-tune
2. frozen
frozen
fine-tune

A B
!  A B 8
baseA baseB
!  8 n {1,2,…,7}
!  n 1 n

n=
!  B3B baseB 3 frozen,
B selffer
!  A3B baseA 3 frozen,
B transfer
!  frozen fine-tune B3B+ A3B+
Figure 1: Overview of the experimental treatments and controls. Top two rows: The base networks
are trained using standard supervised backprop on only half of the ImageNet dataset (ﬁrst row: A
half, second row: B half). Thelabeled rectangles(e.g. WA 1) represent theweight vector learned for
frozen
=fine-tune
n AnB BnA

!  A3B baseB 3
B general
!  A3B baseB 3 A
specific
Figure 1: Overview of the experimental treatments and controls. Top two rows: The base networks
are trained using standard supervised backprop on only half of the ImageNet dataset (ﬁrst row: A

ImageNet[Deng et al., 2009]
!  1000 1281167 50000
!  A B 500 645000
! 
!  ImageNet
13
6 7
! 
!  man-made 551 natural 449

!  Caffe [Jia et al., 2014]
! 
Caffe
!  http://yosinski.com/transfer
!  ipython notebook

1
0 1 2 3 4 5 6 7
0.52
0.54
0.56
0.58
0.60
0.62
0.64
0.66
7op-1accuracy(higherisbetter)
baseB
selffer BnB
selffer BnB+
transfer AnB
transfer AnB+
• 
2 3 4 5
Layer n at whLFh network Ls Fhopped and retraLned
Figure 2: The results from this paper’s main experiment. Top: Each marker in the figure represents
the average accuracy over the validation set for a trained network. The white circles above n =
0 represent the accuracy of baseB. There are eight points, because we tested on four separate
random A/B splits. Each dark blue dot represents a BnB network. Light blue points represent
BnB+
networks, or fine-tuned versions of BnB. Dark red diamonds are AnB networks, and light
red diamonds are the fine-tuned AnB+
versions. Points are shifted slightly left or right for visual
•  A B 4
•  AnA BnB BnB AnB

1
1
!  baseB 37.5% 1000 42.5%
!  500 1000
0 1 2 3 4 5 6 7
0.52
0.54
0.56
0.58
0.60
0.62
0.64
0.66
baseB
selffer BnB
selffer BnB+
transfer AnB
transfer AnB+
BnB+
clarity. Bottom: Lines connecting the means of each treatment. Numbered descriptions above each
linerefer to which interpretation from Section 4.1 applies.

0 1 2 3 4 5 6 7
0.52
0.54
0.56
0.58
0.60
0.62
0.64
0.66
baseB
selffer BnB
selffer BnB+
transfer AnB
transfer AnB+
BnB+
1
2
!  BnB 4,5 B
NN (co-adaptation)
!  frozen
!  6,7

0 1 2 3 4 5 6 7
0.52
0.54
0.56
0.58
0.60
0.62
0.64
0.66
baseB
selffer BnB
selffer BnB+
transfer AnB
transfer AnB+
BnB+
1
3
!  BnB+ baseB
fine-tuning BnB

0 1 2 3 4 5 6 7
0.52
0.54
0.56
0.58
0.60
0.62
0.64
0.66
baseB
selffer BnB
selffer BnB+
transfer AnB
transfer AnB+
BnB+
1
4
!  AnB 1,2
1 2 general
!  3 4 ~7
2
3, 4
5~7 &specific

0 1 2 3 4 5 6 7
0.52
0.54
0.56
0.58
0.60
0.62
0.64
0.66
baseB
selffer BnB
selffer BnB+
transfer AnB
transfer AnB+
BnB+
1
5
!  AnB+ baseB BnB+
! 
! 
BnB+
!  fine-tuning
! 
! 
Table1: Performanceboost of AnB+
over controls, averaged over differen
mean boost mean boost
layers over over
aggregated baseB selffer BnB+
1-7 1.6% 1.4%
3-7 1.8% 1.4%
5-7 2.1% 1.7%
4.2 Dissimilar Datasets: SplittingMan-madeand Natural ClassesIntoS
Asmentioned previously, theeffectivenessof featuretransfer isexpected to d
target tasks become less similar. We test this hypothesis by comparing tran
similar datasets (the random A/B splits discussed above) to that on dissimila
assigning man-madeobject classesto A and natural object classesto B. Thism
createsdatasetsasdissimilar aspossiblewithin theImageNet dataset.
Theupper-left subplot of Figure3 showstheaccuracy of abaseA and baseB n
and BnA and AnB networks (orange hexagons). Lines join common target ta
two linescontainsthosenetworkstrained toward thetarget task containing natu
and AnB). Thesenetworksperform better than thosetrained toward theman-m
may bedueto having only 449 classesinstead of 551, or simply being an easi
4.3 Random Weights
We also compare to random, untrained weights because Jarrett et al. (2009) s
ingly — that the combination of random convolutional filters, rectification, p
malization can work almost aswell aslearned features. They reported thisres
networksof two or threelearned layersand on thesmaller Caltech-101 dataset
It isnatural toask whether or not thenearly optimal performanceof random filt
over to adeeper network trained on alarger dataset.

1
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
Layer n at whLFh network Ls Fhopped and retraLned
0.54
0.56
0.58
0.60
0.62
0.64
7op-1aFFuraFy(hLgherLsbetter)
5: 7ransfer + fLne-tunLng LPproves generalLzatLon
3: )Lne-tunLng reFovers Fo-adapted LnteraFtLons
2: 3erforPanFe drops
due to fragLle
Fo-adaptatLon
4: 3erforPanFe
drops due to
representatLon
speFLfLFLty

2
• 
0 1 2 3 4 5 6 7
0.3
0.4
0.5
0.6
0.7
7Rp-1accuracy
0an-made/1atural splLt
baseA man-made, baseB natural
•  baseA baseB
•  BnA AnB
• 
•  natural man-made
•  natural 449 man-made 551
•  natural

3
!  [Jarrett et al. 2009] Rectification
!  NN 2,3
!  Caltech-101
NN&

3
! 
! 
!  1,2 3
!  NN
!  [Jarrett et al. 2009]
0 1 2 3 4 5 6 7
0.0
0.1
0.2
0.3
0.4
0.5
0.6
5andRm, untraLned fLlters

1,2,3
0 1 2 3 4 5 6 7
/ayer n at whLch netwRrN Ls chRpped and retraLned
−0.30
−0.25
−0.20
−0.15
−0.10
−0.05
0.00
5elatLvetRp-1accuracy(hLgherLsbetter)
reference
mean AnB, randRm splLts
mean AnB, m/n splLt
randRm features
Figure 3: Performance degradation vs. layer. Top left: Degradation when transferring between dis-
similar tasks (from man-made classes of ImageNet to natural classes or vice versa). The upper line
connects networks trained to the “natural” target task, and the lower line connects those trained to-
ward the“man-made” target task. Top right: Performancewhen thefirst n layersconsist of random,
untrained weights. Bottom: The top two plots compared to the random A/B split from Section 4.1
(red diamonds), all normalized by subtracting their baselevel performance.
dataset, making their random filters perform better by comparison. In the supplementary material,
weprovidean extraexperiment indicating theextent to which our networksareoverfit.
5 Conclusions
We have demonstrated a method for quantifying the transferability of features from each layer of
a neural network, which reveals their generality or specificity. We showed how transferability is
negatively affected by two distinct issues: optimization difficulties related to splitting networks in
themiddleof fragilely co-adaptedlayersandthespecializationof higher layer featurestotheoriginal
• 
• 
•  [Jarrett et al. 2009] [Jarrett et al. 2009]
Caltech-101

NN general specific
!  frozen 2
! 
!  specific
! 
! 
!  fine-tuning

! 
!  & GPU 9.5
!  DL
!  NN
! 
!  DL
!

! 
!  R. Caruana, “Multitask learning,” Mach. Learn., vol. 28(1), pp. 41–75,
1997.
!  S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans.
Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, 2010.
!  Deep Learning
!  Y. Bengio, “Deep Learning of Representations for Unsupervised and
Transfer Learning,” JMLR Work. Conf. Proc. 7, vol. 7, pp. 1–20, 2011.
! 
!  http://yosinski.com/transfer
!  CVPR 2012 Tutorial : Deep Learning Methods for Vision
http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/
CVPR2012-Tutorial_lee.pdf

(DL Hacks輪読) How transferable are features in deep neural networks?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to (DL Hacks輪読) How transferable are features in deep neural networks?

Similar to (DL Hacks輪読) How transferable are features in deep neural networks? (8)

More from Masahiro Suzuki

More from Masahiro Suzuki (6)

Recently uploaded

Recently uploaded (20)

(DL Hacks輪読) How transferable are features in deep neural networks?