SlideShare a Scribd company logo
1 of 92
Download to read offline
Fine-Grained Image Analysis
ICME Tutorial
Jul. 8, 2019 Shanghai, China
Xiu-Shen WEI
Megvii Research Nanjing, Megvii Inc.
IEEE ICME2019
T-06: Computer Vision for Transportation............................................................
T-07: Causally Regularized Machine Learning ....................................................
T-08: Architecture Design for Deep Neural Networks .........................................
T-09: Intelligent Multimedia Recommendation ...................................................
Oral Sessions...............................................................................................................
Best Paper Session ................................................................................................
O-01: Content Recommendation and Cross-modal Hashing................................
O-02: Development of Multimedia Standards and Related Research...................
O-03: Classification and Low Shot Learning........................................................
O-04: 3D Media Computing .................................................................................
O-05: Special Session "Pedestrian Detection, Tracking and Re-identification in
Outline
Part 1
Part 2
Background about CV, DL, and image analysis
Introduction of ļ¬ne-grained image analysis
Fine-grained image retrieval
Fine-grained image recognition
Other computer vision tasks related to ļ¬ne-grained image analysis
New developments of ļ¬ne-grained image analysis
http://www.weixiushen.com/
Part 1
Background
A brief introduction of computer vision
Traditional image recognition and retrieval
Deep learning and convolutional neural networks
Introduction
Fine-grained images vs. generic images
Various real-world applications of ļ¬ne-grained images
Challenges of ļ¬ne-grained image analysis
Fine-grained benchmark datasets
Fine-grained image retrieval
Fine-grained image retrieval based on hand-crafted features
Fine-grained image retrieval based on deep learning
http://www.weixiushen.com/
Background
What is computer vision?
What we see
The goal of computer vision
ā€¢ To extract ā€œmeaningā€ from pixels
What we see What a computer sees
Source: S. Narasimhan
What a computer sees
http://www.weixiushen.com/
Background (conā€™t)
Why study computer vision?
CV is useful
CV is interesting
CV is difļ¬cult
ā€¦
Finger reader
Image captioning
Crowds and occlusions
http://www.weixiushen.com/
Background (conā€™t)
Successes of computer vision to date
Optical character recognition Biometric systems
Face recognition Self-driving cars
http://www.weixiushen.com/
Top-tier CV conferences/journals and prizes
Marr Prize
Background (conā€™t)
http://www.weixiushen.com/
Traditional image recognition and image retrieval
Tradi=onal$Approaches$
Image$ vision$features$ Recogni=on$
ct$
=on$
o$
ļ¬ca=on$
Audio$ audio$features$
Speaker$
iden=ļ¬ca=on$
Data Feature
extraction
Learning
algorithm
Recognition:
SVM
Random Forest
ā€¦
Retrieval:
k-NN
ā€¦
Background (conā€™t)
http://www.weixiushen.com/
Computer vision features
Computer$Vision$Features$
SIFT$
HoG$ RIFT$
Textons$
GIST$
Deep learning
Background (conā€™t)
http://www.weixiushen.com/
Deep learning and convolutional neural networks
Fei-Fei Li & Andrej Karpathy Lecture 7 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 7 - 21 Jan 201518
before:
now:
input
layer hidden layer
output layer
Fei-Fei Li & Andrej Karpathy Lecture 7 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 7 - 21 Jan 201518
before:
now:
input
layer hidden layer
output layer
Before:
Now:
Figures are courtesy of L. Fei-Fei.
ā€œRomeĀ was not built in one day!ā€
Background (conā€™t)
http://www.weixiushen.com/
All neural net activations arranged in 3-dimension:
äøŽä¹‹ē±»ä¼¼ļ¼Œč‹„äø‰ē»“ęƒ…å½¢äø‹ēš„卷ē§Æ层 l ēš„č¾“å…„张量äøŗ xl āˆˆ RHlƗWlƗDl
ļ¼Œ
则čƄ层卷ē§Æę øäøŗ fl āˆˆ RHƗWƗDl
怂åœØäø‰ē»“ę—¶å·ē§Æę“ä½œå®žé™…äøŠåŖę˜Æ将äŗŒē»“卷
ē§Æę‰©å±•åˆ°äŗ†åƹåŗ”位ē½®ēš„ę‰€ęœ‰é€šé“äøŠļ¼ˆå³ Dlļ¼‰ļ¼Œęœ€ē»ˆå°†äø€ę¬”卷ē§Æ处ē†ēš„ę‰€ęœ‰
HWDl äøŖ元ē“ ę±‚和作äøŗčƄ位ē½®å·ē§Æē»“ęžœć€‚å¦‚å›¾2-5ꉀē¤ŗ怂
äø‰ē»“åœŗę™Æ卷ē§Æę“ä½œ 卷ē§Æē‰¹å¾
图 äø‰ē»“åœŗę™Æäø‹ēš„卷ē§Æę øäøŽč¾“å…„ę•°ę®ć€‚å›¾å·¦å·ē§Æę ø大小äøŗ 3 Ɨ 4 Ɨ 3ļ¼Œå›¾å³äøŗåœØ
čƄ位ē½®čæ›č”Œå·ē§Æę“ä½œåŽå¾—åˆ°ēš„ 1 Ɨ 1 Ɨ 1 ēš„č¾“å‡ŗē»“ęžœ
čæ›äø€ę­„地ļ¼Œč‹„ē±»ä¼¼ fl čæ™ę ·ēš„卷ē§Æę ø꜉ D äøŖļ¼Œåˆ™åœØ同äø€äøŖ位ē½®äøŠåÆ得到
Deep learning and convolutional neural networks
äøŽä¹‹ē±»ä¼¼ļ¼Œč‹„äø‰ē»“ęƒ…å½¢äø‹ēš„卷ē§Æ层 l ēš„č¾“å…„张量äøŗ xl āˆˆ RHlƗWlƗDl
ļ¼Œ
则čƄ层卷ē§Æę øäøŗ fl āˆˆ RHƗWƗDl
怂åœØäø‰ē»“ę—¶å·ē§Æę“ä½œå®žé™…äøŠåŖę˜Æ将äŗŒē»“卷
ē§Æę‰©å±•åˆ°äŗ†åƹåŗ”位ē½®ēš„ę‰€ęœ‰é€šé“äøŠļ¼ˆå³ Dlļ¼‰ļ¼Œęœ€ē»ˆå°†äø€ę¬”卷ē§Æ处ē†ēš„ę‰€ęœ‰
HWDl äøŖ元ē“ ę±‚和作äøŗčƄ位ē½®å·ē§Æē»“ęžœć€‚å¦‚å›¾2-5ꉀē¤ŗ怂
äø‰ē»“åœŗę™Æ卷ē§Æę“ä½œ 卷ē§Æē‰¹å¾
图 äø‰ē»“åœŗę™Æäø‹ēš„卷ē§Æę øäøŽč¾“å…„ę•°ę®ć€‚å›¾å·¦å·ē§Æę ø大小äøŗ 3 Ɨ 4 Ɨ 3ļ¼Œå›¾å³äøŗåœØ
čƄ位ē½®čæ›č”Œå·ē§Æę“ä½œåŽå¾—åˆ°ēš„ 1 Ɨ 1 Ɨ 1 ēš„č¾“å‡ŗē»“ęžœ
čæ›äø€ę­„地ļ¼Œč‹„ē±»ä¼¼ fl čæ™ę ·ēš„卷ē§Æę ø꜉ D äøŖļ¼Œåˆ™åœØ同äø€äøŖ位ē½®äøŠåÆ得到
1 Ɨ 1 Ɨ 1 Ɨ D ē»“åŗ¦ēš„卷ē§Æč¾“å‡ŗļ¼Œč€Œ D 即äøŗē¬¬ l + 1 层ē‰¹å¾ xl+1 ēš„通道ꕰ
Dl+1ć€‚å½¢å¼åŒ–ēš„卷ē§Æę“ä½œåÆč”Øē¤ŗäøŗļ¼š
yil+1,jl+1,d =
H
i=0
W
j=0
Dl
dl=0
fi,j,dl,d Ɨ xl
il+1+i,jl+1+j,dl , (2.1)
其äø­ļ¼Œ(il+1, jl+1) äøŗ卷ē§Æē»“ęžœēš„位ē½®åę ‡ļ¼Œę»”č¶³ļ¼š
l+1 l l+1
Background (conā€™t)
http://www.weixiushen.com/
Unit processing of CNNs
ē§Æē‰¹å¾ļ¼‰
ē‰¹å¾
27
4 3 2 1 0
1 2 3 4 5
1 0 1
0 1 0
6 7 8 9 0
9 8 7 6 5
28
ē¬¬2ę¬”å·ē§Æę“ä½œ 卷ē§Æ后ē»“ęžœļ¼ˆå·ē§Æē‰¹å¾ļ¼‰
(b) ē¬¬äŗŒę¬”卷ē§Æę“ä½œåŠå¾—åˆ°ēš„卷ē§Æē‰¹å¾
29
卷ē§Æē‰¹å¾ļ¼‰
ē‰¹å¾
27
1 0 1
1 0 1
0 1 0
6 7 8 9 0
1 2 3 4 5
9 8 7 6 5
28 29
4 3 2 1 0
1 2 3 4 5
21
1628
23
27
22
27
ē¬¬9ę¬”å·ē§Æę“ä½œ 卷ē§Æ后ē»“ęžœļ¼ˆå·ē§Æē‰¹å¾ļ¼‰
(d) ē¬¬ä¹ę¬”卷ē§Æę“ä½œåŠå¾—åˆ°ēš„卷ē§Æē‰¹å¾
卷ē§Æę“ä½œē¤ŗ例
25
Convolution
Max-pooling: yil+1,jl+1,d = max
0 i<H,0 j<W
xl
il+1ƗH+i,jl+1ƗW+j,dl , (2.6)
其äø­ļ¼Œ0 il+1 < Hl+1ļ¼Œ0 jl+1 < Wl+1ļ¼Œ0 d < Dl+1 = Dl怂
图2-7ꉀē¤ŗäøŗ 2 Ɨ 2 å¤§å°ć€ę­„é•æäøŗ 1 ēš„ęœ€å¤§å€¼ę±‡åˆę“ä½œē¤ŗä¾‹ć€‚
7
4 3 2 1 0
1 2 3 4 5
6 7 8 9 0
1 2 3 4 5
9 8 7 6 5
ē¬¬1ę¬”ęœ€å¤§å€¼ę±‡åˆę“ä½œ ę±‡åˆåŽē»“ęžœļ¼ˆę±‡åˆē‰¹å¾ļ¼‰
(a) ē¬¬ 1 ę¬”ę±‡åˆę“ä½œåŠå¾—åˆ°ēš„ę±‡åˆē‰¹å¾
7
6 7 8 9 0
1 2 3 4 5
9 8 7 6 5
4 3 2 1 0
1 3 4 52
7 8 9 9
9 8 9 9
9 8 7 6
4 3 4 5
ē¬¬16ę¬”ęœ€å¤§å€¼ę±‡åˆę“ä½œ ę±‡åˆåŽē»“ęžœļ¼ˆę±‡åˆē‰¹å¾ļ¼‰
(b) ē¬¬ 16 ę¬”ę±‡åˆę“ä½œåŠå¾—åˆ°ēš„ę±‡åˆē‰¹å¾
图 ęœ€å¤§å€¼ę±‡åˆę“ä½œē¤ŗ例
除äŗ†äøŠčæ°ęœ€åøøē”Øēš„äø¤ē§ę±‡åˆę“ä½œå¤–ļ¼Œéšęœŗę±‡åˆļ¼ˆstochastic-poolingļ¼‰[94]
则介äŗŽäŗŒč€…之闓怂随ęœŗę±‡åˆę“ä½œéžåøøē®€å•ļ¼ŒåŖ需åÆ¹č¾“å…„ę•°ę®äø­ēš„å…ƒē“ ęŒ‰ē…§
äø€å®šę¦‚ēŽ‡å€¼å¤§å°éšęœŗ选ꋩļ¼Œå…¶å¹¶äøåƒęœ€å¤§å€¼ę±‡åˆé‚£ę ·ę°øčæœåŖ取那äøŖęœ€å¤§å€¼
元ē“ ć€‚åƹ随ęœŗę±‡åˆč€Œč؀ļ¼Œå…ƒē“ å€¼å¤§ēš„响åŗ”ļ¼ˆactivationļ¼‰č¢«é€‰äø­ēš„ꦂēŽ‡ä¹Ÿå¤§ļ¼Œ
反之äŗ¦ē„¶ć€‚åÆ仄čÆ“ļ¼ŒåœØå…Øå±€ę„ä¹‰äøŠļ¼Œéšęœŗę±‡åˆäøŽå¹³å‡å€¼ę±‡åˆčæ‘ä¼¼ļ¼›åœØ局éƒØꄏ
Pooling
ReLU(x) = max {0, x} (8.3)
=
āŽ§
āŽŖāŽØ
āŽŖāŽ©
x x 0
0 x < 0
. (8.4)
äøŽå‰äø¤äøŖęæ€ę“»å‡½ę•°ē›øęƔļ¼ŒReLU å‡½ę•°ēš„ę¢Æåŗ¦åœØ x 0 ę—¶äøŗ 1ļ¼Œåä¹‹äøŗ 0ļ¼ˆå¦‚
图8-3ꉀē¤ŗļ¼‰ļ¼›x 0 éƒØ分完å…Øę¶ˆé™¤äŗ† Sigmoid åž‹å‡½ę•°ēš„ę¢Æåŗ¦é„±å’Œę•ˆåŗ”怂åœØč®”
ē®—å¤ę‚åŗ¦äøŠļ¼ŒReLU å‡½ę•°ä¹Ÿē›øåƹ前äø¤č€…ēš„ęŒ‡ę•°å‡½ę•°č®”ē®—ꛓäøŗē®€å•ć€‚åŒę—¶ļ¼Œå®ž
éŖŒäø­čæ˜å‘ēŽ° ReLU å‡½ę•°ęœ‰åŠ©äŗŽéšęœŗę¢Æåŗ¦äø‹é™ę–¹ę³•ę”¶ę•›ļ¼Œę”¶ę•›é€Ÿåŗ¦ēŗ¦åæ« 6 倍
左右 [52]怂äøčæ‡ļ¼ŒReLU å‡½ę•°ä¹Ÿęœ‰č‡Ŗčŗ«ēš„ē¼ŗ陷ļ¼Œå³åœØ x < 0 ę—¶ļ¼Œę¢Æåŗ¦ä¾æäøŗ 0怂
ę¢å„čƝčÆ“ļ¼ŒåƹäŗŽå°äŗŽ 0 ēš„čæ™éƒØ分卷ē§Æē»“ęžœå“åŗ”ļ¼Œå®ƒä»¬äø€ę—¦å˜äøŗč“Ÿå€¼å°†å†ę— 
ę³•å½±å“ē½‘ē»œč®­ē»ƒā€”ā€”čæ™ē§ēŽ°č±”č¢«ē§°ä½œā€œę­»åŒŗā€ć€‚
āˆ’10 āˆ’5 0 5 10
0
1
2
3
4
5
6
7
8
9
10
(a) ReLU å‡½ę•°
āˆ’10 āˆ’5 0 5 10
āˆ’0.2
0
0.2
0.4
0.6
0.8
1
1.2
(b) ReLU å‡½ę•°ę¢Æåŗ¦
图 å‡½ę•°åŠå…¶å‡½ę•°ę¢Æåŗ¦
Non-linear activation function
Background (conā€™t)
http://www.weixiushen.com/
CNN architecture
Fei-Fei Li & Andrej Karpathy Lecture 7 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 7 - 21 Jan 201563
CONV
ReLU
CONV
ReLU
POOLCONV
ReLU
CONV
ReLU
POOL CONV
ReLU
CONV
ReLU
POOL FC
(Fully-connected)
Figures are courtesy of L. Fei-Fei.
Background (conā€™t)
http://www.weixiushen.com/
Meta class
Bird Dog Orangeā€¦ā€¦
Traditional image recognition
(Coarse-grained)
Fine-grained image recognition
Dog
Husky Satsuma Alaskaā€¦ā€¦
Introduction
Fine-grained images vs. generic images
http://www.weixiushen.com/
Various real-world applications
Introduction (conā€™t)
http://www.weixiushen.com/
Various real-world applications
Introduction (conā€™t)
http://www.weixiushen.com/
Various real-world applications
Introduction (conā€™t)
http://www.weixiushen.com/
Results
1. Megvii Research Nanjing
a. Error = 0.10267
2. Alibaba Machine Intelligence Technology Lab
a. Error = 0.11315
3. General Dynamics Mission Systems
a. Error = 0.12678
FGVC6
Sponsored by
This certiļ¬cate is awarded to
Bo-Yan Zhou, Bo-Rui Zhao, Quan Cui, Yan-Ping Xie,
Zhao-Min Chen, Ren-Jie Song, and Xiu-Shen Wei
Megvii Research Nanjing
iNat2019
winners of the iNaturalist 2019 image
classiļ¬cation challenge held in conjunction with
the FGVC workshop at CVPR 2019.
Genera with at least 10 species
Herbarium Challenge 2019 (top team
ā— #1 Megvii Research Nanjing (89.8%)
ā—‹ Boyan Zhou, Quan Cui, Borui Zhao, Yanping Xie,
Renjie Song
ā— #2 PEAK (89.1%)
ā—‹ Chunqiao Xu, Shao Zeng, Qiule Sun, Shuyu Ge,
Peihua Li (Dalian University of Technology)
ā— #3 Miroslav Valan (89.0%)
ā—‹ Swedish Museum of Natural History
ā— #4 Hugo Touvron (88.9%)
ā—‹ Hugo Touvron and Andrea Vedaldi (Facebook AI
Research)
Various real-world applications
Introduction (conā€™t)
http://www.weixiushen.com/
Herbarium Challenge 2019
Kiat Chuan Tan1
, Yulong Liu1
, Barbara Ambrose2
, Melissa Tulig2
, Serge Belongie1,3
1
Google Research, 2
New York Botanical Garden, 3
Cornell Tech
Google Research
, Xiu-Shen Wei
Various real-world applications
Introduction (conā€™t)
http://www.weixiushen.com/
Various real-world applications
Introduction (conā€™t)
http://www.weixiushen.com/
Various real-world applications
Introduction (conā€™t)
http://www.weixiushen.com/
Challenge of ļ¬ne-grained
image analysis
Small inter-class variance
Large intra-class variance
Institute of Computer Science and Technology, Peking University
hexiangteng@pku.edu.cn, pengyuxin@pku.edu.cn
Abstract
ed image classiļ¬cation is a challenging task
ge intra-class variance and small inter-class
ng at recognizing hundreds of sub-categories
he same basic-level category. Most existing
mage classiļ¬cation methods generally learn
models to obtain the semantic parts for bet-
ion accuracy. Despite achieving promising
methods mainly have two limitations: (1)
arts which obtained through the part detec-
re beneļ¬cial and indispensable for classiļ¬-
2) ļ¬ne-grained image classiļ¬cation requires
visual descriptions which could not be pro-
art locations or attribute annotations. For ad-
bove two limitations, this paper proposes the
odel combining vision and language (CVL)
atent semantic representations. The vision
deep representations from the original visual
ia deep convolutional neural network. The
am utilizes the natural language descriptions
oint out the discriminative parts or charac-
ach image, and provides a ļ¬‚exible and com-
Heermann Gull
Herring Gull
Slaty-backed Gull
Western Gullinter-class variance
intra-classvariance
Figure 1. Examples from CUB-200-2011. Note that it is a techni-
cally challenging task even for humans to categorize them due to
Introduction (conā€™t)
Figures are courtesy of [X. He andY. Peng, CVPR 2017]. http://www.weixiushen.com/
The key of ļ¬ne-grained image analysis
Introduction (conā€™t)
http://www.weixiushen.com/
Fine-grained benchmark datasets
ē±»ļ¼Œ11788 张图像ļ¼ˆå…¶äø­å« 5994 å¼ č®­ē»ƒå›¾åƒå’Œ 5794 å¼ ęµ‹čƕ图像ļ¼‰ļ¼ŒęÆå¼ å›¾ę
供äŗ†å›¾åƒēŗ§åˆ«ę ‡č®°ć€ē‰©ä½“č¾¹ē•Œę”†ć€éƒØ件关键ē‚¹å’ŒéøŸē±»å±žę€§äæ”ęÆ怂čÆ„ę•°ę®é›†å›¾åƒ
ā½°ä¾‹å¦‚图 2 12ꉀā½°ć€‚ę­¤å¤–ļ¼ŒBirdsnap[6]
ę˜Æ 2014 幓哄伦ā½äŗšā¼¤å­¦ēš„ē ”ē©¶č€…ꏐå‡ŗēš„
ā¼¤č§„ęØ”éøŸē±»ē»†ē²’åŗ¦ēŗ§åˆ«å›¾åƒę•°ę®é›†ļ¼Œå…± 500 ē§éøŸē±»ā¼¦ē±»ļ¼Œ49829 张图像ļ¼ˆå…¶äø­
47386 å¼ č®­ē»ƒå›¾åƒå’Œ 2443 å¼ ęµ‹čƕ图像ļ¼‰ļ¼ŒäøŽ CUB200-2011 ę•°ę®é›†ē±»ä¼¼ļ¼ŒęƏ张图
ęä¾›äŗ†å›¾åƒēŗ§åˆ«ę ‡č®°ć€ē‰©ä½“č¾¹ē•Œę”†ć€éƒØ件关键ē‚¹å’ŒéøŸē±»å±žę€§äæ”ęÆ怂
图 2 12: CUB200-2011[75]
ę•°ę®é›†å›¾åƒā½°ä¾‹ć€‚
CUB200-2011
[C.Wah et al., CNS-TR-2011-001, 2011]
11,788 images, 200 ļ¬ne-grained classes
Introduction (conā€™t)
http://www.weixiushen.com/
Fine-grained benchmark datasets
2 . 3 . 2 ē»†ē²’åŗ¦ēŗ§åˆ«ē‹—ē±»ę•°ę®é›†
Stanford Dogs[38]
ę•°ę®é›†ę˜Æę–Æ坦ē¦ā¼¤å­¦äŗŽ 2011 å¹“ęå‡ŗēš„ē»†ē²’åŗ¦ēŗ§åˆ«å›¾åƒę•°
集怂čÆ„ę•°ę®é›†å…±å« 120 ē§ē‹—ē±»ā¼¦ē±»ēš„ 20580 张图像ļ¼ŒęÆå¼ å›¾åƒęä¾›äŗ†å›¾åƒēŗ§
ę ‡č®°å’Œē‰©ä½“č¾¹ē•Œę”†ļ¼Œä½†ęœŖęä¾›éƒØ件关键ē‚¹å’ŒéøŸē±»å±žę€§äæ”ęÆē­‰ē²¾ē»†ēŗ§åˆ«ē›‘ē£äæ”
怂čÆ„ę•°ę®é›†å›¾åƒā½°ä¾‹å¦‚图 2 13ꉀā½°ć€‚
[A. Khosla et al., CVPR Workshop 2011]
Stanford Dogs
20,580 images
120 ļ¬ne-grained classes
Introduction (conā€™t)
http://www.weixiushen.com/
Fine-grained benchmark datasets
Oxford Flowers 8,189 images, 102 ļ¬ne-grained classes
[M.-E. Nilsback and A. Zisserman, CVGIP 2008]
Introduction (conā€™t)
http://www.weixiushen.com/
Fine-grained benchmark datasets
24 ē¬¬äŗŒē«  ē›ø关ē ”ē©¶å†…容ē»¼čæ°
图 2 15: Aircrafts[51]
ę•°ę®é›†å›¾åƒā½°ä¾‹ć€‚
Aircrafts
10,200 images
100 ļ¬ne-grained classes
[S. Maji et al., arXiv: 1306.5151, 2013]
Introduction (conā€™t)
http://www.weixiushen.com/
Fine-grained benchmark datasets
Stanford Cars
图 2 15: Aircrafts[51]
ę•°ę®é›†å›¾åƒā½°ä¾‹ć€‚
图 2 16: Stanford Cars[40]
ę•°ę®é›†å›¾åƒā½°ä¾‹ć€‚
16,185 images, 196 ļ¬ne-grained classes
[J. Krause et al., ICCV Workshop 2013]
Introduction (conā€™t)
http://www.weixiushen.com/
Fine-grained image analysis is hot ā€¦
Introduction (conā€™t)
Many papers published on top-tier conf./journals
CVPR, ICCV, ECCV, IJCAI, etc.
TPAMI, IJCV,TIP, etc.
Many frequently held workshops
Workshop on Fine-GrainedVisual Categorization
ā€¦
Many academic challenges about ļ¬ne-grained tasks
The Nature Conservancy Fisheries Monitoring
iFood Classiļ¬cation Challenge
iNature Classiļ¬cation Challenge
ā€¦
http://www.weixiushen.com/
Image Retrieval (IR)
What is image retrieval?What is image retrieval?
! Description Based Image Retrieval (DBIR)! Description Based Image Retrieval (DBIR)
! Content Based Image Retrieval (CBIR)
Query
Textual
Text
Textual
query
Query
by
example
Image
Sketchexample Sketch
5/46
Fine-grained image retrieval
http://www.weixiushen.com/
Fine-grained image retrieval (conā€™t)
Common components of CBIR systems
Feature
extraction
Input
Database
Query Feature
extraction
Comparison
Return
http://www.weixiushen.com/
Deep learning for image retrieval
Figure 9: Sample search results using CroW features compressed to just d ā€œ 32 dimensions. The query image is shown at
the leftmost side with the query bounding box marked in a red rectangle.
Fine-grained image retrieval (conā€™t)
http://www.weixiushen.com/
FGIR vs. General-purposed IR
Fine-grained image retrieval (conā€™t)
volutional Descriptor Aggregation for
e-Grained Image Retrieval
-Hao Luo, Jianxin Wu, Member, IEEE, Zhi-Hua Zhou, Fellow, IEEE
ural network models pre-
n task have been successfully
ch as texture description and
e tasks require annotations
this paper, we focus on a
pure unsupervised setting:
h image labels, ļ¬ne-grained
e the unsupervised retrieval
olutional Descriptor Aggre-
localizes the main object(s)
discards noisy background
The selected descriptors are
reduced into a short feature
und. SCDA is unsupervised,
ox annotation. Experiments
he effectiveness of SCDA for
visualization of the SCDA
d to visual attributes (even
SCDAā€™s high mean average
Moreover, on general image
omparable retrieval results
etrieval approaches.
retrieval, selection and ag-
zation.
TION
(a) Fine-grained image retrieval. Two examples (ā€œMallardā€ and ā€œRolls-
Royce Phantom Sedan 2012ā€) from the CUB200-2011 [10] and Cars [11]
datasets, respectively.
(b) General image retrieval. Two examples from the Oxford Building [12]
dataset.
Figure 1. Fine-grained image retrieval vs. general image retrieval. Fine
grained image retrieval (FGIR) processes visually similar objects as the probe
and gallery. For example, given an image of Mallard (or Rolls-Royce Phantom
Sedan 2012) as the query, the FGIR system should return images of the same
http://www.weixiushen.com/
FGIR based on hand-crafted features
Fine-grained image retrieval (conā€™t)
ē»†ē²’åŗ¦ēŗ§åˆ«å›¾åƒēš„ę£€ē“¢ä»»åŠ”通åøøåŖåœØäŗŽęŸäøŖē‰©ä½“ē±»åˆ«å†…ļ¼Œä¾‹å¦‚仅会åƹäøåŒéøŸē±»
ē±»åˆ«čæ›ā¾å›¾åƒę£€ē“¢ļ¼Œé€šåøø并äøä¼šåƹéøŸē±»ć€ē‹—ē±»ē­‰ē­‰å¤šē§ē±»åˆ«ē‰©ä½“ēš„ę··åˆę•°ę®é›†
čæ›ā¾ē»†ē²’åŗ¦ēŗ§åˆ«å›¾åƒę£€ē“¢ć€‚ē›øā½ä¹‹äø‹ļ¼Œęœ¬ā½‚ē¬¬ 3ē« ęå‡ŗēš„ę£€ē“¢ā½…ę³•ę˜Æā¾øäøŖåŸŗäŗŽ
ę·±åŗ¦å­¦ä¹ ēš„ē»†ē²’åŗ¦ēŗ§åˆ«å›¾åƒę£€ē“¢ā½…ę³•ļ¼ŒåŒę—¶ļ¼Œęœ¬ā½‚ē ”ē©¶ēš„ē»†ē²’åŗ¦ēŗ§åˆ«å›¾åƒę£€ē“¢
图 2 11: Xie ē­‰ā¼ˆ[83]
ꏐå‡ŗēš„ē»†ē²’åŗ¦ēŗ§åˆ«å›¾åƒā€œęœē“¢ā€ä»»åŠ”ā½°ę„å›¾ć€‚ļ¼ˆčÆ„ę’å›¾ē”± IEEE ꎈꝃä½æ
ā½¤ć€‚ļ¼‰
[Xie et al., IEEE TMM 2015] http://www.weixiushen.com/
045
046
047
048
049
050
051
052
053
054
055
056
057
058
059
045
046
047
048
049
050
051
052
053
054
055
056
057
058
059
CCV
#545
ECC
#54
2 ECCV-16 submission ID 545
(a) Input image
(d) Descriptor
aggregation(c) Descriptor
selection
(b) Convolutional
activation tensor
AnSCDAfeature
Figure 1. Pipeline of the proposed SCDA method. (Best viewed in color.)
annotations are expensive and unrealistic in many real applications. In answer
to this di culty, there are attempts to categorize ļ¬ne-grained images with only
image-level labels, e.g., [6ā€“9].
In this paper, we handle a more challenging but more realistic task, i.e., Fine-
[Wei et al., IEEE TIP 2017]
Fine-grained image retrieval (conā€™t)
Selective Convolutional Descriptor Aggregation (SCDA)
http://www.weixiushen.com/
45
46
47
48
49
50
51
52
53
54
55
56
CV
45
2 ECCV-16 submission ID 545
(a) Input image
(c) Descriptor
selection
(b) Convolutional
activation tensor
Figure 1. Pipeline of the proposed SCDA method. (Best viewe
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
ECCV
#545
CCV-16 submission ID 545
ed applications. For example, a bird protection project may not want
g images given a bird query. To our best knowledge, this is the ļ¬rst
o ļ¬ne-grained image retrieval using deep learning.
ctive Convolutional Descriptor Aggregation
per, we follow the notations in [24]. The term ā€œfeature mapā€ indicates
lution results of one channel; the term ā€œactivationsā€ indicates feature
ll channels in a convolution layer; and the term ā€œdescriptorā€ indicates
ensional component vector of activations. ā€œpool5ā€ refers to the activa-
e max-pooled last convolution layer, and ā€œfc8ā€ refers to the activations
fully connected layer.
an input image I of size H ā‡„ W, the activations of a convolution
formulated as an order-3 tensor T with h ā‡„ w ā‡„ d elements, which
set of 2-D feature maps S = {Sn} (n = 1, . . . , d). Sn of size h ā‡„ w is
ature map of the corresponding channel. From another point of view,
also considered as having h ā‡„ w cells and each cell contains one d-
al deep descriptor. We denote the deep descriptors as X = x(i,j) ,
j) is a particular cell (i 2 {1, . . . , h} , j 2 {1, . . . , w} , x(i,j) 2 Rd
). For
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
3 Selective Convolutional Descripto
In this paper, we follow the notations in [24]. The
the convolution results of one channel; the term ā€œ
maps of all channels in a convolution layer; and th
the d-dimensional component vector of activations
tions of the max-pooled last convolution layer, and
of the last fully connected layer.
Given an input image I of size H ā‡„ W, the
layer are formulated as an order-3 tensor T wit
include a set of 2-D feature maps S = {Sn} (n =
the nth feature map of the corresponding channel
T can be also considered as having h ā‡„ w cells
dimensional deep descriptor. We denote the deep
where (i, j) is a particular cell (i 2 {1, . . . , h} , j 2
instance, by employing the popular pre-trained V
deep descriptors, we can get a 7 ā‡„ 7 ā‡„ 512 activati
image is 224ā‡„224. Thus, on one hand, for this im
(i.e., Sn) of size 7 ā‡„ 7; on the other hand, 49 deep
obtained.
3.1 Selecting Convolutional Descriptors
What distinguishes SCDA from existing deep le
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
3 Selective Convolutional Descriptor Aggregation
In this paper, we follow the notations in [24]. The term ā€œfeature mapā€ in
the convolution results of one channel; the term ā€œactivationsā€ indicates
maps of all channels in a convolution layer; and the term ā€œdescriptorā€ in
the d-dimensional component vector of activations. ā€œpool5ā€ refers to the
tions of the max-pooled last convolution layer, and ā€œfc8ā€ refers to the activ
of the last fully connected layer.
Given an input image I of size H ā‡„ W, the activations of a conv
layer are formulated as an order-3 tensor T with h ā‡„ w ā‡„ d elements,
include a set of 2-D feature maps S = {Sn} (n = 1, . . . , d). Sn of size h
the nth feature map of the corresponding channel. From another point o
T can be also considered as having h ā‡„ w cells and each cell contains
dimensional deep descriptor. We denote the deep descriptors as X = x
where (i, j) is a particular cell (i 2 {1, . . . , h} , j 2 {1, . . . , w} , x(i,j) 2 R
instance, by employing the popular pre-trained VGG-16 model [25] to e
deep descriptors, we can get a 7 ā‡„ 7 ā‡„ 512 activation tensor in pool5 if the
image is 224ā‡„224. Thus, on one hand, for this image, we have 512 featur
(i.e., Sn) of size 7 ā‡„ 7; on the other hand, 49 deep descriptors of 512-d a
obtained.
3.1 Selecting Convolutional Descriptors
What distinguishes SCDA from existing deep learning-based image re
methods is: using only the pre-trained model, SCDA is able to ļ¬nd usef
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
Selective Convolutional Descriptor Aggregation
n this paper, we follow the notations in [24]. The term ā€œfeature mapā€ indicates
he convolution results of one channel; the term ā€œactivationsā€ indicates feature
aps of all channels in a convolution layer; and the term ā€œdescriptorā€ indicates
he d-dimensional component vector of activations. ā€œpool5ā€ refers to the activa-
ons of the max-pooled last convolution layer, and ā€œfc8ā€ refers to the activations
the last fully connected layer.
Given an input image I of size H ā‡„ W, the activations of a convolution
yer are formulated as an order-3 tensor T with h ā‡„ w ā‡„ d elements, which
clude a set of 2-D feature maps S = {Sn} (n = 1, . . . , d). Sn of size h ā‡„ w is
he nth feature map of the corresponding channel. From another point of view,
can be also considered as having h ā‡„ w cells and each cell contains one d-
mensional deep descriptor. We denote the deep descriptors as X = x(i,j) ,
here (i, j) is a particular cell (i 2 {1, . . . , h} , j 2 {1, . . . , w} , x(i,j) 2 Rd
). For
stance, by employing the popular pre-trained VGG-16 model [25] to extract
eep descriptors, we can get a 7 ā‡„ 7 ā‡„ 512 activation tensor in pool5 if the input
mage is 224ā‡„224. Thus, on one hand, for this image, we have 512 feature maps
.e., Sn) of size 7 ā‡„ 7; on the other hand, 49 deep descriptors of 512-d are also
btained.
.1 Selecting Convolutional Descriptors
Feature maps:
Descriptors:
Fine-grained image retrieval (conā€™t)
Notations
[Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
CCV
#545
ECCV
#54
ECCV-16 submission ID 545 5
The 468āˆ’th channel The 375āˆ’th channel The 284āˆ’th channelThe 108āˆ’th channel
The 481āˆ’th channel The 245āˆ’th channel The 6āˆ’th channel The 163āˆ’th channel
Figure 2. Sampled feature maps of four ļ¬ne-grained images. Although we resize the
images for better visualization, our method can deal with images of any resolution.
Fine-grained image retrieval (conā€™t)
[Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
270
271
272
273
274
275
276
277
278
279
280
281
ECCV
#545
ECCV-16 sub
...
...
(a) Input image (d) Mask m(c) Activation map(b) Feature maps
Figure 4. Selecting useful deep convolutional descriptors. (B
Therefore, we use fM to select useful and meaningful d
f
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
(b) Visualization of the mask map
Figure 3. Visualization of the mask map M and the correspondin
component fM. The selected regions are highlighted in red. (Best v
region, we could expect this region to be an object rather tha
Therefore, in the proposed method, we add up the obtained
tensor through the depth direction. Thus, the h ā‡„ w ā‡„ d 3-
an h ā‡„ w 2-D tensor, which we call the ā€œactivation mapā€, i.
(where Sn is the nth feature map in pool5). For the activation
h ā‡„ w summed activation responses, corresponding to h ā‡„ w
on the aforementioned observation, it is straightforward to sa
activation response a particular position (i, j) is, the more po
responding region being part of the object. Then, we calcula
ĀÆa of all the positions in A as the threshold to decide which
objects: the position (i, j) whose activation response is highe
the main object, e.g., birds or dogs, might appear in that posi
M of the same size as A can be obtained as:
Mi,j =
(
1 if Ai,j > ĀÆa
0 otherwise
,
where (i, j) is a particular position in these h ā‡„ w positions.
The ļ¬gures in the ļ¬rst row of Fig. 3 show some examples
for birds and dogs. In these ļ¬gures, we ļ¬rst resize the mask
Fine-grained image retrieval (conā€™t)
Obtaining the activation map by summarizing feature maps
[Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
CCV
#545
ECC
#54
6 ECCV-16 submission ID 545
(a) Visualization of the mask map M
(b) Visualization of the mask map
Figure 3. Visualization of the mask map M and the corresponding largest connected
component fM. The selected regions are highlighted in red. (Best viewed in color.)
region, we could expect this region to be an object rather than the background.
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
CCV
#545
ECC
#54
6 ECCV-16 submission ID 545
(a) Visualization of the mask map M
(b) Visualization of the mask map
Figure 3. Visualization of the mask map M and the corresponding largest connected
component fM. The selected regions are highlighted in red. (Best viewed in color.)
Fine-grained image retrieval (conā€™t)
Visualization of the mask map M
[Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
ECCV
#545
ECCV
#54
ECCV-16 submission ID 545 7
...
...
(a) Input image
(e) The largest connected
component of the mask map(d) Mask map(c) Activation map(b) Feature maps
Figure 4. Selecting useful deep convolutional descriptors. (Best viewed in color.)
Therefore, we use fM to select useful and meaningful deep convolutional de-
scriptors. The descriptor x(i,j) should be kept when fMi,j = 1, while fMi,j = 0
means the position (i, j) might have background or noisy parts:
F =
n
x(i,j)|fMi,j = 1
o
, (2)
Fine-grained image retrieval (conā€™t)
Selecting useful deep convolutional descriptors
[Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
Qualitative evaluation
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
ECCV
#545
ECCV
#545
ECCV-16 submission ID 545 7
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
Figure 6. Samples of predicted object localization bounding box on CUB200-2011 and
Stanford Dogs. The ground-truth bounding box is marked as the red dashed rectangle,
Fine-grained image retrieval (conā€™t)
[Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
ECCV
#545
ECCV
#545
ECCV-16 submission ID 545 9
Table 2. Comparison of diā†µerent encoding or pooling approaches for FGIR.
Approach Dimension
CUB200-2011 Stanford Dogs
top1 top5 top1 top5
VLAD 1,024 55.92% 62.51% 69.28% 74.43%
Fisher Vector 2,048 52.04% 59.19% 68.37% 73.74%
avgPool 512 56.42% 63.14% 73.76% 78.47%
maxPool 512 58.35% 64.18% 70.37% 75.59%
avg&maxPool 1,024 59.72% 65.79% 74.86% 79.24%
ā€“ Pooling approaches. We also try two traditional pooling approaches, i.e.,
max-pooling and average-pooling, to aggregate the deep descriptors.
After encoding or pooling into a single vector, for VLAD and FV, the square
root normalization and `2-normalization are followed; for max- and average-
pooling methods, we just do `2-normalization (the square root normalization did
not work well). Finally, the cosine similarity is used for nearest neighbor search.
We use two datasets to demonstrate which type of aggregation method is optimal
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
localization performance is just slightly lower or even comparable to that of
these algorithms using strong supervisions, e.g., ground-truth bounding box and
parts annotations (even in the test phase).
3.2 Aggregating Convolutional Descriptors
After selecting F =
n
x(i,j)|fMi,j = 1
o
, we compare several encoding or pooling
approaches to aggregate these convolutional features, and then give our proposal.
ā€“ VLAD [14] uses k-means to ļ¬nd a codebook of K centroids {c1, . . . , cK}
and maps x(i,j) into a single vector v(i,j) =
ā‡„
0 . . . 0 x(i,j) ck . . . 0
ā‡¤
2
RKā‡„d
, where ck is the closest centroid to x(i,j). The ļ¬nal representation isP
i,j v(i,j).
ā€“ Fisher Vector [15]: FV is similar to VLAD, but uses a soft assignment
(i.e., Gaussian Mixture Model) instead of using k-means. Moreover, FV also
includes second-order statistics.2
2
For parameter choice of VLAD/FV, we follow the suggestions reported in [30]. The
number of clusters in VLAD and the number of Gaussian components in FV are
both set to 2. Larger values lead to lower accuracy.
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
whole-object in CUB200-2011. By comparing the results of PCP for torso and
our whole-object, we ļ¬nd that, even though our method is unsupervised, the
localization performance is just slightly lower or even comparable to that of
these algorithms using strong supervisions, e.g., ground-truth bounding box and
parts annotations (even in the test phase).
3.2 Aggregating Convolutional Descriptors
After selecting F =
n
x(i,j)|fMi,j = 1
o
, we compare several encoding or pooling
approaches to aggregate these convolutional features, and then give our proposal.
ā€“ VLAD [14] uses k-means to ļ¬nd a codebook of K centroids {c1, . . . , cK}
and maps x(i,j) into a single vector v(i,j) =
ā‡„
0 . . . 0 x(i,j) ck . . . 0
ā‡¤
2
RKā‡„d
, where ck is the closest centroid to x(i,j). The ļ¬nal representation isP
i,j v(i,j).
ā€“ Fisher Vector [15]: FV is similar to VLAD, but uses a soft assignment
(i.e., Gaussian Mixture Model) instead of using k-means. Moreover, FV also
includes second-order statistics.2
2
For parameter choice of VLAD/FV, we follow the suggestions reported in [30]. The
number of clusters in VLAD and the number of Gaussian components in FV are
both set to 2. Larger values lead to lower accuracy.
Fine-grained image retrieval (conā€™t)
Aggregating convolutional descriptors
[Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
ECCV-16 submission ID 545 9
Table 2. Comparison of diā†µerent encoding or pooling approaches for FGIR.
Approach Dimension
CUB200-2011 Stanford Dogs
top1 top5 top1 top5
VLAD 1,024 55.92% 62.51% 69.28% 74.43%
Fisher Vector 2,048 52.04% 59.19% 68.37% 73.74%
avgPool 512 56.42% 63.14% 73.76% 78.47%
maxPool 512 58.35% 64.18% 70.37% 75.59%
avg&maxPool 1,024 59.72% 65.79% 74.86% 79.24%
ā€“ Pooling approaches. We also try two traditional pooling approaches, i.e.,
max-pooling and average-pooling, to aggregate the deep descriptors.
After encoding or pooling into a single vector, for VLAD and FV, the square
root normalization and `2-normalization are followed; for max- and average-
SCDA
Fine-grained image retrieval (conā€™t)
Comparing difference encoding or pooling methods
[Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
ECCV
#545
ECCV
#545
10 ECCV-16 submission ID 545
(a) M of Pool5 (d) of Relu5_2(c) M of Relu5_2(b) of Pool5
Figure 6. The mask map and its corresponding largest connected component of dif-
ferent CNN layers. (The ļ¬gure is best viewed in color.)
where ā†µ is the coe cient for SCDArelu5 2
. It is set to 0.5 for FGIR. The `2
normalization is followed. In addition, another SCDA+
of the horizontal ļ¬‚ip of
the original image is incorporated, which is denoted as ā€œSCDA ļ¬‚ip+
ā€ (4,096-d).
4 Experiments and Results
In this section, we report the ļ¬ne-grained image retrieval results. In addition,
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
our aggregation scheme. Its performance is signiļ¬cantly and consistently higher
than the others. We use the ā€œavg&maxPoolā€ aggregation as ā€œSCDA featureā€ to
represent the whole ļ¬ne-grained image.
3.3 Multiple Layer Ensemble
As studied in [31,32], the ensemble of multiple layers boost the ļ¬nal performance.
Thus, we also incorporate another SCDA feature produced from the relu5 2 layer
which is three layers in front of pool5 in the VGG-16 model [25].
Following pool5, we get the mask map Mrelu5 2
from relu5 2. Its activations are
less related to the semantic meaning than those of pool5. As shown in Fig. 6 (c),
there are many noisy parts. However, the bird is more accurately detected than
pool5. Therefore, we combine fMpool5
and Mrelu5 2 together to get the ļ¬nal mask
map of relu5 2. fMpool5
is ļ¬rstly upsampled to the size of Mrelu5 2
. We keep the
descriptors when their position in both fMpool5
and Mrelu5 2
are 1, which are
the ļ¬nal selected relu5 2 descriptors. The aggregation process remains the same.
Finally, we concatenate the SCDA features of relu5 2 and pool5 into a single
representation, denoted by ā€œSCDA+
ā€:
SCDA+
ā‡„
SCDApool5
, ā†µ ā‡„ SCDArelu5 2
ā‡¤
, (3)
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
ECCV
#545
6 submission ID 545
(d) of Relu5_2(c) M of Relu5_2(b) of Pool5
mask map and its corresponding largest connected component of dif-
s. (The ļ¬gure is best viewed in color.)
coe cient for SCDArelu5 2
. It is set to 0.5 for FGIR. The `2
s followed. In addition, another SCDA+
of the horizontal ļ¬‚ip of
ge is incorporated, which is denoted as ā€œSCDA ļ¬‚ip+
ā€ (4,096-d).
ments and Results
Fine-grained image retrieval (conā€™t)
Multiple layer ensemble
[Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
CV
545
ECC
#5
12 ECCV-16 submission ID 545
Figure 7. Some retrieval results of four ļ¬ne-grained datasets. On the left, there are
Fine-grained image retrieval (conā€™t)
[Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
V
5
E
#
ECCV-16 submission ID 545 13
Figure 8. Quality demonstrations of the SCDA feature. From the top to bottom of
each column, there are six returned original images in the order of one sorted dimension
of ā€œ256-d SVD+whiteningā€. (Best viewed in color and zoomed in.)
Fine-grained image retrieval (conā€™t)
Quality demonstration of the SCDA feature
[Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
Deep Descriptor Transforming
CNN pre-trained models
Deep Descriptor Transforming
Four b
Object
Col-lo
Image co-localization (a.k.a. unsupervised object discovery) is a
fundamental computer vision problem, which simultaneously localizes
objects of the same category across a set of distinct images.
2 ā€“ The proposed method: DDT 4 ā€“ E
stat
and
Ć¼ DD
rob
Shen2
, Zhi-Hua Zhou1
Technology, Nanjing University, Nanjing, China
aide, Adelaide, Australia
edu.cn, {yao.li01,chunhua.shen}@adelaide.edu.au
CNN pre-trained models
Deep Descriptor Transforming
Input images
into a set of linearly uncorrelated variables (i.e., the principal
components). This transformation is deļ¬ned in such a way
that the ļ¬rst principal component has the largest possible vari-
ance, and each succeeding component in turn has the highest
variance possible under the constraint that it is orthogonal to
all the preceding components.
PCA is widely used in machine learning and computer
vision for dimension reduction [Chen et al., 2013; Gu et
al., 2011; Zhang et al., 2009; Davidson, 2009], noise reduc-
tion [Zhang et al., 2013; Nie et al., 2011] and so on. Speciļ¬-
cally, in this paper, we utilize PCA as projection directions for
transforming these deep descriptors {x(i,j)} to evaluate their
correlations. Then, on each projection direction, the corre-
sponding principal componentā€™s values are treated as the cues
for image co-localization, especially the ļ¬rst principal com-
ponent. Thanks to the property of this kind of transforming,
DDT is also able to handle data noise.
In DDT, for a set of N images containing objects from the
same category, we ļ¬rst collect the corresponding convolutional
descriptors (X1
, . . . , XN
) by feeding them into a pre-trained
CNN model. Then, the mean vector of all the descriptors is
calculated by:
ĀÆx =
1
K
X
n
X
i,j
xn
(i,j) , (1)
characteristic th
localization sce
indicates indeed
Therefore, th
old for dividing
positive values
part has negativ
appear. In additi
indicates that n
can be used for
resized by the n
same as that of
largest connecte
what is done in [
relation values a
bounding box w
of positive regi
prediction.
3.4 Discussi
In this section,
comparing with
As shown in F
and DDT are hi
nsforming for Image Co-Localization
n-Lin Zhang1
, Yao Li2
, Chen-Wei Xie1
,
Chunhua Shen2
, Zhi-Hua Zhou1
Software Technology, Nanjing University, Nanjing, C
y of Adelaide, Adelaide, Australia
amda.nju.edu.cn, {yao.li01,chunhua.shen}@adelaide.
h the
s. In
ained
erent CNN pre-trained models
XN-1X1 X2
XNā€¦
Deep
descriptors
Calculating descriptorsā€™
mean vector
nsforming for Image Co-Localizationā‡¤
n-Lin Zhang1
, Yao Li2
, Chen-Wei Xie1
,
Chunhua Shen2
, Zhi-Hua Zhou1
Software Technology, Nanjing University, Nanjing, C
y of Adelaide, Adelaide, Australia
amda.nju.edu.cn, {yao.li01,chunhua.shen}@adelaide.
h the
all the preceding components.
PCA is widely used in machine learning and computer
vision for dimension reduction [Chen et al., 2013; Gu et
al., 2011; Zhang et al., 2009; Davidson, 2009], noise reduc-
tion [Zhang et al., 2013; Nie et al., 2011] and so on. Speciļ¬-
cally, in this paper, we utilize PCA as projection directions for
transforming these deep descriptors {x(i,j)} to evaluate their
correlations. Then, on each projection direction, the corre-
sponding principal componentā€™s values are treated as the cues
for image co-localization, especially the ļ¬rst principal com-
ponent. Thanks to the property of this kind of transforming,
DDT is also able to handle data noise.
In DDT, for a set of N images containing objects from the
same category, we ļ¬rst collect the corresponding convolutional
descriptors (X1
, . . . , XN
) by feeding them into a pre-trained
CNN model. Then, the mean vector of all the descriptors is
calculated by:
ĀÆx =
1
K
X
n
X
i,j
xn
(i,j) , (1)
where K = h ā‡„ w ā‡„ N. Note that, here we assume each
image has the same number of deep descriptors (i.e., h ā‡„ w)
for presentation clarity. Our proposed method, however, can
handle input images with arbitrary resolutions.
positive values
part has negativ
appear. In addi
indicates that n
can be used fo
resized by the
same as that of
largest connect
what is done in
relation values
bounding box w
of positive reg
prediction.
3.4 Discuss
In this section
comparing with
As shown in
and DDT are h
ers the informa
ā€œpersonā€ and ev
jects. Furtherm
the aggregatio
calculate the m
on for dimension reduction [Chen et al., 2013; Gu et
2011; Zhang et al., 2009; Davidson, 2009], noise reduc-
[Zhang et al., 2013; Nie et al., 2011] and so on. Speciļ¬-
y, in this paper, we utilize PCA as projection directions for
sforming these deep descriptors {x(i,j)} to evaluate their
elations. Then, on each projection direction, the corre-
nding principal componentā€™s values are treated as the cues
mage co-localization, especially the ļ¬rst principal com-
ent. Thanks to the property of this kind of transforming,
T is also able to handle data noise.
n DDT, for a set of N images containing objects from the
e category, we ļ¬rst collect the corresponding convolutional
criptors (X1
, . . . , XN
) by feeding them into a pre-trained
N model. Then, the mean vector of all the descriptors is
ulated by:
ĀÆx =
1
K
X
n
X
i,j
xn
(i,j) , (1)
re K = h ā‡„ w ā‡„ N. Note that, here we assume each
ge has the same number of deep descriptors (i.e., h ā‡„ w)
presentation clarity. Our proposed method, however, can
dle input images with arbitrary resolutions.
hen, after obtaining the covariance matrix:
appear. In addition, if P
indicates that no comm
can be used for detecti
resized by the nearest
same as that of the inp
largest connected comp
what is done in [Wei et a
relation values and the z
bounding box which con
of positive regions is r
prediction.
3.4 Discussions an
In this section, we inve
comparing with SCDA.
As shown in Fig. 2, th
and DDT are highlighte
ers the information from
ā€œpersonā€ and even ā€œgui
jects. Furthermore, we
the aggregation map o
calculate the mean valu
tion [Zhang et al., 2013; Nie et al., 2011] and so on. Speciļ¬-
cally, in this paper, we utilize PCA as projection directions for
transforming these deep descriptors {x(i,j)} to evaluate their
correlations. Then, on each projection direction, the corre-
sponding principal componentā€™s values are treated as the cues
for image co-localization, especially the ļ¬rst principal com-
ponent. Thanks to the property of this kind of transforming,
DDT is also able to handle data noise.
In DDT, for a set of N images containing objects from the
same category, we ļ¬rst collect the corresponding convolutional
descriptors (X1
, . . . , XN
) by feeding them into a pre-trained
CNN model. Then, the mean vector of all the descriptors is
calculated by:
ĀÆx =
1
K
X
n
X
i,j
xn
(i,j) , (1)
where K = h ā‡„ w ā‡„ N. Note that, here we assume each
image has the same number of deep descriptors (i.e., h ā‡„ w)
for presentation clarity. Our proposed method, however, can
handle input images with arbitrary resolutions.
Then, after obtaining the covariance matrix:
Cov(x) =
1
K
X
n
X
i,j
(xn
(i,j) ĀÆx)(xn
(i,j) ĀÆx)>
, (2)
can be used for det
resized by the neare
same as that of the i
largest connected co
what is done in [Wei
relation values and th
bounding box which
of positive regions i
prediction.
3.4 Discussions
In this section, we i
comparing with SCD
As shown in Fig. 2
and DDT are highlig
ers the information f
ā€œpersonā€ and even ā€œg
jects. Furthermore,
the aggregation map
calculate the mean v
ization threshold in S
values in aggregatio
red vertical line corre
beyond the threshold
sforming for Image Co-Localizationā‡¤
n-Lin Zhang1
, Yao Li2
, Chen-Wei Xie1
,
Chunhua Shen2
, Zhi-Hua Zhou1
Software Technology, Nanjing University, Nanjing, C
y of Adelaide, Adelaide, Australia
amda.nju.edu.cn, {yao.li01,chunhua.shen}@adelaide.
sforming for Image Co-Localizationā‡¤
n-Lin Zhang1
, Yao Li2
, Chen-Wei Xie1
,
Chunhua Shen2
, Zhi-Hua Zhou1
Software Technology, Nanjing University, Nanjing, Ch
y of Adelaide, Adelaide, Australia
ng for Image Co-Localizationā‡¤
ang1
, Yao Li2
, Chen-Wei Xie1
,
a Shen2
, Zhi-Hua Zhou1
Technology, Nanjing University, Nanjing, China
laide, Adelaide, Australia
(a) Bike (b) Diningtable
0 0.5 1
0
50
100
150
-1 0 1
0
100
200
300
400
(c) Boat
0 0.5 1
0
20
40
60
-1
0
100
200
300
400
(d) Bottle
0 0.2 0.4 0.6 0.8 1
0
20
40
60
80
100
120
140
-1 -0.5 0 0.5 10
20
40
60
80
100
120
140
SCDA DDT SCDA DDT DDT DSCDA SCDA
(g) Chair (h) Dog (i) Horse (j) Motorbi
0 0.2 0.4 0.6 0.8 1
0
20
40
60
80
8 1 -1 -0.5 0 0.5 1
0
50
100
150
200
250
300
0 0.2 0.4 0.6 0.8 1
0
50
100
150
-1 -0.5 0 0.5 1
0
50
100
150
200
250
300
0 0.5 1
0
20
40
60
80
-1
0
200
400
600
0 0.5 1
0
50
100
150
200
-1 0 1
0
200
400
600
800
0 0.2 0.4 0.6 0.8 1
0
20
40
60
80
-1 -0.5 0 0.5 1
0
50
100
150
200
250
Figure 2: Examples from twelve randomly sampled classes of VOC 2007. Th
SCDA, the second column are by our DDT. The red vertical lines in the histogra
localizing objects. The selected regions in images are highlighted in red. (Best
ā€œdiningtableā€ image region. In Fig. 2, more examples are pre-
sented. In that ļ¬gure, some failure cases can be also found,
e.g., the chair class in Fig. 2 (g).
In addition, the normalized P1
can be also used as localiza-
tion probability scores. Combining it with conditional random
ļ¬led techniques might produce more accurate object bound-
aries. Thus, DDT can be modiļ¬ed slightly in that way, and
then perform the co-segmentation problem. More importantly,
different from other co-segmentation methods, DDT can detect
noisy images while other methods can not.
4 Experiments
In this section, we ļ¬rst introduce the evaluation metric and
datasets used in image co-localization. Then, we compare the
empirical results of our DDT with other state-of-the-arts on
these datasets. The computational cost of DDT is reported too.
Moreover, the results in Sec. 4.4 and Sec. 4.5 illustrate the
generalization ability and robustness of the proposed method.
Finally, our further study in Sec. 4.6 reveals DDT might deal
Table 1: C
M
[Joulin
[Joulin
[Rubinste
[Tang
S
[Cho e
O
predicted bou
box. All Cor
Our experi
commonly u
covery datas
2007 / VOC
ImageNet Su
For exper
al., 2015; Li
Methods
[Joulin et al., 2
SCDA
[Cho et al., 2
[Li et al., 20
Our DDT
Methods
SCDA
[Cho et al., 20
[Li et al., 201
Our DDT
Table 4: Comparisons of the CorLoc metric with weakly supervised object l
ā€œXā€ in the ā€œNeg.ā€ column indicates that these WSOL methods require access t
Methods Neg. aero bike bird boat bottle bus car cat chair cow table dog
[Shi et al., 2013] X 67.3 54.4 34.3 17.8 1.3 46.6 60.7 68.9 2.5 32.4 16.2 58.
[Cinbis et al., 2015] X 56.6 58.3 28.4 20.7 6.8 54.9 69.1 20.8 9.2 50.5 10.2 29.
[Wang et al., 2015] X 37.7 58.8 39.0 4.7 4.0 48.4 70.0 63.7 9.0 54.2 33.3 37.
[Bilen et al., 2015] X 66.4 59.3 42.7 20.4 21.3 63.4 74.3 59.6 21.1 58.2 14.0 38.
[Ren et al., 2016] X 79.2 56.9 46.0 12.2 15.7 58.4 71.4 48.6 7.2 69.9 16.7 47.
[Wang et al., 2014] X 80.1 63.9 51.5 14.9 21.0 55.7 74.2 43.5 26.2 53.4 16.3 56.
Our DDT 67.3 63.3 61.3 22.7 8.5 64.8 57.0 80.5 9.4 49.0 22.5 72.
Table 4:
ā€œXā€ in th
Meth
[Shi et al
[Cinbis et
[Wang et a
[Bilen et a
[Ren et a
[Wang et a
Our D
Figure 3
successf
rectangle
the same
Table 5:
Metho
[Cho et al.
SCDA
erent from their original purposes (e.g., for describing texture
r ļ¬nding object proposals [Ghodrati et al., 2015]). However,
or such adaptations of pre-trained models, they still require
urther annotations in the new domain (e.g., image labels).
While, DDT deals with the image co-localization problem in
n unsupervised setting.
Coincidentally, several recent works also shed lights on
NN pre-trained model reuse in the unsupervised setting, e.g.,
CDA [Wei et al., 2017]. SCDA is proposed for handling
he ļ¬ne-grained image retrieval task, where it uses pre-trained
models (from ImageNet, which is not ļ¬ne-grained) to locate
main objects in ļ¬ne-grained images. It is the most related work
o ours, even though SCDA is not for image co-localization.
Different from our DDT, SCDA assumes only an object of
nterest in each image, and meanwhile objects from other
ategories does not exist. Thus, SCDA locates the object using
ues from this single image assumption. Apparently, it can not
ork well for images containing diverse objects (cf. Table 2
nd Table 3), and also can not handle data noise (cf. Sec. 4.5).
.2 Image Co-Localization
mage co-localization is a fundamental problem in computer
ision, where it needs to discover the common object emerging
n only positive sets of example images (without any nega-
ve examples or further supervisions). Image co-localization
hares some similarities with image co-segmentation [Zhao
nd Fu, 2015; Kim et al., 2011; Joulin et al., 2012]. Instead
f generating a precise segmentation of the related objects in
3 The Proposed Method
3.1 Preliminary
The following notations are used in the
term ā€œfeature mapā€ indicates the con
channel; the term ā€œactivationsā€ indica
channels in a convolution layer; and
indicates the d-dimensional componen
Given an input image I of size H ā‡„
convolution layer are formulated as an
hā‡„wā‡„d elements. T can be considere
and each cell contains one d-dimension
the n-th image, we denote its correspo
as Xn
=
n
xn
(i,j) 2 Rd
o
, where (i,
(i 2 {1, . . . , h} , j 2 {1, . . . , w}) and
3.2 SCDA Recap
Since SCDA [Wei et al., 2017] is the m
we hereby present a recap of this meth
for dealing with the ļ¬ne-grained imag
employs pre-trained models to select t
scriptors by localizing the main object
unsupervisedly. In SCDA, it assumes th
only one main object of interest and w
objects. Thus, the object localization s
Fine-grained image retrieval (conā€™t)
[Wei et al., IJCAI 2017] http://www.weixiushen.com/
CNN pre-trained models
Deep Descriptor Transforming
Four benchmark c
Object Discovery, PA
Col-localization re
Detecting noisy im
2 ā€“ The proposed method: DDT 4 ā€“ Experimen
5 ā€“ Further st
the
s. In
ined
rent
trac-
onal
d act
e co-
ffec-
ming
tors
ons,
in a
ffec-
nch-
nsis-
hods
mon-
cate-
ta.
model by
for other
CNN pre-trained models
Deep Descriptor Transforming
Figure 1: Pipeline of the proposed DDT method for image
co-localization. In this instance, the goal is to localize the
airplane within each image. Note that, there might be few
noisy images in the image set. (Best viewed in color.)
adopted to various usages, e.g., as universal feature extrac-
tors [Wang et al., 2015; Li et al., 2016], object proposal gen-
Input images
PCA is widely used in machine learning and computer
vision for dimension reduction [Chen et al., 2013; Gu et
al., 2011; Zhang et al., 2009; Davidson, 2009], noise reduc-
tion [Zhang et al., 2013; Nie et al., 2011] and so on. Speciļ¬-
cally, in this paper, we utilize PCA as projection directions for
transforming these deep descriptors {x(i,j)} to evaluate their
correlations. Then, on each projection direction, the corre-
sponding principal componentā€™s values are treated as the cues
for image co-localization, especially the ļ¬rst principal com-
ponent. Thanks to the property of this kind of transforming,
DDT is also able to handle data noise.
In DDT, for a set of N images containing objects from the
same category, we ļ¬rst collect the corresponding convolutional
descriptors (X1
, . . . , XN
) by feeding them into a pre-trained
CNN model. Then, the mean vector of all the descriptors is
calculated by:
ĀÆx =
1
K
X
n
X
i,j
xn
(i,j) , (1)
where K = h ā‡„ w ā‡„ N. Note that, here we assume each
image has the same number of deep descriptors (i.e., h ā‡„ w)
for presentation clarity. Our proposed method, however, can
handle input images with arbitrary resolutions.
Then, after obtaining the covariance matrix:
Cov(x) =
1
K
X
n
X
i,j
(xn
(i,j) ĀÆx)(xn
(i,j) ĀÆx)>
, (2)
we can get the eigenvectors ā‡ 1, . . . , ā‡ d of Cov(x) which cor-
respond to the sorted eigenvalues 1 Ā· Ā· Ā· d 0.
part has negative values pre
appear. In addition, if P1
of
indicates that no common o
can be used for detecting
resized by the nearest inter
same as that of the input im
largest connected componen
what is done in [Wei et al., 2
relation values and the zero
bounding box which contain
of positive regions is retur
prediction.
3.4 Discussions and A
In this section, we investig
comparing with SCDA.
As shown in Fig. 2, the ob
and DDT are highlighted in
ers the information from a s
ā€œpersonā€ and even ā€œguide-b
jects. Furthermore, we nor
the aggregation map of SC
calculate the mean value (w
ization threshold in SCDA)
values in aggregation map
red vertical line corresponds
beyond the threshold, there
explanation about why SCD
Whilst, for DDT, it lever
form these deep descriptor
n Wu , Chunhua Shen , Zhi-Hua Zhou
Novel Software Technology, Nanjing University, Nanjing, China
niversity of Adelaide, Adelaide, Australia
zh}@lamda.nju.edu.cn, {yao.li01,chunhua.shen}@adelaide.edu.au
able with the
plications. In
of pre-trained
ally, different
eature extrac-
convolutional
ons could act
the image co-
mple but effec-
Transforming
of descriptors
stent regions,
CNN pre-trained models
Deep Descriptor Transforming
XN-1X1 X2
XNā€¦
Deep
descriptors
Calculating descriptorsā€™
mean vector
i1
, Chen-Lin Zhang1
, Yao Li2
, Chen-Wei Xie1
,
n Wu1
, Chunhua Shen2
, Zhi-Hua Zhou1
Novel Software Technology, Nanjing University, Nanjing, China
niversity of Adelaide, Adelaide, Australia
zh}@lamda.nju.edu.cn, {yao.li01,chunhua.shen}@adelaide.edu.au
able with the
plications. In
of pre-trained
ally, different
eature extrac-
convolutional
ons could act
the image co-
mple but effec-
Transforming
CNN pre-trained models
transforming these deep descriptors {x(i,j)} to evaluate their
correlations. Then, on each projection direction, the corre-
sponding principal componentā€™s values are treated as the cues
for image co-localization, especially the ļ¬rst principal com-
ponent. Thanks to the property of this kind of transforming,
DDT is also able to handle data noise.
In DDT, for a set of N images containing objects from the
same category, we ļ¬rst collect the corresponding convolutional
descriptors (X1
, . . . , XN
) by feeding them into a pre-trained
CNN model. Then, the mean vector of all the descriptors is
calculated by:
ĀÆx =
1
K
X
n
X
i,j
xn
(i,j) , (1)
where K = h ā‡„ w ā‡„ N. Note that, here we assume each
image has the same number of deep descriptors (i.e., h ā‡„ w)
for presentation clarity. Our proposed method, however, can
handle input images with arbitrary resolutions.
Then, after obtaining the covariance matrix:
Cov(x) =
1
K
X
n
X
i,j
(xn
(i,j) ĀÆx)(xn
(i,j) ĀÆx)>
, (2)
we can get the eigenvectors ā‡ 1, . . . , ā‡ d of Cov(x) which cor-
respond to the sorted eigenvalues 1 Ā· Ā· Ā· d 0.
As aforementioned, since the ļ¬rst principal component has
the largest variance, we take the eigenvector ā‡ 1 corresponding
to the largest eigenvalue as the main projection direction. For
the deep descriptor at a particular position (i, j) of an image,
its ļ¬rst principal component p1
is calculated as follows:
same as that of the input i
largest connected compone
what is done in [Wei et al., 2
relation values and the zero
bounding box which contai
of positive regions is retu
prediction.
3.4 Discussions and A
In this section, we investi
comparing with SCDA.
As shown in Fig. 2, the o
and DDT are highlighted in
ers the information from a
ā€œpersonā€ and even ā€œguide-b
jects. Furthermore, we nor
the aggregation map of S
calculate the mean value (
ization threshold in SCDA)
values in aggregation map
red vertical line correspond
beyond the threshold, ther
explanation about why SC
Whilst, for DDT, it leve
form these deep descripto
class, DDT can accurately
histogram is also drawn. B
tive values. We normalize
Apparently, few values ar
(i.e., 0). More importantly,
Obtaining the
covariance matrix
and then getting the
eigenvectors
sponding principal componentā€™s values are treated as the cues
for image co-localization, especially the ļ¬rst principal com-
ponent. Thanks to the property of this kind of transforming,
DDT is also able to handle data noise.
In DDT, for a set of N images containing objects from the
same category, we ļ¬rst collect the corresponding convolutional
descriptors (X1
, . . . , XN
) by feeding them into a pre-trained
CNN model. Then, the mean vector of all the descriptors is
calculated by:
ĀÆx =
1
K
X
n
X
i,j
xn
(i,j) , (1)
where K = h ā‡„ w ā‡„ N. Note that, here we assume each
image has the same number of deep descriptors (i.e., h ā‡„ w)
for presentation clarity. Our proposed method, however, can
handle input images with arbitrary resolutions.
Then, after obtaining the covariance matrix:
Cov(x) =
1
K
X
n
X
i,j
(xn
(i,j) ĀÆx)(xn
(i,j) ĀÆx)>
, (2)
we can get the eigenvectors ā‡ 1, . . . , ā‡ d of Cov(x) which cor-
respond to the sorted eigenvalues 1 Ā· Ā· Ā· d 0.
As aforementioned, since the ļ¬rst principal component has
the largest variance, we take the eigenvector ā‡ 1 corresponding
to the largest eigenvalue as the main projection direction. For
the deep descriptor at a particular position (i, j) of an image,
its ļ¬rst principal component p1
is calculated as follows:
what is done in [Wei et al., 2017]). B
relation values and the zero thresho
bounding box which contains the la
of positive regions is returned as
prediction.
3.4 Discussions and Analyse
In this section, we investigate the
comparing with SCDA.
As shown in Fig. 2, the object loc
and DDT are highlighted in red. B
ers the information from a single i
ā€œpersonā€ and even ā€œguide-boardā€ a
jects. Furthermore, we normalize
the aggregation map of SCDA in
calculate the mean value (which i
ization threshold in SCDA). The hi
values in aggregation map is also
red vertical line corresponds to the
beyond the threshold, there are sti
explanation about why SCDA high
Whilst, for DDT, it leverages th
form these deep descriptors into
class, DDT can accurately locate
histogram is also drawn. But, P1
h
tive values. We normalize P1
into
Apparently, few values are larger
(i.e., 0). More importantly, many va
ponent. Thanks to the property of this kind of transforming,
DDT is also able to handle data noise.
In DDT, for a set of N images containing objects from the
same category, we ļ¬rst collect the corresponding convolutional
descriptors (X1
, . . . , XN
) by feeding them into a pre-trained
CNN model. Then, the mean vector of all the descriptors is
calculated by:
ĀÆx =
1
K
X
n
X
i,j
xn
(i,j) , (1)
where K = h ā‡„ w ā‡„ N. Note that, here we assume each
image has the same number of deep descriptors (i.e., h ā‡„ w)
for presentation clarity. Our proposed method, however, can
handle input images with arbitrary resolutions.
Then, after obtaining the covariance matrix:
Cov(x) =
1
K
X
n
X
i,j
(xn
(i,j) ĀÆx)(xn
(i,j) ĀÆx)>
, (2)
we can get the eigenvectors ā‡ 1, . . . , ā‡ d of Cov(x) which cor-
respond to the sorted eigenvalues 1 Ā· Ā· Ā· d 0.
As aforementioned, since the ļ¬rst principal component has
the largest variance, we take the eigenvector ā‡ 1 corresponding
to the largest eigenvalue as the main projection direction. For
the deep descriptor at a particular position (i, j) of an image,
its ļ¬rst principal component p1
is calculated as follows:
p1
(i,j) = ā‡ >
1 x(i,j) ĀÆx . (3)
According to their spatial locations, all p1
(i,j) from an image
are combined into a 2-D matrix whose dimensions are h ā‡„ w.
bounding box which contains the
of positive regions is returned a
prediction.
3.4 Discussions and Analys
In this section, we investigate th
comparing with SCDA.
As shown in Fig. 2, the object l
and DDT are highlighted in red.
ers the information from a single
ā€œpersonā€ and even ā€œguide-boardā€
jects. Furthermore, we normaliz
the aggregation map of SCDA
calculate the mean value (which
ization threshold in SCDA). The
values in aggregation map is als
red vertical line corresponds to th
beyond the threshold, there are s
explanation about why SCDA hig
Whilst, for DDT, it leverages t
form these deep descriptors into
class, DDT can accurately locat
histogram is also drawn. But, P1
tive values. We normalize P1
int
Apparently, few values are larg
(i.e., 0). More importantly, many
indicates the strong negative co
validates the effectiveness of DD
As another example shown in Fig
locates ā€œpersonā€ in the image b
class. While, DDT can correctl
r Transforming for Image Co-Localizationā‡¤
1
, Chen-Lin Zhang1
, Yao Li2
, Chen-Wei Xie1
,
Wu1
, Chunhua Shen2
, Zhi-Hua Zhou1
Novel Software Technology, Nanjing University, Nanjing, China
niversity of Adelaide, Adelaide, Australia
zh}@lamda.nju.edu.cn, {yao.li01,chunhua.shen}@adelaide.edu.au
able with the
plications. In
of pre-trained
ally, different
eature extrac-
convolutional
CNN pre-trained modelsTransforming
Transforming for Image Co-Localizationā‡¤
1
, Chen-Lin Zhang1
, Yao Li2
, Chen-Wei Xie1
,
Wu1
, Chunhua Shen2
, Zhi-Hua Zhou1
Novel Software Technology, Nanjing University, Nanjing, China
niversity of Adelaide, Adelaide, Australia
zh}@lamda.nju.edu.cn, {yao.li01,chunhua.shen}@adelaide.edu.au
able with the
plications. In
of pre-trained
lly, different CNN pre-trained models
ming (DDT)
is that we can leverage
mage set, instead of a
We call that matrix as indicator matrix:
2
6
p1
(1,1) p1
(1,2) . . . p1
(1,w)
p1
p1
. . . p1
3
7
sforming for Image Co-Localizationā‡¤
n-Lin Zhang1
, Yao Li2
, Chen-Wei Xie1
,
Chunhua Shen2
, Zhi-Hua Zhou1
Software Technology, Nanjing University, Nanjing, China
y of Adelaide, Adelaide, Australia
amda.nju.edu.cn, {yao.li01,chunhua.shen}@adelaide.edu.au
h the
s. In
ined
erent
trac-
onal
CNN pre-trained models
0 0.2 0.4 0.6 0.8 1
0
20
40
60
80
100
120
-1 -0.5 0 0.5 10
20
40
60
80
100
120
(g) Chair (h) Dog (i) Horse (j) Motorbike (k) Pla
0 0.2 0.4 0.6 0.8 1
0
20
40
60
8 1 -1 -0.5 0 0.5 1
0
50
100
150
200
250
0 0.2 0.4 0.6 0.8 1
0
50
100
-1 -0.5 0 0.5 1
0
50
100
150
200
250
0 0.5 1
0
20
40
60
80
-1
0
200
400
600
0 0.5 1
0
20
40
-1
0
100
200
Figure 2: Examples from twelve randomly sampled classes of VOC 2007. The ļ¬rst column of ea
SCDA, the second column are by our DDT. The red vertical lines in the histogram plots indicate th
localizing objects. The selected regions in images are highlighted in red. (Best viewed in color a
ā€œdiningtableā€ image region. In Fig. 2, more examples are pre-
sented. In that ļ¬gure, some failure cases can be also found,
e.g., the chair class in Fig. 2 (g).
In addition, the normalized P1
can be also used as localiza-
tion probability scores. Combining it with conditional random
ļ¬led techniques might produce more accurate object bound-
aries. Thus, DDT can be modiļ¬ed slightly in that way, and
then perform the co-segmentation problem. More importantly,
different from other co-segmentation methods, DDT can detect
noisy images while other methods can not.
4 Experiments
In this section, we ļ¬rst introduce the evaluation metric and
datasets used in image co-localization. Then, we compare the
empirical results of our DDT with other state-of-the-arts on
these datasets. The computational cost of DDT is reported too.
Moreover, the results in Sec. 4.4 and Sec. 4.5 illustrate the
generalization ability and robustness of the proposed method.
Finally, our further study in Sec. 4.6 reveals DDT might deal
with part-based image co-localization, which is a novel and
challenging problem.
In our experiments, the images keep the original image reso-
lutions. For the pre-trained deep model, the publicly available
VGG-19 model [Simonyan and Zisserman, 2015] is employed
to extract deep convolution descriptors from the last convo-
lution layer (before pool5). We use the open-source library
MatConvNet [Vedaldi and Lenc, 2015] for conducting experi-
ments. All the experiments are run on a computer with Intel
Xeon E5-2660 v3, 500G main memory, and a K80 GPU.
4.1 Evaluation Metric and Datasets
Following previous image co-localization works [Li et al.,
2016; Cho et al., 2015; Tang et al., 2014], we take the cor-
rect localization (CorLoc) metric for evaluating the proposed
method. CorLoc is deļ¬ned as the percentage of images cor-
rectly localized according to the PASCAL-criterion [Ever-
ingham et al., 2015]:
area(BpBgt)
area(Bp[Bgt) > 0.5, where Bp is the
Table 1: Comparisons of Co
Methods Airpla
[Joulin et al., 2010] 32.93
[Joulin et al., 2012] 57.32
[Rubinstein et al., 2013] 74.39
[Tang et al., 2014] 71.95
SCDA 87.80
[Cho et al., 2015] 82.93
Our DDT 91.46
predicted bounding box and Bg
box. All CorLoc results are rep
Our experiments are conducte
commonly used in image co-lo
covery dataset [Rubinstein et
2007 / VOC 2012 dataset [Eve
ImageNet Subsets [Li et al., 201
For experiments on the VOC
al., 2015; Li et al., 2016; Joulin
in the trainval set (excluding im
instances annotated as difļ¬cult
covery, we use the 100-image s
al., 2013; Cho et al., 2015] in
comparison with other methods
In addition, Object Discover
images in the Airplane, Car and
These noisy images contain no
egory, as the third image show
Sec. 4.5, we quantitatively meas
DDT to identify these noisy im
To further investigate the g
ImageNet Subsets [Li et al., 2
six subsets/categories. These s
from the 1000-label ILSVRC c
al., 2015]. That is to say, these
trained CNN models. Experim
that DDT is insensitive to the o
Table 2: Comparison
Methods aero bike bird b
[Joulin et al., 2014] 32.8 17.3 20.9 1
SCDA 54.4 27.2 43.4 1
[Cho et al., 2015] 50.3 42.8 30.0 1
[Li et al., 2016] 73.1 45.0 43.4 2
Our DDT 67.3 63.3 61.3 2
Table 3: Comparison
Methods aero bike bird bo
SCDA 60.8 41.7 38.6 21
[Cho et al., 2015] 57.0 41.2 36.0 26
[Li et al., 2016] 65.7 57.8 47.9 28
Our DDT 76.7 67.1 57.9 30
4.2 Comparisons with Sta
Comparisons to Image Co-Loc
We ļ¬rst compare the results of
cluding SCDA) on Object Disc
we also use VGG-19 to extract th
perform experiments. As show
forms other methods by about 4%
Especially for the airplane class
that of [Cho et al., 2015]. In ad
of each category in this dataset
SCDA can perform well.
For VOC 2007 and 2012, th
objects per image, which is m
Discovery. The comparisons of
two datasets are reported in Tab
It is clear that on average our D
state-of-the-arts (based on deep l
both two datasets. Moreover, D
small common objects, e.g., ā€œbo
because most images of these d
which do not obey SCDAā€™s assum
Table 4: Comparisons of the CorLoc metric with weakly supervised object localization met
ā€œXā€ in the ā€œNeg.ā€ column indicates that these WSOL methods require access to a negative ima
Methods Neg. aero bike bird boat bottle bus car cat chair cow table dog horse mbike per
[Shi et al., 2013] X 67.3 54.4 34.3 17.8 1.3 46.6 60.7 68.9 2.5 32.4 16.2 58.9 51.5 64.6 18
[Cinbis et al., 2015] X 56.6 58.3 28.4 20.7 6.8 54.9 69.1 20.8 9.2 50.5 10.2 29.0 58.0 64.9 36
[Wang et al., 2015] X 37.7 58.8 39.0 4.7 4.0 48.4 70.0 63.7 9.0 54.2 33.3 37.4 61.6 57.6 30
[Bilen et al., 2015] X 66.4 59.3 42.7 20.4 21.3 63.4 74.3 59.6 21.1 58.2 14.0 38.5 49.5 60.0 19
[Ren et al., 2016] X 79.2 56.9 46.0 12.2 15.7 58.4 71.4 48.6 7.2 69.9 16.7 47.4 44.2 75.5 41
[Wang et al., 2014] X 80.1 63.9 51.5 14.9 21.0 55.7 74.2 43.5 26.2 53.4 16.3 56.7 58.3 69.5 14
Our DDT 67.3 63.3 61.3 22.7 8.5 64.8 57.0 80.5 9.4 49.0 22.5 72.6 73.8 69.0 7
(a) Chipmunk (b) Rhino
(d) Racoon (e) Rake
Figure 3: Random samples of predicted object co-localization bounding box on ImageNet Sub
successful predictions and one failure case. In these images, the red rectangle is the predictio
rectangle is the ground truth bounding box. In the successful predictions, the yellow rectangle
the same as the red predictions. (Best viewed in color and zoomed in.)
Table 5: Comparisons of on image sets disjoint with ImageNet.
Methods Chipmunk Rhino Stoat Racoon Rake Wheelchair Mean
Image
Methods Neg. aer
[Shi et al., 2013] X 67.
[Cinbis et al., 2015] X 56.
[Wang et al., 2015] X 37.
[Bilen et al., 2015] X 66.
[Ren et al., 2016] X 79.
[Wang et al., 2014] X 80.
Our DDT 67.
(a) Chipmu
(d) Racoo
Figure 3: Random samp
successful predictions a
rectangle is the ground t
the same as the red pred
Table 5: Comparisons of
Methods Chipmunk
[Cho et al., 2015] 26.6
SCDA 32.3
[Li et al., 2016] 44.9
Our DDT 70.3
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
false positive rate
truepositiverate
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
truepositiverate
Airplane
False positive rate
Truepositiverate
Truepositiverate
Figure 4: ROC curves ill
at identifying noisy ima
The curves in red line ar
in blue dashed line prese
only the method in [Tan
CNN pre-trained model reuse in the unsupervised setting, e.g.,
SCDA [Wei et al., 2017]. SCDA is proposed for handling
the ļ¬ne-grained image retrieval task, where it uses pre-trained
models (from ImageNet, which is not ļ¬ne-grained) to locate
main objects in ļ¬ne-grained images. It is the most related work
to ours, even though SCDA is not for image co-localization.
Different from our DDT, SCDA assumes only an object of
interest in each image, and meanwhile objects from other
categories does not exist. Thus, SCDA locates the object using
cues from this single image assumption. Apparently, it can not
work well for images containing diverse objects (cf. Table 2
and Table 3), and also can not handle data noise (cf. Sec. 4.5).
2.2 Image Co-Localization
Image co-localization is a fundamental problem in computer
vision, where it needs to discover the common object emerging
in only positive sets of example images (without any nega-
tive examples or further supervisions). Image co-localization
shares some similarities with image co-segmentation [Zhao
and Fu, 2015; Kim et al., 2011; Joulin et al., 2012]. Instead
of generating a precise segmentation of the related objects in
each image, co-localization methods aim to return a bound-
ing box around the object. Moreover, co-segmentation has
a strong assumption that every image contains the object of
interest, and hence is unable to handle noisy images.
Additionally, co-localization is also related to weakly su-
pervised object localization (WSOL) [Zhang et al., 2016;
Bilen et al., 2015; Wang et al., 2014; Siva and Xiang, 2011].
But the key difference between them is WSOL requires
manually-labeled negative images whereas co-localization
channel; the term ā€œactivationsā€ indicates feature m
channels in a convolution layer; and the term ā€œd
indicates the d-dimensional component vector of ac
Given an input image I of size H ā‡„ W, the activa
convolution layer are formulated as an order-3 tenso
hā‡„wā‡„d elements. T can be considered as having h
and each cell contains one d-dimensional deep descr
the n-th image, we denote its corresponding deep de
as Xn
=
n
xn
(i,j) 2 Rd
o
, where (i, j) is a partic
(i 2 {1, . . . , h} , j 2 {1, . . . , w}) and n 2 {1, . . . , N
3.2 SCDA Recap
Since SCDA [Wei et al., 2017] is the most related wo
we hereby present a recap of this method. SCDA is
for dealing with the ļ¬ne-grained image retrieval pr
employs pre-trained models to select the meaningfu
scriptors by localizing the main object in ļ¬ne-graine
unsupervisedly. In SCDA, it assumes that each image
only one main object of interest and without other c
objects. Thus, the object localization strategy is bas
activation tensor of a single image.
Concretely, for an image, the activation tensor is
through the depth direction. Thus, the h ā‡„ w ā‡„ d 3
becomes a h ā‡„ w 2-D matrix, which is called the ā€œag
mapā€ in SCDA. Then, the mean value ĀÆa of the ag
map is regarded as the threshold for localizing the
the activation response in the position (i, j) of the ag
map is larger than ĀÆa, it indicates the object might
that position.
What we need is a mapping function ā€¦
Fine-grained image retrieval (conā€™t)
[Wei et al., IJCAI 2017] http://www.weixiushen.com/
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III
Object Detection Beyond Mask R-CNN and RetinaNet III

More Related Content

What's hot

[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...
[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...
[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...Susang Kim
Ā 
Attentive semantic alignment with offset aware correlation kernels
Attentive semantic alignment with offset aware correlation kernelsAttentive semantic alignment with offset aware correlation kernels
Attentive semantic alignment with offset aware correlation kernelsNAVER Engineering
Ā 
[č«–ę–‡ē“¹ä»‹] DPSNet: End-to-end Deep Plane Sweep Stereo
[č«–ę–‡ē“¹ä»‹] DPSNet: End-to-end Deep Plane Sweep Stereo[č«–ę–‡ē“¹ä»‹] DPSNet: End-to-end Deep Plane Sweep Stereo
[č«–ę–‡ē“¹ä»‹] DPSNet: End-to-end Deep Plane Sweep StereoSeiya Ito
Ā 
Comparison of Various RCNN techniques for Classification of Object from Image
Comparison of Various RCNN techniques for Classification of Object from ImageComparison of Various RCNN techniques for Classification of Object from Image
Comparison of Various RCNN techniques for Classification of Object from ImageIRJET Journal
Ā 
A Novel Background Subtraction Algorithm for Dynamic Texture Scenes
A Novel Background Subtraction Algorithm for Dynamic Texture ScenesA Novel Background Subtraction Algorithm for Dynamic Texture Scenes
A Novel Background Subtraction Algorithm for Dynamic Texture ScenesIJMER
Ā 
APPLYING R-SPATIOGRAM IN OBJECT TRACKING FOR OCCLUSION HANDLING
APPLYING R-SPATIOGRAM IN OBJECT TRACKING FOR OCCLUSION HANDLINGAPPLYING R-SPATIOGRAM IN OBJECT TRACKING FOR OCCLUSION HANDLING
APPLYING R-SPATIOGRAM IN OBJECT TRACKING FOR OCCLUSION HANDLINGsipij
Ā 
Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
Convolutional Neural Networks on Graphs with Fast Localized Spectral FilteringConvolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
Convolutional Neural Networks on Graphs with Fast Localized Spectral FilteringSOYEON KIM
Ā 
Geometry Batching Using Texture-Arrays
Geometry Batching Using Texture-ArraysGeometry Batching Using Texture-Arrays
Geometry Batching Using Texture-ArraysMatthias Trapp
Ā 
Survey on optical flow estimation with DL
Survey on optical flow estimation with DLSurvey on optical flow estimation with DL
Survey on optical flow estimation with DLLeapMind Inc
Ā 
Differentiable Ray Sampling for Neural 3D Representation
Differentiable Ray Sampling for Neural 3D RepresentationDifferentiable Ray Sampling for Neural 3D Representation
Differentiable Ray Sampling for Neural 3D RepresentationPreferred Networks
Ā 
ICRA 2015 interactive presentation
ICRA 2015 interactive presentationICRA 2015 interactive presentation
ICRA 2015 interactive presentationSunando Sengupta
Ā 
Deep image retrieval - learning global representations for image search - ub ...
Deep image retrieval - learning global representations for image search - ub ...Deep image retrieval - learning global representations for image search - ub ...
Deep image retrieval - learning global representations for image search - ub ...Universitat de Barcelona
Ā 
Texture mapping in_opengl
Texture mapping in_openglTexture mapping in_opengl
Texture mapping in_openglManas Nayak
Ā 
Object Pose Estimation
Object Pose EstimationObject Pose Estimation
Object Pose EstimationArithmer Inc.
Ā 
Convolutional Patch Representations for Image Retrieval An unsupervised approach
Convolutional Patch Representations for Image Retrieval An unsupervised approachConvolutional Patch Representations for Image Retrieval An unsupervised approach
Convolutional Patch Representations for Image Retrieval An unsupervised approachUniversitat de Barcelona
Ā 

What's hot (20)

[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...
[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...
[Paper] GIRAFFE: Representing Scenes as Compositional Generative Neural Featu...
Ā 
Lec11 object-re-id
Lec11 object-re-idLec11 object-re-id
Lec11 object-re-id
Ā 
Attentive semantic alignment with offset aware correlation kernels
Attentive semantic alignment with offset aware correlation kernelsAttentive semantic alignment with offset aware correlation kernels
Attentive semantic alignment with offset aware correlation kernels
Ā 
Lec07 aggregation-and-retrieval-system
Lec07 aggregation-and-retrieval-systemLec07 aggregation-and-retrieval-system
Lec07 aggregation-and-retrieval-system
Ā 
[č«–ę–‡ē“¹ä»‹] DPSNet: End-to-end Deep Plane Sweep Stereo
[č«–ę–‡ē“¹ä»‹] DPSNet: End-to-end Deep Plane Sweep Stereo[č«–ę–‡ē“¹ä»‹] DPSNet: End-to-end Deep Plane Sweep Stereo
[č«–ę–‡ē“¹ä»‹] DPSNet: End-to-end Deep Plane Sweep Stereo
Ā 
Comparison of Various RCNN techniques for Classification of Object from Image
Comparison of Various RCNN techniques for Classification of Object from ImageComparison of Various RCNN techniques for Classification of Object from Image
Comparison of Various RCNN techniques for Classification of Object from Image
Ā 
Deep 3D Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2018
Deep 3D Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2018Deep 3D Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2018
Deep 3D Analysis - Javier Ruiz-Hidalgo - UPC Barcelona 2018
Ā 
A Novel Background Subtraction Algorithm for Dynamic Texture Scenes
A Novel Background Subtraction Algorithm for Dynamic Texture ScenesA Novel Background Subtraction Algorithm for Dynamic Texture Scenes
A Novel Background Subtraction Algorithm for Dynamic Texture Scenes
Ā 
APPLYING R-SPATIOGRAM IN OBJECT TRACKING FOR OCCLUSION HANDLING
APPLYING R-SPATIOGRAM IN OBJECT TRACKING FOR OCCLUSION HANDLINGAPPLYING R-SPATIOGRAM IN OBJECT TRACKING FOR OCCLUSION HANDLING
APPLYING R-SPATIOGRAM IN OBJECT TRACKING FOR OCCLUSION HANDLING
Ā 
Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
Convolutional Neural Networks on Graphs with Fast Localized Spectral FilteringConvolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering
Ā 
Lec12 review-part-i
Lec12 review-part-iLec12 review-part-i
Lec12 review-part-i
Ā 
Geometry Batching Using Texture-Arrays
Geometry Batching Using Texture-ArraysGeometry Batching Using Texture-Arrays
Geometry Batching Using Texture-Arrays
Ā 
Gnn overview
Gnn overviewGnn overview
Gnn overview
Ā 
Survey on optical flow estimation with DL
Survey on optical flow estimation with DLSurvey on optical flow estimation with DL
Survey on optical flow estimation with DL
Ā 
Differentiable Ray Sampling for Neural 3D Representation
Differentiable Ray Sampling for Neural 3D RepresentationDifferentiable Ray Sampling for Neural 3D Representation
Differentiable Ray Sampling for Neural 3D Representation
Ā 
ICRA 2015 interactive presentation
ICRA 2015 interactive presentationICRA 2015 interactive presentation
ICRA 2015 interactive presentation
Ā 
Deep image retrieval - learning global representations for image search - ub ...
Deep image retrieval - learning global representations for image search - ub ...Deep image retrieval - learning global representations for image search - ub ...
Deep image retrieval - learning global representations for image search - ub ...
Ā 
Texture mapping in_opengl
Texture mapping in_openglTexture mapping in_opengl
Texture mapping in_opengl
Ā 
Object Pose Estimation
Object Pose EstimationObject Pose Estimation
Object Pose Estimation
Ā 
Convolutional Patch Representations for Image Retrieval An unsupervised approach
Convolutional Patch Representations for Image Retrieval An unsupervised approachConvolutional Patch Representations for Image Retrieval An unsupervised approach
Convolutional Patch Representations for Image Retrieval An unsupervised approach
Ā 

Similar to Object Detection Beyond Mask R-CNN and RetinaNet III

Camp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine LearningCamp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine LearningKrzysztof Kowalczyk
Ā 
Lecture17 xing fei-fei
Lecture17 xing fei-feiLecture17 xing fei-fei
Lecture17 xing fei-feiTianlu Wang
Ā 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsDevansh16
Ā 
A survey on the layers of convolutional Neural Network
A survey on the layers of convolutional Neural NetworkA survey on the layers of convolutional Neural Network
A survey on the layers of convolutional Neural NetworkSasanko Sekhar Gantayat
Ā 
00463517b1e90c1e63000000
00463517b1e90c1e6300000000463517b1e90c1e63000000
00463517b1e90c1e63000000Ivonne Liu
Ā 
RunPool: A Dynamic Pooling Layer for Convolution Neural Network
RunPool: A Dynamic Pooling Layer for Convolution Neural NetworkRunPool: A Dynamic Pooling Layer for Convolution Neural Network
RunPool: A Dynamic Pooling Layer for Convolution Neural NetworkPutra Wanda
Ā 
Output Units and Cost Function in FNN
Output Units and Cost Function in FNNOutput Units and Cost Function in FNN
Output Units and Cost Function in FNNLin JiaMing
Ā 
Deep Learning
Deep LearningDeep Learning
Deep LearningJun Wang
Ā 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryKenta Oono
Ā 
Deep learning and computer vision
Deep learning and computer visionDeep learning and computer vision
Deep learning and computer visionMeetupDataScienceRoma
Ā 
AILABS Lecture Series - Is AI The New Electricity. Topic - Deep Learning - Ev...
AILABS Lecture Series - Is AI The New Electricity. Topic - Deep Learning - Ev...AILABS Lecture Series - Is AI The New Electricity. Topic - Deep Learning - Ev...
AILABS Lecture Series - Is AI The New Electricity. Topic - Deep Learning - Ev...AILABS Academy
Ā 
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018Universitat PolitĆØcnica de Catalunya
Ā 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Universitat PolitĆØcnica de Catalunya
Ā 
A Performance Analysis of Self-* Evolutionary Algorithms on Networks with Cor...
A Performance Analysis of Self-* Evolutionary Algorithms on Networks with Cor...A Performance Analysis of Self-* Evolutionary Algorithms on Networks with Cor...
A Performance Analysis of Self-* Evolutionary Algorithms on Networks with Cor...Rafael Nogueras
Ā 
1809.05680.pdf
1809.05680.pdf1809.05680.pdf
1809.05680.pdfAnkitBiswas31
Ā 
Citython presentation
Citython presentationCitython presentation
Citython presentationAnkit Tewari
Ā 
COMPARISON OF WAVELET NETWORK AND LOGISTIC REGRESSION IN PREDICTING ENTERPRIS...
COMPARISON OF WAVELET NETWORK AND LOGISTIC REGRESSION IN PREDICTING ENTERPRIS...COMPARISON OF WAVELET NETWORK AND LOGISTIC REGRESSION IN PREDICTING ENTERPRIS...
COMPARISON OF WAVELET NETWORK AND LOGISTIC REGRESSION IN PREDICTING ENTERPRIS...ijcsit
Ā 

Similar to Object Detection Beyond Mask R-CNN and RetinaNet III (20)

Camp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine LearningCamp IT: Making the World More Efficient Using AI & Machine Learning
Camp IT: Making the World More Efficient Using AI & Machine Learning
Ā 
Lecture17 xing fei-fei
Lecture17 xing fei-feiLecture17 xing fei-fei
Lecture17 xing fei-fei
Ā 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
Ā 
A survey on the layers of convolutional Neural Network
A survey on the layers of convolutional Neural NetworkA survey on the layers of convolutional Neural Network
A survey on the layers of convolutional Neural Network
Ā 
00463517b1e90c1e63000000
00463517b1e90c1e6300000000463517b1e90c1e63000000
00463517b1e90c1e63000000
Ā 
RunPool: A Dynamic Pooling Layer for Convolution Neural Network
RunPool: A Dynamic Pooling Layer for Convolution Neural NetworkRunPool: A Dynamic Pooling Layer for Convolution Neural Network
RunPool: A Dynamic Pooling Layer for Convolution Neural Network
Ā 
Output Units and Cost Function in FNN
Output Units and Cost Function in FNNOutput Units and Cost Function in FNN
Output Units and Cost Function in FNN
Ā 
A detailed analysis of the supervised machine Learning Algorithms
A detailed analysis of the supervised machine Learning AlgorithmsA detailed analysis of the supervised machine Learning Algorithms
A detailed analysis of the supervised machine Learning Algorithms
Ā 
Deep Learning
Deep LearningDeep Learning
Deep Learning
Ā 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistry
Ā 
Deep learning and computer vision
Deep learning and computer visionDeep learning and computer vision
Deep learning and computer vision
Ā 
AILABS Lecture Series - Is AI The New Electricity. Topic - Deep Learning - Ev...
AILABS Lecture Series - Is AI The New Electricity. Topic - Deep Learning - Ev...AILABS Lecture Series - Is AI The New Electricity. Topic - Deep Learning - Ev...
AILABS Lecture Series - Is AI The New Electricity. Topic - Deep Learning - Ev...
Ā 
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
PixelCNN, Wavenet, Normalizing Flows - Santiago Pascual - UPC Barcelona 2018
Ā 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Ā 
Introduction to Julia
Introduction to JuliaIntroduction to Julia
Introduction to Julia
Ā 
A Performance Analysis of Self-* Evolutionary Algorithms on Networks with Cor...
A Performance Analysis of Self-* Evolutionary Algorithms on Networks with Cor...A Performance Analysis of Self-* Evolutionary Algorithms on Networks with Cor...
A Performance Analysis of Self-* Evolutionary Algorithms on Networks with Cor...
Ā 
1809.05680.pdf
1809.05680.pdf1809.05680.pdf
1809.05680.pdf
Ā 
Citython presentation
Citython presentationCitython presentation
Citython presentation
Ā 
XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander ...
XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander ...XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander ...
XGBoostLSS - An extension of XGBoost to probabilistic forecasting, Alexander ...
Ā 
COMPARISON OF WAVELET NETWORK AND LOGISTIC REGRESSION IN PREDICTING ENTERPRIS...
COMPARISON OF WAVELET NETWORK AND LOGISTIC REGRESSION IN PREDICTING ENTERPRIS...COMPARISON OF WAVELET NETWORK AND LOGISTIC REGRESSION IN PREDICTING ENTERPRIS...
COMPARISON OF WAVELET NETWORK AND LOGISTIC REGRESSION IN PREDICTING ENTERPRIS...
Ā 

More from Wanjin Yu

Architecture Design for Deep Neural Networks III
Architecture Design for Deep Neural Networks IIIArchitecture Design for Deep Neural Networks III
Architecture Design for Deep Neural Networks IIIWanjin Yu
Ā 
Intelligent Multimedia Recommendation
Intelligent Multimedia RecommendationIntelligent Multimedia Recommendation
Intelligent Multimedia RecommendationWanjin Yu
Ā 
Architecture Design for Deep Neural Networks II
Architecture Design for Deep Neural Networks IIArchitecture Design for Deep Neural Networks II
Architecture Design for Deep Neural Networks IIWanjin Yu
Ā 
Architecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks IArchitecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks IWanjin Yu
Ā 
Causally regularized machine learning
Causally regularized machine learningCausally regularized machine learning
Causally regularized machine learningWanjin Yu
Ā 
Computer vision for transportation
Computer vision for transportationComputer vision for transportation
Computer vision for transportationWanjin Yu
Ā 
Object Detection Beyond Mask R-CNN and RetinaNet II
Object Detection Beyond Mask R-CNN and RetinaNet IIObject Detection Beyond Mask R-CNN and RetinaNet II
Object Detection Beyond Mask R-CNN and RetinaNet IIWanjin Yu
Ā 
Object Detection Beyond Mask R-CNN and RetinaNet I
Object Detection Beyond Mask R-CNN and RetinaNet IObject Detection Beyond Mask R-CNN and RetinaNet I
Object Detection Beyond Mask R-CNN and RetinaNet IWanjin Yu
Ā 
Visual Search and Question Answering II
Visual Search and Question Answering IIVisual Search and Question Answering II
Visual Search and Question Answering IIWanjin Yu
Ā 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Wanjin Yu
Ā 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Wanjin Yu
Ā 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Wanjin Yu
Ā 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Wanjin Yu
Ā 
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Wanjin Yu
Ā 
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Wanjin Yu
Ā 
Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Big Data Intelligence: from Correlation Discovery to Causal Reasoning Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Big Data Intelligence: from Correlation Discovery to Causal Reasoning Wanjin Yu
Ā 

More from Wanjin Yu (16)

Architecture Design for Deep Neural Networks III
Architecture Design for Deep Neural Networks IIIArchitecture Design for Deep Neural Networks III
Architecture Design for Deep Neural Networks III
Ā 
Intelligent Multimedia Recommendation
Intelligent Multimedia RecommendationIntelligent Multimedia Recommendation
Intelligent Multimedia Recommendation
Ā 
Architecture Design for Deep Neural Networks II
Architecture Design for Deep Neural Networks IIArchitecture Design for Deep Neural Networks II
Architecture Design for Deep Neural Networks II
Ā 
Architecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks IArchitecture Design for Deep Neural Networks I
Architecture Design for Deep Neural Networks I
Ā 
Causally regularized machine learning
Causally regularized machine learningCausally regularized machine learning
Causally regularized machine learning
Ā 
Computer vision for transportation
Computer vision for transportationComputer vision for transportation
Computer vision for transportation
Ā 
Object Detection Beyond Mask R-CNN and RetinaNet II
Object Detection Beyond Mask R-CNN and RetinaNet IIObject Detection Beyond Mask R-CNN and RetinaNet II
Object Detection Beyond Mask R-CNN and RetinaNet II
Ā 
Object Detection Beyond Mask R-CNN and RetinaNet I
Object Detection Beyond Mask R-CNN and RetinaNet IObject Detection Beyond Mask R-CNN and RetinaNet I
Object Detection Beyond Mask R-CNN and RetinaNet I
Ā 
Visual Search and Question Answering II
Visual Search and Question Answering IIVisual Search and Question Answering II
Visual Search and Question Answering II
Ā 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Ā 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Ā 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Ā 
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Intelligent Image Enhancement and Restoration - From Prior Driven Model to Ad...
Ā 
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Ā 
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Human Behavior Understanding: From Human-Oriented Analysis to Action Recognit...
Ā 
Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Big Data Intelligence: from Correlation Discovery to Causal Reasoning Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Big Data Intelligence: from Correlation Discovery to Causal Reasoning
Ā 

Recently uploaded

Call Girls In Defence Colony Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”
Call Girls In Defence Colony Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”Call Girls In Defence Colony Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”
Call Girls In Defence Colony Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”soniya singh
Ā 
āœ‚ļø šŸ‘… Independent Andheri Escorts With Room Vashi Call Girls šŸ’ƒ 9004004663
āœ‚ļø šŸ‘… Independent Andheri Escorts With Room Vashi Call Girls šŸ’ƒ 9004004663āœ‚ļø šŸ‘… Independent Andheri Escorts With Room Vashi Call Girls šŸ’ƒ 9004004663
āœ‚ļø šŸ‘… Independent Andheri Escorts With Room Vashi Call Girls šŸ’ƒ 9004004663Call Girls Mumbai
Ā 
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607dollysharma2066
Ā 
AlbaniaDreamin24 - How to easily use an API with Flows
AlbaniaDreamin24 - How to easily use an API with FlowsAlbaniaDreamin24 - How to easily use an API with Flows
AlbaniaDreamin24 - How to easily use an API with FlowsThierry TROUIN ā˜
Ā 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024APNIC
Ā 
Call Girls In Sukhdev Vihar Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”
Call Girls In Sukhdev Vihar Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”Call Girls In Sukhdev Vihar Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”
Call Girls In Sukhdev Vihar Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”soniya singh
Ā 
Challengers I Told Ya ShirtChallengers I Told Ya Shirt
Challengers I Told Ya ShirtChallengers I Told Ya ShirtChallengers I Told Ya ShirtChallengers I Told Ya Shirt
Challengers I Told Ya ShirtChallengers I Told Ya Shirtrahman018755
Ā 
Radiant Call girls in Dubai O56338O268 Dubai Call girls
Radiant Call girls in Dubai O56338O268 Dubai Call girlsRadiant Call girls in Dubai O56338O268 Dubai Call girls
Radiant Call girls in Dubai O56338O268 Dubai Call girlsstephieert
Ā 
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebGDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebJames Anderson
Ā 
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Delhi Call girls
Ā 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxellan12
Ā 
Chennai Call Girls Porur Phone šŸ† 8250192130 šŸ‘… celebrity escorts service
Chennai Call Girls Porur Phone šŸ† 8250192130 šŸ‘… celebrity escorts serviceChennai Call Girls Porur Phone šŸ† 8250192130 šŸ‘… celebrity escorts service
Chennai Call Girls Porur Phone šŸ† 8250192130 šŸ‘… celebrity escorts servicesonalikaur4
Ā 
How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)Damian Radcliffe
Ā 
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call GirlVIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girladitipandeya
Ā 
ā‚¹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] šŸ”|97111...
ā‚¹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] šŸ”|97111...ā‚¹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] šŸ”|97111...
ā‚¹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] šŸ”|97111...Diya Sharma
Ā 
Lucknow ā¤CALL GIRL 88759*99948 ā¤CALL GIRLS IN Lucknow ESCORT SERVICEā¤CALL GIRL
Lucknow ā¤CALL GIRL 88759*99948 ā¤CALL GIRLS IN Lucknow ESCORT SERVICEā¤CALL GIRLLucknow ā¤CALL GIRL 88759*99948 ā¤CALL GIRLS IN Lucknow ESCORT SERVICEā¤CALL GIRL
Lucknow ā¤CALL GIRL 88759*99948 ā¤CALL GIRLS IN Lucknow ESCORT SERVICEā¤CALL GIRLimonikaupta
Ā 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Servicesexy call girls service in goa
Ā 

Recently uploaded (20)

Call Girls In Defence Colony Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”
Call Girls In Defence Colony Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”Call Girls In Defence Colony Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”
Call Girls In Defence Colony Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”
Ā 
āœ‚ļø šŸ‘… Independent Andheri Escorts With Room Vashi Call Girls šŸ’ƒ 9004004663
āœ‚ļø šŸ‘… Independent Andheri Escorts With Room Vashi Call Girls šŸ’ƒ 9004004663āœ‚ļø šŸ‘… Independent Andheri Escorts With Room Vashi Call Girls šŸ’ƒ 9004004663
āœ‚ļø šŸ‘… Independent Andheri Escorts With Room Vashi Call Girls šŸ’ƒ 9004004663
Ā 
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607
Ā 
Dwarka Sector 26 Call Girls | Delhi | 9999965857 šŸ«¦ Vanshika Verma More Our Se...
Dwarka Sector 26 Call Girls | Delhi | 9999965857 šŸ«¦ Vanshika Verma More Our Se...Dwarka Sector 26 Call Girls | Delhi | 9999965857 šŸ«¦ Vanshika Verma More Our Se...
Dwarka Sector 26 Call Girls | Delhi | 9999965857 šŸ«¦ Vanshika Verma More Our Se...
Ā 
Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Ā 
AlbaniaDreamin24 - How to easily use an API with Flows
AlbaniaDreamin24 - How to easily use an API with FlowsAlbaniaDreamin24 - How to easily use an API with Flows
AlbaniaDreamin24 - How to easily use an API with Flows
Ā 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024
Ā 
Call Girls In Sukhdev Vihar Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”
Call Girls In Sukhdev Vihar Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”Call Girls In Sukhdev Vihar Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”
Call Girls In Sukhdev Vihar Delhi šŸ’ÆCall Us šŸ”8264348440šŸ”
Ā 
Challengers I Told Ya ShirtChallengers I Told Ya Shirt
Challengers I Told Ya ShirtChallengers I Told Ya ShirtChallengers I Told Ya ShirtChallengers I Told Ya Shirt
Challengers I Told Ya ShirtChallengers I Told Ya Shirt
Ā 
Radiant Call girls in Dubai O56338O268 Dubai Call girls
Radiant Call girls in Dubai O56338O268 Dubai Call girlsRadiant Call girls in Dubai O56338O268 Dubai Call girls
Radiant Call girls in Dubai O56338O268 Dubai Call girls
Ā 
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebGDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
Ā 
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Best VIP Call Girls Noida Sector 75 Call Me: 8448380779
Ā 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
Ā 
Chennai Call Girls Porur Phone šŸ† 8250192130 šŸ‘… celebrity escorts service
Chennai Call Girls Porur Phone šŸ† 8250192130 šŸ‘… celebrity escorts serviceChennai Call Girls Porur Phone šŸ† 8250192130 šŸ‘… celebrity escorts service
Chennai Call Girls Porur Phone šŸ† 8250192130 šŸ‘… celebrity escorts service
Ā 
How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)How is AI changing journalism? (v. April 2024)
How is AI changing journalism? (v. April 2024)
Ā 
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call GirlVIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl
Ā 
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Ā 
ā‚¹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] šŸ”|97111...
ā‚¹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] šŸ”|97111...ā‚¹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] šŸ”|97111...
ā‚¹5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] šŸ”|97111...
Ā 
Lucknow ā¤CALL GIRL 88759*99948 ā¤CALL GIRLS IN Lucknow ESCORT SERVICEā¤CALL GIRL
Lucknow ā¤CALL GIRL 88759*99948 ā¤CALL GIRLS IN Lucknow ESCORT SERVICEā¤CALL GIRLLucknow ā¤CALL GIRL 88759*99948 ā¤CALL GIRLS IN Lucknow ESCORT SERVICEā¤CALL GIRL
Lucknow ā¤CALL GIRL 88759*99948 ā¤CALL GIRLS IN Lucknow ESCORT SERVICEā¤CALL GIRL
Ā 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Ā 

Object Detection Beyond Mask R-CNN and RetinaNet III

  • 1. Fine-Grained Image Analysis ICME Tutorial Jul. 8, 2019 Shanghai, China Xiu-Shen WEI Megvii Research Nanjing, Megvii Inc. IEEE ICME2019 T-06: Computer Vision for Transportation............................................................ T-07: Causally Regularized Machine Learning .................................................... T-08: Architecture Design for Deep Neural Networks ......................................... T-09: Intelligent Multimedia Recommendation ................................................... Oral Sessions............................................................................................................... Best Paper Session ................................................................................................ O-01: Content Recommendation and Cross-modal Hashing................................ O-02: Development of Multimedia Standards and Related Research................... O-03: Classification and Low Shot Learning........................................................ O-04: 3D Media Computing ................................................................................. O-05: Special Session "Pedestrian Detection, Tracking and Re-identification in
  • 2. Outline Part 1 Part 2 Background about CV, DL, and image analysis Introduction of ļ¬ne-grained image analysis Fine-grained image retrieval Fine-grained image recognition Other computer vision tasks related to ļ¬ne-grained image analysis New developments of ļ¬ne-grained image analysis http://www.weixiushen.com/
  • 3. Part 1 Background A brief introduction of computer vision Traditional image recognition and retrieval Deep learning and convolutional neural networks Introduction Fine-grained images vs. generic images Various real-world applications of ļ¬ne-grained images Challenges of ļ¬ne-grained image analysis Fine-grained benchmark datasets Fine-grained image retrieval Fine-grained image retrieval based on hand-crafted features Fine-grained image retrieval based on deep learning http://www.weixiushen.com/
  • 4. Background What is computer vision? What we see The goal of computer vision ā€¢ To extract ā€œmeaningā€ from pixels What we see What a computer sees Source: S. Narasimhan What a computer sees http://www.weixiushen.com/
  • 5. Background (conā€™t) Why study computer vision? CV is useful CV is interesting CV is difļ¬cult ā€¦ Finger reader Image captioning Crowds and occlusions http://www.weixiushen.com/
  • 6. Background (conā€™t) Successes of computer vision to date Optical character recognition Biometric systems Face recognition Self-driving cars http://www.weixiushen.com/
  • 7. Top-tier CV conferences/journals and prizes Marr Prize Background (conā€™t) http://www.weixiushen.com/
  • 8. Traditional image recognition and image retrieval Tradi=onal$Approaches$ Image$ vision$features$ Recogni=on$ ct$ =on$ o$ ļ¬ca=on$ Audio$ audio$features$ Speaker$ iden=ļ¬ca=on$ Data Feature extraction Learning algorithm Recognition: SVM Random Forest ā€¦ Retrieval: k-NN ā€¦ Background (conā€™t) http://www.weixiushen.com/
  • 9. Computer vision features Computer$Vision$Features$ SIFT$ HoG$ RIFT$ Textons$ GIST$ Deep learning Background (conā€™t) http://www.weixiushen.com/
  • 10. Deep learning and convolutional neural networks Fei-Fei Li & Andrej Karpathy Lecture 7 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 7 - 21 Jan 201518 before: now: input layer hidden layer output layer Fei-Fei Li & Andrej Karpathy Lecture 7 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 7 - 21 Jan 201518 before: now: input layer hidden layer output layer Before: Now: Figures are courtesy of L. Fei-Fei. ā€œRomeĀ was not built in one day!ā€ Background (conā€™t) http://www.weixiushen.com/
  • 11. All neural net activations arranged in 3-dimension: äøŽä¹‹ē±»ä¼¼ļ¼Œč‹„äø‰ē»“ęƒ…å½¢äø‹ēš„卷ē§Æ层 l ēš„č¾“å…„张量äøŗ xl āˆˆ RHlƗWlƗDl ļ¼Œ 则čƄ层卷ē§Æę øäøŗ fl āˆˆ RHƗWƗDl 怂åœØäø‰ē»“ę—¶å·ē§Æę“ä½œå®žé™…äøŠåŖę˜Æ将äŗŒē»“卷 ē§Æę‰©å±•åˆ°äŗ†åƹåŗ”位ē½®ēš„ę‰€ęœ‰é€šé“äøŠļ¼ˆå³ Dlļ¼‰ļ¼Œęœ€ē»ˆå°†äø€ę¬”卷ē§Æ处ē†ēš„ę‰€ęœ‰ HWDl äøŖ元ē“ ę±‚和作äøŗčƄ位ē½®å·ē§Æē»“ęžœć€‚å¦‚å›¾2-5ꉀē¤ŗ怂 äø‰ē»“åœŗę™Æ卷ē§Æę“ä½œ 卷ē§Æē‰¹å¾ 图 äø‰ē»“åœŗę™Æäø‹ēš„卷ē§Æę øäøŽč¾“å…„ę•°ę®ć€‚å›¾å·¦å·ē§Æę ø大小äøŗ 3 Ɨ 4 Ɨ 3ļ¼Œå›¾å³äøŗåœØ čƄ位ē½®čæ›č”Œå·ē§Æę“ä½œåŽå¾—åˆ°ēš„ 1 Ɨ 1 Ɨ 1 ēš„č¾“å‡ŗē»“ęžœ čæ›äø€ę­„地ļ¼Œč‹„ē±»ä¼¼ fl čæ™ę ·ēš„卷ē§Æę ø꜉ D äøŖļ¼Œåˆ™åœØ同äø€äøŖ位ē½®äøŠåÆ得到 Deep learning and convolutional neural networks äøŽä¹‹ē±»ä¼¼ļ¼Œč‹„äø‰ē»“ęƒ…å½¢äø‹ēš„卷ē§Æ层 l ēš„č¾“å…„张量äøŗ xl āˆˆ RHlƗWlƗDl ļ¼Œ 则čƄ层卷ē§Æę øäøŗ fl āˆˆ RHƗWƗDl 怂åœØäø‰ē»“ę—¶å·ē§Æę“ä½œå®žé™…äøŠåŖę˜Æ将äŗŒē»“卷 ē§Æę‰©å±•åˆ°äŗ†åƹåŗ”位ē½®ēš„ę‰€ęœ‰é€šé“äøŠļ¼ˆå³ Dlļ¼‰ļ¼Œęœ€ē»ˆå°†äø€ę¬”卷ē§Æ处ē†ēš„ę‰€ęœ‰ HWDl äøŖ元ē“ ę±‚和作äøŗčƄ位ē½®å·ē§Æē»“ęžœć€‚å¦‚å›¾2-5ꉀē¤ŗ怂 äø‰ē»“åœŗę™Æ卷ē§Æę“ä½œ 卷ē§Æē‰¹å¾ 图 äø‰ē»“åœŗę™Æäø‹ēš„卷ē§Æę øäøŽč¾“å…„ę•°ę®ć€‚å›¾å·¦å·ē§Æę ø大小äøŗ 3 Ɨ 4 Ɨ 3ļ¼Œå›¾å³äøŗåœØ čƄ位ē½®čæ›č”Œå·ē§Æę“ä½œåŽå¾—åˆ°ēš„ 1 Ɨ 1 Ɨ 1 ēš„č¾“å‡ŗē»“ęžœ čæ›äø€ę­„地ļ¼Œč‹„ē±»ä¼¼ fl čæ™ę ·ēš„卷ē§Æę ø꜉ D äøŖļ¼Œåˆ™åœØ同äø€äøŖ位ē½®äøŠåÆ得到 1 Ɨ 1 Ɨ 1 Ɨ D ē»“åŗ¦ēš„卷ē§Æč¾“å‡ŗļ¼Œč€Œ D 即äøŗē¬¬ l + 1 层ē‰¹å¾ xl+1 ēš„通道ꕰ Dl+1ć€‚å½¢å¼åŒ–ēš„卷ē§Æę“ä½œåÆč”Øē¤ŗäøŗļ¼š yil+1,jl+1,d = H i=0 W j=0 Dl dl=0 fi,j,dl,d Ɨ xl il+1+i,jl+1+j,dl , (2.1) 其äø­ļ¼Œ(il+1, jl+1) äøŗ卷ē§Æē»“ęžœēš„位ē½®åę ‡ļ¼Œę»”č¶³ļ¼š l+1 l l+1 Background (conā€™t) http://www.weixiushen.com/
  • 12. Unit processing of CNNs ē§Æē‰¹å¾ļ¼‰ ē‰¹å¾ 27 4 3 2 1 0 1 2 3 4 5 1 0 1 0 1 0 6 7 8 9 0 9 8 7 6 5 28 ē¬¬2ę¬”å·ē§Æę“ä½œ 卷ē§Æ后ē»“ęžœļ¼ˆå·ē§Æē‰¹å¾ļ¼‰ (b) ē¬¬äŗŒę¬”卷ē§Æę“ä½œåŠå¾—åˆ°ēš„卷ē§Æē‰¹å¾ 29 卷ē§Æē‰¹å¾ļ¼‰ ē‰¹å¾ 27 1 0 1 1 0 1 0 1 0 6 7 8 9 0 1 2 3 4 5 9 8 7 6 5 28 29 4 3 2 1 0 1 2 3 4 5 21 1628 23 27 22 27 ē¬¬9ę¬”å·ē§Æę“ä½œ 卷ē§Æ后ē»“ęžœļ¼ˆå·ē§Æē‰¹å¾ļ¼‰ (d) ē¬¬ä¹ę¬”卷ē§Æę“ä½œåŠå¾—åˆ°ēš„卷ē§Æē‰¹å¾ 卷ē§Æę“ä½œē¤ŗ例 25 Convolution Max-pooling: yil+1,jl+1,d = max 0 i<H,0 j<W xl il+1ƗH+i,jl+1ƗW+j,dl , (2.6) 其äø­ļ¼Œ0 il+1 < Hl+1ļ¼Œ0 jl+1 < Wl+1ļ¼Œ0 d < Dl+1 = Dl怂 图2-7ꉀē¤ŗäøŗ 2 Ɨ 2 å¤§å°ć€ę­„é•æäøŗ 1 ēš„ęœ€å¤§å€¼ę±‡åˆę“ä½œē¤ŗä¾‹ć€‚ 7 4 3 2 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 9 8 7 6 5 ē¬¬1ę¬”ęœ€å¤§å€¼ę±‡åˆę“ä½œ ę±‡åˆåŽē»“ęžœļ¼ˆę±‡åˆē‰¹å¾ļ¼‰ (a) ē¬¬ 1 ę¬”ę±‡åˆę“ä½œåŠå¾—åˆ°ēš„ę±‡åˆē‰¹å¾ 7 6 7 8 9 0 1 2 3 4 5 9 8 7 6 5 4 3 2 1 0 1 3 4 52 7 8 9 9 9 8 9 9 9 8 7 6 4 3 4 5 ē¬¬16ę¬”ęœ€å¤§å€¼ę±‡åˆę“ä½œ ę±‡åˆåŽē»“ęžœļ¼ˆę±‡åˆē‰¹å¾ļ¼‰ (b) ē¬¬ 16 ę¬”ę±‡åˆę“ä½œåŠå¾—åˆ°ēš„ę±‡åˆē‰¹å¾ 图 ęœ€å¤§å€¼ę±‡åˆę“ä½œē¤ŗ例 除äŗ†äøŠčæ°ęœ€åøøē”Øēš„äø¤ē§ę±‡åˆę“ä½œå¤–ļ¼Œéšęœŗę±‡åˆļ¼ˆstochastic-poolingļ¼‰[94] 则介äŗŽäŗŒč€…之闓怂随ęœŗę±‡åˆę“ä½œéžåøøē®€å•ļ¼ŒåŖ需åÆ¹č¾“å…„ę•°ę®äø­ēš„å…ƒē“ ęŒ‰ē…§ äø€å®šę¦‚ēŽ‡å€¼å¤§å°éšęœŗ选ꋩļ¼Œå…¶å¹¶äøåƒęœ€å¤§å€¼ę±‡åˆé‚£ę ·ę°øčæœåŖ取那äøŖęœ€å¤§å€¼ 元ē“ ć€‚åƹ随ęœŗę±‡åˆč€Œč؀ļ¼Œå…ƒē“ å€¼å¤§ēš„响åŗ”ļ¼ˆactivationļ¼‰č¢«é€‰äø­ēš„ꦂēŽ‡ä¹Ÿå¤§ļ¼Œ 反之äŗ¦ē„¶ć€‚åÆ仄čÆ“ļ¼ŒåœØå…Øå±€ę„ä¹‰äøŠļ¼Œéšęœŗę±‡åˆäøŽå¹³å‡å€¼ę±‡åˆčæ‘ä¼¼ļ¼›åœØ局éƒØꄏ Pooling ReLU(x) = max {0, x} (8.3) = āŽ§ āŽŖāŽØ āŽŖāŽ© x x 0 0 x < 0 . (8.4) äøŽå‰äø¤äøŖęæ€ę“»å‡½ę•°ē›øęƔļ¼ŒReLU å‡½ę•°ēš„ę¢Æåŗ¦åœØ x 0 ę—¶äøŗ 1ļ¼Œåä¹‹äøŗ 0ļ¼ˆå¦‚ 图8-3ꉀē¤ŗļ¼‰ļ¼›x 0 éƒØ分完å…Øę¶ˆé™¤äŗ† Sigmoid åž‹å‡½ę•°ēš„ę¢Æåŗ¦é„±å’Œę•ˆåŗ”怂åœØč®” ē®—å¤ę‚åŗ¦äøŠļ¼ŒReLU å‡½ę•°ä¹Ÿē›øåƹ前äø¤č€…ēš„ęŒ‡ę•°å‡½ę•°č®”ē®—ꛓäøŗē®€å•ć€‚åŒę—¶ļ¼Œå®ž éŖŒäø­čæ˜å‘ēŽ° ReLU å‡½ę•°ęœ‰åŠ©äŗŽéšęœŗę¢Æåŗ¦äø‹é™ę–¹ę³•ę”¶ę•›ļ¼Œę”¶ę•›é€Ÿåŗ¦ēŗ¦åæ« 6 倍 左右 [52]怂äøčæ‡ļ¼ŒReLU å‡½ę•°ä¹Ÿęœ‰č‡Ŗčŗ«ēš„ē¼ŗ陷ļ¼Œå³åœØ x < 0 ę—¶ļ¼Œę¢Æåŗ¦ä¾æäøŗ 0怂 ę¢å„čƝčÆ“ļ¼ŒåƹäŗŽå°äŗŽ 0 ēš„čæ™éƒØ分卷ē§Æē»“ęžœå“åŗ”ļ¼Œå®ƒä»¬äø€ę—¦å˜äøŗč“Ÿå€¼å°†å†ę—  ę³•å½±å“ē½‘ē»œč®­ē»ƒā€”ā€”čæ™ē§ēŽ°č±”č¢«ē§°ä½œā€œę­»åŒŗā€ć€‚ āˆ’10 āˆ’5 0 5 10 0 1 2 3 4 5 6 7 8 9 10 (a) ReLU å‡½ę•° āˆ’10 āˆ’5 0 5 10 āˆ’0.2 0 0.2 0.4 0.6 0.8 1 1.2 (b) ReLU å‡½ę•°ę¢Æåŗ¦ 图 å‡½ę•°åŠå…¶å‡½ę•°ę¢Æåŗ¦ Non-linear activation function Background (conā€™t) http://www.weixiushen.com/
  • 13. CNN architecture Fei-Fei Li & Andrej Karpathy Lecture 7 - 21 Jan 2015Fei-Fei Li & Andrej Karpathy Lecture 7 - 21 Jan 201563 CONV ReLU CONV ReLU POOLCONV ReLU CONV ReLU POOL CONV ReLU CONV ReLU POOL FC (Fully-connected) Figures are courtesy of L. Fei-Fei. Background (conā€™t) http://www.weixiushen.com/
  • 14. Meta class Bird Dog Orangeā€¦ā€¦ Traditional image recognition (Coarse-grained) Fine-grained image recognition Dog Husky Satsuma Alaskaā€¦ā€¦ Introduction Fine-grained images vs. generic images http://www.weixiushen.com/
  • 15. Various real-world applications Introduction (conā€™t) http://www.weixiushen.com/
  • 16. Various real-world applications Introduction (conā€™t) http://www.weixiushen.com/
  • 17. Various real-world applications Introduction (conā€™t) http://www.weixiushen.com/ Results 1. Megvii Research Nanjing a. Error = 0.10267 2. Alibaba Machine Intelligence Technology Lab a. Error = 0.11315 3. General Dynamics Mission Systems a. Error = 0.12678 FGVC6 Sponsored by This certiļ¬cate is awarded to Bo-Yan Zhou, Bo-Rui Zhao, Quan Cui, Yan-Ping Xie, Zhao-Min Chen, Ren-Jie Song, and Xiu-Shen Wei Megvii Research Nanjing iNat2019 winners of the iNaturalist 2019 image classiļ¬cation challenge held in conjunction with the FGVC workshop at CVPR 2019. Genera with at least 10 species
  • 18. Herbarium Challenge 2019 (top team ā— #1 Megvii Research Nanjing (89.8%) ā—‹ Boyan Zhou, Quan Cui, Borui Zhao, Yanping Xie, Renjie Song ā— #2 PEAK (89.1%) ā—‹ Chunqiao Xu, Shao Zeng, Qiule Sun, Shuyu Ge, Peihua Li (Dalian University of Technology) ā— #3 Miroslav Valan (89.0%) ā—‹ Swedish Museum of Natural History ā— #4 Hugo Touvron (88.9%) ā—‹ Hugo Touvron and Andrea Vedaldi (Facebook AI Research) Various real-world applications Introduction (conā€™t) http://www.weixiushen.com/ Herbarium Challenge 2019 Kiat Chuan Tan1 , Yulong Liu1 , Barbara Ambrose2 , Melissa Tulig2 , Serge Belongie1,3 1 Google Research, 2 New York Botanical Garden, 3 Cornell Tech Google Research , Xiu-Shen Wei
  • 19. Various real-world applications Introduction (conā€™t) http://www.weixiushen.com/
  • 20. Various real-world applications Introduction (conā€™t) http://www.weixiushen.com/
  • 21. Various real-world applications Introduction (conā€™t) http://www.weixiushen.com/
  • 22. Challenge of ļ¬ne-grained image analysis Small inter-class variance Large intra-class variance Institute of Computer Science and Technology, Peking University hexiangteng@pku.edu.cn, pengyuxin@pku.edu.cn Abstract ed image classiļ¬cation is a challenging task ge intra-class variance and small inter-class ng at recognizing hundreds of sub-categories he same basic-level category. Most existing mage classiļ¬cation methods generally learn models to obtain the semantic parts for bet- ion accuracy. Despite achieving promising methods mainly have two limitations: (1) arts which obtained through the part detec- re beneļ¬cial and indispensable for classiļ¬- 2) ļ¬ne-grained image classiļ¬cation requires visual descriptions which could not be pro- art locations or attribute annotations. For ad- bove two limitations, this paper proposes the odel combining vision and language (CVL) atent semantic representations. The vision deep representations from the original visual ia deep convolutional neural network. The am utilizes the natural language descriptions oint out the discriminative parts or charac- ach image, and provides a ļ¬‚exible and com- Heermann Gull Herring Gull Slaty-backed Gull Western Gullinter-class variance intra-classvariance Figure 1. Examples from CUB-200-2011. Note that it is a techni- cally challenging task even for humans to categorize them due to Introduction (conā€™t) Figures are courtesy of [X. He andY. Peng, CVPR 2017]. http://www.weixiushen.com/
  • 23. The key of ļ¬ne-grained image analysis Introduction (conā€™t) http://www.weixiushen.com/
  • 24. Fine-grained benchmark datasets ē±»ļ¼Œ11788 张图像ļ¼ˆå…¶äø­å« 5994 å¼ č®­ē»ƒå›¾åƒå’Œ 5794 å¼ ęµ‹čƕ图像ļ¼‰ļ¼ŒęÆå¼ å›¾ę 供äŗ†å›¾åƒēŗ§åˆ«ę ‡č®°ć€ē‰©ä½“č¾¹ē•Œę”†ć€éƒØ件关键ē‚¹å’ŒéøŸē±»å±žę€§äæ”ęÆ怂čÆ„ę•°ę®é›†å›¾åƒ ā½°ä¾‹å¦‚图 2 12ꉀā½°ć€‚ę­¤å¤–ļ¼ŒBirdsnap[6] ę˜Æ 2014 幓哄伦ā½äŗšā¼¤å­¦ēš„ē ”ē©¶č€…ꏐå‡ŗēš„ ā¼¤č§„ęØ”éøŸē±»ē»†ē²’åŗ¦ēŗ§åˆ«å›¾åƒę•°ę®é›†ļ¼Œå…± 500 ē§éøŸē±»ā¼¦ē±»ļ¼Œ49829 张图像ļ¼ˆå…¶äø­ 47386 å¼ č®­ē»ƒå›¾åƒå’Œ 2443 å¼ ęµ‹čƕ图像ļ¼‰ļ¼ŒäøŽ CUB200-2011 ę•°ę®é›†ē±»ä¼¼ļ¼ŒęƏ张图 ęä¾›äŗ†å›¾åƒēŗ§åˆ«ę ‡č®°ć€ē‰©ä½“č¾¹ē•Œę”†ć€éƒØ件关键ē‚¹å’ŒéøŸē±»å±žę€§äæ”ęÆ怂 图 2 12: CUB200-2011[75] ę•°ę®é›†å›¾åƒā½°ä¾‹ć€‚ CUB200-2011 [C.Wah et al., CNS-TR-2011-001, 2011] 11,788 images, 200 ļ¬ne-grained classes Introduction (conā€™t) http://www.weixiushen.com/
  • 25. Fine-grained benchmark datasets 2 . 3 . 2 ē»†ē²’åŗ¦ēŗ§åˆ«ē‹—ē±»ę•°ę®é›† Stanford Dogs[38] ę•°ę®é›†ę˜Æę–Æ坦ē¦ā¼¤å­¦äŗŽ 2011 å¹“ęå‡ŗēš„ē»†ē²’åŗ¦ēŗ§åˆ«å›¾åƒę•° 集怂čÆ„ę•°ę®é›†å…±å« 120 ē§ē‹—ē±»ā¼¦ē±»ēš„ 20580 张图像ļ¼ŒęÆå¼ å›¾åƒęä¾›äŗ†å›¾åƒēŗ§ ę ‡č®°å’Œē‰©ä½“č¾¹ē•Œę”†ļ¼Œä½†ęœŖęä¾›éƒØ件关键ē‚¹å’ŒéøŸē±»å±žę€§äæ”ęÆē­‰ē²¾ē»†ēŗ§åˆ«ē›‘ē£äæ” ć€‚čÆ„ę•°ę®é›†å›¾åƒā½°ä¾‹å¦‚图 2 13ꉀā½°ć€‚ [A. Khosla et al., CVPR Workshop 2011] Stanford Dogs 20,580 images 120 ļ¬ne-grained classes Introduction (conā€™t) http://www.weixiushen.com/
  • 26. Fine-grained benchmark datasets Oxford Flowers 8,189 images, 102 ļ¬ne-grained classes [M.-E. Nilsback and A. Zisserman, CVGIP 2008] Introduction (conā€™t) http://www.weixiushen.com/
  • 27. Fine-grained benchmark datasets 24 ē¬¬äŗŒē«  ē›ø关ē ”ē©¶å†…容ē»¼čæ° å›¾ 2 15: Aircrafts[51] ę•°ę®é›†å›¾åƒā½°ä¾‹ć€‚ Aircrafts 10,200 images 100 ļ¬ne-grained classes [S. Maji et al., arXiv: 1306.5151, 2013] Introduction (conā€™t) http://www.weixiushen.com/
  • 28. Fine-grained benchmark datasets Stanford Cars 图 2 15: Aircrafts[51] ę•°ę®é›†å›¾åƒā½°ä¾‹ć€‚ 图 2 16: Stanford Cars[40] ę•°ę®é›†å›¾åƒā½°ä¾‹ć€‚ 16,185 images, 196 ļ¬ne-grained classes [J. Krause et al., ICCV Workshop 2013] Introduction (conā€™t) http://www.weixiushen.com/
  • 29. Fine-grained image analysis is hot ā€¦ Introduction (conā€™t) Many papers published on top-tier conf./journals CVPR, ICCV, ECCV, IJCAI, etc. TPAMI, IJCV,TIP, etc. Many frequently held workshops Workshop on Fine-GrainedVisual Categorization ā€¦ Many academic challenges about ļ¬ne-grained tasks The Nature Conservancy Fisheries Monitoring iFood Classiļ¬cation Challenge iNature Classiļ¬cation Challenge ā€¦ http://www.weixiushen.com/
  • 30. Image Retrieval (IR) What is image retrieval?What is image retrieval? ! Description Based Image Retrieval (DBIR)! Description Based Image Retrieval (DBIR) ! Content Based Image Retrieval (CBIR) Query Textual Text Textual query Query by example Image Sketchexample Sketch 5/46 Fine-grained image retrieval http://www.weixiushen.com/
  • 31. Fine-grained image retrieval (conā€™t) Common components of CBIR systems Feature extraction Input Database Query Feature extraction Comparison Return http://www.weixiushen.com/
  • 32. Deep learning for image retrieval Figure 9: Sample search results using CroW features compressed to just d ā€œ 32 dimensions. The query image is shown at the leftmost side with the query bounding box marked in a red rectangle. Fine-grained image retrieval (conā€™t) http://www.weixiushen.com/
  • 33. FGIR vs. General-purposed IR Fine-grained image retrieval (conā€™t) volutional Descriptor Aggregation for e-Grained Image Retrieval -Hao Luo, Jianxin Wu, Member, IEEE, Zhi-Hua Zhou, Fellow, IEEE ural network models pre- n task have been successfully ch as texture description and e tasks require annotations this paper, we focus on a pure unsupervised setting: h image labels, ļ¬ne-grained e the unsupervised retrieval olutional Descriptor Aggre- localizes the main object(s) discards noisy background The selected descriptors are reduced into a short feature und. SCDA is unsupervised, ox annotation. Experiments he effectiveness of SCDA for visualization of the SCDA d to visual attributes (even SCDAā€™s high mean average Moreover, on general image omparable retrieval results etrieval approaches. retrieval, selection and ag- zation. TION (a) Fine-grained image retrieval. Two examples (ā€œMallardā€ and ā€œRolls- Royce Phantom Sedan 2012ā€) from the CUB200-2011 [10] and Cars [11] datasets, respectively. (b) General image retrieval. Two examples from the Oxford Building [12] dataset. Figure 1. Fine-grained image retrieval vs. general image retrieval. Fine grained image retrieval (FGIR) processes visually similar objects as the probe and gallery. For example, given an image of Mallard (or Rolls-Royce Phantom Sedan 2012) as the query, the FGIR system should return images of the same http://www.weixiushen.com/
  • 34. FGIR based on hand-crafted features Fine-grained image retrieval (conā€™t) ē»†ē²’åŗ¦ēŗ§åˆ«å›¾åƒēš„ę£€ē“¢ä»»åŠ”通åøøåŖåœØäŗŽęŸäøŖē‰©ä½“ē±»åˆ«å†…ļ¼Œä¾‹å¦‚仅会åƹäøåŒéøŸē±» ē±»åˆ«čæ›ā¾å›¾åƒę£€ē“¢ļ¼Œé€šåøø并äøä¼šåƹéøŸē±»ć€ē‹—ē±»ē­‰ē­‰å¤šē§ē±»åˆ«ē‰©ä½“ēš„ę··åˆę•°ę®é›† čæ›ā¾ē»†ē²’åŗ¦ēŗ§åˆ«å›¾åƒę£€ē“¢ć€‚ē›øā½ä¹‹äø‹ļ¼Œęœ¬ā½‚ē¬¬ 3ē« ęå‡ŗēš„ę£€ē“¢ā½…ę³•ę˜Æā¾øäøŖåŸŗäŗŽ ę·±åŗ¦å­¦ä¹ ēš„ē»†ē²’åŗ¦ēŗ§åˆ«å›¾åƒę£€ē“¢ā½…ę³•ļ¼ŒåŒę—¶ļ¼Œęœ¬ā½‚ē ”ē©¶ēš„ē»†ē²’åŗ¦ēŗ§åˆ«å›¾åƒę£€ē“¢ 图 2 11: Xie ē­‰ā¼ˆ[83] ꏐå‡ŗēš„ē»†ē²’åŗ¦ēŗ§åˆ«å›¾åƒā€œęœē“¢ā€ä»»åŠ”ā½°ę„å›¾ć€‚ļ¼ˆčÆ„ę’å›¾ē”± IEEE ꎈꝃä½æ ā½¤ć€‚ļ¼‰ [Xie et al., IEEE TMM 2015] http://www.weixiushen.com/
  • 35. 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 CCV #545 ECC #54 2 ECCV-16 submission ID 545 (a) Input image (d) Descriptor aggregation(c) Descriptor selection (b) Convolutional activation tensor AnSCDAfeature Figure 1. Pipeline of the proposed SCDA method. (Best viewed in color.) annotations are expensive and unrealistic in many real applications. In answer to this di culty, there are attempts to categorize ļ¬ne-grained images with only image-level labels, e.g., [6ā€“9]. In this paper, we handle a more challenging but more realistic task, i.e., Fine- [Wei et al., IEEE TIP 2017] Fine-grained image retrieval (conā€™t) Selective Convolutional Descriptor Aggregation (SCDA) http://www.weixiushen.com/
  • 36. 45 46 47 48 49 50 51 52 53 54 55 56 CV 45 2 ECCV-16 submission ID 545 (a) Input image (c) Descriptor selection (b) Convolutional activation tensor Figure 1. Pipeline of the proposed SCDA method. (Best viewe 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 ECCV #545 CCV-16 submission ID 545 ed applications. For example, a bird protection project may not want g images given a bird query. To our best knowledge, this is the ļ¬rst o ļ¬ne-grained image retrieval using deep learning. ctive Convolutional Descriptor Aggregation per, we follow the notations in [24]. The term ā€œfeature mapā€ indicates lution results of one channel; the term ā€œactivationsā€ indicates feature ll channels in a convolution layer; and the term ā€œdescriptorā€ indicates ensional component vector of activations. ā€œpool5ā€ refers to the activa- e max-pooled last convolution layer, and ā€œfc8ā€ refers to the activations fully connected layer. an input image I of size H ā‡„ W, the activations of a convolution formulated as an order-3 tensor T with h ā‡„ w ā‡„ d elements, which set of 2-D feature maps S = {Sn} (n = 1, . . . , d). Sn of size h ā‡„ w is ature map of the corresponding channel. From another point of view, also considered as having h ā‡„ w cells and each cell contains one d- al deep descriptor. We denote the deep descriptors as X = x(i,j) , j) is a particular cell (i 2 {1, . . . , h} , j 2 {1, . . . , w} , x(i,j) 2 Rd ). For 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 3 Selective Convolutional Descripto In this paper, we follow the notations in [24]. The the convolution results of one channel; the term ā€œ maps of all channels in a convolution layer; and th the d-dimensional component vector of activations tions of the max-pooled last convolution layer, and of the last fully connected layer. Given an input image I of size H ā‡„ W, the layer are formulated as an order-3 tensor T wit include a set of 2-D feature maps S = {Sn} (n = the nth feature map of the corresponding channel T can be also considered as having h ā‡„ w cells dimensional deep descriptor. We denote the deep where (i, j) is a particular cell (i 2 {1, . . . , h} , j 2 instance, by employing the popular pre-trained V deep descriptors, we can get a 7 ā‡„ 7 ā‡„ 512 activati image is 224ā‡„224. Thus, on one hand, for this im (i.e., Sn) of size 7 ā‡„ 7; on the other hand, 49 deep obtained. 3.1 Selecting Convolutional Descriptors What distinguishes SCDA from existing deep le 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 3 Selective Convolutional Descriptor Aggregation In this paper, we follow the notations in [24]. The term ā€œfeature mapā€ in the convolution results of one channel; the term ā€œactivationsā€ indicates maps of all channels in a convolution layer; and the term ā€œdescriptorā€ in the d-dimensional component vector of activations. ā€œpool5ā€ refers to the tions of the max-pooled last convolution layer, and ā€œfc8ā€ refers to the activ of the last fully connected layer. Given an input image I of size H ā‡„ W, the activations of a conv layer are formulated as an order-3 tensor T with h ā‡„ w ā‡„ d elements, include a set of 2-D feature maps S = {Sn} (n = 1, . . . , d). Sn of size h the nth feature map of the corresponding channel. From another point o T can be also considered as having h ā‡„ w cells and each cell contains dimensional deep descriptor. We denote the deep descriptors as X = x where (i, j) is a particular cell (i 2 {1, . . . , h} , j 2 {1, . . . , w} , x(i,j) 2 R instance, by employing the popular pre-trained VGG-16 model [25] to e deep descriptors, we can get a 7 ā‡„ 7 ā‡„ 512 activation tensor in pool5 if the image is 224ā‡„224. Thus, on one hand, for this image, we have 512 featur (i.e., Sn) of size 7 ā‡„ 7; on the other hand, 49 deep descriptors of 512-d a obtained. 3.1 Selecting Convolutional Descriptors What distinguishes SCDA from existing deep learning-based image re methods is: using only the pre-trained model, SCDA is able to ļ¬nd usef 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 Selective Convolutional Descriptor Aggregation n this paper, we follow the notations in [24]. The term ā€œfeature mapā€ indicates he convolution results of one channel; the term ā€œactivationsā€ indicates feature aps of all channels in a convolution layer; and the term ā€œdescriptorā€ indicates he d-dimensional component vector of activations. ā€œpool5ā€ refers to the activa- ons of the max-pooled last convolution layer, and ā€œfc8ā€ refers to the activations the last fully connected layer. Given an input image I of size H ā‡„ W, the activations of a convolution yer are formulated as an order-3 tensor T with h ā‡„ w ā‡„ d elements, which clude a set of 2-D feature maps S = {Sn} (n = 1, . . . , d). Sn of size h ā‡„ w is he nth feature map of the corresponding channel. From another point of view, can be also considered as having h ā‡„ w cells and each cell contains one d- mensional deep descriptor. We denote the deep descriptors as X = x(i,j) , here (i, j) is a particular cell (i 2 {1, . . . , h} , j 2 {1, . . . , w} , x(i,j) 2 Rd ). For stance, by employing the popular pre-trained VGG-16 model [25] to extract eep descriptors, we can get a 7 ā‡„ 7 ā‡„ 512 activation tensor in pool5 if the input mage is 224ā‡„224. Thus, on one hand, for this image, we have 512 feature maps .e., Sn) of size 7 ā‡„ 7; on the other hand, 49 deep descriptors of 512-d are also btained. .1 Selecting Convolutional Descriptors Feature maps: Descriptors: Fine-grained image retrieval (conā€™t) Notations [Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
  • 37. 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 CCV #545 ECCV #54 ECCV-16 submission ID 545 5 The 468āˆ’th channel The 375āˆ’th channel The 284āˆ’th channelThe 108āˆ’th channel The 481āˆ’th channel The 245āˆ’th channel The 6āˆ’th channel The 163āˆ’th channel Figure 2. Sampled feature maps of four ļ¬ne-grained images. Although we resize the images for better visualization, our method can deal with images of any resolution. Fine-grained image retrieval (conā€™t) [Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
  • 38. 270 271 272 273 274 275 276 277 278 279 280 281 ECCV #545 ECCV-16 sub ... ... (a) Input image (d) Mask m(c) Activation map(b) Feature maps Figure 4. Selecting useful deep convolutional descriptors. (B Therefore, we use fM to select useful and meaningful d f 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 (b) Visualization of the mask map Figure 3. Visualization of the mask map M and the correspondin component fM. The selected regions are highlighted in red. (Best v region, we could expect this region to be an object rather tha Therefore, in the proposed method, we add up the obtained tensor through the depth direction. Thus, the h ā‡„ w ā‡„ d 3- an h ā‡„ w 2-D tensor, which we call the ā€œactivation mapā€, i. (where Sn is the nth feature map in pool5). For the activation h ā‡„ w summed activation responses, corresponding to h ā‡„ w on the aforementioned observation, it is straightforward to sa activation response a particular position (i, j) is, the more po responding region being part of the object. Then, we calcula ĀÆa of all the positions in A as the threshold to decide which objects: the position (i, j) whose activation response is highe the main object, e.g., birds or dogs, might appear in that posi M of the same size as A can be obtained as: Mi,j = ( 1 if Ai,j > ĀÆa 0 otherwise , where (i, j) is a particular position in these h ā‡„ w positions. The ļ¬gures in the ļ¬rst row of Fig. 3 show some examples for birds and dogs. In these ļ¬gures, we ļ¬rst resize the mask Fine-grained image retrieval (conā€™t) Obtaining the activation map by summarizing feature maps [Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
  • 39. 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 CCV #545 ECC #54 6 ECCV-16 submission ID 545 (a) Visualization of the mask map M (b) Visualization of the mask map Figure 3. Visualization of the mask map M and the corresponding largest connected component fM. The selected regions are highlighted in red. (Best viewed in color.) region, we could expect this region to be an object rather than the background. 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 CCV #545 ECC #54 6 ECCV-16 submission ID 545 (a) Visualization of the mask map M (b) Visualization of the mask map Figure 3. Visualization of the mask map M and the corresponding largest connected component fM. The selected regions are highlighted in red. (Best viewed in color.) Fine-grained image retrieval (conā€™t) Visualization of the mask map M [Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
  • 40. 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 ECCV #545 ECCV #54 ECCV-16 submission ID 545 7 ... ... (a) Input image (e) The largest connected component of the mask map(d) Mask map(c) Activation map(b) Feature maps Figure 4. Selecting useful deep convolutional descriptors. (Best viewed in color.) Therefore, we use fM to select useful and meaningful deep convolutional de- scriptors. The descriptor x(i,j) should be kept when fMi,j = 1, while fMi,j = 0 means the position (i, j) might have background or noisy parts: F = n x(i,j)|fMi,j = 1 o , (2) Fine-grained image retrieval (conā€™t) Selecting useful deep convolutional descriptors [Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
  • 41. Qualitative evaluation 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 ECCV #545 ECCV #545 ECCV-16 submission ID 545 7 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 Figure 6. Samples of predicted object localization bounding box on CUB200-2011 and Stanford Dogs. The ground-truth bounding box is marked as the red dashed rectangle, Fine-grained image retrieval (conā€™t) [Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
  • 42. 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 ECCV #545 ECCV #545 ECCV-16 submission ID 545 9 Table 2. Comparison of diā†µerent encoding or pooling approaches for FGIR. Approach Dimension CUB200-2011 Stanford Dogs top1 top5 top1 top5 VLAD 1,024 55.92% 62.51% 69.28% 74.43% Fisher Vector 2,048 52.04% 59.19% 68.37% 73.74% avgPool 512 56.42% 63.14% 73.76% 78.47% maxPool 512 58.35% 64.18% 70.37% 75.59% avg&maxPool 1,024 59.72% 65.79% 74.86% 79.24% ā€“ Pooling approaches. We also try two traditional pooling approaches, i.e., max-pooling and average-pooling, to aggregate the deep descriptors. After encoding or pooling into a single vector, for VLAD and FV, the square root normalization and `2-normalization are followed; for max- and average- pooling methods, we just do `2-normalization (the square root normalization did not work well). Finally, the cosine similarity is used for nearest neighbor search. We use two datasets to demonstrate which type of aggregation method is optimal 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 localization performance is just slightly lower or even comparable to that of these algorithms using strong supervisions, e.g., ground-truth bounding box and parts annotations (even in the test phase). 3.2 Aggregating Convolutional Descriptors After selecting F = n x(i,j)|fMi,j = 1 o , we compare several encoding or pooling approaches to aggregate these convolutional features, and then give our proposal. ā€“ VLAD [14] uses k-means to ļ¬nd a codebook of K centroids {c1, . . . , cK} and maps x(i,j) into a single vector v(i,j) = ā‡„ 0 . . . 0 x(i,j) ck . . . 0 ā‡¤ 2 RKā‡„d , where ck is the closest centroid to x(i,j). The ļ¬nal representation isP i,j v(i,j). ā€“ Fisher Vector [15]: FV is similar to VLAD, but uses a soft assignment (i.e., Gaussian Mixture Model) instead of using k-means. Moreover, FV also includes second-order statistics.2 2 For parameter choice of VLAD/FV, we follow the suggestions reported in [30]. The number of clusters in VLAD and the number of Gaussian components in FV are both set to 2. Larger values lead to lower accuracy. 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 whole-object in CUB200-2011. By comparing the results of PCP for torso and our whole-object, we ļ¬nd that, even though our method is unsupervised, the localization performance is just slightly lower or even comparable to that of these algorithms using strong supervisions, e.g., ground-truth bounding box and parts annotations (even in the test phase). 3.2 Aggregating Convolutional Descriptors After selecting F = n x(i,j)|fMi,j = 1 o , we compare several encoding or pooling approaches to aggregate these convolutional features, and then give our proposal. ā€“ VLAD [14] uses k-means to ļ¬nd a codebook of K centroids {c1, . . . , cK} and maps x(i,j) into a single vector v(i,j) = ā‡„ 0 . . . 0 x(i,j) ck . . . 0 ā‡¤ 2 RKā‡„d , where ck is the closest centroid to x(i,j). The ļ¬nal representation isP i,j v(i,j). ā€“ Fisher Vector [15]: FV is similar to VLAD, but uses a soft assignment (i.e., Gaussian Mixture Model) instead of using k-means. Moreover, FV also includes second-order statistics.2 2 For parameter choice of VLAD/FV, we follow the suggestions reported in [30]. The number of clusters in VLAD and the number of Gaussian components in FV are both set to 2. Larger values lead to lower accuracy. Fine-grained image retrieval (conā€™t) Aggregating convolutional descriptors [Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
  • 43. ECCV-16 submission ID 545 9 Table 2. Comparison of diā†µerent encoding or pooling approaches for FGIR. Approach Dimension CUB200-2011 Stanford Dogs top1 top5 top1 top5 VLAD 1,024 55.92% 62.51% 69.28% 74.43% Fisher Vector 2,048 52.04% 59.19% 68.37% 73.74% avgPool 512 56.42% 63.14% 73.76% 78.47% maxPool 512 58.35% 64.18% 70.37% 75.59% avg&maxPool 1,024 59.72% 65.79% 74.86% 79.24% ā€“ Pooling approaches. We also try two traditional pooling approaches, i.e., max-pooling and average-pooling, to aggregate the deep descriptors. After encoding or pooling into a single vector, for VLAD and FV, the square root normalization and `2-normalization are followed; for max- and average- SCDA Fine-grained image retrieval (conā€™t) Comparing difference encoding or pooling methods [Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
  • 44. 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 ECCV #545 ECCV #545 10 ECCV-16 submission ID 545 (a) M of Pool5 (d) of Relu5_2(c) M of Relu5_2(b) of Pool5 Figure 6. The mask map and its corresponding largest connected component of dif- ferent CNN layers. (The ļ¬gure is best viewed in color.) where ā†µ is the coe cient for SCDArelu5 2 . It is set to 0.5 for FGIR. The `2 normalization is followed. In addition, another SCDA+ of the horizontal ļ¬‚ip of the original image is incorporated, which is denoted as ā€œSCDA ļ¬‚ip+ ā€ (4,096-d). 4 Experiments and Results In this section, we report the ļ¬ne-grained image retrieval results. In addition, 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 our aggregation scheme. Its performance is signiļ¬cantly and consistently higher than the others. We use the ā€œavg&maxPoolā€ aggregation as ā€œSCDA featureā€ to represent the whole ļ¬ne-grained image. 3.3 Multiple Layer Ensemble As studied in [31,32], the ensemble of multiple layers boost the ļ¬nal performance. Thus, we also incorporate another SCDA feature produced from the relu5 2 layer which is three layers in front of pool5 in the VGG-16 model [25]. Following pool5, we get the mask map Mrelu5 2 from relu5 2. Its activations are less related to the semantic meaning than those of pool5. As shown in Fig. 6 (c), there are many noisy parts. However, the bird is more accurately detected than pool5. Therefore, we combine fMpool5 and Mrelu5 2 together to get the ļ¬nal mask map of relu5 2. fMpool5 is ļ¬rstly upsampled to the size of Mrelu5 2 . We keep the descriptors when their position in both fMpool5 and Mrelu5 2 are 1, which are the ļ¬nal selected relu5 2 descriptors. The aggregation process remains the same. Finally, we concatenate the SCDA features of relu5 2 and pool5 into a single representation, denoted by ā€œSCDA+ ā€: SCDA+ ā‡„ SCDApool5 , ā†µ ā‡„ SCDArelu5 2 ā‡¤ , (3) 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 ECCV #545 6 submission ID 545 (d) of Relu5_2(c) M of Relu5_2(b) of Pool5 mask map and its corresponding largest connected component of dif- s. (The ļ¬gure is best viewed in color.) coe cient for SCDArelu5 2 . It is set to 0.5 for FGIR. The `2 s followed. In addition, another SCDA+ of the horizontal ļ¬‚ip of ge is incorporated, which is denoted as ā€œSCDA ļ¬‚ip+ ā€ (4,096-d). ments and Results Fine-grained image retrieval (conā€™t) Multiple layer ensemble [Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
  • 45. 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 CV 545 ECC #5 12 ECCV-16 submission ID 545 Figure 7. Some retrieval results of four ļ¬ne-grained datasets. On the left, there are Fine-grained image retrieval (conā€™t) [Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
  • 46. 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 V 5 E # ECCV-16 submission ID 545 13 Figure 8. Quality demonstrations of the SCDA feature. From the top to bottom of each column, there are six returned original images in the order of one sorted dimension of ā€œ256-d SVD+whiteningā€. (Best viewed in color and zoomed in.) Fine-grained image retrieval (conā€™t) Quality demonstration of the SCDA feature [Wei et al., IEEE TIP 2017] http://www.weixiushen.com/
  • 47. Deep Descriptor Transforming CNN pre-trained models Deep Descriptor Transforming Four b Object Col-lo Image co-localization (a.k.a. unsupervised object discovery) is a fundamental computer vision problem, which simultaneously localizes objects of the same category across a set of distinct images. 2 ā€“ The proposed method: DDT 4 ā€“ E stat and Ć¼ DD rob Shen2 , Zhi-Hua Zhou1 Technology, Nanjing University, Nanjing, China aide, Adelaide, Australia edu.cn, {yao.li01,chunhua.shen}@adelaide.edu.au CNN pre-trained models Deep Descriptor Transforming Input images into a set of linearly uncorrelated variables (i.e., the principal components). This transformation is deļ¬ned in such a way that the ļ¬rst principal component has the largest possible vari- ance, and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to all the preceding components. PCA is widely used in machine learning and computer vision for dimension reduction [Chen et al., 2013; Gu et al., 2011; Zhang et al., 2009; Davidson, 2009], noise reduc- tion [Zhang et al., 2013; Nie et al., 2011] and so on. Speciļ¬- cally, in this paper, we utilize PCA as projection directions for transforming these deep descriptors {x(i,j)} to evaluate their correlations. Then, on each projection direction, the corre- sponding principal componentā€™s values are treated as the cues for image co-localization, especially the ļ¬rst principal com- ponent. Thanks to the property of this kind of transforming, DDT is also able to handle data noise. In DDT, for a set of N images containing objects from the same category, we ļ¬rst collect the corresponding convolutional descriptors (X1 , . . . , XN ) by feeding them into a pre-trained CNN model. Then, the mean vector of all the descriptors is calculated by: ĀÆx = 1 K X n X i,j xn (i,j) , (1) characteristic th localization sce indicates indeed Therefore, th old for dividing positive values part has negativ appear. In additi indicates that n can be used for resized by the n same as that of largest connecte what is done in [ relation values a bounding box w of positive regi prediction. 3.4 Discussi In this section, comparing with As shown in F and DDT are hi nsforming for Image Co-Localization n-Lin Zhang1 , Yao Li2 , Chen-Wei Xie1 , Chunhua Shen2 , Zhi-Hua Zhou1 Software Technology, Nanjing University, Nanjing, C y of Adelaide, Adelaide, Australia amda.nju.edu.cn, {yao.li01,chunhua.shen}@adelaide. h the s. In ained erent CNN pre-trained models XN-1X1 X2 XNā€¦ Deep descriptors Calculating descriptorsā€™ mean vector nsforming for Image Co-Localizationā‡¤ n-Lin Zhang1 , Yao Li2 , Chen-Wei Xie1 , Chunhua Shen2 , Zhi-Hua Zhou1 Software Technology, Nanjing University, Nanjing, C y of Adelaide, Adelaide, Australia amda.nju.edu.cn, {yao.li01,chunhua.shen}@adelaide. h the all the preceding components. PCA is widely used in machine learning and computer vision for dimension reduction [Chen et al., 2013; Gu et al., 2011; Zhang et al., 2009; Davidson, 2009], noise reduc- tion [Zhang et al., 2013; Nie et al., 2011] and so on. Speciļ¬- cally, in this paper, we utilize PCA as projection directions for transforming these deep descriptors {x(i,j)} to evaluate their correlations. Then, on each projection direction, the corre- sponding principal componentā€™s values are treated as the cues for image co-localization, especially the ļ¬rst principal com- ponent. Thanks to the property of this kind of transforming, DDT is also able to handle data noise. In DDT, for a set of N images containing objects from the same category, we ļ¬rst collect the corresponding convolutional descriptors (X1 , . . . , XN ) by feeding them into a pre-trained CNN model. Then, the mean vector of all the descriptors is calculated by: ĀÆx = 1 K X n X i,j xn (i,j) , (1) where K = h ā‡„ w ā‡„ N. Note that, here we assume each image has the same number of deep descriptors (i.e., h ā‡„ w) for presentation clarity. Our proposed method, however, can handle input images with arbitrary resolutions. positive values part has negativ appear. In addi indicates that n can be used fo resized by the same as that of largest connect what is done in relation values bounding box w of positive reg prediction. 3.4 Discuss In this section comparing with As shown in and DDT are h ers the informa ā€œpersonā€ and ev jects. Furtherm the aggregatio calculate the m on for dimension reduction [Chen et al., 2013; Gu et 2011; Zhang et al., 2009; Davidson, 2009], noise reduc- [Zhang et al., 2013; Nie et al., 2011] and so on. Speciļ¬- y, in this paper, we utilize PCA as projection directions for sforming these deep descriptors {x(i,j)} to evaluate their elations. Then, on each projection direction, the corre- nding principal componentā€™s values are treated as the cues mage co-localization, especially the ļ¬rst principal com- ent. Thanks to the property of this kind of transforming, T is also able to handle data noise. n DDT, for a set of N images containing objects from the e category, we ļ¬rst collect the corresponding convolutional criptors (X1 , . . . , XN ) by feeding them into a pre-trained N model. Then, the mean vector of all the descriptors is ulated by: ĀÆx = 1 K X n X i,j xn (i,j) , (1) re K = h ā‡„ w ā‡„ N. Note that, here we assume each ge has the same number of deep descriptors (i.e., h ā‡„ w) presentation clarity. Our proposed method, however, can dle input images with arbitrary resolutions. hen, after obtaining the covariance matrix: appear. In addition, if P indicates that no comm can be used for detecti resized by the nearest same as that of the inp largest connected comp what is done in [Wei et a relation values and the z bounding box which con of positive regions is r prediction. 3.4 Discussions an In this section, we inve comparing with SCDA. As shown in Fig. 2, th and DDT are highlighte ers the information from ā€œpersonā€ and even ā€œgui jects. Furthermore, we the aggregation map o calculate the mean valu tion [Zhang et al., 2013; Nie et al., 2011] and so on. Speciļ¬- cally, in this paper, we utilize PCA as projection directions for transforming these deep descriptors {x(i,j)} to evaluate their correlations. Then, on each projection direction, the corre- sponding principal componentā€™s values are treated as the cues for image co-localization, especially the ļ¬rst principal com- ponent. Thanks to the property of this kind of transforming, DDT is also able to handle data noise. In DDT, for a set of N images containing objects from the same category, we ļ¬rst collect the corresponding convolutional descriptors (X1 , . . . , XN ) by feeding them into a pre-trained CNN model. Then, the mean vector of all the descriptors is calculated by: ĀÆx = 1 K X n X i,j xn (i,j) , (1) where K = h ā‡„ w ā‡„ N. Note that, here we assume each image has the same number of deep descriptors (i.e., h ā‡„ w) for presentation clarity. Our proposed method, however, can handle input images with arbitrary resolutions. Then, after obtaining the covariance matrix: Cov(x) = 1 K X n X i,j (xn (i,j) ĀÆx)(xn (i,j) ĀÆx)> , (2) can be used for det resized by the neare same as that of the i largest connected co what is done in [Wei relation values and th bounding box which of positive regions i prediction. 3.4 Discussions In this section, we i comparing with SCD As shown in Fig. 2 and DDT are highlig ers the information f ā€œpersonā€ and even ā€œg jects. Furthermore, the aggregation map calculate the mean v ization threshold in S values in aggregatio red vertical line corre beyond the threshold sforming for Image Co-Localizationā‡¤ n-Lin Zhang1 , Yao Li2 , Chen-Wei Xie1 , Chunhua Shen2 , Zhi-Hua Zhou1 Software Technology, Nanjing University, Nanjing, C y of Adelaide, Adelaide, Australia amda.nju.edu.cn, {yao.li01,chunhua.shen}@adelaide. sforming for Image Co-Localizationā‡¤ n-Lin Zhang1 , Yao Li2 , Chen-Wei Xie1 , Chunhua Shen2 , Zhi-Hua Zhou1 Software Technology, Nanjing University, Nanjing, Ch y of Adelaide, Adelaide, Australia ng for Image Co-Localizationā‡¤ ang1 , Yao Li2 , Chen-Wei Xie1 , a Shen2 , Zhi-Hua Zhou1 Technology, Nanjing University, Nanjing, China laide, Adelaide, Australia (a) Bike (b) Diningtable 0 0.5 1 0 50 100 150 -1 0 1 0 100 200 300 400 (c) Boat 0 0.5 1 0 20 40 60 -1 0 100 200 300 400 (d) Bottle 0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 100 120 140 -1 -0.5 0 0.5 10 20 40 60 80 100 120 140 SCDA DDT SCDA DDT DDT DSCDA SCDA (g) Chair (h) Dog (i) Horse (j) Motorbi 0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 8 1 -1 -0.5 0 0.5 1 0 50 100 150 200 250 300 0 0.2 0.4 0.6 0.8 1 0 50 100 150 -1 -0.5 0 0.5 1 0 50 100 150 200 250 300 0 0.5 1 0 20 40 60 80 -1 0 200 400 600 0 0.5 1 0 50 100 150 200 -1 0 1 0 200 400 600 800 0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 -1 -0.5 0 0.5 1 0 50 100 150 200 250 Figure 2: Examples from twelve randomly sampled classes of VOC 2007. Th SCDA, the second column are by our DDT. The red vertical lines in the histogra localizing objects. The selected regions in images are highlighted in red. (Best ā€œdiningtableā€ image region. In Fig. 2, more examples are pre- sented. In that ļ¬gure, some failure cases can be also found, e.g., the chair class in Fig. 2 (g). In addition, the normalized P1 can be also used as localiza- tion probability scores. Combining it with conditional random ļ¬led techniques might produce more accurate object bound- aries. Thus, DDT can be modiļ¬ed slightly in that way, and then perform the co-segmentation problem. More importantly, different from other co-segmentation methods, DDT can detect noisy images while other methods can not. 4 Experiments In this section, we ļ¬rst introduce the evaluation metric and datasets used in image co-localization. Then, we compare the empirical results of our DDT with other state-of-the-arts on these datasets. The computational cost of DDT is reported too. Moreover, the results in Sec. 4.4 and Sec. 4.5 illustrate the generalization ability and robustness of the proposed method. Finally, our further study in Sec. 4.6 reveals DDT might deal Table 1: C M [Joulin [Joulin [Rubinste [Tang S [Cho e O predicted bou box. All Cor Our experi commonly u covery datas 2007 / VOC ImageNet Su For exper al., 2015; Li Methods [Joulin et al., 2 SCDA [Cho et al., 2 [Li et al., 20 Our DDT Methods SCDA [Cho et al., 20 [Li et al., 201 Our DDT Table 4: Comparisons of the CorLoc metric with weakly supervised object l ā€œXā€ in the ā€œNeg.ā€ column indicates that these WSOL methods require access t Methods Neg. aero bike bird boat bottle bus car cat chair cow table dog [Shi et al., 2013] X 67.3 54.4 34.3 17.8 1.3 46.6 60.7 68.9 2.5 32.4 16.2 58. [Cinbis et al., 2015] X 56.6 58.3 28.4 20.7 6.8 54.9 69.1 20.8 9.2 50.5 10.2 29. [Wang et al., 2015] X 37.7 58.8 39.0 4.7 4.0 48.4 70.0 63.7 9.0 54.2 33.3 37. [Bilen et al., 2015] X 66.4 59.3 42.7 20.4 21.3 63.4 74.3 59.6 21.1 58.2 14.0 38. [Ren et al., 2016] X 79.2 56.9 46.0 12.2 15.7 58.4 71.4 48.6 7.2 69.9 16.7 47. [Wang et al., 2014] X 80.1 63.9 51.5 14.9 21.0 55.7 74.2 43.5 26.2 53.4 16.3 56. Our DDT 67.3 63.3 61.3 22.7 8.5 64.8 57.0 80.5 9.4 49.0 22.5 72. Table 4: ā€œXā€ in th Meth [Shi et al [Cinbis et [Wang et a [Bilen et a [Ren et a [Wang et a Our D Figure 3 successf rectangle the same Table 5: Metho [Cho et al. SCDA erent from their original purposes (e.g., for describing texture r ļ¬nding object proposals [Ghodrati et al., 2015]). However, or such adaptations of pre-trained models, they still require urther annotations in the new domain (e.g., image labels). While, DDT deals with the image co-localization problem in n unsupervised setting. Coincidentally, several recent works also shed lights on NN pre-trained model reuse in the unsupervised setting, e.g., CDA [Wei et al., 2017]. SCDA is proposed for handling he ļ¬ne-grained image retrieval task, where it uses pre-trained models (from ImageNet, which is not ļ¬ne-grained) to locate main objects in ļ¬ne-grained images. It is the most related work o ours, even though SCDA is not for image co-localization. Different from our DDT, SCDA assumes only an object of nterest in each image, and meanwhile objects from other ategories does not exist. Thus, SCDA locates the object using ues from this single image assumption. Apparently, it can not ork well for images containing diverse objects (cf. Table 2 nd Table 3), and also can not handle data noise (cf. Sec. 4.5). .2 Image Co-Localization mage co-localization is a fundamental problem in computer ision, where it needs to discover the common object emerging n only positive sets of example images (without any nega- ve examples or further supervisions). Image co-localization hares some similarities with image co-segmentation [Zhao nd Fu, 2015; Kim et al., 2011; Joulin et al., 2012]. Instead f generating a precise segmentation of the related objects in 3 The Proposed Method 3.1 Preliminary The following notations are used in the term ā€œfeature mapā€ indicates the con channel; the term ā€œactivationsā€ indica channels in a convolution layer; and indicates the d-dimensional componen Given an input image I of size H ā‡„ convolution layer are formulated as an hā‡„wā‡„d elements. T can be considere and each cell contains one d-dimension the n-th image, we denote its correspo as Xn = n xn (i,j) 2 Rd o , where (i, (i 2 {1, . . . , h} , j 2 {1, . . . , w}) and 3.2 SCDA Recap Since SCDA [Wei et al., 2017] is the m we hereby present a recap of this meth for dealing with the ļ¬ne-grained imag employs pre-trained models to select t scriptors by localizing the main object unsupervisedly. In SCDA, it assumes th only one main object of interest and w objects. Thus, the object localization s Fine-grained image retrieval (conā€™t) [Wei et al., IJCAI 2017] http://www.weixiushen.com/
  • 48. CNN pre-trained models Deep Descriptor Transforming Four benchmark c Object Discovery, PA Col-localization re Detecting noisy im 2 ā€“ The proposed method: DDT 4 ā€“ Experimen 5 ā€“ Further st the s. In ined rent trac- onal d act e co- ffec- ming tors ons, in a ffec- nch- nsis- hods mon- cate- ta. model by for other CNN pre-trained models Deep Descriptor Transforming Figure 1: Pipeline of the proposed DDT method for image co-localization. In this instance, the goal is to localize the airplane within each image. Note that, there might be few noisy images in the image set. (Best viewed in color.) adopted to various usages, e.g., as universal feature extrac- tors [Wang et al., 2015; Li et al., 2016], object proposal gen- Input images PCA is widely used in machine learning and computer vision for dimension reduction [Chen et al., 2013; Gu et al., 2011; Zhang et al., 2009; Davidson, 2009], noise reduc- tion [Zhang et al., 2013; Nie et al., 2011] and so on. Speciļ¬- cally, in this paper, we utilize PCA as projection directions for transforming these deep descriptors {x(i,j)} to evaluate their correlations. Then, on each projection direction, the corre- sponding principal componentā€™s values are treated as the cues for image co-localization, especially the ļ¬rst principal com- ponent. Thanks to the property of this kind of transforming, DDT is also able to handle data noise. In DDT, for a set of N images containing objects from the same category, we ļ¬rst collect the corresponding convolutional descriptors (X1 , . . . , XN ) by feeding them into a pre-trained CNN model. Then, the mean vector of all the descriptors is calculated by: ĀÆx = 1 K X n X i,j xn (i,j) , (1) where K = h ā‡„ w ā‡„ N. Note that, here we assume each image has the same number of deep descriptors (i.e., h ā‡„ w) for presentation clarity. Our proposed method, however, can handle input images with arbitrary resolutions. Then, after obtaining the covariance matrix: Cov(x) = 1 K X n X i,j (xn (i,j) ĀÆx)(xn (i,j) ĀÆx)> , (2) we can get the eigenvectors ā‡ 1, . . . , ā‡ d of Cov(x) which cor- respond to the sorted eigenvalues 1 Ā· Ā· Ā· d 0. part has negative values pre appear. In addition, if P1 of indicates that no common o can be used for detecting resized by the nearest inter same as that of the input im largest connected componen what is done in [Wei et al., 2 relation values and the zero bounding box which contain of positive regions is retur prediction. 3.4 Discussions and A In this section, we investig comparing with SCDA. As shown in Fig. 2, the ob and DDT are highlighted in ers the information from a s ā€œpersonā€ and even ā€œguide-b jects. Furthermore, we nor the aggregation map of SC calculate the mean value (w ization threshold in SCDA) values in aggregation map red vertical line corresponds beyond the threshold, there explanation about why SCD Whilst, for DDT, it lever form these deep descriptor n Wu , Chunhua Shen , Zhi-Hua Zhou Novel Software Technology, Nanjing University, Nanjing, China niversity of Adelaide, Adelaide, Australia zh}@lamda.nju.edu.cn, {yao.li01,chunhua.shen}@adelaide.edu.au able with the plications. In of pre-trained ally, different eature extrac- convolutional ons could act the image co- mple but effec- Transforming of descriptors stent regions, CNN pre-trained models Deep Descriptor Transforming XN-1X1 X2 XNā€¦ Deep descriptors Calculating descriptorsā€™ mean vector i1 , Chen-Lin Zhang1 , Yao Li2 , Chen-Wei Xie1 , n Wu1 , Chunhua Shen2 , Zhi-Hua Zhou1 Novel Software Technology, Nanjing University, Nanjing, China niversity of Adelaide, Adelaide, Australia zh}@lamda.nju.edu.cn, {yao.li01,chunhua.shen}@adelaide.edu.au able with the plications. In of pre-trained ally, different eature extrac- convolutional ons could act the image co- mple but effec- Transforming CNN pre-trained models transforming these deep descriptors {x(i,j)} to evaluate their correlations. Then, on each projection direction, the corre- sponding principal componentā€™s values are treated as the cues for image co-localization, especially the ļ¬rst principal com- ponent. Thanks to the property of this kind of transforming, DDT is also able to handle data noise. In DDT, for a set of N images containing objects from the same category, we ļ¬rst collect the corresponding convolutional descriptors (X1 , . . . , XN ) by feeding them into a pre-trained CNN model. Then, the mean vector of all the descriptors is calculated by: ĀÆx = 1 K X n X i,j xn (i,j) , (1) where K = h ā‡„ w ā‡„ N. Note that, here we assume each image has the same number of deep descriptors (i.e., h ā‡„ w) for presentation clarity. Our proposed method, however, can handle input images with arbitrary resolutions. Then, after obtaining the covariance matrix: Cov(x) = 1 K X n X i,j (xn (i,j) ĀÆx)(xn (i,j) ĀÆx)> , (2) we can get the eigenvectors ā‡ 1, . . . , ā‡ d of Cov(x) which cor- respond to the sorted eigenvalues 1 Ā· Ā· Ā· d 0. As aforementioned, since the ļ¬rst principal component has the largest variance, we take the eigenvector ā‡ 1 corresponding to the largest eigenvalue as the main projection direction. For the deep descriptor at a particular position (i, j) of an image, its ļ¬rst principal component p1 is calculated as follows: same as that of the input i largest connected compone what is done in [Wei et al., 2 relation values and the zero bounding box which contai of positive regions is retu prediction. 3.4 Discussions and A In this section, we investi comparing with SCDA. As shown in Fig. 2, the o and DDT are highlighted in ers the information from a ā€œpersonā€ and even ā€œguide-b jects. Furthermore, we nor the aggregation map of S calculate the mean value ( ization threshold in SCDA) values in aggregation map red vertical line correspond beyond the threshold, ther explanation about why SC Whilst, for DDT, it leve form these deep descripto class, DDT can accurately histogram is also drawn. B tive values. We normalize Apparently, few values ar (i.e., 0). More importantly, Obtaining the covariance matrix and then getting the eigenvectors sponding principal componentā€™s values are treated as the cues for image co-localization, especially the ļ¬rst principal com- ponent. Thanks to the property of this kind of transforming, DDT is also able to handle data noise. In DDT, for a set of N images containing objects from the same category, we ļ¬rst collect the corresponding convolutional descriptors (X1 , . . . , XN ) by feeding them into a pre-trained CNN model. Then, the mean vector of all the descriptors is calculated by: ĀÆx = 1 K X n X i,j xn (i,j) , (1) where K = h ā‡„ w ā‡„ N. Note that, here we assume each image has the same number of deep descriptors (i.e., h ā‡„ w) for presentation clarity. Our proposed method, however, can handle input images with arbitrary resolutions. Then, after obtaining the covariance matrix: Cov(x) = 1 K X n X i,j (xn (i,j) ĀÆx)(xn (i,j) ĀÆx)> , (2) we can get the eigenvectors ā‡ 1, . . . , ā‡ d of Cov(x) which cor- respond to the sorted eigenvalues 1 Ā· Ā· Ā· d 0. As aforementioned, since the ļ¬rst principal component has the largest variance, we take the eigenvector ā‡ 1 corresponding to the largest eigenvalue as the main projection direction. For the deep descriptor at a particular position (i, j) of an image, its ļ¬rst principal component p1 is calculated as follows: what is done in [Wei et al., 2017]). B relation values and the zero thresho bounding box which contains the la of positive regions is returned as prediction. 3.4 Discussions and Analyse In this section, we investigate the comparing with SCDA. As shown in Fig. 2, the object loc and DDT are highlighted in red. B ers the information from a single i ā€œpersonā€ and even ā€œguide-boardā€ a jects. Furthermore, we normalize the aggregation map of SCDA in calculate the mean value (which i ization threshold in SCDA). The hi values in aggregation map is also red vertical line corresponds to the beyond the threshold, there are sti explanation about why SCDA high Whilst, for DDT, it leverages th form these deep descriptors into class, DDT can accurately locate histogram is also drawn. But, P1 h tive values. We normalize P1 into Apparently, few values are larger (i.e., 0). More importantly, many va ponent. Thanks to the property of this kind of transforming, DDT is also able to handle data noise. In DDT, for a set of N images containing objects from the same category, we ļ¬rst collect the corresponding convolutional descriptors (X1 , . . . , XN ) by feeding them into a pre-trained CNN model. Then, the mean vector of all the descriptors is calculated by: ĀÆx = 1 K X n X i,j xn (i,j) , (1) where K = h ā‡„ w ā‡„ N. Note that, here we assume each image has the same number of deep descriptors (i.e., h ā‡„ w) for presentation clarity. Our proposed method, however, can handle input images with arbitrary resolutions. Then, after obtaining the covariance matrix: Cov(x) = 1 K X n X i,j (xn (i,j) ĀÆx)(xn (i,j) ĀÆx)> , (2) we can get the eigenvectors ā‡ 1, . . . , ā‡ d of Cov(x) which cor- respond to the sorted eigenvalues 1 Ā· Ā· Ā· d 0. As aforementioned, since the ļ¬rst principal component has the largest variance, we take the eigenvector ā‡ 1 corresponding to the largest eigenvalue as the main projection direction. For the deep descriptor at a particular position (i, j) of an image, its ļ¬rst principal component p1 is calculated as follows: p1 (i,j) = ā‡ > 1 x(i,j) ĀÆx . (3) According to their spatial locations, all p1 (i,j) from an image are combined into a 2-D matrix whose dimensions are h ā‡„ w. bounding box which contains the of positive regions is returned a prediction. 3.4 Discussions and Analys In this section, we investigate th comparing with SCDA. As shown in Fig. 2, the object l and DDT are highlighted in red. ers the information from a single ā€œpersonā€ and even ā€œguide-boardā€ jects. Furthermore, we normaliz the aggregation map of SCDA calculate the mean value (which ization threshold in SCDA). The values in aggregation map is als red vertical line corresponds to th beyond the threshold, there are s explanation about why SCDA hig Whilst, for DDT, it leverages t form these deep descriptors into class, DDT can accurately locat histogram is also drawn. But, P1 tive values. We normalize P1 int Apparently, few values are larg (i.e., 0). More importantly, many indicates the strong negative co validates the effectiveness of DD As another example shown in Fig locates ā€œpersonā€ in the image b class. While, DDT can correctl r Transforming for Image Co-Localizationā‡¤ 1 , Chen-Lin Zhang1 , Yao Li2 , Chen-Wei Xie1 , Wu1 , Chunhua Shen2 , Zhi-Hua Zhou1 Novel Software Technology, Nanjing University, Nanjing, China niversity of Adelaide, Adelaide, Australia zh}@lamda.nju.edu.cn, {yao.li01,chunhua.shen}@adelaide.edu.au able with the plications. In of pre-trained ally, different eature extrac- convolutional CNN pre-trained modelsTransforming Transforming for Image Co-Localizationā‡¤ 1 , Chen-Lin Zhang1 , Yao Li2 , Chen-Wei Xie1 , Wu1 , Chunhua Shen2 , Zhi-Hua Zhou1 Novel Software Technology, Nanjing University, Nanjing, China niversity of Adelaide, Adelaide, Australia zh}@lamda.nju.edu.cn, {yao.li01,chunhua.shen}@adelaide.edu.au able with the plications. In of pre-trained lly, different CNN pre-trained models ming (DDT) is that we can leverage mage set, instead of a We call that matrix as indicator matrix: 2 6 p1 (1,1) p1 (1,2) . . . p1 (1,w) p1 p1 . . . p1 3 7 sforming for Image Co-Localizationā‡¤ n-Lin Zhang1 , Yao Li2 , Chen-Wei Xie1 , Chunhua Shen2 , Zhi-Hua Zhou1 Software Technology, Nanjing University, Nanjing, China y of Adelaide, Adelaide, Australia amda.nju.edu.cn, {yao.li01,chunhua.shen}@adelaide.edu.au h the s. In ined erent trac- onal CNN pre-trained models 0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 100 120 -1 -0.5 0 0.5 10 20 40 60 80 100 120 (g) Chair (h) Dog (i) Horse (j) Motorbike (k) Pla 0 0.2 0.4 0.6 0.8 1 0 20 40 60 8 1 -1 -0.5 0 0.5 1 0 50 100 150 200 250 0 0.2 0.4 0.6 0.8 1 0 50 100 -1 -0.5 0 0.5 1 0 50 100 150 200 250 0 0.5 1 0 20 40 60 80 -1 0 200 400 600 0 0.5 1 0 20 40 -1 0 100 200 Figure 2: Examples from twelve randomly sampled classes of VOC 2007. The ļ¬rst column of ea SCDA, the second column are by our DDT. The red vertical lines in the histogram plots indicate th localizing objects. The selected regions in images are highlighted in red. (Best viewed in color a ā€œdiningtableā€ image region. In Fig. 2, more examples are pre- sented. In that ļ¬gure, some failure cases can be also found, e.g., the chair class in Fig. 2 (g). In addition, the normalized P1 can be also used as localiza- tion probability scores. Combining it with conditional random ļ¬led techniques might produce more accurate object bound- aries. Thus, DDT can be modiļ¬ed slightly in that way, and then perform the co-segmentation problem. More importantly, different from other co-segmentation methods, DDT can detect noisy images while other methods can not. 4 Experiments In this section, we ļ¬rst introduce the evaluation metric and datasets used in image co-localization. Then, we compare the empirical results of our DDT with other state-of-the-arts on these datasets. The computational cost of DDT is reported too. Moreover, the results in Sec. 4.4 and Sec. 4.5 illustrate the generalization ability and robustness of the proposed method. Finally, our further study in Sec. 4.6 reveals DDT might deal with part-based image co-localization, which is a novel and challenging problem. In our experiments, the images keep the original image reso- lutions. For the pre-trained deep model, the publicly available VGG-19 model [Simonyan and Zisserman, 2015] is employed to extract deep convolution descriptors from the last convo- lution layer (before pool5). We use the open-source library MatConvNet [Vedaldi and Lenc, 2015] for conducting experi- ments. All the experiments are run on a computer with Intel Xeon E5-2660 v3, 500G main memory, and a K80 GPU. 4.1 Evaluation Metric and Datasets Following previous image co-localization works [Li et al., 2016; Cho et al., 2015; Tang et al., 2014], we take the cor- rect localization (CorLoc) metric for evaluating the proposed method. CorLoc is deļ¬ned as the percentage of images cor- rectly localized according to the PASCAL-criterion [Ever- ingham et al., 2015]: area(BpBgt) area(Bp[Bgt) > 0.5, where Bp is the Table 1: Comparisons of Co Methods Airpla [Joulin et al., 2010] 32.93 [Joulin et al., 2012] 57.32 [Rubinstein et al., 2013] 74.39 [Tang et al., 2014] 71.95 SCDA 87.80 [Cho et al., 2015] 82.93 Our DDT 91.46 predicted bounding box and Bg box. All CorLoc results are rep Our experiments are conducte commonly used in image co-lo covery dataset [Rubinstein et 2007 / VOC 2012 dataset [Eve ImageNet Subsets [Li et al., 201 For experiments on the VOC al., 2015; Li et al., 2016; Joulin in the trainval set (excluding im instances annotated as difļ¬cult covery, we use the 100-image s al., 2013; Cho et al., 2015] in comparison with other methods In addition, Object Discover images in the Airplane, Car and These noisy images contain no egory, as the third image show Sec. 4.5, we quantitatively meas DDT to identify these noisy im To further investigate the g ImageNet Subsets [Li et al., 2 six subsets/categories. These s from the 1000-label ILSVRC c al., 2015]. That is to say, these trained CNN models. Experim that DDT is insensitive to the o Table 2: Comparison Methods aero bike bird b [Joulin et al., 2014] 32.8 17.3 20.9 1 SCDA 54.4 27.2 43.4 1 [Cho et al., 2015] 50.3 42.8 30.0 1 [Li et al., 2016] 73.1 45.0 43.4 2 Our DDT 67.3 63.3 61.3 2 Table 3: Comparison Methods aero bike bird bo SCDA 60.8 41.7 38.6 21 [Cho et al., 2015] 57.0 41.2 36.0 26 [Li et al., 2016] 65.7 57.8 47.9 28 Our DDT 76.7 67.1 57.9 30 4.2 Comparisons with Sta Comparisons to Image Co-Loc We ļ¬rst compare the results of cluding SCDA) on Object Disc we also use VGG-19 to extract th perform experiments. As show forms other methods by about 4% Especially for the airplane class that of [Cho et al., 2015]. In ad of each category in this dataset SCDA can perform well. For VOC 2007 and 2012, th objects per image, which is m Discovery. The comparisons of two datasets are reported in Tab It is clear that on average our D state-of-the-arts (based on deep l both two datasets. Moreover, D small common objects, e.g., ā€œbo because most images of these d which do not obey SCDAā€™s assum Table 4: Comparisons of the CorLoc metric with weakly supervised object localization met ā€œXā€ in the ā€œNeg.ā€ column indicates that these WSOL methods require access to a negative ima Methods Neg. aero bike bird boat bottle bus car cat chair cow table dog horse mbike per [Shi et al., 2013] X 67.3 54.4 34.3 17.8 1.3 46.6 60.7 68.9 2.5 32.4 16.2 58.9 51.5 64.6 18 [Cinbis et al., 2015] X 56.6 58.3 28.4 20.7 6.8 54.9 69.1 20.8 9.2 50.5 10.2 29.0 58.0 64.9 36 [Wang et al., 2015] X 37.7 58.8 39.0 4.7 4.0 48.4 70.0 63.7 9.0 54.2 33.3 37.4 61.6 57.6 30 [Bilen et al., 2015] X 66.4 59.3 42.7 20.4 21.3 63.4 74.3 59.6 21.1 58.2 14.0 38.5 49.5 60.0 19 [Ren et al., 2016] X 79.2 56.9 46.0 12.2 15.7 58.4 71.4 48.6 7.2 69.9 16.7 47.4 44.2 75.5 41 [Wang et al., 2014] X 80.1 63.9 51.5 14.9 21.0 55.7 74.2 43.5 26.2 53.4 16.3 56.7 58.3 69.5 14 Our DDT 67.3 63.3 61.3 22.7 8.5 64.8 57.0 80.5 9.4 49.0 22.5 72.6 73.8 69.0 7 (a) Chipmunk (b) Rhino (d) Racoon (e) Rake Figure 3: Random samples of predicted object co-localization bounding box on ImageNet Sub successful predictions and one failure case. In these images, the red rectangle is the predictio rectangle is the ground truth bounding box. In the successful predictions, the yellow rectangle the same as the red predictions. (Best viewed in color and zoomed in.) Table 5: Comparisons of on image sets disjoint with ImageNet. Methods Chipmunk Rhino Stoat Racoon Rake Wheelchair Mean Image Methods Neg. aer [Shi et al., 2013] X 67. [Cinbis et al., 2015] X 56. [Wang et al., 2015] X 37. [Bilen et al., 2015] X 66. [Ren et al., 2016] X 79. [Wang et al., 2014] X 80. Our DDT 67. (a) Chipmu (d) Racoo Figure 3: Random samp successful predictions a rectangle is the ground t the same as the red pred Table 5: Comparisons of Methods Chipmunk [Cho et al., 2015] 26.6 SCDA 32.3 [Li et al., 2016] 44.9 Our DDT 70.3 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 false positive rate truepositiverate 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 truepositiverate Airplane False positive rate Truepositiverate Truepositiverate Figure 4: ROC curves ill at identifying noisy ima The curves in red line ar in blue dashed line prese only the method in [Tan CNN pre-trained model reuse in the unsupervised setting, e.g., SCDA [Wei et al., 2017]. SCDA is proposed for handling the ļ¬ne-grained image retrieval task, where it uses pre-trained models (from ImageNet, which is not ļ¬ne-grained) to locate main objects in ļ¬ne-grained images. It is the most related work to ours, even though SCDA is not for image co-localization. Different from our DDT, SCDA assumes only an object of interest in each image, and meanwhile objects from other categories does not exist. Thus, SCDA locates the object using cues from this single image assumption. Apparently, it can not work well for images containing diverse objects (cf. Table 2 and Table 3), and also can not handle data noise (cf. Sec. 4.5). 2.2 Image Co-Localization Image co-localization is a fundamental problem in computer vision, where it needs to discover the common object emerging in only positive sets of example images (without any nega- tive examples or further supervisions). Image co-localization shares some similarities with image co-segmentation [Zhao and Fu, 2015; Kim et al., 2011; Joulin et al., 2012]. Instead of generating a precise segmentation of the related objects in each image, co-localization methods aim to return a bound- ing box around the object. Moreover, co-segmentation has a strong assumption that every image contains the object of interest, and hence is unable to handle noisy images. Additionally, co-localization is also related to weakly su- pervised object localization (WSOL) [Zhang et al., 2016; Bilen et al., 2015; Wang et al., 2014; Siva and Xiang, 2011]. But the key difference between them is WSOL requires manually-labeled negative images whereas co-localization channel; the term ā€œactivationsā€ indicates feature m channels in a convolution layer; and the term ā€œd indicates the d-dimensional component vector of ac Given an input image I of size H ā‡„ W, the activa convolution layer are formulated as an order-3 tenso hā‡„wā‡„d elements. T can be considered as having h and each cell contains one d-dimensional deep descr the n-th image, we denote its corresponding deep de as Xn = n xn (i,j) 2 Rd o , where (i, j) is a partic (i 2 {1, . . . , h} , j 2 {1, . . . , w}) and n 2 {1, . . . , N 3.2 SCDA Recap Since SCDA [Wei et al., 2017] is the most related wo we hereby present a recap of this method. SCDA is for dealing with the ļ¬ne-grained image retrieval pr employs pre-trained models to select the meaningfu scriptors by localizing the main object in ļ¬ne-graine unsupervisedly. In SCDA, it assumes that each image only one main object of interest and without other c objects. Thus, the object localization strategy is bas activation tensor of a single image. Concretely, for an image, the activation tensor is through the depth direction. Thus, the h ā‡„ w ā‡„ d 3 becomes a h ā‡„ w 2-D matrix, which is called the ā€œag mapā€ in SCDA. Then, the mean value ĀÆa of the ag map is regarded as the threshold for localizing the the activation response in the position (i, j) of the ag map is larger than ĀÆa, it indicates the object might that position. What we need is a mapping function ā€¦ Fine-grained image retrieval (conā€™t) [Wei et al., IJCAI 2017] http://www.weixiushen.com/