11. You Only Look Once : Unified, Real Time Object Detection
• 物体領域探索をグリッドベースにすることで効率化
– バウンディングボックスとスコアを出力するGoogLeNetをベースとしたCNNを使用
– ノートPCでもリアルタイムで処理が可能
10
12. Deep Residual Learning for Image Recognition (Best paper)
• ResNet:非常に深いCNN
– 単純に層を積み重ねると誤差勾配の発散や消滅が発生
→ 入力と出力の残差が最小になるように学習
– ImageNetでは152層,CIFAR-10では110層のときが最も性能が良い
– Fast / Faster R-CNNと組み合わせることも可能
11
13. Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images
• RGB画像と点群データを用いた3次元物体認識を行うネットワークモデルを提案
– 3D Amodal Region Proposal Network:物体候補を検出するネットワーク
– Joint Object Recognition Network:3D空間から物体の位置を推定して物体認識
12
3D Amodal Region Proposal Network Joint Object Recognition Network
14. Dense Human Body Correspondences Using Convolutional Networks
• CNNを用いた3次元における人体同士のマッチングと非剛体のレジストレーション
– 3種類の3D人体データセットを使って特徴抽出と対応点のマッチング
• 高精度かつリアルタイムに処理が可能
13
descriptor
classification
prediction 1
classification
prediction N...
descriptor
classification
prediction 1
classification
prediction N...
descriptor
classification
prediction
loss
function
input
full model
depth
generator
depth
maps
feature
extraction
and
averaging
per vertex
descriptors
Figure2: Wetrain aneural network which extracts afeature descriptor and predicts thecorresponding segmentation label on
thehuman body surfacefor each point in theinput depth maps. Wegenerate per-vertex descriptors for 3D modelsby averag-
ing the feature descriptors in their rendered depth maps. Weusetheextracted features to compute dense correspondences.
0 1 2 3 4 5 6 7 8 9 10
layer image conv max conv max 2× conv conv max 2× conv int conv
filter-stride - 11-4 3-2 5-1 3-2 3-1 3-1 3-2 1-1 - 3-1
channel 1 96 96 256 256 384 256 256 4096 4096 16
activation - relu lrn relu lrn relu relu idn relu idn relu
size 512 128 64 64 32 32 32 16 16 128 512
num 1 1 4 4 16 16 16 64 64 1 1
Table 1: The end-to-end network architecture generates a per-pixel feature descriptor and a classification label for all pixels
in adepth map simultaneously. From top to bottom in column: Thefilter size and thestride, thenumber of filters, thetypeof
theactivation function, thesize of theimage after filtering and thenumber of copies reserved for up-sampling.
training mesh segmentation 1 segmentation 2 segmentation 3
4. Implementation Details
We first discuss how we generate the training data and
then describe the architecture of our network.
4.1. Training Data Generation
Collecting 3D Shapes. To generate the training data for
15. 姿勢推定:Convolutional Pose Machine
• カスケード型に配置したCNNによる姿勢推定
– 各関節位置の尤度マップを出力
• ステージを進める毎に注目領域を拡大して高精度な尤度マップを出力
– tステージにはt-1ステージの尤度マップと特徴マップを入力
14
9⇥9
C
1⇥1
C
1⇥1
C
1⇥1
C
1⇥1
C
11⇥11
C
11⇥11
C
LossLoss
f 1 f 2
(c) Stage 1
Input
Image
h⇥w⇥3
Input
Image
h⇥w⇥3
9⇥9
C
9⇥9
C
9⇥9
C
2⇥
P
2⇥
P
5⇥5
C
2⇥
P
9⇥9
C
9⇥9
C
9⇥9
C
2⇥
P
2⇥
P
5⇥5
C
2⇥
P
11⇥11
C
(e) E↵ective Receptive Field
x
x0
g1 g2 gT
b1 b2 bT
2 T
(a) Stage 1
PoolingP
ConvolutionC
x0
Convolutional
Pose Machines
(T –stage)
x
x0
h0
⇥w0
⇥(P + 1)
h0
⇥w0
⇥(P + 1)
(b) Stage ≥ 2
(d) Stage ≥ 2
9 ⇥9 26 ⇥26 60 ⇥60 96 ⇥96 160 ⇥160 240 ⇥240 320 ⇥320 400 ⇥400
Figure 2: Architecture and receptive fields of CPMs. We show a convolutional architecture and receptive fields across layers for a CPM with any T
stages. Theposemachine[29] isshown in insets(a) and (b), and thecorresponding convolutional networksareshown in insets(c) and (d). Insets(a) and (c)
17. A Key Volume Mining Deep Framework for Action Recognition
• 動画中の人物の行動認識でキーとなる領域を重点的に学習するCNN+RNNのモデル
を提案
– 2D+時系列の3次元の畳み込みをするCNNとstochastic outを用いることでキーフレーム
を選択して行動認識
16
18. DenseCap : Fully Convolutional Localization Network for Dense Captioning
• 検出した物体領域ごとにキャプションを作成するネットワーク
– Region Proposal NetworkをベースにしたFully Convolutional Localization Networkを使用
17
19. Stacked Attention Networks for Image Question Answering
• 質問に対して画像中のどこに着目すればいいのかを示すAttention layerを導入したネ
ットワーク
– Attention layerをスタックすることでより正確な着目位置を推定
18
feng Gao2
, Li Deng2
, Alex Smola1
t Research, Redmond, WA 98052, USA
deng} @mi cr osof t . com, al ex@smol a. or g
Question:
What are sitting
in the basket on
a bicycle?
CNN/
LSTM
Softmax
dogs
Answer:
CNN
+
Query
+
Attention layer 1
Attention layer 2
feature vectors of different
parts of image
(a) Stacked Attention Network for ImageQA
hat learn to answer natural language questions from im-
ges. SANs use semantic representation of a question as
uery to search for the regions in an image that arerelated
o the answer. We argue that image question answering
QA) often requires multiple steps of reasoning. Thus, we
evelop a multiple-layer SAN in which we query an image
multiple times to infer the answer progressively. Experi-
ments conducted on four image QA data sets demonstrate
hat the proposed SANs significantly outperform previous
ate-of-the-art approaches. The visualization of the atten-
on layers illustrates the progress that the SAN locates the
elevant visual clues that lead to theanswer of the question
ayer-by-layer.
. Introduction
With the recent advancement in computer vision and
n natural language processing (NLP), image question an-
wering (QA) becomes one of the most active research ar-
as [7, 21, 18, 1, 19]. Unlike pure language based QA sys-
emsthat havebeen studied extensively in theNLPcommu-
ty [28, 14, 4, 31, 3, 32], imageQA systemsaredesigned to
utomatically answer natural language questions according
Question:
What are sitting
in the basket on
a bicycle?
CNN/
LSTM
Softmax
d
Ans
CNN
+
Query
+
Attention layer 1
Attention layer 2
(a) Stacked Attention Network for Image QA
Original Image First Attention Layer Second Attention Layer
(b) Visualization of the learned multiple attention layers. The
stacked attention network first focuses on all referred concepts,
e.g., bi cycl e, basket and objects in the basket (dogs) in
thefirst attention layer and then further narrowsdown thefocus in
thesecond layer and finds out theanswer dog.