CVPR2016を自分なりにまとめてみた
Machine Perception& Robotics Group
Hiroshi Fukui
自己紹介
• 名前:福井 宏
– 所属:中部大学 Machine Perception and Robotics Group (藤吉研究室)
– TwitterID : @Catechine0125
– HP:https://sites.google.com/site/fhiroresearch/home
• 主な研究テーマ
1
歩行者検出 歩行者属性認識
参加報告
• 6/27 ~ 7/1にCVPR2016へ参加
– 共著に入っていたので山下 隆義先生に引っ付いていきました
– 6/19 ~ 6/22は IEEE Intelligent Vehicle に参加 (ここではまとめません)
2
6 / 19 – 6 / 23 6 / 27 – 7 / 16 / 26 7 / 36 / 246 / 18 6 / 25
↑ なぜか,1度日本に帰国・・・
参加報告 ~CVPR2016編~
• CVPR(Computer Vision and Pattern Recognition)とは?
– 画像認識の分野におけるトップカンファレンス
– 画像認識やLow levelな処理,3次元画像認識など分野は多種多様
• 開催地:アメリカ ネバダ州 ラスベガス
• 開催日:6/26 ~ 7/1
3
今年のCVPRの概要
• Accept rate : 29.9% (643 / 2145)
– Long oral : 3.9% (83件)
– Short oral : 9.7% (123件)
4
メインカンファレンス
(4日間)
チュートリアル
(1日)
ワークショップ
(1日)
↓CVPR2016のプログラム
論文の傾向
5
Learning
Deep
Image
Object
Detection
Using
Networks
Recognition
Convolutional
Neural
Video
via
Estimation
Segmentation
Classification
Images
Visual
Feature
Semantic
Action
Network
Robust
Tracking
Objects
Shape
Pose
Model
Videos
Scene
Reconstruction
Human
FaceEfficient
Sparse
Analysis
Fast
Matching
Person
Scenes
Prediction
Data
Joint
Approach
Training
Structured
Saliency
Motion
Search
Recurrent
Unsupervised
Temporal
Hierarchical
Representations
Flow
Single
Dynamic
Camera
Localization
Real-Time
Re-Identification
Understanding
Supervised
Fine-Grained
CNN
Stereo
Selection
Dataset
Large
TransferDiscriminative
Depth
Alignment
Models
Facial
Based
Regression
Large-Scale
Modeling
Dense
Matrix
Field
Features
Framework
Online
Weakly
Multi-View
Multiple
Activity
Accurate
Simultaneous
Point
Text
Automatic
Set
Scale
Light
Predicting
Similarity
Clustering
CNNs
Sequences
Domain
Tensor
Parsing
Linear
Zero-Shot
Hand
Random
Optical
Fields
Answering
Structure
Space
Inference
Representation
Optimization
Kernel
Adaptive Algorithm
Vision
Pairwise
Descriptors
Salient
Correspondences
Embedding
Loss
Consistency Registration
Multi-Label
Question Metric
Priors
Cascaded
Label
Distance
Retrieval
Classifiers
Gaussian
Recognizing
Egocentric
Local
Actions
Fusion
Distribution
RGB-D
Captioning
Surface
Mining
Benchmark
Detecting
Manifold
Indoor
End-To-End
Maps
Background
Proposals
Look
Deblurring
Rolling
Applications
Ranking
Pooling
Optimal
Labeling
Low
Language
Patch
Correspondence
Latent
Attention
Faces
Coding
Shutter
Novel
Complex
Proposal
Active
Subspace
New
Urban
Natural
Intensity
Occlusion
Context
Recovery
Supervision
Information
Noise
ResolutionIterative
Propagation
Blind
Denoising
Volumetric
Crowded
Constrained
Uncalibrated
Deformable
Monocular
Trajectory
論文の傾向
6
Learning
Deep
Image
Object
Detection
Using
Networks
Recognition
Convolutional
Neural
Video
via
Estimation
Segmentation
Classification
Images
Visual
Feature
Semantic
Action
Network
Robust
Tracking
Objects
Shape
Pose
Model
Videos
Scene
Reconstruction
Human
FaceEfficient
Sparse
Analysis
Fast
Matching
Person
Scenes
Prediction
Data
Joint
Approach
Training
Structured
Saliency
Motion
Search
Recurrent
Unsupervised
Temporal
Hierarchical
Representations
Flow
Single
Dynamic
Camera
Localization
Real-Time
Re-Identification
Understanding
Supervised
Fine-Grained
CNN
Stereo
Selection
Dataset
Large
TransferDiscriminative
Depth
Alignment
Models
Facial
Based
Regression
Large-Scale
Modeling
Dense
Matrix
Field
Features
Framework
Online
Weakly
Multi-View
Multiple
Activity
Accurate
Simultaneous
Point
Text
Automatic
Set
Scale
Light
Predicting
Similarity
Clustering
CNNs
Sequences
Domain
Tensor
Parsing
Linear
Zero-Shot
Hand
Random
Optical
Fields
Answering
Structure
Space
Inference
Representation
Optimization
Kernel
Adaptive Algorithm
Vision
Pairwise
Descriptors
Salient
Correspondences
Embedding
Loss
Consistency Registration
Multi-Label
Question Metric
Priors
Cascaded
Label
Distance
Retrieval
Classifiers
Gaussian
Recognizing
Egocentric
Local
Actions
Fusion
Distribution
RGB-D
Captioning
Surface
Mining
Benchmark
Detecting
Manifold
Indoor
End-To-End
Maps
Background
Proposals
Look
Deblurring
Rolling
Applications
Ranking
Pooling
Optimal
Labeling
Low
Language
Patch
Correspondence
Latent
Attention
Faces
Coding
Shutter
Novel
Complex
Proposal
Active
Subspace
New
Urban
Natural
Intensity
Occlusion
Context
Recovery
Supervision
Information
Noise
ResolutionIterative
Propagation
Blind
Denoising
Volumetric
Crowded
Constrained
Uncalibrated
Deformable
Monocular
Trajectory
Recognition, Video, Low-level, …
論文の傾向
7
Learning
Deep
Image
i
Object
Detection
Using
ONetworks
Recognition
g
Convolutional
Neural
Video
via
Estimation
Segmentation
N
Classification
Images
Visual
Feature
Semantic
Action
g
Network
Robust
Tracking
Objects
Shape
DDPose
kk
Model
Videos
OO
Scene
g
Reconstruction
Human
FaceEfficient
Sparse
Analysis
Fast
Matching
Person
ii
Prediction
Data
Joint
Approach
tt
Training
Structured
Saliency
d l
Motion
Search
Recurrent
Unsupervised
Temporal
Hierarchical
Representations
Flow
Single
Dynamic
Camera
Localization
Real-Time
Re-Identification
Understanding
hi Supervised
Fine-Grained
CNN
Stereo
Selection
Large
TransferDiscriminative
Depth
Alignment
Models
pp
Facial
Based
Regression
Large-Scale
Modeling
D
M t
Field
Features
Framework
Online
Weakly
Multi View
Multiple
Activity
Accurate
Simultaneous
Point
Text
Automatic
Set
Scale
Light
Predicting
Similarity
Clustering
CNNs
Sequences
Domain
Tensor
Parsing
Linear
Zero-Shot
Hand
Random
Optical
Fields
Answering
Structure
Space
Inference
A
Representation
Optimization
Adaptive Algorithm
Vision
Pairwise
Descriptors
Salient
Correspondences
Embedding
Consistency Registration
Multi-Label
Question Metric
Priors
Cascaded
Label
Distance
Retrieval
Classifiers
Gaussian
Recognizing
Egocentric
L l
Actions
Fu
Distribution
RGB D
Captioning
Mining
Benchmark
Detecting
Manifold
LeLe
End-To-End
Maps
Background
Proposals
Deblurring
Rolling
Applications
Ranking
Pooling
Optimal
Labeling
w
Language
Patch
Correspondence
Latent
Attention
Faces
Coding
Shutter
Novel
Complex
Proposal
Active
Subspace
e
New
Urban
Natural
Intensity
Occlusion
Context
Recovery
Supervision
Information
Noise
ResolutionIterative
Propagation
Blind
Denoising
Volumetric
Crowded
Constrained
Uncalibrated
Deformable
Monocular
Trajectory
画像認識分野の発表のほとんどが
Deep Learning及びCNNを使用
論文の傾向
8
Learning
Deep
Image
i
Object
Detection
Using
O
g
Networks
Recognition
Convolutional
g
NeuralN
Video
via
Estimation
Segmentation
Classification
Images
Visual
Feature
Semantic
Action
g
Network
Robust
Trackingg
Objects
Shape
DDPose
Model
Videos
OO
Scene
Reconstruction
Human
FaceEfficient
Sparse
Analysis
Fast
Matching
Person
ii
Prediction
Data
Joint
Approach
tt
Training
Structured
Saliency
d l
Motion
Search
Recurrent
Unsupervised
Temporal
Hierarchical
Representations
Flow
Single
Dynamic
Camera
Localization
Real-Time
Re-Identification
Understanding
hi Supervised
Fine-Grained
CNN
Stereo
Selection
Large
TransferDiscriminative
Depth
Alignment
Models
pp
Facial
Based
Regression
Large-Scale
Modeling
D
M t
Field
Features
Framework
Online
Weakly
Multi View
Multiple
Activity
Accurate
Simultaneous
Point
Text
Automatic
Set
Scale
Light
Predicting
Similarity
Clustering
CNNs
Sequences
Domain
Tensor
Parsing
Linear
Zero-Shot
Hand
Random
Optical
Fields
Answering
Structure
Space
Inference
A
Representation
Optimization
Adaptive Algorithm
Vision
Pairwise
Descriptors
Salient
Correspondences
Embedding
Consistency Registration
Multi-Label
Question Metric
Priors
Cascaded
Label
Distance
Retrieval
Classifiers
Gaussian
Recognizing
Egocentric
L l
Actions
Fu
Distribution
RGB D
Captioning
Mining
Benchmark
Detecting
Manifold
LeLe
End-To-End
Maps
Background
Proposals
Deblurring
Rolling
Applications
Ranking
Pooling
Optimal
Labeling
w
Language
Patch
Correspondence
Latent
Attention
Faces
Coding
Shutter
Novel
Complex
Proposal
Active
Subspace
e
New
Urban
Natural
Intensity
Occlusion
Context
Recovery
Supervision
Information
Noise
ResolutionIterative
Propagation
Blind
Denoising
Volumetric
Crowded
Constrained
Uncalibrated
Deformable
Monocular
Trajectory
人の姿勢推定やRe-Identificationに
関係する研究が増加
Fast/Faster R-CNNの提案で
物体検出(認識)の研究が
数多く発表
セグメンテーションではCNNと
MRF/CRFの組み合わせが
数多く提案
物体検出 or 認識
• 基本的にはFast / Faster R-CNNをベースとした高精度な物体検出
– 処理速度が速く高精度に扱えるため頻繁に使用
• ネットワーク自体の性能を向上させることで物体検出の性能を向上
– 転移学習の利用
– 中には,より高速に認識できるように工夫している手法も存在 (YOLO)
– 層を数多く積み重ねて性能を上げる方法も提案
• 入力するデータの工夫
– 距離情報や点群データとの組み合わせ
9
You Only Look Once : Unified, Real Time Object Detection
• 物体領域探索をグリッドベースにすることで効率化
– バウンディングボックスとスコアを出力するGoogLeNetをベースとしたCNNを使用
– ノートPCでもリアルタイムで処理が可能
10
Deep Residual Learning for Image Recognition (Best paper)
• ResNet:非常に深いCNN
– 単純に層を積み重ねると誤差勾配の発散や消滅が発生
→ 入力と出力の残差が最小になるように学習
– ImageNetでは152層,CIFAR-10では110層のときが最も性能が良い
– Fast / Faster R-CNNと組み合わせることも可能
11
Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images
• RGB画像と点群データを用いた3次元物体認識を行うネットワークモデルを提案
– 3D Amodal Region Proposal Network:物体候補を検出するネットワーク
– Joint Object Recognition Network:3D空間から物体の位置を推定して物体認識
12
3D Amodal Region Proposal Network Joint Object Recognition Network
Dense Human Body Correspondences Using Convolutional Networks
• CNNを用いた3次元における人体同士のマッチングと非剛体のレジストレーション
– 3種類の3D人体データセットを使って特徴抽出と対応点のマッチング
• 高精度かつリアルタイムに処理が可能
13
descriptor
classification
prediction 1
classification
prediction N...
descriptor
classification
prediction 1
classification
prediction N...
descriptor
classification
prediction
loss
function
input
full model
depth
generator
depth
maps
feature
extraction
and
averaging
per vertex
descriptors
Figure2: Wetrain aneural network which extracts afeature descriptor and predicts thecorresponding segmentation label on
thehuman body surfacefor each point in theinput depth maps. Wegenerate per-vertex descriptors for 3D modelsby averag-
ing the feature descriptors in their rendered depth maps. Weusetheextracted features to compute dense correspondences.
0 1 2 3 4 5 6 7 8 9 10
layer image conv max conv max 2× conv conv max 2× conv int conv
filter-stride - 11-4 3-2 5-1 3-2 3-1 3-1 3-2 1-1 - 3-1
channel 1 96 96 256 256 384 256 256 4096 4096 16
activation - relu lrn relu lrn relu relu idn relu idn relu
size 512 128 64 64 32 32 32 16 16 128 512
num 1 1 4 4 16 16 16 64 64 1 1
Table 1: The end-to-end network architecture generates a per-pixel feature descriptor and a classification label for all pixels
in adepth map simultaneously. From top to bottom in column: Thefilter size and thestride, thenumber of filters, thetypeof
theactivation function, thesize of theimage after filtering and thenumber of copies reserved for up-sampling.
training mesh segmentation 1 segmentation 2 segmentation 3
4. Implementation Details
We first discuss how we generate the training data and
then describe the architecture of our network.
4.1. Training Data Generation
Collecting 3D Shapes. To generate the training data for
姿勢推定:Convolutional Pose Machine
• カスケード型に配置したCNNによる姿勢推定
– 各関節位置の尤度マップを出力
• ステージを進める毎に注目領域を拡大して高精度な尤度マップを出力
– tステージにはt-1ステージの尤度マップと特徴マップを入力
14
9⇥9
C
1⇥1
C
1⇥1
C
1⇥1
C
1⇥1
C
11⇥11
C
11⇥11
C
LossLoss
f 1 f 2
(c) Stage 1
Input
Image
h⇥w⇥3
Input
Image
h⇥w⇥3
9⇥9
C
9⇥9
C
9⇥9
C
2⇥
P
2⇥
P
5⇥5
C
2⇥
P
9⇥9
C
9⇥9
C
9⇥9
C
2⇥
P
2⇥
P
5⇥5
C
2⇥
P
11⇥11
C
(e) E↵ective Receptive Field
x
x0
g1 g2 gT
b1 b2 bT
2 T
(a) Stage 1
PoolingP
ConvolutionC
x0
Convolutional
Pose Machines
(T –stage)
x
x0
h0
⇥w0
⇥(P + 1)
h0
⇥w0
⇥(P + 1)
(b) Stage ≥ 2
(d) Stage ≥ 2
9 ⇥9 26 ⇥26 60 ⇥60 96 ⇥96 160 ⇥160 240 ⇥240 320 ⇥320 400 ⇥400
Figure 2: Architecture and receptive fields of CPMs. We show a convolutional architecture and receptive fields across layers for a CPM with any T
stages. Theposemachine[29] isshown in insets(a) and (b), and thecorresponding convolutional networksareshown in insets(c) and (d). Insets(a) and (c)
論文の傾向
15
Learning
Deep
Image
i
ObjectO
Detection
Using
O
g
Networks
Recognition
Convolutional
g
Neural
Video
via
Estimation
Segmentation
N
Classification
Images
Visual
Feature
Semantic
Action
Network
Robust
Tracking
Objects
Shape
DDPose
kk
Model
Videos
Scene
g
Reconstruction
Human
FaceEfficient
Sparse
Analysis
Fast
Matching
Person
ii
Prediction
Data
Joint
Approach
tt
Training
Structured
Saliency
d l
Motion
Search
Recurrent
Unsupervised
Temporal
Hierarchical
Representations
Flow
Single
Dynamic
Camera
Localization
Real-Time
Re-Identification
Understanding
hi Supervised
Fine-Grained
CNN
Stereo
Selection
Large
TransferDiscriminative
Depth
Alignment
Models
pp
Facial
Based
Regression
Large-Scale
Modeling
D
M t
Field
Features
Framework
Online
Weakly
Multi View
Multiple
Activity
Accurate
Simultaneous
Point
Text
Automatic
Set
Scale
Light
Predicting
Similarity
Clustering
CNNs
Sequences
Domain
Tensor
Parsing
Linear
Zero-Shot
Hand
Random
Optical
Fields
Answering
Structure
Space
Inference
A
Representation
Optimization
Adaptive Algorithm
Vision
Pairwise
Descriptors
Salient
Correspondences
Embedding
Consistency Registration
Multi-Label
Question Metric
Priors
Cascaded
Label
Distance
Retrieval
Classifiers
Gaussian
Recognizing
Egocentric
L l
Actions
Fu
Distribution
RGB D
Captioning
Mining
Benchmark
Detecting
Manifold
LeLe
End-To-End
Maps
Background
Proposals
Deblurring
Rolling
Applications
Ranking
Pooling
Optimal
Labeling
w
Language
Patch
Correspondence
Latent
Attention
Faces
Coding
Shutter
Novel
Complex
Proposal
Active
Subspace
e
New
Urban
Natural
Intensity
Occlusion
Context
Recovery
Supervision
Information
Noise
ResolutionIterative
Propagation
Blind
Denoising
Volumetric
Crowded
Constrained
Uncalibrated
Deformable
Monocular
Trajectory
RNNが使われるようになったことで
動画解析(行動認識やトラッキング等),
キャプション生成, Visual Question Answering
の研究が大きく発展
A Key Volume Mining Deep Framework for Action Recognition
• 動画中の人物の行動認識でキーとなる領域を重点的に学習するCNN+RNNのモデル
を提案
– 2D+時系列の3次元の畳み込みをするCNNとstochastic outを用いることでキーフレーム
を選択して行動認識
16
DenseCap : Fully Convolutional Localization Network for Dense Captioning
• 検出した物体領域ごとにキャプションを作成するネットワーク
– Region Proposal NetworkをベースにしたFully Convolutional Localization Networkを使用
17
Stacked Attention Networks for Image Question Answering
• 質問に対して画像中のどこに着目すればいいのかを示すAttention layerを導入したネ
ットワーク
– Attention layerをスタックすることでより正確な着目位置を推定
18
feng Gao2
, Li Deng2
, Alex Smola1
t Research, Redmond, WA 98052, USA
deng} @mi cr osof t . com, al ex@smol a. or g
Question:
What are sitting
in the basket on
a bicycle?
CNN/
LSTM
Softmax
dogs
Answer:
CNN
+
Query
+
Attention layer 1
Attention layer 2
feature vectors of different
parts of image
(a) Stacked Attention Network for ImageQA
hat learn to answer natural language questions from im-
ges. SANs use semantic representation of a question as
uery to search for the regions in an image that arerelated
o the answer. We argue that image question answering
QA) often requires multiple steps of reasoning. Thus, we
evelop a multiple-layer SAN in which we query an image
multiple times to infer the answer progressively. Experi-
ments conducted on four image QA data sets demonstrate
hat the proposed SANs significantly outperform previous
ate-of-the-art approaches. The visualization of the atten-
on layers illustrates the progress that the SAN locates the
elevant visual clues that lead to theanswer of the question
ayer-by-layer.
. Introduction
With the recent advancement in computer vision and
n natural language processing (NLP), image question an-
wering (QA) becomes one of the most active research ar-
as [7, 21, 18, 1, 19]. Unlike pure language based QA sys-
emsthat havebeen studied extensively in theNLPcommu-
ty [28, 14, 4, 31, 3, 32], imageQA systemsaredesigned to
utomatically answer natural language questions according
Question:
What are sitting
in the basket on
a bicycle?
CNN/
LSTM
Softmax
d
Ans
CNN
+
Query
+
Attention layer 1
Attention layer 2
(a) Stacked Attention Network for Image QA
Original Image First Attention Layer Second Attention Layer
(b) Visualization of the learned multiple attention layers. The
stacked attention network first focuses on all referred concepts,
e.g., bi cycl e, basket and objects in the basket (dogs) in
thefirst attention layer and then further narrowsdown thefocus in
thesecond layer and finds out theanswer dog.
その他の傾向
• 変わった(or 新規の)問題設定への挑戦
– 動画像からの音推定,顔画像からの心拍数推定,Face2Face等
• データセットの作成
19
音の推定
• 動画像をCNN+RNNに入力して音波を出力
20
顔画像からの心拍数推定
• 顔の局所的な領域から色特徴量を抽出
– 抽出した色特徴量をSelf-Adaptive Matrix Completionを適用
• SAMCから得られたパワースペクトルを使って心拍数を推定
21
2. Feature Extraction
Feature
Extraction
Region 1
Region 2
Region R
...
...
ROI
extraction
ROI
Warping
1. Face Region Extraction 3. Self-Adaptive Matrix Completion
Observation matrix Low-rank matrix
Prior mask
SAMC
Estimated Mask
0 1 2 3 4 5 6
Frequency, Hz
HR FrequencySignal estimated using SAMC
Magnitude
Power spectral
density estimation
4. Heart Rate Estimation
Figure2. Overview of theproposed approach for HR estimation. During thefirst phase, weautomatically detect aset of facial keypointsand
Face2Face
• Sourceの表情をTargetに反映
– 2つの顔のアライメントを合成することでSourceの表情を反映
22
新しいデータセットの作成:Cityscapes Dataset
• 従来のセグメンテーション用データセットより規模の大きいデータセット
– 従来のセグメンテーション用データセット:CamVid Dataset
• 1都市で撮影した約700フレームの画像からデータセットを構築
– CityScapes Datasetの規模
• 50都市で撮影した約25,000フレームの画像からデータセットを構築(細:5,000,粗
:20,000)
23
新しいデータセットの作成:SYNTHIA Dataset
• 自動車の走行シーンをCGで作成したデータセット
– セグメンテーションデータや距離データも公開
24
新しいデータセットの作成:WIDER FACE
• 大規模な顔検出 & 顔属性データセット
25
まとめ
• 画像認識の分野ではDeep Learningを使った手法がほとんど
– RNNの登場で動画理解とキャプション生成,VQAの研究が数多く登場
• キャプション生成&VQAのオーラルセッションが2回も設けられるくらい活発に
– Fast,Faster-RCNNとRNNを使った研究がとても多かった印象が強い
– Deep Learningの性能を向上させるために大規模なデータセットを作ったという論文も存
在
– Deep Learning以外の手法(MRF, RF, SVM…)は後処理として使用
26

CVPR2016を自分なりにまとめてみた