MIRU2014 tutorial deeplearning

Deep Learning
～使いこなすために知っておきたいこと～
山下隆義

Deep Learningについて
様々なベンチマークでトップレベルの性能
音声認識(2011)
多層(7つ）結合．事前学習あり
F. Seide, G. Li and D. Yu, “Conversational Speech Transcription Using
Context-Dependent Deep Neural Networks.”, INTERSPEECH2011.
一般物体認識(2012)
多層のCNNで従来性能を大きく上回る
A. Krizhevsky, I. Sutskever and G. E. Hinton. "ImageNet Classification
with Deep Convolutional Neural Networks." NIPS. Vol. 1. No. 2. 2012.
2

なぜDeep Learningが注目されてる？
色々なベンチマークで１位を取ったから・・・
過去の偉大な成果
様々なテクニック
ハードウェアの進化&活用方法
ニューラルネットワークの
暗黒時代を超えて．．
根気強く取り組んだ成果
画像認識のパラダイムシフトの可能性
特徴抽出，識別手法が別々
手作業での特徴設計
特徴抽出＋識別を同時に行う
自動的な特徴設計
3

Deep Learning界隈の状況
IT関連会社が軒並み注目して開発投資している
Facebookが研究所(AI Lab)を設立
LeCunを所長に、Ranzatoらが所属
Googleが関連会社を買収
DNNResearch(Hintonらの会社）、Deep Mind
Yahoo が関連会社を買収
IQEngine(画像認識の会社)、LookFlow
Baiduが研究所を設立
ほとんどが２０１３年以降
4

Deep Learning関連の投稿
6件
4件
17件
ここ１−２年で発表が大幅に増加
1件
1件1件4件
2010 2011 2012 2013 2014
ICCV
CVPR
20件
5件
ICPR
音声認識で注目
LSVRC2012でトップ
(ICCV workshop)
5
12件
ECCV

Deep Learningで何ができる？
一般物体認識(2012，2013)
Large Scale Visual Recognition Challenge 2012で，他手法を圧倒
同2013ではDeep Learningベースの手法が多数
畳み込みニューラルネットワーク+ 高精度化のテクニック(ReLu, dropoutなど）
1000種類の物体のカテゴリを認識
データセット
LSVRC2012のWEBページより引用
http://image-net.org/challenges/LSVRC/2012/ilsvrc2012.pdf
http://image-net.org/challenges/LSVRC/2012/supervision.pdf
6

シーン認識(2012)
畳み込みニューラルネットワークを利用して，各ピクセルのシーンラベルを付与
画素情報をそのまま入力して，特徴を自動的に学習
Superpixelによるセグメンテーションも併用
C.Farabet, C.Couprie, L.Najman, Y.LeCun, “Learning Hierarchical Features for Scene Labeling.”, PAMI2012. 7

デノイジング(2012)
階層的に学習するニューラルネットワーク
スパースコーディングをネットワークの学習に利用
⇒デノイジングオートエンコーダ
従来手法に比べて，見た目で勝る（数値上は同等）
ノイズなし画像Deep learningでの復元
J.Xie, L.Xu, E.Chen, “Image Denoising and Inpainting with Deep Neural Networks”, NIPS2012.
8
従来手法での復元

人検出(2013)
畳み込みニューラルネットワークの学習にスパースコーディングを利用
各階層の出力をすべて統合する方法でローカル＆グローバルな特徴を抽出
畳み込み層のフィルタ例
（INRIAデータセット，フィルタサイズ：9x9）
P.Sermanet, K.Kavukcuoglu, S.Chintala, Y.LeCun, “Pedestrian Detection with Unsupervised Multi-Stage Feature Learning”,
CVPR2013.
9

人検出のリアルタイム処理(2014)
NVIDIAのTegra K1上で16-17fpsで動作（140x60ピクセル）
９層の畳み込みニューラルネットワーク
人までの距離，身長，向きも同時に検出
I. Sato, H. Niihara, “Beyond Pedestrian Detection: Deep Neural Networks Level-Up Automotive Safety”,2014
11

Deep Belief Object Recognition (iOSの物体認識アプリ)(2014)
物体認識をiOS上に実装し，３00msで動作
Krizhevskyらの畳み込みミューラルネットワークをベースとしている
SDKを公開している
https://www.jetpac.com/
12

Project Adam
https://www.youtube.com/watch?feature=player_embedded&v=zOPIvC0MlA4
13
犬の種別まで認識

Deep Learningの著名な研究者（１）
Hinton (トロント大学): 教祖的存在
https://www.cs.toronto.edu/~hinton/
ニューラルネットワークの暗黒時代でも根気強く研究
Auto encoder、drop outなどのアプローチを提案
Deep Learningの使い方をレシピとしてまとめている
LeCun(ニューヨーク大学）：CNNの第一人者
http://yann.lecun.com
Deep Learningを画像応用
畳み込みネットワーク(CNN)で手書き文字認識を実現
数多くの学生を輩出:Ranzato(Facebook)、Kavukcuoglu(DeepMind)
*FacebookのAI Labの所長を兼任
14

Deep Learningの著名な研究者（２）
Schmidhuber (IDSIA)：多くのCompetitionで１位
http://www.idsia.ch/~juergen/
長年、Deep Learningの研究をしている
文字・標識などの認識テストでトップ
GPUの活用方法
検出・認識処理の高速化手法も提案
X.Wang ( 香港中文大)：人検出などへの応用
http://mmlab.ie.cuhk.edu.hk/project_deep_learning.html
人検出のベンチマークでトップ
顔器官検出などへも幅広く応用している
15

何がDeep Learning？？
Deep Learningに関連するキーワード
Restricted
Boltzmann
Machines
Deep Belief
Networks
Maxpooling
Deep
Boltzmann
Machines
Convolutional
Neural
Networks
Deep Neural
Networks
Back-propagation
Contrastive
Divergence
Dropout
Maxout
Dropconnect
16

ネットワークの構成ネットワークの学習方法汎化性向上の方法
Restricted
Boltzmann
Machines
Deep Belief
Networks
Maxpooling
Deep
Boltzmann
Machines
Convolutional
Neural
Networks
Deep Neural
Networks
Back-propagation
Contrastive
Divergence
Dropout
Maxout
Dropconnect
17

Multi-Layer
Perceptron
Deep Learning
Restricted
Boltzmann
Machines
人工知能モデル
多層化
Deep Belief
Networks
Convolutional
Neural
Networks
Deep
Boltzmann
Machines
Deep Neural
Networks
確率モデル
畳込み層を導入
多層化
多層化
Max pooling
Maxout
Dropout
Dropconnect
18

MLPとRBM
Multi-Layer Perceptron(MLP) Restricted Boltzmann Machine(RBM)
m
å )
p(xi =1|Y ) =s ( wijyj + ai
j=1
n
å )
p(yj =1| X) =s ( wijxi + bj
i=1
m
å )
yi =s ( wijxj + bj
j=1
19

DNNとDBN
Deep Neural Networks(DNNs) Deep Belief Networks(DBN)
教師あり学習(Back propagation)
すべてのパラメータを同時に学習
教師なし学習(Contrastive Divergence)
階層ごとにパラメータを学習
+
教師あり学習(Back propagation)
すべてのパラメータを同時に学習パラメータ更新
パラメータ更新
パラメータ学習
20
出力層
隠れ層
隠れ層
出力層
入力層
出力層
入力層入力層
入力層
出力層

Convolutional Neural Networksについて
Convolutional Neural Networksの全体像
ネットワークを構成する要素
学習の方法
21

Convolutional Neural Networks
初期の研究（今も同じ）
畳み込み、サンプリング、全結合の層から構成
手書き文字認識に応用
平行移動に対する不変性がある
各構成要素について説明します
Y. LeCun, et.al. “Gradient-based Learning Applied to Document Recognition, Proc. of The IEEE, 1998. 22

学習の方法
23

Convolution Layer
カーネルをスライドさせて画像に畳み込む
近接の画素とのみ結合する(局所受容野)
Input image 10x10 kernel 3x3 Feature map 8x
Convolution
Response
f
8
Activation
function
24
Convolutions

Convolution Layer
カーネルをスライドさせて画像に畳み込む
近接の画素とのみ結合する(局所受容野)
カーネルは複数でも良い
25
Activation
Input image 10x10 function Feature map 8x8
Convolution
Response
kernel 3x3
f
f
f
Convolutions

Activation Function
シグモイド関数Rectified Linear Unit(ReLU) Maxout
1
古くから使われている
サチると勾配が消滅
画像認識でよく使われる
学習が速く、勾配がサチる
ことがない
複数のカーネルの畳み込み
値の最大を出力
区分線形凸関数
ReLUより表現力が高い
勾配が消えない
f (xi ) =max(xj f (x ,0) i ) =
1+ e-x j
26
Convolutions

Maxout
Input image
Feature map
10x10
kernel
3x3
8x8x3
Convolution
複数のカーネルの応答値から
最大値を出力
Feature map
8x8
I.J.Goodfellow, D.Warde-Farley, M.Mirza, A.Courville, and Y.Bengio, “Maxout networks.“,
arXiv preprint arXiv:1302.4389, 2013. 27

Pooling Layer
Feature mapのサイズを縮小させる
Max pooling
2x2の領域
での最大値
Average pooling
2x2の領域
での平均値
Lp pooling
m
å
n
å
ピークをより強調
f (xi ) = ( I(i, j)p *G(i, j))
1
p
i=1
j=1
28 Sampling

Fully connection layer
x1
x2
x3
xi
h1
h2
hj
各ノードとの結合重み
例えば、、
は
を算出し、activation functionに与えて値を得る
全結合型の構成
hj = f (WT x +bj )
29 Full
connection
w11
w12
w21
w1 j
w22 w31
w32
w3 j
wi2
wij
wi1

Classification Layer
出力ノード数＝認識するクラス数
Softmaxを利用するのが一般的
P(y1)
P(y2)
P(yM)
M
å
各クラスの確率を算出して、
最大値を認識クラスとする
x1
x2
x3
xi
h1
h2
hM
30
前層
出力層
classification
各クラスの確率
P(yi ) =
exp(hi )
exp(hj )
j=1

学習の方法
31

どうやって学習するの？
学習できめること：各種パラメータ
畳み込み層の各カーネルの要素の値
全結合層の各ユニットの結合重み，バイアス
Layer数が多ければ多いほど沢山のパラメータ
教師データをもとに識別エラーから各パラメータを更新
エラーが小さくなるように更新を繰り返し行い、収束を図る
確率的勾配降下法(Stochastic Gradient Descent)
32

確率的勾配降下法
順伝搬と逆伝搬による更新を繰り返し行い，良いパラメータ群を
得る
33
Input:
画像：x
ラベ
ル:y
順伝搬
現パラメータ群により各学習データの認識を行う
学習セット：
(I1,y1),…, (xn,yn)
Convolution
Full connection
逆伝搬
認識結果（誤差）をもとに，パラメータ群を更新する
Classification

誤差を最小化するように各層のパラメータを更新する
Input:
画像：xi
ラベル:yi
学習セット：
(x1,y1),…, (xn,yn)
全パラメータをWとする
ロス関数：y’iとラベルyiの違いを算出
これを最小化する
34
Convolution
Full connection Classification
y' = F(W, x)
n
å
E = Loss(F(W, xi ), yi )
i
逆誤差伝搬法で誤差の偏微分を計算
¶E
¶W
W ¬W -g
更新率γを掛けて、パラメータ群Wを更新

mini batch
一度に大量の学習サンプルでロスを求めるのは大変
⇒過学習もおきやすい
少しのサンプルだけでパラメータを更新⇒mini batch
(SGD)
m1
m2
mk
各mを利用して逐次的にパラメータを更新
mkまで利用したら再度m1を利用
35
n枚

初期のパラメータはどうする？
すべて乱数できめます・・・
うまく収束しない、良いパラメータが得られないことあり
36

（よりよい初期パラメータを得るための）学習の方法
37

教師なしでの事前学習（１）
層ごとに初期初期パラメータを学習する
38
Input:
画像：x
ラベ
ル:y
学習セット：(x1,y1),…, (xn,yn)
まずは各カーネルの要素を決める

教師なしでの事前学習（２）
層ごとに初期パラメータを学習する
39
Input:
画像：x
ラベ
ル:y
次に結合重みを決める
（前層で学習されたカーネルを畳み込んで得られた特徴を入力とする）

教師なしでの事前学習（３）
層ごとに初期パラメータを学習する
40
Input:
画像：x
ラベ
ル:y
そして，次の層のパラメータ（結合重み）も順次決めていく
（前層までの学習されたパラメータにより得られた特徴を入力とする）

Auto encoder
教師なしで初期パラメータを学習
Input layer
hidden layer
入力と復元結果のエラーが最小となるように
各パラメータを更新する
Reconstruction error
Reconstruction layer
復元する際の重みはWの転置
Tied weights
パラメータ更新には確率的勾配降下法を利用
求めるパラメータ：w1, b1, b1’
41
x x’
h(x)

Auto encoder
Input layer
hidden layer
Tied weights
42
x x’
h(x)

Auto encoder
Input layer
hidden layer
Tied weights
43
x x’
h(x)

Auto encoder
Input layer
hidden layer
Tied weights
44
x x’
h(x)

Auto encoder
Convolutional層も同様に行う
入力画像
Kernel
復元画像
カーネルk1の畳み込みとカーネルk1’の畳み込み
の差が最小になるように更新する
k1
求めるパラメータ：k, k’, b, b’
45
k’
1
feature map

Stacked Auto encoder
２層目以降は前層までの出力を利用して行う
x x’
Input layer
hidden layer
・１層目の出力を入力データして利用
・再構成層のユニット数は入力層と同じ
（この場合は１層目の出力数）
・２層目のパラメータを更新する
Y.Bengio, P. Lamblin, D. Popovici and H. Larochelle, “Greedy Layer-Wise Training of Deep Networks”, NIPS07
46
h(x)

Stacked Auto encoder
２層目以降は前層までの出力を利用して行う
x x’
Input layer
hidden layer
・１層目の出力を入力データして利用
・再構成層のユニット数は入力層と同じ
（この場合は１層目の出力数）
・２層目のパラメータを更新する
47
h(x)
Y.Bengio, P. Lamblin, D. Popovici and H. Larochelle, “Greedy Layer-Wise Training of Deep Networks”, NIPS07

汎化性を向上させるための方法
48

Dropout
全結合層の過学習を抑制する
(学習方法のおさらい)
入力データのラベルとネットワークの
出力の誤差を元にパラメータを更新
Input layer
K1
Kn
Kernel
Fully connected layerの一部のノードからの結
合を取り除く(０にする)
だいたい５０％
各mini-batchで異なる結合をランダムに取り除く
近似的なアンサンブル学習
G. Hinton, N.Srivastava, A.Krizhevsky, I.Sutskever, and R.Salakhutdinov, “Improving neural networks by
preventing co-adaptation of feature detectors.”, arXiv preprint arXiv:1207.0580, 2012. 49

Dropconnect
全結合層の過学習を抑制する
(学習方法のおさらい)
入力データのラベルとネットワークの
出力の誤差を元にパラメータを更新
Input layer
K1
Kn
Kernel
Fully connected layerの一部の結合を取り除く
(０にする)
だいたい５０％
各mini-batchで異なる結合をランダムに取り除く
近似的なアンサンブル学習
L.Wan, M.Zeiler, S.Zhang, Y.LeCun, R.Fergus, “Regularization of Neural Network using DropConnect”, ICML2013 50

Dropout vs Dropconnect
L.Wan, M.Zeiler, S.Zhang, Y.LeCun, R.Fergus, “Regularization of Neural Network using DropConnect”, ICML2013 51

学習画像の生成
Data Augmentation
位置や大きさを変えて学習データ数を増やす
Elastic Distortion
位置や大きさだけでなく，形状の変化も適用
52
P.Y. Simard, D. Steinkraus, and J.C. Platt, “Best practices for convolutional neural networks applied to visual
document analysis.”, ICDAR2003.

前処理の重要性
Global Contrast Normalization
各画像を平均０，分散１になるように正規化
入力データの明るさを正規化することで性能が向上
53

Global Contrast Normalization
各画像を平均０，分散１になるように正規化
下図はpylearn2の結果
正規化なし正規化あり
54

ZCA whitening
隣接ピクセルの冗長性をなくす
X' =WX
隣接ピクセルの冗長性をなくすような
Wを主成分分析により学習
http://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf
55

ZCA whitening
56
ZCA whitening only
Global contrast normalization +
ZCA whitening

Normalize Layer
activation function で得られた値を正規化
Pooling layer Convolutional layer
Convolutional layer Normalize layer
Pooling layer
Normalize layer
pooling layer後にNormalize layerを配置することもある
57

Normalize Layer
Local contrast normalization
同一特徴マップにおける局所領域内で正規化する
vj,k = xj,k - wp,qxj+p,k+q å
åwp,q =1
yj,k =
vj,k
max(C,s jk )
2 å
s jk = wpqvj+p,k+q
K. Jarrett, K. Kavukcuoglu, M. Ranzato and Y.LeCun ,“What is the Best Multi-Stage Architecture for
Object Recognition?”, ICCV2009 58

Normalize Layer
Local response normalization
同一位置における異なる特徴マップ間で正規化する
yi
i+N/2
å
j,k = (1+a (yl
j,k )2 )b
l=i-N/2
59
G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever and R. R. Salakhutdinov ,“Improving neural
networks by preventing co-adaptation of feature detectors ”, arxiv2012

パラメータによる性能比較
60

学習に関係するパラメータ
学習するために決めること（ネットワークの構成）
畳み込むカーネルサイズと数
Activation functionの種類sigmoid、maxout、ReLU
Poolingの方法とサイズmax, average, L2 etc.
Maxoutのサイズ
畳み込み層の数
Fully connectionのユニット数
Fully connection層の数
Dropoutの割合
61

学習に関係するパラメータ
学習するために決めること（学習のパラメータ）
mini batchのサイズ
更新回数
学習率
Auto encoderの有無
62

CIFAR10での性能比較
データセットについて
クラス数１０
airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck
学習データ：５００００枚
評価データ：１００００枚
data augmentation なし：約85% 63

ベースとなるネットワークの構成
64
３層のConvolutional layerと1層のClassification layer
３２x３
２
入力画像
Convolution layer
Normalize layer
Pooling layer
Convolution layer
Normalize layer
Pooling layer
Classification layer
Convolution layer
Normalize layer
Pooling layer

比較内容
65
３層のConvolutional layerと1層のClassification layer
３２x３
２
入力画像
Convolution layer
Normalize layer
Pooling layer
Convolution layer
Normalize layer
Pooling layer
Classification layer
Convolution layer
Normalize layer
Pooling layer
カーネルのサイズと数
活性化関数の種類
Poolingの種類
Normalize layerの有無
fully connection layerの数
前処理の有無
その他
・dropoutの有無
・バッチサイズ
・学習率

CIFAR10での性能比較（１）
ネットワークの構成
繰り返し回数
エラー率
1
0.8
0.6
0.4
0.2
0
Conv3Full0
Conv3Full1
Conv3Full2
Conv3Full3
0 100000 200000 300000
66

CIFAR10での性能比較（２）
前処理の有無
1
0.8
0.6
0.4
0.2
0
GCN+ZCA
ZCA
なし
0 100000 200000 300000
繰り返し回数
エラー率
67

Convolution layerのフィルタ
original image ZCA image GCN + ZCA image
68

CIFAR10での性能比較（３）
Convolutional Layer
カーネルのサイズによる性能比較
カーネル数による性能比較
繰り返し回数
エラー率
1
0.8
0.6
0.4
0.2
0
3x3x2
5x5x4
3x4x5
7x6x4
0 100000 200000 300000
69

CIFAR10での性能比較（４）
Convolutional Layer
繰り返し回数
エラー率
1
0.8
0.6
0.4
0.2
0
4x4x4
8x8x8
16x16x16
32x32x32
64x64x64
128x128x128
256x256x256
0 100000 200000 300000
70

CIFAR10での性能比較（５）
Convolutional Layer
繰り返し回数
エラー率
1
0.8
0.6
0.4
0.2
0
4x8x16
8x16x32
16x32x64
32x64x128
64x128x256
128x256x512
256x512x1024
256x256x256
0 100000 200000 300000
71

フィルタの可視化
１層目のConvolutional layerの変化
72
更新回数：0〜7000回（1000回間隔)
カーネル数：１２８
カーネルサイズ：５x５

CIFAR10での性能比較（６）
Activation Function
1
0.8
0.6
0.4
0.2
0
maxout
sigmoid
ReLU
Tanh
0 100000 200000 300000
繰り返し回数
エラー率
73

CIFAR10での性能比較（７）
学習の設定
Dropoutの有無による性能比較
Auto Encoderの有無による性能比較
学習率による性能比較
1
0.8
0.6
0.4
0.2
0
dropoutあり
dropoutなし
0 100000 200000 300000
繰り返し回数
エラー率
74

CIFAR10での性能比較（９）
学習の設定
Dropoutの有無による性能比較
Auto Encoderの有無による性能比較
学習率による性能比較
1
0.8
0.6
0.4
0.2
0
0.1
0.01
0.001
0.0001
0 100000 200000 300000
繰り返し回数
エラー率
75

CIFAR10での性能比較（１０）
学習の設定
バッチサイズによる性能比較
繰り返し回数
エラー率
76
1
0.8
0.6
0.4
0.2
0
batch size :5
batch size :10
batch size :20
batch size :25
batch size :50
batch size :100
batch size :125
0 100000 200000

CIFAR10での性能比較（１１）
Normalize Layerの有無
1
0.8
0.6
0.4
0.2
0
正規化あり
正規化なし
0 100000 200000 300000
繰り返し回数
エラー率
77

CIFAR10でのパラメータ
項目結果
層の構成3層のconvolution + 1層のclassification
入力画像の前処理GCN+ZCA
convolution layerのカーネル数とサイズ128x256x512, 5x5+5x5+4x4
Activation function maxout （ReLuも良い)
full connection層の数なくても良い
dropout の有無ありの方がよい
Normalize layerの有無ありの方がよい
学習率0.001が一番良い
バッチサイズ１０程度（大きすぎると時間がかかる）
78

学習時間について
CPU V.S. GPU(１回の更新にかかる時間）
Layer CPU
(Core2
2.6GHz)
GPU
(GeForce
GT690)
比率
畳み込み層
カーネル：１
27.3ms 11.6ms 2.35倍
畳み込み層
カーネル：２０
451.5ms 29.2ms 15.46倍
全結合層
ノード数：１００
486ms 14.8ms 32.84倍
学習率
Pre training :0.5
Fine tuning :0.01
Mini-batch :10
79

リソースサイズについて
Layerの種類リソースサイズ
畳み込み層
カーネルサイズ：５x５、カーネル数：1
0.1KB
畳み込み層
カーネルサイズ：５x５、カーネル数：32
4KB
全結合層
ノード数：100 (パラメータ数：約87000)
0.35MB
２値化層
ノード数：1600(パラメータ数：約
410000)
1.6MB
入力画像サイズ：40x40ピクセルの場合
80

今から始めるためのツール・
知っておきたいこと
81

チュートリアルページ
http://deeplearning.net/tutorial/intro.html
82

CNNを使いこなすための環境
Theanoライブラリが有名(python)
偏微分等の数値演算実装が容易
http://deeplearning.net/software/theano/
83

cuda-convnet
Krizhevskyらのコード
https://code.google.com/p/cuda-convnet/
84

Caffe
Caffe
畳み込みニューラルネットワークの公開実装コード
UC バークレーの研究グループ(T. Darrell)
学習済みのネットワークも公開
これをベースにベンチマークサイト(Kaggle)で
トップになっているチームもあり
https://github.com/UCB-ICSI-Vision-Group/decaf-release/
85

OverFeat
OverFeat
畳み込みニューラルネットワークの公開実装コード(C/C++)
ImageNetで学習済み
http://cilvr.nyu.edu/doku.php?id=software:overfeat:start
LeCun，R.Fagusのグループ
86

Deep Learningのツール
ツール環境できること
Cuda Conv-net python, c++ CNNの学習・評価
GPUによる高速化
Caffe c++ CNNの学習・評価
OverFeat python, c/c++ CNNの学習・評価
Torch7 lua(スクリプト言語) 機械学習ライブラリ
CNN,RBMの学習・評価
Pylearn2 python 機械学習ライブラリ
CNN,RBMの学習・評価
87

ベンチマークテストでの性能（１）
http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html 88

ベンチマークテストでの性能（２）
http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html 89

ベンチマークテストでの性能（３）
Large Scale Visual Recognition Challenge 2012 (ILSVRC2012)
http://www.image-net.org/challenges/LSVRC/2012/results.html#t1 90

ベンチマークテストでの性能（４）
Caltech Pedestrian Detection
http://www.vision.caltech.edu/Image_Datasets/CaltechPedestrians/
91

Kaggle Dog vs Cat
• P.Sermanetが優勝
Image netで学習したOverFeatをベースに，ベンチマークのデータセットで
ネットワークを更新
2位はDecaf(Caffe)のUC Berkley チーム
http://fastly.kaggle.net/c/dogs-vs-cats
データセット例
92

物体検出
最近のトレンド
CNNを特徴量の生成として使用
Image netで学習したネットワークを活用
Caffeをもとに，物体のLocalizationに応用
CNNで抽出した特徴量をSVMでクラス識別
Pascal VOCでトップの物体検出
R. Girshick, J. Donahue, T. Darrell, J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation”
, Conference on Computer Vision and Pattern Recognition, pp. 580–587, 2014.
95

物体検出（転移）
学習済みのネットワーク(Image net)を特定のデータセット
(Pascal VOC)に転移
⇒限られたデータセットでネットワークを更新
Oquab, Maxime, et al. "Learning and transferring mid-level image representations using convolutional neural
networks." (2013). 96

人の属性分類
各poseletのCNNから特徴抽出
SVMにより各属性の判別を行う
Figure3: Poselet Input Patchesfrom Berkeley Attr ibutes
of People Dataset. For each poselet, we use the detected
patches to train aconvolution neural net. Here are someex-amples
of input poselet patches and weareshowing poselet
patcheswith high scores for poselet 1,16 and 79.
sumably requires different features) whilethebottomlayers
are shared to a) reduce the number of parameters and b) to
leverage common low-level structure.
The whole network is trained jointly by standard back-propagation
of the error [24] and stochastic gradient de-scent
[2] using as a loss function the sum of the log-losses
of each attributefor each training sample. Thedetails of the
layers are given in Figure 2 and further implementation de-tails
can befound in [15]. To deal with noiseand inaccurate
poselet detections, we train on patches with high poselet
detection scores and then wegradually addmore low confi-dence
patches.
Different parts of thebody may havedifferent signals for
each of the attributes and sometimes signals coming from
one part cannot infer certain attributes accurately. For ex-ample,
deep net trained on person leg patches contains little
information about whether the person wears a hat. There-fore,
we first use deep convolutional nets to generate dis-criminative
image representations for each part separately
and then we combine these representations for the final
classification. Specifically, we extract the activations from
! "
# $
Figure 2: Par t-based Convolutional Neural Nets. For each poselet, one convolutional neural net is trained on patches
resized 64x64. The network consists of 4 stages of convolution/pooling/normalization and followed by a fully connected
layer. Then, it branches out one fully connected layer with 128 hidden units for each attribute. Weconcatenate the activation
from fc attr from each poselet network to obtain the pose-normalized representation. The details of filter size, number of
filters weused are depicted above.
some degenerate cases, images may have few poselets de-tected.
To deal with that, we also incorporate a deep net-work
covering thewhole-person bounding box region as in-put
to our final pose-normalized representation.
Based on our experiments, we find amore complex net
is needed for the whole-person region than for the part re-gions.
We extract deep convolutional features from the
model trained on Imagenet [15] using theopen sourcepack-age
provided by [8] as our deep representation of the full
image patch.
As shown in Figure 1, we concatenate the features from
thedeep representations of thefull imagepatch and the150
parts and train a linear SVM for each attribute.
2.5
2
1.5
1
0.5
4
N.Zhang, M.Paluri, M.Ranzato, T.Darrell, L.Bourdev, “PANDA: Pose Aligned Networks for Deep Attribute Modeling”, CVPR2014
4. Datasets
is male long hair hat glasses dress sunglasses short sleeves is baby
0
x 10
Number of Labels
Positives
Negatives
Unspecified
97
Figure 4: Statisitcs of the number of groundtruth labels on
Attribute 25k Dataset. For each attribute, green is the num-ber
of positive labels, red is the number of negative labels
Figure 1: Overview of Pose Aligned Networks for Deep Attr ibute modeling (PANDA). One convolutional neural net trained on semantic part patches for each poselet and then the top-level activations of all nets are concatenated to obtain pose-normalized deep representation. The final attributes are predicted by linear SVM classifier using the pose-normalized

ビデオの認識
時系列での特徴を抽出するネットワーク
Single Frame Late Fusion Early Fusion Slow Fusion
・ピクセルレベルで複数フレーム
の情報を統合
・局所的な動き方向と速度の
検出が可能
・2つのSingle Frameを使用
・大域的な運動特性を計算可能
・時間、空間の情報をバランス
よく使用
・空間だけではなく時間的に
畳み込むため大域的な情報が
得られる
A.Karpathy, T.Leung, G.Toderici, R.Sukthankar,S.Shetty, Li Fei-Fei, “ Large-scale Video Classification with Convolutional
Neural Networks”, 2014
98

ビデオの認識
Slow Fusion network on the first layer
A.Karpathy, T.Leung, G.Toderici, R.Sukthankar,S.Shetty, Li Fei-Fei, “ Large-scale Video Classification with
Convolutional Neural Networks”, 2014 99

ビデオの認識
Sports-1M Datasetを公開
100
A.Karpathy, T.Leung, G.Toderici, R.Sukthankar,S.Shetty, Li Fei-Fei, “ Large-scale Video Classification with
Convolutional Neural Networks”, 2014

顔照合
２次元・３次元での顔のアライメント後，CNNにより特徴抽出
（4096次元ベクトル．４０３０人分，440万枚の画像を利用）
個人認証はベクトルの距離比較
人間の識別能力に限りなく近くなっている
Y. Taingman, M. Yang, M. A. Ranzato and L. Wolf. "DeepFace: Closing the Gap to Human-Level Performance in Face Verification.” CVPR2014.
101

顔照合
切り出し位置を変えた複数のネットワークから特徴を抽出
DeepFaceを超える照合精度
Method Accuracy(%) No. of points No. of images Feature dimensions
DeepFace 97.25 6+67 4,400,000+3,000,000 4096×4
DeepID on CelebFaces 96.05 5 87,628 150
DeepID on CelebFaces+ 97.20 5 202,599 150
DeepID on CelebFaces+ & TL 97.45 5 202,599 150
Y. Taingman, M. Yang, M. A. Ranzato and L. Wolf. "DeepFace: Closing the Gap to Human-Level Performance in Face Verification.” CVPR2014.
102

Deep Learningの応用先
認識
セグメンテーション
一般物体認識
(LSVRCトップ)
一般物体検出
(Pascal VOCトップ)
顔認識(照合）
(LFWトップ)
人物属性推定
人検出*
(Caltech Pedestrian dataset トップ)
*HOG+CSS-SVMでのSliding window処理あり
シーンラベリング
手領域抽出
顔ラベリング
髪領域抽出
顔器官検出
人の姿勢推定
検出回帰
103

研究紹介（１）
手領域抽出
グレースケールの画像から
未知の手の領域を抽出
104
山下, 綿末, 山内, 藤吉, “Deep Convolutional Neural Network による手形状領域の抽出”, SSII2014

セグメンテーション問題の場合
Classification Layerの代わりにBiarization Layerを利用
x1
x2
x3
xi
y1
y2
Yj
Fully connection layerと同様に全結合
入力画像と同じ大きさの出力ノード数
y = σ(wx +b)
activation functionにはsigmoidを利用
105

パラメータの可視化
識別層の重みを可視化
107
x1
x2
x3
xi
y1
y2
Yj

その他の研究事例
顔器官検出
器官点の位置を推定
108
シーンラベリング
各画素に8つのクラスラベ
ルを付与

まとめ
Deep Learningの動向，畳み込みニューラルネットワーク，
について紹介
パラメータによる性能の変化について
（この結果は一例．データセットにより変わるので注意）
さらなる理解を深めるために・・・
チュートリアルページ等，活用できる情報が沢山あります
（が，分かりにくいのも多数）
もっと聞きたい方．．．．
連絡くださいyamashita@cs.chubu.ac.jp
Twitter : takayosiy，Facebookなどでも．
109

MIRU2014 tutorial deeplearning

More Related Content

What's hot

Viewers also liked

Similar to MIRU2014 tutorial deeplearning

More from Takayoshi Yamashita

MIRU2014 tutorial deeplearning

Editor's Notes