Matrix capsules with em routing

1
DEEP LEARNING JP
[DL Papers]
http://deeplearning.jp/
Matrix Capsules with EM Routing (ICLR2018)
Kazuki Fujikawa, DeNA

サマリ
• 書誌情報
– ICLR 2018
– Geoffrey Hinton, Sara Sabour, Nicholas Frosst
• 概要
– Dynamic Routing Between Capsules [Sabour+, NIPS2017] の続報
– カプセル層を多層化
– Poseを2Dベクトルではなく3D行列で表現
– 存在確率をベクトルのノルムではなく専用のユニットで表現
– ルーティングに混合ガウスモデルのEMアルゴリズムを用いる
– smallNORBデータセットでSOTA
2
Published asaconference paper at ICLR 2018
Figure 1: A network with one ReLU convolutional layer followed by a primary convolutional cap-
sule layer and two moreconvolutional capsule layers.

アウトライン
• 背景
• 関連研究
• 提案手法
• 実験・結果
3

アウトライン
• 背景
• 関連研究
• 提案手法
• 実験・結果
4

背景
• CNNにおけるMax-Pooling
– 畳み込み層で検出した特徴が、画像内の位置に関わらず同じ出力になるように調整する
（移動不変性）
– 空間的な関係性を失うため、それにより誤った判断を引き起こすことがある
5図引用: https://hackernoon.com/uncovering-the-intuition-behind-capsule-networks-and-inverse-graphics-part-i-7412d121798d

背景
• Viewpoint Invariance（視点不変性）
– 現実世界の”画像”は、3D物体に対して視点を固定して写像したものである
• 視点が違うとピクセルレベルでは大きな違いを生む
• 人間が見ると同一物だとすぐわかる
– 通常のCNNでは空間的な関係性を考慮できない
• 同一物と認識させるためには、オブジェクト毎にあらゆる視点で撮影した画像が必要
• 視点の変化に対して不変な特徴を抽出したい
6図引用: Hinton+, ICLR2018

アウトライン
• 背景
• 関連研究
• 提案手法
• 実験・結果
7

関連研究
• Dynamic Routing Between Capsules [Sabour+, NIPS2017]
– 特徴をスカラーではなくベクトルで表す
• 特徴量（スカラー）: 着目した特徴の存在有無を表す
– 同じオブジェクトで異なる姿勢の特徴は別ユニットで表現せざるを得ない
• 特徴量（ベクトル）: 着目した特徴に関して、姿勢などの任意の属性を表現可能
– 同じオブジェクトで異なる姿勢の特徴を同一のカプセルで表現することが原理的には可能
– カプセル: ユニットの集合
8図引用: https://medium.com/ai%C2%B3-theory-practice-business/understanding-hintons-capsule-networks-part-ii-how-capsules-work-153b6ade9f66

関連研究
– アーキテクチャ
9
図引用: https://medium.com/@mike_ross/a-visual-representation-of-capsule-network-computations-83767d79e737

関連研究
10
画像から普通のCNNで特徴抽出、
6x6, 256チャンネルの
feature mapを獲得

関連研究
11
256チャンネルのfeature mapを
8チャンネル区切りにreshapeし、
8Dベクトルが 6x6, 32チャンネル
あると考える

関連研究
12
Squashする
Squash: ベクトルの方向を維持しながら
ノルムが0~1になるようにスケーリング

引用: https://jhui.github.io/2017/11/03/Dynamic-Routing-
Between-Capsules/

関連研究
13
Flattenする

関連研究
14
各ベクトル（カプセル）は
特定の特徴を表し、そのノルムが
存在有無を表している

関連研究
15
各カプセルは、自身の情報𝑢𝑖を手がかりに、クラス毎に
異なる重み𝐖𝑖𝑗でアフィン変換して各クラスの特徴 (pose)
を予測する
𝐮𝑗|𝑖 = 𝐖𝑖𝑗 𝐮𝑖

関連研究
16
各下位カプセルが予測した上位カプセル 𝐮𝑗|𝑖を、
𝑐𝑖𝑗で重み付けして和をとり、squash
して出力カプセル𝐯𝑗を得る
（ 𝑐𝑖𝑗 をDynamic Routingで計算する）
𝐯𝑗 = 𝑠𝑞𝑢𝑎𝑠ℎ(
𝑖
𝑐𝑖𝑗 𝑢𝑗|𝑖)

関連研究
17
𝑏𝑖𝑗: 下位カプセルiと上位カプセルj
の関連度（になっていく）
最初はゼロ埋め

関連研究
18
𝐛𝑖をjに関してsoftmax
下位カプセルiの割当先を
どれか一つに絞る
（最初は均等）

関連研究
19
予測出力カプセル 𝐮𝑗|𝑖を𝑐𝑖𝑗で
重み付けし、iに関して
足し合わせる

関連研究
20
Squashしてノルムが0~1になる
ようにスケーリングし、
各クラスの予測カプセルとする

関連研究
21
𝑏𝑖𝑗を 𝐮𝑗|𝑖と𝐯𝑗の内積を足すことで更新
𝐮𝑗|𝑖と𝐯𝑗の関連が強い要素が大きくなる

関連研究
22
分類誤差（margin loss）以外に、
正解クラスのcapsuleの情報のみをinputに、
再構成させ、ピクセル単位での再構成誤差
も学習に利用
→ 再構成に必要な属性（太さ、スケール etc.）
がカプセルで表現されるようになる
Figure 2: Decoder structure to reconstruct a digit from the DigitCaps layer representation. The
euclidean distance between the image and the output of the Sigmoid layer is minimized during
training. Weuse thetruelabel asreconstruction target during training.
ﬁeldsoverlap with thelocation of thecenter of thecapsule. In total PrimaryCapsules has[32⇥6⇥6]
capsule outputs (each output is an 8D vector) and each capsule in the [6 ⇥ 6] grid is sharing their
weights with each other. One can seePrimaryCapsules asaConvolution layer with Eq. 1 asitsblock
non-linearity. The ﬁnal Layer (DigitCaps) has one 16D capsule per digit class and each of these
capsules receivesinput from all thecapsules in the layer below.
Wehaverouting only between two consecutivecapsule layers (e.g. PrimaryCapsules and DigitCaps).
SinceConv1 output is1D, thereisno orientation in itsspaceto agreeon. Therefore, no routing isused
between Conv1 and PrimaryCapsules. All therouting logits (bi j ) are initialized to zero. Therefore,
initially acapsule output (ui ) issent to all parent capsules (v0...v9) with equal probability (ci j ).
Our implementation isin TensorFlow (Abadi et al. [2016]) and weusetheAdam optimizer (Kingma
and Ba[2014]) with itsTensorFlow default parameters, including theexponentially decaying learning
rate, to minimize thesum of the margin losses in Eq. 4.
4.1 Reconstruction asa regularization method
Weusean additional reconstruction lossto encourage thedigit capsules to encodetheinstantiation
parameters of theinput digit. During training, wemask out all but theactivity vector of thecorrect

アウトライン
• 背景
• 関連研究
• 提案手法
• 実験・結果
23

提案手法
• Matrix Capsules with EM Routing
– カプセル層を多層化
– Poseを2Dベクトルではなく3D行列で表現
– 存在確率をベクトルのノルムではなく専用のユニットで表現
– ルーティングに混合ガウスモデル（GMM）のEMアルゴリズムを用いる
24
Published asaconference paper at ICLR 2018
Figure 1: A network with one ReLU convolutional layer followed by a primary convolutional cap-
sule layer and two moreconvolutional capsule layers.
that location. Theactivationsof theprimary capsulesareproduced by applying thesigmoid function
to the weighted sumsof the same set of lower-layer ReLUs.
The primary capsules are followed by two 3x3 convolutional capsule layers (K=3), each with 32
capsule types (C=D=32) with strides of 2 and one, respectively. The last layer of convolutional

提案手法
• Dynamic Routing vs EM Routing
– Dynamic Routing（内積 + Softmax）
• 上位カプセルのPose 𝐯𝑗 （2D vector）: 𝑠𝑞𝑢𝑎𝑠ℎ( 𝑖 𝑐𝑖𝑗 𝑢𝑗|𝑖)
• 上位カプセルのActivation (1D scalar): 𝐯𝑗のノルム
• カプセル間の重み 𝑐𝑖𝑗: 内積 𝐮𝑗|𝑖 𝐯𝑗 に基づく
– EM Routing（GMM）
• 上位カプセル数 = GMMのクラスタ数
• 上位カプセルのPose (3D matrix): 各クラスタの平均 𝛍 を3D matrixへreshape
• 上位カプセルのActivation (1D scalar): 𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝜆 𝛽 𝑎 − 𝛽 𝑢 𝑖 𝑟𝑖𝑗 − ℎ 𝑐𝑜𝑠𝑡𝑗
ℎ
• カプセル間の重み 𝑅𝑖𝑗: GMMの負担率
25
下位ConvCaps層上位ConvCaps層

提案手法
• EM Routing
26

提案手法
• EM Routing
27
[Sabour+, NIPS2017]
• Input
• 𝐮𝑗|𝑖: L層目カプセルが予測するpose vector
• Output
• 𝐯𝑗: L+1層目カプセルの最終的なpose vector
[Hinton+, ICLR2018]
• Input
• 𝑉: L層目カプセルが予測するpose matrix
• 𝒂: L層目カプセル自身のactivation
• Output
• 𝑀: L+1層目カプセルの最終的なpose matrix
• 𝒂: L+1層目カプセルのactivation

提案手法
• EM Routing
28
負担率𝑅𝑖𝑗: 最初は一様に初期化
（L+1層目の 1 / カプセル数）

提案手法
• EM Routing
29
負担率𝑅𝑖𝑗: L層目のactivationを掛けて更新

提案手法
• EM Routing
30
現在の負担率を基に、L層目カプセルが予測するpose matrix : 𝑉
に関するGMMのパラメータ𝜇 𝑗
ℎ
, 𝜎𝑗
ℎ 2
を更新
数イテレーション後の 𝜇 𝑗
ℎ
が
L+1層目のカプセルの pose matrixになっていく

提案手法
• EM Routing
31
L+1層目のカプセルのactivationを計算する
𝑎𝑗 = 𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝜆 𝛽 𝑎 − 𝛽 𝑢
𝑖
𝑅𝑖𝑗 −
ℎ 𝑖
−𝑅𝑖𝑗ln(𝑃𝑖|𝑗
ℎ
)
𝜆: 逆温度パラメータ（ハイパーパラメータ、イテレーション毎に上げる）
𝛽 𝑎, 𝛽 𝑢: 全カプセル共通の学習対象のパラメータ
𝑃𝑖|𝑗
ℎ
: ガウス分布の確率密度
混合ガウス分布の負の対数尤度が低ければ（lossが低ければ） 𝑎𝑗は大きくなる

提案手法
• EM Routing
32
Mステップで求めてたパラメータ𝜇 𝑗
ℎ
, 𝜎𝑗
ℎ 2
を
使って負担率を更新

提案手法
• EM Routing
33

提案手法
• Margin loss [Sabour+, NIPS2017]
• Spread loss [Hinton+, ICLR2018]
– ベースはMargin loss
– 正解となるクラスのスコアが、不正解となるクラスのスコアよりも大きくなるように
するということを明示的に定式化
• Coordinate addition
– 最終層では、全カプセルがクラスカプセルへ結合し、元のxy座標の情報を失う
– アフィン変換先の特徴空間でも同じような位置関係になるように、xy座標をスケーリング
したものを、各クラスカプセルの最初の2成分へ足し込む
34
SS
he training less sensitive to the initialization and hyper-parameters of the model,
s” to directly maximize thegap between theactivation of thetarget class(at ) and
e other classes. If the activation of a wrong class, ai , is closer than the margin,
enalized by thesquared distance to themargin:
Li = (max(0, m − (at − ai ))2
, L =
X
i 6= t
Li (3)
small margin of 0.2 and linearly increasing it during training to 0.9, we avoid
he earlier layers. Spread loss is equivalent to squared Hinge loss with m = 1.
rini (2011) studies avariant of thislossin thecontext of multi class SVMs.
NTS
ataset (LeCun et al. (2004)) has gray-level stereo images of 5 classes of toys:
ks, humansand animals. Thereare10 physical instances of each classwhich are
n. 5 physical instances of aclass areselected for thetraining dataand theother 5
very individual toy is pictured at 18 different azimuths (0-340), 9 elevations and
ns, so thetraining and test setseach contain 24,300 stereo pairsof 96x96 images.
NORB asabenchmark for developing our capsules system because it iscarefully

アウトライン
• 背景
• 関連研究
• 提案手法
• 実験・結果
35

実験
• smallNORBデータセット
– グレースケール
– 5クラス分類（airplanes, cars, trucks, humans and animals ）
– クラス毎に、角度や高度、照明条件の異なる画像が存在
– train: test = 24,300 : 24,300
– 96x96 pixel をダウンサンプルして 48x48 pixel にし、32x32にランダムにクロップ
36

実験
• 実験結果
– Cireşan+, 2011 よりも高パフォーマンスを記録し、SOTAを達成
– Pose matrix, Spread loss, Coordinate additionの効果が確認された
– ルーティングの過程で少しずつ正解クラスのactivationが高まっていくことが確認できた
37
Figure 2: Histogram of distances of votes to the mean of each of the 5 ﬁnal capsules after each
routing iteration. Each distance point is weighted by its assignment probability. All three images
are selected from the smallNORB test set. The routing procedure correctly routes the votes in the
truck and the human example. The plane example shows a rare failure case of the model where the
planeisconfused with acar in thethird routing iteration. Thehistograms arezoomed-in to visualize
only votes with distances less than 0.05. Fig. B.2 shows the complete histograms for the ”human”
capsule without clipping the x-axis or ﬁxing thescale of the y-axis.

References
• Hinton, G., Frosst, N., & Sabour, S. (2018). Matrix capsules with EM routing. Hinton, G.,
Frosst, N., & Sabour, S. (2018). Matrix capsules with EM routing. ICLR2018.
• Sabour, S., Frosst, N., & Hinton, G. E. (2017). Dynamic routing between capsules.
In Advances in Neural Information Processing Systems (pp. 3859-3869).
• Cireşan, Dan C., et al. "High-performance neural networks for visual object
classification." arXiv preprint arXiv:1102.0183 (2011).
38

Matrix capsules with em routing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Matrix capsules with em routing

Similar to Matrix capsules with em routing (20)

More from Kazuki Fujikawa

More from Kazuki Fujikawa (15)

Matrix capsules with em routing