1. 1
DEEP LEARNING JP
[DL Papers]
http://deeplearning.jp/
Matrix Capsules with EM Routing (ICLR2018)
Kazuki Fujikawa, DeNA
2. サマリ
• 書誌情報
– ICLR 2018
– Geoffrey Hinton, Sara Sabour, Nicholas Frosst
• 概要
– Dynamic Routing Between Capsules [Sabour+, NIPS2017] の続報
– カプセル層を多層化
– Poseを2Dベクトルではなく3D行列で表現
– 存在確率をベクトルのノルムではなく専用のユニットで表現
– ルーティングに混合ガウスモデルのEMアルゴリズムを用いる
– smallNORBデータセットでSOTA
2
Published asaconference paper at ICLR 2018
Figure 1: A network with one ReLU convolutional layer followed by a primary convolutional cap-
sule layer and two moreconvolutional capsule layers.
22. 関連研究
• Dynamic Routing Between Capsules [Sabour+, NIPS2017]
– アーキテクチャ
22
図引用: https://medium.com/@mike_ross/a-visual-representation-of-capsule-network-computations-83767d79e737
分類誤差(margin loss)以外に、
正解クラスのcapsuleの情報のみをinputに、
再構成させ、ピクセル単位での再構成誤差
も学習に利用
→ 再構成に必要な属性(太さ、スケール etc.)
がカプセルで表現されるようになる
Figure 2: Decoder structure to reconstruct a digit from the DigitCaps layer representation. The
euclidean distance between the image and the output of the Sigmoid layer is minimized during
training. Weuse thetruelabel asreconstruction target during training.
fieldsoverlap with thelocation of thecenter of thecapsule. In total PrimaryCapsules has[32⇥6⇥6]
capsule outputs (each output is an 8D vector) and each capsule in the [6 ⇥ 6] grid is sharing their
weights with each other. One can seePrimaryCapsules asaConvolution layer with Eq. 1 asitsblock
non-linearity. The final Layer (DigitCaps) has one 16D capsule per digit class and each of these
capsules receivesinput from all thecapsules in the layer below.
Wehaverouting only between two consecutivecapsule layers (e.g. PrimaryCapsules and DigitCaps).
SinceConv1 output is1D, thereisno orientation in itsspaceto agreeon. Therefore, no routing isused
between Conv1 and PrimaryCapsules. All therouting logits (bi j ) are initialized to zero. Therefore,
initially acapsule output (ui ) issent to all parent capsules (v0...v9) with equal probability (ci j ).
Our implementation isin TensorFlow (Abadi et al. [2016]) and weusetheAdam optimizer (Kingma
and Ba[2014]) with itsTensorFlow default parameters, including theexponentially decaying learning
rate, to minimize thesum of the margin losses in Eq. 4.
4.1 Reconstruction asa regularization method
Weusean additional reconstruction lossto encourage thedigit capsules to encodetheinstantiation
parameters of theinput digit. During training, wemask out all but theactivity vector of thecorrect
24. 提案手法
• Matrix Capsules with EM Routing
– カプセル層を多層化
– Poseを2Dベクトルではなく3D行列で表現
– 存在確率をベクトルのノルムではなく専用のユニットで表現
– ルーティングに混合ガウスモデル(GMM)のEMアルゴリズムを用いる
24
Published asaconference paper at ICLR 2018
Figure 1: A network with one ReLU convolutional layer followed by a primary convolutional cap-
sule layer and two moreconvolutional capsule layers.
that location. Theactivationsof theprimary capsulesareproduced by applying thesigmoid function
to the weighted sumsof the same set of lower-layer ReLUs.
The primary capsules are followed by two 3x3 convolutional capsule layers (K=3), each with 32
capsule types (C=D=32) with strides of 2 and one, respectively. The last layer of convolutional
34. 提案手法
• Margin loss [Sabour+, NIPS2017]
• Spread loss [Hinton+, ICLR2018]
– ベースはMargin loss
– 正解となるクラスのスコアが、不正解となるクラスのスコアよりも大きくなるように
するということを明示的に定式化
• Coordinate addition
– 最終層では、全カプセルがクラスカプセルへ結合し、元のxy座標の情報を失う
– アフィン変換先の特徴空間でも同じような位置関係になるように、xy座標をスケーリング
したものを、各クラスカプセルの最初の2成分へ足し込む
34
SS
he training less sensitive to the initialization and hyper-parameters of the model,
s” to directly maximize thegap between theactivation of thetarget class(at ) and
e other classes. If the activation of a wrong class, ai , is closer than the margin,
enalized by thesquared distance to themargin:
Li = (max(0, m − (at − ai ))2
, L =
X
i 6= t
Li (3)
small margin of 0.2 and linearly increasing it during training to 0.9, we avoid
he earlier layers. Spread loss is equivalent to squared Hinge loss with m = 1.
rini (2011) studies avariant of thislossin thecontext of multi class SVMs.
NTS
ataset (LeCun et al. (2004)) has gray-level stereo images of 5 classes of toys:
ks, humansand animals. Thereare10 physical instances of each classwhich are
n. 5 physical instances of aclass areselected for thetraining dataand theother 5
very individual toy is pictured at 18 different azimuths (0-340), 9 elevations and
ns, so thetraining and test setseach contain 24,300 stereo pairsof 96x96 images.
NORB asabenchmark for developing our capsules system because it iscarefully
37. 実験
• 実験結果
– Cireşan+, 2011 よりも高パフォーマンスを記録し、SOTAを達成
– Pose matrix, Spread loss, Coordinate additionの効果が確認された
– ルーティングの過程で少しずつ正解クラスのactivationが高まっていくことが確認できた
37
Figure 2: Histogram of distances of votes to the mean of each of the 5 final capsules after each
routing iteration. Each distance point is weighted by its assignment probability. All three images
are selected from the smallNORB test set. The routing procedure correctly routes the votes in the
truck and the human example. The plane example shows a rare failure case of the model where the
planeisconfused with acar in thethird routing iteration. Thehistograms arezoomed-in to visualize
only votes with distances less than 0.05. Fig. B.2 shows the complete histograms for the ”human”
capsule without clipping the x-axis or fixing thescale of the y-axis.
38. References
• Hinton, G., Frosst, N., & Sabour, S. (2018). Matrix capsules with EM routing. Hinton, G.,
Frosst, N., & Sabour, S. (2018). Matrix capsules with EM routing. ICLR2018.
• Sabour, S., Frosst, N., & Hinton, G. E. (2017). Dynamic routing between capsules.
In Advances in Neural Information Processing Systems (pp. 3859-3869).
• Cireşan, Dan C., et al. "High-performance neural networks for visual object
classification." arXiv preprint arXiv:1102.0183 (2011).
38