Capsules Network Overview

Capsules Network
Phạm Hoàng Hiệp hoanghiepjp96@gmail.comCode: https://github.com/hiepph/capsules-pytorch

Translation Invariant
CNN’s neural activities are invariant to small
changes in viewpoint
Aim for equivariance: changes in viewpoint lead to
corresponding changes in neural activities
https://kndrck.co/posts/capsule_networks_explained/

Activation in last hidden layer of deep CNN is a percept
+ Info about many objects in the image
+ But not relationship between objects
Requires a lot of data in to be translation invariant (a.k.a
brute-force)

Bad representation
of vision system
Absent of routing mechanism from low-level
visual data to higher level parts
(Max) Pooling sucks
+ Precise location of most active feature is
thrown away
+ Reduce number of inputs to the next
layer of feature extraction
“The pooling operation used in convolutional
neural networks is a big mistake and the fact that
it works so well is a disaster.” - Hinton

Inverse Graphics
Represent the relationship between
the object as a whole and the pose of
the part as a matrix of weights
The translation invariance is now
represented in the matrix of weights,
and not in the neural activity.
Aurélien Géron. Capsule Networks (CapsNets) – Tutorial

Key intuition: Preserve hierarchical pose relationships
between object parts

Encoder
Sara Sabour, Geoffrey E Hinton, et al. Dynamic Routing Between Capsules, 2017.

Convert pixel intensities to activities of local
feature detectors that are then used as inputs to
the primary capsules
Layer 1. Convolutional layer

Take basic features detected by the
convolutional layer and produce combinations of
the feature
Lowest level of multi-dimensional entities
Layer 2. Primary Capsules

Non-linear “squashing” activation

Lower level capsule will send its input to the
higher level capsule that agrees with its input
(a.k.a routing agreement)
Layer 3. Digit Capsules

Dynamic Routing Algorithm Max Pechyonkin. Understanding Hinton’s Capsule Networks.

Capsules encapsulate all important information
about the state of the feature they are detecting
in vector form
Output 2 things:
+ Probability that an object of that type is
present
+ Instantiation parameters include the
precise pose
Layer 3. Digit Capsules

Decoder
Sara Sabour, Geoffrey E Hinton, et al. Dynamic Routing Between Capsules, 2017.

Encourage the digit capsules to encode the
instantiation parameters of the input digit
by reconstructing it
Loss = margin_loss + 0.0005 * recon_loss

Reconstruction
G. E. Hinton, et al. Transforming Auto-encoders.

Fashion MNIST
Human: 83.5%, 2 Conv Layers with max pooling: 87.6%

Reconstruction
Ground truth Reconstructed

Pros Cons
+ Position and Pose are preserved (equivariance)
+ Promising for image segmentation and object
detection
+ Routing by agreement is great for overlapping
object
+ Robust for affine transformations
+ Activation vectors are easier to interpret
(rotation, thickness, skew,...)
+ Not state of the art on other dataset except
MNIST (but a good start)
+ Slow training, due to internal loop (routing by
agreement algorithm)
+ Cannot see 2 very close identical objects
("crowding")

Capsules Network Overview

More Related Content

Similar to Capsules Network Overview

Recently uploaded

Capsules Network Overview