Capsules Network
Phạm Hoàng Hiệp hoanghiepjp96@gmail.comCode: https://github.com/hiepph/capsules-pytorch
What’s wrong with
CNN?
Translation Invariant
CNN’s neural activities are invariant to small
changes in viewpoint
Aim for equivariance: changes in viewpoint lead to
corresponding changes in neural activities
https://kndrck.co/posts/capsule_networks_explained/
Activation in last hidden layer of deep CNN is a percept
+ Info about many objects in the image
+ But not relationship between objects
Requires a lot of data in to be translation invariant (a.k.a
brute-force)
https://kndrck.co/posts/capsule_networks_explained/
Bad representation
of vision system
Absent of routing mechanism from low-level
visual data to higher level parts
(Max) Pooling sucks
+ Precise location of most active feature is
thrown away
+ Reduce number of inputs to the next
layer of feature extraction
https://kndrck.co/posts/capsule_networks_explained/
“The pooling operation used in convolutional
neural networks is a big mistake and the fact that
it works so well is a disaster.” - Hinton
Capsules
to the rescue
Inverse Graphics
Represent the relationship between
the object as a whole and the pose of
the part as a matrix of weights
The translation invariance is now
represented in the matrix of weights,
and not in the neural activity.
Aurélien Géron. Capsule Networks (CapsNets) – Tutorial
Key intuition: Preserve hierarchical pose relationships
between object parts
Architecture
Encoder
Sara Sabour, Geoffrey E Hinton, et al. Dynamic Routing Between Capsules, 2017.
Convert pixel intensities to activities of local
feature detectors that are then used as inputs to
the primary capsules
Layer 1. Convolutional layer
Take basic features detected by the
convolutional layer and produce combinations of
the feature
Lowest level of multi-dimensional entities
Layer 2. Primary Capsules
Non-linear “squashing” activation
Lower level capsule will send its input to the
higher level capsule that agrees with its input
(a.k.a routing agreement)
Layer 3. Digit Capsules
Dynamic Routing Algorithm Max Pechyonkin. Understanding Hinton’s Capsule Networks.
Capsules encapsulate all important information
about the state of the feature they are detecting
in vector form
Output 2 things:
+ Probability that an object of that type is
present
+ Instantiation parameters include the
precise pose
Layer 3. Digit Capsules
Decoder
Sara Sabour, Geoffrey E Hinton, et al. Dynamic Routing Between Capsules, 2017.
Encourage the digit capsules to encode the
instantiation parameters of the input digit
by reconstructing it
Loss = margin_loss + 0.0005 * recon_loss
Reconstruction
G. E. Hinton, et al. Transforming Auto-encoders.
Experiment
Fashion MNIST
Human: 83.5%, 2 Conv Layers with max pooling: 87.6%
Reconstruction
Ground truth Reconstructed
Pros Cons
+ Position and Pose are preserved (equivariance)
+ Promising for image segmentation and object
detection
+ Routing by agreement is great for overlapping
object
+ Robust for affine transformations
+ Activation vectors are easier to interpret
(rotation, thickness, skew,...)
+ Not state of the art on other dataset except
MNIST (but a good start)
+ Slow training, due to internal loop (routing by
agreement algorithm)
+ Cannot see 2 very close identical objects
("crowding")

Capsules Network Overview

  • 1.
    Capsules Network Phạm HoàngHiệp hoanghiepjp96@gmail.comCode: https://github.com/hiepph/capsules-pytorch
  • 2.
  • 3.
    Translation Invariant CNN’s neuralactivities are invariant to small changes in viewpoint Aim for equivariance: changes in viewpoint lead to corresponding changes in neural activities https://kndrck.co/posts/capsule_networks_explained/
  • 4.
    Activation in lasthidden layer of deep CNN is a percept + Info about many objects in the image + But not relationship between objects Requires a lot of data in to be translation invariant (a.k.a brute-force) https://kndrck.co/posts/capsule_networks_explained/
  • 5.
    Bad representation of visionsystem Absent of routing mechanism from low-level visual data to higher level parts (Max) Pooling sucks + Precise location of most active feature is thrown away + Reduce number of inputs to the next layer of feature extraction https://kndrck.co/posts/capsule_networks_explained/ “The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster.” - Hinton
  • 6.
  • 7.
    Inverse Graphics Represent therelationship between the object as a whole and the pose of the part as a matrix of weights The translation invariance is now represented in the matrix of weights, and not in the neural activity. Aurélien Géron. Capsule Networks (CapsNets) – Tutorial
  • 8.
    Key intuition: Preservehierarchical pose relationships between object parts
  • 9.
  • 10.
    Encoder Sara Sabour, GeoffreyE Hinton, et al. Dynamic Routing Between Capsules, 2017.
  • 11.
    Convert pixel intensitiesto activities of local feature detectors that are then used as inputs to the primary capsules Layer 1. Convolutional layer
  • 12.
    Take basic featuresdetected by the convolutional layer and produce combinations of the feature Lowest level of multi-dimensional entities Layer 2. Primary Capsules
  • 13.
  • 14.
    Lower level capsulewill send its input to the higher level capsule that agrees with its input (a.k.a routing agreement) Layer 3. Digit Capsules
  • 15.
    Dynamic Routing AlgorithmMax Pechyonkin. Understanding Hinton’s Capsule Networks.
  • 16.
    Capsules encapsulate allimportant information about the state of the feature they are detecting in vector form Output 2 things: + Probability that an object of that type is present + Instantiation parameters include the precise pose Layer 3. Digit Capsules
  • 17.
    Decoder Sara Sabour, GeoffreyE Hinton, et al. Dynamic Routing Between Capsules, 2017.
  • 18.
    Encourage the digitcapsules to encode the instantiation parameters of the input digit by reconstructing it Loss = margin_loss + 0.0005 * recon_loss
  • 19.
    Reconstruction G. E. Hinton,et al. Transforming Auto-encoders.
  • 20.
  • 21.
    Fashion MNIST Human: 83.5%,2 Conv Layers with max pooling: 87.6%
  • 22.
  • 23.
    Pros Cons + Positionand Pose are preserved (equivariance) + Promising for image segmentation and object detection + Routing by agreement is great for overlapping object + Robust for affine transformations + Activation vectors are easier to interpret (rotation, thickness, skew,...) + Not state of the art on other dataset except MNIST (but a good start) + Slow training, due to internal loop (routing by agreement algorithm) + Cannot see 2 very close identical objects ("crowding")