2. What is the current State of the Art in Image
Classification & Object Recognition?
● Image classification is a central problem in machine learning
● We would not have been successful if we simply used a raw multi-layer
perceptron connected to each pixel of an image.
● On top of becoming quickly intractable, this direct operation is not very efficient
as pixels are spatially correlated. ---So we initially need to extract features that
are: ---meaningful and ---low-dimensional
● And that's where convolutional neural networks come in the game!
● Convolutional Networks are the state of the art algorithm
● Basic idea is, show an algorithm labeled images i.e photos of dogs labeled "dog"
eventually it will start to abstract features that are more likely to indicate the
presence of an actual dog
● We can use this model to classify new, unlabeled images.
3. ● First, an input image is fed to the network.
● Filters of a given size scan the image and perform convolutions.
● The obtained features then go through an activation function. Then, the output
goes through a succession of pooling and other convolution operations.
● Features are reduced in dimension as the network goes on.
● At the end, high-level features are flattened and fed to fully connected layers,
which will eventually yield class probabilities through a softmax layer.
● During training time, the network learns how to recognize the features that make
a sample belong to a given class through backpropagation.
● ConvNets appear as a way to construct features that we would have had to
handcraft ourselves otherwise.
4. For a CNN, a mere presence of these objects can be a very strong indicator to
consider that there is a face in the image.
5. Why Convolutional Networks are Doomed
1. Sub sampling looses the precise spatial relationships between higher level
parts such as nose and a mouth. The precise spatial relationships are need for
identity recognition.
6. 2. They cannot extrapolate their understanding of geomatrical relationships to
radically new viewpoints
7. EQUIVARIANCE Vs. INVARIANCE
● Sub-sampling tries to make the neural activities invariant for small changes in
view point
● Its better to aim for equivariance: Changes in viewpoint leads to corresponding
changes in neural activities.
8. What is the right representation of the images?
● Computer graphics deals with constructing a visual image from some internal
hierarchical representation of geometric data.
● Note that the structure of this representation needs to take into account relative
positions of objects.
● That internal representation is stored in computer’s memory as arrays of
geometrical objects and matrices that represent relative positions and orientation
of these objects.
● Then, special software takes that representation and converts it into an image on
the screen. It is called rendering.
9. ● Hinton argues that brains, in fact, do the opposite of rendering. He calls it
inverse graphics: from visual information received by eyes, they deconstruct a
hierarchical representation of the world around us.
● He argues that in order to correctly do classification and object recognition, it is
important to preserve hierarchical pose relationships between object parts. This
is the key intuition that will allow us to understand why capsule theory is so
important. It incorporates relative relationships between objects and it is
represented numerically as a 4D pose matrix.
10. In simple words, a Capsule network is a neural network that tries to perform Inverse
Graphics
11. What is a Capsule?
● A capsule is any function that tries to predict the presence and the instantiation
parameters of a particular object at a given location.
● Max pooling loses valuable information and also does not encode relative spatial
relationships between features.
● We should use capsules instead, because they will encapsulate all important
information about the state of the features they are detecting in a form of a vector
(as opposed to a scalar that a neuron outputs).
● Capsules encode probability of detection of a feature as the length of their output
vector. And the state of the detected feature is encoded as the direction in which
that vector points to (“instantiation parameters”).
● So when detected feature moves around the image or its state somehow
changes, the probability still stays the same (length of vector does not change),
but its orientation changes(thus achieving Equivariance).
12. ● In the diagram below, the network contains 50 capsules, the arrows represent the
output vectors of these capsules.
● The black arrows correspons to capsules that tries to find the rectangles, while
the blue arrows represent the output of the capsules looking to find triangles.
● The length of estimated vectors represent the estimated probability, while the
orientation represents object’s estimated pose paratmeters.
● The vector will rotate in its space, representing the changing state of the detected
object, but its length will remain fxed, because the capsule is still sure it has
detected a face.
● This is what Hinton refers to as activities of equivariance: neuronal activities will
change when an object “moves over the manifold of possible appearances” in the
picture. At the same time, the probabilities of detection remain constant, which is
the form of invariance that we should aim at, and not the type offered by CNNs
with max pooling.
14. Primary Capsules
Ex: Capsule that detected Rectangle. During training Capsule learns transformation
matrix for each pair of capsules.
Ex: Capsule that detected Triangle
15. Routing by Agreement
● Since the outputs of both the capsules agree with boat orientation, therefore it is
totaly safe to assume that both triangle and rectangle are part of a boat.
● Thus the output of these capsules should be routed to the boat capsule. This
helps in reducing both the training time and noise in the final output. This is
called routing by agreement.
19. PROS
● Reaches high accuracy on MNIST, and promising on CIFAR10
● Requires less training data
● Position and pose information are preserved (equivariance)
● This is promising for image segmentation and object detection
● Routing by agreement is great for overlapping objects
● Capsule activations nicely map the hierarchy of parts
● Offers robustness to affine transformations
● Activation vectors are easier to interpret (rotation, thickness, skew...)
● It’s Hinton! ;-)
20. CONS
● Not state of the art on CIFAR10 (but it’s a good start)
● Not tested yet on larger images (e.g., ImageNet): will it work well?
● Slow training, due to the inner loop (in the routing by agreement algorithm)
● A CapsNet cannot see two very close identical objects. This is called “crowding”,
and it has been observed as well in human vision