Images and words
mechanics of automated captioning
with neural networks
Alberto Massidda
Who we are
● Founded in 2001;
● Branches in Milan, Rome and London;
● Market leader in enterprise ready solutions based on Open Source tech;
● Expertise:
○ DevOps
○ Cloud
○ BigData and many more...
This presentation is Open Source (yay!)
https://creativecommons.org/licenses/by-nc-sa/3.0/
Outline
1. Task introduction
2. Object recognition
3. Language generation
4. Putting all together
5. Improving performance
6. Beyond captioning: deep image search
The task
Generating a description from an image.
“A man jumps over a skateboard”
The challenges
1. Recognize objects in the image
2. Generate a fluent description in natural language
Neural object recognition
A solved problem: Convolutional Neural Networks do the trick.
CNN is an architecture specialized in finding topological invariants in the input.
Finds relationships between atoms and infers higher abstractions.
Highly resistant to noise and spatial transformations.
It learns automatically what are the relevant features to extract from an input.
Not limited to images: CNNs can be applied to text, audio, etc..
An image as integers
A handwritten “8” can be
represented as a matrix of
integers.
● 0 for blank
● 1-255 for grayscales
white-to-black
Architecture of a ConvNet
Filter
Convolution
+
ReLU
Max
Pooling Filter
Convolution
+
ReLU
Max
Pooling
Fully
Connected
1. Convolution
2. Non Linearity (ReLU)
3. Pooling or Sub Sampling
4. Classification (Fully Connected Layer)
Convolution intuition
Let’s multiply a sliding matrix (the “brushing filter”) with our input matrix.
For example, the matrix does edge detection.
Convolution in CNN
Each new generated image is called “channel”. A common RGB has 3 channels.
Channels hold different perspectives about the image.
We start with random filters and tune these matrixes as part of our training.
We end up with filters that have learned perspectives of interest.
Convolution example
Rectifier Linear Unit
We can apply another filter to rule out pixel that don’t contribute.
Max Pooling 1/2
After this, we downsample the image by “hashing” it to fewer values. We can:
1 1 2 4
5 6 7 8
3 2 1 0
1 2 3 4
6 8
3 4
13 21
8 8
Max pooling
2x2
stride 2
Sum pooling
2x2
stride 2
● Max: pick only the
highest element
● Sum: sum together all
the elements
Max Pooling 2/2
Fully connected layer
After a couple of “convolute, relu and pool” cycles, we have maybe 128 channels
of 14x14 pixel images.
Concat and reshape them in a linear array of 25088 cells.
Feed it to a feed forward neural network that will output our classes.
CNN demo time
Real time web handwritten digit recognition
http://scs.ryerson.ca/~aharley/vis/conv/flat.html
There are a lot of “famous” nets that can be freely downloaded and used off the shelf,
like ResNet which has an error rate of 3.6% over 20000 categories.
Why not just using a MLP?
Why MLP suffers
The Multi Layer Perceptron can actually classify images as just array of pixels.
But it loses if I move and/or rotate the image.
This is because it lacks support for learning the invariant topological properties
that are maintained when the image goes through a spatial transformation.
Language generation with Recurrent Networks
Language generation is a serial task. We generate words one after another.
This is well modeled by Recurrent Neural Cells: a neuron that uses itself over and
over again to accept serial inputs, outputting each time a new value.
Words as integers without embedding
Vocabulary of words.
V = [‘fight’, ‘kill’, ‘queen’, ‘king’, ‘man’, ‘woman’, ‘love’,...]
“One hot vector” encoding representation of single words.
‘fight’ = [1 0 0 0 0 0 0 …]
‘kill’ = [0 1 0 0 0 0 0 …]
‘queen’ = [0 0 1 0 0 0 0 …]
Can correlate documents (TF-IDF), but can’t correlate single words to each other:
“I fight the king” = [ 1 1 0 1 0 0 0 …]
“fight the tirannny”= [ 0 1 0 1 0 0 1 …]
Words as floats with vector embedding
Word embedding.
Fixed length, real valued vector encoding representation of single words.
Close concepts have close vectors.
‘fight’ = [0.17 0.53 0.89 0.03 0.00 0.54 0.11 ]
‘kill’ = [0.17 0.53 0.91 0.06 0.00 0.54 0.12 ]
‘queen’ = [0.22 0.45 0.13 0.53 0.90 0.41 0.00 ]
Vector operation yields to coherent results: king - man + woman = queen
How language is generated
x1
h1
y1
x2
h2
y2
x3
h3
y3
“What” “is” “the”
“problem”“the”“is”target word
output likelihood
hidden state
input embedding
input word
Whh
Why
Wxh
RNN and the problem of memory
All network state is held in a single cell, used over and over again. Internal state
can get really complicated. Moving the values around during training can lead to
loss of data.
RNN has a “plugin” architecture, in which we can use different types of cells:
Simple RNN cell: fastest, but breaks over long sequences. Outdated.
LSTM cell: slower, supports selectively forgetting and keeping data. Standard.
GRU cell: like LSTM, but faster due to simpler internal architecture. State of art.
RNN demo
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Putting all together
This is a classical seq2seq.
An image is fed to the CNN.
The CNN generates a state that
model the scene as a cluster of
objects.
The state is fed to a LSTM cell
Avoid getting distracted and Attention
We can train an
intermediate network called
Attention that emphasizes
relationships between
different parts of the
encoder (image) with
different time step of the
decoder (current word
being generated).
How attention works
End to end demo
https://www.captionbot.ai/
Beyond captioning: deep image search
Inputting an image and a question, the network will output an answer.
Chain together CNN and RNN models to a FC outputting to our vocabulary.
http://vqa.cloudcv.org/
CNN
h
How many wheels has the skate?
RNN h
F
C
F
C
F
C
Thank you

Alberto Massidda - Images and words: mechanics of automated captioning with neural networks - Codemotion Milan 2018

  • 1.
    Images and words mechanicsof automated captioning with neural networks Alberto Massidda
  • 2.
    Who we are ●Founded in 2001; ● Branches in Milan, Rome and London; ● Market leader in enterprise ready solutions based on Open Source tech; ● Expertise: ○ DevOps ○ Cloud ○ BigData and many more...
  • 3.
    This presentation isOpen Source (yay!) https://creativecommons.org/licenses/by-nc-sa/3.0/
  • 4.
    Outline 1. Task introduction 2.Object recognition 3. Language generation 4. Putting all together 5. Improving performance 6. Beyond captioning: deep image search
  • 5.
    The task Generating adescription from an image. “A man jumps over a skateboard”
  • 6.
    The challenges 1. Recognizeobjects in the image 2. Generate a fluent description in natural language
  • 7.
    Neural object recognition Asolved problem: Convolutional Neural Networks do the trick. CNN is an architecture specialized in finding topological invariants in the input. Finds relationships between atoms and infers higher abstractions. Highly resistant to noise and spatial transformations. It learns automatically what are the relevant features to extract from an input. Not limited to images: CNNs can be applied to text, audio, etc..
  • 8.
    An image asintegers A handwritten “8” can be represented as a matrix of integers. ● 0 for blank ● 1-255 for grayscales white-to-black
  • 9.
    Architecture of aConvNet Filter Convolution + ReLU Max Pooling Filter Convolution + ReLU Max Pooling Fully Connected 1. Convolution 2. Non Linearity (ReLU) 3. Pooling or Sub Sampling 4. Classification (Fully Connected Layer)
  • 10.
    Convolution intuition Let’s multiplya sliding matrix (the “brushing filter”) with our input matrix. For example, the matrix does edge detection.
  • 11.
    Convolution in CNN Eachnew generated image is called “channel”. A common RGB has 3 channels. Channels hold different perspectives about the image. We start with random filters and tune these matrixes as part of our training. We end up with filters that have learned perspectives of interest.
  • 12.
  • 13.
    Rectifier Linear Unit Wecan apply another filter to rule out pixel that don’t contribute.
  • 14.
    Max Pooling 1/2 Afterthis, we downsample the image by “hashing” it to fewer values. We can: 1 1 2 4 5 6 7 8 3 2 1 0 1 2 3 4 6 8 3 4 13 21 8 8 Max pooling 2x2 stride 2 Sum pooling 2x2 stride 2 ● Max: pick only the highest element ● Sum: sum together all the elements
  • 15.
  • 16.
    Fully connected layer Aftera couple of “convolute, relu and pool” cycles, we have maybe 128 channels of 14x14 pixel images. Concat and reshape them in a linear array of 25088 cells. Feed it to a feed forward neural network that will output our classes.
  • 17.
    CNN demo time Realtime web handwritten digit recognition http://scs.ryerson.ca/~aharley/vis/conv/flat.html There are a lot of “famous” nets that can be freely downloaded and used off the shelf, like ResNet which has an error rate of 3.6% over 20000 categories.
  • 18.
    Why not justusing a MLP?
  • 19.
    Why MLP suffers TheMulti Layer Perceptron can actually classify images as just array of pixels. But it loses if I move and/or rotate the image. This is because it lacks support for learning the invariant topological properties that are maintained when the image goes through a spatial transformation.
  • 20.
    Language generation withRecurrent Networks Language generation is a serial task. We generate words one after another. This is well modeled by Recurrent Neural Cells: a neuron that uses itself over and over again to accept serial inputs, outputting each time a new value.
  • 21.
    Words as integerswithout embedding Vocabulary of words. V = [‘fight’, ‘kill’, ‘queen’, ‘king’, ‘man’, ‘woman’, ‘love’,...] “One hot vector” encoding representation of single words. ‘fight’ = [1 0 0 0 0 0 0 …] ‘kill’ = [0 1 0 0 0 0 0 …] ‘queen’ = [0 0 1 0 0 0 0 …] Can correlate documents (TF-IDF), but can’t correlate single words to each other: “I fight the king” = [ 1 1 0 1 0 0 0 …] “fight the tirannny”= [ 0 1 0 1 0 0 1 …]
  • 22.
    Words as floatswith vector embedding Word embedding. Fixed length, real valued vector encoding representation of single words. Close concepts have close vectors. ‘fight’ = [0.17 0.53 0.89 0.03 0.00 0.54 0.11 ] ‘kill’ = [0.17 0.53 0.91 0.06 0.00 0.54 0.12 ] ‘queen’ = [0.22 0.45 0.13 0.53 0.90 0.41 0.00 ] Vector operation yields to coherent results: king - man + woman = queen
  • 23.
    How language isgenerated x1 h1 y1 x2 h2 y2 x3 h3 y3 “What” “is” “the” “problem”“the”“is”target word output likelihood hidden state input embedding input word Whh Why Wxh
  • 24.
    RNN and theproblem of memory All network state is held in a single cell, used over and over again. Internal state can get really complicated. Moving the values around during training can lead to loss of data. RNN has a “plugin” architecture, in which we can use different types of cells: Simple RNN cell: fastest, but breaks over long sequences. Outdated. LSTM cell: slower, supports selectively forgetting and keeping data. Standard. GRU cell: like LSTM, but faster due to simpler internal architecture. State of art.
  • 25.
  • 26.
    Putting all together Thisis a classical seq2seq. An image is fed to the CNN. The CNN generates a state that model the scene as a cluster of objects. The state is fed to a LSTM cell
  • 27.
    Avoid getting distractedand Attention We can train an intermediate network called Attention that emphasizes relationships between different parts of the encoder (image) with different time step of the decoder (current word being generated).
  • 28.
  • 29.
    End to enddemo https://www.captionbot.ai/
  • 30.
    Beyond captioning: deepimage search Inputting an image and a question, the network will output an answer. Chain together CNN and RNN models to a FC outputting to our vocabulary. http://vqa.cloudcv.org/ CNN h How many wheels has the skate? RNN h F C F C F C
  • 31.