SlideShare a Scribd company logo
1 of 79
Download to read offline
From Object Classification to Caption Generation: a
Descriptive Approach
T. Campari, G. Etta, T. Sgarbanti
University of Padua
September 19, 2018
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 1 / 74
Overview
1 Introduction
2 Object Detection
Convolutional Neural Networks
Residual Network
YOLO
Dataset
3 Caption Generation
Introduction
Model Architecture
Word Representation
Training Process
Caption Prediction and Audio Generation
4 Analysis
Image Analysis
Text Analysis
5 Discussion
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 2 / 74
Introduction
Introduction
In this presentation we are going to describe an application which, from a
given image, identifies the visual elements within and generates a
describing caption in a textual and audio format.
Our project is divided in two parts:
The first one describes comparison between two object detectors
chosen for the experiments (YOLO and ResNet50)
The second one generates a caption using a LSTM which takes a
vocabulary and the array of feature images previously produced
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 3 / 74
Introduction
Technical Specification
For this project a Google Cloud Platforms Virtual Machine with following
specification was used:
2 x Nvidia Tesla P100
n1-highmem-2 (2 vCPU)
13GB of RAM
600GB of SSD
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 4 / 74
Object Detection
Object Detection
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 5 / 74
Object Detection
Object Detection
What is Object Detection?
Is a task whose goal is to recognize different classes of objects in a given
image.
What do we need in order to train a Object Detector?
A set of images;
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 6 / 74
Object Detection
Object Detection
What is Object Detection?
Is a task whose goal is to recognize different classes of objects in a given
image.
What do we need in order to train a Object Detector?
A set of images;
A list of objects classes to recognize;
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 6 / 74
Object Detection
Object Detection
What is Object Detection?
Is a task whose goal is to recognize different classes of objects in a given
image.
What do we need in order to train a Object Detector?
A set of images;
A list of objects classes to recognize;
A CNN which is able to recognize patterns in images (i.e. an Object
Detector like YOLO or ResNet50);
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 6 / 74
Object Detection
Object Detection
What is Object Detection?
Is a task whose goal is to recognize different classes of objects in a given
image.
What do we need in order to train a Object Detector?
A set of images;
A list of objects classes to recognize;
A CNN which is able to recognize patterns in images (i.e. an Object
Detector like YOLO or ResNet50);
A support file ground-truth containing Bounding Boxes for each
image.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 6 / 74
Object Detection Convolutional Neural Networks
Convolutional Neural Networks
A Convolutional Neural Network (CNN) is a particular type of artificial
neural network commonly used for audio and video analysis.
A CNN is composed of several layers that given an input, like an image,
transforms it to provide a vector of score classes as output.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 7 / 74
Object Detection Convolutional Neural Networks
Convolutional Neural Networks
CNN layers have neurons arranged in three dimensions (width, height,
depth)
Three main types of layers:
Convolutional Layers
Pooling Layers
Fully Connected Layers
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 8 / 74
Object Detection Convolutional Neural Networks
Convolutional Layers
A convolutional layer consists of a set of learnable filters able to detect
some visual features or patterns in images.
Figure: 6 5X5X3 filtri
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 9 / 74
Object Detection Convolutional Neural Networks
Pooling Layers
Reduce the volume of the representation
Reduce the amount of parameters
Control overfitting
Pooling layers operate independently on each activation map preserving
the input volume depth.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 10 / 74
Object Detection Convolutional Neural Networks
Fully Connected Layer
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 11 / 74
Object Detection Residual Network
Residual Networks
What are Residual Networks?
They are a new type of Convolutional Neural Network that:
Exploits Residual Learning;
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 12 / 74
Object Detection Residual Network
Residual Learning
Residuals
Are defined as H(x) = F(x) − x, that is the difference between the output
image of a layer and the original image.
Residual Learning
Refers to a phase of the training process where the residual network, at
each layer, learns the residuals. instead of learning the features of an
image.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 13 / 74
Object Detection Residual Network
Residual Networks
What are Residual Networks?
Are a new type of Convolutional Neural Network that:
Exploits Residual Learning;
Resolves degrading accuracy problem.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 14 / 74
Object Detection Residual Network
Degrading Accuracy
Degrading Accuracy: in convolutional neural networks as the depth of
the network increases, the accuracy degrades. In the image below there is
an example of this concept, where the number of layers of a CNN has
increased.
This does not happen in residual networks, where as long as layers
increase, accuracy does not saturate.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 15 / 74
Object Detection Residual Network
Shortcut connections
They are the true distinctive element of residual networks, which allows
the learning of residuals.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 16 / 74
Object Detection Residual Network
ResNet50
It is a Residual Network, developed by Microsoft Research for Object
Detection task, with the following structure:
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 17 / 74
Object Detection YOLO
YOLO
You Only Look Once
YOLO analyzes the full image just one time and applies a single neural
network.
Divides the image into regions
Predict bounding boxes
Confidence values is computed
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 18 / 74
Object Detection YOLO
Advantages
Predictions informed by global context in the image
Faster than other detection systems
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 19 / 74
Object Detection Dataset
Dataset I
Dataset
For this task we used a subset of OpenImage v4 composed of 100K images
containing 135 different classes of Objects.
How did we pick images?
Our dataset images were picked by sorting OpenImages by the number of
objects in an image and then taking 33.3K images from the beginning,
middle and end of the dataset.
This process was made in order to train our network both with images
with few objects and others with a greater number of objects within.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 20 / 74
Object Detection Dataset
Dataset II
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 21 / 74
Object Detection Dataset
Training Process on YOLO
One XML file for each image:
width and height
labels and bounding boxes coordinates for each object
64 images batch size
30 epochs
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 22 / 74
Object Detection Dataset
Training process on ResNet50
How are the coordinates of Bounding Boxes managed by ResNet50?
ResNet50 needs a single CSV containing all Bounding Boxes for each
image of training set. Each line of this file must have the following
structure:
path to image, x1, y1, x2, y2, label
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 23 / 74
Caption Generation Introduction
Caption Generation
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 24 / 74
Caption Generation Introduction
Caption Generation
Is a task whose goal is to produce a short descriptive sentence for a given
image.
What do we need in order to generate a caption?
An image previously encoded into a features array
A neural network which is able to ”remember” the sequence being
generated
A dictionary (of words) with an appropriate representation
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 25 / 74
Caption Generation Introduction
Remembering sequences
Since we need to cope with sentences composed of series of words, a
model able to manage Sequential Data effectively is needed, for which
the goal is to learn:
P(o|x)
Where x is a sequence of input elements with a static type whilst o is the
output whose type can be either static or sequential.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 26 / 74
Caption Generation Introduction
Remembering sequences
By making use of Sequential Transductions, defined as:
T : X∗
→ O∗
If it is casual with a finite memory k ∈ N, Recursive State
Representation can be taken into account. They depend on hidden
state variables related with different time moments t. An hidden state at
time t, and its output, can be represented with the following equations:
ht = f(ht−, xt, t)
ot = g(ht, xt, t)
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 27 / 74
Caption Generation Introduction
Recursive State Update
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 28 / 74
Caption Generation Introduction
Shallow Recursive Neural Network
A nonlinear model which address to the problem of implementing f (·) and
g(·) is called Recurrent Neural Network (RNN) where, in its shallow
representation, makes use of tanh() to implement the previous functions-
and h0 = 0. The obtained result in terms of representation and internal
structure is the following:
ht = f(Bht− + Axt)
ot = g(Cht)
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 29 / 74
Caption Generation Introduction
Long Short Term Memory (LSTM)
Back Propagation Through Time and Real Time Recurrent Learning
show the existence of a problem related with gradient computation and
RNNs, also known as Vanishing / Exploding Gradients problem. One of
the solutions adopted in the project was using Long Short Term
Memories (LSTM).
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 30 / 74
Caption Generation Model Architecture
Putting all together
At this point, the final neural network is composed of:
An image encoder previously described which produces an array of
features whose dimensions are reduced to 256 in order to save space
A LSTM whose number of hidden states is equal to the maximum
caption length. It receives the reduced-feature array as h0 and a word,
previously one-hot encoded, as input. The output ot is the new world
predicted after a softmax computation
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 31 / 74
Caption Generation Model Architecture
Assumptions and model choices
During the development, choices were taken in relation with some parts of
the learning model. In specific:
For comparison purposes it was chosen to use, in addition to
ResNet50 image encoder, VGG-16 encoder in order to test the
reliability of the NLG model itself
The embedding layer which shrinks the image feature array to a
vector of 256 was set up with a ReLu activation function in order to
optimize the performance
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 32 / 74
Caption Generation Model Architecture
Assumptions and model choices
Adaptive Moment Estimation (Adam) was chosen as optimizer of
the model due to its feature of using adaptive learning steps and
storing decaying average of past gradients, speeding up the training
like momentum does
Each cell in LSTM produces for each word a likelihood. Since we are
in a context where a multi-classification is performed, a softmax
activation function was chosen also due to its ”squashing” properties.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 33 / 74
Caption Generation Model Architecture
Data transformation operations
Apart from image encoding, some other work was required in order to
represent words effectively for the learning algorithm. It was necessary to:
Clean words from punctuation / non literal symbols
Insert a < BEGIN > and < END > placeholder into each caption in
order to mark their begin and end
Remove one-letter words to improve efficiency
Represent them effectively using a numerical format
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 34 / 74
Caption Generation Word Representation
Word representation
Words cannot be represented as they are. In order to be ”understood” by
ML algorithms, they need to be numerically encoded. To achieve this,
each word from a list of unique ones was encoded with its own index and
vice versa.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 35 / 74
Caption Generation Word Representation
Word representation
Unfortunately, integer representation of words is not enough due to
performance issues and redundancy. The next step in order to represent
words effectively is to perform one hot encoding:
Since words can be seen as categorical values, a MxN matrix
Encoding can be created, where M is the number of the captions and
N the number of unique words
For each sentence i, a value 1 is inserted into the cell related with the
index of the word j, such that Encoding[i][j] = 1, 0 otherwise.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 36 / 74
Caption Generation Word Representation
Word representation
As result of this, a similar structure will be obtained:
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 37 / 74
Caption Generation Training Process
Training Process
The training process originally was set up to use two datasets:
Flickr8k
MS-COCO
Due to time and performance reasons, it was chosen to only work with the
first one. It is composed of:
6000 training images
1000 testing images
1000 free-use images
Each image has 4 captions associated with.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 38 / 74
Caption Generation Training Process
Training process
Two dictionaries were created to associate images with their captions and
encodings:
The first associated each filename key to a list of their captions
The second associated each again filename to its encoding,
depending on the one used
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 39 / 74
Caption Generation Training Process
Training process
In relation with captions, a series of further metrics and data structures
were computed to be used from the Neural Network:
A list of unique words through all the captions
The number of unique words
The caption with the highest number of words
The total number of words (samples) for the entire training set
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 40 / 74
Caption Generation Training Process
Training process
The training process was performed for both configurations, one using
ResNet50 and the other using VGG16, for 3 epochs, each of them
requiring about 10 hours to be completed.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 41 / 74
Caption Generation Caption Prediction and Audio Generation
Caption Prediction
Given an image, caption prediction can be performed in two ways:
Argmax Prediction
Beam Search Prediction
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 42 / 74
Caption Generation Caption Prediction and Audio Generation
Argmax Prediction
From the output of each LSTM cell, which is list of words with their
likelihood, it is always chosen the word with the highest value.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 43 / 74
Caption Generation Caption Prediction and Audio Generation
Beam Search Prediction
Sometimes, choosing the most likely word does not lead to the most
descriptive caption. This problem can be resolved by using Beam Search
Heuristic, which essentially creates for each output a k ∈ N+ number of
text candidates, called beams, summing the related likelihoods.
With this approach, we always have to choose between k candidates at
time iteratively built as explained.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 44 / 74
Caption Generation Caption Prediction and Audio Generation
Audio Description Generation
The caption obtained from the prediction was then converted into an
audio format by using Google Text To Speech Framework.
The caption was sent to an API from the framework together with
other voice settings
The API returned an .mp3 file containing the audio version of the
caption as response
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 45 / 74
Analysis
Analysis
Analysis
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 46 / 74
Analysis Image Analysis
IoU
Iou - Intersection over Union
Intersection over Union is an evaluation metric used to measure the
accuracy of an object detector on a particular dataset. It can be computed
as follows:
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 47 / 74
Analysis Image Analysis
TP, FP, TN, FN
Analysis were performed by dividing predictions in according to the
following confusion matrix:
Real Values
Positive Negative
Predicted Values
Positive TP FP
Negative FN TN
In particular, a prediction is a true positive if and only if IoU between
prediction and ground truth is at least 0.5.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 48 / 74
Analysis Image Analysis
Measures
TP, FP, TN and FN were computed and these measures were used to
quantify the performances of the network:
Precision: TP
TP+FP, the fraction of True Positive among all that
predicted positive;
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 49 / 74
Analysis Image Analysis
Measures
TP, FP, TN and FN were computed and these measures were used to
quantify the performances of the network:
Precision: TP
TP+FP, the fraction of True Positive among all that
predicted positive;
Recall: TP
TP+FN, the fraction of True Positive predictions among all
actual positive cases;
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 49 / 74
Analysis Image Analysis
Measures
TP, FP, TN and FN were computed and these measures were used to
quantify the performances of the network:
Precision: TP
TP+FP, the fraction of True Positive among all that
predicted positive;
Recall: TP
TP+FN, the fraction of True Positive predictions among all
actual positive cases;
F1 score: 2 · precision·recall
precision+recall = 2TP
2TP+FP+FN, the harmonic mean of
precision and recall.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 49 / 74
Analysis Image Analysis
mAP
mAP
Mean Average Precision is the mean over all classes of the Average
Precision.
mAP = c∈C AP(c)
|C|
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 50 / 74
Analysis Image Analysis
AP
AP
Average Precision is computed as the average of maximum precision at 11
fixed recall levels (0.0, 0.1, 0.2, ..., 0.9, 1.0).
AP =
1
11
r∈{0.0,...,1.0}
APr
=
1
11
r∈{0.0,...,1.0}
pi nterp(r)
pi nterp(r) = max
(r)≥r
p(r)
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 51 / 74
Analysis Image Analysis
Dataset for Analysis
For our Analysis we used OpenImage v4’s Validation Set, that has 41K
images with a CSV file associated containing all correct bounding boxes.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 52 / 74
Analysis Image Analysis
Results on Object Detection - YOLO
F1 Score Precision Recall
Yolo-0.1 0.210 0.134 0.476
Yolo-0.2 0.283 0.220 0.397
Yolo-0.3 0.312 0.296 0.331
Yolo-0.4 0.311 0.371 0.267
Yolo-0.5 0.278 0.499 0.201
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 53 / 74
Analysis Image Analysis
Results on Object Detection - ResNet50
F1 Score Precision Recall
ResNet50-0.1 0.061 0.032 0.785
ResNet50-0.2 0.190 0.112 0.640
ResNet50-0.3 0.311 0.222 0.520
ResNet50-0.4 0.379 0.354 0.409
ResNet50-0.5 0.377 0.498 0.303
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 54 / 74
Analysis Image Analysis
A deeper analysis on YOLO with mAP
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 55 / 74
Analysis Image Analysis
A deeper analysis on ResNet50 with mAP
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 56 / 74
Analysis Text Analysis
Caption Evaluation
Caption evaluation was analyzed by using 3 different metrics with specific
purposes:
BLEU (BiLingual Evaluation Understudy): a precision-oriented
metric for machine generated captions
ROUGE (Recall-Oriented Understudy for Gisting Evaluation): a
recall-oriented metric
WER (Word Error Rate): used to measure the number of
misspelling and missing letters
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 57 / 74
Analysis Text Analysis
BLEU
BLEU is a metric to evaluate the modified precision of a generated
sentence to a reference one.
It gives a score from 0 to 1 that indicates how good the match is. It relies
on the concept of n-grams, lists of contiguous words with size n. For each
n-gram in the candidate sentence, BLEU measures how many n-grams in
the machine generated caption appears in the reference. It is expressed as
(BnPr ), where n is the size of the n-gram..
Example:
Candidate: The cat is on the mat
Reference: There is a cat on the mat
BLEU 1-gram modified precision: 6/6
BLEU 2-grams modified precision: 2/5
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 58 / 74
Analysis Text Analysis
BLEU
That being said, the formula is:
BLEU = BP · exp
N
n=1
Wn log (BnPr )
BP: is the Brevity Penalty to
manage the cases when the
candidate is longer than the
reference
Wn: normalization in relation
with N, the maximum n-gram
size
BP =
1 if c > r
e1−r/c, if c ≤ r
Wn =
1
N
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 59 / 74
Analysis Text Analysis
BLEU Analysis Results
BLEU BLEU-1 BLEU-2 BLEU-3 BLEU-4
VGG16 0.4719 0.1564 0.2896 0.4766 0.5274
ResNet50 0.5060 0.1795 0.3472 0.5049 0.5561
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 60 / 74
Analysis Text Analysis
BLEU Analysis Results
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 61 / 74
Analysis Text Analysis
ROUGE
As previously stated, ROUGE measures the recall, by counting how many
times the n-grams in the human reference captions appeared in the
machine generated ones.
In relation with the previous example:
Candidate: The cat is on the mat
Reference: There is a cat on the mat
ROUGE 1-gram recall: 6/7
ROUGE 2-grams recall: 2/6
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 62 / 74
Analysis Text Analysis
ROUGE Analysis Results
ROUGE-1Rc ROUGE-2Rc
VGG16 0.2737 0.12034
ResNet50 0.2798 0.1164
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 63 / 74
Analysis Text Analysis
ROUGE Analysis Results
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 64 / 74
Analysis Text Analysis
ROUGE Analysis Results
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 65 / 74
Analysis Text Analysis
WER
In addition to the precision and the recall previously calculated, a third
measure is needed in order to know how many characters the candidate
caption differs from its reference. This measure is called Word Error Rate
(WER), defined as:
WER =
S + D + I
N
Where:
S is the number of substitutions
D is the number of deletions
I is the number of insertions
N is the number of words in the reference sentence
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 66 / 74
Analysis Text Analysis
WER Analysis Results
WER WERRatio
VGG16 11.676 0.6373125
ResNet50 8.487 0.5350625
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 67 / 74
Discussion
Discussion and Future
Improvements
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 68 / 74
Discussion
Discussion
Results for detection demonstrates what was already expected:
ResNet50:
-: Needs high confidence to be
sure about predictions;
+: Finds more objects and has
always a better recall than
YOLO
YOLO:
-: Finds less objects than
ResNet50;
+: Fast algorithm that finds
objects at lower confidence
threshold values.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 69 / 74
Discussion
Discussion
Caption generation showed instead a series of score metrics and a
distribution of their values which reflected the lack of training of the
model.
Despite this, the full pipeline, which started from an input image and
ended with its textual and audio description, was achieved.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 70 / 74
Discussion
Future Improvements
Improvements on Object Detector
To improve Object Detectors performance a more homogeneous dataset
must be used. Furthermore a more powerful machine is needed in order to
train well Object Detectors.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 71 / 74
Discussion
Future Improvements
For what concerns Caption Generation, results can be improved by:
Training the model for an adequate number of epochs
Reintroducing one letter words to improve BLEU, ROUGE and WER
scores
Using an higher number of hidden layers
Training with larger datasets like MS-COCO
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 72 / 74
Discussion
Demo
Demo
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 73 / 74
Discussion
Thank you for the attention.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 74 / 74

More Related Content

Recently uploaded

Electricity and Circuits for Grade 9 students
Electricity and Circuits for Grade 9 studentsElectricity and Circuits for Grade 9 students
Electricity and Circuits for Grade 9 students
levieagacer
 
Nanoparticles for the Treatment of Alzheimer’s Disease_102718.pptx
Nanoparticles for the Treatment of Alzheimer’s Disease_102718.pptxNanoparticles for the Treatment of Alzheimer’s Disease_102718.pptx
Nanoparticles for the Treatment of Alzheimer’s Disease_102718.pptx
ssusera4ec7b
 

Recently uploaded (20)

GBSN - Biochemistry (Unit 3) Metabolism
GBSN - Biochemistry (Unit 3) MetabolismGBSN - Biochemistry (Unit 3) Metabolism
GBSN - Biochemistry (Unit 3) Metabolism
 
X-rays from a Central “Exhaust Vent” of the Galactic Center Chimney
X-rays from a Central “Exhaust Vent” of the Galactic Center ChimneyX-rays from a Central “Exhaust Vent” of the Galactic Center Chimney
X-rays from a Central “Exhaust Vent” of the Galactic Center Chimney
 
Polyethylene and its polymerization.pptx
Polyethylene and its polymerization.pptxPolyethylene and its polymerization.pptx
Polyethylene and its polymerization.pptx
 
FORENSIC CHEMISTRY ARSON INVESTIGATION.pdf
FORENSIC CHEMISTRY ARSON INVESTIGATION.pdfFORENSIC CHEMISTRY ARSON INVESTIGATION.pdf
FORENSIC CHEMISTRY ARSON INVESTIGATION.pdf
 
Costs to heap leach gold ore tailings in Karamoja region of Uganda
Costs to heap leach gold ore tailings in Karamoja region of UgandaCosts to heap leach gold ore tailings in Karamoja region of Uganda
Costs to heap leach gold ore tailings in Karamoja region of Uganda
 
EU START PROJECT. START-Newsletter_Issue_4.pdf
EU START PROJECT. START-Newsletter_Issue_4.pdfEU START PROJECT. START-Newsletter_Issue_4.pdf
EU START PROJECT. START-Newsletter_Issue_4.pdf
 
MSCII_ FCT UNIT 5 TOXICOLOGY.pdf
MSCII_              FCT UNIT 5 TOXICOLOGY.pdfMSCII_              FCT UNIT 5 TOXICOLOGY.pdf
MSCII_ FCT UNIT 5 TOXICOLOGY.pdf
 
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...
 
Efficient spin-up of Earth System Models usingsequence acceleration
Efficient spin-up of Earth System Models usingsequence accelerationEfficient spin-up of Earth System Models usingsequence acceleration
Efficient spin-up of Earth System Models usingsequence acceleration
 
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptxSaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
 
Electricity and Circuits for Grade 9 students
Electricity and Circuits for Grade 9 studentsElectricity and Circuits for Grade 9 students
Electricity and Circuits for Grade 9 students
 
Film Coated Tablet and Film Coating raw materials.pdf
Film Coated Tablet and Film Coating raw materials.pdfFilm Coated Tablet and Film Coating raw materials.pdf
Film Coated Tablet and Film Coating raw materials.pdf
 
Heads-Up Multitasker: CHI 2024 Presentation.pdf
Heads-Up Multitasker: CHI 2024 Presentation.pdfHeads-Up Multitasker: CHI 2024 Presentation.pdf
Heads-Up Multitasker: CHI 2024 Presentation.pdf
 
Nanoparticles for the Treatment of Alzheimer’s Disease_102718.pptx
Nanoparticles for the Treatment of Alzheimer’s Disease_102718.pptxNanoparticles for the Treatment of Alzheimer’s Disease_102718.pptx
Nanoparticles for the Treatment of Alzheimer’s Disease_102718.pptx
 
RACEMIzATION AND ISOMERISATION completed.pptx
RACEMIzATION AND ISOMERISATION completed.pptxRACEMIzATION AND ISOMERISATION completed.pptx
RACEMIzATION AND ISOMERISATION completed.pptx
 
Factor Causing low production and physiology of mamary Gland
Factor Causing low production and physiology of mamary GlandFactor Causing low production and physiology of mamary Gland
Factor Causing low production and physiology of mamary Gland
 
Manganese‐RichSandstonesasanIndicatorofAncientOxic LakeWaterConditionsinGale...
Manganese‐RichSandstonesasanIndicatorofAncientOxic  LakeWaterConditionsinGale...Manganese‐RichSandstonesasanIndicatorofAncientOxic  LakeWaterConditionsinGale...
Manganese‐RichSandstonesasanIndicatorofAncientOxic LakeWaterConditionsinGale...
 
Terpineol and it's characterization pptx
Terpineol and it's characterization pptxTerpineol and it's characterization pptx
Terpineol and it's characterization pptx
 
Classification of Kerogen, Perspective on palynofacies in depositional envi...
Classification of Kerogen,  Perspective on palynofacies in depositional  envi...Classification of Kerogen,  Perspective on palynofacies in depositional  envi...
Classification of Kerogen, Perspective on palynofacies in depositional envi...
 
GBSN - Microbiology (Unit 4) Concept of Asepsis
GBSN - Microbiology (Unit 4) Concept of AsepsisGBSN - Microbiology (Unit 4) Concept of Asepsis
GBSN - Microbiology (Unit 4) Concept of Asepsis
 

Featured

Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Featured (20)

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 

Cognitive Services "From Object Classification to caption generation a descriptive approach"

  • 1. From Object Classification to Caption Generation: a Descriptive Approach T. Campari, G. Etta, T. Sgarbanti University of Padua September 19, 2018 T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 1 / 74
  • 2. Overview 1 Introduction 2 Object Detection Convolutional Neural Networks Residual Network YOLO Dataset 3 Caption Generation Introduction Model Architecture Word Representation Training Process Caption Prediction and Audio Generation 4 Analysis Image Analysis Text Analysis 5 Discussion T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 2 / 74
  • 3. Introduction Introduction In this presentation we are going to describe an application which, from a given image, identifies the visual elements within and generates a describing caption in a textual and audio format. Our project is divided in two parts: The first one describes comparison between two object detectors chosen for the experiments (YOLO and ResNet50) The second one generates a caption using a LSTM which takes a vocabulary and the array of feature images previously produced T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 3 / 74
  • 4. Introduction Technical Specification For this project a Google Cloud Platforms Virtual Machine with following specification was used: 2 x Nvidia Tesla P100 n1-highmem-2 (2 vCPU) 13GB of RAM 600GB of SSD T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 4 / 74
  • 5. Object Detection Object Detection T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 5 / 74
  • 6. Object Detection Object Detection What is Object Detection? Is a task whose goal is to recognize different classes of objects in a given image. What do we need in order to train a Object Detector? A set of images; T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 6 / 74
  • 7. Object Detection Object Detection What is Object Detection? Is a task whose goal is to recognize different classes of objects in a given image. What do we need in order to train a Object Detector? A set of images; A list of objects classes to recognize; T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 6 / 74
  • 8. Object Detection Object Detection What is Object Detection? Is a task whose goal is to recognize different classes of objects in a given image. What do we need in order to train a Object Detector? A set of images; A list of objects classes to recognize; A CNN which is able to recognize patterns in images (i.e. an Object Detector like YOLO or ResNet50); T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 6 / 74
  • 9. Object Detection Object Detection What is Object Detection? Is a task whose goal is to recognize different classes of objects in a given image. What do we need in order to train a Object Detector? A set of images; A list of objects classes to recognize; A CNN which is able to recognize patterns in images (i.e. an Object Detector like YOLO or ResNet50); A support file ground-truth containing Bounding Boxes for each image. T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 6 / 74
  • 10. Object Detection Convolutional Neural Networks Convolutional Neural Networks A Convolutional Neural Network (CNN) is a particular type of artificial neural network commonly used for audio and video analysis. A CNN is composed of several layers that given an input, like an image, transforms it to provide a vector of score classes as output. T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 7 / 74
  • 11. Object Detection Convolutional Neural Networks Convolutional Neural Networks CNN layers have neurons arranged in three dimensions (width, height, depth) Three main types of layers: Convolutional Layers Pooling Layers Fully Connected Layers T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 8 / 74
  • 12. Object Detection Convolutional Neural Networks Convolutional Layers A convolutional layer consists of a set of learnable filters able to detect some visual features or patterns in images. Figure: 6 5X5X3 filtri T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 9 / 74
  • 13. Object Detection Convolutional Neural Networks Pooling Layers Reduce the volume of the representation Reduce the amount of parameters Control overfitting Pooling layers operate independently on each activation map preserving the input volume depth. T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 10 / 74
  • 14. Object Detection Convolutional Neural Networks Fully Connected Layer T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 11 / 74
  • 15. Object Detection Residual Network Residual Networks What are Residual Networks? They are a new type of Convolutional Neural Network that: Exploits Residual Learning; T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 12 / 74
  • 16. Object Detection Residual Network Residual Learning Residuals Are defined as H(x) = F(x) − x, that is the difference between the output image of a layer and the original image. Residual Learning Refers to a phase of the training process where the residual network, at each layer, learns the residuals. instead of learning the features of an image. T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 13 / 74
  • 17. Object Detection Residual Network Residual Networks What are Residual Networks? Are a new type of Convolutional Neural Network that: Exploits Residual Learning; Resolves degrading accuracy problem. T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 14 / 74
  • 18. Object Detection Residual Network Degrading Accuracy Degrading Accuracy: in convolutional neural networks as the depth of the network increases, the accuracy degrades. In the image below there is an example of this concept, where the number of layers of a CNN has increased. This does not happen in residual networks, where as long as layers increase, accuracy does not saturate. T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 15 / 74
  • 19. Object Detection Residual Network Shortcut connections They are the true distinctive element of residual networks, which allows the learning of residuals. T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 16 / 74
  • 20. Object Detection Residual Network ResNet50 It is a Residual Network, developed by Microsoft Research for Object Detection task, with the following structure: T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 17 / 74
  • 21. Object Detection YOLO YOLO You Only Look Once YOLO analyzes the full image just one time and applies a single neural network. Divides the image into regions Predict bounding boxes Confidence values is computed T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 18 / 74
  • 22. Object Detection YOLO Advantages Predictions informed by global context in the image Faster than other detection systems T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 19 / 74
  • 23. Object Detection Dataset Dataset I Dataset For this task we used a subset of OpenImage v4 composed of 100K images containing 135 different classes of Objects. How did we pick images? Our dataset images were picked by sorting OpenImages by the number of objects in an image and then taking 33.3K images from the beginning, middle and end of the dataset. This process was made in order to train our network both with images with few objects and others with a greater number of objects within. T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 20 / 74
  • 24. Object Detection Dataset Dataset II T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 21 / 74
  • 25. Object Detection Dataset Training Process on YOLO One XML file for each image: width and height labels and bounding boxes coordinates for each object 64 images batch size 30 epochs T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 22 / 74
  • 26. Object Detection Dataset Training process on ResNet50 How are the coordinates of Bounding Boxes managed by ResNet50? ResNet50 needs a single CSV containing all Bounding Boxes for each image of training set. Each line of this file must have the following structure: path to image, x1, y1, x2, y2, label T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 23 / 74
  • 27. Caption Generation Introduction Caption Generation T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 24 / 74
  • 28. Caption Generation Introduction Caption Generation Is a task whose goal is to produce a short descriptive sentence for a given image. What do we need in order to generate a caption? An image previously encoded into a features array A neural network which is able to ”remember” the sequence being generated A dictionary (of words) with an appropriate representation T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 25 / 74
  • 29. Caption Generation Introduction Remembering sequences Since we need to cope with sentences composed of series of words, a model able to manage Sequential Data effectively is needed, for which the goal is to learn: P(o|x) Where x is a sequence of input elements with a static type whilst o is the output whose type can be either static or sequential. T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 26 / 74
  • 30. Caption Generation Introduction Remembering sequences By making use of Sequential Transductions, defined as: T : X∗ → O∗ If it is casual with a finite memory k ∈ N, Recursive State Representation can be taken into account. They depend on hidden state variables related with different time moments t. An hidden state at time t, and its output, can be represented with the following equations: ht = f(ht−, xt, t) ot = g(ht, xt, t) T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 27 / 74
  • 31. Caption Generation Introduction Recursive State Update T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 28 / 74
  • 32. Caption Generation Introduction Shallow Recursive Neural Network A nonlinear model which address to the problem of implementing f (·) and g(·) is called Recurrent Neural Network (RNN) where, in its shallow representation, makes use of tanh() to implement the previous functions- and h0 = 0. The obtained result in terms of representation and internal structure is the following: ht = f(Bht− + Axt) ot = g(Cht) T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 29 / 74
  • 33. Caption Generation Introduction Long Short Term Memory (LSTM) Back Propagation Through Time and Real Time Recurrent Learning show the existence of a problem related with gradient computation and RNNs, also known as Vanishing / Exploding Gradients problem. One of the solutions adopted in the project was using Long Short Term Memories (LSTM). T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 30 / 74
  • 34. Caption Generation Model Architecture Putting all together At this point, the final neural network is composed of: An image encoder previously described which produces an array of features whose dimensions are reduced to 256 in order to save space A LSTM whose number of hidden states is equal to the maximum caption length. It receives the reduced-feature array as h0 and a word, previously one-hot encoded, as input. The output ot is the new world predicted after a softmax computation T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 31 / 74
  • 35. Caption Generation Model Architecture Assumptions and model choices During the development, choices were taken in relation with some parts of the learning model. In specific: For comparison purposes it was chosen to use, in addition to ResNet50 image encoder, VGG-16 encoder in order to test the reliability of the NLG model itself The embedding layer which shrinks the image feature array to a vector of 256 was set up with a ReLu activation function in order to optimize the performance T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 32 / 74
  • 36. Caption Generation Model Architecture Assumptions and model choices Adaptive Moment Estimation (Adam) was chosen as optimizer of the model due to its feature of using adaptive learning steps and storing decaying average of past gradients, speeding up the training like momentum does Each cell in LSTM produces for each word a likelihood. Since we are in a context where a multi-classification is performed, a softmax activation function was chosen also due to its ”squashing” properties. T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 33 / 74
  • 37. Caption Generation Model Architecture Data transformation operations Apart from image encoding, some other work was required in order to represent words effectively for the learning algorithm. It was necessary to: Clean words from punctuation / non literal symbols Insert a < BEGIN > and < END > placeholder into each caption in order to mark their begin and end Remove one-letter words to improve efficiency Represent them effectively using a numerical format T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 34 / 74
  • 38. Caption Generation Word Representation Word representation Words cannot be represented as they are. In order to be ”understood” by ML algorithms, they need to be numerically encoded. To achieve this, each word from a list of unique ones was encoded with its own index and vice versa. T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 35 / 74
  • 39. Caption Generation Word Representation Word representation Unfortunately, integer representation of words is not enough due to performance issues and redundancy. The next step in order to represent words effectively is to perform one hot encoding: Since words can be seen as categorical values, a MxN matrix Encoding can be created, where M is the number of the captions and N the number of unique words For each sentence i, a value 1 is inserted into the cell related with the index of the word j, such that Encoding[i][j] = 1, 0 otherwise. T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 36 / 74
  • 40. Caption Generation Word Representation Word representation As result of this, a similar structure will be obtained: T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 37 / 74
  • 41. Caption Generation Training Process Training Process The training process originally was set up to use two datasets: Flickr8k MS-COCO Due to time and performance reasons, it was chosen to only work with the first one. It is composed of: 6000 training images 1000 testing images 1000 free-use images Each image has 4 captions associated with. T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 38 / 74
  • 42. Caption Generation Training Process Training process Two dictionaries were created to associate images with their captions and encodings: The first associated each filename key to a list of their captions The second associated each again filename to its encoding, depending on the one used T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 39 / 74
  • 43. Caption Generation Training Process Training process In relation with captions, a series of further metrics and data structures were computed to be used from the Neural Network: A list of unique words through all the captions The number of unique words The caption with the highest number of words The total number of words (samples) for the entire training set T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 40 / 74
  • 44. Caption Generation Training Process Training process The training process was performed for both configurations, one using ResNet50 and the other using VGG16, for 3 epochs, each of them requiring about 10 hours to be completed. T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 41 / 74
  • 45. Caption Generation Caption Prediction and Audio Generation Caption Prediction Given an image, caption prediction can be performed in two ways: Argmax Prediction Beam Search Prediction T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 42 / 74
  • 46. Caption Generation Caption Prediction and Audio Generation Argmax Prediction From the output of each LSTM cell, which is list of words with their likelihood, it is always chosen the word with the highest value. T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 43 / 74
  • 47. Caption Generation Caption Prediction and Audio Generation Beam Search Prediction Sometimes, choosing the most likely word does not lead to the most descriptive caption. This problem can be resolved by using Beam Search Heuristic, which essentially creates for each output a k ∈ N+ number of text candidates, called beams, summing the related likelihoods. With this approach, we always have to choose between k candidates at time iteratively built as explained. T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 44 / 74
  • 48. Caption Generation Caption Prediction and Audio Generation Audio Description Generation The caption obtained from the prediction was then converted into an audio format by using Google Text To Speech Framework. The caption was sent to an API from the framework together with other voice settings The API returned an .mp3 file containing the audio version of the caption as response T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 45 / 74
  • 49. Analysis Analysis Analysis T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 46 / 74
  • 50. Analysis Image Analysis IoU Iou - Intersection over Union Intersection over Union is an evaluation metric used to measure the accuracy of an object detector on a particular dataset. It can be computed as follows: T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 47 / 74
  • 51. Analysis Image Analysis TP, FP, TN, FN Analysis were performed by dividing predictions in according to the following confusion matrix: Real Values Positive Negative Predicted Values Positive TP FP Negative FN TN In particular, a prediction is a true positive if and only if IoU between prediction and ground truth is at least 0.5. T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 48 / 74
  • 52. Analysis Image Analysis Measures TP, FP, TN and FN were computed and these measures were used to quantify the performances of the network: Precision: TP TP+FP, the fraction of True Positive among all that predicted positive; T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 49 / 74
  • 53. Analysis Image Analysis Measures TP, FP, TN and FN were computed and these measures were used to quantify the performances of the network: Precision: TP TP+FP, the fraction of True Positive among all that predicted positive; Recall: TP TP+FN, the fraction of True Positive predictions among all actual positive cases; T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 49 / 74
  • 54. Analysis Image Analysis Measures TP, FP, TN and FN were computed and these measures were used to quantify the performances of the network: Precision: TP TP+FP, the fraction of True Positive among all that predicted positive; Recall: TP TP+FN, the fraction of True Positive predictions among all actual positive cases; F1 score: 2 · precision·recall precision+recall = 2TP 2TP+FP+FN, the harmonic mean of precision and recall. T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 49 / 74
  • 55. Analysis Image Analysis mAP mAP Mean Average Precision is the mean over all classes of the Average Precision. mAP = c∈C AP(c) |C| T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 50 / 74
  • 56. Analysis Image Analysis AP AP Average Precision is computed as the average of maximum precision at 11 fixed recall levels (0.0, 0.1, 0.2, ..., 0.9, 1.0). AP = 1 11 r∈{0.0,...,1.0} APr = 1 11 r∈{0.0,...,1.0} pi nterp(r) pi nterp(r) = max (r)≥r p(r) T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 51 / 74
  • 57. Analysis Image Analysis Dataset for Analysis For our Analysis we used OpenImage v4’s Validation Set, that has 41K images with a CSV file associated containing all correct bounding boxes. T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 52 / 74
  • 58. Analysis Image Analysis Results on Object Detection - YOLO F1 Score Precision Recall Yolo-0.1 0.210 0.134 0.476 Yolo-0.2 0.283 0.220 0.397 Yolo-0.3 0.312 0.296 0.331 Yolo-0.4 0.311 0.371 0.267 Yolo-0.5 0.278 0.499 0.201 T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 53 / 74
  • 59. Analysis Image Analysis Results on Object Detection - ResNet50 F1 Score Precision Recall ResNet50-0.1 0.061 0.032 0.785 ResNet50-0.2 0.190 0.112 0.640 ResNet50-0.3 0.311 0.222 0.520 ResNet50-0.4 0.379 0.354 0.409 ResNet50-0.5 0.377 0.498 0.303 T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 54 / 74
  • 60. Analysis Image Analysis A deeper analysis on YOLO with mAP T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 55 / 74
  • 61. Analysis Image Analysis A deeper analysis on ResNet50 with mAP T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 56 / 74
  • 62. Analysis Text Analysis Caption Evaluation Caption evaluation was analyzed by using 3 different metrics with specific purposes: BLEU (BiLingual Evaluation Understudy): a precision-oriented metric for machine generated captions ROUGE (Recall-Oriented Understudy for Gisting Evaluation): a recall-oriented metric WER (Word Error Rate): used to measure the number of misspelling and missing letters T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 57 / 74
  • 63. Analysis Text Analysis BLEU BLEU is a metric to evaluate the modified precision of a generated sentence to a reference one. It gives a score from 0 to 1 that indicates how good the match is. It relies on the concept of n-grams, lists of contiguous words with size n. For each n-gram in the candidate sentence, BLEU measures how many n-grams in the machine generated caption appears in the reference. It is expressed as (BnPr ), where n is the size of the n-gram.. Example: Candidate: The cat is on the mat Reference: There is a cat on the mat BLEU 1-gram modified precision: 6/6 BLEU 2-grams modified precision: 2/5 T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 58 / 74
  • 64. Analysis Text Analysis BLEU That being said, the formula is: BLEU = BP · exp N n=1 Wn log (BnPr ) BP: is the Brevity Penalty to manage the cases when the candidate is longer than the reference Wn: normalization in relation with N, the maximum n-gram size BP = 1 if c > r e1−r/c, if c ≤ r Wn = 1 N T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 59 / 74
  • 65. Analysis Text Analysis BLEU Analysis Results BLEU BLEU-1 BLEU-2 BLEU-3 BLEU-4 VGG16 0.4719 0.1564 0.2896 0.4766 0.5274 ResNet50 0.5060 0.1795 0.3472 0.5049 0.5561 T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 60 / 74
  • 66. Analysis Text Analysis BLEU Analysis Results T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 61 / 74
  • 67. Analysis Text Analysis ROUGE As previously stated, ROUGE measures the recall, by counting how many times the n-grams in the human reference captions appeared in the machine generated ones. In relation with the previous example: Candidate: The cat is on the mat Reference: There is a cat on the mat ROUGE 1-gram recall: 6/7 ROUGE 2-grams recall: 2/6 T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 62 / 74
  • 68. Analysis Text Analysis ROUGE Analysis Results ROUGE-1Rc ROUGE-2Rc VGG16 0.2737 0.12034 ResNet50 0.2798 0.1164 T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 63 / 74
  • 69. Analysis Text Analysis ROUGE Analysis Results T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 64 / 74
  • 70. Analysis Text Analysis ROUGE Analysis Results T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 65 / 74
  • 71. Analysis Text Analysis WER In addition to the precision and the recall previously calculated, a third measure is needed in order to know how many characters the candidate caption differs from its reference. This measure is called Word Error Rate (WER), defined as: WER = S + D + I N Where: S is the number of substitutions D is the number of deletions I is the number of insertions N is the number of words in the reference sentence T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 66 / 74
  • 72. Analysis Text Analysis WER Analysis Results WER WERRatio VGG16 11.676 0.6373125 ResNet50 8.487 0.5350625 T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 67 / 74
  • 73. Discussion Discussion and Future Improvements T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 68 / 74
  • 74. Discussion Discussion Results for detection demonstrates what was already expected: ResNet50: -: Needs high confidence to be sure about predictions; +: Finds more objects and has always a better recall than YOLO YOLO: -: Finds less objects than ResNet50; +: Fast algorithm that finds objects at lower confidence threshold values. T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 69 / 74
  • 75. Discussion Discussion Caption generation showed instead a series of score metrics and a distribution of their values which reflected the lack of training of the model. Despite this, the full pipeline, which started from an input image and ended with its textual and audio description, was achieved. T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 70 / 74
  • 76. Discussion Future Improvements Improvements on Object Detector To improve Object Detectors performance a more homogeneous dataset must be used. Furthermore a more powerful machine is needed in order to train well Object Detectors. T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 71 / 74
  • 77. Discussion Future Improvements For what concerns Caption Generation, results can be improved by: Training the model for an adequate number of epochs Reintroducing one letter words to improve BLEU, ROUGE and WER scores Using an higher number of hidden layers Training with larger datasets like MS-COCO T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 72 / 74
  • 78. Discussion Demo Demo T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 73 / 74
  • 79. Discussion Thank you for the attention. T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 74 / 74