Cognitive Services "From Object Classification to caption generation a descriptive approach"

From Object Classiﬁcation to Caption Generation: a
Descriptive Approach
T. Campari, G. Etta, T. Sgarbanti
University of Padua
September 19, 2018
T. Campari, G. Etta, T. Sgarbanti Object Classiﬁcation and Caption Generation September 19, 2018 1 / 74

Overview
1 Introduction
2 Object Detection
Convolutional Neural Networks
Residual Network
YOLO
Dataset
3 Caption Generation
Introduction
Model Architecture
Word Representation
Training Process
Caption Prediction and Audio Generation
4 Analysis
Image Analysis
Text Analysis
5 Discussion

Introduction
Introduction
In this presentation we are going to describe an application which, from a
given image, identiﬁes the visual elements within and generates a
describing caption in a textual and audio format.
Our project is divided in two parts:
The ﬁrst one describes comparison between two object detectors
chosen for the experiments (YOLO and ResNet50)
The second one generates a caption using a LSTM which takes a
vocabulary and the array of feature images previously produced

Introduction
Technical Speciﬁcation
For this project a Google Cloud Platforms Virtual Machine with following
speciﬁcation was used:
2 x Nvidia Tesla P100
n1-highmem-2 (2 vCPU)
13GB of RAM
600GB of SSD

Object Detection
Object Detection

Object Detection
Object Detection
What is Object Detection?
Is a task whose goal is to recognize diﬀerent classes of objects in a given
image.
What do we need in order to train a Object Detector?
A set of images;

Object Detection
Object Detection
image.
A set of images;
A list of objects classes to recognize;

Object Detection
Object Detection
image.
A set of images;
A CNN which is able to recognize patterns in images (i.e. an Object
Detector like YOLO or ResNet50);

Object Detection
Object Detection
image.
A set of images;
A CNN which is able to recognize patterns in images (i.e. an Object
Detector like YOLO or ResNet50);
A support ﬁle ground-truth containing Bounding Boxes for each
image.

Object Detection Convolutional Neural Networks
A Convolutional Neural Network (CNN) is a particular type of artiﬁcial
neural network commonly used for audio and video analysis.
A CNN is composed of several layers that given an input, like an image,
transforms it to provide a vector of score classes as output.

CNN layers have neurons arranged in three dimensions (width, height,
depth)
Three main types of layers:
Convolutional Layers
Pooling Layers
Fully Connected Layers

Convolutional Layers
A convolutional layer consists of a set of learnable ﬁlters able to detect
some visual features or patterns in images.
Figure: 6 5X5X3 ﬁltri

Pooling Layers
Reduce the volume of the representation
Reduce the amount of parameters
Control overﬁtting
Pooling layers operate independently on each activation map preserving
the input volume depth.

Fully Connected Layer

Object Detection Residual Network
Residual Networks
What are Residual Networks?
They are a new type of Convolutional Neural Network that:
Exploits Residual Learning;

Residual Learning
Residuals
Are deﬁned as H(x) = F(x) − x, that is the diﬀerence between the output
image of a layer and the original image.
Residual Learning
Refers to a phase of the training process where the residual network, at
each layer, learns the residuals. instead of learning the features of an
image.

Residual Networks
What are Residual Networks?
Are a new type of Convolutional Neural Network that:
Exploits Residual Learning;
Resolves degrading accuracy problem.

Degrading Accuracy
Degrading Accuracy: in convolutional neural networks as the depth of
the network increases, the accuracy degrades. In the image below there is
an example of this concept, where the number of layers of a CNN has
increased.
This does not happen in residual networks, where as long as layers
increase, accuracy does not saturate.

Shortcut connections
They are the true distinctive element of residual networks, which allows
the learning of residuals.

ResNet50
It is a Residual Network, developed by Microsoft Research for Object
Detection task, with the following structure:

Object Detection YOLO
YOLO
You Only Look Once
YOLO analyzes the full image just one time and applies a single neural
network.
Divides the image into regions
Predict bounding boxes
Conﬁdence values is computed

Object Detection YOLO
Advantages
Predictions informed by global context in the image
Faster than other detection systems

Object Detection Dataset
Dataset I
Dataset
For this task we used a subset of OpenImage v4 composed of 100K images
containing 135 diﬀerent classes of Objects.
How did we pick images?
Our dataset images were picked by sorting OpenImages by the number of
objects in an image and then taking 33.3K images from the beginning,
middle and end of the dataset.
This process was made in order to train our network both with images
with few objects and others with a greater number of objects within.

Dataset II

Training Process on YOLO
One XML ﬁle for each image:
width and height
labels and bounding boxes coordinates for each object
64 images batch size
30 epochs

Training process on ResNet50
How are the coordinates of Bounding Boxes managed by ResNet50?
ResNet50 needs a single CSV containing all Bounding Boxes for each
image of training set. Each line of this ﬁle must have the following
structure:
path to image, x1, y1, x2, y2, label

Caption Generation Introduction
Caption Generation

Caption Generation
Is a task whose goal is to produce a short descriptive sentence for a given
image.
What do we need in order to generate a caption?
An image previously encoded into a features array
A neural network which is able to ”remember” the sequence being
generated
A dictionary (of words) with an appropriate representation

Remembering sequences
Since we need to cope with sentences composed of series of words, a
model able to manage Sequential Data eﬀectively is needed, for which
the goal is to learn:
P(o|x)
Where x is a sequence of input elements with a static type whilst o is the
output whose type can be either static or sequential.

Remembering sequences
By making use of Sequential Transductions, defined as:
T : X∗
→ O∗
If it is casual with a finite memory k ∈ N, Recursive State
Representation can be taken into account. They depend on hidden
state variables related with different time moments t. An hidden state at
time t, and its output, can be represented with the following equations:
ht = f(ht−, xt, t)
ot = g(ht, xt, t)

Recursive State Update

Shallow Recursive Neural Network
A nonlinear model which address to the problem of implementing f (·) and
g(·) is called Recurrent Neural Network (RNN) where, in its shallow
representation, makes use of tanh() to implement the previous functions-
and h0 = 0. The obtained result in terms of representation and internal
structure is the following:
ht = f(Bht− + Axt)
ot = g(Cht)

Long Short Term Memory (LSTM)
Back Propagation Through Time and Real Time Recurrent Learning
show the existence of a problem related with gradient computation and
RNNs, also known as Vanishing / Exploding Gradients problem. One of
the solutions adopted in the project was using Long Short Term
Memories (LSTM).

Caption Generation Model Architecture
Putting all together
At this point, the ﬁnal neural network is composed of:
An image encoder previously described which produces an array of
features whose dimensions are reduced to 256 in order to save space
A LSTM whose number of hidden states is equal to the maximum
caption length. It receives the reduced-feature array as h0 and a word,
previously one-hot encoded, as input. The output ot is the new world
predicted after a softmax computation

Assumptions and model choices
During the development, choices were taken in relation with some parts of
the learning model. In speciﬁc:
For comparison purposes it was chosen to use, in addition to
ResNet50 image encoder, VGG-16 encoder in order to test the
reliability of the NLG model itself
The embedding layer which shrinks the image feature array to a
vector of 256 was set up with a ReLu activation function in order to
optimize the performance

Assumptions and model choices
Adaptive Moment Estimation (Adam) was chosen as optimizer of
the model due to its feature of using adaptive learning steps and
storing decaying average of past gradients, speeding up the training
like momentum does
Each cell in LSTM produces for each word a likelihood. Since we are
in a context where a multi-classiﬁcation is performed, a softmax
activation function was chosen also due to its ”squashing” properties.

Data transformation operations
Apart from image encoding, some other work was required in order to
represent words effectively for the learning algorithm. It was necessary to:
Clean words from punctuation / non literal symbols
Insert a < BEGIN > and < END > placeholder into each caption in
order to mark their begin and end
Remove one-letter words to improve efficiency
Represent them effectively using a numerical format

Caption Generation Word Representation
Word representation
Words cannot be represented as they are. In order to be ”understood” by
ML algorithms, they need to be numerically encoded. To achieve this,
each word from a list of unique ones was encoded with its own index and
vice versa.

Word representation
Unfortunately, integer representation of words is not enough due to
performance issues and redundancy. The next step in order to represent
words eﬀectively is to perform one hot encoding:
Since words can be seen as categorical values, a MxN matrix
Encoding can be created, where M is the number of the captions and
N the number of unique words
For each sentence i, a value 1 is inserted into the cell related with the
index of the word j, such that Encoding[i][j] = 1, 0 otherwise.

Word representation
As result of this, a similar structure will be obtained:

Caption Generation Training Process
Training Process
The training process originally was set up to use two datasets:
Flickr8k
MS-COCO
Due to time and performance reasons, it was chosen to only work with the
ﬁrst one. It is composed of:
6000 training images
1000 testing images
1000 free-use images
Each image has 4 captions associated with.

Training process
Two dictionaries were created to associate images with their captions and
encodings:
The first associated each filename key to a list of their captions
The second associated each again filename to its encoding,
depending on the one used

Training process
In relation with captions, a series of further metrics and data structures
were computed to be used from the Neural Network:
A list of unique words through all the captions
The number of unique words
The caption with the highest number of words
The total number of words (samples) for the entire training set

Training process
The training process was performed for both conﬁgurations, one using
ResNet50 and the other using VGG16, for 3 epochs, each of them
requiring about 10 hours to be completed.

Caption Generation Caption Prediction and Audio Generation
Caption Prediction
Given an image, caption prediction can be performed in two ways:
Argmax Prediction
Beam Search Prediction

Argmax Prediction
From the output of each LSTM cell, which is list of words with their
likelihood, it is always chosen the word with the highest value.

Beam Search Prediction
Sometimes, choosing the most likely word does not lead to the most
descriptive caption. This problem can be resolved by using Beam Search
Heuristic, which essentially creates for each output a k ∈ N+ number of
text candidates, called beams, summing the related likelihoods.
With this approach, we always have to choose between k candidates at
time iteratively built as explained.

Audio Description Generation
The caption obtained from the prediction was then converted into an
audio format by using Google Text To Speech Framework.
The caption was sent to an API from the framework together with
other voice settings
The API returned an .mp3 ﬁle containing the audio version of the
caption as response

Analysis
Analysis
Analysis

Analysis Image Analysis
IoU
Iou - Intersection over Union
Intersection over Union is an evaluation metric used to measure the
accuracy of an object detector on a particular dataset. It can be computed
as follows:

TP, FP, TN, FN
Analysis were performed by dividing predictions in according to the
following confusion matrix:
Real Values
Positive Negative
Predicted Values
Positive TP FP
Negative FN TN
In particular, a prediction is a true positive if and only if IoU between
prediction and ground truth is at least 0.5.

Measures
TP, FP, TN and FN were computed and these measures were used to
quantify the performances of the network:
Precision: TP
TP+FP, the fraction of True Positive among all that
predicted positive;

Measures
Precision: TP
predicted positive;
Recall: TP
TP+FN, the fraction of True Positive predictions among all
actual positive cases;

Measures
Precision: TP
predicted positive;
Recall: TP
TP+FN, the fraction of True Positive predictions among all
actual positive cases;
F1 score: 2 · precision·recall
precision+recall = 2TP
2TP+FP+FN, the harmonic mean of
precision and recall.

mAP
mAP
Mean Average Precision is the mean over all classes of the Average
Precision.
mAP = c∈C AP(c)
|C|

AP
AP
Average Precision is computed as the average of maximum precision at 11
ﬁxed recall levels (0.0, 0.1, 0.2, ..., 0.9, 1.0).
AP =
1
11
r∈{0.0,...,1.0}
APr
=
1
11
r∈{0.0,...,1.0}
pi nterp(r)
pi nterp(r) = max
(r)≥r
p(r)

Dataset for Analysis
For our Analysis we used OpenImage v4’s Validation Set, that has 41K
images with a CSV ﬁle associated containing all correct bounding boxes.

Results on Object Detection - YOLO
F1 Score Precision Recall
Yolo-0.1 0.210 0.134 0.476
Yolo-0.2 0.283 0.220 0.397
Yolo-0.3 0.312 0.296 0.331
Yolo-0.4 0.311 0.371 0.267
Yolo-0.5 0.278 0.499 0.201

Results on Object Detection - ResNet50
F1 Score Precision Recall
ResNet50-0.1 0.061 0.032 0.785
ResNet50-0.2 0.190 0.112 0.640
ResNet50-0.3 0.311 0.222 0.520
ResNet50-0.4 0.379 0.354 0.409
ResNet50-0.5 0.377 0.498 0.303

A deeper analysis on YOLO with mAP

A deeper analysis on ResNet50 with mAP

Analysis Text Analysis
Caption Evaluation
Caption evaluation was analyzed by using 3 diﬀerent metrics with speciﬁc
purposes:
BLEU (BiLingual Evaluation Understudy): a precision-oriented
metric for machine generated captions
ROUGE (Recall-Oriented Understudy for Gisting Evaluation): a
recall-oriented metric
WER (Word Error Rate): used to measure the number of
misspelling and missing letters

BLEU
BLEU is a metric to evaluate the modified precision of a generated
sentence to a reference one.
It gives a score from 0 to 1 that indicates how good the match is. It relies
on the concept of n-grams, lists of contiguous words with size n. For each
n-gram in the candidate sentence, BLEU measures how many n-grams in
the machine generated caption appears in the reference. It is expressed as
(BnPr ), where n is the size of the n-gram..
Example:
Candidate: The cat is on the mat
Reference: There is a cat on the mat
BLEU 1-gram modified precision: 6/6
BLEU 2-grams modified precision: 2/5

BLEU
That being said, the formula is:
BLEU = BP · exp
N
n=1
Wn log (BnPr )
BP: is the Brevity Penalty to
manage the cases when the
candidate is longer than the
reference
Wn: normalization in relation
with N, the maximum n-gram
size
BP =
1 if c > r
e1−r/c, if c ≤ r
Wn =
1
N

BLEU Analysis Results
BLEU BLEU-1 BLEU-2 BLEU-3 BLEU-4
VGG16 0.4719 0.1564 0.2896 0.4766 0.5274
ResNet50 0.5060 0.1795 0.3472 0.5049 0.5561

BLEU Analysis Results

ROUGE
As previously stated, ROUGE measures the recall, by counting how many
times the n-grams in the human reference captions appeared in the
machine generated ones.
In relation with the previous example:
Candidate: The cat is on the mat
Reference: There is a cat on the mat
ROUGE 1-gram recall: 6/7
ROUGE 2-grams recall: 2/6

ROUGE Analysis Results
ROUGE-1Rc ROUGE-2Rc
VGG16 0.2737 0.12034
ResNet50 0.2798 0.1164

WER
In addition to the precision and the recall previously calculated, a third
measure is needed in order to know how many characters the candidate
caption diﬀers from its reference. This measure is called Word Error Rate
(WER), deﬁned as:
WER =
S + D + I
N
Where:
S is the number of substitutions
D is the number of deletions
I is the number of insertions
N is the number of words in the reference sentence

WER Analysis Results
WER WERRatio
VGG16 11.676 0.6373125
ResNet50 8.487 0.5350625

Discussion
Discussion and Future
Improvements

Discussion
Discussion
Results for detection demonstrates what was already expected:
ResNet50:
-: Needs high confidence to be
sure about predictions;
+: Finds more objects and has
always a better recall than
YOLO
YOLO:
-: Finds less objects than
ResNet50;
+: Fast algorithm that finds
objects at lower confidence
threshold values.

Discussion
Discussion
Caption generation showed instead a series of score metrics and a
distribution of their values which reﬂected the lack of training of the
model.
Despite this, the full pipeline, which started from an input image and
ended with its textual and audio description, was achieved.

Discussion
Future Improvements
Improvements on Object Detector
To improve Object Detectors performance a more homogeneous dataset
must be used. Furthermore a more powerful machine is needed in order to
train well Object Detectors.

Discussion
Future Improvements
For what concerns Caption Generation, results can be improved by:
Training the model for an adequate number of epochs
Reintroducing one letter words to improve BLEU, ROUGE and WER
scores
Using an higher number of hidden layers
Training with larger datasets like MS-COCO

Discussion
Demo
Demo

Discussion
Thank you for the attention.

Cognitive Services "From Object Classification to caption generation a descriptive approach"

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Cognitive Services "From Object Classification to caption generation a descriptive approach"