Cognitive Services "From Object Classification to caption generation a descriptive approach"
1. From Object Classification to Caption Generation: a
Descriptive Approach
T. Campari, G. Etta, T. Sgarbanti
University of Padua
September 19, 2018
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 1 / 74
2. Overview
1 Introduction
2 Object Detection
Convolutional Neural Networks
Residual Network
YOLO
Dataset
3 Caption Generation
Introduction
Model Architecture
Word Representation
Training Process
Caption Prediction and Audio Generation
4 Analysis
Image Analysis
Text Analysis
5 Discussion
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 2 / 74
3. Introduction
Introduction
In this presentation we are going to describe an application which, from a
given image, identifies the visual elements within and generates a
describing caption in a textual and audio format.
Our project is divided in two parts:
The first one describes comparison between two object detectors
chosen for the experiments (YOLO and ResNet50)
The second one generates a caption using a LSTM which takes a
vocabulary and the array of feature images previously produced
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 3 / 74
4. Introduction
Technical Specification
For this project a Google Cloud Platforms Virtual Machine with following
specification was used:
2 x Nvidia Tesla P100
n1-highmem-2 (2 vCPU)
13GB of RAM
600GB of SSD
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 4 / 74
6. Object Detection
Object Detection
What is Object Detection?
Is a task whose goal is to recognize different classes of objects in a given
image.
What do we need in order to train a Object Detector?
A set of images;
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 6 / 74
7. Object Detection
Object Detection
What is Object Detection?
Is a task whose goal is to recognize different classes of objects in a given
image.
What do we need in order to train a Object Detector?
A set of images;
A list of objects classes to recognize;
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 6 / 74
8. Object Detection
Object Detection
What is Object Detection?
Is a task whose goal is to recognize different classes of objects in a given
image.
What do we need in order to train a Object Detector?
A set of images;
A list of objects classes to recognize;
A CNN which is able to recognize patterns in images (i.e. an Object
Detector like YOLO or ResNet50);
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 6 / 74
9. Object Detection
Object Detection
What is Object Detection?
Is a task whose goal is to recognize different classes of objects in a given
image.
What do we need in order to train a Object Detector?
A set of images;
A list of objects classes to recognize;
A CNN which is able to recognize patterns in images (i.e. an Object
Detector like YOLO or ResNet50);
A support file ground-truth containing Bounding Boxes for each
image.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 6 / 74
10. Object Detection Convolutional Neural Networks
Convolutional Neural Networks
A Convolutional Neural Network (CNN) is a particular type of artificial
neural network commonly used for audio and video analysis.
A CNN is composed of several layers that given an input, like an image,
transforms it to provide a vector of score classes as output.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 7 / 74
11. Object Detection Convolutional Neural Networks
Convolutional Neural Networks
CNN layers have neurons arranged in three dimensions (width, height,
depth)
Three main types of layers:
Convolutional Layers
Pooling Layers
Fully Connected Layers
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 8 / 74
12. Object Detection Convolutional Neural Networks
Convolutional Layers
A convolutional layer consists of a set of learnable filters able to detect
some visual features or patterns in images.
Figure: 6 5X5X3 filtri
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 9 / 74
13. Object Detection Convolutional Neural Networks
Pooling Layers
Reduce the volume of the representation
Reduce the amount of parameters
Control overfitting
Pooling layers operate independently on each activation map preserving
the input volume depth.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 10 / 74
14. Object Detection Convolutional Neural Networks
Fully Connected Layer
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 11 / 74
15. Object Detection Residual Network
Residual Networks
What are Residual Networks?
They are a new type of Convolutional Neural Network that:
Exploits Residual Learning;
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 12 / 74
16. Object Detection Residual Network
Residual Learning
Residuals
Are defined as H(x) = F(x) − x, that is the difference between the output
image of a layer and the original image.
Residual Learning
Refers to a phase of the training process where the residual network, at
each layer, learns the residuals. instead of learning the features of an
image.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 13 / 74
17. Object Detection Residual Network
Residual Networks
What are Residual Networks?
Are a new type of Convolutional Neural Network that:
Exploits Residual Learning;
Resolves degrading accuracy problem.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 14 / 74
18. Object Detection Residual Network
Degrading Accuracy
Degrading Accuracy: in convolutional neural networks as the depth of
the network increases, the accuracy degrades. In the image below there is
an example of this concept, where the number of layers of a CNN has
increased.
This does not happen in residual networks, where as long as layers
increase, accuracy does not saturate.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 15 / 74
19. Object Detection Residual Network
Shortcut connections
They are the true distinctive element of residual networks, which allows
the learning of residuals.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 16 / 74
20. Object Detection Residual Network
ResNet50
It is a Residual Network, developed by Microsoft Research for Object
Detection task, with the following structure:
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 17 / 74
21. Object Detection YOLO
YOLO
You Only Look Once
YOLO analyzes the full image just one time and applies a single neural
network.
Divides the image into regions
Predict bounding boxes
Confidence values is computed
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 18 / 74
22. Object Detection YOLO
Advantages
Predictions informed by global context in the image
Faster than other detection systems
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 19 / 74
23. Object Detection Dataset
Dataset I
Dataset
For this task we used a subset of OpenImage v4 composed of 100K images
containing 135 different classes of Objects.
How did we pick images?
Our dataset images were picked by sorting OpenImages by the number of
objects in an image and then taking 33.3K images from the beginning,
middle and end of the dataset.
This process was made in order to train our network both with images
with few objects and others with a greater number of objects within.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 20 / 74
24. Object Detection Dataset
Dataset II
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 21 / 74
25. Object Detection Dataset
Training Process on YOLO
One XML file for each image:
width and height
labels and bounding boxes coordinates for each object
64 images batch size
30 epochs
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 22 / 74
26. Object Detection Dataset
Training process on ResNet50
How are the coordinates of Bounding Boxes managed by ResNet50?
ResNet50 needs a single CSV containing all Bounding Boxes for each
image of training set. Each line of this file must have the following
structure:
path to image, x1, y1, x2, y2, label
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 23 / 74
28. Caption Generation Introduction
Caption Generation
Is a task whose goal is to produce a short descriptive sentence for a given
image.
What do we need in order to generate a caption?
An image previously encoded into a features array
A neural network which is able to ”remember” the sequence being
generated
A dictionary (of words) with an appropriate representation
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 25 / 74
29. Caption Generation Introduction
Remembering sequences
Since we need to cope with sentences composed of series of words, a
model able to manage Sequential Data effectively is needed, for which
the goal is to learn:
P(o|x)
Where x is a sequence of input elements with a static type whilst o is the
output whose type can be either static or sequential.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 26 / 74
30. Caption Generation Introduction
Remembering sequences
By making use of Sequential Transductions, defined as:
T : X∗
→ O∗
If it is casual with a finite memory k ∈ N, Recursive State
Representation can be taken into account. They depend on hidden
state variables related with different time moments t. An hidden state at
time t, and its output, can be represented with the following equations:
ht = f(ht−, xt, t)
ot = g(ht, xt, t)
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 27 / 74
32. Caption Generation Introduction
Shallow Recursive Neural Network
A nonlinear model which address to the problem of implementing f (·) and
g(·) is called Recurrent Neural Network (RNN) where, in its shallow
representation, makes use of tanh() to implement the previous functions-
and h0 = 0. The obtained result in terms of representation and internal
structure is the following:
ht = f(Bht− + Axt)
ot = g(Cht)
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 29 / 74
33. Caption Generation Introduction
Long Short Term Memory (LSTM)
Back Propagation Through Time and Real Time Recurrent Learning
show the existence of a problem related with gradient computation and
RNNs, also known as Vanishing / Exploding Gradients problem. One of
the solutions adopted in the project was using Long Short Term
Memories (LSTM).
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 30 / 74
34. Caption Generation Model Architecture
Putting all together
At this point, the final neural network is composed of:
An image encoder previously described which produces an array of
features whose dimensions are reduced to 256 in order to save space
A LSTM whose number of hidden states is equal to the maximum
caption length. It receives the reduced-feature array as h0 and a word,
previously one-hot encoded, as input. The output ot is the new world
predicted after a softmax computation
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 31 / 74
35. Caption Generation Model Architecture
Assumptions and model choices
During the development, choices were taken in relation with some parts of
the learning model. In specific:
For comparison purposes it was chosen to use, in addition to
ResNet50 image encoder, VGG-16 encoder in order to test the
reliability of the NLG model itself
The embedding layer which shrinks the image feature array to a
vector of 256 was set up with a ReLu activation function in order to
optimize the performance
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 32 / 74
36. Caption Generation Model Architecture
Assumptions and model choices
Adaptive Moment Estimation (Adam) was chosen as optimizer of
the model due to its feature of using adaptive learning steps and
storing decaying average of past gradients, speeding up the training
like momentum does
Each cell in LSTM produces for each word a likelihood. Since we are
in a context where a multi-classification is performed, a softmax
activation function was chosen also due to its ”squashing” properties.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 33 / 74
37. Caption Generation Model Architecture
Data transformation operations
Apart from image encoding, some other work was required in order to
represent words effectively for the learning algorithm. It was necessary to:
Clean words from punctuation / non literal symbols
Insert a < BEGIN > and < END > placeholder into each caption in
order to mark their begin and end
Remove one-letter words to improve efficiency
Represent them effectively using a numerical format
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 34 / 74
38. Caption Generation Word Representation
Word representation
Words cannot be represented as they are. In order to be ”understood” by
ML algorithms, they need to be numerically encoded. To achieve this,
each word from a list of unique ones was encoded with its own index and
vice versa.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 35 / 74
39. Caption Generation Word Representation
Word representation
Unfortunately, integer representation of words is not enough due to
performance issues and redundancy. The next step in order to represent
words effectively is to perform one hot encoding:
Since words can be seen as categorical values, a MxN matrix
Encoding can be created, where M is the number of the captions and
N the number of unique words
For each sentence i, a value 1 is inserted into the cell related with the
index of the word j, such that Encoding[i][j] = 1, 0 otherwise.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 36 / 74
40. Caption Generation Word Representation
Word representation
As result of this, a similar structure will be obtained:
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 37 / 74
41. Caption Generation Training Process
Training Process
The training process originally was set up to use two datasets:
Flickr8k
MS-COCO
Due to time and performance reasons, it was chosen to only work with the
first one. It is composed of:
6000 training images
1000 testing images
1000 free-use images
Each image has 4 captions associated with.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 38 / 74
42. Caption Generation Training Process
Training process
Two dictionaries were created to associate images with their captions and
encodings:
The first associated each filename key to a list of their captions
The second associated each again filename to its encoding,
depending on the one used
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 39 / 74
43. Caption Generation Training Process
Training process
In relation with captions, a series of further metrics and data structures
were computed to be used from the Neural Network:
A list of unique words through all the captions
The number of unique words
The caption with the highest number of words
The total number of words (samples) for the entire training set
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 40 / 74
44. Caption Generation Training Process
Training process
The training process was performed for both configurations, one using
ResNet50 and the other using VGG16, for 3 epochs, each of them
requiring about 10 hours to be completed.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 41 / 74
45. Caption Generation Caption Prediction and Audio Generation
Caption Prediction
Given an image, caption prediction can be performed in two ways:
Argmax Prediction
Beam Search Prediction
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 42 / 74
46. Caption Generation Caption Prediction and Audio Generation
Argmax Prediction
From the output of each LSTM cell, which is list of words with their
likelihood, it is always chosen the word with the highest value.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 43 / 74
47. Caption Generation Caption Prediction and Audio Generation
Beam Search Prediction
Sometimes, choosing the most likely word does not lead to the most
descriptive caption. This problem can be resolved by using Beam Search
Heuristic, which essentially creates for each output a k ∈ N+ number of
text candidates, called beams, summing the related likelihoods.
With this approach, we always have to choose between k candidates at
time iteratively built as explained.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 44 / 74
48. Caption Generation Caption Prediction and Audio Generation
Audio Description Generation
The caption obtained from the prediction was then converted into an
audio format by using Google Text To Speech Framework.
The caption was sent to an API from the framework together with
other voice settings
The API returned an .mp3 file containing the audio version of the
caption as response
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 45 / 74
50. Analysis Image Analysis
IoU
Iou - Intersection over Union
Intersection over Union is an evaluation metric used to measure the
accuracy of an object detector on a particular dataset. It can be computed
as follows:
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 47 / 74
51. Analysis Image Analysis
TP, FP, TN, FN
Analysis were performed by dividing predictions in according to the
following confusion matrix:
Real Values
Positive Negative
Predicted Values
Positive TP FP
Negative FN TN
In particular, a prediction is a true positive if and only if IoU between
prediction and ground truth is at least 0.5.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 48 / 74
52. Analysis Image Analysis
Measures
TP, FP, TN and FN were computed and these measures were used to
quantify the performances of the network:
Precision: TP
TP+FP, the fraction of True Positive among all that
predicted positive;
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 49 / 74
53. Analysis Image Analysis
Measures
TP, FP, TN and FN were computed and these measures were used to
quantify the performances of the network:
Precision: TP
TP+FP, the fraction of True Positive among all that
predicted positive;
Recall: TP
TP+FN, the fraction of True Positive predictions among all
actual positive cases;
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 49 / 74
54. Analysis Image Analysis
Measures
TP, FP, TN and FN were computed and these measures were used to
quantify the performances of the network:
Precision: TP
TP+FP, the fraction of True Positive among all that
predicted positive;
Recall: TP
TP+FN, the fraction of True Positive predictions among all
actual positive cases;
F1 score: 2 · precision·recall
precision+recall = 2TP
2TP+FP+FN, the harmonic mean of
precision and recall.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 49 / 74
55. Analysis Image Analysis
mAP
mAP
Mean Average Precision is the mean over all classes of the Average
Precision.
mAP = c∈C AP(c)
|C|
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 50 / 74
56. Analysis Image Analysis
AP
AP
Average Precision is computed as the average of maximum precision at 11
fixed recall levels (0.0, 0.1, 0.2, ..., 0.9, 1.0).
AP =
1
11
r∈{0.0,...,1.0}
APr
=
1
11
r∈{0.0,...,1.0}
pi nterp(r)
pi nterp(r) = max
(r)≥r
p(r)
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 51 / 74
57. Analysis Image Analysis
Dataset for Analysis
For our Analysis we used OpenImage v4’s Validation Set, that has 41K
images with a CSV file associated containing all correct bounding boxes.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 52 / 74
58. Analysis Image Analysis
Results on Object Detection - YOLO
F1 Score Precision Recall
Yolo-0.1 0.210 0.134 0.476
Yolo-0.2 0.283 0.220 0.397
Yolo-0.3 0.312 0.296 0.331
Yolo-0.4 0.311 0.371 0.267
Yolo-0.5 0.278 0.499 0.201
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 53 / 74
59. Analysis Image Analysis
Results on Object Detection - ResNet50
F1 Score Precision Recall
ResNet50-0.1 0.061 0.032 0.785
ResNet50-0.2 0.190 0.112 0.640
ResNet50-0.3 0.311 0.222 0.520
ResNet50-0.4 0.379 0.354 0.409
ResNet50-0.5 0.377 0.498 0.303
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 54 / 74
60. Analysis Image Analysis
A deeper analysis on YOLO with mAP
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 55 / 74
61. Analysis Image Analysis
A deeper analysis on ResNet50 with mAP
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 56 / 74
62. Analysis Text Analysis
Caption Evaluation
Caption evaluation was analyzed by using 3 different metrics with specific
purposes:
BLEU (BiLingual Evaluation Understudy): a precision-oriented
metric for machine generated captions
ROUGE (Recall-Oriented Understudy for Gisting Evaluation): a
recall-oriented metric
WER (Word Error Rate): used to measure the number of
misspelling and missing letters
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 57 / 74
63. Analysis Text Analysis
BLEU
BLEU is a metric to evaluate the modified precision of a generated
sentence to a reference one.
It gives a score from 0 to 1 that indicates how good the match is. It relies
on the concept of n-grams, lists of contiguous words with size n. For each
n-gram in the candidate sentence, BLEU measures how many n-grams in
the machine generated caption appears in the reference. It is expressed as
(BnPr ), where n is the size of the n-gram..
Example:
Candidate: The cat is on the mat
Reference: There is a cat on the mat
BLEU 1-gram modified precision: 6/6
BLEU 2-grams modified precision: 2/5
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 58 / 74
64. Analysis Text Analysis
BLEU
That being said, the formula is:
BLEU = BP · exp
N
n=1
Wn log (BnPr )
BP: is the Brevity Penalty to
manage the cases when the
candidate is longer than the
reference
Wn: normalization in relation
with N, the maximum n-gram
size
BP =
1 if c > r
e1−r/c, if c ≤ r
Wn =
1
N
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 59 / 74
65. Analysis Text Analysis
BLEU Analysis Results
BLEU BLEU-1 BLEU-2 BLEU-3 BLEU-4
VGG16 0.4719 0.1564 0.2896 0.4766 0.5274
ResNet50 0.5060 0.1795 0.3472 0.5049 0.5561
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 60 / 74
66. Analysis Text Analysis
BLEU Analysis Results
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 61 / 74
67. Analysis Text Analysis
ROUGE
As previously stated, ROUGE measures the recall, by counting how many
times the n-grams in the human reference captions appeared in the
machine generated ones.
In relation with the previous example:
Candidate: The cat is on the mat
Reference: There is a cat on the mat
ROUGE 1-gram recall: 6/7
ROUGE 2-grams recall: 2/6
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 62 / 74
68. Analysis Text Analysis
ROUGE Analysis Results
ROUGE-1Rc ROUGE-2Rc
VGG16 0.2737 0.12034
ResNet50 0.2798 0.1164
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 63 / 74
69. Analysis Text Analysis
ROUGE Analysis Results
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 64 / 74
70. Analysis Text Analysis
ROUGE Analysis Results
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 65 / 74
71. Analysis Text Analysis
WER
In addition to the precision and the recall previously calculated, a third
measure is needed in order to know how many characters the candidate
caption differs from its reference. This measure is called Word Error Rate
(WER), defined as:
WER =
S + D + I
N
Where:
S is the number of substitutions
D is the number of deletions
I is the number of insertions
N is the number of words in the reference sentence
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 66 / 74
72. Analysis Text Analysis
WER Analysis Results
WER WERRatio
VGG16 11.676 0.6373125
ResNet50 8.487 0.5350625
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 67 / 74
74. Discussion
Discussion
Results for detection demonstrates what was already expected:
ResNet50:
-: Needs high confidence to be
sure about predictions;
+: Finds more objects and has
always a better recall than
YOLO
YOLO:
-: Finds less objects than
ResNet50;
+: Fast algorithm that finds
objects at lower confidence
threshold values.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 69 / 74
75. Discussion
Discussion
Caption generation showed instead a series of score metrics and a
distribution of their values which reflected the lack of training of the
model.
Despite this, the full pipeline, which started from an input image and
ended with its textual and audio description, was achieved.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 70 / 74
76. Discussion
Future Improvements
Improvements on Object Detector
To improve Object Detectors performance a more homogeneous dataset
must be used. Furthermore a more powerful machine is needed in order to
train well Object Detectors.
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 71 / 74
77. Discussion
Future Improvements
For what concerns Caption Generation, results can be improved by:
Training the model for an adequate number of epochs
Reintroducing one letter words to improve BLEU, ROUGE and WER
scores
Using an higher number of hidden layers
Training with larger datasets like MS-COCO
T. Campari, G. Etta, T. Sgarbanti Object Classification and Caption Generation September 19, 2018 72 / 74