IMAGE CAPTIONING
MUHAMMAD ZBEEDAT
MAY 2019
INTRODUCTION
• What do you see in the
picture?
• Well some of you might say “A white dog in a grassy area”, some may say
“White dog with brown spots” and yet some others might say “A dog on
grass and some pink flowers”.
• Definitely all of these captions are relevant for this image and there may be
some others also. But the point I want to make is; it’s so easy for us, as
human beings, to just have a glance at a picture and describe it in an
appropriate language. Even a 5 year old could do this with utmost ease.
• But, can you write a computer program that takes an image as input and
produces a relevant caption as output?
• Just prior to the recent development of DNN - Deep Neural
Networks this problem was inconceivable even by the most
advanced researchers in Computer Vision. But with the advent
of Deep Learning this problem can be solved very easily if we
have the required dataset.
• This problem was well researched by Andrej Karapathy in his
PhD
thesis at Stanford, who is also now the Director of AI at Tesla.
• To get a better feel of this problem, I strongly recommend to
use this state-of-the-art system created by Microsoft called
as Caption Bot. Just go to this link and try uploading any
picture you want; this system will generate a caption for it.
MOTIVATION
We must first understand how important this
problem is to real world scenarios. Let’s see few
applications where a solution to this problem can
be very useful:
MOTIVATION
Aid to the blind— We can create a product for
the blind which will guide them travelling on
the roads without the support of anyone else.
We can do this by first converting the scene
into text and then the text to voice. Both are
now famous applications of Deep Learning.
Refer this link where its shown how Nvidia
research is trying to create such a product.
Self driving cars— Automatic driving is one of
the biggest challenges and if we can properly
caption the scene around the car, it can give
a boost to the self driving system.
MOTIVATION
Automatic Captioning can help,
make Google Image Search as
good as Google Search, as then
every image could be first
converted into a caption and
then search can be performed
based on the caption.
In web development, it’s good
practice to provide a description
for any image that appears on
the page so that an image can be
read or heard as opposed to just
seen. This makes web content
accessible.
MOTIVATION
CCTV cameras (Closed-circuit television
cameras) are everywhere today, but
along with viewing the world, if we can
also generate relevant captions, then
we can raise alarms as soon as there is
some malicious activity going on
somewhere. This could probably help
reduce some crime and/or accidents.
It can be used to describe video in real
time.
DATA COLLECTION
• There are many open source
datasets available for this
problem, like Flickr 8k
(containing8k images), Flickr
30k (containing 30k images), MS
COCO (containing 180k images),
etc.
• COCO stands for Common
Objects and Contexts, and it
contains a large variety of
images where each image has a
EXAMPLE
CAPTIONING MODEL
A captioning model relies on two main components, a CNN and an RNN. Captioning is all
about merging the two to combine their most powerful attributes i.e.
CNNs (Convolutional Neural Networks) excel (‫מצטיין‬) at preserving spatial
information and recognize objects in images.
RNNs (Recurrent Neural Networks) work well with any kind of sequential
data, such as generating a sequence of words.
So by merging the two, you can get a model that can find patterns and images, and then
use that information to help generate a description of those images.
CNN
Convolutional Neural Networks, or CNNs, were designed to map image data to an output variable.
They have proven so effective that they are the go-to method for any type of prediction problem involving image
data as an input.
The benefit of using CNNs is their ability to develop an internal representation of a two-dimensional image. This
allows the model to learn position and scale (‫סקאלה‬/‫מידה‬ ‫קנה‬) in variant structures in the data, which is important
when working with images.
Use CNNs For:
• Image data
• Classification prediction problems
• Regression prediction problems
More generally, CNNs work well with data that has a spatial relationship (‫מרחבי‬ ‫יחס‬) .
Although not specifically developed for non-image data, CNNs achieve state-of-the-art results on problems such as
document classification (For example, there is an order relationship between words in a document of text) used in
sentiment analysis and related problems.
the process of
computationally identifying
and categorizing opinions
expressed in a piece of text,
especially in order to
determine whether the
writer's attitude towards a
particular topic, product,
etc. is positive, negative, or
neutral.
RNN
RNNs in general and LSTMs in particular have received the most success when working with sequences of words and
paragraphs, generally called natural language processing.
This includes both sequences of text and sequences of spoken language represented as a time series. They are also
used as generative models that require a sequence output, not only with text, but on applications such as generating
handwriting.
Use RNNs For:
• Text data
• Speech data
• Classification prediction problems
• Regression prediction problems
• Generative models
Don’t Use RNNs For:
• Tabular data (as you would see in a CSV file or spreadsheet)
• Image data
MODEL OF
IMAGE
CAPTIONING
APPROACH
• Say you’re asked to write a caption that
describes this image, how would you approach
this task?
• Based on how these objects are placed in an
image and their relationship to each other, you
might think that a dog is looking at the sky. He
is smiling so he might as well be happy. Also he
is outside so he might as well be in a park. The
sky isn’t blue so it might as well be sunset or
sunrise.
After collecting these visual observations, you
could put together a phrase that describes the
image as, “A happy dog is looking at the sky”.
CHALLENGES
1. Recognize objects in the image
2. Generate a fluent description in natural language
NEURAL OBJECT RECOGNITION
• A solved problem: Convolutional Neural Networks (CNN) do the trick.
• CNN is an architecture specialized in finding topological invariants
(‫טופולוגיות‬) in the input.
• Finds relationships between atoms and infers (‫להסיק‬) higher
abstractions.
• Highly resistant to (‫בפני‬ ‫עמיד‬) noise and spatial transformations.
• It learns automatically what are the relevant features to extract from
an input.
• Not limited to images: CNNs can be applied to text, audio, etc..
ARCHITECTU
RE OF CNN
1. Convolution
2. Non Linearity (ReLU)
3. Pooling or Sub Sampling
4. Classification (Fully
Connected Layer)
OBJECT
DETECTION
POOLING -
MAX/SUM/AV
G
• Downsample the image by
“hashing” it to fewer values.
We can:
CLASSIFICATION: FULLY CONNECTED LAYER
• After a couple of “convolute, Relu
and pool” cycles, we have maybe 128
channels of 14x14 pixel images.
• Concat and reshape them in a linear
array of 25088 (14x14x128) cells.
• Feed it to a feed forward neural
network that will output our classes.
• Since we want a set of
features that represents
the spatial content in the
image, we’re going to
remove the final fully
connected layer that
classifies the image and
look at earlier layer that
processes the spatial
information in the image.
CNN MODEL
• Feed an image into a CNN.
Can use a pre-trained
network like VGG16 or
Resnet or AlexNet and
more...
• https://neurohive.io/en/po
pular-networks/vgg16/
• https://medium.com/@side
real/cnns-architectures-
lenet-alexnet-vgg-
googlenet-resnet-and-
more-666091488df5
VGG16
• VGG16 is a convolutional neural network model
proposed by K. Simonyan and A. Zisserman from the
University of Oxford in the paper “Very Deep
Convolutional Networks for Large-Scale Image
Recognition”. The model achieves 92.7% top-5 test
accuracy in ImageNet, which is a dataset of over 14
million images belonging to 1000 classes. It was one
of the famous model submitted to ILSVRC-2014. It
makes the improvement over AlexNet by replacing
large kernel-sized filters (11 and 5 in the first and
second convolutional layer, respectively) with
multiple 3×3 kernel-sized filters one after another.
VGG16 was trained for weeks and was using NVIDIA
Titan Black GPU’s.
VGG-16 VS
ALEX-NET
• Similar to AlexNet, only 3x3
convolutions, but lots of
filters.
Alex-Net
_______________________________________________________
VGG-16
LAYERS
VGG-16# 3D
CONVOLUTIO
N LAYERS
Filter size: 3×3 (which is the smallest size to
capture the notion of left/right, up/down,
center)
In one of the configurations, it also utilizes 1×1
convolution filters, which can be seen as a linear
transformation of the input channels
The convolution stride is fixed to 1 pixel
VGG-16#
THRESHOLD
LAYERS
VGG-16#
MAX
POOLING
LAYERS
Max-pooling is performed over a 2×2 pixel window,
with stride 2.
VGG-16#
MULTILAYER
CLASSIFIER
VGG-16#
SOFTMAX
LAYER
VGG-16#
WHAT DO
THEY LEARN?
VGG-16#
TRAINING
VGG-16#
TRAINING
• Backward propogation
LEARNING
OBJECTS
PARTS
EXAMPLE
CNN DEMO TIME
Real time web handwritten digit recognition
http://scs.ryerson.ca/~aharley/vis/conv/flat.html
There are a lot of “famous” nets that can be freely downloaded
and used off the shelf,
like ResNet which has an error rate of 3.6% over 20000
categories.
VGG-16 and more…
• So now CNN acts as a feature extractor that compresses the
information in the original image into a smaller representation.
Since it encodes the content of the image into a smaller feature
vector hence, this CNN is often called the encoder.
When we process this feature vector and use it as an initial
input to the following RNN, then it would be called decoder
because RNN would decode the process feature vector and turn
it into natural language.
RNN -
RECURRENT
NETWORKS
OFFER A LOT OF
FLEXIBILITY
Vanilla mode of processing without RNN, from fixed-sized input to fixed-
sized output (e.g. image classification).
Sequence output (e.g. image captioning takes an image and outputs a
sentence of words).
Sequence input (e.g. sentiment analysis where a given sentence is
classified as expressing positive or negative sentiment).
Sequence input and sequence output (e.g. Machine Translation: an RNN
reads a sentence in English and then outputs a sentence in French).
Synced sequence input and output (e.g. video classification where we wish
to label each frame of the video).
SEQUENTIAL
PROCESSING OF
FIXED INPUTS
(IN ABSENCE OF
SEQUENCES)
• the figure below
shows results
from two very
nice papers
from DeepMind
an algorithm learns a recurrent network policy that steers its attention
around an image; In particular, it learns to read out house numbers from
left to right
a recurrent network generates images of digits by learning to sequentially
add color to a canvas
LANGUAGE
GENERATION
WITH
RECURRENT
NEURAL
NETWORKS
• Language generation is
a serial task. We
generate words one
after another.
• This is well modeled by
Recurrent Neural Cells: a
neuron that uses itself
over and over again to
accept serial inputs,
outputting each time a
new value.
RNN HIDDEN
STATES
RNN HIDDEN
STATES
TRAINING
RNN
TRAINING
PHASE
X
Pre-trained VGG-
16
[3x224x224]
Wih
TEST PHASE
RNN AND THE PROBLEM OF MEMORY
All network state is held in a single cell, used over and over again. Internal
state
can get really complicated. Moving the values around during training can lead
to
loss of data.
RNN has a “plugin” architecture, in which we can use different types of cells:
• Simple RNN cell: fastest, but breaks over long sequences. Outdated.
• LSTM cell: slower, supports selectively forgetting and keeping data.
Standard.
• GRU cell: like LSTM, but faster due to simpler internal architecture. State of
art.
A REGULAR
RNN
• Vanilla RNN
Activation Function
LSTM - LONG
SHORT TERM
MEMORY
• With LSTM
LSTM
• Overall
structure
• Recommend to see this
video:
https://www.youtube.com/
watch?v=WCUNPb-5EYI
GATES
GATING
MEMORY
ANOTHER ENTIRELY
SEPARATED NEURAL
NETWORK THAT LEARNS
WHEN TO FORGET WHAT
MEMORY
ANOTHER ENTIRELY
SEPARATED NEURAL
NETWORK THAT LEARNS
WHAT PREDICTIONS TO
PASS AND WHAT NOT
LONG
SHORT
MEMORY
WITH
ATTENTION
WORK WITH ATTENTION TO
DECIDE WHAT TO IGNORE
LTSM
• Core idea
LTSM
• Step by step
LTSM
• Step by step - Update
GRU -
GATED
RECURRENT
UNIT
VARIATION OF A LSTM
GRU
LSTM had 3 gates input, output and
forget gates. Where in GRU we only have
two gates an update gate z and a reset
gate r.
Update gate: The update gate decides
how much of previous memory to keep
around.
Reset Input: The reset gate defines how
to combine new input with previous value.
Unlike LSTM in GRU, there is no
persistent cell state distinct from the
hidden state as in LSTM.
SOFT
ATTENTION
MECHANISM
SOFT
ATTENTION
• Example
SOFT
ATTENTION
• Results
SOFT
ATTENTION
• Results
DENSE-CAP
• Fully Convolutional
Localization Networks for
Dense (‫צפיפות‬) Captioning
DENSECAP
DENSECAP
• Model Architicture
REFERENCES
• VGG16 – Convolutional Network for Classification and Detection
• The Unreasonable Effectiveness of Recurrent Neural Networks
• Illustrated Guide to LSTM’s and GRU’s: A step by step explanation
• Image Captioning with Keras
• Automatic Image Captioning : Building an image-caption generator
from scratch
• Multi-Modal Methods: Image Captioning (From Translation to
Attention)
• TensorFlow Tutorial #22 - Image Captioning
RECOMMENDED VIDEOS
• Convolutional Neural Network (CNN) models
• CS231n Winter 2016: Lecture 7: Convolutional Neural Networks
• Convolutional Nerural Network Course
• Recurrent Neural Networks (RNN) and Long Short-Term Memory
(LSTM)
• CS231n Lecture 10 - Recurrent Neural Networks, Image Captioning,
LSTM
• Andrej Karpathy - Automated Image Captioning with ConvNets and
Recurrent Nets
• Illustrated Guide to LSTM's and GRU's: A step by step explanation
• https://paperswithcode.com/task/text-generation
• https://machinelearningmastery.com/develop-a-deep-
learning-caption-generation-model-in-python/

Image captioning

  • 1.
  • 2.
    INTRODUCTION • What doyou see in the picture?
  • 3.
    • Well someof you might say “A white dog in a grassy area”, some may say “White dog with brown spots” and yet some others might say “A dog on grass and some pink flowers”. • Definitely all of these captions are relevant for this image and there may be some others also. But the point I want to make is; it’s so easy for us, as human beings, to just have a glance at a picture and describe it in an appropriate language. Even a 5 year old could do this with utmost ease. • But, can you write a computer program that takes an image as input and produces a relevant caption as output?
  • 4.
    • Just priorto the recent development of DNN - Deep Neural Networks this problem was inconceivable even by the most advanced researchers in Computer Vision. But with the advent of Deep Learning this problem can be solved very easily if we have the required dataset. • This problem was well researched by Andrej Karapathy in his PhD thesis at Stanford, who is also now the Director of AI at Tesla.
  • 5.
    • To geta better feel of this problem, I strongly recommend to use this state-of-the-art system created by Microsoft called as Caption Bot. Just go to this link and try uploading any picture you want; this system will generate a caption for it.
  • 6.
    MOTIVATION We must firstunderstand how important this problem is to real world scenarios. Let’s see few applications where a solution to this problem can be very useful:
  • 7.
    MOTIVATION Aid to theblind— We can create a product for the blind which will guide them travelling on the roads without the support of anyone else. We can do this by first converting the scene into text and then the text to voice. Both are now famous applications of Deep Learning. Refer this link where its shown how Nvidia research is trying to create such a product. Self driving cars— Automatic driving is one of the biggest challenges and if we can properly caption the scene around the car, it can give a boost to the self driving system.
  • 8.
    MOTIVATION Automatic Captioning canhelp, make Google Image Search as good as Google Search, as then every image could be first converted into a caption and then search can be performed based on the caption. In web development, it’s good practice to provide a description for any image that appears on the page so that an image can be read or heard as opposed to just seen. This makes web content accessible.
  • 9.
    MOTIVATION CCTV cameras (Closed-circuittelevision cameras) are everywhere today, but along with viewing the world, if we can also generate relevant captions, then we can raise alarms as soon as there is some malicious activity going on somewhere. This could probably help reduce some crime and/or accidents. It can be used to describe video in real time.
  • 10.
    DATA COLLECTION • Thereare many open source datasets available for this problem, like Flickr 8k (containing8k images), Flickr 30k (containing 30k images), MS COCO (containing 180k images), etc. • COCO stands for Common Objects and Contexts, and it contains a large variety of images where each image has a
  • 11.
  • 12.
    CAPTIONING MODEL A captioningmodel relies on two main components, a CNN and an RNN. Captioning is all about merging the two to combine their most powerful attributes i.e. CNNs (Convolutional Neural Networks) excel (‫מצטיין‬) at preserving spatial information and recognize objects in images. RNNs (Recurrent Neural Networks) work well with any kind of sequential data, such as generating a sequence of words. So by merging the two, you can get a model that can find patterns and images, and then use that information to help generate a description of those images.
  • 13.
    CNN Convolutional Neural Networks,or CNNs, were designed to map image data to an output variable. They have proven so effective that they are the go-to method for any type of prediction problem involving image data as an input. The benefit of using CNNs is their ability to develop an internal representation of a two-dimensional image. This allows the model to learn position and scale (‫סקאלה‬/‫מידה‬ ‫קנה‬) in variant structures in the data, which is important when working with images. Use CNNs For: • Image data • Classification prediction problems • Regression prediction problems More generally, CNNs work well with data that has a spatial relationship (‫מרחבי‬ ‫יחס‬) . Although not specifically developed for non-image data, CNNs achieve state-of-the-art results on problems such as document classification (For example, there is an order relationship between words in a document of text) used in sentiment analysis and related problems. the process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, etc. is positive, negative, or neutral.
  • 14.
    RNN RNNs in generaland LSTMs in particular have received the most success when working with sequences of words and paragraphs, generally called natural language processing. This includes both sequences of text and sequences of spoken language represented as a time series. They are also used as generative models that require a sequence output, not only with text, but on applications such as generating handwriting. Use RNNs For: • Text data • Speech data • Classification prediction problems • Regression prediction problems • Generative models Don’t Use RNNs For: • Tabular data (as you would see in a CSV file or spreadsheet) • Image data
  • 15.
  • 16.
    APPROACH • Say you’reasked to write a caption that describes this image, how would you approach this task? • Based on how these objects are placed in an image and their relationship to each other, you might think that a dog is looking at the sky. He is smiling so he might as well be happy. Also he is outside so he might as well be in a park. The sky isn’t blue so it might as well be sunset or sunrise. After collecting these visual observations, you could put together a phrase that describes the image as, “A happy dog is looking at the sky”.
  • 17.
    CHALLENGES 1. Recognize objectsin the image 2. Generate a fluent description in natural language
  • 18.
    NEURAL OBJECT RECOGNITION •A solved problem: Convolutional Neural Networks (CNN) do the trick. • CNN is an architecture specialized in finding topological invariants (‫טופולוגיות‬) in the input. • Finds relationships between atoms and infers (‫להסיק‬) higher abstractions. • Highly resistant to (‫בפני‬ ‫עמיד‬) noise and spatial transformations. • It learns automatically what are the relevant features to extract from an input. • Not limited to images: CNNs can be applied to text, audio, etc..
  • 19.
    ARCHITECTU RE OF CNN 1.Convolution 2. Non Linearity (ReLU) 3. Pooling or Sub Sampling 4. Classification (Fully Connected Layer)
  • 20.
  • 21.
    POOLING - MAX/SUM/AV G • Downsamplethe image by “hashing” it to fewer values. We can:
  • 22.
    CLASSIFICATION: FULLY CONNECTEDLAYER • After a couple of “convolute, Relu and pool” cycles, we have maybe 128 channels of 14x14 pixel images. • Concat and reshape them in a linear array of 25088 (14x14x128) cells. • Feed it to a feed forward neural network that will output our classes.
  • 23.
    • Since wewant a set of features that represents the spatial content in the image, we’re going to remove the final fully connected layer that classifies the image and look at earlier layer that processes the spatial information in the image.
  • 24.
    CNN MODEL • Feedan image into a CNN. Can use a pre-trained network like VGG16 or Resnet or AlexNet and more... • https://neurohive.io/en/po pular-networks/vgg16/ • https://medium.com/@side real/cnns-architectures- lenet-alexnet-vgg- googlenet-resnet-and- more-666091488df5
  • 25.
    VGG16 • VGG16 isa convolutional neural network model proposed by K. Simonyan and A. Zisserman from the University of Oxford in the paper “Very Deep Convolutional Networks for Large-Scale Image Recognition”. The model achieves 92.7% top-5 test accuracy in ImageNet, which is a dataset of over 14 million images belonging to 1000 classes. It was one of the famous model submitted to ILSVRC-2014. It makes the improvement over AlexNet by replacing large kernel-sized filters (11 and 5 in the first and second convolutional layer, respectively) with multiple 3×3 kernel-sized filters one after another. VGG16 was trained for weeks and was using NVIDIA Titan Black GPU’s.
  • 26.
    VGG-16 VS ALEX-NET • Similarto AlexNet, only 3x3 convolutions, but lots of filters. Alex-Net _______________________________________________________
  • 27.
  • 28.
    VGG-16# 3D CONVOLUTIO N LAYERS Filtersize: 3×3 (which is the smallest size to capture the notion of left/right, up/down, center) In one of the configurations, it also utilizes 1×1 convolution filters, which can be seen as a linear transformation of the input channels The convolution stride is fixed to 1 pixel
  • 29.
  • 30.
    VGG-16# MAX POOLING LAYERS Max-pooling is performedover a 2×2 pixel window, with stride 2.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
    CNN DEMO TIME Realtime web handwritten digit recognition http://scs.ryerson.ca/~aharley/vis/conv/flat.html There are a lot of “famous” nets that can be freely downloaded and used off the shelf, like ResNet which has an error rate of 3.6% over 20000 categories. VGG-16 and more…
  • 39.
    • So nowCNN acts as a feature extractor that compresses the information in the original image into a smaller representation. Since it encodes the content of the image into a smaller feature vector hence, this CNN is often called the encoder. When we process this feature vector and use it as an initial input to the following RNN, then it would be called decoder because RNN would decode the process feature vector and turn it into natural language.
  • 40.
    RNN - RECURRENT NETWORKS OFFER ALOT OF FLEXIBILITY Vanilla mode of processing without RNN, from fixed-sized input to fixed- sized output (e.g. image classification). Sequence output (e.g. image captioning takes an image and outputs a sentence of words). Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing positive or negative sentiment). Sequence input and sequence output (e.g. Machine Translation: an RNN reads a sentence in English and then outputs a sentence in French). Synced sequence input and output (e.g. video classification where we wish to label each frame of the video).
  • 41.
    SEQUENTIAL PROCESSING OF FIXED INPUTS (INABSENCE OF SEQUENCES) • the figure below shows results from two very nice papers from DeepMind an algorithm learns a recurrent network policy that steers its attention around an image; In particular, it learns to read out house numbers from left to right a recurrent network generates images of digits by learning to sequentially add color to a canvas
  • 42.
    LANGUAGE GENERATION WITH RECURRENT NEURAL NETWORKS • Language generationis a serial task. We generate words one after another. • This is well modeled by Recurrent Neural Cells: a neuron that uses itself over and over again to accept serial inputs, outputting each time a new value.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
    RNN AND THEPROBLEM OF MEMORY All network state is held in a single cell, used over and over again. Internal state can get really complicated. Moving the values around during training can lead to loss of data. RNN has a “plugin” architecture, in which we can use different types of cells: • Simple RNN cell: fastest, but breaks over long sequences. Outdated. • LSTM cell: slower, supports selectively forgetting and keeping data. Standard. • GRU cell: like LSTM, but faster due to simpler internal architecture. State of art.
  • 49.
    A REGULAR RNN • VanillaRNN Activation Function
  • 50.
    LSTM - LONG SHORTTERM MEMORY • With LSTM
  • 51.
    LSTM • Overall structure • Recommendto see this video: https://www.youtube.com/ watch?v=WCUNPb-5EYI
  • 52.
  • 53.
  • 54.
    MEMORY ANOTHER ENTIRELY SEPARATED NEURAL NETWORKTHAT LEARNS WHEN TO FORGET WHAT
  • 55.
    MEMORY ANOTHER ENTIRELY SEPARATED NEURAL NETWORKTHAT LEARNS WHAT PREDICTIONS TO PASS AND WHAT NOT
  • 56.
  • 57.
  • 58.
  • 59.
    LTSM • Step bystep - Update
  • 60.
  • 61.
    GRU LSTM had 3gates input, output and forget gates. Where in GRU we only have two gates an update gate z and a reset gate r. Update gate: The update gate decides how much of previous memory to keep around. Reset Input: The reset gate defines how to combine new input with previous value. Unlike LSTM in GRU, there is no persistent cell state distinct from the hidden state as in LSTM.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
    DENSE-CAP • Fully Convolutional LocalizationNetworks for Dense (‫צפיפות‬) Captioning
  • 67.
  • 68.
  • 69.
    REFERENCES • VGG16 –Convolutional Network for Classification and Detection • The Unreasonable Effectiveness of Recurrent Neural Networks • Illustrated Guide to LSTM’s and GRU’s: A step by step explanation • Image Captioning with Keras • Automatic Image Captioning : Building an image-caption generator from scratch • Multi-Modal Methods: Image Captioning (From Translation to Attention) • TensorFlow Tutorial #22 - Image Captioning
  • 70.
    RECOMMENDED VIDEOS • ConvolutionalNeural Network (CNN) models • CS231n Winter 2016: Lecture 7: Convolutional Neural Networks • Convolutional Nerural Network Course • Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM) • CS231n Lecture 10 - Recurrent Neural Networks, Image Captioning, LSTM • Andrej Karpathy - Automated Image Captioning with ConvNets and Recurrent Nets • Illustrated Guide to LSTM's and GRU's: A step by step explanation
  • 71.