1. Department of Artificial Intelligence
Winter 2022 (Session: 2022-2023)
G H Raisoni College of Engineering, Nagpur
Presented By:
1. Akundi Harshvardhan (A-20)
2. Arya Bharne (A-24)
3. Priyesh Gawali (A-62)
4. Rajat Satpure (A-65)
Guide:-
prof. Pravin Kshirsagar
Assistant Professor
GHRCE ,Nagpur
Title of the Project:-
Image Captioning using Deep Learning
and NLP
3. Introduction
• From a image we can describe what’s in
image , simply we can define the image by
seeing the image.
• In this project, we are developing a
system/model which will describe the
content present in image i.e. our model
will give a caption to a image which is
called Image Captioning.
• In this, we are using a Deep Neural
Network(DNN) ,Convolutional Neural
Network(CNN) and Long Short Term
Memory(LSTM).
• As we are dealing with text data (image
captions) so we are also using Natural
Language Processing (NLP).
• By combining all these we are developing
a model which is called as Image Caption
Generator.
A muscular man standing.
4. Abstract
• Image captioning is an important task
nowadays. It will help you describe the
image in Editing software, assists visually
impaired and it can also generate captions
for social media posts .
• In recent years, researchers made a
significant progress in image captioning .
• In our solution we are using ‘Long Short
Term Memory’ (LSTM) along with
‘Convolutional Neural Network’(CNN).
• We are using Convolutional Neural Network
(CNN) to extract features from images and
Long Short Term Memory (LSTM) for
generating description from extracted
features of image.
5. • To describe contents of an image using CNN .
• To showcase the effectiveness of LSTM .
• To create a working model that describes the image on the basis of
features that are extracted.
• To understand the features of an image.
• To predict the next words from extracted features to make a caption.
Objectives
6. ● Template-based approaches are able to generate grammatically correct
captions, but since the templates are predefined, it cannot generate variable-
length captions.
● Retrieval-based methods produce general and grammatically correct captions.
However, they cannot generate more descriptive and semantically correct
captions.
● Captions are most often generated for a whole scene in the image. However,
captions can also be generated for different areas of an image such as in Dense
captioning.
● RNN when used along with CNN had a very short term memory.
● Multimodal recurrent neural network method is similar to the method of Kiros
which uses a fixed length context, but in this method, the temporal context is
stored in a recurrent architecture that allows an arbitrary context length.
Literature Survey (Survey of existing products)
7. • We describe something using features of that thing . Like if we are describing
image we use it features to describe it. For example :- If we saw a large red
rose, we started describing it by saying “ A big beautiful Red rose.”, in this
sentence we use features like ‘large(size)’, ‘red(colour)’,
’beautiful(appearance)’ to describe the flower i.e. we are giving caption to it .
This process of describing something by seeing it is called as Image Captioning.
• In this project we are developing an Image Caption Generator which extract the
features from image by using Convolutional Neural Network (CNN),from the
extracted features our model will generate the caption for given image by
arranging the features in proper meaningful manner using Recurrent Neural
Network(RNN) and Long Short Term Memory(LSTM).
• LSTM remembers the previous words which helps in the prediction of words
which came later to make a proper sentence(Caption).
• By combining CNN(for feature extraction), RNN and LSTM(for prediction and
arranging of words ) , we are developing our model.
Proposed Methodology/System Architecture
8.
9. • Category:
Machine Learning, Deep Learning and NLP
• Programming Language:
Python
• Tools & Libraries:
Tensorflow, Keras, NumPy, TQDM
• IDE:
Google Colab, Kaggle and Jupyter notebook
• Prerequisites:
Python, Machine Learning, Deep Learning and NLP
• DataSets :
Flickr Dataset.
Hardware / Software Specification
10. Our developed solution is a model which will describe the image using features
extracted from it i.e. our model will give caption to an image. We have used
Convolutional Neural Network (CNN) for feature extraction from an image and
Recurrent Neural Network (RNN) along with Long Short Term Memory(LSTM)
for prediction of words to make a proper caption for an given image .
Conclusion
11. 1. International Journal of Innovative Research in Electrical, Electronics, Instrumentation and
Control Engineering NCIET- 2020
2. Aditya, A. N., Anditya, A. and Suyanto, (2019). “Generating Image Description on
Indonesian Language using Convolutional Neural Network and Gated Recurrent Unit”, 7th
International Conference on Information and Communication Technology (ICoICT).
3. Chetan, A. and Vaishli, J. (2018). “Image Caption Generation using Deep Learning
Technique”, Fourth International Conference on Computing Communication Control and
Automation (ICCUBEA).
4. Huda A. Al-muzaini, Tasniem N. and Hafida B. (2018) “Automatic Arabic Image
Captioning using RNN LSTM-Based Language Model and CNN”, International Journal of
Advanced Computer Science and Applications (IJACSA), Vol. 9, No.6.
5. International Journal of Innovative Research in Electrical, Electronics, Instrumentation and
Control Engineering NCIET- 2020
6. Aditya, A. N., Anditya, A. and Suyanto, (2019). “Generating Image Description on
Indonesian Language using Convolutional Neural Network and Gated Recurrent Unit”, 7th
International Conference on Information and Communication Technology (ICoICT).
References
12. 1. Chetan, A. and Vaishli, J. (2018). “Image Caption Generation using Deep
Learning Technique”, Fourth International Conference on Computing
Communication Control and Automation (ICCUBEA).
2. Huda A. Al-muzaini, Tasniem N. and Hafida B. (2018) “Automatic Arabic Image
Captioning using RNN LSTM-Based Language Model and CNN”, International
Journal of Advanced Computer Science and Applications (IJACSA), Vol. 9,
No.6.
3. J. Liu, G. Wang, P. Hu, L.-Y. Duan, and A. C. Kot. Global context-aware
attention lstm networks for 3d action recognition. CVPR, 2017.
4. J. Lu, C. Xiong, D. Parikh, and R. Socher. Knowing when to look: Adaptive
attention via a visual sentinel for image captioning. CVPR, 2017
5. S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel. Self-critical
sequence training for image captioning. CVPR, 2017.
6. Loshchilov and F. Hutter. Sgdr: Stochastic gradient de[1]scent with restarts.
ICLR, 2016.
7. J. Johnson, A. Karpathy, and L. Fei-Fei. Densecap: Fully convolutional
localization networks for dense captioning. In CVPR, 2016.
References