Video Description using Deep Learning

VIDEO DESCRIPTION USING DEEP
LEARNING
By : Pranjal Mahajan
Mentor : Pranjali Deshpande

AGENDA
1. PROBLEM STATEMENT
2. INTRODUCTION
3. LITERATURE SURVEY
4. MOTIVATION
5. BACKGROUND
6. SYSTEM DESCRIPTION
7. REQUIREMENTS
8. ADVANTAGES
9. LIMITATIONS
10. CONCLUSION

1. PROBLEM STATEMENT
• To identify the contents of a video and describe in natural language.

2. INTRODUCTION
• A machine can efficiently perform image classification, object recognition,
and video segmentation.
• The tasks like video description are a challenge.
• Video description has applications in
1. human-robot interaction,
2. helping the visually impaired,
3. video retrieval by content.

3. LITERATURE SURVEY
Publication Methodology andTechniques Remarks
IEEECVPR,
2020
Video is condensed into a spatio-temporal graph
network, which serves as the object branch.This
interaction information is distilled into another scene
branch via the object-aware knowledge distillation
mechanism.
Takes into
consideration
interaction
information.Can
shortcut the
classiﬁcation problem
using background.
ICCVW,
2019
Two stage training setting to optimise both encoder and
decoder simultaneously.The architecture is initialized
using pre-trained encoders and decoders.Then the
most relevant features for video description generation
are learnt.
Vocabulary is large.
Is computationally
expensive.

Publication Methodology andTechniques Remarks
arXiv, 2018
A self-critical REINFORCE algorithm is used to get
better weights for the LSTMs and train the LSTMS.
Then, we jointly tune the full model in this step, freeing
the weights of the CNNs.
Can generate complex
sentences.
Challenging to train
such a big model.
ACM
Books,
2018
Encoder-Decoder framework in which uses encoder
(CNN) to extract visual features from raw video frames
and decoder (RNN/LSTM) to get the desired output
sentence.
Easy to train.
Limited to small
vocabulary.
AAAI, 2013
Template based approach in which SVO triplets are
identified using a combination of visual object and
activity detectors. followed by search based
optimization to get their best combination.
Simplest approach.
Generated sentences
are simple.

4. MOTIVATION
• The Spatial, temporal and attribute based attention models
1. are inefficient to exploit video temporal structure in a longer range.
2. require heavy computation operations
• The Hierarchical Recurrent Neural Encoder Model is able to overcome these
challenges.

5. BACKGROUND
5.1 CONVOLUTIONAL NEURAL NETWORK (CNN)

5.2 LONG-SHORT TERM MEMORY (LSTM)
The LSTM is a RNN and has three gates –
• input gate (i)
• forget gate (f)
• output gate (o)

6. SYSTEM DESCRIPTION
• Input : A video (in .npy format).
• Expected Output : Natural language description of the
input video.

6.1.1 ENCODER
 Encoder part extracts visual features from raw video frames in a fixed-
dimension vector (he) that would represent the entire sequence.
 Video Feature pool consists of
1. Object appearance feature –
extracted using VGG16 pretrained on ImageNet dataset.
2. Action feature –
extracted using C3D pretrained on activity recognition dataset.
6.1.2 DECODER
 Decoder part takes that vector as an initial state and it is then fed to a BLSTM
to generate the desired output sentence.

6.2 HIERARCHICAL RECURRENT NEURAL
ENCODER(HRNE)
• The ﬁrst LSTM layer is used to explore local temporal structure within
sentence.
• The second LSTM layer learns the temporal dependencies among sentence.
• More complex HRNE model could be adding more layers to build multiple
time-scale abstraction of the visual information.

6.3 DATASET
• The MSR-VTT is used for training and testing.
• In its current version, MSR-VTT provides 10K web video clips with 41.2
hours and 200K clip-sentence pairs in total.

6.4 EVALUATION METRICS
• The generated sentence correlates well with a human judgment when the
metrics are high.

7. REQUIREMENTS
• Central Processing Unit (CPU) — Intel Core i5 6th Gen. processor or higher.
• RAM — 8 GB minimum.
• Graphics Processing Unit (GPU) — NVIDIA GeForce GTX 960 or higher.
• Operating System — Ubuntu, Mac or Microsoft Windows 10.
• Software – Python compiling IDE with Modules like Keras, TensorFlow

8. ADVANTAGES
• Exploits temporal information over longer time.
• Shortens the path with the capability of adding non-linearity, providing a
better trade-off between efﬁciency and effectiveness.
• Is able to uncover temporal transitions between frame chunks with different
granularities.

9. LIMITATIONS
• LSTM decoder is prone to overfitting.
• Hence, we need to validate the generalization capability.
• In the future work, we can plug a softmax classiﬁer upon the encoder and
video labels instead of the LSTM language decoder.

10. CONCLUSION
• We take raw video as input, and apply 2D CNN (VGG16) and 3D CNN (C3D)
on it to extract the object appearance and action features respectively.
• To get the encoded vector, multiple LSTM can be stacked using HRNE.
• The decoder is a LSTM which inputs visual features and generates a natural
language description for sentence.

REFERENCES
[1] N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney, K. Saenko, and S. Guadarrama.
Generating natural-language video descriptions using text-mined knowledge. In AAAI, July
2013.
[2] Z. Wu, T. Yao, Y. Fu, and Y. Jiang. Deep learning for video classification and captioning. In S.
hang, editor, Frontiers of Multimedia Research, pages 3–29. ACM Books, 2018
[3] S Olivastri, G Singh, F Cuzzolin. End-to-End Video Captioning. In Large Scale Holistic Video
Understanding, ICCVW 2019
[4] Pan, Boxiao & Cai, Haoye & Huang, De-An & Lee, Kuan-Hui & Gaidon, Adrien & Adeli, Ehsan
& Niebles, Juan Carlos. Spatio-Temporal Graph for Video Captioning with Knowledge
Distillation. Computer Vision and Pattern Recognition (CVPR),2020
[5] Lijun Li and Boqing Gong. End-to-end video captioning with multitask reinforcement
learning. arXiv preprint arXiv:1803.07950, 2018.

[6] Yuling Gui, Dan Guo, Ye Zhao. Semantic Enhanced Encoder-Decoder Network (SEN) for
Video Captioning. In MAHCI '19 2019
[7] K. Simonyan, A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image
Recognition. In International Conference on Learning Representations, 2015
[8] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, Learning Spatiotemporal Features
with 3D Convolutional Networks, ICCV 2015
[9] Nayyer Aafaq, Ajmal Mian, Wei Liu, Syed Zulqarnain Gilani, and Mubarak Shah. Video
Description: A Survey of Methods, Datasets and Evaluation Metrics. In ACM Computing Surveys
(CSUR),2019
[10] Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, Yueting Zhuang. Hierarchical Recurrent Neural
Encoder for Video Representation with Application to Captioning. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR)

Video Description using Deep Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Similar to Video Description using Deep Learning

Similar to Video Description using Deep Learning (20)

Recently uploaded

Recently uploaded (20)

Video Description using Deep Learning