VIDEO DESCRIPTION USING DEEP
LEARNING
By : Pranjal Mahajan
Mentor : Pranjali Deshpande
AGENDA
1. PROBLEM STATEMENT
2. INTRODUCTION
3. LITERATURE SURVEY
4. MOTIVATION
5. BACKGROUND
6. SYSTEM DESCRIPTION
7. REQUIREMENTS
8. ADVANTAGES
9. LIMITATIONS
10. CONCLUSION
1. PROBLEM STATEMENT
• To identify the contents of a video and describe in natural language.
2. INTRODUCTION
• A machine can efficiently perform image classification, object recognition,
and video segmentation.
• The tasks like video description are a challenge.
• Video description has applications in
1. human-robot interaction,
2. helping the visually impaired,
3. video retrieval by content.
3. LITERATURE SURVEY
Publication Methodology andTechniques Remarks
IEEECVPR,
2020
Video is condensed into a spatio-temporal graph
network, which serves as the object branch.This
interaction information is distilled into another scene
branch via the object-aware knowledge distillation
mechanism.
Takes into
consideration
interaction
information.Can
shortcut the
classification problem
using background.
ICCVW,
2019
Two stage training setting to optimise both encoder and
decoder simultaneously.The architecture is initialized
using pre-trained encoders and decoders.Then the
most relevant features for video description generation
are learnt.
Vocabulary is large.
Is computationally
expensive.
Publication Methodology andTechniques Remarks
arXiv, 2018
A self-critical REINFORCE algorithm is used to get
better weights for the LSTMs and train the LSTMS.
Then, we jointly tune the full model in this step, freeing
the weights of the CNNs.
Can generate complex
sentences.
Challenging to train
such a big model.
ACM
Books,
2018
Encoder-Decoder framework in which uses encoder
(CNN) to extract visual features from raw video frames
and decoder (RNN/LSTM) to get the desired output
sentence.
Easy to train.
Limited to small
vocabulary.
AAAI, 2013
Template based approach in which SVO triplets are
identified using a combination of visual object and
activity detectors. followed by search based
optimization to get their best combination.
Simplest approach.
Generated sentences
are simple.
4. MOTIVATION
• The Spatial, temporal and attribute based attention models
1. are inefficient to exploit video temporal structure in a longer range.
2. require heavy computation operations
• The Hierarchical Recurrent Neural Encoder Model is able to overcome these
challenges.
5. BACKGROUND
5.1 CONVOLUTIONAL NEURAL NETWORK (CNN)
5.2 LONG-SHORT TERM MEMORY (LSTM)
The LSTM is a RNN and has three gates –
• input gate (i)
• forget gate (f)
• output gate (o)
6. SYSTEM DESCRIPTION
• Input : A video (in .npy format).
• Expected Output : Natural language description of the
input video.
6.1 ENCODER-DECODER MODELS
6.1.1 ENCODER
 Encoder part extracts visual features from raw video frames in a fixed-
dimension vector (he) that would represent the entire sequence.
 Video Feature pool consists of
1. Object appearance feature –
extracted using VGG16 pretrained on ImageNet dataset.
2. Action feature –
extracted using C3D pretrained on activity recognition dataset.
6.1.2 DECODER
 Decoder part takes that vector as an initial state and it is then fed to a BLSTM
to generate the desired output sentence.
6.2 HIERARCHICAL RECURRENT NEURAL
ENCODER(HRNE)
• The first LSTM layer is used to explore local temporal structure within
sentence.
• The second LSTM layer learns the temporal dependencies among sentence.
• More complex HRNE model could be adding more layers to build multiple
time-scale abstraction of the visual information.
6.3 DATASET
• The MSR-VTT is used for training and testing.
• In its current version, MSR-VTT provides 10K web video clips with 41.2
hours and 200K clip-sentence pairs in total.
6.4 EVALUATION METRICS
• The generated sentence correlates well with a human judgment when the
metrics are high.
7. REQUIREMENTS
• Central Processing Unit (CPU) — Intel Core i5 6th Gen. processor or higher.
• RAM — 8 GB minimum.
• Graphics Processing Unit (GPU) — NVIDIA GeForce GTX 960 or higher.
• Operating System — Ubuntu, Mac or Microsoft Windows 10.
• Software – Python compiling IDE with Modules like Keras, TensorFlow
8. ADVANTAGES
• Exploits temporal information over longer time.
• Shortens the path with the capability of adding non-linearity, providing a
better trade-off between efficiency and effectiveness.
• Is able to uncover temporal transitions between frame chunks with different
granularities.
9. LIMITATIONS
• LSTM decoder is prone to overfitting.
• Hence, we need to validate the generalization capability.
• In the future work, we can plug a softmax classifier upon the encoder and
video labels instead of the LSTM language decoder.
10. CONCLUSION
• We take raw video as input, and apply 2D CNN (VGG16) and 3D CNN (C3D)
on it to extract the object appearance and action features respectively.
• To get the encoded vector, multiple LSTM can be stacked using HRNE.
• The decoder is a LSTM which inputs visual features and generates a natural
language description for sentence.
REFERENCES
[1] N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney, K. Saenko, and S. Guadarrama.
Generating natural-language video descriptions using text-mined knowledge. In AAAI, July
2013.
[2] Z. Wu, T. Yao, Y. Fu, and Y. Jiang. Deep learning for video classification and captioning. In S.
hang, editor, Frontiers of Multimedia Research, pages 3–29. ACM Books, 2018
[3] S Olivastri, G Singh, F Cuzzolin. End-to-End Video Captioning. In Large Scale Holistic Video
Understanding, ICCVW 2019
[4] Pan, Boxiao & Cai, Haoye & Huang, De-An & Lee, Kuan-Hui & Gaidon, Adrien & Adeli, Ehsan
& Niebles, Juan Carlos. Spatio-Temporal Graph for Video Captioning with Knowledge
Distillation. Computer Vision and Pattern Recognition (CVPR),2020
[5] Lijun Li and Boqing Gong. End-to-end video captioning with multitask reinforcement
learning. arXiv preprint arXiv:1803.07950, 2018.
[6] Yuling Gui, Dan Guo, Ye Zhao. Semantic Enhanced Encoder-Decoder Network (SEN) for
Video Captioning. In MAHCI '19 2019
[7] K. Simonyan, A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image
Recognition. In International Conference on Learning Representations, 2015
[8] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, Learning Spatiotemporal Features
with 3D Convolutional Networks, ICCV 2015
[9] Nayyer Aafaq, Ajmal Mian, Wei Liu, Syed Zulqarnain Gilani, and Mubarak Shah. Video
Description: A Survey of Methods, Datasets and Evaluation Metrics. In ACM Computing Surveys
(CSUR),2019
[10] Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, Yueting Zhuang. Hierarchical Recurrent Neural
Encoder for Video Representation with Application to Captioning. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR)
THANKYOU

Video Description using Deep Learning

  • 1.
    VIDEO DESCRIPTION USINGDEEP LEARNING By : Pranjal Mahajan Mentor : Pranjali Deshpande
  • 2.
    AGENDA 1. PROBLEM STATEMENT 2.INTRODUCTION 3. LITERATURE SURVEY 4. MOTIVATION 5. BACKGROUND 6. SYSTEM DESCRIPTION 7. REQUIREMENTS 8. ADVANTAGES 9. LIMITATIONS 10. CONCLUSION
  • 3.
    1. PROBLEM STATEMENT •To identify the contents of a video and describe in natural language.
  • 4.
    2. INTRODUCTION • Amachine can efficiently perform image classification, object recognition, and video segmentation. • The tasks like video description are a challenge. • Video description has applications in 1. human-robot interaction, 2. helping the visually impaired, 3. video retrieval by content.
  • 5.
    3. LITERATURE SURVEY PublicationMethodology andTechniques Remarks IEEECVPR, 2020 Video is condensed into a spatio-temporal graph network, which serves as the object branch.This interaction information is distilled into another scene branch via the object-aware knowledge distillation mechanism. Takes into consideration interaction information.Can shortcut the classification problem using background. ICCVW, 2019 Two stage training setting to optimise both encoder and decoder simultaneously.The architecture is initialized using pre-trained encoders and decoders.Then the most relevant features for video description generation are learnt. Vocabulary is large. Is computationally expensive.
  • 6.
    Publication Methodology andTechniquesRemarks arXiv, 2018 A self-critical REINFORCE algorithm is used to get better weights for the LSTMs and train the LSTMS. Then, we jointly tune the full model in this step, freeing the weights of the CNNs. Can generate complex sentences. Challenging to train such a big model. ACM Books, 2018 Encoder-Decoder framework in which uses encoder (CNN) to extract visual features from raw video frames and decoder (RNN/LSTM) to get the desired output sentence. Easy to train. Limited to small vocabulary. AAAI, 2013 Template based approach in which SVO triplets are identified using a combination of visual object and activity detectors. followed by search based optimization to get their best combination. Simplest approach. Generated sentences are simple.
  • 7.
    4. MOTIVATION • TheSpatial, temporal and attribute based attention models 1. are inefficient to exploit video temporal structure in a longer range. 2. require heavy computation operations • The Hierarchical Recurrent Neural Encoder Model is able to overcome these challenges.
  • 8.
    5. BACKGROUND 5.1 CONVOLUTIONALNEURAL NETWORK (CNN)
  • 9.
    5.2 LONG-SHORT TERMMEMORY (LSTM) The LSTM is a RNN and has three gates – • input gate (i) • forget gate (f) • output gate (o)
  • 10.
    6. SYSTEM DESCRIPTION •Input : A video (in .npy format). • Expected Output : Natural language description of the input video.
  • 11.
  • 12.
    6.1.1 ENCODER  Encoderpart extracts visual features from raw video frames in a fixed- dimension vector (he) that would represent the entire sequence.  Video Feature pool consists of 1. Object appearance feature – extracted using VGG16 pretrained on ImageNet dataset. 2. Action feature – extracted using C3D pretrained on activity recognition dataset. 6.1.2 DECODER  Decoder part takes that vector as an initial state and it is then fed to a BLSTM to generate the desired output sentence.
  • 13.
    6.2 HIERARCHICAL RECURRENTNEURAL ENCODER(HRNE) • The first LSTM layer is used to explore local temporal structure within sentence. • The second LSTM layer learns the temporal dependencies among sentence. • More complex HRNE model could be adding more layers to build multiple time-scale abstraction of the visual information.
  • 15.
    6.3 DATASET • TheMSR-VTT is used for training and testing. • In its current version, MSR-VTT provides 10K web video clips with 41.2 hours and 200K clip-sentence pairs in total.
  • 16.
    6.4 EVALUATION METRICS •The generated sentence correlates well with a human judgment when the metrics are high.
  • 17.
    7. REQUIREMENTS • CentralProcessing Unit (CPU) — Intel Core i5 6th Gen. processor or higher. • RAM — 8 GB minimum. • Graphics Processing Unit (GPU) — NVIDIA GeForce GTX 960 or higher. • Operating System — Ubuntu, Mac or Microsoft Windows 10. • Software – Python compiling IDE with Modules like Keras, TensorFlow
  • 18.
    8. ADVANTAGES • Exploitstemporal information over longer time. • Shortens the path with the capability of adding non-linearity, providing a better trade-off between efficiency and effectiveness. • Is able to uncover temporal transitions between frame chunks with different granularities.
  • 19.
    9. LIMITATIONS • LSTMdecoder is prone to overfitting. • Hence, we need to validate the generalization capability. • In the future work, we can plug a softmax classifier upon the encoder and video labels instead of the LSTM language decoder.
  • 20.
    10. CONCLUSION • Wetake raw video as input, and apply 2D CNN (VGG16) and 3D CNN (C3D) on it to extract the object appearance and action features respectively. • To get the encoded vector, multiple LSTM can be stacked using HRNE. • The decoder is a LSTM which inputs visual features and generates a natural language description for sentence.
  • 21.
    REFERENCES [1] N. Krishnamoorthy,G. Malkarnenkar, R. J. Mooney, K. Saenko, and S. Guadarrama. Generating natural-language video descriptions using text-mined knowledge. In AAAI, July 2013. [2] Z. Wu, T. Yao, Y. Fu, and Y. Jiang. Deep learning for video classification and captioning. In S. hang, editor, Frontiers of Multimedia Research, pages 3–29. ACM Books, 2018 [3] S Olivastri, G Singh, F Cuzzolin. End-to-End Video Captioning. In Large Scale Holistic Video Understanding, ICCVW 2019 [4] Pan, Boxiao & Cai, Haoye & Huang, De-An & Lee, Kuan-Hui & Gaidon, Adrien & Adeli, Ehsan & Niebles, Juan Carlos. Spatio-Temporal Graph for Video Captioning with Knowledge Distillation. Computer Vision and Pattern Recognition (CVPR),2020 [5] Lijun Li and Boqing Gong. End-to-end video captioning with multitask reinforcement learning. arXiv preprint arXiv:1803.07950, 2018.
  • 22.
    [6] Yuling Gui,Dan Guo, Ye Zhao. Semantic Enhanced Encoder-Decoder Network (SEN) for Video Captioning. In MAHCI '19 2019 [7] K. Simonyan, A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations, 2015 [8] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, Learning Spatiotemporal Features with 3D Convolutional Networks, ICCV 2015 [9] Nayyer Aafaq, Ajmal Mian, Wei Liu, Syed Zulqarnain Gilani, and Mubarak Shah. Video Description: A Survey of Methods, Datasets and Evaluation Metrics. In ACM Computing Surveys (CSUR),2019 [10] Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, Yueting Zhuang. Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • 23.