SlideShare a Scribd company logo
VIDEO DESCRIPTION USING DEEP
LEARNING
By : Pranjal Mahajan
Mentor : Pranjali Deshpande
AGENDA
1. PROBLEM STATEMENT
2. INTRODUCTION
3. LITERATURE SURVEY
4. MOTIVATION
5. BACKGROUND
6. SYSTEM DESCRIPTION
7. REQUIREMENTS
8. ADVANTAGES
9. LIMITATIONS
10. CONCLUSION
1. PROBLEM STATEMENT
• To identify the contents of a video and describe in natural language.
2. INTRODUCTION
• A machine can efficiently perform image classification, object recognition,
and video segmentation.
• The tasks like video description are a challenge.
• Video description has applications in
1. human-robot interaction,
2. helping the visually impaired,
3. video retrieval by content.
3. LITERATURE SURVEY
Publication Methodology andTechniques Remarks
IEEECVPR,
2020
Video is condensed into a spatio-temporal graph
network, which serves as the object branch.This
interaction information is distilled into another scene
branch via the object-aware knowledge distillation
mechanism.
Takes into
consideration
interaction
information.Can
shortcut the
classification problem
using background.
ICCVW,
2019
Two stage training setting to optimise both encoder and
decoder simultaneously.The architecture is initialized
using pre-trained encoders and decoders.Then the
most relevant features for video description generation
are learnt.
Vocabulary is large.
Is computationally
expensive.
Publication Methodology andTechniques Remarks
arXiv, 2018
A self-critical REINFORCE algorithm is used to get
better weights for the LSTMs and train the LSTMS.
Then, we jointly tune the full model in this step, freeing
the weights of the CNNs.
Can generate complex
sentences.
Challenging to train
such a big model.
ACM
Books,
2018
Encoder-Decoder framework in which uses encoder
(CNN) to extract visual features from raw video frames
and decoder (RNN/LSTM) to get the desired output
sentence.
Easy to train.
Limited to small
vocabulary.
AAAI, 2013
Template based approach in which SVO triplets are
identified using a combination of visual object and
activity detectors. followed by search based
optimization to get their best combination.
Simplest approach.
Generated sentences
are simple.
4. MOTIVATION
• The Spatial, temporal and attribute based attention models
1. are inefficient to exploit video temporal structure in a longer range.
2. require heavy computation operations
• The Hierarchical Recurrent Neural Encoder Model is able to overcome these
challenges.
5. BACKGROUND
5.1 CONVOLUTIONAL NEURAL NETWORK (CNN)
5.2 LONG-SHORT TERM MEMORY (LSTM)
The LSTM is a RNN and has three gates –
• input gate (i)
• forget gate (f)
• output gate (o)
6. SYSTEM DESCRIPTION
• Input : A video (in .npy format).
• Expected Output : Natural language description of the
input video.
6.1 ENCODER-DECODER MODELS
6.1.1 ENCODER
 Encoder part extracts visual features from raw video frames in a fixed-
dimension vector (he) that would represent the entire sequence.
 Video Feature pool consists of
1. Object appearance feature –
extracted using VGG16 pretrained on ImageNet dataset.
2. Action feature –
extracted using C3D pretrained on activity recognition dataset.
6.1.2 DECODER
 Decoder part takes that vector as an initial state and it is then fed to a BLSTM
to generate the desired output sentence.
6.2 HIERARCHICAL RECURRENT NEURAL
ENCODER(HRNE)
• The first LSTM layer is used to explore local temporal structure within
sentence.
• The second LSTM layer learns the temporal dependencies among sentence.
• More complex HRNE model could be adding more layers to build multiple
time-scale abstraction of the visual information.
6.3 DATASET
• The MSR-VTT is used for training and testing.
• In its current version, MSR-VTT provides 10K web video clips with 41.2
hours and 200K clip-sentence pairs in total.
6.4 EVALUATION METRICS
• The generated sentence correlates well with a human judgment when the
metrics are high.
7. REQUIREMENTS
• Central Processing Unit (CPU) — Intel Core i5 6th Gen. processor or higher.
• RAM — 8 GB minimum.
• Graphics Processing Unit (GPU) — NVIDIA GeForce GTX 960 or higher.
• Operating System — Ubuntu, Mac or Microsoft Windows 10.
• Software – Python compiling IDE with Modules like Keras, TensorFlow
8. ADVANTAGES
• Exploits temporal information over longer time.
• Shortens the path with the capability of adding non-linearity, providing a
better trade-off between efficiency and effectiveness.
• Is able to uncover temporal transitions between frame chunks with different
granularities.
9. LIMITATIONS
• LSTM decoder is prone to overfitting.
• Hence, we need to validate the generalization capability.
• In the future work, we can plug a softmax classifier upon the encoder and
video labels instead of the LSTM language decoder.
10. CONCLUSION
• We take raw video as input, and apply 2D CNN (VGG16) and 3D CNN (C3D)
on it to extract the object appearance and action features respectively.
• To get the encoded vector, multiple LSTM can be stacked using HRNE.
• The decoder is a LSTM which inputs visual features and generates a natural
language description for sentence.
REFERENCES
[1] N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney, K. Saenko, and S. Guadarrama.
Generating natural-language video descriptions using text-mined knowledge. In AAAI, July
2013.
[2] Z. Wu, T. Yao, Y. Fu, and Y. Jiang. Deep learning for video classification and captioning. In S.
hang, editor, Frontiers of Multimedia Research, pages 3–29. ACM Books, 2018
[3] S Olivastri, G Singh, F Cuzzolin. End-to-End Video Captioning. In Large Scale Holistic Video
Understanding, ICCVW 2019
[4] Pan, Boxiao & Cai, Haoye & Huang, De-An & Lee, Kuan-Hui & Gaidon, Adrien & Adeli, Ehsan
& Niebles, Juan Carlos. Spatio-Temporal Graph for Video Captioning with Knowledge
Distillation. Computer Vision and Pattern Recognition (CVPR),2020
[5] Lijun Li and Boqing Gong. End-to-end video captioning with multitask reinforcement
learning. arXiv preprint arXiv:1803.07950, 2018.
[6] Yuling Gui, Dan Guo, Ye Zhao. Semantic Enhanced Encoder-Decoder Network (SEN) for
Video Captioning. In MAHCI '19 2019
[7] K. Simonyan, A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image
Recognition. In International Conference on Learning Representations, 2015
[8] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, Learning Spatiotemporal Features
with 3D Convolutional Networks, ICCV 2015
[9] Nayyer Aafaq, Ajmal Mian, Wei Liu, Syed Zulqarnain Gilani, and Mubarak Shah. Video
Description: A Survey of Methods, Datasets and Evaluation Metrics. In ACM Computing Surveys
(CSUR),2019
[10] Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, Yueting Zhuang. Hierarchical Recurrent Neural
Encoder for Video Representation with Application to Captioning. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR)
THANKYOU

More Related Content

What's hot

Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videosAdria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Codiax
 
Image Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine LearningImage Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine Learning
ijtsrd
 
モデルアーキテクチャ観点からの高速化2019
モデルアーキテクチャ観点からの高速化2019モデルアーキテクチャ観点からの高速化2019
モデルアーキテクチャ観点からの高速化2019
Yusuke Uchida
 
AI&BigData Lab. Артем Чернодуб "Распознавание изображений методом Lazy Deep ...
AI&BigData Lab. Артем Чернодуб  "Распознавание изображений методом Lazy Deep ...AI&BigData Lab. Артем Чернодуб  "Распознавание изображений методом Lazy Deep ...
AI&BigData Lab. Артем Чернодуб "Распознавание изображений методом Lazy Deep ...
GeeksLab Odessa
 
B018131117
B018131117B018131117
B018131117
IOSR Journals
 
The road to multi/many core computing
The road to multi/many core computingThe road to multi/many core computing
The road to multi/many core computing
Osvaldo Gervasi
 
Unsupervised Video Anomaly Detection: A brief overview
Unsupervised Video Anomaly Detection: A brief overviewUnsupervised Video Anomaly Detection: A brief overview
Unsupervised Video Anomaly Detection: A brief overview
Ridge-i, Inc.
 
Intro to Deep Learning for Computer Vision
Intro to Deep Learning for Computer VisionIntro to Deep Learning for Computer Vision
Intro to Deep Learning for Computer Vision
Christoph Körner
 
Cisco packettracer overview_20jul09
Cisco packettracer overview_20jul09Cisco packettracer overview_20jul09
Cisco packettracer overview_20jul09
rahmanitayulia
 
Ac02417471753
Ac02417471753Ac02417471753
Ac02417471753
IJMER
 
Neural Architectures for Video Encoding
Neural Architectures for Video EncodingNeural Architectures for Video Encoding
Neural Architectures for Video Encoding
Universitat Politècnica de Catalunya
 
Multilayer bit allocation for video encoding
Multilayer bit allocation for video encodingMultilayer bit allocation for video encoding
Multilayer bit allocation for video encoding
IJMIT JOURNAL
 

What's hot (12)

Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videosAdria Recasens, DeepMind – Multi-modal self-supervised learning from videos
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videos
 
Image Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine LearningImage Captioning Generator using Deep Machine Learning
Image Captioning Generator using Deep Machine Learning
 
モデルアーキテクチャ観点からの高速化2019
モデルアーキテクチャ観点からの高速化2019モデルアーキテクチャ観点からの高速化2019
モデルアーキテクチャ観点からの高速化2019
 
AI&BigData Lab. Артем Чернодуб "Распознавание изображений методом Lazy Deep ...
AI&BigData Lab. Артем Чернодуб  "Распознавание изображений методом Lazy Deep ...AI&BigData Lab. Артем Чернодуб  "Распознавание изображений методом Lazy Deep ...
AI&BigData Lab. Артем Чернодуб "Распознавание изображений методом Lazy Deep ...
 
B018131117
B018131117B018131117
B018131117
 
The road to multi/many core computing
The road to multi/many core computingThe road to multi/many core computing
The road to multi/many core computing
 
Unsupervised Video Anomaly Detection: A brief overview
Unsupervised Video Anomaly Detection: A brief overviewUnsupervised Video Anomaly Detection: A brief overview
Unsupervised Video Anomaly Detection: A brief overview
 
Intro to Deep Learning for Computer Vision
Intro to Deep Learning for Computer VisionIntro to Deep Learning for Computer Vision
Intro to Deep Learning for Computer Vision
 
Cisco packettracer overview_20jul09
Cisco packettracer overview_20jul09Cisco packettracer overview_20jul09
Cisco packettracer overview_20jul09
 
Ac02417471753
Ac02417471753Ac02417471753
Ac02417471753
 
Neural Architectures for Video Encoding
Neural Architectures for Video EncodingNeural Architectures for Video Encoding
Neural Architectures for Video Encoding
 
Multilayer bit allocation for video encoding
Multilayer bit allocation for video encodingMultilayer bit allocation for video encoding
Multilayer bit allocation for video encoding
 

Similar to Video Description using Deep Learning

Parking Surveillance Footage Summarization
Parking Surveillance Footage SummarizationParking Surveillance Footage Summarization
Parking Surveillance Footage Summarization
IRJET Journal
 
Video content analysis and retrieval system using video storytelling and inde...
Video content analysis and retrieval system using video storytelling and inde...Video content analysis and retrieval system using video storytelling and inde...
Video content analysis and retrieval system using video storytelling and inde...
IJECEIAES
 
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...
Ijripublishers Ijri
 
SUMMARY GENERATION FOR LECTURING VIDEOS
SUMMARY GENERATION FOR LECTURING VIDEOSSUMMARY GENERATION FOR LECTURING VIDEOS
SUMMARY GENERATION FOR LECTURING VIDEOS
IRJET Journal
 
Key frame extraction methodology for video annotation
Key frame extraction methodology for video annotationKey frame extraction methodology for video annotation
Key frame extraction methodology for video annotation
IAEME Publication
 
Mtech Second progresspresentation ON VIDEO SUMMARIZATION
Mtech Second progresspresentation ON VIDEO SUMMARIZATIONMtech Second progresspresentation ON VIDEO SUMMARIZATION
Mtech Second progresspresentation ON VIDEO SUMMARIZATION
NEERAJ BAGHEL
 
Multimodal video abstraction into a static document using deep learning
Multimodal video abstraction into a static document using deep learning Multimodal video abstraction into a static document using deep learning
Multimodal video abstraction into a static document using deep learning
IJECEIAES
 
PERFORMANCE ANALYSIS OF FINGERPRINTING EXTRACTION ALGORITHM IN VIDEO COPY DET...
PERFORMANCE ANALYSIS OF FINGERPRINTING EXTRACTION ALGORITHM IN VIDEO COPY DET...PERFORMANCE ANALYSIS OF FINGERPRINTING EXTRACTION ALGORITHM IN VIDEO COPY DET...
PERFORMANCE ANALYSIS OF FINGERPRINTING EXTRACTION ALGORITHM IN VIDEO COPY DET...
IJCSEIT Journal
 
Inverted File Based Search Technique for Video Copy Retrieval
Inverted File Based Search Technique for Video Copy RetrievalInverted File Based Search Technique for Video Copy Retrieval
Inverted File Based Search Technique for Video Copy Retrieval
ijcsa
 
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...
Ijripublishers Ijri
 
Video Summarization for Sports
Video Summarization for SportsVideo Summarization for Sports
Video Summarization for Sports
IRJET Journal
 
Secure IoT Systems Monitor Framework using Probabilistic Image Encryption
Secure IoT Systems Monitor Framework using Probabilistic Image EncryptionSecure IoT Systems Monitor Framework using Probabilistic Image Encryption
Secure IoT Systems Monitor Framework using Probabilistic Image Encryption
IJAEMSJORNAL
 
Multi-View Video Coding Algorithms/Techniques: A Comprehensive Study
Multi-View Video Coding Algorithms/Techniques: A Comprehensive StudyMulti-View Video Coding Algorithms/Techniques: A Comprehensive Study
Multi-View Video Coding Algorithms/Techniques: A Comprehensive Study
IJERA Editor
 
Video copy detection using segmentation method and
Video copy detection using segmentation method andVideo copy detection using segmentation method and
Video copy detection using segmentation method and
eSAT Publishing House
 
Publications
PublicationsPublications
Video stream analysis in clouds an object detection and classification frame...
Video stream analysis in clouds  an object detection and classification frame...Video stream analysis in clouds  an object detection and classification frame...
Video stream analysis in clouds an object detection and classification frame...
Finalyearprojects Toall
 
Semantic Summarization of videos, Semantic Summarization of videos
Semantic Summarization of videos, Semantic Summarization of videosSemantic Summarization of videos, Semantic Summarization of videos
Semantic Summarization of videos, Semantic Summarization of videos
darsh228313
 
Final PPT.pptx (1).pptx
Final PPT.pptx (1).pptxFinal PPT.pptx (1).pptx
Final PPT.pptx (1).pptx
gopikahari7
 
Mtech Fourth progress presentation
Mtech Fourth progress presentationMtech Fourth progress presentation
Mtech Fourth progress presentation
NEERAJ BAGHEL
 
VIDEO SUMMARIZATION: CORRELATION FOR SUMMARIZATION AND SUBTRACTION FOR RARE E...
VIDEO SUMMARIZATION: CORRELATION FOR SUMMARIZATION AND SUBTRACTION FOR RARE E...VIDEO SUMMARIZATION: CORRELATION FOR SUMMARIZATION AND SUBTRACTION FOR RARE E...
VIDEO SUMMARIZATION: CORRELATION FOR SUMMARIZATION AND SUBTRACTION FOR RARE E...
Journal For Research
 

Similar to Video Description using Deep Learning (20)

Parking Surveillance Footage Summarization
Parking Surveillance Footage SummarizationParking Surveillance Footage Summarization
Parking Surveillance Footage Summarization
 
Video content analysis and retrieval system using video storytelling and inde...
Video content analysis and retrieval system using video storytelling and inde...Video content analysis and retrieval system using video storytelling and inde...
Video content analysis and retrieval system using video storytelling and inde...
 
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...
 
SUMMARY GENERATION FOR LECTURING VIDEOS
SUMMARY GENERATION FOR LECTURING VIDEOSSUMMARY GENERATION FOR LECTURING VIDEOS
SUMMARY GENERATION FOR LECTURING VIDEOS
 
Key frame extraction methodology for video annotation
Key frame extraction methodology for video annotationKey frame extraction methodology for video annotation
Key frame extraction methodology for video annotation
 
Mtech Second progresspresentation ON VIDEO SUMMARIZATION
Mtech Second progresspresentation ON VIDEO SUMMARIZATIONMtech Second progresspresentation ON VIDEO SUMMARIZATION
Mtech Second progresspresentation ON VIDEO SUMMARIZATION
 
Multimodal video abstraction into a static document using deep learning
Multimodal video abstraction into a static document using deep learning Multimodal video abstraction into a static document using deep learning
Multimodal video abstraction into a static document using deep learning
 
PERFORMANCE ANALYSIS OF FINGERPRINTING EXTRACTION ALGORITHM IN VIDEO COPY DET...
PERFORMANCE ANALYSIS OF FINGERPRINTING EXTRACTION ALGORITHM IN VIDEO COPY DET...PERFORMANCE ANALYSIS OF FINGERPRINTING EXTRACTION ALGORITHM IN VIDEO COPY DET...
PERFORMANCE ANALYSIS OF FINGERPRINTING EXTRACTION ALGORITHM IN VIDEO COPY DET...
 
Inverted File Based Search Technique for Video Copy Retrieval
Inverted File Based Search Technique for Video Copy RetrievalInverted File Based Search Technique for Video Copy Retrieval
Inverted File Based Search Technique for Video Copy Retrieval
 
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...
 
Video Summarization for Sports
Video Summarization for SportsVideo Summarization for Sports
Video Summarization for Sports
 
Secure IoT Systems Monitor Framework using Probabilistic Image Encryption
Secure IoT Systems Monitor Framework using Probabilistic Image EncryptionSecure IoT Systems Monitor Framework using Probabilistic Image Encryption
Secure IoT Systems Monitor Framework using Probabilistic Image Encryption
 
Multi-View Video Coding Algorithms/Techniques: A Comprehensive Study
Multi-View Video Coding Algorithms/Techniques: A Comprehensive StudyMulti-View Video Coding Algorithms/Techniques: A Comprehensive Study
Multi-View Video Coding Algorithms/Techniques: A Comprehensive Study
 
Video copy detection using segmentation method and
Video copy detection using segmentation method andVideo copy detection using segmentation method and
Video copy detection using segmentation method and
 
Publications
PublicationsPublications
Publications
 
Video stream analysis in clouds an object detection and classification frame...
Video stream analysis in clouds  an object detection and classification frame...Video stream analysis in clouds  an object detection and classification frame...
Video stream analysis in clouds an object detection and classification frame...
 
Semantic Summarization of videos, Semantic Summarization of videos
Semantic Summarization of videos, Semantic Summarization of videosSemantic Summarization of videos, Semantic Summarization of videos
Semantic Summarization of videos, Semantic Summarization of videos
 
Final PPT.pptx (1).pptx
Final PPT.pptx (1).pptxFinal PPT.pptx (1).pptx
Final PPT.pptx (1).pptx
 
Mtech Fourth progress presentation
Mtech Fourth progress presentationMtech Fourth progress presentation
Mtech Fourth progress presentation
 
VIDEO SUMMARIZATION: CORRELATION FOR SUMMARIZATION AND SUBTRACTION FOR RARE E...
VIDEO SUMMARIZATION: CORRELATION FOR SUMMARIZATION AND SUBTRACTION FOR RARE E...VIDEO SUMMARIZATION: CORRELATION FOR SUMMARIZATION AND SUBTRACTION FOR RARE E...
VIDEO SUMMARIZATION: CORRELATION FOR SUMMARIZATION AND SUBTRACTION FOR RARE E...
 

Recently uploaded

Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
Remote DBA Services
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
Rakesh Kumar R
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
mz5nrf0n
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
lorraineandreiamcidl
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
Ayan Halder
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
Green Software Development
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
Octavian Nadolu
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Crescat
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
Łukasz Chruściel
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
Google
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
Hironori Washizaki
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
ICS
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
Drona Infotech
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
Rakesh Kumar R
 

Recently uploaded (20)

Oracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptxOracle Database 19c New Features for DBAs and Developers.pptx
Oracle Database 19c New Features for DBAs and Developers.pptx
 
How to write a program in any programming language
How to write a program in any programming languageHow to write a program in any programming language
How to write a program in any programming language
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
 
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOMLORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
LORRAINE ANDREI_LEQUIGAN_HOW TO USE ZOOM
 
Using Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional SafetyUsing Xen Hypervisor for Functional Safety
Using Xen Hypervisor for Functional Safety
 
Energy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina JonuziEnergy consumption of Database Management - Florina Jonuzi
Energy consumption of Database Management - Florina Jonuzi
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
 
Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf2024 eCommerceDays Toulouse - Sylius 2.0.pdf
2024 eCommerceDays Toulouse - Sylius 2.0.pdf
 
OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024OpenMetadata Community Meeting - 5th June 2024
OpenMetadata Community Meeting - 5th June 2024
 
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppAI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
 
SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024SWEBOK and Education at FUSE Okinawa 2024
SWEBOK and Education at FUSE Okinawa 2024
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Webinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for EmbeddedWebinar On-Demand: Using Flutter for Embedded
Webinar On-Demand: Using Flutter for Embedded
 
Mobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona InfotechMobile App Development Company In Noida | Drona Infotech
Mobile App Development Company In Noida | Drona Infotech
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
 

Video Description using Deep Learning

  • 1. VIDEO DESCRIPTION USING DEEP LEARNING By : Pranjal Mahajan Mentor : Pranjali Deshpande
  • 2. AGENDA 1. PROBLEM STATEMENT 2. INTRODUCTION 3. LITERATURE SURVEY 4. MOTIVATION 5. BACKGROUND 6. SYSTEM DESCRIPTION 7. REQUIREMENTS 8. ADVANTAGES 9. LIMITATIONS 10. CONCLUSION
  • 3. 1. PROBLEM STATEMENT • To identify the contents of a video and describe in natural language.
  • 4. 2. INTRODUCTION • A machine can efficiently perform image classification, object recognition, and video segmentation. • The tasks like video description are a challenge. • Video description has applications in 1. human-robot interaction, 2. helping the visually impaired, 3. video retrieval by content.
  • 5. 3. LITERATURE SURVEY Publication Methodology andTechniques Remarks IEEECVPR, 2020 Video is condensed into a spatio-temporal graph network, which serves as the object branch.This interaction information is distilled into another scene branch via the object-aware knowledge distillation mechanism. Takes into consideration interaction information.Can shortcut the classification problem using background. ICCVW, 2019 Two stage training setting to optimise both encoder and decoder simultaneously.The architecture is initialized using pre-trained encoders and decoders.Then the most relevant features for video description generation are learnt. Vocabulary is large. Is computationally expensive.
  • 6. Publication Methodology andTechniques Remarks arXiv, 2018 A self-critical REINFORCE algorithm is used to get better weights for the LSTMs and train the LSTMS. Then, we jointly tune the full model in this step, freeing the weights of the CNNs. Can generate complex sentences. Challenging to train such a big model. ACM Books, 2018 Encoder-Decoder framework in which uses encoder (CNN) to extract visual features from raw video frames and decoder (RNN/LSTM) to get the desired output sentence. Easy to train. Limited to small vocabulary. AAAI, 2013 Template based approach in which SVO triplets are identified using a combination of visual object and activity detectors. followed by search based optimization to get their best combination. Simplest approach. Generated sentences are simple.
  • 7. 4. MOTIVATION • The Spatial, temporal and attribute based attention models 1. are inefficient to exploit video temporal structure in a longer range. 2. require heavy computation operations • The Hierarchical Recurrent Neural Encoder Model is able to overcome these challenges.
  • 8. 5. BACKGROUND 5.1 CONVOLUTIONAL NEURAL NETWORK (CNN)
  • 9. 5.2 LONG-SHORT TERM MEMORY (LSTM) The LSTM is a RNN and has three gates – • input gate (i) • forget gate (f) • output gate (o)
  • 10. 6. SYSTEM DESCRIPTION • Input : A video (in .npy format). • Expected Output : Natural language description of the input video.
  • 12. 6.1.1 ENCODER  Encoder part extracts visual features from raw video frames in a fixed- dimension vector (he) that would represent the entire sequence.  Video Feature pool consists of 1. Object appearance feature – extracted using VGG16 pretrained on ImageNet dataset. 2. Action feature – extracted using C3D pretrained on activity recognition dataset. 6.1.2 DECODER  Decoder part takes that vector as an initial state and it is then fed to a BLSTM to generate the desired output sentence.
  • 13. 6.2 HIERARCHICAL RECURRENT NEURAL ENCODER(HRNE) • The first LSTM layer is used to explore local temporal structure within sentence. • The second LSTM layer learns the temporal dependencies among sentence. • More complex HRNE model could be adding more layers to build multiple time-scale abstraction of the visual information.
  • 14.
  • 15. 6.3 DATASET • The MSR-VTT is used for training and testing. • In its current version, MSR-VTT provides 10K web video clips with 41.2 hours and 200K clip-sentence pairs in total.
  • 16. 6.4 EVALUATION METRICS • The generated sentence correlates well with a human judgment when the metrics are high.
  • 17. 7. REQUIREMENTS • Central Processing Unit (CPU) — Intel Core i5 6th Gen. processor or higher. • RAM — 8 GB minimum. • Graphics Processing Unit (GPU) — NVIDIA GeForce GTX 960 or higher. • Operating System — Ubuntu, Mac or Microsoft Windows 10. • Software – Python compiling IDE with Modules like Keras, TensorFlow
  • 18. 8. ADVANTAGES • Exploits temporal information over longer time. • Shortens the path with the capability of adding non-linearity, providing a better trade-off between efficiency and effectiveness. • Is able to uncover temporal transitions between frame chunks with different granularities.
  • 19. 9. LIMITATIONS • LSTM decoder is prone to overfitting. • Hence, we need to validate the generalization capability. • In the future work, we can plug a softmax classifier upon the encoder and video labels instead of the LSTM language decoder.
  • 20. 10. CONCLUSION • We take raw video as input, and apply 2D CNN (VGG16) and 3D CNN (C3D) on it to extract the object appearance and action features respectively. • To get the encoded vector, multiple LSTM can be stacked using HRNE. • The decoder is a LSTM which inputs visual features and generates a natural language description for sentence.
  • 21. REFERENCES [1] N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney, K. Saenko, and S. Guadarrama. Generating natural-language video descriptions using text-mined knowledge. In AAAI, July 2013. [2] Z. Wu, T. Yao, Y. Fu, and Y. Jiang. Deep learning for video classification and captioning. In S. hang, editor, Frontiers of Multimedia Research, pages 3–29. ACM Books, 2018 [3] S Olivastri, G Singh, F Cuzzolin. End-to-End Video Captioning. In Large Scale Holistic Video Understanding, ICCVW 2019 [4] Pan, Boxiao & Cai, Haoye & Huang, De-An & Lee, Kuan-Hui & Gaidon, Adrien & Adeli, Ehsan & Niebles, Juan Carlos. Spatio-Temporal Graph for Video Captioning with Knowledge Distillation. Computer Vision and Pattern Recognition (CVPR),2020 [5] Lijun Li and Boqing Gong. End-to-end video captioning with multitask reinforcement learning. arXiv preprint arXiv:1803.07950, 2018.
  • 22. [6] Yuling Gui, Dan Guo, Ye Zhao. Semantic Enhanced Encoder-Decoder Network (SEN) for Video Captioning. In MAHCI '19 2019 [7] K. Simonyan, A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations, 2015 [8] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, Learning Spatiotemporal Features with 3D Convolutional Networks, ICCV 2015 [9] Nayyer Aafaq, Ajmal Mian, Wei Liu, Syed Zulqarnain Gilani, and Mubarak Shah. Video Description: A Survey of Methods, Datasets and Evaluation Metrics. In ACM Computing Surveys (CSUR),2019 [10] Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, Yueting Zhuang. Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR)