Generating natural language descriptions from video using CNN (Convolutional Neural
Network) and LSTM (Long Short Term Memory) layers stacked into one HRNE (Hierarchical Recurrent Neural Encoder) model.
Attention boosted deep networks for video classificationDanbi Cho
The presentation explains the integrating attention with CNN and LSTM.
This paper carried out the video classification task using the attention with CNNLSTM models.
(9th April 2021)
A survey on deep learning based approaches for action and gesture recognition...Danbi Cho
The presentation surveys the methodologies for action and gesture recognition tasks with deep learning models and feature engineering methods.
(6th April 2021)
발표자: 김은솔 (서울대 박사과정)
발표일: 2017.6.
2010년 9월부터 서울대 컴퓨터공학부 석박사 통합과정에 재학 중이며, 2014년 6월 젊은 여성과학자로 선정되었다.
개요:
본 발표에서는 사람과 기계가 컨텐츠를 같이 시청하고 컨텐츠의 내용에 대해 자연 언어로 묻고 답할 수 있는 기계 학습 엔진을 소개한다.
Hierarchical multimodal recurrent neural network 기술을 기반으로 컨텐츠에 포함된 이미지, 자막(텍스트), 소리 정보를 sequential하게 결합하여 multimodal episodic memory를 구축하고, 주어진 질문에 필요한 memory를 선택하여 답을 추출할 수 있는 방법을 소개한다.
또한 recurrent neural network으로 multimodal memory를 구축할 때에 long-term sequence를 효율적으로 학습하기 위한 방법으로, reinforcement learning 아이디어를 결합한 방법을 소개한다.
The document discusses fuzzy logic in watermarking systems. It provides an overview of watermarking, encryption, fuzzy logic, and fuzzification/defuzzification. It then discusses how fuzzy sets can be used to design the front-end of a watermarking system, with fuzzy sets defined for factors like robustness, encryption algorithms, and watermarking methods. The front-end architecture is designed based on fuzzy rules to provide a more user-friendly interface. In conclusion, this approach can make software architecture design easier by incorporating user-defined fuzzy sets and rules.
Cognitive ability of human brain and soft computing techniquesDr G R Sinha
The document discusses cognitive ability of the human brain and soft computing techniques. It begins with providing facts about the brain like the number of neurons and growth rate of neurons. It then discusses cognitive ability development through activities, memory, and experience. Soft computing techniques like neural networks, fuzzy logic, and genetic algorithms are presented as ways to understand cognition through applied neuroscience. Deep learning and convolutional neural networks are specifically highlighted as machine learning approaches for pattern recognition and classification.
The document provides an overview of steganography, discussing various techniques for hiding secret messages in digital files like images, audio, and video. It describes methods for embedding data in the frequency and spatial domains of digital files. The paper also compares steganography to encryption and digital watermarking, outlining the goals and measures of different steganography techniques.
The document discusses predicting the short-term and long-term memorability of videos using machine learning models. It extracts features like C3D, HMP, captions and InceptionV3 from videos and uses models like SVR, XGBoost and Bayesian Ridge Regression. It finds that combining features gives better memorability prediction scores than individual features. A neural network further improves scores over traditional models when trained on a combination of all features. Future work could explore other video features and deep learning models.
Attention boosted deep networks for video classificationDanbi Cho
The presentation explains the integrating attention with CNN and LSTM.
This paper carried out the video classification task using the attention with CNNLSTM models.
(9th April 2021)
A survey on deep learning based approaches for action and gesture recognition...Danbi Cho
The presentation surveys the methodologies for action and gesture recognition tasks with deep learning models and feature engineering methods.
(6th April 2021)
발표자: 김은솔 (서울대 박사과정)
발표일: 2017.6.
2010년 9월부터 서울대 컴퓨터공학부 석박사 통합과정에 재학 중이며, 2014년 6월 젊은 여성과학자로 선정되었다.
개요:
본 발표에서는 사람과 기계가 컨텐츠를 같이 시청하고 컨텐츠의 내용에 대해 자연 언어로 묻고 답할 수 있는 기계 학습 엔진을 소개한다.
Hierarchical multimodal recurrent neural network 기술을 기반으로 컨텐츠에 포함된 이미지, 자막(텍스트), 소리 정보를 sequential하게 결합하여 multimodal episodic memory를 구축하고, 주어진 질문에 필요한 memory를 선택하여 답을 추출할 수 있는 방법을 소개한다.
또한 recurrent neural network으로 multimodal memory를 구축할 때에 long-term sequence를 효율적으로 학습하기 위한 방법으로, reinforcement learning 아이디어를 결합한 방법을 소개한다.
The document discusses fuzzy logic in watermarking systems. It provides an overview of watermarking, encryption, fuzzy logic, and fuzzification/defuzzification. It then discusses how fuzzy sets can be used to design the front-end of a watermarking system, with fuzzy sets defined for factors like robustness, encryption algorithms, and watermarking methods. The front-end architecture is designed based on fuzzy rules to provide a more user-friendly interface. In conclusion, this approach can make software architecture design easier by incorporating user-defined fuzzy sets and rules.
Cognitive ability of human brain and soft computing techniquesDr G R Sinha
The document discusses cognitive ability of the human brain and soft computing techniques. It begins with providing facts about the brain like the number of neurons and growth rate of neurons. It then discusses cognitive ability development through activities, memory, and experience. Soft computing techniques like neural networks, fuzzy logic, and genetic algorithms are presented as ways to understand cognition through applied neuroscience. Deep learning and convolutional neural networks are specifically highlighted as machine learning approaches for pattern recognition and classification.
The document provides an overview of steganography, discussing various techniques for hiding secret messages in digital files like images, audio, and video. It describes methods for embedding data in the frequency and spatial domains of digital files. The paper also compares steganography to encryption and digital watermarking, outlining the goals and measures of different steganography techniques.
The document discusses predicting the short-term and long-term memorability of videos using machine learning models. It extracts features like C3D, HMP, captions and InceptionV3 from videos and uses models like SVR, XGBoost and Bayesian Ridge Regression. It finds that combining features gives better memorability prediction scores than individual features. A neural network further improves scores over traditional models when trained on a combination of all features. Future work could explore other video features and deep learning models.
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videosCodiax
The document summarizes a talk on multi-modal self-supervised learning from videos. It discusses using multiple modalities like vision, audio and language from videos for self-supervised learning. It presents two models: 1) A Multi-Modal Versatile network that can take any modality as input and respects the specificity of each while enabling comparison. 2) BraVe which learns representations by regressing a broad representation of the whole video from a narrow view to leverage different augmentations and modalities. Both models achieve state-of-the-art results on downstream tasks, showing videos provide rich self-supervision and using additional context improves representation learning.
Image Captioning Generator using Deep Machine Learningijtsrd
Technologys scope has evolved into one of the most powerful tools for human development in a variety of fields.AI and machine learning have become one of the most powerful tools for completing tasks quickly and accurately without the need for human intervention. This project demonstrates how deep machine learning can be used to create a caption or a sentence for a given picture. This can be used for visually impaired persons, as well as automobiles for self identification, and for various applications to verify quickly and easily. The Convolutional Neural Network CNN is used to describe the alphabet, and the Long Short Term Memory LSTM is used to organize the right meaningful sentences in this model. The flicker 8k and flicker 30k datasets were used to train this. Sreejith S P | Vijayakumar A "Image Captioning Generator using Deep Machine Learning" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-4 , June 2021, URL: https://www.ijtsrd.compapers/ijtsrd42344.pdf Paper URL: https://www.ijtsrd.comcomputer-science/artificial-intelligence/42344/image-captioning-generator-using-deep-machine-learning/sreejith-s-p
AI&BigData Lab. Артем Чернодуб "Распознавание изображений методом Lazy Deep ...GeeksLab Odessa
23.05.15 Одесса. Impact Hub Odessa. Конференция AI&BigData Lab
Артем Чернодуб (Computer Vision Team, ZZ Wolf)
"Распознавание изображений методом Lazy Deep Learning в фото-органайзере ZZ Photo"
В докладе рассматривается проблема распознавания изображений методами машинного зрения. Проводится краткий обзор существующих подзадач в этой области (детекция обьектов, классификация сцен, ассоциативный поиск в базах изображений, распознавание лиц и др.) и современных методов их решения с акцентом на глубокое обучение (Deep Learning).
Подробнее:
http://geekslab.co/
https://www.facebook.com/GeeksLab.co
https://www.youtube.com/user/GeeksLabVideo
This document provides a survey of video steganography techniques. It begins with definitions and comparisons of steganography, cryptography and watermarking. Video steganography hides secret information by embedding it in video files. Various video steganography techniques are explored, including spatial domain and transform domain methods. Spatial domain methods embed in pixel values directly while transform methods operate in compressed domains. The document evaluates and analyzes different video steganography methods and their imperceptibility, payload, security and computational costs.
This document summarizes a presentation given in Kangwondo, South Korea in December 2012 about heterogeneous computing and GPU computing. The presentation discussed the evolution of CPUs and GPUs, general purpose GPU computing, OpenCL as a programming standard, and a case study implementing the AES encryption algorithm on a GPU. Performance tests showed the GPU implementations were over an order of magnitude faster than a CPU implementation for AES.
Unsupervised Video Anomaly Detection: A brief overviewRidge-i, Inc.
The presentation provided an overview of unsupervised video anomaly detection techniques. It discussed benchmark datasets, approaches like convolutional LSTM autoencoders, memory-augmented autoencoders, and memory-augmented convolutional autoencoders. Experiments on these approaches were conducted on standard datasets and a new oven dataset. The memory-augmented convolutional autoencoder achieved the best performance, being more sensitive to anomalies while maintaining fast inference speed. The presentation concluded with recommendations on the discussed techniques.
Cisco Packet Tracer is a network simulation software that allows users to design, configure and test networks virtually. It provides benefits for both instructors and students by making networking concepts easier to teach and learn. Packet Tracer's key features include simulation of network devices and protocols, visualization of network traffic, and multi-user collaboration. The software supports Cisco Networking Academy curricula and helps develop students' problem solving and critical thinking skills.
This document summarizes a research paper that proposes a method to enhance security in a video copy detection system using content-based fingerprinting. The paper discusses how existing video fingerprinting systems are not robust against content-changing attacks like changing the background of a video. To address this, the paper proposes using an interest point matching algorithm to extract fingerprints. The interest point matching algorithm detects interest points in video frames using the Harris corner detection method. It then constructs correspondences between interest points to form fingerprints. The fingerprints extracted with this method are claimed to be more robust against content-changing attacks compared to existing fingerprinting methods. The proposed algorithm is tested on videos with distortions and is found to have high detection rates and low false positive rates.
These slides summarize the main trends in deep neural networks for video encoding. Including single frame models, spatiotemporal convolutionals, long term sequence modeling with RNNs and their combinaction with optical flow.
Multilayer bit allocation for video encodingIJMIT JOURNAL
Video compression approach removes spatial and temporal redundancy based on the signal statistical correlation. Bit allocation technique adopts a visual distortion model for a better rate visual distortion video coding. Visual distortion model uses both motion and the texture structures in the video sequences. The existing video coding mechanisms reduces the bit rate for video coding. However to get better video compression ratio there is a need for multilayer compression technique. In this paper we proposed a multilayer bit allocation video coding mechanism. The proposed model reduces the bit allocation for video coding by retaining the same video quality. The experimental results using the proposed model reduced the bit rate by 3% to 4%. The result are promising. Finally we conclude with conclusion and future work.
The document summarizes a research paper that proposes a method to summarize parking surveillance footage. The method first pre-processes the raw footage to extract only frames containing vehicles. These frames are then classified using a CNN model to detect vehicles and recognize license plates. The classified objects and license plate numbers are used to generate a textual summary of the vehicles in the footage, making it easier for users to review large amounts of surveillance video. The paper discusses related work on video summarization techniques and provides details of the proposed methodology, which includes preprocessing footage, extracting features from frames containing vehicles, using CNNs for object detection and license plate recognition, and generating a summarized video and text report.
Video content analysis and retrieval system using video storytelling and inde...IJECEIAES
Videos are used often for communicating ideas, concepts, experience, and situations, because of the significant advances made in video communication technology. The social media platforms enhanced the video usage expeditiously. At, present, recognition of a video is done, using the metadata like video title, video descriptions, and video thumbnails. There are situations like video searcher requires only a video clip on a specific topic from a long video. This paper proposes a novel methodology for the analysis of video content and using video storytelling and indexing techniques for the retrieval of the intended video clip from a long duration video. Video storytelling technique is used for video content analysis and to produce a description of the video. The video description thus created is used for preparation of an index using wormhole algorithm, guarantying the search of a keyword of definite length L, within the minimum worst-case time. This video index can be used by video searching algorithm to retrieve the relevant part of the video by virtue of the frequency of the word in the keyword search of the video index. Instead of downloading and transferring a whole video, the user can download or transfer the specifically necessary video clip. The network constraints associated with the transfer of videos are considerably addressed.
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...Ijripublishers Ijri
This document discusses techniques for improving video compression efficiency for surveillance videos. It proposes modifying the architecture of scalable video coding to make it surveillance-centric by allowing adaptive rate-distortion optimization at the GOP level based on whether events of interest are present. Experimental results show foreground detection and updating of background adaptively over time to improve compression. Future work includes further enhancing selective motion estimation techniques to improve processing efficiency without degrading video quality.
SUMMARY GENERATION FOR LECTURING VIDEOSIRJET Journal
The document discusses generating summaries for lecturing videos. It proposes a method that uses optical character recognition to extract textual information from each video frame, detects changes in text between frames to find scene changes, and combines key frames to create a highlight summary. The proposed system was tested on PowerPoint-based lecture videos. Future work could focus on expanding it to handle chalkboard-style lectures through improved machine learning models. The goal is to automatically generate concise summaries that extract the most important concepts from long videos to help students learn more efficiently.
Key frame extraction methodology for video annotationIAEME Publication
This document summarizes a research paper that proposes a key frame extraction methodology to facilitate video annotation. The methodology uses edge difference between consecutive video frames to determine if the content has significantly changed. Frames where the edge difference exceeds a threshold are selected as key frames. The algorithm calculates edge differences for all frame pairs in a video. It then computes statistics like mean and standard deviation to determine a threshold. Frames with differences above this threshold are extracted as key frames. The key frames extracted represent important content changes in the video. Extracting key frames reduces processing requirements for video annotation compared to analyzing all frames. The methodology was tested on videos from domains like transportation and performed well at selecting representative frames.
Mtech Second progresspresentation ON VIDEO SUMMARIZATIONNEERAJ BAGHEL
This document presents a second progress report on video summarization research. It provides an outline of topics covered, including an introduction to video summarization, a literature review summarizing 5 papers on the topic, identified research gaps, challenges, the problem statement of finding key frames based on extracted text, overview of relevant datasets and tools used, and conclusions. The literature review analyzes the objectives, methods, strengths and limitations of the summarized papers.
Multimodal video abstraction into a static document using deep learning IJECEIAES
Abstraction is a strategy that gives the essential points of a document in a short period of time. The video abstraction approach proposed in this research is based on multi-modal video data, which comprises both audio and visual data. Segmenting the input video into scenes and obtaining a textual and visual summary for each scene are the major video abstraction procedures to summarize the video events into a static document. To recognize the shot and scene boundary from a video sequence, a hybrid features method was employed, which improves detection shot performance by selecting strong and flexible features. The most informative keyframes from each scene are then incorporated into the visual summary. A hybrid deep learning model was used for abstractive text summarization. The BBC archive provided the testing videos, which comprised BBC Learning English and BBC News. In addition, a news summary dataset was used to train a deep model. The performance of the proposed approaches was assessed using metrics like Rouge for textual summary, which achieved a 40.49% accuracy rate. While precision, recall, and F-score used for visual summary have achieved (94.9%) accuracy, which performed better than the other methods, according to the findings of the experiments.
PERFORMANCE ANALYSIS OF FINGERPRINTING EXTRACTION ALGORITHM IN VIDEO COPY DET...IJCSEIT Journal
A video fingerprint is a recognizer that is derived from a piece of video content. The video fingerprinting
methods obtain unique features of a video that differentiates one video clip from another. It aims to identify
whether a query video segment is a copy of video from the video database or not based on the signature of
the video. It is difficult to find whether a video is a copied video or a similar video, since the features of the
content are very similar from one video to the other. The main focus of this paper is to detect that the query
video is present in the video database with robustness depending on the content of video and also by fast
search of fingerprints. The Fingerprint Extraction Algorithm and Fast Search Algorithms are adopted in
this paper to achieve robust, fast, efficient and accurate video copy detection. As a first step, the
Fingerprint Extraction algorithm is employed which extracts a fingerprint through the features from the
image content of video. The images are represented as Temporally Informative Representative Images
(TIRI). Then, the second step is to find the presence of copy of a query video in a video database, in which
a close match of its fingerprint in the corresponding fingerprint database is searched using inverted-filebased
method. The proposed system is tested against various attacks like noise, brightness, contrast,
rotation and frame drop. Thus the performance of the proposed system on an average shows high true
positive rate of 98% and low false positive rate of 1.3% for different attacks.
Inverted File Based Search Technique for Video Copy Retrievalijcsa
A video copy detection system is a content-based search engine focusing on Spatio-temporal features. It
aims to find whether a query video segment is a copy of video from the video database or not based on the
signature of the video. It is hard to find whether a video is a copied video or a similar video since the
features of the content are very similar from one video to the other. The main focus is to detect that the
query video is present in the video database with robustness depending on the content of video and also by
fast search of fingerprints. The Fingerprint Extraction Algorithm and Fast Search Algorithm are adopted
to achieve robust, fast, efficient and accurate video copy detection. As a first step, the Fingerprint
Extraction algorithm is employed which extracts a fingerprint through the features from the image content
of video. The images are represented as Temporally Informative Representative Images (TIRI). Then the
next step is to find the presence of copy of a query video in a video database, in which a close match of its
fingerprint in the corresponding fingerprint database is searched using inverted-file-based method.
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...Ijripublishers Ijri
Global interconnect planning becomes a challenge as semiconductor technology continuously scales. Because of the increasing wire resistance and higher capacitive coupling in smaller features, the delay of global interconnects becomes large compared with the delay of a logic gate, introducing a huge performance gap that needs to be resolved A novel equalized global link architecture and driver– receiver co design flow are proposed for high-speed and low-energy on-chip communication by utilizing a continuous-time linear equalizer (CTLE). The proposed global link is analyzed using a linear system method, and the formula of CTLE eye opening is derived to provide high-level design guidelines and insights.
Compared with the separate driver–receiver design flow, over 50% energy reduction is observed.
Adria Recasens, DeepMind – Multi-modal self-supervised learning from videosCodiax
The document summarizes a talk on multi-modal self-supervised learning from videos. It discusses using multiple modalities like vision, audio and language from videos for self-supervised learning. It presents two models: 1) A Multi-Modal Versatile network that can take any modality as input and respects the specificity of each while enabling comparison. 2) BraVe which learns representations by regressing a broad representation of the whole video from a narrow view to leverage different augmentations and modalities. Both models achieve state-of-the-art results on downstream tasks, showing videos provide rich self-supervision and using additional context improves representation learning.
Image Captioning Generator using Deep Machine Learningijtsrd
Technologys scope has evolved into one of the most powerful tools for human development in a variety of fields.AI and machine learning have become one of the most powerful tools for completing tasks quickly and accurately without the need for human intervention. This project demonstrates how deep machine learning can be used to create a caption or a sentence for a given picture. This can be used for visually impaired persons, as well as automobiles for self identification, and for various applications to verify quickly and easily. The Convolutional Neural Network CNN is used to describe the alphabet, and the Long Short Term Memory LSTM is used to organize the right meaningful sentences in this model. The flicker 8k and flicker 30k datasets were used to train this. Sreejith S P | Vijayakumar A "Image Captioning Generator using Deep Machine Learning" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-4 , June 2021, URL: https://www.ijtsrd.compapers/ijtsrd42344.pdf Paper URL: https://www.ijtsrd.comcomputer-science/artificial-intelligence/42344/image-captioning-generator-using-deep-machine-learning/sreejith-s-p
AI&BigData Lab. Артем Чернодуб "Распознавание изображений методом Lazy Deep ...GeeksLab Odessa
23.05.15 Одесса. Impact Hub Odessa. Конференция AI&BigData Lab
Артем Чернодуб (Computer Vision Team, ZZ Wolf)
"Распознавание изображений методом Lazy Deep Learning в фото-органайзере ZZ Photo"
В докладе рассматривается проблема распознавания изображений методами машинного зрения. Проводится краткий обзор существующих подзадач в этой области (детекция обьектов, классификация сцен, ассоциативный поиск в базах изображений, распознавание лиц и др.) и современных методов их решения с акцентом на глубокое обучение (Deep Learning).
Подробнее:
http://geekslab.co/
https://www.facebook.com/GeeksLab.co
https://www.youtube.com/user/GeeksLabVideo
This document provides a survey of video steganography techniques. It begins with definitions and comparisons of steganography, cryptography and watermarking. Video steganography hides secret information by embedding it in video files. Various video steganography techniques are explored, including spatial domain and transform domain methods. Spatial domain methods embed in pixel values directly while transform methods operate in compressed domains. The document evaluates and analyzes different video steganography methods and their imperceptibility, payload, security and computational costs.
This document summarizes a presentation given in Kangwondo, South Korea in December 2012 about heterogeneous computing and GPU computing. The presentation discussed the evolution of CPUs and GPUs, general purpose GPU computing, OpenCL as a programming standard, and a case study implementing the AES encryption algorithm on a GPU. Performance tests showed the GPU implementations were over an order of magnitude faster than a CPU implementation for AES.
Unsupervised Video Anomaly Detection: A brief overviewRidge-i, Inc.
The presentation provided an overview of unsupervised video anomaly detection techniques. It discussed benchmark datasets, approaches like convolutional LSTM autoencoders, memory-augmented autoencoders, and memory-augmented convolutional autoencoders. Experiments on these approaches were conducted on standard datasets and a new oven dataset. The memory-augmented convolutional autoencoder achieved the best performance, being more sensitive to anomalies while maintaining fast inference speed. The presentation concluded with recommendations on the discussed techniques.
Cisco Packet Tracer is a network simulation software that allows users to design, configure and test networks virtually. It provides benefits for both instructors and students by making networking concepts easier to teach and learn. Packet Tracer's key features include simulation of network devices and protocols, visualization of network traffic, and multi-user collaboration. The software supports Cisco Networking Academy curricula and helps develop students' problem solving and critical thinking skills.
This document summarizes a research paper that proposes a method to enhance security in a video copy detection system using content-based fingerprinting. The paper discusses how existing video fingerprinting systems are not robust against content-changing attacks like changing the background of a video. To address this, the paper proposes using an interest point matching algorithm to extract fingerprints. The interest point matching algorithm detects interest points in video frames using the Harris corner detection method. It then constructs correspondences between interest points to form fingerprints. The fingerprints extracted with this method are claimed to be more robust against content-changing attacks compared to existing fingerprinting methods. The proposed algorithm is tested on videos with distortions and is found to have high detection rates and low false positive rates.
These slides summarize the main trends in deep neural networks for video encoding. Including single frame models, spatiotemporal convolutionals, long term sequence modeling with RNNs and their combinaction with optical flow.
Multilayer bit allocation for video encodingIJMIT JOURNAL
Video compression approach removes spatial and temporal redundancy based on the signal statistical correlation. Bit allocation technique adopts a visual distortion model for a better rate visual distortion video coding. Visual distortion model uses both motion and the texture structures in the video sequences. The existing video coding mechanisms reduces the bit rate for video coding. However to get better video compression ratio there is a need for multilayer compression technique. In this paper we proposed a multilayer bit allocation video coding mechanism. The proposed model reduces the bit allocation for video coding by retaining the same video quality. The experimental results using the proposed model reduced the bit rate by 3% to 4%. The result are promising. Finally we conclude with conclusion and future work.
The document summarizes a research paper that proposes a method to summarize parking surveillance footage. The method first pre-processes the raw footage to extract only frames containing vehicles. These frames are then classified using a CNN model to detect vehicles and recognize license plates. The classified objects and license plate numbers are used to generate a textual summary of the vehicles in the footage, making it easier for users to review large amounts of surveillance video. The paper discusses related work on video summarization techniques and provides details of the proposed methodology, which includes preprocessing footage, extracting features from frames containing vehicles, using CNNs for object detection and license plate recognition, and generating a summarized video and text report.
Video content analysis and retrieval system using video storytelling and inde...IJECEIAES
Videos are used often for communicating ideas, concepts, experience, and situations, because of the significant advances made in video communication technology. The social media platforms enhanced the video usage expeditiously. At, present, recognition of a video is done, using the metadata like video title, video descriptions, and video thumbnails. There are situations like video searcher requires only a video clip on a specific topic from a long video. This paper proposes a novel methodology for the analysis of video content and using video storytelling and indexing techniques for the retrieval of the intended video clip from a long duration video. Video storytelling technique is used for video content analysis and to produce a description of the video. The video description thus created is used for preparation of an index using wormhole algorithm, guarantying the search of a keyword of definite length L, within the minimum worst-case time. This video index can be used by video searching algorithm to retrieve the relevant part of the video by virtue of the frequency of the word in the keyword search of the video index. Instead of downloading and transferring a whole video, the user can download or transfer the specifically necessary video clip. The network constraints associated with the transfer of videos are considerably addressed.
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...Ijripublishers Ijri
This document discusses techniques for improving video compression efficiency for surveillance videos. It proposes modifying the architecture of scalable video coding to make it surveillance-centric by allowing adaptive rate-distortion optimization at the GOP level based on whether events of interest are present. Experimental results show foreground detection and updating of background adaptively over time to improve compression. Future work includes further enhancing selective motion estimation techniques to improve processing efficiency without degrading video quality.
SUMMARY GENERATION FOR LECTURING VIDEOSIRJET Journal
The document discusses generating summaries for lecturing videos. It proposes a method that uses optical character recognition to extract textual information from each video frame, detects changes in text between frames to find scene changes, and combines key frames to create a highlight summary. The proposed system was tested on PowerPoint-based lecture videos. Future work could focus on expanding it to handle chalkboard-style lectures through improved machine learning models. The goal is to automatically generate concise summaries that extract the most important concepts from long videos to help students learn more efficiently.
Key frame extraction methodology for video annotationIAEME Publication
This document summarizes a research paper that proposes a key frame extraction methodology to facilitate video annotation. The methodology uses edge difference between consecutive video frames to determine if the content has significantly changed. Frames where the edge difference exceeds a threshold are selected as key frames. The algorithm calculates edge differences for all frame pairs in a video. It then computes statistics like mean and standard deviation to determine a threshold. Frames with differences above this threshold are extracted as key frames. The key frames extracted represent important content changes in the video. Extracting key frames reduces processing requirements for video annotation compared to analyzing all frames. The methodology was tested on videos from domains like transportation and performed well at selecting representative frames.
Mtech Second progresspresentation ON VIDEO SUMMARIZATIONNEERAJ BAGHEL
This document presents a second progress report on video summarization research. It provides an outline of topics covered, including an introduction to video summarization, a literature review summarizing 5 papers on the topic, identified research gaps, challenges, the problem statement of finding key frames based on extracted text, overview of relevant datasets and tools used, and conclusions. The literature review analyzes the objectives, methods, strengths and limitations of the summarized papers.
Multimodal video abstraction into a static document using deep learning IJECEIAES
Abstraction is a strategy that gives the essential points of a document in a short period of time. The video abstraction approach proposed in this research is based on multi-modal video data, which comprises both audio and visual data. Segmenting the input video into scenes and obtaining a textual and visual summary for each scene are the major video abstraction procedures to summarize the video events into a static document. To recognize the shot and scene boundary from a video sequence, a hybrid features method was employed, which improves detection shot performance by selecting strong and flexible features. The most informative keyframes from each scene are then incorporated into the visual summary. A hybrid deep learning model was used for abstractive text summarization. The BBC archive provided the testing videos, which comprised BBC Learning English and BBC News. In addition, a news summary dataset was used to train a deep model. The performance of the proposed approaches was assessed using metrics like Rouge for textual summary, which achieved a 40.49% accuracy rate. While precision, recall, and F-score used for visual summary have achieved (94.9%) accuracy, which performed better than the other methods, according to the findings of the experiments.
PERFORMANCE ANALYSIS OF FINGERPRINTING EXTRACTION ALGORITHM IN VIDEO COPY DET...IJCSEIT Journal
A video fingerprint is a recognizer that is derived from a piece of video content. The video fingerprinting
methods obtain unique features of a video that differentiates one video clip from another. It aims to identify
whether a query video segment is a copy of video from the video database or not based on the signature of
the video. It is difficult to find whether a video is a copied video or a similar video, since the features of the
content are very similar from one video to the other. The main focus of this paper is to detect that the query
video is present in the video database with robustness depending on the content of video and also by fast
search of fingerprints. The Fingerprint Extraction Algorithm and Fast Search Algorithms are adopted in
this paper to achieve robust, fast, efficient and accurate video copy detection. As a first step, the
Fingerprint Extraction algorithm is employed which extracts a fingerprint through the features from the
image content of video. The images are represented as Temporally Informative Representative Images
(TIRI). Then, the second step is to find the presence of copy of a query video in a video database, in which
a close match of its fingerprint in the corresponding fingerprint database is searched using inverted-filebased
method. The proposed system is tested against various attacks like noise, brightness, contrast,
rotation and frame drop. Thus the performance of the proposed system on an average shows high true
positive rate of 98% and low false positive rate of 1.3% for different attacks.
Inverted File Based Search Technique for Video Copy Retrievalijcsa
A video copy detection system is a content-based search engine focusing on Spatio-temporal features. It
aims to find whether a query video segment is a copy of video from the video database or not based on the
signature of the video. It is hard to find whether a video is a copied video or a similar video since the
features of the content are very similar from one video to the other. The main focus is to detect that the
query video is present in the video database with robustness depending on the content of video and also by
fast search of fingerprints. The Fingerprint Extraction Algorithm and Fast Search Algorithm are adopted
to achieve robust, fast, efficient and accurate video copy detection. As a first step, the Fingerprint
Extraction algorithm is employed which extracts a fingerprint through the features from the image content
of video. The images are represented as Temporally Informative Representative Images (TIRI). Then the
next step is to find the presence of copy of a query video in a video database, in which a close match of its
fingerprint in the corresponding fingerprint database is searched using inverted-file-based method.
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...Ijripublishers Ijri
Global interconnect planning becomes a challenge as semiconductor technology continuously scales. Because of the increasing wire resistance and higher capacitive coupling in smaller features, the delay of global interconnects becomes large compared with the delay of a logic gate, introducing a huge performance gap that needs to be resolved A novel equalized global link architecture and driver– receiver co design flow are proposed for high-speed and low-energy on-chip communication by utilizing a continuous-time linear equalizer (CTLE). The proposed global link is analyzed using a linear system method, and the formula of CTLE eye opening is derived to provide high-level design guidelines and insights.
Compared with the separate driver–receiver design flow, over 50% energy reduction is observed.
The document proposes a method to summarize sports match videos using object detection, optical character recognition (OCR), and speech analysis. Video frames are analyzed using a YOLO model to detect important objects like cards in football or scoreboards in cricket. OCR is used to read text on scoreboards and detect changes. Speech analysis examines crowd noise to find exciting moments. Timestamps of important clips identified through these methods are combined and extracted from the original video to create a summarized highlights video. The approach is intended to work for both cricket and football matches.
Secure IoT Systems Monitor Framework using Probabilistic Image EncryptionIJAEMSJORNAL
In recent years, the modeling of human behaviors and patterns of activity for recognition or detection of special events has attracted considerable research interest. Various methods abounding to build intelligent vision systems aimed at understanding the scene and making correct semantic inferences from the observed dynamics of moving targets. Many systems include detection, storage of video information, and human-computer interfaces. Here we present not only an update that expands previous similar surveys but also a emphasis on contextual abnormal detection of human activity , especially in video surveillance applications. The main purpose of this survey is to identify existing methods extensively, and to characterize the literature in a manner that brings to attention key challenges.
Multi-View Video Coding Algorithms/Techniques: A Comprehensive StudyIJERA Editor
This document summarizes recent developments in multi-view video coding techniques. It begins with an introduction to multi-view video and multi-view video coding. It then discusses exploiting temporal and inter-view similarities for efficient compression. Several existing multi-view video coding methods and algorithms are reviewed, including predictive coding, subband coding, motion and disparity compensation, and wavelet-based approaches. The benefits and requirements of multi-view video compression are also outlined.
This document proposes a method for video copy detection using segmentation, MPEG-7 descriptors, and graph-based sequence matching. It extracts key frames from videos, extracts features from the frames using descriptors like CEDD, FCTH, SCD, EHD and CLD, and stores them in a database. When a query video is input, its features are extracted and compared to the database to detect if it matches any videos already in the database. Graph-based sequence matching is also used to find the optimal matching between video sequences despite transformations like changed frame rates or ordering. The method is shown to perform better than previous techniques at detecting copied videos through transformations.
The document proposes a Hybrid Layered Video (HLV) encoding scheme for mobile multimedia applications. The scheme has two components: (1) a sketch-based representation that uses parametric curves to represent object outlines, called Generative Sketch-based Video (GSV); and (2) a texture component with three layers - a low-quality base layer, medium-quality mid-layer, and original-quality highest layer. Different combinations of the GSV and texture layers provide varying quality and resource usage profiles. The scheme aims to enable computer vision tasks on mobile devices in a bandwidth- and power-efficient manner.
Video stream analysis in clouds an object detection and classification frame...Finalyearprojects Toall
The document presents a cloud-based video analytics framework for scalable and automated object detection and classification from video streams. The framework allows an operator to specify video analysis criteria and duration. Videos are fetched from cloud storage, decoded, and analyzed on GPU-powered cloud servers. Vehicle and face detection case studies showed the framework reliably analyzed 21,600 video streams totaling 175GB in 6.52 hours on a 15 node cloud, 3 hours when using GPUs, making it twice as fast as without GPUs.
Semantic Summarization of videos, Semantic Summarization of videosdarsh228313
In [1], each capsule uses a activity vector to represent different instantiation parameters (position, size, orientation, thickness, … etc.), with the vector length (norm) representing the probability of the presence of an entity
Hence, the output vector for each capsule need to be normalized to 0,1
This is done by the non-linear squashing function:
The document proposes enhancing security in video communication through visual cryptography and firefly optimization. Key points:
- The approach uses visual cryptography to hide a watermark in selected video frames. Frames are selected using a threshold determined by the firefly optimization algorithm.
- The watermark is embedded into frames using reversible data hiding. The embedded watermark can then be extracted after attacks and has minimum distortion.
- The goals are to achieve high security, accommodate high capacity data, and recover data with minimum distortion compared to other reversible data hiding techniques.
The document presents a progress report on video summarization. It outlines the proposed work, which involves using a pre-trained Inception V3 network for feature extraction and matching extracted features to a user query to generate a summarized video. The document also discusses related work on query-focused and query-conditioned video summarization, and references datasets and tools used for video summarization.
VIDEO SUMMARIZATION: CORRELATION FOR SUMMARIZATION AND SUBTRACTION FOR RARE E...Journal For Research
The document presents a video summarization technique called Correlation for Summarization and Subtraction for Rare Event (CSSR). The technique extracts frames from input video, calculates the correlation between frames to identify redundant frames, and discards similar frames to create a summarized video. It also identifies objects or actions in areas of interest by subtracting summarized frames from the stored background image of that area. The technique was tested on videos and able to successfully create short summarized videos while also detecting objects in specified areas of interest. The authors conclude the technique provides an optimized solution for automatic video summarization and security monitoring with reduced manual effort.
Similar to Video Description using Deep Learning (20)
Zoom is a comprehensive platform designed to connect individuals and teams efficiently. With its user-friendly interface and powerful features, Zoom has become a go-to solution for virtual communication and collaboration. It offers a range of tools, including virtual meetings, team chat, VoIP phone systems, online whiteboards, and AI companions, to streamline workflows and enhance productivity.
Artificia Intellicence and XPath Extension FunctionsOctavian Nadolu
The purpose of this presentation is to provide an overview of how you can use AI from XSLT, XQuery, Schematron, or XML Refactoring operations, the potential benefits of using AI, and some of the challenges we face.
Graspan: A Big Data System for Big Code AnalysisAftab Hussain
We built a disk-based parallel graph system, Graspan, that uses a novel edge-pair centric computation model to compute dynamic transitive closures on very large program graphs.
We implement context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases such as Linux shows that their Graspan implementations scale to millions of lines of code and are much simpler than their original implementations.
These analyses were used to augment the existing checkers; these augmented checkers found 132 new NULL pointer bugs and 1308 unnecessary NULL tests in Linux 4.4.0-rc5, PostgreSQL 8.3.9, and Apache httpd 2.2.18.
- Accepted in ASPLOS ‘17, Xi’an, China.
- Featured in the tutorial, Systemized Program Analyses: A Big Data Perspective on Static Analysis Scalability, ASPLOS ‘17.
- Invited for presentation at SoCal PLS ‘16.
- Invited for poster presentation at PLDI SRC ‘16.
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Crescat
Crescat is industry-trusted event management software, built by event professionals for event professionals. Founded in 2017, we have three key products tailored for the live event industry.
Crescat Event for concert promoters and event agencies. Crescat Venue for music venues, conference centers, wedding venues, concert halls and more. And Crescat Festival for festivals, conferences and complex events.
With a wide range of popular features such as event scheduling, shift management, volunteer and crew coordination, artist booking and much more, Crescat is designed for customisation and ease-of-use.
Over 125,000 events have been planned in Crescat and with hundreds of customers of all shapes and sizes, from boutique event agencies through to international concert promoters, Crescat is rigged for success. What's more, we highly value feedback from our users and we are constantly improving our software with updates, new features and improvements.
If you plan events, run a venue or produce festivals and you're looking for ways to make your life easier, then we have a solution for you. Try our software for free or schedule a no-obligation demo with one of our product specialists today at crescat.io
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
OpenMetadata Community Meeting - 5th June 2024OpenMetadata
The OpenMetadata Community Meeting was held on June 5th, 2024. In this meeting, we discussed about the data quality capabilities that are integrated with the Incident Manager, providing a complete solution to handle your data observability needs. Watch the end-to-end demo of the data quality features.
* How to run your own data quality framework
* What is the performance impact of running data quality frameworks
* How to run the test cases in your own ETL pipelines
* How the Incident Manager is integrated
* Get notified with alerts when test cases fail
Watch the meeting recording here - https://www.youtube.com/watch?v=UbNOje0kf6E
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI AppGoogle
AI Fusion Buddy Review: Brand New, Groundbreaking Gemini-Powered AI App
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-fusion-buddy-review
AI Fusion Buddy Review: Key Features
✅Create Stunning AI App Suite Fully Powered By Google's Latest AI technology, Gemini
✅Use Gemini to Build high-converting Converting Sales Video Scripts, ad copies, Trending Articles, blogs, etc.100% unique!
✅Create Ultra-HD graphics with a single keyword or phrase that commands 10x eyeballs!
✅Fully automated AI articles bulk generation!
✅Auto-post or schedule stunning AI content across all your accounts at once—WordPress, Facebook, LinkedIn, Blogger, and more.
✅With one keyword or URL, generate complete websites, landing pages, and more…
✅Automatically create & sell AI content, graphics, websites, landing pages, & all that gets you paid non-stop 24*7.
✅Pre-built High-Converting 100+ website Templates and 2000+ graphic templates logos, banners, and thumbnail images in Trending Niches.
✅Say goodbye to wasting time logging into multiple Chat GPT & AI Apps once & for all!
✅Save over $5000 per year and kick out dependency on third parties completely!
✅Brand New App: Not available anywhere else!
✅ Beginner-friendly!
✅ZERO upfront cost or any extra expenses
✅Risk-Free: 30-Day Money-Back Guarantee!
✅Commercial License included!
See My Other Reviews Article:
(1) AI Genie Review: https://sumonreview.com/ai-genie-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
#AIFusionBuddyReview,
#AIFusionBuddyFeatures,
#AIFusionBuddyPricing,
#AIFusionBuddyProsandCons,
#AIFusionBuddyTutorial,
#AIFusionBuddyUserExperience
#AIFusionBuddyforBeginners,
#AIFusionBuddyBenefits,
#AIFusionBuddyComparison,
#AIFusionBuddyInstallation,
#AIFusionBuddyRefundPolicy,
#AIFusionBuddyDemo,
#AIFusionBuddyMaintenanceFees,
#AIFusionBuddyNewbieFriendly,
#WhatIsAIFusionBuddy?,
#HowDoesAIFusionBuddyWorks
Takashi Kobayashi and Hironori Washizaki, "SWEBOK Guide and Future of SE Education," First International Symposium on the Future of Software Engineering (FUSE), June 3-6, 2024, Okinawa, Japan
GraphSummit Paris - The art of the possible with Graph TechnologyNeo4j
Sudhir Hasbe, Chief Product Officer, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
Flutter is a popular open source, cross-platform framework developed by Google. In this webinar we'll explore Flutter and its architecture, delve into the Flutter Embedder and Flutter’s Dart language, discover how to leverage Flutter for embedded device development, learn about Automotive Grade Linux (AGL) and its consortium and understand the rationale behind AGL's choice of Flutter for next-gen IVI systems. Don’t miss this opportunity to discover whether Flutter is right for your project.
Mobile App Development Company In Noida | Drona InfotechDrona Infotech
Looking for a reliable mobile app development company in Noida? Look no further than Drona Infotech. We specialize in creating customized apps for your business needs.
Visit Us For : https://www.dronainfotech.com/mobile-application-development/
2. AGENDA
1. PROBLEM STATEMENT
2. INTRODUCTION
3. LITERATURE SURVEY
4. MOTIVATION
5. BACKGROUND
6. SYSTEM DESCRIPTION
7. REQUIREMENTS
8. ADVANTAGES
9. LIMITATIONS
10. CONCLUSION
3. 1. PROBLEM STATEMENT
• To identify the contents of a video and describe in natural language.
4. 2. INTRODUCTION
• A machine can efficiently perform image classification, object recognition,
and video segmentation.
• The tasks like video description are a challenge.
• Video description has applications in
1. human-robot interaction,
2. helping the visually impaired,
3. video retrieval by content.
5. 3. LITERATURE SURVEY
Publication Methodology andTechniques Remarks
IEEECVPR,
2020
Video is condensed into a spatio-temporal graph
network, which serves as the object branch.This
interaction information is distilled into another scene
branch via the object-aware knowledge distillation
mechanism.
Takes into
consideration
interaction
information.Can
shortcut the
classification problem
using background.
ICCVW,
2019
Two stage training setting to optimise both encoder and
decoder simultaneously.The architecture is initialized
using pre-trained encoders and decoders.Then the
most relevant features for video description generation
are learnt.
Vocabulary is large.
Is computationally
expensive.
6. Publication Methodology andTechniques Remarks
arXiv, 2018
A self-critical REINFORCE algorithm is used to get
better weights for the LSTMs and train the LSTMS.
Then, we jointly tune the full model in this step, freeing
the weights of the CNNs.
Can generate complex
sentences.
Challenging to train
such a big model.
ACM
Books,
2018
Encoder-Decoder framework in which uses encoder
(CNN) to extract visual features from raw video frames
and decoder (RNN/LSTM) to get the desired output
sentence.
Easy to train.
Limited to small
vocabulary.
AAAI, 2013
Template based approach in which SVO triplets are
identified using a combination of visual object and
activity detectors. followed by search based
optimization to get their best combination.
Simplest approach.
Generated sentences
are simple.
7. 4. MOTIVATION
• The Spatial, temporal and attribute based attention models
1. are inefficient to exploit video temporal structure in a longer range.
2. require heavy computation operations
• The Hierarchical Recurrent Neural Encoder Model is able to overcome these
challenges.
12. 6.1.1 ENCODER
Encoder part extracts visual features from raw video frames in a fixed-
dimension vector (he) that would represent the entire sequence.
Video Feature pool consists of
1. Object appearance feature –
extracted using VGG16 pretrained on ImageNet dataset.
2. Action feature –
extracted using C3D pretrained on activity recognition dataset.
6.1.2 DECODER
Decoder part takes that vector as an initial state and it is then fed to a BLSTM
to generate the desired output sentence.
13. 6.2 HIERARCHICAL RECURRENT NEURAL
ENCODER(HRNE)
• The first LSTM layer is used to explore local temporal structure within
sentence.
• The second LSTM layer learns the temporal dependencies among sentence.
• More complex HRNE model could be adding more layers to build multiple
time-scale abstraction of the visual information.
14.
15. 6.3 DATASET
• The MSR-VTT is used for training and testing.
• In its current version, MSR-VTT provides 10K web video clips with 41.2
hours and 200K clip-sentence pairs in total.
16. 6.4 EVALUATION METRICS
• The generated sentence correlates well with a human judgment when the
metrics are high.
17. 7. REQUIREMENTS
• Central Processing Unit (CPU) — Intel Core i5 6th Gen. processor or higher.
• RAM — 8 GB minimum.
• Graphics Processing Unit (GPU) — NVIDIA GeForce GTX 960 or higher.
• Operating System — Ubuntu, Mac or Microsoft Windows 10.
• Software – Python compiling IDE with Modules like Keras, TensorFlow
18. 8. ADVANTAGES
• Exploits temporal information over longer time.
• Shortens the path with the capability of adding non-linearity, providing a
better trade-off between efficiency and effectiveness.
• Is able to uncover temporal transitions between frame chunks with different
granularities.
19. 9. LIMITATIONS
• LSTM decoder is prone to overfitting.
• Hence, we need to validate the generalization capability.
• In the future work, we can plug a softmax classifier upon the encoder and
video labels instead of the LSTM language decoder.
20. 10. CONCLUSION
• We take raw video as input, and apply 2D CNN (VGG16) and 3D CNN (C3D)
on it to extract the object appearance and action features respectively.
• To get the encoded vector, multiple LSTM can be stacked using HRNE.
• The decoder is a LSTM which inputs visual features and generates a natural
language description for sentence.
21. REFERENCES
[1] N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney, K. Saenko, and S. Guadarrama.
Generating natural-language video descriptions using text-mined knowledge. In AAAI, July
2013.
[2] Z. Wu, T. Yao, Y. Fu, and Y. Jiang. Deep learning for video classification and captioning. In S.
hang, editor, Frontiers of Multimedia Research, pages 3–29. ACM Books, 2018
[3] S Olivastri, G Singh, F Cuzzolin. End-to-End Video Captioning. In Large Scale Holistic Video
Understanding, ICCVW 2019
[4] Pan, Boxiao & Cai, Haoye & Huang, De-An & Lee, Kuan-Hui & Gaidon, Adrien & Adeli, Ehsan
& Niebles, Juan Carlos. Spatio-Temporal Graph for Video Captioning with Knowledge
Distillation. Computer Vision and Pattern Recognition (CVPR),2020
[5] Lijun Li and Boqing Gong. End-to-end video captioning with multitask reinforcement
learning. arXiv preprint arXiv:1803.07950, 2018.
22. [6] Yuling Gui, Dan Guo, Ye Zhao. Semantic Enhanced Encoder-Decoder Network (SEN) for
Video Captioning. In MAHCI '19 2019
[7] K. Simonyan, A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image
Recognition. In International Conference on Learning Representations, 2015
[8] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, Learning Spatiotemporal Features
with 3D Convolutional Networks, ICCV 2015
[9] Nayyer Aafaq, Ajmal Mian, Wei Liu, Syed Zulqarnain Gilani, and Mubarak Shah. Video
Description: A Survey of Methods, Datasets and Evaluation Metrics. In ACM Computing Surveys
(CSUR),2019
[10] Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, Yueting Zhuang. Hierarchical Recurrent Neural
Encoder for Video Representation with Application to Captioning. In IEEE Conference on
Computer Vision and Pattern Recognition (CVPR)