Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...Ijripublishers Ijri
Global interconnect planning becomes a challenge as semiconductor technology continuously scales. Because of the increasing wire resistance and higher capacitive coupling in smaller features, the delay of global interconnects becomes large compared with the delay of a logic gate, introducing a huge performance gap that needs to be resolved A novel equalized global link architecture and driver– receiver co design flow are proposed for high-speed and low-energy on-chip communication by utilizing a continuous-time linear equalizer (CTLE). The proposed global link is analyzed using a linear system method, and the formula of CTLE eye opening is derived to provide high-level design guidelines and insights.
Compared with the separate driver–receiver design flow, over 50% energy reduction is observed.
Video has become ubiquitous on the Internet, TV, as well as personal devices. Recognition of video content has been a fundamental challenge in computer vision for decades, where previous research predominantly focused on recognizing videos using a predefined yet limited vocabulary. Thanks to the recent development of deep learning techniques, researchers in multiple communities are now striving to bridge videos with natural language in order to move beyond classification to interpretation, which should be regarded as the ultimate goal of video understanding. We will present recent advances in exploring the synergy of video understanding and language processing techniques.
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...Ijripublishers Ijri
Global interconnect planning becomes a challenge as semiconductor technology continuously scales. Because of the increasing wire resistance and higher capacitive coupling in smaller features, the delay of global interconnects becomes large compared with the delay of a logic gate, introducing a huge performance gap that needs to be resolved A novel equalized global link architecture and driver– receiver co design flow are proposed for high-speed and low-energy on-chip communication by utilizing a continuous-time linear equalizer (CTLE). The proposed global link is analyzed using a linear system method, and the formula of CTLE eye opening is derived to provide high-level design guidelines and insights.
Compared with the separate driver–receiver design flow, over 50% energy reduction is observed.
Video saliency-recognition by applying custom spatio temporal fusion techniqueIAESIJAI
Video saliency detection is a major growing field with quite few contributions to it. The general method available today is to conduct frame wise saliency detection and this leads to several complications, including an incoherent pixel-based saliency map, making it not so useful. This paper provides a novel solution to saliency detection and mapping with its custom spatio-temporal fusion method that uses frame wise overall motion colour saliency along with pixel-based consistent spatio-temporal diffusion for its temporal uniformity. In the proposed method section, it has been discussed how the video is fragmented into groups of frames and each frame undergoes diffusion and integration in a temporary fashion for the colour saliency mapping to be computed. Then the inter group frame are used to format the pixel-based saliency fusion, after which the features, that is, fusion of pixel saliency and colour information, guide the diffusion of the spatio temporal saliency. With this, the result has been tested with 5 publicly available global saliency evaluation metrics and it comes to conclusion that the proposed algorithm performs better than several state-of-the-art saliency detection methods with increase in accuracy with a good value margin. All the results display the robustness, reliability, versatility and accuracy.
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017 MLconf
Serena is a Ph.D. student in the Stanford Vision Lab, advised by Prof. Fei-Fei Li. Her research interests are in computer vision, machine learning, and deep learning. She is particularly interested in the areas of video understanding, human action recognition, and healthcare applications. She interned at Facebook AI Research in Summer 2016.
Before starting her Ph.D., she received a B.S. in Electrical Engineering in 2010, and an M.S. in Electrical Engineering in 2013, both from Stanford. She also worked as a software engineer at Rockmelt (acquired by Yahoo) from 2009-2011.
Abstract summary
Towards Scaling Video Understanding:
The quantity of video data is vast, yet our capabilities for visual recognition and understanding in videos lags significantly behind that for images. In this talk, I will first discuss some of the challenges of scale in labeling, modeling, and inference behind this gap. I will then present some of our recent work towards addressing these challenges, in particular using reinforcement learning-based formulations to tackle efficient inference in videos and learning classifiers from noisy web search results. Finally, I will conclude with discussion on future promising directions towards scaling video understanding.
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...Ijripublishers Ijri
Global interconnect planning becomes a challenge as semiconductor technology continuously scales. Because of the increasing wire resistance and higher capacitive coupling in smaller features, the delay of global interconnects becomes large compared with the delay of a logic gate, introducing a huge performance gap that needs to be resolved A novel equalized global link architecture and driver– receiver co design flow are proposed for high-speed and low-energy on-chip communication by utilizing a continuous-time linear equalizer (CTLE). The proposed global link is analyzed using a linear system method, and the formula of CTLE eye opening is derived to provide high-level design guidelines and insights.
Compared with the separate driver–receiver design flow, over 50% energy reduction is observed.
Video has become ubiquitous on the Internet, TV, as well as personal devices. Recognition of video content has been a fundamental challenge in computer vision for decades, where previous research predominantly focused on recognizing videos using a predefined yet limited vocabulary. Thanks to the recent development of deep learning techniques, researchers in multiple communities are now striving to bridge videos with natural language in order to move beyond classification to interpretation, which should be regarded as the ultimate goal of video understanding. We will present recent advances in exploring the synergy of video understanding and language processing techniques.
Jiri ece-01-03 adaptive temporal averaging and frame prediction based surveil...Ijripublishers Ijri
Global interconnect planning becomes a challenge as semiconductor technology continuously scales. Because of the increasing wire resistance and higher capacitive coupling in smaller features, the delay of global interconnects becomes large compared with the delay of a logic gate, introducing a huge performance gap that needs to be resolved A novel equalized global link architecture and driver– receiver co design flow are proposed for high-speed and low-energy on-chip communication by utilizing a continuous-time linear equalizer (CTLE). The proposed global link is analyzed using a linear system method, and the formula of CTLE eye opening is derived to provide high-level design guidelines and insights.
Compared with the separate driver–receiver design flow, over 50% energy reduction is observed.
Video saliency-recognition by applying custom spatio temporal fusion techniqueIAESIJAI
Video saliency detection is a major growing field with quite few contributions to it. The general method available today is to conduct frame wise saliency detection and this leads to several complications, including an incoherent pixel-based saliency map, making it not so useful. This paper provides a novel solution to saliency detection and mapping with its custom spatio-temporal fusion method that uses frame wise overall motion colour saliency along with pixel-based consistent spatio-temporal diffusion for its temporal uniformity. In the proposed method section, it has been discussed how the video is fragmented into groups of frames and each frame undergoes diffusion and integration in a temporary fashion for the colour saliency mapping to be computed. Then the inter group frame are used to format the pixel-based saliency fusion, after which the features, that is, fusion of pixel saliency and colour information, guide the diffusion of the spatio temporal saliency. With this, the result has been tested with 5 publicly available global saliency evaluation metrics and it comes to conclusion that the proposed algorithm performs better than several state-of-the-art saliency detection methods with increase in accuracy with a good value margin. All the results display the robustness, reliability, versatility and accuracy.
Serena Yeung, PHD, Stanford, at MLconf Seattle 2017 MLconf
Serena is a Ph.D. student in the Stanford Vision Lab, advised by Prof. Fei-Fei Li. Her research interests are in computer vision, machine learning, and deep learning. She is particularly interested in the areas of video understanding, human action recognition, and healthcare applications. She interned at Facebook AI Research in Summer 2016.
Before starting her Ph.D., she received a B.S. in Electrical Engineering in 2010, and an M.S. in Electrical Engineering in 2013, both from Stanford. She also worked as a software engineer at Rockmelt (acquired by Yahoo) from 2009-2011.
Abstract summary
Towards Scaling Video Understanding:
The quantity of video data is vast, yet our capabilities for visual recognition and understanding in videos lags significantly behind that for images. In this talk, I will first discuss some of the challenges of scale in labeling, modeling, and inference behind this gap. I will then present some of our recent work towards addressing these challenges, in particular using reinforcement learning-based formulations to tackle efficient inference in videos and learning classifiers from noisy web search results. Finally, I will conclude with discussion on future promising directions towards scaling video understanding.
Video has become ubiquitous on the Internet, TV, as well as personal devices. Recognition of video content has been a fundamental challenge in computer vision for decades, where previous research predominantly focused on understanding videos using a predefined yet limited vocabulary. Thanks to the recent development of deep learning techniques, researchers in both computer vision and multimedia communities are now striving to bridge videos with natural language, which can be regarded as the ultimate goal of video understanding. We will present recent advances in exploring the synergy of video understanding and language processing techniques, including video-language alignment and video captioning.
Generating natural language descriptions from video using CNN (Convolutional Neural
Network) and LSTM (Long Short Term Memory) layers stacked into one HRNE (Hierarchical Recurrent Neural Encoder) model.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Multi-View Video Coding Algorithms/Techniques: A Comprehensive StudyIJERA Editor
For scientific exploration and visualization, Multi-view display enables a viewer to experience a 3-D
environment via a flat 2-D screen. Visualization is the most effective and informative form for delivering any
information. In this paper the recent developments in the multi-view video coding are reviewed such as Motion
and Disparity Compensated coding, Wavelet based multi-view video coding, Spatial scalability coding etc.
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/pathpartner/embedded-vision-training/videos/pages/may-2019-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Praveen Nayak, Tech Lead at PathPartner Technology, presents the "Using Deep Learning for Video Event Detection on a Compute Budget" tutorial at the May 2019 Embedded Vision Summit.
Convolutional neural networks (CNNs) have made tremendous strides in object detection and recognition in recent years. However, extending the CNN approach to understanding of video or volumetric data poses tough challenges, including trade-offs between representation quality and computational complexity, which is of particular concern on embedded platforms with tight computational budgets. This presentation explores the use of CNNs for video understanding.
Nayak reviews the evolution of deep representation learning methods involving spatio- temporal fusion from C3D to Conv-LSTMs for vision-based human activity detection. He proposes a decoupled alternative to this fusion, describing an approach that combines a low-complexity predictive temporal segment proposal model and a fine-grained (perhaps high- complexity) inference model. PathPartner Technology finds that this hybrid approach, in addition to reducing computational load with minimal loss of accuracy, enables effective solutions to these high complexity inference tasks.
Abstract Recently, Video is becoming a catholic medium for e-learning. As per the popularity of online video information over the World Wide Web (WWW) is mostly dependent on user-assigned tags or specification, which is the system by which we can access such videos. However, this system have limitations for retrieval and frequently we want access to the content (pacify) of the video itself is directly matched against a user’s query except manually assigned tags or specifications. In e-lecturing videos it contains visual and aural medium: slides of presentation and speech. in this system, we are going to retrieve the text from the videos automatically. To abstract visible information, we apply video content analysis to detect slides and optical character recognition to obtain their text. We abstract textual metadata by applying video Optical Character Recognition (OCR) technology on key-frames and Automatic Speech Recognition (ASR) on lecture audio. The ASR and OCR translate and discover slide text line types are accept for keywords abstraction, in which video and fragment-level keywords are abstracted for video searching on the basis of contents. .Key Words: Video fragmentation, Frame Abstraction, video indexing, and etc
Inverted File Based Search Technique for Video Copy Retrievalijcsa
A video copy detection system is a content-based search engine focusing on Spatio-temporal features. It
aims to find whether a query video segment is a copy of video from the video database or not based on the
signature of the video. It is hard to find whether a video is a copied video or a similar video since the
features of the content are very similar from one video to the other. The main focus is to detect that the
query video is present in the video database with robustness depending on the content of video and also by
fast search of fingerprints. The Fingerprint Extraction Algorithm and Fast Search Algorithm are adopted
to achieve robust, fast, efficient and accurate video copy detection. As a first step, the Fingerprint
Extraction algorithm is employed which extracts a fingerprint through the features from the image content
of video. The images are represented as Temporally Informative Representative Images (TIRI). Then the
next step is to find the presence of copy of a query video in a video database, in which a close match of its
fingerprint in the corresponding fingerprint database is searched using inverted-file-based method.
PERFORMANCE ANALYSIS OF FINGERPRINTING EXTRACTION ALGORITHM IN VIDEO COPY DET...IJCSEIT Journal
A video fingerprint is a recognizer that is derived from a piece of video content. The video fingerprinting
methods obtain unique features of a video that differentiates one video clip from another. It aims to identify
whether a query video segment is a copy of video from the video database or not based on the signature of
the video. It is difficult to find whether a video is a copied video or a similar video, since the features of the
content are very similar from one video to the other. The main focus of this paper is to detect that the query
video is present in the video database with robustness depending on the content of video and also by fast
search of fingerprints. The Fingerprint Extraction Algorithm and Fast Search Algorithms are adopted in
this paper to achieve robust, fast, efficient and accurate video copy detection. As a first step, the
Fingerprint Extraction algorithm is employed which extracts a fingerprint through the features from the
image content of video. The images are represented as Temporally Informative Representative Images
(TIRI). Then, the second step is to find the presence of copy of a query video in a video database, in which
a close match of its fingerprint in the corresponding fingerprint database is searched using inverted-filebased
method. The proposed system is tested against various attacks like noise, brightness, contrast,
rotation and frame drop. Thus the performance of the proposed system on an average shows high true
positive rate of 98% and low false positive rate of 1.3% for different attacks.
Towards Using Semantic Features for Near-Duplicate Video DetectionWesley De Neve
Towards Using Semantic Features for Near-Duplicate Video Detection.
Paper presented at the ICME 2010 Workshop on Visual Content Identification and Search in Singapore.
Semantic Summarization of videos, Semantic Summarization of videosdarsh228313
In [1], each capsule uses a activity vector to represent different instantiation parameters (position, size, orientation, thickness, … etc.), with the vector length (norm) representing the probability of the presence of an entity
Hence, the output vector for each capsule need to be normalized to 0,1
This is done by the non-linear squashing function:
Review on content based video lecture retrievaleSAT Journals
Abstract Recent advances in multimedia technologies allow the capture and storage of video data with relatively inexpensive computers. Furthermore, the new possibilities offered by the information highways have made a large amount of video data publicly available. However, without appropriate search techniques all these data are hardly usable. Users are not satisfied with the video retrieval systems that provide analogue VCR functionality. For example, a user analyses a soccer video will ask for specific events such as goals. Content-based search and retrieval of video data becomes a challenging and important problem. Therefore, the need for tools that can be manipulate the video content in the same way as traditional databases manage numeric and textual data is significant. Therefore, a more efficient method for video retrieval in WWW or within large lecture video archives is urgently needed. This project presents an approach for automated video indexing and video search in large lecture video archives. First of all, we apply automatic video segmentation and key-frame detection to offer a visual guideline for the video content navigation. Subsequently, we extract textual metadata by applying video Optical Character Recognition (OCR) technology on key-frames and Automatic Speech Recognition on lecture audio tracks. Keywords—Feature extraction, video annotation, video browsing, video retrieval, video structure analysis
Tutorial on "Video Summarization and Re-use Technologies and Tools", delivered at IEEE ICME 2020. These slides correspond to the first part of the tutorial, presented by Vasileios Mezaris and Evlampios Apostolidis. This part deals with automatic video summarization, and includes a presentation of the video summarization problem definition and a literature overview; an in-depth discussion on a few unsupervised GAN-based methods; and a discussion on video summarization datasets, evaluation protocols and results, and future directions.
Video has become ubiquitous on the Internet, TV, as well as personal devices. Recognition of video content has been a fundamental challenge in computer vision for decades, where previous research predominantly focused on understanding videos using a predefined yet limited vocabulary. Thanks to the recent development of deep learning techniques, researchers in both computer vision and multimedia communities are now striving to bridge videos with natural language, which can be regarded as the ultimate goal of video understanding. We will present recent advances in exploring the synergy of video understanding and language processing techniques, including video-language alignment and video captioning.
Generating natural language descriptions from video using CNN (Convolutional Neural
Network) and LSTM (Long Short Term Memory) layers stacked into one HRNE (Hierarchical Recurrent Neural Encoder) model.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Multi-View Video Coding Algorithms/Techniques: A Comprehensive StudyIJERA Editor
For scientific exploration and visualization, Multi-view display enables a viewer to experience a 3-D
environment via a flat 2-D screen. Visualization is the most effective and informative form for delivering any
information. In this paper the recent developments in the multi-view video coding are reviewed such as Motion
and Disparity Compensated coding, Wavelet based multi-view video coding, Spatial scalability coding etc.
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/pathpartner/embedded-vision-training/videos/pages/may-2019-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Praveen Nayak, Tech Lead at PathPartner Technology, presents the "Using Deep Learning for Video Event Detection on a Compute Budget" tutorial at the May 2019 Embedded Vision Summit.
Convolutional neural networks (CNNs) have made tremendous strides in object detection and recognition in recent years. However, extending the CNN approach to understanding of video or volumetric data poses tough challenges, including trade-offs between representation quality and computational complexity, which is of particular concern on embedded platforms with tight computational budgets. This presentation explores the use of CNNs for video understanding.
Nayak reviews the evolution of deep representation learning methods involving spatio- temporal fusion from C3D to Conv-LSTMs for vision-based human activity detection. He proposes a decoupled alternative to this fusion, describing an approach that combines a low-complexity predictive temporal segment proposal model and a fine-grained (perhaps high- complexity) inference model. PathPartner Technology finds that this hybrid approach, in addition to reducing computational load with minimal loss of accuracy, enables effective solutions to these high complexity inference tasks.
Abstract Recently, Video is becoming a catholic medium for e-learning. As per the popularity of online video information over the World Wide Web (WWW) is mostly dependent on user-assigned tags or specification, which is the system by which we can access such videos. However, this system have limitations for retrieval and frequently we want access to the content (pacify) of the video itself is directly matched against a user’s query except manually assigned tags or specifications. In e-lecturing videos it contains visual and aural medium: slides of presentation and speech. in this system, we are going to retrieve the text from the videos automatically. To abstract visible information, we apply video content analysis to detect slides and optical character recognition to obtain their text. We abstract textual metadata by applying video Optical Character Recognition (OCR) technology on key-frames and Automatic Speech Recognition (ASR) on lecture audio. The ASR and OCR translate and discover slide text line types are accept for keywords abstraction, in which video and fragment-level keywords are abstracted for video searching on the basis of contents. .Key Words: Video fragmentation, Frame Abstraction, video indexing, and etc
Inverted File Based Search Technique for Video Copy Retrievalijcsa
A video copy detection system is a content-based search engine focusing on Spatio-temporal features. It
aims to find whether a query video segment is a copy of video from the video database or not based on the
signature of the video. It is hard to find whether a video is a copied video or a similar video since the
features of the content are very similar from one video to the other. The main focus is to detect that the
query video is present in the video database with robustness depending on the content of video and also by
fast search of fingerprints. The Fingerprint Extraction Algorithm and Fast Search Algorithm are adopted
to achieve robust, fast, efficient and accurate video copy detection. As a first step, the Fingerprint
Extraction algorithm is employed which extracts a fingerprint through the features from the image content
of video. The images are represented as Temporally Informative Representative Images (TIRI). Then the
next step is to find the presence of copy of a query video in a video database, in which a close match of its
fingerprint in the corresponding fingerprint database is searched using inverted-file-based method.
PERFORMANCE ANALYSIS OF FINGERPRINTING EXTRACTION ALGORITHM IN VIDEO COPY DET...IJCSEIT Journal
A video fingerprint is a recognizer that is derived from a piece of video content. The video fingerprinting
methods obtain unique features of a video that differentiates one video clip from another. It aims to identify
whether a query video segment is a copy of video from the video database or not based on the signature of
the video. It is difficult to find whether a video is a copied video or a similar video, since the features of the
content are very similar from one video to the other. The main focus of this paper is to detect that the query
video is present in the video database with robustness depending on the content of video and also by fast
search of fingerprints. The Fingerprint Extraction Algorithm and Fast Search Algorithms are adopted in
this paper to achieve robust, fast, efficient and accurate video copy detection. As a first step, the
Fingerprint Extraction algorithm is employed which extracts a fingerprint through the features from the
image content of video. The images are represented as Temporally Informative Representative Images
(TIRI). Then, the second step is to find the presence of copy of a query video in a video database, in which
a close match of its fingerprint in the corresponding fingerprint database is searched using inverted-filebased
method. The proposed system is tested against various attacks like noise, brightness, contrast,
rotation and frame drop. Thus the performance of the proposed system on an average shows high true
positive rate of 98% and low false positive rate of 1.3% for different attacks.
Towards Using Semantic Features for Near-Duplicate Video DetectionWesley De Neve
Towards Using Semantic Features for Near-Duplicate Video Detection.
Paper presented at the ICME 2010 Workshop on Visual Content Identification and Search in Singapore.
Semantic Summarization of videos, Semantic Summarization of videosdarsh228313
In [1], each capsule uses a activity vector to represent different instantiation parameters (position, size, orientation, thickness, … etc.), with the vector length (norm) representing the probability of the presence of an entity
Hence, the output vector for each capsule need to be normalized to 0,1
This is done by the non-linear squashing function:
Review on content based video lecture retrievaleSAT Journals
Abstract Recent advances in multimedia technologies allow the capture and storage of video data with relatively inexpensive computers. Furthermore, the new possibilities offered by the information highways have made a large amount of video data publicly available. However, without appropriate search techniques all these data are hardly usable. Users are not satisfied with the video retrieval systems that provide analogue VCR functionality. For example, a user analyses a soccer video will ask for specific events such as goals. Content-based search and retrieval of video data becomes a challenging and important problem. Therefore, the need for tools that can be manipulate the video content in the same way as traditional databases manage numeric and textual data is significant. Therefore, a more efficient method for video retrieval in WWW or within large lecture video archives is urgently needed. This project presents an approach for automated video indexing and video search in large lecture video archives. First of all, we apply automatic video segmentation and key-frame detection to offer a visual guideline for the video content navigation. Subsequently, we extract textual metadata by applying video Optical Character Recognition (OCR) technology on key-frames and Automatic Speech Recognition on lecture audio tracks. Keywords—Feature extraction, video annotation, video browsing, video retrieval, video structure analysis
Tutorial on "Video Summarization and Re-use Technologies and Tools", delivered at IEEE ICME 2020. These slides correspond to the first part of the tutorial, presented by Vasileios Mezaris and Evlampios Apostolidis. This part deals with automatic video summarization, and includes a presentation of the video summarization problem definition and a literature overview; an in-depth discussion on a few unsupervised GAN-based methods; and a discussion on video summarization datasets, evaluation protocols and results, and future directions.
Similar to 論文紹介:Temporal Sentence Grounding in Videos: A Survey and Future Directions (20)
論文紹介:Multi-criteria Token Fusion with One-step-ahead Attention for Efficient ...Toru Tamaki
Sanghyeok Lee, Joonmyung Choi, Hyunwoo J. Kim, "Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers" arXiv2024
https://arxiv.org/abs/2403.10030
論文紹介:ArcFace: Additive Angular Margin Loss for Deep Face RecognitionToru Tamaki
Jiankang Deng, Jia Guo, Niannan Xue, Stefanos Zafeiriou , "ArcFace: Additive Angular Margin Loss for Deep Face Recognition" CVPR2019
https://openaccess.thecvf.com/content_CVPR_2019/html/Deng_ArcFace_Additive_Angular_Margin_Loss_for_Deep_Face_Recognition_CVPR_2019_paper.html
論文紹介:Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayersToru Tamaki
Lei Ke, Yu-Wing Tai, Chi-Keung Tang, "Deep Occlusion-Aware Instance Segmentation With Overlapping BiLayers" CVPR2021
https://openaccess.thecvf.com/content/CVPR2021/html/Ke_Deep_Occlusion-Aware_Instance_Segmentation_With_Overlapping_BiLayers_CVPR_2021_paper.html
論文紹介:Automated Classification of Model Errors on ImageNetToru Tamaki
Momchil Peychev, Mark Müller, Marc Fischer, Martin Vechev, " Automated Classification of Model Errors on ImageNet", NeurIPS2023
https://proceedings.neurips.cc/paper_files/paper/2023/hash/7480ed13740773505262791131c12b89-Abstract-Conference.html
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex ScenesToru Tamaki
Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Philip H.S. Torr, Song Bai, " MOSE: A New Dataset for Video Object Segmentation in Complex Scenes " ICCV2023
論文紹介:Tracking Anything with Decoupled Video SegmentationToru Tamaki
Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, Joon-Young Lee, " Tracking Anything with Decoupled Video Segmentation " ICCV2023
https://openaccess.thecvf.com/content/ICCV2023/html/Cheng_Tracking_Anything_with_Decoupled_Video_Segmentation_ICCV_2023_paper.html
論文紹介:Real-Time Evaluation in Online Continual Learning: A New HopeToru Tamaki
Yasir Ghunaim, Adel Bibi, Kumail Alhamoud, Motasem Alfarra, Hasan Abed Al Kader Hammoud, Ameya Prabhu, Philip H.S. Torr, Bernard Ghanem, " Real-Time Evaluation in Online Continual Learning: A New Hope " CVPR2023
https://openaccess.thecvf.com/content/CVPR2023/html/Ghunaim_Real-Time_Evaluation_in_Online_Continual_Learning_A_New_Hope_CVPR_2023_paper.html
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...Toru Tamaki
Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas, " PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation " CVPR2017
https://openaccess.thecvf.com/content_cvpr_2017/html/Qi_PointNet_Deep_Learning_CVPR_2017_paper.html
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
論文紹介:Temporal Sentence Grounding in Videos: A Survey and Future Directions
1. Temporal Senctence
Grounding in Videos: A
Survey and Future Direction
Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou
TPAMI
仁田智也(名工大)
2. 概要
nTemporal Sentence Grounding in Videos (TSGV)
• 入力
• 動画
• 文章
• 出力
• 文章と関連する動画の区間
• 動画内の一部分を文章で検索することができる
Index Terms—Temporal Sentence Grounding in Video, Natural Language Video Localization, Video Moment Retrieval, Temporal Video
Grounding, Multimodal Retrieval, Cross-modal Video Retrieval, Multimodal Learning, Video Understanding, Vision and Language.
F
1 INTRODUCTION
VIDEO has gradually become a major type of information
transmission media, thanks to the fast development and
innovation in communication and media creation technologies.
A video is formed from a sequence of continuous image frames
possibly accompanied by audio and subtitles. Compared to image
and text, video conveys richer semantic knowledge, as well as
more diverse and complex activities. Despite the strengths of
video, searching for content from the video is challenging. Thus,
there is a high demand for techniques that could quickly retrieve
video segments of user interest, specified in natural language.
1.1 Definition and History
Given an untrimmed video, temporal sentence grounding in videos
(TSGV) is to retrieve a video segment, also known as a temporal
moment, that semantically corresponds to a query in natural
language i.e., sentence. As illustrated in Fig. 1, for the query “A
person is putting clothes in the washing machine.”, TSGV needs
to return the start and end timestamps (i.e., 9.6s and 24.5s) of
a video moment from the input video as the answer. The answer
moment should contain actions or events described by the query.
As a fundamental vision-language problem, TSGV also serves
Query: A person is
putting clothes in the
washing machine.
…
9.6s 24.5s
TSGV Model
Fig. 1. An illustration of temporal sentence grounding in videos (TSGV).
connects computer vision (CV) and natural language processing
(NLP) and benefits from the advancements made in both areas.
TSGV also shares similarities with some classical tasks in both
CV and NLP. For instance, video action recognition (VAR) [1]–
[4] in CV is to detect video segments, which perform specific
actions in video. Although VAR localizes temporal segments with
activity information, it is constrained by the predefined action
categories. TSGV is more flexible and aims to retrieve complicated
and diverse activities from video via arbitrary language queries.
In this sense, TSGV needs a semantic understanding of both
arXiv:2201.08071v3
[cs.CV]
13
Mar
ACCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGEN
0
10
20
30
40
50
60
70
2017 2018 2019 2020 2021 2022*
#
of
Papers
Year
*ACL
8%
AAAI
11%
ACM MM
10%
CVPR
12%
ICCV
4%
ECCV
4%
TIP
5%
TPAMI
2%
TMM
6%
NeurIPS
2%
SIGIR
4%
ArXiv
11%
Other
Confs/Jnls
21%
Fig. 2. Statistics of the collected papers in this survey. Left: number of
papers published each year (till September 2022). Right: distribution of
papers by venue, where *ACL denotes the series of conferences hosted
direction
existing
from res
surveys:
should t
does a
and (iv)
TSGV m
the com
research
to confid
make a c
suite of
論文投稿数
3. データセット
nDiDeMo (Hendricks+, ICCV2017)
• アノテーションの正解区間の時間が5秒固定
• 動画は特徴量抽出された形で配布
nCharades-STA (Gao+, ICCV2017)
• ビデオの平均の長さ:30秒
nActivityNet Captions (Krishna+, ICCV2017)
• Dence Video Captioning用のデータセットを利用
nTACoS (Regneri+, TACL2013)
• 修正版のTACoS 2DTAN (Zhang+, AAAI2020)がある
nMAD (Soldan+, CVPR2022)
• ビデオの平均時間110分
• アノテーションの正解区間の平均時間4.1秒
ACCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INT
Fig. 7. Statistics of query length and normalized moment length over
benchmark datasets.
query pairs. Average video and moment lengths are 286.59 and
6.10 seconds, and average query length is 10.05 words. Each
video in TACoS contains 148.17 annotations on average. We name
this dataset TACoSorg in Table 1. A modified version TACoS2DTAN
is made available by Zhang et al. [18]. TACoS2DTAN has 18, 227
annotations. On average, there are 143.52 annotations per video.
The average moment length and query length after modification
are 27.88 seconds and 9.42 words, respectively.
MAD [121] is a large-scale dataset containing mainstream movies.
F
R
p
m
クエリの
単語長
正解区間の
動画時間に対する割合
5. 評価指標
ndR@n, IoU@m
• R@n, IoU@mを改良したもの
• 𝑅@𝑛, 𝐼𝑜𝑈@𝑚 =
!
"!
∑#$!
"!
𝑟(𝑛, 𝑚, 𝑞#) 3 𝛼#
%
3 𝛼#
&
• 𝛼#
%
, 𝛼#
&
は割引率
• 𝛼#
∗
= 1 − 𝑝#
∗
− 𝑔#
∗
• IoUが閾値を越えていても正解との差で割引をする
ACCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7
Fig. 7. Statistics of query length and normalized moment length over
benchmark datasets.
query pairs. Average video and moment lengths are 286.59 and
6.10 seconds, and average query length is 10.05 words. Each
video in TACoS contains 148.17 annotations on average. We name
this dataset TACoSorg in Table 1. A modified version TACoS2DTAN
is made available by Zhang et al. [18]. TACoS2DTAN has 18, 227
annotations. On average, there are 143.52 annotations per video.
(a) An illustration of temporal IoU.
Prediction Moment #1
Prediction Moment #2
…
Ground-truth Moment
!!
"
!!
#
"!,%
"
"!,%
#
"!,&
"
"!,&
#
!!,#
$
− #!
$
!!,#
%
− #!
%
!!,&
$
− #!
$
!!,&
%
− #!
%
(b) An illustration of dR@n,IoU@m.
6. 手法
nTSGV手法の分類
• 教師あり学習
• 提案ベース
• 非提案型
• 強化学習
• その他
• 弱教師あり学習
upervised TSGV methods in different categories. The methods plotted at the same position on the
n, m, qi) (14)
U@mi is unreliable for
enerate long predictions
h moments are long in
ce of correct prediction
TSGV
Methods
Supervised
Methods
Weakly-supervised
Methods
Proposal-based
Methods
Reinforcement learning-
based Methods
Proposal-free
Methods
Multi-Instance Learning-
based Methods
Reconstruction-
based Methods
Other Weakly-
supervised Methods
Other Supervised
Methods
Sliding window-based
Methods
Proposal generated
Methods
Anchor-based Methods
Standard Anchor-
based Methods
2D-Map-based
Methods
Regression-based
Methods
Span-based Methods
7. 教師あり学習
1. 特徴量抽出
• 各モーダルの特徴量抽出機を用いて特徴量抽出を行う
2. エンコーダ
• 各モーダルの特徴量を結合できるように変換する
3. マルチモーダルインタラクター
• 二つのモーダルを結合して処理を行う
4. モーメントローカライザー
• 区間を推定する
ACCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
Visual Feature
Extractor
Textual Feature
Extractor
Multi-modal
Interaction
(Cross-modal
Reasoning)
Moment
Localizer
(a) Preprocessor (b) Feature Extractor (d) Feature Interactor (e) Answer Predictor
Visual Feature
Encoder
Textual Feature
Encoder
(c) Feature Encoder
Video
Query: The man speeds
up then returns to his
initial speed.
Proposal Generator
1 2 3 4
9. 提案ベース手法
nSliding Window (SW)
• 提案モジュール
• スライディングウィンドウ
• 𝜍フレームを基準にスライディングウィンドウ
• CTRL (Gao+, ICCV2017)
ACCEPTED BY IEEE TRANSACTIONS ON
frames
frames
...
... Proposal
Candidates
...
Proposal
Candidates
...
...
...
Proposal
Candidates
(a) Sliding window (SW) strategy (b
Fig. 5. Illustration of sliding window, pro
ACCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
Video
Query: The man
speeds up then
returns to his
initial speed.
Textual Feature Extractor
(SkipThoughts or
WE+LSTM)
...
Proposal
Candidates
Visual Feature
Extractor (C3D)
Multimodal
Fusion
FC
Concat FC FC
Alignment
Score
Location
Regressor
(a) Query-guided segment p
CTRL (Gao+, ICCV2017)
10. 提案ベース手法
nProposal Generation (PG)
• 提案モジュール
• 提案生成モジュール
• 入力ビデオとクエリを元に提案区間を生成
• QSPN (Xu, AAAI2019)
MACHINE INTELLIGENCE 9
cal SW
Query: The man speeds
up then returns to his
initial speed.
Video
Textual Feature
Extractor (WE+LSTM)
Visual Feature Extractor
(C3D)
Convolutional
Blocks
...
Temporal Attention Weight
Max
Pooling
3D ROI
Pooling
(a) Query-guided segment proposal network.
S AND MACHINE INTELLIGENCE 9
9].
anonical SW
V. They define
asets.
(a) Query-guided segment proposal network.
LSTM
<start>
LSTM LSTM LSTM
......
LSTM LSTM LSTM LSTM
......
<end>
...... FC
FC
......
(b) The early fusion retrieval model of QSPN.
ACCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELL
(a) Sliding window (SW) strategy
Proposal Detector
Video
Language
Query
... ...
Proposal Candidates
... ...
(b) Proposal generated (PG) strategy
Fig. 5. Illustration of sliding window, proposal generated, anchor-based, and 2D-
inter
large
F
diffe
QSPN (Xu+, AAAI2019)
11. 提案ベース手法
nAnchor-base (AN)
• 提案モジュール
• アンカーベース
• 抽出した特徴量に対してアンカーを基準に区間を提案
• TGN (Chen+, EMNLP2018)
ACCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
(a) Sliding window (SW) strategy (b) Proposal generated (PG) strategy
...
...
...
...
...
Visual Feature Sequence
(c) Anchor-based strategy
Fig. 5. Illustration of sliding window, proposal generated, anchor-based, and 2D-Map strategies.
interactor determines the performa
large extent.
Fig. 6 summarizes the various
different feature interactors among
input is determined by the types
or sentence-level), and the types o
ACHINE INTELLIGENCE 10
Query: The man
speeds up then
returns to his
initial speed.
Video
Visual Feature
Extractor
Textual Feature
Extractor
i-LSTM
i-LSTM
i-LSTM
i-LSTM
...
Encoder Interactor Grounder
...
...
...
...
Fig. 14. TGN architecture, reproduced from Chen et al. [15].
TGN (Chen+, EMNLP2018)
12. 提案ベース手法
n2D-Map Anchor-base (2D)
• 提案モジュール
• 2Dマップアンカー
• 区間の始点終点を表すデータを2Dマップで表す
• 2D-TAN (Zhang+, AAAI2020)
TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4
(SW) strategy (b) Proposal generated (PG) strategy (c) Anchor-based strategy
copy
0 1 2 3 4 5
(1,5)
end time
index
start
time index
a possible proposal
(d) 2D-Map strategy
sliding window, proposal generated, anchor-based, and 2D-Map strategies.
interactor determines the performance of a TSGV method to a
large extent.
Fig. 6 summarizes the various input and output formats of
different feature interactors among existing TSGV methods. The
HINE INTELLIGENCE 10
eed
the
the
age
Fig. 14. TGN architecture, reproduced from Chen et al. [15].
Video
Query: The man speeds up then
returns to his initial speed.
Textual Feature
Extractor
(WE+LSTM)
Visual Feature
Extractor (C3D or
VGG)
(0,2)
(6,6)
Max
Pooling
Hadamard
Product
2D Temporal Feature Map Extraction
Query Encoding
ConvNets
...
Temporal Adjacent Network End Idx
Start
Idx
Fig. 15. 2D-TAN architecture, reproduced from Zhang et al. [18].
2D-TAN (Zhang+, AAAI2020)
13. 非提案型手法
nRegression-base(RG)
• 区間の始点と終点のフレーム(𝑡%, 𝑡&)を回帰
• 出力 𝑡%, 𝑡& ∈ ℝ(
• ABLR (Yuan+, AAAI2019)
ACHINE INTELLIGENCE 11
gh in-
erates
more
a 2D
g. 15
them
Query: The man
speeds up then
returns to his initial
speed.
Video
Visual Feature
Extractor (C3D)
Visual Feature
Encoder (BiLSTM)
Textual Feature
Extractor (GloVe)
Textual Feature
Encoder (BiLSTM)
A A
A
Mean
Pooling
Multi-modal Co-attention Interaction Attention Based
Coorodinates Regression
Attention Weight
Based Regression
Attention Feature
Based Regression
Fig. 16. ABLR architecture, reproduced from Yuan et al. [19].
ABLR (Yuan+, AAAI2019)
14. 非提案型手法
nSpan-base (SN)
• 各フレームに対して区間の始点と終点である確率を推定
• 出力: 𝑃%, 𝑃& ∈ ℝ)×(
• VSLNet (Zhang+, ACL2020)
ACCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTEL
Query: The man speeds
up then returns to his
initial speed.
Video
Visual Feature
Extractor (I3D)
Feature
Encoder
Textual Feature
Extractor (GloVe)
Feature
Encoder
Context-Query
Attention
Shared
Query-Guided
Highlighting
Conditioned
Span Predictor
Fig. 17. VSLNet architecture, reproduced from Zhang et al. [23].
VSLNet (Zhang+, ACL2020)
15. 強化学習による手法
nReinforcement Learning-base (RL)
• 推定区間の拡大・縮小、移動をすることで報酬を受け取る
• RWM-RL (He+, AAAI2019)
• 7つの行動ポリシーによって推定区間を変化
ACCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12
Fig. 17. VSLNet architecture, reproduced from Zhang et al. [23].
DeNet [92] disentangles query into relations and modified features
and devises a debias mechanism to alleviate both query uncertainty
and annotation bias issues.
There are also regression methods [93], [98], [99], [152] in-
corporating additional modalities from video to improve the local-
ization performance. For instance, HVTG [98] and MARN [152]
extract both appearance and motion features from video. In ad-
dition to appearance and motion, PMI [99] further exploits audio
features from the video extracted by SoundNet [153]. DRFT [93]
leverages the visual, optical flow, and depth flow features of
video, and analyzes the retrieval results of different feature view
combinations.
4.2.2 Span-based Method
Environment
(Video + Query)
Agent
(Deep Model)
State value
Reward
Policy
Action
Query: A Person is putting Clothes in the washing machine.
Video
Fig. 18. Illustration of sequence decision making formulation in TSGV.
Fig. 19. RWM-RL architecture, reproduced from He et al. [24].
segmentation task. ABIN [106] devises an auxiliary adversaria
discriminator network to produce coordinate and frame correlation
l. [23].
ified features
y uncertainty
9], [152] in-
ove the local-
MARN [152]
video. In ad-
xploits audio
. DRFT [93]
Fig. 18. Illustration of sequence decision making formulation in TSGV.
Visual Feature
Encoder (C3D)
Textual Feature Encoder
(Skip-Thought)
Query: A person is
putting clothes in the
washing machine.
FC
FC
FC
FC
FC
Query Feature
Global Visual
Feature
Local Visual
Feature
Localition Feature
GRU
FC
FC
FC
FC
IoU
Regression
Location
Regression
Softmax
State-Value
Policy
Action Space
Observation Network
Environment Actor-Critic
Multi-task
Fig. 19. RWM-RL architecture, reproduced from He et al. [24].
RWM-RL (He+, AAAI2019)
17. 弱教師あり学習
nMulti-Instance Learninig (MIL)
• テキストとビデオの距離学習をしたモデル
• ビデオとテキストの相関のスコアを計算可能
• TGA (Mithun+, CVPR2019)
ACCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
Query: The man speeds up then
returns to his initial speed.
Video
Visual Feature
Extractor (C3D or
VGG)
Textual Feature
Extractor (WE+GRU)
... FC ...
Weighted
Pooling
FC
FC
Joint Video-
Text Space
Text-Guided
Attention
Fig. 21. TGA architecture, reproduced from Mithun et al. [27].
TGA (Mithun+, CVPR2019)
18. 弱教師あり学習
nReconstruction-base (REC)
• キャプショニングによって入力クエリに近づくような区間を推定する
• SCN (Zhao+, AAAI2020)
t al. [27].
for understanding
and fusing them
antically relevant
interactor module
reat extent.
ber of annotations
ries on video with
d labor-intensive,
o suffer from the
deos are usually
Fig. 22. WS-DEC architecture, reproduced from Duan et al. [29].
Video
C3D
Query: The man
speeds up then
returns to his
initial speed.
GloVe
Textual
Encoder
Visual
Decoder
...
...
...
Proposal
Candidates
Confidence
Score
...
...
...
Masked Query:
The man <mask>
up then <mask> to
his initial speed.
Textual
Decoder
Visual
Encoder
...
C3D
GloVe
...
Reconstructed Query: The
man speeds up then
returns to his initial speed. ...
Reconstruction
Loss
...
Top-K
Rewards
...
Rank
Loss
Softmax
Top-K Proposals
Proposal Generation Module
Semantic Completion Module
Selection
Selected Proposals
Refinement
Fig. 23. SCN architecture, reproduced from Lin et al. [30].
SCN (Zhao+, AAAI2020)
22. 考察
nデータの不確実性
• アノテーション
• アノテーションの区間の誤り
• アノテーターによる偏り
• 1動画に対してアノテーター1人
nデータバイアス
• 使用単語
• 使用されている単語に偏りがある
• アノテーション区間
• trainとtestのデータセットでアノテーション区間の偏りが近い
AN
MAN [16] CVPR’19 - - 27.02 41.1
2D
TMN [70] ECCV’18 - - 22.92 35.1
VLG-Net [75] ICCV’21 33.35 25.57 25.57 -
SN L-Net [21] AAAI’19 - - - 41.4
RL SM-RL [25] CVPR’19 - - 31.06 43.9
…
Timeline
Query #1: the person
washes their hands
at the kitchen sink.
Query #2: a person
is washing their
hands in the sink.
Query #2: a person
washes their hands
in the kitchen sink.
1
2
3
4
5 Temporal Annotations by Different Annotators
Fig. 24. Illustration of data uncertainty in TSGV benchmarks. Annotati
uncertainty means disagreements of annotated ground-truth mome
across different annotators. Query uncertainty means various que
expressions for one ground-truth moment.
in Table 7, improving the feature extractor or exploiting mo
diverse features leads to better accuracy.
Moreover, we also conduct the efficiency comparison amo
different method categories by selecting one representative mod
from each category. The results and discussion are presented
Section A.1 in the Appendix.
6 CHALLENGES AND FUTURE DIRECTIONS
6.1 Critical Analysis
Data Uncertainty. Recent studies [92], [224] observe that da
samples among current benchmark datasets are ambiguous a
ACCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1
(a) Charades-STA
(b) ActivityNet Captions
Fig. 25. The top-30 frequent actions in Charades-STA and ActivityNet
Captions datasets.
To investigate the effects of distributional bias among existing
TSGV methods, Yuan et al. [126] further re-organize the two
(a) Charades-STA
(b) ActivityNet Captions
Fig. 26. An illustration of moment distributions for Charades-STA and
ActivityNet Captions, where “Start” and “End” axes represent the nor
malized start and end timestamps, respectively. The deeper the color
the larger density (i.e., more annotations) in the dataset.
ALYSIS AND MACHINE INTELLIGENCE 19
(a) Charades-STA
ACCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 19
(a) Charades-STA (a) Charades-STA
(b) ActivityNet Captions
ACCEPTED BY IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE I
(a) Charades-STA
(b) ActivityNet Captions
Fig. 25. The top-30 frequent actions in Charades-STA and ActivityNet
Captions datasets.
To investigate the effects of distributional bias among existing
TSGV methods, Yuan et al. [126] further re-organize the two
benchmark datasets and develop Charades-CD and ActivityNet-
CD datasets. Each dataset contains two test sets, i.e., the
independent-and-identical distribution (iid) test set, and the out-
of-distribution (ood) test set (see Fig. 27). Then, Yuan et al. [126]
collects a set of SOTA TSGV baselines and evaluates them on
the reorganized benchmark datasets. Results show that baselines
generally achieve impressive performance on the iid test set, but
fail to generalize to the ood test set. It is worth noting that weakly-
supervised methods are naturally immune to distributional bias
since they do not require annotated samples for training.