Tutorial on "Video Summarization and Re-use Technologies and Tools", delivered at IEEE ICME 2020. These slides correspond to the first part of the tutorial, presented by Vasileios Mezaris and Evlampios Apostolidis. This part deals with automatic video summarization, and includes a presentation of the video summarization problem definition and a literature overview; an in-depth discussion on a few unsupervised GAN-based methods; and a discussion on video summarization datasets, evaluation protocols and results, and future directions.
Unsupervised Video Summarization via Attention-Driven Adversarial LearningVasileiosMezaris
"Unsupervised Video Summarization via Attention-Driven Adversarial Learning", by E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, I. Patras. Proceedings of the 26th Int. Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Jan. 2020.
https://imatge.upc.edu/web/publications/keyframe-based-video-summarization-designer
This Final Degree Work extends two previous projects and consists in carrying out an improvement of the video keyframe extraction module from one of them called Designer Master, by integrating the algorithms that were developed in the other, Object Maps.
Firstly the proposed solution is explained, which consists in a shot detection method, where the input video is sampled uniformly and afterwards, cumulative pixel-to-pixel difference is applied and a classifier decides which frames are keyframes or not.
Last, to validate our approach we conducted a user study in which both applications were compared. Users were asked to complete a survey regarding to different summaries created by means of the original application and with the one developed in this project. The results obtained were analyzed and they showed that the improvement done in the keyframes extraction module improves slightly the application performance and the quality of the generated summaries.
Presentation of the paper titled "Combining Global and Local Attention with Positional Encoding for Video Summarization", by E. Apostolidis, G. Balaouras, V. Mezaris, I. Patras, delivered at the IEEE Int. Symposium on Multimedia (ISM), Dec. 2021. The corresponding software is available at https://github.com/e-apostolidis/PGL-SUM.
Explaining video summarization based on the focus of attentionVasileiosMezaris
Presentation of paper "Explaining video summarization based on
the focus of attention", by E. Apostolidis, G. Balaouras, V. Mezaris, I. Patras, delivered at IEEE ISM 2022, Dec. 2022, Naples, Italy.
In this paper we propose a method for explaining
video summarization. We start by formulating the problem as
the creation of an explanation mask which indicates the parts
of the video that influenced the most the estimates of a video
summarization network, about the frames’ importance. Then, we
explain how the typical analysis pipeline of attention-based networks for video summarization can be used to define explanation
signals, and we examine various attention-based signals that have
been studied as explanations in the NLP domain. We evaluate
the performance of these signals by investigating the video
summarization network’s input-output relationship according
to different replacement functions, and utilizing measures that quantify the capability of explanations to spot the most and
least influential parts of a video. We run experiments using an
attention-based network (CA-SUM) and two datasets (SumMe
and TVSum) for video summarization. Our evaluations indicate the advanced performance of explanations formed using the inherent attention weights, and demonstrate the ability of our
method to explain the video summarization results using clues
about the focus of the attention mechanism.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/dec-2016-member-meeting-uofw
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Professor Jeff Bilmes of the University of Washington delivers the presentation "Image and Video Summarization" at the December 2016 Embedded Vision Alliance Member Meeting. Bilmes provides an overview of the state of the art in image and video summarization.
Unsupervised Video Summarization via Attention-Driven Adversarial LearningVasileiosMezaris
"Unsupervised Video Summarization via Attention-Driven Adversarial Learning", by E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, I. Patras. Proceedings of the 26th Int. Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Jan. 2020.
https://imatge.upc.edu/web/publications/keyframe-based-video-summarization-designer
This Final Degree Work extends two previous projects and consists in carrying out an improvement of the video keyframe extraction module from one of them called Designer Master, by integrating the algorithms that were developed in the other, Object Maps.
Firstly the proposed solution is explained, which consists in a shot detection method, where the input video is sampled uniformly and afterwards, cumulative pixel-to-pixel difference is applied and a classifier decides which frames are keyframes or not.
Last, to validate our approach we conducted a user study in which both applications were compared. Users were asked to complete a survey regarding to different summaries created by means of the original application and with the one developed in this project. The results obtained were analyzed and they showed that the improvement done in the keyframes extraction module improves slightly the application performance and the quality of the generated summaries.
Presentation of the paper titled "Combining Global and Local Attention with Positional Encoding for Video Summarization", by E. Apostolidis, G. Balaouras, V. Mezaris, I. Patras, delivered at the IEEE Int. Symposium on Multimedia (ISM), Dec. 2021. The corresponding software is available at https://github.com/e-apostolidis/PGL-SUM.
Explaining video summarization based on the focus of attentionVasileiosMezaris
Presentation of paper "Explaining video summarization based on
the focus of attention", by E. Apostolidis, G. Balaouras, V. Mezaris, I. Patras, delivered at IEEE ISM 2022, Dec. 2022, Naples, Italy.
In this paper we propose a method for explaining
video summarization. We start by formulating the problem as
the creation of an explanation mask which indicates the parts
of the video that influenced the most the estimates of a video
summarization network, about the frames’ importance. Then, we
explain how the typical analysis pipeline of attention-based networks for video summarization can be used to define explanation
signals, and we examine various attention-based signals that have
been studied as explanations in the NLP domain. We evaluate
the performance of these signals by investigating the video
summarization network’s input-output relationship according
to different replacement functions, and utilizing measures that quantify the capability of explanations to spot the most and
least influential parts of a video. We run experiments using an
attention-based network (CA-SUM) and two datasets (SumMe
and TVSum) for video summarization. Our evaluations indicate the advanced performance of explanations formed using the inherent attention weights, and demonstrate the ability of our
method to explain the video summarization results using clues
about the focus of the attention mechanism.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/dec-2016-member-meeting-uofw
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Professor Jeff Bilmes of the University of Washington delivers the presentation "Image and Video Summarization" at the December 2016 Embedded Vision Alliance Member Meeting. Bilmes provides an overview of the state of the art in image and video summarization.
https://mcv-m6-video.github.io/deepvideo-2018/
Overview of deep learning solutions for video processing. Part of a series of slides covering topics like action recognition, action detection, object tracking, object detection, scene segmentation, language and learning from videos.
Prepared for the Master in Computer Vision Barcelona:
http://pagines.uab.cat/mcv/
Materi ini disajikan pada diklat PEMBUATAN VIDEO BERBASIS SMARTPHONE UNTUK MEMFASILITASI PEMBELAJARAN DIMASA COVID-19, yang diselenggarakan oleh PPPPTK TK dan PLB pada program Cloud Teacher Training.
Netflix Recommendations Feature Engineering with Time TravelFaisal Siddiqi
Hua Jiang and Kedar Sadekar talked about feature engineering using time rewinding in the context of Netflix Recommendations at an ML Platform meetup at LinkedIn HQ. Jan 24, 2018
In October 2017, ISO/IEC JCT1 SC29/WG11 MPEG and ITU-T SG16/Q6 VCEG have jointly published a Call for Proposals on Video Compression with Capability beyond HEVC and its current extensions. It is targeting at a new generation of video compression technology that has substantially higher compression capability than the existing HEVC standard. The responses to the call are evaluated in April 2018, forming the kick-off for a new standardization activity in the Joint Video Experts Team (JVET) of VCEG and MPEG, with a target of finalization by the end of the year 2020. Three categories of video are addressed: Standard dynamic range video (SDR), high dynamic range video (HDR), and 360° video. While SDR and HDR cover variants of conventional video to be displayed e.g. on a suitable TV screen at very high resolution (UHD), the 360° category targets at videos capturing a full-degree surround view of the scene. This enables an immersive video experience with the possibility to look around in the rendered scene, e.g. when viewed using a head-mounted display. This application triggers various technical challenges which need to be addressed in terms of compression, encoding, transport, and rendering. The talk summarizes the current state of the complete standardization project. Focussing on the SDR and 360° video categories, it highlights the development of selected coding tools compared to the state of the art. Representative examples of the new technological challenges as well as corresponding proposed solutions are presented.
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
Unsupervised learning representation with Deep Convolutional Generative Adversarial Network, Paper by Alec Radford, Luke Metz, and Soumith Chintala
(indico Research, Facebook AI Research).
"A Stepwise, Label-based Approach for Improving the Adversarial Training in Unsupervised Video Summarization" presentation from the ReTV project team at the AI4TV workshop at the ACM Multimedia 2019 conference focusing on
Video content analysis and retrieval system using video storytelling and inde...IJECEIAES
Videos are used often for communicating ideas, concepts, experience, and situations, because of the significant advances made in video communication technology. The social media platforms enhanced the video usage expeditiously. At, present, recognition of a video is done, using the metadata like video title, video descriptions, and video thumbnails. There are situations like video searcher requires only a video clip on a specific topic from a long video. This paper proposes a novel methodology for the analysis of video content and using video storytelling and indexing techniques for the retrieval of the intended video clip from a long duration video. Video storytelling technique is used for video content analysis and to produce a description of the video. The video description thus created is used for preparation of an index using wormhole algorithm, guarantying the search of a keyword of definite length L, within the minimum worst-case time. This video index can be used by video searching algorithm to retrieve the relevant part of the video by virtue of the frequency of the word in the keyword search of the video index. Instead of downloading and transferring a whole video, the user can download or transfer the specifically necessary video clip. The network constraints associated with the transfer of videos are considerably addressed.
https://mcv-m6-video.github.io/deepvideo-2018/
Overview of deep learning solutions for video processing. Part of a series of slides covering topics like action recognition, action detection, object tracking, object detection, scene segmentation, language and learning from videos.
Prepared for the Master in Computer Vision Barcelona:
http://pagines.uab.cat/mcv/
Materi ini disajikan pada diklat PEMBUATAN VIDEO BERBASIS SMARTPHONE UNTUK MEMFASILITASI PEMBELAJARAN DIMASA COVID-19, yang diselenggarakan oleh PPPPTK TK dan PLB pada program Cloud Teacher Training.
Netflix Recommendations Feature Engineering with Time TravelFaisal Siddiqi
Hua Jiang and Kedar Sadekar talked about feature engineering using time rewinding in the context of Netflix Recommendations at an ML Platform meetup at LinkedIn HQ. Jan 24, 2018
In October 2017, ISO/IEC JCT1 SC29/WG11 MPEG and ITU-T SG16/Q6 VCEG have jointly published a Call for Proposals on Video Compression with Capability beyond HEVC and its current extensions. It is targeting at a new generation of video compression technology that has substantially higher compression capability than the existing HEVC standard. The responses to the call are evaluated in April 2018, forming the kick-off for a new standardization activity in the Joint Video Experts Team (JVET) of VCEG and MPEG, with a target of finalization by the end of the year 2020. Three categories of video are addressed: Standard dynamic range video (SDR), high dynamic range video (HDR), and 360° video. While SDR and HDR cover variants of conventional video to be displayed e.g. on a suitable TV screen at very high resolution (UHD), the 360° category targets at videos capturing a full-degree surround view of the scene. This enables an immersive video experience with the possibility to look around in the rendered scene, e.g. when viewed using a head-mounted display. This application triggers various technical challenges which need to be addressed in terms of compression, encoding, transport, and rendering. The talk summarizes the current state of the complete standardization project. Focussing on the SDR and 360° video categories, it highlights the development of selected coding tools compared to the state of the art. Representative examples of the new technological challenges as well as corresponding proposed solutions are presented.
https://telecombcn-dl.github.io/2017-dlcv/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.
Unsupervised learning representation with Deep Convolutional Generative Adversarial Network, Paper by Alec Radford, Luke Metz, and Soumith Chintala
(indico Research, Facebook AI Research).
"A Stepwise, Label-based Approach for Improving the Adversarial Training in Unsupervised Video Summarization" presentation from the ReTV project team at the AI4TV workshop at the ACM Multimedia 2019 conference focusing on
Video content analysis and retrieval system using video storytelling and inde...IJECEIAES
Videos are used often for communicating ideas, concepts, experience, and situations, because of the significant advances made in video communication technology. The social media platforms enhanced the video usage expeditiously. At, present, recognition of a video is done, using the metadata like video title, video descriptions, and video thumbnails. There are situations like video searcher requires only a video clip on a specific topic from a long video. This paper proposes a novel methodology for the analysis of video content and using video storytelling and indexing techniques for the retrieval of the intended video clip from a long duration video. Video storytelling technique is used for video content analysis and to produce a description of the video. The video description thus created is used for preparation of an index using wormhole algorithm, guarantying the search of a keyword of definite length L, within the minimum worst-case time. This video index can be used by video searching algorithm to retrieve the relevant part of the video by virtue of the frequency of the word in the keyword search of the video index. Instead of downloading and transferring a whole video, the user can download or transfer the specifically necessary video clip. The network constraints associated with the transfer of videos are considerably addressed.
Review on content based video lecture retrievaleSAT Journals
Abstract Recent advances in multimedia technologies allow the capture and storage of video data with relatively inexpensive computers. Furthermore, the new possibilities offered by the information highways have made a large amount of video data publicly available. However, without appropriate search techniques all these data are hardly usable. Users are not satisfied with the video retrieval systems that provide analogue VCR functionality. For example, a user analyses a soccer video will ask for specific events such as goals. Content-based search and retrieval of video data becomes a challenging and important problem. Therefore, the need for tools that can be manipulate the video content in the same way as traditional databases manage numeric and textual data is significant. Therefore, a more efficient method for video retrieval in WWW or within large lecture video archives is urgently needed. This project presents an approach for automated video indexing and video search in large lecture video archives. First of all, we apply automatic video segmentation and key-frame detection to offer a visual guideline for the video content navigation. Subsequently, we extract textual metadata by applying video Optical Character Recognition (OCR) technology on key-frames and Automatic Speech Recognition on lecture audio tracks. Keywords—Feature extraction, video annotation, video browsing, video retrieval, video structure analysis
The advents in this technological era have resulted into enormous pool of information. This information is
stored at multiple places globally, in multiple formats. This article highlights a methodology for extracting
the video lectures delivered by experts in the domain of Computer Science by using Generalized Gamma
Mixture Model. The feature extraction is based on the DCT transformations. In order to propose the model,
the data set is pooled from the YouTube video lectures in the domain of Computer Science. The outputs
generated are evaluated using Precision and Recall.
As presented by Johan Oomen (Sound an Vision) and Vasilis Mezaris (Information Technologies Institute Thessaloniki) at the 2019 FIAT/IFTA World Conference in Dubrovnik, Croatia.
Enhancing multi-class web video categorization model using machine and deep ...IJECEIAES
With today’s digital revolution, many people communicate and collaborate in cyberspace. Users rely on social media platforms, such as Facebook, YouTube and Twitter, all of which exert a considerable impact on human lives. In particular, watching videos has become more preferable than simply browsing the internet because of many reasons. However, difficulties arise when searching for specific videos accurately in the same domains, such as entertainment, politics, education, video and TV shows. This problem can be solved through web video categorization (WVC) approaches that utilize video textual information, visual features, or audio approaches. However, retrieving or obtaining videos with similar content with high accuracy is challenging. Therefore, this paper proposes a novel mode for enhancing WVC that is based on user comments and weighted features from video descriptions. Specifically, this model uses supervised learning, along with machine learning classifiers (MLCs) and deep learning (DL) models. Two experiments are conducted on the proposed balanced dataset on the basis of the two proposed algorithms based on multi-classes, namely, education, politics, health and sports. The model achieves high accuracy rates of 97% and 99% by using MLCs and DL models that are based on artificial neural network (ANN) and long short-term memory (LSTM), respectively.
CREW (Collaborative Research Events on the Web) aims to improve access to research event content by capturing and publishing the scholarly communication that occurs at events like conferences and workshops. This is a Virtual Research Environment funded by JISC within the UK.
This slide show describes release 5 of the development. See site: http://www.crew-vre.net/
Content based video retrieval using discrete cosine transformnooriasukmaningtyas
A content based video retrieval (CBVR) framework is built in this paper.
One of the essential features of video retrieval process and CBVR is a color
value. The discrete cosine transform (DCT) is used to extract a query video
features to compare with the video features stored in our database. Average
result of 0.6475 was obtained by using the DCT after implementing it to the
database we created and collected, and on all categories. This technique was
applied on our database of video, check 100 database videos, 5 videos in
Keywords: each category.
Multimodal video abstraction into a static document using deep learning IJECEIAES
Abstraction is a strategy that gives the essential points of a document in a short period of time. The video abstraction approach proposed in this research is based on multi-modal video data, which comprises both audio and visual data. Segmenting the input video into scenes and obtaining a textual and visual summary for each scene are the major video abstraction procedures to summarize the video events into a static document. To recognize the shot and scene boundary from a video sequence, a hybrid features method was employed, which improves detection shot performance by selecting strong and flexible features. The most informative keyframes from each scene are then incorporated into the visual summary. A hybrid deep learning model was used for abstractive text summarization. The BBC archive provided the testing videos, which comprised BBC Learning English and BBC News. In addition, a news summary dataset was used to train a deep model. The performance of the proposed approaches was assessed using metrics like Rouge for textual summary, which achieved a 40.49% accuracy rate. While precision, recall, and F-score used for visual summary have achieved (94.9%) accuracy, which performed better than the other methods, according to the findings of the experiments.
Semantic Summarization of videos, Semantic Summarization of videosdarsh228313
In [1], each capsule uses a activity vector to represent different instantiation parameters (position, size, orientation, thickness, … etc.), with the vector length (norm) representing the probability of the presence of an entity
Hence, the output vector for each capsule need to be normalized to 0,1
This is done by the non-linear squashing function:
Similar to Icme2020 tutorial video_summarization_part1 (20)
Multi-Modal Fusion for Image Manipulation Detection and LocalizationVasileiosMezaris
Presentation of paper "Exploring Multi-Modal Fusion for Image Manipulation Detection and Localization", by K. Triaridis, V. Mezaris, delivered at 30th Int. Conf. on MultiMedia Modeling (MMM 2024), Amsterdam, NL, Jan.-Feb. 2024.
Presentation of our top-scoring solution to the MediaEval 2023 NewsImages Task, "Cross-modal Networks, Fine-Tuning, Data Augmentation and Dual Softmax Operation for MediaEval NewsImages 2023", by A. Leventakis, D. Galanopoulos, V. Mezaris, delivered at the 2023 Multimedia Evaluation Workshop (MediaEval'23), Amsterdam, NL, Feb. 2024.
Spatio-Temporal Summarization of 360-degrees VideosVasileiosMezaris
Presentation of paper "An Integrated System for Spatio-Temporal Summarization of 360-degrees Videos", by I. Kontostathis, E. Apostolidis, V. Mezaris, delivered at 30th Int. Conf. on MultiMedia Modeling (MMM 2024), Amsterdam, NL, Jan.-Feb. 2024.
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...VasileiosMezaris
Presentation of paper "Masked Feature Modelling for the unsupervised pre-training of a Graph Attention Network block for bottom-up video event recognition", by D. Daskalakis, N. Gkalelis, V. Mezaris, delivered at IEEE ISM 2023, Dec. 2022, Laguna Hills, CA, USA.
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022VasileiosMezaris
Matching images to articles is challenging and can be considered a special version of the cross-media retrieval problem. This working note paper presents our solution for the MediaEval NewsImages
benchmarking task. We investigated the performance of two cross-modal networks, a pre-trained network and a trainable one, the latter originally developed for text-video retrieval tasks and adapted to the NewsImages task. Moreover, we utilize a method for revising the similarities produced by either one of the cross-modal networks, i.e., a dual softmax operation, to improve our solutions’ performance. We report the official results for our submitted runs and additional experiments we conducted to evaluate our runs internally.
TAME: Trainable Attention Mechanism for ExplanationsVasileiosMezaris
Presentation of paper "TAME: Attention Mechanism Based Feature Fusion for Generating Explanation Maps of Convolutional Neural Networks", by M. Ntrougkas, N. Gkalelis, V. Mezaris, delivered at IEEE ISM 2022, Dec. 2022, Naples, Italy.
The apparent “black box” nature of neural networks is a barrier to adoption in applications where explainability is essential. This paper presents TAME (Trainable Attention Mechanism for Explanations), a method for generating explanation maps with a multi-branch hierarchical attention mechanism. TAME combines a target model’s feature maps from multiple layers using an attention mechanism, transforming them into an explanation map. TAME can easily be applied to any convolutional neural network (CNN) by streamlining the optimization of the attention mechanism’s training method and the selection
of target model’s feature maps. After training, explanation maps can be computed in a single forward pass. We apply TAME to two widely used models, i.e. VGG-16 and ResNet-50, trained on ImageNet and show improvements over previous top-performing methods. We also provide a comprehensive ablation study comparing the performance of different variations of TAME’s architecture.
Presentation of paper "Gated-ViGAT: Efficient Bottom-Up Event
Recognition and Explanation Using a New Frame
Selection Policy and Gating Mechanism", by N. Gkalelis, D. Daskalakis, V. Mezaris, delivered at IEEE ISM 2022, Dec. 2022, Naples, Italy.
In this paper, Gated-ViGAT, an efficient approach for video event recognition, utilizing bottom-up (object) information, a new frame sampling policy and a gating mechanism is proposed. Specifically, the frame sampling policy uses weighted in-degrees (WiDs), derived from the adjacency matrices of graph attention networks (GATs), and a dissimilarity measure to select
the most salient and at the same time diverse frames representing
the event in the video. Additionally, the proposed gating mechanism fetches the selected frames sequentially, and commits early exiting when an adequately confident decision is achieved. In this way, only a few frames are processed by the computationally
expensive branch of our network that is responsible for the bottom-up information extraction. The experimental evaluation on two large, publicly available video datasets (MiniKinetics, ActivityNet) demonstrates that Gated-ViGAT provides a large computational complexity reduction in comparison to our previous approach (ViGAT), while maintaining the excellent event
recognition and explainability performance.
Combining textual and visual features for Ad-hoc Video SearchVasileiosMezaris
In this presentation, our work in the context of the Ad-hoc Video Search (AVS) Task of TRECVID 2022 is presented. Our participation in the AVS task is based on a cross-modal deep network architecture, T x V ("T times V"), which utilizes several textual and visual features. As part of the retrieval stage, a dual-softmax approach is also utilized to revise the calculated text-video similarities.
Explaining the decisions of image/video classifiersVasileiosMezaris
Presentation delivered by Vasileios Mezaris at the 1st Nice Workshop on Interpretability, November 2022, Nice, France.
This presentation starts by discussing the motivation of explainability approaches for image and video classifiers. Then, we focus on three distinct problems: learning how to derive explanations for the decisions of a legacy (trained) image classifier; designing a classifier for video event recognition that can also deliver explanations for its decisions; and, taking a first look at possible explanation signals of a video summarizer. Technical details of our proposed solutions to these three problems are presented. Besides quantitative results concerning the goodness of the derived explanations, qualitative examples are also discussed in order to provide insight on the reasons behind classification errors, including possible dataset biases affecting the trained classifiers.
Learning visual explanations for DCNN-based image classifiers using an attent...VasileiosMezaris
I. Gkartzonika, N. Gkalelis, V. Mezaris, "Learning Visual Explanations for DCNN-Based Image Classifiers Using an Attention Mechanism", Proc. ECCV 2022 Workshop on Vision with Biased or Scarce Data (VBSD), Oct. 2022.
In this paper two new learning-based eXplainable AI (XAI) methods for deep convolutional neural network (DCNN) image classifiers, called L-CAM-Fm and L-CAM-Img, are proposed. Both methods use an attention mechanism that is inserted in the original (frozen) DCNN and is trained to derive class activation maps (CAMs) from the last convolutional layer’s feature maps. During training, CAMs are applied to the feature maps (L-CAM-Fm) or the input image (L-CAM-Img) forcing the attention mechanism to learn the image regions explaining the DCNN’s outcome. Experimental evaluation on ImageNet shows that the proposed methods achieve competitive results while requiring a single forward pass at the inference stage. Moreover, based on the derived explanations a comprehensive qualitative analysis is performed providing valuable insight for understanding the reasons behind classification errors, including possible dataset biases affecting the trained classifier.
Are all combinations equal? Combining textual and visual features with multi...VasileiosMezaris
D. Galanopoulos, V. Mezaris, "Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval", Proc. ECCV 2022 Workshop on AI for Creative Video Editing and Understanding (CVEU), Oct. 2022.
In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on text-to-video retrieval. We investigate how to optimally combine multiple diverse textual and visual features into feature pairs that lead to generating multiple joint feature spaces, which encode text-video pairs into comparable representations. To learn these representations our proposed network architecture is trained by following a multiple space learning procedure. Moreover, at the retrieval stage, we introduce additional softmax operations for revising the inferred query-video similarities. Extensive experiments in several setups based on three large-scale datasets (IACC.3, V3C1, and MSR-VTT) lead to conclusions on how to best combine text-visual features and document the performance of the proposed network.
Presentation of the paper titled "Summarizing videos using concentrated attention and considering the uniqueness and diversity of the video frames", by E. Apostolidis, G. Balaouras, V. Mezaris, I. Patras, delivered at the ACM Int. Conf. on Multimedia Retrieval (ICMR’22), Newark, NJ, USA, June 2022. The corresponding software is available at https://github.com/e-apostolidis/CA-SUM.
Presentation of the paper titled "A Web Service for Video Smart-Cropping", by K. Apostolidis, V. Mezaris, delivered at the IEEE Int. Symposium on Multimedia (ISM), Dec. 2021. The corresponding software and dataset are available at https://github.com/bmezaris/RetargetVid.
Presentation slides for our paper "Combining Adversarial and Reinforcement Learning for Video Thumbnail Selection", ACM ICMR 2021. https://doi.org/10.1145/3460426.3463630.
We developed a new method for unsupervised video thumbnail selection. The developed network architecture selects video thumbnails based on two criteria: the representativeness and the aesthetic quality of their visual content. Training relies on a combination of adversarial and reinforcement learning. The former is used to train a discriminator, whose goal is to distinguish the original from a reconstructed version of the video based on a small set of candidate thumbnails. The discriminator’s feedback is a measure of the representativeness of the selected thumbnails. This measure is combined with estimates about the aesthetic quality of the thumbnails (made using a SoA Fully Convolutional Network) to form a reward and train the thumbnail selector via reinforcement learning. Experiments on two datasets (OVP and Youtube) show the competitiveness of the proposed method against other SoA approaches. An ablation study with respect to the adopted thumbnail selection criteria documents the importance of considering the aesthetics, and the contribution of this information when used in combination with measures about the representativeness of the visual content.
Hard-Negatives Selection Strategy for Cross-Modal RetrievalVasileiosMezaris
Cross-modal learning has gained a lot of interest recently, and many applications of it, such as image-text retrieval, cross-modal video search, or video captioning have been proposed. In this work, we deal with the cross-modal video retrieval problem. The state-of-the-art approaches are based on deep network architectures, and rely on mining hard-negative samples during training to optimize the selection of the network’s parameters. Starting from a state-of-the-art cross-modal architecture that uses the improved marginal ranking loss function, we propose a simple strategy for hard-negative mining to identify which training samples are hard-negatives and which, although presently treated as hard-negatives, are likely not negative samples at all and shouldn’t be treated as such. Additionally, to take full advantage of network models trained using different design choices for hard-negative mining, we examine model combination strategies, and we design a hybrid one effectively combining large numbers of trained models.
Talk by Vasileios Mezaris, titled "Misinformation on the internet: Video and AI", delivered at the "Age of misinformation: an interdisciplinary outlook on fake news" webinar, on 17 December 2020.
Slides for the paper titled "Structured pruning of LSTMs via Eigenanalysis and Geometric Median for Mobile Multimedia and Deep Learning Applications", by N. Gkalelis and V. Mezaris, presented at the 22nd IEEE Int. Symposium on Multimedia (ISM), Dec. 2020.
Slides for the paper "Performance over Random: A robust evaluation protocol for video summarization methods", authored by E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, and I. Patras, published in the Proceedings of ACM Multimedia 2020 (ACM MM), Seattle, WA, USA, Oct. 2020.
Presentation of the paper titled "Migration-Related Semantic Concepts for the Retrieval of Relevant Video Content", by E. Elejalde, D. Galanopoulos, C. Niederee, V. Mezaris, published in the proceedings of the Int. Workshop on Artificial Intelligence and Robotics for Law Enforcement Agencies (AIRLEAs) at the 3rd Int. Conf. on Intelligent Technologies and Applications (INTAP 2020), Gjovik, Norway, Sept. 2020.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
This pdf is about the Schizophrenia.
For more details visit on YouTube; @SELF-EXPLANATORY;
https://www.youtube.com/channel/UCAiarMZDNhe1A3Rnpr_WkzA/videos
Thanks...!
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
A brief information about the SCOP protein database used in bioinformatics.
The Structural Classification of Proteins (SCOP) database is a comprehensive and authoritative resource for the structural and evolutionary relationships of proteins. It provides a detailed and curated classification of protein structures, grouping them into families, superfamilies, and folds based on their structural and sequence similarities.
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
1. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Vasileios Mezaris,
Evlampios Apostolidis
CERTH-ITI, Greece
Tutorial at IEEE ICME 2020
Section I.1: Video summarization
problem definition and literature
overview
Video Summarization and Re-use
Technologies and Tools
Part I: Automatic video summarization
2. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Tutorial’s structure and time schedule
2
Part I: Automatic video summarization
Section I.1: Video summarization problem definition and literature overview (20’)
Q&A (5’)
Section I.2: In-depth discussion on a few unsupervised GAN-based methods (20’)
Q&A (5’)
Section I.3: Datasets, evaluation protocols and results, and future directions (20’)
20’ Q&A and break, then we are back with the tutorial’s Part II: Video summaries re-use and
recommendation
3. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
3
Video is everywhere!
Problem definition
Hours of video content uploaded on
YouTube every minute
Captured by smart-devices and instantly
shared online
Constantly and rapidly increased
volumes of video content
Image sources: https://www.financialexpress.com/india-news/govt-agencies-adopt-new-
age-video-sharing-apps-like-tiktok/1767354/ (left) & https://www.statista.com/ (right)
4. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
4
But how to find what we are looking for in endless collections of video content?
Problem definition - video consumption side
Image source: https://www.voicendata.com/sprint-removes-video-streaming-limits/
5. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
5
But how to find what we are looking for in endless collections of video content?
Problem definition - video consumption side
Image source: https://www.voicendata.com/sprint-removes-video-streaming-limits/
Quickly inspect a video’s
content by checking its
synopsis!
6. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
6
But how to reach different audiences for a given media item?
Problem definition - video editing side
Image source: https://marketingland.com/social-media-audience-critical-content-marketing-223647
Good
Very
interesting Boring
Nice
Much
detailed
7. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
7
But how to reach different audiences for a given media item?
Problem definition - video editing side
Image source: https://marketingland.com/social-media-audience-critical-content-marketing-223647
Good
Very
interesting Boring
Nice
Use of technologies for
content adaptation, re-use
and re-purposing!
Much
detailed
8. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
8
Video summary: a short visual summary that encapsulates the flow of the story and
the essential parts of the full-length video
Original video
Video summary (storyboard)
Problem definition
Source: https://www.youtube.com/watch?v=deRF9oEbRso
9. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
9
Problem definition
General applications of video summarization
Professional CMS: effective indexing,
browsing, retrieval & promotion of media
assets!
Video sharing platforms: improved viewer
experience, enhanced viewer engagement &
increased content consumption!
Source: https://www.redbytes.in/how-to-build-an-app-like-hotstar/ Source: Screenshot of the BBC News channel on YouTube
10. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
10
Problem definition
General applications of video summarization
Audience- and channel-specific content adaptation: video content re-use and re-distribution in
the most appropriate way!
Image source: https://www.databagg.com/online-video-sharing
11. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
11
Problem definition
Domain-specific applications of video summarization
Full movie (e.g. 1h 30’-2h) Movie trailer (2’30’’)
J. R. Smith, D. Joshi, B. Huet, W. Hsu, and J. Cota, “Harnessing A.I. for Augmenting Creativity: Application to Movie Trailer
Creation,” in Proc. of the 25th ACM Int. Conf. on Multimedia, ser. MM ’17. New York, NY, USA: ACM, 2017, pp. 1799–1808.
Source: https://www.youtube.com/watch?v=wb49-oV0F78
12. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
12
Problem definition
Domain-specific applications of video summarization
Full game (e.g. 1h 30’)
Game’s synopsis & highlights (1’32’’)
Source: https://www.youtube.com/watch?v=oo-2IFTifUU
13. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
13
Problem definition
Domain-specific applications of video summarization
Video samples extracted from: https://www.youtube.com/watch?v=gk3qTMlcadk
Raw CCTV material (e.g. 24h) Summary of important actions/events (with timestamps)
14. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
14
Literature overview
Taxonomy of deep learning
based methods for automatic
video summarization
15. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
15
Literature overview
Supervised approaches: using video semantics and metadata
[Zhang, 2016; Kaufman, 2017] learn and transfer the summary structure of
semantically-similar videos
[Panda, 2017] metadata-driven video categorization and summarization by
maximizing relevance with the video category
[Song, 2016; Zhou, 2018a] category-driven summarization by category feature
preservation (keep main parts of a wedding when summarizing a wedding video)
[Otani, 2016; Yuan, 2019] maximize relevance of visual (video) and textual
(metadata) data in a common latent space
16. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
16
Literature overview
Supervised approaches: considering temporal structure and dependency
[Zhang, 2016b] estimate frames’ importance by modeling their variable-range
temporal dependency using RNNs
[Zhao, 2018] models and encodes the temporal structure of the video for
defining the key-fragments using hierarchies of RNNs
[Ji, 2019] video-to-summary as a sequence-to-sequence learning problem using
attention-driven encoder-decoder network
[Feng, 2018; Wang, 2019] estimate frames’ importance by modeling their long-
range dependency using high-capacity memory networks
17. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
17
Literature overview
Supervised approaches: imitating human summaries
[Zhang, 2019] summarization by confusing a trainable discriminator when making
the distinction between a machine- and a human-generated summary; model the
variable-range temporal dependency using RNNs and Dilated Temporal Units
[Fu, 2019] key-fragment selection by confusing a trainable discriminator when
making the distinction between the machine- and a human-selected key-fragments;
fragmentation based on attention-based Pointer Network, and discrimination using
a 3D-CNN classifier
18. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
18
Literature overview
Supervised approaches: targeting specific properties of the summary
[Chu, 2019] models spatiotemporal information based on raw frames and optical
flow maps, and learns frames’ importance from human annotations via a label
distribution learning process
[Elfeki, 2019] uses of CNNs and RNNs to form spatiotemporal feature vectors and
estimates the level of activity and importance of each frame to create the summary
[Chen, 2019] summarization based on reinforcement learning and reward functions
associated to the diversity and representativeness of the video summary
19. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
19
Literature overview
Unsupervised approaches: inferring the original video
[Mahasseni, 2017] SUM-GAN trains a summarizer to fool a discriminator when
distinguishing the original from the summary-based reconstructed video using
adversarial learning
[Jung, 2019] CSNet extends [Mahasseni, 2017] with a chunk and stride network and
attention mechanism to assess variable-range dependencies and select the video key-
frames
[Apostolidis, 2020] SUM-GAN-AAE extends [Mahasseni, 2017] with a stepwise, fine-
grained training strategy and an attention auto-encoder to improve the key-fragment
selection process
[Rochan, 2019] UnpairedVSN learns video summarization from unpaired data based on
an adversarial process that defines a mapping function of a raw video to a human
summary
20. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
20
Literature overview
Unsupervised approaches: targeting specific properties of the summary
[Zhou, 2018b] DR-DSN learns to create representative and diverse summaries via
reinforcement learning and relevant reward functions
[Gonuguntla, 2019] EDSN extracts spatiotemporal information and learns
summarization by rewarding the maintenance of main spatiotemporal patterns in
the summary
[Zhang, 2018] OnlineMotionAE extracts the key motions of appearing objects and
uses an online motion auto-encoder model to generate summaries that include the
main objects in the video and the attractive actions made by each of these objects
21. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
DL-based video summarization methods mainly rely on combinations of CNNs and RNNs
Pre-trained CNNs are used to represent the visual content; RNNs (mostly LSTMs) are used to
model the temporal dependency among video frames
The proposed video summarization approaches are mostly supervised
Best supervised approaches utilize tailored attention mechanisms or memory networks to
capture variable- and long-range temporal dependencies respectively
For unsupervised video summarization GANs are the central direction and RL is another but
less common approach
Best unsupervised approaches rely on VAE-GAN architectures that have been enhanced with
attention mechanisms
Some concluding remarks
21
22. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The generation of ground-truth data can be an expensive and laborious process
Video summarization is a subjective task and multiple summaries can be proposed for a video
Human annotations that vary a lot make it hard to train a method with the typical supervised
training approaches
Unsupervised video summarization algorithms overcome the need for ground-truth data and
can be trained using only an adequately large collection of videos
Unsupervised learning allows to train a summarization method using different types of video
content (TV shows, news) and then perform content-wise video summarization
Some concluding remarks
22
23. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The generation of ground-truth data can be an expensive and laborious process
Video summarization is a subjective task and multiple summaries can be proposed for a video
Human annotations that vary a lot make it hard to train a method with the typical supervised
training approaches
Unsupervised video summarization algorithms overcome the need for ground-truth data and
can be trained using only an adequately large collection of videos
Unsupervised learning allows to train a summarization method using different types of video
content (TV shows, news) and then perform content-wise video summarization
Some concluding remarks
23
Unsupervised video summarization has great advantages, increases the applicability
of summarization technologies, and its potential should be investigated
24. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Vasileios Mezaris,
Evlampios Apostolidis
CERTH-ITI, Greece
Tutorial at IEEE ICME 2020
Short break; coming up:
Section I.2: Discussion on a few
unsupervised GAN-based
methods
Video Summarization and Re-use
Technologies and Tools
Part I: Automatic video summarization
25. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Vasileios Mezaris,
Evlampios Apostolidis
CERTH-ITI, Greece
Tutorial at IEEE ICME 2020
Section I.2: Discussion on a few
unsupervised GAN-based
methods
Video Summarization and Re-use
Technologies and Tools
Part I: Automatic video summarization
26. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN method [Mahasseni, 2017]
Problem formulation: video summarization via selecting a
sparse subset of frames that optimally represent the video
Main idea: learn summarization by minimizing the distance
between videos and a distribution of their summarizations
Goal: select a set of keyframes such that a distance between
the deep representations of the selected keyframes and the
video is minimized
26
B. Mahasseni, M. Lam and S. Todorovic, "Unsupervised Video Summarization with Adversarial LSTM Networks," 2017 IEEE
CVPR, Honolulu, HI, 2017, pp. 2982-2991, doi: 10.1109/CVPR.2017.318.
Courtesy of
Mahasseni et al.
27. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN method [Mahasseni, 2017]
Problem formulation: video summarization via selecting a
sparse subset of frames that optimally represent the video
Main idea: learn summarization by minimizing the distance
between videos and a distribution of their summarizations
Goal: select a set of keyframes such that a distance between
the deep representations of the selected keyframes and the
video is minimized
Challenge: how to define a good distance?
27
B. Mahasseni, M. Lam and S. Todorovic, "Unsupervised Video Summarization with Adversarial LSTM Networks," 2017 IEEE
CVPR, Honolulu, HI, 2017, pp. 2982-2991, doi: 10.1109/CVPR.2017.318.
Courtesy of
Mahasseni et al.
28. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN method [Mahasseni, 2017]
Problem formulation: video summarization via selecting a
sparse subset of frames that optimally represent the video
Main idea: learn summarization by minimizing the distance
between videos and a distribution of their summarizations
Goal: select a set of keyframes such that a distance between
the deep representations of the selected keyframes and the
video is minimized
Challenge: how to define a good distance?
Solution: use a Discriminator network and train it with the
Summarizer in an adversarial manner
28
B. Mahasseni, M. Lam and S. Todorovic, "Unsupervised Video Summarization with Adversarial LSTM Networks," 2017 IEEE
CVPR, Honolulu, HI, 2017, pp. 2982-2991, doi: 10.1109/CVPR.2017.318.
Courtesy of
Mahasseni et al.
29. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN method [Mahasseni, 2017]
Deep features of video frames in Frame Selector
=> normalized importance scores
Weighted features in Encoder => latent
representation e
Latent representation e in Decoder => sequence of
features for the frames of input video
Original & reconstructed features in Discriminator
=> distance estimation and binary classification as
“video” or “summary”
29
Training pipeline and loss functions
30. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN method [Mahasseni, 2017]
Deep features of video frames in Frame Selector
=> normalized importance scores
Weighted features in Encoder => latent
representation e
Latent representation e in Decoder => sequence of
features for the frames of input video
Original & reconstructed features in Discriminator
=> distance estimation and binary classification as
“video” or “summary”
30
Training pipeline and loss functions
31. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN method [Mahasseni, 2017]
Deep features of video frames in Frame Selector
=> normalized importance scores
Weighted features in Encoder => latent
representation e
Latent representation e in Decoder => sequence of
features for the frames of input video
Original & reconstructed features in Discriminator
=> distance estimation and binary classification as
“video” or “summary”
31
Training pipeline and loss functions
32. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN method [Mahasseni, 2017]
Deep features of video frames in Frame Selector
=> normalized importance scores
Weighted features in Encoder => latent
representation e
Latent representation e in Decoder => sequence of
features for the frames of input video
Original & reconstructed features in Discriminator
=> distance estimation and binary classification as
“video” or “summary”
32
Training pipeline and loss functions
33. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN method [Mahasseni, 2017]
Deep features of video frames in Frame Selector
=> normalized importance scores
Weighted features in Encoder => latent
representation e
Latent representation e in Decoder => sequence of
features for the frames of input video
Original & reconstructed features in Discriminator
=> distance estimation and binary classification as
“video” or “summary”
33
Training pipeline and loss functions
34. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN method [Mahasseni, 2017]
Train Frame Selector and Encoder by minimizing
Lsparsity + Lprior + Lreconst
Train Decoder by minimizing Lreconst + LGAN
Train Discriminator by maximizing LGAN
Update all components via backward propagation
using Stochastic Gradient Variational Bayes
estimation
34
Training pipeline and loss functions
35. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN method [Mahasseni, 2017]
Deep features of video frames in Frame Selector
=> normalized importance scores
35
Inference stage and video summarization
35
Video fragmentation using KTS
Fragment-level importance scores
Key-fragment selection as a Knapsack problem
Frame-level importance scores
36. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-sl method [Apostolidis, 2019]
36
E. Apostolidis, A. Metsai, E. Adamantidou, V. Mezaris, I. Patras, "A Stepwise, Label-based Approach for Improving the
Adversarial Training in Unsupervised Video Summarization", Proc. 1st Int. Workshop on AI for Smart TV Content Production,
Access and Delivery (AI4TV'19) at ACM Multimedia 2019, Nice, France, October 2019.
Builds on the SUM-GAN architecture
Contains a linear compression layer that
reduces the size of CNN feature vectors
Follows an incremental and fine-grained
approach to train the model’s components
37. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-sl method [Apostolidis, 2019]
37
E. Apostolidis, A. Metsai, E. Adamantidou, V. Mezaris, I. Patras, "A Stepwise, Label-based Approach for Improving the
Adversarial Training in Unsupervised Video Summarization", Proc. 1st Int. Workshop on AI for Smart TV Content Production,
Access and Delivery (AI4TV'19) at ACM Multimedia 2019, Nice, France, October 2019.
Builds on the SUM-GAN architecture
Contains a linear compression layer that
reduces the size of CNN feature vectors
Follows an incremental and fine-grained
approach to train the model’s components
38. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-sl method [Apostolidis, 2019]
Builds on the SUM-GAN architecture
Contains a linear compression layer that
reduces the size of CNN feature vectors
Follows an incremental and fine-grained
approach to train the model’s components
38
E. Apostolidis, A. Metsai, E. Adamantidou, V. Mezaris, I. Patras, "A Stepwise, Label-based Approach for Improving the
Adversarial Training in Unsupervised Video Summarization", Proc. 1st Int. Workshop on AI for Smart TV Content Production,
Access and Delivery (AI4TV'19) at ACM Multimedia 2019, Nice, France, October 2019.
39. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-sl method [Apostolidis, 2019]
Step-wise training process
39
Training pipeline and loss functions
40. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-sl method [Apostolidis, 2019]
40
Step-wise training process
Training pipeline and loss functions
41. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-sl method [Apostolidis, 2019]
41
Step-wise training process
Training pipeline and loss functions
42. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-sl method [Apostolidis, 2019]
42
Step-wise training process
Training pipeline and loss functions
43. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-sl method [Apostolidis, 2019]
Deep features of video frames in LC layer and
Frame Selector => normalized importance scores
43
Inference stage and video summarization
43
Video fragmentation using KTS
Fragment-level importance scores
Key-fragment selection as a Knapsack problem
Frame-level importance scores
44. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-AAE method [Apostolidis, 2020]
Builds on the SUM-GAN-sl algorithm
Introduces an attention mechanism by
replacing the VAE of SUM-GAN-sl with a
deterministic attention auto-encoder
44
E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, I. Patras, "Unsupervised Video Summarization via Attention-Driven
Adversarial Learning", Proc. 26th Int. Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Springer LNCS vol. 11961, pp.
492-504, Jan. 2020. Best paper award
45. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-AAE method [Apostolidis, 2020]
Builds on the SUM-GAN-sl algorithm
Introduces an attention mechanism by
replacing the VAE of SUM-GAN-sl with a
deterministic attention auto-encoder
45
E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, I. Patras, "Unsupervised Video Summarization via Attention-Driven
Adversarial Learning", Proc. 26th Int. Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Springer LNCS vol. 11961, pp.
492-504, Jan. 2020. Best paper award
46. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-AAE method [Apostolidis, 2020]
46
The attention auto-encoder: Processing pipeline
47. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-AAE method [Apostolidis, 2020]
47
The attention auto-encoder: Processing pipeline
Weighted feature vectors fed to the Encoder
48. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-AAE method [Apostolidis, 2020]
48
The attention auto-encoder: Processing pipeline
Weighted feature vectors fed to the Encoder
Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
For t > 1: use the hidden state of the previous
Decoder’s step (h1)
For t = 1: use the hidden state of the last
Encoder’s step (He)
49. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-AAE method [Apostolidis, 2020]
49
The attention auto-encoder: Processing pipeline
Weighted feature vectors fed to the Encoder
Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
Attention weights (αt) computed using:
Energy score function
Soft-max function
50. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Weighted feature vectors fed to the Encoder
Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
Attention weights (αt) computed using:
Energy score function
Soft-max function
The SUM-GAN-AAE method [Apostolidis, 2020]
50
The attention auto-encoder: Processing pipeline
51. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Weighted feature vectors fed to the Encoder
Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
Attention weights (αt) computed using:
Energy score function
Soft-max function
αt multiplied with V and form Context Vector vt’
The SUM-GAN-AAE method [Apostolidis, 2020]
51
The attention auto-encoder: Processing pipeline
52. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Weighted feature vectors fed to the Encoder
Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
Attention weights (αt) computed using:
Energy score function
Soft-max function
αt multiplied with V and form Context Vector vt’
vt’ combined with Decoder’s previous output yt-1
The SUM-GAN-AAE method [Apostolidis, 2020]
52
The attention auto-encoder: Processing pipeline
53. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Weighted feature vectors fed to the Encoder
Encoder’s output (V) and Decoder’s previous
hidden state fed to the Attention component
Attention weights (αt) computed using:
Energy score function
Soft-max function
αt multiplied with V and form Context Vector vt’
vt’ combined with Decoder’s previous output yt-1
Decoder gradually reconstructs the video
The SUM-GAN-AAE method [Apostolidis, 2020]
53
The attention auto-encoder: Processing pipeline
54. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-AAE method [Apostolidis, 2020]
Training is performed in an incremental way as in SUM-GAN-sl
No prior loss is used
54
Training pipeline and loss functions
55. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The SUM-GAN-AAE method [Apostolidis, 2020]
Deep features of video frames in LC layer and
Frame Selector => normalized importance scores
55
Inference stage and video summarization
55
Video fragmentation using KTS
Fragment-level importance scores
Key-fragment selection as a Knapsack problem
Frame-level importance scores
56. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Much smoother series of importance scores
The SUM-GAN-AAE method [Apostolidis, 2020]
56
Impact of the introduced attention mechanism
57. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Much faster and more stable training of the model
The SUM-GAN-AAE method [Apostolidis, 2020]
57
Impact of the introduced attention mechanism
Average (over 5 splits) learning curve of SUM-GAN-sl and
SUM-GAN-AAE on SumMeLoss curves for the SUM-GAN-sl and SUM-GAN-AAE
58. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
The most common strategy for learning summarization in an unsupervised way
A mechanism to build a representative summary by maximizing inference to the full video
Summarization performance is superior to other unsupervised learning approaches (e.g.
reinforcement learning) and comparable to a few supervised learning methods
Step-wise training facilitates the training of complex GAN-based architectures
Introduction of attention mechanisms is beneficial to the quality of the created summary
There is room for further improving GAN-based unsupervised video summarization via: a)
combination with reinforcement learning approaches, b) extension with memory networks
Some concluding remarks
58
Using GANs for video summarization
59. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Vasileios Mezaris,
Evlampios Apostolidis
CERTH-ITI, Greece
Tutorial at IEEE ICME 2020
Short break; coming up:
Section I.3: Datasets, evaluation
protocols and results, and future
directions
Video Summarization and Re-use
Technologies and Tools
Part I: Automatic video summarization
60. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Vasileios Mezaris,
Evlampios Apostolidis
CERTH-ITI, Greece
Tutorial at IEEE ICME 2020
Section I.3: Datasets, evaluation
protocols and results, and future
directions
Video Summarization and Re-use
Technologies and Tools
Part I: Automatic video summarization
61. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Datasets
61
SumMe (https://gyglim.github.io/me/vsum/index.html#benchmark)
25 videos capturing multiple events (e.g. cooking and sports)
video length: 1 to 6 min
annotation: fragment-based video summaries (15-18 per video)
TVSum (https://github.com/yalesong/tvsum)
50 videos from 10 categories of TRECVid MED task
video length: 1 to 11 min
annotation: frame-level importance scores (20 per video)
Most commonly used
62. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Datasets
62
Open Video Project (OVP) (https://sites.google.com/site/vsummsite/download)
50 videos of various genres (e.g. documentary, educational, historical, lecture)
video length: 1 to 4 min
annotation: keyframe-based video summaries (5 per video)
Youtube (https://sites.google.com/site/vsummsite/download)
50 videos of diverse content (e.g. cartoons, news, sports, commercials) collected from websites
video length: 1 to 10 min
annotation: keyframe-based video summaries (5 per video)
Less commonly used
63. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Evaluation protocols
63
Early approach
Agreement between automatically-created (A) and user-defined (U) summary is expressed by
Matching of a pair of frames is based on color histograms, the Manhattan distance and a
predefined similarity threshold
80% of video samples are used for training and the remaining 20% for testing
The final evaluation outcome occurs by:
Computing the average F-Score for a test video given the different user summaries for this video
Computing the average of the calculated F-Score values for the different test videos
64. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Evaluation protocols
64
Established approach
The generated summary should not exceed 15% of the video length
Agreement between automatically-generated (A) and user-defined (U) summary is expressed
by the F-Score (%), with (P)recision and (R)ecall measuring the temporal overlap (∩) (|| ||
means duration)
Typical metrics for computing Precision and Recall at the frame-level
80% of video samples are used for training and the remaining 20% for testing
65. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Evaluation protocols
65
Established approach - A side note
TVSum annotations need conversion from frame-level importance scores to key-fragments
65
Human annotations in TVSum: frame-level importance scores
66. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Evaluation protocols
66
Established approach - A side note
TVSum annotations need conversion from frame-level importance scores to key-fragments
66
Video fragmentation using KTS
Human annotations in TVSum: frame-level importance scores
67. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Evaluation protocols
67
Established approach - A side note
TVSum annotations need conversion from frame-level importance scores to key-fragments
67
Video fragmentation using KTS
Fragment-level importance scores
Human annotations in TVSum: frame-level importance scores
68. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Evaluation protocols
68
Established approach - A side note
TVSum annotations need conversion from frame-level importance scores to key-fragments
Video fragmentation using KTS
Fragment-level importance scores
Key-fragment selection as a Knapsack problem
Human annotations in TVSum: frame-level importance scores
69. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Evaluation protocols
69
Established approach
Slight but important distinction w.r.t. what is eventually used as ground-truth summary
Most used approach
70. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Evaluation protocols
70
Established approach
Slight but important distinction w.r.t. what is eventually used as ground-truth summary
Most used approach
71. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Evaluation protocols
71
F-Score1
Established approach
Slight but important distinction w.r.t. what is eventually used as ground-truth summary
Most used approach
72. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
72
F-Score2
F-Score1
Evaluation protocols
Established approach
Slight but important distinction w.r.t. what is eventually used as ground-truth summary
Most used approach
73. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
73
F-ScoreN
F-Score2
F-Score1
Evaluation protocols
Established approach
Slight but important distinction w.r.t. what is eventually used as ground-truth summary
Most used approach
74. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
74
F-ScoreN
F-Score2
F-Score1
Evaluation protocols
Established approach
Slight but important distinction w.r.t. what is eventually used as ground-truth summary
Most used approach
SumMe: F-Score = max{F-Scorei}i=1
N
TVSum: F-Score = mean{F-Scorei}i=1
N
75. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
75
Evaluation protocols
Established approach
Slight but important distinction w.r.t. what is eventually used as ground-truth summary
Alternative approach
76. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
76
F-Score
Evaluation protocols
Established approach
Slight but important distinction w.r.t. what is eventually used as ground-truth summary
Alternative approach
78. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Best-performing unsupervised methods rely
on Generative Adversarial Networks
The use of attention mechanisms allows the
identification of important parts of the video
Best on TVSum is a dataset-tailored method
as it has random-level performance on SumMe
The use of rewards and reinforcement learning
is less competitive than the use of GANs
A few methods show random performance in
at least one of the used datasets
Results: comparison of unsupervised methods
78
Method SumMe TVSum AVG
FSc Rnk FSc Rnk Rnk
Random summary 40.2 10 54.4 9 9.5
Online Motion AE 37.7 11 51.5 11 11
SUM-FCNunsup 41.5 8 52.7 10 9
DR-DSN 41.4 9 57.6 6 7.5
EDSN 42.6 7 57.3 7 7
UnpairedVSN 47.5 4 55.6 8 6
PCDL 42.7 6 58.4 4 5
ACGAN 46.0 5 58.5 3 4
Tesselation 41.4 7 64.1 1 4
SUM-GAN-sl 47.8 3 58.4 4 3.5
SUM-GAN-AAE 48.9 2 58.3 5 3.5
CSNet 51.3 1 58.8 2 1.5
General remarks
79. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Best-performing unsupervised methods rely
on Generative Adversarial Networks
The use of attention mechanisms allows the
identification of important parts of the video
Best on TVSum is a dataset-tailored method
as it has random-level performance on SumMe
The use of rewards and reinforcement learning
is less competitive than the use of GANs
A few methods show random performance in
at least one of the used datasets
Results: comparison of unsupervised methods
79
Method SumMe TVSum AVG
FSc Rnk FSc Rnk Rnk
Random summary 40.2 10 54.4 9 9.5
Online Motion AE 37.7 11 51.5 11 11
SUM-FCNunsup 41.5 8 52.7 10 9
DR-DSN 41.4 9 57.6 6 7.5
EDSN 42.6 7 57.3 7 7
UnpairedVSN 47.5 4 55.6 8 6
PCDL 42.7 6 58.4 4 5
ACGAN 46.0 5 58.5 3 4
Tesselation 41.4 7 64.1 1 4
SUM-GAN-sl 47.8 3 58.4 4 3.5
SUM-GAN-AAE 48.9 2 58.3 5 3.5
CSNet 51.3 1 58.8 2 1.5
General remarks
80. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Best-performing unsupervised methods rely
on Generative Adversarial Networks
The use of attention mechanisms allows the
identification of important parts of the video
Best on TVSum is a dataset-tailored method
as it has random-level performance on SumMe
The use of rewards and reinforcement learning
is less competitive than the use of GANs
A few methods show random performance in
at least one of the used datasets
Results: comparison of unsupervised methods
80
Method SumMe TVSum AVG
FSc Rnk FSc Rnk Rnk
Random summary 40.2 10 54.4 9 9.5
Online Motion AE 37.7 11 51.5 11 11
SUM-FCNunsup 41.5 8 52.7 10 9
DR-DSN 41.4 9 57.6 6 7.5
EDSN 42.6 7 57.3 7 7
UnpairedVSN 47.5 4 55.6 8 6
PCDL 42.7 6 58.4 4 5
ACGAN 46.0 5 58.5 3 4
Tesselation 41.4 7 64.1 1 4
SUM-GAN-sl 47.8 3 58.4 4 3.5
SUM-GAN-AAE 48.9 2 58.3 5 3.5
CSNet 51.3 1 58.8 2 1.5
General remarks
81. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Best-performing unsupervised methods rely
on Generative Adversarial Networks
The use of attention mechanisms allows the
identification of important parts of the video
Best on TVSum is a dataset-tailored method
as it has random-level performance on SumMe
The use of rewards and reinforcement learning
is less competitive than the use of GANs
A few methods show random performance in
at least one of the used datasets
Results: comparison of unsupervised methods
81
Method SumMe TVSum AVG
FSc Rnk FSc Rnk Rnk
Random summary 40.2 10 54.4 9 9.5
Online Motion AE 37.7 11 51.5 11 11
SUM-FCNunsup 41.5 8 52.7 10 9
DR-DSN 41.4 9 57.6 6 7.5
EDSN 42.6 7 57.3 7 7
UnpairedVSN 47.5 4 55.6 8 6
PCDL 42.7 6 58.4 4 5
ACGAN 46.0 5 58.5 3 4
Tesselation 41.4 7 64.1 1 4
SUM-GAN-sl 47.8 3 58.4 4 3.5
SUM-GAN-AAE 48.9 2 58.3 5 3.5
CSNet 51.3 1 58.8 2 1.5
General remarks
82. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Best-performing unsupervised methods rely
on Generative Adversarial Networks
The use of attention mechanisms allows the
identification of important parts of the video
Best on TVSum is a dataset-tailored method
as it has random-level performance on SumMe
The use of rewards and reinforcement learning
is less competitive than the use of GANs
A few methods show random performance in
at least one of the used datasets
Results: comparison of unsupervised methods
82
Method SumMe TVSum AVG
FSc Rnk FSc Rnk Rnk
Random summary 40.2 10 54.4 9 9.5
Online Motion AE 37.7 11 51.5 11 11
SUM-FCNunsup 41.5 8 52.7 10 9
DR-DSN 41.4 9 57.6 6 7.5
EDSN 42.6 7 57.3 7 7
UnpairedVSN 47.5 4 55.6 8 6
PCDL 42.7 6 58.4 4 5
ACGAN 46.0 5 58.5 3 4
Tesselation 41.4 7 64.1 1 4
SUM-GAN-sl 47.8 3 58.4 4 3.5
SUM-GAN-AAE 48.9 2 58.3 5 3.5
CSNet 51.3 1 58.8 2 1.5
General remarks
83. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Best-performing unsupervised methods rely
on Generative Adversarial Networks
The use of attention mechanisms allows the
identification of important parts of the video
Best on TVSum is a dataset-tailored method
as it has random-level performance on SumMe
The use of rewards and reinforcement learning
is less competitive than the use of GANs
A few methods show random performance in
at least one of the used datasets
Results: comparison of unsupervised methods
83
Method SumMe TVSum AVG
FSc Rnk FSc Rnk Rnk
Random summary 40.2 10 54.4 9 9.5
Online Motion AE 37.7 11 51.5 11 11
SUM-FCNunsup 41.5 8 52.7 10 9
DR-DSN 41.4 9 57.6 6 7.5
EDSN 42.6 7 57.3 7 7
UnpairedVSN 47.5 4 55.6 8 6
PCDL 42.7 6 58.4 4 5
ACGAN 46.0 5 58.5 3 4
Tesselation 41.4 7 64.1 1 4
SUM-GAN-sl 47.8 3 58.4 4 3.5
SUM-GAN-AAE 48.9 2 58.3 5 3.5
CSNet 51.3 1 58.8 2 1.5
General remarks
84. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Best-performing unsupervised methods rely
on Generative Adversarial Networks
The use of attention mechanisms allows the
identification of important parts of the video
Best on TVSum is a dataset-tailored method
as it has random-level performance on SumMe
The use of rewards and reinforcement learning
is less competitive than the use of GANs
A few methods show random performance in
at least one of the used datasets
Results: comparison of unsupervised methods
84
General remarks
Method SumMe TVSum AVG
FSc Rnk FSc Rnk Rnk
Random summary 40.2 10 54.4 9 9.5
Online Motion AE 37.7 11 51.5 11 11
SUM-FCNunsup 41.5 8 52.7 10 9
DR-DSN 41.4 9 57.6 6 7.5
EDSN 42.6 7 57.3 7 7
UnpairedVSN 47.5 4 55.6 8 6
PCDL 42.7 6 58.4 4 5
ACGAN 46.0 5 58.5 3 4
Tesselation 41.4 7 64.1 1 4
SUM-GAN-sl 47.8 3 58.4 4 3.5
SUM-GAN-AAE 48.9 2 58.3 5 3.5
CSNet 51.3 1 58.8 2 1.5
97. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Quantitative comparison
97
Video #15 of TVSum: “How to Clean Your Dog’s Ears - Vetoquinol USA
99. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Use of video summarization technologies
99
Tool for content adaptation / re-purposing
Developed by CERTH-ITI
Elaborates GAN-based methods for unsupervised
learning [Apostolidis 2019, 2020]
Enables content adaptation for distribution via
multiple communication channels
Faciliates summary creation based on the audience
needs for: Twitter, Facebook (feed & stories),
Instagram (feed & stories), YouTube, TikTok
E. Apostolidis, A. Metsai, E. Adamantidou, V. Mezaris, I. Patras, "A Stepwise, Label-based Approach for Improving the
Adversarial Training in Unsupervised Video Summarization", Proc. 1st Int. Workshop on AI for Smart TV Content Production,
Access and Delivery (AI4TV'19) at ACM Multimedia 2019, Nice, France, October 2019.
E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, I. Patras, "Unsupervised Video Summarization via Attention-Driven
Adversarial Learning", Proc. 26th Int. Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Springer LNCS vol. 11961,
pp. 492-504, Jan. 2020.
100. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Use of video summarization technologies
100
Tool for content adaptation / re-purposing
Learns content-specific summarization
Separate models can be trained and used for
different video content (e.g. TV shows)
Creating these models does not require manually-
generated training data (it’s (almost) for free)
E. Apostolidis, A. Metsai, E. Adamantidou, V. Mezaris, I. Patras, "A Stepwise, Label-based Approach for Improving the
Adversarial Training in Unsupervised Video Summarization", Proc. 1st Int. Workshop on AI for Smart TV Content Production,
Access and Delivery (AI4TV'19) at ACM Multimedia 2019, Nice, France, October 2019.
E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, I. Patras, "Unsupervised Video Summarization via Attention-Driven
Adversarial Learning", Proc. 26th Int. Conf. on Multimedia Modeling (MMM2020), Daejeon, Korea, Springer LNCS vol. 11961,
pp. 492-504, Jan. 2020.
101. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Use of video summarization technologies
101
Tool for content adaption / re-purposing
Try it with your video at: http://multimedia2.iti.gr/videosummarization/service/start.html
Demo video: https://youtu.be/LbjPLJzeNII
102. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Future directions
102
Unsupervised video summarization based on combining adversarial and reinforcement
learning
Advanced attention mechanisms and memory networks for capturing long-range temporal
dependencies among parts of the video
Exploiting augmented/extended training data
Introducing editorial rules in unsupervised video summarization
Examine the potential of transfer learning in video summarization
Analysis-oriented
103. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Future directions
103
There is a lack of integrated technologies for automating video summarization and CERTH’s
web application is one of the first complete tools
Automated summarization that is adaptive to the distribution channel / targeted audience or
the video content has a strong potential!
Further applications of video summarization should be investigated by:
monitoring the modern media/social media ecosystem
identifying new application domains for content adaptation / re-purposing
translating the needs of these application domains into analysis requirements
Application-oriented
104. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
[Apostolidis, 2019] E. Apostolidis, A. I. Metsai, E. Adamantidou, V. Mezaris, and I. Patras, “A stepwise, label-based approach for
improving the adversarial training in unsupervised video summarization,” in Proc. of the 1st Int. Workshop on AI for Smart TV
Content Production, Access and Delivery, ser. AI4TV ’19. New York, NY, USA: ACM, 2019, pp. 17–25.
[Apostolidis, 2020] E. Apostolidis, E. Adamantidou, A. I. Metsai, V. Mezaris, and I. Patras, “Unsupervised video summarization via
attention-driven adversarial learning,” in Proc. of the Int. Conf. on Multimedia Modeling. Springer, 2020, pp. 492–504.
[Bahdanau, 2015] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in
Proc. of the 3rd Int. Conf. on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track
Proceedings, Y. Bengio and Y. LeCun, Eds., 2015.
[Chen 2019] Y. Chen, L. Tao, X. Wang, and T. Yamasaki, “Weakly supervised video summarization by hierarchical reinforcement
learning,” in Proc. of the ACM Multimedia Asia, 2019, pp. 1–6.
[Cho, 2014] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder–
decoder approaches,” in Proc. of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation.
Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 103–111.
[Chu, 2019] W.-T. Chu and Y.-H. Liu, “Spatiotemporal modeling and label distribution learning for video summarization,” in Proc.
of the 2019 IEEE 21st Int. Workshop on Multimedia Signal Processing (MMSP). IEEE, 2019, pp. 1–6.
[Elfeki, 2019] M. Elfeki and A. Borji, “Video summarization via actionness ranking,” in Proc. of the IEEE Winter Conference on
Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, January 7-11, 2019, Jan 2019, pp. 754–763.
Key references
104
105. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
[Fajtl, 2019] J. Fajtl, H. S. Sokeh, V. Argyriou, D. Monekosso, and P. Remagnino, “Summarizing videos with attention,” in Asian
Conf. on Computer Vision (ACCV) 2019 Workshops, G. Carneiro and S. You, Eds. Cham: Springer International Publishing,
2019, pp. 39–54.
[Feng, 2018] L. Feng, Z. Li, Z. Kuang, and W. Zhang, “Extractive video summarizer with memory augmented neural networks,” in
Proc. of the 26th ACM Int. Conf. on Multimedia, ser. MM ’18. New York, NY, USA: ACM, 2018, pp. 976–983.
[Fu, 2019] T. Fu, S. Tai, and H. Chen, “Attentive and adversarial learning for video summarization,” in Proc. of the IEEE Winter
Conf. on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, January 7-11, 2019, pp. 1579–1587.
[Gonuguntla, 2019] N. Gonuguntla, B. Mandal, N. Puhan et al., “Enhanced deep video summarization network,” in Proc. of the
2019 British Machine Vision Conference (BMVC), 2019.
[Goyal, 2017] A. Goyal, N. R. Ke, A. Lamb, R. D. Hjelm, C. J. Pal, J. Pineau, and Y. Bengio, “Actual: Actor-critic under adversarial
learning,” ArXiv, vol. abs/1711.04755, 2017.
[Gygli, 2014] M. Gygli, H. Grabner, H. Riemenschneider, and L. Van Gool, “Creating summaries from user videos,” in Proc. of the
European Conference on Computer Vision (ECCV) 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Cham: Springer
International Publishing, 2014, pp. 505–520.
[Gygli, 2015] M. Gygli, H. Grabner, and L. V. Gool, “Video summarization by learning submodular mixtures of objectives,” in Proc.
of the 2015 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 3090–3098.
[Haarnoja, 2018] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep
reinforcement learning with a stochastic actor,” in Proc. of the 35th Int. Conf. on Machine Learning (ICML), 2018.
Key references
105
106. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
[He, 2019] X. He, Y. Hua, T. Song, Z. Zhang, Z. Xue, R. Ma, N. Robertson, and H. Guan, “Unsupervised video summarization with
attentive conditional generative adversarial networks,” in Proc. of the 27th ACM Int. Conf. on Multimedia, ser. MM ’19. New
York, NY, USA: ACM, 2019, pp. 2296–2304.
[Hochreiter, 1997] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–
1780, 1997.
[Huang, 2020] C. Huang and H. Wang, “A novel key-frames selection framework for comprehensive video summarization,” IEEE
Transactions on Circuits and Systems for Video Technology, vol. 30, no. 2, pp. 577–589, 2020.
[Ji, 2019] Z. Ji, K. Xiong, Y. Pang, and X. Li, “Video summarization with attention-based encoder-decoder networks,” IEEE
Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2019.
[Jung, 2019] Y. Jung, D. Cho, D. Kim, S. Woo, and I. S. Kweon, “Discriminative feature learning for unsupervised video
summarization,” in Proc. of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 8537–8544.
[Kaufman, 2017] D. Kaufman, G. Levi, T. Hassner, and L. Wolf, “Temporal tessellation: A unified approach for video analysis,” in
Proc. of the 2017 IEEE Int. Conf. on Computer Vision (ICCV), Oct 2017, pp. 94–104.
[Kulesza, 2012] A. Kulesza and B. Taskar, Determinantal Point Processes for Machine Learning. Hanover, MA, USA: Now
Publishers Inc., 2012.
[Lal, 2019] S. Lal, S. Duggal, and I. Sreedevi, “Online video summarization: Predicting future to better summarize present,” in
Proc. of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2019, pp. 471–480.
Key references
106
107. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
[Lebron Casas, 2019] L. Lebron Casas and E. Koblents, “Video summarization with LSTM and deep attention models,” in
MultiMedia Modeling, I. Kompatsiaris, B. Huet, V. Mezaris, C. Gurrin, W.-H. Cheng, and S. Vrochidis, Eds. Cham: Springer
International Publishing, 2019, pp. 67–79.
[Liu, 2019] Y.-T. Liu, Y.-J. Li, F.-E. Yang, S.-F. Chen, and Y.-C. F. Wang, “Learning hierarchical self-attention for video
summarization,” in Proc. of the 2019 IEEE Int. Conf. on Image Processing (ICIP). IEEE, 2019, pp. 3377–3381.
[Mahasseni, 2017] B. Mahasseni, M. Lam, and S. Todorovic, “Unsupervised video summarization with adversarial LSTM
networks,” in Proc. of the 2017 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2982–
2991.
[Otani, 2016] M. Otani, Y. Nakashima, E. Rahtu, J. Heikkil¨a, and N. Yokoya, “Video summarization using deep semantic
features,” in Proc. of the 13th Asian Conference on Computer Vision (ACCV’16), 2016.
[Panda, 2017] R. Panda, A. Das, Z. Wu, J. Ernst, and A. K. Roy-Chowdhury, “Weakly supervised summarization of web videos,” in
Proc. of the 2017 IEEE Int. Conf. on Computer Vision (ICCV), Oct 2017, pp. 3677–3686.
[Pfau, 2016] D. Pfau and O. Vinyals, “Connecting generative adversarial networks and actor-critic methods,” in NIPS Workshop
on Adversarial Training, 2016.
[Potapov, 2014] D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid, “Category-specific video summarization,” in Proc. of the
European Conference on Computer Vision (ECCV) 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Cham: Springer
International Publishing, 2014, pp. 540–555.
Key references
107
108. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
[Rochan, 2018] M. Rochan, L. Ye, and Y. Wang, “Video summarization using fully convolutional sequence networks,” in Proc. of
the European Conference on Computer Vision (ECCV) 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds. Cham:
Springer International Publishing, 2018, pp. 358–374.
[Rochan, 2019] M. Rochan and Y. Wang, “Video summarization by learning from unpaired data,” in Proc. of the 2019 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
[Savioli, 2019] N. Savioli, “A hybrid approach between adversarial generative networks and actor-critic policy gradient for low
rate high-resolution image compression,” in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition
Workshops, CVPR Workshops 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 2019.
[Smith, 2017] J. R. Smith, D. Joshi, B. Huet, W. Hsu, and J. Cota, “Harnessing A.I. for Augmenting Creativity: Application to Movie
Trailer Creation,” in Proc. of the 25th ACM Int. Conf. on Multimedia, ser. MM ’17. New York, NY, USA: ACM, 2017, pp. 1799–
1808.
[Song, 2015] Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes, “TvSUM: Summarizing web videos using titles,” in Proc. of the 2015
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 5179–5187.
[Song, 2016] X. Song, K. Chen, J. Lei, L. Sun, Z. Wang, L. Xie, and M. Song, “Category driven deep recurrent neural network for
video summarization,” in Proc. of the 2016 IEEE Int. Conf. on Multimedia Expo Workshops (ICMEW), July 2016, pp. 1–6.
[Szegedy, 2015] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A.
Rabinovich, “Going deeper with convolutions,” in Proc. of the 2015 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), June 2015, pp. 1–9.
Key references
108
109. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
[Vinyals, 2015] O. Vinyals, M. Fortunato, and N. Jaitly, “Pointer networks,” in Advances in Neural Information Processing Systems
28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 2692–2700.
[Wang, 2019] J. Wang, W. Wang, Z. Wang, L. Wang, D. Feng, and T. Tan, “Stacked memory network for video summarization,” in
Proc. of the 27th ACM Int. Conf. on Multimedia, ser. MM ’19. New York, NY, USA: ACM, 2019, pp. 836–844.
[Wang, 2016] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good
practices for deep action recognition,” in Proc. of the European Conference on Computer Vision – ECCV 2016, B. Leibe, J.
Matas, N. Sebe, and M. Welling, Eds. Cham: Springer International Publishing, 2016, pp. 20–36.
[Wei, 2018] H. Wei, B. Ni, Y. Yan, H. Yu, X. Yang, and C. Yao, “Video summarization via semantic attended networks,” in Proc. of
the 2018 AAAI Conf. on Artificial Intelligence (AAAI), 2018.
[Yu, 2017] L. Yu, W. Zhang, J. Wang, and Y. Yu, “SeqGAN: Sequence generative adversarial nets with policy gradient,” in Proc. of
the 2017 AAAI Conf. on Artificial Intelligence, ser. (AAAI). AAAI Press, 2017, pp. 2852–2858.
[Yuan, 2019a] L. Yuan, F. E. H. Tay, P. Li, L. Zhou, and J. Feng, “Cycle-SUM: Cycle-consistent adversarial lstm networks for
unsupervised video summarization,” in Proc. of the 2019 AAAI Conf. on Artificial Intelligence (AAAI), 2019.
[Yuan, 2019b] Y. Yuan, T. Mei, P. Cui, and W. Zhu, “Video summarization by learning deep side semantic embedding,” IEEE
Transactions on Circuits and Systems for Video Technology, vol. 29, no. 1, pp. 226–237, Jan 2019.
[Yuan, 2019c] Y. Yuan, H. Li, and Q. Wang, “Spatiotemporal modeling for video summarization using convolutional recurrent
neural network,” IEEE Access, vol. 7, pp. 64 676–64 685, 2019.
Key references
109
110. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
[Zhang, 2016a] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, “Summary transfer: Exemplar-based subset selection for video
summarization,” in Proc. of the 2016 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp.
1059–1067.
[Zhang, 2016b] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, “Video summarization with long short-term memory,” in Proc. of
the European Conference on Computer Vision (ECCV) 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds. Cham: Springer
International Publishing, 2016, pp. 766–782.
[Zhang, 2018] Y. Zhang, X. Liang, D. Zhang, M. Tan, and E. P. Xing, “Unsupervised object-level video summarization with online
motion auto-encoder,” Pattern Recognition Letters, 2018.
[Zhang, 2019] Y. Zhang, M. Kampffmeyer, X. Zhao, and M. Tan, “DTR-GAN: Dilated temporal relational adversarial network for
video summarization,” in Proc. of the ACM Turing Celebration Conference - China, ser. ACM TURC ’19. New York, NY, USA:
ACM, 2019, pp. 89:1–89:6.
[Zhao, 2017] B. Zhao, X. Li, and X. Lu, “Hierarchical recurrent neural network for video summarization,” in Proc. of the 2017 ACM
on Multimedia Conference, ser. MM ’17. New York, NY, USA: ACM, 2017, pp. 863–871.
[Zhao, 2018] B. Zhao, X. Li, and X. Lu, “HSA-RNN: Hierarchical structure-adaptive RNN for video summarization,” in Proc. of the
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7405–7414.), 2018.
[Zhao, 2019] B. Zhao, X. Li, and X. Lu, “Property-constrained dual learning for video summarization,” IEEE Transactions on
Neural Networks and Learning Systems, 2019.
Key references
110
111. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
[Zhou, 2018a] K. Zhou, T. Xiang, and A. Cavallaro, “Video summarisation by classification with deep reinforcement learning,” in
Proc. of the 2018 British Machine Vision Conference (BMVC), 2018.
[Zhou, 2018b] K. Zhou and Y. Qiao, “Deep reinforcement learning for unsupervised video summarization with diversity-
representativeness reward,” in Proc. of the 2018 AAAI Conference on Artificial Intelligence (AAAI), 2018.
Key references
111
112. retv-project.eu @ReTV_EU @ReTVproject retv-project retv_project
Vasileios Mezaris
bmezaris@iti.gr
Evlampios Apostolidis
apostolid@iti.gr
CERTH-ITI, Greece
info@retv-project.eu
This work has received funding from the
European Union’s Horizon 2020 research
and innovation programme under grant
agreement H2020-780656 ReTV
Questions?
Following the Q&A session and the
break, we will be back with Part II of
the tutorial, on video summaries re-
use and recommendation