Presentation of paper "Gated-ViGAT: Efficient Bottom-Up Event
Recognition and Explanation Using a New Frame
Selection Policy and Gating Mechanism", by N. Gkalelis, D. Daskalakis, V. Mezaris, delivered at IEEE ISM 2022, Dec. 2022, Naples, Italy.
In this paper, Gated-ViGAT, an efficient approach for video event recognition, utilizing bottom-up (object) information, a new frame sampling policy and a gating mechanism is proposed. Specifically, the frame sampling policy uses weighted in-degrees (WiDs), derived from the adjacency matrices of graph attention networks (GATs), and a dissimilarity measure to select
the most salient and at the same time diverse frames representing
the event in the video. Additionally, the proposed gating mechanism fetches the selected frames sequentially, and commits early exiting when an adequately confident decision is achieved. In this way, only a few frames are processed by the computationally
expensive branch of our network that is responsible for the bottom-up information extraction. The experimental evaluation on two large, publicly available video datasets (MiniKinetics, ActivityNet) demonstrates that Gated-ViGAT provides a large computational complexity reduction in comparison to our previous approach (ViGAT), while maintaining the excellent event
recognition and explainability performance.
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...Christopher Diamantopoulos
This implemented DSP system utilizes TCP socket communication. Upon message reception, it decides the appropriate process to be executed based on cases which can be categorized as follows:
1) image capture
2) image transfer
3) image processing
4) sensor calibration
A user-friendly MATLAB GUI, named DIPeth, facilitates the system's control.
Video data is abundant and being generated at ever increasing rates. Analyzing video with AI can provide valuable insights and capabilities for many applications ranging from autonomous driving and smart cameras to smartphones and extended reality. However, as video resolution and frame rates increase while AI video perception models become more complex, running these workloads in real time is becoming more challenging. This presentation explores the latest research that is enabling efficient video perception while maintaining neural network model accuracy. You’ll learn about:
- How video perception is crucial for understanding the world and making devices smarter
- The challenges of on-device real-time video perception at high resolution through AI
- Qualcomm AI Research’s latest research and techniques for efficient video perception
Checkout: https://www.qualcomm.com/AI
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/videantis/embedded-vision-training/videos/pages/may-2014-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Marco Jacobs, Vice President of Marketing at videantis, presents the "Implementing Histogram of Oriented Gradients on a Parallel Vision Processor" tutorial at the May 2014 Embedded Vision Summit.
Object detection in images is one of the core problems in computer vision. The Histogram of Oriented Gradients method (Dalal and Triggs 2005) is a key algorithm for object detection, and has been used in automotive, security and many other applications.
In this presentation, Jacobs gives an overview of the algorithm and show how it can be implemented in real-time on a high-performance, low-cost, and low-power parallel vision processor. He demonstrates the standard OpenCV based HOG with Linear SVM for Human/Pedestrian detection on VGA sequences in real-time. The SVM Vectors used are provided with OpenCV, learned from the Daimler Pedestrian Detection Benchmark Dataset and the INRIA Person Dataset.
State of GeoServer provides an update on our community and reviews the new and noteworthy features for the Project. The community keeps an aggressive six month release cycle with GeoServer 2.8 and 2.9 being released this year.
Each releases bring together exciting new features. This year a lot of work has been done on the user interface, clustering, security and compatibility with the latest Java platform. We will also take a look at community research into vector tiles, multi-resolution raster support and more.
Attend this talk for a cheerful update on what is happening with this popular OSGeo project. Whether you are an expert user, a developer, or simply curious what these projects can do for you, this talk is for you.
YOW2018 Cloud Performance Root Cause Analysis at NetflixBrendan Gregg
Keynote by Brendan Gregg for YOW! 2018. Video: https://www.youtube.com/watch?v=03EC8uA30Pw . Description: "At Netflix, improving the performance of our cloud means happier customers and lower costs, and involves root cause
analysis of applications, runtimes, operating systems, and hypervisors, in an environment of 150k cloud instances
that undergo numerous production changes each week. Apart from the developers who regularly optimize their own code
, we also have a dedicated performance team to help with any issue across the cloud, and to build tooling to aid in
this analysis. In this session we will summarize the Netflix environment, procedures, and tools we use and build t
o do root cause analysis on cloud performance issues. The analysis performed may be cloud-wide, using self-service
GUIs such as our open source Atlas tool, or focused on individual instances, and use our open source Vector tool, f
lame graphs, Java debuggers, and tooling that uses Linux perf, ftrace, and bcc/eBPF. You can use these open source
tools in the same way to find performance wins in your own environment."
State of GeoServer provides an update on our community and reviews the new and noteworthy features for the Project. The community keeps an aggressive six month release cycle with GeoServer 2.8 and 2.9 being released this year.
Each releases bring together exciting new features. This year a lot of work has been done on the user interface, clustering, security and compatibility with the latest Java platform. We will also take a look at community research into vector tiles, multi-resolution raster support and more.
Attend this talk for a cheerful update on what is happening with this popular OSGeo project. Whether you are an expert user, a developer, or simply curious what these projects can do for you, this talk is for you.
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/intel/embedded-vision-training/videos/pages/may-2017-embedded-vision-summit-park
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Minje Park, Software Engineering Manager at Intel, presents the "Designing Deep Neural Network Algorithms for Embedded Devices" tutorial at the May 2017 Embedded Vision Summit.
Deep neural networks have shown state-of-the-art results in a variety of vision tasks. Although accurate, most of these deep neural networks are computationally intensive, creating challenges for embedded devices. In this talk, Park provides several ideas and insights on how to design deep neural network architectures small enough for embedded deployment. He also explores how to further reduce the processing load by adopting simple but effective compression and quantization techniques. He shows a set of practical applications, such as face recognition, facial attribute classification, and person detection, which can be run in near real-time without any heavy GPU or dedicated DSP and without losing accuracy.
Explaining video summarization based on the focus of attentionVasileiosMezaris
Presentation of paper "Explaining video summarization based on
the focus of attention", by E. Apostolidis, G. Balaouras, V. Mezaris, I. Patras, delivered at IEEE ISM 2022, Dec. 2022, Naples, Italy.
In this paper we propose a method for explaining
video summarization. We start by formulating the problem as
the creation of an explanation mask which indicates the parts
of the video that influenced the most the estimates of a video
summarization network, about the frames’ importance. Then, we
explain how the typical analysis pipeline of attention-based networks for video summarization can be used to define explanation
signals, and we examine various attention-based signals that have
been studied as explanations in the NLP domain. We evaluate
the performance of these signals by investigating the video
summarization network’s input-output relationship according
to different replacement functions, and utilizing measures that quantify the capability of explanations to spot the most and
least influential parts of a video. We run experiments using an
attention-based network (CA-SUM) and two datasets (SumMe
and TVSum) for video summarization. Our evaluations indicate the advanced performance of explanations formed using the inherent attention weights, and demonstrate the ability of our
method to explain the video summarization results using clues
about the focus of the attention mechanism.
A session in the DevNet Zone at Cisco Live, Berlin. Flare allows users with mobile devices to discover and interact with things in an environment. It combines multiple location technologies, such as iBeacon and CMX, with a realtime communications architecture to enable new kinds of user interactions. This session will introduce the Flare REST and Socket.IO API, server, client libraries and sample code, and introduce you to the resources available on DevNet and GitHub. Come visit us in the DevNet zone for a hands-on demonstration.
Paper discussion:Video-to-Video Synthesis (NIPS 2018)Motaz Sabri
This presentation was used in Ridge-i Yomekai event in decemver 2018 for a NIPS2018 paper named Video-to-Video Synthesis delivered by researchers from Nvidia and MIT.
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/may-2018-embedded-vision-summit-sze
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Vivienne Sze, Associate Professor at MIT, presents the "Approaches for Energy Efficient Implementation of Deep Neural Networks" tutorial at the May 2018 Embedded Vision Summit.
Deep neural networks (DNNs) are proving very effective for a variety of challenging machine perception tasks. But these algorithms are very computationally demanding. To enable DNNs to be used in practical applications, it’s critical to find efficient ways to implement them.
This talk explores how DNNs are being mapped onto today’s processor architectures, and how these algorithms are evolving to enable improved efficiency. Sze explores the energy consumption of commonly used CNNs versus their accuracy, and provides insights on "energy-aware" pruning of these networks.
Analysis of the Pending Interest Table behavior in the context of a distributed denial of service attack.
Slides presented at:
3rd ACM SIGCOMM Workshop on Information-Centric Networking (ICN 2013) - Hong Kong, China
The paper is available at:
http://conferences.sigcomm.org/sigcomm/2013/papers/icn/p67.pdf
Multi-Modal Fusion for Image Manipulation Detection and LocalizationVasileiosMezaris
Presentation of paper "Exploring Multi-Modal Fusion for Image Manipulation Detection and Localization", by K. Triaridis, V. Mezaris, delivered at 30th Int. Conf. on MultiMedia Modeling (MMM 2024), Amsterdam, NL, Jan.-Feb. 2024.
Presentation of our top-scoring solution to the MediaEval 2023 NewsImages Task, "Cross-modal Networks, Fine-Tuning, Data Augmentation and Dual Softmax Operation for MediaEval NewsImages 2023", by A. Leventakis, D. Galanopoulos, V. Mezaris, delivered at the 2023 Multimedia Evaluation Workshop (MediaEval'23), Amsterdam, NL, Feb. 2024.
IMAGE CAPTURE, PROCESSING AND TRANSFER VIA ETHERNET UNDER CONTROL OF MATLAB G...Christopher Diamantopoulos
This implemented DSP system utilizes TCP socket communication. Upon message reception, it decides the appropriate process to be executed based on cases which can be categorized as follows:
1) image capture
2) image transfer
3) image processing
4) sensor calibration
A user-friendly MATLAB GUI, named DIPeth, facilitates the system's control.
Video data is abundant and being generated at ever increasing rates. Analyzing video with AI can provide valuable insights and capabilities for many applications ranging from autonomous driving and smart cameras to smartphones and extended reality. However, as video resolution and frame rates increase while AI video perception models become more complex, running these workloads in real time is becoming more challenging. This presentation explores the latest research that is enabling efficient video perception while maintaining neural network model accuracy. You’ll learn about:
- How video perception is crucial for understanding the world and making devices smarter
- The challenges of on-device real-time video perception at high resolution through AI
- Qualcomm AI Research’s latest research and techniques for efficient video perception
Checkout: https://www.qualcomm.com/AI
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/videantis/embedded-vision-training/videos/pages/may-2014-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Marco Jacobs, Vice President of Marketing at videantis, presents the "Implementing Histogram of Oriented Gradients on a Parallel Vision Processor" tutorial at the May 2014 Embedded Vision Summit.
Object detection in images is one of the core problems in computer vision. The Histogram of Oriented Gradients method (Dalal and Triggs 2005) is a key algorithm for object detection, and has been used in automotive, security and many other applications.
In this presentation, Jacobs gives an overview of the algorithm and show how it can be implemented in real-time on a high-performance, low-cost, and low-power parallel vision processor. He demonstrates the standard OpenCV based HOG with Linear SVM for Human/Pedestrian detection on VGA sequences in real-time. The SVM Vectors used are provided with OpenCV, learned from the Daimler Pedestrian Detection Benchmark Dataset and the INRIA Person Dataset.
State of GeoServer provides an update on our community and reviews the new and noteworthy features for the Project. The community keeps an aggressive six month release cycle with GeoServer 2.8 and 2.9 being released this year.
Each releases bring together exciting new features. This year a lot of work has been done on the user interface, clustering, security and compatibility with the latest Java platform. We will also take a look at community research into vector tiles, multi-resolution raster support and more.
Attend this talk for a cheerful update on what is happening with this popular OSGeo project. Whether you are an expert user, a developer, or simply curious what these projects can do for you, this talk is for you.
YOW2018 Cloud Performance Root Cause Analysis at NetflixBrendan Gregg
Keynote by Brendan Gregg for YOW! 2018. Video: https://www.youtube.com/watch?v=03EC8uA30Pw . Description: "At Netflix, improving the performance of our cloud means happier customers and lower costs, and involves root cause
analysis of applications, runtimes, operating systems, and hypervisors, in an environment of 150k cloud instances
that undergo numerous production changes each week. Apart from the developers who regularly optimize their own code
, we also have a dedicated performance team to help with any issue across the cloud, and to build tooling to aid in
this analysis. In this session we will summarize the Netflix environment, procedures, and tools we use and build t
o do root cause analysis on cloud performance issues. The analysis performed may be cloud-wide, using self-service
GUIs such as our open source Atlas tool, or focused on individual instances, and use our open source Vector tool, f
lame graphs, Java debuggers, and tooling that uses Linux perf, ftrace, and bcc/eBPF. You can use these open source
tools in the same way to find performance wins in your own environment."
State of GeoServer provides an update on our community and reviews the new and noteworthy features for the Project. The community keeps an aggressive six month release cycle with GeoServer 2.8 and 2.9 being released this year.
Each releases bring together exciting new features. This year a lot of work has been done on the user interface, clustering, security and compatibility with the latest Java platform. We will also take a look at community research into vector tiles, multi-resolution raster support and more.
Attend this talk for a cheerful update on what is happening with this popular OSGeo project. Whether you are an expert user, a developer, or simply curious what these projects can do for you, this talk is for you.
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/intel/embedded-vision-training/videos/pages/may-2017-embedded-vision-summit-park
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Minje Park, Software Engineering Manager at Intel, presents the "Designing Deep Neural Network Algorithms for Embedded Devices" tutorial at the May 2017 Embedded Vision Summit.
Deep neural networks have shown state-of-the-art results in a variety of vision tasks. Although accurate, most of these deep neural networks are computationally intensive, creating challenges for embedded devices. In this talk, Park provides several ideas and insights on how to design deep neural network architectures small enough for embedded deployment. He also explores how to further reduce the processing load by adopting simple but effective compression and quantization techniques. He shows a set of practical applications, such as face recognition, facial attribute classification, and person detection, which can be run in near real-time without any heavy GPU or dedicated DSP and without losing accuracy.
Explaining video summarization based on the focus of attentionVasileiosMezaris
Presentation of paper "Explaining video summarization based on
the focus of attention", by E. Apostolidis, G. Balaouras, V. Mezaris, I. Patras, delivered at IEEE ISM 2022, Dec. 2022, Naples, Italy.
In this paper we propose a method for explaining
video summarization. We start by formulating the problem as
the creation of an explanation mask which indicates the parts
of the video that influenced the most the estimates of a video
summarization network, about the frames’ importance. Then, we
explain how the typical analysis pipeline of attention-based networks for video summarization can be used to define explanation
signals, and we examine various attention-based signals that have
been studied as explanations in the NLP domain. We evaluate
the performance of these signals by investigating the video
summarization network’s input-output relationship according
to different replacement functions, and utilizing measures that quantify the capability of explanations to spot the most and
least influential parts of a video. We run experiments using an
attention-based network (CA-SUM) and two datasets (SumMe
and TVSum) for video summarization. Our evaluations indicate the advanced performance of explanations formed using the inherent attention weights, and demonstrate the ability of our
method to explain the video summarization results using clues
about the focus of the attention mechanism.
A session in the DevNet Zone at Cisco Live, Berlin. Flare allows users with mobile devices to discover and interact with things in an environment. It combines multiple location technologies, such as iBeacon and CMX, with a realtime communications architecture to enable new kinds of user interactions. This session will introduce the Flare REST and Socket.IO API, server, client libraries and sample code, and introduce you to the resources available on DevNet and GitHub. Come visit us in the DevNet zone for a hands-on demonstration.
Paper discussion:Video-to-Video Synthesis (NIPS 2018)Motaz Sabri
This presentation was used in Ridge-i Yomekai event in decemver 2018 for a NIPS2018 paper named Video-to-Video Synthesis delivered by researchers from Nvidia and MIT.
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/may-2018-embedded-vision-summit-sze
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Vivienne Sze, Associate Professor at MIT, presents the "Approaches for Energy Efficient Implementation of Deep Neural Networks" tutorial at the May 2018 Embedded Vision Summit.
Deep neural networks (DNNs) are proving very effective for a variety of challenging machine perception tasks. But these algorithms are very computationally demanding. To enable DNNs to be used in practical applications, it’s critical to find efficient ways to implement them.
This talk explores how DNNs are being mapped onto today’s processor architectures, and how these algorithms are evolving to enable improved efficiency. Sze explores the energy consumption of commonly used CNNs versus their accuracy, and provides insights on "energy-aware" pruning of these networks.
Analysis of the Pending Interest Table behavior in the context of a distributed denial of service attack.
Slides presented at:
3rd ACM SIGCOMM Workshop on Information-Centric Networking (ICN 2013) - Hong Kong, China
The paper is available at:
http://conferences.sigcomm.org/sigcomm/2013/papers/icn/p67.pdf
Multi-Modal Fusion for Image Manipulation Detection and LocalizationVasileiosMezaris
Presentation of paper "Exploring Multi-Modal Fusion for Image Manipulation Detection and Localization", by K. Triaridis, V. Mezaris, delivered at 30th Int. Conf. on MultiMedia Modeling (MMM 2024), Amsterdam, NL, Jan.-Feb. 2024.
Presentation of our top-scoring solution to the MediaEval 2023 NewsImages Task, "Cross-modal Networks, Fine-Tuning, Data Augmentation and Dual Softmax Operation for MediaEval NewsImages 2023", by A. Leventakis, D. Galanopoulos, V. Mezaris, delivered at the 2023 Multimedia Evaluation Workshop (MediaEval'23), Amsterdam, NL, Feb. 2024.
Spatio-Temporal Summarization of 360-degrees VideosVasileiosMezaris
Presentation of paper "An Integrated System for Spatio-Temporal Summarization of 360-degrees Videos", by I. Kontostathis, E. Apostolidis, V. Mezaris, delivered at 30th Int. Conf. on MultiMedia Modeling (MMM 2024), Amsterdam, NL, Jan.-Feb. 2024.
Masked Feature Modelling for the unsupervised pre-training of a Graph Attenti...VasileiosMezaris
Presentation of paper "Masked Feature Modelling for the unsupervised pre-training of a Graph Attention Network block for bottom-up video event recognition", by D. Daskalakis, N. Gkalelis, V. Mezaris, delivered at IEEE ISM 2023, Dec. 2022, Laguna Hills, CA, USA.
Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022VasileiosMezaris
Matching images to articles is challenging and can be considered a special version of the cross-media retrieval problem. This working note paper presents our solution for the MediaEval NewsImages
benchmarking task. We investigated the performance of two cross-modal networks, a pre-trained network and a trainable one, the latter originally developed for text-video retrieval tasks and adapted to the NewsImages task. Moreover, we utilize a method for revising the similarities produced by either one of the cross-modal networks, i.e., a dual softmax operation, to improve our solutions’ performance. We report the official results for our submitted runs and additional experiments we conducted to evaluate our runs internally.
TAME: Trainable Attention Mechanism for ExplanationsVasileiosMezaris
Presentation of paper "TAME: Attention Mechanism Based Feature Fusion for Generating Explanation Maps of Convolutional Neural Networks", by M. Ntrougkas, N. Gkalelis, V. Mezaris, delivered at IEEE ISM 2022, Dec. 2022, Naples, Italy.
The apparent “black box” nature of neural networks is a barrier to adoption in applications where explainability is essential. This paper presents TAME (Trainable Attention Mechanism for Explanations), a method for generating explanation maps with a multi-branch hierarchical attention mechanism. TAME combines a target model’s feature maps from multiple layers using an attention mechanism, transforming them into an explanation map. TAME can easily be applied to any convolutional neural network (CNN) by streamlining the optimization of the attention mechanism’s training method and the selection
of target model’s feature maps. After training, explanation maps can be computed in a single forward pass. We apply TAME to two widely used models, i.e. VGG-16 and ResNet-50, trained on ImageNet and show improvements over previous top-performing methods. We also provide a comprehensive ablation study comparing the performance of different variations of TAME’s architecture.
Combining textual and visual features for Ad-hoc Video SearchVasileiosMezaris
In this presentation, our work in the context of the Ad-hoc Video Search (AVS) Task of TRECVID 2022 is presented. Our participation in the AVS task is based on a cross-modal deep network architecture, T x V ("T times V"), which utilizes several textual and visual features. As part of the retrieval stage, a dual-softmax approach is also utilized to revise the calculated text-video similarities.
Explaining the decisions of image/video classifiersVasileiosMezaris
Presentation delivered by Vasileios Mezaris at the 1st Nice Workshop on Interpretability, November 2022, Nice, France.
This presentation starts by discussing the motivation of explainability approaches for image and video classifiers. Then, we focus on three distinct problems: learning how to derive explanations for the decisions of a legacy (trained) image classifier; designing a classifier for video event recognition that can also deliver explanations for its decisions; and, taking a first look at possible explanation signals of a video summarizer. Technical details of our proposed solutions to these three problems are presented. Besides quantitative results concerning the goodness of the derived explanations, qualitative examples are also discussed in order to provide insight on the reasons behind classification errors, including possible dataset biases affecting the trained classifiers.
Learning visual explanations for DCNN-based image classifiers using an attent...VasileiosMezaris
I. Gkartzonika, N. Gkalelis, V. Mezaris, "Learning Visual Explanations for DCNN-Based Image Classifiers Using an Attention Mechanism", Proc. ECCV 2022 Workshop on Vision with Biased or Scarce Data (VBSD), Oct. 2022.
In this paper two new learning-based eXplainable AI (XAI) methods for deep convolutional neural network (DCNN) image classifiers, called L-CAM-Fm and L-CAM-Img, are proposed. Both methods use an attention mechanism that is inserted in the original (frozen) DCNN and is trained to derive class activation maps (CAMs) from the last convolutional layer’s feature maps. During training, CAMs are applied to the feature maps (L-CAM-Fm) or the input image (L-CAM-Img) forcing the attention mechanism to learn the image regions explaining the DCNN’s outcome. Experimental evaluation on ImageNet shows that the proposed methods achieve competitive results while requiring a single forward pass at the inference stage. Moreover, based on the derived explanations a comprehensive qualitative analysis is performed providing valuable insight for understanding the reasons behind classification errors, including possible dataset biases affecting the trained classifier.
Are all combinations equal? Combining textual and visual features with multi...VasileiosMezaris
D. Galanopoulos, V. Mezaris, "Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval", Proc. ECCV 2022 Workshop on AI for Creative Video Editing and Understanding (CVEU), Oct. 2022.
In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on text-to-video retrieval. We investigate how to optimally combine multiple diverse textual and visual features into feature pairs that lead to generating multiple joint feature spaces, which encode text-video pairs into comparable representations. To learn these representations our proposed network architecture is trained by following a multiple space learning procedure. Moreover, at the retrieval stage, we introduce additional softmax operations for revising the inferred query-video similarities. Extensive experiments in several setups based on three large-scale datasets (IACC.3, V3C1, and MSR-VTT) lead to conclusions on how to best combine text-visual features and document the performance of the proposed network.
Presentation of the paper titled "Summarizing videos using concentrated attention and considering the uniqueness and diversity of the video frames", by E. Apostolidis, G. Balaouras, V. Mezaris, I. Patras, delivered at the ACM Int. Conf. on Multimedia Retrieval (ICMR’22), Newark, NJ, USA, June 2022. The corresponding software is available at https://github.com/e-apostolidis/CA-SUM.
Presentation of the paper titled "A Web Service for Video Smart-Cropping", by K. Apostolidis, V. Mezaris, delivered at the IEEE Int. Symposium on Multimedia (ISM), Dec. 2021. The corresponding software and dataset are available at https://github.com/bmezaris/RetargetVid.
Presentation of the paper titled "Combining Global and Local Attention with Positional Encoding for Video Summarization", by E. Apostolidis, G. Balaouras, V. Mezaris, I. Patras, delivered at the IEEE Int. Symposium on Multimedia (ISM), Dec. 2021. The corresponding software is available at https://github.com/e-apostolidis/PGL-SUM.
Presentation slides for our paper "Combining Adversarial and Reinforcement Learning for Video Thumbnail Selection", ACM ICMR 2021. https://doi.org/10.1145/3460426.3463630.
We developed a new method for unsupervised video thumbnail selection. The developed network architecture selects video thumbnails based on two criteria: the representativeness and the aesthetic quality of their visual content. Training relies on a combination of adversarial and reinforcement learning. The former is used to train a discriminator, whose goal is to distinguish the original from a reconstructed version of the video based on a small set of candidate thumbnails. The discriminator’s feedback is a measure of the representativeness of the selected thumbnails. This measure is combined with estimates about the aesthetic quality of the thumbnails (made using a SoA Fully Convolutional Network) to form a reward and train the thumbnail selector via reinforcement learning. Experiments on two datasets (OVP and Youtube) show the competitiveness of the proposed method against other SoA approaches. An ablation study with respect to the adopted thumbnail selection criteria documents the importance of considering the aesthetics, and the contribution of this information when used in combination with measures about the representativeness of the visual content.
Hard-Negatives Selection Strategy for Cross-Modal RetrievalVasileiosMezaris
Cross-modal learning has gained a lot of interest recently, and many applications of it, such as image-text retrieval, cross-modal video search, or video captioning have been proposed. In this work, we deal with the cross-modal video retrieval problem. The state-of-the-art approaches are based on deep network architectures, and rely on mining hard-negative samples during training to optimize the selection of the network’s parameters. Starting from a state-of-the-art cross-modal architecture that uses the improved marginal ranking loss function, we propose a simple strategy for hard-negative mining to identify which training samples are hard-negatives and which, although presently treated as hard-negatives, are likely not negative samples at all and shouldn’t be treated as such. Additionally, to take full advantage of network models trained using different design choices for hard-negative mining, we examine model combination strategies, and we design a hybrid one effectively combining large numbers of trained models.
Talk by Vasileios Mezaris, titled "Misinformation on the internet: Video and AI", delivered at the "Age of misinformation: an interdisciplinary outlook on fake news" webinar, on 17 December 2020.
Slides for the paper titled "Structured pruning of LSTMs via Eigenanalysis and Geometric Median for Mobile Multimedia and Deep Learning Applications", by N. Gkalelis and V. Mezaris, presented at the 22nd IEEE Int. Symposium on Multimedia (ISM), Dec. 2020.
Slides for the paper "Performance over Random: A robust evaluation protocol for video summarization methods", authored by E. Apostolidis, E. Adamantidou, A. Metsai, V. Mezaris, and I. Patras, published in the Proceedings of ACM Multimedia 2020 (ACM MM), Seattle, WA, USA, Oct. 2020.
Presentation of the paper titled "Migration-Related Semantic Concepts for the Retrieval of Relevant Video Content", by E. Elejalde, D. Galanopoulos, C. Niederee, V. Mezaris, published in the proceedings of the Int. Workshop on Artificial Intelligence and Robotics for Law Enforcement Agencies (AIRLEAs) at the 3rd Int. Conf. on Intelligent Technologies and Applications (INTAP 2020), Gjovik, Norway, Sept. 2020.
A brief information about the SCOP protein database used in bioinformatics.
The Structural Classification of Proteins (SCOP) database is a comprehensive and authoritative resource for the structural and evolutionary relationships of proteins. It provides a detailed and curated classification of protein structures, grouping them into families, superfamilies, and folds based on their structural and sequence similarities.
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
Multi-source connectivity as the driver of solar wind variability in the heli...Sérgio Sacani
The ambient solar wind that flls the heliosphere originates from multiple
sources in the solar corona and is highly structured. It is often described
as high-speed, relatively homogeneous, plasma streams from coronal
holes and slow-speed, highly variable, streams whose source regions are
under debate. A key goal of ESA/NASA’s Solar Orbiter mission is to identify
solar wind sources and understand what drives the complexity seen in the
heliosphere. By combining magnetic feld modelling and spectroscopic
techniques with high-resolution observations and measurements, we show
that the solar wind variability detected in situ by Solar Orbiter in March
2022 is driven by spatio-temporal changes in the magnetic connectivity to
multiple sources in the solar atmosphere. The magnetic feld footpoints
connected to the spacecraft moved from the boundaries of a coronal hole
to one active region (12961) and then across to another region (12957). This
is refected in the in situ measurements, which show the transition from fast
to highly Alfvénic then to slow solar wind that is disrupted by the arrival of
a coronal mass ejection. Our results describe solar wind variability at 0.5 au
but are applicable to near-Earth observatories.
THE IMPORTANCE OF MARTIAN ATMOSPHERE SAMPLE RETURN.Sérgio Sacani
The return of a sample of near-surface atmosphere from Mars would facilitate answers to several first-order science questions surrounding the formation and evolution of the planet. One of the important aspects of terrestrial planet formation in general is the role that primary atmospheres played in influencing the chemistry and structure of the planets and their antecedents. Studies of the martian atmosphere can be used to investigate the role of a primary atmosphere in its history. Atmosphere samples would also inform our understanding of the near-surface chemistry of the planet, and ultimately the prospects for life. High-precision isotopic analyses of constituent gases are needed to address these questions, requiring that the analyses are made on returned samples rather than in situ.
Mammalian Pineal Body Structure and Also Functions
Gated-ViGAT
1. Title of presentation
Subtitle
Name of presenter
Date
Gated-ViGAT: Efficient Bottom-Up Event Recognition and Explanation
Using a New Frame Selection Policy and Gating Mechanism
Nikolaos Gkalelis, Dimitrios Daskalakis, Vasileios Mezaris
CERTH-ITI, Thermi - Thessaloniki, Greece
IEEE Int. Symposium on Multimedia,
Naples, Italy, Dec. 2022
2. 2
• The recognition of high-level events in unconstrained video is an important topic
with applications in: security (e.g. “making a bomb”), automotive industry (e.g.
“pedestrian crossing the street”), etc.
• Most approaches are top-down: “patchify” the frame (context agnostic); use
label and loss function to learn focusing on frame regions related with event
• Bottom-up approaches: use an object detector, feature extractor and graph
network to extract and process features from the main objects in the video
Introduction
Video event
“walking the dog”
3. 3
• Our recent bottom-up approach with SOTA performance in many datasets
• Uses a graph attention network (GAT) head to process local (object) & global
(frame) information
• Also provides frame/object-level explanations (in contrast to top-down ones)
Video event
“removing ice from
car” miscategorized
as “shoveling snow”
Object-level
explanation:
classifier does
not focus on the
car object
ViGAT
4. 4
• Cornerstone of ViGAT head; transforms a feature matrix (representing graph’s
nodes) to a feature vector (representing the whole graph)
• Computes explanation significance (weighted in-degrees, WiDs) of each node
using the graph’s adjacency matrix
Attention
Mechanism
GAT head Graph pooling
X (K x F) A (K x K) Ζ (K x F) η (1 x F)
𝝋𝒍 =
𝒌=𝟏
𝑲
𝒂𝒌,𝒍 , 𝒍 = 𝟏, … , 𝑲
Computation of
Attention matrix from
node features; and
Adjacency Matrix using
attention coefficients
Multiplication of
node features with
Adjacency Matrix
Production of vector-
representation of the graph
WiDs: Explanation
significance of l-th node
ViGAT block
5. ω2
ω2
5
K
K objects
object-level
features
b
frame-level
local features
P
ω2
P
P
P
ω3
b
frame-level
global features
P
ω1 concat u
video
feature
o
video frames
video-level
global feature
mean
video-level
local feature
K
frame WiDs
(local info)
frame WiDs
(global info)
object WiDs
P
P
P
Recognized Event: Playing
beach volleyball!
Explanation: Event supporting
frames and objects
ViGAT architecture
max3
max
o: object detector
b: feature extractor
u: classification head
GAT blocks: ω1, ω2, ω3
Global branch: ω1
Local branch: ω2, ω3
Local information
Global information
6. 6
• ViGAT has high computational cost due to local (object) information processing
(e.g.,P=120 frames, K=50 objects per frame, PK=6000 objects/video)
• Efficient video processing has investigated at the top-down (frame) paradigm:
- Frame selection policy: identify most important frames for classification
- Gating component: stop processing frames when sufficient evidence achieved
• Unexplored topic in bottom-up paradigm: Can we use such techniques to reduce
the computational complexity in the local processing pipeline of ViGAT?
ViGAT
7. 7
K
b
P
Q
ω3
concat
u
video
feature
o
Extracted
video frames
mean
video-level
local feature
K
Frame WiDs
(local info)
Object WiDs
(local info)
Q(s)
Frame selection
policy
Q(s) Q(s)
Q(s)
Q(s)
Q(s)
g(s)
ON/OFF
concat
max
Explanation: Event supporting
frames and objects
Recognized Event: Playing
beach volleyball!
Computed
video-level
global feature
Computed
frame WiDs
(global info)
u1 uP
max3
Gate is closed: Request Q(s+1) - Q(s) additional frames
ζ(s)
ζ(1)
g(1)
g(S)
Z(s)
Gated-ViGAT
ω2
ω2
ω2
Local information processing pipeline
8. 8
• Iterative algorithm to select Q frames
frame-level
global features
frame WiDs
(global info)
argmax
p1
minmax
minmax
αp = (1/2) (1 – γp
Τγpi-1
)
γp = γp /|γp|
γ1
γP
uP
u1
uP
u1
α1 αP
pi
argmax
u1 uP
up = αp up
u1 uP
α1 αP
1. Initialize
2. Select Q-1 frames
Input: Q, frame index p1, P feature vectors
Iterate for i= 2 to Q-1
γ1
γP
Gated-ViGAT: Frame selection policy
9. 9
• Each gate has a GAT block-like structure and binary classification
head (open/close); corresponds to specified number of frames Q(s);
trained to provide 1 (i.e. open) when ViGAT loss is low; design
hyperparameters:Q(s) , β (sensitivity)
Use frame selection policy to select Q(s) frames for gate g(s)
Compute the video-level local feature ζ(s) (and Z(s))
Compute ViGAT classification loss: lce = CE(label, y)
Derive pseudolabel 0(s) : 1 if lce <= βes/2 ; zero otherwise
Compute gate component loss: 𝐿 =
1
𝑆 𝑠=1
𝑆
𝑙𝑏𝑐𝑒(𝑔 𝑠
𝒁 𝑠
, 𝑜(𝑠)
)
Perform backpropagation to update gate weights
concat
u
video
feature
video-level
local feature
g(s)
concat
Computed
video-level
global feature
ζ(s)
ζ(1)
g(1)
g(S)
Local ViGAT
branch
Z(s)
Ground truth
label
cross
entropy
y
Binary cross
entropy
o(s)
Gated-ViGAT: Gate training
Select Q(s) video
frames for gate g(s)
Q
o
10. 10
• ActivityNet v1.3: 200 events/actions, 10K/5K training/testing, 5 to 10 mins; multilabel
• MiniKinetics: 200 events/actions, 80K/5K training/testing, 10 secs duration; single-label
• Video representation: 120/30 frames with uniform sampling for ActivityNet/MiniKinetics
• Pretrained ViGAT components: Faster R-CNN (pretrained/finetuned on Imagenet1K/VG, K=50
objects), ViT-B/16 backbone (pretrained/finetuned on Imagenet11K/Imagenet1K), 3 GAT blocks
(pretrained on the respective dataset, i.e., ActivityNet or MiniKinetics)
• Gates: S= 6 / 5 (number of gates), {Q(s)} = {9, 12, 16, 20, 25, 30} / {2, 4, 6, 8, 10} (sequence lengths),
for ActivityNet/MiniKinetics
• Gate training hyperparameters: β = 10-8, epochs= 40, lr = 10-4 multiplied with 0.1 at epochs 16, 35
• Evaluation Measures: mAP (ActivityNet), top-1 accuracy (MiniKinetics), FLOPs
• Gated-ViGAT is compared against top-scoring methods in the two datasets
Experiments
11. 11
Methods in MiniKinetics Top-1%
TBN [30] 69.5
BAT [7] 70.6
MARS (3D ResNet) [31] 72.8
Fast-S3D (Inception) [14] 78
ATFR (X3D-S) [18] 78
ATFR (R(2+1D)) [18] 78.2
RMS (SlowOnly) [28] 78.6
ATFR (I3D) [18] 78.8
Ada3D (I3D, Kinetics) [32] 79.2
ATFR (3D Resnet) [18] 79.3
CGNL (Modified ResNet) [17] 79.5
TCPNet (ResNet, Kinetics) [3] 80.7
LgNet (R3D) [3] 80.9
FrameExit (EfficientNet) [1] 75.3
ViGAT [9] 82.1
Gated-ViGAT (proposed) 81.3
• Gated-ViGAT outperforms all top-down approaches
• Slightly underperforms ViGAT, but approx. 4 and 5.5 FLOPs reduction
• As expected, has higher computational complexity than many top-down
approaches (e.g. see [3], [4]) but can provide explanations
Methods in ActivityNet mAP%
AdaFrame [21] 71.5
ListenToLook [23] 72.3
LiteEval [33] 72.7
SCSampler [25] 72.9
AR-Net [13] 73.8
FrameExit [1] 77.3
AR-Net (EfficientNet) [13] 79.7
MARL (ResNet, Kinetics) [22] 82.9
FrameExit (X3D-S) [1] 87.4
ViGAT [9] 88.1
Gated-ViGAT (proposed) 87.3
FLOPS in 2 datasets ViGAT Gated-ViGAT
ActivityNet 137.4 24.8
MiniKinetics 34.4 8.7
Experiments: results
*Best and second best performance
are denoted with bold and underline
12. 12
• Computed # of videos processed and recognition performance for each gate
• Average number of frames for ActivityNet / MiniKinetics: 20 / 7
• Recognition rate drops with gate number increase; this behavior is more
clearly shown in ActivityNet (longer videos)
• Conclusion: more “easy” videos exit early, more “difficult” videos still difficult
to recognize even with many frames (similar conclusion with [1])
ActivityNet g(1) g(2) g(3) g(4) g(5) g(6)
# frames 9 12 16 20 25 30
# videos 793 651 722 502 535 1722
mAP% 99.8 94.5 93.8 92.7 86 71.6
MiniKinetics g(1) g(2) g(3) g(4) g(5)
# frames 2 4 6 8 10
# videos 179 686 1199 458 2477
Top-1% 84.9 83 81.1 84.9 80.7
Experiments: method insight
13. 13
• Bullfighting (top) and Cricket (bottom) test videos of ActivityNet exited at first
gate, i.e., recognized using only 9 frames out of 120 (required with ViGAT)
• Frames selected with the proposed policy, both explain recognition result and
provide diverse view of the video: help to recognize video with fewer frames
Bullfighting
Cricket
Experiments: examples
14. 14
• Can also provide explanations at object-level (in contrast to top-down methods)
“Waterskiing” predicted
as “Making a sandwich”
“Playing accordion” predicted
as “Playing guitarra”
“Breakdancing” (correct prediction)
Experiments: examples
15. 15
Policy / #frames 10 20 30
Random 83 85.5 86.5
WiD-based 84.9 86.1 86.9
Random on local 85.4 86.6 86.9
WiD-based on local 86.6 87.1 87.5
FrameExit policy 86.2 87.3 87.5
Proposed policy 86.7 87.3 87.6
Gated-ViGAT (proposed) 86.8 87.5 87.7
Experiments: ablation study on frame selection policies
• Comparison (mAP%) on ActivityNet
• Gated-ViGAT selects diverse frames with high explanation potential
• Proposed policy is second best (surpassing FrameExit [1], current SOTA)
Random: Θ frames selected randomly for local/global features
WiD-Based: Θ frames are selected using global WiDs
Random local: P frames derive global feature; Θ frames selected randomly
WiD-based local: P frames derive global feature; Θ frames using global WiDs
FrameExit policy: Θ frames are selected using policy in [1]
Proposed policy: P frames derive global feature; Θ selected using proposed
Gated-ViGAT: in addition to above gate component selects Θ frames in average
16. 16
• Top-6 frames of “bungee jumping” video selected with WiD-based vs proposed policy
Proposed
WiD-based
Updated
WiDs
Experiments: ablation study example
17. 17
• An efficient bottom-up event recognition and explanation approach presented
• Utilizes a new policy algorithm to select frames that: a) explain best the
classifier’s decision, b) provide diverse information of the underlying event
• Utilizes a gating mechanism to instruct the model to stop extracting bottom-
up (object) information when sufficient evidence of the event is achieved
• Evaluation on 2 datasets provided competitive recognition performance and
approx. 5 times FLOPs reduction in comparison to previous SOTA
• Future work: investigations for further efficiency improvements, e.g.: faster
object detector, feature extractor, frame selection also for the global
information pipeline, etc.
Conclusions
18. 18
Thank you for your attention!
Questions?
Nikolaos Gkalelis, gkalelis@iti.gr
Vasileios Mezaris, bmezaris@iti.gr
Code publicly available at:
https://github.com/bmezaris/Gated-ViGAT
This work was supported by the EUs Horizon 2020 research and innovation programme under grant
agreement 101021866 CRiTERIA