Query and Keyframe Representations for Ad-hoc Video Search

•

0 likes•70 views

The paper “Query and Keyframe Representations for Ad-hoc Video Search” descibes a fully-automatic method that combines video concept detection and textual query analysis in order to solve the problem of ad-hoc video search.

Technology

Query and Keyframe Representations
for Ad-hoc Video Search
Foteini Markatopoulou1,2, Damianos Galanopoulos1, Vasileios Mezaris1, Ioannis Patras2
1Information Technologies Institute (ITI), CERTH, Thessaloniki, Greece 2Queen Mary University of London, London, UK
Supported by:
Proposed Method
Experimental results
Problem and motivation Background
• Ad-hoc video search: retrieving, from a large
video collection, video fragments
(keyframes) that are related to a given query
• Typical solution: treat the query as a set of
simple terms
• Motivation:
o Detecting the most useful parts of the
query, e.g., subsequences that contain the
main content that the user asks for
retrieval
o Combining two different measures for the
distance between the video shots and the
target query, calculated on concept-based
and semantic embedding representations
Method outline
•(a) Concept-based keyframe representation:
apply a DCNN in every keyframe
•(c) Concept-based query representation:
translate the query in a set of related concepts
using NLP
•(b) Semantic embeddings for concept-based
query and keyframe representations: project
both into a given semantic embedding space
• Each concept is enriched with additional
information captured by Google or Wikipedia [20]
• An inverted index structure is used in order to
associate the query with the concepts [4]
• A semi-automatic system [21], where the user is
asked to choose keywords given a test query
Proposed solution
•Two different distances are combined in terms of arithmetic mean
• Datasets: TRECVID AVS 2016, TRECVID Video Search
2008
o Test set: 600 and 100 hours, respectively
• Evaluated queries: 30 and 48, respectively
• Evaluation measure: MXinfAP (%) B) Comparisons: MXInfAP (%) for different compared AVS methodsA) Parameter selection: MXInfAP (%) on the AVS16 dataset to
investigate the parameters of the proposed method; (a) The
transformation to the semantic embedding space is ignored; (b) The
final distance from the query is calculated solely in the semantic
embedding space; (c) The complete process is used, i.e., the final
distance is the mean of the distances calculated in (a) and (b)
• Semantic embeddings: pre-trained Google News Corpus word2vec model
• Keyframe representation: 1346 concepts
o 1000 Imagenet concepts extracted using 5 pre-trained ImageNet DCNNs;
fused in terms of arithmetic mean
o 346 TRECVID SIN concepts extracted using 2 fine-tuned DCNNs, again fused
All
Excluding one step
Steps step 1 step 2 step 3 step 4 step 5
(a) Concept-based
representation
5.94 5.92 5.74 3.96 5.95 4.53
(b) Semantic
embeddings
3.77 3.86 2.98 3.22 3.75 2.80
(c) Combination 6.35 6.51 5.77 4.37 6.27 4.99
Methods AVS16 VS08
(a) Literature methods
Tzelepis et al. [20] 4.16 8.27
Ueki et al. [21] 5.65 7.24
Norouzi et al. [15] 3.14 7.30
(b) Top-4 TRECVID finalists
Top-1 Le et al. [4] 5.4 Tang et al. 6.7
Top-2 Markat. et al. [13] 5.1 Snoek al. 5.4
Top-3 Liang et al. [6] 4.0 Ngo et al. 4.2
Top-4 Zhangy et al. [23] 3.8 Mei et al. 4.1
Proposed 6.35 9.11

The document describes the LAPI team's approach for the MediaEval 2016 predicting media interestingness task. They used a classical machine learning approach where descriptors were generated for videos and images, and support vector machines (SVMs) were trained on the development set. The best SVM-descriptor combinations were identified on the development set and used to predict the test set. Descriptors included HSV histograms, HOG, SIFT, LBP, GIST, and AlexNet layers. SVMs with linear, polynomial, and RBF kernels were tested. The best results on the development set had a MAP of 0.214 for images and 0.179 for videos. On the test set, their best run was above the estimated MAP on

Automatic Building detection for satellite Images using IGV and DSM

Amit Raikar

This document presents a method for automatic building detection from satellite images using internal gray variance (IGV) and digital surface model (DSM). The proposed method aims to detect low-rising buildings and buildings with partially bright and partially dark rooftops more accurately than existing methods. The key steps include image enhancement, IGV feature extraction, seed point detection using the enhanced image and IGV, clustering using DSM data, binarization, thinning, shadow detection, and segmentation. Results on test satellite images show the method achieves higher detection percentages and lower branch factors than an existing method.

MediaEval 2016 - MLPBOON Predicting Media Interestingness System

multimediaeval

Presenter: Jayneel Parekh The MLPBOON Predicting Media Interestingness System for MediaEval 2016 In Working Notes Proceedings of the MediaEval 2016 Workshop, Hilversum, Netherlands, October 20-21, CEUR-WS.org (2016) by Jayneel Parekh, Sanjeel Parekh Paper: http://ceur-ws.org/Vol-1739/MediaEval_2016_paper_25.pdf Video: https://youtu.be/nAnrdYiy7nc Abstract: This paper describes the system developed by team MLPBOON for MediaEval 2016 Predicting Media Interestingness Image Subtask. After experimenting with various features and classifiers on the development dataset, our final system involves use of CNN features (fc7 layer of AlexNet) for the input representation and logistic regression as the classifier. For the proposed method, the MAP for the best run reaches a value of 0.229.

MediaEval 2016 - UNIFESP Predicting Media Interestingness Task

multimediaeval

Presenter: Samuel G. Fadel UNIFESP at MediaEval 2016: Predicting Media Interestingness Task In Working Notes Proceedings of the MediaEval 2016 Workshop, Hilversum, Netherlands, October 20-21, CEUR-WS.org (2016) by Jurandy Almeida Paper: http://ceur-ws.org/Vol-1739/MediaEval_2016_paper_28.pdf Video: https://youtu.be/YLthKNczlcA Abstract: This paper describes the approach proposed by UNIFESP for the MediaEval 2016 Predicting Media Interestingness Task and for its video subtask only. The proposed approach is based on combining learning-to-rank algorithms for predicting the interestingness of videos by their visual content.

unrban-building-damage-detection-by-PJLi.ppt

grssieee

1) The document presents a method for detecting building damage from very high resolution satellite images using one-class SVM classification and shadow information. 2) Initial building damage is detected using one-class SVM classification on multitemporal images. Shadows are then detected and changes in shadows over time are identified. 3) The initial damage detection results are refined by considering areas of shadow change, removing detections not near shadow changes. This combined method improved damage detection accuracy over using spectral data alone.

PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch

Sunghoon Joo

ANISH_and_DR.DANIEL_VRINCEANU_Presentation

Anish Patel

This document discusses applications of computer vision and provides examples. It defines computer vision as acquiring, processing, analyzing and understanding images from the real world. It notes common programming languages and libraries used like OpenCV. It then provides two examples: using camera and computer vision to track a simple pendulum experiment, and using OCR to automatically grade surveys by detecting answers. Algorithms and flowcharts are provided for both applications. The conclusion discusses more potential science experiments and uses of computer vision in education.

PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization

Sunghoon Joo

Oral presentation at IEEE International Conference on Image Processing (ICIP), Hong Kong, September 2010. Abstract: Non-uniform filters are frequently used in many image processing applications to describe regions or to detect specific features. However, non-uniform filtering is a computationally complex task. This paper presents a method to perform fast non-uniform filtering using a reduced number of memory accesses. The idea is based on integral images which are commonly used for box or Haar wavelet filtering. The disadvantage of those filters for several applications is their uniform shape. We describe a method to build Symmetric Weighted Integral Images that are tailored for a variety of kernels and the process to perform fast filtering with them. We show a relevant speedup when compared to Kernel Integral Images and large when compared to conventional non-uniform filtering by reducing the computational complexity.

Object Discovery using CNN Features in Egocentric Videos

Marc Bolaños Solà

This document proposes a method for object discovery in egocentric videos using convolutional neural networks (CNN). The method aims to characterize the environment of the person wearing an egocentric camera. It uses an objectness detector to sample object candidates, extracts CNN features to represent objects, and employs a refill strategy and clustering to discover new concepts in an iterative manner. The method is validated on a dataset of 1,000 images labeled with the most frequent objects, outperforming state-of-the-art approaches. Future work includes discovering objects, scenes and people to further characterize the environment.

Ce343scsdcdsc

Ali M. Shehadeh

This document describes the CE 343 Surveying Laboratory course. The course is a 1 credit hour lab that teaches students how to use various surveying instruments through hands-on training and field work. Topics covered include chain surveying, leveling, using a theodolite, tachometry, electronic distance measurements, and plane tables. Assessment is based on projects, a midterm exam, and a final exam. The course aims to train students to apply surveying theories and use principal surveying instruments in civil engineering projects.

Review: Incremental Few-shot Instance Segmentation [CDM]

Dongmin Choi

1) iMTFA is an incremental approach to few-shot instance segmentation that allows adding new classes without retraining. 2) It extends the MTFA baseline by training an instance feature extractor to generate discriminative embeddings for each instance, with the average embedding used as the class representative. 3) At inference, it predicts classes based on the cosine distance between ROI embeddings and stored class representatives, using class-agnostic box regression and mask prediction. 4) Experiments on COCO, VOC2007 and VOC2012 show iMTFA outperforms SOTA few-shot object detection and instance segmentation methods while enabling incremental class addition.

[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...

Dongmin Choi

Content based video summarization into object maps

Universitat Politècnica de Catalunya

Keyframe-based Video Summarization Designer

Universitat Politècnica de Catalunya

https://imatge.upc.edu/web/publications/keyframe-based-video-summarization-designer This Final Degree Work extends two previous projects and consists in carrying out an improvement of the video keyframe extraction module from one of them called Designer Master, by integrating the algorithms that were developed in the other, Object Maps. Firstly the proposed solution is explained, which consists in a shot detection method, where the input video is sampled uniformly and afterwards, cumulative pixel-to-pixel difference is applied and a classifier decides which frames are keyframes or not. Last, to validate our approach we conducted a user study in which both applications were compared. Users were asked to complete a survey regarding to different summaries created by means of the original application and with the one developed in this project. The results obtained were analyzed and they showed that the improvement done in the keyframes extraction module improves slightly the application performance and the quality of the generated summaries.

Object Detection and Recognition

Intel Nervana

Review : Structure Boundary Preserving Segmentation for Medical Image with Am...

Dongmin Choi

RuiLi_CVVT2016

Rui Li

This document proposes a novel long-term visual tracking algorithm called FAST that can be used for target following on UAVs. FAST transforms the correlation filter tracker from the frequency domain to the spatial domain to serve as a detector for redetection. It also introduces a coarse-to-fine redetection scheme using generic object proposals for coarse selection followed by a discriminative detector for fine selection, avoiding exhaustive search. Experiments show FAST achieves real-time, automatic, smooth long-term target following on UAVs for both indoor and outdoor scenarios.

Review: You Only Look One-level Feature

Dongmin Choi

Video Thumbnail Selector

VasileiosMezaris

Presentation slides for our paper "Combining Adversarial and Reinforcement Learning for Video Thumbnail Selection", ACM ICMR 2021. https://doi.org/10.1145/3460426.3463630. We developed a new method for unsupervised video thumbnail selection. The developed network architecture selects video thumbnails based on two criteria: the representativeness and the aesthetic quality of their visual content. Training relies on a combination of adversarial and reinforcement learning. The former is used to train a discriminator, whose goal is to distinguish the original from a reconstructed version of the video based on a small set of candidate thumbnails. The discriminator’s feedback is a measure of the representativeness of the selected thumbnails. This measure is combined with estimates about the aesthetic quality of the thumbnails (made using a SoA Fully Convolutional Network) to form a reward and train the thumbnail selector via reinforcement learning. Experiments on two datasets (OVP and Youtube) show the competitiveness of the proposed method against other SoA approaches. An ablation study with respect to the adopted thumbnail selection criteria documents the importance of considering the aesthetics, and the contribution of this information when used in combination with measures about the representativeness of the visual content.

VISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATION

cscpconf

Recent developments in digital video and drastic increase of internet use have increased the amount of people searching and watching videos online. In order to make the search of the videos easy, Summary of the video may be provided along with each video. The video summary provided thus should be effective so that the user would come to know the content of the video without having to watch it fully. The summary produced should consists of the key frames that effectively express the content and context of the video. This work suggests a method to extract key frames which express most of the information in the video. This is achieved by quantifying Visual attention each frame commands. Visual attention of each frame is quantified using a descriptor called Attention quantifier. This quantification of visual attention is based on the human attention mechanism that indicates color conspicuousness and the motion involved seek more attention. So based on the color conspicuousness and the motion involved each frame is given a Attention parameter. Based on the attention quantifier value the key frames are extracted and are summarized adaptively. This framework suggests a method to produces meaningful video summary.

Revisiting the Notion of Diversity in Software Testing

Lionel Briand

The document discusses the concept of diversity in software testing. It provides examples of how diversity has been applied in various testing applications, including test case prioritization and minimization, mutation analysis, and explaining errors in deep neural networks. The key aspects of diversity discussed are the representation of test cases, measures of distance or similarity between cases, and techniques for maximizing diversity. The document emphasizes that the best approach depends on factors like information access, execution costs, and the specific application context.

Fast object re-detection and localization in video for spatio-temporal fragme...

LinkedTV

Artifacts Detection by Extracting Edge Features and Error Block Analysis from...

Md. Mehedi Hasan

This document presents a method for edge-based feature extraction to detect artifacts and analyze error patterns in broadcast videos. The method uses edge magnitude and direction features that are less sensitive to noise. Detected error frames are analyzed to classify error blocks based on edge direction, texture content, and shape. Experimental results show the method achieves high accuracy in detecting distorted frames and analyzing error patterns compared to other approaches. Future work will apply the error analysis to video error concealment.

An Intelligent Approach for Effective Retrieval of Content from Large Data Se...

IJCSIS Research Publications

The advents in this technological era have resulted into enormous pool of information. This information is stored at multiple places globally, in multiple formats. This article highlights a methodology for extracting the video lectures delivered by experts in the domain of Computer Science by using Generalized Gamma Mixture Model. The feature extraction is based on the DCT transformations. In order to propose the model, the data set is pooled from the YouTube video lectures in the domain of Computer Science. The outputs generated are evaluated using Precision and Recall.

Automated_attendance_system_project.pptx

Naveensai51

1) The document proposes an automated attendance system using facial recognition and deep learning algorithms to accurately track students' attendance by capturing a single classroom photo, detecting faces, and matching to known student identities. 2) The system aims to design and develop an automated attendance system utilizing facial recognition and deep learning as a more reliable, efficient, and accurate alternative to manual attendance tracking. 3) The proposed method involves data collection, preprocessing images to extract faces, building and training a CNN model on the faces, and testing the model on new images to predict identities and record attendance.

Moving object detection in complex scene

Kumar Mayank

Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)

Saimunur Rahman

What's hot

Review : PolarMask: Single Shot Instance Segmentation with Polar Representati...

Dongmin Choi

Deformable DETR Review [CDM]

Dongmin Choi

Fast Non-Uniform Filtering with Symmetric Weighted Integral Images

davidmarimon

Object Discovery using CNN Features in Egocentric Videos

Marc Bolaños Solà

Ce343scsdcdsc

Ali M. Shehadeh

Review: Incremental Few-shot Instance Segmentation [CDM]

Dongmin Choi

[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...

Dongmin Choi

Content based video summarization into object maps

Universitat Politècnica de Catalunya

Keyframe-based Video Summarization Designer

Universitat Politècnica de Catalunya

Object Detection and Recognition

Intel Nervana

Review : Structure Boundary Preserving Segmentation for Medical Image with Am...

Dongmin Choi

RuiLi_CVVT2016

Rui Li

Review: You Only Look One-level Feature

Dongmin Choi

What's hot (13)

Review : PolarMask: Single Shot Instance Segmentation with Polar Representati...

Deformable DETR Review [CDM]

Fast Non-Uniform Filtering with Symmetric Weighted Integral Images

Object Discovery using CNN Features in Egocentric Videos

Ce343scsdcdsc

Review: Incremental Few-shot Instance Segmentation [CDM]

[Review] BoxInst: High-Performance Instance Segmentation with Box Annotations...

Content based video summarization into object maps

Keyframe-based Video Summarization Designer

Object Detection and Recognition

Review : Structure Boundary Preserving Segmentation for Medical Image with Am...

RuiLi_CVVT2016

Review: You Only Look One-level Feature

Similar to Query and Keyframe Representations for Ad-hoc Video Search

Video Thumbnail Selector

VasileiosMezaris

VISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATION

cscpconf

Revisiting the Notion of Diversity in Software Testing

Lionel Briand

Fast object re-detection and localization in video for spatio-temporal fragme...

LinkedTV

Artifacts Detection by Extracting Edge Features and Error Block Analysis from...

Md. Mehedi Hasan

An Intelligent Approach for Effective Retrieval of Content from Large Data Se...

IJCSIS Research Publications

Automated_attendance_system_project.pptx

Naveensai51

Moving object detection in complex scene

Kumar Mayank

Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)

Saimunur Rahman

Fast object re detection and localization in video for spatio-temporal fragme...

MediaMixerCommunity

This document describes a method for fast object re-detection and localization in video. The proposed approach improves upon a baseline method through GPU processing, video frame sampling based on shot segmentation, and enhancing robustness to scale variations. Experiments on 6 videos show the complete approach achieves over 99% precision and recall while processing videos 10 times faster than real-time. The method provides accurate, fast detection of predefined objects and could enable interactive linked media applications through object labeling and fragment linking.

IRJET- Study of SVM and CNN in Semantic Concept Detection

IRJET Journal

1) The document discusses approaches for semantic concept detection in videos using techniques like support vector machines (SVM) and convolutional neural networks (CNN). 2) It proposes a concept detection system that uses SVM and CNN together, extracting features from key frames using Hue moments and classifying the features with SVM and CNN. 3) The outputs of SVM and CNN are fused to improve concept detection accuracy compared to using the classifiers individually. Fusing the two classifiers is intended to better identify the concepts in video frames.

Synthesizing pseudo 2.5 d content from monocular videos for mixed reality

NAVER Engineering

Free-viewpoint video (FVV) is a kind of advanced media that provides a more immersive user experience than traditional media. It allows users to interact with content because users can view media at the desired viewpoint and is becoming a next-generation media. In creating FVV content, existing systems require complex and specialized capturing equipment and has low end-user usability because it needs a lot of expertise to use the system. This becomes an inconvenience for individuals or small organizations who want to create content and limits the end user’s ability to create FVV-based user-generated content (UGC) and inhibits the creation and sharing of various created content. To tackle these problems, ParaPara is proposed in this work. ParaPara is an end-to-end system that uses a simple yet effective method to generate pseudo-2.5D FVV content from monocular videos, unlike the previously proposed systems. First, the system detects persons from the monocular video through a deep neural network, calculates the real-world homography matrix based on the minimal user interaction, and estimates the pseudo-3D positions of the detected persons. Then, person textures are extracted using general image processing algorithms and placed at the estimated real-world positions. Finally, the pseudo-2.5D content is synthesized from these elements. The content, which is synthesized by the proposed system, is implemented on Microsoft HoloLens; the user can freely place the generated content on the real world and watch it on a free viewpoint.

Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...

CSCJournals

In today's era of digitization and fast internet, many video are uploaded on websites, a mechanism is required to access this video accurately and efficiently. Semantic concept detection achieve this task accurately and is used in many application like multimedia annotation, video summarization, annotation, indexing and retrieval. Video retrieval based on semantic concept is efficient and challenging research area. Semantic concept detection bridges the semantic gap between low level extraction of features from key-frame or shot of video and high level interpretation of the same as semantics. Semantic Concept detection automatically assigns labels to video from predefined vocabulary. This task is considered as supervised machine learning problem. Support vector machine (SVM) emerged as default classifier choice for this task. But recently Deep Convolutional Neural Network (CNN) has shown exceptional performance in this area. CNN requires large dataset for training. In this paper, we present framework for semantic concept detection using hybrid model of SVM and CNN. Global features like color moment, HSV histogram, wavelet transform, grey level co-occurrence matrix and edge orientation histogram are selected as low level features extracted from annotated groundtruth video dataset of TRECVID. In second pipeline, deep features are extracted using pretrained CNN. Dataset is partitioned in three segments to deal with data imbalance issue. Two classifiers are separately trained on all segments and fusion of scores is performed to detect the concepts in test dataset. The system performance is evaluated using Mean Average Precision for multi-label dataset. The performance of the proposed framework using hybrid model of SVM and CNN is comparable to existing approaches.

CA-SUM Video Summarization

VasileiosMezaris

TechnicalBackgroundOverview

Motaz El-Saban

This document discusses Motaz El Saban's research experience and interests which focus on analyzing, modeling, learning from, and predicting digital media content such as text, images, and speech. Some key areas of research include real-time video stitching, annotating mobile videos, object and activity recognition from videos, and facial expression recognition using deep learning techniques. The document also outlines El Saban's educational background and provides an agenda for his upcoming presentation.

SYNOPSIS on Parse representation and Linear SVM.

bhavinecindus

1. The document discusses a thesis on using sparse feature parameterization and multi-kernel SVM for large scale scene classification. The objective is to improve accuracy for large datasets using sparse representations and machine learning algorithms. 2. Key challenges include high dimensionality reducing accuracy for large datasets, nonlinear distributions, and computational costs of deep learning models. The research aims to address these issues. 3. The motivation from literature shows that multi-kernel SVMs have proved effective but could be improved by minimizing redundancy and optimizing kernel parameters for feature sets.

Large-scale Video Classification with Convolutional Neural Net.docx

croysierkathey

Large-scale Video Classification with Convolutional Neural Networks Andrej Karpathy1,2 George Toderici1 Sanketh Shetty1 [email protected][email protected][email protected] Thomas Leung1 Rahul Sukthankar1 Li Fei-Fei2 [email protected][email protected][email protected] 1Google Research 2Computer Science Department, Stanford University http://cs.stanford.edu/people/karpathy/deepvideo Abstract Convolutional Neural Networks (CNNs) have been es- tablished as a powerful class of models for image recognition problems. Encouraged by these results, we pro- vide an extensive empirical evaluation of CNNs on large- scale video classification using a new dataset of 1 million YouTube videos belonging to 487 classes. We study multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggest a multiresolution, foveated architecture as a promising way of speeding up the training. Our best spatio-temporal networks display significant performance improvements compared to strong feature-based baselines (55.3% to 63.9%), but only a surprisingly mod- est improvement compared to single-frame models (59.3% to 60.9%). We further study the generalization performance of our best model by retraining the top layers on the UCF- 101 Action Recognition dataset and observe significant performance improvements compared to the UCF-101 baseline model (63.3% up from 43.9%). 1. Introduction Images and videos have become ubiquitous on the internet, which has encouraged the development of algorithms that can analyze their semantic content for various applications, including search and summarization. Re- cently, Convolutional Neural Networks (CNNs) [15] have been demonstrated as an effective class of models for understanding image content, giving state-of-the-art results on image recognition, segmentation, detection and retrieval [11, 3, 2, 20, 9, 18]. The key enabling factors behind these results were techniques for scaling up the networks to tens of millions of parameters and massive labeled datasets that can support the learning process. Under these conditions, CNNs have been shown to learn powerful and interpretable image features [28]. Encouraged by positive results in domain of images, we study the performance of CNNs in large-scale video classification, where the networks have access to not only the appearance information present in single, static images, but also their complex temporal evolu- tion. There are several challenges to extending and applying CNNs in this setting. From a practical standpoint, there are currently no video classification benchmarks that match the scale and variety of existing image datasets because videos are significantly more difficult to collect, annotate and store. To obtain suffi- cient amount of data needed to train our CNN architectures, we collected a new Sports-1M dataset, which consists of 1 million YouTube videos belonging to a taxonomy ...

Explaining video summarization based on the focus of attention

VasileiosMezaris

Presentation of paper "Explaining video summarization based on the focus of attention", by E. Apostolidis, G. Balaouras, V. Mezaris, I. Patras, delivered at IEEE ISM 2022, Dec. 2022, Naples, Italy. In this paper we propose a method for explaining video summarization. We start by formulating the problem as the creation of an explanation mask which indicates the parts of the video that influenced the most the estimates of a video summarization network, about the frames’ importance. Then, we explain how the typical analysis pipeline of attention-based networks for video summarization can be used to define explanation signals, and we examine various attention-based signals that have been studied as explanations in the NLP domain. We evaluate the performance of these signals by investigating the video summarization network’s input-output relationship according to different replacement functions, and utilizing measures that quantify the capability of explanations to spot the most and least influential parts of a video. We run experiments using an attention-based network (CA-SUM) and two datasets (SumMe and TVSum) for video summarization. Our evaluations indicate the advanced performance of explanations formed using the inherent attention weights, and demonstrate the ability of our method to explain the video summarization results using clues about the focus of the attention mechanism.

A Multiple-Expert Binarization Framework for Multispectral Images

Reza Farrahi Moghaddam, PhD, BEng

Content based video retrieval using discrete cosine transform

nooriasukmaningtyas

A content based video retrieval (CBVR) framework is built in this paper. One of the essential features of video retrieval process and CBVR is a color value. The discrete cosine transform (DCT) is used to extract a query video features to compare with the video features stored in our database. Average result of 0.6475 was obtained by using the DCT after implementing it to the database we created and collected, and on all categories. This technique was applied on our database of video, check 100 database videos, 5 videos in Keywords: each category.

Similar to Query and Keyframe Representations for Ad-hoc Video Search (20)

Video Thumbnail Selector

VISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATION

Revisiting the Notion of Diversity in Software Testing

Fast object re-detection and localization in video for spatio-temporal fragme...

Artifacts Detection by Extracting Edge Features and Error Block Analysis from...

An Intelligent Approach for Effective Retrieval of Content from Large Data Se...

Automated_attendance_system_project.pptx

Moving object detection in complex scene

Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)

Fast object re detection and localization in video for spatio-temporal fragme...

IRJET- Study of SVM and CNN in Semantic Concept Detection

Synthesizing pseudo 2.5 d content from monocular videos for mixed reality

Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...

CA-SUM Video Summarization

TechnicalBackgroundOverview

SYNOPSIS on Parse representation and Linear SVM.

Large-scale Video Classification with Convolutional Neural Net.docx

Explaining video summarization based on the focus of attention

A Multiple-Expert Binarization Framework for Multispectral Images

Content based video retrieval using discrete cosine transform

More from MOVING Project

Opening up education through digitization. Remarks on recent developments in ...

Query and Keyframe Representations for Ad-hoc Video Search

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Similar to Query and Keyframe Representations for Ad-hoc Video Search

Similar to Query and Keyframe Representations for Ad-hoc Video Search (20)

More from MOVING Project

More from MOVING Project (20)

Recently uploaded

Recently uploaded (20)

Query and Keyframe Representations for Ad-hoc Video Search