The paper “Query and Keyframe Representations for Ad-hoc Video Search” descibes a fully-automatic method that combines video concept detection and textual query analysis in order to solve the problem of ad-hoc video search.
MediaEval 2016: LAPI at Predicting Media Interestingness Taskmultimediaeval
The document describes the LAPI team's approach for the MediaEval 2016 predicting media interestingness task. They used a classical machine learning approach where descriptors were generated for videos and images, and support vector machines (SVMs) were trained on the development set. The best SVM-descriptor combinations were identified on the development set and used to predict the test set. Descriptors included HSV histograms, HOG, SIFT, LBP, GIST, and AlexNet layers. SVMs with linear, polynomial, and RBF kernels were tested. The best results on the development set had a MAP of 0.214 for images and 0.179 for videos. On the test set, their best run was above the estimated MAP on
Automatic Building detection for satellite Images using IGV and DSMAmit Raikar
This document presents a method for automatic building detection from satellite images using internal gray variance (IGV) and digital surface model (DSM). The proposed method aims to detect low-rising buildings and buildings with partially bright and partially dark rooftops more accurately than existing methods. The key steps include image enhancement, IGV feature extraction, seed point detection using the enhanced image and IGV, clustering using DSM data, binarization, thinning, shadow detection, and segmentation. Results on test satellite images show the method achieves higher detection percentages and lower branch factors than an existing method.
MediaEval 2016 - MLPBOON Predicting Media Interestingness Systemmultimediaeval
Presenter: Jayneel Parekh
The MLPBOON Predicting Media Interestingness System for MediaEval 2016 In Working Notes Proceedings of the MediaEval 2016 Workshop, Hilversum, Netherlands, October 20-21, CEUR-WS.org (2016) by Jayneel Parekh, Sanjeel Parekh
Paper: http://ceur-ws.org/Vol-1739/MediaEval_2016_paper_25.pdf
Video: https://youtu.be/nAnrdYiy7nc
Abstract: This paper describes the system developed by team MLPBOON for MediaEval 2016 Predicting Media Interestingness Image Subtask. After experimenting with various features and classifiers on the development dataset, our final system involves use of CNN features (fc7 layer of AlexNet) for the input representation and logistic regression as the classifier. For the proposed method, the MAP for the best run reaches a value of 0.229.
MediaEval 2016 - UNIFESP Predicting Media Interestingness Taskmultimediaeval
Presenter: Samuel G. Fadel
UNIFESP at MediaEval 2016: Predicting Media Interestingness Task In Working Notes Proceedings of the MediaEval 2016 Workshop, Hilversum, Netherlands, October 20-21, CEUR-WS.org (2016) by Jurandy Almeida
Paper: http://ceur-ws.org/Vol-1739/MediaEval_2016_paper_28.pdf
Video: https://youtu.be/YLthKNczlcA
Abstract: This paper describes the approach proposed by UNIFESP for the MediaEval 2016 Predicting Media Interestingness Task and for its video subtask only. The proposed approach is based on combining learning-to-rank algorithms for predicting the interestingness of videos by their visual content.
1) The document presents a method for detecting building damage from very high resolution satellite images using one-class SVM classification and shadow information.
2) Initial building damage is detected using one-class SVM classification on multitemporal images. Shadows are then detected and changes in shadows over time are identified.
3) The initial damage detection results are refined by considering areas of shadow change, removing detections not near shadow changes. This combined method improved damage detection accuracy over using spectral data alone.
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From ScratchSunghoon Joo
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
Paper link: https://arxiv.org/abs/2003.03384
Video presentation link: https://youtu.be/J__uJ79m01Q
This document discusses applications of computer vision and provides examples. It defines computer vision as acquiring, processing, analyzing and understanding images from the real world. It notes common programming languages and libraries used like OpenCV. It then provides two examples: using camera and computer vision to track a simple pendulum experiment, and using OCR to automatically grade surveys by detecting answers. Algorithms and flowcharts are provided for both applications. The conclusion discusses more potential science experiments and uses of computer vision in education.
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector QuantizationSunghoon Joo
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
[Guo et al., ICML 2020]
Paper link: https://arxiv.org/abs/1908.10396
Video presentation link: https://youtu.be/cU46yR-A0cs
reviewed by Sunghoon Joo
MediaEval 2016: LAPI at Predicting Media Interestingness Taskmultimediaeval
The document describes the LAPI team's approach for the MediaEval 2016 predicting media interestingness task. They used a classical machine learning approach where descriptors were generated for videos and images, and support vector machines (SVMs) were trained on the development set. The best SVM-descriptor combinations were identified on the development set and used to predict the test set. Descriptors included HSV histograms, HOG, SIFT, LBP, GIST, and AlexNet layers. SVMs with linear, polynomial, and RBF kernels were tested. The best results on the development set had a MAP of 0.214 for images and 0.179 for videos. On the test set, their best run was above the estimated MAP on
Automatic Building detection for satellite Images using IGV and DSMAmit Raikar
This document presents a method for automatic building detection from satellite images using internal gray variance (IGV) and digital surface model (DSM). The proposed method aims to detect low-rising buildings and buildings with partially bright and partially dark rooftops more accurately than existing methods. The key steps include image enhancement, IGV feature extraction, seed point detection using the enhanced image and IGV, clustering using DSM data, binarization, thinning, shadow detection, and segmentation. Results on test satellite images show the method achieves higher detection percentages and lower branch factors than an existing method.
MediaEval 2016 - MLPBOON Predicting Media Interestingness Systemmultimediaeval
Presenter: Jayneel Parekh
The MLPBOON Predicting Media Interestingness System for MediaEval 2016 In Working Notes Proceedings of the MediaEval 2016 Workshop, Hilversum, Netherlands, October 20-21, CEUR-WS.org (2016) by Jayneel Parekh, Sanjeel Parekh
Paper: http://ceur-ws.org/Vol-1739/MediaEval_2016_paper_25.pdf
Video: https://youtu.be/nAnrdYiy7nc
Abstract: This paper describes the system developed by team MLPBOON for MediaEval 2016 Predicting Media Interestingness Image Subtask. After experimenting with various features and classifiers on the development dataset, our final system involves use of CNN features (fc7 layer of AlexNet) for the input representation and logistic regression as the classifier. For the proposed method, the MAP for the best run reaches a value of 0.229.
MediaEval 2016 - UNIFESP Predicting Media Interestingness Taskmultimediaeval
Presenter: Samuel G. Fadel
UNIFESP at MediaEval 2016: Predicting Media Interestingness Task In Working Notes Proceedings of the MediaEval 2016 Workshop, Hilversum, Netherlands, October 20-21, CEUR-WS.org (2016) by Jurandy Almeida
Paper: http://ceur-ws.org/Vol-1739/MediaEval_2016_paper_28.pdf
Video: https://youtu.be/YLthKNczlcA
Abstract: This paper describes the approach proposed by UNIFESP for the MediaEval 2016 Predicting Media Interestingness Task and for its video subtask only. The proposed approach is based on combining learning-to-rank algorithms for predicting the interestingness of videos by their visual content.
1) The document presents a method for detecting building damage from very high resolution satellite images using one-class SVM classification and shadow information.
2) Initial building damage is detected using one-class SVM classification on multitemporal images. Shadows are then detected and changes in shadows over time are identified.
3) The initial damage detection results are refined by considering areas of shadow change, removing detections not near shadow changes. This combined method improved damage detection accuracy over using spectral data alone.
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From ScratchSunghoon Joo
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
Paper link: https://arxiv.org/abs/2003.03384
Video presentation link: https://youtu.be/J__uJ79m01Q
This document discusses applications of computer vision and provides examples. It defines computer vision as acquiring, processing, analyzing and understanding images from the real world. It notes common programming languages and libraries used like OpenCV. It then provides two examples: using camera and computer vision to track a simple pendulum experiment, and using OCR to automatically grade surveys by detecting answers. Algorithms and flowcharts are provided for both applications. The conclusion discusses more potential science experiments and uses of computer vision in education.
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector QuantizationSunghoon Joo
PR-272: Accelerating Large-Scale Inference with Anisotropic Vector Quantization
[Guo et al., ICML 2020]
Paper link: https://arxiv.org/abs/1908.10396
Video presentation link: https://youtu.be/cU46yR-A0cs
reviewed by Sunghoon Joo
Fast Non-Uniform Filtering with Symmetric Weighted Integral Imagesdavidmarimon
Oral presentation at IEEE International Conference on Image Processing (ICIP), Hong Kong, September 2010.
Abstract: Non-uniform filters are frequently used in many image processing applications to describe regions or to detect specific features. However, non-uniform filtering is a computationally complex task. This paper presents a method to perform fast non-uniform filtering using a reduced number of memory accesses. The idea is based on integral images which are commonly used for box or Haar wavelet filtering. The disadvantage of those filters for several applications is their uniform shape. We describe a method to build Symmetric Weighted Integral Images that are tailored for a variety of kernels and the process to perform fast filtering with them. We show a relevant speedup when compared to Kernel Integral Images and large when compared to conventional non-uniform filtering by reducing the computational complexity.
Object Discovery using CNN Features in Egocentric VideosMarc Bolaños Solà
This document proposes a method for object discovery in egocentric videos using convolutional neural networks (CNN). The method aims to characterize the environment of the person wearing an egocentric camera. It uses an objectness detector to sample object candidates, extracts CNN features to represent objects, and employs a refill strategy and clustering to discover new concepts in an iterative manner. The method is validated on a dataset of 1,000 images labeled with the most frequent objects, outperforming state-of-the-art approaches. Future work includes discovering objects, scenes and people to further characterize the environment.
This document describes the CE 343 Surveying Laboratory course. The course is a 1 credit hour lab that teaches students how to use various surveying instruments through hands-on training and field work. Topics covered include chain surveying, leveling, using a theodolite, tachometry, electronic distance measurements, and plane tables. Assessment is based on projects, a midterm exam, and a final exam. The course aims to train students to apply surveying theories and use principal surveying instruments in civil engineering projects.
1) iMTFA is an incremental approach to few-shot instance segmentation that allows adding new classes without retraining.
2) It extends the MTFA baseline by training an instance feature extractor to generate discriminative embeddings for each instance, with the average embedding used as the class representative.
3) At inference, it predicts classes based on the cosine distance between ROI embeddings and stored class representatives, using class-agnostic box regression and mask prediction.
4) Experiments on COCO, VOC2007 and VOC2012 show iMTFA outperforms SOTA few-shot object detection and instance segmentation methods while enabling incremental class addition.
Master thesis defence by Manuel Martos-Asensio
Advisors: Horst Eidenberger (Technische Universtität Viena) and Xavier Giró-i-Nieto (Universitat Politècnica de Catalunya)
More details
https://imatge.upc.edu/web/publications/keyframe-based-video-summarization-designer
This Final Degree Work extends two previous projects and consists in carrying out an improvement of the video keyframe extraction module from one of them called Designer Master, by integrating the algorithms that were developed in the other, Object Maps.
Firstly the proposed solution is explained, which consists in a shot detection method, where the input video is sampled uniformly and afterwards, cumulative pixel-to-pixel difference is applied and a classifier decides which frames are keyframes or not.
Last, to validate our approach we conducted a user study in which both applications were compared. Users were asked to complete a survey regarding to different summaries created by means of the original application and with the one developed in this project. The results obtained were analyzed and they showed that the improvement done in the keyframes extraction module improves slightly the application performance and the quality of the generated summaries.
Yinyin Liu presents a model for object detection and localization, called Fast-RCNN. She will show how to introduce a ROI pooling layer into neon, and how to add the PASCAL VOC dataset to interface with model training and inference. Lastly, Yinyin will run through a demo on how to apply the trained model to detect new objects.
Review : Structure Boundary Preserving Segmentation for Medical Image with Am...Dongmin Choi
Paper title : Structure Boundary Preserving Segmentation for Medical Image with Ambiguous Boundary (CVPR2020)
Paper link : https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_Structure_Boundary_Preserving_Segmentation_for_Medical_Image_With_Ambiguous_Boundary_CVPR_2020_paper.pdf
This document proposes a novel long-term visual tracking algorithm called FAST that can be used for target following on UAVs. FAST transforms the correlation filter tracker from the frequency domain to the spatial domain to serve as a detector for redetection. It also introduces a coarse-to-fine redetection scheme using generic object proposals for coarse selection followed by a discriminative detector for fine selection, avoiding exhaustive search. Experiments show FAST achieves real-time, automatic, smooth long-term target following on UAVs for both indoor and outdoor scenarios.
Presentation slides for our paper "Combining Adversarial and Reinforcement Learning for Video Thumbnail Selection", ACM ICMR 2021. https://doi.org/10.1145/3460426.3463630.
We developed a new method for unsupervised video thumbnail selection. The developed network architecture selects video thumbnails based on two criteria: the representativeness and the aesthetic quality of their visual content. Training relies on a combination of adversarial and reinforcement learning. The former is used to train a discriminator, whose goal is to distinguish the original from a reconstructed version of the video based on a small set of candidate thumbnails. The discriminator’s feedback is a measure of the representativeness of the selected thumbnails. This measure is combined with estimates about the aesthetic quality of the thumbnails (made using a SoA Fully Convolutional Network) to form a reward and train the thumbnail selector via reinforcement learning. Experiments on two datasets (OVP and Youtube) show the competitiveness of the proposed method against other SoA approaches. An ablation study with respect to the adopted thumbnail selection criteria documents the importance of considering the aesthetics, and the contribution of this information when used in combination with measures about the representativeness of the visual content.
VISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATIONcscpconf
Recent developments in digital video and drastic increase of internet use have increased the
amount of people searching and watching videos online. In order to make the search of the
videos easy, Summary of the video may be provided along with each video. The video summary
provided thus should be effective so that the user would come to know the content of the video
without having to watch it fully. The summary produced should consists of the key frames that
effectively express the content and context of the video. This work suggests a method to extract
key frames which express most of the information in the video. This is achieved by quantifying
Visual attention each frame commands. Visual attention of each frame is quantified using a
descriptor called Attention quantifier. This quantification of visual attention is based on the
human attention mechanism that indicates color conspicuousness and the motion involved seek
more attention. So based on the color conspicuousness and the motion involved each frame is
given a Attention parameter. Based on the attention quantifier value the key frames are extracted and are summarized adaptively. This framework suggests a method to produces meaningful video summary.
Revisiting the Notion of Diversity in Software TestingLionel Briand
The document discusses the concept of diversity in software testing. It provides examples of how diversity has been applied in various testing applications, including test case prioritization and minimization, mutation analysis, and explaining errors in deep neural networks. The key aspects of diversity discussed are the representation of test cases, measures of distance or similarity between cases, and techniques for maximizing diversity. The document emphasizes that the best approach depends on factors like information access, execution costs, and the specific application context.
Fast object re-detection and localization in video for spatio-temporal fragme...LinkedTV
Fast object re-detection and localization in video for spatio-temporal fragment creation, Jul. 2013, San Jose, California, USA. Talk provided by Vasileios Mezaris.
Artifacts Detection by Extracting Edge Features and Error Block Analysis from...Md. Mehedi Hasan
This document presents a method for edge-based feature extraction to detect artifacts and analyze error patterns in broadcast videos. The method uses edge magnitude and direction features that are less sensitive to noise. Detected error frames are analyzed to classify error blocks based on edge direction, texture content, and shape. Experimental results show the method achieves high accuracy in detecting distorted frames and analyzing error patterns compared to other approaches. Future work will apply the error analysis to video error concealment.
The advents in this technological era have resulted into enormous pool of information. This information is
stored at multiple places globally, in multiple formats. This article highlights a methodology for extracting
the video lectures delivered by experts in the domain of Computer Science by using Generalized Gamma
Mixture Model. The feature extraction is based on the DCT transformations. In order to propose the model,
the data set is pooled from the YouTube video lectures in the domain of Computer Science. The outputs
generated are evaluated using Precision and Recall.
1) The document proposes an automated attendance system using facial recognition and deep learning algorithms to accurately track students' attendance by capturing a single classroom photo, detecting faces, and matching to known student identities.
2) The system aims to design and develop an automated attendance system utilizing facial recognition and deep learning as a more reliable, efficient, and accurate alternative to manual attendance tracking.
3) The proposed method involves data collection, preprocessing images to extract faces, building and training a CNN model on the faces, and testing the model on new images to predict identities and record attendance.
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)Saimunur Rahman
This presentation was prepared for ViPr Reading group at Multimedia University, Cyberjaya. The goal of this presentation was to make aware the lab members about the recent advancements in action recognition.
Fast Non-Uniform Filtering with Symmetric Weighted Integral Imagesdavidmarimon
Oral presentation at IEEE International Conference on Image Processing (ICIP), Hong Kong, September 2010.
Abstract: Non-uniform filters are frequently used in many image processing applications to describe regions or to detect specific features. However, non-uniform filtering is a computationally complex task. This paper presents a method to perform fast non-uniform filtering using a reduced number of memory accesses. The idea is based on integral images which are commonly used for box or Haar wavelet filtering. The disadvantage of those filters for several applications is their uniform shape. We describe a method to build Symmetric Weighted Integral Images that are tailored for a variety of kernels and the process to perform fast filtering with them. We show a relevant speedup when compared to Kernel Integral Images and large when compared to conventional non-uniform filtering by reducing the computational complexity.
Object Discovery using CNN Features in Egocentric VideosMarc Bolaños Solà
This document proposes a method for object discovery in egocentric videos using convolutional neural networks (CNN). The method aims to characterize the environment of the person wearing an egocentric camera. It uses an objectness detector to sample object candidates, extracts CNN features to represent objects, and employs a refill strategy and clustering to discover new concepts in an iterative manner. The method is validated on a dataset of 1,000 images labeled with the most frequent objects, outperforming state-of-the-art approaches. Future work includes discovering objects, scenes and people to further characterize the environment.
This document describes the CE 343 Surveying Laboratory course. The course is a 1 credit hour lab that teaches students how to use various surveying instruments through hands-on training and field work. Topics covered include chain surveying, leveling, using a theodolite, tachometry, electronic distance measurements, and plane tables. Assessment is based on projects, a midterm exam, and a final exam. The course aims to train students to apply surveying theories and use principal surveying instruments in civil engineering projects.
1) iMTFA is an incremental approach to few-shot instance segmentation that allows adding new classes without retraining.
2) It extends the MTFA baseline by training an instance feature extractor to generate discriminative embeddings for each instance, with the average embedding used as the class representative.
3) At inference, it predicts classes based on the cosine distance between ROI embeddings and stored class representatives, using class-agnostic box regression and mask prediction.
4) Experiments on COCO, VOC2007 and VOC2012 show iMTFA outperforms SOTA few-shot object detection and instance segmentation methods while enabling incremental class addition.
Master thesis defence by Manuel Martos-Asensio
Advisors: Horst Eidenberger (Technische Universtität Viena) and Xavier Giró-i-Nieto (Universitat Politècnica de Catalunya)
More details
https://imatge.upc.edu/web/publications/keyframe-based-video-summarization-designer
This Final Degree Work extends two previous projects and consists in carrying out an improvement of the video keyframe extraction module from one of them called Designer Master, by integrating the algorithms that were developed in the other, Object Maps.
Firstly the proposed solution is explained, which consists in a shot detection method, where the input video is sampled uniformly and afterwards, cumulative pixel-to-pixel difference is applied and a classifier decides which frames are keyframes or not.
Last, to validate our approach we conducted a user study in which both applications were compared. Users were asked to complete a survey regarding to different summaries created by means of the original application and with the one developed in this project. The results obtained were analyzed and they showed that the improvement done in the keyframes extraction module improves slightly the application performance and the quality of the generated summaries.
Yinyin Liu presents a model for object detection and localization, called Fast-RCNN. She will show how to introduce a ROI pooling layer into neon, and how to add the PASCAL VOC dataset to interface with model training and inference. Lastly, Yinyin will run through a demo on how to apply the trained model to detect new objects.
Review : Structure Boundary Preserving Segmentation for Medical Image with Am...Dongmin Choi
Paper title : Structure Boundary Preserving Segmentation for Medical Image with Ambiguous Boundary (CVPR2020)
Paper link : https://openaccess.thecvf.com/content_CVPR_2020/papers/Lee_Structure_Boundary_Preserving_Segmentation_for_Medical_Image_With_Ambiguous_Boundary_CVPR_2020_paper.pdf
This document proposes a novel long-term visual tracking algorithm called FAST that can be used for target following on UAVs. FAST transforms the correlation filter tracker from the frequency domain to the spatial domain to serve as a detector for redetection. It also introduces a coarse-to-fine redetection scheme using generic object proposals for coarse selection followed by a discriminative detector for fine selection, avoiding exhaustive search. Experiments show FAST achieves real-time, automatic, smooth long-term target following on UAVs for both indoor and outdoor scenarios.
Presentation slides for our paper "Combining Adversarial and Reinforcement Learning for Video Thumbnail Selection", ACM ICMR 2021. https://doi.org/10.1145/3460426.3463630.
We developed a new method for unsupervised video thumbnail selection. The developed network architecture selects video thumbnails based on two criteria: the representativeness and the aesthetic quality of their visual content. Training relies on a combination of adversarial and reinforcement learning. The former is used to train a discriminator, whose goal is to distinguish the original from a reconstructed version of the video based on a small set of candidate thumbnails. The discriminator’s feedback is a measure of the representativeness of the selected thumbnails. This measure is combined with estimates about the aesthetic quality of the thumbnails (made using a SoA Fully Convolutional Network) to form a reward and train the thumbnail selector via reinforcement learning. Experiments on two datasets (OVP and Youtube) show the competitiveness of the proposed method against other SoA approaches. An ablation study with respect to the adopted thumbnail selection criteria documents the importance of considering the aesthetics, and the contribution of this information when used in combination with measures about the representativeness of the visual content.
VISUAL ATTENTION BASED KEYFRAMES EXTRACTION AND VIDEO SUMMARIZATIONcscpconf
Recent developments in digital video and drastic increase of internet use have increased the
amount of people searching and watching videos online. In order to make the search of the
videos easy, Summary of the video may be provided along with each video. The video summary
provided thus should be effective so that the user would come to know the content of the video
without having to watch it fully. The summary produced should consists of the key frames that
effectively express the content and context of the video. This work suggests a method to extract
key frames which express most of the information in the video. This is achieved by quantifying
Visual attention each frame commands. Visual attention of each frame is quantified using a
descriptor called Attention quantifier. This quantification of visual attention is based on the
human attention mechanism that indicates color conspicuousness and the motion involved seek
more attention. So based on the color conspicuousness and the motion involved each frame is
given a Attention parameter. Based on the attention quantifier value the key frames are extracted and are summarized adaptively. This framework suggests a method to produces meaningful video summary.
Revisiting the Notion of Diversity in Software TestingLionel Briand
The document discusses the concept of diversity in software testing. It provides examples of how diversity has been applied in various testing applications, including test case prioritization and minimization, mutation analysis, and explaining errors in deep neural networks. The key aspects of diversity discussed are the representation of test cases, measures of distance or similarity between cases, and techniques for maximizing diversity. The document emphasizes that the best approach depends on factors like information access, execution costs, and the specific application context.
Fast object re-detection and localization in video for spatio-temporal fragme...LinkedTV
Fast object re-detection and localization in video for spatio-temporal fragment creation, Jul. 2013, San Jose, California, USA. Talk provided by Vasileios Mezaris.
Artifacts Detection by Extracting Edge Features and Error Block Analysis from...Md. Mehedi Hasan
This document presents a method for edge-based feature extraction to detect artifacts and analyze error patterns in broadcast videos. The method uses edge magnitude and direction features that are less sensitive to noise. Detected error frames are analyzed to classify error blocks based on edge direction, texture content, and shape. Experimental results show the method achieves high accuracy in detecting distorted frames and analyzing error patterns compared to other approaches. Future work will apply the error analysis to video error concealment.
The advents in this technological era have resulted into enormous pool of information. This information is
stored at multiple places globally, in multiple formats. This article highlights a methodology for extracting
the video lectures delivered by experts in the domain of Computer Science by using Generalized Gamma
Mixture Model. The feature extraction is based on the DCT transformations. In order to propose the model,
the data set is pooled from the YouTube video lectures in the domain of Computer Science. The outputs
generated are evaluated using Precision and Recall.
1) The document proposes an automated attendance system using facial recognition and deep learning algorithms to accurately track students' attendance by capturing a single classroom photo, detecting faces, and matching to known student identities.
2) The system aims to design and develop an automated attendance system utilizing facial recognition and deep learning as a more reliable, efficient, and accurate alternative to manual attendance tracking.
3) The proposed method involves data collection, preprocessing images to extract faces, building and training a CNN model on the faces, and testing the model on new images to predict identities and record attendance.
Reading group - Week 2 - Trajectory Pooled Deep-Convolutional Descriptors (TDD)Saimunur Rahman
This presentation was prepared for ViPr Reading group at Multimedia University, Cyberjaya. The goal of this presentation was to make aware the lab members about the recent advancements in action recognition.
Fast object re detection and localization in video for spatio-temporal fragme...MediaMixerCommunity
This document describes a method for fast object re-detection and localization in video. The proposed approach improves upon a baseline method through GPU processing, video frame sampling based on shot segmentation, and enhancing robustness to scale variations. Experiments on 6 videos show the complete approach achieves over 99% precision and recall while processing videos 10 times faster than real-time. The method provides accurate, fast detection of predefined objects and could enable interactive linked media applications through object labeling and fragment linking.
IRJET- Study of SVM and CNN in Semantic Concept DetectionIRJET Journal
1) The document discusses approaches for semantic concept detection in videos using techniques like support vector machines (SVM) and convolutional neural networks (CNN).
2) It proposes a concept detection system that uses SVM and CNN together, extracting features from key frames using Hue moments and classifying the features with SVM and CNN.
3) The outputs of SVM and CNN are fused to improve concept detection accuracy compared to using the classifiers individually. Fusing the two classifiers is intended to better identify the concepts in video frames.
Synthesizing pseudo 2.5 d content from monocular videos for mixed realityNAVER Engineering
Free-viewpoint video (FVV) is a kind of advanced media that provides a more immersive user experience than traditional media. It allows users to interact with content because users can view media at the desired viewpoint and is becoming a next-generation media.
In creating FVV content, existing systems require complex and specialized capturing equipment and has low end-user usability because it needs a lot of expertise to use the system. This becomes an inconvenience for individuals or small organizations who want to create content and limits the end user’s ability to create FVV-based user-generated content (UGC) and inhibits the creation and sharing of various created content.
To tackle these problems, ParaPara is proposed in this work. ParaPara is an end-to-end system that uses a simple yet effective method to generate pseudo-2.5D FVV content from monocular videos, unlike the previously proposed systems. First, the system detects persons from the monocular video through a deep neural network, calculates the real-world homography matrix based on the minimal user interaction, and estimates the pseudo-3D positions of the detected persons. Then, person textures are extracted using general image processing algorithms and placed at the estimated real-world positions. Finally, the pseudo-2.5D content is synthesized from these elements. The content, which is synthesized by the proposed system, is implemented on Microsoft HoloLens; the user can freely place the generated content on the real world and watch it on a free viewpoint.
Semantic Concept Detection in Video Using Hybrid Model of CNN and SVM Classif...CSCJournals
In today's era of digitization and fast internet, many video are uploaded on websites, a mechanism is required to access this video accurately and efficiently. Semantic concept detection achieve this task accurately and is used in many application like multimedia annotation, video summarization, annotation, indexing and retrieval. Video retrieval based on semantic concept is efficient and challenging research area. Semantic concept detection bridges the semantic gap between low level extraction of features from key-frame or shot of video and high level interpretation of the same as semantics. Semantic Concept detection automatically assigns labels to video from predefined vocabulary. This task is considered as supervised machine learning problem. Support vector machine (SVM) emerged as default classifier choice for this task. But recently Deep Convolutional Neural Network (CNN) has shown exceptional performance in this area. CNN requires large dataset for training. In this paper, we present framework for semantic concept detection using hybrid model of SVM and CNN. Global features like color moment, HSV histogram, wavelet transform, grey level co-occurrence matrix and edge orientation histogram are selected as low level features extracted from annotated groundtruth video dataset of TRECVID. In second pipeline, deep features are extracted using pretrained CNN. Dataset is partitioned in three segments to deal with data imbalance issue. Two classifiers are separately trained on all segments and fusion of scores is performed to detect the concepts in test dataset. The system performance is evaluated using Mean Average Precision for multi-label dataset. The performance of the proposed framework using hybrid model of SVM and CNN is comparable to existing approaches.
Presentation of the paper titled "Summarizing videos using concentrated attention and considering the uniqueness and diversity of the video frames", by E. Apostolidis, G. Balaouras, V. Mezaris, I. Patras, delivered at the ACM Int. Conf. on Multimedia Retrieval (ICMR’22), Newark, NJ, USA, June 2022. The corresponding software is available at https://github.com/e-apostolidis/CA-SUM.
This document discusses Motaz El Saban's research experience and interests which focus on analyzing, modeling, learning from, and predicting digital media content such as text, images, and speech. Some key areas of research include real-time video stitching, annotating mobile videos, object and activity recognition from videos, and facial expression recognition using deep learning techniques. The document also outlines El Saban's educational background and provides an agenda for his upcoming presentation.
SYNOPSIS on Parse representation and Linear SVM.bhavinecindus
1. The document discusses a thesis on using sparse feature parameterization and multi-kernel SVM for large scale scene classification. The objective is to improve accuracy for large datasets using sparse representations and machine learning algorithms.
2. Key challenges include high dimensionality reducing accuracy for large datasets, nonlinear distributions, and computational costs of deep learning models. The research aims to address these issues.
3. The motivation from literature shows that multi-kernel SVMs have proved effective but could be improved by minimizing redundancy and optimizing kernel parameters for feature sets.
Large-scale Video Classification with Convolutional Neural Net.docxcroysierkathey
Large-scale Video Classification with Convolutional Neural Networks
Andrej Karpathy1,2 George Toderici1 Sanketh Shetty1
[email protected][email protected][email protected]
Thomas Leung1 Rahul Sukthankar1 Li Fei-Fei2
[email protected][email protected][email protected]
1Google Research 2Computer Science Department, Stanford University
http://cs.stanford.edu/people/karpathy/deepvideo
Abstract
Convolutional Neural Networks (CNNs) have been es-
tablished as a powerful class of models for image recog-
nition problems. Encouraged by these results, we pro-
vide an extensive empirical evaluation of CNNs on large-
scale video classification using a new dataset of 1 million
YouTube videos belonging to 487 classes. We study mul-
tiple approaches for extending the connectivity of a CNN
in time domain to take advantage of local spatio-temporal
information and suggest a multiresolution, foveated archi-
tecture as a promising way of speeding up the training.
Our best spatio-temporal networks display significant per-
formance improvements compared to strong feature-based
baselines (55.3% to 63.9%), but only a surprisingly mod-
est improvement compared to single-frame models (59.3%
to 60.9%). We further study the generalization performance
of our best model by retraining the top layers on the UCF-
101 Action Recognition dataset and observe significant per-
formance improvements compared to the UCF-101 baseline
model (63.3% up from 43.9%).
1. Introduction
Images and videos have become ubiquitous on the in-
ternet, which has encouraged the development of algo-
rithms that can analyze their semantic content for vari-
ous applications, including search and summarization. Re-
cently, Convolutional Neural Networks (CNNs) [15] have
been demonstrated as an effective class of models for un-
derstanding image content, giving state-of-the-art results
on image recognition, segmentation, detection and retrieval
[11, 3, 2, 20, 9, 18]. The key enabling factors behind these
results were techniques for scaling up the networks to tens
of millions of parameters and massive labeled datasets that
can support the learning process. Under these conditions,
CNNs have been shown to learn powerful and interpretable
image features [28]. Encouraged by positive results in do-
main of images, we study the performance of CNNs in
large-scale video classification, where the networks have
access to not only the appearance information present in
single, static images, but also their complex temporal evolu-
tion. There are several challenges to extending and applying
CNNs in this setting.
From a practical standpoint, there are currently no video
classification benchmarks that match the scale and variety
of existing image datasets because videos are significantly
more difficult to collect, annotate and store. To obtain suffi-
cient amount of data needed to train our CNN architectures,
we collected a new Sports-1M dataset, which consists of 1
million YouTube videos belonging to a taxonomy ...
Explaining video summarization based on the focus of attentionVasileiosMezaris
Presentation of paper "Explaining video summarization based on
the focus of attention", by E. Apostolidis, G. Balaouras, V. Mezaris, I. Patras, delivered at IEEE ISM 2022, Dec. 2022, Naples, Italy.
In this paper we propose a method for explaining
video summarization. We start by formulating the problem as
the creation of an explanation mask which indicates the parts
of the video that influenced the most the estimates of a video
summarization network, about the frames’ importance. Then, we
explain how the typical analysis pipeline of attention-based networks for video summarization can be used to define explanation
signals, and we examine various attention-based signals that have
been studied as explanations in the NLP domain. We evaluate
the performance of these signals by investigating the video
summarization network’s input-output relationship according
to different replacement functions, and utilizing measures that quantify the capability of explanations to spot the most and
least influential parts of a video. We run experiments using an
attention-based network (CA-SUM) and two datasets (SumMe
and TVSum) for video summarization. Our evaluations indicate the advanced performance of explanations formed using the inherent attention weights, and demonstrate the ability of our
method to explain the video summarization results using clues
about the focus of the attention mechanism.
Content based video retrieval using discrete cosine transformnooriasukmaningtyas
A content based video retrieval (CBVR) framework is built in this paper.
One of the essential features of video retrieval process and CBVR is a color
value. The discrete cosine transform (DCT) is used to extract a query video
features to compare with the video features stored in our database. Average
result of 0.6475 was obtained by using the DCT after implementing it to the
database we created and collected, and on all categories. This technique was
applied on our database of video, check 100 database videos, 5 videos in
Keywords: each category.
Similar to Query and Keyframe Representations for Ad-hoc Video Search (20)
MOVING: Applying digital science methodology for TVETMOVING Project
This document provides an overview of the MOVING project, which aims to improve digital information management skills through an open interdisciplinary platform. It discusses two use cases: one focused on legal accountants managing compliance information, and the other on junior researchers managing research literature. For each use case, the document describes the objectives, tasks, relevant personas, and a mock-up of the proposed digital tools. It also provides information on designing the tools using a human-centered approach and testing them through the MOVING technology platform. The overall goal is to empower users across different fields and backgrounds to apply data analytics tools in their work.
Learning analytics for reflective learningMOVING Project
This presentation includes how meaningful reflection guidance can be applied in order to motivate people to improve their own search information behaviour, and how guidance can be provided to follow a pre-defined learning path through a curriculum to improve the own competence on information literacy.
Unesco mobileweek 2019_frontier_tech_oer-finalMOVING Project
This document discusses open educational resources (OER) and the potential for artificial intelligence (AI) to enhance OER. It describes how AI could be used to automatically ingest, clean, structure and preprocess OER from various sources. User modeling and social network modeling are discussed as ways to enable adaptive learning and personalized education. Several existing and potential AI systems are mentioned, such as tools for competency calculation, semantic processing of OER, and intelligent assistants. The growth of OER over time is shown in a graph. Cyc, an early large-scale AI knowledge base, is also summarized.
Inferring knowledge acquisition through Web navigation behaviourMOVING Project
In this presentation, UMAN explores if there is a way to measure learning and personalise the user learning experience in an unobtrusive manner. The PhD thesis in this paper proposes using data-driven methods to measure learning by mining user interaction data to identify regularities that could be indicators of learning.
The document summarizes the ITI-CERTH participation in TRECVID 2018, including their approaches to ad-hoc video search, instance search, and activity recognition. Their key approaches included using pre-trained deep convolutional neural networks (DCNNs) to extract concepts from videos, late fusion of multiple DCNNs, and combining traditional and DCNN approaches. Their system achieved a maximum infAP of 0.047 for ad-hoc video search and had high accuracy for activity recognition training.
Wissenschaft 2.0 und offene Forschungsmethoden vermitteln– Der MOOC "Science ...MOVING Project
This presentation was held on 24th of January within the lecture "Research Analytics" which is regularly organized by the Saxon State and University Library Dresden. The aim was to present the MOOC "Science 2.0 and open research methods" to researchers and librarians.
VERGE: A Multimodal Interactive Search Engine for Video Browsing and RetrievalMOVING Project
VERGE is an interactive video search engine that enables browsing & retrieval of video/image collections, and includes query submissions & a re-ranking capability.
Effective Unsupervised Author Disambiguation with Relative FrequenciesMOVING Project
The document summarizes research on using agglomerative clustering and probabilistic similarity to perform unsupervised author disambiguation. It presents the problem of linking author mentions that refer to the same real-world author when only the name is given. The method clusters author mentions based on similarity across features like terms, affiliations and co-authors. Experiments on a large dataset show that co-author features are highly important and the stopping criteria can be improved to better balance precision and recall for different problem sizes. Overall, the unsupervised approach works well but leaves room for future parameter and evaluation methodology refinements.
What to read next? Challenges and Preliminary Results in Selecting Represen...MOVING Project
1. The document presents an approach for selecting representative documents from a set of search results to provide users with an overview of the content and subtopics. It compares different document representations, clustering algorithms, and selection methods on two datasets.
2. The evaluation measures of coverage and redundancy were found to be insufficient for accurately evaluating representativeness, as the scores increased with the number of selected documents and were sometimes independent of the actual selection method.
3. The research questions explored how document representation, clustering algorithm, and selection method influence coverage and redundancy, finding the choice of clustering had the largest impact. Coverage and redundancy were found to be inflated and not directly reflect representativeness.
Qualitative Analysis of Vocabulary Evolution on the Linked Open Data CloudMOVING Project
The document analyzes the evolution of vocabularies on the Linked Open Data cloud. It selects 12 major vocabularies and analyzes over 10,000 changes across different versions. The analysis finds that most changes are internal rather than relying on external vocabularies, and that domains like publications see more changes. Understanding how vocabularies evolve can help with maintaining and developing new ontologies over time.
Analyzing the Evolution of Vocabulary Terms and Their Impact on the LOD CloudMOVING Project
Vocabularies are used for modeling data in Knowledge Graphs
(KGs) like the Linked Open Data Cloud and Wikidata. During their life-time, vocabularies are subject to changes. New terms are coined while existing terms are modified or deprecated. We first quantify the amount and frequency of changes in vocabularies. Subsequently, we investigate to which extend and when the changes are adopted in the evolution of KGs.
We conduct our experiments on three large-scale KGs for which time-stamped information is available, namely the Billion Triples Challenge datasets, Dynamic Linked Data Observatory dataset, and Wikidata. Our results show that the change frequency of terms is rather low, but can have high impact due to the large amount of distributed graph data on the web. Furthermore, not all coined terms are used and most of the
deprecated terms are still used by data publishers. The adoption time of terms coming from different vocabularies ranges from very fast (few days) to very slow (few years). Surprisingly, we could observe some adoptions before the vocabulary changes were published. Understanding the evolution of vocabulary terms is important to avoid wrong assumptions about the modeling status of data published on the web, which may result in difficulties when querying the data from distributed sources.
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...MOVING Project
This document discusses predicting the lifetime of RDF triples in Linked Open Data to help keep LOD caches up-to-date. It presents a method using linear regression to predict triple lifetime based on features like subject, predicate, and object. Evaluating on two datasets, the model predicted lifetime within 10% error. This was then used in a novel crawling strategy that outperformed existing strategies, preferentially updating triples predicted to change soon. The strategy provides an advantage by not requiring additional past data once trained.
Deep Multi-task Learning with Label Correlation Constraint for Video Concept ...MOVING Project
This document proposes a deep multi-task learning method called DMTL_LC for video concept detection that jointly considers task and label relations. It extends a convolutional neural network with a two-sided network for multi-task learning and adds a label-based constraint to incorporate statistical information about pairwise label correlations. An evaluation on the TRECVID SIN 2013 dataset shows DMTL_LC outperforms other single-task and multi-task baselines, achieving a MXinfAP of 22.60%. The method contributes a deep learning approach for concept detection that leverages relationships between tasks and labels.
Generic to Specific Recognition Models for Membership Analysis in Group VideosMOVING Project
We present a novel two-phase Support Vector Machine (SVM) based specific recognition model that is learned using an optimized generic recognition model.
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor IvaniukFwdays
At this talk we will discuss DDoS protection tools and best practices, discuss network architectures and what AWS has to offer. Also, we will look into one of the largest DDoS attacks on Ukrainian infrastructure that happened in February 2022. We'll see, what techniques helped to keep the web resources available for Ukrainians and how AWS improved DDoS protection for all customers based on Ukraine experience
"Choosing proper type of scaling", Olena SyrotaFwdays
Imagine an IoT processing system that is already quite mature and production-ready and for which client coverage is growing and scaling and performance aspects are life and death questions. The system has Redis, MongoDB, and stream processing based on ksqldb. In this talk, firstly, we will analyze scaling approaches and then select the proper ones for our system.
Northern Engraving | Modern Metal Trim, Nameplates and Appliance PanelsNorthern Engraving
What began over 115 years ago as a supplier of precision gauges to the automotive industry has evolved into being an industry leader in the manufacture of product branding, automotive cockpit trim and decorative appliance trim. Value-added services include in-house Design, Engineering, Program Management, Test Lab and Tool Shops.
What is an RPA CoE? Session 2 – CoE RolesDianaGray10
In this session, we will review the players involved in the CoE and how each role impacts opportunities.
Topics covered:
• What roles are essential?
• What place in the automation journey does each role play?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
Conversational agents, or chatbots, are increasingly used to access all sorts of services using natural language. While open-domain chatbots - like ChatGPT - can converse on any topic, task-oriented chatbots - the focus of this paper - are designed for specific tasks, like booking a flight, obtaining customer support, or setting an appointment. Like any other software, task-oriented chatbots need to be properly tested, usually by defining and executing test scenarios (i.e., sequences of user-chatbot interactions). However, there is currently a lack of methods to quantify the completeness and strength of such test scenarios, which can lead to low-quality tests, and hence to buggy chatbots.
To fill this gap, we propose adapting mutation testing (MuT) for task-oriented chatbots. To this end, we introduce a set of mutation operators that emulate faults in chatbot designs, an architecture that enables MuT on chatbots built using heterogeneous technologies, and a practical realisation as an Eclipse plugin. Moreover, we evaluate the applicability, effectiveness and efficiency of our approach on open-source chatbots, with promising results.
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...Fwdays
Direct losses from downtime in 1 minute = $5-$10 thousand dollars. Reputation is priceless.
As part of the talk, we will consider the architectural strategies necessary for the development of highly loaded fintech solutions. We will focus on using queues and streaming to efficiently work and manage large amounts of data in real-time and to minimize latency.
We will focus special attention on the architectural patterns used in the design of the fintech system, microservices and event-driven architecture, which ensure scalability, fault tolerance, and consistency of the entire system.
QA or the Highway - Component Testing: Bridging the gap between frontend appl...zjhamm304
These are the slides for the presentation, "Component Testing: Bridging the gap between frontend applications" that was presented at QA or the Highway 2024 in Columbus, OH by Zachary Hamm.
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...Alex Pruden
Folding is a recent technique for building efficient recursive SNARKs. Several elegant folding protocols have been proposed, such as Nova, Supernova, Hypernova, Protostar, and others. However, all of them rely on an additively homomorphic commitment scheme based on discrete log, and are therefore not post-quantum secure. In this work we present LatticeFold, the first lattice-based folding protocol based on the Module SIS problem. This folding protocol naturally leads to an efficient recursive lattice-based SNARK and an efficient PCD scheme. LatticeFold supports folding low-degree relations, such as R1CS, as well as high-degree relations, such as CCS. The key challenge is to construct a secure folding protocol that works with the Ajtai commitment scheme. The difficulty, is ensuring that extracted witnesses are low norm through many rounds of folding. We present a novel technique using the sumcheck protocol to ensure that extracted witnesses are always low norm no matter how many rounds of folding are used. Our evaluation of the final proof system suggests that it is as performant as Hypernova, while providing post-quantum security.
Paper Link: https://eprint.iacr.org/2024/257
From Natural Language to Structured Solr Queries using LLMsSease
This talk draws on experimentation to enable AI applications with Solr. One important use case is to use AI for better accessibility and discoverability of the data: while User eXperience techniques, lexical search improvements, and data harmonization can take organizations to a good level of accessibility, a structural (or “cognitive” gap) remains between the data user needs and the data producer constraints.
That is where AI – and most importantly, Natural Language Processing and Large Language Model techniques – could make a difference. This natural language, conversational engine could facilitate access and usage of the data leveraging the semantics of any data source.
The objective of the presentation is to propose a technical approach and a way forward to achieve this goal.
The key concept is to enable users to express their search queries in natural language, which the LLM then enriches, interprets, and translates into structured queries based on the Solr index’s metadata.
This approach leverages the LLM’s ability to understand the nuances of natural language and the structure of documents within Apache Solr.
The LLM acts as an intermediary agent, offering a transparent experience to users automatically and potentially uncovering relevant documents that conventional search methods might overlook. The presentation will include the results of this experimental work, lessons learned, best practices, and the scope of future work that should improve the approach and make it production-ready.
How information systems are built or acquired puts information, which is what they should be about, in a secondary place. Our language adapted accordingly, and we no longer talk about information systems but applications. Applications evolved in a way to break data into diverse fragments, tightly coupled with applications and expensive to integrate. The result is technical debt, which is re-paid by taking even bigger "loans", resulting in an ever-increasing technical debt. Software engineering and procurement practices work in sync with market forces to maintain this trend. This talk demonstrates how natural this situation is. The question is: can something be done to reverse the trend?
Session 1 - Intro to Robotic Process Automation.pdfUiPathCommunity
👉 Check out our full 'Africa Series - Automation Student Developers (EN)' page to register for the full program:
https://bit.ly/Automation_Student_Kickstart
In this session, we shall introduce you to the world of automation, the UiPath Platform, and guide you on how to install and setup UiPath Studio on your Windows PC.
📕 Detailed agenda:
What is RPA? Benefits of RPA?
RPA Applications
The UiPath End-to-End Automation Platform
UiPath Studio CE Installation and Setup
💻 Extra training through UiPath Academy:
Introduction to Automation
UiPath Business Automation Platform
Explore automation development with UiPath Studio
👉 Register here for our upcoming Session 2 on June 20: Introduction to UiPath Studio Fundamentals: https://community.uipath.com/events/details/uipath-lagos-presents-session-2-introduction-to-uipath-studio-fundamentals/
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...Jason Yip
The typical problem in product engineering is not bad strategy, so much as “no strategy”. This leads to confusion, lack of motivation, and incoherent action. The next time you look for a strategy and find an empty space, instead of waiting for it to be filled, I will show you how to fill it in yourself. If you’re wrong, it forces a correction. If you’re right, it helps create focus. I’ll share how I’ve approached this in the past, both what works and lessons for what didn’t work so well.
"Scaling RAG Applications to serve millions of users", Kevin GoedeckeFwdays
How we managed to grow and scale a RAG application from zero to thousands of users in 7 months. Lessons from technical challenges around managing high load for LLMs, RAGs and Vector databases.
Leveraging the Graph for Clinical Trials and Standards
Query and Keyframe Representations for Ad-hoc Video Search
1. Query and Keyframe Representations
for Ad-hoc Video Search
Foteini Markatopoulou1,2, Damianos Galanopoulos1, Vasileios Mezaris1, Ioannis Patras2
1Information Technologies Institute (ITI), CERTH, Thessaloniki, Greece 2Queen Mary University of London, London, UK
Supported by:
Proposed Method
Experimental results
Problem and motivation Background
• Ad-hoc video search: retrieving, from a large
video collection, video fragments
(keyframes) that are related to a given query
• Typical solution: treat the query as a set of
simple terms
• Motivation:
o Detecting the most useful parts of the
query, e.g., subsequences that contain the
main content that the user asks for
retrieval
o Combining two different measures for the
distance between the video shots and the
target query, calculated on concept-based
and semantic embedding representations
Method outline
•(a) Concept-based keyframe representation:
apply a DCNN in every keyframe
•(c) Concept-based query representation:
translate the query in a set of related concepts
using NLP
•(b) Semantic embeddings for concept-based
query and keyframe representations: project
both into a given semantic embedding space
• Each concept is enriched with additional
information captured by Google or Wikipedia [20]
• An inverted index structure is used in order to
associate the query with the concepts [4]
• A semi-automatic system [21], where the user is
asked to choose keywords given a test query
Proposed solution
•Two different distances are combined in terms of arithmetic mean
• Datasets: TRECVID AVS 2016, TRECVID Video Search
2008
o Test set: 600 and 100 hours, respectively
• Evaluated queries: 30 and 48, respectively
• Evaluation measure: MXinfAP (%) B) Comparisons: MXInfAP (%) for different compared AVS methodsA) Parameter selection: MXInfAP (%) on the AVS16 dataset to
investigate the parameters of the proposed method; (a) The
transformation to the semantic embedding space is ignored; (b) The
final distance from the query is calculated solely in the semantic
embedding space; (c) The complete process is used, i.e., the final
distance is the mean of the distances calculated in (a) and (b)
• Semantic embeddings: pre-trained Google News Corpus word2vec model
• Keyframe representation: 1346 concepts
o 1000 Imagenet concepts extracted using 5 pre-trained ImageNet DCNNs;
fused in terms of arithmetic mean
o 346 TRECVID SIN concepts extracted using 2 fine-tuned DCNNs, again fused
All
Excluding one step
Steps step 1 step 2 step 3 step 4 step 5
(a) Concept-based
representation
5.94 5.92 5.74 3.96 5.95 4.53
(b) Semantic
embeddings
3.77 3.86 2.98 3.22 3.75 2.80
(c) Combination 6.35 6.51 5.77 4.37 6.27 4.99
Methods AVS16 VS08
(a) Literature methods
Tzelepis et al. [20] 4.16 8.27
Ueki et al. [21] 5.65 7.24
Norouzi et al. [15] 3.14 7.30
(b) Top-4 TRECVID finalists
Top-1 Le et al. [4] 5.4 Tang et al. 6.7
Top-2 Markat. et al. [13] 5.1 Snoek al. 5.4
Top-3 Liang et al. [6] 4.0 Ngo et al. 4.2
Top-4 Zhangy et al. [23] 3.8 Mei et al. 4.1
Proposed 6.35 9.11