VERGE is an interactive video search engine that enables browsing & retrieval of video/image collections, and includes query submissions & a re-ranking capability.
VERGE: A Multimodal Interactive Search Engine for Video Browsing and Retrieval
1. Leverage semantic features of text metadata
(title, description, tags, category) of each video
via the exploitation of online databases
(Wordnet, Babelnet) to replace terms with their
respective semantic concepts
Semantically adjacent terms offer increased
versatility in retrieved user query results
Combine Apache Lucene for text normalization
and MongoDB for data storage, indexing and
retrieval
Supported by:
: A Multimodal Interactive Search Engine for Video Browsing and Retrieval
Introduction
VERGE interactive video search engine
Enables browsing & retrieval of video/image collections
Includes query submissions & a reranking capability
Friendly and efficient graphical user interface (GUI)
VERGE supports Known Item Search (KIS), Instance Search (INS) &
Ad-Hoc Video Search (AVS) tasks
Participation in video-oriented benchmarks and workshops
TRECVID (2007 – 2018)
Video Browser Showdown – VBS (2014 – 2019)
Stelios Andreadis1, Anastasia Moumtzidou1, Damianos Galanopoulos1,
Foteini Markatopoulou1,2, Konstantinos Apostolidis1, Thanassis Mavropoulos1,
IIias Gialampoukidis1, Stefanos Vrochidis1, Vasileios Mezaris1,
Ioannis Kompatsiaris1, and Ioannis Patras 2
1Information Technologies Institute, CERTH, Thessaloniki, Greece
2School of Electronic Engineering and Computer Science, QMUL, UK
Essential improvements on the novel
version of VERGE GUI that was introduced
in VBS 2018.
Navigation limits to a single-page for
efficiency
A variety of retrieval modalities, mostly
offered in a dashboard menu
Shot-based or video-based representation
of results
The complete shots of a video can be
displayed in a film-strip
Built with common Web Technologies,
RESTful Web Services and a MongoDB
VERGE GUI
Interface & Interaction Modes
Contact point: Stefanos Vrochidis (stefanos@iti.gr)
URL: http://mklab-services.iti.gr/vbs2019
System
Indexing & Retrieval Modules
1000 ImageNet concepts - Late fusion of 4
different state-of-the-art pre-trained ImageNet
Deep Convolutional Neural Nets (DCNNs)
345 TRECVID SIN concepts - ResNet pre-
trained ImageNet DCNN fine-tuned on these
concepts. Selection of the 323 top-performing
concepts.
500 event-related concepts – Using an existing
DCNN fine-tuned on the EventNet dataset
365 place-related concepts – Using an existing
DCNN fine-tuned on the Places dataset
Concept-based Retrieval
Train GoogleNet on 5055 ImageNet concepts
Use of last pooling layer with length = 1024 as
global key-frame representation
Retrieval:
Create an asymmetric distance computation
index of database vector for fast image
retrieval
Use K-Nearest Neighbours for finding visually
similar images to the query image
Visual Similarity Search
Consider high level visual concepts:
Use 20 concepts representing each video
frame
Create a video concept vector by getting the
sum or the product of its frames
Use Cosine/Euclidean distance for calculating
the distance among the videos
Fast retrieval is achieved by pre-calculating all
distances among the video collection
Image/Shot color clustering
Extraction of MPEG-7 Color Layout descriptor
from all frames
Mapping of frames to a color of the palette (of
8 colors) by using the Euclidean distance
Video clustering into topics
Exploit textual metadata (description and
tags) per video
Topic modeling using Latent Dirichlet
Allocation
Most frequent terms per topic are presented
Clustering Multimodal Fusion & Search
Query translation into a set of high-level concepts 𝐶 𝑄
Query to elementary “sub-queries” decomposition
“Sub-queries” creation using POS and NER
tagging, string matching and Noun Phrases
extraction
Semantic relatedness using Explicit Semantic
Analysis for every “sub-query”– concept pair
Select the most closely related concepts for each
sub-query
Finally, 𝐶 𝑄 is filed with concepts that describe
the input query
Automatic Query Formulation & Expansion
Text-based Search