Interactive Video Search Tools: Where Deep Learning Falls Short & User Interaction Helps

Interactive Video Search:
Where is the User
in the Age of Deep Learning?
Klaus Schoeffmann1, Werner Bailer2, Jakub Lokoc3, Cathal Gurrin4, George Awad5
Tutorial at ACM Multimedia 2018, Seoul
1…Klagenfurt University, Klagenfurt, Austria
2…JOANNEUM RESEARCH, Graz, Austria
3…Charles University, Prague, Czech Republic
4…Dublin City University, Dublin, Ireland
5…National Institute of Standards and Technology, Gaithersburg, USA

Recommended Readings
On influential trends in interactive video retrieval: Video Browser Showdown 2015-2017. J. Lokoc, W. Bailer, K. Schoeffmann,
B. Muenzer, G. Awad, IEEE Transactions on Multimedia, 2018
Interactive video search tools: a detailed analysis of the video browser showdown 2015. Claudiu Cobârzan, Klaus Schoeffmann,
Werner Bailer, Wolfgang Hürst, Adam Blazek, Jakub Lokoc, Stefanos Vrochidis, Kai Uwe Barthel, Luca Rossetto. Multimedia
Tools Appl. 76(4): 5539-5571 (2017).
G. Awad, A. Butt, J. Fiscus, M. Michel, D. Joy, W. Kraaij, A. F. Smeaton, G. Quenot, M. Eskevich, R. Ordelman, G. J. F. Jones, and
B. Huet, “Trecvid 2017: Evaluating ad-hoc and instance video search, events detection, video captioning and hyperlinking,” in
Proceedings of TRECVID 2017. NIST, USA,

TOC 1
1. Introduction (20 min) [KS]
a. General introduction
b. Automatic vs. interactive video search
c. Where deep learning fails
d. The need for evaluation campaigns
2. Interactive video search tools (40 min) [JL]
a. Demo: VIRET (1st place at VBS2018)
b. Demo: ITEC (2nd place at VBS2018)
c. Demo: DCU Lifelogging Search Tool 2018
d. Other tools and open source software
3. Evaluation approaches (30 min) [KS]
a. Overview of evaluation approaches
b. History of selected evaluation campaigns
c. TRECVID
d. Video Browser Showdown (VBS)
e. Lifelog Search Challenge (LSC)

TOC 2
4. Task design and datasets (30 min) [KS]
a. Task types (known item search, retrieval, etc.)
b. Trade-offs: modelling real-world tasks and controlling conditions
c. Data set preparation and annotations
d. Available data sets
5. Evaluation procedures, results and metrics (30 min) [JL]
a. Repeatability
b. Modelling real-world tasks and avoiding bias
c. Examples from evaluation campaigns
6. Lessons learned from evaluation campaigns (20 min) - [JL]
a. Interactive exploration or query-and-browse?
b. How much does deep learning help in interactive settings?
c. Future challenges
7. Conclusions
a. Where is the user in the age of deep learning?

Let’s Look Back a Few Years...
[Marcel Worring et al., „Where Is the User in Multimedia Retrieval?“, IEEE Multimedia, Vol. 19, No. 4, Oct.-Dec. 2012, pp. 6-10 ]

Let’s Look Back a Few Years...
● A few statements/findings:
○ Many solutions are developed without having an explicitly defined real-world problem to
solve.
○ Performance measures focus on the quality of how we answer a query.
○ MAP has become the primary target for many researchers.
○ It is certainly weird to use MAP alone when talking about users employing multimedia
retrieval to solve their search problems.
○ As a consequence of MAP’s dominance, the field has shifted its focus too much toward
answering a query.
“Thus a better understanding of what users actually want and do
when using multimedia retrieval is needed.”
[Marcel Worring et al., „Where Is the User in Multimedia Retrieval?“, IEEE Multimedia, Vol. 19, No. 4, Oct.-Dec. 2012, pp. 6-10 ]

How Would You Search for These Images?
How to describe the special atmosphere, the artistic content, the mood?
by marfis75
“An image tells a thousand words.”

How Would You Search for This Video Scene?

Shortcomings of Fully Automatic Video Retrieval
● Works well if
○ Users can properly describe their needs
○ System understands search intent of users
○ There is no polysemy and no context variation
○ Content features can sufficiently describe visual content
○ Computer vision (e.g., CNN) can accurately detect semantics
● Unfortunately, for real-world problems rarely true!
“Query-and-browse results” approach

Performance of Video Retrieval
● Typically based on MAP
○ Computed for a specific query- and dataset
○ Results are still quite low (even in the age of deep learning!)
○ Also, results can heavily vary from one dataset to another, and from one queryset to another
○ Example: TRECVID Ad-hoc Video Search (AVS) – automatic runs only
2016 2017 2018
Teams 9 8 10
Runs 30 33 33
Min xInfAP 0 0.026 0.003
Max xInfAP 0.054 0.206 0.121
Median xInfAP 0.024 0.092 0.058
Dataset: IACC.3, 30 queries per year

Deep Learning Can Fail Easily
[J. Su, D.V. Vargas, and K. Sakurai. One pixel attack for fooling neural networks. 2018. arXiv]
How to deal with noisy
data/videos?

Output of
Yolo v2
Andrew Ng talk Artificial Intelligence is the New Electricity
“Anything typical human can do with < 1s of thought we can probably now or soon automate with AI”

Nguyen A, Yosinski J, Clune J. Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images. In Computer Vision and
Pattern Recognition (CVPR '15), IEEE, 2015

The Power of Human Computation
Example from the Video Browser Showdown 2015:
System X: shot and scene detection, concept
detection (SIFT, VLAD, CNNs), similarity search.
System Y: tiny thumbnails only, powerful user.
Outperformed system X and was finally ranked 3rd!
Moumtzidou, Anastasia, et al. "VERGE in VBS 2017." International Conference on Multimedia
Modeling. Springer, Cham, 2017.
Hürst, Wolfgang, Rob van de Werken, and Miklas Hoet. "A storyboard-based interface for mobile
video browsing." International Conference on Multimedia Modeling. Springer, Cham, 2015.

Interactive Video Retrieval Approach
● Assume a smart and interactive user
○ That knows about the challenges and shortcomings of simple querying
○ But might also know how to circumvent them
○ Could be a digital native!
● Give him/her full control over the search process
○ Provide many query and interaction features
■ Querying, browsing, navigation, filtering, inspecting/watching
● Assume an iterative/exploratory search process
○ Search - Inspect - Think - Repeat
○ “Will know it when I see it”
○ Could include many iterations!
○ Instead of “query-and-browse results”

What Users Might Need...
Concept
Search
Browsing
features
Motion Sketch
Search History
Hudelist, Marco A., Christian Beecks, and Klaus
Schoeffmann. "Finding the chameleon in your
video collection." Proceedings of the 7th
International Conference on Multimedia
Systems. ACM, 2016.

Typical Query Types of Video Retrieval Tools
● Query-by-text
○ Enter keywords to match with available or extracted text (e.g., metadata, OCR, ASR, concepts, objects...)
● Query-by-concept
○ Show content for a specific class/category from concept detection (e.g., from ImageNet)
● Query-by-example
○ Provide example image/scene/sound
● Query-by-filtering
○ Filter content by some metadata or content feature (time, color, edge, motion, …)
● Query-by-sketch
○ Provide sketch of image/scene
● Query-by-dataset-example
○ Look for similar but other results
● Query-by-exploration
○ Start by looking around / browsing
○ Needs appropriate visualization
● Query-by-inspection
○ Inspect single clips, navigate
Search in multimedia content (particularly video) is
a highly interactive process!
Users want to look around, try different query features,
inspect results, refine queries, and start all over again!
Automatic
Interactive

Evaluation of Interactive Video Retrieval
● Interfaces are inherently developed for human users
● Every user might be different
○ Different culture, knowledge, preferences, experiences, ...
○ Even the same user at a different time
● Video search interfaces need to be evaluated with real users...
○ No simulations!
○ User studies and campaigns (TRECVID, MediaEval, VBS, LSC)!
○ Find out how well users perform with a specific system
● ...and with real data!
○ Real videos “in the wild” (e.g., IACC.1 and V3C dataset)
○ Actual queries that would make sense in practice
○ Comparable evaluations (same data, same conditions, etc.)
International
competitions
Datasets

Only same dataset, query, time, room/condition, ...
...allows for true comparative evaluation!

Where is the User in the Age of Deep Learning?

2. Interactive
Video Search Tools
Common architecture, components and top ranked tools

What are basic video preprocessing steps?
What models are used?
Where interactive search helps?
Common Architecture

Common Architecture - Temporal Segmentation
M. Gygli. Ridiculously Fast Shot Boundary Detection with Fully
Convolutional Neural Networks. https://arxiv.org/pdf/1705.08214.pdf1. Compute a score based on a distance of frames
2. Threshold-based decision (fixed/adaptive)

Common Architecture - Semantic Search
Classification and embedding by
popular Deep CNNs
AlexNet (A. Krizhevsky et al., 2012)
GoogLeNet (Ch. Szegedy et al., 2015)
ResNet (K. He et al., 2015)
NasNet (B. Zoph et al., 2018)
...
Object detectors appear too (YOLO, SSD)
Joint embedding models? VQA?

Common Architecture - Sketch based Search
Sketches from memory
Just part of the scene
Edges often do not match
Colors often do not match
=> invariance needed

Common Architecture - Limits
● Used ranking models have their limits
○ Missed frames
○ Wrong annotation
○ Inaccurate similarity function
● Still, to find a shot of a class is often easy
(see later), but to find one particular shot
or all shots of a class?
T. Soucek. Known-Item Search in Image Datasets Using
Automatically Detected Keywords. BC thesis, 2018.

Common Architecture at VBS - Interactive Search
Hudelist & Schoeffmann. An Evaluation of Video Browsing
on Tablets with the ThumbBrowser. MMM2017
Goeau et al.,, Table of Video Content, ICME 2007

Aspects of Flexible Interactive Video Search

VIRET tool (Winner of VBS 2018, 3. at LSC 2018)
Filters
Query by text
Query by color
Query by image
Video player
Top ranked
frames by
a query
Representative
frames from the
selected video
Frame-based retrieval system with temporal context visualization. Focus on simple interface!
Jakub Lokoc, Tomas Soucek, Gregor Kovalcik: Using an Interactive Video Retrieval Tool for LifeLog Data. LSC@ICMR 2018: 15-19, ACM
Jakub Lokoc, Gregor Kovalcik, Tomas Soucek: Revisiting SIRET Video Retrieval Tool. VBS@MMM 2018: 419-424, Springer

VIRET Tool (Winner of VBS 2018)

ITEC Tool
Primus, Manfred Jürgen, et al. "The ITEC Collaborative Video Search System at the Video Browser Showdown 2018." International Conference on Multimedia Modeling.
Springer, Cham, 2018.

ITEC tool (2. at VBS 2018 and LSC 2018)https://www.youtube.com/watch?v=CA5kr2pO5b

LSC (Geospatial Browsing)
W Hürst, K Ouwehand, M Mengerink, A Duane and C Gurrin. Geospatial Access to Lifelogging Photos in Virtual Reality. The Lifelog Search Challenge 2018 at ACM ICMR 2018.

LSC (Interactive Video Retrieval)
J. Lokoč, T. Souček and G. Kovalčík. Using an Interactive Video Retrieval Tool for LifeLog Data. The Lifelog Search Challenge 2018 at ACM ICMR 2018.
(3rd highest performing system, but the same system won VBS 2018)

LSC (LiveXplore)
A Leibetseder, B Muenzer, A Kletz, M Primus and K Schöffmann. liveXplore at the Lifelog Search Challenge 2018. The Lifelog Search Challenge 2018 at ACM ICMR 2018.
(2nd highest performing system)

VR Lifelog Search Tool (winner of LSC 2018)
Large lifelog archive with
time-limited KIS topics
Multimodal (visual concept
and temporal) query
formulation
Ranked list of visual
imagery (image per minute)
Gesture-based
manipulation of results
A Duane, C Gurrin & W Hürst. Virtual Reality Lifelog Explorer for the Lifelog Search Challenge at ACM ICMR 2018. The Lifelog Search Challenge 2018 at ACM ICMR 2018.
Top Performing System.

https://www.youtube.com/watch?v=aocN9eOuRv0

vitrivr (University of Basel)
● Open-Source content-based multimedia retrieval stack
○ Supports images, music, video and 3D-models concurrently
○ Used for various applications both in and outside of academia
○ Modular architecture enables easy extension and customization
○ Compatible with all major operating systems
○ Available from vitrivr.org
● Participated several times in VBS (originally as IMOTION)
[Credit: Luca Rossetto]

● System overview

Overview of Evaluation Approaches
● Qualitative user study/survey
○ Self report: ask users about their experience with the tool, thinking aloud tests, etc.
○ Using psychophysiological measurements (e.g., electrodermal activity - EDA)
● Log-file analysis
○ Analyze server and/or client-side interaction patterns
○ Measure time needed for certain actions, etc.
● Question answering
○ Ask questions about content (open, multiple choice) to assess which content users found
● Indirect/task-based evaluation (Cranfield paradigm)
○ Pose certain tasks, measure the effectiveness of solving the task
○ Quantitative user study with many users and trials
○ Open competition, as in VBS, LSC, and TRECVID

Properties of Evaluation Approaches
● Availability and level of detail of ground truth
○ None (e.g., questionnaires, logs)
○ Detailed and complete (e.g., retrieval tasks)
● Effort during experiments
○ Low (automatic check against ground truth)
○ Moderate (answers need to checked by human, e.g. live judges)
○ High (observation of or interview with participants)
● Controlled conditions
○ All users in same room with same setup (typical user-study)
vs. participants via online survey
● Statistical tests!
○ We can only conclude that one interactive tool is better than
the other, if there is statistically significant proof
○ Tests like ANOVA, t-tests, Wilcoxon-signed rank tests, …
○ Consider prerequisites of specific test (e.g., normal distribution)

Example: Comparing Tasks and User Study
● Experiment compared
○ Question answering
○ Retrieval tasks
○ User study with questionnaire
● Materials
○ Interactive search tool with keyframe visualisation
○ TRECVID BBC rushes data set (25 hrs)
○ Questionnaire adapted from TRECVID 2004
○ 19 users, doing at least 4 tasks
W. Bailer and H. Rehatschek, Comparing Fact Finding Tasks and User Survey for Evaluating a Video Browsing Tool. ACM Multimedia 2009.

Example: Comparing Tasks and User Study
● TVB1 I was familiar with the topic of the query.
● TVB3 I found that it was easy to find clips that are relevant.
● TVB4 For this topic I had enough time to find enough clips.
● TVB5 For this particular topic the tool interface allowed me to do browsing efficiently.
● TVB6 For this particular topic I was satisfied with the results of the browsing.
W. Bailer and H. Rehatschek, Comparing Fact Finding Tasks and User Survey for Evaluating a Video Browsing Tool. ACM Multimedia 2009.

Using Electrodermal Activity (EDA)
Measuring EDA during retrieval tasks (A, B, C, D) with an interactive search tool, 14 participants
C. Martinez-Peñaranda, et al., A Psychophysiological Approach to the Usability Evaluation of a Multi-view Video Browsing Tool,” MMM 2013.

History of Selected Evaluation Campaigns
● Evaluation campaigns for video analysis and search started in early 2000s
○ Most well-known are TRECVID and MediaEval (previously ImageCLEF)
○ Both spin-offs from text retrieval benchmarks
● Several ones include tasks that are relevant to video search
● Most tasks are designed to be fully automatic
● Some allow at least interactive submissions as an option
○ Most submissions are usually still for the automatic type
● Since 2007, live evaluations with audience have been organized at major
international conferences
○ Videolympics, VBS, LSC

History of Selected Evaluation Campaigns

TRECVID
● Workshop series (2001 – present) → http://trecvid.nist.gov
● Started as a track in the TREC (Text REtrieval Conference) evaluation
benchmark.
● Became an independent evaluation benchmark since 2003.
● Focus: content-based video analysis, retrieval, detection, etc.
● Provides data, tasks, and uniform, appropriate scoring procedures
● Aims for realistic system tasks and test collections:
○ Unfiltered data
○ Focus on relatively high-level functionality (e.g. interactive search)
○ Measurement against human abilities
● Forum for the
○ exchange of research ideas and for
○ the discussion of research methodology – what works, what doesn’t , and why

TRECVID Philosophy
● TRECVID is a modern example of the Cranfield tradition
○ Laboratory system evaluation based on test collections
● Focus on advancing the state of the art from evaluation results
○ TRECVID’s primary aim is not competitive product benchmarking
○ Experimental workshop: sometimes experiments fail!
● Laboratory experiments (vs. e.g., observational studies)
○ Sacrifice operational realism and broad scope of conclusions
○ For control and information about causality – what works and why?
○ Results tend to be narrow, at best indicative, not final
○ Evidence grows as approaches prove themselves repeatedly, as part of various systems,
against various test data, over years

TRECVID Datasets
HAVIC
Soap opera (since 2013)
Social media
(since 2016)
Security cameras
(since 2008)

Teams actively participated (2016-2018)
INF CMU; Beijing University of Posts and Telecommunication; University Autonoma de Madrid; Shandong University; Xian JiaoTong University Singapore
kobe_nict_siegen Kobe University, Japan; National Institute of Information and Communications Technology, Japan; University of Siegen, Germany
UEC Dept. of Informatics, The University of Electro-Communications, Tokyo
ITI_CERTH Information Technology Institute, Centre for Research and Technology Hellas
ITEC_UNIKLU Klagenfurt University
NII_Hitachi_UIT National Institute Of Informatics.; Hitachi Ltd; University of Information Technology (HCM-UIT)
IMOTION University of Basel, Switzerland; University of Mons, Belgium; Koc University, Turkey
MediaMill University of Amsterdam ; Qualcomm
Vitrivr University of Basel
Waseda_Meisei Waseda University; Meisei University
VIREO City University of Hong Kong
EURECOM EURECOM
FIU_UM Florida International University, University of Miami
NECTEC National Electronics and Computer Technology Center NECTEC
RUCMM Renmin University of China
NTU_ROSE_AVS ROSE LAB, NANYANG TECHNOLOGICAL UNIVERSITY
SIRET SIRET Department of Software Engineering, Faculty of Mathematics and Physics, Charles University
UTS_ISA University of Technology Sydney

VideOlympics
● Run the same year’s TRECVID
search tasks live in front of
audience
● Organized at CIVR 2007-2009
Photos: Cees Snoek, https://www.flickr.com/groups/civr2007/

Video Browser Showdown (VBS)
● Video search competition (annually at MMM)
○ Inspired by VideOlympics
○ Demonstrates and evaluates state-of-the-art
interactive video retrieval tools
○ Also, entertaining event during welcome reception at MMM
● Participating teams solve retrieval tasks
○ Known-item search (KIS) tasks - one result - textual or visual
○ Ad-hoc video search (AVS) tasks - many results - textual
○ In large video archive (originally in 60 mins videos only)
● Systems are connected to the VBS Server
○ Presents tasks in live manner
○ Evaluates submitted results of teams (penalty for false submissions)
First VBS in Klagenfurt, Austria
(only search in a single video)

2012: Klagenfurt
11 teams
KIS, single video (v)
2013: Huangshan
6 teams
KIS, single video (v+t)
2014: Dublin
7 teams
KIS, single video
and 30h archive (v+t)
2015: Sydney
9 teams
KIS, 100h archive (v+t)
2016: Miami
9 teams
2017: Reykjavik
6 teams
AVS, 600h archive (t)
2018: Bangkok
9 teams
2019: Thessaloniki
6 teams

VBS Server:
• Presents queries
• Shows remaining time
• Computes scores
• Shows statistics/ranking

Video Browser Showdown (VBS)https://www.youtube.com/watch?v=tSlYFNlsn8U&t=140

Lifelog Search Challenge (LSC 2018)
● New (annual) search challenge at ACM ICMR
● Focus on a life retrieval challenge
o from multimodal lifelog data
o Motivated by the fact that ever larger personal data
archives are being gathered and the advent of AR
technologies & veracity of data’ means that archives
of life experiences are likely to become more
commonplace.
● To be useful, the data should be searchable…
o and for lifelogs, that means interactive search

Lifelog Search Challenge (Definition)
Dodge and Kitchin (2007), refer to lifelogging as
“a form of pervasive computing, consisting of a
unified digital record of the totality of an
individual’s experiences, captured multi-modally
through digital sensors and stored permanently
as a personal multimedia archive”.

Lifelog Search Challenge (Motivation)

Lifelog Search Challenge (Lifelogging)

Lifelog Search Challenge (Data)
One month archive of multimodal lifelog
data, extracted from NTCIR-13 Lifelog
collection, including:
○ Wearable camera images at a rate of
3-5 / minute & concept annotations.
○ Biometrics
○ Activity logs
○ Media consumption
○ Content created/consumed
u1_2016-08-15_050922_1, 'indoor',
0.991932, 'person', 0.9719478,
'computer', 0.309054524

Lifelog Search Challenge (One Minute)
<minute id="496">
<location>
<name>Home</name>
</location>
<bodymetrics>
<calories>2.8</calories>
<gsr>7.03E-05</gsr>
<heart-rate>94</heart-rate>
<skin-temp>86</skin-temp>
<steps>0</steps>
</bodymetrics>
<text>author,1,big,2,dout,1,revis,1,think,1,while,1</text>
<images>
<image>
<image-id>u1_2016-08-15_050922_1</image-id>
<image-path>u1/2016-08-15/20160815_050922_000.jpg</image-path>
<annotations>'indoor', 0.985, 'computer', 0.984, 'laptop', 0.967, 'desk', 0.925</annotations>
</image>
</images>
</minute>

Lifelog Search Challenge (Topics)
<Description timestamp="0">In a coffee shop with my colleague in the afternoon called the Helix with at least one person in the background.</Description>
<Description timestamp="30">In a coffee shop with my colleague in the afternoon called the Helix with at least one person in the background and a plastic plant
on my right side.</Description>
on my right side. There are keys on the table in front of me and you can see the cafe sign on the left side. I walked to the cafe and it took less than two minutes to
get there.</Description>
get there. My colleague in the foreground is wearing a white shirt and drinking coffee from a red paper cup.</Description>
get there. My colleague in the foreground is wearing a white shirt and drinking coffee from a red paper cup. Immediately after having the coffee, I drive to the
shop.</Description>
get there. My colleague in the foreground is wearing a white shirt and drinking coffee from a red paper cup. Immediately after having the coffee, I drive to the
shop. It is a Monday.</Description>
Temporarily enhancing topic descriptions that get more detailed (easier)
every thirty seconds. The topics have 1 or few relevant items in the collection.

Lifelog Search Challenge 2018 (Six Teams)

4. Task Design and Datasets
Task types, trade-offs, datasets, annotations

Task Types: Introduction
● Searching for content can be modelled as different task types
○ Choice impacts dataset preparation, annotations, evaluation methods
○ and the way to run the experiments
● Some of the task types here have fully automatic variants…
○ out of scope, but may serve as baseline to compare to
● Task can be categorized by the target and the formulation of the query
○ Particular target item vs. set or class
○ only one target item in data set, or
○ multiple occurrences of an instance, of a class of relevant items/segments
○ Definition of query
○ example, given in a specific modality
○ precise definition vs. fuzzy idea

Task Types (at Campaigns): Overview

Task Types (at Campaigns): Overview
How clear is search intent?
Known-item search
AVS tasks
Example Visual Textual Abstract None
This is how I use web video search
VIS tasks
Given video dataset
What is the role of similarity for KIS atVideo Browser Showdown? SISAP'18, Peru

Task Type: Visual Instance Search
● User holds a digital representation
of a relevant example of the needed
information
● Example or its features can be sent
to system
● User does not need to translate
example into query representation
● e.g., trademark/logo detection

Task Types: Known Item Search (KIS)
● User sees/hears/reads a representation
○ Target item is described or presented
● Used in VBS & LSC
● Exactly one target semantics
○ Representation of exactly one relevant item/segment in dataset
● Models user’s (partly faded) memories
○ user has a memory of content to be found, might be fuzzy
● User must translate representation to provided query methods
○ The complexity of this translation depends significantly on the modality
■ e.g., visual is usually easier than textual, which leaves more room for interpretation
○ Relation of/to content is important too
■ e.g. searching in own life log media vs. searching in media
collection on the web
“on a busy street”

Task Types: Ad-hoc Search
● User sees/hears/reads a representation of the needed information
○ Target item is described or presented
● Many targets semantics
○ Representation of a broader set/class of relevant items/segments
○ cf. TRECVID AVS task
● Models user’s rough memories
○ user has only a memory of the type of relevant content, not about details
● Similar issues of translating the representation like for KIS
○ but due to broader set of relevant items the correct interpretation of textual information is a less critical
issue
● Raises issues of what is considered within/without scope of a result set
○ e.g., partly visible, visible on a screen in the content, cartoon/drawing versions, …
○ TRECVID has developed guidelines for annotation of ground truth

Task Types: Exploration
● User does not start from a clear idea/query
of the information need
○ No concrete query, just inspects dataset
○ Browsing and exploring may lead to identifying useful
content
● Reflects a number of practical situations,
but very hard to evaluate
○ User simply cannot describe the content
○ User does not remember content but would recognize it
○ Dontent inspection for the sake of interest
○ Digital forensics
● No known examples of such tasks in
benchmarking campaigns due to the difficulties
with evaluation
Demo: https://www.picsbuffet.com/
Barthel, Kai Uwe, Nico Hezel, and Radek Mackowiak. "Graph-based browsing for large
video collections." International Conference on Multimedia Modeling. Springer, Cham, 2015.

Task Design is About Trade-offs: Aspects to consider
Tasks shall
○ model real-world content search problems
■ in order to assess whether tools are usable for these problems
○ set controlled conditions
■ to enable reliable assessment
○ be repeatable
■ to compare results from different evaluation sessions
○ avoid bias towards certain features or query methods
many real world problems involve very fuzzy information needs well defined queries are best suited for evaluation
users remember more about the scene when they start looking through examples information in the task should be provided at defined points in time
during evaluation sessions, relevant shots may be discovered, and the ground
truth updated
for repeatable evaluation, a fixed ground truth set is desirable
although real world tasks may involve time pressure, it would be best to measure
the time until the task is solved
time limits are needed in evaluation sessions for practical reasons

Task Selection (KIS @ VBS)
● Known duplicates:
○ List of known (partial) duplicates from matching metadata and file size
○ Content-based matches
● Uniqueness inside same and similar content:
○ Ensure unambiguous target
○ May be applied to sequence of short shots rather than single shot
● Complexity of segment:
○ Rough duration of 20s
○ Limited number of shots
● Describe-ability:
○ Textual KIS requires segments that can be described with limited amount of text
(less shots, salient location or objects, etc.)

VBS KIS Task Selection - Examples
● KIS Visual (video 37756, frame 750-1250)
○ Short shots, varying content - hard to describe as text, but
unique sequence
● KIS Textual (video 36729, frame 4047-4594)
○ @0 sec: “Shots of a factory hall from above. Workers
transporting gravel with wheelbarrows. Other workers
putting steel bars in place.”
○ @100 sec: “The hall has Cooperativa Agraria written in red
letters on the roof.”
○ @200 sec: “There are 1950s style American cars and
trucks visible in one shot.”

Presenting Queries (VBS)
● Example picture?
○ allow taking pictures of visual query clips?
● Visual
○ Play query once
■ one chance to memorize, but no chance to check possibly
relevant shot against query — in real life, one cannot visually
check, but one does not forget what one knew at query time
○ Repeat query but blur increasingly
■ basic information is there, but not possible to check details
● Textual
○ User's memory is for most people also visual
○ Simulate case where retrieval expert is asked to find content
■ expert could ask questions
○ Provide incremental details about the scene (but initial piece
of information must already be unambiguous for KIS)

Task Participants
● Typically developers of tools participate in evaluation campaigns
○ They know how to translate information requests into queries
○ Knowledge of user has huge impact on performance that can be achieved
● “Novice session”
○ Invite members from the audience to use the tools, after a brief introduction
○ Provides insights about usability and complexity of tool
○ In real use cases, users are rather domain experts than retrieval experts, thus this condition
is important to test
○ Selection of novices is an issue for comparing results
○ Question of whether/how scores of expert and novice tasks shall be combined

Real-World Datasets
● Research neds reproducible results
○ standardized and free datasets are necessary
● One problem with many datasets:
○ current state of web video in the wild is not or no longer represented accurately by them
[Rossetto & Schuldt]
● Hence, we also need datasets that model the real world
○ One such early effort:
○ V3C is such a dataset (see later)
Rossetto, L., & Schuldt, H. (2017). Web video in numbers-an analysis of web-video metadata. arXiv preprint arXiv:1707.01340.

Videos in the Wild
Age-distribution of common video collections vs what is found in the wild

Videos in the Wild
Duration-distribution of common video collections vs what is found in the wild

Dataset Preparation and Annotations
● Data set = content + annotations for specific problem
● Today, content is everywhere
● Annotations are still hard to get
○ External data (e.g., archive documentation) often not available at sufficient granularity and
time-indexed
○ Creation by experts is prohibitively costly
● Approaches
○ Crowdsourcing (with different notions of “crowd” impacting quality)
○ Reduce amount of annotations needed
○ Generate data set and ground truth

Collaborative Annotation
Initiatives from TRECVID participants 2003-2013
○ http://mrim.imag.fr/tvca/
○ Concept annotations for high-level feature extraction/semantic indexing tasks
○ As data sets grew in size, the percentage of the content that could be annotated declined
○ Use of active learning to select samples where annotation brings highest benefit
S. Ayache and G. Quénot, "Video Corpus Annotation using Active Learning", ECIR 2008.

Crowdsourcing with the General Public
● Use platforms like Amazon Mechanical Turk to collect data
○ Main issue, however, is that annotations are noisy and unreliable
● Solutions
○ Multiple annotations and majority votes
○ Involve tasks that help assessing the confidence to a specific worker
■ e.g., asking easy questions first, to verify facts about image
○ More sophisticated aggregation strategies
● MediaEval ran tasks in 2013 and 2014
○ Annotation of fashion images and timed comments about music
B. Loni, M. Larson, A. Bozzon, L. Gottlieb, Crowdsourcing for Social Multimedia at MediaEval 2013: Challenges, Data set, and Evaluation, MediaEval WS Notes, 2013.
K. Yadati, P. S.N. Shakthinathan Chandrasekaran Ayyanathan, M. Larson, Crowdsorting Timed Comments about Music: Foundations for a New Crowdsourcing Task, MediaEval WS Notes, 2014.

Pooling
● Exhaustive relevance judgements
are costly for large data sets
● Annotate pool of top k results
returned from participating systems
● Pros
○ Efficient
○ Results are correct for all participants, not
an approximation
● Cons
○ Annotations can only be done after
experiment
○ Repeating the experiment with
new/updated systems requires updating the
annotation (or getting approximate results)
Sri Devi Ravana et al., Document-based approach to improve the accuracy of pairwise comparison in evaluating information retrieval systems, ASLIB J. Inf. Management, 67(4), 2015.

Live Annotation
● Assessment of incoming results during competition
● Used in VBS 2017-2018
● Addresses issues of incomplete or missing ground truth
○ e.g., created using pooling , or new queries
● Pros
○ Provide immediate feedback
○ Avoid biased results from ground truth pooled from other systems
● Cons
○ Done under time pressure
○ Not possible to review other similar cases - may cause inconsistency in decisions
○ Multi-annotator agreement would be needed (impacts decision time and number of annotators needed)

Live Annotation – Example from VBS 2018
● 1,848 shots judged live
○ About 40% of submitted shots were not in TRECVID g.t.
● Verification experiment
○ 1,383 were judged again later
○ Judgement were diverging for 23% of the shots, in 88% of those cases the live judgement was “incorrect”
● Judges seem to decide to incorrect when in doubt
○ While ground truth for later use is biased, still same conditions for all teams in the room
● Need to set up clear rules for live judges
○ Like used by NIST for TRECVID annotations
Judge 1: false Judge 2: true Judge 1: true Judge 1: false
same video

Assembling Content and Ground Truth
● MPEG Compact Descriptor for Video Analysis
(CDVA)
○ Dataset for the evaluation of visual instance search
○ 23,000 video clips (1min - > 1hr)
● Annotation effort too high
○ Generate query and reference clips from three disjoint
subsets
○ Randomly embed relevant segment in noisy material
○ Apply transformations to query clips
○ Ground truth is generated from the editing scripts
○ Created 9,715 queries, 5,128 references

Process for LSC Dataset Generation
● Lifelog data has an inevitable privacy/GDPR compliance concern
● Required a cleaning/anonymization process for images, locations & words
○ Lifelogger deletes private/embarrassing images, validated by researcher
○ Images resized down (1024x768) to remove readable text
○ faces automatically & manually blurred; locations anonymized
○ Manually generated blacklist of terms for removal from textual data

Available Datasets
● Past TRECVID data
○ https://www-nlpir.nist.gov/projects/trecvid/past.data.table.html
○ Different types of usage conditions and license agreements
○ Ground truth, annotations and partly extracted features are available
● Past MediaEval data
○ http://www.multimediaeval.org/datasets/index.html
○ Mostly directly downloadable, annotations and sometimes features available
● Some freely available data sets
○ TRECVID IACC.1-3
○ TRECVID V3C1 (starting 2019), will also be used for VBS (download available)
○ BLIP 10,000 http://skuld.cs.umass.edu/traces/mmsys/2013/blip/Blip10000.html
○ YFCC100M https://webscope.sandbox.yahoo.com/catalog.php?datatype=i&did=67
○ Stanford I2V http://purl.stanford.edu/zx935qw7203

Available Datasets
● MPEG CDVA data set
○ Mixed licenses, partly CC, partly specific conditions of content owners
● NTCIR-Lifelog datasets
○ NTCIR-12 Lifelog - 90 days of mostly visual and activity data from 3 lifeloggers (100K+
images)
■ ImageCLEF 2017 dataset a subset of NTCIR-12
○ NTCIR-13 Lifelog - 90 days of richer media data from 2 lifeloggers (95K images)
■ LSC 2018 - 30 days of visual, activity, health, information & biometric data from one lifelogger
■ ImageCLEF 2018 dataset a subset of NTCIR-13
○ NTCIR-14 - 45 days of visual, biometric, health, activity data from two lifeloggers

Example: V3C Dataset
Vimeo Creative Commons Collection
○ The Vimeo Creative Commons Collection (V3C) [2] consists of ‘free’ video material sourced from the web
video platform vimeo.com. It is designed to contain a wide range of content which is representative of what
is found on the platform in general. All videos in the collection have been released by their creators under a
Creative Commons License which allows for unrestricted redistribution.
Rossetto, L., Schuldt, H., Awad, G., & Butt, A. (2019). V3C – a Research Video Collection. Proceedings of the 25th International Conference on MultiMedia Modeling.

5. Evaluation procedures,
results and metrics
Interactive and automatic retrieval

Evaluation settings for interactive retrieval tasks
● For each tool, human in the loop ...
○ Same room, projector, time pressure
○ Expert and novice users
● … compete in simulated tasks (KIS, AVS, ...)
○ Shared dataset in advance (V3C1 1000h)
○ 2V+1T KIS sessions and 2 AVS sessions
■ Tasks selected randomly and revisited
■ Tasks presented on data projector

Evaluation settings for interactive retrieval tasks
● Problem with repeatability of results
○ Human in the loop, conditions
● Evaluation provides one comparison of
tools in a shared environment with a given
set of tasks, users and shared dataset
○ Performance reflected by an overall score

Known-item search tasks at VBS 2018

Results - observed trends 2015-2017
2015 (100 hours) 2016 (250 hours) 2017 (600 hours)
Observation: First AVS easier than Visual KIS easier than Textual KIS
J. Lokoc, W. Bailer, K. Schoeffmann, B. Muenzer, G. Awad, On influential trends in interactive video retrieval: Video Browser Showdown 2015-2017, IEEE
Transactions on Multimedia, 2018

KIS score function (since 2018)
● Reward for solving a task
● Reward for being fast
● Fair scoring around time limit
● Penalty for wrong submissions

AVS score function (since 2018)
VBS 2018
VBS 2017
Score based on precision and recall

Settings and metrics in LSC Evaluation
● Similar to the VBS… For each tool, human in the
loop ...
○ Same room, projector, time pressure
○ Expert and novice users
● … compete in simulated tasks (all KIS type)
○ Shared dataset in advance (LSC Dataset - 27 days))
○ Six expert topics & 12 Novice topics
■ Topics prepared by the organisers with full
(non-pooled) relevance judgements for all topics
■ Tasks presented on data projector
■ Participants submit a ‘correct’ answer to the LSC
server which evaluates it against groundtruth.

Lifelog Search Challenge (Topics)
I am building a chair that is wooden in the late afternoon. I am at work, in an office environment (23
images, 12 minutes).
I am walking out to an airplane across the airport apron. I stayed in an airport hotel on the previous night
before checking out and walking a short distance to the airport (1 image, 1 minute).
I was in a Norwegian furniture store in a shopping mall (16 images, 9 minutes).
I was eating in a Thai restaurant (130 images, 66 images).
There was a large picture of a man carrying a box of tomatoes beside a child on a bicycle (185 images,
97 minutes).
I was playing a vintage car-racing game on my laptop in a hotel after flying (53 images, 27 minutes).
I was watching 'The Blues Brothers' Movie on the TV at home (82 images, 42 minutes).

LSC Score Function
Score calculated from 0 to 100, based on the
amount of time remaining. Negative scoring for
incorrect answers (lose 10%) of available score.
Overall score is based on the sum of scores for all
expert and novice topics.
Similar to VBS, a problem with repeatability of
results ( Human in the loop).
Evaluation provides one comparison of tools in a
shared environment with a given set of tasks, users
and shared dataset.

Evaluation settings at TRECVID
● Three run types:
○ Fully Automatic
○ Manually-assisted
○ Relevance-feedback
● Query/Topics:
○ Text only
○ Text + image/video examples
● Training conditions:
○ Training data from same/cross domain as testing
○ Training data collected automatically
● Results:
System returns top 1000 shots that most likely
Satisfy the query/topic

Query Development Process
● Sample test videos (~30 - 40%) were viewed by 10 human assessors hired by the NIST.
● 4 facets describing different scenes were used (if applicable) to annotate the watched videos:
○ Who : concrete objects and being (kind of persons, animals, things)
○ What : are the objects and/or beings doing ? (generic actions, conditions/state)
○ Where : locale, site, place, geographic, architectural, etc
○ When : time of day, season
● Test queries were constructed from the annotated descriptions to include : Persons, Actions,
Locations, and Objects and their combinations.

Sample topics of Ad-hoc search queries
Find shots of a person holding a poster on the street at daytime
Find shots of one or more people eating food at a table indoors
Find shots of two or more cats both visible simultaneously
Find shots of a person climbing an object (such as tree, stairs, barrier)
Find shots of car driving scenes in a rainy day
Find shots of a person wearing a scarf
Find shots of destroyed buildings

● Usually 30 queries/topics are evaluated per year
● NIST hires 10 human assessors to:
○ Watch returned video shots
○ Judge if a video shot satisfy query (YES / NO vote)
● All system results per query/topic are pooled and NIST judges top ranked
results (rank 1 to ~200) 100% and sample ranked results from 201 - 1000
to form a unique judged master set.
● The unique judged master set are divided into small pool files (~1000 shots
/ file) and given to the human assessors to watch and judge.

TRECVID evaluation framework
Video
Collection
Information needs
(Topics/Queries)
Video search
algorithm 1
Video search
algorithm 2
Video search
algorithm K
Ranked
result set 1
Ranked
result set 2
Ranked
result set k
Video
pools
Judge 100% of top X ranked
results and Y% from X+1
ranked results to bottom
TRECVID
Participants
Ranked
result sets
Ground Truth
Evaluation
scores Human assessors
Pooling

● Basic rules for the human assessors to follow include:
○ In topic description, "contains x" or words to that effect are short for "contains x to a degree sufficient for x
to be recognizable as x to a human" . This means among other things that unless explicitly stated, partial
visibility or audibility may suffice.
○ The fact that a segment contains video of physical objects representing the feature target, such as photos,
paintings, models, or toy versions of the target, will NOT be grounds for judging the feature to be true for the
segment. Containing video of the target within video may be grounds for doing so.
○ If the feature is true for some frame (sequence) within the shot, then it is true for the shot; and vice versa.
This is a simplification adopted for the benefits it affords in pooling of results and approximating the basis
for calculating recall.
○ When a topic expresses the need for x and y and ..., all of these (x and y and ...) must be perceivable
simultaneously in one or more frames of a shot in order for the shot to be considered as meeting the need.

Evaluation metric at TRECVID
● Mean extended inferred average precision (xinfAP) across all topics
○ Developed* by Emine Yilmaz and Javed A. Aslam at Northeastern University
○ Estimates average precision surprisingly well using a surprisingly small sample of judgments
from the usual submission pools (see next slide!)
○ More topics can be judged with same effort
○ The extended infAP added stratified feature to infAP (i.e we can sample from each strata
with different sample rate)
* J.A. Aslam, V. Pavlu and E. Yilmaz, Statistical Method for System Evaluation Using Incomplete Judgments Proceedings of the 29th
ACM SIGIR Conference, Seattle, 2006.

InfAP correlation with AP
Mean InfAP of 100% sample
MeanInfAPof80%sampleMeanInfAPof40%sample
MeanInfAPof60%sampleMeanInfAPof20%sample
Mean InfAP of
100% sample
==
AP

Automatic vs. Interactive search in AVS
Can we compare results from TRECVID (infAP) and VBS (unordered list)?
● Simulate AP from unordered list

Automatic vs. Interactive search in AVS
Can we compare results from TRECVID (infAP) and VBS (unordered list)?
● Get precision at VBS recall level if ranked lists are available

6. Lessons learned
Collection of our observations from TRECVID and VBS

Video Search (at TRECVID) Observations
One solution will not fit all. Investigations/discussion of video search must be
related to the searcher‘s specific needs/capabilities/history and to the kinds
data being searched.
The enormous and growing amounts of video require extremely large-scale
approaches to video exploitation. Much of it has little or no metadata
describing the content in any detail.
● 400 hrs of video are being uploaded on YouTube per minute (as of
11/2017)
● “Over 1.9 Billion logged-in users visit YouTube each month and every day
people watch over a billion hours of video and generate billions of views.”
(https://www.youtube.com/yt/about/press/)

Multiple information sources (text, audio, video), each errorful, can yield better results when
combined than used alone…
● A human in the loop in search still makes an enormous difference.
● Text from speech via automatic speech recognition (ASR) is a powerful source of information but:
○ Its usefulness varies by video genre
○ Not everything/one in a video is talked about, “in the news"
○ Audible mentions are often offset in time from visibility
○ Not all languages have good ASR
● Machine learning approaches to tagging
○ yield seemingly useful results against large amounts of data when training data is sufficient
and similar to the test data (within domain)
○ but will they work well enough to be useful on highly heterogeneous video?

● Processing video using a sample of more than one frame per shot, yields better results but quickly
pushes common hardware configurations to their limits
● TRECVID systems have been looking at combining automatically derived and manual-provided
evidence in search :
○ Internet Archive video will provide titles, keywords, descriptions
○ Where in the Panofsky hierarchy are the donors’ descriptions? If very personal, does that
mean less useful for other people?
● Need observational studies of real searching of various sorts using current functionality and
identifying unmet needs

VBS organization
● Test session before event - problems with submission formats etc.
● Textual KIS tasks in a special private session
○ Textual tasks are not so attractive for audience
○ Textual tasks are important and challenging
○ More time and tasks are needed to assess tool performance
● Visual and AVS tasks during welcome reception
○ “Panem et circenses” - competitions are also intended to entertain audience
○ Generally, more novice users can be invited to try the tool

VBS server
● Central element of the competition
○ Presents all tasks using data projector
○ Presents scores in all categories
○ Presents feedback for actual submissions
○ Additional logic (duplicates, flood of submissions, logs)
○ Also at LSC 2018, with a revised ranking function
● Selected issue - duplicate problem
○ IACC dataset contains numerous duplicate videos with identical visual content (but e.g.,
different language)
○ Submission was regarded as wrong although the visual content was correct
○ One actual case in 2018, had to be corrected after the event and changed the final ranking
○ Dataset design should explicitly avoid duplicates, or at least provide a list of duplicates;
moreover: server could provide more flexibility in changing judgements retrospectively

VBS server
● Issues of the simulations of KIS tasks
● How to “implant” visual memories?
○ Play scene just once - users forget the scene
○ Play scene in the loop - users exploit details -> overfitting to task presentation
○ Play scene in the loop + blur - colors can be still used, but also user forget important details
○ Play scene several times in the beginning and then show text description
● How to face ambiguities of textual KIS?
○ Simple text - not enough details, ambiguous meaning of some sentences
○ Extending text - simulation of a discussion - which details should be used first?
○ Still ambiguities -> teams should be allowed to ask some questions

AVS task and live judges at VBS
● Ambiguous task descriptions are problematic, hard to find balance
between too easy and too hard tasks
● Opinion of user vs. opinion of judge - who is right?
○ Users try to maximize score - sometimes risk wrong submission
○ Each shot is assessed just once -> the same “truth” for all teams
○ Similar as for textual KIS - teams should be allowed to ask some questions
○ Teams have to read TRECVID rules for human assessors!
● Calibration of more judges
○ For more than one live judge - calibration of opinions is necessary, even during competition
● Balance the number of users for AVS tasks (ideally also for KIS tasks)

VBS interaction logging
● Until 2017, there was no connection between VBS results and really used
tool features to solve a task
○ VBS server received only team, video and frame IDs
○ Attempts to receive logs after competition failed
● Since 2018, an interaction log is a mandatory part of each task submission
○ How to obtain logs when the task is not solved?
○ Tools use variable modalities and interfaces - how to unify actions?
○ How to present and interpret logs?
○ How to log very frequent actions?
○ Time synchronization?
○ Log verification during test runs

VBS interaction logging - 8/9 teams sent logs!
We can analyze both aggregated general statistics and user interactions in a given tool/task !!

● The complexity of tasks where AI is superior to human is obviously growing
○ Checkers -> Chess -> GO -> Poker -> DOTA? -> Starcraft? -> … à guess user needs?
● Machine learning revolution - bigger/better training data
o à better performance
● Can we collect big training data to support interactive video retrieval?
○ To cover an open world (how many concepts, actions, … do you need)?
○ To fit needs of every user (how many contexts do we have)?
● Reinforcement learning?

Driver has to get
carefully through
many situations
with just basic
equipment
Q: Is this possible also for video retrieval systems?
Attribution: Can Pac Swire (away for a bit)
Driver has to rely
on himself but
subsystems help
(ABS, power
steering, etc.)
Attribution: Grand Parc - Bordeaux
Driver just tells
where to go
Attribution: Grendelkhan

● Users already benefit from deep learning
○ HCI support - body motion, hand gestures
○ More complete and precise automatic annotations
○ Embeddings/representations for similarity search
○ 2D/3D projections for visualization of high-dimensional data
○ Relevance feedback learning (benefit from past actions)
● Promising directions
○ One-shot learning for fast inspection of new concepts
○ Multimodal joint embeddings
○ …
○ Just A Rather Very Intelligent System (J.A.R.V.I.S.) used by Tony Stark (Iron Man) ??

Never say “this will not work!”
● If you have an idea how to solve interactive retrieval tasks - just try it!
○ Don’t be afraid your system is not specialized, you can surprise yourself and the community!
○ Paper submission in September 2019 for VBS at MMM 2020 in Seoul!
○ LSC submission in February 2019 for ICMR 2019 in Ottawa in June 2019.
○ The next TRECVID CFP will go out by mid-January, 2019.
Lokoč, Jakub, Adam Blažek, and Tomáš Skopal. "Signature-based video
browser." International Conference on Multimedia Modeling. Springer, Cham,
2014.
Del Fabro, Manfred, and Laszlo Böszörmenyi. "AAU video browser: non-
sequential hierarchical video browsing without content analysis." International
Conference on Multimedia Modeling. Springer, Berlin, Heidelberg, 2012.
Hürst, Wolfgang, Rob van de Werken, and Miklas Hoet. "A storyboard-based
interface for mobile video browsing." International Conference on Multimedia
Modeling. Springer, Cham, 2015.

Acknowledgements
This work has received funding from the European Union’s Horizon 2020
research and innovation programme, grant no. 761802, MARCONI. It was
supported also by Czech Science Foundation project Nr. 17-22224S.
Moreover, the work was also supported by the Klagenfurt University and
Lakeside Labs GmbH, Klagenfurt, Austria and funding from the European
Regional Development Fund and the Carinthian Economic Promotion Fund
(KWF) under grant KWF 20214 u. 3520/26336/38165.

Interactive Video Search Tools: Where Deep Learning Falls Short & User Interaction Helps

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Similar to Interactive Video Search Tools: Where Deep Learning Falls Short & User Interaction Helps

Similar to Interactive Video Search Tools: Where Deep Learning Falls Short & User Interaction Helps (20)

More from klschoef

More from klschoef (7)

Recently uploaded

Recently uploaded (20)

Interactive Video Search Tools: Where Deep Learning Falls Short & User Interaction Helps