SlideShare a Scribd company logo
1 of 136
Download to read offline
Interactive Video Search:
Where is the User
in the Age of Deep Learning?
Klaus Schoeffmann1, Werner Bailer2, Jakub Lokoc3, Cathal Gurrin4, George Awad5
Tutorial at ACM Multimedia 2018, Seoul
1…Klagenfurt University, Klagenfurt, Austria
2…JOANNEUM RESEARCH, Graz, Austria
3…Charles University, Prague, Czech Republic
4…Dublin City University, Dublin, Ireland
5…National Institute of Standards and Technology, Gaithersburg, USA
Recommended Readings
On influential trends in interactive video retrieval: Video Browser Showdown 2015-2017. J. Lokoc, W. Bailer, K. Schoeffmann,
B. Muenzer, G. Awad, IEEE Transactions on Multimedia, 2018
Interactive video search tools: a detailed analysis of the video browser showdown 2015. Claudiu Cobârzan, Klaus Schoeffmann,
Werner Bailer, Wolfgang Hürst, Adam Blazek, Jakub Lokoc, Stefanos Vrochidis, Kai Uwe Barthel, Luca Rossetto. Multimedia
Tools Appl. 76(4): 5539-5571 (2017).
G. Awad, A. Butt, J. Fiscus, M. Michel, D. Joy, W. Kraaij, A. F. Smeaton, G. Quenot, M. Eskevich, R. Ordelman, G. J. F. Jones, and
B. Huet, “Trecvid 2017: Evaluating ad-hoc and instance video search, events detection, video captioning and hyperlinking,” in
Proceedings of TRECVID 2017. NIST, USA,
TOC 1
1. Introduction (20 min) [KS]
a. General introduction
b. Automatic vs. interactive video search
c. Where deep learning fails
d. The need for evaluation campaigns
2. Interactive video search tools (40 min) [JL]
a. Demo: VIRET (1st place at VBS2018)
b. Demo: ITEC (2nd place at VBS2018)
c. Demo: DCU Lifelogging Search Tool 2018
d. Other tools and open source software
3. Evaluation approaches (30 min) [KS]
a. Overview of evaluation approaches
b. History of selected evaluation campaigns
c. TRECVID
d. Video Browser Showdown (VBS)
e. Lifelog Search Challenge (LSC)
TOC 2
4. Task design and datasets (30 min) [KS]
a. Task types (known item search, retrieval, etc.)
b. Trade-offs: modelling real-world tasks and controlling conditions
c. Data set preparation and annotations
d. Available data sets
5. Evaluation procedures, results and metrics (30 min) [JL]
a. Repeatability
b. Modelling real-world tasks and avoiding bias
c. Examples from evaluation campaigns
6. Lessons learned from evaluation campaigns (20 min) - [JL]
a. Interactive exploration or query-and-browse?
b. How much does deep learning help in interactive settings?
c. Future challenges
7. Conclusions
a. Where is the user in the age of deep learning?
1. Introduction
Let’s Look Back a Few Years...
[Marcel Worring et al., „Where Is the User in Multimedia Retrieval?“, IEEE Multimedia, Vol. 19, No. 4, Oct.-Dec. 2012, pp. 6-10 ]
Let’s Look Back a Few Years...
● A few statements/findings:
○ Many solutions are developed without having an explicitly defined real-world problem to
solve.
○ Performance measures focus on the quality of how we answer a query.
○ MAP has become the primary target for many researchers.
○ It is certainly weird to use MAP alone when talking about users employing multimedia
retrieval to solve their search problems.
○ As a consequence of MAP’s dominance, the field has shifted its focus too much toward
answering a query.
“Thus a better understanding of what users actually want and do
when using multimedia retrieval is needed.”
[Marcel Worring et al., „Where Is the User in Multimedia Retrieval?“, IEEE Multimedia, Vol. 19, No. 4, Oct.-Dec. 2012, pp. 6-10 ]
How Would You Search for These Images?
How to describe the special atmosphere, the artistic content, the mood?
by marfis75
“An image tells a thousand words.”
How Would You Search for This Video Scene?
What Users Might Want...
Shortcomings of Fully Automatic Video Retrieval
● Works well if
○ Users can properly describe their needs
○ System understands search intent of users
○ There is no polysemy and no context variation
○ Content features can sufficiently describe visual content
○ Computer vision (e.g., CNN) can accurately detect semantics
● Unfortunately, for real-world problems rarely true!
“Query-and-browse results” approach
Performance of Video Retrieval
● Typically based on MAP
○ Computed for a specific query- and dataset
○ Results are still quite low (even in the age of deep learning!)
○ Also, results can heavily vary from one dataset to another, and from one queryset to another
○ Example: TRECVID Ad-hoc Video Search (AVS) – automatic runs only
2016 2017 2018
Teams 9 8 10
Runs 30 33 33
Min xInfAP 0 0.026 0.003
Max xInfAP 0.054 0.206 0.121
Median xInfAP 0.024 0.092 0.058
Dataset: IACC.3, 30 queries per year
Deep Learning Can Fail Easily
[J. Su, D.V. Vargas, and K. Sakurai. One pixel attack for fooling neural networks. 2018. arXiv]
How to deal with noisy
data/videos?
Deep Learning Can Fail Easily
Output of
Yolo v2
Andrew Ng talk Artificial Intelligence is the New Electricity
“Anything typical human can do with < 1s of thought we can probably now or soon automate with AI”
Deep Learning Can Fail Easily
Nguyen A, Yosinski J, Clune J. Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images. In Computer Vision and
Pattern Recognition (CVPR '15), IEEE, 2015
The Power of Human Computation
Example from the Video Browser Showdown 2015:
System X: shot and scene detection, concept
detection (SIFT, VLAD, CNNs), similarity search.
System Y: tiny thumbnails only, powerful user.
Outperformed system X and was finally ranked 3rd!
Moumtzidou, Anastasia, et al. "VERGE in VBS 2017." International Conference on Multimedia
Modeling. Springer, Cham, 2017.
Hürst, Wolfgang, Rob van de Werken, and Miklas Hoet. "A storyboard-based interface for mobile
video browsing." International Conference on Multimedia Modeling. Springer, Cham, 2015.
Interactive Video Retrieval Approach
● Assume a smart and interactive user
○ That knows about the challenges and shortcomings of simple querying
○ But might also know how to circumvent them
○ Could be a digital native!
● Give him/her full control over the search process
○ Provide many query and interaction features
■ Querying, browsing, navigation, filtering, inspecting/watching
● Assume an iterative/exploratory search process
○ Search - Inspect - Think - Repeat
○ “Will know it when I see it”
○ Could include many iterations!
○ Instead of “query-and-browse results”
What Users Might Need...
Concept
Search
Browsing
features
Motion Sketch
Search History
Hudelist, Marco A., Christian Beecks, and Klaus
Schoeffmann. "Finding the chameleon in your
video collection." Proceedings of the 7th
International Conference on Multimedia
Systems. ACM, 2016.
Typical Query Types of Video Retrieval Tools
● Query-by-text
○ Enter keywords to match with available or extracted text (e.g., metadata, OCR, ASR, concepts, objects...)
● Query-by-concept
○ Show content for a specific class/category from concept detection (e.g., from ImageNet)
● Query-by-example
○ Provide example image/scene/sound
● Query-by-filtering
○ Filter content by some metadata or content feature (time, color, edge, motion, …)
● Query-by-sketch
○ Provide sketch of image/scene
● Query-by-dataset-example
○ Look for similar but other results
● Query-by-exploration
○ Start by looking around / browsing
○ Needs appropriate visualization
● Query-by-inspection
○ Inspect single clips, navigate
Search in multimedia content (particularly video) is
a highly interactive process!
Users want to look around, try different query features,
inspect results, refine queries, and start all over again!
Automatic
Interactive
Evaluation of Interactive Video Retrieval
● Interfaces are inherently developed for human users
● Every user might be different
○ Different culture, knowledge, preferences, experiences, ...
○ Even the same user at a different time
● Video search interfaces need to be evaluated with real users...
○ No simulations!
○ User studies and campaigns (TRECVID, MediaEval, VBS, LSC)!
○ Find out how well users perform with a specific system
● ...and with real data!
○ Real videos “in the wild” (e.g., IACC.1 and V3C dataset)
○ Actual queries that would make sense in practice
○ Comparable evaluations (same data, same conditions, etc.)
International
competitions
Datasets
Only same dataset, query, time, room/condition, ...
...allows for true comparative evaluation!
Where is the User in the Age of Deep Learning?
2. Interactive
Video Search Tools
Common architecture, components and top ranked tools
What are basic video preprocessing steps?
What models are used?
Where interactive search helps?
Common Architecture
Common Architecture - Temporal Segmentation
M. Gygli. Ridiculously Fast Shot Boundary Detection with Fully
Convolutional Neural Networks. https://arxiv.org/pdf/1705.08214.pdf1. Compute a score based on a distance of frames
2. Threshold-based decision (fixed/adaptive)
Common Architecture - Semantic Search
Classification and embedding by
popular Deep CNNs
AlexNet (A. Krizhevsky et al., 2012)
GoogLeNet (Ch. Szegedy et al., 2015)
ResNet (K. He et al., 2015)
NasNet (B. Zoph et al., 2018)
...
Object detectors appear too (YOLO, SSD)
Joint embedding models? VQA?
Common Architecture - Sketch based Search
Sketches from memory
Just part of the scene
Edges often do not match
Colors often do not match
=> invariance needed
Common Architecture - Limits
● Used ranking models have their limits
○ Missed frames
○ Wrong annotation
○ Inaccurate similarity function
● Still, to find a shot of a class is often easy
(see later), but to find one particular shot
or all shots of a class?
T. Soucek. Known-Item Search in Image Datasets Using
Automatically Detected Keywords. BC thesis, 2018.
Common Architecture at VBS - Interactive Search
Hudelist & Schoeffmann. An Evaluation of Video Browsing
on Tablets with the ThumbBrowser. MMM2017
Goeau et al.,, Table of Video Content, ICME 2007
Aspects of Flexible Interactive Video Search
VIRET tool (Winner of VBS 2018, 3. at LSC 2018)
Filters
Query by text
Query by color
Query by image
Video player
Top ranked
frames by
a query
Representative
frames from the
selected video
Frame-based retrieval system with temporal context visualization. Focus on simple interface!
Jakub Lokoc, Tomas Soucek, Gregor Kovalcik: Using an Interactive Video Retrieval Tool for LifeLog Data. LSC@ICMR 2018: 15-19, ACM
Jakub Lokoc, Gregor Kovalcik, Tomas Soucek: Revisiting SIRET Video Retrieval Tool. VBS@MMM 2018: 419-424, Springer
VIRET Tool (Winner of VBS 2018)
ITEC Tool
Primus, Manfred Jürgen, et al. "The ITEC Collaborative Video Search System at the Video Browser Showdown 2018." International Conference on Multimedia Modeling.
Springer, Cham, 2018.
ITEC tool (2. at VBS 2018 and LSC 2018)https://www.youtube.com/watch?v=CA5kr2pO5b
LSC (Geospatial Browsing)
W Hürst, K Ouwehand, M Mengerink, A Duane and C Gurrin. Geospatial Access to Lifelogging Photos in Virtual Reality. The Lifelog Search Challenge 2018 at ACM ICMR 2018.
LSC (Interactive Video Retrieval)
J. Lokoč, T. Souček and G. Kovalčík. Using an Interactive Video Retrieval Tool for LifeLog Data. The Lifelog Search Challenge 2018 at ACM ICMR 2018.
(3rd highest performing system, but the same system won VBS 2018)
LSC (LiveXplore)
A Leibetseder, B Muenzer, A Kletz, M Primus and K Schöffmann. liveXplore at the Lifelog Search Challenge 2018. The Lifelog Search Challenge 2018 at ACM ICMR 2018.
(2nd highest performing system)
VR Lifelog Search Tool (winner of LSC 2018)
Large lifelog archive with
time-limited KIS topics
Multimodal (visual concept
and temporal) query
formulation
Ranked list of visual
imagery (image per minute)
Gesture-based
manipulation of results
A Duane, C Gurrin & W Hürst. Virtual Reality Lifelog Explorer for the Lifelog Search Challenge at ACM ICMR 2018. The Lifelog Search Challenge 2018 at ACM ICMR 2018.
Top Performing System.
https://www.youtube.com/watch?v=aocN9eOuRv0
vitrivr (University of Basel)
● Open-Source content-based multimedia retrieval stack
○ Supports images, music, video and 3D-models concurrently
○ Used for various applications both in and outside of academia
○ Modular architecture enables easy extension and customization
○ Compatible with all major operating systems
○ Available from vitrivr.org
● Participated several times in VBS (originally as IMOTION)
[Credit: Luca Rossetto]
vitrivr (University of Basel)
● System overview
[Credit: Luca Rossetto]
vitrivr (University of Basel)
[Credit: Luca Rossetto]
vitrivr (University of Basel)
[Credit: Luca Rossetto]
vitrivr (University of Basel)
[Credit: Luca Rossetto]
3. Evaluation Approaches
Overview of Evaluation Approaches
● Qualitative user study/survey
○ Self report: ask users about their experience with the tool, thinking aloud tests, etc.
○ Using psychophysiological measurements (e.g., electrodermal activity - EDA)
● Log-file analysis
○ Analyze server and/or client-side interaction patterns
○ Measure time needed for certain actions, etc.
● Question answering
○ Ask questions about content (open, multiple choice) to assess which content users found
● Indirect/task-based evaluation (Cranfield paradigm)
○ Pose certain tasks, measure the effectiveness of solving the task
○ Quantitative user study with many users and trials
○ Open competition, as in VBS, LSC, and TRECVID
Properties of Evaluation Approaches
● Availability and level of detail of ground truth
○ None (e.g., questionnaires, logs)
○ Detailed and complete (e.g., retrieval tasks)
● Effort during experiments
○ Low (automatic check against ground truth)
○ Moderate (answers need to checked by human, e.g. live judges)
○ High (observation of or interview with participants)
● Controlled conditions
○ All users in same room with same setup (typical user-study)
vs. participants via online survey
● Statistical tests!
○ We can only conclude that one interactive tool is better than
the other, if there is statistically significant proof
○ Tests like ANOVA, t-tests, Wilcoxon-signed rank tests, …
○ Consider prerequisites of specific test (e.g., normal distribution)
Example: Comparing Tasks and User Study
● Experiment compared
○ Question answering
○ Retrieval tasks
○ User study with questionnaire
● Materials
○ Interactive search tool with keyframe visualisation
○ TRECVID BBC rushes data set (25 hrs)
○ Questionnaire adapted from TRECVID 2004
○ 19 users, doing at least 4 tasks
W. Bailer and H. Rehatschek, Comparing Fact Finding Tasks and User Survey for Evaluating a Video Browsing Tool. ACM Multimedia 2009.
Example: Comparing Tasks and User Study
● TVB1 I was familiar with the topic of the query.
● TVB3 I found that it was easy to find clips that are relevant.
● TVB4 For this topic I had enough time to find enough clips.
● TVB5 For this particular topic the tool interface allowed me to do browsing efficiently.
● TVB6 For this particular topic I was satisfied with the results of the browsing.
W. Bailer and H. Rehatschek, Comparing Fact Finding Tasks and User Survey for Evaluating a Video Browsing Tool. ACM Multimedia 2009.
Using Electrodermal Activity (EDA)
Measuring EDA during retrieval tasks (A, B, C, D) with an interactive search tool, 14 participants
C. Martinez-Peñaranda, et al., A Psychophysiological Approach to the Usability Evaluation of a Multi-view Video Browsing Tool,” MMM 2013.
History of Selected Evaluation Campaigns
● Evaluation campaigns for video analysis and search started in early 2000s
○ Most well-known are TRECVID and MediaEval (previously ImageCLEF)
○ Both spin-offs from text retrieval benchmarks
● Several ones include tasks that are relevant to video search
● Most tasks are designed to be fully automatic
● Some allow at least interactive submissions as an option
○ Most submissions are usually still for the automatic type
● Since 2007, live evaluations with audience have been organized at major
international conferences
○ Videolympics, VBS, LSC
History of Selected Evaluation Campaigns
History of Selected Evaluation Campaigns
TRECVID
● Workshop series (2001 – present) → http://trecvid.nist.gov
● Started as a track in the TREC (Text REtrieval Conference) evaluation
benchmark.
● Became an independent evaluation benchmark since 2003.
● Focus: content-based video analysis, retrieval, detection, etc.
● Provides data, tasks, and uniform, appropriate scoring procedures
● Aims for realistic system tasks and test collections:
○ Unfiltered data
○ Focus on relatively high-level functionality (e.g. interactive search)
○ Measurement against human abilities
● Forum for the
○ exchange of research ideas and for
○ the discussion of research methodology – what works, what doesn’t , and why
TRECVID Philosophy
● TRECVID is a modern example of the Cranfield tradition
○ Laboratory system evaluation based on test collections
● Focus on advancing the state of the art from evaluation results
○ TRECVID’s primary aim is not competitive product benchmarking
○ Experimental workshop: sometimes experiments fail!
● Laboratory experiments (vs. e.g., observational studies)
○ Sacrifice operational realism and broad scope of conclusions
○ For control and information about causality – what works and why?
○ Results tend to be narrow, at best indicative, not final
○ Evidence grows as approaches prove themselves repeatedly, as part of various systems,
against various test data, over years
TRECVID Datasets
HAVIC
Soap opera (since 2013)
Social media
(since 2016)
Security cameras
(since 2008)
Teams actively participated (2016-2018)
INF CMU; Beijing University of Posts and Telecommunication; University Autonoma de Madrid; Shandong University; Xian JiaoTong University Singapore
kobe_nict_siegen Kobe University, Japan; National Institute of Information and Communications Technology, Japan; University of Siegen, Germany
UEC Dept. of Informatics, The University of Electro-Communications, Tokyo
ITI_CERTH Information Technology Institute, Centre for Research and Technology Hellas
ITEC_UNIKLU Klagenfurt University
NII_Hitachi_UIT National Institute Of Informatics.; Hitachi Ltd; University of Information Technology (HCM-UIT)
IMOTION University of Basel, Switzerland; University of Mons, Belgium; Koc University, Turkey
MediaMill University of Amsterdam ; Qualcomm
Vitrivr University of Basel
Waseda_Meisei Waseda University; Meisei University
VIREO City University of Hong Kong
EURECOM EURECOM
FIU_UM Florida International University, University of Miami
NECTEC National Electronics and Computer Technology Center NECTEC
RUCMM Renmin University of China
NTU_ROSE_AVS ROSE LAB, NANYANG TECHNOLOGICAL UNIVERSITY
SIRET SIRET Department of Software Engineering, Faculty of Mathematics and Physics, Charles University
UTS_ISA University of Technology Sydney
VideOlympics
● Run the same year’s TRECVID
search tasks live in front of
audience
● Organized at CIVR 2007-2009
Photos: Cees Snoek, https://www.flickr.com/groups/civr2007/
Video Browser Showdown (VBS)
● Video search competition (annually at MMM)
○ Inspired by VideOlympics
○ Demonstrates and evaluates state-of-the-art
interactive video retrieval tools
○ Also, entertaining event during welcome reception at MMM
● Participating teams solve retrieval tasks
○ Known-item search (KIS) tasks - one result - textual or visual
○ Ad-hoc video search (AVS) tasks - many results - textual
○ In large video archive (originally in 60 mins videos only)
● Systems are connected to the VBS Server
○ Presents tasks in live manner
○ Evaluates submitted results of teams (penalty for false submissions)
First VBS in Klagenfurt, Austria
(only search in a single video)
Video Browser Showdown (VBS)
2012: Klagenfurt
11 teams
KIS, single video (v)
2013: Huangshan
6 teams
KIS, single video (v+t)
2014: Dublin
7 teams
KIS, single video
and 30h archive (v+t)
2015: Sydney
9 teams
KIS, 100h archive (v+t)
2016: Miami
9 teams
KIS, 250h archive (v+t)
2017: Reykjavik
6 teams
KIS, 600h archive (v+t)
AVS, 600h archive (t)
2018: Bangkok
9 teams
KIS, 600h archive (v+t)
AVS, 600h archive (t)
2019: Thessaloniki
6 teams
KIS, 1000h archive (v+t)
AVS, 1000h archive (t)
Video Browser Showdown (VBS)
VBS Server:
• Presents queries
• Shows remaining time
• Computes scores
• Shows statistics/ranking
Video Browser Showdown (VBS)https://www.youtube.com/watch?v=tSlYFNlsn8U&t=140
Lifelog Search Challenge (LSC 2018)
● New (annual) search challenge at ACM ICMR
● Focus on a life retrieval challenge
o from multimodal lifelog data
o Motivated by the fact that ever larger personal data
archives are being gathered and the advent of AR
technologies & veracity of data’ means that archives
of life experiences are likely to become more
commonplace.
● To be useful, the data should be searchable…
o and for lifelogs, that means interactive search
Lifelog Search Challenge (Definition)
Dodge and Kitchin (2007), refer to lifelogging as
“a form of pervasive computing, consisting of a
unified digital record of the totality of an
individual’s experiences, captured multi-modally
through digital sensors and stored permanently
as a personal multimedia archive”.
Lifelog Search Challenge (Motivation)
Lifelog Search Challenge (Lifelogging)
Lifelog Search Challenge (Data)
One month archive of multimodal lifelog
data, extracted from NTCIR-13 Lifelog
collection, including:
○ Wearable camera images at a rate of
3-5 / minute & concept annotations.
○ Biometrics
○ Activity logs
○ Media consumption
○ Content created/consumed
u1_2016-08-15_050922_1, 'indoor',
0.991932, 'person', 0.9719478,
'computer', 0.309054524
Lifelog Search Challenge (One Minute)
<minute id="496">
<location>
<name>Home</name>
</location>
<bodymetrics>
<calories>2.8</calories>
<gsr>7.03E-05</gsr>
<heart-rate>94</heart-rate>
<skin-temp>86</skin-temp>
<steps>0</steps>
</bodymetrics>
<text>author,1,big,2,dout,1,revis,1,think,1,while,1</text>
<images>
<image>
<image-id>u1_2016-08-15_050922_1</image-id>
<image-path>u1/2016-08-15/20160815_050922_000.jpg</image-path>
<annotations>'indoor', 0.985, 'computer', 0.984, 'laptop', 0.967, 'desk', 0.925</annotations>
</image>
</images>
</minute>
Lifelog Search Challenge (Topics)
<Description timestamp="0">In a coffee shop with my colleague in the afternoon called the Helix with at least one person in the background.</Description>
<Description timestamp="30">In a coffee shop with my colleague in the afternoon called the Helix with at least one person in the background and a plastic plant
on my right side.</Description>
<Description timestamp="60">In a coffee shop with my colleague in the afternoon called the Helix with at least one person in the background and a plastic plant
on my right side. There are keys on the table in front of me and you can see the cafe sign on the left side. I walked to the cafe and it took less than two minutes to
get there.</Description>
<Description timestamp="90">In a coffee shop with my colleague in the afternoon called the Helix with at least one person in the background and a plastic plant
on my right side. There are keys on the table in front of me and you can see the cafe sign on the left side. I walked to the cafe and it took less than two minutes to
get there. My colleague in the foreground is wearing a white shirt and drinking coffee from a red paper cup.</Description>
<Description timestamp="120">In a coffee shop with my colleague in the afternoon called the Helix with at least one person in the background and a plastic plant
on my right side. There are keys on the table in front of me and you can see the cafe sign on the left side. I walked to the cafe and it took less than two minutes to
get there. My colleague in the foreground is wearing a white shirt and drinking coffee from a red paper cup. Immediately after having the coffee, I drive to the
shop.</Description>
<Description timestamp="150">In a coffee shop with my colleague in the afternoon called the Helix with at least one person in the background and a plastic plant
on my right side. There are keys on the table in front of me and you can see the cafe sign on the left side. I walked to the cafe and it took less than two minutes to
get there. My colleague in the foreground is wearing a white shirt and drinking coffee from a red paper cup. Immediately after having the coffee, I drive to the
shop. It is a Monday.</Description>
Temporarily enhancing topic descriptions that get more detailed (easier)
every thirty seconds. The topics have 1 or few relevant items in the collection.
Lifelog Search Challenge 2018 (Six Teams)
4. Task Design and Datasets
Task types, trade-offs, datasets, annotations
Task Types: Introduction
● Searching for content can be modelled as different task types
○ Choice impacts dataset preparation, annotations, evaluation methods
○ and the way to run the experiments
● Some of the task types here have fully automatic variants…
○ out of scope, but may serve as baseline to compare to
● Task can be categorized by the target and the formulation of the query
○ Particular target item vs. set or class
○ only one target item in data set, or
○ multiple occurrences of an instance, of a class of relevant items/segments
○ Definition of query
○ example, given in a specific modality
○ precise definition vs. fuzzy idea
Task Types (at Campaigns): Overview
Task Types (at Campaigns): Overview
How clear is search intent?
Known-item search
AVS tasks
Example Visual Textual Abstract None
This is how I use web video search
VIS tasks
Given video dataset
What is the role of similarity for KIS atVideo Browser Showdown? SISAP'18, Peru
Task Type: Visual Instance Search
● User holds a digital representation
of a relevant example of the needed
information
● Example or its features can be sent
to system
● User does not need to translate
example into query representation
● e.g., trademark/logo detection
Task Types: Known Item Search (KIS)
● User sees/hears/reads a representation
○ Target item is described or presented
● Used in VBS & LSC
● Exactly one target semantics
○ Representation of exactly one relevant item/segment in dataset
● Models user’s (partly faded) memories
○ user has a memory of content to be found, might be fuzzy
● User must translate representation to provided query methods
○ The complexity of this translation depends significantly on the modality
■ e.g., visual is usually easier than textual, which leaves more room for interpretation
○ Relation of/to content is important too
■ e.g. searching in own life log media vs. searching in media
collection on the web
“on a busy street”
Task Types: Ad-hoc Search
● User sees/hears/reads a representation of the needed information
○ Target item is described or presented
● Many targets semantics
○ Representation of a broader set/class of relevant items/segments
○ cf. TRECVID AVS task
● Models user’s rough memories
○ user has only a memory of the type of relevant content, not about details
● Similar issues of translating the representation like for KIS
○ but due to broader set of relevant items the correct interpretation of textual information is a less critical
issue
● Raises issues of what is considered within/without scope of a result set
○ e.g., partly visible, visible on a screen in the content, cartoon/drawing versions, …
○ TRECVID has developed guidelines for annotation of ground truth
Task Types: Exploration
● User does not start from a clear idea/query
of the information need
○ No concrete query, just inspects dataset
○ Browsing and exploring may lead to identifying useful
content
● Reflects a number of practical situations,
but very hard to evaluate
○ User simply cannot describe the content
○ User does not remember content but would recognize it
○ Dontent inspection for the sake of interest
○ Digital forensics
● No known examples of such tasks in
benchmarking campaigns due to the difficulties
with evaluation
Demo: https://www.picsbuffet.com/
Barthel, Kai Uwe, Nico Hezel, and Radek Mackowiak. "Graph-based browsing for large
video collections." International Conference on Multimedia Modeling. Springer, Cham, 2015.
Task Design is About Trade-offs: Aspects to consider
Tasks shall
○ model real-world content search problems
■ in order to assess whether tools are usable for these problems
○ set controlled conditions
■ to enable reliable assessment
○ be repeatable
■ to compare results from different evaluation sessions
○ avoid bias towards certain features or query methods
many real world problems involve very fuzzy information needs well defined queries are best suited for evaluation
users remember more about the scene when they start looking through examples information in the task should be provided at defined points in time
during evaluation sessions, relevant shots may be discovered, and the ground
truth updated
for repeatable evaluation, a fixed ground truth set is desirable
although real world tasks may involve time pressure, it would be best to measure
the time until the task is solved
time limits are needed in evaluation sessions for practical reasons
Task Selection (KIS @ VBS)
● Known duplicates:
○ List of known (partial) duplicates from matching metadata and file size
○ Content-based matches
● Uniqueness inside same and similar content:
○ Ensure unambiguous target
○ May be applied to sequence of short shots rather than single shot
● Complexity of segment:
○ Rough duration of 20s
○ Limited number of shots
● Describe-ability:
○ Textual KIS requires segments that can be described with limited amount of text
(less shots, salient location or objects, etc.)
VBS KIS Task Selection - Examples
● KIS Visual (video 37756, frame 750-1250)
○ Short shots, varying content - hard to describe as text, but
unique sequence
● KIS Textual (video 36729, frame 4047-4594)
○ @0 sec: “Shots of a factory hall from above. Workers
transporting gravel with wheelbarrows. Other workers
putting steel bars in place.”
○ @100 sec: “The hall has Cooperativa Agraria written in red
letters on the roof.”
○ @200 sec: “There are 1950s style American cars and
trucks visible in one shot.”
Presenting Queries (VBS)
● Example picture?
○ allow taking pictures of visual query clips?
● Visual
○ Play query once
■ one chance to memorize, but no chance to check possibly
relevant shot against query — in real life, one cannot visually
check, but one does not forget what one knew at query time
○ Repeat query but blur increasingly
■ basic information is there, but not possible to check details
● Textual
○ User's memory is for most people also visual
○ Simulate case where retrieval expert is asked to find content
■ expert could ask questions
○ Provide incremental details about the scene (but initial piece
of information must already be unambiguous for KIS)
Task Participants
● Typically developers of tools participate in evaluation campaigns
○ They know how to translate information requests into queries
○ Knowledge of user has huge impact on performance that can be achieved
● “Novice session”
○ Invite members from the audience to use the tools, after a brief introduction
○ Provides insights about usability and complexity of tool
○ In real use cases, users are rather domain experts than retrieval experts, thus this condition
is important to test
○ Selection of novices is an issue for comparing results
○ Question of whether/how scores of expert and novice tasks shall be combined
Real-World Datasets
● Research neds reproducible results
○ standardized and free datasets are necessary
● One problem with many datasets:
○ current state of web video in the wild is not or no longer represented accurately by them
[Rossetto & Schuldt]
● Hence, we also need datasets that model the real world
○ One such early effort:
○ V3C is such a dataset (see later)
Rossetto, L., & Schuldt, H. (2017). Web video in numbers-an analysis of web-video metadata. arXiv preprint arXiv:1707.01340.
Videos in the Wild
Age-distribution of common video collections vs what is found in the wild
Rossetto, L., & Schuldt, H. (2017). Web video in numbers-an analysis of web-video metadata. arXiv preprint arXiv:1707.01340.
Videos in the Wild
Duration-distribution of common video collections vs what is found in the wild
Rossetto, L., & Schuldt, H. (2017). Web video in numbers-an analysis of web-video metadata. arXiv preprint arXiv:1707.01340.
Dataset Preparation and Annotations
● Data set = content + annotations for specific problem
● Today, content is everywhere
● Annotations are still hard to get
○ External data (e.g., archive documentation) often not available at sufficient granularity and
time-indexed
○ Creation by experts is prohibitively costly
● Approaches
○ Crowdsourcing (with different notions of “crowd” impacting quality)
○ Reduce amount of annotations needed
○ Generate data set and ground truth
Collaborative Annotation
Initiatives from TRECVID participants 2003-2013
○ http://mrim.imag.fr/tvca/
○ Concept annotations for high-level feature extraction/semantic indexing tasks
○ As data sets grew in size, the percentage of the content that could be annotated declined
○ Use of active learning to select samples where annotation brings highest benefit
S. Ayache and G. Quénot, "Video Corpus Annotation using Active Learning", ECIR 2008.
Crowdsourcing with the General Public
● Use platforms like Amazon Mechanical Turk to collect data
○ Main issue, however, is that annotations are noisy and unreliable
● Solutions
○ Multiple annotations and majority votes
○ Involve tasks that help assessing the confidence to a specific worker
■ e.g., asking easy questions first, to verify facts about image
○ More sophisticated aggregation strategies
● MediaEval ran tasks in 2013 and 2014
○ Annotation of fashion images and timed comments about music
B. Loni, M. Larson, A. Bozzon, L. Gottlieb, Crowdsourcing for Social Multimedia at MediaEval 2013: Challenges, Data set, and Evaluation, MediaEval WS Notes, 2013.
K. Yadati, P. S.N. Shakthinathan Chandrasekaran Ayyanathan, M. Larson, Crowdsorting Timed Comments about Music: Foundations for a New Crowdsourcing Task, MediaEval WS Notes, 2014.
Pooling
● Exhaustive relevance judgements
are costly for large data sets
● Annotate pool of top k results
returned from participating systems
● Pros
○ Efficient
○ Results are correct for all participants, not
an approximation
● Cons
○ Annotations can only be done after
experiment
○ Repeating the experiment with
new/updated systems requires updating the
annotation (or getting approximate results)
Sri Devi Ravana et al., Document-based approach to improve the accuracy of pairwise comparison in evaluating information retrieval systems, ASLIB J. Inf. Management, 67(4), 2015.
Live Annotation
● Assessment of incoming results during competition
● Used in VBS 2017-2018
● Addresses issues of incomplete or missing ground truth
○ e.g., created using pooling , or new queries
● Pros
○ Provide immediate feedback
○ Avoid biased results from ground truth pooled from other systems
● Cons
○ Done under time pressure
○ Not possible to review other similar cases - may cause inconsistency in decisions
○ Multi-annotator agreement would be needed (impacts decision time and number of annotators needed)
Live Annotation – Example from VBS 2018
● 1,848 shots judged live
○ About 40% of submitted shots were not in TRECVID g.t.
● Verification experiment
○ 1,383 were judged again later
○ Judgement were diverging for 23% of the shots, in 88% of those cases the live judgement was “incorrect”
● Judges seem to decide to incorrect when in doubt
○ While ground truth for later use is biased, still same conditions for all teams in the room
● Need to set up clear rules for live judges
○ Like used by NIST for TRECVID annotations
Judge 1: false Judge 2: true Judge 1: true Judge 1: false
same video
Assembling Content and Ground Truth
● MPEG Compact Descriptor for Video Analysis
(CDVA)
○ Dataset for the evaluation of visual instance search
○ 23,000 video clips (1min - > 1hr)
● Annotation effort too high
○ Generate query and reference clips from three disjoint
subsets
○ Randomly embed relevant segment in noisy material
○ Apply transformations to query clips
○ Ground truth is generated from the editing scripts
○ Created 9,715 queries, 5,128 references
Process for LSC Dataset Generation
● Lifelog data has an inevitable privacy/GDPR compliance concern
● Required a cleaning/anonymization process for images, locations & words
○ Lifelogger deletes private/embarrassing images, validated by researcher
○ Images resized down (1024x768) to remove readable text
○ faces automatically & manually blurred; locations anonymized
○ Manually generated blacklist of terms for removal from textual data
Available Datasets
● Past TRECVID data
○ https://www-nlpir.nist.gov/projects/trecvid/past.data.table.html
○ Different types of usage conditions and license agreements
○ Ground truth, annotations and partly extracted features are available
● Past MediaEval data
○ http://www.multimediaeval.org/datasets/index.html
○ Mostly directly downloadable, annotations and sometimes features available
● Some freely available data sets
○ TRECVID IACC.1-3
○ TRECVID V3C1 (starting 2019), will also be used for VBS (download available)
○ BLIP 10,000 http://skuld.cs.umass.edu/traces/mmsys/2013/blip/Blip10000.html
○ YFCC100M https://webscope.sandbox.yahoo.com/catalog.php?datatype=i&did=67
○ Stanford I2V http://purl.stanford.edu/zx935qw7203
Available Datasets
● MPEG CDVA data set
○ Mixed licenses, partly CC, partly specific conditions of content owners
● NTCIR-Lifelog datasets
○ NTCIR-12 Lifelog - 90 days of mostly visual and activity data from 3 lifeloggers (100K+
images)
■ ImageCLEF 2017 dataset a subset of NTCIR-12
○ NTCIR-13 Lifelog - 90 days of richer media data from 2 lifeloggers (95K images)
■ LSC 2018 - 30 days of visual, activity, health, information & biometric data from one lifelogger
■ ImageCLEF 2018 dataset a subset of NTCIR-13
○ NTCIR-14 - 45 days of visual, biometric, health, activity data from two lifeloggers
Example: V3C Dataset
Vimeo Creative Commons Collection
○ The Vimeo Creative Commons Collection (V3C) [2] consists of ‘free’ video material sourced from the web
video platform vimeo.com. It is designed to contain a wide range of content which is representative of what
is found on the platform in general. All videos in the collection have been released by their creators under a
Creative Commons License which allows for unrestricted redistribution.
Rossetto, L., Schuldt, H., Awad, G., & Butt, A. (2019). V3C – a Research Video Collection. Proceedings of the 25th International Conference on MultiMedia Modeling.
5. Evaluation procedures,
results and metrics
Interactive and automatic retrieval
Evaluation settings for interactive retrieval tasks
● For each tool, human in the loop ...
○ Same room, projector, time pressure
○ Expert and novice users
● … compete in simulated tasks (KIS, AVS, ...)
○ Shared dataset in advance (V3C1 1000h)
○ 2V+1T KIS sessions and 2 AVS sessions
■ Tasks selected randomly and revisited
■ Tasks presented on data projector
Evaluation settings for interactive retrieval tasks
● Problem with repeatability of results
○ Human in the loop, conditions
● Evaluation provides one comparison of
tools in a shared environment with a given
set of tasks, users and shared dataset
○ Performance reflected by an overall score
Known-item search tasks at VBS 2018
Results of VBS 2018
Results - observed trends 2015-2017
2015 (100 hours) 2016 (250 hours) 2017 (600 hours)
Observation: First AVS easier than Visual KIS easier than Textual KIS
J. Lokoc, W. Bailer, K. Schoeffmann, B. Muenzer, G. Awad, On influential trends in interactive video retrieval: Video Browser Showdown 2015-2017, IEEE
Transactions on Multimedia, 2018
KIS score function (since 2018)
● Reward for solving a task
● Reward for being fast
● Fair scoring around time limit
● Penalty for wrong submissions
J. Lokoc, W. Bailer, K. Schoeffmann, B. Muenzer, G. Awad, On influential trends in interactive video retrieval: Video Browser Showdown 2015-2017, IEEE
Transactions on Multimedia, 2018
AVS score function (since 2018)
VBS 2018
VBS 2017
Score based on precision and recall
Overall scores at VBS 2018
Settings and metrics in LSC Evaluation
● Similar to the VBS… For each tool, human in the
loop ...
○ Same room, projector, time pressure
○ Expert and novice users
● … compete in simulated tasks (all KIS type)
○ Shared dataset in advance (LSC Dataset - 27 days))
○ Six expert topics & 12 Novice topics
■ Topics prepared by the organisers with full
(non-pooled) relevance judgements for all topics
■ Tasks presented on data projector
■ Participants submit a ‘correct’ answer to the LSC
server which evaluates it against groundtruth.
Lifelog Search Challenge (Topics)
I am building a chair that is wooden in the late afternoon. I am at work, in an office environment (23
images, 12 minutes).
I am walking out to an airplane across the airport apron. I stayed in an airport hotel on the previous night
before checking out and walking a short distance to the airport (1 image, 1 minute).
I was in a Norwegian furniture store in a shopping mall (16 images, 9 minutes).
I was eating in a Thai restaurant (130 images, 66 images).
There was a large picture of a man carrying a box of tomatoes beside a child on a bicycle (185 images,
97 minutes).
I was playing a vintage car-racing game on my laptop in a hotel after flying (53 images, 27 minutes).
I was watching 'The Blues Brothers' Movie on the TV at home (82 images, 42 minutes).
LSC Score Function
Score calculated from 0 to 100, based on the
amount of time remaining. Negative scoring for
incorrect answers (lose 10%) of available score.
Overall score is based on the sum of scores for all
expert and novice topics.
Similar to VBS, a problem with repeatability of
results ( Human in the loop).
Evaluation provides one comparison of tools in a
shared environment with a given set of tasks, users
and shared dataset.
Evaluation settings at TRECVID
● Three run types:
○ Fully Automatic
○ Manually-assisted
○ Relevance-feedback
● Query/Topics:
○ Text only
○ Text + image/video examples
● Training conditions:
○ Training data from same/cross domain as testing
○ Training data collected automatically
● Results:
System returns top 1000 shots that most likely
Satisfy the query/topic
Query Development Process
● Sample test videos (~30 - 40%) were viewed by 10 human assessors hired by the NIST.
● 4 facets describing different scenes were used (if applicable) to annotate the watched videos:
○ Who : concrete objects and being (kind of persons, animals, things)
○ What : are the objects and/or beings doing ? (generic actions, conditions/state)
○ Where : locale, site, place, geographic, architectural, etc
○ When : time of day, season
● Test queries were constructed from the annotated descriptions to include : Persons, Actions,
Locations, and Objects and their combinations.
Sample topics of Ad-hoc search queries
Find shots of a person holding a poster on the street at daytime
Find shots of one or more people eating food at a table indoors
Find shots of two or more cats both visible simultaneously
Find shots of a person climbing an object (such as tree, stairs, barrier)
Find shots of car driving scenes in a rainy day
Find shots of a person wearing a scarf
Find shots of destroyed buildings
Evaluation settings at TRECVID
● Usually 30 queries/topics are evaluated per year
● NIST hires 10 human assessors to:
○ Watch returned video shots
○ Judge if a video shot satisfy query (YES / NO vote)
● All system results per query/topic are pooled and NIST judges top ranked
results (rank 1 to ~200) 100% and sample ranked results from 201 - 1000
to form a unique judged master set.
● The unique judged master set are divided into small pool files (~1000 shots
/ file) and given to the human assessors to watch and judge.
TRECVID evaluation framework
Video
Collection
Information needs
(Topics/Queries)
Video search
algorithm 1
Video search
algorithm 2
Video search
algorithm K
Ranked
result set 1
Ranked
result set 2
Ranked
result set k
Video
pools
Judge 100% of top X ranked
results and Y% from X+1
ranked results to bottom
TRECVID
Participants
Ranked
result sets
Ground Truth
Evaluation
scores Human assessors
Pooling
Evaluation settings at TRECVID
● Basic rules for the human assessors to follow include:
○ In topic description, "contains x" or words to that effect are short for "contains x to a degree sufficient for x
to be recognizable as x to a human" . This means among other things that unless explicitly stated, partial
visibility or audibility may suffice.
○ The fact that a segment contains video of physical objects representing the feature target, such as photos,
paintings, models, or toy versions of the target, will NOT be grounds for judging the feature to be true for the
segment. Containing video of the target within video may be grounds for doing so.
○ If the feature is true for some frame (sequence) within the shot, then it is true for the shot; and vice versa.
This is a simplification adopted for the benefits it affords in pooling of results and approximating the basis
for calculating recall.
○ When a topic expresses the need for x and y and ..., all of these (x and y and ...) must be perceivable
simultaneously in one or more frames of a shot in order for the shot to be considered as meeting the need.
Evaluation metric at TRECVID
● Mean extended inferred average precision (xinfAP) across all topics
○ Developed* by Emine Yilmaz and Javed A. Aslam at Northeastern University
○ Estimates average precision surprisingly well using a surprisingly small sample of judgments
from the usual submission pools (see next slide!)
○ More topics can be judged with same effort
○ The extended infAP added stratified feature to infAP (i.e we can sample from each strata
with different sample rate)
* J.A. Aslam, V. Pavlu and E. Yilmaz, Statistical Method for System Evaluation Using Incomplete Judgments Proceedings of the 29th
ACM SIGIR Conference, Seattle, 2006.
InfAP correlation with AP
Mean InfAP of 100% sample
Mean InfAP of 100% sample
Mean InfAP of 100% sample
Mean InfAP of 100% sample
MeanInfAPof80%sampleMeanInfAPof40%sample
MeanInfAPof60%sampleMeanInfAPof20%sample
Mean InfAP of
100% sample
==
AP
Automatic vs. Interactive search in AVS
Can we compare results from TRECVID (infAP) and VBS (unordered list)?
● Simulate AP from unordered list
J. Lokoc, W. Bailer, K. Schoeffmann, B. Muenzer, G. Awad, On influential trends in interactive video retrieval: Video Browser Showdown 2015-2017, IEEE
Transactions on Multimedia, 2018
Automatic vs. Interactive search in AVS
Can we compare results from TRECVID (infAP) and VBS (unordered list)?
● Get precision at VBS recall level if ranked lists are available
J. Lokoc, W. Bailer, K. Schoeffmann, B. Muenzer, G. Awad, On influential trends in interactive video retrieval: Video Browser Showdown 2015-2017, IEEE
Transactions on Multimedia, 2018
6. Lessons learned
Collection of our observations from TRECVID and VBS
Video Search (at TRECVID) Observations
One solution will not fit all. Investigations/discussion of video search must be
related to the searcher‘s specific needs/capabilities/history and to the kinds
data being searched.
The enormous and growing amounts of video require extremely large-scale
approaches to video exploitation. Much of it has little or no metadata
describing the content in any detail.
● 400 hrs of video are being uploaded on YouTube per minute (as of
11/2017)
● “Over 1.9 Billion logged-in users visit YouTube each month and every day
people watch over a billion hours of video and generate billions of views.”
(https://www.youtube.com/yt/about/press/)
Video Search (at TRECVID) Observations
Multiple information sources (text, audio, video), each errorful, can yield better results when
combined than used alone…
● A human in the loop in search still makes an enormous difference.
● Text from speech via automatic speech recognition (ASR) is a powerful source of information but:
○ Its usefulness varies by video genre
○ Not everything/one in a video is talked about, “in the news"
○ Audible mentions are often offset in time from visibility
○ Not all languages have good ASR
● Machine learning approaches to tagging
○ yield seemingly useful results against large amounts of data when training data is sufficient
and similar to the test data (within domain)
○ but will they work well enough to be useful on highly heterogeneous video?
Video Search (at TRECVID) Observations
● Processing video using a sample of more than one frame per shot, yields better results but quickly
pushes common hardware configurations to their limits
● TRECVID systems have been looking at combining automatically derived and manual-provided
evidence in search :
○ Internet Archive video will provide titles, keywords, descriptions
○ Where in the Panofsky hierarchy are the donors’ descriptions? If very personal, does that
mean less useful for other people?
● Need observational studies of real searching of various sorts using current functionality and
identifying unmet needs
VBS organization
● Test session before event - problems with submission formats etc.
● Textual KIS tasks in a special private session
○ Textual tasks are not so attractive for audience
○ Textual tasks are important and challenging
○ More time and tasks are needed to assess tool performance
● Visual and AVS tasks during welcome reception
○ “Panem et circenses” - competitions are also intended to entertain audience
○ Generally, more novice users can be invited to try the tool
VBS server
● Central element of the competition
○ Presents all tasks using data projector
○ Presents scores in all categories
○ Presents feedback for actual submissions
○ Additional logic (duplicates, flood of submissions, logs)
○ Also at LSC 2018, with a revised ranking function
● Selected issue - duplicate problem
○ IACC dataset contains numerous duplicate videos with identical visual content (but e.g.,
different language)
○ Submission was regarded as wrong although the visual content was correct
○ One actual case in 2018, had to be corrected after the event and changed the final ranking
○ Dataset design should explicitly avoid duplicates, or at least provide a list of duplicates;
moreover: server could provide more flexibility in changing judgements retrospectively
VBS server
● Issues of the simulations of KIS tasks
● How to “implant” visual memories?
○ Play scene just once - users forget the scene
○ Play scene in the loop - users exploit details -> overfitting to task presentation
○ Play scene in the loop + blur - colors can be still used, but also user forget important details
○ Play scene several times in the beginning and then show text description
● How to face ambiguities of textual KIS?
○ Simple text - not enough details, ambiguous meaning of some sentences
○ Extending text - simulation of a discussion - which details should be used first?
○ Still ambiguities -> teams should be allowed to ask some questions
AVS task and live judges at VBS
● Ambiguous task descriptions are problematic, hard to find balance
between too easy and too hard tasks
● Opinion of user vs. opinion of judge - who is right?
○ Users try to maximize score - sometimes risk wrong submission
○ Each shot is assessed just once -> the same “truth” for all teams
○ Similar as for textual KIS - teams should be allowed to ask some questions
○ Teams have to read TRECVID rules for human assessors!
● Calibration of more judges
○ For more than one live judge - calibration of opinions is necessary, even during competition
● Balance the number of users for AVS tasks (ideally also for KIS tasks)
VBS interaction logging
● Until 2017, there was no connection between VBS results and really used
tool features to solve a task
○ VBS server received only team, video and frame IDs
○ Attempts to receive logs after competition failed
● Since 2018, an interaction log is a mandatory part of each task submission
○ How to obtain logs when the task is not solved?
○ Tools use variable modalities and interfaces - how to unify actions?
○ How to present and interpret logs?
○ How to log very frequent actions?
○ Time synchronization?
○ Log verification during test runs
VBS interaction logging - 8/9 teams sent logs!
We can analyze both aggregated general statistics and user interactions in a given tool/task !!
Conclusion
Where is the User in the Age of Deep Learning?
● The complexity of tasks where AI is superior to human is obviously growing
○ Checkers -> Chess -> GO -> Poker -> DOTA? -> Starcraft? -> … à guess user needs?
● Machine learning revolution - bigger/better training data
o à better performance
● Can we collect big training data to support interactive video retrieval?
○ To cover an open world (how many concepts, actions, … do you need)?
○ To fit needs of every user (how many contexts do we have)?
● Reinforcement learning?
Where is the User in the Age of Deep Learning?
Driver has to get
carefully through
many situations
with just basic
equipment
Q: Is this possible also for video retrieval systems?
Attribution: Can Pac Swire (away for a bit)
Driver has to rely
on himself but
subsystems help
(ABS, power
steering, etc.)
Attribution: Grand Parc - Bordeaux
Driver just tells
where to go
Attribution: Grendelkhan
Where is the User in the Age of Deep Learning?
● Users already benefit from deep learning
○ HCI support - body motion, hand gestures
○ More complete and precise automatic annotations
○ Embeddings/representations for similarity search
○ 2D/3D projections for visualization of high-dimensional data
○ Relevance feedback learning (benefit from past actions)
● Promising directions
○ One-shot learning for fast inspection of new concepts
○ Multimodal joint embeddings
○ …
○ Just A Rather Very Intelligent System (J.A.R.V.I.S.) used by Tony Stark (Iron Man) ??
Never say “this will not work!”
● If you have an idea how to solve interactive retrieval tasks - just try it!
○ Don’t be afraid your system is not specialized, you can surprise yourself and the community!
○ Paper submission in September 2019 for VBS at MMM 2020 in Seoul!
○ LSC submission in February 2019 for ICMR 2019 in Ottawa in June 2019.
○ The next TRECVID CFP will go out by mid-January, 2019.
Lokoč, Jakub, Adam Blažek, and Tomáš Skopal. "Signature-based video
browser." International Conference on Multimedia Modeling. Springer, Cham,
2014.
Del Fabro, Manfred, and Laszlo Böszörmenyi. "AAU video browser: non-
sequential hierarchical video browsing without content analysis." International
Conference on Multimedia Modeling. Springer, Berlin, Heidelberg, 2012.
Hürst, Wolfgang, Rob van de Werken, and Miklas Hoet. "A storyboard-based
interface for mobile video browsing." International Conference on Multimedia
Modeling. Springer, Cham, 2015.
Acknowledgements
This work has received funding from the European Union’s Horizon 2020
research and innovation programme, grant no. 761802, MARCONI. It was
supported also by Czech Science Foundation project Nr. 17-22224S.
Moreover, the work was also supported by the Klagenfurt University and
Lakeside Labs GmbH, Klagenfurt, Austria and funding from the European
Regional Development Fund and the Carinthian Economic Promotion Fund
(KWF) under grant KWF 20214 u. 3520/26336/38165.

More Related Content

What's hot

JISC Observatory: Horizon Scanning for Higher & Further Education
JISC Observatory: Horizon Scanning for Higher & Further EducationJISC Observatory: Horizon Scanning for Higher & Further Education
JISC Observatory: Horizon Scanning for Higher & Further EducationThom Bunting
 
FoCAS Newsletter Issue 1: Septemeber 2013
FoCAS Newsletter Issue 1: Septemeber 2013FoCAS Newsletter Issue 1: Septemeber 2013
FoCAS Newsletter Issue 1: Septemeber 2013FoCAS Initiative
 
Model-Driven Security with Modularity and Reusability for Engineering Secure ...
Model-Driven Security with Modularity and Reusability for Engineering Secure ...Model-Driven Security with Modularity and Reusability for Engineering Secure ...
Model-Driven Security with Modularity and Reusability for Engineering Secure ...Phu H. Nguyen
 
3d Printing & Bioprinting in Healthcare Conference Agenda
3d Printing & Bioprinting in Healthcare Conference Agenda3d Printing & Bioprinting in Healthcare Conference Agenda
3d Printing & Bioprinting in Healthcare Conference AgendaTony Couch
 
Cloud Services for Education - HNSciCloud applied to the UP2U project
Cloud Services for Education - HNSciCloud applied to the UP2U projectCloud Services for Education - HNSciCloud applied to the UP2U project
Cloud Services for Education - HNSciCloud applied to the UP2U projectHelix Nebula The Science Cloud
 

What's hot (7)

JISC Observatory: Horizon Scanning for Higher & Further Education
JISC Observatory: Horizon Scanning for Higher & Further EducationJISC Observatory: Horizon Scanning for Higher & Further Education
JISC Observatory: Horizon Scanning for Higher & Further Education
 
FoCAS Newsletter Issue 1: Septemeber 2013
FoCAS Newsletter Issue 1: Septemeber 2013FoCAS Newsletter Issue 1: Septemeber 2013
FoCAS Newsletter Issue 1: Septemeber 2013
 
Sinnott Paper
Sinnott PaperSinnott Paper
Sinnott Paper
 
Model-Driven Security with Modularity and Reusability for Engineering Secure ...
Model-Driven Security with Modularity and Reusability for Engineering Secure ...Model-Driven Security with Modularity and Reusability for Engineering Secure ...
Model-Driven Security with Modularity and Reusability for Engineering Secure ...
 
3d Printing & Bioprinting in Healthcare Conference Agenda
3d Printing & Bioprinting in Healthcare Conference Agenda3d Printing & Bioprinting in Healthcare Conference Agenda
3d Printing & Bioprinting in Healthcare Conference Agenda
 
Cloud Services for Education - HNSciCloud applied to the UP2U project
Cloud Services for Education - HNSciCloud applied to the UP2U projectCloud Services for Education - HNSciCloud applied to the UP2U project
Cloud Services for Education - HNSciCloud applied to the UP2U project
 
Activity report
Activity reportActivity report
Activity report
 

Similar to Interactive Video Search Tools: Where Deep Learning Falls Short & User Interaction Helps

Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...
Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...
Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...maranlar
 
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...Universitat Politècnica de Catalunya
 
Designing Useful and Usable Augmented Reality Experiences
Designing Useful and Usable Augmented Reality Experiences Designing Useful and Usable Augmented Reality Experiences
Designing Useful and Usable Augmented Reality Experiences Yan Xu
 
Multi modal retrieval and generation with deep distributed models
Multi modal retrieval and generation with deep distributed modelsMulti modal retrieval and generation with deep distributed models
Multi modal retrieval and generation with deep distributed modelsRoelof Pieters
 
Overview of Video Concept Detection using (CNN) Convolutional Neural Network
Overview of Video Concept Detection using (CNN) Convolutional Neural NetworkOverview of Video Concept Detection using (CNN) Convolutional Neural Network
Overview of Video Concept Detection using (CNN) Convolutional Neural NetworkIRJET Journal
 
Visualization for Software Analytics
Visualization for Software AnalyticsVisualization for Software Analytics
Visualization for Software AnalyticsMargaret-Anne Storey
 
Burnaev and Notchenko. Skoltech. Bridging gap between 2D and 3D with Deep Lea...
Burnaev and Notchenko. Skoltech. Bridging gap between 2D and 3D with Deep Lea...Burnaev and Notchenko. Skoltech. Bridging gap between 2D and 3D with Deep Lea...
Burnaev and Notchenko. Skoltech. Bridging gap between 2D and 3D with Deep Lea...Skolkovo Robotics Center
 
Video Data Visualization System : Semantic Classification and Personalization
Video Data Visualization System : Semantic Classification and Personalization  Video Data Visualization System : Semantic Classification and Personalization
Video Data Visualization System : Semantic Classification and Personalization ijcga
 
Video Data Visualization System : Semantic Classification and Personalization
Video Data Visualization System : Semantic Classification and Personalization  Video Data Visualization System : Semantic Classification and Personalization
Video Data Visualization System : Semantic Classification and Personalization ijcga
 
2015-11-11 research seminar
2015-11-11 research seminar2015-11-11 research seminar
2015-11-11 research seminarifi8106tlu
 
Invited Talk OAGM Workshop Salzburg, May 2015
Invited Talk OAGM Workshop Salzburg, May 2015Invited Talk OAGM Workshop Salzburg, May 2015
Invited Talk OAGM Workshop Salzburg, May 2015dermotte
 
Bridging the gap between AI and UI - DSI Vienna - full version
Bridging the gap between AI and UI - DSI Vienna - full versionBridging the gap between AI and UI - DSI Vienna - full version
Bridging the gap between AI and UI - DSI Vienna - full versionLiad Magen
 
The Recurated Museum: IV. Collections Management & Sustainability
The Recurated Museum: IV. Collections Management & SustainabilityThe Recurated Museum: IV. Collections Management & Sustainability
The Recurated Museum: IV. Collections Management & SustainabilityChristopher Morse
 
MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...
MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...
MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...multimediaeval
 
Evaluating the User Experience of Virtual Learning Environments Using Biometr...
Evaluating the User Experience of Virtual Learning Environments Using Biometr...Evaluating the User Experience of Virtual Learning Environments Using Biometr...
Evaluating the User Experience of Virtual Learning Environments Using Biometr...Renée Schulz
 
Multimedia Information Retrieval: What is it, and why isn't ...
Multimedia Information Retrieval: What is it, and why isn't ...Multimedia Information Retrieval: What is it, and why isn't ...
Multimedia Information Retrieval: What is it, and why isn't ...webhostingguy
 
ECIR 2013 Keynote - Time for Events
ECIR 2013 Keynote - Time for EventsECIR 2013 Keynote - Time for Events
ECIR 2013 Keynote - Time for Eventsmor
 
Understanding user interactivity for immersive communications and its impact ...
Understanding user interactivity for immersive communications and its impact ...Understanding user interactivity for immersive communications and its impact ...
Understanding user interactivity for immersive communications and its impact ...lauratoni4
 
Understanding user interactivity for immersive communications and its impact ...
Understanding user interactivity for immersive communications and its impact ...Understanding user interactivity for immersive communications and its impact ...
Understanding user interactivity for immersive communications and its impact ...Alpen-Adria-Universität
 

Similar to Interactive Video Search Tools: Where Deep Learning Falls Short & User Interaction Helps (20)

Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...
Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...
Multimedia Information Retrieval: Bytes and pixels meet the challenges of hum...
 
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
Closing, Course Offer 17/18 & Homework (D5 2017 UPC Deep Learning for Compute...
 
Designing Useful and Usable Augmented Reality Experiences
Designing Useful and Usable Augmented Reality Experiences Designing Useful and Usable Augmented Reality Experiences
Designing Useful and Usable Augmented Reality Experiences
 
Gesture detection
Gesture detectionGesture detection
Gesture detection
 
Multi modal retrieval and generation with deep distributed models
Multi modal retrieval and generation with deep distributed modelsMulti modal retrieval and generation with deep distributed models
Multi modal retrieval and generation with deep distributed models
 
Overview of Video Concept Detection using (CNN) Convolutional Neural Network
Overview of Video Concept Detection using (CNN) Convolutional Neural NetworkOverview of Video Concept Detection using (CNN) Convolutional Neural Network
Overview of Video Concept Detection using (CNN) Convolutional Neural Network
 
Visualization for Software Analytics
Visualization for Software AnalyticsVisualization for Software Analytics
Visualization for Software Analytics
 
Burnaev and Notchenko. Skoltech. Bridging gap between 2D and 3D with Deep Lea...
Burnaev and Notchenko. Skoltech. Bridging gap between 2D and 3D with Deep Lea...Burnaev and Notchenko. Skoltech. Bridging gap between 2D and 3D with Deep Lea...
Burnaev and Notchenko. Skoltech. Bridging gap between 2D and 3D with Deep Lea...
 
Video Data Visualization System : Semantic Classification and Personalization
Video Data Visualization System : Semantic Classification and Personalization  Video Data Visualization System : Semantic Classification and Personalization
Video Data Visualization System : Semantic Classification and Personalization
 
Video Data Visualization System : Semantic Classification and Personalization
Video Data Visualization System : Semantic Classification and Personalization  Video Data Visualization System : Semantic Classification and Personalization
Video Data Visualization System : Semantic Classification and Personalization
 
2015-11-11 research seminar
2015-11-11 research seminar2015-11-11 research seminar
2015-11-11 research seminar
 
Invited Talk OAGM Workshop Salzburg, May 2015
Invited Talk OAGM Workshop Salzburg, May 2015Invited Talk OAGM Workshop Salzburg, May 2015
Invited Talk OAGM Workshop Salzburg, May 2015
 
Bridging the gap between AI and UI - DSI Vienna - full version
Bridging the gap between AI and UI - DSI Vienna - full versionBridging the gap between AI and UI - DSI Vienna - full version
Bridging the gap between AI and UI - DSI Vienna - full version
 
The Recurated Museum: IV. Collections Management & Sustainability
The Recurated Museum: IV. Collections Management & SustainabilityThe Recurated Museum: IV. Collections Management & Sustainability
The Recurated Museum: IV. Collections Management & Sustainability
 
MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...
MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...
MediaEval 2016 - COSMIR and the OpenMIC Challenge: A Plan for Sustainable Mus...
 
Evaluating the User Experience of Virtual Learning Environments Using Biometr...
Evaluating the User Experience of Virtual Learning Environments Using Biometr...Evaluating the User Experience of Virtual Learning Environments Using Biometr...
Evaluating the User Experience of Virtual Learning Environments Using Biometr...
 
Multimedia Information Retrieval: What is it, and why isn't ...
Multimedia Information Retrieval: What is it, and why isn't ...Multimedia Information Retrieval: What is it, and why isn't ...
Multimedia Information Retrieval: What is it, and why isn't ...
 
ECIR 2013 Keynote - Time for Events
ECIR 2013 Keynote - Time for EventsECIR 2013 Keynote - Time for Events
ECIR 2013 Keynote - Time for Events
 
Understanding user interactivity for immersive communications and its impact ...
Understanding user interactivity for immersive communications and its impact ...Understanding user interactivity for immersive communications and its impact ...
Understanding user interactivity for immersive communications and its impact ...
 
Understanding user interactivity for immersive communications and its impact ...
Understanding user interactivity for immersive communications and its impact ...Understanding user interactivity for immersive communications and its impact ...
Understanding user interactivity for immersive communications and its impact ...
 

More from klschoef

Relevant Content Detection in Cataract Surgery Videos (Invited Talk 1 at IPTA...
Relevant Content Detection in Cataract Surgery Videos (Invited Talk 1 at IPTA...Relevant Content Detection in Cataract Surgery Videos (Invited Talk 1 at IPTA...
Relevant Content Detection in Cataract Surgery Videos (Invited Talk 1 at IPTA...klschoef
 
Medical Video Processing (Tutorial)
Medical Video Processing (Tutorial)Medical Video Processing (Tutorial)
Medical Video Processing (Tutorial)klschoef
 
Video Browser Showdown 2018 - Results
Video Browser Showdown 2018 - ResultsVideo Browser Showdown 2018 - Results
Video Browser Showdown 2018 - Resultsklschoef
 
Video Browser Showdown 2018
Video Browser Showdown 2018Video Browser Showdown 2018
Video Browser Showdown 2018klschoef
 
Medical Multimedia Information Systems (ACMMM17 Tutorial)
Medical Multimedia Information Systems (ACMMM17 Tutorial) Medical Multimedia Information Systems (ACMMM17 Tutorial)
Medical Multimedia Information Systems (ACMMM17 Tutorial) klschoef
 
Interactive Video Search - Tutorial at ACM Multimedia 2015
Interactive Video Search - Tutorial at ACM Multimedia 2015Interactive Video Search - Tutorial at ACM Multimedia 2015
Interactive Video Search - Tutorial at ACM Multimedia 2015klschoef
 
Video Browsing - The Need for Interactive Video Search (Talk at CBMI 2014)
Video Browsing - The Need for Interactive Video Search (Talk at CBMI 2014)Video Browsing - The Need for Interactive Video Search (Talk at CBMI 2014)
Video Browsing - The Need for Interactive Video Search (Talk at CBMI 2014)klschoef
 

More from klschoef (7)

Relevant Content Detection in Cataract Surgery Videos (Invited Talk 1 at IPTA...
Relevant Content Detection in Cataract Surgery Videos (Invited Talk 1 at IPTA...Relevant Content Detection in Cataract Surgery Videos (Invited Talk 1 at IPTA...
Relevant Content Detection in Cataract Surgery Videos (Invited Talk 1 at IPTA...
 
Medical Video Processing (Tutorial)
Medical Video Processing (Tutorial)Medical Video Processing (Tutorial)
Medical Video Processing (Tutorial)
 
Video Browser Showdown 2018 - Results
Video Browser Showdown 2018 - ResultsVideo Browser Showdown 2018 - Results
Video Browser Showdown 2018 - Results
 
Video Browser Showdown 2018
Video Browser Showdown 2018Video Browser Showdown 2018
Video Browser Showdown 2018
 
Medical Multimedia Information Systems (ACMMM17 Tutorial)
Medical Multimedia Information Systems (ACMMM17 Tutorial) Medical Multimedia Information Systems (ACMMM17 Tutorial)
Medical Multimedia Information Systems (ACMMM17 Tutorial)
 
Interactive Video Search - Tutorial at ACM Multimedia 2015
Interactive Video Search - Tutorial at ACM Multimedia 2015Interactive Video Search - Tutorial at ACM Multimedia 2015
Interactive Video Search - Tutorial at ACM Multimedia 2015
 
Video Browsing - The Need for Interactive Video Search (Talk at CBMI 2014)
Video Browsing - The Need for Interactive Video Search (Talk at CBMI 2014)Video Browsing - The Need for Interactive Video Search (Talk at CBMI 2014)
Video Browsing - The Need for Interactive Video Search (Talk at CBMI 2014)
 

Recently uploaded

Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxAleenaTreesaSaji
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physicsvishikhakeshava1
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 

Recently uploaded (20)

Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Luciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptxLuciferase in rDNA technology (biotechnology).pptx
Luciferase in rDNA technology (biotechnology).pptx
 
Work, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE PhysicsWork, Energy and Power for class 10 ICSE Physics
Work, Energy and Power for class 10 ICSE Physics
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 

Interactive Video Search Tools: Where Deep Learning Falls Short & User Interaction Helps

  • 1. Interactive Video Search: Where is the User in the Age of Deep Learning? Klaus Schoeffmann1, Werner Bailer2, Jakub Lokoc3, Cathal Gurrin4, George Awad5 Tutorial at ACM Multimedia 2018, Seoul 1…Klagenfurt University, Klagenfurt, Austria 2…JOANNEUM RESEARCH, Graz, Austria 3…Charles University, Prague, Czech Republic 4…Dublin City University, Dublin, Ireland 5…National Institute of Standards and Technology, Gaithersburg, USA
  • 2. Recommended Readings On influential trends in interactive video retrieval: Video Browser Showdown 2015-2017. J. Lokoc, W. Bailer, K. Schoeffmann, B. Muenzer, G. Awad, IEEE Transactions on Multimedia, 2018 Interactive video search tools: a detailed analysis of the video browser showdown 2015. Claudiu Cobârzan, Klaus Schoeffmann, Werner Bailer, Wolfgang Hürst, Adam Blazek, Jakub Lokoc, Stefanos Vrochidis, Kai Uwe Barthel, Luca Rossetto. Multimedia Tools Appl. 76(4): 5539-5571 (2017). G. Awad, A. Butt, J. Fiscus, M. Michel, D. Joy, W. Kraaij, A. F. Smeaton, G. Quenot, M. Eskevich, R. Ordelman, G. J. F. Jones, and B. Huet, “Trecvid 2017: Evaluating ad-hoc and instance video search, events detection, video captioning and hyperlinking,” in Proceedings of TRECVID 2017. NIST, USA,
  • 3. TOC 1 1. Introduction (20 min) [KS] a. General introduction b. Automatic vs. interactive video search c. Where deep learning fails d. The need for evaluation campaigns 2. Interactive video search tools (40 min) [JL] a. Demo: VIRET (1st place at VBS2018) b. Demo: ITEC (2nd place at VBS2018) c. Demo: DCU Lifelogging Search Tool 2018 d. Other tools and open source software 3. Evaluation approaches (30 min) [KS] a. Overview of evaluation approaches b. History of selected evaluation campaigns c. TRECVID d. Video Browser Showdown (VBS) e. Lifelog Search Challenge (LSC)
  • 4. TOC 2 4. Task design and datasets (30 min) [KS] a. Task types (known item search, retrieval, etc.) b. Trade-offs: modelling real-world tasks and controlling conditions c. Data set preparation and annotations d. Available data sets 5. Evaluation procedures, results and metrics (30 min) [JL] a. Repeatability b. Modelling real-world tasks and avoiding bias c. Examples from evaluation campaigns 6. Lessons learned from evaluation campaigns (20 min) - [JL] a. Interactive exploration or query-and-browse? b. How much does deep learning help in interactive settings? c. Future challenges 7. Conclusions a. Where is the user in the age of deep learning?
  • 6. Let’s Look Back a Few Years... [Marcel Worring et al., „Where Is the User in Multimedia Retrieval?“, IEEE Multimedia, Vol. 19, No. 4, Oct.-Dec. 2012, pp. 6-10 ]
  • 7. Let’s Look Back a Few Years... ● A few statements/findings: ○ Many solutions are developed without having an explicitly defined real-world problem to solve. ○ Performance measures focus on the quality of how we answer a query. ○ MAP has become the primary target for many researchers. ○ It is certainly weird to use MAP alone when talking about users employing multimedia retrieval to solve their search problems. ○ As a consequence of MAP’s dominance, the field has shifted its focus too much toward answering a query. “Thus a better understanding of what users actually want and do when using multimedia retrieval is needed.” [Marcel Worring et al., „Where Is the User in Multimedia Retrieval?“, IEEE Multimedia, Vol. 19, No. 4, Oct.-Dec. 2012, pp. 6-10 ]
  • 8. How Would You Search for These Images? How to describe the special atmosphere, the artistic content, the mood? by marfis75 “An image tells a thousand words.”
  • 9. How Would You Search for This Video Scene?
  • 10. What Users Might Want...
  • 11. Shortcomings of Fully Automatic Video Retrieval ● Works well if ○ Users can properly describe their needs ○ System understands search intent of users ○ There is no polysemy and no context variation ○ Content features can sufficiently describe visual content ○ Computer vision (e.g., CNN) can accurately detect semantics ● Unfortunately, for real-world problems rarely true! “Query-and-browse results” approach
  • 12. Performance of Video Retrieval ● Typically based on MAP ○ Computed for a specific query- and dataset ○ Results are still quite low (even in the age of deep learning!) ○ Also, results can heavily vary from one dataset to another, and from one queryset to another ○ Example: TRECVID Ad-hoc Video Search (AVS) – automatic runs only 2016 2017 2018 Teams 9 8 10 Runs 30 33 33 Min xInfAP 0 0.026 0.003 Max xInfAP 0.054 0.206 0.121 Median xInfAP 0.024 0.092 0.058 Dataset: IACC.3, 30 queries per year
  • 13. Deep Learning Can Fail Easily [J. Su, D.V. Vargas, and K. Sakurai. One pixel attack for fooling neural networks. 2018. arXiv] How to deal with noisy data/videos?
  • 14. Deep Learning Can Fail Easily Output of Yolo v2 Andrew Ng talk Artificial Intelligence is the New Electricity “Anything typical human can do with < 1s of thought we can probably now or soon automate with AI”
  • 15. Deep Learning Can Fail Easily Nguyen A, Yosinski J, Clune J. Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images. In Computer Vision and Pattern Recognition (CVPR '15), IEEE, 2015
  • 16. The Power of Human Computation Example from the Video Browser Showdown 2015: System X: shot and scene detection, concept detection (SIFT, VLAD, CNNs), similarity search. System Y: tiny thumbnails only, powerful user. Outperformed system X and was finally ranked 3rd! Moumtzidou, Anastasia, et al. "VERGE in VBS 2017." International Conference on Multimedia Modeling. Springer, Cham, 2017. Hürst, Wolfgang, Rob van de Werken, and Miklas Hoet. "A storyboard-based interface for mobile video browsing." International Conference on Multimedia Modeling. Springer, Cham, 2015.
  • 17. Interactive Video Retrieval Approach ● Assume a smart and interactive user ○ That knows about the challenges and shortcomings of simple querying ○ But might also know how to circumvent them ○ Could be a digital native! ● Give him/her full control over the search process ○ Provide many query and interaction features ■ Querying, browsing, navigation, filtering, inspecting/watching ● Assume an iterative/exploratory search process ○ Search - Inspect - Think - Repeat ○ “Will know it when I see it” ○ Could include many iterations! ○ Instead of “query-and-browse results”
  • 18. What Users Might Need... Concept Search Browsing features Motion Sketch Search History Hudelist, Marco A., Christian Beecks, and Klaus Schoeffmann. "Finding the chameleon in your video collection." Proceedings of the 7th International Conference on Multimedia Systems. ACM, 2016.
  • 19. Typical Query Types of Video Retrieval Tools ● Query-by-text ○ Enter keywords to match with available or extracted text (e.g., metadata, OCR, ASR, concepts, objects...) ● Query-by-concept ○ Show content for a specific class/category from concept detection (e.g., from ImageNet) ● Query-by-example ○ Provide example image/scene/sound ● Query-by-filtering ○ Filter content by some metadata or content feature (time, color, edge, motion, …) ● Query-by-sketch ○ Provide sketch of image/scene ● Query-by-dataset-example ○ Look for similar but other results ● Query-by-exploration ○ Start by looking around / browsing ○ Needs appropriate visualization ● Query-by-inspection ○ Inspect single clips, navigate Search in multimedia content (particularly video) is a highly interactive process! Users want to look around, try different query features, inspect results, refine queries, and start all over again! Automatic Interactive
  • 20. Evaluation of Interactive Video Retrieval ● Interfaces are inherently developed for human users ● Every user might be different ○ Different culture, knowledge, preferences, experiences, ... ○ Even the same user at a different time ● Video search interfaces need to be evaluated with real users... ○ No simulations! ○ User studies and campaigns (TRECVID, MediaEval, VBS, LSC)! ○ Find out how well users perform with a specific system ● ...and with real data! ○ Real videos “in the wild” (e.g., IACC.1 and V3C dataset) ○ Actual queries that would make sense in practice ○ Comparable evaluations (same data, same conditions, etc.) International competitions Datasets
  • 21. Only same dataset, query, time, room/condition, ... ...allows for true comparative evaluation!
  • 22. Where is the User in the Age of Deep Learning?
  • 23. 2. Interactive Video Search Tools Common architecture, components and top ranked tools
  • 24. What are basic video preprocessing steps? What models are used? Where interactive search helps? Common Architecture
  • 25. Common Architecture - Temporal Segmentation M. Gygli. Ridiculously Fast Shot Boundary Detection with Fully Convolutional Neural Networks. https://arxiv.org/pdf/1705.08214.pdf1. Compute a score based on a distance of frames 2. Threshold-based decision (fixed/adaptive)
  • 26. Common Architecture - Semantic Search Classification and embedding by popular Deep CNNs AlexNet (A. Krizhevsky et al., 2012) GoogLeNet (Ch. Szegedy et al., 2015) ResNet (K. He et al., 2015) NasNet (B. Zoph et al., 2018) ... Object detectors appear too (YOLO, SSD) Joint embedding models? VQA?
  • 27. Common Architecture - Sketch based Search Sketches from memory Just part of the scene Edges often do not match Colors often do not match => invariance needed
  • 28. Common Architecture - Limits ● Used ranking models have their limits ○ Missed frames ○ Wrong annotation ○ Inaccurate similarity function ● Still, to find a shot of a class is often easy (see later), but to find one particular shot or all shots of a class? T. Soucek. Known-Item Search in Image Datasets Using Automatically Detected Keywords. BC thesis, 2018.
  • 29. Common Architecture at VBS - Interactive Search Hudelist & Schoeffmann. An Evaluation of Video Browsing on Tablets with the ThumbBrowser. MMM2017 Goeau et al.,, Table of Video Content, ICME 2007
  • 30. Aspects of Flexible Interactive Video Search
  • 31. VIRET tool (Winner of VBS 2018, 3. at LSC 2018) Filters Query by text Query by color Query by image Video player Top ranked frames by a query Representative frames from the selected video Frame-based retrieval system with temporal context visualization. Focus on simple interface! Jakub Lokoc, Tomas Soucek, Gregor Kovalcik: Using an Interactive Video Retrieval Tool for LifeLog Data. LSC@ICMR 2018: 15-19, ACM Jakub Lokoc, Gregor Kovalcik, Tomas Soucek: Revisiting SIRET Video Retrieval Tool. VBS@MMM 2018: 419-424, Springer
  • 32. VIRET Tool (Winner of VBS 2018)
  • 33. ITEC Tool Primus, Manfred Jürgen, et al. "The ITEC Collaborative Video Search System at the Video Browser Showdown 2018." International Conference on Multimedia Modeling. Springer, Cham, 2018.
  • 34. ITEC tool (2. at VBS 2018 and LSC 2018)https://www.youtube.com/watch?v=CA5kr2pO5b
  • 35. LSC (Geospatial Browsing) W Hürst, K Ouwehand, M Mengerink, A Duane and C Gurrin. Geospatial Access to Lifelogging Photos in Virtual Reality. The Lifelog Search Challenge 2018 at ACM ICMR 2018.
  • 36. LSC (Interactive Video Retrieval) J. Lokoč, T. Souček and G. Kovalčík. Using an Interactive Video Retrieval Tool for LifeLog Data. The Lifelog Search Challenge 2018 at ACM ICMR 2018. (3rd highest performing system, but the same system won VBS 2018)
  • 37. LSC (LiveXplore) A Leibetseder, B Muenzer, A Kletz, M Primus and K Schöffmann. liveXplore at the Lifelog Search Challenge 2018. The Lifelog Search Challenge 2018 at ACM ICMR 2018. (2nd highest performing system)
  • 38. VR Lifelog Search Tool (winner of LSC 2018) Large lifelog archive with time-limited KIS topics Multimodal (visual concept and temporal) query formulation Ranked list of visual imagery (image per minute) Gesture-based manipulation of results A Duane, C Gurrin & W Hürst. Virtual Reality Lifelog Explorer for the Lifelog Search Challenge at ACM ICMR 2018. The Lifelog Search Challenge 2018 at ACM ICMR 2018. Top Performing System.
  • 40. vitrivr (University of Basel) ● Open-Source content-based multimedia retrieval stack ○ Supports images, music, video and 3D-models concurrently ○ Used for various applications both in and outside of academia ○ Modular architecture enables easy extension and customization ○ Compatible with all major operating systems ○ Available from vitrivr.org ● Participated several times in VBS (originally as IMOTION) [Credit: Luca Rossetto]
  • 41. vitrivr (University of Basel) ● System overview [Credit: Luca Rossetto]
  • 42. vitrivr (University of Basel) [Credit: Luca Rossetto]
  • 43. vitrivr (University of Basel) [Credit: Luca Rossetto]
  • 44. vitrivr (University of Basel) [Credit: Luca Rossetto]
  • 46. Overview of Evaluation Approaches ● Qualitative user study/survey ○ Self report: ask users about their experience with the tool, thinking aloud tests, etc. ○ Using psychophysiological measurements (e.g., electrodermal activity - EDA) ● Log-file analysis ○ Analyze server and/or client-side interaction patterns ○ Measure time needed for certain actions, etc. ● Question answering ○ Ask questions about content (open, multiple choice) to assess which content users found ● Indirect/task-based evaluation (Cranfield paradigm) ○ Pose certain tasks, measure the effectiveness of solving the task ○ Quantitative user study with many users and trials ○ Open competition, as in VBS, LSC, and TRECVID
  • 47. Properties of Evaluation Approaches ● Availability and level of detail of ground truth ○ None (e.g., questionnaires, logs) ○ Detailed and complete (e.g., retrieval tasks) ● Effort during experiments ○ Low (automatic check against ground truth) ○ Moderate (answers need to checked by human, e.g. live judges) ○ High (observation of or interview with participants) ● Controlled conditions ○ All users in same room with same setup (typical user-study) vs. participants via online survey ● Statistical tests! ○ We can only conclude that one interactive tool is better than the other, if there is statistically significant proof ○ Tests like ANOVA, t-tests, Wilcoxon-signed rank tests, … ○ Consider prerequisites of specific test (e.g., normal distribution)
  • 48. Example: Comparing Tasks and User Study ● Experiment compared ○ Question answering ○ Retrieval tasks ○ User study with questionnaire ● Materials ○ Interactive search tool with keyframe visualisation ○ TRECVID BBC rushes data set (25 hrs) ○ Questionnaire adapted from TRECVID 2004 ○ 19 users, doing at least 4 tasks W. Bailer and H. Rehatschek, Comparing Fact Finding Tasks and User Survey for Evaluating a Video Browsing Tool. ACM Multimedia 2009.
  • 49. Example: Comparing Tasks and User Study ● TVB1 I was familiar with the topic of the query. ● TVB3 I found that it was easy to find clips that are relevant. ● TVB4 For this topic I had enough time to find enough clips. ● TVB5 For this particular topic the tool interface allowed me to do browsing efficiently. ● TVB6 For this particular topic I was satisfied with the results of the browsing. W. Bailer and H. Rehatschek, Comparing Fact Finding Tasks and User Survey for Evaluating a Video Browsing Tool. ACM Multimedia 2009.
  • 50. Using Electrodermal Activity (EDA) Measuring EDA during retrieval tasks (A, B, C, D) with an interactive search tool, 14 participants C. Martinez-Peñaranda, et al., A Psychophysiological Approach to the Usability Evaluation of a Multi-view Video Browsing Tool,” MMM 2013.
  • 51. History of Selected Evaluation Campaigns ● Evaluation campaigns for video analysis and search started in early 2000s ○ Most well-known are TRECVID and MediaEval (previously ImageCLEF) ○ Both spin-offs from text retrieval benchmarks ● Several ones include tasks that are relevant to video search ● Most tasks are designed to be fully automatic ● Some allow at least interactive submissions as an option ○ Most submissions are usually still for the automatic type ● Since 2007, live evaluations with audience have been organized at major international conferences ○ Videolympics, VBS, LSC
  • 52. History of Selected Evaluation Campaigns
  • 53. History of Selected Evaluation Campaigns
  • 54. TRECVID ● Workshop series (2001 – present) → http://trecvid.nist.gov ● Started as a track in the TREC (Text REtrieval Conference) evaluation benchmark. ● Became an independent evaluation benchmark since 2003. ● Focus: content-based video analysis, retrieval, detection, etc. ● Provides data, tasks, and uniform, appropriate scoring procedures ● Aims for realistic system tasks and test collections: ○ Unfiltered data ○ Focus on relatively high-level functionality (e.g. interactive search) ○ Measurement against human abilities ● Forum for the ○ exchange of research ideas and for ○ the discussion of research methodology – what works, what doesn’t , and why
  • 55. TRECVID Philosophy ● TRECVID is a modern example of the Cranfield tradition ○ Laboratory system evaluation based on test collections ● Focus on advancing the state of the art from evaluation results ○ TRECVID’s primary aim is not competitive product benchmarking ○ Experimental workshop: sometimes experiments fail! ● Laboratory experiments (vs. e.g., observational studies) ○ Sacrifice operational realism and broad scope of conclusions ○ For control and information about causality – what works and why? ○ Results tend to be narrow, at best indicative, not final ○ Evidence grows as approaches prove themselves repeatedly, as part of various systems, against various test data, over years
  • 56. TRECVID Datasets HAVIC Soap opera (since 2013) Social media (since 2016) Security cameras (since 2008)
  • 57. Teams actively participated (2016-2018) INF CMU; Beijing University of Posts and Telecommunication; University Autonoma de Madrid; Shandong University; Xian JiaoTong University Singapore kobe_nict_siegen Kobe University, Japan; National Institute of Information and Communications Technology, Japan; University of Siegen, Germany UEC Dept. of Informatics, The University of Electro-Communications, Tokyo ITI_CERTH Information Technology Institute, Centre for Research and Technology Hellas ITEC_UNIKLU Klagenfurt University NII_Hitachi_UIT National Institute Of Informatics.; Hitachi Ltd; University of Information Technology (HCM-UIT) IMOTION University of Basel, Switzerland; University of Mons, Belgium; Koc University, Turkey MediaMill University of Amsterdam ; Qualcomm Vitrivr University of Basel Waseda_Meisei Waseda University; Meisei University VIREO City University of Hong Kong EURECOM EURECOM FIU_UM Florida International University, University of Miami NECTEC National Electronics and Computer Technology Center NECTEC RUCMM Renmin University of China NTU_ROSE_AVS ROSE LAB, NANYANG TECHNOLOGICAL UNIVERSITY SIRET SIRET Department of Software Engineering, Faculty of Mathematics and Physics, Charles University UTS_ISA University of Technology Sydney
  • 58. VideOlympics ● Run the same year’s TRECVID search tasks live in front of audience ● Organized at CIVR 2007-2009 Photos: Cees Snoek, https://www.flickr.com/groups/civr2007/
  • 59. Video Browser Showdown (VBS) ● Video search competition (annually at MMM) ○ Inspired by VideOlympics ○ Demonstrates and evaluates state-of-the-art interactive video retrieval tools ○ Also, entertaining event during welcome reception at MMM ● Participating teams solve retrieval tasks ○ Known-item search (KIS) tasks - one result - textual or visual ○ Ad-hoc video search (AVS) tasks - many results - textual ○ In large video archive (originally in 60 mins videos only) ● Systems are connected to the VBS Server ○ Presents tasks in live manner ○ Evaluates submitted results of teams (penalty for false submissions) First VBS in Klagenfurt, Austria (only search in a single video)
  • 60. Video Browser Showdown (VBS) 2012: Klagenfurt 11 teams KIS, single video (v) 2013: Huangshan 6 teams KIS, single video (v+t) 2014: Dublin 7 teams KIS, single video and 30h archive (v+t) 2015: Sydney 9 teams KIS, 100h archive (v+t) 2016: Miami 9 teams KIS, 250h archive (v+t) 2017: Reykjavik 6 teams KIS, 600h archive (v+t) AVS, 600h archive (t) 2018: Bangkok 9 teams KIS, 600h archive (v+t) AVS, 600h archive (t) 2019: Thessaloniki 6 teams KIS, 1000h archive (v+t) AVS, 1000h archive (t)
  • 61. Video Browser Showdown (VBS) VBS Server: • Presents queries • Shows remaining time • Computes scores • Shows statistics/ranking
  • 62.
  • 63. Video Browser Showdown (VBS)https://www.youtube.com/watch?v=tSlYFNlsn8U&t=140
  • 64. Lifelog Search Challenge (LSC 2018) ● New (annual) search challenge at ACM ICMR ● Focus on a life retrieval challenge o from multimodal lifelog data o Motivated by the fact that ever larger personal data archives are being gathered and the advent of AR technologies & veracity of data’ means that archives of life experiences are likely to become more commonplace. ● To be useful, the data should be searchable… o and for lifelogs, that means interactive search
  • 65. Lifelog Search Challenge (Definition) Dodge and Kitchin (2007), refer to lifelogging as “a form of pervasive computing, consisting of a unified digital record of the totality of an individual’s experiences, captured multi-modally through digital sensors and stored permanently as a personal multimedia archive”.
  • 66. Lifelog Search Challenge (Motivation)
  • 67. Lifelog Search Challenge (Lifelogging)
  • 68. Lifelog Search Challenge (Data) One month archive of multimodal lifelog data, extracted from NTCIR-13 Lifelog collection, including: ○ Wearable camera images at a rate of 3-5 / minute & concept annotations. ○ Biometrics ○ Activity logs ○ Media consumption ○ Content created/consumed u1_2016-08-15_050922_1, 'indoor', 0.991932, 'person', 0.9719478, 'computer', 0.309054524
  • 69. Lifelog Search Challenge (One Minute) <minute id="496"> <location> <name>Home</name> </location> <bodymetrics> <calories>2.8</calories> <gsr>7.03E-05</gsr> <heart-rate>94</heart-rate> <skin-temp>86</skin-temp> <steps>0</steps> </bodymetrics> <text>author,1,big,2,dout,1,revis,1,think,1,while,1</text> <images> <image> <image-id>u1_2016-08-15_050922_1</image-id> <image-path>u1/2016-08-15/20160815_050922_000.jpg</image-path> <annotations>'indoor', 0.985, 'computer', 0.984, 'laptop', 0.967, 'desk', 0.925</annotations> </image> </images> </minute>
  • 70. Lifelog Search Challenge (Topics) <Description timestamp="0">In a coffee shop with my colleague in the afternoon called the Helix with at least one person in the background.</Description> <Description timestamp="30">In a coffee shop with my colleague in the afternoon called the Helix with at least one person in the background and a plastic plant on my right side.</Description> <Description timestamp="60">In a coffee shop with my colleague in the afternoon called the Helix with at least one person in the background and a plastic plant on my right side. There are keys on the table in front of me and you can see the cafe sign on the left side. I walked to the cafe and it took less than two minutes to get there.</Description> <Description timestamp="90">In a coffee shop with my colleague in the afternoon called the Helix with at least one person in the background and a plastic plant on my right side. There are keys on the table in front of me and you can see the cafe sign on the left side. I walked to the cafe and it took less than two minutes to get there. My colleague in the foreground is wearing a white shirt and drinking coffee from a red paper cup.</Description> <Description timestamp="120">In a coffee shop with my colleague in the afternoon called the Helix with at least one person in the background and a plastic plant on my right side. There are keys on the table in front of me and you can see the cafe sign on the left side. I walked to the cafe and it took less than two minutes to get there. My colleague in the foreground is wearing a white shirt and drinking coffee from a red paper cup. Immediately after having the coffee, I drive to the shop.</Description> <Description timestamp="150">In a coffee shop with my colleague in the afternoon called the Helix with at least one person in the background and a plastic plant on my right side. There are keys on the table in front of me and you can see the cafe sign on the left side. I walked to the cafe and it took less than two minutes to get there. My colleague in the foreground is wearing a white shirt and drinking coffee from a red paper cup. Immediately after having the coffee, I drive to the shop. It is a Monday.</Description> Temporarily enhancing topic descriptions that get more detailed (easier) every thirty seconds. The topics have 1 or few relevant items in the collection.
  • 71. Lifelog Search Challenge 2018 (Six Teams)
  • 72. 4. Task Design and Datasets Task types, trade-offs, datasets, annotations
  • 73. Task Types: Introduction ● Searching for content can be modelled as different task types ○ Choice impacts dataset preparation, annotations, evaluation methods ○ and the way to run the experiments ● Some of the task types here have fully automatic variants… ○ out of scope, but may serve as baseline to compare to ● Task can be categorized by the target and the formulation of the query ○ Particular target item vs. set or class ○ only one target item in data set, or ○ multiple occurrences of an instance, of a class of relevant items/segments ○ Definition of query ○ example, given in a specific modality ○ precise definition vs. fuzzy idea
  • 74. Task Types (at Campaigns): Overview
  • 75. Task Types (at Campaigns): Overview How clear is search intent? Known-item search AVS tasks Example Visual Textual Abstract None This is how I use web video search VIS tasks Given video dataset What is the role of similarity for KIS atVideo Browser Showdown? SISAP'18, Peru
  • 76. Task Type: Visual Instance Search ● User holds a digital representation of a relevant example of the needed information ● Example or its features can be sent to system ● User does not need to translate example into query representation ● e.g., trademark/logo detection
  • 77. Task Types: Known Item Search (KIS) ● User sees/hears/reads a representation ○ Target item is described or presented ● Used in VBS & LSC ● Exactly one target semantics ○ Representation of exactly one relevant item/segment in dataset ● Models user’s (partly faded) memories ○ user has a memory of content to be found, might be fuzzy ● User must translate representation to provided query methods ○ The complexity of this translation depends significantly on the modality ■ e.g., visual is usually easier than textual, which leaves more room for interpretation ○ Relation of/to content is important too ■ e.g. searching in own life log media vs. searching in media collection on the web “on a busy street”
  • 78. Task Types: Ad-hoc Search ● User sees/hears/reads a representation of the needed information ○ Target item is described or presented ● Many targets semantics ○ Representation of a broader set/class of relevant items/segments ○ cf. TRECVID AVS task ● Models user’s rough memories ○ user has only a memory of the type of relevant content, not about details ● Similar issues of translating the representation like for KIS ○ but due to broader set of relevant items the correct interpretation of textual information is a less critical issue ● Raises issues of what is considered within/without scope of a result set ○ e.g., partly visible, visible on a screen in the content, cartoon/drawing versions, … ○ TRECVID has developed guidelines for annotation of ground truth
  • 79. Task Types: Exploration ● User does not start from a clear idea/query of the information need ○ No concrete query, just inspects dataset ○ Browsing and exploring may lead to identifying useful content ● Reflects a number of practical situations, but very hard to evaluate ○ User simply cannot describe the content ○ User does not remember content but would recognize it ○ Dontent inspection for the sake of interest ○ Digital forensics ● No known examples of such tasks in benchmarking campaigns due to the difficulties with evaluation Demo: https://www.picsbuffet.com/ Barthel, Kai Uwe, Nico Hezel, and Radek Mackowiak. "Graph-based browsing for large video collections." International Conference on Multimedia Modeling. Springer, Cham, 2015.
  • 80. Task Design is About Trade-offs: Aspects to consider Tasks shall ○ model real-world content search problems ■ in order to assess whether tools are usable for these problems ○ set controlled conditions ■ to enable reliable assessment ○ be repeatable ■ to compare results from different evaluation sessions ○ avoid bias towards certain features or query methods many real world problems involve very fuzzy information needs well defined queries are best suited for evaluation users remember more about the scene when they start looking through examples information in the task should be provided at defined points in time during evaluation sessions, relevant shots may be discovered, and the ground truth updated for repeatable evaluation, a fixed ground truth set is desirable although real world tasks may involve time pressure, it would be best to measure the time until the task is solved time limits are needed in evaluation sessions for practical reasons
  • 81. Task Selection (KIS @ VBS) ● Known duplicates: ○ List of known (partial) duplicates from matching metadata and file size ○ Content-based matches ● Uniqueness inside same and similar content: ○ Ensure unambiguous target ○ May be applied to sequence of short shots rather than single shot ● Complexity of segment: ○ Rough duration of 20s ○ Limited number of shots ● Describe-ability: ○ Textual KIS requires segments that can be described with limited amount of text (less shots, salient location or objects, etc.)
  • 82. VBS KIS Task Selection - Examples ● KIS Visual (video 37756, frame 750-1250) ○ Short shots, varying content - hard to describe as text, but unique sequence ● KIS Textual (video 36729, frame 4047-4594) ○ @0 sec: “Shots of a factory hall from above. Workers transporting gravel with wheelbarrows. Other workers putting steel bars in place.” ○ @100 sec: “The hall has Cooperativa Agraria written in red letters on the roof.” ○ @200 sec: “There are 1950s style American cars and trucks visible in one shot.”
  • 83. Presenting Queries (VBS) ● Example picture? ○ allow taking pictures of visual query clips? ● Visual ○ Play query once ■ one chance to memorize, but no chance to check possibly relevant shot against query — in real life, one cannot visually check, but one does not forget what one knew at query time ○ Repeat query but blur increasingly ■ basic information is there, but not possible to check details ● Textual ○ User's memory is for most people also visual ○ Simulate case where retrieval expert is asked to find content ■ expert could ask questions ○ Provide incremental details about the scene (but initial piece of information must already be unambiguous for KIS)
  • 84. Task Participants ● Typically developers of tools participate in evaluation campaigns ○ They know how to translate information requests into queries ○ Knowledge of user has huge impact on performance that can be achieved ● “Novice session” ○ Invite members from the audience to use the tools, after a brief introduction ○ Provides insights about usability and complexity of tool ○ In real use cases, users are rather domain experts than retrieval experts, thus this condition is important to test ○ Selection of novices is an issue for comparing results ○ Question of whether/how scores of expert and novice tasks shall be combined
  • 85. Real-World Datasets ● Research neds reproducible results ○ standardized and free datasets are necessary ● One problem with many datasets: ○ current state of web video in the wild is not or no longer represented accurately by them [Rossetto & Schuldt] ● Hence, we also need datasets that model the real world ○ One such early effort: ○ V3C is such a dataset (see later) Rossetto, L., & Schuldt, H. (2017). Web video in numbers-an analysis of web-video metadata. arXiv preprint arXiv:1707.01340.
  • 86. Videos in the Wild Age-distribution of common video collections vs what is found in the wild Rossetto, L., & Schuldt, H. (2017). Web video in numbers-an analysis of web-video metadata. arXiv preprint arXiv:1707.01340.
  • 87. Videos in the Wild Duration-distribution of common video collections vs what is found in the wild Rossetto, L., & Schuldt, H. (2017). Web video in numbers-an analysis of web-video metadata. arXiv preprint arXiv:1707.01340.
  • 88. Dataset Preparation and Annotations ● Data set = content + annotations for specific problem ● Today, content is everywhere ● Annotations are still hard to get ○ External data (e.g., archive documentation) often not available at sufficient granularity and time-indexed ○ Creation by experts is prohibitively costly ● Approaches ○ Crowdsourcing (with different notions of “crowd” impacting quality) ○ Reduce amount of annotations needed ○ Generate data set and ground truth
  • 89. Collaborative Annotation Initiatives from TRECVID participants 2003-2013 ○ http://mrim.imag.fr/tvca/ ○ Concept annotations for high-level feature extraction/semantic indexing tasks ○ As data sets grew in size, the percentage of the content that could be annotated declined ○ Use of active learning to select samples where annotation brings highest benefit S. Ayache and G. Quénot, "Video Corpus Annotation using Active Learning", ECIR 2008.
  • 90. Crowdsourcing with the General Public ● Use platforms like Amazon Mechanical Turk to collect data ○ Main issue, however, is that annotations are noisy and unreliable ● Solutions ○ Multiple annotations and majority votes ○ Involve tasks that help assessing the confidence to a specific worker ■ e.g., asking easy questions first, to verify facts about image ○ More sophisticated aggregation strategies ● MediaEval ran tasks in 2013 and 2014 ○ Annotation of fashion images and timed comments about music B. Loni, M. Larson, A. Bozzon, L. Gottlieb, Crowdsourcing for Social Multimedia at MediaEval 2013: Challenges, Data set, and Evaluation, MediaEval WS Notes, 2013. K. Yadati, P. S.N. Shakthinathan Chandrasekaran Ayyanathan, M. Larson, Crowdsorting Timed Comments about Music: Foundations for a New Crowdsourcing Task, MediaEval WS Notes, 2014.
  • 91. Pooling ● Exhaustive relevance judgements are costly for large data sets ● Annotate pool of top k results returned from participating systems ● Pros ○ Efficient ○ Results are correct for all participants, not an approximation ● Cons ○ Annotations can only be done after experiment ○ Repeating the experiment with new/updated systems requires updating the annotation (or getting approximate results) Sri Devi Ravana et al., Document-based approach to improve the accuracy of pairwise comparison in evaluating information retrieval systems, ASLIB J. Inf. Management, 67(4), 2015.
  • 92. Live Annotation ● Assessment of incoming results during competition ● Used in VBS 2017-2018 ● Addresses issues of incomplete or missing ground truth ○ e.g., created using pooling , or new queries ● Pros ○ Provide immediate feedback ○ Avoid biased results from ground truth pooled from other systems ● Cons ○ Done under time pressure ○ Not possible to review other similar cases - may cause inconsistency in decisions ○ Multi-annotator agreement would be needed (impacts decision time and number of annotators needed)
  • 93. Live Annotation – Example from VBS 2018 ● 1,848 shots judged live ○ About 40% of submitted shots were not in TRECVID g.t. ● Verification experiment ○ 1,383 were judged again later ○ Judgement were diverging for 23% of the shots, in 88% of those cases the live judgement was “incorrect” ● Judges seem to decide to incorrect when in doubt ○ While ground truth for later use is biased, still same conditions for all teams in the room ● Need to set up clear rules for live judges ○ Like used by NIST for TRECVID annotations Judge 1: false Judge 2: true Judge 1: true Judge 1: false same video
  • 94. Assembling Content and Ground Truth ● MPEG Compact Descriptor for Video Analysis (CDVA) ○ Dataset for the evaluation of visual instance search ○ 23,000 video clips (1min - > 1hr) ● Annotation effort too high ○ Generate query and reference clips from three disjoint subsets ○ Randomly embed relevant segment in noisy material ○ Apply transformations to query clips ○ Ground truth is generated from the editing scripts ○ Created 9,715 queries, 5,128 references
  • 95. Process for LSC Dataset Generation ● Lifelog data has an inevitable privacy/GDPR compliance concern ● Required a cleaning/anonymization process for images, locations & words ○ Lifelogger deletes private/embarrassing images, validated by researcher ○ Images resized down (1024x768) to remove readable text ○ faces automatically & manually blurred; locations anonymized ○ Manually generated blacklist of terms for removal from textual data
  • 96. Available Datasets ● Past TRECVID data ○ https://www-nlpir.nist.gov/projects/trecvid/past.data.table.html ○ Different types of usage conditions and license agreements ○ Ground truth, annotations and partly extracted features are available ● Past MediaEval data ○ http://www.multimediaeval.org/datasets/index.html ○ Mostly directly downloadable, annotations and sometimes features available ● Some freely available data sets ○ TRECVID IACC.1-3 ○ TRECVID V3C1 (starting 2019), will also be used for VBS (download available) ○ BLIP 10,000 http://skuld.cs.umass.edu/traces/mmsys/2013/blip/Blip10000.html ○ YFCC100M https://webscope.sandbox.yahoo.com/catalog.php?datatype=i&did=67 ○ Stanford I2V http://purl.stanford.edu/zx935qw7203
  • 97. Available Datasets ● MPEG CDVA data set ○ Mixed licenses, partly CC, partly specific conditions of content owners ● NTCIR-Lifelog datasets ○ NTCIR-12 Lifelog - 90 days of mostly visual and activity data from 3 lifeloggers (100K+ images) ■ ImageCLEF 2017 dataset a subset of NTCIR-12 ○ NTCIR-13 Lifelog - 90 days of richer media data from 2 lifeloggers (95K images) ■ LSC 2018 - 30 days of visual, activity, health, information & biometric data from one lifelogger ■ ImageCLEF 2018 dataset a subset of NTCIR-13 ○ NTCIR-14 - 45 days of visual, biometric, health, activity data from two lifeloggers
  • 98. Example: V3C Dataset Vimeo Creative Commons Collection ○ The Vimeo Creative Commons Collection (V3C) [2] consists of ‘free’ video material sourced from the web video platform vimeo.com. It is designed to contain a wide range of content which is representative of what is found on the platform in general. All videos in the collection have been released by their creators under a Creative Commons License which allows for unrestricted redistribution. Rossetto, L., Schuldt, H., Awad, G., & Butt, A. (2019). V3C – a Research Video Collection. Proceedings of the 25th International Conference on MultiMedia Modeling.
  • 99. 5. Evaluation procedures, results and metrics Interactive and automatic retrieval
  • 100. Evaluation settings for interactive retrieval tasks ● For each tool, human in the loop ... ○ Same room, projector, time pressure ○ Expert and novice users ● … compete in simulated tasks (KIS, AVS, ...) ○ Shared dataset in advance (V3C1 1000h) ○ 2V+1T KIS sessions and 2 AVS sessions ■ Tasks selected randomly and revisited ■ Tasks presented on data projector
  • 101. Evaluation settings for interactive retrieval tasks ● Problem with repeatability of results ○ Human in the loop, conditions ● Evaluation provides one comparison of tools in a shared environment with a given set of tasks, users and shared dataset ○ Performance reflected by an overall score
  • 102. Known-item search tasks at VBS 2018
  • 103. Results of VBS 2018
  • 104. Results - observed trends 2015-2017 2015 (100 hours) 2016 (250 hours) 2017 (600 hours) Observation: First AVS easier than Visual KIS easier than Textual KIS J. Lokoc, W. Bailer, K. Schoeffmann, B. Muenzer, G. Awad, On influential trends in interactive video retrieval: Video Browser Showdown 2015-2017, IEEE Transactions on Multimedia, 2018
  • 105. KIS score function (since 2018) ● Reward for solving a task ● Reward for being fast ● Fair scoring around time limit ● Penalty for wrong submissions J. Lokoc, W. Bailer, K. Schoeffmann, B. Muenzer, G. Awad, On influential trends in interactive video retrieval: Video Browser Showdown 2015-2017, IEEE Transactions on Multimedia, 2018
  • 106. AVS score function (since 2018) VBS 2018 VBS 2017 Score based on precision and recall
  • 107. Overall scores at VBS 2018
  • 108. Settings and metrics in LSC Evaluation ● Similar to the VBS… For each tool, human in the loop ... ○ Same room, projector, time pressure ○ Expert and novice users ● … compete in simulated tasks (all KIS type) ○ Shared dataset in advance (LSC Dataset - 27 days)) ○ Six expert topics & 12 Novice topics ■ Topics prepared by the organisers with full (non-pooled) relevance judgements for all topics ■ Tasks presented on data projector ■ Participants submit a ‘correct’ answer to the LSC server which evaluates it against groundtruth.
  • 109. Lifelog Search Challenge (Topics) I am building a chair that is wooden in the late afternoon. I am at work, in an office environment (23 images, 12 minutes). I am walking out to an airplane across the airport apron. I stayed in an airport hotel on the previous night before checking out and walking a short distance to the airport (1 image, 1 minute). I was in a Norwegian furniture store in a shopping mall (16 images, 9 minutes). I was eating in a Thai restaurant (130 images, 66 images). There was a large picture of a man carrying a box of tomatoes beside a child on a bicycle (185 images, 97 minutes). I was playing a vintage car-racing game on my laptop in a hotel after flying (53 images, 27 minutes). I was watching 'The Blues Brothers' Movie on the TV at home (82 images, 42 minutes).
  • 110. LSC Score Function Score calculated from 0 to 100, based on the amount of time remaining. Negative scoring for incorrect answers (lose 10%) of available score. Overall score is based on the sum of scores for all expert and novice topics. Similar to VBS, a problem with repeatability of results ( Human in the loop). Evaluation provides one comparison of tools in a shared environment with a given set of tasks, users and shared dataset.
  • 111. Evaluation settings at TRECVID ● Three run types: ○ Fully Automatic ○ Manually-assisted ○ Relevance-feedback ● Query/Topics: ○ Text only ○ Text + image/video examples ● Training conditions: ○ Training data from same/cross domain as testing ○ Training data collected automatically ● Results: System returns top 1000 shots that most likely Satisfy the query/topic
  • 112. Query Development Process ● Sample test videos (~30 - 40%) were viewed by 10 human assessors hired by the NIST. ● 4 facets describing different scenes were used (if applicable) to annotate the watched videos: ○ Who : concrete objects and being (kind of persons, animals, things) ○ What : are the objects and/or beings doing ? (generic actions, conditions/state) ○ Where : locale, site, place, geographic, architectural, etc ○ When : time of day, season ● Test queries were constructed from the annotated descriptions to include : Persons, Actions, Locations, and Objects and their combinations.
  • 113. Sample topics of Ad-hoc search queries Find shots of a person holding a poster on the street at daytime Find shots of one or more people eating food at a table indoors Find shots of two or more cats both visible simultaneously Find shots of a person climbing an object (such as tree, stairs, barrier) Find shots of car driving scenes in a rainy day Find shots of a person wearing a scarf Find shots of destroyed buildings
  • 114. Evaluation settings at TRECVID ● Usually 30 queries/topics are evaluated per year ● NIST hires 10 human assessors to: ○ Watch returned video shots ○ Judge if a video shot satisfy query (YES / NO vote) ● All system results per query/topic are pooled and NIST judges top ranked results (rank 1 to ~200) 100% and sample ranked results from 201 - 1000 to form a unique judged master set. ● The unique judged master set are divided into small pool files (~1000 shots / file) and given to the human assessors to watch and judge.
  • 115. TRECVID evaluation framework Video Collection Information needs (Topics/Queries) Video search algorithm 1 Video search algorithm 2 Video search algorithm K Ranked result set 1 Ranked result set 2 Ranked result set k Video pools Judge 100% of top X ranked results and Y% from X+1 ranked results to bottom TRECVID Participants Ranked result sets Ground Truth Evaluation scores Human assessors Pooling
  • 116. Evaluation settings at TRECVID ● Basic rules for the human assessors to follow include: ○ In topic description, "contains x" or words to that effect are short for "contains x to a degree sufficient for x to be recognizable as x to a human" . This means among other things that unless explicitly stated, partial visibility or audibility may suffice. ○ The fact that a segment contains video of physical objects representing the feature target, such as photos, paintings, models, or toy versions of the target, will NOT be grounds for judging the feature to be true for the segment. Containing video of the target within video may be grounds for doing so. ○ If the feature is true for some frame (sequence) within the shot, then it is true for the shot; and vice versa. This is a simplification adopted for the benefits it affords in pooling of results and approximating the basis for calculating recall. ○ When a topic expresses the need for x and y and ..., all of these (x and y and ...) must be perceivable simultaneously in one or more frames of a shot in order for the shot to be considered as meeting the need.
  • 117. Evaluation metric at TRECVID ● Mean extended inferred average precision (xinfAP) across all topics ○ Developed* by Emine Yilmaz and Javed A. Aslam at Northeastern University ○ Estimates average precision surprisingly well using a surprisingly small sample of judgments from the usual submission pools (see next slide!) ○ More topics can be judged with same effort ○ The extended infAP added stratified feature to infAP (i.e we can sample from each strata with different sample rate) * J.A. Aslam, V. Pavlu and E. Yilmaz, Statistical Method for System Evaluation Using Incomplete Judgments Proceedings of the 29th ACM SIGIR Conference, Seattle, 2006.
  • 118. InfAP correlation with AP Mean InfAP of 100% sample Mean InfAP of 100% sample Mean InfAP of 100% sample Mean InfAP of 100% sample MeanInfAPof80%sampleMeanInfAPof40%sample MeanInfAPof60%sampleMeanInfAPof20%sample Mean InfAP of 100% sample == AP
  • 119. Automatic vs. Interactive search in AVS Can we compare results from TRECVID (infAP) and VBS (unordered list)? ● Simulate AP from unordered list J. Lokoc, W. Bailer, K. Schoeffmann, B. Muenzer, G. Awad, On influential trends in interactive video retrieval: Video Browser Showdown 2015-2017, IEEE Transactions on Multimedia, 2018
  • 120. Automatic vs. Interactive search in AVS Can we compare results from TRECVID (infAP) and VBS (unordered list)? ● Get precision at VBS recall level if ranked lists are available J. Lokoc, W. Bailer, K. Schoeffmann, B. Muenzer, G. Awad, On influential trends in interactive video retrieval: Video Browser Showdown 2015-2017, IEEE Transactions on Multimedia, 2018
  • 121. 6. Lessons learned Collection of our observations from TRECVID and VBS
  • 122. Video Search (at TRECVID) Observations One solution will not fit all. Investigations/discussion of video search must be related to the searcher‘s specific needs/capabilities/history and to the kinds data being searched. The enormous and growing amounts of video require extremely large-scale approaches to video exploitation. Much of it has little or no metadata describing the content in any detail. ● 400 hrs of video are being uploaded on YouTube per minute (as of 11/2017) ● “Over 1.9 Billion logged-in users visit YouTube each month and every day people watch over a billion hours of video and generate billions of views.” (https://www.youtube.com/yt/about/press/)
  • 123. Video Search (at TRECVID) Observations Multiple information sources (text, audio, video), each errorful, can yield better results when combined than used alone… ● A human in the loop in search still makes an enormous difference. ● Text from speech via automatic speech recognition (ASR) is a powerful source of information but: ○ Its usefulness varies by video genre ○ Not everything/one in a video is talked about, “in the news" ○ Audible mentions are often offset in time from visibility ○ Not all languages have good ASR ● Machine learning approaches to tagging ○ yield seemingly useful results against large amounts of data when training data is sufficient and similar to the test data (within domain) ○ but will they work well enough to be useful on highly heterogeneous video?
  • 124. Video Search (at TRECVID) Observations ● Processing video using a sample of more than one frame per shot, yields better results but quickly pushes common hardware configurations to their limits ● TRECVID systems have been looking at combining automatically derived and manual-provided evidence in search : ○ Internet Archive video will provide titles, keywords, descriptions ○ Where in the Panofsky hierarchy are the donors’ descriptions? If very personal, does that mean less useful for other people? ● Need observational studies of real searching of various sorts using current functionality and identifying unmet needs
  • 125. VBS organization ● Test session before event - problems with submission formats etc. ● Textual KIS tasks in a special private session ○ Textual tasks are not so attractive for audience ○ Textual tasks are important and challenging ○ More time and tasks are needed to assess tool performance ● Visual and AVS tasks during welcome reception ○ “Panem et circenses” - competitions are also intended to entertain audience ○ Generally, more novice users can be invited to try the tool
  • 126. VBS server ● Central element of the competition ○ Presents all tasks using data projector ○ Presents scores in all categories ○ Presents feedback for actual submissions ○ Additional logic (duplicates, flood of submissions, logs) ○ Also at LSC 2018, with a revised ranking function ● Selected issue - duplicate problem ○ IACC dataset contains numerous duplicate videos with identical visual content (but e.g., different language) ○ Submission was regarded as wrong although the visual content was correct ○ One actual case in 2018, had to be corrected after the event and changed the final ranking ○ Dataset design should explicitly avoid duplicates, or at least provide a list of duplicates; moreover: server could provide more flexibility in changing judgements retrospectively
  • 127. VBS server ● Issues of the simulations of KIS tasks ● How to “implant” visual memories? ○ Play scene just once - users forget the scene ○ Play scene in the loop - users exploit details -> overfitting to task presentation ○ Play scene in the loop + blur - colors can be still used, but also user forget important details ○ Play scene several times in the beginning and then show text description ● How to face ambiguities of textual KIS? ○ Simple text - not enough details, ambiguous meaning of some sentences ○ Extending text - simulation of a discussion - which details should be used first? ○ Still ambiguities -> teams should be allowed to ask some questions
  • 128. AVS task and live judges at VBS ● Ambiguous task descriptions are problematic, hard to find balance between too easy and too hard tasks ● Opinion of user vs. opinion of judge - who is right? ○ Users try to maximize score - sometimes risk wrong submission ○ Each shot is assessed just once -> the same “truth” for all teams ○ Similar as for textual KIS - teams should be allowed to ask some questions ○ Teams have to read TRECVID rules for human assessors! ● Calibration of more judges ○ For more than one live judge - calibration of opinions is necessary, even during competition ● Balance the number of users for AVS tasks (ideally also for KIS tasks)
  • 129. VBS interaction logging ● Until 2017, there was no connection between VBS results and really used tool features to solve a task ○ VBS server received only team, video and frame IDs ○ Attempts to receive logs after competition failed ● Since 2018, an interaction log is a mandatory part of each task submission ○ How to obtain logs when the task is not solved? ○ Tools use variable modalities and interfaces - how to unify actions? ○ How to present and interpret logs? ○ How to log very frequent actions? ○ Time synchronization? ○ Log verification during test runs
  • 130. VBS interaction logging - 8/9 teams sent logs! We can analyze both aggregated general statistics and user interactions in a given tool/task !!
  • 132. Where is the User in the Age of Deep Learning? ● The complexity of tasks where AI is superior to human is obviously growing ○ Checkers -> Chess -> GO -> Poker -> DOTA? -> Starcraft? -> … à guess user needs? ● Machine learning revolution - bigger/better training data o à better performance ● Can we collect big training data to support interactive video retrieval? ○ To cover an open world (how many concepts, actions, … do you need)? ○ To fit needs of every user (how many contexts do we have)? ● Reinforcement learning?
  • 133. Where is the User in the Age of Deep Learning? Driver has to get carefully through many situations with just basic equipment Q: Is this possible also for video retrieval systems? Attribution: Can Pac Swire (away for a bit) Driver has to rely on himself but subsystems help (ABS, power steering, etc.) Attribution: Grand Parc - Bordeaux Driver just tells where to go Attribution: Grendelkhan
  • 134. Where is the User in the Age of Deep Learning? ● Users already benefit from deep learning ○ HCI support - body motion, hand gestures ○ More complete and precise automatic annotations ○ Embeddings/representations for similarity search ○ 2D/3D projections for visualization of high-dimensional data ○ Relevance feedback learning (benefit from past actions) ● Promising directions ○ One-shot learning for fast inspection of new concepts ○ Multimodal joint embeddings ○ … ○ Just A Rather Very Intelligent System (J.A.R.V.I.S.) used by Tony Stark (Iron Man) ??
  • 135. Never say “this will not work!” ● If you have an idea how to solve interactive retrieval tasks - just try it! ○ Don’t be afraid your system is not specialized, you can surprise yourself and the community! ○ Paper submission in September 2019 for VBS at MMM 2020 in Seoul! ○ LSC submission in February 2019 for ICMR 2019 in Ottawa in June 2019. ○ The next TRECVID CFP will go out by mid-January, 2019. Lokoč, Jakub, Adam Blažek, and Tomáš Skopal. "Signature-based video browser." International Conference on Multimedia Modeling. Springer, Cham, 2014. Del Fabro, Manfred, and Laszlo Böszörmenyi. "AAU video browser: non- sequential hierarchical video browsing without content analysis." International Conference on Multimedia Modeling. Springer, Berlin, Heidelberg, 2012. Hürst, Wolfgang, Rob van de Werken, and Miklas Hoet. "A storyboard-based interface for mobile video browsing." International Conference on Multimedia Modeling. Springer, Cham, 2015.
  • 136. Acknowledgements This work has received funding from the European Union’s Horizon 2020 research and innovation programme, grant no. 761802, MARCONI. It was supported also by Czech Science Foundation project Nr. 17-22224S. Moreover, the work was also supported by the Klagenfurt University and Lakeside Labs GmbH, Klagenfurt, Austria and funding from the European Regional Development Fund and the Carinthian Economic Promotion Fund (KWF) under grant KWF 20214 u. 3520/26336/38165.