Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Interactive Video Search: Where is the User in the Age of Deep Learning?

317 views

Published on

These are the slides of our tutorial presented on Monday, October 23, 2018 at ACM Multimedia 2018 in Seoul.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Interactive Video Search: Where is the User in the Age of Deep Learning?

  1. 1. Interactive Video Search: Where is the User in the Age of Deep Learning? Klaus Schoeffmann1, Werner Bailer2, Jakub Lokoc3, Cathal Gurrin4, George Awad5 Tutorial at ACM Multimedia 2018, Seoul 1…Klagenfurt University, Klagenfurt, Austria 2…JOANNEUM RESEARCH, Graz, Austria 3…Charles University, Prague, Czech Republic 4…Dublin City University, Dublin, Ireland 5…National Institute of Standards and Technology, Gaithersburg, USA
  2. 2. Recommended Readings On influential trends in interactive video retrieval: Video Browser Showdown 2015-2017. J. Lokoc, W. Bailer, K. Schoeffmann, B. Muenzer, G. Awad, IEEE Transactions on Multimedia, 2018 Interactive video search tools: a detailed analysis of the video browser showdown 2015. Claudiu Cobârzan, Klaus Schoeffmann, Werner Bailer, Wolfgang Hürst, Adam Blazek, Jakub Lokoc, Stefanos Vrochidis, Kai Uwe Barthel, Luca Rossetto. Multimedia Tools Appl. 76(4): 5539-5571 (2017). G. Awad, A. Butt, J. Fiscus, M. Michel, D. Joy, W. Kraaij, A. F. Smeaton, G. Quenot, M. Eskevich, R. Ordelman, G. J. F. Jones, and B. Huet, “Trecvid 2017: Evaluating ad-hoc and instance video search, events detection, video captioning and hyperlinking,” in Proceedings of TRECVID 2017. NIST, USA,
  3. 3. TOC 1 1. Introduction (20 min) [KS] a. General introduction b. Automatic vs. interactive video search c. Where deep learning fails d. The need for evaluation campaigns 2. Interactive video search tools (40 min) [JL] a. Demo: VIRET (1st place at VBS2018) b. Demo: ITEC (2nd place at VBS2018) c. Demo: DCU Lifelogging Search Tool 2018 d. Other tools and open source software 3. Evaluation approaches (30 min) [KS] a. Overview of evaluation approaches b. History of selected evaluation campaigns c. TRECVID d. Video Browser Showdown (VBS) e. Lifelog Search Challenge (LSC)
  4. 4. TOC 2 4. Task design and datasets (30 min) [KS] a. Task types (known item search, retrieval, etc.) b. Trade-offs: modelling real-world tasks and controlling conditions c. Data set preparation and annotations d. Available data sets 5. Evaluation procedures, results and metrics (30 min) [JL] a. Repeatability b. Modelling real-world tasks and avoiding bias c. Examples from evaluation campaigns 6. Lessons learned from evaluation campaigns (20 min) - [JL] a. Interactive exploration or query-and-browse? b. How much does deep learning help in interactive settings? c. Future challenges 7. Conclusions a. Where is the user in the age of deep learning?
  5. 5. 1. Introduction
  6. 6. Let’s Look Back a Few Years... [Marcel Worring et al., „Where Is the User in Multimedia Retrieval?“, IEEE Multimedia, Vol. 19, No. 4, Oct.-Dec. 2012, pp. 6-10 ]
  7. 7. Let’s Look Back a Few Years... ● A few statements/findings: ○ Many solutions are developed without having an explicitly defined real-world problem to solve. ○ Performance measures focus on the quality of how we answer a query. ○ MAP has become the primary target for many researchers. ○ It is certainly weird to use MAP alone when talking about users employing multimedia retrieval to solve their search problems. ○ As a consequence of MAP’s dominance, the field has shifted its focus too much toward answering a query. “Thus a better understanding of what users actually want and do when using multimedia retrieval is needed.” [Marcel Worring et al., „Where Is the User in Multimedia Retrieval?“, IEEE Multimedia, Vol. 19, No. 4, Oct.-Dec. 2012, pp. 6-10 ]
  8. 8. How Would You Search for These Images? How to describe the special atmosphere, the artistic content, the mood? by marfis75 “An image tells a thousand words.”
  9. 9. How Would You Search for This Video Scene?
  10. 10. What Users Might Want...
  11. 11. Shortcomings of Fully Automatic Video Retrieval ● Works well if ○ Users can properly describe their needs ○ System understands search intent of users ○ There is no polysemy and no context variation ○ Content features can sufficiently describe visual content ○ Computer vision (e.g., CNN) can accurately detect semantics ● Unfortunately, for real-world problems rarely true! “Query-and-browse results” approach
  12. 12. Performance of Video Retrieval ● Typically based on MAP ○ Computed for a specific query- and dataset ○ Results are still quite low (even in the age of deep learning!) ○ Also, results can heavily vary from one dataset to another, and from one queryset to another ○ Example: TRECVID Ad-hoc Video Search (AVS) – automatic runs only 2016 2017 2018 Teams 9 8 10 Runs 30 33 33 Min xInfAP 0 0.026 0.003 Max xInfAP 0.054 0.206 0.121 Median xInfAP 0.024 0.092 0.058 Dataset: IACC.3, 30 queries per year
  13. 13. Deep Learning Can Fail Easily [J. Su, D.V. Vargas, and K. Sakurai. One pixel attack for fooling neural networks. 2018. arXiv] How to deal with noisy data/videos?
  14. 14. Deep Learning Can Fail Easily Output of Yolo v2 Andrew Ng talk Artificial Intelligence is the New Electricity “Anything typical human can do with < 1s of thought we can probably now or soon automate with AI”
  15. 15. Deep Learning Can Fail Easily Nguyen A, Yosinski J, Clune J. Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images. In Computer Vision and Pattern Recognition (CVPR '15), IEEE, 2015
  16. 16. The Power of Human Computation Example from the Video Browser Showdown 2015: System X: shot and scene detection, concept detection (SIFT, VLAD, CNNs), similarity search. System Y: tiny thumbnails only, powerful user. Outperformed system X and was finally ranked 3rd! Moumtzidou, Anastasia, et al. "VERGE in VBS 2017." International Conference on Multimedia Modeling. Springer, Cham, 2017. Hürst, Wolfgang, Rob van de Werken, and Miklas Hoet. "A storyboard-based interface for mobile video browsing." International Conference on Multimedia Modeling. Springer, Cham, 2015.
  17. 17. Interactive Video Retrieval Approach ● Assume a smart and interactive user ○ That knows about the challenges and shortcomings of simple querying ○ But might also know how to circumvent them ○ Could be a digital native! ● Give him/her full control over the search process ○ Provide many query and interaction features ■ Querying, browsing, navigation, filtering, inspecting/watching ● Assume an iterative/exploratory search process ○ Search - Inspect - Think - Repeat ○ “Will know it when I see it” ○ Could include many iterations! ○ Instead of “query-and-browse results”
  18. 18. What Users Might Need... Concept Search Browsing features Motion Sketch Search History Hudelist, Marco A., Christian Beecks, and Klaus Schoeffmann. "Finding the chameleon in your video collection." Proceedings of the 7th International Conference on Multimedia Systems. ACM, 2016.
  19. 19. Typical Query Types of Video Retrieval Tools ● Query-by-text ○ Enter keywords to match with available or extracted text (e.g., metadata, OCR, ASR, concepts, objects...) ● Query-by-concept ○ Show content for a specific class/category from concept detection (e.g., from ImageNet) ● Query-by-example ○ Provide example image/scene/sound ● Query-by-filtering ○ Filter content by some metadata or content feature (time, color, edge, motion, …) ● Query-by-sketch ○ Provide sketch of image/scene ● Query-by-dataset-example ○ Look for similar but other results ● Query-by-exploration ○ Start by looking around / browsing ○ Needs appropriate visualization ● Query-by-inspection ○ Inspect single clips, navigate Search in multimedia content (particularly video) is a highly interactive process! Users want to look around, try different query features, inspect results, refine queries, and start all over again! Automatic Interactive
  20. 20. Evaluation of Interactive Video Retrieval ● Interfaces are inherently developed for human users ● Every user might be different ○ Different culture, knowledge, preferences, experiences, ... ○ Even the same user at a different time ● Video search interfaces need to be evaluated with real users... ○ No simulations! ○ User studies and campaigns (TRECVID, MediaEval, VBS, LSC)! ○ Find out how well users perform with a specific system ● ...and with real data! ○ Real videos “in the wild” (e.g., IACC.1 and V3C dataset) ○ Actual queries that would make sense in practice ○ Comparable evaluations (same data, same conditions, etc.) International competitions Datasets
  21. 21. Only same dataset, query, time, room/condition, ... ...allows for true comparative evaluation!
  22. 22. Where is the User in the Age of Deep Learning?
  23. 23. 2. Interactive Video Search Tools Common architecture, components and top ranked tools
  24. 24. What are basic video preprocessing steps? What models are used? Where interactive search helps? Common Architecture
  25. 25. Common Architecture - Temporal Segmentation M. Gygli. Ridiculously Fast Shot Boundary Detection with Fully Convolutional Neural Networks. https://arxiv.org/pdf/1705.08214.pdf1. Compute a score based on a distance of frames 2. Threshold-based decision (fixed/adaptive)
  26. 26. Common Architecture - Semantic Search Classification and embedding by popular Deep CNNs AlexNet (A. Krizhevsky et al., 2012) GoogLeNet (Ch. Szegedy et al., 2015) ResNet (K. He et al., 2015) NasNet (B. Zoph et al., 2018) ... Object detectors appear too (YOLO, SSD) Joint embedding models? VQA?
  27. 27. Common Architecture - Sketch based Search Sketches from memory Just part of the scene Edges often do not match Colors often do not match => invariance needed
  28. 28. Common Architecture - Limits ● Used ranking models have their limits ○ Missed frames ○ Wrong annotation ○ Inaccurate similarity function ● Still, to find a shot of a class is often easy (see later), but to find one particular shot or all shots of a class? T. Soucek. Known-Item Search in Image Datasets Using Automatically Detected Keywords. BC thesis, 2018.
  29. 29. Common Architecture at VBS - Interactive Search Hudelist & Schoeffmann. An Evaluation of Video Browsing on Tablets with the ThumbBrowser. MMM2017 Goeau et al.,, Table of Video Content, ICME 2007
  30. 30. Aspects of Flexible Interactive Video Search
  31. 31. VIRET tool (Winner of VBS 2018, 3. at LSC 2018) Filters Query by text Query by color Query by image Video player Top ranked frames by a query Representative frames from the selected video Frame-based retrieval system with temporal context visualization. Focus on simple interface! Jakub Lokoc, Tomas Soucek, Gregor Kovalcik: Using an Interactive Video Retrieval Tool for LifeLog Data. LSC@ICMR 2018: 15-19, ACM Jakub Lokoc, Gregor Kovalcik, Tomas Soucek: Revisiting SIRET Video Retrieval Tool. VBS@MMM 2018: 419-424, Springer
  32. 32. VIRET Tool (Winner of VBS 2018)
  33. 33. ITEC Tool Primus, Manfred Jürgen, et al. "The ITEC Collaborative Video Search System at the Video Browser Showdown 2018." International Conference on Multimedia Modeling. Springer, Cham, 2018.
  34. 34. ITEC tool (2. at VBS 2018 and LSC 2018)https://www.youtube.com/watch?v=CA5kr2pO5b
  35. 35. LSC (Geospatial Browsing) W Hürst, K Ouwehand, M Mengerink, A Duane and C Gurrin. Geospatial Access to Lifelogging Photos in Virtual Reality. The Lifelog Search Challenge 2018 at ACM ICMR 2018.
  36. 36. LSC (Interactive Video Retrieval) J. Lokoč, T. Souček and G. Kovalčík. Using an Interactive Video Retrieval Tool for LifeLog Data. The Lifelog Search Challenge 2018 at ACM ICMR 2018. (3rd highest performing system, but the same system won VBS 2018)
  37. 37. LSC (LiveXplore) A Leibetseder, B Muenzer, A Kletz, M Primus and K Schöffmann. liveXplore at the Lifelog Search Challenge 2018. The Lifelog Search Challenge 2018 at ACM ICMR 2018. (2nd highest performing system)
  38. 38. VR Lifelog Search Tool (winner of LSC 2018) Large lifelog archive with time-limited KIS topics Multimodal (visual concept and temporal) query formulation Ranked list of visual imagery (image per minute) Gesture-based manipulation of results A Duane, C Gurrin & W Hürst. Virtual Reality Lifelog Explorer for the Lifelog Search Challenge at ACM ICMR 2018. The Lifelog Search Challenge 2018 at ACM ICMR 2018. Top Performing System.
  39. 39. https://www.youtube.com/watch?v=aocN9eOuRv0
  40. 40. vitrivr (University of Basel) ● Open-Source content-based multimedia retrieval stack ○ Supports images, music, video and 3D-models concurrently ○ Used for various applications both in and outside of academia ○ Modular architecture enables easy extension and customization ○ Compatible with all major operating systems ○ Available from vitrivr.org ● Participated several times in VBS (originally as IMOTION) [Credit: Luca Rossetto]
  41. 41. vitrivr (University of Basel) ● System overview [Credit: Luca Rossetto]
  42. 42. vitrivr (University of Basel) [Credit: Luca Rossetto]
  43. 43. vitrivr (University of Basel) [Credit: Luca Rossetto]
  44. 44. vitrivr (University of Basel) [Credit: Luca Rossetto]
  45. 45. 3. Evaluation Approaches
  46. 46. Overview of Evaluation Approaches ● Qualitative user study/survey ○ Self report: ask users about their experience with the tool, thinking aloud tests, etc. ○ Using psychophysiological measurements (e.g., electrodermal activity - EDA) ● Log-file analysis ○ Analyze server and/or client-side interaction patterns ○ Measure time needed for certain actions, etc. ● Question answering ○ Ask questions about content (open, multiple choice) to assess which content users found ● Indirect/task-based evaluation (Cranfield paradigm) ○ Pose certain tasks, measure the effectiveness of solving the task ○ Quantitative user study with many users and trials ○ Open competition, as in VBS, LSC, and TRECVID
  47. 47. Properties of Evaluation Approaches ● Availability and level of detail of ground truth ○ None (e.g., questionnaires, logs) ○ Detailed and complete (e.g., retrieval tasks) ● Effort during experiments ○ Low (automatic check against ground truth) ○ Moderate (answers need to checked by human, e.g. live judges) ○ High (observation of or interview with participants) ● Controlled conditions ○ All users in same room with same setup (typical user-study) vs. participants via online survey ● Statistical tests! ○ We can only conclude that one interactive tool is better than the other, if there is statistically significant proof ○ Tests like ANOVA, t-tests, Wilcoxon-signed rank tests, … ○ Consider prerequisites of specific test (e.g., normal distribution)
  48. 48. Example: Comparing Tasks and User Study ● Experiment compared ○ Question answering ○ Retrieval tasks ○ User study with questionnaire ● Materials ○ Interactive search tool with keyframe visualisation ○ TRECVID BBC rushes data set (25 hrs) ○ Questionnaire adapted from TRECVID 2004 ○ 19 users, doing at least 4 tasks W. Bailer and H. Rehatschek, Comparing Fact Finding Tasks and User Survey for Evaluating a Video Browsing Tool. ACM Multimedia 2009.
  49. 49. Example: Comparing Tasks and User Study ● TVB1 I was familiar with the topic of the query. ● TVB3 I found that it was easy to find clips that are relevant. ● TVB4 For this topic I had enough time to find enough clips. ● TVB5 For this particular topic the tool interface allowed me to do browsing efficiently. ● TVB6 For this particular topic I was satisfied with the results of the browsing. W. Bailer and H. Rehatschek, Comparing Fact Finding Tasks and User Survey for Evaluating a Video Browsing Tool. ACM Multimedia 2009.
  50. 50. Using Electrodermal Activity (EDA) Measuring EDA during retrieval tasks (A, B, C, D) with an interactive search tool, 14 participants C. Martinez-Peñaranda, et al., A Psychophysiological Approach to the Usability Evaluation of a Multi-view Video Browsing Tool,” MMM 2013.
  51. 51. History of Selected Evaluation Campaigns ● Evaluation campaigns for video analysis and search started in early 2000s ○ Most well-known are TRECVID and MediaEval (previously ImageCLEF) ○ Both spin-offs from text retrieval benchmarks ● Several ones include tasks that are relevant to video search ● Most tasks are designed to be fully automatic ● Some allow at least interactive submissions as an option ○ Most submissions are usually still for the automatic type ● Since 2007, live evaluations with audience have been organized at major international conferences ○ Videolympics, VBS, LSC
  52. 52. History of Selected Evaluation Campaigns
  53. 53. History of Selected Evaluation Campaigns
  54. 54. TRECVID ● Workshop series (2001 – present) → http://trecvid.nist.gov ● Started as a track in the TREC (Text REtrieval Conference) evaluation benchmark. ● Became an independent evaluation benchmark since 2003. ● Focus: content-based video analysis, retrieval, detection, etc. ● Provides data, tasks, and uniform, appropriate scoring procedures ● Aims for realistic system tasks and test collections: ○ Unfiltered data ○ Focus on relatively high-level functionality (e.g. interactive search) ○ Measurement against human abilities ● Forum for the ○ exchange of research ideas and for ○ the discussion of research methodology – what works, what doesn’t , and why
  55. 55. TRECVID Philosophy ● TRECVID is a modern example of the Cranfield tradition ○ Laboratory system evaluation based on test collections ● Focus on advancing the state of the art from evaluation results ○ TRECVID’s primary aim is not competitive product benchmarking ○ Experimental workshop: sometimes experiments fail! ● Laboratory experiments (vs. e.g., observational studies) ○ Sacrifice operational realism and broad scope of conclusions ○ For control and information about causality – what works and why? ○ Results tend to be narrow, at best indicative, not final ○ Evidence grows as approaches prove themselves repeatedly, as part of various systems, against various test data, over years
  56. 56. TRECVID Datasets HAVIC Soap opera (since 2013) Social media (since 2016) Security cameras (since 2008)
  57. 57. Teams actively participated (2016-2018) INF CMU; Beijing University of Posts and Telecommunication; University Autonoma de Madrid; Shandong University; Xian JiaoTong University Singapore kobe_nict_siegen Kobe University, Japan; National Institute of Information and Communications Technology, Japan; University of Siegen, Germany UEC Dept. of Informatics, The University of Electro-Communications, Tokyo ITI_CERTH Information Technology Institute, Centre for Research and Technology Hellas ITEC_UNIKLU Klagenfurt University NII_Hitachi_UIT National Institute Of Informatics.; Hitachi Ltd; University of Information Technology (HCM-UIT) IMOTION University of Basel, Switzerland; University of Mons, Belgium; Koc University, Turkey MediaMill University of Amsterdam ; Qualcomm Vitrivr University of Basel Waseda_Meisei Waseda University; Meisei University VIREO City University of Hong Kong EURECOM EURECOM FIU_UM Florida International University, University of Miami NECTEC National Electronics and Computer Technology Center NECTEC RUCMM Renmin University of China NTU_ROSE_AVS ROSE LAB, NANYANG TECHNOLOGICAL UNIVERSITY SIRET SIRET Department of Software Engineering, Faculty of Mathematics and Physics, Charles University UTS_ISA University of Technology Sydney
  58. 58. VideOlympics ● Run the same year’s TRECVID search tasks live in front of audience ● Organized at CIVR 2007-2009 Photos: Cees Snoek, https://www.flickr.com/groups/civr2007/
  59. 59. Video Browser Showdown (VBS) ● Video search competition (annually at MMM) ○ Inspired by VideOlympics ○ Demonstrates and evaluates state-of-the-art interactive video retrieval tools ○ Also, entertaining event during welcome reception at MMM ● Participating teams solve retrieval tasks ○ Known-item search (KIS) tasks - one result - textual or visual ○ Ad-hoc video search (AVS) tasks - many results - textual ○ In large video archive (originally in 60 mins videos only) ● Systems are connected to the VBS Server ○ Presents tasks in live manner ○ Evaluates submitted results of teams (penalty for false submissions) First VBS in Klagenfurt, Austria (only search in a single video)
  60. 60. Video Browser Showdown (VBS) 2012: Klagenfurt 11 teams KIS, single video (v) 2013: Huangshan 6 teams KIS, single video (v+t) 2014: Dublin 7 teams KIS, single video and 30h archive (v+t) 2015: Sydney 9 teams KIS, 100h archive (v+t) 2016: Miami 9 teams KIS, 250h archive (v+t) 2017: Reykjavik 6 teams KIS, 600h archive (v+t) AVS, 600h archive (t) 2018: Bangkok 9 teams KIS, 600h archive (v+t) AVS, 600h archive (t) 2019: Thessaloniki 6 teams KIS, 1000h archive (v+t) AVS, 1000h archive (t)
  61. 61. Video Browser Showdown (VBS) VBS Server: • Presents queries • Shows remaining time • Computes scores • Shows statistics/ranking
  62. 62. Video Browser Showdown (VBS)https://www.youtube.com/watch?v=tSlYFNlsn8U&t=140
  63. 63. Lifelog Search Challenge (LSC 2018) ● New (annual) search challenge at ACM ICMR ● Focus on a life retrieval challenge o from multimodal lifelog data o Motivated by the fact that ever larger personal data archives are being gathered and the advent of AR technologies & veracity of data’ means that archives of life experiences are likely to become more commonplace. ● To be useful, the data should be searchable… o and for lifelogs, that means interactive search
  64. 64. Lifelog Search Challenge (Definition) Dodge and Kitchin (2007), refer to lifelogging as “a form of pervasive computing, consisting of a unified digital record of the totality of an individual’s experiences, captured multi-modally through digital sensors and stored permanently as a personal multimedia archive”.
  65. 65. Lifelog Search Challenge (Motivation)
  66. 66. Lifelog Search Challenge (Lifelogging)
  67. 67. Lifelog Search Challenge (Data) One month archive of multimodal lifelog data, extracted from NTCIR-13 Lifelog collection, including: ○ Wearable camera images at a rate of 3-5 / minute & concept annotations. ○ Biometrics ○ Activity logs ○ Media consumption ○ Content created/consumed u1_2016-08-15_050922_1, 'indoor', 0.991932, 'person', 0.9719478, 'computer', 0.309054524
  68. 68. Lifelog Search Challenge (One Minute) <minute id="496"> <location> <name>Home</name> </location> <bodymetrics> <calories>2.8</calories> <gsr>7.03E-05</gsr> <heart-rate>94</heart-rate> <skin-temp>86</skin-temp> <steps>0</steps> </bodymetrics> <text>author,1,big,2,dout,1,revis,1,think,1,while,1</text> <images> <image> <image-id>u1_2016-08-15_050922_1</image-id> <image-path>u1/2016-08-15/20160815_050922_000.jpg</image-path> <annotations>'indoor', 0.985, 'computer', 0.984, 'laptop', 0.967, 'desk', 0.925</annotations> </image> </images> </minute>
  69. 69. Lifelog Search Challenge (Topics) <Description timestamp="0">In a coffee shop with my colleague in the afternoon called the Helix with at least one person in the background.</Description> <Description timestamp="30">In a coffee shop with my colleague in the afternoon called the Helix with at least one person in the background and a plastic plant on my right side.</Description> <Description timestamp="60">In a coffee shop with my colleague in the afternoon called the Helix with at least one person in the background and a plastic plant on my right side. There are keys on the table in front of me and you can see the cafe sign on the left side. I walked to the cafe and it took less than two minutes to get there.</Description> <Description timestamp="90">In a coffee shop with my colleague in the afternoon called the Helix with at least one person in the background and a plastic plant on my right side. There are keys on the table in front of me and you can see the cafe sign on the left side. I walked to the cafe and it took less than two minutes to get there. My colleague in the foreground is wearing a white shirt and drinking coffee from a red paper cup.</Description> <Description timestamp="120">In a coffee shop with my colleague in the afternoon called the Helix with at least one person in the background and a plastic plant on my right side. There are keys on the table in front of me and you can see the cafe sign on the left side. I walked to the cafe and it took less than two minutes to get there. My colleague in the foreground is wearing a white shirt and drinking coffee from a red paper cup. Immediately after having the coffee, I drive to the shop.</Description> <Description timestamp="150">In a coffee shop with my colleague in the afternoon called the Helix with at least one person in the background and a plastic plant on my right side. There are keys on the table in front of me and you can see the cafe sign on the left side. I walked to the cafe and it took less than two minutes to get there. My colleague in the foreground is wearing a white shirt and drinking coffee from a red paper cup. Immediately after having the coffee, I drive to the shop. It is a Monday.</Description> Temporarily enhancing topic descriptions that get more detailed (easier) every thirty seconds. The topics have 1 or few relevant items in the collection.
  70. 70. Lifelog Search Challenge 2018 (Six Teams)
  71. 71. 4. Task Design and Datasets Task types, trade-offs, datasets, annotations
  72. 72. Task Types: Introduction ● Searching for content can be modelled as different task types ○ Choice impacts dataset preparation, annotations, evaluation methods ○ and the way to run the experiments ● Some of the task types here have fully automatic variants… ○ out of scope, but may serve as baseline to compare to ● Task can be categorized by the target and the formulation of the query ○ Particular target item vs. set or class ○ only one target item in data set, or ○ multiple occurrences of an instance, of a class of relevant items/segments ○ Definition of query ○ example, given in a specific modality ○ precise definition vs. fuzzy idea
  73. 73. Task Types (at Campaigns): Overview
  74. 74. Task Types (at Campaigns): Overview How clear is search intent? Known-item search AVS tasks Example Visual Textual Abstract None This is how I use web video search VIS tasks Given video dataset What is the role of similarity for KIS atVideo Browser Showdown? SISAP'18, Peru
  75. 75. Task Type: Visual Instance Search ● User holds a digital representation of a relevant example of the needed information ● Example or its features can be sent to system ● User does not need to translate example into query representation ● e.g., trademark/logo detection
  76. 76. Task Types: Known Item Search (KIS) ● User sees/hears/reads a representation ○ Target item is described or presented ● Used in VBS & LSC ● Exactly one target semantics ○ Representation of exactly one relevant item/segment in dataset ● Models user’s (partly faded) memories ○ user has a memory of content to be found, might be fuzzy ● User must translate representation to provided query methods ○ The complexity of this translation depends significantly on the modality ■ e.g., visual is usually easier than textual, which leaves more room for interpretation ○ Relation of/to content is important too ■ e.g. searching in own life log media vs. searching in media collection on the web “on a busy street”
  77. 77. Task Types: Ad-hoc Search ● User sees/hears/reads a representation of the needed information ○ Target item is described or presented ● Many targets semantics ○ Representation of a broader set/class of relevant items/segments ○ cf. TRECVID AVS task ● Models user’s rough memories ○ user has only a memory of the type of relevant content, not about details ● Similar issues of translating the representation like for KIS ○ but due to broader set of relevant items the correct interpretation of textual information is a less critical issue ● Raises issues of what is considered within/without scope of a result set ○ e.g., partly visible, visible on a screen in the content, cartoon/drawing versions, … ○ TRECVID has developed guidelines for annotation of ground truth
  78. 78. Task Types: Exploration ● User does not start from a clear idea/query of the information need ○ No concrete query, just inspects dataset ○ Browsing and exploring may lead to identifying useful content ● Reflects a number of practical situations, but very hard to evaluate ○ User simply cannot describe the content ○ User does not remember content but would recognize it ○ Dontent inspection for the sake of interest ○ Digital forensics ● No known examples of such tasks in benchmarking campaigns due to the difficulties with evaluation Demo: https://www.picsbuffet.com/ Barthel, Kai Uwe, Nico Hezel, and Radek Mackowiak. "Graph-based browsing for large video collections." International Conference on Multimedia Modeling. Springer, Cham, 2015.
  79. 79. Task Design is About Trade-offs: Aspects to consider Tasks shall ○ model real-world content search problems ■ in order to assess whether tools are usable for these problems ○ set controlled conditions ■ to enable reliable assessment ○ be repeatable ■ to compare results from different evaluation sessions ○ avoid bias towards certain features or query methods many real world problems involve very fuzzy information needs well defined queries are best suited for evaluation users remember more about the scene when they start looking through examples information in the task should be provided at defined points in time during evaluation sessions, relevant shots may be discovered, and the ground truth updated for repeatable evaluation, a fixed ground truth set is desirable although real world tasks may involve time pressure, it would be best to measure the time until the task is solved time limits are needed in evaluation sessions for practical reasons
  80. 80. Task Selection (KIS @ VBS) ● Known duplicates: ○ List of known (partial) duplicates from matching metadata and file size ○ Content-based matches ● Uniqueness inside same and similar content: ○ Ensure unambiguous target ○ May be applied to sequence of short shots rather than single shot ● Complexity of segment: ○ Rough duration of 20s ○ Limited number of shots ● Describe-ability: ○ Textual KIS requires segments that can be described with limited amount of text (less shots, salient location or objects, etc.)
  81. 81. VBS KIS Task Selection - Examples ● KIS Visual (video 37756, frame 750-1250) ○ Short shots, varying content - hard to describe as text, but unique sequence ● KIS Textual (video 36729, frame 4047-4594) ○ @0 sec: “Shots of a factory hall from above. Workers transporting gravel with wheelbarrows. Other workers putting steel bars in place.” ○ @100 sec: “The hall has Cooperativa Agraria written in red letters on the roof.” ○ @200 sec: “There are 1950s style American cars and trucks visible in one shot.”
  82. 82. Presenting Queries (VBS) ● Example picture? ○ allow taking pictures of visual query clips? ● Visual ○ Play query once ■ one chance to memorize, but no chance to check possibly relevant shot against query — in real life, one cannot visually check, but one does not forget what one knew at query time ○ Repeat query but blur increasingly ■ basic information is there, but not possible to check details ● Textual ○ User's memory is for most people also visual ○ Simulate case where retrieval expert is asked to find content ■ expert could ask questions ○ Provide incremental details about the scene (but initial piece of information must already be unambiguous for KIS)
  83. 83. Task Participants ● Typically developers of tools participate in evaluation campaigns ○ They know how to translate information requests into queries ○ Knowledge of user has huge impact on performance that can be achieved ● “Novice session” ○ Invite members from the audience to use the tools, after a brief introduction ○ Provides insights about usability and complexity of tool ○ In real use cases, users are rather domain experts than retrieval experts, thus this condition is important to test ○ Selection of novices is an issue for comparing results ○ Question of whether/how scores of expert and novice tasks shall be combined
  84. 84. Real-World Datasets ● Research neds reproducible results ○ standardized and free datasets are necessary ● One problem with many datasets: ○ current state of web video in the wild is not or no longer represented accurately by them [Rossetto & Schuldt] ● Hence, we also need datasets that model the real world ○ One such early effort: ○ V3C is such a dataset (see later) Rossetto, L., & Schuldt, H. (2017). Web video in numbers-an analysis of web-video metadata. arXiv preprint arXiv:1707.01340.
  85. 85. Videos in the Wild Age-distribution of common video collections vs what is found in the wild Rossetto, L., & Schuldt, H. (2017). Web video in numbers-an analysis of web-video metadata. arXiv preprint arXiv:1707.01340.
  86. 86. Videos in the Wild Duration-distribution of common video collections vs what is found in the wild Rossetto, L., & Schuldt, H. (2017). Web video in numbers-an analysis of web-video metadata. arXiv preprint arXiv:1707.01340.
  87. 87. Dataset Preparation and Annotations ● Data set = content + annotations for specific problem ● Today, content is everywhere ● Annotations are still hard to get ○ External data (e.g., archive documentation) often not available at sufficient granularity and time-indexed ○ Creation by experts is prohibitively costly ● Approaches ○ Crowdsourcing (with different notions of “crowd” impacting quality) ○ Reduce amount of annotations needed ○ Generate data set and ground truth
  88. 88. Collaborative Annotation Initiatives from TRECVID participants 2003-2013 ○ http://mrim.imag.fr/tvca/ ○ Concept annotations for high-level feature extraction/semantic indexing tasks ○ As data sets grew in size, the percentage of the content that could be annotated declined ○ Use of active learning to select samples where annotation brings highest benefit S. Ayache and G. Quénot, "Video Corpus Annotation using Active Learning", ECIR 2008.
  89. 89. Crowdsourcing with the General Public ● Use platforms like Amazon Mechanical Turk to collect data ○ Main issue, however, is that annotations are noisy and unreliable ● Solutions ○ Multiple annotations and majority votes ○ Involve tasks that help assessing the confidence to a specific worker ■ e.g., asking easy questions first, to verify facts about image ○ More sophisticated aggregation strategies ● MediaEval ran tasks in 2013 and 2014 ○ Annotation of fashion images and timed comments about music B. Loni, M. Larson, A. Bozzon, L. Gottlieb, Crowdsourcing for Social Multimedia at MediaEval 2013: Challenges, Data set, and Evaluation, MediaEval WS Notes, 2013. K. Yadati, P. S.N. Shakthinathan Chandrasekaran Ayyanathan, M. Larson, Crowdsorting Timed Comments about Music: Foundations for a New Crowdsourcing Task, MediaEval WS Notes, 2014.
  90. 90. Pooling ● Exhaustive relevance judgements are costly for large data sets ● Annotate pool of top k results returned from participating systems ● Pros ○ Efficient ○ Results are correct for all participants, not an approximation ● Cons ○ Annotations can only be done after experiment ○ Repeating the experiment with new/updated systems requires updating the annotation (or getting approximate results) Sri Devi Ravana et al., Document-based approach to improve the accuracy of pairwise comparison in evaluating information retrieval systems, ASLIB J. Inf. Management, 67(4), 2015.
  91. 91. Live Annotation ● Assessment of incoming results during competition ● Used in VBS 2017-2018 ● Addresses issues of incomplete or missing ground truth ○ e.g., created using pooling , or new queries ● Pros ○ Provide immediate feedback ○ Avoid biased results from ground truth pooled from other systems ● Cons ○ Done under time pressure ○ Not possible to review other similar cases - may cause inconsistency in decisions ○ Multi-annotator agreement would be needed (impacts decision time and number of annotators needed)
  92. 92. Live Annotation – Example from VBS 2018 ● 1,848 shots judged live ○ About 40% of submitted shots were not in TRECVID g.t. ● Verification experiment ○ 1,383 were judged again later ○ Judgement were diverging for 23% of the shots, in 88% of those cases the live judgement was “incorrect” ● Judges seem to decide to incorrect when in doubt ○ While ground truth for later use is biased, still same conditions for all teams in the room ● Need to set up clear rules for live judges ○ Like used by NIST for TRECVID annotations Judge 1: false Judge 2: true Judge 1: true Judge 1: false same video
  93. 93. Assembling Content and Ground Truth ● MPEG Compact Descriptor for Video Analysis (CDVA) ○ Dataset for the evaluation of visual instance search ○ 23,000 video clips (1min - > 1hr) ● Annotation effort too high ○ Generate query and reference clips from three disjoint subsets ○ Randomly embed relevant segment in noisy material ○ Apply transformations to query clips ○ Ground truth is generated from the editing scripts ○ Created 9,715 queries, 5,128 references
  94. 94. Process for LSC Dataset Generation ● Lifelog data has an inevitable privacy/GDPR compliance concern ● Required a cleaning/anonymization process for images, locations & words ○ Lifelogger deletes private/embarrassing images, validated by researcher ○ Images resized down (1024x768) to remove readable text ○ faces automatically & manually blurred; locations anonymized ○ Manually generated blacklist of terms for removal from textual data
  95. 95. Available Datasets ● Past TRECVID data ○ https://www-nlpir.nist.gov/projects/trecvid/past.data.table.html ○ Different types of usage conditions and license agreements ○ Ground truth, annotations and partly extracted features are available ● Past MediaEval data ○ http://www.multimediaeval.org/datasets/index.html ○ Mostly directly downloadable, annotations and sometimes features available ● Some freely available data sets ○ TRECVID IACC.1-3 ○ TRECVID V3C1 (starting 2019), will also be used for VBS (download available) ○ BLIP 10,000 http://skuld.cs.umass.edu/traces/mmsys/2013/blip/Blip10000.html ○ YFCC100M https://webscope.sandbox.yahoo.com/catalog.php?datatype=i&did=67 ○ Stanford I2V http://purl.stanford.edu/zx935qw7203
  96. 96. Available Datasets ● MPEG CDVA data set ○ Mixed licenses, partly CC, partly specific conditions of content owners ● NTCIR-Lifelog datasets ○ NTCIR-12 Lifelog - 90 days of mostly visual and activity data from 3 lifeloggers (100K+ images) ■ ImageCLEF 2017 dataset a subset of NTCIR-12 ○ NTCIR-13 Lifelog - 90 days of richer media data from 2 lifeloggers (95K images) ■ LSC 2018 - 30 days of visual, activity, health, information & biometric data from one lifelogger ■ ImageCLEF 2018 dataset a subset of NTCIR-13 ○ NTCIR-14 - 45 days of visual, biometric, health, activity data from two lifeloggers
  97. 97. Example: V3C Dataset Vimeo Creative Commons Collection ○ The Vimeo Creative Commons Collection (V3C) [2] consists of ‘free’ video material sourced from the web video platform vimeo.com. It is designed to contain a wide range of content which is representative of what is found on the platform in general. All videos in the collection have been released by their creators under a Creative Commons License which allows for unrestricted redistribution. Rossetto, L., Schuldt, H., Awad, G., & Butt, A. (2019). V3C – a Research Video Collection. Proceedings of the 25th International Conference on MultiMedia Modeling.
  98. 98. 5. Evaluation procedures, results and metrics Interactive and automatic retrieval
  99. 99. Evaluation settings for interactive retrieval tasks ● For each tool, human in the loop ... ○ Same room, projector, time pressure ○ Expert and novice users ● … compete in simulated tasks (KIS, AVS, ...) ○ Shared dataset in advance (V3C1 1000h) ○ 2V+1T KIS sessions and 2 AVS sessions ■ Tasks selected randomly and revisited ■ Tasks presented on data projector
  100. 100. Evaluation settings for interactive retrieval tasks ● Problem with repeatability of results ○ Human in the loop, conditions ● Evaluation provides one comparison of tools in a shared environment with a given set of tasks, users and shared dataset ○ Performance reflected by an overall score
  101. 101. Known-item search tasks at VBS 2018
  102. 102. Results of VBS 2018
  103. 103. Results - observed trends 2015-2017 2015 (100 hours) 2016 (250 hours) 2017 (600 hours) Observation: First AVS easier than Visual KIS easier than Textual KIS J. Lokoc, W. Bailer, K. Schoeffmann, B. Muenzer, G. Awad, On influential trends in interactive video retrieval: Video Browser Showdown 2015-2017, IEEE Transactions on Multimedia, 2018
  104. 104. KIS score function (since 2018) ● Reward for solving a task ● Reward for being fast ● Fair scoring around time limit ● Penalty for wrong submissions J. Lokoc, W. Bailer, K. Schoeffmann, B. Muenzer, G. Awad, On influential trends in interactive video retrieval: Video Browser Showdown 2015-2017, IEEE Transactions on Multimedia, 2018
  105. 105. AVS score function (since 2018) VBS 2018 VBS 2017 Score based on precision and recall
  106. 106. Overall scores at VBS 2018
  107. 107. Settings and metrics in LSC Evaluation ● Similar to the VBS… For each tool, human in the loop ... ○ Same room, projector, time pressure ○ Expert and novice users ● … compete in simulated tasks (all KIS type) ○ Shared dataset in advance (LSC Dataset - 27 days)) ○ Six expert topics & 12 Novice topics ■ Topics prepared by the organisers with full (non-pooled) relevance judgements for all topics ■ Tasks presented on data projector ■ Participants submit a ‘correct’ answer to the LSC server which evaluates it against groundtruth.
  108. 108. Lifelog Search Challenge (Topics) I am building a chair that is wooden in the late afternoon. I am at work, in an office environment (23 images, 12 minutes). I am walking out to an airplane across the airport apron. I stayed in an airport hotel on the previous night before checking out and walking a short distance to the airport (1 image, 1 minute). I was in a Norwegian furniture store in a shopping mall (16 images, 9 minutes). I was eating in a Thai restaurant (130 images, 66 images). There was a large picture of a man carrying a box of tomatoes beside a child on a bicycle (185 images, 97 minutes). I was playing a vintage car-racing game on my laptop in a hotel after flying (53 images, 27 minutes). I was watching 'The Blues Brothers' Movie on the TV at home (82 images, 42 minutes).
  109. 109. LSC Score Function Score calculated from 0 to 100, based on the amount of time remaining. Negative scoring for incorrect answers (lose 10%) of available score. Overall score is based on the sum of scores for all expert and novice topics. Similar to VBS, a problem with repeatability of results ( Human in the loop). Evaluation provides one comparison of tools in a shared environment with a given set of tasks, users and shared dataset.
  110. 110. Evaluation settings at TRECVID ● Three run types: ○ Fully Automatic ○ Manually-assisted ○ Relevance-feedback ● Query/Topics: ○ Text only ○ Text + image/video examples ● Training conditions: ○ Training data from same/cross domain as testing ○ Training data collected automatically ● Results: System returns top 1000 shots that most likely Satisfy the query/topic
  111. 111. Query Development Process ● Sample test videos (~30 - 40%) were viewed by 10 human assessors hired by the NIST. ● 4 facets describing different scenes were used (if applicable) to annotate the watched videos: ○ Who : concrete objects and being (kind of persons, animals, things) ○ What : are the objects and/or beings doing ? (generic actions, conditions/state) ○ Where : locale, site, place, geographic, architectural, etc ○ When : time of day, season ● Test queries were constructed from the annotated descriptions to include : Persons, Actions, Locations, and Objects and their combinations.
  112. 112. Sample topics of Ad-hoc search queries Find shots of a person holding a poster on the street at daytime Find shots of one or more people eating food at a table indoors Find shots of two or more cats both visible simultaneously Find shots of a person climbing an object (such as tree, stairs, barrier) Find shots of car driving scenes in a rainy day Find shots of a person wearing a scarf Find shots of destroyed buildings
  113. 113. Evaluation settings at TRECVID ● Usually 30 queries/topics are evaluated per year ● NIST hires 10 human assessors to: ○ Watch returned video shots ○ Judge if a video shot satisfy query (YES / NO vote) ● All system results per query/topic are pooled and NIST judges top ranked results (rank 1 to ~200) 100% and sample ranked results from 201 - 1000 to form a unique judged master set. ● The unique judged master set are divided into small pool files (~1000 shots / file) and given to the human assessors to watch and judge.
  114. 114. TRECVID evaluation framework Video Collection Information needs (Topics/Queries) Video search algorithm 1 Video search algorithm 2 Video search algorithm K Ranked result set 1 Ranked result set 2 Ranked result set k Video pools Judge 100% of top X ranked results and Y% from X+1 ranked results to bottom TRECVID Participants Ranked result sets Ground Truth Evaluation scores Human assessors Pooling
  115. 115. Evaluation settings at TRECVID ● Basic rules for the human assessors to follow include: ○ In topic description, "contains x" or words to that effect are short for "contains x to a degree sufficient for x to be recognizable as x to a human" . This means among other things that unless explicitly stated, partial visibility or audibility may suffice. ○ The fact that a segment contains video of physical objects representing the feature target, such as photos, paintings, models, or toy versions of the target, will NOT be grounds for judging the feature to be true for the segment. Containing video of the target within video may be grounds for doing so. ○ If the feature is true for some frame (sequence) within the shot, then it is true for the shot; and vice versa. This is a simplification adopted for the benefits it affords in pooling of results and approximating the basis for calculating recall. ○ When a topic expresses the need for x and y and ..., all of these (x and y and ...) must be perceivable simultaneously in one or more frames of a shot in order for the shot to be considered as meeting the need.
  116. 116. Evaluation metric at TRECVID ● Mean extended inferred average precision (xinfAP) across all topics ○ Developed* by Emine Yilmaz and Javed A. Aslam at Northeastern University ○ Estimates average precision surprisingly well using a surprisingly small sample of judgments from the usual submission pools (see next slide!) ○ More topics can be judged with same effort ○ The extended infAP added stratified feature to infAP (i.e we can sample from each strata with different sample rate) * J.A. Aslam, V. Pavlu and E. Yilmaz, Statistical Method for System Evaluation Using Incomplete Judgments Proceedings of the 29th ACM SIGIR Conference, Seattle, 2006.
  117. 117. InfAP correlation with AP Mean InfAP of 100% sample Mean InfAP of 100% sample Mean InfAP of 100% sample Mean InfAP of 100% sample MeanInfAPof80%sampleMeanInfAPof40%sample MeanInfAPof60%sampleMeanInfAPof20%sample Mean InfAP of 100% sample == AP
  118. 118. Automatic vs. Interactive search in AVS Can we compare results from TRECVID (infAP) and VBS (unordered list)? ● Simulate AP from unordered list J. Lokoc, W. Bailer, K. Schoeffmann, B. Muenzer, G. Awad, On influential trends in interactive video retrieval: Video Browser Showdown 2015-2017, IEEE Transactions on Multimedia, 2018
  119. 119. Automatic vs. Interactive search in AVS Can we compare results from TRECVID (infAP) and VBS (unordered list)? ● Get precision at VBS recall level if ranked lists are available J. Lokoc, W. Bailer, K. Schoeffmann, B. Muenzer, G. Awad, On influential trends in interactive video retrieval: Video Browser Showdown 2015-2017, IEEE Transactions on Multimedia, 2018
  120. 120. 6. Lessons learned Collection of our observations from TRECVID and VBS
  121. 121. Video Search (at TRECVID) Observations One solution will not fit all. Investigations/discussion of video search must be related to the searcher‘s specific needs/capabilities/history and to the kinds data being searched. The enormous and growing amounts of video require extremely large-scale approaches to video exploitation. Much of it has little or no metadata describing the content in any detail. ● 400 hrs of video are being uploaded on YouTube per minute (as of 11/2017) ● “Over 1.9 Billion logged-in users visit YouTube each month and every day people watch over a billion hours of video and generate billions of views.” (https://www.youtube.com/yt/about/press/)
  122. 122. Video Search (at TRECVID) Observations Multiple information sources (text, audio, video), each errorful, can yield better results when combined than used alone… ● A human in the loop in search still makes an enormous difference. ● Text from speech via automatic speech recognition (ASR) is a powerful source of information but: ○ Its usefulness varies by video genre ○ Not everything/one in a video is talked about, “in the news" ○ Audible mentions are often offset in time from visibility ○ Not all languages have good ASR ● Machine learning approaches to tagging ○ yield seemingly useful results against large amounts of data when training data is sufficient and similar to the test data (within domain) ○ but will they work well enough to be useful on highly heterogeneous video?
  123. 123. Video Search (at TRECVID) Observations ● Processing video using a sample of more than one frame per shot, yields better results but quickly pushes common hardware configurations to their limits ● TRECVID systems have been looking at combining automatically derived and manual-provided evidence in search : ○ Internet Archive video will provide titles, keywords, descriptions ○ Where in the Panofsky hierarchy are the donors’ descriptions? If very personal, does that mean less useful for other people? ● Need observational studies of real searching of various sorts using current functionality and identifying unmet needs
  124. 124. VBS organization ● Test session before event - problems with submission formats etc. ● Textual KIS tasks in a special private session ○ Textual tasks are not so attractive for audience ○ Textual tasks are important and challenging ○ More time and tasks are needed to assess tool performance ● Visual and AVS tasks during welcome reception ○ “Panem et circenses” - competitions are also intended to entertain audience ○ Generally, more novice users can be invited to try the tool
  125. 125. VBS server ● Central element of the competition ○ Presents all tasks using data projector ○ Presents scores in all categories ○ Presents feedback for actual submissions ○ Additional logic (duplicates, flood of submissions, logs) ○ Also at LSC 2018, with a revised ranking function ● Selected issue - duplicate problem ○ IACC dataset contains numerous duplicate videos with identical visual content (but e.g., different language) ○ Submission was regarded as wrong although the visual content was correct ○ One actual case in 2018, had to be corrected after the event and changed the final ranking ○ Dataset design should explicitly avoid duplicates, or at least provide a list of duplicates; moreover: server could provide more flexibility in changing judgements retrospectively
  126. 126. VBS server ● Issues of the simulations of KIS tasks ● How to “implant” visual memories? ○ Play scene just once - users forget the scene ○ Play scene in the loop - users exploit details -> overfitting to task presentation ○ Play scene in the loop + blur - colors can be still used, but also user forget important details ○ Play scene several times in the beginning and then show text description ● How to face ambiguities of textual KIS? ○ Simple text - not enough details, ambiguous meaning of some sentences ○ Extending text - simulation of a discussion - which details should be used first? ○ Still ambiguities -> teams should be allowed to ask some questions
  127. 127. AVS task and live judges at VBS ● Ambiguous task descriptions are problematic, hard to find balance between too easy and too hard tasks ● Opinion of user vs. opinion of judge - who is right? ○ Users try to maximize score - sometimes risk wrong submission ○ Each shot is assessed just once -> the same “truth” for all teams ○ Similar as for textual KIS - teams should be allowed to ask some questions ○ Teams have to read TRECVID rules for human assessors! ● Calibration of more judges ○ For more than one live judge - calibration of opinions is necessary, even during competition ● Balance the number of users for AVS tasks (ideally also for KIS tasks)
  128. 128. VBS interaction logging ● Until 2017, there was no connection between VBS results and really used tool features to solve a task ○ VBS server received only team, video and frame IDs ○ Attempts to receive logs after competition failed ● Since 2018, an interaction log is a mandatory part of each task submission ○ How to obtain logs when the task is not solved? ○ Tools use variable modalities and interfaces - how to unify actions? ○ How to present and interpret logs? ○ How to log very frequent actions? ○ Time synchronization? ○ Log verification during test runs
  129. 129. VBS interaction logging - 8/9 teams sent logs! We can analyze both aggregated general statistics and user interactions in a given tool/task !!
  130. 130. Conclusion
  131. 131. Where is the User in the Age of Deep Learning? ● The complexity of tasks where AI is superior to human is obviously growing ○ Checkers -> Chess -> GO -> Poker -> DOTA? -> Starcraft? -> … à guess user needs? ● Machine learning revolution - bigger/better training data o à better performance ● Can we collect big training data to support interactive video retrieval? ○ To cover an open world (how many concepts, actions, … do you need)? ○ To fit needs of every user (how many contexts do we have)? ● Reinforcement learning?
  132. 132. Where is the User in the Age of Deep Learning? Driver has to get carefully through many situations with just basic equipment Q: Is this possible also for video retrieval systems? Attribution: Can Pac Swire (away for a bit) Driver has to rely on himself but subsystems help (ABS, power steering, etc.) Attribution: Grand Parc - Bordeaux Driver just tells where to go Attribution: Grendelkhan
  133. 133. Where is the User in the Age of Deep Learning? ● Users already benefit from deep learning ○ HCI support - body motion, hand gestures ○ More complete and precise automatic annotations ○ Embeddings/representations for similarity search ○ 2D/3D projections for visualization of high-dimensional data ○ Relevance feedback learning (benefit from past actions) ● Promising directions ○ One-shot learning for fast inspection of new concepts ○ Multimodal joint embeddings ○ … ○ Just A Rather Very Intelligent System (J.A.R.V.I.S.) used by Tony Stark (Iron Man) ??
  134. 134. Never say “this will not work!” ● If you have an idea how to solve interactive retrieval tasks - just try it! ○ Don’t be afraid your system is not specialized, you can surprise yourself and the community! ○ Paper submission in September 2019 for VBS at MMM 2020 in Seoul! ○ LSC submission in February 2019 for ICMR 2019 in Ottawa in June 2019. ○ The next TRECVID CFP will go out by mid-January, 2019. Lokoč, Jakub, Adam Blažek, and Tomáš Skopal. "Signature-based video browser." International Conference on Multimedia Modeling. Springer, Cham, 2014. Del Fabro, Manfred, and Laszlo Böszörmenyi. "AAU video browser: non- sequential hierarchical video browsing without content analysis." International Conference on Multimedia Modeling. Springer, Berlin, Heidelberg, 2012. Hürst, Wolfgang, Rob van de Werken, and Miklas Hoet. "A storyboard-based interface for mobile video browsing." International Conference on Multimedia Modeling. Springer, Cham, 2015.
  135. 135. Acknowledgements This work has received funding from the European Union’s Horizon 2020 research and innovation programme, grant no. 761802, MARCONI. It was supported also by Czech Science Foundation project Nr. 17-22224S. Moreover, the work was also supported by the Klagenfurt University and Lakeside Labs GmbH, Klagenfurt, Austria and funding from the European Regional Development Fund and the Carinthian Economic Promotion Fund (KWF) under grant KWF 20214 u. 3520/26336/38165.

×