Audiovisual content exploitation JTS2010

884 views

Published on

On audiovisual content exploitation at Netherlands Institute for Sound and Vision and the crowdsourcing application Pinkpop

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
884
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
7
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • € 173 mln in 7 jaar (looptijd 2007-2014)wel terugverdienverplichting. Tijdens looptijd € 19 mln
  • in the contextdatabase also other information sources
  • The amount of footage for each festivalyear varies from only a summary to almost unabridgedconcert recordings, even including raw, unpublished footage
  • In contrast to domains like news video, where the numberof visual concepts is unrestricted, the number of conceptsthat may appear in a concert is more or less fixed. A bandplays on stage for an audience. Thus, major concepts arerelated to the role of the band members, e.g. lead singer, orguitarist, and the type of instruments that they play, e.g.,drums or keyboard. Although quite many instruments exist,most bands typically use guitars, drums, and keyboards.We chose 12 concert concepts based on frequency, visual detectionfeasibility, previous mentioning in literature [3, 10],and expected utility for concert video users. For each conceptwe annotated several hundred of examples using theannotation tool depicted in Figure 3 [1]. The 12 concertconcepts are depicted in Figure 4.
  • Automatic speech recognition (ASR) technology was usedto attach browsing functionality to the interview fragmentsin the collection. Speech transcripts were generated usingthe SHoUT ASR toolkit [4] and post-processed to generatea filtered term frequency list that is most likely to representthe contents of the interviews, based on tf.idf statistics. Thislist was then used to create a time-synchronized term cloud.Each word in the cloud is clickable to enable users to jumpto the part of the interview where a word is mentioned.
  • The main mode of user interaction with our video searchengine is by means of a timeline-based video player, seeFigure 2. The player enables users to watch and navigatethrough a single video concert. Little colored dots on thetimeline mark the location of an interesting fragment correspondingto an automatically derived label. To inspect thelabel and the duration of the fragment, users simply movetheir mouse cursor over the colored dot. By clicking the dot,the player instantly starts the specific moment in the video.If needed, the user can manually select more concept labelsin the panel on the left of the video player. If the timelinebecomes too crowded as a results of multiple labels, the usermay decide to zoom in on the timeline. Besides providingfeedback on the automatically detected labels, we also allowour users to comment on the individual fragments, sharethe fragment through e-mail or Twitter, and embed the integratedvideo player, including the crowdsourcing mechanism,on different websites.
  • In order to find a balance between an appealing user experienceand a maximized user participation, we motivateonline users to participate by providing them with access toa selection of exclusive, full-length concert videos. The userswatch the videos without interruption and are encouragedto provide their feedback by graphical overlays that appearon top of the video, see Figure 1.The threshold to participate is deliberately kept low. Usersdo not need to sign up and can provide their feedback justby clicking buttons. With the thumbs-up button they indicatethat they agree with the automatically detected labelfor the video fragment. If they press the thumbs-down button,the user is asked to correct the label. Within a fewclicks the user can select another pre-defined label or createa new label on demand. In addition, the user are allowedto indicate whether the start or end of the fragment wasinconsistent with the label. All user feedback is stored ina database together with the users IP addresses and usersessions.
  • Audiovisual content exploitation JTS2010

    1. 1. Audiovisual content exploitation in the networked information society<br />Crowdsourcing Rock ‘n Roll Multimedia Retrieval<br />Roeland Ordelman<br />Research & Development<br />Netherlands Institute for Sound and Vision<br />rordelman@beeldengeluid.nl<br />
    2. 2. contents<br />AV content exploitation, annotation technology and user needs<br />NISV context: digitization in Images of the Future<br />Annotation technology for enabling access<br />Annotation technology and user needs<br />Example: Crowdsourcing Rock ‘n Roll Multimedia Retrieval<br />
    3. 3. NISV context<br /><ul><li>+700.000 hours of radio, television, documentaries, films and music, over 2 million photographs, 20.000 objects like cameras, televisions, radios, costumes and pieces of scenery
    4. 4. and growing:
    5. 5. digitally born television and radio programs made by the Dutch public broadcasting companies (video: ~15K/hours/year)
    6. 6. PROARCHIVE: archiving service
    7. 7. selection of (Dutch) AV content from the web</li></li></ul><li> LARGE DIGITIZATION PROGRAM<br />IMAGES of the future<br />
    8. 8. Images of the Future<br /><ul><li>Selection, restoration, digitization, encoding and storage of 137,000 hours of video, 20,000 hours of film, 124,000 hours of audio and more than three million photographs.
    9. 9. Three goals:
    10. 10. Safeguarding heritage for future generations
    11. 11. Creating social- economical value (“unlock the social and economic potential of the collections”)
    12. 12. Innovation: new infrastructure for strengthening knowledge economy</li></li></ul><li>INVESTMENTSBUSINESS MODELS<br />The cultural heritage sector is challenged to re-evaluate its business models<br />
    13. 13. Business model<br /><ul><li>The total investment of this initiative sums up to 173 million Euro
    14. 14. A strong business model is necessary to support this kind of investment and prove that such an investment will result in long-term socio-economic returns
    15. 15. The outcome of a Cost-Benefit analysis was positive: “The total balance of costs and returns of restoring, preserving and digitising audio-visual material (excluding costs of tax payments) will be between: 20+ and 60+ million.’’
    16. 16. Economic benefits:
    17. 17. Direct effects of the investment are revenues from sales, access for specific user groups, the repartition of copyright for the use of the material and so on.
    18. 18. The indirect effects concern the product markets and labour market.
    19. 19. Social benefits:
    20. 20. conservation of culture, reinforcement of cultural awareness, reinforcement of democracy through the accessibility of information, increase in multimedia literacy and contribution to the Lisbon goals set by the EU</li></ul>http://www.prestoprime.org/project/public.en.html<br />
    21. 21. Content exploitation: from content is king ...<br />
    22. 22. ... to metadata rules<br />
    23. 23. Manual annotation<br />costly & limited<br />
    24. 24. (SEMI) AUTOMATIC ANNOTATION<br />DECADE+ RESEARCH efforts<br />
    25. 25. Research on automatic annotation<br /><ul><li>automatic information extraction based on:
    26. 26. visual features
    27. 27. information from audio
    28. 28. crowdsourcing
    29. 29. deploying collateral data sources:
    30. 30. subtitles, production scripts, meeting minutes, slides</li></li></ul><li>Various (laboratory) showcases<br />Commercial systems (e.g., blinkx, google)<br />Progress? Yes!<br />
    31. 31. work in progress<br /><ul><li>institutional: reorganisation of traditional archival workflows
    32. 32. national: development of common services
    33. 33. OAI, Persistent Identifiers, ASR service, Vocabulary Repositories
    34. 34. commercial: uptake by MNCs (Google and Microsoft) and SMEs
    35. 35. individual: bring about a shift regarding defensive attitude of content owners towards
    36. 36. opening up their funded and protected archives
    37. 37. use of possibly noisy content descriptions(trust/reliability) </li></li></ul><li>Automatic annotation: NISV as a user<br /><ul><li>Participation in international research projects
    38. 38. Video Active, MultiMATCH, VIDI-video, LiWA, P2P-Fusion, Sterna, EUscreen, PrestoPrime
    39. 39. Collaboration agreement with Dutch research institutes
    40. 40. Researchers stationed at Sound and Vision
    41. 41. Provide data (TRECVID, VideoCLEF)
    42. 42. Research environment: exact copy of iMMix production environment for testing new technology
    43. 43. speech recognition
    44. 44. video analysis
    45. 45. fingerprinting
    46. 46. linking of context data (web, program guide, production data)</li></li></ul><li>media professionals<br />journalists<br />researchers<br />educators<br />general public <br />disparity between technology and user needs<br />
    47. 47. User perspective<br /><ul><li>Rapidly evolving networked information society
    48. 48. Opening up
    49. 49. Focus on community specific requirements
    50. 50. search needs
    51. 51. presentation/interaction needs
    52. 52. Draw communities into libraries</li></li></ul><li>community specific requirements<br />From document level search to fragment level search <br />
    53. 53. 19<br />Broadcast professionals<br />In: Huurnink, Hollink, van Den Heuvel 2009 (submitted)<br />
    54. 54. User survey (broadcast professionals)<br />
    55. 55. Researchers<br /><ul><li>Verteld Verleden aims at establishing a shared information space on distributed Dutch Oral History collections:
    56. 56. distributed collections (harvested via OAI)
    57. 57. search & interlink collections via centralized search
    58. 58. project goals: </li></ul>provide demonstrator portal to show how technology could help researchers<br />acquire information on specific user requirements <br />search<br />collaboration<br />linking<br />privacy<br />dedicated work space<br />http://www.verteldverleden.org<br />
    59. 59. draw communities into libraries<br />
    60. 60. Goals<br /><ul><li>exploiting community tagging (tagging games, etc)
    61. 61. exploring the wisdom of crowds by hooking up with user communities (e.g., everyone-as-commentator, unexpected experts)
    62. 62. capturing relevant information from the internet and aligning this with archived items.
    63. 63. finding new ways for communities to interact with the data.</li></li></ul><li>Technology perspective<br />Technology:<br /><ul><li>provide anchor points for linking up with the `cloud’ (entity detection, segmentation, cross-collection SID, etc): people, places, events, topics, quotes, etc.
    64. 64. synchronization of web-content/UGC with AV documents
    65. 65. users in the loop: UGC for adapting/training analysis tools
    66. 66. technology aided annotation: Documentalist Support System
    67. 67. provide documentalist/archivist with relevant context during manual annotation</li></li></ul><li>WEB-archiving<br />COLLECT CONTEXT DATA FROM THE WEB<br />
    68. 68. Web-archiving<br /><ul><li>extend Sound and Vision archive with audiovisual content from the internet
    69. 69. archive internet web content </li></ul>preserve broadcast related websites <br />to use as context information for audiovisual data in the Sound and Vision archive<br />
    70. 70. AUDIOVISUAL INTERNET CONTENT<br />iMMix<br />AV ARCHIVE<br />CONTEXT<br />CONTEXT<br />BROADCAST RELATED INTERNET CONTENT<br /> WEB-ARCHIVE<br />
    71. 71. Special Use Case: documentalist support<br /><ul><li>in the process of generating metadata for an archived AV item, a documentalist searches for relevant information on this item, for example on the internet
    72. 72. internet search might fail as such information is typically available only for a limited amount of time
    73. 73. the “internet archive” works as a “contextdatabase” for relevant internet context</li></li></ul><li>INTERNET CONTEXT THAT MAY “DISAPPEAR” BUT COULD BE USED AS INFORMATION FOR DESCRIBING TELEVISION BROADCASTS<br />
    74. 74.
    75. 75. Tagging game<br />www.waisda.nl<br />
    76. 76. Crowdsourcing Rock N’ Roll Multimedia Retrieval<br />Netherlands Institute for Sound and Vision<br />University of Amsterdam – Visual Search (Cees Snoek)<br />University of Twente – Speech Recognition (Franciska de Jong)<br />VideoDock – User Interface (Bauke Freiburg)<br />
    77. 77. Background<br /><ul><li>40th birthday of popular annual Dutch rock festival Pinkpop
    78. 78. from only summary to almost unabridged recordings, even including raw, unpublished footage as well as interviews
    79. 79. collection digitized in Images for the Future
    80. 80. goal: build an application for showcasing history of the festival in an attractive way using state-of-the-art technology</li></li></ul><li>Rationale<br /><ul><li>Use state-of-the-art visual analysis to allow browsing collection on the basis of visual concert concepts
    81. 81. Use speech recognition for browsing interviews
    82. 82. Exploit popularity of festival to get rock ‘n roll enthusiasts community into the loop:
    83. 83. general feedback on technology
    84. 84. improve and extend automatic labeling
    85. 85. share video fragment</li></li></ul><li>IPR<br /><ul><li>Various Dutch broadcasters hold the copyrights of the content.
    86. 86. Granted dispensation to use content to enable a large scale study of community-aided annotation and verification via an open internet platform
    87. 87. for a limited time period of three months,
    88. 88. video displayed in a secured player
    89. 89. (access to experimental results)</li></li></ul><li>Visual search<br /><ul><li>visual concept detection: for each concept a ‘detector’ is trained on the basis of manually labeled training data.
    90. 90. number of concepts in concerts more or less fixed (in contrast to BN domain), 12 were choosen based on:
    91. 91. frequency
    92. 92. visual detection feasibility
    93. 93. previous mentioning in literature
    94. 94. expected utility for users
    95. 95. for each concept several hundred examples were labeled</li></li></ul><li>
    96. 96.
    97. 97. Fragment level concept detection<br /><ul><li>video fragments instead of more technically defined shots or keyframes
    98. 98. fragment algorithm finds the longest fragments with the highest average scores for a specific concert concept
    99. 99. Only the top-n fragments per concert concept areloaded in the video player</li></li></ul><li>Speech Recognition<br /><ul><li>Speech transcripts generated by open-source speech recognition toolkit SHoUT developed in MultimediaN and CATCH projects
    100. 100. Words in transcripts have time-labels
    101. 101. Transcripts converted to filtered term frequency list on the basis of tf.idf statistics for generating a time-synchronized term cloud:
    102. 102. jump to relevant interview parts via terms</li></li></ul><li>Player<br /><ul><li>timeline-based videoplayer
    103. 103. colored dots represent concepts
    104. 104. clicking dot starts playback
    105. 105. feedback window:
    106. 106. right/wrong label
    107. 107. comment
    108. 108. share (email/twitter)
    109. 109. embed integrated video player,
    110. 110. including crowdsourcing mechanism</li></li></ul><li>
    111. 111. Encouraging User Feedback<br /><ul><li>balance between appealing user experience and maximized user participation
    112. 112. full-length concert videos (no ‘commercials’)
    113. 113. no interruptions, only graphical overlays
    114. 114. participation threshold kept low:
    115. 115. no signing up
    116. 116. just click buttons (thumps-up/down)
    117. 117. all user feedback with IP adresses and user sessions stored in database</li></li></ul><li>DEMO<br />
    118. 118. Preliminary results<br /><ul><li>12,563 visitors of which 9,595 unique in 3 months.
    119. 119. visitors watched on average 3.5 pages, with an average viewing time of 4,57 minutes.
    120. 120. busiest day was December 3, with 1,566 visitors, immediately after launch and media attention
    121. 121. Most traffic (65%) originated from 255 referrer sites. The best referrer sites being:
    122. 122. pinkpop.nl (festival organization)
    123. 123. oor.nl (music magazine)
    124. 124. google
    125. 125. users provided feedback more than 4000 times.
    126. 126. We are currently investigating how this feedback can be exploited to improve automated multimedia analysis results</li></li></ul><li>Wrap up<br /><ul><li>value of archive is strongly related to access opportunities
    127. 127. access is to a large extend technology driven
    128. 128. but next to technology development we need to make a shift:
    129. 129. from a ‘laboratory view’ on users to drawing users and communities into the loop
    130. 130. NISV is aiming towards this two-way strategy:
    131. 131. incorporate advanced access technology
    132. 132. discuss access requirements with the stakeholders</li></li></ul><li>

    ×