Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Overview of the MediaEval 2012 Tagging Task

1,260 views

Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Overview of the MediaEval 2012 Tagging Task

  1. 1. 2012 Tagging Task Overview Christoph Kofler Sebastian SchmiedekeDelft University of Technology Technical University of Berlin Isabelle Ferrané University of Toulouse (IRIT) MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 1
  2. 2. Motivations• MediaEval – evaluate new algorithms for multimedia access and retrieval – emphasize the multi in multimedia – focus on human and social aspects of multimedia tasks.• Tagging Task – focus on semi-professional video on the Internet – use features derived from various media or information sources speech, audio, visual content or associated metadata social information – focus on tags related to the genre of the video MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 2
  3. 3. History• Tagging task at MediaEval – 2010: The Tagging Task (Wild Wild Web Version): Prediction of user tags 2010  too many tags; great diversity – 2011: The Genre Tagging Task: Genres from blip.tv labels (26 labels) 2011 – 2012: The Tagging Task: Same labels but much more dev/test data 2012• MediaEval joined by – 2009 & 2010: Internal Quaero campaigns (Video Genre Classification) 2010  too few participants – 2011 & 2012: Tagging task as an External Quaero evaluation 2012 MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 3
  4. 4. Datasets• A set of Videos (ME12TT) – Created by the PetaMedia Network of Excellence – downloaded from blip.tv – episodes of shows mentioned in Twitter messages – licensed under Creative Commons – 14,838 episodes from 2,249 shows ~ 3,260 hours of data – extension of the MediaEval Wild Wild Web dataset (2010)• Split into Development and Test sets – 2011 : 247 for development / 1,727 for test (1974 videos) – 2012 : 5,288 for development / 9,550 for test (7.5 times more) MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 4
  5. 5. Genres• 26 Genre labels from Blip.tv ... same as in 2011 : 25 genres + 1 default_catagory1000 art 1001 autos_and_vehicles1002 business 1003 citizen_journalism1004 comedy 1005 conferences_and_other_events1006 default_category 1007 documentary1008 educational 1009 food_and_drink1010 gaming 1011 health1012 literature 1013 movies_and_television1014 music_and_entertainment 1015 personal_or_auto-biographical1016 politics 1017 religion1018 school_and_education 1019 sports1020 technology 1021 the_environment1022 the_mainstream_media 1023 travel1024 videoblogging 1025 web_development_and_sites MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 5
  6. 6. Information available (1/2)• From different sources – title, description, user tags, uploader ID, duration – tweets mentioning the shows – automatic processings: • automatic speech recognition (ASR) – English transcription ! – some other languages Ne w • shot boundaries and 1 keyframe per shot MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 6
  7. 7. Information available (2/2)• Focus on speech data – LIUM transcripts (5,084 files from dev / 6,879 files from test) • English • One-best hypotheses (NIST CTM format) • Word-lattices (SLF HTK format) 4-Gram topology ! New • Confusion network (ATT FSM-like format) – LIMSI-VOCAPIA transcripts (5,237 files from dev. /7,215 files from test) • English, French, Spanish, Dutch • Language identification strategy (language confidence score) if Score > 0.8 then transcription based on the detected language else best score between English and the detected language MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 7
  8. 8. Task Goal• same as in 2011 – searching and browsing the Internet for video – using genre as a searching or organizational criterion  Videos may not be accurately or adequately tagged  Automatically assign genre labels using features derived from: speech, audio, visual content, associated textual or social information• what’s new in 2012 – provide a huge amount of data ( 7.5 times more ) – enable information retrieval as well as classification approaches with more balanced datasets (1/3 dev; 2/3 test); – each genre was “equally” distributed between both sets MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 8
  9. 9. Genre distribution over datasets Part 1: Genre 1000 to 1012Part 2:Genres 1013 to 1025 MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 9
  10. 10. Evaluation Protocol• Task: Predict the genre label for each video of the test set• Submissions: up to 5 runs, represent different approaches RUN 1 Audio and/or visual informations (including information about shots and keyframes) RUN 2 ASR transcripts RUN 3 All data except metadata ! RUN 4 All data except the uploader ID (ID used in the 2011 campaign) New RUN 5 All data• Groundtruth : genre label associated to each video• Metric : Mean Average Precision (MAP)  enable to evaluate the ranked retrieval results regarding a set of queries Q MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 10
  11. 11. Participants• 2012 : 20 registered, 6 submissions (10 in 2011) – 5 veterans, 1 new participant, – 3 organiser-connected teams, 5 countriesSystem Participant Supporting ProjectKIT Karlsruhe Institute of Technology Germany Quaero !UNICAMP-UFMG University of Campinas New FAPEMIG, FAPESP, CAPES, & CNPq Federal University of Minas Gerais BrazilARF University Politechnica of Bucharest Romania EXCEL POSDRU Johannes Kepler University & Research Institute of Artificial Intelligence Austria Polytech Annecy-Chambery FranceTUB Technical University of Berlin Germany EU FP7 VideoSenseTUD-MM Delft University of Technology The NetherlandsTUD Delft University of Technology The Netherlands MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 11
  12. 12. Features• Features used: in 2011 only - in 2011 & 2012 - in 2012 onlyASR Only transcription Or with translation of Stop word Semantic in english non-english filtering similarity ASR Limsi, transcription & Stemming ; TF-IDF ; Top ASR Lium Bags of Words terms (BoW); LDAAudio MFCC, LPC,LPS, ZCR, ZCR Spectrogram as an image Rhythm, timbre, onset strengh, Energy loudness, …Visual On image: On video; On Keyframes: Bags of Visual Color,texture Self-similarity matrix; WordsContent SIFT / rgbSIFT / SURF / HoG Face detection Shot boundaries; Shot length ; Transition between shots ; Motion featuresMetadata Title, Tags Filename, Show ID, BoW Description, Uploader IDOthers Video from YouTube Web pages from Google, Synonyms, Social data blip.tv Wikipedia hyponyms, domain from Video distribution terms, BoW from Delicious over genres (dev) Wordnet MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 12
  13. 13. Methods (1/2)• Machine Learning approach (ML) – parameter extraction from audio, visual, textual data, early fusion – feature: • transformation (PCA), • selection (Mutual information, Term Frequency) • dimension reduction (LDA) – classification methods, supervised or unsupervised • K-NN, SVM, Naive Bayes, DBN, GMM, NN • K-Means (clustering) • CRF (Conditional Random Fields), Decision tree (Random Forest) – training step, cross-validation approach and stacking – fusion of classifier results / late fusion / majority voting MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 13
  14. 14. Methods (2/2)• Information retrieval approach (IR) – text preprocessing & text Indexing – query and ranking list, query expansion and re-ranking methods – fusion of ranked lists from different modalities (RRF) – selection of the category with the highest ranking score• Evolution since 2011 – 2011 : 2 distinct communities : ML or IR approach – 2012 : mainly ML approach or mixed one MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 14
  15. 15. Resources• Which one were used?System AUDIO ASR VISUAL METADATA SOCIAL OTHERKITUNICAMP-UFMGARFTUBTUD-MMTUD• Evolution since 2011 – 2011 : use of external data mainly from the web and social data 1 participant especially interested in social aspect – 2012 : no external data, no social data MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 15
  16. 16. Main results (1/2) • Each participant’s best result SYSTEM BEST MAP APPROACH FEATURES METHOD RUN KIT Run3 0.3499 ML Color, Texture, rgbSIFT SVM Run6* 0.3581 + video distrib. over genre UNICAMP- Run 4 0.2112 ML BoW Stacking UFMG ARF Run5 0.3793 ML TF-IDF mtd SVM Linear ASR Limsi TUB Run4 0.5225 ML BoW mtd MI - Naive Bayes TUD-MM Run4 0.3675 ML & IR TF on Visual word SVM Linear + ASR & mtd Reciprocal Rank Fusion TUD Run2 0.25 ML ASR Lium DBN one-bestBaseline results : All videos into the default category MAP = 0.0063 / Videos randomly classified MAP = 0.002 MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 16
  17. 17. Main results (2/2) • Official run comparison (MAP) SYSTEM Run1 Run2 Run3 Run4 Run5 Other Runs Audio/Visual ASR Exc. Exc. Upld. ID All MTD KIT 0.3008 (visual1) 0.3461 0.2329 (visual2) 0.1448 0.3499 (fusion) 0.3581 UNICAMP 0.1238 0.2112 UFMG ARF 0.1941(visual & audio) 0.2174 0.2204 0.3793 0.1892 (audio) TUB 0.2301 0.1035 0.2259 0.5225 0.3304 TUD-MM 0.0061 0.3127 0.2279 0.3675 0.2157 0.0577 0.0047 TUD 0.23 / 0.25 0.10 / 0.09 17Baseline results : All videosMediaEval Workshop – 4-5 October 2012 - Pisa, randomly classified MAP = 0.002 into the default category MAP = 0.0063 /Videos Italy
  18. 18. Lessons learned or open questions (1/3)• About data – 2011 : small dataset for developement ~ 247 videos – Difficulties to train models, external resources required – 2012 : huge amount of data for development ~ 5,288 videos – Enough to train models in machine learning approach  Impact on: - the type of methods used (ML against IR) - the need/use of external data  No use of social data this year: - is it a question of community ? - can be disappointing regarding the MediaEval motivations MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 18
  19. 19. Lessons learned or open questions (2/3)• About results – 2011 : Best system MAP 0.5626 using – Audio, ASR, visual, metadata including the uploader ID – external data from Google and Youtube – 2012 : Best system (non-organiser connected) : MAP 0.3793 – TF-IDF on ASR, metadata including the uploader ID, no visual data Best system (organizer-connected): MAP 0.5225 – BoW on metadata without the uploader ID, no visual data  results difficult to compare regarding the great diversity of features, of methods and of systems combining both  monomedia (visual only ; ASR only) or multimedia contributions  a failure analysis should help to understand « what impacts what? » MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 19
  20. 20. Lessons learned or open questions (3/3)• About metric – MAP as the official metric – some participants provided other types of results in terms of correct classification rate or F-score or detail AP results per genre  would « analysing the confusion between genres » be of interest?• About genre labels – Labels provided by blip.tv are covering two aspects • topics: Autos_and_vehicules, Health, Religion, ... • Real genre : Comedy, Documentary, Videoblogging,...  would « making a distinction between form and content » be of interest? MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 20
  21. 21. Conclusion• What to do next time? – Has everything been said? – Should we leave the task unchanged? – If not, we have to define another orientation – Should we focus on another aspect of the content? • Interaction, • Mood, • User intention regarding the query – Define what needs to be changed: • Data • Goals and use cases • Metric • ...  a lot of points need to be considered MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 21
  22. 22. Tagging Task OverviewMore details about Metadata MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 22
  23. 23. Tagging Task Overview• Motivations & History• Datasets, Genres, Metadata & Examples• Task Goal & Evaluation protocol• Participants, Features & Methods• Resources & Main results• Conclusion MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 23
  24. 24. Tagging Task Overview : Examples• Example of Medadata from Video: Crashsolo-OneMinuteRumpoleAndTheAngelOfDeath659.flv Tag: 1012 literature Genre label from blip.tv<video> <title> <![CDATA[One Minute Rumpole and the Angel Of Death]]> </title> <description> <![CDATA["Rumpole and the Angel of Death," by John Mortimer, …]> </description> <explicit> false </explicit> <duration> 66 </duration> <url> http://blip.tv/file/1271048 </url> <license> <type> Creative Commons Attribution-NonCommercial 2.0 </type> <id> 4 </id> </license> … MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 24
  25. 25. Tagging Task Overview : Examples• Example of Medadata from…<tags> <string> oneminutecritic </string> <string> fvrl </string> <string> vancouver </string> Tags given by the uploader <string> library </string> <string> books </string></tags><uploader> <uid> 112708 </uid> ID of the uploader <login> crashsolo </login></uploader><file> <filename> Crashsolo-OneMinuteRumpoleAndTheAngelOfDeath659.flv </filename> <link> http://blip.tv/file/get/… </link> <size> 3745110 </size></file><comments /></video> MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 25
  26. 26. Tagging Task Overview : Examples• Video data (420,000 shots and keyframes)Video: Crashsolo-OneMinuteRumpoleAndTheAngelOfDeath659.flvTag: 1012 literature Genre label from blip.tv<?xml version="1.0" encoding="utf-8" ?> <Segmentation> <CreationID> Crashsolo-OneMinuteRumpoleAndTheAngelOfDeath659 </CreationID> <InitialFrameID> Crashsolo-OneMinuteRumpoleAndTheAngelOfDeath659/394.jpg </InitialFrameID> <Segments type="SHOT"> Shot <Segment start="T00:00:00:000F1000" end="T00:00:56:357F1000"> <Index> 0 </Index> boundaries <KeyFrameID time="T00:00:28:142F1000"> CrashsoloOneMinuteRumpoleAndTheAngelOfDeath659/394.jpg </KeyFrameID> </Segment> ... One keyframe per shot </Segments> </Segmentation> MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 26
  27. 27. Tagging Task Overview : Metadata• Social data (8,856 unique twitter users)Video: Crashsolo-OneMinuteRumpoleAndTheAngelOfDeath659.flvTag: 1012 literature Genre label from blip.tvCode: 1271048Post: http://blip.tv/file/1271048 http://twitter.com/crashsolo Posted One Minute Rumpole and the Angel Of Death to blip.tv: http://blip.tv/file/1271048 Upload a file on blip.tv Level 0 & Post a tweet (level 0) User User’s contacts Level 1 Level 2 Contacts’own contacts MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 27

×