Overview of the MediaEval 2012 Tagging Task
Upcoming SlideShare
Loading in...5
×
 

Overview of the MediaEval 2012 Tagging Task

on

  • 666 views

 

Statistics

Views

Total Views
666
Views on SlideShare
666
Embed Views
0

Actions

Likes
0
Downloads
3
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • In this overview, I wont’t detail again the main motivations of mediaeval. I will directly jump to what motivated the Tagging task. The idea was first - to focus on multimedia content available on the internet through semi-professional videos provided by Blip.tv - to take into account the « multi » aspect of multimedia, the use of features derived from various media was encouraged. This means speech, audio, visual content and associated metadata, the social aspect was introduced through the use of social data the genre was considered as a mean to access and retrieve such multimedia content
  • A little bit of history first...before giving details about submissions and results In 2010, the tagging task at MediaEval aimed at predicting the tags freely given by users uploading a video. But the great diversity and amount of tags made the task difficult. In 2011 the idea was to reduce the number of tags by focusing on the genre labels provided by Blip.tv. And a subset of the data collected for the wild wild web was used. In 2012, we kept the same goal but the amount of data has been increased. At the same time, within the Quaero project a task of Video genre classification was opened to participants to the project. The evaluation of this task was done internally in 2009 and 2010, but the weak number of participants did not made the task challenging. So for two years, MediaEval has been joined by Quaero and the Tagging task evaluated by MediaEval became an external evaluation of the Quaero multimedia content recognition task. .
  • The taging task is based on a corpus created by the PetaMedia Network of excellence. Video were downloaded from blip.tv. These videos were selected according to two main criteria : the mention in twitter of the corresponding show and the Creative Common license More than fourteen thousand episodes where selected representing more than 3 thousand hours of data. The corpus used this year is an extension of the data set proposed by the Wild Wild Web task. All the videos taken into account were split in two subsets: one for the development and one for the test. In 2012 the amount of data was 7.5 times bigger. More than 5 thousand videos for the development set and almost 10 thousand for the test.
  • These video as last year are organised around 25 genres. The amount of videos per genre not been balanced, a default category has been created including genres with less than 10 episodes. Some genre are more related to a topic like sports, literature ... Some others to a ‘real genre’ like comedy, documentary, videoblogging, ...
  • Each video comes along with different sources of information about their content. An Xml file contains metadata from blip.tv about the title, the video description, the tags given by the user who uploaded the video, its uploader id, and so on. Information about different level of tweets mentionning the episode is also available Automatic processing results are provided given infortaion on the speech and visual content. For speech as last year LIMSI-VOCAPIA a French research and development group, provided speech recognition transcription in english and other languages This year the Speech reserach group of a French university also provided transcriptions for english content The visual content was provided by the technical university of Berlin Let’s show some examples
  • Depending on the provider, the speech transcription is available under different forms. The LIUM provided the transcription one-best hypotheses, as well as word lattices and confusion matrix. LIMSI-VOCAPIA provided transcription in different language according to a language identification strategy beased on a language confidence scrore. Depending on a threshold and under this threshold on a best score
  • The goal of the Taggin task this year is the same as in 2011 Searching and browsing the internet looking for videos and using the genre as a searching or an organisational criterion could be a good idea, But the main drawback is that most of the video available may not be accurately or adequately tagged So the idea of the Tagging task is to automatically assign a genre label using sets of features derived from different media. That’s what we did in 2011. So what is new in 2012 ? - provide a huge amount of data - to enable information retrieval as well as classification or machine learning approches - by providing more balanced dataset a third videos for development the two other third for test Each genre being balanced between dev and test sets
  • The goal of the Taggin task this year is the same as in 2011 Searching and browsing the internet looking for videos and using the genre as a searching or an organisational criterion could be a good idea, But the main drawback is that most of the video available may not be accurately or adequately tagged So the idea of the Tagging task is to automatically assign a genre label using sets of features derived from different media. That’s what we did in 2011. So what is new in 2012 ? - provide a huge amount of data - to enable information retrieval as well as classification or machine learning approches - by providing more balanced dataset a third videos for development the two other third for test Each genre being balanced between dev and test sets
  • The evaluation protocol to be followed by the participants was to use these data separatly or in a combined way to predict the genre label for each video of the test set In order to compare the resullts as well as the impact of different media, 5 official runs were defined. Each participant had to submit up to 5 of these runs. Extra runs were also allowed. The first run was based on the use of audio and or visual content. The second one had to focus on speech through ASR results The last three runs enables to study the impact of metadata, no metadata, metadata but not the uploader ID which boosted the results last year and all the metadata available. The groundtruth is based on the genre label associated to the video on blip.tv The official metric is the Mean Average Precision in order to evaluate the ranked retrieval results regarding a set of queries concerning the genre.
  • This year we had 6 participants less than last year where we had 10 submissions. Four of this participants were here last year. And we had a new one from Brazil. As they will present themselves just after, I am not going to detail this. Some system are in grey, it is because one task organisation member has contributed to these work.
  • Emphazise the diffrence between the data
  • I will talk about what motivates this task and I will say few words about its history I will briefly describe the datasets, their content and the genres taken into account in this task Then I will present the goal of the task and the evaluation protocol followed I will try to give a synthetic view of the various features and methods used by the participants to this task. I will give an overview of the resources really used and I will present the main results obtained. Before each participant gives you more details about what they did in their experiments I will give some elements of conclusion.
  • In the metadata from Blip.TV we can found the genre label, The title, the description
  • A set of tags freely given by the uploader, the uploader ID And other information
  • The visual information give a shot segmentation of the video and for each shot its boundaries and one keyframe
  • The social Data from twitter are based on the uploader tweet annoncing its video post and on his contacts, and contacts’contacts relaying the tweet.

Overview of the MediaEval 2012 Tagging Task Overview of the MediaEval 2012 Tagging Task Presentation Transcript

  • 2012 Tagging Task Overview Christoph Kofler Sebastian SchmiedekeDelft University of Technology Technical University of Berlin Isabelle Ferrané University of Toulouse (IRIT) MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 1
  • Motivations• MediaEval – evaluate new algorithms for multimedia access and retrieval – emphasize the multi in multimedia – focus on human and social aspects of multimedia tasks.• Tagging Task – focus on semi-professional video on the Internet – use features derived from various media or information sources speech, audio, visual content or associated metadata social information – focus on tags related to the genre of the video MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 2
  • History• Tagging task at MediaEval – 2010: The Tagging Task (Wild Wild Web Version): Prediction of user tags 2010  too many tags; great diversity – 2011: The Genre Tagging Task: Genres from blip.tv labels (26 labels) 2011 – 2012: The Tagging Task: Same labels but much more dev/test data 2012• MediaEval joined by – 2009 & 2010: Internal Quaero campaigns (Video Genre Classification) 2010  too few participants – 2011 & 2012: Tagging task as an External Quaero evaluation 2012 MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 3
  • Datasets• A set of Videos (ME12TT) – Created by the PetaMedia Network of Excellence – downloaded from blip.tv – episodes of shows mentioned in Twitter messages – licensed under Creative Commons – 14,838 episodes from 2,249 shows ~ 3,260 hours of data – extension of the MediaEval Wild Wild Web dataset (2010)• Split into Development and Test sets – 2011 : 247 for development / 1,727 for test (1974 videos) – 2012 : 5,288 for development / 9,550 for test (7.5 times more) MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 4
  • Genres• 26 Genre labels from Blip.tv ... same as in 2011 : 25 genres + 1 default_catagory1000 art 1001 autos_and_vehicles1002 business 1003 citizen_journalism1004 comedy 1005 conferences_and_other_events1006 default_category 1007 documentary1008 educational 1009 food_and_drink1010 gaming 1011 health1012 literature 1013 movies_and_television1014 music_and_entertainment 1015 personal_or_auto-biographical1016 politics 1017 religion1018 school_and_education 1019 sports1020 technology 1021 the_environment1022 the_mainstream_media 1023 travel1024 videoblogging 1025 web_development_and_sites MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 5
  • Information available (1/2)• From different sources – title, description, user tags, uploader ID, duration – tweets mentioning the shows – automatic processings: • automatic speech recognition (ASR) – English transcription ! – some other languages Ne w • shot boundaries and 1 keyframe per shot MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 6
  • Information available (2/2)• Focus on speech data – LIUM transcripts (5,084 files from dev / 6,879 files from test) • English • One-best hypotheses (NIST CTM format) • Word-lattices (SLF HTK format) 4-Gram topology ! New • Confusion network (ATT FSM-like format) – LIMSI-VOCAPIA transcripts (5,237 files from dev. /7,215 files from test) • English, French, Spanish, Dutch • Language identification strategy (language confidence score) if Score > 0.8 then transcription based on the detected language else best score between English and the detected language MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 7
  • Task Goal• same as in 2011 – searching and browsing the Internet for video – using genre as a searching or organizational criterion  Videos may not be accurately or adequately tagged  Automatically assign genre labels using features derived from: speech, audio, visual content, associated textual or social information• what’s new in 2012 – provide a huge amount of data ( 7.5 times more ) – enable information retrieval as well as classification approaches with more balanced datasets (1/3 dev; 2/3 test); – each genre was “equally” distributed between both sets MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 8
  • Genre distribution over datasets Part 1: Genre 1000 to 1012Part 2:Genres 1013 to 1025 MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 9
  • Evaluation Protocol• Task: Predict the genre label for each video of the test set• Submissions: up to 5 runs, represent different approaches RUN 1 Audio and/or visual informations (including information about shots and keyframes) RUN 2 ASR transcripts RUN 3 All data except metadata ! RUN 4 All data except the uploader ID (ID used in the 2011 campaign) New RUN 5 All data• Groundtruth : genre label associated to each video• Metric : Mean Average Precision (MAP)  enable to evaluate the ranked retrieval results regarding a set of queries Q MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 10
  • Participants• 2012 : 20 registered, 6 submissions (10 in 2011) – 5 veterans, 1 new participant, – 3 organiser-connected teams, 5 countriesSystem Participant Supporting ProjectKIT Karlsruhe Institute of Technology Germany Quaero !UNICAMP-UFMG University of Campinas New FAPEMIG, FAPESP, CAPES, & CNPq Federal University of Minas Gerais BrazilARF University Politechnica of Bucharest Romania EXCEL POSDRU Johannes Kepler University & Research Institute of Artificial Intelligence Austria Polytech Annecy-Chambery FranceTUB Technical University of Berlin Germany EU FP7 VideoSenseTUD-MM Delft University of Technology The NetherlandsTUD Delft University of Technology The Netherlands MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 11
  • Features• Features used: in 2011 only - in 2011 & 2012 - in 2012 onlyASR Only transcription Or with translation of Stop word Semantic in english non-english filtering similarity ASR Limsi, transcription & Stemming ; TF-IDF ; Top ASR Lium Bags of Words terms (BoW); LDAAudio MFCC, LPC,LPS, ZCR, ZCR Spectrogram as an image Rhythm, timbre, onset strengh, Energy loudness, …Visual On image: On video; On Keyframes: Bags of Visual Color,texture Self-similarity matrix; WordsContent SIFT / rgbSIFT / SURF / HoG Face detection Shot boundaries; Shot length ; Transition between shots ; Motion featuresMetadata Title, Tags Filename, Show ID, BoW Description, Uploader IDOthers Video from YouTube Web pages from Google, Synonyms, Social data blip.tv Wikipedia hyponyms, domain from Video distribution terms, BoW from Delicious over genres (dev) Wordnet MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 12
  • Methods (1/2)• Machine Learning approach (ML) – parameter extraction from audio, visual, textual data, early fusion – feature: • transformation (PCA), • selection (Mutual information, Term Frequency) • dimension reduction (LDA) – classification methods, supervised or unsupervised • K-NN, SVM, Naive Bayes, DBN, GMM, NN • K-Means (clustering) • CRF (Conditional Random Fields), Decision tree (Random Forest) – training step, cross-validation approach and stacking – fusion of classifier results / late fusion / majority voting MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 13
  • Methods (2/2)• Information retrieval approach (IR) – text preprocessing & text Indexing – query and ranking list, query expansion and re-ranking methods – fusion of ranked lists from different modalities (RRF) – selection of the category with the highest ranking score• Evolution since 2011 – 2011 : 2 distinct communities : ML or IR approach – 2012 : mainly ML approach or mixed one MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 14
  • Resources• Which one were used?System AUDIO ASR VISUAL METADATA SOCIAL OTHERKITUNICAMP-UFMGARFTUBTUD-MMTUD• Evolution since 2011 – 2011 : use of external data mainly from the web and social data 1 participant especially interested in social aspect – 2012 : no external data, no social data MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 15
  • Main results (1/2) • Each participant’s best result SYSTEM BEST MAP APPROACH FEATURES METHOD RUN KIT Run3 0.3499 ML Color, Texture, rgbSIFT SVM Run6* 0.3581 + video distrib. over genre UNICAMP- Run 4 0.2112 ML BoW Stacking UFMG ARF Run5 0.3793 ML TF-IDF mtd SVM Linear ASR Limsi TUB Run4 0.5225 ML BoW mtd MI - Naive Bayes TUD-MM Run4 0.3675 ML & IR TF on Visual word SVM Linear + ASR & mtd Reciprocal Rank Fusion TUD Run2 0.25 ML ASR Lium DBN one-bestBaseline results : All videos into the default category MAP = 0.0063 / Videos randomly classified MAP = 0.002 MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 16
  • Main results (2/2) • Official run comparison (MAP) SYSTEM Run1 Run2 Run3 Run4 Run5 Other Runs Audio/Visual ASR Exc. Exc. Upld. ID All MTD KIT 0.3008 (visual1) 0.3461 0.2329 (visual2) 0.1448 0.3499 (fusion) 0.3581 UNICAMP 0.1238 0.2112 UFMG ARF 0.1941(visual & audio) 0.2174 0.2204 0.3793 0.1892 (audio) TUB 0.2301 0.1035 0.2259 0.5225 0.3304 TUD-MM 0.0061 0.3127 0.2279 0.3675 0.2157 0.0577 0.0047 TUD 0.23 / 0.25 0.10 / 0.09 17Baseline results : All videosMediaEval Workshop – 4-5 October 2012 - Pisa, randomly classified MAP = 0.002 into the default category MAP = 0.0063 /Videos Italy
  • Lessons learned or open questions (1/3)• About data – 2011 : small dataset for developement ~ 247 videos – Difficulties to train models, external resources required – 2012 : huge amount of data for development ~ 5,288 videos – Enough to train models in machine learning approach  Impact on: - the type of methods used (ML against IR) - the need/use of external data  No use of social data this year: - is it a question of community ? - can be disappointing regarding the MediaEval motivations MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 18
  • Lessons learned or open questions (2/3)• About results – 2011 : Best system MAP 0.5626 using – Audio, ASR, visual, metadata including the uploader ID – external data from Google and Youtube – 2012 : Best system (non-organiser connected) : MAP 0.3793 – TF-IDF on ASR, metadata including the uploader ID, no visual data Best system (organizer-connected): MAP 0.5225 – BoW on metadata without the uploader ID, no visual data  results difficult to compare regarding the great diversity of features, of methods and of systems combining both  monomedia (visual only ; ASR only) or multimedia contributions  a failure analysis should help to understand « what impacts what? » MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 19
  • Lessons learned or open questions (3/3)• About metric – MAP as the official metric – some participants provided other types of results in terms of correct classification rate or F-score or detail AP results per genre  would « analysing the confusion between genres » be of interest?• About genre labels – Labels provided by blip.tv are covering two aspects • topics: Autos_and_vehicules, Health, Religion, ... • Real genre : Comedy, Documentary, Videoblogging,...  would « making a distinction between form and content » be of interest? MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 20
  • Conclusion• What to do next time? – Has everything been said? – Should we leave the task unchanged? – If not, we have to define another orientation – Should we focus on another aspect of the content? • Interaction, • Mood, • User intention regarding the query – Define what needs to be changed: • Data • Goals and use cases • Metric • ...  a lot of points need to be considered MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 21
  • Tagging Task OverviewMore details about Metadata MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 22
  • Tagging Task Overview• Motivations & History• Datasets, Genres, Metadata & Examples• Task Goal & Evaluation protocol• Participants, Features & Methods• Resources & Main results• Conclusion MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 23
  • Tagging Task Overview : Examples• Example of Medadata from Video: Crashsolo-OneMinuteRumpoleAndTheAngelOfDeath659.flv Tag: 1012 literature Genre label from blip.tv<video> <title> <![CDATA[One Minute Rumpole and the Angel Of Death]]> </title> <description> <![CDATA["Rumpole and the Angel of Death," by John Mortimer, …]> </description> <explicit> false </explicit> <duration> 66 </duration> <url> http://blip.tv/file/1271048 </url> <license> <type> Creative Commons Attribution-NonCommercial 2.0 </type> <id> 4 </id> </license> … MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 24
  • Tagging Task Overview : Examples• Example of Medadata from…<tags> <string> oneminutecritic </string> <string> fvrl </string> <string> vancouver </string> Tags given by the uploader <string> library </string> <string> books </string></tags><uploader> <uid> 112708 </uid> ID of the uploader <login> crashsolo </login></uploader><file> <filename> Crashsolo-OneMinuteRumpoleAndTheAngelOfDeath659.flv </filename> <link> http://blip.tv/file/get/… </link> <size> 3745110 </size></file><comments /></video> MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 25
  • Tagging Task Overview : Examples• Video data (420,000 shots and keyframes)Video: Crashsolo-OneMinuteRumpoleAndTheAngelOfDeath659.flvTag: 1012 literature Genre label from blip.tv<?xml version="1.0" encoding="utf-8" ?> <Segmentation> <CreationID> Crashsolo-OneMinuteRumpoleAndTheAngelOfDeath659 </CreationID> <InitialFrameID> Crashsolo-OneMinuteRumpoleAndTheAngelOfDeath659/394.jpg </InitialFrameID> <Segments type="SHOT"> Shot <Segment start="T00:00:00:000F1000" end="T00:00:56:357F1000"> <Index> 0 </Index> boundaries <KeyFrameID time="T00:00:28:142F1000"> CrashsoloOneMinuteRumpoleAndTheAngelOfDeath659/394.jpg </KeyFrameID> </Segment> ... One keyframe per shot </Segments> </Segmentation> MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 26
  • Tagging Task Overview : Metadata• Social data (8,856 unique twitter users)Video: Crashsolo-OneMinuteRumpoleAndTheAngelOfDeath659.flvTag: 1012 literature Genre label from blip.tvCode: 1271048Post: http://blip.tv/file/1271048 http://twitter.com/crashsolo Posted One Minute Rumpole and the Angel Of Death to blip.tv: http://blip.tv/file/1271048 Upload a file on blip.tv Level 0 & Post a tweet (level 0) User User’s contacts Level 1 Level 2 Contacts’own contacts MediaEval Workshop – 4-5 October 2012 - Pisa, Italy 27