Multimodal Music Tagging Task Overview

Multimodal Music Tagging Task

Nicola Orio – University of Padova
Cynthia C. S. Liem – Delft University of Technology
Geoffroy Peeters – UMR STMS IRCAM-CNRS, Paris
Markus Schedl – Johannes Kepler University, Linz

MediaEval, Pisa 05/10/2012 MusiClef: Multimodal Music Tagging Task 1

Multimodal music tagging
• Definition
• Songs of a commercial music library need to be categorized
according to their usage in TV and radio broadcasts (e.g.
soundtracks, jingles)

• Practical motivation
• The search for suitable music for video productions is a
major activity for professionals and lay users alike
• Collaborative filtering systems are taking their role
• Notwithstanding their known limitations: long-tail, cold start…
• Annotating professional music libraries is another important
professional activity


Human assessment

Different sources of information are routinely exploited
by professionals to overcome limitations of individual media

Goals of MusiClef
• To focus evaluation on professional application scenarios
• Textual description of music items

• To grant replication of experiments and results
• Feature extraction phase is crucial – released features
computed with public, open-source library (MIRToolbox)

• To promote the exploitation of multimodal sources of
information
• Content (audio) + Context (tags & webpages)

• To disseminate music related initiatives
• Outside the music information retrieval community

Evaluation initiatives – 1
• MIREX (since 2004)
• Community-based selection of tasks
• Many tasks address audio feature extraction algorithms
• Participants submit algorithms that are run by organizers
• Music files are not shared with participants

• Million Song Dataset (since 2011)
• Task on music recommendation proposed by organizers
• Audio features are computed using proprietary algorithms
• Only features are shared with participants


Evaluation initiatives – 2

• Quaero-Eval (since 2012)
• Tasks agreed with participants
• Strategies to grant public access to evaluation results
• Participants run training experiments on a shared repository
• Runs on test set made by the organizers


Test collection – 1
• Individual songs of pop and rock music
• 1355 songs (from 218 artists)
• train (975) and test (380) split

• Social tags
• Gathered from Last.fm API

• Multilingual sets of Web pages related to artists+albums
• Mined querying Google

• Acoustic features: MFCC (using MIRToolbox) with a
window length of 200ms and 50% overlap


Test collection – 2
• Test collection created starting from the “500 Greatest
Songs of All Time” (Rolling Stone)
• Expected high number of social tags and web pages

• Ground truth created by experts in the domain
• 355 tags selected (167 genre, 288 usage)
• Tags associated to less than 20 songs were discarded

• Reference implementation in Matlab
• Participants has an example to run a complete experiment
• Code for the evaluation made already available


Evaluation measures

• Standard IR measures

• Accuracy
• Precision
• Recall
• Specificity
• F-measure


Examining tags more closely
• Some tags are more equal than others…

hard rock ballroom
melancholic
travel
countryside
bright

• Thus, we propose to also analyze results employing a
higher-level tag categorization


Tag categorization – 1
• Affective, mood-related aspects:
• activity: the amount of perceived music
activity, without implying strong positive or
negative affective qualities (e.g.
'fast', 'mellow', 'lazy')
• affective state: affective qualities that can only be
connected and attributed to living beings (e.g.
'aggressive', 'hopeful')
• atmosphere: affective qualities that can be
connected to environments (e.g.
'chaotic', 'intimate').


Tag categorization – 2
• Situation, time and space aspects of the music:
• Physical situation: concrete physical environments
(e.g. 'city', 'night').
• Occasion: implications of time and space, typically
connected to social events (e.g. 'holiday', 'glamour').
• Sociocultural genre (e.g. 'new wave', 'r&b', 'punk')
• Sound qualities:
• timbral aspects (e.g. 'acoustic', 'bright')
• temporal aspects (e.g. 'beat', 'groove').
• Other (e.g. 'catchy', 'evocative').

Reference implementation
• Made in MATLAB and released publicly
• Simple and straightforward approaches:
• Individual GMMs for audio, user tags, web pages
• Tagging process: 1-NN qualification using symmetrized KL

• Scenarios tested:
• Audio, user tags, web pages individually
• Majority vote
• Union


Baseline results – 1
• Evaluation of the submitted runs and of the reference
implementation
• Results with different modalities over the full dataset

strategy accuracy recall precision specificity f-measure
audio 0.894 0.148 0.127 0.939 0.126
tags 0.898 0.061 0.039 0.942 0.037
web pages 0.897 0.050 0.007 0.954 0.011
majority 0.880 0.123 0.086 0.922 0.086
union 0.824 0.240 0.115 0.845 0.134


Baseline results – 2
1. activity, energy
2. affective state
3. atmosphere
4. other
5. situation: occasion
6. situation: physical
7. sociocultural genre
8. sound: temporal
9: sound: timbral


Participation

• Initially a lot of interest - about 8 explicitly interested
parties
• But ultimately just one participant (LUTIN UserLab)
• Aggregation of estimators
• Currently investigating what happened to the 7 others
• So far, it appears ISMIR 2012 was inconveniently close
• The 3 other MusiClef co-organizers will discuss this there


Conclusions

• We established a multimodal music tagging benchmark task
• Special effort in facilitating deeper tag analysis
• We would like a 2013 multimodal music benchmark task
• Depending on survey input
• Depending on your input


Thank you for your attention!

For contact and more information: musiclef@dei.unipd.it


Multimodal Music Tagging Task Overview

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (18)

Similar to Multimodal Music Tagging Task Overview

Similar to Multimodal Music Tagging Task Overview (20)

More from MediaEval2012

More from MediaEval2012 (19)

Recently uploaded

Recently uploaded (20)

Multimodal Music Tagging Task Overview