Emotional expression is an important property of music. Its
emotional characteristics are thus especially natural for
music indexing and recommendation. The Emotion in Music task addresses the task of automatic music emotion prediction and is held for the second year in 2014. As compared to previous year, we modified the task by offering a new feature development subtask, and releasing a new evaluation set. We employed a crowdsourcing approach to collect the data, using Amazon Mechanical Turk. The dataset consists of music licensed under Creative Commons from the Free Music Archive, which can be shared freely without restrictions. In this paper we describe the dataset collection, annotations, and evaluation criteria, as well as the two required and optional runs.
http://ceur-ws.org/Vol-1263/mediaeval2014_submission_33.pdf
BusinessGPT - Security and Governance for Generative AI
Emotion in Music Task at MediaEval 2014
1. Emotion in Music: Task Overview
Anna Aljanaki1 Mohammad Soleymani2
Yi-Hsuan Yang3
1Utrecht University, Netherlands
2University of Geneva, Switzerland
3Academia Sinica, Taiwan
16-17 October, MediaEval 2014
2. Task definition
Description
I A benchmark for music emotion recognition systems
(similar but different from MIREX)
I Focusing on audio analysis (optionally, metadata)
Two subtasks
I Dynamic task (required): predict arousal and valence
values for a song every 0.5s.
I Feature design task: design new or rework existing audio
features to estimate emotion for the whole 45s musical
excerpt or dynamically.
3. Ground truth
Development set
I Collected for Emotion in Music brave new task in 2013.
I 744 files.
I 10 annotators per file.
Test set
I Additional data collected in 2014.
I 1000 files.
I 10 annotators per file.
4. Ground truth. Music
I 1744 musical excerpts of 45 seconds (randomly sampled)
from Free Music Archive (freemusicarchive.org).
I Curated music licensed under Creative Commons.
I Manually checked for quality.
I 10 genres: Rock, Pop, Electronic, Hip-Hop, Classical, Soul
and RnB, Country, Folk, International, Jazz
5. Ground truth. Annotations.
Collecting annotations.
I Amazon Mechanical Turk (mturk.com).
I 10 Mechanical Turk workers annotated each song.
I We averaged 10 annotations and provided to participants:
I Continuous annotations of valence and arousal (1 label
every 1=2 second).
I Static annotations of valence and arousal for each file
(independent from continuous).
6. Ground truth. Annotations.
Worker Instructions on Valence and Arousal Space
The workers were given the following instructions to introduce
valence-arousal space to them.
I Valence refers to the degree of positive or negative
emotions one experiences from a given piece of music.
I Positive valence: happiness, joy, excitement.
I Negative valence: sadness, fear, anxiety, anger.
I Arousal refers to the intensity of the music clip.
I High arousal: loud, energetic, emotionally engaging.
I Low arousal: quiet, peaceful, repetitive.
8. Ground truth. Annotations.
Some statistics
I 250 out of 424 workers (59%) passed the qualification test.
I It took annotators 10.5 minutes on average to complete the
task (3 songs), and we payed 0.40$ per task.
I 99% of time the song was unfamiliar to the annotator.
I In general, the music was enjoyed by annotators (on a
scale from 1 to 5, mean liking=3:32 1:22, median=4)
9. Ground truth. Annotations.
Static annotations.
A measure of inter-annotator agreement - Krippendorf’s alpha:
I Valence - 0.22
I Arousal - 0.37
10. Ground truth. Annotations.
Dynamic annotations.
A measure of inter-annotator agreement - Kendall’s W after
discarding first 15 seconds:
I Valence - 0:16 0:11
I Arousal - 0:2 0:13
11. Evaluation
Dynamic subtask evaluation
We use Pearson’s correlation coefficient and RMSE as metrics in the
following steps:
1. Calculate Pearson’s rho between predictions and ground truth
for each song separately.
2. Average across songs separately for valence and for arousal.
3. Rank all submissions for each dimension based on the averaged
rho.
4. In case the difference based on the one sided Wilcoxon test is
not significant (p0.05), we use RMSE to break the tie.
5. If the ranking changed, we do significance test between
neighbouring pairs again (bubble sort).
Feature design subtask evaluation
Same procedure, but Pearson’s rho is calculated for all the songs in
test set at once.
12. Baseline
The organizers decided not to submit and only provide a simple
baseline that participants should beat.
I Five features: Spectral Flux, HCDF (harmonic change
detection function), loudness, roughness and zero crossing
rate.
I Linear Regression
13. Results - Arousal
7 teams crossed the finish line, 6 teams beat the baseline (at
least for arousal).
Dynamic task
Rank Team Arousal
RMSE
1 TUMMISP 0:35 0:45 0:1 0:05
2 SAIL 0:28 0:50 0:13 0:07
3 UoA 0:21 0:57 0:08 0:05
4 Beatsens 0:23 0:56 0:12 0:05
5 Rainbow 0:18 0:60 0:12 0:07
6 THUHCSIL 0:17 0:41 0:12 0:05
7 Baseline 0:18 0:36 0:14 0:06
8 Average baseline 0 0:39 0:03
14. Results - Valence
Dynamic task
The teams highlighted in bold beat the baseline, other teams
are in the same rank with it.
Rank Team Valence
RMSE
1 TUMMISP 0:20 0:49 0:08 0:05
2 Beatsens 0:12 0:55 0:09 0:05
3 SAIL 0:15 0:5 0:10 0:06
4 UoA 0:17 0:5 0:14 0:07
5 THUHCSIL 0:10 0:37 0:09 0:05
5 Rainbow 0:07 0:29 0:10 0:06
5 Baseline 0:11 0:34 0:10 0:06
6 Average baseline 0 0:34 0:03
15. Results
Only one team designed new features.
Feature design - static evaluation.
Arousal Valence
2 RMSE 2 RMSE
SAIL 0:53 0:32 0:28 0:27
Feature design - dynamic evaluation.
Arousal Valence
RMSE RMSE
SAIL 0:22 0:12 0:11 0:09
18. Approaches
Beatsens
I 54 features from MIRToolbox.
I Annotations are modeled as a continuous conditional
random field (CCRF) process.
I SVR is used as base classifier.
I Best performance is achieved by a combination of spectral,
dynamic and rhythmic features, of which the most
important were MFCCs.
19. Approaches
SAIL
Have designed 3 types of new features
1. Compressibility features
2. Median Spectral Band Energy
3. Spectral Centre of Mass
Use Partial Least Squares Regression in combination with
Haar coefficients to predict the dynamic ratings based on
features from the whole song.