The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

1
The Acoustic Emotion Gaussians
Model for Emotion-based Music
Annotation and Retrieval
Ju-Chiang Wang, Yi-Hsuan Yang,
Hsin-Min Wang, and Skyh-Kang Jeng
Academia Sinica,
National Taiwan University,
Taipei, Taiwan

2
Outline
• Introduction
• Related Work
• The Acoustic Emotion Gaussians (AEG)
Model
• Music Emotion Annotation and Retrieval
• Evaluation and Result
• Conclusion and Future Work

3
Introduction
• One of the most exciting but challenging
endeavors in music information retrieval (MIR)
– Develop a computational model that comprehends
the affective content of music signals
• Why is emotion so important to MIR system?
– Music is the finest language of emotion
– We use music to convey or modulate emotion
– Smaller semantic gap, comparing to genre
– Each state in our daily life contains emotion,
context-dependent music recommendation

4
Dimensional Emotion:
The Valence-Arousal(Activation) Model
• Emotions are considered as numerical values (instead of
discrete labels) over a number of emotion dimensions
• Good visualization, intuitive, a unified model
• Easy to capture temporal change of emotion
Mufin Player
Mr. Emo developed by Yang and Chen

5
The Valence-Arousal Annotation
• Emotion is subjective, different emotion may be elicited
from a song in the VA space
• Assumption: the VA annotation of a song can be drawn
from a Gaussian distribution, as observed above
• Subjectivity issue: observed by multiple subjects
• Temporal change: summarize the scope of changes

6
Related Work:
Regression for Gaussian Parameters
• The Gaussian-parameter approach directly learns five
regression models to predict the mean, variance, and
covariance of valence and arousal, respectively
• Without a joint modeling and estimation for the Gaussian
parameters
x
Regressor 1
Regressor 2
Regressor 3
Regressor 4
Regressor 5
mVal
mAro
sVal-Aro
sAro-Aro
sVal-Val

7
The Acoustic Emotion Gaussians Model for
Modeling between VA and Acoustic Feature
• A principled probabilistic/statistical approach
• Represent the acoustic features of a song by a probabilistic
histogram vector
• Develop a model to comprehend the relationship between acoustic
features and VA space (annotations)
Acoustic GMM Posterior Distributions

8
AEG: Construct Feature Reference Model
Global Set  of frame
vectors randomly
selected from each track
…
A1 N2
NK-1
NK N3N4
Global GMM for acoustic
feature encoding
EM Training
A Universal
Music Database
Acoustic GMM
Music Tracks
& Audio Signal
Frame-based Features
… …
… …

9
Represent a Song into Probabilistic Space
1
2
K-1
K…
Posterior
Probabilities over
the Acoustic GMM
…
A1
A2
AK-1
Acoustic GMM
AK
…
Feature Vectors
Histogram:
Acoustic GMM Posterior
prob
Each dim corresponds to a specific acoustic pattern,
called a latent feature class (or audio word)
1 2 K-1 K…

10
Generative Process of VA GMM
• Key idea: Each component VA Gaussian corresponds to
a latent feature class (a specific acoustic pattern)
Audio Signal
of Each Clip
A Mixture of Gaussians
in the VA Space
…
A1
A2
AK-1
Acoustic GMM
AK
1
2
K-1
K
…

11
Total Likelihood Function of VA GMM
• To cover the subjectivity, each training clip is annotated
by multiple subjects {uj}, the corresponding annotation ej
• An annotated corpus: assume each annotation eij of clip
si can be generated by a weighted VA GMM with {qik}!
• Generating the Corpus-level likelihood and maximize it
using the EM algorithm
1 1 1 1
( | ) ( | ) ( | , )
jU KN N
i ik ij k k
i i j k
p p s q
= = = =
= = å E E e  m S

1
( | ) ( | , )
K
ij i ik ij k k
k
p s q
=
= åe e Sm
Acoustic GMM posterior
Clip-level likelihood:
Each annotation contributes equally
parameters of each
latent VA Gaussian to learn
Annotation-level
Likelihood

12
User Prior Model
• Some annotations could be outliers
• The prior weight of each annotation can be described by
the likelihood over the clip-level annotation Gaussian
– Larger B indicates lower label consistency (higher uncertainty)
– Smaller likelihood implies the annotation could be an outlier
( | , ) ( | , , )jp u s s=e e a B
,
( | , )
( | )
( | , )
j
j s j
ju
p u s
p u s
p u s
g ¬ =
å
e
e

13
Integrating the Annotation (User) Prior
• Integrating Acoustic GMM Posterior and Annotation Prior
into the Generative Process
1 1 1
1 1 1
( | ) ( | ) ( | ) ( | )
( | , )
j
j
UN N
i ij i ij i
i i j
U KN
ij ik ij k k
i j k
p p s p u s p s
g q
= = =
= = =
= =
=
å 
å å
E E e
e

 m S
Clip-level likelihood:
prior weighted sum over
annotation-level likelihood
Annotation Prior Acoustic GMM posterior

14
The Objective Function
• Take log of p(E| ), and according to Jensen’s inequality
we derive the lower bound
where
• Then, we maximize Lbound with the EM-Algorithm
1 1 1
1 1 1
log ( | ) log ( | , )
log ( | , )
j
j
UN K
ij ik ij k k
i j k
UN K
bound ij ik ij k k
i j k
p D
L
g q
g q
= = =
= = =
=
³ =
å å å
åå å
E e
e


m
m
S
S
1 1
1
jUN
ij
i j
g
= =
=åå
two-layer log sum
one-layer log sum
parameters to learn

15
The Learning of VA GMM on MER60
Iter=8Iter=4
Iter=32Iter=16
Iter=2

16
Music Emotion Annotation
• Given the acoustic GMM posterior of a test song, predict
the emotion as a single VA Gaussian
1
2
K-1
K
…
Acoustic GMM Posterior Learned VA GMM Predicted Single Gaussian
1
ˆˆ( | ) ( | , )
K
k ij k k
k
p s q
=
= åe e m S
^
^
^
^
…
{ , }*
m *
S

17
Find the Representative Gaussian
• Minimize the cumulative weighted relative entropy
– The representative Gaussian has the minimal cumulative
distance from all the component VA Gaussians
• The optimal parameters of the Gaussian are
( )KL
{ , }
1
ˆ( | , ) arg min ( | , ) || ( | , )
K
k k k
k
p D p pq* *
=
= åe e e
S
S S S
m
m m m
*
1
ˆ
K
k k
k
q
=
= åm m
( )* * *
1
ˆ ( )( )
K
T
k k k k
k
q
=
= + - -åS S m m m m

18
Emotion-Based Music Retrieval
Approach Indexing Matching
Fold-In Acoustic GMM Posterior Cosine Sim (K-dim)
Emotion Prediction Predicted VA Gaussian Gaussian Likelihood

19
The Fold-In Approach
l1
l2
lK-1
lK
…
The Learned VA GMM A VA Point Query
Fold In
The query is Dominated by
the VA Gaussian of A2
Pseudo
Song
Distribution
1
ˆ ˆarg max log ( | , )
K
k k k
k
pl
=
= å e
l
l m S ˆe
Using the EM algorithm
1
2
K-1
K
…
Acoustic
GMM
Posterior
Music
Database

20
Evaluation – Dataset
• Two corpora used: MER60 and MTurk
• MER60
– 60 music clips, each is 30-second
– 99 subjects in total, making each clip annotated by 40 subjects
– The VA values are entered by clicking on the emotion space on a
computer display
• MTurk
– 240 clips, each is 15-second
– Collected via Amazon's Mechanical Turk
– Each subject rated the per-second VA values for 11 randomly-
selected clips using a graphical interface
– Automatic verification step employed, finalizing each clip with 7
to 23 subjects

21
Evaluation – Acoustic Features
• Adopt the bag-of-frames representation
• All the frames of a clip are aggregated into the acoustic
GMM posterior and perform the analysis of emotion at
the clip-level, instead of frame-level
• MER60: extracted by MIRToolbox
– Dynamic, spectral, timbre (including 13 MFCCs, 13 delta MFCCs,
and 13 delta-delta MFCCs), and tonal
– 70-dim all concatenation or 39-dim MFCCs
• MTurk: provided by Schmidt et al.
– MFCCs, chroma, spectrum descriptors, and spectral contrast
– 50-dim all concatenation, 20-dim MFCCs, or 14-dim spectral
contrast

22
Evaluation Metric for Emotion Annotation
• Average KL divergence (AKL)
– Measure the KL divergence from the predicted VA Gaussian of a
test clip to its ground truth VA Gaussian
• Average Mean Distance (AMD)
– Measure the Euclidean distance between the mean vectors of
the predicted and ground truth VA Gaussians
( )1 1 1
P G P G P G G P G
1
tr( ) log ( ) ( ) 2
2
T- - -
- + - - -m m m mS S S S S
P G P G( ) ( )T
- -m m m m

23
Result for Emotion Annotation
• MER60, leave-one-out train and test
• MTurk, 70%-30% randomly splitting train and test
Smaller
Better

24
Summary for Emotion Annotation
• The performance saturates when K is sufficiently large
• Larger scale corpus prefers larger K (feature resolution)
• Annotation prior is effective for the AKL performance
• For MER60, 70-D concat feature performs the best
• For MTurk, using MFCCs alone is more effective
• MTurk is easier and presents smaller performance scale

25
Result for Music Retrieval
• MTurk: 2,520 clips training, 1,080 clips for retrieval database
• Evaluate the ranking using the Normalized Discounted
Cumulative Gain (NDCG) with 5, 10, and 20 retrieved clips
2 2
( )1
NDCG @ (1)
log
P
iP
R i
P R
Z i=
ì üï ïï ï= +í ý
ï ïï ïî þ
å
Larger Better

26
Conclusion and Future Work
• The AEG model provides a principled probabilistic
framework that is technically sound, and also unifies
the emotion-based music annotation and retrieval
• AEG can better take into account the subjective
nature of emotion perception
• Transparency and interpretability of the model
learning and semantic mapping processes
• The potential for incorporating multi-modal content
• Dynamic personalization via model adaptation
• Alignment among multi-modal emotion semantics

27
Appendix: PWKL for Emotion Corpus
PWKL
5.095
1.985
• PWKL: the diversity of ground truth among all songs in a
corpus, the larger the more diverse
• We compute the (pair-wise) KL divergence between the
ground truth annotation Gaussians of each pair of clips
in a corpus
• MTurk is easier, since a safer prediction, the origin, can
gain good performance

The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

Recommended

Recommended

More Related Content

Similar to The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

Similar to The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval (20)

Recently uploaded

Recently uploaded (20)

The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval