1. The document presents a method for personalized music emotion recognition via model adaptation. It develops a probabilistic Acoustic Emotion Gaussians (AEG) model to represent emotions in music as Gaussians over the valence and arousal dimensions.
2. It then describes a technique to personalize the AEG model for individual users via Maximum A Posteriori (MAP) adaptation, using a user's own music annotations to update the model parameters.
3. An evaluation shows the personalized AEG model achieves improved music emotion recognition performance compared to the general AEG model, demonstrating the effectiveness of the proposed adaptation method.
Personalized Music Emotion Recognition via Model Adaptation
1. 1
Personalized Music Emotion
Recognition via Model Adaptation
Ju-Chiang Wang, Yi-Hsuan Yang,
Hsin-Min Wang, and Skyh-Kang Jeng
Academia Sinica,
National Taiwan University,
Taipei, Taiwan
2. 2
Outline
• Introduction
• The Acoustic Emotion Gaussians (AEG) Model
• Personalization via MAP Adaptation
• Music Emotion Recognition using AEG
• Evaluation and Result
• Conclusion
3. 3
Introduction
• Developing a computational model that
comprehends the affective content from musical
audio signal, for automatic music emotion
recognition and content-based music retrieval
• Emotion perception in music is in nature
subjective (fairly user-dependent)
– A general music emotion recognition (MER) system
could be insufficient
– One’s personal device is desirable to understand
his/her perception of music emotion
– Adaptive MER method, efficient and effective
4. 4
Basic Idea
• The UBM-GMM system for speaker adaptation
– State-of-the-art systems for speaker recognition
– A large background GMM (UBM), representing the
speaker-independent distribution of acoustic features
– Obtain the speaker-dependent GMM via model
adaptation with the speech data of a specific speaker
• Adaptive MER method for personalization
– A probabilistic background emotion model, learn the
broad emotion perception of music from general
users
– Personalize the background emotion model via model
adaptation in an online and dynamic fashion
5. 5
Multi-Dimensional Emotion
• Emotions are considered as numerical values
(instead of discrete labels) over two emotion
dimensions, i.e., Valence and Arousal (Activation)
• Good visualization, a unified model
Mr. Emo developed by Yang and Chen
6. 6
The Valence-Arousal Annotations
• Different emotions may be elicited from a song
• Assumption: the VA annotation of a song can be
drawn from a Gaussian distribution, as observed
• Learn from the multiple annotations and the acoustic
features of the corresponding song
• Predict the emotion as a single Gaussian
7. 7
The Acoustic Emotion Gaussians Model
• Represent the acoustic features of a song by a
probabilistic histogram vector
• Develop a model to comprehend the relationship
between acoustic features and VA annotations
– Wang et al. (2012), “The acoustic emotion Gaussians model for
emotion-based music annotation and retrieval,” Proc. ACM Multimedia
(full paper)
Acoustic GMM Posterior Distributions
8. 8
Construct Feature Reference Model
A1 A2
AK-1
AK A3A4
Each component represents
a specific pattern
EM Training
A Universal
Music Database
Acoustic GMM
Music Tracks
& Audio Signal
Frame-based Features
… …
… …
Global Set of frame
vectors randomly
selected from each track
…
Music Tracks
& Audio Signal
A Universal
Music Database
Music Tracks
& Audio Signal
9. 9
Represent a Song into Probabilistic Space
1
2
K-1
K…
Posterior
Probabilities over
the Acoustic GMM
…
A1
A2
AK-1
Acoustic GMM
AK
…
Feature Vectors
Histogram:
Acoustic GMM Posterior
prob
1 2 K-1 K…
10. 10
Generative Process of VA GMM
• Key idea: Each component in acoustic GMM can generate
a component VA Gaussian
Audio Signal
of Each Clip
A Mixture of Gaussians
in the VA Space
…
A1
A2
AK-1
Acoustic GMM
AK
1
2
K-1
K
…
Viewed as a set of
acoustic codewords
11. 11
The Likelihood Function of VA GMM
• Each training clip is annotated by multiple users {uj},
indexed by j
• An annotated corpus: assume each annotation eij of clip
si can be generated by a weighted VA GMM with {qik}!
• Generating the Corpus-level likelihood and maximize it
using the EM algorithm
1 1 1 1
( | ) ( | ) ( | , )
jU KN N
i ik ij k k
i i j k
p p s q
= = = =
= = å E E e m S
1
( | ) ( | , )
K
ij i ik ij k k
k
p s q
=
= åe e Sm
Acoustic GMM posterior
Clip-level likelihood:
Each user contributes equally
parameters of each
latent VA Gaussian to learn
Annotation-level
Likelihood
12. 12
Personalizing VA GMM via MAP
• Apply the Maximum A Posteriori (MAP) adaptation
• Suppose we have a set of personally annotated songs
{ei, qi}, i=1,…,M
• The posterior probability over each component zk for ei is
• The expected sufficient statistics with posterior and ei
1
( | , )
( | , )
( | , )
ik i k k
k i i K
iq i q qq
p z
q
q=
=
å
e
e
e
m
m
S
q
S
1
1
( | , )
( )
( | , )
M
k i i ii
k M
k i ii
p z
E
p z
=
=
¬
å
å
e e
e
m
q
q
1
1
( | , )
( )
( | , )
W T
k ij i i ii
k W
k i ii
p z
E
p z
=
=
=
å
å
e e e
e
q
S
q
13. 13
MAP for GMM: Parameter Interpolation
• The updated parameters for the personalized VA GMM
can be derived by interpolation
• The effective number of component zk for the target user
• The interpolation factors (data-dependent) can be set by
( ) (1 ) ,k k k k kEa a¢ ¬ + -m m m
( ) (1 )( ) .T T
k k k k k k k k kEa a¢ ¢ ¢¬ + - + -S S S m m m m
1
( | , )
M
k k i i
i
M p z
=
= å e q
k
k
k
M
M
Parameter interpolation between the expectation and background
14. 14
Graphical Interpretation – MAP Adaptation
1
2
3 6
5
4
Interpolation
Factor
Acoustic GMM Posterior
The personal annotation
can be applied to clips
exclusive to the
background training set
15. 15
Music Emotion Recognition
• Given the acoustic GMM posterior of a test song, predict
the emotion as a single VA Gaussian
1
2
K-1
K
…
Acoustic GMM Posterior Learned VA GMM Predicted Single Gaussian
1
ˆˆ( | ) ( | , )
K
k ij k k
k
p s q
=
= åe e m S
^
^
^
^
…
{ , }*
m *
S
16. 16
Find the Representative Gaussian
• Minimize the cumulative weighted relative entropy
– The representative Gaussian has the minimal cumulative
distance from all the component VA Gaussians
• The optimal parameters of the Gaussian are
( )KL
{ , }
1
ˆ( | , ) arg min ( | , ) || ( | , )
K
k k k
k
p D p pq* *
=
= åe e e
S
S S S
m
m m m
*
1
ˆ
K
k k
k
q
=
= åm m
( )* * *
1
ˆ ( )( )
K
T
k k k k
k
q
=
= + - -åS S m m m m
17. 17
Evaluation – Dataset and Acoustic Features
• MER60
– 60 music clips, each is 30-second
– 99 users in total, each clip annotated by 40 subjects
– 6 users have annotated all the clips
– Evaluate the personalization based on these 6 users
• Bag-of-frames representation, perform the analysis of
emotion at the clip-level, instead of frame-level
– 70Dim: dynamic, spectral, timbre (13 MFCCs, 13 delta MFCCs,
and 13 delta-delta MFCCs), and tonal
18. 18
Evaluation – Incremental Setting
• Incremental adaptation experiment per target user
– Randomly split all the clips (w/ annotations) into 6 folds
– Perform 6-fold CV
• Hold out one fold for testing
• The rest 5 folds: All the annotations except the
target user’s to train a background VA GMM
• Add one fold of annotation of target user into the
adaptation pool (P=5 iterations loop)
– Adaptation pool to adapt the background VA
GMM
– Evaluate the prediction performance of the test
fold
19. 19
Evaluation – Result
• Metric (ALLi): compute the log-likelihood of the predicted
Gaussian with the ground truth annotation of the target
user
20. 20
Conclusion and Future Work
• The AEG model provides a principled probabilistic
framework that is technically sound, and flexible for
adaptation
• We have presented a novel MAP-based adaptation
technique which is very efficient for personalizing the
AEG model
• Demonstrated the effectiveness of the proposed method
for personalizing MER in an incremental learning manner
• We will investigate the maximum likelihood linear
regression (MLLR) that learns a linear transformation
over the parameters of the AEG model