SlideShare a Scribd company logo
1
The Acoustic Emotion Gaussians
Model for Emotion-based Music
Annotation and Retrieval
Ju-Chiang Wang, Yi-Hsuan Yang,
Hsin-Min Wang, and Skyh-Kang Jeng
Academia Sinica,
National Taiwan University,
Taipei, Taiwan
2
Outline
• Introduction
• Related Work
• The Acoustic Emotion Gaussians (AEG)
Model
• Music Emotion Annotation and Retrieval
• Evaluation and Result
• Conclusion and Future Work
3
Introduction
• One of the most exciting but challenging
endeavors in music information retrieval (MIR)
– Develop a computational model that comprehends
the affective content of music signals
• Why is emotion so important to MIR system?
– Music is the finest language of emotion
– We use music to convey or modulate emotion
– Smaller semantic gap, comparing to genre
– Each state in our daily life contains emotion,
context-dependent music recommendation
4
Dimensional Emotion:
The Valence-Arousal(Activation) Model
• Emotions are considered as numerical values (instead of
discrete labels) over a number of emotion dimensions
• Good visualization, intuitive, a unified model
• Easy to capture temporal change of emotion
Mufin Player
Mr. Emo developed by Yang and Chen
5
The Valence-Arousal Annotation
• Emotion is subjective, different emotion may be elicited
from a song in the VA space
• Assumption: the VA annotation of a song can be drawn
from a Gaussian distribution, as observed above
• Subjectivity issue: observed by multiple subjects
• Temporal change: summarize the scope of changes
6
Related Work:
Regression for Gaussian Parameters
• The Gaussian-parameter approach directly learns five
regression models to predict the mean, variance, and
covariance of valence and arousal, respectively
• Without a joint modeling and estimation for the Gaussian
parameters
x
Regressor 1
Regressor 2
Regressor 3
Regressor 4
Regressor 5
mVal
mAro
sVal-Aro
sAro-Aro
sVal-Val
7
The Acoustic Emotion Gaussians Model for
Modeling between VA and Acoustic Feature
• A principled probabilistic/statistical approach
• Represent the acoustic features of a song by a probabilistic
histogram vector
• Develop a model to comprehend the relationship between acoustic
features and VA space (annotations)
Acoustic GMM Posterior Distributions
8
AEG: Construct Feature Reference Model
Global Set  of frame
vectors randomly
selected from each track
…
A1 N2
NK-1
NK N3N4
Global GMM for acoustic
feature encoding
EM Training
A Universal
Music Database
Acoustic GMM
Music Tracks
& Audio Signal
Frame-based Features
… …
… …
9
Represent a Song into Probabilistic Space
1
2
K-1
K…
Posterior
Probabilities over
the Acoustic GMM
…
A1
A2
AK-1
Acoustic GMM
AK
…
Feature Vectors
Histogram:
Acoustic GMM Posterior
prob
Each dim corresponds to a specific acoustic pattern,
called a latent feature class (or audio word)
1 2 K-1 K…
10
Generative Process of VA GMM
• Key idea: Each component VA Gaussian corresponds to
a latent feature class (a specific acoustic pattern)
Audio Signal
of Each Clip
A Mixture of Gaussians
in the VA Space
…
A1
A2
AK-1
Acoustic GMM
AK
1
2
K-1
K
…
11
Total Likelihood Function of VA GMM
• To cover the subjectivity, each training clip is annotated
by multiple subjects {uj}, the corresponding annotation ej
• An annotated corpus: assume each annotation eij of clip
si can be generated by a weighted VA GMM with {qik}!
• Generating the Corpus-level likelihood and maximize it
using the EM algorithm
1 1 1 1
( | ) ( | ) ( | , )
jU KN N
i ik ij k k
i i j k
p p s q
= = = =
= = å E E e  m S

1
( | ) ( | , )
K
ij i ik ij k k
k
p s q
=
= åe e Sm
Acoustic GMM posterior
Clip-level likelihood:
Each annotation contributes equally
parameters of each
latent VA Gaussian to learn
Annotation-level
Likelihood
12
User Prior Model
• Some annotations could be outliers
• The prior weight of each annotation can be described by
the likelihood over the clip-level annotation Gaussian
– Larger B indicates lower label consistency (higher uncertainty)
– Smaller likelihood implies the annotation could be an outlier
( | , ) ( | , , )jp u s s=e e a B
,
( | , )
( | )
( | , )
j
j s j
ju
p u s
p u s
p u s
g ¬ =
å
e
e
13
Integrating the Annotation (User) Prior
• Integrating Acoustic GMM Posterior and Annotation Prior
into the Generative Process
1 1 1
1 1 1
( | ) ( | ) ( | ) ( | )
( | , )
j
j
UN N
i ij i ij i
i i j
U KN
ij ik ij k k
i j k
p p s p u s p s
g q
= = =
= = =
= =
=
å 
å å
E E e
e

 m S
Clip-level likelihood:
prior weighted sum over
annotation-level likelihood
Annotation Prior Acoustic GMM posterior
14
The Objective Function
• Take log of p(E| ), and according to Jensen’s inequality
we derive the lower bound
where
• Then, we maximize Lbound with the EM-Algorithm
1 1 1
1 1 1
log ( | ) log ( | , )
log ( | , )
j
j
UN K
ij ik ij k k
i j k
UN K
bound ij ik ij k k
i j k
p D
L
g q
g q
= = =
= = =
=
³ =
å å å
åå å
E e
e


m
m
S
S
1 1
1
jUN
ij
i j
g
= =
=åå
two-layer log sum
one-layer log sum
parameters to learn
15
The Learning of VA GMM on MER60
Iter=8Iter=4
Iter=32Iter=16
Iter=2
16
Music Emotion Annotation
• Given the acoustic GMM posterior of a test song, predict
the emotion as a single VA Gaussian
1
2
K-1
K
…
Acoustic GMM Posterior Learned VA GMM Predicted Single Gaussian
1
ˆˆ( | ) ( | , )
K
k ij k k
k
p s q
=
= åe e m S
^
^
^
^
…
{ , }*
m *
S
17
Find the Representative Gaussian
• Minimize the cumulative weighted relative entropy
– The representative Gaussian has the minimal cumulative
distance from all the component VA Gaussians
• The optimal parameters of the Gaussian are
( )KL
{ , }
1
ˆ( | , ) arg min ( | , ) || ( | , )
K
k k k
k
p D p pq* *
=
= åe e e
S
S S S
m
m m m
*
1
ˆ
K
k k
k
q
=
= åm m
( )* * *
1
ˆ ( )( )
K
T
k k k k
k
q
=
= + - -åS S m m m m
18
Emotion-Based Music Retrieval
Approach Indexing Matching
Fold-In Acoustic GMM Posterior Cosine Sim (K-dim)
Emotion Prediction Predicted VA Gaussian Gaussian Likelihood
19
The Fold-In Approach
l1
l2
lK-1
lK
…
The Learned VA GMM A VA Point Query
Fold In
The query is Dominated by
the VA Gaussian of A2
Pseudo
Song
Distribution
1
ˆ ˆarg max log ( | , )
K
k k k
k
pl
=
= å e
l
l m S ˆe
Using the EM algorithm
1
2
K-1
K
…
Acoustic
GMM
Posterior
Music
Database
20
Evaluation – Dataset
• Two corpora used: MER60 and MTurk
• MER60
– 60 music clips, each is 30-second
– 99 subjects in total, making each clip annotated by 40 subjects
– The VA values are entered by clicking on the emotion space on a
computer display
• MTurk
– 240 clips, each is 15-second
– Collected via Amazon's Mechanical Turk
– Each subject rated the per-second VA values for 11 randomly-
selected clips using a graphical interface
– Automatic verification step employed, finalizing each clip with 7
to 23 subjects
21
Evaluation – Acoustic Features
• Adopt the bag-of-frames representation
• All the frames of a clip are aggregated into the acoustic
GMM posterior and perform the analysis of emotion at
the clip-level, instead of frame-level
• MER60: extracted by MIRToolbox
– Dynamic, spectral, timbre (including 13 MFCCs, 13 delta MFCCs,
and 13 delta-delta MFCCs), and tonal
– 70-dim all concatenation or 39-dim MFCCs
• MTurk: provided by Schmidt et al.
– MFCCs, chroma, spectrum descriptors, and spectral contrast
– 50-dim all concatenation, 20-dim MFCCs, or 14-dim spectral
contrast
22
Evaluation Metric for Emotion Annotation
• Average KL divergence (AKL)
– Measure the KL divergence from the predicted VA Gaussian of a
test clip to its ground truth VA Gaussian
• Average Mean Distance (AMD)
– Measure the Euclidean distance between the mean vectors of
the predicted and ground truth VA Gaussians
( )1 1 1
P G P G P G G P G
1
tr( ) log ( ) ( ) 2
2
T- - -
- + - - -m m m mS S S S S
P G P G( ) ( )T
- -m m m m
23
Result for Emotion Annotation
• MER60, leave-one-out train and test
• MTurk, 70%-30% randomly splitting train and test
Smaller
Better
24
Summary for Emotion Annotation
• The performance saturates when K is sufficiently large
• Larger scale corpus prefers larger K (feature resolution)
• Annotation prior is effective for the AKL performance
• For MER60, 70-D concat feature performs the best
• For MTurk, using MFCCs alone is more effective
• MTurk is easier and presents smaller performance scale
25
Result for Music Retrieval
• MTurk: 2,520 clips training, 1,080 clips for retrieval database
• Evaluate the ranking using the Normalized Discounted
Cumulative Gain (NDCG) with 5, 10, and 20 retrieved clips
2 2
( )1
NDCG @ (1)
log
P
iP
R i
P R
Z i=
ì üï ïï ï= +í ý
ï ïï ïî þ
å
Larger Better
26
Conclusion and Future Work
• The AEG model provides a principled probabilistic
framework that is technically sound, and also unifies
the emotion-based music annotation and retrieval
• AEG can better take into account the subjective
nature of emotion perception
• Transparency and interpretability of the model
learning and semantic mapping processes
• The potential for incorporating multi-modal content
• Dynamic personalization via model adaptation
• Alignment among multi-modal emotion semantics
27
Appendix: PWKL for Emotion Corpus
PWKL
5.095
1.985
• PWKL: the diversity of ground truth among all songs in a
corpus, the larger the more diverse
• We compute the (pair-wise) KL divergence between the
ground truth annotation Gaussians of each pair of clips
in a corpus
• MTurk is easier, since a safer prediction, the origin, can
gain good performance

More Related Content

Similar to The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

Fundamentals of music processing chapter 5 발표자료
Fundamentals of music processing chapter 5 발표자료Fundamentals of music processing chapter 5 발표자료
Fundamentals of music processing chapter 5 발표자료
Jeong Choi
 
P4_Predictive_Modeling_Speech.pdf
P4_Predictive_Modeling_Speech.pdfP4_Predictive_Modeling_Speech.pdf
P4_Predictive_Modeling_Speech.pdf
Yonas D. Ebren
 
AC overview
AC overviewAC overview
AC overview
WarNik Chow
 
"Speech recognition" - Hidden Markov Models @ Papers We Love Bucharest
"Speech recognition" - Hidden Markov Models @ Papers We Love Bucharest"Speech recognition" - Hidden Markov Models @ Papers We Love Bucharest
"Speech recognition" - Hidden Markov Models @ Papers We Love Bucharest
Stefan Adam
 
Weakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-AttentionWeakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-Attention
NU_I_TODALAB
 
Music genre prediction
Music genre predictionMusic genre prediction
Music genre prediction
Anusha Chavva
 
Waveform_codingUNIT-II_DC_-PPT.pptx
Waveform_codingUNIT-II_DC_-PPT.pptxWaveform_codingUNIT-II_DC_-PPT.pptx
Waveform_codingUNIT-II_DC_-PPT.pptx
KIRUTHIKAAR2
 
Deep Learning Meetup #5
Deep Learning Meetup #5Deep Learning Meetup #5
Deep Learning Meetup #5
Aloïs Gruson
 
Waveform_codingUNIT-II_DC_-PPT.pptx
Waveform_codingUNIT-II_DC_-PPT.pptxWaveform_codingUNIT-II_DC_-PPT.pptx
Waveform_codingUNIT-II_DC_-PPT.pptx
KIRUTHIKAAR2
 
adaptive equa.ppt
adaptive equa.pptadaptive equa.ppt
adaptive equa.ppt
mohamadfarzansabahi1
 
Feasibility of EEG Super-Resolution Using Deep Convolutional Networks
Feasibility of EEG Super-Resolution Using Deep Convolutional NetworksFeasibility of EEG Super-Resolution Using Deep Convolutional Networks
Feasibility of EEG Super-Resolution Using Deep Convolutional Networks
Sangjun Han
 
Music Gesture for Visual Sound Separation
Music Gesture for Visual Sound SeparationMusic Gesture for Visual Sound Separation
Music Gesture for Visual Sound Separation
ivaderivader
 
Emotion based music player
Emotion based music playerEmotion based music player
Emotion based music player
Nizam Muhammed
 
TAAI 2016 Keynote Talk: It is all about AI
TAAI 2016 Keynote Talk: It is all about AITAAI 2016 Keynote Talk: It is all about AI
TAAI 2016 Keynote Talk: It is all about AI
Yi-Shin Chen
 
Dcase2016 oral presentation - Experiments on DCASE 2016: Acoustic Scene Class...
Dcase2016 oral presentation - Experiments on DCASE 2016: Acoustic Scene Class...Dcase2016 oral presentation - Experiments on DCASE 2016: Acoustic Scene Class...
Dcase2016 oral presentation - Experiments on DCASE 2016: Acoustic Scene Class...
Ankit Shah
 
Decoding Brain oscillations during naturalistic scenarios
Decoding Brain oscillations during naturalistic scenariosDecoding Brain oscillations during naturalistic scenarios
Decoding Brain oscillations during naturalistic scenarios
KrishnaPrasad194459
 
Support Vector Machine Techniques for Nonlinear Equalization
Support Vector Machine Techniques for Nonlinear EqualizationSupport Vector Machine Techniques for Nonlinear Equalization
Support Vector Machine Techniques for Nonlinear Equalization
Shamman Noor Shoudha
 
Icmmse slides
Icmmse slidesIcmmse slides
Icmmse slides
Manoj Shukla
 
Btp 1st
Btp 1stBtp 1st
Btp 1st
Dinesh Yadav
 
Oceans13 Presentation
Oceans13 PresentationOceans13 Presentation
Oceans13 Presentation
Ahmad ElMoslimany
 

Similar to The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval (20)

Fundamentals of music processing chapter 5 발표자료
Fundamentals of music processing chapter 5 발표자료Fundamentals of music processing chapter 5 발표자료
Fundamentals of music processing chapter 5 발표자료
 
P4_Predictive_Modeling_Speech.pdf
P4_Predictive_Modeling_Speech.pdfP4_Predictive_Modeling_Speech.pdf
P4_Predictive_Modeling_Speech.pdf
 
AC overview
AC overviewAC overview
AC overview
 
"Speech recognition" - Hidden Markov Models @ Papers We Love Bucharest
"Speech recognition" - Hidden Markov Models @ Papers We Love Bucharest"Speech recognition" - Hidden Markov Models @ Papers We Love Bucharest
"Speech recognition" - Hidden Markov Models @ Papers We Love Bucharest
 
Weakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-AttentionWeakly-Supervised Sound Event Detection with Self-Attention
Weakly-Supervised Sound Event Detection with Self-Attention
 
Music genre prediction
Music genre predictionMusic genre prediction
Music genre prediction
 
Waveform_codingUNIT-II_DC_-PPT.pptx
Waveform_codingUNIT-II_DC_-PPT.pptxWaveform_codingUNIT-II_DC_-PPT.pptx
Waveform_codingUNIT-II_DC_-PPT.pptx
 
Deep Learning Meetup #5
Deep Learning Meetup #5Deep Learning Meetup #5
Deep Learning Meetup #5
 
Waveform_codingUNIT-II_DC_-PPT.pptx
Waveform_codingUNIT-II_DC_-PPT.pptxWaveform_codingUNIT-II_DC_-PPT.pptx
Waveform_codingUNIT-II_DC_-PPT.pptx
 
adaptive equa.ppt
adaptive equa.pptadaptive equa.ppt
adaptive equa.ppt
 
Feasibility of EEG Super-Resolution Using Deep Convolutional Networks
Feasibility of EEG Super-Resolution Using Deep Convolutional NetworksFeasibility of EEG Super-Resolution Using Deep Convolutional Networks
Feasibility of EEG Super-Resolution Using Deep Convolutional Networks
 
Music Gesture for Visual Sound Separation
Music Gesture for Visual Sound SeparationMusic Gesture for Visual Sound Separation
Music Gesture for Visual Sound Separation
 
Emotion based music player
Emotion based music playerEmotion based music player
Emotion based music player
 
TAAI 2016 Keynote Talk: It is all about AI
TAAI 2016 Keynote Talk: It is all about AITAAI 2016 Keynote Talk: It is all about AI
TAAI 2016 Keynote Talk: It is all about AI
 
Dcase2016 oral presentation - Experiments on DCASE 2016: Acoustic Scene Class...
Dcase2016 oral presentation - Experiments on DCASE 2016: Acoustic Scene Class...Dcase2016 oral presentation - Experiments on DCASE 2016: Acoustic Scene Class...
Dcase2016 oral presentation - Experiments on DCASE 2016: Acoustic Scene Class...
 
Decoding Brain oscillations during naturalistic scenarios
Decoding Brain oscillations during naturalistic scenariosDecoding Brain oscillations during naturalistic scenarios
Decoding Brain oscillations during naturalistic scenarios
 
Support Vector Machine Techniques for Nonlinear Equalization
Support Vector Machine Techniques for Nonlinear EqualizationSupport Vector Machine Techniques for Nonlinear Equalization
Support Vector Machine Techniques for Nonlinear Equalization
 
Icmmse slides
Icmmse slidesIcmmse slides
Icmmse slides
 
Btp 1st
Btp 1stBtp 1st
Btp 1st
 
Oceans13 Presentation
Oceans13 PresentationOceans13 Presentation
Oceans13 Presentation
 

Recently uploaded

How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 

Recently uploaded (20)

How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 

The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval

  • 1. 1 The Acoustic Emotion Gaussians Model for Emotion-based Music Annotation and Retrieval Ju-Chiang Wang, Yi-Hsuan Yang, Hsin-Min Wang, and Skyh-Kang Jeng Academia Sinica, National Taiwan University, Taipei, Taiwan
  • 2. 2 Outline • Introduction • Related Work • The Acoustic Emotion Gaussians (AEG) Model • Music Emotion Annotation and Retrieval • Evaluation and Result • Conclusion and Future Work
  • 3. 3 Introduction • One of the most exciting but challenging endeavors in music information retrieval (MIR) – Develop a computational model that comprehends the affective content of music signals • Why is emotion so important to MIR system? – Music is the finest language of emotion – We use music to convey or modulate emotion – Smaller semantic gap, comparing to genre – Each state in our daily life contains emotion, context-dependent music recommendation
  • 4. 4 Dimensional Emotion: The Valence-Arousal(Activation) Model • Emotions are considered as numerical values (instead of discrete labels) over a number of emotion dimensions • Good visualization, intuitive, a unified model • Easy to capture temporal change of emotion Mufin Player Mr. Emo developed by Yang and Chen
  • 5. 5 The Valence-Arousal Annotation • Emotion is subjective, different emotion may be elicited from a song in the VA space • Assumption: the VA annotation of a song can be drawn from a Gaussian distribution, as observed above • Subjectivity issue: observed by multiple subjects • Temporal change: summarize the scope of changes
  • 6. 6 Related Work: Regression for Gaussian Parameters • The Gaussian-parameter approach directly learns five regression models to predict the mean, variance, and covariance of valence and arousal, respectively • Without a joint modeling and estimation for the Gaussian parameters x Regressor 1 Regressor 2 Regressor 3 Regressor 4 Regressor 5 mVal mAro sVal-Aro sAro-Aro sVal-Val
  • 7. 7 The Acoustic Emotion Gaussians Model for Modeling between VA and Acoustic Feature • A principled probabilistic/statistical approach • Represent the acoustic features of a song by a probabilistic histogram vector • Develop a model to comprehend the relationship between acoustic features and VA space (annotations) Acoustic GMM Posterior Distributions
  • 8. 8 AEG: Construct Feature Reference Model Global Set  of frame vectors randomly selected from each track … A1 N2 NK-1 NK N3N4 Global GMM for acoustic feature encoding EM Training A Universal Music Database Acoustic GMM Music Tracks & Audio Signal Frame-based Features … … … …
  • 9. 9 Represent a Song into Probabilistic Space 1 2 K-1 K… Posterior Probabilities over the Acoustic GMM … A1 A2 AK-1 Acoustic GMM AK … Feature Vectors Histogram: Acoustic GMM Posterior prob Each dim corresponds to a specific acoustic pattern, called a latent feature class (or audio word) 1 2 K-1 K…
  • 10. 10 Generative Process of VA GMM • Key idea: Each component VA Gaussian corresponds to a latent feature class (a specific acoustic pattern) Audio Signal of Each Clip A Mixture of Gaussians in the VA Space … A1 A2 AK-1 Acoustic GMM AK 1 2 K-1 K …
  • 11. 11 Total Likelihood Function of VA GMM • To cover the subjectivity, each training clip is annotated by multiple subjects {uj}, the corresponding annotation ej • An annotated corpus: assume each annotation eij of clip si can be generated by a weighted VA GMM with {qik}! • Generating the Corpus-level likelihood and maximize it using the EM algorithm 1 1 1 1 ( | ) ( | ) ( | , ) jU KN N i ik ij k k i i j k p p s q = = = = = = å E E e  m S  1 ( | ) ( | , ) K ij i ik ij k k k p s q = = åe e Sm Acoustic GMM posterior Clip-level likelihood: Each annotation contributes equally parameters of each latent VA Gaussian to learn Annotation-level Likelihood
  • 12. 12 User Prior Model • Some annotations could be outliers • The prior weight of each annotation can be described by the likelihood over the clip-level annotation Gaussian – Larger B indicates lower label consistency (higher uncertainty) – Smaller likelihood implies the annotation could be an outlier ( | , ) ( | , , )jp u s s=e e a B , ( | , ) ( | ) ( | , ) j j s j ju p u s p u s p u s g ¬ = å e e
  • 13. 13 Integrating the Annotation (User) Prior • Integrating Acoustic GMM Posterior and Annotation Prior into the Generative Process 1 1 1 1 1 1 ( | ) ( | ) ( | ) ( | ) ( | , ) j j UN N i ij i ij i i i j U KN ij ik ij k k i j k p p s p u s p s g q = = = = = = = = = å  å å E E e e   m S Clip-level likelihood: prior weighted sum over annotation-level likelihood Annotation Prior Acoustic GMM posterior
  • 14. 14 The Objective Function • Take log of p(E| ), and according to Jensen’s inequality we derive the lower bound where • Then, we maximize Lbound with the EM-Algorithm 1 1 1 1 1 1 log ( | ) log ( | , ) log ( | , ) j j UN K ij ik ij k k i j k UN K bound ij ik ij k k i j k p D L g q g q = = = = = = = ³ = å å å åå å E e e   m m S S 1 1 1 jUN ij i j g = = =åå two-layer log sum one-layer log sum parameters to learn
  • 15. 15 The Learning of VA GMM on MER60 Iter=8Iter=4 Iter=32Iter=16 Iter=2
  • 16. 16 Music Emotion Annotation • Given the acoustic GMM posterior of a test song, predict the emotion as a single VA Gaussian 1 2 K-1 K … Acoustic GMM Posterior Learned VA GMM Predicted Single Gaussian 1 ˆˆ( | ) ( | , ) K k ij k k k p s q = = åe e m S ^ ^ ^ ^ … { , }* m * S
  • 17. 17 Find the Representative Gaussian • Minimize the cumulative weighted relative entropy – The representative Gaussian has the minimal cumulative distance from all the component VA Gaussians • The optimal parameters of the Gaussian are ( )KL { , } 1 ˆ( | , ) arg min ( | , ) || ( | , ) K k k k k p D p pq* * = = åe e e S S S S m m m m * 1 ˆ K k k k q = = åm m ( )* * * 1 ˆ ( )( ) K T k k k k k q = = + - -åS S m m m m
  • 18. 18 Emotion-Based Music Retrieval Approach Indexing Matching Fold-In Acoustic GMM Posterior Cosine Sim (K-dim) Emotion Prediction Predicted VA Gaussian Gaussian Likelihood
  • 19. 19 The Fold-In Approach l1 l2 lK-1 lK … The Learned VA GMM A VA Point Query Fold In The query is Dominated by the VA Gaussian of A2 Pseudo Song Distribution 1 ˆ ˆarg max log ( | , ) K k k k k pl = = å e l l m S ˆe Using the EM algorithm 1 2 K-1 K … Acoustic GMM Posterior Music Database
  • 20. 20 Evaluation – Dataset • Two corpora used: MER60 and MTurk • MER60 – 60 music clips, each is 30-second – 99 subjects in total, making each clip annotated by 40 subjects – The VA values are entered by clicking on the emotion space on a computer display • MTurk – 240 clips, each is 15-second – Collected via Amazon's Mechanical Turk – Each subject rated the per-second VA values for 11 randomly- selected clips using a graphical interface – Automatic verification step employed, finalizing each clip with 7 to 23 subjects
  • 21. 21 Evaluation – Acoustic Features • Adopt the bag-of-frames representation • All the frames of a clip are aggregated into the acoustic GMM posterior and perform the analysis of emotion at the clip-level, instead of frame-level • MER60: extracted by MIRToolbox – Dynamic, spectral, timbre (including 13 MFCCs, 13 delta MFCCs, and 13 delta-delta MFCCs), and tonal – 70-dim all concatenation or 39-dim MFCCs • MTurk: provided by Schmidt et al. – MFCCs, chroma, spectrum descriptors, and spectral contrast – 50-dim all concatenation, 20-dim MFCCs, or 14-dim spectral contrast
  • 22. 22 Evaluation Metric for Emotion Annotation • Average KL divergence (AKL) – Measure the KL divergence from the predicted VA Gaussian of a test clip to its ground truth VA Gaussian • Average Mean Distance (AMD) – Measure the Euclidean distance between the mean vectors of the predicted and ground truth VA Gaussians ( )1 1 1 P G P G P G G P G 1 tr( ) log ( ) ( ) 2 2 T- - - - + - - -m m m mS S S S S P G P G( ) ( )T - -m m m m
  • 23. 23 Result for Emotion Annotation • MER60, leave-one-out train and test • MTurk, 70%-30% randomly splitting train and test Smaller Better
  • 24. 24 Summary for Emotion Annotation • The performance saturates when K is sufficiently large • Larger scale corpus prefers larger K (feature resolution) • Annotation prior is effective for the AKL performance • For MER60, 70-D concat feature performs the best • For MTurk, using MFCCs alone is more effective • MTurk is easier and presents smaller performance scale
  • 25. 25 Result for Music Retrieval • MTurk: 2,520 clips training, 1,080 clips for retrieval database • Evaluate the ranking using the Normalized Discounted Cumulative Gain (NDCG) with 5, 10, and 20 retrieved clips 2 2 ( )1 NDCG @ (1) log P iP R i P R Z i= ì üï ïï ï= +í ý ï ïï ïî þ å Larger Better
  • 26. 26 Conclusion and Future Work • The AEG model provides a principled probabilistic framework that is technically sound, and also unifies the emotion-based music annotation and retrieval • AEG can better take into account the subjective nature of emotion perception • Transparency and interpretability of the model learning and semantic mapping processes • The potential for incorporating multi-modal content • Dynamic personalization via model adaptation • Alignment among multi-modal emotion semantics
  • 27. 27 Appendix: PWKL for Emotion Corpus PWKL 5.095 1.985 • PWKL: the diversity of ground truth among all songs in a corpus, the larger the more diverse • We compute the (pair-wise) KL divergence between the ground truth annotation Gaussians of each pair of clips in a corpus • MTurk is easier, since a safer prediction, the origin, can gain good performance