Dynamic Music Emotion Recognition Using State-Space Models

Dynamic Music Emotion Recognition
using State-Space Models
Team UoA
Konstantin Markov*, Tomoko Matsui**
*The University of Aizu, Japan
**Institute of Statistical Mathematics, Japan

Subtask focus
 Subtask 1: Feature development
 Static affect prediction
 New features
 Signal Processing challenge
 Subtask 2: Dynamic estimation
 Dynamic affect prediction
 New modeling approaches
 Machine learning challenge

Affect trajectory
 Dynamic emotion recognition is estimation of the
affect trajectory in time!
 Apply time series analysis tools:
 Trajectory estimation is a time series
filtering/smoothing task.
 Well suited are the State-Space Models (SSM).

State-Space Models
 Also known as dynamic (temporal) systems:
푥푡 = 푓 푥푡−1, 퐴 + υ푡 , υ~퓝(0, 푅)
푦푡 = 푔 푥푡 , 퐵 + ν푡 , ν~퓝(0, 푄)

Gaussian Filtering/Smoothing
 Given a State-Space Model:
 The task of filtering is to approximate
푝 푥푡 푦1:푡 , 푡 = 1, … 푇
 The task of smoothing is to approximate
푝 푥푡 푦1:푇 , 푡 = 1, … 푇
 When they are approximated by
Gaussian distributions, the task is called
Gaussian filtering/smoothing.

Gaussian Filtering
 It can be shown that:
푥, Σ푡
푝 푥푡 푦1:푡 = 퓝 푥푡 휇푡
(Deisenroth, 2011)
푥 , where
푥 = 휇푡
휇푡
푥푝푟푒푑 + Σ푡
푦 −1
푥,푦 Σ푡
푦
푦푡 − 휇푡
푥 = Σ푡
Σ푡
푥푝푟푒푑 − Σ푡
푦 −1
푥,푦 Σ푡
푥,푦 )푇
(Σ푡
푥푝푟푒푑 , Σ푡
푝 푥푡 푦1:푡−1 = 퓝 푥푡 휇푡
푥푝푟푒푑

Kalman Filter (KF)
 Linear state-space model:
푥푡 = 퐹푥푡−1 + υ푡
푦푡 = 퐺푥푡 + ν푡
 Advantages:
 Analytic approximation to 푝(풙푡 |풚1:푡 )
 Fast
 Disadvantages:
 Linearity assumption

Gaussian Process SSM (GP-SSM)
 Gaussian Processes based state-space model:
푥푡 = 푓 푥푡−1 + υ푡 , 푓(푥) ∼ 풢풫(0, 퐾푓 )
푦푡 = 푔 푥푡 + ν푡, 푔(푥) ∼ 풢풫(0, 퐾푔)
 Advantages:
 Non-linear, Non-parametric
 Flexible.
 Disadvantages:
 No standard algorithms for training and inference.
 Analytic moment matching approximation (Deisenroth,2012)
 Computationally expensive.

Experiments
 Feature extraction.
 Marsyas tool.
 mfcc – MelFreq Cepstral Coefficients
 spfe – zero-cross, spectral flux, centroid, rolloff.
 scf – spectral crest factor.
 baseline – Features used in the official baseline
system.
 Independent state and observation model learning.
 Multivariate linear regression for the KF.
 GP regression learning for the GP-SSM.

Experiments
 Development data.
 Training set - 600 clips.
 Validation set – 144 clips.
 Training set clustering.
 Four clusters based on clips static A-V vectors.
 Separate SSM trained for each
cluster.
 Maximum likelihood based model
selection.

Results on Development Data
Feature Kalman filter RTS smoother
R RMSE R RMSE
AROUSAL – SINGLE MODEL
mfcc 0.2062 0.2894 0.1070 0.3008
mfcc+spfe 0.2326 0.2378 0.0894 0.2291
mfcc+scf 0.1171 0.2288 0.1611 0.2188
baseline 0.2791 0.3631 0.1898 0.4027
AROUSAL – MULTIPLE MODELS
mfcc 0.1698 0.1384 0.0991 0.1284
mfcc+spfe 0.2022 0.1290 0.1246 0.1277
mfcc+scf 0.0059 0.1613 0.0253 0.1615
baseline 0.0212 0.2276 0.0236 0.2259

Feature Kalman filter RTS smoother
R RMSE R RMSE
VALENCE – SINGLE MODEL
mfcc 0.0411 0.3131 0.0598 0.3542
mfcc+spfe 0.0304 0.3100 0.0725 0.3495
mfcc+scf 0.1545 0.3346 0.1401 0.3616
baseline 0.0753 0.1341 0.0779 0.1499
VALENCE – MULTIPLE MODELS
mfcc -0.082 0.1847 -0.042 0.1915
mfcc+spfe -0.054 0.1866 -0.068 0.1914
mfcc+scf 0.0149 0.1688 -0.008 0.1703
baseline -0.080 0.2425 -0.058 0.2497

Feature GP-SSM filter GP-SSM smoother
R RMSE R RMSE
mfcc 0.0436 0.3088 0.0743 0.3207
spfe 0.0582 0.3046 0.0714 0.3486
baseline -0.007 0.3025 0.0393 0.3444
mfcc 0.0217 0.2766 0.0313 0.3083
spfe 0.0283 0.3297 -0.003 0.3515
baseline -0.011 0.3891 -0.020 0.4431

Results on Test Data
Feature Kalman filter GP-SSM filter
R RMSE R RMSE
mfcc 0.1914 0.0852 0.044 0.1089
spfe 0.0526 0.1986 0.1015 0.2066
baseline -0.1520 0.3824 0.0301 0.2073
mfcc -0.065 0.1590 -0.017 0.1096
spfe -0.075 0.2679 -0.023 0.1920
baseline -0.099 0.2325 -0.049 0.2267

Conclusions
 Training data clustering.
 Did not improve the Kalman Filter performance.
 The only way GP-SSM could be trained.
 KF vs. GP-SST
 Similar performance under similar conditions.
 GP-SST a bit better for Valence estimation.
 Feature types.
 Baseline features seem to perform well.
 No definite winner.

The end
Thank you for your attention!
Questions?

Dynamic Music Emotion Recognition Using State-Space Models

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Dynamic Music Emotion Recognition Using State-Space Models

Similar to Dynamic Music Emotion Recognition Using State-Space Models (20)

More from multimediaeval

More from multimediaeval (20)

Recently uploaded

Recently uploaded (20)

Dynamic Music Emotion Recognition Using State-Space Models