MediaEval 2014: THU-HCSIL Approach to Emotion in Music Task using Multi-level Regression

Lab
of
Human
Computer
Speech
Interaction,
Key
Lab
of
Pervasive
Computing
MediaEval 2014:
THU-HCSIL Approach to Emotion in Music Task
using Multi-level Regression
Yuchao Fan, Mingxing Xu
THU-HCSIL Team
Tsinghua University
Department
of
Computer
Science

Technology,
Tsinghua
University

Lab
of
Human
Computer
Speech
Interaction,
Key
Lab
of
Pervasive
Computing
Outline
! Feature Extraction
Feature selection
Feature combination
! Regression
Data Observation
Framework of Proposed Approach
! Evaluation
! Conclusion
Department
of
Computer
Science

Technology,
Tsinghua
University

Lab
of
Human
Computer
Speech
Interaction,
Key
Lab
of
Pervasive
Computing
Feature Selection and Combination
! Original Feature —— EMO-LARGE features
Provided by organizers
Extracted by OpenSmile
6552 Dimensions
56 Low-Level Descriptors
Department
of
Computer
Science

Technology,
Tsinghua
University

Lab
of
Human
Computer
Speech
Interaction,
Key
Lab
of
Pervasive
Computing
! Feature Selection by Ranking Feature Importance
Tree-based Estimators – Random Forest Extra-trees
! A multitude of decision trees, compute feature importance through permutation
! Different node-splitting strategy
! Toolbox : scikit-learn
Department
of
Computer
Science

Technology,
Tsinghua
University
Selected Feature
RF
n F
! : the first n important features in Random Forest (RF)
ET
n F
! : the first n important features in Extra-Trees (ET)
! Selected Feature (SLT):
SLT { RF ET }
n n n F = F ∩F

Lab
of
Human
Computer
Speech
Interaction,
Key
Lab
of
Pervasive
Computing
Selected Feature Arousal Valence
n 1280 1280
Dimension 661 522
RMSE THU feature performance for Arousal
Department
of
Computer
Science

Technology,
Tsinghua
University
0.19
0.185
0.18
0.175
0.17
80 160 320 640 1280 2560 5120 n

Lab
of
Human
Computer
Speech
Interaction,
Key
Lab
of
Pervasive
Computing
! Supplement Feature
Music-related, total 217 dimensions
! THU feature set = {Selected Feature ∪ Supplement Feature}
Arousal: 878 dim (661 + 217)
Valence: 739 dim (522 + 217)
Department
of
Computer
Science

Technology,
Tsinghua
University

Lab
of
Human
Computer
Speech
Interaction,
Key
Lab
of
Pervasive
Computing
Ground Truth Distribution of Dev. Set
Arousal
Department
of
Computer
Science

Technology,
Tsinghua
University
Valence

Lab
of
Human
Computer
Speech
Interaction,
Key
Lab
of
Pervasive
Computing
Ground Truth Distribution (4x4)
Department
of
Computer
Science

Technology,
Tsinghua
University

Lab
of
Human
Computer
Speech
Interaction,
Key
Lab
of
Pervasive
Computing
Department
of
Computer
Science

Technology,
Tsinghua
University

Lab
of
Human
Computer
Speech
Interaction,
Key
Lab
of
Pervasive
Computing
Global Reg. VS Local Reg.
! Arousal RMSE distribution
Left: Global Reg, RMSE Avg. = 0.1764
Right: Local Reg, RMSE Avg. = 0.1490
Department
of
Computer
Science

Technology,
Tsinghua
University

Lab
of
Human
Computer
Speech
Interaction,
Key
Lab
of
Pervasive
Computing
! Valence RMSE distribution
Left: Global Reg, RMSE Avg. = 0.1785
Right: Local Reg, RMSE Avg. = 0.1517
Department
of
Computer
Science

Technology,
Tsinghua
University

Lab
of
Human
Computer
Speech
Interaction,
Key
Lab
of
Pervasive
Computing
SYSTEM RMSE AROUSAL VALENCE
GLOBAL REG. 0.1764 0.1785
LOCAL REG. 0.1490 0.1517
Department
of
Computer
Science

Technology,
Tsinghua
University

Lab
of
Human
Computer
Speech
Interaction,
Key
Lab
of
Pervasive
Computing
Ground Truth Distribution of Regression Results
result∈[−0.5,−0.25)
Mean = -0.2770, Std = 0.1439
Department
of
Computer
Science

Technology,
Tsinghua
University

Lab
of
Human
Computer
Speech
Interaction,
Key
Lab
of
Pervasive
Computing
result∈[−0.25,0)
Mean = -0.0827, Std = 0.1816
Department
of
Computer
Science

Technology,
Tsinghua
University

Lab
of
Human
Computer
Speech
Interaction,
Key
Lab
of
Pervasive
Computing
result∈[0,0.25)
Mean = 0.1548, Std = 0.2304
Department
of
Computer
Science

Technology,
Tsinghua
University

Lab
of
Human
Computer
Speech
Interaction,
Key
Lab
of
Pervasive
Computing
result∈[0.25,0.5)
Mean = 0.3697, Std = 0.1917
Department
of
Computer
Science

Technology,
Tsinghua
University

Lab
of
Human
Computer
Speech
Interaction,
Key
Lab
of
Pervasive
Computing
Range
1
Range
2
Range
3
Range
4
Department
of
Computer
Science

Technology,
Tsinghua
University

Lab
of
Human
Computer
Speech
Interaction,
Key
Lab
of
Pervasive
Computing
Training
G R
L R
1
L R
2
噯噯
L R
6
Testing
Training Set
Tr X
Boundary
Compution 1
Tr X
Tr X
2
噯噯
Tr X
6
Booster
Trainer
Department
of
Computer
Science

Technology,
Tsinghua
University
Test sample
G R L
i R
Subset
Selection
i Ind(x)

噯噯
Selection
Boundary
Compution X
1
Tr X
Booster
Trainer
R
1
L R
Lab
of
Human
Computer
Speech
Interaction,
Key
Lab
of
Pervasive
Computing
Tr X
6
噯噯
L R
6
2
噯噯
2
噯噯
Testing
i Ind(x)
L R
6
Testing
Tr X
6
G R L
i R
G R L
i R
Test sample
x Regression r0 Regression
Test x Regression r0 Regression
r1
r1
Department
of
Computer
Science

Technology,
Tsinghua
University
Subset
Selection
i Ind(x)
Post-
Smoothing

Table 2: Regression results with different models.
ET = Extra-Trees, RF = Random Forest.
Model
Lab
of
Human
Computer
Speech
Interaction,
Key
Lab
of
Pervasive
Computing
Arousal Valence
MAE RMSE MAE RMSE
Algorithm Details
Tool
Linear 0.159 0.203 0.158 0.199
Pace 0.153 0.195 0.154 0.197 WEKA
M5Rule 0.154 0.195 0.159 0.205
ET 0.150 0.188 0.149 0.191
Department
of
Computer
Science

Technology,
Tsinghua
University
China (973 Program,
2014, Barcelona, Spain
improved by reducing the scope of model, we propose a 2-
level regression method in this section. Figure 1 shows the
framework of this method.
The first-level regression is just a naive procedure com-monly
used in regression module. For x in test set X, we
predict the test result r0 = f(x,RG) by model RG, a global
model trained with entire training dataset XTr.
For each emotion dimension, we divide [−0.75, 0.75] (based
on the ground truth distribution of XTr) into scikit-RF 0.142 0.177 0.143 0.183
6 bins learn
with
equal Booster length 0.141 (0.25). 0.176 For x in 0.138 X, its bin 0.178 index i = XGBoost
Ind(x) =
⌈(f(x,RG)+0.75)/0.25⌉. The 2nd-level regression result of x
is r1 = f(x,RL
i ), where RL
i is the local model for the ith bin
trained by the subset of training dataset XTr
i , which is de-termined
by XTr
i = {x|x ∈ XTr and g(x) ∈ [ai, bi]}, where
g(x) is the ground truth of x and the boundaries {ai, bi} are
computed on dev set XD as follows:
We first define Gi = {g(x)|Ind(x) = i, x ∈ XD}. Af-ter
investigating the distribution of Gi, we then select ai =
min{g|CDFi(g) ≥ 1.5%} and bi = max{g|CDFi(g) ≤ 98.5%},
where CDFi is the cumulative distribution function for Gi,
to eliminate outliers and fix the confidence level of dev set
XD at 97%.
4. RESULTS AND CONCLUSIONS
Results in this section are all modeled on THU feature set
and experiments are all for dynamic task.
• Dynamic Run Run Run result results rrun3(Run • Static Run Run 2.
Result metric. not considering Performance module 5. REFERENCES
[1] A. Music Workshop, [2] Eyben, Opensmile: audio international [3] Breiman,

Lab
of
Human
Computer
Speech
Interaction,
Key
Lab
of
Pervasive
Computing
r
Department
of
Computer
Science

Technology,
Tsinghua
University
x 0 r
1 Figure 1: Process framework for multi-level regres-sion
system.
Table 2: Regression results with different models.
ET = Extra-Trees, RF = Random Forest.
Arousal Valence
Model
Tool
MAE RMSE MAE RMSE
Linear 0.159 0.203 0.158 0.199
Pace 0.153 0.195 0.154 0.197 WEKA
M5Rule 0.154 0.195 0.159 0.205
ET 0.150 0.188 0.149 0.191
scikit-learn
RF 0.142 0.177 0.143 0.183
Booster 0.141 0.176 0.138 0.178 XGBoost
g(x) is the ground truth of x and the boundaries {ai, bi} are
computed on dev set XD follows:
We first define Gi = {g(x)|Ind(x) = i, x ∈ XD}. Af-ter
investigating the distribution of Gi, we then select ai =
min{g|CDFi(g) ≥ 1.5%} and bi = max{g|CDFi(g) ≤ 98.5%},
where CDFi is the cumulative distribution function for Gi,
Dynamic
Static and run regression • Dynamic Run Run Run result results rrun3(Run • Static Run Run 2.
Result metric. not considering Performance module 5. REFERENCES
How to Select Regression Models ?
Performance of Different Regression Models
• 5-fold cross validation is used and is song
independent (separated by song IDs randomly)
• 1/5 of training set is used as Dev. dataset

Lab
of
Human
Computer
Speech
Interaction,
Key
Lab
of
Pervasive
Computing
One-level VS Multi-level Regression
NOTES: XGBoost, an open-source toolbox, was used for
regression in both cases.
Department
of
Computer
Science

Technology,
Tsinghua
University

RL
Lab
of
Human
Computer
Speech
Interaction,
Key
Lab
of
Pervasive
Computing
Evaluation on Test Data
Department
of
Computer
Science

Technology,
Tsinghua
University
RL
1
RL
2
噯噯
RL
6
Testing
r
1 L
i R
level regres-sion
different models.
Tool
Regression Strategy
RMSE
Arousal Valence
One-level 0.176 0.178
Multi-level 0.175 0.177
Table 4: Results on required test data.
Task Run
Arousal Valence
APC RMSE APC RMSE
Dynamic
baseline 0.180 0.270 0.110 0.190
1 0.128 0.124 0.064 0.100
2 0.127 0.125 0.074 0.099
3 0.170 0.119 0.091 0.094
4 0.167 0.120 0.098 0.094
Static 5 0.770 0.107 0.459 0.097
RL
噯噯
RL
and run 5 is for static task. XGBoost lib is employed for
regression in all runs.
• Dynamic task (subtask 2):
Run 1. One-level regression.
Run 2. Two-level regression.
Run 3. Run 1 + smooth. For one clip, we replace seg-ment
result in Run 1 with the average of 4 adjacent segment
Testing
Tr
Tr
Tr
Booster
Trainer
RG
1
2
6
Regression 1 r
i Ind(x)
framework for multi-level regres-sion
results with different models.
= Random Forest.
Valence
Tool
MAE RMSE
0.158 0.199
0.154 0.197 WEKA
0.159 0.205
0.149 0.191
scikit-learn
0.143 0.183
0.138 0.178 XGBoost
regression and multi-level regression.
Regression Strategy
RMSE
Arousal Valence
One-level 0.176 0.178
Multi-level 0.175 0.177
Table 4: Results on required test data.
Task Run
Arousal Valence
APC RMSE APC RMSE
Dynamic
baseline 0.180 0.270 0.110 0.190
1 0.128 0.124 0.064 0.100
2 0.127 0.125 0.074 0.099
3 0.170 0.119 0.091 0.094
4 0.167 0.120 0.098 0.094
Static 5 0.770 0.107 0.459 0.097
and run 5 is for static task. XGBoost lib is employed for
regression in all runs.
• Dynamic task (subtask 2):
Run 1. One-level regression.
Run 2. Two-level regression.
Run 3. Run 1 + smooth. For one clip, we replace seg-ment
result in Run 1 with the average of 4 adjacent segment
results and itself as following:
rrun3(i) = mean(rrun1
i−2 , ..., rrun1
i+2 ), i = 3, 4, ..., 88.
Run 4. Run 2 + smooth.
• Static task (subtask 1):
Run 5. For each clip, average over 90 segment results from
Run 2.
Result shows that our result outperforms baseline on RMSE
NOTES:

Lab
of
Human
Computer
Speech
Interaction,
Key
Lab
of
Pervasive
Computing
Conclusion
! Randomized decision trees based feature ranking
for feature selection
! Multi-level regression proposed, which consists of
a global regression and several local regressions
! Post-smoothing adopted for prediction refinement
In f uture, we will focus on more sophisticated method to define neighbor
set for training local regression models and different smoothing strategies
for post-processing.
Department
of
Computer
Science

Technology,
Tsinghua
University

Lab
of
Human
Computer
Speech
Interaction,
Key
Lab
of
Pervasive
Computing
Questions ? / Comments ? / Suggestions ?
THANKS
Department
of
Computer
Science

Technology,
Tsinghua
University

MediaEval 2014: THU-HCSIL Approach to Emotion in Music Task using Multi-level Regression

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (14)

Similar to MediaEval 2014: THU-HCSIL Approach to Emotion in Music Task using Multi-level Regression

Similar to MediaEval 2014: THU-HCSIL Approach to Emotion in Music Task using Multi-level Regression (20)

More from multimediaeval

More from multimediaeval (20)

Recently uploaded

Recently uploaded (20)

MediaEval 2014: THU-HCSIL Approach to Emotion in Music Task using Multi-level Regression