ISMIR 2016_Melody Extraction

MELODY EXTRACTION ON VOCAL SEGMENTS
USING MULTI-COLUMN DEEP NEURAL NETWORKS
Sangeun Kum, Changhyun Oh, Juhan Nam
keums@kaist.ac.kr
Music and Audio Computing Lab.
Korea Advanced Institute of Science and Technology
11 Aug. 2016

Melody extraction
: from polyphonic music
 Definition:
Automatically obtain the f0 curve of the predominant melodic line drawn from multiple sources [1]
[1] Bittner, R. M., Salamon, J., Essid, S., & Bello, J. P. Melody extraction by contour classification. In Proc. ISMIR (pp. 500-506).
<An example of melody extraction, Beyonce - ‘Halo’>

Melody extraction algorithms
[2] Salamon, Justin, et al. "Melody extraction from polyphonic music signals: Approaches, applications, and challenges." Signal Processing Magazine,
IEEE31.2 (2014): 118-134.
Salience based
approaches
Source separation
based approaches
Data driven
based approaches

<posteriorgram [3] >
Support vector machine note
classifier
 Pitch labels : 60 MIDI notes (G2~F#7)
 Resolution = 1 semitone
 Losing detailed information about
singing styles
ex) vibrato, transition patterns
[3] Ellis, Daniel PW, and Graham E. Poliner. "Classification-based melody transcription." Machine Learning 65.2 (2006)
Melody extraction algorithms
: Data-driven based approaches
Data Posteriorgram
Support
Vector
Machine
G2~ F#7
60 MIDI
scale
HMM

1. Deep neural network
2. Classification-based approach : High resolution
3. Data augmentation
4. Singing voice detector
Addressed issues

Deep neural networks
: Configuration
input layer
512 512 256
output layer
D2
F#5
……
hidden layer
Multi-frame
spectrogram
…
 Input :
• Multi-frame spectrogram
• Train : singing voice frame
• Test : all frame
 Hidden layers : 512-512-256
 Output :
• Range : D2 – F#5
• Layer : 41, 81, 161
 Nonlinear function : ReLU
 Optimizer : RMSprop
 Output layer : sigmoid
 Dropout : 20%
 Using Keras
Addressed
issues

<Fig. (b) Classification accuracyon the validation set>
Motivation
: Classification accuracy & pitch resolution
res_1 res_2 res_4
high pitch
resolution
high
classification
accuracy
<Fig. (a) Multi-column DNN>
D
N
N
D
N
N
D
N
N
res_1 res_2 res_4
Addressed
issues
[4] Ciregan, Dan, Ueli Meier, and Jürgen Schmidhuber. "Multi-column deep neural networks for image classification." Computer Vision and Pattern Recognition (CVPR), 2012.
[5] Agostinelli, Forest, Michael R. Anderson, and Honglak Lee. "Adaptive multi-column deep neural networks with application to robust image denoising." Advances in Neural
Information Processing Systems. 2013.

Motivation
: Multi-column DNN
[4] Ciregan, Dan, Ueli Meier, and Jürgen Schmidhuber. "Multi-column deep neural networks for image classification." Computer Vision and Pattern Recognition
(CVPR), 2012.
[5] Agostinelli, Forest, Michael R. Anderson, and Honglak Lee. "Adaptive multi-column deep neural networks with application to robust image denoising."
Advances in Neural Information Processing Systems. 2013.
Addressed
issues

res_1 res_2 res_4
Proposed method
: Architecture of multi-column DNN (MCDNN)
‘res_1’  1 semitone
ex) 40, 41, 42, …
‘res_2’  0.5 semitone
ex) 40, 40.5, 41, …
‘res_N’  1/N semitone
res_1 res_2 res_4
Addressed
issues

1 semitone
#41
0.5 semitone
#81
#161 #161
0.25 semitone
#161
Proposed method
: Architecture of multi-column DNN (MCDNN)
Addressed
issues

Training Datasets
 RWC Database [6]
 American, Japanese pop music
 100 songs
: Training set (85 songs)
: Validation set (15 songs)
 Data augmentation
 Pitch-shifted songs
(±1, 2 𝑠𝑒𝑚𝑖𝑡𝑜𝑛𝑒)
: 100  500 songs
RWC
(100 songs)
RWC
+1
semitone
RWC
-1
semitone
RWC
+2
semitone
RWC
-2
semitone
Data augmentation
[6] Goto, Masataka, et al. "RWC Music Database: Popular, Classical and Jazz Music Databases." ISMIR. Vol. 2. 2002.
[7] Bittner, Rachel M., et al. "MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research." ISMIR. 2014.
MelodyMCDNNTest
Data
Training
Data
Addressed
issues

Training Datasets
 RWC Database [6]
 American, Japanese pop music
 100 songs
: Training set (85 songs)
: Validation set (15 songs)
 Data augmentation
 Pitch-shifted songs
(±1, 2 𝑠𝑒𝑚𝑖𝑡𝑜𝑛𝑒)
: 100  500 songs
 MedleyDB [7]
 Total : 122 songs
 60 vocal songs
 Genre :
Singer/Songwriter, Classical, Rock,
Folk, Pop, Musical Theatre
RWC
(100 songs)
MedleyDB
(60 songs)
RWC
+1
semitone
RWC
-1
semitone
RWC
+2
semitone
RWC
-2
semitone
Data augmentation
[4] Goto, Masataka, et al. "RWC Music Database: Popular, Classical and Jazz Music Databases." ISMIR. Vol. 2. 2002.
[5] Bittner, Rachel M., et al. "MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research." ISMIR. 2014.
MelodyMCDNNTest
Data
Training
Data
Addressed
issues

res_1 res_2 res_4
Temporal smoothing by HMM
: Viterbi decoding
Addressed
issues

: Viterbi decoding
Bayes' theorem
Viterbi decoding
transition
priorposterior
Addressed
issues
[3] Ellis, Daniel PW, and Graham E. Poliner. "Classification-based melody transcription." Machine Learning 65.2 (2006)

Singing voice detection
: Energy-based approach
HMM
SVD
 Spectral energy (200~1800Hz)
• High singing voice level
• The sum is normalized by the
median energy in the band.
Addressed
issues

< Classification accuracy on the validation set>
The classification accuracy
: Multi-frame spectrogram
Results

- RWC (100 songs)
- RWC (100 songs) + pitch-shifted RWC (100 x 4 songs)
- RWC (100 songs) + pitch-shifted RWC (400 songs) + MedleyDB (60 songs)
Classification accuracy
: Data Augmentation
Results

: Performance of smoothing
Results

A case example of melody extraction
on an opera song (ADC2004)
SCDNN(res=4)
Results

SCDNN(res=4) + Data augmentation
Results

1-2-4 MCDNN + Data augmentation
Results

 ADC2004 [8]
• 12 songs
• Rock, R&B, Pop, Jazz, Opera
 MIREX05 [9]
• Total 25 songs
• 13 songs
( 9 vocal songs + 4 instrument songs)
 MIR1k [10]
• 1000 vocal songs
[8,9] http://labrosa.ee.columbia.edu/projects/melody
[10] https://sites.google.com/site/unvoicedsoundseparation/mir-1k
MelodyMCDNNTest
Data
Training
Data
Evaluation
: Test Datasets
Evaluation

Test data set
 Case 1 : All songs(including song without vocal)
 Case 2 : Singing voice songs
Single-column Vs. Multi-column
: Raw / Chroma accuracy
Case 1 Case 2
Assumed the voiced framed are perfectly detected.
Evaluation

Comparison to State-of-the-art Methods
: ADC2004
Evaluation

: MIREX05
Evaluation

 Summary
• multi-frame spectrogram
• data augmentation
• multi-column DNN
• HMM-based smoothing
 Limitation & Future work
• working only for singing voice melody
• singing voice detection
• Replace HMM with RNN
Conclusion

Limitation & Future works
Multi-column DNN(1-2-4)
Songs with vocalSongs with vocal & non-vocal

 Summary
• multi-frame spectrogram
• data augmentation
• multi-column DNN
• HMM-based smoothing
• replace HMM with RNN
Conclusion

Appendix
 Resample : 8 kHz
 Merge stereo channel into mono.
 STFT :
• FFT size : 1024 (1 bin = 7.81Hz)
• Window size = 1024 (Hann)
• Hop size : 80 (1 frame = 10ms)
• Compressing the magnitude by a log
scale
• Using 256 bins
(0 ~ 2000Hz : vocal range)
 Multi-frame
• 11 frame spectrogram / example
[] Ellis, Daniel PW, and Graham E. Poliner. "Classification-based melody transcription." Machine Learning 65.2 (2006): 439-456.
Pre-processing
voice frame

: MIR-1k
Appendix

3 C c
2 B b
1 A a
c
C
3
b
B
2
a
A
1
Multi-frame spectrogram
Appendix

ISMIR 2016_Melody Extraction

More Related Content

Similar to ISMIR 2016_Melody Extraction

Recently uploaded

ISMIR 2016_Melody Extraction

Editor's Notes