MELODY EXTRACTION ON VOCAL SEGMENTS
USING MULTI-COLUMN DEEP NEURAL NETWORKS
Sangeun Kum, Changhyun Oh, Juhan Nam
keums@kaist.ac.kr
Music and Audio Computing Lab.
Korea Advanced Institute of Science and Technology
11 Aug. 2016
Melody extraction
: from polyphonic music
 Definition:
Automatically obtain the f0 curve of the predominant melodic line drawn from multiple sources [1]
[1] Bittner, R. M., Salamon, J., Essid, S., & Bello, J. P. Melody extraction by contour classification. In Proc. ISMIR (pp. 500-506).
<An example of melody extraction, Beyonce - ‘Halo’>
Melody extraction algorithms
[2] Salamon, Justin, et al. "Melody extraction from polyphonic music signals: Approaches, applications, and challenges." Signal Processing Magazine,
IEEE31.2 (2014): 118-134.
Salience based
approaches
Source separation
based approaches
Data driven
based approaches
<posteriorgram [3] >
Support vector machine note
classifier
 Pitch labels : 60 MIDI notes (G2~F#7)
 Resolution = 1 semitone
 Losing detailed information about
singing styles
ex) vibrato, transition patterns
[3] Ellis, Daniel PW, and Graham E. Poliner. "Classification-based melody transcription." Machine Learning 65.2 (2006)
Melody extraction algorithms
: Data-driven based approaches
Data Posteriorgram
Support
Vector
Machine
G2~ F#7
60 MIDI
scale
HMM
1. Deep neural network
2. Classification-based approach : High resolution
3. Data augmentation
4. Singing voice detector
Addressed issues
1. Deep neural network
2. Classification-based approach : High resolution
3. Data augmentation
4. Singing voice detector
Addressed issues
1. Deep neural network
2. Classification-based approach : High resolution
3. Data augmentation
4. Singing voice detector
Addressed issues
1. Deep neural network
2. Classification-based approach : High resolution
3. Data augmentation
4. Singing voice detector
Addressed issues
Deep neural networks
: Configuration
input layer
512 512 256
output layer
D2
F#5
……
hidden layer
Multi-frame
spectrogram
…
 Input :
• Multi-frame spectrogram
• Train : singing voice frame
• Test : all frame
 Hidden layers : 512-512-256
 Output :
• Range : D2 – F#5
• Layer : 41, 81, 161
 Nonlinear function : ReLU
 Optimizer : RMSprop
 Output layer : sigmoid
 Dropout : 20%
 Using Keras
Addressed
issues
<Fig. (b) Classification accuracyon the validation set>
Motivation
: Classification accuracy & pitch resolution
res_1 res_2 res_4
high pitch
resolution
high
classification
accuracy
<Fig. (a) Multi-column DNN>
D
N
N
D
N
N
D
N
N
res_1 res_2 res_4
Addressed
issues
[4] Ciregan, Dan, Ueli Meier, and Jürgen Schmidhuber. "Multi-column deep neural networks for image classification." Computer Vision and Pattern Recognition (CVPR), 2012.
[5] Agostinelli, Forest, Michael R. Anderson, and Honglak Lee. "Adaptive multi-column deep neural networks with application to robust image denoising." Advances in Neural
Information Processing Systems. 2013.
Motivation
: Multi-column DNN
[4] Ciregan, Dan, Ueli Meier, and Jürgen Schmidhuber. "Multi-column deep neural networks for image classification." Computer Vision and Pattern Recognition
(CVPR), 2012.
[5] Agostinelli, Forest, Michael R. Anderson, and Honglak Lee. "Adaptive multi-column deep neural networks with application to robust image denoising."
Advances in Neural Information Processing Systems. 2013.
Addressed
issues
res_1 res_2 res_4
Proposed method
: Architecture of multi-column DNN (MCDNN)
‘res_1’  1 semitone
ex) 40, 41, 42, …
‘res_2’  0.5 semitone
ex) 40, 40.5, 41, …
‘res_N’  1/N semitone
res_1 res_2 res_4
Addressed
issues
1 semitone
#41
0.5 semitone
#81
#161 #161
0.25 semitone
#161
Proposed method
: Architecture of multi-column DNN (MCDNN)
Addressed
issues
res_1 res_2 res_4
Proposed method
: Architecture of multi-column DNN (MCDNN)
‘res_1’  1 semitone
ex) 40, 41, 42, …
‘res_2’  0.5 semitone
ex) 40, 40.5, 41, …
‘res_N’  1/N semitone
res_1 res_2 res_4
Addressed
issues
Training Datasets
 RWC Database [6]
 American, Japanese pop music
 100 songs
: Training set (85 songs)
: Validation set (15 songs)
 Data augmentation
 Pitch-shifted songs
(±1, 2 𝑠𝑒𝑚𝑖𝑡𝑜𝑛𝑒)
: 100  500 songs
RWC
(100 songs)
RWC
+1
semitone
RWC
-1
semitone
RWC
+2
semitone
RWC
-2
semitone
Data augmentation
[6] Goto, Masataka, et al. "RWC Music Database: Popular, Classical and Jazz Music Databases." ISMIR. Vol. 2. 2002.
[7] Bittner, Rachel M., et al. "MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research." ISMIR. 2014.
MelodyMCDNNTest
Data
Training
Data
Addressed
issues
Training Datasets
 RWC Database [6]
 American, Japanese pop music
 100 songs
: Training set (85 songs)
: Validation set (15 songs)
 Data augmentation
 Pitch-shifted songs
(±1, 2 𝑠𝑒𝑚𝑖𝑡𝑜𝑛𝑒)
: 100  500 songs
 MedleyDB [7]
 Total : 122 songs
 60 vocal songs
 Genre :
Singer/Songwriter, Classical, Rock,
Folk, Pop, Musical Theatre
RWC
(100 songs)
MedleyDB
(60 songs)
RWC
+1
semitone
RWC
-1
semitone
RWC
+2
semitone
RWC
-2
semitone
Data augmentation
[4] Goto, Masataka, et al. "RWC Music Database: Popular, Classical and Jazz Music Databases." ISMIR. Vol. 2. 2002.
[5] Bittner, Rachel M., et al. "MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research." ISMIR. 2014.
MelodyMCDNNTest
Data
Training
Data
Addressed
issues
res_1 res_2 res_4
Temporal smoothing by HMM
: Viterbi decoding
Addressed
issues
Temporal smoothing by HMM
: Viterbi decoding
Bayes' theorem
Viterbi decoding
transition
priorposterior
Addressed
issues
[3] Ellis, Daniel PW, and Graham E. Poliner. "Classification-based melody transcription." Machine Learning 65.2 (2006)
Singing voice detection
: Energy-based approach
HMM
SVD
 Spectral energy (200~1800Hz)
• High singing voice level
• The sum is normalized by the
median energy in the band.
Addressed
issues
< Classification accuracy on the validation set>
The classification accuracy
: Multi-frame spectrogram
Results
- RWC (100 songs)
- RWC (100 songs) + pitch-shifted RWC (100 x 4 songs)
- RWC (100 songs) + pitch-shifted RWC (400 songs) + MedleyDB (60 songs)
Classification accuracy
: Data Augmentation
Results
Temporal smoothing by HMM
: Performance of smoothing
Results
A case example of melody extraction
on an opera song (ADC2004)
SCDNN(res=4)
Results
A case example of melody extraction
on an opera song (ADC2004)
SCDNN(res=4) + Data augmentation
Results
A case example of melody extraction
on an opera song (ADC2004)
1-2-4 MCDNN + Data augmentation
Results
 ADC2004 [8]
• 12 songs
• Rock, R&B, Pop, Jazz, Opera
 MIREX05 [9]
• Total 25 songs
• 13 songs
( 9 vocal songs + 4 instrument songs)
 MIR1k [10]
• 1000 vocal songs
[8,9] http://labrosa.ee.columbia.edu/projects/melody
[10] https://sites.google.com/site/unvoicedsoundseparation/mir-1k
MelodyMCDNNTest
Data
Training
Data
Evaluation
: Test Datasets
Evaluation
Test data set
 Case 1 : All songs(including song without vocal)
 Case 2 : Singing voice songs
Single-column Vs. Multi-column
: Raw / Chroma accuracy
Case 1 Case 2
Assumed the voiced framed are perfectly detected.
Evaluation
Comparison to State-of-the-art Methods
: ADC2004
Evaluation
Comparison to State-of-the-art Methods
: MIREX05
Evaluation
 Summary
• multi-frame spectrogram
• data augmentation
• multi-column DNN
• HMM-based smoothing
 Limitation & Future work
• working only for singing voice melody
• singing voice detection
• Replace HMM with RNN
Conclusion
Limitation & Future works
 Limitation & Future work
• working only for singing voice melody
• singing voice detection
Multi-column DNN(1-2-4)
Songs with vocalSongs with vocal & non-vocal
 Summary
• multi-frame spectrogram
• data augmentation
• multi-column DNN
• HMM-based smoothing
 Limitation & Future work
• working only for singing voice melody
• singing voice detection
• replace HMM with RNN
Conclusion
Thank you
keums@kaist.ac.kr
Appendix
 Resample : 8 kHz
 Merge stereo channel into mono.
 STFT :
• FFT size : 1024 (1 bin = 7.81Hz)
• Window size = 1024 (Hann)
• Hop size : 80 (1 frame = 10ms)
• Compressing the magnitude by a log
scale
• Using 256 bins
(0 ~ 2000Hz : vocal range)
 Multi-frame
• 11 frame spectrogram / example
[] Ellis, Daniel PW, and Graham E. Poliner. "Classification-based melody transcription." Machine Learning 65.2 (2006): 439-456.
Pre-processing
voice frame
Comparison to State-of-the-art Methods
: MIR-1k
Appendix
MIREX2016 results
Appendix
3 C c
2 B b
1 A a
c
C
3
b
B
2
a
A
1
Multi-frame spectrogram
Appendix

ISMIR 2016_Melody Extraction

  • 1.
    MELODY EXTRACTION ONVOCAL SEGMENTS USING MULTI-COLUMN DEEP NEURAL NETWORKS Sangeun Kum, Changhyun Oh, Juhan Nam keums@kaist.ac.kr Music and Audio Computing Lab. Korea Advanced Institute of Science and Technology 11 Aug. 2016
  • 2.
    Melody extraction : frompolyphonic music  Definition: Automatically obtain the f0 curve of the predominant melodic line drawn from multiple sources [1] [1] Bittner, R. M., Salamon, J., Essid, S., & Bello, J. P. Melody extraction by contour classification. In Proc. ISMIR (pp. 500-506). <An example of melody extraction, Beyonce - ‘Halo’>
  • 3.
    Melody extraction algorithms [2]Salamon, Justin, et al. "Melody extraction from polyphonic music signals: Approaches, applications, and challenges." Signal Processing Magazine, IEEE31.2 (2014): 118-134. Salience based approaches Source separation based approaches Data driven based approaches
  • 4.
    <posteriorgram [3] > Supportvector machine note classifier  Pitch labels : 60 MIDI notes (G2~F#7)  Resolution = 1 semitone  Losing detailed information about singing styles ex) vibrato, transition patterns [3] Ellis, Daniel PW, and Graham E. Poliner. "Classification-based melody transcription." Machine Learning 65.2 (2006) Melody extraction algorithms : Data-driven based approaches Data Posteriorgram Support Vector Machine G2~ F#7 60 MIDI scale HMM
  • 5.
    1. Deep neuralnetwork 2. Classification-based approach : High resolution 3. Data augmentation 4. Singing voice detector Addressed issues
  • 6.
    1. Deep neuralnetwork 2. Classification-based approach : High resolution 3. Data augmentation 4. Singing voice detector Addressed issues
  • 7.
    1. Deep neuralnetwork 2. Classification-based approach : High resolution 3. Data augmentation 4. Singing voice detector Addressed issues
  • 8.
    1. Deep neuralnetwork 2. Classification-based approach : High resolution 3. Data augmentation 4. Singing voice detector Addressed issues
  • 9.
    Deep neural networks :Configuration input layer 512 512 256 output layer D2 F#5 …… hidden layer Multi-frame spectrogram …  Input : • Multi-frame spectrogram • Train : singing voice frame • Test : all frame  Hidden layers : 512-512-256  Output : • Range : D2 – F#5 • Layer : 41, 81, 161  Nonlinear function : ReLU  Optimizer : RMSprop  Output layer : sigmoid  Dropout : 20%  Using Keras Addressed issues
  • 10.
    <Fig. (b) Classificationaccuracyon the validation set> Motivation : Classification accuracy & pitch resolution res_1 res_2 res_4 high pitch resolution high classification accuracy <Fig. (a) Multi-column DNN> D N N D N N D N N res_1 res_2 res_4 Addressed issues [4] Ciregan, Dan, Ueli Meier, and Jürgen Schmidhuber. "Multi-column deep neural networks for image classification." Computer Vision and Pattern Recognition (CVPR), 2012. [5] Agostinelli, Forest, Michael R. Anderson, and Honglak Lee. "Adaptive multi-column deep neural networks with application to robust image denoising." Advances in Neural Information Processing Systems. 2013.
  • 11.
    Motivation : Multi-column DNN [4]Ciregan, Dan, Ueli Meier, and Jürgen Schmidhuber. "Multi-column deep neural networks for image classification." Computer Vision and Pattern Recognition (CVPR), 2012. [5] Agostinelli, Forest, Michael R. Anderson, and Honglak Lee. "Adaptive multi-column deep neural networks with application to robust image denoising." Advances in Neural Information Processing Systems. 2013. Addressed issues
  • 12.
    res_1 res_2 res_4 Proposedmethod : Architecture of multi-column DNN (MCDNN) ‘res_1’  1 semitone ex) 40, 41, 42, … ‘res_2’  0.5 semitone ex) 40, 40.5, 41, … ‘res_N’  1/N semitone res_1 res_2 res_4 Addressed issues
  • 13.
    1 semitone #41 0.5 semitone #81 #161#161 0.25 semitone #161 Proposed method : Architecture of multi-column DNN (MCDNN) Addressed issues
  • 14.
    res_1 res_2 res_4 Proposedmethod : Architecture of multi-column DNN (MCDNN) ‘res_1’  1 semitone ex) 40, 41, 42, … ‘res_2’  0.5 semitone ex) 40, 40.5, 41, … ‘res_N’  1/N semitone res_1 res_2 res_4 Addressed issues
  • 15.
    Training Datasets  RWCDatabase [6]  American, Japanese pop music  100 songs : Training set (85 songs) : Validation set (15 songs)  Data augmentation  Pitch-shifted songs (±1, 2 𝑠𝑒𝑚𝑖𝑡𝑜𝑛𝑒) : 100  500 songs RWC (100 songs) RWC +1 semitone RWC -1 semitone RWC +2 semitone RWC -2 semitone Data augmentation [6] Goto, Masataka, et al. "RWC Music Database: Popular, Classical and Jazz Music Databases." ISMIR. Vol. 2. 2002. [7] Bittner, Rachel M., et al. "MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research." ISMIR. 2014. MelodyMCDNNTest Data Training Data Addressed issues
  • 16.
    Training Datasets  RWCDatabase [6]  American, Japanese pop music  100 songs : Training set (85 songs) : Validation set (15 songs)  Data augmentation  Pitch-shifted songs (±1, 2 𝑠𝑒𝑚𝑖𝑡𝑜𝑛𝑒) : 100  500 songs  MedleyDB [7]  Total : 122 songs  60 vocal songs  Genre : Singer/Songwriter, Classical, Rock, Folk, Pop, Musical Theatre RWC (100 songs) MedleyDB (60 songs) RWC +1 semitone RWC -1 semitone RWC +2 semitone RWC -2 semitone Data augmentation [4] Goto, Masataka, et al. "RWC Music Database: Popular, Classical and Jazz Music Databases." ISMIR. Vol. 2. 2002. [5] Bittner, Rachel M., et al. "MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research." ISMIR. 2014. MelodyMCDNNTest Data Training Data Addressed issues
  • 17.
    res_1 res_2 res_4 Temporalsmoothing by HMM : Viterbi decoding Addressed issues
  • 18.
    Temporal smoothing byHMM : Viterbi decoding Bayes' theorem Viterbi decoding transition priorposterior Addressed issues [3] Ellis, Daniel PW, and Graham E. Poliner. "Classification-based melody transcription." Machine Learning 65.2 (2006)
  • 19.
    Singing voice detection :Energy-based approach HMM SVD  Spectral energy (200~1800Hz) • High singing voice level • The sum is normalized by the median energy in the band. Addressed issues
  • 20.
    < Classification accuracyon the validation set> The classification accuracy : Multi-frame spectrogram Results
  • 21.
    - RWC (100songs) - RWC (100 songs) + pitch-shifted RWC (100 x 4 songs) - RWC (100 songs) + pitch-shifted RWC (400 songs) + MedleyDB (60 songs) Classification accuracy : Data Augmentation Results
  • 22.
    Temporal smoothing byHMM : Performance of smoothing Results
  • 23.
    A case exampleof melody extraction on an opera song (ADC2004) SCDNN(res=4) Results
  • 24.
    A case exampleof melody extraction on an opera song (ADC2004) SCDNN(res=4) + Data augmentation Results
  • 25.
    A case exampleof melody extraction on an opera song (ADC2004) 1-2-4 MCDNN + Data augmentation Results
  • 26.
     ADC2004 [8] •12 songs • Rock, R&B, Pop, Jazz, Opera  MIREX05 [9] • Total 25 songs • 13 songs ( 9 vocal songs + 4 instrument songs)  MIR1k [10] • 1000 vocal songs [8,9] http://labrosa.ee.columbia.edu/projects/melody [10] https://sites.google.com/site/unvoicedsoundseparation/mir-1k MelodyMCDNNTest Data Training Data Evaluation : Test Datasets Evaluation
  • 27.
    Test data set Case 1 : All songs(including song without vocal)  Case 2 : Singing voice songs Single-column Vs. Multi-column : Raw / Chroma accuracy Case 1 Case 2 Assumed the voiced framed are perfectly detected. Evaluation
  • 28.
    Comparison to State-of-the-artMethods : ADC2004 Evaluation
  • 29.
    Comparison to State-of-the-artMethods : MIREX05 Evaluation
  • 30.
     Summary • multi-framespectrogram • data augmentation • multi-column DNN • HMM-based smoothing  Limitation & Future work • working only for singing voice melody • singing voice detection • Replace HMM with RNN Conclusion
  • 31.
    Limitation & Futureworks  Limitation & Future work • working only for singing voice melody • singing voice detection Multi-column DNN(1-2-4) Songs with vocalSongs with vocal & non-vocal
  • 32.
     Summary • multi-framespectrogram • data augmentation • multi-column DNN • HMM-based smoothing  Limitation & Future work • working only for singing voice melody • singing voice detection • replace HMM with RNN Conclusion
  • 33.
  • 34.
    Appendix  Resample :8 kHz  Merge stereo channel into mono.  STFT : • FFT size : 1024 (1 bin = 7.81Hz) • Window size = 1024 (Hann) • Hop size : 80 (1 frame = 10ms) • Compressing the magnitude by a log scale • Using 256 bins (0 ~ 2000Hz : vocal range)  Multi-frame • 11 frame spectrogram / example [] Ellis, Daniel PW, and Graham E. Poliner. "Classification-based melody transcription." Machine Learning 65.2 (2006): 439-456. Pre-processing voice frame
  • 35.
    Comparison to State-of-the-artMethods : MIR-1k Appendix
  • 36.
  • 37.
    3 C c 2B b 1 A a c C 3 b B 2 a A 1 Multi-frame spectrogram Appendix

Editor's Notes

  • #2 Thanks for introduction. Hello I’m sangeun kum. Finally, this is final presentation ISMIR 2016. And I’ll talk about my research.
  • #3 Melody extraction is automatically obtain fundamental frequency curve of the predominant melodic line/ from polyphonic music. This is an example of melody extraction using our proposed methods. Let’s follow the pitch line listening Halo. Pretty well? As you know, all demo are perfect.
  • #4 I got this table from salamon’s review paper. Various algorithms have been proposed so far. They can be broadly classified into three categories: A salience-based approaches use a salience function to estimate the salience of each possible pitch value. A source-separation based approaches isolate the melody source from the mixture. These two approaches are majority of the melody extraction algorithms On the other hand, Data-driven based approach is rarely attempted.
  • #5 In 2006, Ellis and Poliner proposed a fully data-driven method using a Support Vector Machine to classify 60 MIDI note from a spectrogram. The resolution of output is one semitone. So, it could lose detailed information about singing styles such as vibrato. And.. last year, Bitter et.al proposed an algorithm using classifier to predict melody contour. However, data-driven approach is still rarely attempted.
  • #6 Therefore, we addressed some issues. No one attempt to use deep neural network to extract melody. You know.. deep learning is really hot keyword in research area in these days, (although deep learning session started after banquet. ) Anyway, Deep learning has proved having great performance with sufficient labeled data and computing power. So we tried to use deep learning.
  • #7 Deep learning method is based on classification approach. so, we tried to maintain high accuracy and high resolution.
  • #8  The third point is that melody-labeled public datasets are not much available and manual labeling is laborious. therefore, it is desirable to augment existing datasets.
  • #9  The last one is singing voice detector to obtain high overall accuracy,
  • #10 We configure the DNN like that. we train DNNs using only voiced frames of training data set, and we take multi-frame spectrogram to capture contextual information. The pitch labels cover from D2 to F#5 with different resolution. For the output layer, we use the sigmoid function instead of the softmax function because the sigmoid slightly worked better in our experiments. We optimize the objective function using RMSprop and 20% dropout for all hidden layers to avoid overfitting to the training set. For fast computing, we run the code using Keras, a deep learning library in Python, on a computer with two GPUs.
  • #11 And then we check the validation accuracy. Figure(b) shows a classification accuracy of each DNN with different resolutions.  you can see that.. as the resolution increases, the accuracy drops quite significantly. There is a trade-off between the pitch resolution and the classification accuracy. But, we need to both them. So… in order to take advantage of both them,  we combine each output of DNN.
  • #12 The MCDNN was originally devised as an ensemble method to improve the performance of DNN for image classification Several deep neural columns/ become experts on inputs in different ways, therefore by averaging each predictions, we can decrease the errors. It was applied to image denoising as well In this approach, each column was trained on a different type of noise and the outputs were weighted to handle noise types. Our proposed model may pose half-way between these two approaches
  • #13 This is our architecture of a proposed methods. By using Multi Column DNN, our model produces a finer pitch resolution more accurately.  Each of the DNN columns takes multi-fame spectrogram frames as input / to capture contextual information from neighboring frames.  And each DNN columns predict pitch labels with different resolutions. The lowest resolution is 1 semitone. The next one has higher resolutions by two times  Given the outputs of the columns, we compute the combined posterior likelihood probability
  • #14 pitch predictions with lower resolutions are actually expanded by replicating each element so that the output sizes are the same for the all columns. Mathematically, we multiplied all probabilities together, which corresponds to summing the log-likelihood of the predictions.
  • #15 For temporal smoothing, we use Viterbi decoding and choose singing voice frames using SVD. Finally we can get melody pitch contour.
  • #16 We use the RWC pop music database as our main training set. Also, In order to enlarge training set, we augment the training set by applying pitch-shifting by 1, 2 semitones.
  • #17 The RWC database includes only pop music. Melody contours tend to have different style according to genre. So, for genre diversity and avoiding overfitting, we use 60 vocal tracks of the MedleyDB dataset as an additional training set.
  • #18 We conduct the Viterbi decoding based on a Hidden Markov model for temporal smoothing.
  • #19 We follow the Ellis and Poliner’s steps. We estimate the prior and transition matrix from ground-truth of the training set. And then, we use the DNN prediction of whole tracks as posterior probabilities
  • #20 The DNN is trained with only voiced frames for pitch classification. Therefore, a singing voice detection step is necessary for the test phase. However, since singing voice detection itself is a challenging task and not our main concern in this paper. So, we use a simple energy-based singing voice detector.
  • #21 This is result part. I told you our model takes multi-frames of spectrogram to capture contextual information. To figure out an optimal size, we experimented with multi-frame as inputs of DNN where the input data were taken from neighbor spectrogram frames. The accuracy increases up to 11 frames and then converge to a certain level. This is expected because pitch contours usually have continuous curve patterns, and this temporal features can be captured better by taking multiple frames. For the following experiments, we fix the input size to 11 frames.
  • #22 This Figure shows the classification accuracy for a varying size of pitch resolution when the pitch-shifted RWC data and MedleyDB data are added to the training data pool in turn. Overall, the accuracy increases by 2 to 3 % with the additional sets.
  • #23 This Table shows the results as performance increments after applying the Viterbi decoding for the 1-2-4 Multi Column DNN on the test sets.  It helps to learn long-term temporal dependencies.
  • #24 Here we verify it by illustrating three examples from different models. We selected an opera song from the ADC2004 dataset, because this song has dynamic pitch motions such as high pitch and strong vibrato. This one is from the Single Column DNN with a pitch resolution of 4 and trained only with the RWC dataset.
  • #25 This one is from the same SCDNN but trained with additional data. Comparing the first models, the additional songs help tracking the vibrato. But the second model still misses the whole excursion.
  • #26 The right one is from the 1-2-4 Multi-column DNN. With the additional resolutions, the Multi-column DNN makes further improvement, tracking the pitch contours quite precisely.
  • #27 We examine our proposed model with three public datasets using mir-eval.
  • #28 Due to our model that can handle singing voice only, Therefore, we evaluate our model on all songs, it’s case 1 and those with singing voices separately  it’s case 2. And we assumed the voiced frames are perfectly detected to verify performance of classifier. As you can see,  the results of multi-column DNN is better than those of SCDNN comparing blue bar and yellow bar. Also, we can find that MCDNN increases the accuracies on the sets with singing voices comparing case 1 and 2. This indicates that our model is a specialized in singing voice songs.
  • #29 We compare our proposed method using energy based SVD with state-of-the-art algorithms. These are all based on pitch saliency methods. This is the result of ADC dataset Well.. Unfortunately, the performance is bad when the testset include all songs, However after exclude non-voice songs, the result is not ..bad..
  • #30 This is the result of MIREX05 dataset. Yes…!! The accuracies are comparable to some of the algorithms when the test sets include singing vocals.
  • #31 To summarize, In this paper, we proposed a novel data-driven melody extraction algorithm using the multi-column deep neural networks. We showed how the data-driven approach can be improved by different settings of the model, such as data augmentation, multi-column DNN, etc.
  • #32 The limitation of this model is that it works well  only for singing voice because we trained it only with “vocal songs.“ However, this also indicates that our model can be improved to a general melody extractor, if a sufficient amount of instrumental pieces are included in the training sets. Since we used a simple energy-based singing voice detector,  the performance of our model has limitations. However, the results show that, with a better voice detector, our model can be improved up to perfect voice detecting case.
  • #33 And we’ll replace HMM with RNN end to end.
  • #34 Thank for your attention.
  • #35 In pre-processing step, we resample the audio files to 8 kHz and merge stereo channels into mono. We then compute spectrogram with Hann window and hop size of 80 samples, and finally compress the magnitude by a log scale. we use only 256 bins from 0 to 2 kHz where the human singing voices have a relatively greater level than background music. And we use only voiced frames for training.
  • #36 And This is the result of MIR1k dataset with ADC04, MIREX05.