Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ISMIR 2016_Melody Extraction

266 views

Published on

ISMIR2016 : Deep learning session(oral presentation)

Published in: Engineering
  • Be the first to comment

ISMIR 2016_Melody Extraction

  1. 1. MELODY EXTRACTION ON VOCAL SEGMENTS USING MULTI-COLUMN DEEP NEURAL NETWORKS Sangeun Kum, Changhyun Oh, Juhan Nam keums@kaist.ac.kr Music and Audio Computing Lab. Korea Advanced Institute of Science and Technology 11 Aug. 2016
  2. 2. Melody extraction : from polyphonic music  Definition: Automatically obtain the f0 curve of the predominant melodic line drawn from multiple sources [1] [1] Bittner, R. M., Salamon, J., Essid, S., & Bello, J. P. Melody extraction by contour classification. In Proc. ISMIR (pp. 500-506). <An example of melody extraction, Beyonce - ‘Halo’>
  3. 3. Melody extraction algorithms [2] Salamon, Justin, et al. "Melody extraction from polyphonic music signals: Approaches, applications, and challenges." Signal Processing Magazine, IEEE31.2 (2014): 118-134. Salience based approaches Source separation based approaches Data driven based approaches
  4. 4. <posteriorgram [3] > Support vector machine note classifier  Pitch labels : 60 MIDI notes (G2~F#7)  Resolution = 1 semitone  Losing detailed information about singing styles ex) vibrato, transition patterns [3] Ellis, Daniel PW, and Graham E. Poliner. "Classification-based melody transcription." Machine Learning 65.2 (2006) Melody extraction algorithms : Data-driven based approaches Data Posteriorgram Support Vector Machine G2~ F#7 60 MIDI scale HMM
  5. 5. 1. Deep neural network 2. Classification-based approach : High resolution 3. Data augmentation 4. Singing voice detector Addressed issues
  6. 6. 1. Deep neural network 2. Classification-based approach : High resolution 3. Data augmentation 4. Singing voice detector Addressed issues
  7. 7. 1. Deep neural network 2. Classification-based approach : High resolution 3. Data augmentation 4. Singing voice detector Addressed issues
  8. 8. 1. Deep neural network 2. Classification-based approach : High resolution 3. Data augmentation 4. Singing voice detector Addressed issues
  9. 9. Deep neural networks : Configuration input layer 512 512 256 output layer D2 F#5 …… hidden layer Multi-frame spectrogram …  Input : • Multi-frame spectrogram • Train : singing voice frame • Test : all frame  Hidden layers : 512-512-256  Output : • Range : D2 – F#5 • Layer : 41, 81, 161  Nonlinear function : ReLU  Optimizer : RMSprop  Output layer : sigmoid  Dropout : 20%  Using Keras Addressed issues
  10. 10. <Fig. (b) Classification accuracyon the validation set> Motivation : Classification accuracy & pitch resolution res_1 res_2 res_4 high pitch resolution high classification accuracy <Fig. (a) Multi-column DNN> D N N D N N D N N res_1 res_2 res_4 Addressed issues [4] Ciregan, Dan, Ueli Meier, and Jürgen Schmidhuber. "Multi-column deep neural networks for image classification." Computer Vision and Pattern Recognition (CVPR), 2012. [5] Agostinelli, Forest, Michael R. Anderson, and Honglak Lee. "Adaptive multi-column deep neural networks with application to robust image denoising." Advances in Neural Information Processing Systems. 2013.
  11. 11. Motivation : Multi-column DNN [4] Ciregan, Dan, Ueli Meier, and Jürgen Schmidhuber. "Multi-column deep neural networks for image classification." Computer Vision and Pattern Recognition (CVPR), 2012. [5] Agostinelli, Forest, Michael R. Anderson, and Honglak Lee. "Adaptive multi-column deep neural networks with application to robust image denoising." Advances in Neural Information Processing Systems. 2013. Addressed issues
  12. 12. res_1 res_2 res_4 Proposed method : Architecture of multi-column DNN (MCDNN) ‘res_1’  1 semitone ex) 40, 41, 42, … ‘res_2’  0.5 semitone ex) 40, 40.5, 41, … ‘res_N’  1/N semitone res_1 res_2 res_4 Addressed issues
  13. 13. 1 semitone #41 0.5 semitone #81 #161 #161 0.25 semitone #161 Proposed method : Architecture of multi-column DNN (MCDNN) Addressed issues
  14. 14. res_1 res_2 res_4 Proposed method : Architecture of multi-column DNN (MCDNN) ‘res_1’  1 semitone ex) 40, 41, 42, … ‘res_2’  0.5 semitone ex) 40, 40.5, 41, … ‘res_N’  1/N semitone res_1 res_2 res_4 Addressed issues
  15. 15. Training Datasets  RWC Database [6]  American, Japanese pop music  100 songs : Training set (85 songs) : Validation set (15 songs)  Data augmentation  Pitch-shifted songs (±1, 2 𝑠𝑒𝑚𝑖𝑡𝑜𝑛𝑒) : 100  500 songs RWC (100 songs) RWC +1 semitone RWC -1 semitone RWC +2 semitone RWC -2 semitone Data augmentation [6] Goto, Masataka, et al. "RWC Music Database: Popular, Classical and Jazz Music Databases." ISMIR. Vol. 2. 2002. [7] Bittner, Rachel M., et al. "MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research." ISMIR. 2014. MelodyMCDNNTest Data Training Data Addressed issues
  16. 16. Training Datasets  RWC Database [6]  American, Japanese pop music  100 songs : Training set (85 songs) : Validation set (15 songs)  Data augmentation  Pitch-shifted songs (±1, 2 𝑠𝑒𝑚𝑖𝑡𝑜𝑛𝑒) : 100  500 songs  MedleyDB [7]  Total : 122 songs  60 vocal songs  Genre : Singer/Songwriter, Classical, Rock, Folk, Pop, Musical Theatre RWC (100 songs) MedleyDB (60 songs) RWC +1 semitone RWC -1 semitone RWC +2 semitone RWC -2 semitone Data augmentation [4] Goto, Masataka, et al. "RWC Music Database: Popular, Classical and Jazz Music Databases." ISMIR. Vol. 2. 2002. [5] Bittner, Rachel M., et al. "MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research." ISMIR. 2014. MelodyMCDNNTest Data Training Data Addressed issues
  17. 17. res_1 res_2 res_4 Temporal smoothing by HMM : Viterbi decoding Addressed issues
  18. 18. Temporal smoothing by HMM : Viterbi decoding Bayes' theorem Viterbi decoding transition priorposterior Addressed issues [3] Ellis, Daniel PW, and Graham E. Poliner. "Classification-based melody transcription." Machine Learning 65.2 (2006)
  19. 19. Singing voice detection : Energy-based approach HMM SVD  Spectral energy (200~1800Hz) • High singing voice level • The sum is normalized by the median energy in the band. Addressed issues
  20. 20. < Classification accuracy on the validation set> The classification accuracy : Multi-frame spectrogram Results
  21. 21. - RWC (100 songs) - RWC (100 songs) + pitch-shifted RWC (100 x 4 songs) - RWC (100 songs) + pitch-shifted RWC (400 songs) + MedleyDB (60 songs) Classification accuracy : Data Augmentation Results
  22. 22. Temporal smoothing by HMM : Performance of smoothing Results
  23. 23. A case example of melody extraction on an opera song (ADC2004) SCDNN(res=4) Results
  24. 24. A case example of melody extraction on an opera song (ADC2004) SCDNN(res=4) + Data augmentation Results
  25. 25. A case example of melody extraction on an opera song (ADC2004) 1-2-4 MCDNN + Data augmentation Results
  26. 26.  ADC2004 [8] • 12 songs • Rock, R&B, Pop, Jazz, Opera  MIREX05 [9] • Total 25 songs • 13 songs ( 9 vocal songs + 4 instrument songs)  MIR1k [10] • 1000 vocal songs [8,9] http://labrosa.ee.columbia.edu/projects/melody [10] https://sites.google.com/site/unvoicedsoundseparation/mir-1k MelodyMCDNNTest Data Training Data Evaluation : Test Datasets Evaluation
  27. 27. Test data set  Case 1 : All songs(including song without vocal)  Case 2 : Singing voice songs Single-column Vs. Multi-column : Raw / Chroma accuracy Case 1 Case 2 Assumed the voiced framed are perfectly detected. Evaluation
  28. 28. Comparison to State-of-the-art Methods : ADC2004 Evaluation
  29. 29. Comparison to State-of-the-art Methods : MIREX05 Evaluation
  30. 30.  Summary • multi-frame spectrogram • data augmentation • multi-column DNN • HMM-based smoothing  Limitation & Future work • working only for singing voice melody • singing voice detection • Replace HMM with RNN Conclusion
  31. 31. Limitation & Future works  Limitation & Future work • working only for singing voice melody • singing voice detection Multi-column DNN(1-2-4) Songs with vocalSongs with vocal & non-vocal
  32. 32.  Summary • multi-frame spectrogram • data augmentation • multi-column DNN • HMM-based smoothing  Limitation & Future work • working only for singing voice melody • singing voice detection • replace HMM with RNN Conclusion
  33. 33. Thank you keums@kaist.ac.kr
  34. 34. Appendix  Resample : 8 kHz  Merge stereo channel into mono.  STFT : • FFT size : 1024 (1 bin = 7.81Hz) • Window size = 1024 (Hann) • Hop size : 80 (1 frame = 10ms) • Compressing the magnitude by a log scale • Using 256 bins (0 ~ 2000Hz : vocal range)  Multi-frame • 11 frame spectrogram / example [] Ellis, Daniel PW, and Graham E. Poliner. "Classification-based melody transcription." Machine Learning 65.2 (2006): 439-456. Pre-processing voice frame
  35. 35. Comparison to State-of-the-art Methods : MIR-1k Appendix
  36. 36. MIREX2016 results Appendix
  37. 37. 3 C c 2 B b 1 A a c C 3 b B 2 a A 1 Multi-frame spectrogram Appendix

×