Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

[Tutorial] Computational Approaches to Melodic Analysis of Indian Art Music

Hands-on tutorial given in IISC Bengaluru, 2016.

  • Be the first to comment

[Tutorial] Computational Approaches to Melodic Analysis of Indian Art Music

  1. 1. Computational Approaches to Melodic Analysis of Indian Art Music Indian Institute of Sciences, Bengaluru, India 2016 Sankalp Gulati Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain
  2. 2. Tonic Melody Intonation Raga Motifs Similarity Melodic description
  3. 3. Tonic Identification
  4. 4. Tonic Identification time (s) Frequency(Hz) 0 1 2 3 4 5 6 7 8 0 1000 2000 3000 4000 5000 100 150 200 250 300 0 0.2 0.4 0.6 0.8 1 Frequency (bins), 1bin=10 cents, Ref=55 Hz Normalizedsalience f2 f3 f4 f 5f6 Tonic Signal processing Learning q  Tanpura / drone background sound q  Extent of gamakas on Sa and Pa svara q  Vadi, sam-vadi svara of the rāga S. Gulati, A. Bellur, J. Salamon, H. Ranjani, V. Ishwar, H.A. Murthy, and X. Serra. Automatic tonic identification in Indian art music: approaches and evaluation. Journal of New Music Research, 43(01):55–73, 2014. Salamon, J., Gulati, S., & Serra, X. (2012). A multipitch approach to tonic identification in Indian classical music. In Proc. of Int. Conf. on Music Information Retrieval (ISMIR) (pp. 499–504), Porto, Portugal. Bellur, A., Ishwar, V., Serra, X., & Murthy, H. (2012). A knowledge based signal processing approach to tonic identification in Indian classical music. In 2nd CompMusic Workshop (pp. 113–118) Istanbul, Turkey. Ranjani, H. G., Arthi, S., & Sreenivas, T. V. (2011). Carnatic music analysis: Shadja, swara identification and raga verification in Alapana using stochastic models. Applications of Signal Processing to Audio and Acoustics (WASPAA), IEEE Workshop , 29–32, New Paltz, NY. Accuracy : ~90% !!!
  5. 5. Tonic Identification: Multipitch Approach q  Audio example: q  Utilizing drone sound q  Multi-pitch analysis Vocals Drone J. Salamon, E. G´omez, and J. Bonada. Sinusoid extraction and salience function design for predominant melody estimation. In Proc. 14th Int. Conf. on Digital Audio Effects (DAFX-11), pages 73–80, Paris, France, Sep. 2011.
  6. 6. Tonic Identification: Block Diagram STFT Spectral Peak Picking Frequency/ Amplitude correc<on Salience peak picking Mul<-pitch histogram Histogram peak picking Bin salience mapping Harmonic summa<on Audio Sinusoids Time frequency salience Sinusoid Extrac<on Tonic candidates Salience func<on computa<on Tonic candidate genera<on
  7. 7. Tonic Identification: Signal Processing q  STFT §  Hop size: 11 ms §  Window length: 46 ms §  Window type: hamming §  FFT = 8192 points STFT
  8. 8. Tonic Identification: Signal Processing q  Spectral peak picking §  Absolute threshold: -60 dB Spectral Peak Picking
  9. 9. Tonic Identification: Signal Processing q  Frequency/Amplitude correction §  Parabolic interpolation Frequency/ Amplitude correc<on
  10. 10. Tonic Identification: Signal Processing q  Harmonic summation §  Spectrum considered: 55-7200 Hz §  Frequency range: 55-1760 Hz §  Base frequency: 55 Hz §  Bin resolution: 10 cents per bin (120 per octave) §  N octaves: 5 §  Maximum harmonics: 20 §  Square cosine window across 50 cents Bin salience mapping Harmonic summa<on
  11. 11. Tonic Identification: Signal Processing q  Tonic candidate generation §  Number of salience peaks per frame: 5 §  Frequency range: 110-550 Hz Mul<-pitch histogram
  12. 12. Tonic Identification: Feature Exraction q  Identifying tonic in correct octave using multi-pitch histogram q  Classification based template learning q  Class of an instance is the rank of the tonic 100 150 200 250 300 350 400 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Frequency bins (1 bin = 10 cents), Ref: 55Hz Normalizedsalience Multipitch Histogram f2 f3 f4 f5
  13. 13. q  Decision Tree: f2 f3 f2 f3 f5 1st 1st 2nd 3rd 4th 5th >5 <=5 >-7 <=-7 >-11 <=-11 >5 <=5 >-6 <=-6 Sa Sa Pa salience Frequency Sa Sa Pa salience Frequency Tonic Identification: Classification
  14. 14. Tonic Identification: Results S. Gulati, A. Bellur, J. Salamon, H. Ranjani, V. Ishwar, H.A. Murthy, and X. Serra. Automatic tonic identification in Indian art music: approaches and evaluation. Journal of New Music Research, 43(01): 55–73, 2014.
  15. 15. Predominant Pitch Estimation
  16. 16. Pitch Estimation Algorithms q  Time-domain approaches §  ACF-based (Rabiner 1977) §  AMDF-based (YIN) Cheveigné et al. q  Frequency-domain approaches §  Two-way mismatch (Maher and Beauchamp 1994) §  Subharmonic summation (Hermes 1988) Rabiner, L. (1977, February). On the use of autocorrelation analysis for pitch detection. IEEE Transactions on Acoustics, Speech, and Signal Processing, 25(1), 24–33 De Cheveigné, A., and Kawahara, H., "YIN, a fundamental frequency estimator for speech and music." The Journal of the Acoustical Society of America 111, no. 4 (2002): 1917-1930. §  Multi-pitch approaches §  Source separation-based (Klapuri, 2003) §  Harmonic summation (Melodia) (Salamon and Gómez, 2012) Medan, Y., & Yair, E. (1991). Super resolution pitch determination of speech signals. IEEE transactions on signal processing, 39(1), 40–48. Maher, R., & Beauchamp, J. W. (1994). Fundamental frequency estimation of musical signals using a two-way mismatch procedure. The Journal of the Acoustical Society of , 95 (April), 2254–2263. Hermes, D. (1988, 1988). Measurement of pitch by subharmonic summation. Journal of the Acoustical Society of America, 83, 257 - 264. Klapuri, A. (2003b, November). Multiple fundamental frequency estimation based on harmonicity and spectral smoothness. IEEE Transactions on Speech and Audio Processing, 11(6), 804–816. Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Language Processing, 20(6), 1759–1770.
  17. 17. Pitch Estimation Algorithms q  Time-domain approaches §  ACF-based (Rabiner 1977) §  AMDF-based (YIN) Cheveigné et al. q  Frequency-domain approaches §  Two-way mismatch (Maher and Beauchamp 1994) §  Subharmonic summation (Hermes 1988) Rabiner, L. (1977, February). On the use of autocorrelation analysis for pitch detection. IEEE Transactions on Acoustics, Speech, and Signal Processing, 25(1), 24–33 De Cheveigné, A., and Kawahara, H., "YIN, a fundamental frequency estimator for speech and music." The Journal of the Acoustical Society of America 111, no. 4 (2002): 1917-1930. §  Multi-pitch approaches §  Source separation-based (Klapuri, 2003) §  Harmonic summation (Melodia) (Salamon and Gómez, 2012) Medan, Y., & Yair, E. (1991). Super resolution pitch determination of speech signals. IEEE transactions on signal processing, 39(1), 40–48. Maher, R., & Beauchamp, J. W. (1994). Fundamental frequency estimation of musical signals using a two-way mismatch procedure. The Journal of the Acoustical Society of , 95 (April), 2254–2263. Hermes, D. (1988, 1988). Measurement of pitch by subharmonic summation. Journal of the Acoustical Society of America, 83, 257 - 264. Klapuri, A. (2003b, November). Multiple fundamental frequency estimation based on harmonicity and spectral smoothness. IEEE Transactions on Speech and Audio Processing, 11(6), 804–816. Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Language Processing, 20(6), 1759–1770.
  18. 18. Predominant Pitch Estimation: YIN Signal Difference function Auto-correlation Cumulative difference function rt͑␶͒ϭ ͚jϭtϩ1 tϩW xjxjϩ␶, ͑1͒ where rt(␶) is the autocorrelation function of lag ␶ calculated at time index t, and W is the integration window size. This function is illustrated in Fig. 1͑b͒ for the signal plotted in Fig. 1͑a͒. It is common in signal processing to use a slightly different definition: rtЈ͑␶͒ϭ ͚jϭtϩ1 tϩWϪ␶ xjxjϩ␶. ͑2͒ Here the integration window size shrinks with increasing values of ␶, with the result that the envelope of the function decreases as a function of lag as illustrated in Fig. 1͑c͒. The FIG. 1. ͑a͒ Example of a speech waveform. ͑b͒ Autocorrelation function ͑ACF͒ calculated from the waveform in ͑a͒ according to Eq. ͑1͒. ͑c͒ Same, calculated according to Eq. ͑2͒. The envelope of this function is tapered to zero because of the smaller number of terms in the summation at larger ␶. FIG. 2. F0 estimation error rates as a function of the slope of the envelope of the ACF, quantified by its intercept with the abscissa. The dotted line represents errors for which the F0 estimate was too high, the dashed line those for which it was too low, and the full line their sum. Triangles at the right represent error rates for ACF calculated as in Eq. ͑1͒ (␶maxϭϱ). These rates were measured over a subset of the database used in Sec. III. Lag (samples) The present article introduces a method for F0 estima- tion that produces fewer errors than other well-known meth- ods. The name YIN ͑from ‘‘yin’’ and ‘‘yang’’ of oriental philosophy͒ alludes to the interplay between autocorrelation and cancellation that it involves. This article is the first of a rt͑␶͒ϭ where rt(␶ at time ind function is Fig. 1͑a͒. I different d rtЈ͑␶͒ϭ Here the values of ␶ decreases two definit side ͓tϩ1, this articl ‘‘modified correlation In resp multiples FIG. 1. ͑a͒ Example of a speech waveform. ͑b͒ Autocorrelation function ͑ACF͒ calculated from the waveform in ͑a͒ according to Eq. ͑1͒. ͑c͒ Same, calculated according to Eq. ͑2͒. The envelope of this function is tapered to zero because of the smaller number of terms in the summation at larger ␶. The horizontal arrows symbolize the search range for the period. FIG. 2. F0 e of the ACF, represents er those for wh right represen rates were m ␶max . The parameter ␶max allows the algorithm to be biased to favor one form of error at the expense of the other, with a minimum of total error for intermediate values. Using Eq. ͑2͒ rather than Eq. ͑1͒ introduces a natural bias that can be tuned by adjusting W. However, changing the window size has other effects, and one can argue that a bias of this sort, if useful, should be applied explicitly rather than implicitly. This is one reason to prefer the definition of Eq. ͑1͒. The autocorrelation method compares the signal to its shifted self. In that sense it is related to the AMDF method ͑average magnitude difference function, Ross et al., 1974; Ney, 1982͒ that performs its comparison using differences rather than products, and more generally to time-domain methods that measure intervals between events in time ͑Hess, 1983͒. The ACF is the Fourier transform of the power spectrum, and can be seen as measuring the regular spacing of harmonics within that spectrum. The cepstrum method ͑Noll, 1967͒ replaces the power spectrum by the log magni- tude spectrum and thus puts less weight on high-amplitude parts of the spectrum ͑particularly near the first formant that often dominates the ACF͒. Similar ‘‘spectral whitening’’ ef- fects can be obtained by linear predictive inverse filtering or center-clipping ͑Rabiner and Schafer, 1978͒, or by splitting the signal over a bank of filters, calculating ACFs within each channel, and adding the results after amplitude normal- ization ͑de Cheveigne´, 1991͒. Auditory models based on au- tocorrelation are currently one of the more popular ways to The same is true after taking the square and averaging over a window: ͚jϭtϩ1 tϩW ͑xjϪxjϩT͒2 ϭ0. ͑5͒ Conversely, an unknown period may be found by forming the difference function: dt͑␶͒ϭ ͚jϭ1 W ͑xjϪxjϩ␶͒2 , ͑6͒ and searching for the values of ␶ for which the function is zero. There is an infinite set of such values, all multiples of the period. The difference function calculated from the signal in Fig. 1͑a͒ is illustrated in Fig. 3͑a͒. The squared sum may FIG. 3. ͑a͒ Difference function calculated for the speech signal of Fig. 1͑a͒. ͑b͒ Cumulative mean normalized difference function. Note that the function starts at 1 rather than 0 and remains high until the dip at the period. size was 25 ms, window shift was one sample, search range was 40 to 800 Hz, and threshold ͑step 4͒ was 0.1. Version Gross error ͑%͒ Step 1 10.0 Step 2 1.95 Step 3 1.69 Step 4 0.78 Step 5 0.77 Step 6 0.50 Lag (samples) ed a od re ow 00 sed h a ͑2͒ ned has if tly. its hod 74; ces ain The same is true after taking the square and averaging over a FIG. 3. ͑a͒ Difference function calculated for the speech signal of Fig. 1͑a͒. ͑b͒ Cumulative mean normalized difference function. Note that the function starts at 1 rather than 0 and remains high until the dip at the period. hod were dow 800 Lag (samples) ␶max . The parameter ␶max allows the algorithm to be biased to favor one form of error at the expense of the other, with a minimum of total error for intermediate values. Using Eq. ͑2͒ rather than Eq. ͑1͒ introduces a natural bias that can be tuned by adjusting W. However, changing the window size has other effects, and one can argue that a bias of this sort, if useful, should be applied explicitly rather than implicitly. This is one reason to prefer the definition of Eq. ͑1͒. The autocorrelation method compares the signal to its shifted self. In that sense it is related to the AMDF method ͑average magnitude difference function, Ross et al., 1974; Ney, 1982͒ that performs its comparison using differences rather than products, and more generally to time-domain The same is true after taking the square and averaging over a window: FIG. 3. ͑a͒ Difference function calculated for the speech signal of Fig. 1͑a͒. ͑b͒ Cumulative mean normalized difference function. Note that the function starts at 1 rather than 0 and remains high until the dip at the period. TABLE I. Gross error rates for the simple unbiased autocorrelation method ͑step 1͒, and for the cumulated steps described in the text. These rates were measured over a subset of the database used in Sec. III. Integration window size was 25 ms, window shift was one sample, search range was 40 to 800 Hz, and threshold ͑step 4͒ was 0.1. Version Gross error ͑%͒ Step 1 10.0 Step 2 1.95 Step 3 1.69 Step 4 0.78 Step 5 0.77 Step 6 0.50 Lag (samples) De Cheveigné, A., and Kawahara, H., "YIN, a fundamental frequency estimator for speech and music." The Journal of the Acoustical Society of America 111, no. 4 (2002): 1917-1930.
  19. 19. Predominant Pitch Estimation: YIN
  20. 20. Predominant Pitch Estimation: Melodia Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Language Processing, 20 (6), 1759–1770.
  21. 21. Predominant Pitch Estimation: Melodia Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Language Processing, 20 (6), 1759–1770.
  22. 22. Predominant Pitch Estimation: Melodia audio Spectrogram Spectral peaks Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Language Processing, 20 (6), 1759–1770.
  23. 23. Predominant Pitch Estimation: Melodia Spectral peaks Time-frequency salience Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Language Processing, 20 (6), 1759–1770.
  24. 24. Predominant Pitch Estimation: Melodia Time-frequency salience Salience peaks Contours Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Language Processing, 20 (6), 1759–1770.
  25. 25. Predominant Pitch Estimation: Melodia Contours Predominant melody contours Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE Transactions on Audio, Speech, and Language Processing, 20 (6), 1759–1770.
  26. 26. Essentia implementation of Melodia
  27. 27. Essentia implementation of Melodia
  28. 28. Essentia implementation of Melodia
  29. 29. Essentia implementation of Melodia
  30. 30. Essentia implementation of Melodia
  31. 31. Essentia implementation of Melodia
  32. 32. Essentia implementation of Melodia
  33. 33. Essentia implementation of Melodia
  34. 34. Essentia implementation of Melodia Audio Spectrogram
  35. 35. Essentia implementation of Melodia
  36. 36. Essentia implementation of Melodia Spectral peaks Spectrogram
  37. 37. Essentia implementation of Melodia
  38. 38. Essentia implementation of Melodia Time-frequency salience Spectral peaks
  39. 39. Essentia implementation of Melodia
  40. 40. Essentia implementation of Melodia Salience peaks Time-frequency salience
  41. 41. Essentia implementation of Melodia
  42. 42. Essentia implementation of Melodia All contours Salience peaks
  43. 43. Essentia implementation of Melodia
  44. 44. Essentia implementation of Melodia Predominant melody contours All contours
  45. 45. Essentia implementation of Melodia
  46. 46. Essentia implementation of Melodia
  47. 47. Essentia implementation of Melodia
  48. 48. Essentia implementation of Melodia
  49. 49. Predominant Pitch Estimation: Melodia
  50. 50. What about loudness and timbre?
  51. 51. What about loudness and timbre?
  52. 52. Loudness features in Essentia
  53. 53. Loudness of predominant voice
  54. 54. Loudness of predominant voiceFrequency Time
  55. 55. Loudness of predominant voiceFrequency Time
  56. 56. Loudness of predominant voiceFrequency Time F0
  57. 57. Loudness of predominant voiceFrequency Time F0
  58. 58. Loudness of predominant voiceFrequency Time F0
  59. 59. Loudness of predominant voiceFrequency Time F0
  60. 60. Loudness of predominant voice: example
  61. 61. Spectral centroid of predominant voice
  62. 62. CompMusic: Dunya
  63. 63. CompMusic: Dunya API Internet
  64. 64. CompMusic: Dunya Web
  65. 65. CompMusic: Dunya API hTps://github.com/MTG/pycompmusic
  66. 66. Dunya API Examples q  Metadata q  Features

×