Predominant fundamental frequency estimation
             versus singing voice separation 
             for the automatic transcription 
            of accompanied flamenco singing	

          Emilia Gómez1, Francisco Cañadas2, Justin Salamon1, Jordi Bonada1,
                            Pedro Vera2, Pablo Cabañas2	

                           1 Music Technology Group, Universitat Pompeu Fabra	

                                          2 Universidad de Jaen 	





emilia.gomez@upf.edu	
  
To future ISMIR organizers	

                                                         2/35
 Minimizing the “banquet/last day” effect:	


 ‣     Schedule the best paper presentation	

 ‣     Convert it to a poster session	

 ‣     Invite a great keynote speaker	

 ‣     ...	





emilia.gomez@upf.edu	
  
This talk  ISMIR 2012	


 ‣     Musical cultures	

 ‣     Music transcription (Benetos et al.)	

 ‣     Predominant f0 estimation (Salamon et al.) 	

 ‣     Onset detection (Böck et al.)	

 ‣     NMF (Boulanger-Lewandowski et al., Kirchhoff et al.), Singing voice
       separation (Sprechmann et al.; )	

 ‣     Ground truth  evaluation (Peeters  Fort; Urbano et al.)	

 ‣     Flamenco (Pikkrakis et al.)	

 ‣     Singing (Devaney et al., Proutsjova et al., Lagrange et al., Ross et al., Koduri
       et al.) 	




emilia.gomez@upf.edu	
  
Predominant fundamental frequency estimation
             versus singing voice separation 
             for the automatic transcription 
            of accompanied flamenco singing	





emilia.gomez@upf.edu	
  
Predominant fundamental frequency estimation
             versus singing voice separation 
             for the automatic transcription 
            of accompanied flamenco singing	





emilia.gomez@upf.edu	
  
Flamenco singing	

‣     Music tradition from Andalusia, south of Spain.	

‣     Singing tradition (Gamboa, 2005): cante. 	

‣     Accompanying instruments: 	

       ‣  Flamenco guitar: toque.	

       ‣  Other instruments: claps (palmas), rhythmic
          feet (zapateado), percussion (cajón)	





     emilia.gomez@upf.edu	
  
Predominant fundamental frequency estimation
             versus singing voice separation 
             for the automatic transcription 
            of accompanied flamenco singing	





emilia.gomez@upf.edu	
  
Music material	


‣    Previous work on a cappella (Mora et al.
     2012, Gómez and Bonada 2012)	


‣    Focus on accompanied styles:
     Fandangos, 4 variants (Valverde,
     Almonaster, Calañas, Valiente-Alosno,
     Valiente-Huelva)	





       emilia.gomez@upf.edu	
  
Arcangel	





                           http://www.youtube.com/watch?v=p2hTeDJblBs
emilia.gomez@upf.edu	
  
Predominant fundamental frequency estimation
             versus singing voice separation 
             for the automatic transcription 
            of accompanied flamenco singing	





emilia.gomez@upf.edu	
  
Flamenco singing transcription	


 ‣     Tedious	

 ‣     No standard methodology	

 ‣     ‘Computer-assisted’
       transcription	

 ‣     Note-level	





                                    Donnier (2011)	

emilia.gomez@upf.edu	
  
Predominant fundamental frequency estimation
             versus singing voice separation 
             for the automatic transcription 
            of accompanied flamenco singing	





emilia.gomez@upf.edu	
  
Automatic singing transcription	



Challenges	


  ‣     General: singing voice	

  ‣     Specific: 	

         ‣  Polyphonic material	

         ‣  Ornamentation, melisma	

         ‣  Recording conditions 
            (e.g. reverb, noise)	

               	

Fandango (Cojo de Málaga) 1921	

          ‣    Voice quality	

          ‣    Tuning	





 emilia.gomez@upf.edu	
  
Approach	



 ‣     System based on previous work by (Bonada et al. 2010) used in
       online castings for TV-shows.	



                 Singing voice    Note transcription	

                 f0 estimation	





emilia.gomez@upf.edu	
  
Approach	





     Singing voice 
     f0 estimation	

      Note transcription	





emilia.gomez@upf.edu	
  
Predominant fundamental frequency estimation
             versus singing voice separation 
             for the automatic transcription 
            of accompanied flamenco singing	





emilia.gomez@upf.edu	
  
(1) Separation-based approach (UJA)	



Singing voice separation	


    ‣     A mixture spectrogram X is factorized into three
          different spectrograms:	

           ‣  Percussive (Xp): smoothness in f, sparseness in t	

           ‣  Harmonic (Xh): sparseness in f, smoothness in t	

           ‣  Vocal (Xv): sparseness in f, sparseness in t	

    ‣     Our NMF proposal does not use any clustering
          process to discriminate basis 	





  emilia.gomez@upf.edu	
  
(1) Separation-based approach (UJA)	



Singing voice separation	


    ‣     Stages:	

           1.  Segmentation: manual labelling.	

           2.  Training: learn percussive and harmonic basis vectors
               from instrumental regions, using an unsupervised NMF
               percussive/harmonic separation approach.	

           3.  Separation: Xv is extracted from the vocal regions by
               keeping the percussive and harmonic basis vectors
               fixed from the previous stage. 	





  emilia.gomez@upf.edu	
  
(1) Separation-based approach (UJA)	



Monophonic f0 estimation	

   ‣     Cumulative mean normalized difference function (de Cheveigné and
         Kawahara, 2002).	

          ‣  Indicates the cost of having a period equal to τ at time frame t	

          ‣  f0 sequence: lowest-cost path. Dynamic programming	

          ‣  Step-by-step along time. Continuous and smooth f0 contour	





 emilia.gomez@upf.edu	
  
Predominant fundamental frequency estimation
             versus singing voice separation 
             for the automatic transcription 
            of accompanied flamenco singing	





emilia.gomez@upf.edu	
  
(2) Predominant f0 estimation (MTG)	





emilia.gomez@upf.edu	
  
(2) Predominant f0 estimation (MTG)	





emilia.gomez@upf.edu	
  
(2) Predominant f0 estimation (MTG)	





emilia.gomez@upf.edu	
  
(2) Predominant f0 estimation (MTG)	





emilia.gomez@upf.edu	
  
(2) Predominant f0 estimation (MTG)	





emilia.gomez@upf.edu	
  
(2) Predominant f0 estimation (MTG)	



‣       More details (Salamon et al. @ ISMIR)	

‣       Default parameters (MTG)	

‣       Per-excerpt adapted parameters
        (MTGAdaptedParam):	

          ‣    Minimum and maximum frequency
               threshold	

          ‣    Strictness of the voicing filter	

                      Song 
                      (Fandango de Valverde, Raya)	



                        f0	


                       mix	





     emilia.gomez@upf.edu	
  
Approach	





                                     Note transcription	

                 Singing voice 
                 f0 estimation	





emilia.gomez@upf.edu	
  
Approach	





                                     Note transcription	

                 Singing voice 
                 f0 estimation	





emilia.gomez@upf.edu	
  
Note segmentation	



 ‣     Tuning frequency estimation: 	

        1.  Histogram of f0 deviations, 1 cent resolution	

        2.  Give more weight to stable frames (low f0 derivative)	

        3.  Use a bell-shape window to assign f0 values to histogram
            bins	

        4.  The maximum of the histogram (bmax) determines the
            estimated tuning frequency fref = 440·2bmax/1200	





emilia.gomez@upf.edu	
  
following criteria: duration (Ld ), pitch (Lc ), existence of              dio
                   voiced and unvoiced frames (Lv ), and low-level features                   repr
                                                           Note segmentation	

                   related to stability (Ls ):

‣        Short note transcription: Dynamic programming (DP) algorithm.	

                     each
                            L(npi ) = Ld (npi ) · Lc (npi ) · Lv (npi ) · Ls (npi )    (8)    are
                                                                                              givi
          ‣    Duration: small for short and long durations	

                      Duration likelihood Ld is set so that it is small for short             step
          ‣    Stability: a voiced note should be more or less stable in timbre  energy	

          ‣ 
                 and long durations. Pitch likelihood L is defined so that it
               Pitch: more weight to frames with low f0 derivative	

 c
                                                                                              base
          ‣    Voicing: according to the % of voiced frames0 values are to the note nom-
                 is higher the closer the frame f in a note 	

                               peat
note pitch indexinal pitch cpi , giving more relevance to frames with low f0                     F
                derivative values. The voicing likelihood Lv is defined so
                                                        node k, j
                                                                                              tion
                that segments with a high percentage of unvoiced frames                       and
                are unlikely to be a voiced note, while segments with a                       temp
     j
                high percentage of voiced frames are unlikely to be an un-                    leve
                voiced note. Finally, the stability likelihood Ls considers
                that a voiced note is unlikely to have fast and significant
     0
         0
                timbre or energy changes in the middle. Note that this is                     4.1
                   k-dmax              k-dmin   k   frame index
                not in contradiction with smooth vowel changes, charac-
     emilia.gomez@upf.edu	
  

                teristic of flamenco singing.                                                  We
Note transcription	




 ‣     Iterative note transcription:	

         1.  Note consolidation: consecutive notes with same pitch and
             soft transition in terms of energy and timbre (stability
             below a threshold)	

         2.  Tuning frequency refinement: consider note pitch values,
             giving higher weight to longer and louder notes	

         3.  Note pitch refinement.	





emilia.gomez@upf.edu	
  
Predominant fundamental frequency estimation
             versus singing voice separation 
             for the automatic transcription 
            of accompanied flamenco singing	





emilia.gomez@upf.edu	
  
Evaluation strategy	


 ‣     Music material: 	

        ‣  30 excerpts, μduration=53.48 seconds, 2392 notes	

        ‣  Variety of singers, recording conditions.	

 ‣     Ground truth (big problem!):	

        ‣  All perceptible notes (including ornamentations)	

        ‣  Equal-tempered chromatic scale	

        ‣  Discussion of working examples with flamenco experts	

        ‣  Annotations by a single subject	

 ‣     Evaluation measures (another big problem!) proposed by MIREX
       (Audio Melody Extraction task, on a frame basis, comparing
       quantized pitch values)	



emilia.gomez@upf.edu	
  
Results	


‣      Satisfying results for both strategies.	

‣      Good guitar timbre estimation in our
       separation-based approach 
       requiring manual segmentation.	

‣      Predominant f0 estimation (MTG),
       yields slightly higher accuracy  fully
       automatic.	

‣      Best results adapting parameters
       (84.68% overall, 77.92 pitch accuracy)	

‣      Voicing false alarm rate (around 10%),
       the guitar is detected as melody. 	

‣      Better results than for a cappella
       singing, no tuning errors.	




     emilia.gomez@upf.edu	
  
Qualitative error analysis	


 ‣     Limitations: 	

        ‣  F0 estimation:	

            ‣  Highly accompanied sections: voicing, 5th/8th
               errors 	

        ‣  Note segmentation  labelling 	

            ‣  Highly ornamented sections	

        ‣  Overall agreement:	





emilia.gomez@upf.edu	
  
Case study	


 ‣     Fandango de Valverde, Raya	





emilia.gomez@upf.edu	
  
Case study	





emilia.gomez@upf.edu	
  
Case study	





emilia.gomez@upf.edu	
  
Case study	





emilia.gomez@upf.edu	
  
Case study	





emilia.gomez@upf.edu	
  
Case study	





emilia.gomez@upf.edu	
  
Case study	





emilia.gomez@upf.edu	
  
Conclusions	


 ‣     Adaptive algorithms according to repertoire  use-
       case	

 ‣     Limitations  challenges: 	

        ‣  F0 estimation: voicing	

        ‣  Note transcription: onset detection, pitch labelling.	

 ‣     Accurate enough for higher level analyses: similarity,
       style classification, motive analysis, 
       COmputation  FLAmenco
       http://mtg.upf.edu/research/projects/cofla)	


                           Thanks!	

emilia.gomez@upf.edu	
  

Gomezetal ismir2012

  • 1.
    Predominant fundamental frequencyestimation versus singing voice separation for the automatic transcription of accompanied flamenco singing Emilia Gómez1, Francisco Cañadas2, Justin Salamon1, Jordi Bonada1, Pedro Vera2, Pablo Cabañas2 1 Music Technology Group, Universitat Pompeu Fabra 2 Universidad de Jaen emilia.gomez@upf.edu  
  • 2.
    To future ISMIRorganizers 2/35 Minimizing the “banquet/last day” effect: ‣  Schedule the best paper presentation ‣  Convert it to a poster session ‣  Invite a great keynote speaker ‣  ... emilia.gomez@upf.edu  
  • 3.
    This talk ISMIR 2012 ‣  Musical cultures ‣  Music transcription (Benetos et al.) ‣  Predominant f0 estimation (Salamon et al.) ‣  Onset detection (Böck et al.) ‣  NMF (Boulanger-Lewandowski et al., Kirchhoff et al.), Singing voice separation (Sprechmann et al.; ) ‣  Ground truth evaluation (Peeters Fort; Urbano et al.) ‣  Flamenco (Pikkrakis et al.) ‣  Singing (Devaney et al., Proutsjova et al., Lagrange et al., Ross et al., Koduri et al.) emilia.gomez@upf.edu  
  • 4.
    Predominant fundamental frequencyestimation versus singing voice separation for the automatic transcription of accompanied flamenco singing emilia.gomez@upf.edu  
  • 5.
    Predominant fundamental frequencyestimation versus singing voice separation for the automatic transcription of accompanied flamenco singing emilia.gomez@upf.edu  
  • 6.
    Flamenco singing ‣  Music tradition from Andalusia, south of Spain. ‣  Singing tradition (Gamboa, 2005): cante. ‣  Accompanying instruments: ‣  Flamenco guitar: toque. ‣  Other instruments: claps (palmas), rhythmic feet (zapateado), percussion (cajón) emilia.gomez@upf.edu  
  • 7.
    Predominant fundamental frequencyestimation versus singing voice separation for the automatic transcription of accompanied flamenco singing emilia.gomez@upf.edu  
  • 8.
    Music material ‣  Previous work on a cappella (Mora et al. 2012, Gómez and Bonada 2012) ‣  Focus on accompanied styles: Fandangos, 4 variants (Valverde, Almonaster, Calañas, Valiente-Alosno, Valiente-Huelva) emilia.gomez@upf.edu  
  • 9.
    Arcangel http://www.youtube.com/watch?v=p2hTeDJblBs emilia.gomez@upf.edu  
  • 10.
    Predominant fundamental frequencyestimation versus singing voice separation for the automatic transcription of accompanied flamenco singing emilia.gomez@upf.edu  
  • 11.
    Flamenco singing transcription ‣  Tedious ‣  No standard methodology ‣  ‘Computer-assisted’ transcription ‣  Note-level Donnier (2011) emilia.gomez@upf.edu  
  • 12.
    Predominant fundamental frequencyestimation versus singing voice separation for the automatic transcription of accompanied flamenco singing emilia.gomez@upf.edu  
  • 13.
    Automatic singing transcription Challenges ‣  General: singing voice ‣  Specific: ‣  Polyphonic material ‣  Ornamentation, melisma ‣  Recording conditions (e.g. reverb, noise) Fandango (Cojo de Málaga) 1921 ‣  Voice quality ‣  Tuning emilia.gomez@upf.edu  
  • 14.
    Approach ‣  System based on previous work by (Bonada et al. 2010) used in online castings for TV-shows. Singing voice Note transcription f0 estimation emilia.gomez@upf.edu  
  • 15.
    Approach Singing voice f0 estimation Note transcription emilia.gomez@upf.edu  
  • 16.
    Predominant fundamental frequencyestimation versus singing voice separation for the automatic transcription of accompanied flamenco singing emilia.gomez@upf.edu  
  • 17.
    (1) Separation-based approach(UJA) Singing voice separation ‣  A mixture spectrogram X is factorized into three different spectrograms: ‣  Percussive (Xp): smoothness in f, sparseness in t ‣  Harmonic (Xh): sparseness in f, smoothness in t ‣  Vocal (Xv): sparseness in f, sparseness in t ‣  Our NMF proposal does not use any clustering process to discriminate basis emilia.gomez@upf.edu  
  • 18.
    (1) Separation-based approach(UJA) Singing voice separation ‣  Stages: 1.  Segmentation: manual labelling. 2.  Training: learn percussive and harmonic basis vectors from instrumental regions, using an unsupervised NMF percussive/harmonic separation approach. 3.  Separation: Xv is extracted from the vocal regions by keeping the percussive and harmonic basis vectors fixed from the previous stage. emilia.gomez@upf.edu  
  • 19.
    (1) Separation-based approach(UJA) Monophonic f0 estimation ‣  Cumulative mean normalized difference function (de Cheveigné and Kawahara, 2002). ‣  Indicates the cost of having a period equal to τ at time frame t ‣  f0 sequence: lowest-cost path. Dynamic programming ‣  Step-by-step along time. Continuous and smooth f0 contour emilia.gomez@upf.edu  
  • 20.
    Predominant fundamental frequencyestimation versus singing voice separation for the automatic transcription of accompanied flamenco singing emilia.gomez@upf.edu  
  • 21.
    (2) Predominant f0estimation (MTG) emilia.gomez@upf.edu  
  • 22.
    (2) Predominant f0estimation (MTG) emilia.gomez@upf.edu  
  • 23.
    (2) Predominant f0estimation (MTG) emilia.gomez@upf.edu  
  • 24.
    (2) Predominant f0estimation (MTG) emilia.gomez@upf.edu  
  • 25.
    (2) Predominant f0estimation (MTG) emilia.gomez@upf.edu  
  • 26.
    (2) Predominant f0estimation (MTG) ‣  More details (Salamon et al. @ ISMIR) ‣  Default parameters (MTG) ‣  Per-excerpt adapted parameters (MTGAdaptedParam): ‣  Minimum and maximum frequency threshold ‣  Strictness of the voicing filter Song (Fandango de Valverde, Raya) f0 mix emilia.gomez@upf.edu  
  • 27.
    Approach Note transcription Singing voice f0 estimation emilia.gomez@upf.edu  
  • 28.
    Approach Note transcription Singing voice f0 estimation emilia.gomez@upf.edu  
  • 29.
    Note segmentation ‣  Tuning frequency estimation: 1.  Histogram of f0 deviations, 1 cent resolution 2.  Give more weight to stable frames (low f0 derivative) 3.  Use a bell-shape window to assign f0 values to histogram bins 4.  The maximum of the histogram (bmax) determines the estimated tuning frequency fref = 440·2bmax/1200 emilia.gomez@upf.edu  
  • 30.
    following criteria: duration(Ld ), pitch (Lc ), existence of dio voiced and unvoiced frames (Lv ), and low-level features repr Note segmentation related to stability (Ls ): ‣  Short note transcription: Dynamic programming (DP) algorithm. each L(npi ) = Ld (npi ) · Lc (npi ) · Lv (npi ) · Ls (npi ) (8) are givi ‣  Duration: small for short and long durations Duration likelihood Ld is set so that it is small for short step ‣  Stability: a voiced note should be more or less stable in timbre energy ‣  and long durations. Pitch likelihood L is defined so that it Pitch: more weight to frames with low f0 derivative c base ‣  Voicing: according to the % of voiced frames0 values are to the note nom- is higher the closer the frame f in a note peat note pitch indexinal pitch cpi , giving more relevance to frames with low f0 F derivative values. The voicing likelihood Lv is defined so node k, j tion that segments with a high percentage of unvoiced frames and are unlikely to be a voiced note, while segments with a temp j high percentage of voiced frames are unlikely to be an un- leve voiced note. Finally, the stability likelihood Ls considers that a voiced note is unlikely to have fast and significant 0 0 timbre or energy changes in the middle. Note that this is 4.1 k-dmax k-dmin k frame index not in contradiction with smooth vowel changes, charac- emilia.gomez@upf.edu   teristic of flamenco singing. We
  • 31.
    Note transcription ‣  Iterative note transcription: 1.  Note consolidation: consecutive notes with same pitch and soft transition in terms of energy and timbre (stability below a threshold) 2.  Tuning frequency refinement: consider note pitch values, giving higher weight to longer and louder notes 3.  Note pitch refinement. emilia.gomez@upf.edu  
  • 32.
    Predominant fundamental frequencyestimation versus singing voice separation for the automatic transcription of accompanied flamenco singing emilia.gomez@upf.edu  
  • 33.
    Evaluation strategy ‣  Music material: ‣  30 excerpts, μduration=53.48 seconds, 2392 notes ‣  Variety of singers, recording conditions. ‣  Ground truth (big problem!): ‣  All perceptible notes (including ornamentations) ‣  Equal-tempered chromatic scale ‣  Discussion of working examples with flamenco experts ‣  Annotations by a single subject ‣  Evaluation measures (another big problem!) proposed by MIREX (Audio Melody Extraction task, on a frame basis, comparing quantized pitch values) emilia.gomez@upf.edu  
  • 34.
    Results ‣  Satisfying results for both strategies. ‣  Good guitar timbre estimation in our separation-based approach  requiring manual segmentation. ‣  Predominant f0 estimation (MTG), yields slightly higher accuracy  fully automatic. ‣  Best results adapting parameters (84.68% overall, 77.92 pitch accuracy) ‣  Voicing false alarm rate (around 10%), the guitar is detected as melody. ‣  Better results than for a cappella singing, no tuning errors. emilia.gomez@upf.edu  
  • 35.
    Qualitative error analysis ‣  Limitations: ‣  F0 estimation: ‣  Highly accompanied sections: voicing, 5th/8th errors ‣  Note segmentation labelling ‣  Highly ornamented sections ‣  Overall agreement: emilia.gomez@upf.edu  
  • 36.
    Case study ‣  Fandango de Valverde, Raya emilia.gomez@upf.edu  
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
    Conclusions ‣  Adaptive algorithms according to repertoire use- case ‣  Limitations challenges: ‣  F0 estimation: voicing ‣  Note transcription: onset detection, pitch labelling. ‣  Accurate enough for higher level analyses: similarity, style classification, motive analysis, COmputation FLAmenco http://mtg.upf.edu/research/projects/cofla) Thanks! emilia.gomez@upf.edu