Gomezetal ismir2012

Predominant fundamental frequency estimation
versus singing voice separation
for the automatic transcription
of accompanied ﬂamenco singing

Emilia Gómez1, Francisco Cañadas2, Justin Salamon1, Jordi Bonada1,
Pedro Vera2, Pablo Cabañas2

1 Music Technology Group, Universitat Pompeu Fabra

2 Universidad de Jaen

emilia.gomez@upf.edu

To future ISMIR organizers

2/35
Minimizing the “banquet/last day” effect:

‣  Schedule the best paper presentation

‣  Convert it to a poster session

‣  Invite a great keynote speaker

‣  ...


This talk ISMIR 2012

‣  Musical cultures

‣  Music transcription (Benetos et al.)

‣  Predominant f0 estimation (Salamon et al.)

‣  Onset detection (Böck et al.)

‣  NMF (Boulanger-Lewandowski et al., Kirchhoff et al.), Singing voice
separation (Sprechmann et al.; )

‣  Ground truth evaluation (Peeters Fort; Urbano et al.)

‣  Flamenco (Pikkrakis et al.)

‣  Singing (Devaney et al., Proutsjova et al., Lagrange et al., Ross et al., Koduri
et al.)


Predominant fundamental frequency estimation
versus singing voice separation
for the automatic transcription
of accompanied ﬂamenco singing


Flamenco singing

‣  Music tradition from Andalusia, south of Spain.

‣  Singing tradition (Gamboa, 2005): cante.

‣  Accompanying instruments:

‣  Flamenco guitar: toque.

‣  Other instruments: claps (palmas), rhythmic
feet (zapateado), percussion (cajón)


Music material

‣  Previous work on a cappella (Mora et al.
2012, Gómez and Bonada 2012)

‣  Focus on accompanied styles:
Fandangos, 4 variants (Valverde,
Almonaster, Calañas, Valiente-Alosno,
Valiente-Huelva)


Arcangel

http://www.youtube.com/watch?v=p2hTeDJblBs

Flamenco singing transcription

‣  Tedious

‣  No standard methodology

‣  ‘Computer-assisted’
transcription

‣  Note-level

Donnier (2011)


Automatic singing transcription

Challenges

‣  General: singing voice

‣  Speciﬁc:

‣  Polyphonic material

‣  Ornamentation, melisma

‣  Recording conditions
(e.g. reverb, noise)

Fandango (Cojo de Málaga) 1921

‣  Voice quality

‣  Tuning


Approach

‣  System based on previous work by (Bonada et al. 2010) used in
online castings for TV-shows.

Singing voice Note transcription

f0 estimation


Approach

Singing voice
f0 estimation

Note transcription


(1) Separation-based approach (UJA)

Singing voice separation

‣  A mixture spectrogram X is factorized into three
different spectrograms:

‣  Percussive (Xp): smoothness in f, sparseness in t

‣  Harmonic (Xh): sparseness in f, smoothness in t

‣  Vocal (Xv): sparseness in f, sparseness in t

‣  Our NMF proposal does not use any clustering
process to discriminate basis



Singing voice separation

‣  Stages:

1.  Segmentation: manual labelling.

2.  Training: learn percussive and harmonic basis vectors
from instrumental regions, using an unsupervised NMF
percussive/harmonic separation approach.

3.  Separation: Xv is extracted from the vocal regions by
keeping the percussive and harmonic basis vectors
ﬁxed from the previous stage.



Monophonic f0 estimation

‣  Cumulative mean normalized difference function (de Cheveigné and
Kawahara, 2002).

‣  Indicates the cost of having a period equal to τ at time frame t

‣  f0 sequence: lowest-cost path. Dynamic programming

‣  Step-by-step along time. Continuous and smooth f0 contour


(2) Predominant f0 estimation (MTG)


(2) Predominant f0 estimation (MTG)

‣  More details (Salamon et al. @ ISMIR)

‣  Default parameters (MTG)

‣  Per-excerpt adapted parameters
(MTGAdaptedParam):

‣  Minimum and maximum frequency
threshold

‣  Strictness of the voicing ﬁlter

Song
(Fandango de Valverde, Raya)

f0

mix


Approach

Note transcription

Singing voice
f0 estimation


Note segmentation

‣  Tuning frequency estimation:

1.  Histogram of f0 deviations, 1 cent resolution

2.  Give more weight to stable frames (low f0 derivative)

3.  Use a bell-shape window to assign f0 values to histogram
bins

4.  The maximum of the histogram (bmax) determines the
estimated tuning frequency fref = 440·2bmax/1200


following criteria: duration (Ld ), pitch (Lc ), existence of dio
voiced and unvoiced frames (Lv ), and low-level features repr
Note segmentation

related to stability (Ls ):

‣  Short note transcription: Dynamic programming (DP) algorithm.

each
L(npi ) = Ld (npi ) · Lc (npi ) · Lv (npi ) · Ls (npi ) (8) are
givi
‣  Duration: small for short and long durations

Duration likelihood Ld is set so that it is small for short step
‣  Stability: a voiced note should be more or less stable in timbre energy

‣ 
and long durations. Pitch likelihood L is defined so that it
Pitch: more weight to frames with low f0 derivative

c
base
‣  Voicing: according to the % of voiced frames0 values are to the note nom-
is higher the closer the frame f in a note

peat
note pitch indexinal pitch cpi , giving more relevance to frames with low f0 F
derivative values. The voicing likelihood Lv is defined so
node k, j
tion
that segments with a high percentage of unvoiced frames and
are unlikely to be a voiced note, while segments with a temp
j
high percentage of voiced frames are unlikely to be an un- leve
voiced note. Finally, the stability likelihood Ls considers
that a voiced note is unlikely to have fast and significant
0
0
timbre or energy changes in the middle. Note that this is 4.1
k-dmax k-dmin k frame index
not in contradiction with smooth vowel changes, charac-

teristic of flamenco singing. We

Note transcription

‣  Iterative note transcription:

1.  Note consolidation: consecutive notes with same pitch and
soft transition in terms of energy and timbre (stability
below a threshold)

2.  Tuning frequency reﬁnement: consider note pitch values,
giving higher weight to longer and louder notes

3.  Note pitch reﬁnement.


Evaluation strategy

‣  Music material:

‣  30 excerpts, μduration=53.48 seconds, 2392 notes

‣  Variety of singers, recording conditions.

‣  Ground truth (big problem!):

‣  All perceptible notes (including ornamentations)

‣  Equal-tempered chromatic scale

‣  Discussion of working examples with ﬂamenco experts

‣  Annotations by a single subject

‣  Evaluation measures (another big problem!) proposed by MIREX
(Audio Melody Extraction task, on a frame basis, comparing
quantized pitch values)


Results

‣  Satisfying results for both strategies.

‣  Good guitar timbre estimation in our
separation-based approach 
requiring manual segmentation.

‣  Predominant f0 estimation (MTG),
yields slightly higher accuracy  fully
automatic.

‣  Best results adapting parameters
(84.68% overall, 77.92 pitch accuracy)

‣  Voicing false alarm rate (around 10%),
the guitar is detected as melody.

‣  Better results than for a cappella
singing, no tuning errors.


Qualitative error analysis

‣  Limitations:

‣  F0 estimation:

‣  Highly accompanied sections: voicing, 5th/8th
errors

‣  Note segmentation labelling

‣  Highly ornamented sections

‣  Overall agreement:


Case study

‣  Fandango de Valverde, Raya


Case study


Conclusions

‣  Adaptive algorithms according to repertoire use-
case

‣  Limitations challenges:

‣  F0 estimation: voicing

‣  Note transcription: onset detection, pitch labelling.

‣  Accurate enough for higher level analyses: similarity,
style classiﬁcation, motive analysis,
COmputation FLAmenco
http://mtg.upf.edu/research/projects/coﬂa)

Thanks!


Gomezetal ismir2012

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Gomezetal ismir2012