SlideShare a Scribd company logo
SINGING VOICE MELODY TRANSCRIPTION USING DEEP NEURAL
NETWORKS
Franc¸ois Rigaud and Mathieu Radenen
Audionamix R&D
171 quai de Valmy, 75010 Paris, France
<firstname>.<lastname>@audionamix.com
ABSTRACT
This paper presents a system for the transcription of
singing voice melodies in polyphonic music signals based
on Deep Neural Network (DNN) models. In particular, a
new DNN system is introduced for performing the f0 es-
timation of the melody, and another DNN, inspired from
recent studies, is learned for segmenting vocal sequences.
Preparation of the data and learning configurations related
to the specificity of both tasks are described. The perfor-
mance of the melody f0 estimation system is compared
with a state-of-the-art method and exhibits highest accu-
racy through a better generalization on two different music
databases. Insights into the global functioning of this DNN
are proposed. Finally, an evaluation of the global system
combining the two DNNs for singing voice melody tran-
scription is presented.
1. INTRODUCTION
The automatic transcription of the main melody from poly-
phonic music signals is a major task of Music Information
Retrieval (MIR) research [19]. Indeed, besides applica-
tions to musicological analysis or music practice, the use
of the main melody as prior information has been shown
useful in various types of higher-level tasks such as music
genre classification [20], music retrieval [21], music de-
soloing [4, 18] or lyrics alignment [15, 23]. From a sig-
nal processing perspective, the main melody can be repre-
sented by sequences of fundamental frequency (f0) defined
on voicing instants, i.e. on portions where the instrument
producing the melody is active. Hence, main melody tran-
scription algorithms usually follow two main processing
steps. First, a representation emphasizing the most likely
f0s over time is computed, e.g. under the form of a salience
matrix [19], a vocal source activation matrix [4] or an en-
hanced spectrogram [22]. Second, a binary classification
of the selected f0s between melodic and background con-
tent is performed using melodic contour detection/tracking
and voicing detection.
c Franc¸ois Rigaud and Mathieu Radenen. Licensed under
a Creative Commons Attribution 4.0 International License (CC BY 4.0).
Attribution: Franc¸ois Rigaud and Mathieu Radenen. “Singing Voice
Melody Transcription using Deep Neural Networks”, 17th International
Society for Music Information Retrieval Conference, 2016.
In this paper we propose to tackle the melody transcrip-
tion task as a supervised classification problem where each
time frame of signal has to be assigned into a pitch class
when a melody is present and an ‘unvoiced’ class when it is
not. Such approach has been proposed in [5] where melody
transcription is performed applying Support Vector Ma-
chine on input features composed of Short-Time Fourier
Transforms (STFT). Similarly for noisy speech signals,
f0 estimation algorithms based on Deep Neural Networks
(DNN) have been introduced in [9,12].
Following such fully data driven approaches we intro-
duce a singing voice melody transcription system com-
posed of two DNN models respectively used to perform
the f0 estimation task and the Voice Activity Detection
(VAD) task. The main contribution of this paper is to
present a DNN architecture able to discriminate the differ-
ent f0s from low-level features, namely spectrogram data.
Compared to a well-known state-of-the-art method [19],
it shows significant improvements in terms of f0 accu-
racy through an increase of robustness with regard to mu-
sical genre and a reduction of octave-related errors. By
analyzing the weights of the network, the DNN is shown
somehow equivalent to a simple harmonic-sum method for
which the parameters usually set empirically are here auto-
matically learned from the data and where the succession
of non-linear layers likely increases the power of discrim-
ination of harmonically-related f0. For the task of VAD,
another DNN model, inspired from [13] is learned. For
both models, special care is taken to prevent over-fitting
issues by using different databases and perturbing the data
with audio degradations. Performance of the whole system
is finally evaluated and shows promising results.
The rest of the paper is organized as follows. Section 2
presents an overview of the whole system. Sections 3 and
4 introduce the DNN models and detail the learning con-
figurations respectively for the VAD and the f0 estimation
task. Then, Section 5 presents an evaluation of the system
and Section 6 concludes the study.
2. SYSTEM OVERVIEW
2.1 Global architecture
The proposed system, displayed on Figure 1, is composed
of two independent parallel DNN blocks that perform re-
spectively the f0 melody estimation and the VAD.
737
Figure 1: Architecture of the proposed system for singing
voice melody transcription.
In contrast with [9,12] that propose a single DNN model
to perform both tasks, we did not find such unified func-
tional architecture able to discriminate successfully a time
frame between quantified f0s and ‘unvoiced’ classes. In-
deed the models presented in these studies are designed
for speech signals mixed with background noise for which
the discrimination between a frame of noise and a frame
of speech is very likely related to the presence or absence
of a pitched structure, which is also probably the kind of
information on which the system relies to estimate the f0.
Conversely, with music signals both the melody and the
accompaniment exhibit harmonic structures and the voic-
ing discrimination usually requires different levels of in-
formation, e.g. under the form of timbral features such as
Mel-Frequency Cepstral Coefficients.
Another characteristic of the proposed system is the par-
allel architecture that allows considering different types of
input data for the two DNNs and which arises from the
application restricted to vocal melodies. Indeed, unlike
generic systems dealing with main melody transcription of
different instruments (often within a same piece of music)
which usually process the f0 estimation and the voicing de-
tection sequentially, the focus on singing voice here hardly
allows for a voicing detection relying only on the distri-
bution and statistics of the candidate pitch contours and/or
their energy [2,19]. Thus, this constraint requires to build
a specific VAD system that should learn to discriminate
the timbre of a vocal melody from an instrumental melody,
such as for example played by a saxophone.
2.2 Signal decomposition
As shown on Figure 1, both DNN models are preceded by
a signal decomposition. At the input of the global system,
audio signals are first converted to mono and re-sampled to
16 kHz. Then, following [13], it is proposed to provide the
DNNs with a set of pre-decomposed signals obtained by
applying a double-stage Harmonic/Percussive Source Sep-
aration (HPSS) [6,22] on the input mixture signal. The key
idea behind double-stage HPSS is to consider that within a
mix, melodic signals are usually less stable/stationary than
the background ‘harmonic’ instruments (such as a bass or
a piano), but more than the percussive instruments (such
as the drums). Thus, according to the frequency reso-
lution that is used to compute a STFT, applying a har-
monic/percussive decomposition on a mixture spectrogram
lead to a rough separation where the melody is mainly ex-
tracted either in the harmonic or in the percussive content.
Using such pre-processing, 4 different signals are ob-
tained. First, the input signal s is decomposed into the sum
of h1 and p1 using a high-frequency resolution STFT (typ-
ically with a window of about 300 ms) where p1 should
mainly contain the melody and the drums, and h1 the
remaining stable instrument signals. Second, p1 is fur-
ther decomposed into the sum of h2 and p2 using a low-
frequency resolution STFT (typically with a window of
about 30 ms), where h2 mainly contains the melody, and
p2 the drums. As presented latter in Sections 3 and 4, dif-
ferent types of these 4 signals or combinations of them will
be used to experimentally determine optimal DNN models.
2.3 Learning data
Several annotated databases composed of polyphonic mu-
sic with transcribed melodies are used for building the
train, validation and test datasets used for the learning (cf.
Sections 3 and 4) and the evaluation (cf. Section 5) of the
DNNs. In particular, a subset of RWC Popular Music and
Royalty Free Music [7] and MIR-1k [10] databases are
used for the train dataset, and the recent databases Med-
leyDB [1] and iKala [3] are split between train, validation
and test datasets. Note that for iKala the vocal and instru-
mental tracks are mixed with a relative gain of 0 dB.
Also, in order to minimize over-fitting issues and to in-
crease the robustness of the system with respect to audio
equalization and encoding degradations, we use the Audio
Degradation Toolbox [14]. Thus, several files composing
the train and validation datasets (50% for the VAD task and
25% for the f0 estimation task) are duplicated with one de-
graded version, the degradation type being randomly cho-
sen amongst those available preserving the alignment be-
tween the audio and the annotation (e.g. not producing
time/pitch warping or too long reverberation effects).
3. VOICE ACTIVITY DETECTION WITH DEEP
NEURAL NETWORKS
This section briefly describes the process for learning the
DNN used to perform the VAD. It is largely inspired from
a previous study presented in more detail in [13]. A similar
architecture of deep recurrent neural network composed of
Bidirectional Long Short-Term Memory (BLSTM) [8] is
used. In our case the architecture is arbitrarily fixed to 3
BLSTM layers of 50 units each and a final feed-forward lo-
gistic output layer with one unit. As in [13], different types
of combination of the pre-decomposed signals (cf. Section
2.2) are considered to determine an optimal network: s,
p1, h2, h1p1, h2p2 and h1h2p2. For each of these pre-
decomposed signals, timbral features are computed under
the form of mel-frequency spectrograms obtained using a
STFT with 32 ms long Hamming windows and 75 % of
overlap, and 40 triangular filters distributed on a mel scale
between 0 and 8000 Hz. Then, each feature of the input
738 Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
50
BLSTM units
...
50
BLSTM units
1
logistic unit
time frame index
DNNinputfeatureindex
50 100 150 200 250 300 350
20
40
60
80
100
120
0 50 100 150 200 250 300 350
0
0.2
0.4
0.6
0.8
1
time frame index
DNNoutput
...
...50
BLSTM units
Figure 2: VAD network illustration.
data is normalized using the mean and variance computed
over the train dataset. Contrary to [13] the learning is per-
formed in a single step, i.e. without adopting a layer by
layer training.
Finally, the best architecture is obtained for the combi-
nation of h1, h2 and p2 signals, thus for an input of size
120, which corresponds to a use of the whole information
present in the original signal (s = h1 + h2 + p2). An
illustration of this network is presented in Figure 2.
A simple post-processing of the DNN output consisting
in a threshold of 0.5 is finally applied to take the binary
decision of voicing frame activation.
4. F0 ESTIMATION WITH DEEP NEURAL
NETWORKS
This section presents in detail the learning configuration
for the DNN used for performing the f0 estimation task.
An interpretation of the network functioning is finally pre-
sented.
4.1 Preparation of learning data
As proposed in [5] we decide to keep low level features
to feed the DNN model. Compared to [12] and [9] which
use as input pre-computed representations known for high-
lighting the periodicity of pitched sounds (respectively
based on an auto-correlation and a harmonic filtering), we
expect here the network to be able to learn an optimal
transformation automatically from spectrogram data. Thus
the set of selected features consists of log-spectrograms
(logarithm of the modulus of the STFT) computed from a
Hamming window of duration 64 ms (1024 samples for a
sampling frequency of 16000 Hz) with an overlap of 0.75,
and from which frequencies below 50 Hz and above 4000
Hz are discarded. For each music excerpt the correspond-
ing log-spectrogram is rescaled between 0 and 1. Since, as
described in Section 2.1, the VAD is performed by a sec-
ond independent system, all time frames for which no vo-
cal melody is present are removed from the dataset. These
features are computed independently for 3 different types
of input signal for which the melody should be more or less
emphasized: s, p1 and h2 (cf. Section 2.2).
For the output, the f0s are quantified between C#2
(f0 69.29 Hz) and C#6 (f0 1108.73 Hz) with a spac-
ing of an eighth of tone, thus leading to a total of 193
classes.
The train and validation datasets including audio de-
graded versions are finally composed of, respectively,
22877 melodic sequences (resp. 3394) for a total duration
of about 220 minutes (resp. 29 min).
4.2 Training
Several experiments have been run to determine a func-
tional DNN architecture. In particular, two types of neuron
units have been considered: the standard feed-forward sig-
moid unit and the Bidirectional Long Short-Term Memory
(BLSTM) recurrent unit [8].
For each test, the weights of the network are initialized
randomly according to a Gaussian distribution with 0 mean
and a standard deviation of 0.1, and optimized to mini-
mize the cross-entropy error function. The learning is then
performed by means of a stochastic gradient descent with
shuffled mini-batches composed of 30 melodic sequences,
a learning rate of 10−7
and a momentum of 0.9. The op-
timization is run for a maximum of 10000 epochs and an
early stopping is applied if no decrease is observed on the
validation set error during 100 consecutive epochs. In ad-
dition to the use of audio degradations during the prepa-
ration of the data for preventing over-fitting (cf. Section
2.3), the training examples are slightly corrupted during
the learning by adding a Gaussian noise with variance 0.05
at each epoch.
Among the different architectures tested, the best clas-
sification performance is obtained for the input signal p1
(slightly better than for s, i.e. without pre-separation) by
a 2-hidden layer feed-forward network with 500 sigmoid
units each, and a 193 output softmax layer. An illustration
of this network is presented in Figure 3. Interestingly, for
that configuration the learning did not suffered from over-
fitting so that it ended at the maximum number of epochs,
thus without early stopping.
While the temporal continuity of the f0 along time-
frames should provide valuable information, the use of
BLSTM recurrent layers (alone or in combination with
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016 739
500
logistic units
...
...500
logistic units
...193
softmax units
time frame index
DNNinputfeatureindex
260 280 300 320 340 360 380 400 420 440
100
200
300
400
500
600
700
time frame index
DNNoutputclassindex
260 280 300 320 340 360 380 400 420 440
20
40
60
80
100
120
140
160
180
Figure 3: f0 estimation network illustration.
feed-forward sigmoid layers) did not lead to efficient sys-
tems. Further experiments should be conducted to enforce
the inclusion of such temporal context in a feed-forward
DNN architecture, for instance by concatenating several
consecutive time frames in the input.
4.3 Post-processing
The output layer of the DNN composed of softmax units
returns a f0 probability distribution for each time frame
that can be seen for a full piece of music as a pitch ac-
tivation matrix. In order to take a final decision that ac-
count for the continuity of the f0 along melodic sequences,
a Viterbi tracking is finally applied on the network out-
put [5, 9, 12]. For that, the log-probability transition be-
tween two consecutive time frames and two f0 classes is
simply arbitrarily set inversely proportional to their abso-
lute difference in semi-tones. For further improvement of
the system, such transition matrix could be learned from
the data [5], however this simple rule gives interesting per-
formance gains (when compared to a simple ‘maximum
picking’ post-processing without temporal context) while
potentially reducing the risk of over-fitting to a particular
music style.
layer #1 unit index
DNNinputfeatureindex
100 200 300 400 500
100
200
300
400
500
600
700
layer #2 unit index
layer#1outputindex
100 200 300 400 500
50
100
150
200
250
300
350
400
450
500
layer#2outputindex
softmax unit index
50 100 150
50
100
150
200
250
300
350
400
450
500
Figure 4: Display of the weights for the two sigmoid feed-
forward layers (top) and the softmax layer (down) of the
DNN learned for the f0 estimation task.
4.4 Network weights interpretation
We propose here to have an insight into the network func-
tioning for this specific task of f0 estimation by analyzing
the weights of the DNN. The input is a short-time spec-
trum and the output corresponds to an activation vector for
which a single element (the actual f0 of the melody at that
time frame) should be predominant. In that case, it is rea-
sonable to expect that the DNN somehow behaves like a
harmonic-sum operator.
While the visualization of the distribution of the hidden-
layer weights usually does not provide with straightfor-
ward cues to analyse a DNN functioning (cf. Figure 4)
we consider a simplified network for which it is assumed
that each feed-forward logistic unit is working in the linear
regime. Thus, removing the non-linear operations, the out-
put of a feed-forward layer with index l composed of Nl
units writes
xl = Wl · xl−1 + bl, (1)
where xl ∈ RNl
(resp. xl−1 ∈ RNl−1
) corresponds to the
ouput vector of layer l (resp. l − 1), Wl ∈ RNl×Nl−1
is
the weight matrix and bl ∈ RNl
the bias vector. Using this
expression, the output of a layer with index L expressed as
the propagation of the input x0 through the linear network
also writes
xL = W · x0 + b, (2)
where W =
L
l=1 Wl corresponds to a global weight ma-
trix, and b to a global bias that depends on the set of pa-
rameters {Wl, bl, ∀l ∈ [1..L]}.
As mentioned above, in our case x0 is a short-time spec-
trum and xL is a f0 activation vector. The global weight
matrix should thus present some characteristics of a pitch
detector. Indeed as displayed on Figure 5a, the matrix
W for the learned DNN (which is thus the product of the
3 weight matrices depicted on Figure 4) exhibits an har-
monic structure for most output classes of f0s; except for
740 Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
DNN output class index
DNNinputfeatureindex
20 40 60 80 100 120 140 160 180
100
200
300
400
500
600
700
(a)
100 200 300 400 500 600 700
−20
0
20
40
60
DNN input feature index
(b)
Figure 5: Linearized DNN illustration. (a) Visualization
of the (transposed) weight matrix W. The x-axis corre-
sponds to the output class indices (the f0s) and the y-axis
represents the input feature indices (frequency channel of
the spectrum input). (b) Weights display for the f0 output
class with index 100.
some f0s in the low and high frequency range for which no
or too few examples are present in the learning data.
Most approaches dealing with main melody transcrip-
tion usually relies on such types of transformations to
compute a representation emphasizing f0 candidates (or
salience function) and are usually partly based on hand-
crafted designs [11, 17, 19]. Interestingly, using a fully
data driven method as proposed, parameters of a compara-
ble weighted harmonic summation algorithm (such as the
number of harmonics to consider for each note and their
respective weights) do not have to be defined. This can be
observed in more details on Figure 5b which depicts the
linearized network weights for the class index 100 (f0
289.43 Hz). Moreover, while this interpretation assumes a
linear network, one can expect that the non-linear opera-
tions actually present in the network help in enhancing the
discrimination between the different f0 classes.
5. EVALUATION
5.1 Experimental procedure
Two different test datasets composed of full music ex-
cerpts (i.e. vocal and non vocal portions) are used for
the evaluation. One is composed of 17 tracks from Med-
leyDB (last songs comprising vocal melodies, from Mu-
sicDelta Reggae to Wolf DieBekherte, for a total of ∼ 25.5
min of vocal portions) and the other is composed of 63
tracks from iKala (from 54223 chorus to 90587 verse for
a total of ∼ 21 min of vocal portions).
The evaluation is conducted in two steps. First the per-
formance of the f0 estimation DNN taken alone (thus with-
out voicing detection) is compared with the state-of the
art system melodia [19] using f0 accuracy metrics. Sec-
ond, the performance of our complete singing voice tran-
scription system (VAD and f0 estimation) is evaluated on
the same datasets. Since our system is restricted to the
transcription of vocal melodies and that, to our knowledge
all available state-of-the-art systems are designed to target
main melody, this final evaluation presents the results for
our system without comparisons with a reference.
For all tasks and systems, the evaluation metrics are
computed using the mir eval library [16]. For Section
5.3, some additional metrics related to voicing detection,
namely precision, f-measure and voicing accuracy, were
not present in the original mir eval code and thus were
added for our experiments.
5.2 f0 estimation task
The performance of the DNN performing the f0 estima-
tion task is first compared to melodia system [19] using
the plug-in implementation with f0 search range limits set
equal to those of our system (69.29-1108.73 Hz, cf. Sec.
4.1) and with remaining parameters left to default values.
For each system and each music track the performance is
evaluated in terms of raw pitch accuracy (RPA) and raw
chroma accuracy (RCA). These metrics are computed on
vocal segments (i.e. without accounting for potential voic-
ing detection errors) for a f0 tolerance of 50 cents.
The results are presented on Figure 6 under the form
of a box plot where, for each metric and dataset, the ends
of the dashed vertical bars delimit the lowest and highest
scores obtained, the 3 vertical bars composing each center
box respectively correspond to the first quartile, the median
and the third quartile of the distribution, and finally the
star markers represent the mean. Both systems are char-
acterized by more widespread distributions for MedleyDB
than for iKala. This reflects the fact that MedleyDB is
more heterogeneous in musical genres and recording con-
ditions than iKala. On iKala, the DNN performs slightly
better than melodia when comparing the means. On Med-
leyDB, the gap between the two systems increases signif-
icantly. The DNN system seems much less affected by
the variability of the music examples and clearly improve
the mean RPA by 20% (62.13% for melodia and 82.48%
for the DNN). Additionally, while exhibiting more similar
distributions of RPA and RCA, the DNN tends to produce
less octave detection errors. It should be noted that this re-
sult does not take into account the recent post-processing
improvement proposed for melodia [2], yet it shows the
interest of using such DNN approach to compute an en-
hanced pitch salience matrix which, simply combined with
a Viterbi post-processing, achieves good performance.
5.3 Singing voice transcription task
The evaluation of the global system is finally performed
on the two same test datasets. The results are displayed as
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016 741
RPA RCA RPA RCA
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
MedleyDB iKala
NN
melodia
Figure 6: Comparative evaluation of the proposed DNN
(in black) and melodia (in gray) on MedleyDB (left) and
iKala (right) test sets for a f0 vocal melody estimation task.
boxplots (cf. description Section 5.2) on Figures 7a and 7b
respectively for the iKala and the MedleyDB datasets. Five
metrics are computed to evaluate the voicing detection,
namely the precision (P), the recall (R), the f-measure (F),
the false alarm rate (FA) and the voicing accuracy (VA). A
sixth metric of overall accuracy (OA) is also presented for
assessing the global performance of the complete singing
voice melody transcription system.
In accordance with the previous evaluation, the results
on MedleyDB are characterized by much more variance
than on iKala. In particular, the voicing precision of the
system (i.e. it’s ability to provide correct detections, no
matter the number of forgotten voiced frames) is signif-
icantly degraded on MedleyDB. Conversely, the voicing
recall which evaluate the ability of the system to detect all
voiced portions actually present no matter the number of
false alarm, remains relatively good on MedleyDB. Com-
bining both metrics, a mean f-measure of 93.15 % and
79.19 % are respectively obtained on iKala and MedleyDB
test datasets.
Finally, the mean scores of overall accuracy obtained
for the global system are equal to 85.06 % and 75.03 %
respectively for iKala and MedleyDB databases.
6. CONCLUSION
This paper introduced a system for the transcription of
singing voice melodies composed of two DNN models. In
particular a new system able to learn a representation em-
phasizing melodic lines from low level data composed of
spectrograms has been proposed for the estimation of the
f0. For this DNN, the performance evaluation shows a rel-
atively good generalization (when compared to a reference
system) on two different test datasets and an increase of ro-
bustness to western music recordings that tend to be repre-
sentative of the current music industry productions. While
for these experiments the systems have been learned from
a relatively low amount of data, the robustness, particu-
larly for the task of VAD, could very likely be improved
by increasing the number of training examples.
P R F FA VA OA
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(a) iKala test dataset
P R F FA VA OA
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(b) MedleyDB test dataset
Figure 7: Voicing detection and overall performance of the
proposed system for iKala and MedleyDB test datasets.
7. REFERENCES
[1] R. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Can-
nam, and J. Bello. MedleyDB: A multitrack dataset for
annotation-intensive MIR research. In Proc. of the 15th
Int. Society for Music Information Retrieval (ISMIR)
Conference, October 2014.
[2] R. M. Bittner, J. Salamon, S. Essid, and J. P. Bello.
Melody extraction by contour classification. In Proc.
of the 16th Int. Society for Music Information Retrieval
(ISMIR) Conference, October 2015.
[3] T.-S. Chan, T.-C. Yeh, Z.-C. Fan, H.-W. Chen, L. Su,
Y.-H. Yang, and R. Jang. Vocal activity informed
singing voice separation with the ikala dataset. In Proc.
of IEEE Int. Conf. on Acoustics, Speech and Signal
Processing (ICASSP), pages 718–722, April 2015.
[4] J.-L. Durrieu, G. Richard, and B. David. An iterative
approach to monaural musical mixture de-soloing. In
Proc. of IEEE Int. Conf. on Acoustics, Speech and Sig-
nal Processing (ICASSP), pages 105–108, April 2009.
[5] D. P. W. Ellis and G. E. Poliner. Classification-based
melody transcription. Machine Learning, 65(2):439–
456, 2006.
[6] D. FitzGerald and M. Gainza. Single channel vo-
cal separation using median filtering and factorisa-
tion techniques. ISAST Trans. on Electronic and Signal
Processing, 4(1):62–73, 2010.
742 Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
[7] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka.
Rwc music database: Popular, classical, and jazz music
databases. In Proc. of the 3rd Int. Society for Music
Information Retrieval (ISMIR) Conference, pages 287–
288, October 2002.
[8] A. Graves, A.-R. Mohamed, and G. Hinton. Speech
recognition with deep recurrent neural networks. In
Proc. of IEEE Int. Conf. on Acoustics, Speech and
Signal Processing (ICASSP), pages 6645–6649, May
2013.
[9] K. Han and DL. Wang. Neural network based
pitch tracking in very noisy speech. IEEE/ACM
Trans. on Audio, Speech, and Language Processing,
22(12):2158–2168, October 2014.
[10] C.-L. Hsu and J.-S. R. Jang. On the improvement of
singing voice separation for monaural recordings using
the mir-1k dataset. IEEE Trans. on Audio, Speech, and
Language Processing, 18(2):310–319, 2010.
[11] S. Jo, S. Joo, and C. D. Yoo. Melody pitch estima-
tion based on range estimation and candidate extrac-
tion using harmonic structure model. In Proc. of IN-
TERSPEECH, pages 2902–2905, 2010.
[12] B. S. Lee and D. P. W. Ellis. Noise robust pitch tracking
by subband autocorrelation classification. In Proc. of
INTERSPEECH, 2012.
[13] S. Leglaive, R. Hennequin, and R. Badeau. Singing
voice detection with deep recurrent neural networks. In
Proc. of IEEE Int. Conf. on Acoustics, Speech and Sig-
nal Processing (ICASSP), pages 121–125, April 2015.
[14] M. Mauch and S. Ewert. The audio degradation tool-
box and its application to robustness evaluation. In
Proc. of the 14th Int. Society for Music Information Re-
trieval (ISMIR) Conference, November 2013.
[15] A. Mesaros and T. Virtanen. Automatic alignment of
music audio and lyrics. In Proc. of 11th Int. Conf. on
Digital Audio Effects (DAFx), September 2008.
[16] C. Raffel, B. McFee, E. J. Humphrey, J. Salamon,
O. Nieto, D. Liang, and D. P. W. Ellis. mir eval: a
transparent implementation of common MIR metrics.
In Proc. of the 15th Int. Society for Music Information
Retrieval (ISMIR) Conference, October 2014.
[17] M. Ryyn¨anen and A. P. Klapuri. Automatic transcrip-
tion of melody, bass line, and chords in polyphonic mu-
sic. Computer Music Journal, 32(3):72–86, 2008.
[18] M. Ryyn¨anen, T. Virtanen, J. Paulus, and A. Kla-
puri. Accompaniment separation and karaoke applica-
tion based on automatic melody transcription. In Proc.
of the IEEE Int. Conf. on Multimedia and Expo, pages
1417–1420, April 2008.
[19] J. Salamon, E. G´omez, D. P. W. Ellis, and G. Richard.
Melody extraction from polyphonic music signals. Ap-
proaches, applications, and challenges. IEEE Signal
Processing Magazine, 31(2):118–134, March 2014.
[20] J. Salamon, B. Rocha, and E. G´omez. Musical genre
classification using melody features extracted from
polyphonic music. In Proc. of IEEE Int. Conf. on
Acoustics, Speech and Signal Processing (ICASSP),
pages 81–84, March 2012.
[21] J. Salamon, J. Serr`a, and E. G´omez. Tonal representa-
tions for music retrieval: From version identification to
query-by-humming. Int. Jour. of Multimedia Informa-
tion Retrieval, special issue on Hybrid Music Informa-
tion Retrieval, 2(1):45–58, 2013.
[22] H. Tachibana, T. Ono, N. Ono, and S. Sagayama.
Melody line estimation in homophonic music au-
dio signals based on temporal-variability of melodic
source. In Proc. of IEEE Int. Conf. on Acoustics,
Speech and Signal Processing (ICASSP), pages 425–
428, March 2010.
[23] C. H. Wong, W. M. Szeto, and K. H. Wong. Automatic
lyrics alignment for Cantonese popular music. Multi-
media Systems, 4-5(12):307–323, 2007.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016 743

More Related Content

What's hot

Paper id 22201419
Paper id 22201419Paper id 22201419
Paper id 22201419IJRAT
 
LTE in a Nutshell: System Overview
LTE in a Nutshell: System OverviewLTE in a Nutshell: System Overview
LTE in a Nutshell: System Overview
Frank Rayal
 
Sparse channel estimation for underwater acoustic communication using compres...
Sparse channel estimation for underwater acoustic communication using compres...Sparse channel estimation for underwater acoustic communication using compres...
Sparse channel estimation for underwater acoustic communication using compres...
IAEME Publication
 
Multiplexing and Multiple Access
Multiplexing and Multiple AccessMultiplexing and Multiple Access
Multiplexing and Multiple Access
Ridwanul Hoque
 
SINGING-VOICE SEPARATION FROM MONAURAL RECORDINGS USING ROBUST PRINCIPAL COMP...
SINGING-VOICE SEPARATION FROM MONAURAL RECORDINGS USING ROBUST PRINCIPAL COMP...SINGING-VOICE SEPARATION FROM MONAURAL RECORDINGS USING ROBUST PRINCIPAL COMP...
SINGING-VOICE SEPARATION FROM MONAURAL RECORDINGS USING ROBUST PRINCIPAL COMP...chakravarthy Gopi
 
Use of NS-2 to Simulate MANET Routing Algorithms
Use of NS-2 to Simulate MANET Routing AlgorithmsUse of NS-2 to Simulate MANET Routing Algorithms
Use of NS-2 to Simulate MANET Routing AlgorithmsGiancarlo Romeo
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
inventionjournals
 
IR UWB TOA Estimation Techniques and Comparison
IR UWB TOA Estimation Techniques and ComparisonIR UWB TOA Estimation Techniques and Comparison
IR UWB TOA Estimation Techniques and Comparison
inventionjournals
 
Adv multimedia2k7 1_s
Adv multimedia2k7 1_sAdv multimedia2k7 1_s
Adv multimedia2k7 1_sKevin Man
 
Investigation and Analysis of SNR Estimation in OFDM system
Investigation and Analysis of SNR Estimation in OFDM systemInvestigation and Analysis of SNR Estimation in OFDM system
Investigation and Analysis of SNR Estimation in OFDM system
IOSR Journals
 
MIMO Channel Estimation Using the LS and MMSE Algorithm
MIMO Channel Estimation Using the LS and MMSE AlgorithmMIMO Channel Estimation Using the LS and MMSE Algorithm
MIMO Channel Estimation Using the LS and MMSE Algorithm
IOSRJECE
 
Cd31527532
Cd31527532Cd31527532
Cd31527532
IJERA Editor
 
Supporting efficient and scalable multicasting
Supporting efficient and scalable multicastingSupporting efficient and scalable multicasting
Supporting efficient and scalable multicasting
ingenioustech
 
Generating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep LearningGenerating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep Learning
Varad Meru
 

What's hot (19)

Paper id 22201419
Paper id 22201419Paper id 22201419
Paper id 22201419
 
LTE in a Nutshell: System Overview
LTE in a Nutshell: System OverviewLTE in a Nutshell: System Overview
LTE in a Nutshell: System Overview
 
report
reportreport
report
 
Sparse channel estimation for underwater acoustic communication using compres...
Sparse channel estimation for underwater acoustic communication using compres...Sparse channel estimation for underwater acoustic communication using compres...
Sparse channel estimation for underwater acoustic communication using compres...
 
Multiplexing and Multiple Access
Multiplexing and Multiple AccessMultiplexing and Multiple Access
Multiplexing and Multiple Access
 
Gprs
GprsGprs
Gprs
 
SINGING-VOICE SEPARATION FROM MONAURAL RECORDINGS USING ROBUST PRINCIPAL COMP...
SINGING-VOICE SEPARATION FROM MONAURAL RECORDINGS USING ROBUST PRINCIPAL COMP...SINGING-VOICE SEPARATION FROM MONAURAL RECORDINGS USING ROBUST PRINCIPAL COMP...
SINGING-VOICE SEPARATION FROM MONAURAL RECORDINGS USING ROBUST PRINCIPAL COMP...
 
Use of NS-2 to Simulate MANET Routing Algorithms
Use of NS-2 to Simulate MANET Routing AlgorithmsUse of NS-2 to Simulate MANET Routing Algorithms
Use of NS-2 to Simulate MANET Routing Algorithms
 
CourseworkRadCom2011
CourseworkRadCom2011CourseworkRadCom2011
CourseworkRadCom2011
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
info
infoinfo
info
 
IR UWB TOA Estimation Techniques and Comparison
IR UWB TOA Estimation Techniques and ComparisonIR UWB TOA Estimation Techniques and Comparison
IR UWB TOA Estimation Techniques and Comparison
 
Adv multimedia2k7 1_s
Adv multimedia2k7 1_sAdv multimedia2k7 1_s
Adv multimedia2k7 1_s
 
Investigation and Analysis of SNR Estimation in OFDM system
Investigation and Analysis of SNR Estimation in OFDM systemInvestigation and Analysis of SNR Estimation in OFDM system
Investigation and Analysis of SNR Estimation in OFDM system
 
BMSB_Qian
BMSB_QianBMSB_Qian
BMSB_Qian
 
MIMO Channel Estimation Using the LS and MMSE Algorithm
MIMO Channel Estimation Using the LS and MMSE AlgorithmMIMO Channel Estimation Using the LS and MMSE Algorithm
MIMO Channel Estimation Using the LS and MMSE Algorithm
 
Cd31527532
Cd31527532Cd31527532
Cd31527532
 
Supporting efficient and scalable multicasting
Supporting efficient and scalable multicastingSupporting efficient and scalable multicasting
Supporting efficient and scalable multicasting
 
Generating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep LearningGenerating Musical Notes and Transcription using Deep Learning
Generating Musical Notes and Transcription using Deep Learning
 

Viewers also liked

(2016) Hennequin and Rigaud - Long-Term Reverberation Modeling for Under-Dete...
(2016) Hennequin and Rigaud - Long-Term Reverberation Modeling for Under-Dete...(2016) Hennequin and Rigaud - Long-Term Reverberation Modeling for Under-Dete...
(2016) Hennequin and Rigaud - Long-Term Reverberation Modeling for Under-Dete...François Rigaud
 
Como hacer Marketing en WhatsApp
Como hacer Marketing en WhatsAppComo hacer Marketing en WhatsApp
Como hacer Marketing en WhatsApp
José Luis Mijangos Santiago
 
the 21st century teacher
the 21st century teacherthe 21st century teacher
the 21st century teacher
charmesintos123
 
Presentación
PresentaciónPresentación
Presentación
clarahid
 
McDonald's case study
McDonald's case studyMcDonald's case study
McDonald's case study
Nirmal Padgelwar
 
Infografia personal
Infografia personalInfografia personal
Infografia personal
adrian quinatoa
 
Abc de cardio 2016 2
Abc de cardio 2016 2Abc de cardio 2016 2
Abc de cardio 2016 2
GERARDO SELA BAYARDO
 
Harinanden New Resume v5
Harinanden New Resume v5Harinanden New Resume v5
Harinanden New Resume v5Harinanden VK
 
Slow U.S. GDP for Second Quarter 2016
Slow U.S. GDP for Second Quarter 2016Slow U.S. GDP for Second Quarter 2016
Slow U.S. GDP for Second Quarter 2016
Brian Kasal
 
2012 Annual Report
2012 Annual Report2012 Annual Report
2012 Annual Report
Amanda Wright
 
Hidrops fetal por sífilis connatal grave en r np t
Hidrops fetal por sífilis connatal grave en r np tHidrops fetal por sífilis connatal grave en r np t
Hidrops fetal por sífilis connatal grave en r np t
alfredo gomez
 
completed-transcript-9240608
completed-transcript-9240608completed-transcript-9240608
completed-transcript-9240608Farrah Ranzino
 
133295314 obtencion-de-acetileno-informe-de-quimica-lab
133295314 obtencion-de-acetileno-informe-de-quimica-lab133295314 obtencion-de-acetileno-informe-de-quimica-lab
133295314 obtencion-de-acetileno-informe-de-quimica-lab
Alexander Santibañez Solano
 

Viewers also liked (13)

(2016) Hennequin and Rigaud - Long-Term Reverberation Modeling for Under-Dete...
(2016) Hennequin and Rigaud - Long-Term Reverberation Modeling for Under-Dete...(2016) Hennequin and Rigaud - Long-Term Reverberation Modeling for Under-Dete...
(2016) Hennequin and Rigaud - Long-Term Reverberation Modeling for Under-Dete...
 
Como hacer Marketing en WhatsApp
Como hacer Marketing en WhatsAppComo hacer Marketing en WhatsApp
Como hacer Marketing en WhatsApp
 
the 21st century teacher
the 21st century teacherthe 21st century teacher
the 21st century teacher
 
Presentación
PresentaciónPresentación
Presentación
 
McDonald's case study
McDonald's case studyMcDonald's case study
McDonald's case study
 
Infografia personal
Infografia personalInfografia personal
Infografia personal
 
Abc de cardio 2016 2
Abc de cardio 2016 2Abc de cardio 2016 2
Abc de cardio 2016 2
 
Harinanden New Resume v5
Harinanden New Resume v5Harinanden New Resume v5
Harinanden New Resume v5
 
Slow U.S. GDP for Second Quarter 2016
Slow U.S. GDP for Second Quarter 2016Slow U.S. GDP for Second Quarter 2016
Slow U.S. GDP for Second Quarter 2016
 
2012 Annual Report
2012 Annual Report2012 Annual Report
2012 Annual Report
 
Hidrops fetal por sífilis connatal grave en r np t
Hidrops fetal por sífilis connatal grave en r np tHidrops fetal por sífilis connatal grave en r np t
Hidrops fetal por sífilis connatal grave en r np t
 
completed-transcript-9240608
completed-transcript-9240608completed-transcript-9240608
completed-transcript-9240608
 
133295314 obtencion-de-acetileno-informe-de-quimica-lab
133295314 obtencion-de-acetileno-informe-de-quimica-lab133295314 obtencion-de-acetileno-informe-de-quimica-lab
133295314 obtencion-de-acetileno-informe-de-quimica-lab
 

Similar to (2016) Rigaud and Radenen - Singing Voice Melody Transcription Using Deep Neural Networks

A Combined Sub-Band And Reconstructed Phase Space Approach To Phoneme Classif...
A Combined Sub-Band And Reconstructed Phase Space Approach To Phoneme Classif...A Combined Sub-Band And Reconstructed Phase Space Approach To Phoneme Classif...
A Combined Sub-Band And Reconstructed Phase Space Approach To Phoneme Classif...
April Smith
 
Speech Analysis and synthesis using Vocoder
Speech Analysis and synthesis using VocoderSpeech Analysis and synthesis using Vocoder
Speech Analysis and synthesis using Vocoder
IJTET Journal
 
Demixing Commercial Music Productions via Human-Assisted Time-Frequency Masking
Demixing Commercial Music Productions via Human-Assisted Time-Frequency MaskingDemixing Commercial Music Productions via Human-Assisted Time-Frequency Masking
Demixing Commercial Music Productions via Human-Assisted Time-Frequency Masking
dhia_naruto
 
(2012) Rigaud, Falaize, David, Daudet - Does Inharmonicity Improve an NMF-Bas...
(2012) Rigaud, Falaize, David, Daudet - Does Inharmonicity Improve an NMF-Bas...(2012) Rigaud, Falaize, David, Daudet - Does Inharmonicity Improve an NMF-Bas...
(2012) Rigaud, Falaize, David, Daudet - Does Inharmonicity Improve an NMF-Bas...François Rigaud
 
(2013) Rigaud et al. - A parametric model and estimation techniques for the i...
(2013) Rigaud et al. - A parametric model and estimation techniques for the i...(2013) Rigaud et al. - A parametric model and estimation techniques for the i...
(2013) Rigaud et al. - A parametric model and estimation techniques for the i...François Rigaud
 
Handling Ihnarmonic Series with Median-Adjustive Trajectories
Handling Ihnarmonic Series with Median-Adjustive TrajectoriesHandling Ihnarmonic Series with Median-Adjustive Trajectories
Handling Ihnarmonic Series with Median-Adjustive TrajectoriesMatthieu Hodgkinson
 
F010334548
F010334548F010334548
F010334548
IOSR Journals
 
Teager Energy Operation on Wavelet Packet Coefficients for Enhancing Noisy Sp...
Teager Energy Operation on Wavelet Packet Coefficients for Enhancing Noisy Sp...Teager Energy Operation on Wavelet Packet Coefficients for Enhancing Noisy Sp...
Teager Energy Operation on Wavelet Packet Coefficients for Enhancing Noisy Sp...
CSCJournals
 
Performance analysis of the convolutional recurrent neural network on acousti...
Performance analysis of the convolutional recurrent neural network on acousti...Performance analysis of the convolutional recurrent neural network on acousti...
Performance analysis of the convolutional recurrent neural network on acousti...
journalBEEI
 
METHOD FOR REDUCING OF NOISE BY IMPROVING SIGNAL-TO-NOISE-RATIO IN WIRELESS LAN
METHOD FOR REDUCING OF NOISE BY IMPROVING SIGNAL-TO-NOISE-RATIO IN WIRELESS LANMETHOD FOR REDUCING OF NOISE BY IMPROVING SIGNAL-TO-NOISE-RATIO IN WIRELESS LAN
METHOD FOR REDUCING OF NOISE BY IMPROVING SIGNAL-TO-NOISE-RATIO IN WIRELESS LAN
IJNSA Journal
 
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
IRJET Journal
 
Sound event detection using deep neural networks
Sound event detection using deep neural networksSound event detection using deep neural networks
Sound event detection using deep neural networks
TELKOMNIKA JOURNAL
 
Jenner_ST_FB_Jan_2016
Jenner_ST_FB_Jan_2016Jenner_ST_FB_Jan_2016
Jenner_ST_FB_Jan_2016Hans Ecke
 
Voice biometric recognition
Voice biometric recognitionVoice biometric recognition
Voice biometric recognition
phyuhsan
 
High Quality Arabic Concatenative Speech Synthesis
High Quality Arabic Concatenative Speech SynthesisHigh Quality Arabic Concatenative Speech Synthesis
High Quality Arabic Concatenative Speech Synthesis
sipij
 
Improvement of minimum tracking in Minimum Statistics noise estimation method
Improvement of minimum tracking in Minimum Statistics noise estimation methodImprovement of minimum tracking in Minimum Statistics noise estimation method
Improvement of minimum tracking in Minimum Statistics noise estimation method
CSCJournals
 
A NOVEL ENF EXTRACTION APPROACH FOR REGION-OF-RECORDING IDENTIFICATION OF MED...
A NOVEL ENF EXTRACTION APPROACH FOR REGION-OF-RECORDING IDENTIFICATION OF MED...A NOVEL ENF EXTRACTION APPROACH FOR REGION-OF-RECORDING IDENTIFICATION OF MED...
A NOVEL ENF EXTRACTION APPROACH FOR REGION-OF-RECORDING IDENTIFICATION OF MED...
cscpconf
 
Real Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform DomainReal Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform Domain
Willy Marroquin (WillyDevNET)
 

Similar to (2016) Rigaud and Radenen - Singing Voice Melody Transcription Using Deep Neural Networks (20)

A Combined Sub-Band And Reconstructed Phase Space Approach To Phoneme Classif...
A Combined Sub-Band And Reconstructed Phase Space Approach To Phoneme Classif...A Combined Sub-Band And Reconstructed Phase Space Approach To Phoneme Classif...
A Combined Sub-Band And Reconstructed Phase Space Approach To Phoneme Classif...
 
Speech Analysis and synthesis using Vocoder
Speech Analysis and synthesis using VocoderSpeech Analysis and synthesis using Vocoder
Speech Analysis and synthesis using Vocoder
 
Demixing Commercial Music Productions via Human-Assisted Time-Frequency Masking
Demixing Commercial Music Productions via Human-Assisted Time-Frequency MaskingDemixing Commercial Music Productions via Human-Assisted Time-Frequency Masking
Demixing Commercial Music Productions via Human-Assisted Time-Frequency Masking
 
(2012) Rigaud, Falaize, David, Daudet - Does Inharmonicity Improve an NMF-Bas...
(2012) Rigaud, Falaize, David, Daudet - Does Inharmonicity Improve an NMF-Bas...(2012) Rigaud, Falaize, David, Daudet - Does Inharmonicity Improve an NMF-Bas...
(2012) Rigaud, Falaize, David, Daudet - Does Inharmonicity Improve an NMF-Bas...
 
(2013) Rigaud et al. - A parametric model and estimation techniques for the i...
(2013) Rigaud et al. - A parametric model and estimation techniques for the i...(2013) Rigaud et al. - A parametric model and estimation techniques for the i...
(2013) Rigaud et al. - A parametric model and estimation techniques for the i...
 
Handling Ihnarmonic Series with Median-Adjustive Trajectories
Handling Ihnarmonic Series with Median-Adjustive TrajectoriesHandling Ihnarmonic Series with Median-Adjustive Trajectories
Handling Ihnarmonic Series with Median-Adjustive Trajectories
 
F010334548
F010334548F010334548
F010334548
 
H0814247
H0814247H0814247
H0814247
 
Teager Energy Operation on Wavelet Packet Coefficients for Enhancing Noisy Sp...
Teager Energy Operation on Wavelet Packet Coefficients for Enhancing Noisy Sp...Teager Energy Operation on Wavelet Packet Coefficients for Enhancing Noisy Sp...
Teager Energy Operation on Wavelet Packet Coefficients for Enhancing Noisy Sp...
 
Performance analysis of the convolutional recurrent neural network on acousti...
Performance analysis of the convolutional recurrent neural network on acousti...Performance analysis of the convolutional recurrent neural network on acousti...
Performance analysis of the convolutional recurrent neural network on acousti...
 
METHOD FOR REDUCING OF NOISE BY IMPROVING SIGNAL-TO-NOISE-RATIO IN WIRELESS LAN
METHOD FOR REDUCING OF NOISE BY IMPROVING SIGNAL-TO-NOISE-RATIO IN WIRELESS LANMETHOD FOR REDUCING OF NOISE BY IMPROVING SIGNAL-TO-NOISE-RATIO IN WIRELESS LAN
METHOD FOR REDUCING OF NOISE BY IMPROVING SIGNAL-TO-NOISE-RATIO IN WIRELESS LAN
 
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
A Noise Reduction Method Based on Modified Least Mean Square Algorithm of Rea...
 
Sound event detection using deep neural networks
Sound event detection using deep neural networksSound event detection using deep neural networks
Sound event detection using deep neural networks
 
Jenner_ST_FB_Jan_2016
Jenner_ST_FB_Jan_2016Jenner_ST_FB_Jan_2016
Jenner_ST_FB_Jan_2016
 
Voice biometric recognition
Voice biometric recognitionVoice biometric recognition
Voice biometric recognition
 
High Quality Arabic Concatenative Speech Synthesis
High Quality Arabic Concatenative Speech SynthesisHigh Quality Arabic Concatenative Speech Synthesis
High Quality Arabic Concatenative Speech Synthesis
 
Paper1
Paper1Paper1
Paper1
 
Improvement of minimum tracking in Minimum Statistics noise estimation method
Improvement of minimum tracking in Minimum Statistics noise estimation methodImprovement of minimum tracking in Minimum Statistics noise estimation method
Improvement of minimum tracking in Minimum Statistics noise estimation method
 
A NOVEL ENF EXTRACTION APPROACH FOR REGION-OF-RECORDING IDENTIFICATION OF MED...
A NOVEL ENF EXTRACTION APPROACH FOR REGION-OF-RECORDING IDENTIFICATION OF MED...A NOVEL ENF EXTRACTION APPROACH FOR REGION-OF-RECORDING IDENTIFICATION OF MED...
A NOVEL ENF EXTRACTION APPROACH FOR REGION-OF-RECORDING IDENTIFICATION OF MED...
 
Real Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform DomainReal Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform Domain
 

(2016) Rigaud and Radenen - Singing Voice Melody Transcription Using Deep Neural Networks

  • 1. SINGING VOICE MELODY TRANSCRIPTION USING DEEP NEURAL NETWORKS Franc¸ois Rigaud and Mathieu Radenen Audionamix R&D 171 quai de Valmy, 75010 Paris, France <firstname>.<lastname>@audionamix.com ABSTRACT This paper presents a system for the transcription of singing voice melodies in polyphonic music signals based on Deep Neural Network (DNN) models. In particular, a new DNN system is introduced for performing the f0 es- timation of the melody, and another DNN, inspired from recent studies, is learned for segmenting vocal sequences. Preparation of the data and learning configurations related to the specificity of both tasks are described. The perfor- mance of the melody f0 estimation system is compared with a state-of-the-art method and exhibits highest accu- racy through a better generalization on two different music databases. Insights into the global functioning of this DNN are proposed. Finally, an evaluation of the global system combining the two DNNs for singing voice melody tran- scription is presented. 1. INTRODUCTION The automatic transcription of the main melody from poly- phonic music signals is a major task of Music Information Retrieval (MIR) research [19]. Indeed, besides applica- tions to musicological analysis or music practice, the use of the main melody as prior information has been shown useful in various types of higher-level tasks such as music genre classification [20], music retrieval [21], music de- soloing [4, 18] or lyrics alignment [15, 23]. From a sig- nal processing perspective, the main melody can be repre- sented by sequences of fundamental frequency (f0) defined on voicing instants, i.e. on portions where the instrument producing the melody is active. Hence, main melody tran- scription algorithms usually follow two main processing steps. First, a representation emphasizing the most likely f0s over time is computed, e.g. under the form of a salience matrix [19], a vocal source activation matrix [4] or an en- hanced spectrogram [22]. Second, a binary classification of the selected f0s between melodic and background con- tent is performed using melodic contour detection/tracking and voicing detection. c Franc¸ois Rigaud and Mathieu Radenen. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Franc¸ois Rigaud and Mathieu Radenen. “Singing Voice Melody Transcription using Deep Neural Networks”, 17th International Society for Music Information Retrieval Conference, 2016. In this paper we propose to tackle the melody transcrip- tion task as a supervised classification problem where each time frame of signal has to be assigned into a pitch class when a melody is present and an ‘unvoiced’ class when it is not. Such approach has been proposed in [5] where melody transcription is performed applying Support Vector Ma- chine on input features composed of Short-Time Fourier Transforms (STFT). Similarly for noisy speech signals, f0 estimation algorithms based on Deep Neural Networks (DNN) have been introduced in [9,12]. Following such fully data driven approaches we intro- duce a singing voice melody transcription system com- posed of two DNN models respectively used to perform the f0 estimation task and the Voice Activity Detection (VAD) task. The main contribution of this paper is to present a DNN architecture able to discriminate the differ- ent f0s from low-level features, namely spectrogram data. Compared to a well-known state-of-the-art method [19], it shows significant improvements in terms of f0 accu- racy through an increase of robustness with regard to mu- sical genre and a reduction of octave-related errors. By analyzing the weights of the network, the DNN is shown somehow equivalent to a simple harmonic-sum method for which the parameters usually set empirically are here auto- matically learned from the data and where the succession of non-linear layers likely increases the power of discrim- ination of harmonically-related f0. For the task of VAD, another DNN model, inspired from [13] is learned. For both models, special care is taken to prevent over-fitting issues by using different databases and perturbing the data with audio degradations. Performance of the whole system is finally evaluated and shows promising results. The rest of the paper is organized as follows. Section 2 presents an overview of the whole system. Sections 3 and 4 introduce the DNN models and detail the learning con- figurations respectively for the VAD and the f0 estimation task. Then, Section 5 presents an evaluation of the system and Section 6 concludes the study. 2. SYSTEM OVERVIEW 2.1 Global architecture The proposed system, displayed on Figure 1, is composed of two independent parallel DNN blocks that perform re- spectively the f0 melody estimation and the VAD. 737
  • 2. Figure 1: Architecture of the proposed system for singing voice melody transcription. In contrast with [9,12] that propose a single DNN model to perform both tasks, we did not find such unified func- tional architecture able to discriminate successfully a time frame between quantified f0s and ‘unvoiced’ classes. In- deed the models presented in these studies are designed for speech signals mixed with background noise for which the discrimination between a frame of noise and a frame of speech is very likely related to the presence or absence of a pitched structure, which is also probably the kind of information on which the system relies to estimate the f0. Conversely, with music signals both the melody and the accompaniment exhibit harmonic structures and the voic- ing discrimination usually requires different levels of in- formation, e.g. under the form of timbral features such as Mel-Frequency Cepstral Coefficients. Another characteristic of the proposed system is the par- allel architecture that allows considering different types of input data for the two DNNs and which arises from the application restricted to vocal melodies. Indeed, unlike generic systems dealing with main melody transcription of different instruments (often within a same piece of music) which usually process the f0 estimation and the voicing de- tection sequentially, the focus on singing voice here hardly allows for a voicing detection relying only on the distri- bution and statistics of the candidate pitch contours and/or their energy [2,19]. Thus, this constraint requires to build a specific VAD system that should learn to discriminate the timbre of a vocal melody from an instrumental melody, such as for example played by a saxophone. 2.2 Signal decomposition As shown on Figure 1, both DNN models are preceded by a signal decomposition. At the input of the global system, audio signals are first converted to mono and re-sampled to 16 kHz. Then, following [13], it is proposed to provide the DNNs with a set of pre-decomposed signals obtained by applying a double-stage Harmonic/Percussive Source Sep- aration (HPSS) [6,22] on the input mixture signal. The key idea behind double-stage HPSS is to consider that within a mix, melodic signals are usually less stable/stationary than the background ‘harmonic’ instruments (such as a bass or a piano), but more than the percussive instruments (such as the drums). Thus, according to the frequency reso- lution that is used to compute a STFT, applying a har- monic/percussive decomposition on a mixture spectrogram lead to a rough separation where the melody is mainly ex- tracted either in the harmonic or in the percussive content. Using such pre-processing, 4 different signals are ob- tained. First, the input signal s is decomposed into the sum of h1 and p1 using a high-frequency resolution STFT (typ- ically with a window of about 300 ms) where p1 should mainly contain the melody and the drums, and h1 the remaining stable instrument signals. Second, p1 is fur- ther decomposed into the sum of h2 and p2 using a low- frequency resolution STFT (typically with a window of about 30 ms), where h2 mainly contains the melody, and p2 the drums. As presented latter in Sections 3 and 4, dif- ferent types of these 4 signals or combinations of them will be used to experimentally determine optimal DNN models. 2.3 Learning data Several annotated databases composed of polyphonic mu- sic with transcribed melodies are used for building the train, validation and test datasets used for the learning (cf. Sections 3 and 4) and the evaluation (cf. Section 5) of the DNNs. In particular, a subset of RWC Popular Music and Royalty Free Music [7] and MIR-1k [10] databases are used for the train dataset, and the recent databases Med- leyDB [1] and iKala [3] are split between train, validation and test datasets. Note that for iKala the vocal and instru- mental tracks are mixed with a relative gain of 0 dB. Also, in order to minimize over-fitting issues and to in- crease the robustness of the system with respect to audio equalization and encoding degradations, we use the Audio Degradation Toolbox [14]. Thus, several files composing the train and validation datasets (50% for the VAD task and 25% for the f0 estimation task) are duplicated with one de- graded version, the degradation type being randomly cho- sen amongst those available preserving the alignment be- tween the audio and the annotation (e.g. not producing time/pitch warping or too long reverberation effects). 3. VOICE ACTIVITY DETECTION WITH DEEP NEURAL NETWORKS This section briefly describes the process for learning the DNN used to perform the VAD. It is largely inspired from a previous study presented in more detail in [13]. A similar architecture of deep recurrent neural network composed of Bidirectional Long Short-Term Memory (BLSTM) [8] is used. In our case the architecture is arbitrarily fixed to 3 BLSTM layers of 50 units each and a final feed-forward lo- gistic output layer with one unit. As in [13], different types of combination of the pre-decomposed signals (cf. Section 2.2) are considered to determine an optimal network: s, p1, h2, h1p1, h2p2 and h1h2p2. For each of these pre- decomposed signals, timbral features are computed under the form of mel-frequency spectrograms obtained using a STFT with 32 ms long Hamming windows and 75 % of overlap, and 40 triangular filters distributed on a mel scale between 0 and 8000 Hz. Then, each feature of the input 738 Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
  • 3. 50 BLSTM units ... 50 BLSTM units 1 logistic unit time frame index DNNinputfeatureindex 50 100 150 200 250 300 350 20 40 60 80 100 120 0 50 100 150 200 250 300 350 0 0.2 0.4 0.6 0.8 1 time frame index DNNoutput ... ...50 BLSTM units Figure 2: VAD network illustration. data is normalized using the mean and variance computed over the train dataset. Contrary to [13] the learning is per- formed in a single step, i.e. without adopting a layer by layer training. Finally, the best architecture is obtained for the combi- nation of h1, h2 and p2 signals, thus for an input of size 120, which corresponds to a use of the whole information present in the original signal (s = h1 + h2 + p2). An illustration of this network is presented in Figure 2. A simple post-processing of the DNN output consisting in a threshold of 0.5 is finally applied to take the binary decision of voicing frame activation. 4. F0 ESTIMATION WITH DEEP NEURAL NETWORKS This section presents in detail the learning configuration for the DNN used for performing the f0 estimation task. An interpretation of the network functioning is finally pre- sented. 4.1 Preparation of learning data As proposed in [5] we decide to keep low level features to feed the DNN model. Compared to [12] and [9] which use as input pre-computed representations known for high- lighting the periodicity of pitched sounds (respectively based on an auto-correlation and a harmonic filtering), we expect here the network to be able to learn an optimal transformation automatically from spectrogram data. Thus the set of selected features consists of log-spectrograms (logarithm of the modulus of the STFT) computed from a Hamming window of duration 64 ms (1024 samples for a sampling frequency of 16000 Hz) with an overlap of 0.75, and from which frequencies below 50 Hz and above 4000 Hz are discarded. For each music excerpt the correspond- ing log-spectrogram is rescaled between 0 and 1. Since, as described in Section 2.1, the VAD is performed by a sec- ond independent system, all time frames for which no vo- cal melody is present are removed from the dataset. These features are computed independently for 3 different types of input signal for which the melody should be more or less emphasized: s, p1 and h2 (cf. Section 2.2). For the output, the f0s are quantified between C#2 (f0 69.29 Hz) and C#6 (f0 1108.73 Hz) with a spac- ing of an eighth of tone, thus leading to a total of 193 classes. The train and validation datasets including audio de- graded versions are finally composed of, respectively, 22877 melodic sequences (resp. 3394) for a total duration of about 220 minutes (resp. 29 min). 4.2 Training Several experiments have been run to determine a func- tional DNN architecture. In particular, two types of neuron units have been considered: the standard feed-forward sig- moid unit and the Bidirectional Long Short-Term Memory (BLSTM) recurrent unit [8]. For each test, the weights of the network are initialized randomly according to a Gaussian distribution with 0 mean and a standard deviation of 0.1, and optimized to mini- mize the cross-entropy error function. The learning is then performed by means of a stochastic gradient descent with shuffled mini-batches composed of 30 melodic sequences, a learning rate of 10−7 and a momentum of 0.9. The op- timization is run for a maximum of 10000 epochs and an early stopping is applied if no decrease is observed on the validation set error during 100 consecutive epochs. In ad- dition to the use of audio degradations during the prepa- ration of the data for preventing over-fitting (cf. Section 2.3), the training examples are slightly corrupted during the learning by adding a Gaussian noise with variance 0.05 at each epoch. Among the different architectures tested, the best clas- sification performance is obtained for the input signal p1 (slightly better than for s, i.e. without pre-separation) by a 2-hidden layer feed-forward network with 500 sigmoid units each, and a 193 output softmax layer. An illustration of this network is presented in Figure 3. Interestingly, for that configuration the learning did not suffered from over- fitting so that it ended at the maximum number of epochs, thus without early stopping. While the temporal continuity of the f0 along time- frames should provide valuable information, the use of BLSTM recurrent layers (alone or in combination with Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016 739
  • 4. 500 logistic units ... ...500 logistic units ...193 softmax units time frame index DNNinputfeatureindex 260 280 300 320 340 360 380 400 420 440 100 200 300 400 500 600 700 time frame index DNNoutputclassindex 260 280 300 320 340 360 380 400 420 440 20 40 60 80 100 120 140 160 180 Figure 3: f0 estimation network illustration. feed-forward sigmoid layers) did not lead to efficient sys- tems. Further experiments should be conducted to enforce the inclusion of such temporal context in a feed-forward DNN architecture, for instance by concatenating several consecutive time frames in the input. 4.3 Post-processing The output layer of the DNN composed of softmax units returns a f0 probability distribution for each time frame that can be seen for a full piece of music as a pitch ac- tivation matrix. In order to take a final decision that ac- count for the continuity of the f0 along melodic sequences, a Viterbi tracking is finally applied on the network out- put [5, 9, 12]. For that, the log-probability transition be- tween two consecutive time frames and two f0 classes is simply arbitrarily set inversely proportional to their abso- lute difference in semi-tones. For further improvement of the system, such transition matrix could be learned from the data [5], however this simple rule gives interesting per- formance gains (when compared to a simple ‘maximum picking’ post-processing without temporal context) while potentially reducing the risk of over-fitting to a particular music style. layer #1 unit index DNNinputfeatureindex 100 200 300 400 500 100 200 300 400 500 600 700 layer #2 unit index layer#1outputindex 100 200 300 400 500 50 100 150 200 250 300 350 400 450 500 layer#2outputindex softmax unit index 50 100 150 50 100 150 200 250 300 350 400 450 500 Figure 4: Display of the weights for the two sigmoid feed- forward layers (top) and the softmax layer (down) of the DNN learned for the f0 estimation task. 4.4 Network weights interpretation We propose here to have an insight into the network func- tioning for this specific task of f0 estimation by analyzing the weights of the DNN. The input is a short-time spec- trum and the output corresponds to an activation vector for which a single element (the actual f0 of the melody at that time frame) should be predominant. In that case, it is rea- sonable to expect that the DNN somehow behaves like a harmonic-sum operator. While the visualization of the distribution of the hidden- layer weights usually does not provide with straightfor- ward cues to analyse a DNN functioning (cf. Figure 4) we consider a simplified network for which it is assumed that each feed-forward logistic unit is working in the linear regime. Thus, removing the non-linear operations, the out- put of a feed-forward layer with index l composed of Nl units writes xl = Wl · xl−1 + bl, (1) where xl ∈ RNl (resp. xl−1 ∈ RNl−1 ) corresponds to the ouput vector of layer l (resp. l − 1), Wl ∈ RNl×Nl−1 is the weight matrix and bl ∈ RNl the bias vector. Using this expression, the output of a layer with index L expressed as the propagation of the input x0 through the linear network also writes xL = W · x0 + b, (2) where W = L l=1 Wl corresponds to a global weight ma- trix, and b to a global bias that depends on the set of pa- rameters {Wl, bl, ∀l ∈ [1..L]}. As mentioned above, in our case x0 is a short-time spec- trum and xL is a f0 activation vector. The global weight matrix should thus present some characteristics of a pitch detector. Indeed as displayed on Figure 5a, the matrix W for the learned DNN (which is thus the product of the 3 weight matrices depicted on Figure 4) exhibits an har- monic structure for most output classes of f0s; except for 740 Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
  • 5. DNN output class index DNNinputfeatureindex 20 40 60 80 100 120 140 160 180 100 200 300 400 500 600 700 (a) 100 200 300 400 500 600 700 −20 0 20 40 60 DNN input feature index (b) Figure 5: Linearized DNN illustration. (a) Visualization of the (transposed) weight matrix W. The x-axis corre- sponds to the output class indices (the f0s) and the y-axis represents the input feature indices (frequency channel of the spectrum input). (b) Weights display for the f0 output class with index 100. some f0s in the low and high frequency range for which no or too few examples are present in the learning data. Most approaches dealing with main melody transcrip- tion usually relies on such types of transformations to compute a representation emphasizing f0 candidates (or salience function) and are usually partly based on hand- crafted designs [11, 17, 19]. Interestingly, using a fully data driven method as proposed, parameters of a compara- ble weighted harmonic summation algorithm (such as the number of harmonics to consider for each note and their respective weights) do not have to be defined. This can be observed in more details on Figure 5b which depicts the linearized network weights for the class index 100 (f0 289.43 Hz). Moreover, while this interpretation assumes a linear network, one can expect that the non-linear opera- tions actually present in the network help in enhancing the discrimination between the different f0 classes. 5. EVALUATION 5.1 Experimental procedure Two different test datasets composed of full music ex- cerpts (i.e. vocal and non vocal portions) are used for the evaluation. One is composed of 17 tracks from Med- leyDB (last songs comprising vocal melodies, from Mu- sicDelta Reggae to Wolf DieBekherte, for a total of ∼ 25.5 min of vocal portions) and the other is composed of 63 tracks from iKala (from 54223 chorus to 90587 verse for a total of ∼ 21 min of vocal portions). The evaluation is conducted in two steps. First the per- formance of the f0 estimation DNN taken alone (thus with- out voicing detection) is compared with the state-of the art system melodia [19] using f0 accuracy metrics. Sec- ond, the performance of our complete singing voice tran- scription system (VAD and f0 estimation) is evaluated on the same datasets. Since our system is restricted to the transcription of vocal melodies and that, to our knowledge all available state-of-the-art systems are designed to target main melody, this final evaluation presents the results for our system without comparisons with a reference. For all tasks and systems, the evaluation metrics are computed using the mir eval library [16]. For Section 5.3, some additional metrics related to voicing detection, namely precision, f-measure and voicing accuracy, were not present in the original mir eval code and thus were added for our experiments. 5.2 f0 estimation task The performance of the DNN performing the f0 estima- tion task is first compared to melodia system [19] using the plug-in implementation with f0 search range limits set equal to those of our system (69.29-1108.73 Hz, cf. Sec. 4.1) and with remaining parameters left to default values. For each system and each music track the performance is evaluated in terms of raw pitch accuracy (RPA) and raw chroma accuracy (RCA). These metrics are computed on vocal segments (i.e. without accounting for potential voic- ing detection errors) for a f0 tolerance of 50 cents. The results are presented on Figure 6 under the form of a box plot where, for each metric and dataset, the ends of the dashed vertical bars delimit the lowest and highest scores obtained, the 3 vertical bars composing each center box respectively correspond to the first quartile, the median and the third quartile of the distribution, and finally the star markers represent the mean. Both systems are char- acterized by more widespread distributions for MedleyDB than for iKala. This reflects the fact that MedleyDB is more heterogeneous in musical genres and recording con- ditions than iKala. On iKala, the DNN performs slightly better than melodia when comparing the means. On Med- leyDB, the gap between the two systems increases signif- icantly. The DNN system seems much less affected by the variability of the music examples and clearly improve the mean RPA by 20% (62.13% for melodia and 82.48% for the DNN). Additionally, while exhibiting more similar distributions of RPA and RCA, the DNN tends to produce less octave detection errors. It should be noted that this re- sult does not take into account the recent post-processing improvement proposed for melodia [2], yet it shows the interest of using such DNN approach to compute an en- hanced pitch salience matrix which, simply combined with a Viterbi post-processing, achieves good performance. 5.3 Singing voice transcription task The evaluation of the global system is finally performed on the two same test datasets. The results are displayed as Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016 741
  • 6. RPA RCA RPA RCA 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 MedleyDB iKala NN melodia Figure 6: Comparative evaluation of the proposed DNN (in black) and melodia (in gray) on MedleyDB (left) and iKala (right) test sets for a f0 vocal melody estimation task. boxplots (cf. description Section 5.2) on Figures 7a and 7b respectively for the iKala and the MedleyDB datasets. Five metrics are computed to evaluate the voicing detection, namely the precision (P), the recall (R), the f-measure (F), the false alarm rate (FA) and the voicing accuracy (VA). A sixth metric of overall accuracy (OA) is also presented for assessing the global performance of the complete singing voice melody transcription system. In accordance with the previous evaluation, the results on MedleyDB are characterized by much more variance than on iKala. In particular, the voicing precision of the system (i.e. it’s ability to provide correct detections, no matter the number of forgotten voiced frames) is signif- icantly degraded on MedleyDB. Conversely, the voicing recall which evaluate the ability of the system to detect all voiced portions actually present no matter the number of false alarm, remains relatively good on MedleyDB. Com- bining both metrics, a mean f-measure of 93.15 % and 79.19 % are respectively obtained on iKala and MedleyDB test datasets. Finally, the mean scores of overall accuracy obtained for the global system are equal to 85.06 % and 75.03 % respectively for iKala and MedleyDB databases. 6. CONCLUSION This paper introduced a system for the transcription of singing voice melodies composed of two DNN models. In particular a new system able to learn a representation em- phasizing melodic lines from low level data composed of spectrograms has been proposed for the estimation of the f0. For this DNN, the performance evaluation shows a rel- atively good generalization (when compared to a reference system) on two different test datasets and an increase of ro- bustness to western music recordings that tend to be repre- sentative of the current music industry productions. While for these experiments the systems have been learned from a relatively low amount of data, the robustness, particu- larly for the task of VAD, could very likely be improved by increasing the number of training examples. P R F FA VA OA 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (a) iKala test dataset P R F FA VA OA 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (b) MedleyDB test dataset Figure 7: Voicing detection and overall performance of the proposed system for iKala and MedleyDB test datasets. 7. REFERENCES [1] R. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Can- nam, and J. Bello. MedleyDB: A multitrack dataset for annotation-intensive MIR research. In Proc. of the 15th Int. Society for Music Information Retrieval (ISMIR) Conference, October 2014. [2] R. M. Bittner, J. Salamon, S. Essid, and J. P. Bello. Melody extraction by contour classification. In Proc. of the 16th Int. Society for Music Information Retrieval (ISMIR) Conference, October 2015. [3] T.-S. Chan, T.-C. Yeh, Z.-C. Fan, H.-W. Chen, L. Su, Y.-H. Yang, and R. Jang. Vocal activity informed singing voice separation with the ikala dataset. In Proc. of IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pages 718–722, April 2015. [4] J.-L. Durrieu, G. Richard, and B. David. An iterative approach to monaural musical mixture de-soloing. In Proc. of IEEE Int. Conf. on Acoustics, Speech and Sig- nal Processing (ICASSP), pages 105–108, April 2009. [5] D. P. W. Ellis and G. E. Poliner. Classification-based melody transcription. Machine Learning, 65(2):439– 456, 2006. [6] D. FitzGerald and M. Gainza. Single channel vo- cal separation using median filtering and factorisa- tion techniques. ISAST Trans. on Electronic and Signal Processing, 4(1):62–73, 2010. 742 Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
  • 7. [7] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka. Rwc music database: Popular, classical, and jazz music databases. In Proc. of the 3rd Int. Society for Music Information Retrieval (ISMIR) Conference, pages 287– 288, October 2002. [8] A. Graves, A.-R. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In Proc. of IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pages 6645–6649, May 2013. [9] K. Han and DL. Wang. Neural network based pitch tracking in very noisy speech. IEEE/ACM Trans. on Audio, Speech, and Language Processing, 22(12):2158–2168, October 2014. [10] C.-L. Hsu and J.-S. R. Jang. On the improvement of singing voice separation for monaural recordings using the mir-1k dataset. IEEE Trans. on Audio, Speech, and Language Processing, 18(2):310–319, 2010. [11] S. Jo, S. Joo, and C. D. Yoo. Melody pitch estima- tion based on range estimation and candidate extrac- tion using harmonic structure model. In Proc. of IN- TERSPEECH, pages 2902–2905, 2010. [12] B. S. Lee and D. P. W. Ellis. Noise robust pitch tracking by subband autocorrelation classification. In Proc. of INTERSPEECH, 2012. [13] S. Leglaive, R. Hennequin, and R. Badeau. Singing voice detection with deep recurrent neural networks. In Proc. of IEEE Int. Conf. on Acoustics, Speech and Sig- nal Processing (ICASSP), pages 121–125, April 2015. [14] M. Mauch and S. Ewert. The audio degradation tool- box and its application to robustness evaluation. In Proc. of the 14th Int. Society for Music Information Re- trieval (ISMIR) Conference, November 2013. [15] A. Mesaros and T. Virtanen. Automatic alignment of music audio and lyrics. In Proc. of 11th Int. Conf. on Digital Audio Effects (DAFx), September 2008. [16] C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang, and D. P. W. Ellis. mir eval: a transparent implementation of common MIR metrics. In Proc. of the 15th Int. Society for Music Information Retrieval (ISMIR) Conference, October 2014. [17] M. Ryyn¨anen and A. P. Klapuri. Automatic transcrip- tion of melody, bass line, and chords in polyphonic mu- sic. Computer Music Journal, 32(3):72–86, 2008. [18] M. Ryyn¨anen, T. Virtanen, J. Paulus, and A. Kla- puri. Accompaniment separation and karaoke applica- tion based on automatic melody transcription. In Proc. of the IEEE Int. Conf. on Multimedia and Expo, pages 1417–1420, April 2008. [19] J. Salamon, E. G´omez, D. P. W. Ellis, and G. Richard. Melody extraction from polyphonic music signals. Ap- proaches, applications, and challenges. IEEE Signal Processing Magazine, 31(2):118–134, March 2014. [20] J. Salamon, B. Rocha, and E. G´omez. Musical genre classification using melody features extracted from polyphonic music. In Proc. of IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pages 81–84, March 2012. [21] J. Salamon, J. Serr`a, and E. G´omez. Tonal representa- tions for music retrieval: From version identification to query-by-humming. Int. Jour. of Multimedia Informa- tion Retrieval, special issue on Hybrid Music Informa- tion Retrieval, 2(1):45–58, 2013. [22] H. Tachibana, T. Ono, N. Ono, and S. Sagayama. Melody line estimation in homophonic music au- dio signals based on temporal-variability of melodic source. In Proc. of IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pages 425– 428, March 2010. [23] C. H. Wong, W. M. Szeto, and K. H. Wong. Automatic lyrics alignment for Cantonese popular music. Multi- media Systems, 4-5(12):307–323, 2007. Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016 743