This document presents a system for transcribing singing voice melodies from polyphonic music signals using deep neural networks (DNNs). It introduces two DNN models: one for fundamental frequency (f0) estimation of the melody, and another for voice activity detection. The f0 estimation DNN takes log-spectrograms as input and classifies each time frame into one of 193 pitch classes or an unvoiced class. It achieves higher accuracy than a state-of-the-art method through better generalization across genres and reduction of octave errors. The system evaluates the DNNs on various music databases and shows promising results for singing voice melody transcription.
PRACTICAL PARTIAL DECODE AND FORWARD ENCODING SCHEME FOR RELAY CHANNELijwmn
In this paper, a source-destination pair, which is augmented by a half-duplex relay, is considered. Two
practical partial decode and forward encoding schemes are proposed. In these transmission schemes, the
relay may decode the source's signal either partially or completely. In each encoding technique, twophase transmission scheme is developed in which the relay can partially listen to the source in the first
phase until it can generate the source’s message and then, in the second phase, forward it to its destination.
By employing these transmission phases, the achievable rates are obtained . Rigorous numerical examples
are presented to i) show the value of power allocation between the source and the relay, and ii) optimize
the length of the listening phase.
A Subspace Method for Blind Channel Estimation in CP-free OFDM SystemsCSCJournals
In this paper, a subspace method is proposed for blind channel estimation in orthogonal frequency-division multiplexing (OFDM) systems over time-dispersive channel. The proposed method does not require a cyclic prefix (CP) and thus leading to higher spectral efficiency. By exploiting the block Toeplitz structure of the channel matrix, the proposed blind estimation method performs satisfactorily with very few received OFDM blocks. Numerical simulations demonstrate the superior performance of the proposed algorithm over methods reported earlier in the literature.
IMPROVING TRANSMISSION EFFICIENCY IN OPTICAL COMMUNICATIONradziwil
This work address problem of inefficient bandwidth utilization in optical
telecommunication systems that is caused by network impairments such as
latency, congestion, packet loss ratio and error ratio.
The proposed solution is software based system that implements User
Datagram Protocol (UDP), Java platform. and custom designed
transmission control mechanisms.
Streaming Audio Using MPEG–7 Audio Spectrum Envelope to Enable Self-similarit...TELKOMNIKA JOURNAL
The ability of traditional packet level Forward Error Correction approaches can limit errors for
small sporadic network losses but when dropouts of large portions occur listening quality becomes an
issue. Services such as audio-on-demand drastically increase the loads on networks therefore new, robust
and highly efficient coding algorithms are necessary. One method overlooked to date, which can work
alongside existing audio compression schemes, is that which takes account of the semantics and natural
repetition of music through meta-data tagging. Similarity detection within polyphonic audio has presented
problematic challenges within the field of Music Information Retrieval. We present a system which works
at the content level thus rendering it applicable in existing streaming services. Using the MPEG–7 Audio
Spectrum Envelope (ASE) gives features for extraction and combined with k-means clustering enables
self-similarity to be performed within polyphonic audio.
On the Performance Analysis of Multi-antenna Relaying System over Rayleigh Fa...IDES Editor
In this work, the end-to-end performance of an
amplify-and-forward multi-antenna infrastructure-based relay
(fixed relay) system over flat Rayleigh fading channel is
investigated. New closed form expressions for the statistics of
the received signal-to-noise ratio (SNR) are presented and
applied for studying the outage probability and the average
bit error rate of the digital receivers. The results reveal that
the system performance improves significantly (roughly 3 dB)
for M=2 over that for M=1 in both low and high signal-tonoise
ratio. However, little additional performance
improvement can be achieved for M>2 relative to M=2 at high
SNR.
A COMPARATIVE PERFORMANCE STUDY OF OFDM SYSTEM WITH THE IMPLEMENTATION OF COM...ijcsa
This paper presents a comparative performance analysis of wireless orthogonal frequency division multiplexing (OFDM) system with the implementation of comb type pilot-based channel estimation algorithm over frequency selective multi-path fading channels. The Minimum Mean Square Error (MMSE) method is used for the estimation of channel at pilot frequencies. For the estimation of channel at data frequencies different interpolation techniques such as low-pass, linear, and second order interpolation are employed. The OFDM system simulation has been carried out with Matlab and the performance is analyzed in terms of bit error rate (BER) for various signal mapping (BPSK, QPSK, 4QAM, 16QAM, and 64QAM) and channel (Rayleigh and Rician) conditions. The impact of selecting number of channel taps on the BER performance is also investigated.
PRACTICAL PARTIAL DECODE AND FORWARD ENCODING SCHEME FOR RELAY CHANNELijwmn
In this paper, a source-destination pair, which is augmented by a half-duplex relay, is considered. Two
practical partial decode and forward encoding schemes are proposed. In these transmission schemes, the
relay may decode the source's signal either partially or completely. In each encoding technique, twophase transmission scheme is developed in which the relay can partially listen to the source in the first
phase until it can generate the source’s message and then, in the second phase, forward it to its destination.
By employing these transmission phases, the achievable rates are obtained . Rigorous numerical examples
are presented to i) show the value of power allocation between the source and the relay, and ii) optimize
the length of the listening phase.
A Subspace Method for Blind Channel Estimation in CP-free OFDM SystemsCSCJournals
In this paper, a subspace method is proposed for blind channel estimation in orthogonal frequency-division multiplexing (OFDM) systems over time-dispersive channel. The proposed method does not require a cyclic prefix (CP) and thus leading to higher spectral efficiency. By exploiting the block Toeplitz structure of the channel matrix, the proposed blind estimation method performs satisfactorily with very few received OFDM blocks. Numerical simulations demonstrate the superior performance of the proposed algorithm over methods reported earlier in the literature.
IMPROVING TRANSMISSION EFFICIENCY IN OPTICAL COMMUNICATIONradziwil
This work address problem of inefficient bandwidth utilization in optical
telecommunication systems that is caused by network impairments such as
latency, congestion, packet loss ratio and error ratio.
The proposed solution is software based system that implements User
Datagram Protocol (UDP), Java platform. and custom designed
transmission control mechanisms.
Streaming Audio Using MPEG–7 Audio Spectrum Envelope to Enable Self-similarit...TELKOMNIKA JOURNAL
The ability of traditional packet level Forward Error Correction approaches can limit errors for
small sporadic network losses but when dropouts of large portions occur listening quality becomes an
issue. Services such as audio-on-demand drastically increase the loads on networks therefore new, robust
and highly efficient coding algorithms are necessary. One method overlooked to date, which can work
alongside existing audio compression schemes, is that which takes account of the semantics and natural
repetition of music through meta-data tagging. Similarity detection within polyphonic audio has presented
problematic challenges within the field of Music Information Retrieval. We present a system which works
at the content level thus rendering it applicable in existing streaming services. Using the MPEG–7 Audio
Spectrum Envelope (ASE) gives features for extraction and combined with k-means clustering enables
self-similarity to be performed within polyphonic audio.
On the Performance Analysis of Multi-antenna Relaying System over Rayleigh Fa...IDES Editor
In this work, the end-to-end performance of an
amplify-and-forward multi-antenna infrastructure-based relay
(fixed relay) system over flat Rayleigh fading channel is
investigated. New closed form expressions for the statistics of
the received signal-to-noise ratio (SNR) are presented and
applied for studying the outage probability and the average
bit error rate of the digital receivers. The results reveal that
the system performance improves significantly (roughly 3 dB)
for M=2 over that for M=1 in both low and high signal-tonoise
ratio. However, little additional performance
improvement can be achieved for M>2 relative to M=2 at high
SNR.
A COMPARATIVE PERFORMANCE STUDY OF OFDM SYSTEM WITH THE IMPLEMENTATION OF COM...ijcsa
This paper presents a comparative performance analysis of wireless orthogonal frequency division multiplexing (OFDM) system with the implementation of comb type pilot-based channel estimation algorithm over frequency selective multi-path fading channels. The Minimum Mean Square Error (MMSE) method is used for the estimation of channel at pilot frequencies. For the estimation of channel at data frequencies different interpolation techniques such as low-pass, linear, and second order interpolation are employed. The OFDM system simulation has been carried out with Matlab and the performance is analyzed in terms of bit error rate (BER) for various signal mapping (BPSK, QPSK, 4QAM, 16QAM, and 64QAM) and channel (Rayleigh and Rician) conditions. The impact of selecting number of channel taps on the BER performance is also investigated.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Investigation and Analysis of SNR Estimation in OFDM systemIOSR Journals
Estimation of signal to noise ratio (SNR) of received signal and to transmit the signal effectively for
the modern communication system. The performance of existing non-data-aided (NDA) SNR estimation methods
are substantially degraded for high level modulation scheme such as M-ary amplitude and phase shift keying
(APSK) or quadrature amplitude modulation (QAM).In this paper SNR estimation proposed method which uses
zero point auto-correlation of received signal per block and auto/cross- correlation of decision feedback signal
in orthogonal frequency division multiplexing (OFDM) system. Proposed method can be studied into two types;
Type 1 can estimate SNR by zero point auto-correlation of decision feedback signal based on the second
moment property. Type 2 uses both zero point auto-correlation and cross-correlation based on the fourth
moment property. In block-by-block reception of OFDM system, these two SNR estimation methods can be
possible for the practical implementation due to correlation based the estimation method and they show more
stable estimation performance than the earlier SNR estimation methods.
MIMO Channel Estimation Using the LS and MMSE AlgorithmIOSRJECE
Wireless Communication Technology has developed over the past few yearsfor other objectives.The Multiple InputMultiple Output (MIMO) is one of techniques that is used to enhancethe data rates, in which multiple antennas are employed both the transmitter and receiver. Multiple signals are transmitted from different antennas at the transmitter using the same frequency and separated space. Various channel estimation techniques are employed in order to judge the physical effects of the medium present. In this paper, we analyze and implementvarious estimation techniques for MIMO Systems such as Least Squares (LS), Minimum Mean Square Error (MMSE),these techniques are therefore compared to effectively estimate the channel in MIMO System. The results demonstrate that SNR required to support different values of bit error rate varies depending on different low correlation between the transmitting and the receiving antennas .In addition, it is illustrated that when the number of transmitter and receiver antennas increases, the performance of TBCE schemes significantly improves. The Same behavior isalso observed for MIMO system. Performance of both MMSE and LSestimation are the same for allkinds of modulation at small value of SNR but the more we increase the SNR value the more performance gap goes on increasing.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
Supporting efficient and scalable multicastingingenioustech
Dear Students
Ingenious techno Solution offers an expertise guidance on you Final Year IEEE & Non- IEEE Projects on the following domain
JAVA
.NET
EMBEDDED SYSTEMS
ROBOTICS
MECHANICAL
MATLAB etc
For further details contact us:
enquiry@ingenioustech.in
044-42046028 or 8428302179.
Ingenious Techno Solution
#241/85, 4th floor
Rangarajapuram main road,
Kodambakkam (Power House)
http://www.ingenioustech.in/
Generating Musical Notes and Transcription using Deep LearningVarad Meru
Music has always been the most followed art form, and lot of research had gone into understanding it. In recent years, deep learning approaches for building unsupervised hierarchical representations from unlabeled data have gained significant interest. Progress in fields, such as image processing and natural language processing, has been substantial, but to my knowledge, methods on auditory data for learning representations have not been studied extensively. In this project I try to use two methods for generating music from range of musical inputs such as MIDI to complex WAV formats. I use RNN-RBMs and CDBN to explore music.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Investigation and Analysis of SNR Estimation in OFDM systemIOSR Journals
Estimation of signal to noise ratio (SNR) of received signal and to transmit the signal effectively for
the modern communication system. The performance of existing non-data-aided (NDA) SNR estimation methods
are substantially degraded for high level modulation scheme such as M-ary amplitude and phase shift keying
(APSK) or quadrature amplitude modulation (QAM).In this paper SNR estimation proposed method which uses
zero point auto-correlation of received signal per block and auto/cross- correlation of decision feedback signal
in orthogonal frequency division multiplexing (OFDM) system. Proposed method can be studied into two types;
Type 1 can estimate SNR by zero point auto-correlation of decision feedback signal based on the second
moment property. Type 2 uses both zero point auto-correlation and cross-correlation based on the fourth
moment property. In block-by-block reception of OFDM system, these two SNR estimation methods can be
possible for the practical implementation due to correlation based the estimation method and they show more
stable estimation performance than the earlier SNR estimation methods.
MIMO Channel Estimation Using the LS and MMSE AlgorithmIOSRJECE
Wireless Communication Technology has developed over the past few yearsfor other objectives.The Multiple InputMultiple Output (MIMO) is one of techniques that is used to enhancethe data rates, in which multiple antennas are employed both the transmitter and receiver. Multiple signals are transmitted from different antennas at the transmitter using the same frequency and separated space. Various channel estimation techniques are employed in order to judge the physical effects of the medium present. In this paper, we analyze and implementvarious estimation techniques for MIMO Systems such as Least Squares (LS), Minimum Mean Square Error (MMSE),these techniques are therefore compared to effectively estimate the channel in MIMO System. The results demonstrate that SNR required to support different values of bit error rate varies depending on different low correlation between the transmitting and the receiving antennas .In addition, it is illustrated that when the number of transmitter and receiver antennas increases, the performance of TBCE schemes significantly improves. The Same behavior isalso observed for MIMO system. Performance of both MMSE and LSestimation are the same for allkinds of modulation at small value of SNR but the more we increase the SNR value the more performance gap goes on increasing.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
Supporting efficient and scalable multicastingingenioustech
Dear Students
Ingenious techno Solution offers an expertise guidance on you Final Year IEEE & Non- IEEE Projects on the following domain
JAVA
.NET
EMBEDDED SYSTEMS
ROBOTICS
MECHANICAL
MATLAB etc
For further details contact us:
enquiry@ingenioustech.in
044-42046028 or 8428302179.
Ingenious Techno Solution
#241/85, 4th floor
Rangarajapuram main road,
Kodambakkam (Power House)
http://www.ingenioustech.in/
Generating Musical Notes and Transcription using Deep LearningVarad Meru
Music has always been the most followed art form, and lot of research had gone into understanding it. In recent years, deep learning approaches for building unsupervised hierarchical representations from unlabeled data have gained significant interest. Progress in fields, such as image processing and natural language processing, has been substantial, but to my knowledge, methods on auditory data for learning representations have not been studied extensively. In this project I try to use two methods for generating music from range of musical inputs such as MIDI to complex WAV formats. I use RNN-RBMs and CDBN to explore music.
Speech Analysis and synthesis using VocoderIJTET Journal
Abstract— In this paper, I proposed a speech analysis and synthesis using a vocoder. Voice conversion systems do not create new speech signals, but just transform existing one. The proposed speech vocoding is different from speech coding. To analyze the speech signal and represent it with less number of bits, so that bandwidth efficiency can be increased. The Synthesis of speech signal from the received bits of information. In this paper three aspects of analysis have been discussed: pitch refinement, spectral envelope estimation and maximum voiced frequency estimation. A Quasi-harmonic analysis model can be used to implement a pitch refinement algorithm which improves the accuracy of the spectral estimation. Harmonic plus noise model to reconstruct the speech signal from parameter. Finally to achieve the highest possible resynthesis quality using the lowest possible number of bits to transmit the speech signal. Future work aims at incorporating the phase information into the analysis and modeling process and also synthesis these three aspects in different pitch period.
Demixing Commercial Music Productions via Human-Assisted Time-Frequency Maskingdhia_naruto
This convention paper has been reproduced from the author’s advance manuscript, without editing, corrections, or
consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers may be
obtained by sending request and remittance to Audio Engineering Society, 60 East 42nd Street, New York, New York
10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof, is
not permitted without direct permission from the Journal of the Audio Engineering Society.
Teager Energy Operation on Wavelet Packet Coefficients for Enhancing Noisy Sp...CSCJournals
In this paper a new thresholding based speech enhancement approach is presented, where the threshold is statistically determined by employing the Teager energy operation on the Wavelet Packet (WP) coefficients of noisy speech. The threshold thus obtained is applied on the WP coefficients of the noisy speech by using a hard thresholding function in order to obtain an enhanced speech. Detailed simulations are carried out in the presence of white, car, pink, and babble noises to evaluate the performance of the proposed method. Standard objective measures, spectrogram representations and subjective listening tests show that the proposed method outperforms the existing state-of-the-art thresholding based speech enhancement approaches for noisy speech from high to low levels of SNR.
Performance analysis of the convolutional recurrent neural network on acousti...journalBEEI
In this study, we attempted to find the optimal hyper-parameters of the convolutional recurrent neural network (CRNN) by investigating its performance on acoustic event detection. Important hyper-parameters such as the input segment length, learning rate, and criterion for the convergence test, were determined experimentally. Additionally, the effects of batch normalization and dropout on the performance were measured experimentally to obtain their optimal combination. Further, we studied the effects of varying the batch data on every iteration during the training. From the experimental results using the TUT sound events synthetic 2016 database, we obtained optimal performance with a learning rate of 1/10000. We found that a longer input segment length aided performance improvement, and batch normalization was far more effective than dropout. Finally, performance improvement was clearly observed by varying the starting points of the batch data for each iteration during the training.
METHOD FOR REDUCING OF NOISE BY IMPROVING SIGNAL-TO-NOISE-RATIO IN WIRELESS LANIJNSA Journal
The signal to noise ratio (SNR) is one of the important measures for reducing the noise.A technique that uses a linear prediction error filter (LPEF) and an adaptive digital filter (ADF) to achieve noise reduction in a speech and image degraded by additive background noise is proposed. Since a speech signal can be represented as the stationary signal over a short interval of time, most of speech signal can be predicted by the LPEF. This estimation is performed by the ADF which is used as system identification. Noise reduction is achieved by subtracting the reconstructed noise from the speech degraded by additive background noise. Most of the MR image accelerating methods suffers from degradation of acquired images, which is often correlated with the degree of acceleration. However, Wideband MRI is a novel technique that transcends such flaws.In this paper we proposed LPEF and ADF for reducing the noise in speech and also we demonstrate that Wideband MRI is capable of obtaining images with identical quality as conventional MR images in terms of SNR in wireless LAN.
We applied various architectures of deep neural networks for sound event detection and compared their performance using two different datasets. Feed forward neural network (FNN), convolutional neural network (CNN), recurrent neural network (RNN) and convolutional recurrent neural network (CRNN) were implemented using hyper-parameters optimized for each architecture and dataset. The results show that the performance of deep neural networks varied significantly depending on the learning rate, which can be optimized by conducting a series of experiments on the validation data over predetermined ranges. Among the implemented architectures, the CRNN performed best under all testing conditions, followed by CNN. Although RNN was effective in tracking the time-correlation information in audio signals,it exhibited inferior performance compared to the CNN and the CRNN. Accordingly, it is necessary to develop more optimization strategies for implementing RNN in sound event detection.
High Quality Arabic Concatenative Speech Synthesissipij
This paper describes the implementation of TD-PSOLA tools to improve the quality of the Arabic Text-tospeech (TTS) system. This system based on Diphone concatenation with TD-PSOLA modifier synthesizer. This paper describes techniques to improve the precision of prosodic modifications in the Arabic speech synthesis using the TD-PSOLA (Time Domain Pitch Synchronous Overlap-Add) method. This approach is based on the decomposition of the signal into overlapping frames synchronized with the pitch period. The main objective is to preserve the consistency and accuracy of the pitch marks after prosodic modifications of the speech signal and diphone with vowel integrated database adjustment and optimisation.
Improvement of minimum tracking in Minimum Statistics noise estimation methodCSCJournals
Noise spectrum estimation is a fundamental component of speech enhancement and speech recognition systems. In this paper we propose a new method for minimum tracking in Minimum Statistics (MS) noise estimation method. This noise estimation algorithm is proposed for highly nonstationary noise environments. This was confirmed with formal listening tests which indicated that the proposed noise estimation algorithm when integrated in speech enhancement was preferred over other noise estimation algorithms.
A NOVEL ENF EXTRACTION APPROACH FOR REGION-OF-RECORDING IDENTIFICATION OF MED...cscpconf
The electric network frequency (ENF) of power lines leaves its trace in nearby media recordings. The ENF signals vary in a consistent way in a given power grid. Therefore, it is possible to develop signal processing and machine learning techniques to identify the grid of origin by extracting attributes of the embedded ENF signal in recorded audio. This paper presents a model based on a novel ENF extraction technique with training on audio and power recordings from different grids. The proposed approach is based on correcting erroneously selected peaks from the Short Time Fourier Transform (STFT) by leveraging time correlations. These peaks are mistakenly taken for the frequency component belonging to the embedded ENF signal of the power grid and are corrected by the algorithm. Results on a test set of 50 recordings from nine different locations demonstrate the effectiveness of the proposed approach with an overall accuracy of 88%.
We present a causal speech enhancement model working on the
raw waveform that runs in real-time on a laptop CPU. The proposed model is based on an encoder-decoder architecture with
skip-connections. It is optimized on both time and frequency
domains, using multiple loss functions. Empirical evidence
shows that it is capable of removing various kinds of background noise including stationary and non-stationary noises,
as well as room reverb. Additionally, we suggest a set of
data augmentation techniques applied directly on the raw waveform which further improve model performance and its generalization abilities. We perform evaluations on several standard
benchmarks, both using objective metrics and human judgements. The proposed model matches state-of-the-art performance of both causal and non causal methods while working
directly on the raw waveform.
Index Terms: Speech enhancement, speech denoising, neural
networks, raw waveform
Similar to (2016) Rigaud and Radenen - Singing Voice Melody Transcription Using Deep Neural Networks (20)
Real Time Speech Enhancement in the Waveform Domain
(2016) Rigaud and Radenen - Singing Voice Melody Transcription Using Deep Neural Networks
1. SINGING VOICE MELODY TRANSCRIPTION USING DEEP NEURAL
NETWORKS
Franc¸ois Rigaud and Mathieu Radenen
Audionamix R&D
171 quai de Valmy, 75010 Paris, France
<firstname>.<lastname>@audionamix.com
ABSTRACT
This paper presents a system for the transcription of
singing voice melodies in polyphonic music signals based
on Deep Neural Network (DNN) models. In particular, a
new DNN system is introduced for performing the f0 es-
timation of the melody, and another DNN, inspired from
recent studies, is learned for segmenting vocal sequences.
Preparation of the data and learning configurations related
to the specificity of both tasks are described. The perfor-
mance of the melody f0 estimation system is compared
with a state-of-the-art method and exhibits highest accu-
racy through a better generalization on two different music
databases. Insights into the global functioning of this DNN
are proposed. Finally, an evaluation of the global system
combining the two DNNs for singing voice melody tran-
scription is presented.
1. INTRODUCTION
The automatic transcription of the main melody from poly-
phonic music signals is a major task of Music Information
Retrieval (MIR) research [19]. Indeed, besides applica-
tions to musicological analysis or music practice, the use
of the main melody as prior information has been shown
useful in various types of higher-level tasks such as music
genre classification [20], music retrieval [21], music de-
soloing [4, 18] or lyrics alignment [15, 23]. From a sig-
nal processing perspective, the main melody can be repre-
sented by sequences of fundamental frequency (f0) defined
on voicing instants, i.e. on portions where the instrument
producing the melody is active. Hence, main melody tran-
scription algorithms usually follow two main processing
steps. First, a representation emphasizing the most likely
f0s over time is computed, e.g. under the form of a salience
matrix [19], a vocal source activation matrix [4] or an en-
hanced spectrogram [22]. Second, a binary classification
of the selected f0s between melodic and background con-
tent is performed using melodic contour detection/tracking
and voicing detection.
c Franc¸ois Rigaud and Mathieu Radenen. Licensed under
a Creative Commons Attribution 4.0 International License (CC BY 4.0).
Attribution: Franc¸ois Rigaud and Mathieu Radenen. “Singing Voice
Melody Transcription using Deep Neural Networks”, 17th International
Society for Music Information Retrieval Conference, 2016.
In this paper we propose to tackle the melody transcrip-
tion task as a supervised classification problem where each
time frame of signal has to be assigned into a pitch class
when a melody is present and an ‘unvoiced’ class when it is
not. Such approach has been proposed in [5] where melody
transcription is performed applying Support Vector Ma-
chine on input features composed of Short-Time Fourier
Transforms (STFT). Similarly for noisy speech signals,
f0 estimation algorithms based on Deep Neural Networks
(DNN) have been introduced in [9,12].
Following such fully data driven approaches we intro-
duce a singing voice melody transcription system com-
posed of two DNN models respectively used to perform
the f0 estimation task and the Voice Activity Detection
(VAD) task. The main contribution of this paper is to
present a DNN architecture able to discriminate the differ-
ent f0s from low-level features, namely spectrogram data.
Compared to a well-known state-of-the-art method [19],
it shows significant improvements in terms of f0 accu-
racy through an increase of robustness with regard to mu-
sical genre and a reduction of octave-related errors. By
analyzing the weights of the network, the DNN is shown
somehow equivalent to a simple harmonic-sum method for
which the parameters usually set empirically are here auto-
matically learned from the data and where the succession
of non-linear layers likely increases the power of discrim-
ination of harmonically-related f0. For the task of VAD,
another DNN model, inspired from [13] is learned. For
both models, special care is taken to prevent over-fitting
issues by using different databases and perturbing the data
with audio degradations. Performance of the whole system
is finally evaluated and shows promising results.
The rest of the paper is organized as follows. Section 2
presents an overview of the whole system. Sections 3 and
4 introduce the DNN models and detail the learning con-
figurations respectively for the VAD and the f0 estimation
task. Then, Section 5 presents an evaluation of the system
and Section 6 concludes the study.
2. SYSTEM OVERVIEW
2.1 Global architecture
The proposed system, displayed on Figure 1, is composed
of two independent parallel DNN blocks that perform re-
spectively the f0 melody estimation and the VAD.
737
2. Figure 1: Architecture of the proposed system for singing
voice melody transcription.
In contrast with [9,12] that propose a single DNN model
to perform both tasks, we did not find such unified func-
tional architecture able to discriminate successfully a time
frame between quantified f0s and ‘unvoiced’ classes. In-
deed the models presented in these studies are designed
for speech signals mixed with background noise for which
the discrimination between a frame of noise and a frame
of speech is very likely related to the presence or absence
of a pitched structure, which is also probably the kind of
information on which the system relies to estimate the f0.
Conversely, with music signals both the melody and the
accompaniment exhibit harmonic structures and the voic-
ing discrimination usually requires different levels of in-
formation, e.g. under the form of timbral features such as
Mel-Frequency Cepstral Coefficients.
Another characteristic of the proposed system is the par-
allel architecture that allows considering different types of
input data for the two DNNs and which arises from the
application restricted to vocal melodies. Indeed, unlike
generic systems dealing with main melody transcription of
different instruments (often within a same piece of music)
which usually process the f0 estimation and the voicing de-
tection sequentially, the focus on singing voice here hardly
allows for a voicing detection relying only on the distri-
bution and statistics of the candidate pitch contours and/or
their energy [2,19]. Thus, this constraint requires to build
a specific VAD system that should learn to discriminate
the timbre of a vocal melody from an instrumental melody,
such as for example played by a saxophone.
2.2 Signal decomposition
As shown on Figure 1, both DNN models are preceded by
a signal decomposition. At the input of the global system,
audio signals are first converted to mono and re-sampled to
16 kHz. Then, following [13], it is proposed to provide the
DNNs with a set of pre-decomposed signals obtained by
applying a double-stage Harmonic/Percussive Source Sep-
aration (HPSS) [6,22] on the input mixture signal. The key
idea behind double-stage HPSS is to consider that within a
mix, melodic signals are usually less stable/stationary than
the background ‘harmonic’ instruments (such as a bass or
a piano), but more than the percussive instruments (such
as the drums). Thus, according to the frequency reso-
lution that is used to compute a STFT, applying a har-
monic/percussive decomposition on a mixture spectrogram
lead to a rough separation where the melody is mainly ex-
tracted either in the harmonic or in the percussive content.
Using such pre-processing, 4 different signals are ob-
tained. First, the input signal s is decomposed into the sum
of h1 and p1 using a high-frequency resolution STFT (typ-
ically with a window of about 300 ms) where p1 should
mainly contain the melody and the drums, and h1 the
remaining stable instrument signals. Second, p1 is fur-
ther decomposed into the sum of h2 and p2 using a low-
frequency resolution STFT (typically with a window of
about 30 ms), where h2 mainly contains the melody, and
p2 the drums. As presented latter in Sections 3 and 4, dif-
ferent types of these 4 signals or combinations of them will
be used to experimentally determine optimal DNN models.
2.3 Learning data
Several annotated databases composed of polyphonic mu-
sic with transcribed melodies are used for building the
train, validation and test datasets used for the learning (cf.
Sections 3 and 4) and the evaluation (cf. Section 5) of the
DNNs. In particular, a subset of RWC Popular Music and
Royalty Free Music [7] and MIR-1k [10] databases are
used for the train dataset, and the recent databases Med-
leyDB [1] and iKala [3] are split between train, validation
and test datasets. Note that for iKala the vocal and instru-
mental tracks are mixed with a relative gain of 0 dB.
Also, in order to minimize over-fitting issues and to in-
crease the robustness of the system with respect to audio
equalization and encoding degradations, we use the Audio
Degradation Toolbox [14]. Thus, several files composing
the train and validation datasets (50% for the VAD task and
25% for the f0 estimation task) are duplicated with one de-
graded version, the degradation type being randomly cho-
sen amongst those available preserving the alignment be-
tween the audio and the annotation (e.g. not producing
time/pitch warping or too long reverberation effects).
3. VOICE ACTIVITY DETECTION WITH DEEP
NEURAL NETWORKS
This section briefly describes the process for learning the
DNN used to perform the VAD. It is largely inspired from
a previous study presented in more detail in [13]. A similar
architecture of deep recurrent neural network composed of
Bidirectional Long Short-Term Memory (BLSTM) [8] is
used. In our case the architecture is arbitrarily fixed to 3
BLSTM layers of 50 units each and a final feed-forward lo-
gistic output layer with one unit. As in [13], different types
of combination of the pre-decomposed signals (cf. Section
2.2) are considered to determine an optimal network: s,
p1, h2, h1p1, h2p2 and h1h2p2. For each of these pre-
decomposed signals, timbral features are computed under
the form of mel-frequency spectrograms obtained using a
STFT with 32 ms long Hamming windows and 75 % of
overlap, and 40 triangular filters distributed on a mel scale
between 0 and 8000 Hz. Then, each feature of the input
738 Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
3. 50
BLSTM units
...
50
BLSTM units
1
logistic unit
time frame index
DNNinputfeatureindex
50 100 150 200 250 300 350
20
40
60
80
100
120
0 50 100 150 200 250 300 350
0
0.2
0.4
0.6
0.8
1
time frame index
DNNoutput
...
...50
BLSTM units
Figure 2: VAD network illustration.
data is normalized using the mean and variance computed
over the train dataset. Contrary to [13] the learning is per-
formed in a single step, i.e. without adopting a layer by
layer training.
Finally, the best architecture is obtained for the combi-
nation of h1, h2 and p2 signals, thus for an input of size
120, which corresponds to a use of the whole information
present in the original signal (s = h1 + h2 + p2). An
illustration of this network is presented in Figure 2.
A simple post-processing of the DNN output consisting
in a threshold of 0.5 is finally applied to take the binary
decision of voicing frame activation.
4. F0 ESTIMATION WITH DEEP NEURAL
NETWORKS
This section presents in detail the learning configuration
for the DNN used for performing the f0 estimation task.
An interpretation of the network functioning is finally pre-
sented.
4.1 Preparation of learning data
As proposed in [5] we decide to keep low level features
to feed the DNN model. Compared to [12] and [9] which
use as input pre-computed representations known for high-
lighting the periodicity of pitched sounds (respectively
based on an auto-correlation and a harmonic filtering), we
expect here the network to be able to learn an optimal
transformation automatically from spectrogram data. Thus
the set of selected features consists of log-spectrograms
(logarithm of the modulus of the STFT) computed from a
Hamming window of duration 64 ms (1024 samples for a
sampling frequency of 16000 Hz) with an overlap of 0.75,
and from which frequencies below 50 Hz and above 4000
Hz are discarded. For each music excerpt the correspond-
ing log-spectrogram is rescaled between 0 and 1. Since, as
described in Section 2.1, the VAD is performed by a sec-
ond independent system, all time frames for which no vo-
cal melody is present are removed from the dataset. These
features are computed independently for 3 different types
of input signal for which the melody should be more or less
emphasized: s, p1 and h2 (cf. Section 2.2).
For the output, the f0s are quantified between C#2
(f0 69.29 Hz) and C#6 (f0 1108.73 Hz) with a spac-
ing of an eighth of tone, thus leading to a total of 193
classes.
The train and validation datasets including audio de-
graded versions are finally composed of, respectively,
22877 melodic sequences (resp. 3394) for a total duration
of about 220 minutes (resp. 29 min).
4.2 Training
Several experiments have been run to determine a func-
tional DNN architecture. In particular, two types of neuron
units have been considered: the standard feed-forward sig-
moid unit and the Bidirectional Long Short-Term Memory
(BLSTM) recurrent unit [8].
For each test, the weights of the network are initialized
randomly according to a Gaussian distribution with 0 mean
and a standard deviation of 0.1, and optimized to mini-
mize the cross-entropy error function. The learning is then
performed by means of a stochastic gradient descent with
shuffled mini-batches composed of 30 melodic sequences,
a learning rate of 10−7
and a momentum of 0.9. The op-
timization is run for a maximum of 10000 epochs and an
early stopping is applied if no decrease is observed on the
validation set error during 100 consecutive epochs. In ad-
dition to the use of audio degradations during the prepa-
ration of the data for preventing over-fitting (cf. Section
2.3), the training examples are slightly corrupted during
the learning by adding a Gaussian noise with variance 0.05
at each epoch.
Among the different architectures tested, the best clas-
sification performance is obtained for the input signal p1
(slightly better than for s, i.e. without pre-separation) by
a 2-hidden layer feed-forward network with 500 sigmoid
units each, and a 193 output softmax layer. An illustration
of this network is presented in Figure 3. Interestingly, for
that configuration the learning did not suffered from over-
fitting so that it ended at the maximum number of epochs,
thus without early stopping.
While the temporal continuity of the f0 along time-
frames should provide valuable information, the use of
BLSTM recurrent layers (alone or in combination with
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016 739
4. 500
logistic units
...
...500
logistic units
...193
softmax units
time frame index
DNNinputfeatureindex
260 280 300 320 340 360 380 400 420 440
100
200
300
400
500
600
700
time frame index
DNNoutputclassindex
260 280 300 320 340 360 380 400 420 440
20
40
60
80
100
120
140
160
180
Figure 3: f0 estimation network illustration.
feed-forward sigmoid layers) did not lead to efficient sys-
tems. Further experiments should be conducted to enforce
the inclusion of such temporal context in a feed-forward
DNN architecture, for instance by concatenating several
consecutive time frames in the input.
4.3 Post-processing
The output layer of the DNN composed of softmax units
returns a f0 probability distribution for each time frame
that can be seen for a full piece of music as a pitch ac-
tivation matrix. In order to take a final decision that ac-
count for the continuity of the f0 along melodic sequences,
a Viterbi tracking is finally applied on the network out-
put [5, 9, 12]. For that, the log-probability transition be-
tween two consecutive time frames and two f0 classes is
simply arbitrarily set inversely proportional to their abso-
lute difference in semi-tones. For further improvement of
the system, such transition matrix could be learned from
the data [5], however this simple rule gives interesting per-
formance gains (when compared to a simple ‘maximum
picking’ post-processing without temporal context) while
potentially reducing the risk of over-fitting to a particular
music style.
layer #1 unit index
DNNinputfeatureindex
100 200 300 400 500
100
200
300
400
500
600
700
layer #2 unit index
layer#1outputindex
100 200 300 400 500
50
100
150
200
250
300
350
400
450
500
layer#2outputindex
softmax unit index
50 100 150
50
100
150
200
250
300
350
400
450
500
Figure 4: Display of the weights for the two sigmoid feed-
forward layers (top) and the softmax layer (down) of the
DNN learned for the f0 estimation task.
4.4 Network weights interpretation
We propose here to have an insight into the network func-
tioning for this specific task of f0 estimation by analyzing
the weights of the DNN. The input is a short-time spec-
trum and the output corresponds to an activation vector for
which a single element (the actual f0 of the melody at that
time frame) should be predominant. In that case, it is rea-
sonable to expect that the DNN somehow behaves like a
harmonic-sum operator.
While the visualization of the distribution of the hidden-
layer weights usually does not provide with straightfor-
ward cues to analyse a DNN functioning (cf. Figure 4)
we consider a simplified network for which it is assumed
that each feed-forward logistic unit is working in the linear
regime. Thus, removing the non-linear operations, the out-
put of a feed-forward layer with index l composed of Nl
units writes
xl = Wl · xl−1 + bl, (1)
where xl ∈ RNl
(resp. xl−1 ∈ RNl−1
) corresponds to the
ouput vector of layer l (resp. l − 1), Wl ∈ RNl×Nl−1
is
the weight matrix and bl ∈ RNl
the bias vector. Using this
expression, the output of a layer with index L expressed as
the propagation of the input x0 through the linear network
also writes
xL = W · x0 + b, (2)
where W =
L
l=1 Wl corresponds to a global weight ma-
trix, and b to a global bias that depends on the set of pa-
rameters {Wl, bl, ∀l ∈ [1..L]}.
As mentioned above, in our case x0 is a short-time spec-
trum and xL is a f0 activation vector. The global weight
matrix should thus present some characteristics of a pitch
detector. Indeed as displayed on Figure 5a, the matrix
W for the learned DNN (which is thus the product of the
3 weight matrices depicted on Figure 4) exhibits an har-
monic structure for most output classes of f0s; except for
740 Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
5. DNN output class index
DNNinputfeatureindex
20 40 60 80 100 120 140 160 180
100
200
300
400
500
600
700
(a)
100 200 300 400 500 600 700
−20
0
20
40
60
DNN input feature index
(b)
Figure 5: Linearized DNN illustration. (a) Visualization
of the (transposed) weight matrix W. The x-axis corre-
sponds to the output class indices (the f0s) and the y-axis
represents the input feature indices (frequency channel of
the spectrum input). (b) Weights display for the f0 output
class with index 100.
some f0s in the low and high frequency range for which no
or too few examples are present in the learning data.
Most approaches dealing with main melody transcrip-
tion usually relies on such types of transformations to
compute a representation emphasizing f0 candidates (or
salience function) and are usually partly based on hand-
crafted designs [11, 17, 19]. Interestingly, using a fully
data driven method as proposed, parameters of a compara-
ble weighted harmonic summation algorithm (such as the
number of harmonics to consider for each note and their
respective weights) do not have to be defined. This can be
observed in more details on Figure 5b which depicts the
linearized network weights for the class index 100 (f0
289.43 Hz). Moreover, while this interpretation assumes a
linear network, one can expect that the non-linear opera-
tions actually present in the network help in enhancing the
discrimination between the different f0 classes.
5. EVALUATION
5.1 Experimental procedure
Two different test datasets composed of full music ex-
cerpts (i.e. vocal and non vocal portions) are used for
the evaluation. One is composed of 17 tracks from Med-
leyDB (last songs comprising vocal melodies, from Mu-
sicDelta Reggae to Wolf DieBekherte, for a total of ∼ 25.5
min of vocal portions) and the other is composed of 63
tracks from iKala (from 54223 chorus to 90587 verse for
a total of ∼ 21 min of vocal portions).
The evaluation is conducted in two steps. First the per-
formance of the f0 estimation DNN taken alone (thus with-
out voicing detection) is compared with the state-of the
art system melodia [19] using f0 accuracy metrics. Sec-
ond, the performance of our complete singing voice tran-
scription system (VAD and f0 estimation) is evaluated on
the same datasets. Since our system is restricted to the
transcription of vocal melodies and that, to our knowledge
all available state-of-the-art systems are designed to target
main melody, this final evaluation presents the results for
our system without comparisons with a reference.
For all tasks and systems, the evaluation metrics are
computed using the mir eval library [16]. For Section
5.3, some additional metrics related to voicing detection,
namely precision, f-measure and voicing accuracy, were
not present in the original mir eval code and thus were
added for our experiments.
5.2 f0 estimation task
The performance of the DNN performing the f0 estima-
tion task is first compared to melodia system [19] using
the plug-in implementation with f0 search range limits set
equal to those of our system (69.29-1108.73 Hz, cf. Sec.
4.1) and with remaining parameters left to default values.
For each system and each music track the performance is
evaluated in terms of raw pitch accuracy (RPA) and raw
chroma accuracy (RCA). These metrics are computed on
vocal segments (i.e. without accounting for potential voic-
ing detection errors) for a f0 tolerance of 50 cents.
The results are presented on Figure 6 under the form
of a box plot where, for each metric and dataset, the ends
of the dashed vertical bars delimit the lowest and highest
scores obtained, the 3 vertical bars composing each center
box respectively correspond to the first quartile, the median
and the third quartile of the distribution, and finally the
star markers represent the mean. Both systems are char-
acterized by more widespread distributions for MedleyDB
than for iKala. This reflects the fact that MedleyDB is
more heterogeneous in musical genres and recording con-
ditions than iKala. On iKala, the DNN performs slightly
better than melodia when comparing the means. On Med-
leyDB, the gap between the two systems increases signif-
icantly. The DNN system seems much less affected by
the variability of the music examples and clearly improve
the mean RPA by 20% (62.13% for melodia and 82.48%
for the DNN). Additionally, while exhibiting more similar
distributions of RPA and RCA, the DNN tends to produce
less octave detection errors. It should be noted that this re-
sult does not take into account the recent post-processing
improvement proposed for melodia [2], yet it shows the
interest of using such DNN approach to compute an en-
hanced pitch salience matrix which, simply combined with
a Viterbi post-processing, achieves good performance.
5.3 Singing voice transcription task
The evaluation of the global system is finally performed
on the two same test datasets. The results are displayed as
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016 741
6. RPA RCA RPA RCA
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
MedleyDB iKala
NN
melodia
Figure 6: Comparative evaluation of the proposed DNN
(in black) and melodia (in gray) on MedleyDB (left) and
iKala (right) test sets for a f0 vocal melody estimation task.
boxplots (cf. description Section 5.2) on Figures 7a and 7b
respectively for the iKala and the MedleyDB datasets. Five
metrics are computed to evaluate the voicing detection,
namely the precision (P), the recall (R), the f-measure (F),
the false alarm rate (FA) and the voicing accuracy (VA). A
sixth metric of overall accuracy (OA) is also presented for
assessing the global performance of the complete singing
voice melody transcription system.
In accordance with the previous evaluation, the results
on MedleyDB are characterized by much more variance
than on iKala. In particular, the voicing precision of the
system (i.e. it’s ability to provide correct detections, no
matter the number of forgotten voiced frames) is signif-
icantly degraded on MedleyDB. Conversely, the voicing
recall which evaluate the ability of the system to detect all
voiced portions actually present no matter the number of
false alarm, remains relatively good on MedleyDB. Com-
bining both metrics, a mean f-measure of 93.15 % and
79.19 % are respectively obtained on iKala and MedleyDB
test datasets.
Finally, the mean scores of overall accuracy obtained
for the global system are equal to 85.06 % and 75.03 %
respectively for iKala and MedleyDB databases.
6. CONCLUSION
This paper introduced a system for the transcription of
singing voice melodies composed of two DNN models. In
particular a new system able to learn a representation em-
phasizing melodic lines from low level data composed of
spectrograms has been proposed for the estimation of the
f0. For this DNN, the performance evaluation shows a rel-
atively good generalization (when compared to a reference
system) on two different test datasets and an increase of ro-
bustness to western music recordings that tend to be repre-
sentative of the current music industry productions. While
for these experiments the systems have been learned from
a relatively low amount of data, the robustness, particu-
larly for the task of VAD, could very likely be improved
by increasing the number of training examples.
P R F FA VA OA
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(a) iKala test dataset
P R F FA VA OA
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(b) MedleyDB test dataset
Figure 7: Voicing detection and overall performance of the
proposed system for iKala and MedleyDB test datasets.
7. REFERENCES
[1] R. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Can-
nam, and J. Bello. MedleyDB: A multitrack dataset for
annotation-intensive MIR research. In Proc. of the 15th
Int. Society for Music Information Retrieval (ISMIR)
Conference, October 2014.
[2] R. M. Bittner, J. Salamon, S. Essid, and J. P. Bello.
Melody extraction by contour classification. In Proc.
of the 16th Int. Society for Music Information Retrieval
(ISMIR) Conference, October 2015.
[3] T.-S. Chan, T.-C. Yeh, Z.-C. Fan, H.-W. Chen, L. Su,
Y.-H. Yang, and R. Jang. Vocal activity informed
singing voice separation with the ikala dataset. In Proc.
of IEEE Int. Conf. on Acoustics, Speech and Signal
Processing (ICASSP), pages 718–722, April 2015.
[4] J.-L. Durrieu, G. Richard, and B. David. An iterative
approach to monaural musical mixture de-soloing. In
Proc. of IEEE Int. Conf. on Acoustics, Speech and Sig-
nal Processing (ICASSP), pages 105–108, April 2009.
[5] D. P. W. Ellis and G. E. Poliner. Classification-based
melody transcription. Machine Learning, 65(2):439–
456, 2006.
[6] D. FitzGerald and M. Gainza. Single channel vo-
cal separation using median filtering and factorisa-
tion techniques. ISAST Trans. on Electronic and Signal
Processing, 4(1):62–73, 2010.
742 Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
7. [7] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka.
Rwc music database: Popular, classical, and jazz music
databases. In Proc. of the 3rd Int. Society for Music
Information Retrieval (ISMIR) Conference, pages 287–
288, October 2002.
[8] A. Graves, A.-R. Mohamed, and G. Hinton. Speech
recognition with deep recurrent neural networks. In
Proc. of IEEE Int. Conf. on Acoustics, Speech and
Signal Processing (ICASSP), pages 6645–6649, May
2013.
[9] K. Han and DL. Wang. Neural network based
pitch tracking in very noisy speech. IEEE/ACM
Trans. on Audio, Speech, and Language Processing,
22(12):2158–2168, October 2014.
[10] C.-L. Hsu and J.-S. R. Jang. On the improvement of
singing voice separation for monaural recordings using
the mir-1k dataset. IEEE Trans. on Audio, Speech, and
Language Processing, 18(2):310–319, 2010.
[11] S. Jo, S. Joo, and C. D. Yoo. Melody pitch estima-
tion based on range estimation and candidate extrac-
tion using harmonic structure model. In Proc. of IN-
TERSPEECH, pages 2902–2905, 2010.
[12] B. S. Lee and D. P. W. Ellis. Noise robust pitch tracking
by subband autocorrelation classification. In Proc. of
INTERSPEECH, 2012.
[13] S. Leglaive, R. Hennequin, and R. Badeau. Singing
voice detection with deep recurrent neural networks. In
Proc. of IEEE Int. Conf. on Acoustics, Speech and Sig-
nal Processing (ICASSP), pages 121–125, April 2015.
[14] M. Mauch and S. Ewert. The audio degradation tool-
box and its application to robustness evaluation. In
Proc. of the 14th Int. Society for Music Information Re-
trieval (ISMIR) Conference, November 2013.
[15] A. Mesaros and T. Virtanen. Automatic alignment of
music audio and lyrics. In Proc. of 11th Int. Conf. on
Digital Audio Effects (DAFx), September 2008.
[16] C. Raffel, B. McFee, E. J. Humphrey, J. Salamon,
O. Nieto, D. Liang, and D. P. W. Ellis. mir eval: a
transparent implementation of common MIR metrics.
In Proc. of the 15th Int. Society for Music Information
Retrieval (ISMIR) Conference, October 2014.
[17] M. Ryyn¨anen and A. P. Klapuri. Automatic transcrip-
tion of melody, bass line, and chords in polyphonic mu-
sic. Computer Music Journal, 32(3):72–86, 2008.
[18] M. Ryyn¨anen, T. Virtanen, J. Paulus, and A. Kla-
puri. Accompaniment separation and karaoke applica-
tion based on automatic melody transcription. In Proc.
of the IEEE Int. Conf. on Multimedia and Expo, pages
1417–1420, April 2008.
[19] J. Salamon, E. G´omez, D. P. W. Ellis, and G. Richard.
Melody extraction from polyphonic music signals. Ap-
proaches, applications, and challenges. IEEE Signal
Processing Magazine, 31(2):118–134, March 2014.
[20] J. Salamon, B. Rocha, and E. G´omez. Musical genre
classification using melody features extracted from
polyphonic music. In Proc. of IEEE Int. Conf. on
Acoustics, Speech and Signal Processing (ICASSP),
pages 81–84, March 2012.
[21] J. Salamon, J. Serr`a, and E. G´omez. Tonal representa-
tions for music retrieval: From version identification to
query-by-humming. Int. Jour. of Multimedia Informa-
tion Retrieval, special issue on Hybrid Music Informa-
tion Retrieval, 2(1):45–58, 2013.
[22] H. Tachibana, T. Ono, N. Ono, and S. Sagayama.
Melody line estimation in homophonic music au-
dio signals based on temporal-variability of melodic
source. In Proc. of IEEE Int. Conf. on Acoustics,
Speech and Signal Processing (ICASSP), pages 425–
428, March 2010.
[23] C. H. Wong, W. M. Szeto, and K. H. Wong. Automatic
lyrics alignment for Cantonese popular music. Multi-
media Systems, 4-5(12):307–323, 2007.
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016 743