(2016) Hennequin and Rigaud - Long-Term Reverberation Modeling for Under-Determined Audio Source Separation with Application to Vocal Melody Extraction
A Subspace Method for Blind Channel Estimation in CP-free OFDM SystemsCSCJournals
In this paper, a subspace method is proposed for blind channel estimation in orthogonal frequency-division multiplexing (OFDM) systems over time-dispersive channel. The proposed method does not require a cyclic prefix (CP) and thus leading to higher spectral efficiency. By exploiting the block Toeplitz structure of the channel matrix, the proposed blind estimation method performs satisfactorily with very few received OFDM blocks. Numerical simulations demonstrate the superior performance of the proposed algorithm over methods reported earlier in the literature.
A Subspace Method for Blind Channel Estimation in CP-free OFDM SystemsCSCJournals
In this paper, a subspace method is proposed for blind channel estimation in orthogonal frequency-division multiplexing (OFDM) systems over time-dispersive channel. The proposed method does not require a cyclic prefix (CP) and thus leading to higher spectral efficiency. By exploiting the block Toeplitz structure of the channel matrix, the proposed blind estimation method performs satisfactorily with very few received OFDM blocks. Numerical simulations demonstrate the superior performance of the proposed algorithm over methods reported earlier in the literature.
Blind Estimation of Carrier Frequency Offset in Multicarrier Communication Sy...IDES Editor
Orthogonal Frequency Division Multiplexing
(OFDM) systems are very sensitive to carrier frequency offset
(CFO), caused by either frequency differences between
transmitter and receiver local oscillators or by frequency
selective channels. The CFO disturbs the orthogonality among
subcarriers of OFDM system and results intercarrier
interference (ICI), which degrades the bit error rate (BER)
performance of the system. This paper presents a new blind
CFO estimation scheme for single-input single-output (SISO)
OFDM systems. The presented scheme is based on the
assumption that the channel frequency response changes
slowly in frequency domain. In this scheme an excellent tradeoff
between complexity and performance, as compared to
existing estimation schemes, is obtained. The improved
performance of the present scheme is confirmed through
extensive simulations.
Two novel transforms, related together and called Sine and Cosine Fresnel Transforms, as well as their optical implementation are presented. Each transform combines both backward and forward light propagation in the framework of the scalar diffraction approximation. It has been proven that the Fresnel transform is the optical version of the fractional Fourier transform. Therefore the former has the same properties as the latter. While showing properties similar to those of the Fresnel transform and therefore of the fractional Fourier transform, each of the Sine and Cosine Fresnel transforms provides a real result for a real input distribution. This enables saving half of the quantity of information in the complex plane. Because of parallelism, optics offers high speed processing of digital signals. Speech signals should be first represented by images through special light modulators for example. The Sine and Cosine Fresnel transforms may be regarded respectively as the fractional Sine and Cosine transforms which are more general than the Cosine transform used in information processing and compression.
Orthogonal Frequency Division Multiplexing with Offset Quadrature Amplitude Modulation
(OFDM / OQAM) is a multicarrier modulation scheme that can be considered as an alternative
to the conventional Orthogonal Frequency Division Multiplexing (OFDM) with Cyclic Prefix
(CP) for transmission over multipath fading channels. In this paper, we investigate the
combination of the OFDM/OQAM with Alamouti system with Time Reversal (TR) technique.
TR can be viewed as a precoding scheme which can be combined with OFDM/OQAM and
easily carried out in a Multiple Input Single Output (MISO) context such as Alamouti system.
We present the simulation results of the performance of OFDM/OQAM system in SISO case
compared with the conventional CP-OFDM system and the performance of the combination
Alamouti OFDM / OQAM with TR compared to Alamouti CP-OFDM. The performance is
derived by computing the Bit Error Rate (BER) as a function of the transmit signal-to-noise
ratio (SNR).
Stability of Target Resonance Modes: Ina Quadrature Polarization ContextIJERA Editor
The paper present a studyon the noise effect whenextracting the resonance residue in a quadrature polarization
setup. The accuracy and stability of the mode quadrature residues is necessary to construct the polarization matrix,
and subsequently, derive a robust polarization states of the receiver antenna. However, with lower signal-to-noise
ratios the extraction performance will degrade; in this regards, the in-phase and quadrature components of the
baseband signal demonstrated improvedextraction performancewhen used. Here, a case of two disjoint wires is
used to verify this approach.
Blind Estimation of Carrier Frequency Offset in Multicarrier Communication Sy...IDES Editor
Orthogonal Frequency Division Multiplexing
(OFDM) systems are very sensitive to carrier frequency offset
(CFO), caused by either frequency differences between
transmitter and receiver local oscillators or by frequency
selective channels. The CFO disturbs the orthogonality among
subcarriers of OFDM system and results intercarrier
interference (ICI), which degrades the bit error rate (BER)
performance of the system. This paper presents a new blind
CFO estimation scheme for single-input single-output (SISO)
OFDM systems. The presented scheme is based on the
assumption that the channel frequency response changes
slowly in frequency domain. In this scheme an excellent tradeoff
between complexity and performance, as compared to
existing estimation schemes, is obtained. The improved
performance of the present scheme is confirmed through
extensive simulations.
Two novel transforms, related together and called Sine and Cosine Fresnel Transforms, as well as their optical implementation are presented. Each transform combines both backward and forward light propagation in the framework of the scalar diffraction approximation. It has been proven that the Fresnel transform is the optical version of the fractional Fourier transform. Therefore the former has the same properties as the latter. While showing properties similar to those of the Fresnel transform and therefore of the fractional Fourier transform, each of the Sine and Cosine Fresnel transforms provides a real result for a real input distribution. This enables saving half of the quantity of information in the complex plane. Because of parallelism, optics offers high speed processing of digital signals. Speech signals should be first represented by images through special light modulators for example. The Sine and Cosine Fresnel transforms may be regarded respectively as the fractional Sine and Cosine transforms which are more general than the Cosine transform used in information processing and compression.
Orthogonal Frequency Division Multiplexing with Offset Quadrature Amplitude Modulation
(OFDM / OQAM) is a multicarrier modulation scheme that can be considered as an alternative
to the conventional Orthogonal Frequency Division Multiplexing (OFDM) with Cyclic Prefix
(CP) for transmission over multipath fading channels. In this paper, we investigate the
combination of the OFDM/OQAM with Alamouti system with Time Reversal (TR) technique.
TR can be viewed as a precoding scheme which can be combined with OFDM/OQAM and
easily carried out in a Multiple Input Single Output (MISO) context such as Alamouti system.
We present the simulation results of the performance of OFDM/OQAM system in SISO case
compared with the conventional CP-OFDM system and the performance of the combination
Alamouti OFDM / OQAM with TR compared to Alamouti CP-OFDM. The performance is
derived by computing the Bit Error Rate (BER) as a function of the transmit signal-to-noise
ratio (SNR).
Stability of Target Resonance Modes: Ina Quadrature Polarization ContextIJERA Editor
The paper present a studyon the noise effect whenextracting the resonance residue in a quadrature polarization
setup. The accuracy and stability of the mode quadrature residues is necessary to construct the polarization matrix,
and subsequently, derive a robust polarization states of the receiver antenna. However, with lower signal-to-noise
ratios the extraction performance will degrade; in this regards, the in-phase and quadrature components of the
baseband signal demonstrated improvedextraction performancewhen used. Here, a case of two disjoint wires is
used to verify this approach.
La misión de NHT Global es mejorar la manera en que usted se siente acerca de sí mismo y acerca del trabajo. Al ofrecer Productos Eficaces y de calidad de vida a nuestros clientes, y posiblemente el plan de compensación más lucrativo que hayan tenido nuestros miembros, estamos comprometidos con el bienestar de las personas en todo el mundo.
Similar to (2016) Hennequin and Rigaud - Long-Term Reverberation Modeling for Under-Determined Audio Source Separation with Application to Vocal Melody Extraction
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Projected Barzilai-Borwein Methods Applied to Distributed Compressive Spectru...Polytechnique Montreal
Cognitive radio allows unlicensed (cognitive) users to use licensed frequency bands by exploiting spectrum sensing techniques to detect whether or not the licensed (primary) users are present. In this paper, we present a compressed sensing applied to spectrum-occupancy detection in wide-band applications. The collected analog signals from each cognitive radio (CR) receiver at a fusion center are transformed to discrete-time signals by using analog-to-information converter (AIC) and then employed to calculate the autocorrelation. For signal reconstruction, we exploit a novel approach to solve the optimization problem consisting of minimizing both a quadratic (l2) error term and an l1-regularization term. In specific, we propose the Basic gradient projection (GP) and projected Barzilai-Borwein (PBB) algorithm to offer a better performance in terms of the mean squared error of the power spectrum density estimate and the detection probability of licensed signal occupancy.
MULTI RESOLUTION LATTICE DISCRETE FOURIER TRANSFORM (MRL-DFT)csandit
In many imaging applications, including sensor arrays, MRI and CT,data is often sampled on
non-rectangular point sets with non-uniform density. Moreover, in image and video processing,
a mix of non-rectangular sampling structures naturally arise. Multirate processing typically
utilizes a normalized integer indexing scheme, which masks the true physical dimensions of the
points. However, the spatial correlation of such signals often contains important
information.This paper presents a theory of signals defined on regular discrete sets called
lattices, and presents an associated form of a finite Fourier transform denoted here as
multiresolution lattice discrete Fourier transform (MRL-DFT). Multirate processing techniques
such as decimation, interpolation and polyphase representations are presented in a context
which preserves the true spatial dimensions of the sampling structure.Moreover, the polyphase
formulation enables systematic representation and processing for sampling patterns with
variable spatial density, and provides a framework for developing generalized FFT and
regridding algorithms.
A Gaussian Clustering Based Voice Activity Detector for Noisy Environments Us...CSCJournals
In this paper, a voice activity detector is proposed on the basis of Gaussian modeling of noise in the spectro-temporal space. Spectro-temporal space is obtained from auditory cortical processing. The auditory model that offers a multi-dimensional picture of the sound includes two stages: the initial stage is a model of inner ear and the second stage is the auditory central cortical modeling in the brain. In this paper, the speech noise in this picture has been modeled by a 3-D mono Gaussian cluster. At the start of suggested VAD process, the noise is modeled by a Gaussian shaped cluster. The average noise behavior is obtained in different spectrotemporal space in various points for each frame. In the stage of separation of speech from noise, the criterion is the difference between the average noise behavior and the speech signal amplitude in spectrotemporal domain. This was measured for each frame and was used as the criterion of classification. Using Noisex92, this method is tested in different noise models such as White, exhibition, Street, Office and Train noises. The results are compared to both auditory model and multifeature method. It is observed that the performance of this method in low signal-to-noise ratios (SNRs) conditions is better than other current methods.
Recovery of low frequency Signals from noisy data using Ensembled Empirical M...inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Construction of The Sampled Signal Up To Any Frequency While Keeping The Samp...CSCJournals
In this paper we will try to develop a method that will let us construct up to any frequency by some additional work we propose. With this method we can construct up to any frequency by adding more hardware to the system with the same sampling rate. By increasing the hardware complexity and keeping the same sampling rate we can reduce the information loss in a proportional manner.
A novel and efficient mixed-signal compressed sensing for wide-band cognitive...Polytechnique Montreal
In cognitive radio (CR) networks, unlicensed (cognitive) users can exploit the licensed frequency bands by using spectrum sensing techniques to identify spectrum holes. This paper proposes a distributed compressive spectrum sensing scheme, in which the modulated wide-band converter can apply compressed sensing (CS) directly to analog signals at the sub-Nyquist rate and the central fusion receives signals from multiple CRs and exploits the multiple-measurements-vectors (MMV) subspace pursuit (M-SP) algorithm to jointly reconstruct the spectral support of the wide-band signal. This support is then used to detect whether the licensed bands are occupy or not. Finally, extensive simulation results show the advantages of the proposed scheme. Besides, we also compare the performance of M-SP with M-orthogonal matching pursuit (M-OMP) algorithms.
Speech Analysis and synthesis using VocoderIJTET Journal
Abstract— In this paper, I proposed a speech analysis and synthesis using a vocoder. Voice conversion systems do not create new speech signals, but just transform existing one. The proposed speech vocoding is different from speech coding. To analyze the speech signal and represent it with less number of bits, so that bandwidth efficiency can be increased. The Synthesis of speech signal from the received bits of information. In this paper three aspects of analysis have been discussed: pitch refinement, spectral envelope estimation and maximum voiced frequency estimation. A Quasi-harmonic analysis model can be used to implement a pitch refinement algorithm which improves the accuracy of the spectral estimation. Harmonic plus noise model to reconstruct the speech signal from parameter. Finally to achieve the highest possible resynthesis quality using the lowest possible number of bits to transmit the speech signal. Future work aims at incorporating the phase information into the analysis and modeling process and also synthesis these three aspects in different pitch period.
Packets Wavelets and Stockwell Transform Analysis of Femoral Doppler Ultrasou...IJECEIAES
Ultrasonic Doppler signals are widely used in the detection of cardiovascular pathologies or the evaluation of the degree of stenosis in the femoral arteries. The presence of stenosis can be indicated by disturbing the blood flow in the femoral arteries, causing spectral broadening of the Doppler signal. To analyze these types of signals and determine stenosis index, a number of time-frequency methods have been developed, such as the short-time Fourier transform, the continuous wavelets transform, the wavelet packet transform, and the S-transform
Analysis the results_of_acoustic_echo_cancellation_for_speech_processing_usin...Venkata Sudhir Vedurla
Similar to (2016) Hennequin and Rigaud - Long-Term Reverberation Modeling for Under-Determined Audio Source Separation with Application to Vocal Melody Extraction (20)
Analysis the results_of_acoustic_echo_cancellation_for_speech_processing_usin...
(2016) Hennequin and Rigaud - Long-Term Reverberation Modeling for Under-Determined Audio Source Separation with Application to Vocal Melody Extraction
1. LONG-TERM REVERBERATION MODELING FOR
UNDER-DETERMINED AUDIO SOURCE SEPARATION WITH
APPLICATION TO VOCAL MELODY EXTRACTION.
Romain Hennequin
Deezer R&D
10 rue d’Ath`enes, 75009 Paris, France
rhennequin@deezer.com
Franc¸ois Rigaud
Audionamix R&D
171 quai de Valmy, 75010 Paris, France
francois.rigaud@audionamix.com
ABSTRACT
In this paper, we present a way to model long-term rever-
beration effects in under-determined source separation al-
gorithms based on a non-negative decomposition frame-
work. A general model for the sources affected by rever-
beration is introduced and update rules for the estimation
of the parameters are presented. Combined with a well-
known source-filter model for singing voice, an applica-
tion to the extraction of reverberated vocal tracks from
polyphonic music signals is proposed. Finally, an objec-
tive evaluation of this application is described. Perfor-
mance improvements are obtained compared to the same
model without reverberation modeling, in particular by
significantly reducing the amount of interference between
sources.
1. INTRODUCTION
Under-determined audio source separation has been a key
topic in audio signal processing for the last two decades.
It consists in isolating different meaningful ‘parts’ of the
sound, such as for instance the lead vocal from the ac-
companiment in a song, or the dialog from the background
music and effects in a movie soundtrack. Non-negative
decompositions such as Non-negative Matrix Factoriza-
tion [5] and its derivative have been very popular in this
research area for the last decade and have achieved state-
of-the art performances [3,9,12].
In music recordings, the vocal track generally contains
reverberation that is either naturally present due to the
recording environment or artificially added during the mix-
ing process. For source separation algorithms, the effects
of reverberation are usually not explicitly modeled and
thus not properly extracted with the corresponding sources.
Some studies [1, 2] introduce a model for the effect of
spatial diffusion caused by the reverberation for a multi-
channel source separation application. In [7] a model for
c Romain Hennequin, Franc¸ois Rigaud. Licensed under
a Creative Commons Attribution 4.0 International License (CC BY 4.0).
Attribution: Romain Hennequin, Franc¸ois Rigaud. “Long-term rever-
beration modeling for under-determined audio source separation with ap-
plication to vocal melody extraction.”, 17th International Society for Mu-
sic Information Retrieval Conference, 2016. This work has been done
when the first author was working for Audionamix.
the dereverberation of spectrograms is presented for the
case of long reverberations, i.e. when the reverberation
time is longer than the length of the analysis window.
We propose in this paper to extend the model of re-
verberation proposed in [7] to a source separation appli-
cation that allows extracting the reverberation of a spe-
cific source together with its dry signal. The reverbera-
tion model is introduced first in a general framework for
which no assumption is made about the spectrogram of the
dry sources. At this state, and as often in demixing appli-
cation, the estimation problem is ill-posed (optimization
of a non-convex cost function with local minima, result
highly dependent on the initialization, ...) and requires the
incorporation of some knowledge about the source signals.
In [7], this issue is dealt with a sparsity prior on the un-
reverberated spectrogram model. Alternatively, the spec-
trogram sources can be structured by using models of non-
negative decompositions with constraints (e.g. harmonic
structure of the source’s tones, sparsity of the activations)
and/or by guiding the estimation process with prior infor-
mation (e.g. source activation, multi-pitch transcription).
Thus in this paper we propose to combine the generic re-
verberation model with a well-known source/filter model
of singing voice [3]. A modified version of the original
voice extraction algorithm is described and evaluated on an
application to the extraction of reverberated vocal melodies
from polyphonic music signals.
Note that unlike usual application of reverberation mod-
eling, we do not aim at extracting dereverbated sources but
we try to extract accurately both the dry signal and the re-
verberation within the same track. Thus, the designation
source separation is not completely in accordance with our
application which targets more precisely stem separation.
The rest of the paper is organized as follows: Section 2
presents the general model for a reverberated source and
Section 3 introduces the update rule used for its estima-
tion. In Section 4, a practical implementation for which
the reverberation model is combined with a source/filter
model is presented. Then, Section 5 presents experimental
results that demonstrate the ability of our algorithm to
extract properly vocals affected by reverberation. Finally,
conclusions are drawn in Section 6.
206
2. Notation
• Matrices are denoted by bold capital letters: M. The
coefficients at row f and column t of matrix M is
denoted by Mf,t.
• Vectors are denoted by bold lower case letters: v.
• Matrix or vector sizes are denoted by capital letters:
T, whereas indexes are denoted with lower case let-
ters: t.
• Scalars are denoted by italic lower case letters: s.
• stands for element-wise matrix multiplication
(Hadamard product) and M λ
stands for element-
wise exponentiation of matrix M with exponent λ.
2. GENERAL MODEL
For the sake of clarity, we will present the signal model for
mono signals only although it can be easily generalized to
multichannel signals as in [6]. In the experimental Section
5, a stereo signal model is actually used.
2.1 Non-negative Decomposition
Most source separation algorithms based on a non-negative
decomposition assume that the non-negative mixture spec-
trogram V (usually the modulus or the squared modulus
of a time-frequency representation such as the Short Time
Fourier Transform (STFT)) which is a F ×T non-negative
matrix, can be approximated as the sum of K source model
spectrograms ˆVk
, which are also non-negative:
V ≈ ˆV =
K
k=1
ˆVk
(1)
Various structured matrix decomposition have been pro-
posed for the source models ˆVk
, such as, to name a few,
standard NMF [8], source/filter modeling [3] or harmonic
source modeling [10].
2.2 Reverberation Model
In the time-domain, a time-invariant reverberation can be
accurately modeled using a convolution with a filter and
thus be written as:
y = h ∗ x, (2)
where x is the dry signal, h is the impulse response of the
reverberation filter and y is the reverberated signal.
For short-term convolution, this expression can be ap-
proximated by a multiplication in the frequency domain
such as proposed in [6] :
yt = h xt, (3)
where xt (respectively yt) is the modulus of the t-th frame
of the STFT of x (respectively y) and h is the modulus of
the Fourier transform of h.
For long-term convolution, this approximation does not
hold. The support of typical reverberation filters are gener-
ally greater than half a second which is way too long for a
STFT analysis window in this kind of application. In this
case, as suggested in [7], we can use an other approxima-
tion which is a convolution in each frequency channel :
yf
= h
f
∗ xf
, (4)
Where yf
, h
f
and xf
are the f-th frequency channel of
the STFT of respectively y, h and x.
Then, starting from a dry spectrogram model ˆVdry,k
of
a source with index k, the reverberated model of the same
source is obtained using the following non-negative ap-
proximation:
ˆVrev,k
f,t =
Tk
τ=1
ˆVdry,k
f,t−τ+1Rk
f,τ (5)
where Rk
is the F × Tk non-negative reverberation matrix
of model k to be estimated.
The model of Equation (5) makes it possible to take
long-term effects of reverberation into account and gen-
eralizes short-term convolution models as proposed in [6]
since when Tk = 1, the model corresponds to the short-
term convolution approximation.
3. ALGORITHM
3.1 Non-negative decomposition algorithms
The approximation of Equation (1) is generally quantified
using a divergence (a measure of dissimilarity) between V
and ˆV to be minimized with respect to the set of parame-
ters Λ of all the models:
C(Λ) = D(V| ˆV(Λ)) (6)
A commonly used class of divergence is the element-
wise β-divergence which encompasses the Itakura-Saito
divergence (β = 0), the Kullback-Leibler divergence (β =
1) and the squared Frobenius distance (β = 2) [4]. The
global cost then writes:
C(Λ) =
f,t
dβ(Vf,t| ˆVf,t(Λ)). (7)
The problem being not convex, the minimization is gener-
ally done using alternating update rules on each parame-
ters of Λ. The update rule for a parameter Θ is commonly
obtained using an heuristic consisting in decomposing the
gradient of the cost-function with respect to this parameter
as a difference of two positive terms, such as
ΘC = PΘ − MΘ, PΘ ≥ 0, MΘ ≥ 0, (8)
and then by updating the parameter according to:
Θ ← Θ
MΘ
PΘ
. (9)
This kind of update rule ensures that the parameter re-
mains non-negative. Moreover the parameter is updated in
a direction descent or remains constant if the partial deriva-
tive is zero. In some cases (including the update rules
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016 207
3. we will present), it is possible to prove using a Majorize-
Minimization (MM) approach [4] that the multiplicative
update rules actually lead to a decrease of the cost func-
tion.
Using such an approach, the update rules for a standard
NMF model ˆV = WH can be expressed as:
H ← H
WT ˆV β−2
V
WT ˆV β−1
, (10)
W ← W
ˆV β−2
V HT
ˆV β−1HT
. (11)
3.2 Estimation of the reverberation matrix
When dry models ˆVdry,k
are fixed, the reverberation matrix
can be estimated using the following update rule applied
successively on each reverberation matrix:
Rk
← Rk
ˆV β−2
V ∗t
ˆVdry,k
ˆV β−1 ∗t
ˆVdry,k
(12)
where ∗t stands for time-convolution:
ˆV β−1
∗t
ˆVdry,k
f,τ
=
T
τ=t
ˆV β−1
f,τ
ˆVdry,k
f,τ−t+1. (13)
The update rule (12) obtained using the procedure de-
scribed in Section 3.1 ensures the non-negativity of Rk
.
This update can be obtained using the MM approach which
ensures that the cost-function will not increase.
3.3 Estimation with other free model
A general drawback of source separation models that do
not explicitly account for reverberation effects is that the
reverberation affecting a given source is usually spread
among all separated sources. However, this issue can still
arise with a proper model of reverberation if the dry model
of the reverberated source is not constrained enough. In-
deed using the generic reverberation model of Equation
(5), the reverberation of a source can still be incorrectly
modeled during the optimization by other source models
having more degrees of freedom. A possible solution for
enforcing a correct optimization of the reverberated model
is to further constrain the structure of the dry model spec-
trogram, e.g. through the inclusion of sparsity or pitch ac-
tivation constraints, and potentially to adopt a sequential
estimation scheme. For instance, first discarding the re-
verberation model, a first rough estimate of the dry source
may be produced. Second, considering the reverberation
model, the dry model previously estimated can be refined
while estimating at the same time the reverberation matrix.
Such an approach is described in the following section for
a practical implementation of the algorithm to the problem
of lead vocal extraction from polyphonic music signals.
4. APPLICATION TO VOICE EXTRACTION
In this section we propose an implementation of our rever-
beration model in a practical case: we use Durrieu’s algo-
rithm [3] for lead vocal isolation and add the reverberation
model over the voice model.
4.1 Base voice extraction algorithm
Durrieu’s algorithm for lead vocal isolation in a song is
based on a source/filter model for the voice.
4.1.1 Model
The non-negative mixture spectrogram model consists
in the sum of a voice spectrogram model based on a
source/filter model and a music spectrogram model based
on a standard NMF:
V ≈ ˆV = ˆVvoice
+ ˆVmusic
. (14)
The voice model is based on a source/filter speech pro-
duction model:
ˆVvoice
= (WF0HF0) (WKHK). (15)
The first factor (WF0HF 0) is the source part correspond-
ing to the excitation of the vocal folds: WF0 is a matrix
of fixed harmonic atoms and HF0 is the activation of these
atoms over time. The second factor (WKHK) is the filter
part corresponding to the resonance of the vocal tract: WK
is a matrix of smooth filter atoms and HK is the activation
of these atoms over time.
The background music model is a generic NMF:
ˆVmusic
= WRHR. (16)
4.1.2 Algorithm
Matrices HF0, WK, HK, WR and HR are estimated mini-
mizing the element-wise Itakura-Saito divergence between
the original mixture power spectrogram and the mixture
model:
C(HF0, WK, HK, WR, HR) =
f,t
dIS(Vf,t| ˆVf,t), (17)
where dIS(x, y) = x
y − log(x
y ) − 1. The minimization is
achieved using multiplicative update rules.
The estimation is done in three steps:
1. A first step of parameter estimation is done using
iteratively the multiplicative update rules.
2. The matrix HF0 is processed using a Viterbi decod-
ing for tracking the main melody and is then thresh-
olded so that coefficients too far from the melody are
set to zero.
3. Parameters are re-estimated as in the first step but
using the thresholded version of HF0 for the initial-
ization.
208 Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
4. 4.2 Inclusion of the reverberation model
As stated in Section 3, the dry spectrogram model (i.e. the
spectrogram model for the source without reverberation)
has to be sufficiently constrained in order to accurately es-
timate the reverberation part. This constraint is here ob-
tained through the use of a fixed harmonic dictionary WF0
and mostly, by the thresholding of the matrix HF0 that en-
forces the sparsity of the activations.
We thus introduce the reverberation model after the step
of thresholding of the matrix HF0. The two first steps then
remains the same as presented in Section 4.1.2. In the third
step, the dry voice model of Equation (15) is replaced by a
reverberated voice model following Equation (5):
ˆVrev. voice
f,t =
T
τ=1
ˆVvoice
f,t−τ+1Rf,τ . (18)
For the parameter re-estimation of step 3, the multi-
plicative update rule of R is given by Equation (12). For
the other parameters of the voice model, the update rules
from [3] are modified to take the reverberation model into
account:
HF0 ← HF0
WT
F0 (WKHK) R ∗t ( ˆV β−2
V)
WT
F0 (WKHK) R ∗t
ˆV β−1
(19)
HK ← HK
WT
K (WF0HF0) R ∗t ( ˆV β−2
V)
WT
K (WF0HF0) R ∗t
ˆV β−1
(20)
WK ← WK
(WF0HF0) R ∗t ( ˆV β−2
V) HT
K
(WF0HF0) R ∗t
ˆV β−1 HT
K
(21)
The update rules for the parameters of the music model
(HR and WR), are unchanged and thus identical to those
given in Equations (10) and (11).
5. EXPERIMENTAL RESULTS
5.1 Experimental setup
We tested the reverberation model that we proposed with
the algorithm presented in Section 4 on a task of lead vocal
extraction in a song. In order to assess the improvement of
our model over the existing one, we ran the separation with
and without reverberation modeling.
We used a database composed of 9 song excerpts of
professionally produced music. The total duration of all
excerpts was about 10 minutes. As the use of rever-
beration modeling only makes sense if there is a signifi-
cant amount of it, all the selected excerpts contains a fair
amount of reverberation. This reverberation was already
present in the separated tracks and was not added artifi-
cially by ourselves. On some excerpts, the reverberation
is time-variant: active on some parts and inactive on other,
ducking echo effect ...Some short excerpts, as well as the
separation results, can be played on the companion web-
site 1
.
Spectrograms were computed as the squared modulus
of the STFT of the signal sampled at 44100 Hz, with 4096-
sample (92.9 ms) long Hamming window with 75% over-
lap. The length T of the reverberation matrix was arbitrar-
ily fixed to 52 frames (which corresponds to about 1.2 s)
in order to be sufficient for long reverberations.
5.2 Results
In order to quantify the results we use standard metrics of
source separation as described in [11]: Signal to Distorsion
Ratio (SDR), Signal to Artefact Ratio (SAR) and Signal to
Interference Ratio (SIR).
The results are presented in Figure 1 for the evaluation
of the extracted voice signals and in Figure 2 for the ex-
tracted music signals. The oracle performance, obtained
using the actual spectrograms of the sources to compute
the separation masks, are also reported. As we can see,
adding the reverberation modeling increases all these met-
rics. The SIR is particularly increased in Figure 1 (more
than 5dB): this is mainly because without the reverbera-
tion model, a large part of the reverberation of the voice
leaks in the music model. This is a phenomenon which is
also clearly audible in excerpts with strong reverberation:
using the reverberation model, the long reverberation tail is
mainly heard within the separated voice and is almost not
audible within the separated music. In return, extracted
vocals with the reverberation model tend to have more au-
dible interferences. This result is in part due to the fact
that the pre-estimation of the dry model (step 1 and 2 of
the base algorithm) is not interference-free, so that apply-
ing the reverberation model increases the energy of these
interferences.
Figure 1. Experimental separation results for the voice
stem.
1 http://romain-hennequin.fr/En/demo/reverb_
separation/reverb.html
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016 209
5. Figure 2. Experimental separation results for the music
stem.
6. CONCLUSION
In this paper we proposed a method to model long-term ef-
fects of reverberation in a source separation application for
which a constrained model of the dry source is available.
Future work should focus on speeding up the algorithm
since, multiple convolutions at each iteration can be time-
consuming. Developing methods to estimate the reverber-
ation duration (of a specific source within a mix) would
also make it possible to automate the whole process. It
could also be interesting to add spatial modeling for multi-
channel processing using full rank spatial variance matrix
and multichannel reverberation matrices.
7. REFERENCES
[1] Simon Arberet, Alexey Ozerov, Ngoc Q. K. Duong,
Emmanuel Vincent, Fr´ed´eric Bimbot, and Pierre Van-
dergheynst. Nonnegative matrix factorization and spa-
tial covariance model for under-determined reverberant
audio source separation. In International Conference
on Information Sciences, Signal Processing and their
applications, pages 1–4, May 2010.
[2] Ngoc Q. K. Duong, Emmanuel Vincent, and R´emi
Gribonval. Under-determined reverberant audio source
separation using a full-rank spatial covariance model.
IEEE Transactions on Audio, Speech, and Language
Processing, 18(7):1830 – 1840, Sept 2010.
[3] Jean-Louis Durrieu, Ga¨el Richard, and Bertrand
David. An iterative approach to monaural musical mix-
ture de-soloing. In International Conference on Acous-
tics, Speech, and Signal Processing (ICASSP), pages
105–108, Taipei, Taiwan, April 2009.
[4] C´edric F´evotte and J´erˆome Idier. Algorithms for non-
negative matrix factorization with the beta-divergence.
Neural Computation, 23(9):2421–2456, September
2011.
[5] Daniel D. Lee and H. Sebastian Seung. Learning the
parts of objects by non-negative matrix factorization.
Nature, 401(6755):788–791, October 1999.
[6] Alexey Ozerov and C´edric F´evotte. Multichannel non-
negative matrix factorization in convolutive mixtures
for audio source separation. IEEE Transactions on Au-
dio, Speech and Language Processing, 18(3):550–563,
2010.
[7] Rita Singh, Bhiksha Raj, and Paris Smaragdis.
Latent-variable decomposition based dereverberation
of monaural and multi-channel signals. In IEEE Inter-
national Conference on Audio and Speech Signal Pro-
cessing, Dallas, Texas, USA, March 2010.
[8] Paris Smaragdis and Judith C. Brown. Non-negative
matrix factorization for polyphonic music transcrip-
tion. In Workshop on Applications of Signal Processing
to Audio and Acoustics, pages 177 – 180, New Paltz,
NY, USA, October 2003.
[9] Paris Smaragdis, Bhiksha Raj, and Madhusudana
Shashanka. Supervised and semi-supervised separation
of sounds from single-channel mixtures. In 7th Inter-
national Conference on Independent Component Anal-
ysis and Signal Separation, London, UK, September
2007.
[10] Emmanuel Vincent, Nancy Bertin, and Roland Badeau.
Adaptive harmonic spectral decomposition for mul-
tiple pitch estimation. IEEE Transactions on Au-
dio, Speech and Language Processing, 18(3):528–537,
March 2010.
[11] Emmanuel Vincent, R´emi Gribonval, and C´edric
F´evotte. Performance measurement in blind audio
source separation. IEEE Transactions on Audio,
Speech and Language Processing, 14(4):1462–1469,
July 2006.
[12] Tuomas Virtanen. Monaural sound source separation
by nonnegative matrix factorization with temporal con-
tinuity and sparseness criteria. IEEE Transactions on
Audio, Speech and Language Processing, 15(3):1066–
1074, March 2007.
210 Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016