SlideShare a Scribd company logo
LONG-TERM REVERBERATION MODELING FOR
UNDER-DETERMINED AUDIO SOURCE SEPARATION WITH
APPLICATION TO VOCAL MELODY EXTRACTION.
Romain Hennequin
Deezer R&D
10 rue d’Ath`enes, 75009 Paris, France
rhennequin@deezer.com
Franc¸ois Rigaud
Audionamix R&D
171 quai de Valmy, 75010 Paris, France
francois.rigaud@audionamix.com
ABSTRACT
In this paper, we present a way to model long-term rever-
beration effects in under-determined source separation al-
gorithms based on a non-negative decomposition frame-
work. A general model for the sources affected by rever-
beration is introduced and update rules for the estimation
of the parameters are presented. Combined with a well-
known source-filter model for singing voice, an applica-
tion to the extraction of reverberated vocal tracks from
polyphonic music signals is proposed. Finally, an objec-
tive evaluation of this application is described. Perfor-
mance improvements are obtained compared to the same
model without reverberation modeling, in particular by
significantly reducing the amount of interference between
sources.
1. INTRODUCTION
Under-determined audio source separation has been a key
topic in audio signal processing for the last two decades.
It consists in isolating different meaningful ‘parts’ of the
sound, such as for instance the lead vocal from the ac-
companiment in a song, or the dialog from the background
music and effects in a movie soundtrack. Non-negative
decompositions such as Non-negative Matrix Factoriza-
tion [5] and its derivative have been very popular in this
research area for the last decade and have achieved state-
of-the art performances [3,9,12].
In music recordings, the vocal track generally contains
reverberation that is either naturally present due to the
recording environment or artificially added during the mix-
ing process. For source separation algorithms, the effects
of reverberation are usually not explicitly modeled and
thus not properly extracted with the corresponding sources.
Some studies [1, 2] introduce a model for the effect of
spatial diffusion caused by the reverberation for a multi-
channel source separation application. In [7] a model for
c Romain Hennequin, Franc¸ois Rigaud. Licensed under
a Creative Commons Attribution 4.0 International License (CC BY 4.0).
Attribution: Romain Hennequin, Franc¸ois Rigaud. “Long-term rever-
beration modeling for under-determined audio source separation with ap-
plication to vocal melody extraction.”, 17th International Society for Mu-
sic Information Retrieval Conference, 2016. This work has been done
when the first author was working for Audionamix.
the dereverberation of spectrograms is presented for the
case of long reverberations, i.e. when the reverberation
time is longer than the length of the analysis window.
We propose in this paper to extend the model of re-
verberation proposed in [7] to a source separation appli-
cation that allows extracting the reverberation of a spe-
cific source together with its dry signal. The reverbera-
tion model is introduced first in a general framework for
which no assumption is made about the spectrogram of the
dry sources. At this state, and as often in demixing appli-
cation, the estimation problem is ill-posed (optimization
of a non-convex cost function with local minima, result
highly dependent on the initialization, ...) and requires the
incorporation of some knowledge about the source signals.
In [7], this issue is dealt with a sparsity prior on the un-
reverberated spectrogram model. Alternatively, the spec-
trogram sources can be structured by using models of non-
negative decompositions with constraints (e.g. harmonic
structure of the source’s tones, sparsity of the activations)
and/or by guiding the estimation process with prior infor-
mation (e.g. source activation, multi-pitch transcription).
Thus in this paper we propose to combine the generic re-
verberation model with a well-known source/filter model
of singing voice [3]. A modified version of the original
voice extraction algorithm is described and evaluated on an
application to the extraction of reverberated vocal melodies
from polyphonic music signals.
Note that unlike usual application of reverberation mod-
eling, we do not aim at extracting dereverbated sources but
we try to extract accurately both the dry signal and the re-
verberation within the same track. Thus, the designation
source separation is not completely in accordance with our
application which targets more precisely stem separation.
The rest of the paper is organized as follows: Section 2
presents the general model for a reverberated source and
Section 3 introduces the update rule used for its estima-
tion. In Section 4, a practical implementation for which
the reverberation model is combined with a source/filter
model is presented. Then, Section 5 presents experimental
results that demonstrate the ability of our algorithm to
extract properly vocals affected by reverberation. Finally,
conclusions are drawn in Section 6.
206
Notation
• Matrices are denoted by bold capital letters: M. The
coefficients at row f and column t of matrix M is
denoted by Mf,t.
• Vectors are denoted by bold lower case letters: v.
• Matrix or vector sizes are denoted by capital letters:
T, whereas indexes are denoted with lower case let-
ters: t.
• Scalars are denoted by italic lower case letters: s.
• stands for element-wise matrix multiplication
(Hadamard product) and M λ
stands for element-
wise exponentiation of matrix M with exponent λ.
2. GENERAL MODEL
For the sake of clarity, we will present the signal model for
mono signals only although it can be easily generalized to
multichannel signals as in [6]. In the experimental Section
5, a stereo signal model is actually used.
2.1 Non-negative Decomposition
Most source separation algorithms based on a non-negative
decomposition assume that the non-negative mixture spec-
trogram V (usually the modulus or the squared modulus
of a time-frequency representation such as the Short Time
Fourier Transform (STFT)) which is a F ×T non-negative
matrix, can be approximated as the sum of K source model
spectrograms ˆVk
, which are also non-negative:
V ≈ ˆV =
K
k=1
ˆVk
(1)
Various structured matrix decomposition have been pro-
posed for the source models ˆVk
, such as, to name a few,
standard NMF [8], source/filter modeling [3] or harmonic
source modeling [10].
2.2 Reverberation Model
In the time-domain, a time-invariant reverberation can be
accurately modeled using a convolution with a filter and
thus be written as:
y = h ∗ x, (2)
where x is the dry signal, h is the impulse response of the
reverberation filter and y is the reverberated signal.
For short-term convolution, this expression can be ap-
proximated by a multiplication in the frequency domain
such as proposed in [6] :
yt = h xt, (3)
where xt (respectively yt) is the modulus of the t-th frame
of the STFT of x (respectively y) and h is the modulus of
the Fourier transform of h.
For long-term convolution, this approximation does not
hold. The support of typical reverberation filters are gener-
ally greater than half a second which is way too long for a
STFT analysis window in this kind of application. In this
case, as suggested in [7], we can use an other approxima-
tion which is a convolution in each frequency channel :
yf
= h
f
∗ xf
, (4)
Where yf
, h
f
and xf
are the f-th frequency channel of
the STFT of respectively y, h and x.
Then, starting from a dry spectrogram model ˆVdry,k
of
a source with index k, the reverberated model of the same
source is obtained using the following non-negative ap-
proximation:
ˆVrev,k
f,t =
Tk
τ=1
ˆVdry,k
f,t−τ+1Rk
f,τ (5)
where Rk
is the F × Tk non-negative reverberation matrix
of model k to be estimated.
The model of Equation (5) makes it possible to take
long-term effects of reverberation into account and gen-
eralizes short-term convolution models as proposed in [6]
since when Tk = 1, the model corresponds to the short-
term convolution approximation.
3. ALGORITHM
3.1 Non-negative decomposition algorithms
The approximation of Equation (1) is generally quantified
using a divergence (a measure of dissimilarity) between V
and ˆV to be minimized with respect to the set of parame-
ters Λ of all the models:
C(Λ) = D(V| ˆV(Λ)) (6)
A commonly used class of divergence is the element-
wise β-divergence which encompasses the Itakura-Saito
divergence (β = 0), the Kullback-Leibler divergence (β =
1) and the squared Frobenius distance (β = 2) [4]. The
global cost then writes:
C(Λ) =
f,t
dβ(Vf,t| ˆVf,t(Λ)). (7)
The problem being not convex, the minimization is gener-
ally done using alternating update rules on each parame-
ters of Λ. The update rule for a parameter Θ is commonly
obtained using an heuristic consisting in decomposing the
gradient of the cost-function with respect to this parameter
as a difference of two positive terms, such as
ΘC = PΘ − MΘ, PΘ ≥ 0, MΘ ≥ 0, (8)
and then by updating the parameter according to:
Θ ← Θ
MΘ
PΘ
. (9)
This kind of update rule ensures that the parameter re-
mains non-negative. Moreover the parameter is updated in
a direction descent or remains constant if the partial deriva-
tive is zero. In some cases (including the update rules
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016 207
we will present), it is possible to prove using a Majorize-
Minimization (MM) approach [4] that the multiplicative
update rules actually lead to a decrease of the cost func-
tion.
Using such an approach, the update rules for a standard
NMF model ˆV = WH can be expressed as:
H ← H
WT ˆV β−2
V
WT ˆV β−1
, (10)
W ← W
ˆV β−2
V HT
ˆV β−1HT
. (11)
3.2 Estimation of the reverberation matrix
When dry models ˆVdry,k
are fixed, the reverberation matrix
can be estimated using the following update rule applied
successively on each reverberation matrix:
Rk
← Rk
ˆV β−2
V ∗t
ˆVdry,k
ˆV β−1 ∗t
ˆVdry,k
(12)
where ∗t stands for time-convolution:
ˆV β−1
∗t
ˆVdry,k
f,τ
=
T
τ=t
ˆV β−1
f,τ
ˆVdry,k
f,τ−t+1. (13)
The update rule (12) obtained using the procedure de-
scribed in Section 3.1 ensures the non-negativity of Rk
.
This update can be obtained using the MM approach which
ensures that the cost-function will not increase.
3.3 Estimation with other free model
A general drawback of source separation models that do
not explicitly account for reverberation effects is that the
reverberation affecting a given source is usually spread
among all separated sources. However, this issue can still
arise with a proper model of reverberation if the dry model
of the reverberated source is not constrained enough. In-
deed using the generic reverberation model of Equation
(5), the reverberation of a source can still be incorrectly
modeled during the optimization by other source models
having more degrees of freedom. A possible solution for
enforcing a correct optimization of the reverberated model
is to further constrain the structure of the dry model spec-
trogram, e.g. through the inclusion of sparsity or pitch ac-
tivation constraints, and potentially to adopt a sequential
estimation scheme. For instance, first discarding the re-
verberation model, a first rough estimate of the dry source
may be produced. Second, considering the reverberation
model, the dry model previously estimated can be refined
while estimating at the same time the reverberation matrix.
Such an approach is described in the following section for
a practical implementation of the algorithm to the problem
of lead vocal extraction from polyphonic music signals.
4. APPLICATION TO VOICE EXTRACTION
In this section we propose an implementation of our rever-
beration model in a practical case: we use Durrieu’s algo-
rithm [3] for lead vocal isolation and add the reverberation
model over the voice model.
4.1 Base voice extraction algorithm
Durrieu’s algorithm for lead vocal isolation in a song is
based on a source/filter model for the voice.
4.1.1 Model
The non-negative mixture spectrogram model consists
in the sum of a voice spectrogram model based on a
source/filter model and a music spectrogram model based
on a standard NMF:
V ≈ ˆV = ˆVvoice
+ ˆVmusic
. (14)
The voice model is based on a source/filter speech pro-
duction model:
ˆVvoice
= (WF0HF0) (WKHK). (15)
The first factor (WF0HF 0) is the source part correspond-
ing to the excitation of the vocal folds: WF0 is a matrix
of fixed harmonic atoms and HF0 is the activation of these
atoms over time. The second factor (WKHK) is the filter
part corresponding to the resonance of the vocal tract: WK
is a matrix of smooth filter atoms and HK is the activation
of these atoms over time.
The background music model is a generic NMF:
ˆVmusic
= WRHR. (16)
4.1.2 Algorithm
Matrices HF0, WK, HK, WR and HR are estimated mini-
mizing the element-wise Itakura-Saito divergence between
the original mixture power spectrogram and the mixture
model:
C(HF0, WK, HK, WR, HR) =
f,t
dIS(Vf,t| ˆVf,t), (17)
where dIS(x, y) = x
y − log(x
y ) − 1. The minimization is
achieved using multiplicative update rules.
The estimation is done in three steps:
1. A first step of parameter estimation is done using
iteratively the multiplicative update rules.
2. The matrix HF0 is processed using a Viterbi decod-
ing for tracking the main melody and is then thresh-
olded so that coefficients too far from the melody are
set to zero.
3. Parameters are re-estimated as in the first step but
using the thresholded version of HF0 for the initial-
ization.
208 Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
4.2 Inclusion of the reverberation model
As stated in Section 3, the dry spectrogram model (i.e. the
spectrogram model for the source without reverberation)
has to be sufficiently constrained in order to accurately es-
timate the reverberation part. This constraint is here ob-
tained through the use of a fixed harmonic dictionary WF0
and mostly, by the thresholding of the matrix HF0 that en-
forces the sparsity of the activations.
We thus introduce the reverberation model after the step
of thresholding of the matrix HF0. The two first steps then
remains the same as presented in Section 4.1.2. In the third
step, the dry voice model of Equation (15) is replaced by a
reverberated voice model following Equation (5):
ˆVrev. voice
f,t =
T
τ=1
ˆVvoice
f,t−τ+1Rf,τ . (18)
For the parameter re-estimation of step 3, the multi-
plicative update rule of R is given by Equation (12). For
the other parameters of the voice model, the update rules
from [3] are modified to take the reverberation model into
account:
HF0 ← HF0
WT
F0 (WKHK) R ∗t ( ˆV β−2
V)
WT
F0 (WKHK) R ∗t
ˆV β−1
(19)
HK ← HK
WT
K (WF0HF0) R ∗t ( ˆV β−2
V)
WT
K (WF0HF0) R ∗t
ˆV β−1
(20)
WK ← WK
(WF0HF0) R ∗t ( ˆV β−2
V) HT
K
(WF0HF0) R ∗t
ˆV β−1 HT
K
(21)
The update rules for the parameters of the music model
(HR and WR), are unchanged and thus identical to those
given in Equations (10) and (11).
5. EXPERIMENTAL RESULTS
5.1 Experimental setup
We tested the reverberation model that we proposed with
the algorithm presented in Section 4 on a task of lead vocal
extraction in a song. In order to assess the improvement of
our model over the existing one, we ran the separation with
and without reverberation modeling.
We used a database composed of 9 song excerpts of
professionally produced music. The total duration of all
excerpts was about 10 minutes. As the use of rever-
beration modeling only makes sense if there is a signifi-
cant amount of it, all the selected excerpts contains a fair
amount of reverberation. This reverberation was already
present in the separated tracks and was not added artifi-
cially by ourselves. On some excerpts, the reverberation
is time-variant: active on some parts and inactive on other,
ducking echo effect ...Some short excerpts, as well as the
separation results, can be played on the companion web-
site 1
.
Spectrograms were computed as the squared modulus
of the STFT of the signal sampled at 44100 Hz, with 4096-
sample (92.9 ms) long Hamming window with 75% over-
lap. The length T of the reverberation matrix was arbitrar-
ily fixed to 52 frames (which corresponds to about 1.2 s)
in order to be sufficient for long reverberations.
5.2 Results
In order to quantify the results we use standard metrics of
source separation as described in [11]: Signal to Distorsion
Ratio (SDR), Signal to Artefact Ratio (SAR) and Signal to
Interference Ratio (SIR).
The results are presented in Figure 1 for the evaluation
of the extracted voice signals and in Figure 2 for the ex-
tracted music signals. The oracle performance, obtained
using the actual spectrograms of the sources to compute
the separation masks, are also reported. As we can see,
adding the reverberation modeling increases all these met-
rics. The SIR is particularly increased in Figure 1 (more
than 5dB): this is mainly because without the reverbera-
tion model, a large part of the reverberation of the voice
leaks in the music model. This is a phenomenon which is
also clearly audible in excerpts with strong reverberation:
using the reverberation model, the long reverberation tail is
mainly heard within the separated voice and is almost not
audible within the separated music. In return, extracted
vocals with the reverberation model tend to have more au-
dible interferences. This result is in part due to the fact
that the pre-estimation of the dry model (step 1 and 2 of
the base algorithm) is not interference-free, so that apply-
ing the reverberation model increases the energy of these
interferences.
Figure 1. Experimental separation results for the voice
stem.
1 http://romain-hennequin.fr/En/demo/reverb_
separation/reverb.html
Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016 209
Figure 2. Experimental separation results for the music
stem.
6. CONCLUSION
In this paper we proposed a method to model long-term ef-
fects of reverberation in a source separation application for
which a constrained model of the dry source is available.
Future work should focus on speeding up the algorithm
since, multiple convolutions at each iteration can be time-
consuming. Developing methods to estimate the reverber-
ation duration (of a specific source within a mix) would
also make it possible to automate the whole process. It
could also be interesting to add spatial modeling for multi-
channel processing using full rank spatial variance matrix
and multichannel reverberation matrices.
7. REFERENCES
[1] Simon Arberet, Alexey Ozerov, Ngoc Q. K. Duong,
Emmanuel Vincent, Fr´ed´eric Bimbot, and Pierre Van-
dergheynst. Nonnegative matrix factorization and spa-
tial covariance model for under-determined reverberant
audio source separation. In International Conference
on Information Sciences, Signal Processing and their
applications, pages 1–4, May 2010.
[2] Ngoc Q. K. Duong, Emmanuel Vincent, and R´emi
Gribonval. Under-determined reverberant audio source
separation using a full-rank spatial covariance model.
IEEE Transactions on Audio, Speech, and Language
Processing, 18(7):1830 – 1840, Sept 2010.
[3] Jean-Louis Durrieu, Ga¨el Richard, and Bertrand
David. An iterative approach to monaural musical mix-
ture de-soloing. In International Conference on Acous-
tics, Speech, and Signal Processing (ICASSP), pages
105–108, Taipei, Taiwan, April 2009.
[4] C´edric F´evotte and J´erˆome Idier. Algorithms for non-
negative matrix factorization with the beta-divergence.
Neural Computation, 23(9):2421–2456, September
2011.
[5] Daniel D. Lee and H. Sebastian Seung. Learning the
parts of objects by non-negative matrix factorization.
Nature, 401(6755):788–791, October 1999.
[6] Alexey Ozerov and C´edric F´evotte. Multichannel non-
negative matrix factorization in convolutive mixtures
for audio source separation. IEEE Transactions on Au-
dio, Speech and Language Processing, 18(3):550–563,
2010.
[7] Rita Singh, Bhiksha Raj, and Paris Smaragdis.
Latent-variable decomposition based dereverberation
of monaural and multi-channel signals. In IEEE Inter-
national Conference on Audio and Speech Signal Pro-
cessing, Dallas, Texas, USA, March 2010.
[8] Paris Smaragdis and Judith C. Brown. Non-negative
matrix factorization for polyphonic music transcrip-
tion. In Workshop on Applications of Signal Processing
to Audio and Acoustics, pages 177 – 180, New Paltz,
NY, USA, October 2003.
[9] Paris Smaragdis, Bhiksha Raj, and Madhusudana
Shashanka. Supervised and semi-supervised separation
of sounds from single-channel mixtures. In 7th Inter-
national Conference on Independent Component Anal-
ysis and Signal Separation, London, UK, September
2007.
[10] Emmanuel Vincent, Nancy Bertin, and Roland Badeau.
Adaptive harmonic spectral decomposition for mul-
tiple pitch estimation. IEEE Transactions on Au-
dio, Speech and Language Processing, 18(3):528–537,
March 2010.
[11] Emmanuel Vincent, R´emi Gribonval, and C´edric
F´evotte. Performance measurement in blind audio
source separation. IEEE Transactions on Audio,
Speech and Language Processing, 14(4):1462–1469,
July 2006.
[12] Tuomas Virtanen. Monaural sound source separation
by nonnegative matrix factorization with temporal con-
tinuity and sparseness criteria. IEEE Transactions on
Audio, Speech and Language Processing, 15(3):1066–
1074, March 2007.
210 Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016

More Related Content

What's hot

Gst08 tutorial-fast fourier sampling
Gst08 tutorial-fast fourier samplingGst08 tutorial-fast fourier sampling
Gst08 tutorial-fast fourier sampling
ssuser8d9d45
 
F04924352
F04924352F04924352
F04924352
IOSR-JEN
 
Frequency offset estimation with fast acquisition in OFDM system assignment h...
Frequency offset estimation with fast acquisition in OFDM system assignment h...Frequency offset estimation with fast acquisition in OFDM system assignment h...
Frequency offset estimation with fast acquisition in OFDM system assignment h...
Sample Assignment
 
Ee463 communications 2 - lab 1 - loren schwappach
Ee463   communications 2 - lab 1 - loren schwappachEe463   communications 2 - lab 1 - loren schwappach
Ee463 communications 2 - lab 1 - loren schwappachLoren Schwappach
 
5_1555_TH4.T01.5_suwa.ppt
5_1555_TH4.T01.5_suwa.ppt5_1555_TH4.T01.5_suwa.ppt
5_1555_TH4.T01.5_suwa.pptgrssieee
 
Frame Synchronization for OFDMA mode of WMAN
Frame Synchronization for OFDMA mode of WMANFrame Synchronization for OFDMA mode of WMAN
Frame Synchronization for OFDMA mode of WMAN
Pushpa Kotipalli
 
A new architecture for fast ultrasound imaging
A new architecture for fast ultrasound imagingA new architecture for fast ultrasound imaging
A new architecture for fast ultrasound imagingJose Miguel Moreno
 
Ch7 noise variation of different modulation scheme pg 63
Ch7 noise variation of different modulation scheme pg 63Ch7 noise variation of different modulation scheme pg 63
Ch7 noise variation of different modulation scheme pg 63
Prateek Omer
 
Aliasing and Antialiasing filter
Aliasing and Antialiasing filterAliasing and Antialiasing filter
Aliasing and Antialiasing filter
Suresh Mohta
 
Blind Estimation of Carrier Frequency Offset in Multicarrier Communication Sy...
Blind Estimation of Carrier Frequency Offset in Multicarrier Communication Sy...Blind Estimation of Carrier Frequency Offset in Multicarrier Communication Sy...
Blind Estimation of Carrier Frequency Offset in Multicarrier Communication Sy...
IDES Editor
 
Sine and Cosine Fresnel Transforms
Sine and Cosine Fresnel TransformsSine and Cosine Fresnel Transforms
Sine and Cosine Fresnel Transforms
CSCJournals
 
Alamouti ofdm
Alamouti ofdmAlamouti ofdm
Alamouti ofdm
csandit
 
DIGITAL SIGNAL PROCESSING: Sampling and Reconstruction on MATLAB
DIGITAL SIGNAL PROCESSING: Sampling and Reconstruction on MATLABDIGITAL SIGNAL PROCESSING: Sampling and Reconstruction on MATLAB
DIGITAL SIGNAL PROCESSING: Sampling and Reconstruction on MATLAB
Martin Wachiye Wafula
 
Sound Source Localization with microphone arrays
Sound Source Localization with microphone arraysSound Source Localization with microphone arrays
Sound Source Localization with microphone arrays
Ramin Anushiravani
 
An improved dft based channel estimation
An improved dft based channel estimationAn improved dft based channel estimation
An improved dft based channel estimation
sakru naik
 
Stability of Target Resonance Modes: Ina Quadrature Polarization Context
Stability of Target Resonance Modes: Ina Quadrature Polarization ContextStability of Target Resonance Modes: Ina Quadrature Polarization Context
Stability of Target Resonance Modes: Ina Quadrature Polarization Context
IJERA Editor
 
An agent based particle swarm optimization for papr reduction of ofdm systems
An agent based particle swarm optimization for papr reduction of ofdm systemsAn agent based particle swarm optimization for papr reduction of ofdm systems
An agent based particle swarm optimization for papr reduction of ofdm systemsaliasghar1989
 

What's hot (20)

Gst08 tutorial-fast fourier sampling
Gst08 tutorial-fast fourier samplingGst08 tutorial-fast fourier sampling
Gst08 tutorial-fast fourier sampling
 
F04924352
F04924352F04924352
F04924352
 
Frequency offset estimation with fast acquisition in OFDM system assignment h...
Frequency offset estimation with fast acquisition in OFDM system assignment h...Frequency offset estimation with fast acquisition in OFDM system assignment h...
Frequency offset estimation with fast acquisition in OFDM system assignment h...
 
Ax26326329
Ax26326329Ax26326329
Ax26326329
 
Ee463 communications 2 - lab 1 - loren schwappach
Ee463   communications 2 - lab 1 - loren schwappachEe463   communications 2 - lab 1 - loren schwappach
Ee463 communications 2 - lab 1 - loren schwappach
 
cr1503
cr1503cr1503
cr1503
 
5_1555_TH4.T01.5_suwa.ppt
5_1555_TH4.T01.5_suwa.ppt5_1555_TH4.T01.5_suwa.ppt
5_1555_TH4.T01.5_suwa.ppt
 
Frame Synchronization for OFDMA mode of WMAN
Frame Synchronization for OFDMA mode of WMANFrame Synchronization for OFDMA mode of WMAN
Frame Synchronization for OFDMA mode of WMAN
 
A new architecture for fast ultrasound imaging
A new architecture for fast ultrasound imagingA new architecture for fast ultrasound imaging
A new architecture for fast ultrasound imaging
 
Ch7 noise variation of different modulation scheme pg 63
Ch7 noise variation of different modulation scheme pg 63Ch7 noise variation of different modulation scheme pg 63
Ch7 noise variation of different modulation scheme pg 63
 
Aliasing and Antialiasing filter
Aliasing and Antialiasing filterAliasing and Antialiasing filter
Aliasing and Antialiasing filter
 
Blind Estimation of Carrier Frequency Offset in Multicarrier Communication Sy...
Blind Estimation of Carrier Frequency Offset in Multicarrier Communication Sy...Blind Estimation of Carrier Frequency Offset in Multicarrier Communication Sy...
Blind Estimation of Carrier Frequency Offset in Multicarrier Communication Sy...
 
Sine and Cosine Fresnel Transforms
Sine and Cosine Fresnel TransformsSine and Cosine Fresnel Transforms
Sine and Cosine Fresnel Transforms
 
45
4545
45
 
Alamouti ofdm
Alamouti ofdmAlamouti ofdm
Alamouti ofdm
 
DIGITAL SIGNAL PROCESSING: Sampling and Reconstruction on MATLAB
DIGITAL SIGNAL PROCESSING: Sampling and Reconstruction on MATLABDIGITAL SIGNAL PROCESSING: Sampling and Reconstruction on MATLAB
DIGITAL SIGNAL PROCESSING: Sampling and Reconstruction on MATLAB
 
Sound Source Localization with microphone arrays
Sound Source Localization with microphone arraysSound Source Localization with microphone arrays
Sound Source Localization with microphone arrays
 
An improved dft based channel estimation
An improved dft based channel estimationAn improved dft based channel estimation
An improved dft based channel estimation
 
Stability of Target Resonance Modes: Ina Quadrature Polarization Context
Stability of Target Resonance Modes: Ina Quadrature Polarization ContextStability of Target Resonance Modes: Ina Quadrature Polarization Context
Stability of Target Resonance Modes: Ina Quadrature Polarization Context
 
An agent based particle swarm optimization for papr reduction of ofdm systems
An agent based particle swarm optimization for papr reduction of ofdm systemsAn agent based particle swarm optimization for papr reduction of ofdm systems
An agent based particle swarm optimization for papr reduction of ofdm systems
 

Viewers also liked

(2016) Rigaud and Radenen - Singing Voice Melody Transcription Using Deep Neu...
(2016) Rigaud and Radenen - Singing Voice Melody Transcription Using Deep Neu...(2016) Rigaud and Radenen - Singing Voice Melody Transcription Using Deep Neu...
(2016) Rigaud and Radenen - Singing Voice Melody Transcription Using Deep Neu...François Rigaud
 
3D Model preparation of a landscape
3D Model preparation of a landscape3D Model preparation of a landscape
3D Model preparation of a landscape
Srijit Sharma
 
Exposé fiche-métier-formateur-delf-dalf
Exposé fiche-métier-formateur-delf-dalfExposé fiche-métier-formateur-delf-dalf
Exposé fiche-métier-formateur-delf-dalf
Sihem Berriri
 
NHT GLOBAL PRESENTACION PARTE I
NHT GLOBAL PRESENTACION PARTE INHT GLOBAL PRESENTACION PARTE I
NHT GLOBAL PRESENTACION PARTE I
NHT GLOBAL LATINO
 
Infografia personal
Infografia personalInfografia personal
Infografia personal
adrian quinatoa
 
Hidrops fetal por sífilis connatal grave en r np t
Hidrops fetal por sífilis connatal grave en r np tHidrops fetal por sífilis connatal grave en r np t
Hidrops fetal por sífilis connatal grave en r np t
alfredo gomez
 
Casos
CasosCasos
Hígado y Vías biliares
Hígado y Vías biliaresHígado y Vías biliares
Hígado y Vías biliares
Verónica Marycruz
 
completed-transcript-9240608
completed-transcript-9240608completed-transcript-9240608
completed-transcript-9240608Farrah Ranzino
 
Hojadevidapersonal1.docx
Hojadevidapersonal1.docxHojadevidapersonal1.docx
Hojadevidapersonal1.docx
Aleja Chalama
 
Infografia redes sociales
Infografia redes socialesInfografia redes sociales
Infografia redes sociales
adrian quinatoa
 
Tarjetas de rorschach
Tarjetas de rorschachTarjetas de rorschach
Tarjetas de rorschach
susana talamantes hernàndez
 

Viewers also liked (14)

(2016) Rigaud and Radenen - Singing Voice Melody Transcription Using Deep Neu...
(2016) Rigaud and Radenen - Singing Voice Melody Transcription Using Deep Neu...(2016) Rigaud and Radenen - Singing Voice Melody Transcription Using Deep Neu...
(2016) Rigaud and Radenen - Singing Voice Melody Transcription Using Deep Neu...
 
3D Model preparation of a landscape
3D Model preparation of a landscape3D Model preparation of a landscape
3D Model preparation of a landscape
 
Exposé fiche-métier-formateur-delf-dalf
Exposé fiche-métier-formateur-delf-dalfExposé fiche-métier-formateur-delf-dalf
Exposé fiche-métier-formateur-delf-dalf
 
Certificates of completion 10-25-16
Certificates of completion 10-25-16Certificates of completion 10-25-16
Certificates of completion 10-25-16
 
NHT GLOBAL PRESENTACION PARTE I
NHT GLOBAL PRESENTACION PARTE INHT GLOBAL PRESENTACION PARTE I
NHT GLOBAL PRESENTACION PARTE I
 
Infografia personal
Infografia personalInfografia personal
Infografia personal
 
Hidrops fetal por sífilis connatal grave en r np t
Hidrops fetal por sífilis connatal grave en r np tHidrops fetal por sífilis connatal grave en r np t
Hidrops fetal por sífilis connatal grave en r np t
 
Casos
CasosCasos
Casos
 
Hígado y Vías biliares
Hígado y Vías biliaresHígado y Vías biliares
Hígado y Vías biliares
 
completed-transcript-9240608
completed-transcript-9240608completed-transcript-9240608
completed-transcript-9240608
 
Hojadevidapersonal1.docx
Hojadevidapersonal1.docxHojadevidapersonal1.docx
Hojadevidapersonal1.docx
 
Forxtrust
ForxtrustForxtrust
Forxtrust
 
Infografia redes sociales
Infografia redes socialesInfografia redes sociales
Infografia redes sociales
 
Tarjetas de rorschach
Tarjetas de rorschachTarjetas de rorschach
Tarjetas de rorschach
 

Similar to (2016) Hennequin and Rigaud - Long-Term Reverberation Modeling for Under-Determined Audio Source Separation with Application to Vocal Melody Extraction

Sparsity based Joint Direction-of-Arrival and Offset Frequency Estimator
Sparsity based Joint Direction-of-Arrival and Offset Frequency EstimatorSparsity based Joint Direction-of-Arrival and Offset Frequency Estimator
Sparsity based Joint Direction-of-Arrival and Offset Frequency Estimator
Jason Fernandes
 
129966864405916304[1]
129966864405916304[1]129966864405916304[1]
129966864405916304[1]威華 王
 
Bz33462466
Bz33462466Bz33462466
Bz33462466
IJERA Editor
 
Wavelets presentation
Wavelets presentationWavelets presentation
Wavelets presentation
Asaph Muhumuza
 
Projected Barzilai-Borwein Methods Applied to Distributed Compressive Spectru...
Projected Barzilai-Borwein Methods Applied to Distributed Compressive Spectru...Projected Barzilai-Borwein Methods Applied to Distributed Compressive Spectru...
Projected Barzilai-Borwein Methods Applied to Distributed Compressive Spectru...
Polytechnique Montreal
 
A time domain clean approach for the identification of acoustic moving source...
A time domain clean approach for the identification of acoustic moving source...A time domain clean approach for the identification of acoustic moving source...
A time domain clean approach for the identification of acoustic moving source...
André Moreira
 
MULTI RESOLUTION LATTICE DISCRETE FOURIER TRANSFORM (MRL-DFT)
MULTI RESOLUTION LATTICE DISCRETE FOURIER TRANSFORM (MRL-DFT)MULTI RESOLUTION LATTICE DISCRETE FOURIER TRANSFORM (MRL-DFT)
MULTI RESOLUTION LATTICE DISCRETE FOURIER TRANSFORM (MRL-DFT)
csandit
 
What Is Fourier Transform
What Is Fourier TransformWhat Is Fourier Transform
What Is Fourier Transform
Patricia Viljoen
 
A Gaussian Clustering Based Voice Activity Detector for Noisy Environments Us...
A Gaussian Clustering Based Voice Activity Detector for Noisy Environments Us...A Gaussian Clustering Based Voice Activity Detector for Noisy Environments Us...
A Gaussian Clustering Based Voice Activity Detector for Noisy Environments Us...
CSCJournals
 
08039246
0803924608039246
08039246
Thilip Kumar
 
Acoustic Model.pptx
Acoustic Model.pptxAcoustic Model.pptx
Acoustic Model.pptx
ManiMaran230751
 
Recovery of low frequency Signals from noisy data using Ensembled Empirical M...
Recovery of low frequency Signals from noisy data using Ensembled Empirical M...Recovery of low frequency Signals from noisy data using Ensembled Empirical M...
Recovery of low frequency Signals from noisy data using Ensembled Empirical M...
inventionjournals
 
Construction of The Sampled Signal Up To Any Frequency While Keeping The Samp...
Construction of The Sampled Signal Up To Any Frequency While Keeping The Samp...Construction of The Sampled Signal Up To Any Frequency While Keeping The Samp...
Construction of The Sampled Signal Up To Any Frequency While Keeping The Samp...
CSCJournals
 
A novel and efficient mixed-signal compressed sensing for wide-band cognitive...
A novel and efficient mixed-signal compressed sensing for wide-band cognitive...A novel and efficient mixed-signal compressed sensing for wide-band cognitive...
A novel and efficient mixed-signal compressed sensing for wide-band cognitive...
Polytechnique Montreal
 
Speech Analysis and synthesis using Vocoder
Speech Analysis and synthesis using VocoderSpeech Analysis and synthesis using Vocoder
Speech Analysis and synthesis using Vocoder
IJTET Journal
 
Packets Wavelets and Stockwell Transform Analysis of Femoral Doppler Ultrasou...
Packets Wavelets and Stockwell Transform Analysis of Femoral Doppler Ultrasou...Packets Wavelets and Stockwell Transform Analysis of Femoral Doppler Ultrasou...
Packets Wavelets and Stockwell Transform Analysis of Femoral Doppler Ultrasou...
IJECEIAES
 
Analysis the results_of_acoustic_echo_cancellation_for_speech_processing_usin...
Analysis the results_of_acoustic_echo_cancellation_for_speech_processing_usin...Analysis the results_of_acoustic_echo_cancellation_for_speech_processing_usin...
Analysis the results_of_acoustic_echo_cancellation_for_speech_processing_usin...Venkata Sudhir Vedurla
 

Similar to (2016) Hennequin and Rigaud - Long-Term Reverberation Modeling for Under-Determined Audio Source Separation with Application to Vocal Melody Extraction (20)

Sparsity based Joint Direction-of-Arrival and Offset Frequency Estimator
Sparsity based Joint Direction-of-Arrival and Offset Frequency EstimatorSparsity based Joint Direction-of-Arrival and Offset Frequency Estimator
Sparsity based Joint Direction-of-Arrival and Offset Frequency Estimator
 
129966864405916304[1]
129966864405916304[1]129966864405916304[1]
129966864405916304[1]
 
Paper1
Paper1Paper1
Paper1
 
Bz33462466
Bz33462466Bz33462466
Bz33462466
 
Bz33462466
Bz33462466Bz33462466
Bz33462466
 
Wavelets presentation
Wavelets presentationWavelets presentation
Wavelets presentation
 
Projected Barzilai-Borwein Methods Applied to Distributed Compressive Spectru...
Projected Barzilai-Borwein Methods Applied to Distributed Compressive Spectru...Projected Barzilai-Borwein Methods Applied to Distributed Compressive Spectru...
Projected Barzilai-Borwein Methods Applied to Distributed Compressive Spectru...
 
A time domain clean approach for the identification of acoustic moving source...
A time domain clean approach for the identification of acoustic moving source...A time domain clean approach for the identification of acoustic moving source...
A time domain clean approach for the identification of acoustic moving source...
 
MULTI RESOLUTION LATTICE DISCRETE FOURIER TRANSFORM (MRL-DFT)
MULTI RESOLUTION LATTICE DISCRETE FOURIER TRANSFORM (MRL-DFT)MULTI RESOLUTION LATTICE DISCRETE FOURIER TRANSFORM (MRL-DFT)
MULTI RESOLUTION LATTICE DISCRETE FOURIER TRANSFORM (MRL-DFT)
 
What Is Fourier Transform
What Is Fourier TransformWhat Is Fourier Transform
What Is Fourier Transform
 
A Gaussian Clustering Based Voice Activity Detector for Noisy Environments Us...
A Gaussian Clustering Based Voice Activity Detector for Noisy Environments Us...A Gaussian Clustering Based Voice Activity Detector for Noisy Environments Us...
A Gaussian Clustering Based Voice Activity Detector for Noisy Environments Us...
 
08039246
0803924608039246
08039246
 
Acoustic Model.pptx
Acoustic Model.pptxAcoustic Model.pptx
Acoustic Model.pptx
 
Recovery of low frequency Signals from noisy data using Ensembled Empirical M...
Recovery of low frequency Signals from noisy data using Ensembled Empirical M...Recovery of low frequency Signals from noisy data using Ensembled Empirical M...
Recovery of low frequency Signals from noisy data using Ensembled Empirical M...
 
Construction of The Sampled Signal Up To Any Frequency While Keeping The Samp...
Construction of The Sampled Signal Up To Any Frequency While Keeping The Samp...Construction of The Sampled Signal Up To Any Frequency While Keeping The Samp...
Construction of The Sampled Signal Up To Any Frequency While Keeping The Samp...
 
A novel and efficient mixed-signal compressed sensing for wide-band cognitive...
A novel and efficient mixed-signal compressed sensing for wide-band cognitive...A novel and efficient mixed-signal compressed sensing for wide-band cognitive...
A novel and efficient mixed-signal compressed sensing for wide-band cognitive...
 
Speech Analysis and synthesis using Vocoder
Speech Analysis and synthesis using VocoderSpeech Analysis and synthesis using Vocoder
Speech Analysis and synthesis using Vocoder
 
L046056365
L046056365L046056365
L046056365
 
Packets Wavelets and Stockwell Transform Analysis of Femoral Doppler Ultrasou...
Packets Wavelets and Stockwell Transform Analysis of Femoral Doppler Ultrasou...Packets Wavelets and Stockwell Transform Analysis of Femoral Doppler Ultrasou...
Packets Wavelets and Stockwell Transform Analysis of Femoral Doppler Ultrasou...
 
Analysis the results_of_acoustic_echo_cancellation_for_speech_processing_usin...
Analysis the results_of_acoustic_echo_cancellation_for_speech_processing_usin...Analysis the results_of_acoustic_echo_cancellation_for_speech_processing_usin...
Analysis the results_of_acoustic_echo_cancellation_for_speech_processing_usin...
 

(2016) Hennequin and Rigaud - Long-Term Reverberation Modeling for Under-Determined Audio Source Separation with Application to Vocal Melody Extraction

  • 1. LONG-TERM REVERBERATION MODELING FOR UNDER-DETERMINED AUDIO SOURCE SEPARATION WITH APPLICATION TO VOCAL MELODY EXTRACTION. Romain Hennequin Deezer R&D 10 rue d’Ath`enes, 75009 Paris, France rhennequin@deezer.com Franc¸ois Rigaud Audionamix R&D 171 quai de Valmy, 75010 Paris, France francois.rigaud@audionamix.com ABSTRACT In this paper, we present a way to model long-term rever- beration effects in under-determined source separation al- gorithms based on a non-negative decomposition frame- work. A general model for the sources affected by rever- beration is introduced and update rules for the estimation of the parameters are presented. Combined with a well- known source-filter model for singing voice, an applica- tion to the extraction of reverberated vocal tracks from polyphonic music signals is proposed. Finally, an objec- tive evaluation of this application is described. Perfor- mance improvements are obtained compared to the same model without reverberation modeling, in particular by significantly reducing the amount of interference between sources. 1. INTRODUCTION Under-determined audio source separation has been a key topic in audio signal processing for the last two decades. It consists in isolating different meaningful ‘parts’ of the sound, such as for instance the lead vocal from the ac- companiment in a song, or the dialog from the background music and effects in a movie soundtrack. Non-negative decompositions such as Non-negative Matrix Factoriza- tion [5] and its derivative have been very popular in this research area for the last decade and have achieved state- of-the art performances [3,9,12]. In music recordings, the vocal track generally contains reverberation that is either naturally present due to the recording environment or artificially added during the mix- ing process. For source separation algorithms, the effects of reverberation are usually not explicitly modeled and thus not properly extracted with the corresponding sources. Some studies [1, 2] introduce a model for the effect of spatial diffusion caused by the reverberation for a multi- channel source separation application. In [7] a model for c Romain Hennequin, Franc¸ois Rigaud. Licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Attribution: Romain Hennequin, Franc¸ois Rigaud. “Long-term rever- beration modeling for under-determined audio source separation with ap- plication to vocal melody extraction.”, 17th International Society for Mu- sic Information Retrieval Conference, 2016. This work has been done when the first author was working for Audionamix. the dereverberation of spectrograms is presented for the case of long reverberations, i.e. when the reverberation time is longer than the length of the analysis window. We propose in this paper to extend the model of re- verberation proposed in [7] to a source separation appli- cation that allows extracting the reverberation of a spe- cific source together with its dry signal. The reverbera- tion model is introduced first in a general framework for which no assumption is made about the spectrogram of the dry sources. At this state, and as often in demixing appli- cation, the estimation problem is ill-posed (optimization of a non-convex cost function with local minima, result highly dependent on the initialization, ...) and requires the incorporation of some knowledge about the source signals. In [7], this issue is dealt with a sparsity prior on the un- reverberated spectrogram model. Alternatively, the spec- trogram sources can be structured by using models of non- negative decompositions with constraints (e.g. harmonic structure of the source’s tones, sparsity of the activations) and/or by guiding the estimation process with prior infor- mation (e.g. source activation, multi-pitch transcription). Thus in this paper we propose to combine the generic re- verberation model with a well-known source/filter model of singing voice [3]. A modified version of the original voice extraction algorithm is described and evaluated on an application to the extraction of reverberated vocal melodies from polyphonic music signals. Note that unlike usual application of reverberation mod- eling, we do not aim at extracting dereverbated sources but we try to extract accurately both the dry signal and the re- verberation within the same track. Thus, the designation source separation is not completely in accordance with our application which targets more precisely stem separation. The rest of the paper is organized as follows: Section 2 presents the general model for a reverberated source and Section 3 introduces the update rule used for its estima- tion. In Section 4, a practical implementation for which the reverberation model is combined with a source/filter model is presented. Then, Section 5 presents experimental results that demonstrate the ability of our algorithm to extract properly vocals affected by reverberation. Finally, conclusions are drawn in Section 6. 206
  • 2. Notation • Matrices are denoted by bold capital letters: M. The coefficients at row f and column t of matrix M is denoted by Mf,t. • Vectors are denoted by bold lower case letters: v. • Matrix or vector sizes are denoted by capital letters: T, whereas indexes are denoted with lower case let- ters: t. • Scalars are denoted by italic lower case letters: s. • stands for element-wise matrix multiplication (Hadamard product) and M λ stands for element- wise exponentiation of matrix M with exponent λ. 2. GENERAL MODEL For the sake of clarity, we will present the signal model for mono signals only although it can be easily generalized to multichannel signals as in [6]. In the experimental Section 5, a stereo signal model is actually used. 2.1 Non-negative Decomposition Most source separation algorithms based on a non-negative decomposition assume that the non-negative mixture spec- trogram V (usually the modulus or the squared modulus of a time-frequency representation such as the Short Time Fourier Transform (STFT)) which is a F ×T non-negative matrix, can be approximated as the sum of K source model spectrograms ˆVk , which are also non-negative: V ≈ ˆV = K k=1 ˆVk (1) Various structured matrix decomposition have been pro- posed for the source models ˆVk , such as, to name a few, standard NMF [8], source/filter modeling [3] or harmonic source modeling [10]. 2.2 Reverberation Model In the time-domain, a time-invariant reverberation can be accurately modeled using a convolution with a filter and thus be written as: y = h ∗ x, (2) where x is the dry signal, h is the impulse response of the reverberation filter and y is the reverberated signal. For short-term convolution, this expression can be ap- proximated by a multiplication in the frequency domain such as proposed in [6] : yt = h xt, (3) where xt (respectively yt) is the modulus of the t-th frame of the STFT of x (respectively y) and h is the modulus of the Fourier transform of h. For long-term convolution, this approximation does not hold. The support of typical reverberation filters are gener- ally greater than half a second which is way too long for a STFT analysis window in this kind of application. In this case, as suggested in [7], we can use an other approxima- tion which is a convolution in each frequency channel : yf = h f ∗ xf , (4) Where yf , h f and xf are the f-th frequency channel of the STFT of respectively y, h and x. Then, starting from a dry spectrogram model ˆVdry,k of a source with index k, the reverberated model of the same source is obtained using the following non-negative ap- proximation: ˆVrev,k f,t = Tk τ=1 ˆVdry,k f,t−τ+1Rk f,τ (5) where Rk is the F × Tk non-negative reverberation matrix of model k to be estimated. The model of Equation (5) makes it possible to take long-term effects of reverberation into account and gen- eralizes short-term convolution models as proposed in [6] since when Tk = 1, the model corresponds to the short- term convolution approximation. 3. ALGORITHM 3.1 Non-negative decomposition algorithms The approximation of Equation (1) is generally quantified using a divergence (a measure of dissimilarity) between V and ˆV to be minimized with respect to the set of parame- ters Λ of all the models: C(Λ) = D(V| ˆV(Λ)) (6) A commonly used class of divergence is the element- wise β-divergence which encompasses the Itakura-Saito divergence (β = 0), the Kullback-Leibler divergence (β = 1) and the squared Frobenius distance (β = 2) [4]. The global cost then writes: C(Λ) = f,t dβ(Vf,t| ˆVf,t(Λ)). (7) The problem being not convex, the minimization is gener- ally done using alternating update rules on each parame- ters of Λ. The update rule for a parameter Θ is commonly obtained using an heuristic consisting in decomposing the gradient of the cost-function with respect to this parameter as a difference of two positive terms, such as ΘC = PΘ − MΘ, PΘ ≥ 0, MΘ ≥ 0, (8) and then by updating the parameter according to: Θ ← Θ MΘ PΘ . (9) This kind of update rule ensures that the parameter re- mains non-negative. Moreover the parameter is updated in a direction descent or remains constant if the partial deriva- tive is zero. In some cases (including the update rules Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016 207
  • 3. we will present), it is possible to prove using a Majorize- Minimization (MM) approach [4] that the multiplicative update rules actually lead to a decrease of the cost func- tion. Using such an approach, the update rules for a standard NMF model ˆV = WH can be expressed as: H ← H WT ˆV β−2 V WT ˆV β−1 , (10) W ← W ˆV β−2 V HT ˆV β−1HT . (11) 3.2 Estimation of the reverberation matrix When dry models ˆVdry,k are fixed, the reverberation matrix can be estimated using the following update rule applied successively on each reverberation matrix: Rk ← Rk ˆV β−2 V ∗t ˆVdry,k ˆV β−1 ∗t ˆVdry,k (12) where ∗t stands for time-convolution: ˆV β−1 ∗t ˆVdry,k f,τ = T τ=t ˆV β−1 f,τ ˆVdry,k f,τ−t+1. (13) The update rule (12) obtained using the procedure de- scribed in Section 3.1 ensures the non-negativity of Rk . This update can be obtained using the MM approach which ensures that the cost-function will not increase. 3.3 Estimation with other free model A general drawback of source separation models that do not explicitly account for reverberation effects is that the reverberation affecting a given source is usually spread among all separated sources. However, this issue can still arise with a proper model of reverberation if the dry model of the reverberated source is not constrained enough. In- deed using the generic reverberation model of Equation (5), the reverberation of a source can still be incorrectly modeled during the optimization by other source models having more degrees of freedom. A possible solution for enforcing a correct optimization of the reverberated model is to further constrain the structure of the dry model spec- trogram, e.g. through the inclusion of sparsity or pitch ac- tivation constraints, and potentially to adopt a sequential estimation scheme. For instance, first discarding the re- verberation model, a first rough estimate of the dry source may be produced. Second, considering the reverberation model, the dry model previously estimated can be refined while estimating at the same time the reverberation matrix. Such an approach is described in the following section for a practical implementation of the algorithm to the problem of lead vocal extraction from polyphonic music signals. 4. APPLICATION TO VOICE EXTRACTION In this section we propose an implementation of our rever- beration model in a practical case: we use Durrieu’s algo- rithm [3] for lead vocal isolation and add the reverberation model over the voice model. 4.1 Base voice extraction algorithm Durrieu’s algorithm for lead vocal isolation in a song is based on a source/filter model for the voice. 4.1.1 Model The non-negative mixture spectrogram model consists in the sum of a voice spectrogram model based on a source/filter model and a music spectrogram model based on a standard NMF: V ≈ ˆV = ˆVvoice + ˆVmusic . (14) The voice model is based on a source/filter speech pro- duction model: ˆVvoice = (WF0HF0) (WKHK). (15) The first factor (WF0HF 0) is the source part correspond- ing to the excitation of the vocal folds: WF0 is a matrix of fixed harmonic atoms and HF0 is the activation of these atoms over time. The second factor (WKHK) is the filter part corresponding to the resonance of the vocal tract: WK is a matrix of smooth filter atoms and HK is the activation of these atoms over time. The background music model is a generic NMF: ˆVmusic = WRHR. (16) 4.1.2 Algorithm Matrices HF0, WK, HK, WR and HR are estimated mini- mizing the element-wise Itakura-Saito divergence between the original mixture power spectrogram and the mixture model: C(HF0, WK, HK, WR, HR) = f,t dIS(Vf,t| ˆVf,t), (17) where dIS(x, y) = x y − log(x y ) − 1. The minimization is achieved using multiplicative update rules. The estimation is done in three steps: 1. A first step of parameter estimation is done using iteratively the multiplicative update rules. 2. The matrix HF0 is processed using a Viterbi decod- ing for tracking the main melody and is then thresh- olded so that coefficients too far from the melody are set to zero. 3. Parameters are re-estimated as in the first step but using the thresholded version of HF0 for the initial- ization. 208 Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016
  • 4. 4.2 Inclusion of the reverberation model As stated in Section 3, the dry spectrogram model (i.e. the spectrogram model for the source without reverberation) has to be sufficiently constrained in order to accurately es- timate the reverberation part. This constraint is here ob- tained through the use of a fixed harmonic dictionary WF0 and mostly, by the thresholding of the matrix HF0 that en- forces the sparsity of the activations. We thus introduce the reverberation model after the step of thresholding of the matrix HF0. The two first steps then remains the same as presented in Section 4.1.2. In the third step, the dry voice model of Equation (15) is replaced by a reverberated voice model following Equation (5): ˆVrev. voice f,t = T τ=1 ˆVvoice f,t−τ+1Rf,τ . (18) For the parameter re-estimation of step 3, the multi- plicative update rule of R is given by Equation (12). For the other parameters of the voice model, the update rules from [3] are modified to take the reverberation model into account: HF0 ← HF0 WT F0 (WKHK) R ∗t ( ˆV β−2 V) WT F0 (WKHK) R ∗t ˆV β−1 (19) HK ← HK WT K (WF0HF0) R ∗t ( ˆV β−2 V) WT K (WF0HF0) R ∗t ˆV β−1 (20) WK ← WK (WF0HF0) R ∗t ( ˆV β−2 V) HT K (WF0HF0) R ∗t ˆV β−1 HT K (21) The update rules for the parameters of the music model (HR and WR), are unchanged and thus identical to those given in Equations (10) and (11). 5. EXPERIMENTAL RESULTS 5.1 Experimental setup We tested the reverberation model that we proposed with the algorithm presented in Section 4 on a task of lead vocal extraction in a song. In order to assess the improvement of our model over the existing one, we ran the separation with and without reverberation modeling. We used a database composed of 9 song excerpts of professionally produced music. The total duration of all excerpts was about 10 minutes. As the use of rever- beration modeling only makes sense if there is a signifi- cant amount of it, all the selected excerpts contains a fair amount of reverberation. This reverberation was already present in the separated tracks and was not added artifi- cially by ourselves. On some excerpts, the reverberation is time-variant: active on some parts and inactive on other, ducking echo effect ...Some short excerpts, as well as the separation results, can be played on the companion web- site 1 . Spectrograms were computed as the squared modulus of the STFT of the signal sampled at 44100 Hz, with 4096- sample (92.9 ms) long Hamming window with 75% over- lap. The length T of the reverberation matrix was arbitrar- ily fixed to 52 frames (which corresponds to about 1.2 s) in order to be sufficient for long reverberations. 5.2 Results In order to quantify the results we use standard metrics of source separation as described in [11]: Signal to Distorsion Ratio (SDR), Signal to Artefact Ratio (SAR) and Signal to Interference Ratio (SIR). The results are presented in Figure 1 for the evaluation of the extracted voice signals and in Figure 2 for the ex- tracted music signals. The oracle performance, obtained using the actual spectrograms of the sources to compute the separation masks, are also reported. As we can see, adding the reverberation modeling increases all these met- rics. The SIR is particularly increased in Figure 1 (more than 5dB): this is mainly because without the reverbera- tion model, a large part of the reverberation of the voice leaks in the music model. This is a phenomenon which is also clearly audible in excerpts with strong reverberation: using the reverberation model, the long reverberation tail is mainly heard within the separated voice and is almost not audible within the separated music. In return, extracted vocals with the reverberation model tend to have more au- dible interferences. This result is in part due to the fact that the pre-estimation of the dry model (step 1 and 2 of the base algorithm) is not interference-free, so that apply- ing the reverberation model increases the energy of these interferences. Figure 1. Experimental separation results for the voice stem. 1 http://romain-hennequin.fr/En/demo/reverb_ separation/reverb.html Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016 209
  • 5. Figure 2. Experimental separation results for the music stem. 6. CONCLUSION In this paper we proposed a method to model long-term ef- fects of reverberation in a source separation application for which a constrained model of the dry source is available. Future work should focus on speeding up the algorithm since, multiple convolutions at each iteration can be time- consuming. Developing methods to estimate the reverber- ation duration (of a specific source within a mix) would also make it possible to automate the whole process. It could also be interesting to add spatial modeling for multi- channel processing using full rank spatial variance matrix and multichannel reverberation matrices. 7. REFERENCES [1] Simon Arberet, Alexey Ozerov, Ngoc Q. K. Duong, Emmanuel Vincent, Fr´ed´eric Bimbot, and Pierre Van- dergheynst. Nonnegative matrix factorization and spa- tial covariance model for under-determined reverberant audio source separation. In International Conference on Information Sciences, Signal Processing and their applications, pages 1–4, May 2010. [2] Ngoc Q. K. Duong, Emmanuel Vincent, and R´emi Gribonval. Under-determined reverberant audio source separation using a full-rank spatial covariance model. IEEE Transactions on Audio, Speech, and Language Processing, 18(7):1830 – 1840, Sept 2010. [3] Jean-Louis Durrieu, Ga¨el Richard, and Bertrand David. An iterative approach to monaural musical mix- ture de-soloing. In International Conference on Acous- tics, Speech, and Signal Processing (ICASSP), pages 105–108, Taipei, Taiwan, April 2009. [4] C´edric F´evotte and J´erˆome Idier. Algorithms for non- negative matrix factorization with the beta-divergence. Neural Computation, 23(9):2421–2456, September 2011. [5] Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, October 1999. [6] Alexey Ozerov and C´edric F´evotte. Multichannel non- negative matrix factorization in convolutive mixtures for audio source separation. IEEE Transactions on Au- dio, Speech and Language Processing, 18(3):550–563, 2010. [7] Rita Singh, Bhiksha Raj, and Paris Smaragdis. Latent-variable decomposition based dereverberation of monaural and multi-channel signals. In IEEE Inter- national Conference on Audio and Speech Signal Pro- cessing, Dallas, Texas, USA, March 2010. [8] Paris Smaragdis and Judith C. Brown. Non-negative matrix factorization for polyphonic music transcrip- tion. In Workshop on Applications of Signal Processing to Audio and Acoustics, pages 177 – 180, New Paltz, NY, USA, October 2003. [9] Paris Smaragdis, Bhiksha Raj, and Madhusudana Shashanka. Supervised and semi-supervised separation of sounds from single-channel mixtures. In 7th Inter- national Conference on Independent Component Anal- ysis and Signal Separation, London, UK, September 2007. [10] Emmanuel Vincent, Nancy Bertin, and Roland Badeau. Adaptive harmonic spectral decomposition for mul- tiple pitch estimation. IEEE Transactions on Au- dio, Speech and Language Processing, 18(3):528–537, March 2010. [11] Emmanuel Vincent, R´emi Gribonval, and C´edric F´evotte. Performance measurement in blind audio source separation. IEEE Transactions on Audio, Speech and Language Processing, 14(4):1462–1469, July 2006. [12] Tuomas Virtanen. Monaural sound source separation by nonnegative matrix factorization with temporal con- tinuity and sparseness criteria. IEEE Transactions on Audio, Speech and Language Processing, 15(3):1066– 1074, March 2007. 210 Proceedings of the 17th ISMIR Conference, New York City, USA, August 7-11, 2016