Comparison of Single Channel Blind Dereverberation Methods for Speech Signals

Comparison of Single Channel
Blind Dereverberation Methods
for Speech Signals
Deha Deniz Türköz - MSc Thesis
Thesis Supervisor: Hakan Erdoğan
Sabancı Üniversitesi
27.06.2016

OUTLINE
1) Introduction
2) Background
a) Features of speech
b) Reverberation model
c) Room impulse response (RIR)
d) Non-negative matrix factorization (NMF)
3) Blind-Dereverberation Methods
a) Delayed linear prediction (DLP)
b) Weighted prediction error (G-WPE)
c) Laplacian based Weighted Prediction Error (L-WPE)
d) NMF based spectral modeling (NMF+N-CTF)
e) Sparsity penalized weighted least squares method (SPWLS)
4) Experiments and Comparisons
5) Discussion and Conclusion
2

1. Introduction
Reverberation:
● is an effect occurs on
speech data due to
reflections through walls,
● decreases speech
intelligibility,
● degrades applications such
as ASR, hands-free
teleconferencing,
● can be modeled with an LTI
filter.
4

● If filter, h is known, then clean signal,s can be
recovered with a simple deconvolution operation called
dereverberation.
● For most cases h & s are unknowns and x is the only known
parameter. Predicting h & s from x is called
“Blind-dereverberation problem” which is the main subject
of this work.
1. Introduction
5

Aim of this work is to compare the existing
blind-dereverberation methods
○ DLP: delayed linear prediction,
○ G-WPE: Gaussian based weighted prediction error,
○ L-WPE: Laplacian based based weighted prediction
error,
○ NMF+N-CTF: NMF based spectral-temporal modeling
and offer a new algorithm called
○ SPWLS: sparsity penalized weighted least squares.
1. Introduction
6

2. Background
a. FeaturesofSpeech
7

2a.FeaturesofSpeech
● Speech is a signal created
through human vocal system.
● Input of vocal tract is called
glottal signal:
○ White noise,
○ Impulse train
● Vocal tract system can be
modeled as all-pole filter
means speech production is a simple
LTI filtering operation of a glottal
signal.
8

2A. Features of Speech
● Speech signals are non-stationary.
● General approach: divide signal into small time segments,
assume each of them are stationary.
● To analyze speech: short-time Fourier transform (STFT)
● STFT: divides speech signal into overlapping segments
called frames by using a window filter. Calculates DFT of
these frames
9

2A.Features of Speech
Formulation of STFT:
L: frame shift,
N:frame size,
X(n,k): discrete STFT coefficients of speech signal x[m] at frame n.
W[m]: Hamming window
10

● STFT of signal is interpreted as a matrix having complex
DFT coefficients at columns.
11

● To visualize signal’s frequency changes with respect to
time: spectrogram
● Spectrogram, S(n,k) uses power spectral domain (PSD)
measures of STFT matrix, X(n,k) as intensity values in an
2D image:
12

2.Background
b.ReverberationModel
13

2b.ReverberationModel
● Reverberation environment can be modeled as an LTI filter
which is called room impulse response (RIR).
● Reverberation model:
h(t): RIR, unknown
s(t): clean signal (anechoic signal), unknown
x(t): reverberated signal (echoed signal), known
14

2b. Reverberation Model
Reverberation effect on spectrogram:
15

2.Background
c.Room Impulse Response(RIR)
16

2c.Room Impulse Response (RIR)
The length of RIR depends on
● Room size,
● Room temperature,
● Room shape,
● Microphone’s distance to the speech source,
● Absorption of sound in room,
: time required for reflected signal to drop by 60 dB level
● RIR shows FIR filter characteristic.
17

2c. Room Impulse Response (RIR)
Usually RIR is divided into two parts:
1. Early reverberation
2. Late reverberation: the most detrimental part of echo
n(t): noise
d(t): early echo + clean signal
(desired signal)
r(t): Late echo
Lh: the length of RIR
h(t): RIR, (earl echo + late echo) 18

2c. Room Impulse Response(RIR)
Then, early and late reverberations are
D: the length of early reverberation
19

2c. Room Impulse Response (RIR)
20

2.Background
d.Non-negative Matrix Factorization (NMF)
21

2d. Non-negative Matrix Factorization (NMF)
NMF: decomposition a V matrix as production of two matrices
B and G with non-negative entries.
B: basis or dictionary matrix, G: weight or gains matrix.
● This problem can be interpreted as an optimization
problem as follows:
where C is the cost function for measuring the distance
between V and BG 22

2d. Non-negative Matrix Factorization (NMF)
● Columns of B are called basis vectors,
● Number of B matrix columns are kept smaller than the size
of V,
● Iterative algorithms are utilized to solve the NMF
problem, since there is no unique solution.
● Initial B & G matrices can be randomized positive numbers
or supervised matrices for fast convergence.
● Popular iterative methods to formulate distance function
between V and BG are:
○ Euclidean distance,
○ Kullback-Leibler distance (KL),
○ Itakuro-Saito distance method (IS).
23

2d.Non-negative Matrix Factorization (NMF)
Kullback-Leibler divergence between V and BG and defined as [6]:
where “1” is the matrix of ones, has the
same size of V
24

2d.Non-negative Matrix Factorization (NMF)
● NMF is a non-convex algorithm and have multiple local
minimums. As a result, B and G can vary for the same V
matrix.
● NMF is a common method used in speech processing, deep
learning, clustering, and computer vision.
● In speech processing, NMF has applications for
Audio-Source Separation, source/filter model,
blind-dereverberation [3][4], speech denoising and so on.
25

3. Blind-Dereverberation Methods
a. Delayed linear prediction (DLP)
26

3a.DelayedLinearPrediction(DLP)
We denote time-domain signals x(t), s(t), h(t) as
respectively.
STFT-domain signal notations are , for x(n,k),
s(n,k), h(n,k) respectively.
Then,
27

3a. Delayed Linear Prediction (DLP)
● DLP estimates inverse filter coefficients from
reverberated signal.
● inverse filter of length Lw, can be used to
approximately obtain a dereverberated signal as:
● In matrix form, reverberation can be formulated as
28

29

● means desired signal can be estimated by only using
reverberated signal and its past samples.
● Then, the inverse filter is
● The number of zeros in the inverse filter vector is equal
to D, delay.
● In conclusion, DLP algorithm is a simple technique to
achieve dereverberation.
● it may not work well in most cases. Reason is having an
FIR filter as the inverse filter.
30

3.Blind-DereverberationMethods
b. Weighted prediction error (G-WPE)
31

3b. Weighted prediction error (g-wpe)
Assumption 1: speech signal has local Gaussian distribution
for small frames with length Lf,
Assumption 2: samples are mutually uncorrelated after a
certain distance,
Assumption 3: variance is constant for short-time frames
with size Lf.
32

● Dereverberation can be done both in time domain and in
STFT domain,
● Using time domain is very costly, because of having quite
big matrices, so STFT domain will be used.
● Probability density function of desired signal in STFT
domain,
n:frame number, k:frequency bin, : time-varying variance
Then,
33

● Variance values alter only with respect to time frames
Thus,
● Apply likelihood maximization to Gaussian pdf. Then, log
likelihood function for dereverberation process in STFT
domain becomes:
Parameter vector for likelihood maximization:
34

Maximizing the equation with respect to parameter vector,
cannot be achieved analytically and there is no closed form
solution for this equation. Thus, an iterative algorithm is
needed.
35

Two step procedure has been proposed in [1] to solve
Likelihood maximization problem.
1. Keep constant and solve for to maximize
likelihood, then obtain ;
2. Keep constant and update
and so on until a convergence criterion satisfied or a
maximum number of iterations completed
36

37

3.Blind-DereverberationMethods
c.Laplacian based weighted linear prediction (L-WPE)
38

3c.Laplacian based weighted predictionERROR (L-WPE)
L-WPE in [2] suggests that speech can be modeled more
precisely with a Laplacian model rather than a Gaussian model
in STFT domain.
● Assumption 1: speech signal has local Laplacian
distribution for small frames with length Lf,
● Assumption 2: represent STFT coefficients of the desired
signal, for each time-frequency bin with an equal
variance, for independent imaginary and real parts.
39

Then, pdf of the Laplacian Model is
Likewise to G-WPE method, maximum likelihood estimation(ML)
will be utilized for parameter vector, . Then,
likelihood function:
40

No closed formulation for likelihood function. Thus, solve it
numerically.
1. Keep constant and solve for to maximize likelihood
(or minimize l1 norm), then obtain
2. Keep constant and update
Step1: fix & update
Likelihood function can be rewritten in terms of as
41

Thus, likelihood function can be written as:
42

Then, problem can be interpreted as a linear programming problem as:
43

3c.Laplacian based weighted predictionERROR(L-WPE)
Step 2: fix & update
After calculating log likelihood and calculating its maximum
with respect to variable , closed form solution for
variance becomes:
● These two steps will proceed until a convergence
criterion is satisfied or maximum number of iterations
has been reached.
44

45

46
3.Blind-Dereverberation Methods
d.NMF based spectral modeling (NMF+N-CTF)

3d. NMF based spectral modeling (NMF+N-CTF)
● The method in [3] is a combined version of non-negative
convoluted transfer function (N-CTF) model and
non-negative matrix factorization (NMF).
● N-CTF model assumption: for each frequency bin, the power
spectrogram of STFT coefficient matrices of clean speech
signal & RIR convolution gives the reverberated signal’s
power spectrogram of STFT coefficient matrix.
,
47

Assumptions:
● Phase elements of the at different frames are
mutually independent
● Zero-mean random variable with Gaussian distribution
● Clean signal & RIR spectral coefficients are mutually
independent.
For simplicity, set , likewise for s(n,k) and
h(n,k). (different than other methods)
48

Kullback-Leibler (KL) divergence will be used to estimate
power spectrogram of s(n,k) from previous eqn. As:
Where,
: estimated power spectrogram of reverberated signal
49

To acquire more accurate estimation, the sparsity of clean
speech spectrogram can be added as a regularization term
with weight .
As a non-negativity constraint, are expected to
be greater than zero.
50

This model can be solved as an iterative learning method as:
51

Let’s add NMF approach:
The clean speech magnitude spectrogram S can be formulated as
the production of a dictionary matrix B and a weight matrix G.
Where,
R: the number of basis vectors in the dictionary matrix B, dictionary size; R<N (s frame size)
52

After combination of method N-CTF and NMF, problem definition
becomes:
Approach: keep two fixed, update one in order until a
convergence criterion has been succeeded or maximum number of
iteration has been reached
53

54

● To remove scale ambiguity, after each iteration each
columns of B is normalized to sum to one
● The columns of H are element-wise divided by the first
column of H.
● The nature of RIR consists of decaying impulses.
● Mapping coefficient matrix, between clean speech
signal and reverberated speech signal can be formulated
as:
where,
55

● Initializations of basis, B and weight, G matrices are
conducted with randomized non-negative numbers for online
method.
● B & G can be initialized with supervised matrices to
increase efficiency.
● In this work, we employ online method.
57

3.Blind-Dereverberation Methods
e.Sparsity penalized weighted least squares method
(SPWLS)
58

3e.Sparsity penalized weighted least squares method (SPWLS)
❖ SPWLS combines the idea of variance normalization with a
weight matrix and the sparsity property of speech
spectrogram matrices.
❖ To provide sparsity of a variable, generally norm
regularization is used.
❖ With regularization, optimization problem, also known
as Lasso problem, requires an iterative algorithm to
solve.
❖ Some popular algorithms to solve Lasso problem are
➢ ISTA (iterative shrinkage and threshold algorithm) [7]
➢ FISTA
➢ SALSA
59

Convolution equation (in STFT domain with fixed frequency k)
can be rewritten in matrix form as:
Then, with regularization term for sparsity, we need to
solve the Lasso problem:
n: noise signal, s: clean speech signal, x: reverberated signal,
H: convolution matrix of RIR.
60

● Add weights to the problem as in L-WPE and G-WPE method.
● Add an extra regularization on the norm of the filter h to
make sure that not getting a trivial solution.
● Our optimization loss function becomes:
where,
: regularization parameter, W: diagonal weight matrix with 1/(std) values
: the target norm for filter h,
k: freq. Index (fixed), n: frame index
61

● Problem is non-differentiable at its local minimum.
● s & h need to be calculated numerically with an iterative
approach.
● Our approach requires a good initialization for s & h
which can be obtained from an earlier method such as
G-WPE.
● Our approach: Performing alternating updates of s and h
that would minimize the objective function with respect
to the corresponding variable.
● For updating s & h, ISTA algorithm is utilized.
62

ISTA: minimizes functions like f(s)+g(s) where the first
function is differentiable and the second function is
usually not differentiable, but simple.
Step 1 to update s: Take a gradient descent step in the
direction of the first function f(.):
(i: iteration index)
The result is an intermediate solution.
● If we calculate the gradient of the first function f(.):
63

: positive step size parameter, indicates the amount that
we move along the negative gradient.
Step 2 to update s: A proximal operator step of g(.) is
performed around that intermediate solution as follows:
Proximal step corresponds to a thresholding/shrinkage
operation for the norm penalty:
Basically, this step erases the components with small energy
and shrinks the other parts. (a = for our algorithm) 64

● After the update of s, we update W matrix according to
new variance values of s.
Now, we need to solve problem for h. Update h according to:
● Use ISTA again:
Step 1 to update h: minimizer for f(.), simple least-square
problem with exact solution:
65

Step 2 to update h: Proximal operation step for the
regularization of h
● Step size parameter, for the inner gradient descent
descent iteration for s can be set to change for each
iteration as
Where are hyperparameters and is the initial step
size, are the inner and outer iteration indices. 66

TEST DATA
Experiment 1: 3 male & 3 female (clean) voices convolved with 6
different RIR samples with 30 dB and 60 dB additive noises (for
DLP, G-WPE, NMF+N-CTF, SPWLS methods)
72 different samples have been dereverberated.
Experiment 2: 1 male and 1 female (clean) voices convolved with
5 different RIR samples and added 30 dB and 60 dB additive
noises. (for all methods)
20 different samples have been dereverberated.
● Test data has been taken from Reverb Challenge" data set. 69

TEST DATA
● Sampling frequency was 16KHz same for all files.
● RIR times (RT60) were 0.17, 0.11, 0.95, 0.33, 0.54, 0.35s
respectively
● L-WPE method was not performed with the RT60 = 0.95s only
due to excessive run time.
● As additive noise, a cafe environment noise with 30 dB and
60 dB levels has been used.
70

setup
● Number of delayed frame size, D was set to 3 frames for
G-WPE, L-WPE and DLP methods,
● Lf , number of frames used for variance calculations is
set to 1 frame for G-WPE, L-WPE and SPWLS methods,
● Iteration number for G-WPE, L-WPE and SPWLS methods is
set to 5,
● Iteration number for NMF+N-CTF method is set to 100.
● STFT parameters: hop size =10ms, window size =30ms.
● Minimum variance to avoid zero divisions,v = 1e(-6)
● Number of STFT frames used to predict signal changes with
respect to RT60 estimates of internal compiling.
71

setup
SPWLS parameters specific to this method are
● step size, = 1E-7,
● ISTA regularization parameter = 1E5,
● inner iteration number for ISTA i =10,
● ISTA regularization parameter for filter =10.
● SPWLS initialization for RIR, H is set as the output of
G-WPE method.
NMF+N-CTF method has
● dictionary matrix size ndict" as 100.
● Method uses online method.
72

Computational effıciency
● All the algorithms are implemented in MATLAB on a
computer with an Intel Xeon CPU, 2.5GHz.
● the fastest one is SPWLS method. Then, G-WPE, DLP,
NMF+N-CTF and L-WPE come in order.
● L-WPE is very slow due to linear programming (LP) part
inside. CVX tool for Matlab is utilized for LP part.
● Compiling times of data with RT60= 0.54 s :
○ L-WPE, ~one day
○ NMF+N-CTF ~1.5hour (with 100 iter#, 100 ndict)
○ G-WPE ~4mins (5 iterations)
○ SPWLS ~2mins (5 iterations)
○ DLP ~3mins (1 iteration) - implemented with Levinson-Durbin algorithm
73

Test Methods
● Accuracy of the dereverberation process is calculated with average
cepstral distortion (CD) test over short time frames.
● Popular method to measure speech quality measure between clean signal and
reconstructed signal.
: clean speech signal cepstral coeffs from 1th to 12th order
: estimated speech signal's cepstral coeffs 1th to 12th order.
: Zero order coeff, denotes the power spectrum envelope in dB.
● CD between similar signals converges to 0.
● Our aim is to keep CD as small as possible after dereverberation process.
74

Test Methods
● STOI, short-time objective intelligibility measure: For
short-time frames, STOI compares the temporal envelopes
of the clean and dereberberated speech in terms of
correlation coefficients.
● PESQ, Perceptual Evaluation of Speech Quality: common
standardized test method for speech quality measure. 3
types of PESQ measure is applied.
● Signal to noise (SNR) ratio test between clean signal and
dereverberated signal.
● Segmented SNR (segSNR): SNR results for short time
frames.
75

Test results-iteration# (experiment2-for20files)
76

77

78

79

80

81

82

Test results- NMF+N-CTF Method (experiment2-for20files)
83

84

85

86

87

88

89

Spectrogram results OF DEREVERBERATED Sıgnals
90

91

iter# =1
92

iter# =5
93

iter# =5
94

iter# =100
95

NUMERICAL RESULTS (ForlongRIRwith RT60=0.54sresults)
99

NUMERICAL RESULTS - NMF+N-CTF Method
ndict= dictionary matrix size , #iter = number of iterations
NNCTF1 ndict = 100 & #iter= 100,
NNCTF2 ndict = 500 & #iter= 200,
NNCTF3 ndict= 1000 & #iter= 200,
NNCTF4 ndict= 1000 & #iter= 400,
NNCTF5 ndict= 1000 & #iter= 240.
100

NUMERICAL RESULTS - NMF+N-CTF Method
101

DISCUSSION & CONCLUSION
● The best test results belongs to L-WPE method.
● In terms of time efficiency and test results, G-WPE works better,
could work better with real time applications.
● L-WPE algorithm is much more complex than G-WPE because of linear
programming part. Thus, it works very slow.
● NMF+N-CTF results
○ converging,
○ test results are not as good as proposed in paper,
○ method could perform better with a good initialization or
supervised dictionary matrix.
○ Increasing dictionary size has good effects on test results, but
Iteration number does not always improve them.
○ No phase information.
104

DISCUSSION & CONCLUSION
● L-WPE was slower, G-WPE was faster than DLP for one iteration.
● SPWLS could not show good performance for CD. To improve the
performance, more constraints can be set for h. In SPWLS, we are
trying to eliminate the whole echo, not only late as in G-WPE,
L-WPE & DLP. Also, step size might be decreased.
● SPWLS shows promises due to time efficiency, SNR and PESQ
results.
● Spectrogram results show that L-WPE and G-WPE are successfully
managing eliminating late reverberant parts.
● DLP is just utilized to make comparisons with L-WPE and G-WPE
methods, since they rooted from DLP method. As expected L-WPE and
G-WPE are better.
105

REFERENCES
[1] Nakatani, Tomohiro, et al. "Speech dereverberation based on variance-normalized delayed linear prediction." IEEE transactions
on audio, speech, and language processing 18.7 (2010): 1717-1731.
[2] Jukić, Ante, and Simon Doclo. "Speech dereverberation using weighted prediction error with Laplacian model of the desired
signal." 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014.
[3] Mohammadiha, Nasser, Paris Smaragdis, and Simon Doclo. "Joint acoustic and spectral modeling for speech dereverberation
using non-negative representations." 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 2015.
[4] Mohammadiha, Nasser, and Simon Doclo. "Speech dereverberation using non-negative convolutive transfer function and
spectro-temporal modeling."IEEE/ACM Transactions on Audio, Speech, and Language Processing 24.2 (2016): 276-289.
[5] Selesnick, Ivan. "Introduction to sparsity in signal processing." Connexions(2012).
[6] Lee, Daniel D., and H. Sebastian Seung. "Algorithms for non-negative matrix factorization." Advances in neural information
processing systems. 2001.
[7] Combettes, Patrick L., and Jean-Christophe Pesquet. "Proximal splitting methods in signal processing." Fixed-point algorithms for
inverse problems in science and engineering. Springer New York, 2011. 185-212.
106

Comparison of Single Channel Blind Dereverberation Methods for Speech Signals

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to Comparison of Single Channel Blind Dereverberation Methods for Speech Signals

Similar to Comparison of Single Channel Blind Dereverberation Methods for Speech Signals (20)

Comparison of Single Channel Blind Dereverberation Methods for Speech Signals