1) The document presents a frame-online DNN-WPE dereverberation method that uses a neural network to directly estimate the power spectral density (PSD) of target speech from observations for use in weighted prediction error (WPE) dereverberation.
2) Experiments on the REVERB challenge and WSJ+VoiceHome datasets show the DNN-based method improves word error rates over unprocessed baselines, particularly for online processing with two microphones.
3) While performance degrades with increased latency constraints and complex background noise, the DNN-WPE still provides an improvement over no enhancement and enables real-time dereverberation.
4. Introduction
4
◉ Apart from interfering sources and noise, reverberation has the most severe impact on the intelligibility
of the target speech signal.
◉ The interfering signals and noise are commonly suppressed by beamforming, while the latter
impairment can be addressed with signal processing aimed at suppressing the late reverberation.
◉ The Weighted Prediction Error (WPE) method showed very good performance in the REVERB
challenge and is used in commercially successful products such as the Google Home.
◉ WPE dereverberates the signal by estimating an inverse filter which is then used to subtract the
reverberation tail from the observation.
5. Introduction
5
◉ With its drawback in design, the quality of this filter mainly depends on the estimation of the power
spectral density (PSD) of the target, i.e., the anechoic speech signal and its early reflections, referred as
”direct speech”.
◉ Since this target is unknown, the conventional WPE works iteratively by alternating between two
steps:
1. dereverberating the signal by filters calculated using estimated direct speech PSD;
2. estimating the direct speech PSD using the filtered signal.
◉ This, however, inherently makes the vanilla WPE an offline method and computationally expensive.
6. Introduction
6
◉ To overcome this dependency issue:
A. <Kinoshita et al.> proposed to utilize a neural network to directly estimate the PSD from the
observation. Furthermore, this enable a blockwise online usage of WPE.
B. <Caroselli et al.> use a smoothed version of the observed signal PSD as an approximation for the
direct signal PSD and also create a recursive formulation of WPE. This option enable a
real-time, frame-by-frame dereverberation.
◉ In this work, we consider a combination of both approaches.
◉ We investigate if the recursive formulation can profit from a more sophisticated PSD estimated by
neural network and how the latency constraint impacts the overall performance of the system
compared to the offline and block-online variant.
8. Task
8
◉ Using microphones, we observe a signal which is represented as the - dimensional vector in
the short time Fourier transformation (STFT) domain.
◉ We assume, that for ASR the early part of the room impulse response (RIR) is beneficial whereas the
reverberation tail deteriorates the recognition and should therefore be suppressed.
◉ Note that we explicitly allow RIRs longer than the length of an DFT window.
source signal convolved with the early part of the RIR
9. WPE
9
◉ The underlying idea of WPE is to estimate the reverberation tail of the signal and subtract it from the
observation to obtain an optimal estimate of the direct speech in a maximum likelihood sense.
(multi-channel)
(monaural)
filter weights
integrating all mics for estimation
written in matrix multiplication
taking K points, with △ delays to avoid
whitening the speech source
* : conjugate
H : conjugate and transpose
The number of taps (K) : the amount of the memory needed to implement the filter.
10. WPE
10
◉ From the assumption that , there is no closed form solution for the likelihood
optimization, but an iterative procedure which alternates between estimating the filter coefficients and
the variance exists :
Step 1) Step 2)
◉ Once we have an estimator for the PSD which only relies on the observation, the block-online
solution is straight-forward. (e.g., a DNN estimator)
time-varying
for each frequency
11. DNN variants
11
Crude approximation
(PSD of direct speech
is equal to that of
observation)
averaging on longer sequence to mitigate the
effect of approximation error
Suitable for block-online processing with
lower latency and less iterations
(Vanilla WPE) (DNN-WPE)
12. Recursive variants
12
◉ To arrive at the recursive variant, which is causal and more flexible in time, the correlation matrix is
estimated with a decaying window :
◉ This leads to a solution with the following updates :
time-varying
13. Recursive variants
13
◉ <Caroselli et al.> approximate the PSD of the target signal using a smoothed PSD of the observation
considering left and right context :
15. PSD estimator
15
◉ Empirically, we use 5/10 filter taps (K) and uniformly sample the delay (△) to be in the range
[1, 4] .
◉ According to preliminary experiments, the DFT window size is set as 512 (32 ms) and shift as
128 (8 ms) as well as (decaying scalar) for the recursive WPE variant and
3 iterations for vanilla WPE.
◉ For smoothing PSD, we use left/right context size of 1/0 .
◉ We then choose the best configuration according to the results on the development set.
16. ASR setting
16
◉ The acoustic model is wide bi-directional residual network (WBRN),
trained on multi-condition data.
◉ The decoder uses the standard 3-gram WSJ0 language model.
◉ Since WPE preserves the number of channels and our acoustic model
is a single-channel model, we always only use the first microphone
channel for decoding.
17. Evaluation
◉ To compare the described approaches, we evaluate the systems in terms of word error rates
(WERs) on the data of the REVERB challenge as well as on Wall Street Journal (WSJ)+VoiceHome
data.
◉ We evaluate the performance on the eight as well as on the two channel task.
◉ The baselines are Unprocessed and Iteration. The former uses no enhancement and the latter is
vanilla WPE.
◉ Those baselines are compared with Smooth, where the PSD is approximated by smoothing the
observation and DNN, which uses a separately trained PSD estimation network.
17
18. Evaluation
◉ These systems are evaluated for three different latency constraints (if applicable):
➢ offline, the whole utterance is available for processing;
➢ block-online, the system is provided with blocks of 2 s;
➢ online, the system operates on a frame-by-frame basis.
◉ The former two settings both use the WPE formulation as outlined at P.10, while the online setting
uses the recursive formulation defined in P.12.
18
20. REVERB
20
◉ In the worst case (2 microphones, online), the DNN-based system still yields an improvement of over
10 % over the unprocessed baseline.
◉ In the offline scenario, it is able to match the result of the iterating baseline but avoids the iterations.
(The vanilla WPE proves to be a strong one.)
all better than unprocessed
getting worse as expected
GMM-based Kaldi system : 31 %
CHiME-3 recipe DNN : 24 %
21. WSJ+VoiceHome
21
all better than unprocessed, but smaller margin
getting worse as expected
◉ The VoiceHome background noise is very dynamic and typically found in households e.g., vacuum
cleaner, dish washing or interviews on television.
◉ Given that WPE itself omits noise in its formulation, this outcome was not expected and might
indicate that the training target for the DNN can be improved when complicated noise is present.
23. Conclusions
◉ In this paper, we show that the recursive WPE formulation can be improved with a more
sophisticated PSD estimator, resulting in a 5 % - 10 % reduction in WER relative.
◉ Furthermore, it enables operating on a frame-by-frame basis and can thus be deployed in a
real-time scenario.
◉ We also find that simply smoothing the observation results in surprisingly good performance, even
if noise is present.
◉ In future work, we plan to investigate if we can train the DNN based PSD estimator by directly
minimizing the loss function of the acoustic model. That way, we avoid to explicitly specify what
the direct signal is and also eliminate the need for parallel training data.
23