SlideShare a Scribd company logo
1 of 25
Download to read offline
Frame-Online DNN-WPE
Dereverberation
Presenter : 何冠勳 61047017s
Date : 2022/10/31
Jahn Heymann et al.,
“Frame-Online DNN-WPE Dereverberation,”
in IWAENC 2018
Outline
2
1 3 5
Introduction Experiments Conclusions
4
2
Background Results
Introduction
What? Why?
1
3
Introduction
4
◉ Apart from interfering sources and noise, reverberation has the most severe impact on the intelligibility
of the target speech signal.
◉ The interfering signals and noise are commonly suppressed by beamforming, while the latter
impairment can be addressed with signal processing aimed at suppressing the late reverberation.
◉ The Weighted Prediction Error (WPE) method showed very good performance in the REVERB
challenge and is used in commercially successful products such as the Google Home.
◉ WPE dereverberates the signal by estimating an inverse filter which is then used to subtract the
reverberation tail from the observation.
Introduction
5
◉ With its drawback in design, the quality of this filter mainly depends on the estimation of the power
spectral density (PSD) of the target, i.e., the anechoic speech signal and its early reflections, referred as
”direct speech”.
◉ Since this target is unknown, the conventional WPE works iteratively by alternating between two
steps:
1. dereverberating the signal by filters calculated using estimated direct speech PSD;
2. estimating the direct speech PSD using the filtered signal.
◉ This, however, inherently makes the vanilla WPE an offline method and computationally expensive.
Introduction
6
◉ To overcome this dependency issue:
A. <Kinoshita et al.> proposed to utilize a neural network to directly estimate the PSD from the
observation. Furthermore, this enable a blockwise online usage of WPE.
B. <Caroselli et al.> use a smoothed version of the observed signal PSD as an approximation for the
direct signal PSD and also create a recursive formulation of WPE. This option enable a
real-time, frame-by-frame dereverberation.
◉ In this work, we consider a combination of both approaches.
◉ We investigate if the recursive formulation can profit from a more sophisticated PSD estimated by
neural network and how the latency constraint impacts the overall performance of the system
compared to the offline and block-online variant.
Background
2
7
- Task
- WPE
- DNN variants
- Recursive variants
Task
8
◉ Using microphones, we observe a signal which is represented as the - dimensional vector in
the short time Fourier transformation (STFT) domain.
◉ We assume, that for ASR the early part of the room impulse response (RIR) is beneficial whereas the
reverberation tail deteriorates the recognition and should therefore be suppressed.
◉ Note that we explicitly allow RIRs longer than the length of an DFT window.
source signal convolved with the early part of the RIR
WPE
9
◉ The underlying idea of WPE is to estimate the reverberation tail of the signal and subtract it from the
observation to obtain an optimal estimate of the direct speech in a maximum likelihood sense.
(multi-channel)
(monaural)
filter weights
integrating all mics for estimation
written in matrix multiplication
taking K points, with △ delays to avoid
whitening the speech source
* : conjugate
H : conjugate and transpose
The number of taps (K) : the amount of the memory needed to implement the filter.
WPE
10
◉ From the assumption that , there is no closed form solution for the likelihood
optimization, but an iterative procedure which alternates between estimating the filter coefficients and
the variance exists :
Step 1) Step 2)
◉ Once we have an estimator for the PSD which only relies on the observation, the block-online
solution is straight-forward. (e.g., a DNN estimator)
time-varying
for each frequency
DNN variants
11
Crude approximation
(PSD of direct speech
is equal to that of
observation)
averaging on longer sequence to mitigate the
effect of approximation error
Suitable for block-online processing with
lower latency and less iterations
(Vanilla WPE) (DNN-WPE)
Recursive variants
12
◉ To arrive at the recursive variant, which is causal and more flexible in time, the correlation matrix is
estimated with a decaying window :
◉ This leads to a solution with the following updates :
time-varying
Recursive variants
13
◉ <Caroselli et al.> approximate the PSD of the target signal using a smoothed PSD of the observation
considering left and right context :
Experiments
3
14
- PSD estimator
- ASR setting
- Evaluation
PSD estimator
15
◉ Empirically, we use 5/10 filter taps (K) and uniformly sample the delay (△) to be in the range
[1, 4] .
◉ According to preliminary experiments, the DFT window size is set as 512 (32 ms) and shift as
128 (8 ms) as well as (decaying scalar) for the recursive WPE variant and
3 iterations for vanilla WPE.
◉ For smoothing PSD, we use left/right context size of 1/0 .
◉ We then choose the best configuration according to the results on the development set.
ASR setting
16
◉ The acoustic model is wide bi-directional residual network (WBRN),
trained on multi-condition data.
◉ The decoder uses the standard 3-gram WSJ0 language model.
◉ Since WPE preserves the number of channels and our acoustic model
is a single-channel model, we always only use the first microphone
channel for decoding.
Evaluation
◉ To compare the described approaches, we evaluate the systems in terms of word error rates
(WERs) on the data of the REVERB challenge as well as on Wall Street Journal (WSJ)+VoiceHome
data.
◉ We evaluate the performance on the eight as well as on the two channel task.
◉ The baselines are Unprocessed and Iteration. The former uses no enhancement and the latter is
vanilla WPE.
◉ Those baselines are compared with Smooth, where the PSD is approximated by smoothing the
observation and DNN, which uses a separately trained PSD estimation network.
17
Evaluation
◉ These systems are evaluated for three different latency constraints (if applicable):
➢ offline, the whole utterance is available for processing;
➢ block-online, the system is provided with blocks of 2 s;
➢ online, the system operates on a frame-by-frame basis.
◉ The former two settings both use the WPE formulation as outlined at P.10, while the online setting
uses the recursive formulation defined in P.12.
18
Results
- REVERB
- WSJ+VoiceHome
4
19
REVERB
20
◉ In the worst case (2 microphones, online), the DNN-based system still yields an improvement of over
10 % over the unprocessed baseline.
◉ In the offline scenario, it is able to match the result of the iterating baseline but avoids the iterations.
(The vanilla WPE proves to be a strong one.)
all better than unprocessed
getting worse as expected
GMM-based Kaldi system : 31 %
CHiME-3 recipe DNN : 24 %
WSJ+VoiceHome
21
all better than unprocessed, but smaller margin
getting worse as expected
◉ The VoiceHome background noise is very dynamic and typically found in households e.g., vacuum
cleaner, dish washing or interviews on television.
◉ Given that WPE itself omits noise in its formulation, this outcome was not expected and might
indicate that the training target for the DNN can be improved when complicated noise is present.
Conclusions
5
22
Conclusions
◉ In this paper, we show that the recursive WPE formulation can be improved with a more
sophisticated PSD estimator, resulting in a 5 % - 10 % reduction in WER relative.
◉ Furthermore, it enables operating on a frame-by-frame basis and can thus be deployed in a
real-time scenario.
◉ We also find that simply smoothing the observation results in surprisingly good performance, even
if noise is present.
◉ In future work, we plan to investigate if we can train the DNN based PSD estimator by directly
minimizing the loss function of the acoustic model. That way, we avoid to explicitly specify what
the direct signal is and also eliminate the need for parallel training data.
23
Reference
➜ Official code: https://github.com/fgnt/nara_wpe (in Tensorflow v1.6)
➜ PyTorch version: https://github.com/nttcslab-sp/dnn_wpe
➜ Chainer version: https://github.com/sas91/jhu-neural-wpe
➜ PPT by <Kinoshita et al.>:
https://www.kecl.ntt.co.jp/icl/signal/kinoshita/publications/Interspeech17/Interspeech_keisuke_kinoshita_DNN-
WPE.pdf
➜ Vanilla WPE: https://github.com/helianvine/fdndlp, http://www.kecl.ntt.co.jp/icl/signal/wpe/
24
Any questions ?
You can find me at
➢ jasonho610@gmail.com
➢ NTNU-SMIL
Thanks!
25

More Related Content

Similar to Frame-Online DNN-WPE Dereverberation.pdf

Power allocation for capacity maximization in eigen mimo with output snr cons...
Power allocation for capacity maximization in eigen mimo with output snr cons...Power allocation for capacity maximization in eigen mimo with output snr cons...
Power allocation for capacity maximization in eigen mimo with output snr cons...
Zac Darcy
 
Power Allocation for Capacity Maximization in Eigen-Mimo with Output SNR Cons...
Power Allocation for Capacity Maximization in Eigen-Mimo with Output SNR Cons...Power Allocation for Capacity Maximization in Eigen-Mimo with Output SNR Cons...
Power Allocation for Capacity Maximization in Eigen-Mimo with Output SNR Cons...
Zac Darcy
 

Similar to Frame-Online DNN-WPE Dereverberation.pdf (20)

Machine Learning approaches at video compression
Machine Learning approaches at video compression Machine Learning approaches at video compression
Machine Learning approaches at video compression
 
An Effective Approach for Colour Image Transmission using DWT Over OFDM for B...
An Effective Approach for Colour Image Transmission using DWT Over OFDM for B...An Effective Approach for Colour Image Transmission using DWT Over OFDM for B...
An Effective Approach for Colour Image Transmission using DWT Over OFDM for B...
 
Audio Inpainting with D2WGAN.pdf
Audio Inpainting with D2WGAN.pdfAudio Inpainting with D2WGAN.pdf
Audio Inpainting with D2WGAN.pdf
 
GPS-Out LIMITATIONS AND NGS IMPROVEMENTS.pptx
GPS-Out LIMITATIONS AND NGS IMPROVEMENTS.pptxGPS-Out LIMITATIONS AND NGS IMPROVEMENTS.pptx
GPS-Out LIMITATIONS AND NGS IMPROVEMENTS.pptx
 
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
 
Emotion and Theme Recognition in Music with Frequency-Aware RF-Regularized CNNs
Emotion and Theme Recognition in Music with Frequency-Aware RF-Regularized CNNsEmotion and Theme Recognition in Music with Frequency-Aware RF-Regularized CNNs
Emotion and Theme Recognition in Music with Frequency-Aware RF-Regularized CNNs
 
Video capacity of WLANs with a multiuser perceptual quality constraint
Video capacity of WLANs with a multiuser perceptual quality constraintVideo capacity of WLANs with a multiuser perceptual quality constraint
Video capacity of WLANs with a multiuser perceptual quality constraint
 
Translatotron review
Translatotron reviewTranslatotron review
Translatotron review
 
Power allocation for capacity maximization in eigen mimo with output snr cons...
Power allocation for capacity maximization in eigen mimo with output snr cons...Power allocation for capacity maximization in eigen mimo with output snr cons...
Power allocation for capacity maximization in eigen mimo with output snr cons...
 
2010 15 vo
2010 15 vo2010 15 vo
2010 15 vo
 
Automated Speech Recognition
Automated Speech Recognition Automated Speech Recognition
Automated Speech Recognition
 
Convolutional Neural Networks for Natural Language Processing / Stanford cs22...
Convolutional Neural Networks for Natural Language Processing / Stanford cs22...Convolutional Neural Networks for Natural Language Processing / Stanford cs22...
Convolutional Neural Networks for Natural Language Processing / Stanford cs22...
 
Power Allocation for Capacity Maximization in Eigen-Mimo with Output SNR Cons...
Power Allocation for Capacity Maximization in Eigen-Mimo with Output SNR Cons...Power Allocation for Capacity Maximization in Eigen-Mimo with Output SNR Cons...
Power Allocation for Capacity Maximization in Eigen-Mimo with Output SNR Cons...
 
POWER ALLOCATION FOR CAPACITY MAXIMIZATION IN EIGEN-MIMO WITH OUTPUT SNR CONS...
POWER ALLOCATION FOR CAPACITY MAXIMIZATION IN EIGEN-MIMO WITH OUTPUT SNR CONS...POWER ALLOCATION FOR CAPACITY MAXIMIZATION IN EIGEN-MIMO WITH OUTPUT SNR CONS...
POWER ALLOCATION FOR CAPACITY MAXIMIZATION IN EIGEN-MIMO WITH OUTPUT SNR CONS...
 
H43044145
H43044145H43044145
H43044145
 
Relay selection for geographical forwarding in sleep wake cycling wireless se...
Relay selection for geographical forwarding in sleep wake cycling wireless se...Relay selection for geographical forwarding in sleep wake cycling wireless se...
Relay selection for geographical forwarding in sleep wake cycling wireless se...
 
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
 
SpecAugment review
SpecAugment reviewSpecAugment review
SpecAugment review
 
Parallel WaveGAN review
Parallel WaveGAN reviewParallel WaveGAN review
Parallel WaveGAN review
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuning
 

More from ssuser849b73 (7)

Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdf
 
Transformer-based SE.pptx
Transformer-based SE.pptxTransformer-based SE.pptx
Transformer-based SE.pptx
 
WaveNet.pdf
WaveNet.pdfWaveNet.pdf
WaveNet.pdf
 
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
 
Wavesplit.pdf
Wavesplit.pdfWavesplit.pdf
Wavesplit.pdf
 
Sudormrf.pdf
Sudormrf.pdfSudormrf.pdf
Sudormrf.pdf
 
EEND-SS.pdf
EEND-SS.pdfEEND-SS.pdf
EEND-SS.pdf
 

Recently uploaded

Online crime reporting system project.pdf
Online crime reporting system project.pdfOnline crime reporting system project.pdf
Online crime reporting system project.pdf
Kamal Acharya
 
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdfALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
Madan Karki
 
Performance enhancement of machine learning algorithm for breast cancer diagn...
Performance enhancement of machine learning algorithm for breast cancer diagn...Performance enhancement of machine learning algorithm for breast cancer diagn...
Performance enhancement of machine learning algorithm for breast cancer diagn...
IJECEIAES
 
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
drjose256
 

Recently uploaded (20)

21scheme vtu syllabus of visveraya technological university
21scheme vtu syllabus of visveraya technological university21scheme vtu syllabus of visveraya technological university
21scheme vtu syllabus of visveraya technological university
 
CLOUD COMPUTING SERVICES - Cloud Reference Modal
CLOUD COMPUTING SERVICES - Cloud Reference ModalCLOUD COMPUTING SERVICES - Cloud Reference Modal
CLOUD COMPUTING SERVICES - Cloud Reference Modal
 
Raashid final report on Embedded Systems
Raashid final report on Embedded SystemsRaashid final report on Embedded Systems
Raashid final report on Embedded Systems
 
Low Altitude Air Defense (LAAD) Gunner’s Handbook
Low Altitude Air Defense (LAAD) Gunner’s HandbookLow Altitude Air Defense (LAAD) Gunner’s Handbook
Low Altitude Air Defense (LAAD) Gunner’s Handbook
 
Basics of Relay for Engineering Students
Basics of Relay for Engineering StudentsBasics of Relay for Engineering Students
Basics of Relay for Engineering Students
 
Online crime reporting system project.pdf
Online crime reporting system project.pdfOnline crime reporting system project.pdf
Online crime reporting system project.pdf
 
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdfALCOHOL PRODUCTION- Beer Brewing Process.pdf
ALCOHOL PRODUCTION- Beer Brewing Process.pdf
 
Autodesk Construction Cloud (Autodesk Build).pptx
Autodesk Construction Cloud (Autodesk Build).pptxAutodesk Construction Cloud (Autodesk Build).pptx
Autodesk Construction Cloud (Autodesk Build).pptx
 
Performance enhancement of machine learning algorithm for breast cancer diagn...
Performance enhancement of machine learning algorithm for breast cancer diagn...Performance enhancement of machine learning algorithm for breast cancer diagn...
Performance enhancement of machine learning algorithm for breast cancer diagn...
 
Operating System chapter 9 (Virtual Memory)
Operating System chapter 9 (Virtual Memory)Operating System chapter 9 (Virtual Memory)
Operating System chapter 9 (Virtual Memory)
 
Passive Air Cooling System and Solar Water Heater.ppt
Passive Air Cooling System and Solar Water Heater.pptPassive Air Cooling System and Solar Water Heater.ppt
Passive Air Cooling System and Solar Water Heater.ppt
 
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
 
Geometric constructions Engineering Drawing.pdf
Geometric constructions Engineering Drawing.pdfGeometric constructions Engineering Drawing.pdf
Geometric constructions Engineering Drawing.pdf
 
Piping and instrumentation diagram p.pdf
Piping and instrumentation diagram p.pdfPiping and instrumentation diagram p.pdf
Piping and instrumentation diagram p.pdf
 
"United Nations Park" Site Visit Report.
"United Nations Park" Site  Visit Report."United Nations Park" Site  Visit Report.
"United Nations Park" Site Visit Report.
 
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
NO1 Best Powerful Vashikaran Specialist Baba Vashikaran Specialist For Love V...
 
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdflitvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
 
Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1
 
Filters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility ApplicationsFilters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility Applications
 
Fuzzy logic method-based stress detector with blood pressure and body tempera...
Fuzzy logic method-based stress detector with blood pressure and body tempera...Fuzzy logic method-based stress detector with blood pressure and body tempera...
Fuzzy logic method-based stress detector with blood pressure and body tempera...
 

Frame-Online DNN-WPE Dereverberation.pdf

  • 1. Frame-Online DNN-WPE Dereverberation Presenter : 何冠勳 61047017s Date : 2022/10/31 Jahn Heymann et al., “Frame-Online DNN-WPE Dereverberation,” in IWAENC 2018
  • 2. Outline 2 1 3 5 Introduction Experiments Conclusions 4 2 Background Results
  • 4. Introduction 4 ◉ Apart from interfering sources and noise, reverberation has the most severe impact on the intelligibility of the target speech signal. ◉ The interfering signals and noise are commonly suppressed by beamforming, while the latter impairment can be addressed with signal processing aimed at suppressing the late reverberation. ◉ The Weighted Prediction Error (WPE) method showed very good performance in the REVERB challenge and is used in commercially successful products such as the Google Home. ◉ WPE dereverberates the signal by estimating an inverse filter which is then used to subtract the reverberation tail from the observation.
  • 5. Introduction 5 ◉ With its drawback in design, the quality of this filter mainly depends on the estimation of the power spectral density (PSD) of the target, i.e., the anechoic speech signal and its early reflections, referred as ”direct speech”. ◉ Since this target is unknown, the conventional WPE works iteratively by alternating between two steps: 1. dereverberating the signal by filters calculated using estimated direct speech PSD; 2. estimating the direct speech PSD using the filtered signal. ◉ This, however, inherently makes the vanilla WPE an offline method and computationally expensive.
  • 6. Introduction 6 ◉ To overcome this dependency issue: A. <Kinoshita et al.> proposed to utilize a neural network to directly estimate the PSD from the observation. Furthermore, this enable a blockwise online usage of WPE. B. <Caroselli et al.> use a smoothed version of the observed signal PSD as an approximation for the direct signal PSD and also create a recursive formulation of WPE. This option enable a real-time, frame-by-frame dereverberation. ◉ In this work, we consider a combination of both approaches. ◉ We investigate if the recursive formulation can profit from a more sophisticated PSD estimated by neural network and how the latency constraint impacts the overall performance of the system compared to the offline and block-online variant.
  • 7. Background 2 7 - Task - WPE - DNN variants - Recursive variants
  • 8. Task 8 ◉ Using microphones, we observe a signal which is represented as the - dimensional vector in the short time Fourier transformation (STFT) domain. ◉ We assume, that for ASR the early part of the room impulse response (RIR) is beneficial whereas the reverberation tail deteriorates the recognition and should therefore be suppressed. ◉ Note that we explicitly allow RIRs longer than the length of an DFT window. source signal convolved with the early part of the RIR
  • 9. WPE 9 ◉ The underlying idea of WPE is to estimate the reverberation tail of the signal and subtract it from the observation to obtain an optimal estimate of the direct speech in a maximum likelihood sense. (multi-channel) (monaural) filter weights integrating all mics for estimation written in matrix multiplication taking K points, with △ delays to avoid whitening the speech source * : conjugate H : conjugate and transpose The number of taps (K) : the amount of the memory needed to implement the filter.
  • 10. WPE 10 ◉ From the assumption that , there is no closed form solution for the likelihood optimization, but an iterative procedure which alternates between estimating the filter coefficients and the variance exists : Step 1) Step 2) ◉ Once we have an estimator for the PSD which only relies on the observation, the block-online solution is straight-forward. (e.g., a DNN estimator) time-varying for each frequency
  • 11. DNN variants 11 Crude approximation (PSD of direct speech is equal to that of observation) averaging on longer sequence to mitigate the effect of approximation error Suitable for block-online processing with lower latency and less iterations (Vanilla WPE) (DNN-WPE)
  • 12. Recursive variants 12 ◉ To arrive at the recursive variant, which is causal and more flexible in time, the correlation matrix is estimated with a decaying window : ◉ This leads to a solution with the following updates : time-varying
  • 13. Recursive variants 13 ◉ <Caroselli et al.> approximate the PSD of the target signal using a smoothed PSD of the observation considering left and right context :
  • 14. Experiments 3 14 - PSD estimator - ASR setting - Evaluation
  • 15. PSD estimator 15 ◉ Empirically, we use 5/10 filter taps (K) and uniformly sample the delay (△) to be in the range [1, 4] . ◉ According to preliminary experiments, the DFT window size is set as 512 (32 ms) and shift as 128 (8 ms) as well as (decaying scalar) for the recursive WPE variant and 3 iterations for vanilla WPE. ◉ For smoothing PSD, we use left/right context size of 1/0 . ◉ We then choose the best configuration according to the results on the development set.
  • 16. ASR setting 16 ◉ The acoustic model is wide bi-directional residual network (WBRN), trained on multi-condition data. ◉ The decoder uses the standard 3-gram WSJ0 language model. ◉ Since WPE preserves the number of channels and our acoustic model is a single-channel model, we always only use the first microphone channel for decoding.
  • 17. Evaluation ◉ To compare the described approaches, we evaluate the systems in terms of word error rates (WERs) on the data of the REVERB challenge as well as on Wall Street Journal (WSJ)+VoiceHome data. ◉ We evaluate the performance on the eight as well as on the two channel task. ◉ The baselines are Unprocessed and Iteration. The former uses no enhancement and the latter is vanilla WPE. ◉ Those baselines are compared with Smooth, where the PSD is approximated by smoothing the observation and DNN, which uses a separately trained PSD estimation network. 17
  • 18. Evaluation ◉ These systems are evaluated for three different latency constraints (if applicable): ➢ offline, the whole utterance is available for processing; ➢ block-online, the system is provided with blocks of 2 s; ➢ online, the system operates on a frame-by-frame basis. ◉ The former two settings both use the WPE formulation as outlined at P.10, while the online setting uses the recursive formulation defined in P.12. 18
  • 20. REVERB 20 ◉ In the worst case (2 microphones, online), the DNN-based system still yields an improvement of over 10 % over the unprocessed baseline. ◉ In the offline scenario, it is able to match the result of the iterating baseline but avoids the iterations. (The vanilla WPE proves to be a strong one.) all better than unprocessed getting worse as expected GMM-based Kaldi system : 31 % CHiME-3 recipe DNN : 24 %
  • 21. WSJ+VoiceHome 21 all better than unprocessed, but smaller margin getting worse as expected ◉ The VoiceHome background noise is very dynamic and typically found in households e.g., vacuum cleaner, dish washing or interviews on television. ◉ Given that WPE itself omits noise in its formulation, this outcome was not expected and might indicate that the training target for the DNN can be improved when complicated noise is present.
  • 23. Conclusions ◉ In this paper, we show that the recursive WPE formulation can be improved with a more sophisticated PSD estimator, resulting in a 5 % - 10 % reduction in WER relative. ◉ Furthermore, it enables operating on a frame-by-frame basis and can thus be deployed in a real-time scenario. ◉ We also find that simply smoothing the observation results in surprisingly good performance, even if noise is present. ◉ In future work, we plan to investigate if we can train the DNN based PSD estimator by directly minimizing the loss function of the acoustic model. That way, we avoid to explicitly specify what the direct signal is and also eliminate the need for parallel training data. 23
  • 24. Reference ➜ Official code: https://github.com/fgnt/nara_wpe (in Tensorflow v1.6) ➜ PyTorch version: https://github.com/nttcslab-sp/dnn_wpe ➜ Chainer version: https://github.com/sas91/jhu-neural-wpe ➜ PPT by <Kinoshita et al.>: https://www.kecl.ntt.co.jp/icl/signal/kinoshita/publications/Interspeech17/Interspeech_keisuke_kinoshita_DNN- WPE.pdf ➜ Vanilla WPE: https://github.com/helianvine/fdndlp, http://www.kecl.ntt.co.jp/icl/signal/wpe/ 24
  • 25. Any questions ? You can find me at ➢ jasonho610@gmail.com ➢ NTNU-SMIL Thanks! 25