SlideShare a Scribd company logo
Frame-Online DNN-WPE
Dereverberation
Presenter : 何冠勳 61047017s
Date : 2022/10/31
Jahn Heymann et al.,
“Frame-Online DNN-WPE Dereverberation,”
in IWAENC 2018
Outline
2
1 3 5
Introduction Experiments Conclusions
4
2
Background Results
Introduction
What? Why?
1
3
Introduction
4
◉ Apart from interfering sources and noise, reverberation has the most severe impact on the intelligibility
of the target speech signal.
◉ The interfering signals and noise are commonly suppressed by beamforming, while the latter
impairment can be addressed with signal processing aimed at suppressing the late reverberation.
◉ The Weighted Prediction Error (WPE) method showed very good performance in the REVERB
challenge and is used in commercially successful products such as the Google Home.
◉ WPE dereverberates the signal by estimating an inverse filter which is then used to subtract the
reverberation tail from the observation.
Introduction
5
◉ With its drawback in design, the quality of this filter mainly depends on the estimation of the power
spectral density (PSD) of the target, i.e., the anechoic speech signal and its early reflections, referred as
”direct speech”.
◉ Since this target is unknown, the conventional WPE works iteratively by alternating between two
steps:
1. dereverberating the signal by filters calculated using estimated direct speech PSD;
2. estimating the direct speech PSD using the filtered signal.
◉ This, however, inherently makes the vanilla WPE an offline method and computationally expensive.
Introduction
6
◉ To overcome this dependency issue:
A. <Kinoshita et al.> proposed to utilize a neural network to directly estimate the PSD from the
observation. Furthermore, this enable a blockwise online usage of WPE.
B. <Caroselli et al.> use a smoothed version of the observed signal PSD as an approximation for the
direct signal PSD and also create a recursive formulation of WPE. This option enable a
real-time, frame-by-frame dereverberation.
◉ In this work, we consider a combination of both approaches.
◉ We investigate if the recursive formulation can profit from a more sophisticated PSD estimated by
neural network and how the latency constraint impacts the overall performance of the system
compared to the offline and block-online variant.
Background
2
7
- Task
- WPE
- DNN variants
- Recursive variants
Task
8
◉ Using microphones, we observe a signal which is represented as the - dimensional vector in
the short time Fourier transformation (STFT) domain.
◉ We assume, that for ASR the early part of the room impulse response (RIR) is beneficial whereas the
reverberation tail deteriorates the recognition and should therefore be suppressed.
◉ Note that we explicitly allow RIRs longer than the length of an DFT window.
source signal convolved with the early part of the RIR
WPE
9
◉ The underlying idea of WPE is to estimate the reverberation tail of the signal and subtract it from the
observation to obtain an optimal estimate of the direct speech in a maximum likelihood sense.
(multi-channel)
(monaural)
filter weights
integrating all mics for estimation
written in matrix multiplication
taking K points, with △ delays to avoid
whitening the speech source
* : conjugate
H : conjugate and transpose
The number of taps (K) : the amount of the memory needed to implement the filter.
WPE
10
◉ From the assumption that , there is no closed form solution for the likelihood
optimization, but an iterative procedure which alternates between estimating the filter coefficients and
the variance exists :
Step 1) Step 2)
◉ Once we have an estimator for the PSD which only relies on the observation, the block-online
solution is straight-forward. (e.g., a DNN estimator)
time-varying
for each frequency
DNN variants
11
Crude approximation
(PSD of direct speech
is equal to that of
observation)
averaging on longer sequence to mitigate the
effect of approximation error
Suitable for block-online processing with
lower latency and less iterations
(Vanilla WPE) (DNN-WPE)
Recursive variants
12
◉ To arrive at the recursive variant, which is causal and more flexible in time, the correlation matrix is
estimated with a decaying window :
◉ This leads to a solution with the following updates :
time-varying
Recursive variants
13
◉ <Caroselli et al.> approximate the PSD of the target signal using a smoothed PSD of the observation
considering left and right context :
Experiments
3
14
- PSD estimator
- ASR setting
- Evaluation
PSD estimator
15
◉ Empirically, we use 5/10 filter taps (K) and uniformly sample the delay (△) to be in the range
[1, 4] .
◉ According to preliminary experiments, the DFT window size is set as 512 (32 ms) and shift as
128 (8 ms) as well as (decaying scalar) for the recursive WPE variant and
3 iterations for vanilla WPE.
◉ For smoothing PSD, we use left/right context size of 1/0 .
◉ We then choose the best configuration according to the results on the development set.
ASR setting
16
◉ The acoustic model is wide bi-directional residual network (WBRN),
trained on multi-condition data.
◉ The decoder uses the standard 3-gram WSJ0 language model.
◉ Since WPE preserves the number of channels and our acoustic model
is a single-channel model, we always only use the first microphone
channel for decoding.
Evaluation
◉ To compare the described approaches, we evaluate the systems in terms of word error rates
(WERs) on the data of the REVERB challenge as well as on Wall Street Journal (WSJ)+VoiceHome
data.
◉ We evaluate the performance on the eight as well as on the two channel task.
◉ The baselines are Unprocessed and Iteration. The former uses no enhancement and the latter is
vanilla WPE.
◉ Those baselines are compared with Smooth, where the PSD is approximated by smoothing the
observation and DNN, which uses a separately trained PSD estimation network.
17
Evaluation
◉ These systems are evaluated for three different latency constraints (if applicable):
➢ offline, the whole utterance is available for processing;
➢ block-online, the system is provided with blocks of 2 s;
➢ online, the system operates on a frame-by-frame basis.
◉ The former two settings both use the WPE formulation as outlined at P.10, while the online setting
uses the recursive formulation defined in P.12.
18
Results
- REVERB
- WSJ+VoiceHome
4
19
REVERB
20
◉ In the worst case (2 microphones, online), the DNN-based system still yields an improvement of over
10 % over the unprocessed baseline.
◉ In the offline scenario, it is able to match the result of the iterating baseline but avoids the iterations.
(The vanilla WPE proves to be a strong one.)
all better than unprocessed
getting worse as expected
GMM-based Kaldi system : 31 %
CHiME-3 recipe DNN : 24 %
WSJ+VoiceHome
21
all better than unprocessed, but smaller margin
getting worse as expected
◉ The VoiceHome background noise is very dynamic and typically found in households e.g., vacuum
cleaner, dish washing or interviews on television.
◉ Given that WPE itself omits noise in its formulation, this outcome was not expected and might
indicate that the training target for the DNN can be improved when complicated noise is present.
Conclusions
5
22
Conclusions
◉ In this paper, we show that the recursive WPE formulation can be improved with a more
sophisticated PSD estimator, resulting in a 5 % - 10 % reduction in WER relative.
◉ Furthermore, it enables operating on a frame-by-frame basis and can thus be deployed in a
real-time scenario.
◉ We also find that simply smoothing the observation results in surprisingly good performance, even
if noise is present.
◉ In future work, we plan to investigate if we can train the DNN based PSD estimator by directly
minimizing the loss function of the acoustic model. That way, we avoid to explicitly specify what
the direct signal is and also eliminate the need for parallel training data.
23
Reference
➜ Official code: https://github.com/fgnt/nara_wpe (in Tensorflow v1.6)
➜ PyTorch version: https://github.com/nttcslab-sp/dnn_wpe
➜ Chainer version: https://github.com/sas91/jhu-neural-wpe
➜ PPT by <Kinoshita et al.>:
https://www.kecl.ntt.co.jp/icl/signal/kinoshita/publications/Interspeech17/Interspeech_keisuke_kinoshita_DNN-
WPE.pdf
➜ Vanilla WPE: https://github.com/helianvine/fdndlp, http://www.kecl.ntt.co.jp/icl/signal/wpe/
24
Any questions ?
You can find me at
➢ jasonho610@gmail.com
➢ NTNU-SMIL
Thanks!
25

More Related Content

Similar to Frame-Online DNN-WPE Dereverberation.pdf

Machine Learning approaches at video compression
Machine Learning approaches at video compression Machine Learning approaches at video compression
Machine Learning approaches at video compression
Roberto Iacoviello
 
An Effective Approach for Colour Image Transmission using DWT Over OFDM for B...
An Effective Approach for Colour Image Transmission using DWT Over OFDM for B...An Effective Approach for Colour Image Transmission using DWT Over OFDM for B...
An Effective Approach for Colour Image Transmission using DWT Over OFDM for B...
IJMTST Journal
 
Audio Inpainting with D2WGAN.pdf
Audio Inpainting with D2WGAN.pdfAudio Inpainting with D2WGAN.pdf
Audio Inpainting with D2WGAN.pdf
ssuser849b73
 
GPS-Out LIMITATIONS AND NGS IMPROVEMENTS.pptx
GPS-Out LIMITATIONS AND NGS IMPROVEMENTS.pptxGPS-Out LIMITATIONS AND NGS IMPROVEMENTS.pptx
GPS-Out LIMITATIONS AND NGS IMPROVEMENTS.pptx
ssuser9ffba4
 
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
tsysglobalsolutions
 
Emotion and Theme Recognition in Music with Frequency-Aware RF-Regularized CNNs
Emotion and Theme Recognition in Music with Frequency-Aware RF-Regularized CNNsEmotion and Theme Recognition in Music with Frequency-Aware RF-Regularized CNNs
Emotion and Theme Recognition in Music with Frequency-Aware RF-Regularized CNNs
multimediaeval
 
Video capacity of WLANs with a multiuser perceptual quality constraint
Video capacity of WLANs with a multiuser perceptual quality constraintVideo capacity of WLANs with a multiuser perceptual quality constraint
Video capacity of WLANs with a multiuser perceptual quality constraint
Shivaditya Jatar
 
Translatotron review
Translatotron reviewTranslatotron review
Translatotron review
June-Woo Kim
 
Power allocation for capacity maximization in eigen mimo with output snr cons...
Power allocation for capacity maximization in eigen mimo with output snr cons...Power allocation for capacity maximization in eigen mimo with output snr cons...
Power allocation for capacity maximization in eigen mimo with output snr cons...
Zac Darcy
 
2010 15 vo
2010 15 vo2010 15 vo
2010 15 vo
Balaji Pillai
 
Automated Speech Recognition
Automated Speech Recognition Automated Speech Recognition
Automated Speech Recognition
Pruthvij Thakar
 
Convolutional Neural Networks for Natural Language Processing / Stanford cs22...
Convolutional Neural Networks for Natural Language Processing / Stanford cs22...Convolutional Neural Networks for Natural Language Processing / Stanford cs22...
Convolutional Neural Networks for Natural Language Processing / Stanford cs22...
changedaeoh
 
Power Allocation for Capacity Maximization in Eigen-Mimo with Output SNR Cons...
Power Allocation for Capacity Maximization in Eigen-Mimo with Output SNR Cons...Power Allocation for Capacity Maximization in Eigen-Mimo with Output SNR Cons...
Power Allocation for Capacity Maximization in Eigen-Mimo with Output SNR Cons...
Zac Darcy
 
POWER ALLOCATION FOR CAPACITY MAXIMIZATION IN EIGEN-MIMO WITH OUTPUT SNR CONS...
POWER ALLOCATION FOR CAPACITY MAXIMIZATION IN EIGEN-MIMO WITH OUTPUT SNR CONS...POWER ALLOCATION FOR CAPACITY MAXIMIZATION IN EIGEN-MIMO WITH OUTPUT SNR CONS...
POWER ALLOCATION FOR CAPACITY MAXIMIZATION IN EIGEN-MIMO WITH OUTPUT SNR CONS...
Zac Darcy
 
H43044145
H43044145H43044145
H43044145
IJERA Editor
 
Relay selection for geographical forwarding in sleep wake cycling wireless se...
Relay selection for geographical forwarding in sleep wake cycling wireless se...Relay selection for geographical forwarding in sleep wake cycling wireless se...
Relay selection for geographical forwarding in sleep wake cycling wireless se...
JPINFOTECH JAYAPRAKASH
 
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
csandit
 
SpecAugment review
SpecAugment reviewSpecAugment review
SpecAugment review
June-Woo Kim
 
Parallel WaveGAN review
Parallel WaveGAN reviewParallel WaveGAN review
Parallel WaveGAN review
June-Woo Kim
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuning
taeseon ryu
 

Similar to Frame-Online DNN-WPE Dereverberation.pdf (20)

Machine Learning approaches at video compression
Machine Learning approaches at video compression Machine Learning approaches at video compression
Machine Learning approaches at video compression
 
An Effective Approach for Colour Image Transmission using DWT Over OFDM for B...
An Effective Approach for Colour Image Transmission using DWT Over OFDM for B...An Effective Approach for Colour Image Transmission using DWT Over OFDM for B...
An Effective Approach for Colour Image Transmission using DWT Over OFDM for B...
 
Audio Inpainting with D2WGAN.pdf
Audio Inpainting with D2WGAN.pdfAudio Inpainting with D2WGAN.pdf
Audio Inpainting with D2WGAN.pdf
 
GPS-Out LIMITATIONS AND NGS IMPROVEMENTS.pptx
GPS-Out LIMITATIONS AND NGS IMPROVEMENTS.pptxGPS-Out LIMITATIONS AND NGS IMPROVEMENTS.pptx
GPS-Out LIMITATIONS AND NGS IMPROVEMENTS.pptx
 
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
 
Emotion and Theme Recognition in Music with Frequency-Aware RF-Regularized CNNs
Emotion and Theme Recognition in Music with Frequency-Aware RF-Regularized CNNsEmotion and Theme Recognition in Music with Frequency-Aware RF-Regularized CNNs
Emotion and Theme Recognition in Music with Frequency-Aware RF-Regularized CNNs
 
Video capacity of WLANs with a multiuser perceptual quality constraint
Video capacity of WLANs with a multiuser perceptual quality constraintVideo capacity of WLANs with a multiuser perceptual quality constraint
Video capacity of WLANs with a multiuser perceptual quality constraint
 
Translatotron review
Translatotron reviewTranslatotron review
Translatotron review
 
Power allocation for capacity maximization in eigen mimo with output snr cons...
Power allocation for capacity maximization in eigen mimo with output snr cons...Power allocation for capacity maximization in eigen mimo with output snr cons...
Power allocation for capacity maximization in eigen mimo with output snr cons...
 
2010 15 vo
2010 15 vo2010 15 vo
2010 15 vo
 
Automated Speech Recognition
Automated Speech Recognition Automated Speech Recognition
Automated Speech Recognition
 
Convolutional Neural Networks for Natural Language Processing / Stanford cs22...
Convolutional Neural Networks for Natural Language Processing / Stanford cs22...Convolutional Neural Networks for Natural Language Processing / Stanford cs22...
Convolutional Neural Networks for Natural Language Processing / Stanford cs22...
 
Power Allocation for Capacity Maximization in Eigen-Mimo with Output SNR Cons...
Power Allocation for Capacity Maximization in Eigen-Mimo with Output SNR Cons...Power Allocation for Capacity Maximization in Eigen-Mimo with Output SNR Cons...
Power Allocation for Capacity Maximization in Eigen-Mimo with Output SNR Cons...
 
POWER ALLOCATION FOR CAPACITY MAXIMIZATION IN EIGEN-MIMO WITH OUTPUT SNR CONS...
POWER ALLOCATION FOR CAPACITY MAXIMIZATION IN EIGEN-MIMO WITH OUTPUT SNR CONS...POWER ALLOCATION FOR CAPACITY MAXIMIZATION IN EIGEN-MIMO WITH OUTPUT SNR CONS...
POWER ALLOCATION FOR CAPACITY MAXIMIZATION IN EIGEN-MIMO WITH OUTPUT SNR CONS...
 
H43044145
H43044145H43044145
H43044145
 
Relay selection for geographical forwarding in sleep wake cycling wireless se...
Relay selection for geographical forwarding in sleep wake cycling wireless se...Relay selection for geographical forwarding in sleep wake cycling wireless se...
Relay selection for geographical forwarding in sleep wake cycling wireless se...
 
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
 
SpecAugment review
SpecAugment reviewSpecAugment review
SpecAugment review
 
Parallel WaveGAN review
Parallel WaveGAN reviewParallel WaveGAN review
Parallel WaveGAN review
 
Visual prompt tuning
Visual prompt tuningVisual prompt tuning
Visual prompt tuning
 

More from ssuser849b73

Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdf
ssuser849b73
 
Transformer-based SE.pptx
Transformer-based SE.pptxTransformer-based SE.pptx
Transformer-based SE.pptx
ssuser849b73
 
WaveNet.pdf
WaveNet.pdfWaveNet.pdf
WaveNet.pdf
ssuser849b73
 
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
ssuser849b73
 
Wavesplit.pdf
Wavesplit.pdfWavesplit.pdf
Wavesplit.pdf
ssuser849b73
 
Sudormrf.pdf
Sudormrf.pdfSudormrf.pdf
Sudormrf.pdf
ssuser849b73
 
EEND-SS.pdf
EEND-SS.pdfEEND-SS.pdf
EEND-SS.pdf
ssuser849b73
 

More from ssuser849b73 (7)

Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdf
 
Transformer-based SE.pptx
Transformer-based SE.pptxTransformer-based SE.pptx
Transformer-based SE.pptx
 
WaveNet.pdf
WaveNet.pdfWaveNet.pdf
WaveNet.pdf
 
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
 
Wavesplit.pdf
Wavesplit.pdfWavesplit.pdf
Wavesplit.pdf
 
Sudormrf.pdf
Sudormrf.pdfSudormrf.pdf
Sudormrf.pdf
 
EEND-SS.pdf
EEND-SS.pdfEEND-SS.pdf
EEND-SS.pdf
 

Recently uploaded

132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
kandramariana6
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
UReason
 
SCALING OF MOS CIRCUITS m .pptx
SCALING OF MOS CIRCUITS m                 .pptxSCALING OF MOS CIRCUITS m                 .pptx
SCALING OF MOS CIRCUITS m .pptx
harshapolam10
 
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
upoux
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
171ticu
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 08 Doors and Windows.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 08 Doors and Windows.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 08 Doors and Windows.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 08 Doors and Windows.pdf
Yasser Mahgoub
 
VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...
VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...
VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...
PIMR BHOPAL
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
21UME003TUSHARDEB
 
morris_worm_intro_and_source_code_analysis_.pdf
morris_worm_intro_and_source_code_analysis_.pdfmorris_worm_intro_and_source_code_analysis_.pdf
morris_worm_intro_and_source_code_analysis_.pdf
ycwu0509
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
VANDANAMOHANGOUDA
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
Nada Hikmah
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
Prakhyath Rai
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
Gino153088
 
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by AnantLLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
Anant Corporation
 
AI for Legal Research with applications, tools
AI for Legal Research with applications, toolsAI for Legal Research with applications, tools
AI for Legal Research with applications, tools
mahaffeycheryld
 
CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
PKavitha10
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
nedcocy
 
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURSCompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
RamonNovais6
 

Recently uploaded (20)

132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
 
SCALING OF MOS CIRCUITS m .pptx
SCALING OF MOS CIRCUITS m                 .pptxSCALING OF MOS CIRCUITS m                 .pptx
SCALING OF MOS CIRCUITS m .pptx
 
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
一比一原版(osu毕业证书)美国俄勒冈州立大学毕业证如何办理
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样学校原版美国波士顿大学毕业证学历学位证书原版一模一样
学校原版美国波士顿大学毕业证学历学位证书原版一模一样
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 08 Doors and Windows.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 08 Doors and Windows.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 08 Doors and Windows.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 08 Doors and Windows.pdf
 
VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...
VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...
VARIABLE FREQUENCY DRIVE. VFDs are widely used in industrial applications for...
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
 
morris_worm_intro_and_source_code_analysis_.pdf
morris_worm_intro_and_source_code_analysis_.pdfmorris_worm_intro_and_source_code_analysis_.pdf
morris_worm_intro_and_source_code_analysis_.pdf
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
 
Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...Software Engineering and Project Management - Introduction, Modeling Concepts...
Software Engineering and Project Management - Introduction, Modeling Concepts...
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
 
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by AnantLLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
 
AI for Legal Research with applications, tools
AI for Legal Research with applications, toolsAI for Legal Research with applications, tools
AI for Legal Research with applications, tools
 
CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
 
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
一比一原版(爱大毕业证书)爱荷华大学毕业证如何办理
 
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURSCompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
CompEx~Manual~1210 (2).pdf COMPEX GAS AND VAPOURS
 

Frame-Online DNN-WPE Dereverberation.pdf

  • 1. Frame-Online DNN-WPE Dereverberation Presenter : 何冠勳 61047017s Date : 2022/10/31 Jahn Heymann et al., “Frame-Online DNN-WPE Dereverberation,” in IWAENC 2018
  • 2. Outline 2 1 3 5 Introduction Experiments Conclusions 4 2 Background Results
  • 4. Introduction 4 ◉ Apart from interfering sources and noise, reverberation has the most severe impact on the intelligibility of the target speech signal. ◉ The interfering signals and noise are commonly suppressed by beamforming, while the latter impairment can be addressed with signal processing aimed at suppressing the late reverberation. ◉ The Weighted Prediction Error (WPE) method showed very good performance in the REVERB challenge and is used in commercially successful products such as the Google Home. ◉ WPE dereverberates the signal by estimating an inverse filter which is then used to subtract the reverberation tail from the observation.
  • 5. Introduction 5 ◉ With its drawback in design, the quality of this filter mainly depends on the estimation of the power spectral density (PSD) of the target, i.e., the anechoic speech signal and its early reflections, referred as ”direct speech”. ◉ Since this target is unknown, the conventional WPE works iteratively by alternating between two steps: 1. dereverberating the signal by filters calculated using estimated direct speech PSD; 2. estimating the direct speech PSD using the filtered signal. ◉ This, however, inherently makes the vanilla WPE an offline method and computationally expensive.
  • 6. Introduction 6 ◉ To overcome this dependency issue: A. <Kinoshita et al.> proposed to utilize a neural network to directly estimate the PSD from the observation. Furthermore, this enable a blockwise online usage of WPE. B. <Caroselli et al.> use a smoothed version of the observed signal PSD as an approximation for the direct signal PSD and also create a recursive formulation of WPE. This option enable a real-time, frame-by-frame dereverberation. ◉ In this work, we consider a combination of both approaches. ◉ We investigate if the recursive formulation can profit from a more sophisticated PSD estimated by neural network and how the latency constraint impacts the overall performance of the system compared to the offline and block-online variant.
  • 7. Background 2 7 - Task - WPE - DNN variants - Recursive variants
  • 8. Task 8 ◉ Using microphones, we observe a signal which is represented as the - dimensional vector in the short time Fourier transformation (STFT) domain. ◉ We assume, that for ASR the early part of the room impulse response (RIR) is beneficial whereas the reverberation tail deteriorates the recognition and should therefore be suppressed. ◉ Note that we explicitly allow RIRs longer than the length of an DFT window. source signal convolved with the early part of the RIR
  • 9. WPE 9 ◉ The underlying idea of WPE is to estimate the reverberation tail of the signal and subtract it from the observation to obtain an optimal estimate of the direct speech in a maximum likelihood sense. (multi-channel) (monaural) filter weights integrating all mics for estimation written in matrix multiplication taking K points, with △ delays to avoid whitening the speech source * : conjugate H : conjugate and transpose The number of taps (K) : the amount of the memory needed to implement the filter.
  • 10. WPE 10 ◉ From the assumption that , there is no closed form solution for the likelihood optimization, but an iterative procedure which alternates between estimating the filter coefficients and the variance exists : Step 1) Step 2) ◉ Once we have an estimator for the PSD which only relies on the observation, the block-online solution is straight-forward. (e.g., a DNN estimator) time-varying for each frequency
  • 11. DNN variants 11 Crude approximation (PSD of direct speech is equal to that of observation) averaging on longer sequence to mitigate the effect of approximation error Suitable for block-online processing with lower latency and less iterations (Vanilla WPE) (DNN-WPE)
  • 12. Recursive variants 12 ◉ To arrive at the recursive variant, which is causal and more flexible in time, the correlation matrix is estimated with a decaying window : ◉ This leads to a solution with the following updates : time-varying
  • 13. Recursive variants 13 ◉ <Caroselli et al.> approximate the PSD of the target signal using a smoothed PSD of the observation considering left and right context :
  • 14. Experiments 3 14 - PSD estimator - ASR setting - Evaluation
  • 15. PSD estimator 15 ◉ Empirically, we use 5/10 filter taps (K) and uniformly sample the delay (△) to be in the range [1, 4] . ◉ According to preliminary experiments, the DFT window size is set as 512 (32 ms) and shift as 128 (8 ms) as well as (decaying scalar) for the recursive WPE variant and 3 iterations for vanilla WPE. ◉ For smoothing PSD, we use left/right context size of 1/0 . ◉ We then choose the best configuration according to the results on the development set.
  • 16. ASR setting 16 ◉ The acoustic model is wide bi-directional residual network (WBRN), trained on multi-condition data. ◉ The decoder uses the standard 3-gram WSJ0 language model. ◉ Since WPE preserves the number of channels and our acoustic model is a single-channel model, we always only use the first microphone channel for decoding.
  • 17. Evaluation ◉ To compare the described approaches, we evaluate the systems in terms of word error rates (WERs) on the data of the REVERB challenge as well as on Wall Street Journal (WSJ)+VoiceHome data. ◉ We evaluate the performance on the eight as well as on the two channel task. ◉ The baselines are Unprocessed and Iteration. The former uses no enhancement and the latter is vanilla WPE. ◉ Those baselines are compared with Smooth, where the PSD is approximated by smoothing the observation and DNN, which uses a separately trained PSD estimation network. 17
  • 18. Evaluation ◉ These systems are evaluated for three different latency constraints (if applicable): ➢ offline, the whole utterance is available for processing; ➢ block-online, the system is provided with blocks of 2 s; ➢ online, the system operates on a frame-by-frame basis. ◉ The former two settings both use the WPE formulation as outlined at P.10, while the online setting uses the recursive formulation defined in P.12. 18
  • 20. REVERB 20 ◉ In the worst case (2 microphones, online), the DNN-based system still yields an improvement of over 10 % over the unprocessed baseline. ◉ In the offline scenario, it is able to match the result of the iterating baseline but avoids the iterations. (The vanilla WPE proves to be a strong one.) all better than unprocessed getting worse as expected GMM-based Kaldi system : 31 % CHiME-3 recipe DNN : 24 %
  • 21. WSJ+VoiceHome 21 all better than unprocessed, but smaller margin getting worse as expected ◉ The VoiceHome background noise is very dynamic and typically found in households e.g., vacuum cleaner, dish washing or interviews on television. ◉ Given that WPE itself omits noise in its formulation, this outcome was not expected and might indicate that the training target for the DNN can be improved when complicated noise is present.
  • 23. Conclusions ◉ In this paper, we show that the recursive WPE formulation can be improved with a more sophisticated PSD estimator, resulting in a 5 % - 10 % reduction in WER relative. ◉ Furthermore, it enables operating on a frame-by-frame basis and can thus be deployed in a real-time scenario. ◉ We also find that simply smoothing the observation results in surprisingly good performance, even if noise is present. ◉ In future work, we plan to investigate if we can train the DNN based PSD estimator by directly minimizing the loss function of the acoustic model. That way, we avoid to explicitly specify what the direct signal is and also eliminate the need for parallel training data. 23
  • 24. Reference ➜ Official code: https://github.com/fgnt/nara_wpe (in Tensorflow v1.6) ➜ PyTorch version: https://github.com/nttcslab-sp/dnn_wpe ➜ Chainer version: https://github.com/sas91/jhu-neural-wpe ➜ PPT by <Kinoshita et al.>: https://www.kecl.ntt.co.jp/icl/signal/kinoshita/publications/Interspeech17/Interspeech_keisuke_kinoshita_DNN- WPE.pdf ➜ Vanilla WPE: https://github.com/helianvine/fdndlp, http://www.kecl.ntt.co.jp/icl/signal/wpe/ 24
  • 25. Any questions ? You can find me at ➢ jasonho610@gmail.com ➢ NTNU-SMIL Thanks! 25