SlideShare a Scribd company logo
A Conformer-based ASR Frontend for Joint
Acoustic Echo Cancellation,
Speech Enhancement and Speech Separation
Tom O’Malley, Arun Narayanan, Quan Wang, Alex Park,
James Walker, Nathan Howard
Google LLC, U.S.A
ASRU 2021
Presenter: 何冠勳
1
Table of Contents
 Introduction
 Architecture
 Experiments
 Results
 Conclusion
2
Introduction
Why & What ?
3
Introduction
 While ASR systems has significantly improved over the years, various factors in real
situation still significantly deteriorate performance.
 Background interference types can be broadly classified into 3 groups:
 Device echo
 Background noise
 Competing speech
• The three classes of interference mentioned above have been addressed in the
literature, typically in isolation.
4
Introduction
 It is well known that improving speech quality does not always improve ASR
performance since the artifacts introduced by non-linear processing can adversely
affect ASR.
 One way to mitigate this is to jointly train the enhancement frontend together with
the backend ASR model.
 In this experiment, some assumption is made:
 Reference signal available
 Noise context available
 Target speaker embedding available
5
Introduction
 A single model then processes these contextual signals to produce enhanced
features that are passed to the ASR system.
 The model is based on Conformer, which has been shown to be especially well-
suited for speech tasks like ASR, enhancement and separation.
 Our results show that a joint model can work almost as well as task specific models.
ASR <Analyzing Robustness of End-to-End Neural Models for Automatic Speech Recognition>
ASR <Improving Mandarin Speech Recognition with Block-augmented Transformer>
Sep+Enh <Multichannel Speech Separation with Narrow-band Conformer>
Dia+ASR <The RoyalFlush System of Speech Recognition for M2MeT Challenge>
6
Architecture
- Features
- En/Decoder
- ASR
- Objective
- Loss
- Inference
stacked
FiLM
FiLM
(6 secs)
Spectral-based
ASR
(pre-trained)
FiLM
(frozen)
(Stacked &
Subsampled)
9
Features, Encoder, Decoder
 We compute log Melfilter bank energy (LFBE) as features of the reference signal
and noise content.
 The speaker embeddings , d-vectors are computed using a text-independent
speaker recognition model trained with the generalized end-to-end extended-set
softmax loss.
 While primary encoder and noise content encoder are self-attentive, cross-
attention encoder is attentive in noise content and the output of primary encoder.
 Decoder consists of a simple projection with sigmoid activation.
10
ASR
 The ASR model is a RNN transducer model with LSTM based encoder. The training
data for this model comes from varied domains that include VoiceSearch, Farfield,
Telephony and YouTube.
 The utterances totals to ∼400k hours of speech. Additionally, we use a room
simulator to generate noisy versions of these datasets at SNRs ranging from 0 dB to
30 dB, and reverberation times ranging from 0 msec to 900 msec.
 The features used by the ASR: 128-dimensional LFBE features computed for 32 ms
windows with 10 ms hop, then stacks 4 contiguous frames and subsamples by a
factor of 3.
11
Objective, Loss
 We use the IRM as the training target. Using the IRM allows us to do enhancement
directly in the feature space for ASR, without any need for reconstructing the
waveform.
 The spectral loss consists of the L1 and L2 distance between the estimated ratio
mask and the ideal ratio mask.
 We only use the encoder of the ASR model for computing the loss. The loss is
computed as the L2 distance between the outputs of the ASR encoder for the
target features and the enhanced features.
 The goal of using ASR loss is to make enhancement be more attuned to the ASR
model.
12
Inference
 Prior work has shown that scaling the mask estimate improves ASR performance,
since the ASR model is sensitive to artifacts.
,where ⍺ and β is set to be 0.5 and 0.01, respectively.
13
Experiments
- Datasets
- Training
14
Datasets
• The speech enhancement training set is created by passing the clean utterances
through a room simulator, that first adds reverberation, followed by two genre of
interference.
 Librispeech + noise (from Getty Audio / YouTube Audio Library)
 Librispeech + competing speaker (from Librispeech)
• There’re two types of AEC training data.
 A reference signal is played out through a reverberant room simulator and
then recorded.
 Re-record the recorded utterances drawing from a dataset collected internally
for text-to-speech (TTS) purposes. We additionally augment this set with actual
TTS utterances.
15
Training
LFBE Dim 128
Window size 32 ms
Hop size 10 ms
D-vector Dim 256
Conformer Causal
Masked-attention
Dim 256
FFW size x 6, x 8
AEC-only (#conformer) Primary encoder 6
SE-only & Joint (#conformer) Primary encoder 2
Noise context encoder 2
Cross-attention encoder 2
16
Results
- AEC-only
- SE-only
- Joint
17
AEC-only
18
AEC-only
19
AEC-only
20
SE-only
21
SE-only
22
SE-only
23
Joint
24
Joint
25
Joint
26
Joint
27
Conclusions
What has been done?
28
Conclusions
 We present a frontend for improving robustness of ASR, that jointly implements three
modules within a single model: acoustic echo cancellation, speech enhancement, and
speech separation.
 This is achieved by making use of
(1) a reference signal of the playback audio
(2) a noise context
(3) an target speaker embedding
• We present detailed evaluations to show that the joint model performs almost as well
as the task-specific models, and significantly reduces word error rate in noisy
conditions even when using a large-scale state-of-the-art ASR model.
Enhancement results on Delta dataset:
AOI / Inter-training / Elderly
30
CER(%) SMIL-635hrTW
SepFormer
(WHAMR)
SepFormer
(WHAM16k)
SepFormer
(WHAM)
AOI 27.27 29.15 36.32 31.84
Elder 28.31 47.99 41.57 50.80
Inter-training 14.18 22.80 18.76 23.51
THANK YOU
Any questions?
You can find me at
📨 jasonho610@gmail.com NTNU-SMIL

More Related Content

Similar to A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement and Speech Separation .pptx

Semantic Mask for Transformer Based End-to-End Speech Recognition
Semantic Mask for Transformer Based End-to-End Speech RecognitionSemantic Mask for Transformer Based End-to-End Speech Recognition
Semantic Mask for Transformer Based End-to-End Speech Recognition
Whenty Ariyanti
 
IRJET- Voice Command Execution with Speech Recognition and Synthesizer
IRJET- Voice Command Execution with Speech Recognition and SynthesizerIRJET- Voice Command Execution with Speech Recognition and Synthesizer
IRJET- Voice Command Execution with Speech Recognition and Synthesizer
IRJET Journal
 
Konversa.docx - konversa.googlecode.com
Konversa.docx - konversa.googlecode.comKonversa.docx - konversa.googlecode.com
Konversa.docx - konversa.googlecode.com
butest
 
Automated Speech Recognition
Automated Speech Recognition Automated Speech Recognition
Automated Speech Recognition
Pruthvij Thakar
 
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
sipij
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD Editor
 
Speaker Segmentation (2006)
Speaker Segmentation (2006)Speaker Segmentation (2006)
Speaker Segmentation (2006)
Luís Gustavo Martins
 
COLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisCOLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech Analysis
Rushin Shah
 
General Speereo Technology
General Speereo TechnologyGeneral Speereo Technology
General Speereo Technology
Daniel Ischenko
 
Mp3
Mp3Mp3
Sudormrf.pdf
Sudormrf.pdfSudormrf.pdf
Sudormrf.pdf
ssuser849b73
 
Methodology of Implementing the Pulse code techniques for Distributed Optical...
Methodology of Implementing the Pulse code techniques for Distributed Optical...Methodology of Implementing the Pulse code techniques for Distributed Optical...
Methodology of Implementing the Pulse code techniques for Distributed Optical...
Editor IJCATR
 
Digital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief ReviewDigital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief Review
Editor IJCATR
 
G010424248
G010424248G010424248
G010424248
IOSR Journals
 
Real-Time Non-Intrusive Speech Quality Estimation: A Signal-Based Model
Real-Time Non-Intrusive Speech Quality Estimation: A Signal-Based ModelReal-Time Non-Intrusive Speech Quality Estimation: A Signal-Based Model
Real-Time Non-Intrusive Speech Quality Estimation: A Signal-Based Model
adil raja
 
J03502050055
J03502050055J03502050055
J03502050055
theijes
 
Real Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform DomainReal Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform Domain
Willy Marroquin (WillyDevNET)
 
129966864160453838[1]
129966864160453838[1]129966864160453838[1]
129966864160453838[1]
威華 王
 
PSoC BASED SPEECH RECOGNITION SYSTEM
PSoC BASED SPEECH RECOGNITION SYSTEMPSoC BASED SPEECH RECOGNITION SYSTEM
PSoC BASED SPEECH RECOGNITION SYSTEM
irjes
 
PSoC BASED SPEECH RECOGNITION SYSTEM
PSoC BASED SPEECH RECOGNITION SYSTEMPSoC BASED SPEECH RECOGNITION SYSTEM
PSoC BASED SPEECH RECOGNITION SYSTEM
IJRES Journal
 

Similar to A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement and Speech Separation .pptx (20)

Semantic Mask for Transformer Based End-to-End Speech Recognition
Semantic Mask for Transformer Based End-to-End Speech RecognitionSemantic Mask for Transformer Based End-to-End Speech Recognition
Semantic Mask for Transformer Based End-to-End Speech Recognition
 
IRJET- Voice Command Execution with Speech Recognition and Synthesizer
IRJET- Voice Command Execution with Speech Recognition and SynthesizerIRJET- Voice Command Execution with Speech Recognition and Synthesizer
IRJET- Voice Command Execution with Speech Recognition and Synthesizer
 
Konversa.docx - konversa.googlecode.com
Konversa.docx - konversa.googlecode.comKonversa.docx - konversa.googlecode.com
Konversa.docx - konversa.googlecode.com
 
Automated Speech Recognition
Automated Speech Recognition Automated Speech Recognition
Automated Speech Recognition
 
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
Speaker Segmentation (2006)
Speaker Segmentation (2006)Speaker Segmentation (2006)
Speaker Segmentation (2006)
 
COLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisCOLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech Analysis
 
General Speereo Technology
General Speereo TechnologyGeneral Speereo Technology
General Speereo Technology
 
Mp3
Mp3Mp3
Mp3
 
Sudormrf.pdf
Sudormrf.pdfSudormrf.pdf
Sudormrf.pdf
 
Methodology of Implementing the Pulse code techniques for Distributed Optical...
Methodology of Implementing the Pulse code techniques for Distributed Optical...Methodology of Implementing the Pulse code techniques for Distributed Optical...
Methodology of Implementing the Pulse code techniques for Distributed Optical...
 
Digital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief ReviewDigital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief Review
 
G010424248
G010424248G010424248
G010424248
 
Real-Time Non-Intrusive Speech Quality Estimation: A Signal-Based Model
Real-Time Non-Intrusive Speech Quality Estimation: A Signal-Based ModelReal-Time Non-Intrusive Speech Quality Estimation: A Signal-Based Model
Real-Time Non-Intrusive Speech Quality Estimation: A Signal-Based Model
 
J03502050055
J03502050055J03502050055
J03502050055
 
Real Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform DomainReal Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform Domain
 
129966864160453838[1]
129966864160453838[1]129966864160453838[1]
129966864160453838[1]
 
PSoC BASED SPEECH RECOGNITION SYSTEM
PSoC BASED SPEECH RECOGNITION SYSTEMPSoC BASED SPEECH RECOGNITION SYSTEM
PSoC BASED SPEECH RECOGNITION SYSTEM
 
PSoC BASED SPEECH RECOGNITION SYSTEM
PSoC BASED SPEECH RECOGNITION SYSTEMPSoC BASED SPEECH RECOGNITION SYSTEM
PSoC BASED SPEECH RECOGNITION SYSTEM
 

More from ssuser849b73

Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdf
ssuser849b73
 
Transformer-based SE.pptx
Transformer-based SE.pptxTransformer-based SE.pptx
Transformer-based SE.pptx
ssuser849b73
 
Audio Inpainting with D2WGAN.pdf
Audio Inpainting with D2WGAN.pdfAudio Inpainting with D2WGAN.pdf
Audio Inpainting with D2WGAN.pdf
ssuser849b73
 
Frame-Online DNN-WPE Dereverberation.pdf
Frame-Online DNN-WPE Dereverberation.pdfFrame-Online DNN-WPE Dereverberation.pdf
Frame-Online DNN-WPE Dereverberation.pdf
ssuser849b73
 
WaveNet.pdf
WaveNet.pdfWaveNet.pdf
WaveNet.pdf
ssuser849b73
 
Wavesplit.pdf
Wavesplit.pdfWavesplit.pdf
Wavesplit.pdf
ssuser849b73
 
Sepformer&DPTNet.pdf
Sepformer&DPTNet.pdfSepformer&DPTNet.pdf
Sepformer&DPTNet.pdf
ssuser849b73
 

More from ssuser849b73 (7)

Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdf
 
Transformer-based SE.pptx
Transformer-based SE.pptxTransformer-based SE.pptx
Transformer-based SE.pptx
 
Audio Inpainting with D2WGAN.pdf
Audio Inpainting with D2WGAN.pdfAudio Inpainting with D2WGAN.pdf
Audio Inpainting with D2WGAN.pdf
 
Frame-Online DNN-WPE Dereverberation.pdf
Frame-Online DNN-WPE Dereverberation.pdfFrame-Online DNN-WPE Dereverberation.pdf
Frame-Online DNN-WPE Dereverberation.pdf
 
WaveNet.pdf
WaveNet.pdfWaveNet.pdf
WaveNet.pdf
 
Wavesplit.pdf
Wavesplit.pdfWavesplit.pdf
Wavesplit.pdf
 
Sepformer&DPTNet.pdf
Sepformer&DPTNet.pdfSepformer&DPTNet.pdf
Sepformer&DPTNet.pdf
 

Recently uploaded

Wearable antenna for antenna applications
Wearable antenna for antenna applicationsWearable antenna for antenna applications
Wearable antenna for antenna applications
Madhumitha Jayaram
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
IJNSA Journal
 
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdfIron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
RadiNasr
 
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
Mukeshwaran Balu
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
Dr Ramhari Poudyal
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
MIGUELANGEL966976
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
jpsjournal1
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 
New techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdfNew techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdf
wisnuprabawa3
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
gerogepatton
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
Exception Handling notes in java exception
Exception Handling notes in java exceptionException Handling notes in java exception
Exception Handling notes in java exception
Ratnakar Mikkili
 
sieving analysis and results interpretation
sieving analysis and results interpretationsieving analysis and results interpretation
sieving analysis and results interpretation
ssuser36d3051
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
Hitesh Mohapatra
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
JamalHussainArman
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
CSM Cloud Service Management Presentarion
CSM Cloud Service Management PresentarionCSM Cloud Service Management Presentarion
CSM Cloud Service Management Presentarion
rpskprasana
 

Recently uploaded (20)

Wearable antenna for antenna applications
Wearable antenna for antenna applicationsWearable antenna for antenna applications
Wearable antenna for antenna applications
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
 
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdfIron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
Iron and Steel Technology Roadmap - Towards more sustainable steelmaking.pdf
 
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
 
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdfBPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
BPV-GUI-01-Guide-for-ASME-Review-Teams-(General)-10-10-2023.pdf
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 
New techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdfNew techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdf
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
Exception Handling notes in java exception
Exception Handling notes in java exceptionException Handling notes in java exception
Exception Handling notes in java exception
 
sieving analysis and results interpretation
sieving analysis and results interpretationsieving analysis and results interpretation
sieving analysis and results interpretation
 
Generative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of contentGenerative AI leverages algorithms to create various forms of content
Generative AI leverages algorithms to create various forms of content
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
CSM Cloud Service Management Presentarion
CSM Cloud Service Management PresentarionCSM Cloud Service Management Presentarion
CSM Cloud Service Management Presentarion
 

A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement and Speech Separation .pptx

  • 1. A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement and Speech Separation Tom O’Malley, Arun Narayanan, Quan Wang, Alex Park, James Walker, Nathan Howard Google LLC, U.S.A ASRU 2021 Presenter: 何冠勳
  • 2. 1 Table of Contents  Introduction  Architecture  Experiments  Results  Conclusion
  • 4. 3 Introduction  While ASR systems has significantly improved over the years, various factors in real situation still significantly deteriorate performance.  Background interference types can be broadly classified into 3 groups:  Device echo  Background noise  Competing speech • The three classes of interference mentioned above have been addressed in the literature, typically in isolation.
  • 5. 4 Introduction  It is well known that improving speech quality does not always improve ASR performance since the artifacts introduced by non-linear processing can adversely affect ASR.  One way to mitigate this is to jointly train the enhancement frontend together with the backend ASR model.  In this experiment, some assumption is made:  Reference signal available  Noise context available  Target speaker embedding available
  • 6. 5 Introduction  A single model then processes these contextual signals to produce enhanced features that are passed to the ASR system.  The model is based on Conformer, which has been shown to be especially well- suited for speech tasks like ASR, enhancement and separation.  Our results show that a joint model can work almost as well as task specific models. ASR <Analyzing Robustness of End-to-End Neural Models for Automatic Speech Recognition> ASR <Improving Mandarin Speech Recognition with Block-augmented Transformer> Sep+Enh <Multichannel Speech Separation with Narrow-band Conformer> Dia+ASR <The RoyalFlush System of Speech Recognition for M2MeT Challenge>
  • 7. 6 Architecture - Features - En/Decoder - ASR - Objective - Loss - Inference
  • 8.
  • 10. 9 Features, Encoder, Decoder  We compute log Melfilter bank energy (LFBE) as features of the reference signal and noise content.  The speaker embeddings , d-vectors are computed using a text-independent speaker recognition model trained with the generalized end-to-end extended-set softmax loss.  While primary encoder and noise content encoder are self-attentive, cross- attention encoder is attentive in noise content and the output of primary encoder.  Decoder consists of a simple projection with sigmoid activation.
  • 11. 10 ASR  The ASR model is a RNN transducer model with LSTM based encoder. The training data for this model comes from varied domains that include VoiceSearch, Farfield, Telephony and YouTube.  The utterances totals to ∼400k hours of speech. Additionally, we use a room simulator to generate noisy versions of these datasets at SNRs ranging from 0 dB to 30 dB, and reverberation times ranging from 0 msec to 900 msec.  The features used by the ASR: 128-dimensional LFBE features computed for 32 ms windows with 10 ms hop, then stacks 4 contiguous frames and subsamples by a factor of 3.
  • 12. 11 Objective, Loss  We use the IRM as the training target. Using the IRM allows us to do enhancement directly in the feature space for ASR, without any need for reconstructing the waveform.  The spectral loss consists of the L1 and L2 distance between the estimated ratio mask and the ideal ratio mask.  We only use the encoder of the ASR model for computing the loss. The loss is computed as the L2 distance between the outputs of the ASR encoder for the target features and the enhanced features.  The goal of using ASR loss is to make enhancement be more attuned to the ASR model.
  • 13. 12 Inference  Prior work has shown that scaling the mask estimate improves ASR performance, since the ASR model is sensitive to artifacts. ,where ⍺ and β is set to be 0.5 and 0.01, respectively.
  • 15. 14 Datasets • The speech enhancement training set is created by passing the clean utterances through a room simulator, that first adds reverberation, followed by two genre of interference.  Librispeech + noise (from Getty Audio / YouTube Audio Library)  Librispeech + competing speaker (from Librispeech) • There’re two types of AEC training data.  A reference signal is played out through a reverberant room simulator and then recorded.  Re-record the recorded utterances drawing from a dataset collected internally for text-to-speech (TTS) purposes. We additionally augment this set with actual TTS utterances.
  • 16. 15 Training LFBE Dim 128 Window size 32 ms Hop size 10 ms D-vector Dim 256 Conformer Causal Masked-attention Dim 256 FFW size x 6, x 8 AEC-only (#conformer) Primary encoder 6 SE-only & Joint (#conformer) Primary encoder 2 Noise context encoder 2 Cross-attention encoder 2
  • 29. 28 Conclusions  We present a frontend for improving robustness of ASR, that jointly implements three modules within a single model: acoustic echo cancellation, speech enhancement, and speech separation.  This is achieved by making use of (1) a reference signal of the playback audio (2) a noise context (3) an target speaker embedding • We present detailed evaluations to show that the joint model performs almost as well as the task-specific models, and significantly reduces word error rate in noisy conditions even when using a large-scale state-of-the-art ASR model.
  • 30. Enhancement results on Delta dataset: AOI / Inter-training / Elderly
  • 31. 30 CER(%) SMIL-635hrTW SepFormer (WHAMR) SepFormer (WHAM16k) SepFormer (WHAM) AOI 27.27 29.15 36.32 31.84 Elder 28.31 47.99 41.57 50.80 Inter-training 14.18 22.80 18.76 23.51
  • 32. THANK YOU Any questions? You can find me at 📨 jasonho610@gmail.com NTNU-SMIL