A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement and Speech Separation .pptx

A Conformer-based ASR Frontend for Joint
Acoustic Echo Cancellation,
Speech Enhancement and Speech Separation
Tom O’Malley, Arun Narayanan, Quan Wang, Alex Park,
James Walker, Nathan Howard
Google LLC, U.S.A
ASRU 2021
Presenter: 何冠勳

1
Table of Contents
 Introduction
 Architecture
 Experiments
 Results
 Conclusion

3
Introduction
 While ASR systems has significantly improved over the years, various factors in real
situation still significantly deteriorate performance.
 Background interference types can be broadly classified into 3 groups:
 Device echo
 Background noise
 Competing speech
• The three classes of interference mentioned above have been addressed in the
literature, typically in isolation.

4
Introduction
 It is well known that improving speech quality does not always improve ASR
performance since the artifacts introduced by non-linear processing can adversely
affect ASR.
 One way to mitigate this is to jointly train the enhancement frontend together with
the backend ASR model.
 In this experiment, some assumption is made:
 Reference signal available
 Noise context available
 Target speaker embedding available

5
Introduction
 A single model then processes these contextual signals to produce enhanced
features that are passed to the ASR system.
 The model is based on Conformer, which has been shown to be especially well-
suited for speech tasks like ASR, enhancement and separation.
 Our results show that a joint model can work almost as well as task specific models.
ASR <Analyzing Robustness of End-to-End Neural Models for Automatic Speech Recognition>
ASR <Improving Mandarin Speech Recognition with Block-augmented Transformer>
Sep+Enh <Multichannel Speech Separation with Narrow-band Conformer>
Dia+ASR <The RoyalFlush System of Speech Recognition for M2MeT Challenge>

6
Architecture
- Features
- En/Decoder
- ASR
- Objective
- Loss
- Inference

stacked
FiLM
FiLM
(6 secs)
Spectral-based
ASR
(pre-trained)
FiLM
(frozen)
(Stacked &
Subsampled)

9
Features, Encoder, Decoder
 We compute log Melfilter bank energy (LFBE) as features of the reference signal
and noise content.
 The speaker embeddings , d-vectors are computed using a text-independent
speaker recognition model trained with the generalized end-to-end extended-set
softmax loss.
 While primary encoder and noise content encoder are self-attentive, cross-
attention encoder is attentive in noise content and the output of primary encoder.
 Decoder consists of a simple projection with sigmoid activation.

10
ASR
 The ASR model is a RNN transducer model with LSTM based encoder. The training
data for this model comes from varied domains that include VoiceSearch, Farfield,
Telephony and YouTube.
 The utterances totals to ∼400k hours of speech. Additionally, we use a room
simulator to generate noisy versions of these datasets at SNRs ranging from 0 dB to
30 dB, and reverberation times ranging from 0 msec to 900 msec.
 The features used by the ASR: 128-dimensional LFBE features computed for 32 ms
windows with 10 ms hop, then stacks 4 contiguous frames and subsamples by a
factor of 3.

11
Objective, Loss
 We use the IRM as the training target. Using the IRM allows us to do enhancement
directly in the feature space for ASR, without any need for reconstructing the
waveform.
 The spectral loss consists of the L1 and L2 distance between the estimated ratio
mask and the ideal ratio mask.
 We only use the encoder of the ASR model for computing the loss. The loss is
computed as the L2 distance between the outputs of the ASR encoder for the
target features and the enhanced features.
 The goal of using ASR loss is to make enhancement be more attuned to the ASR
model.

12
Inference
 Prior work has shown that scaling the mask estimate improves ASR performance,
since the ASR model is sensitive to artifacts.
,where ⍺ and β is set to be 0.5 and 0.01, respectively.

13
Experiments
- Datasets
- Training

14
Datasets
• The speech enhancement training set is created by passing the clean utterances
through a room simulator, that first adds reverberation, followed by two genre of
interference.
 Librispeech + noise (from Getty Audio / YouTube Audio Library)
 Librispeech + competing speaker (from Librispeech)
• There’re two types of AEC training data.
 A reference signal is played out through a reverberant room simulator and
then recorded.
 Re-record the recorded utterances drawing from a dataset collected internally
for text-to-speech (TTS) purposes. We additionally augment this set with actual
TTS utterances.

15
Training
LFBE Dim 128
Window size 32 ms
Hop size 10 ms
D-vector Dim 256
Conformer Causal
Masked-attention
Dim 256
FFW size x 6, x 8
AEC-only (#conformer) Primary encoder 6
SE-only & Joint (#conformer) Primary encoder 2
Noise context encoder 2
Cross-attention encoder 2

16
Results
- AEC-only
- SE-only
- Joint

27
Conclusions
What has been done?

28
Conclusions
 We present a frontend for improving robustness of ASR, that jointly implements three
modules within a single model: acoustic echo cancellation, speech enhancement, and
speech separation.
 This is achieved by making use of
(1) a reference signal of the playback audio
(2) a noise context
(3) an target speaker embedding
• We present detailed evaluations to show that the joint model performs almost as well
as the task-specific models, and significantly reduces word error rate in noisy
conditions even when using a large-scale state-of-the-art ASR model.

Enhancement results on Delta dataset:
AOI / Inter-training / Elderly

30
CER(%) SMIL-635hrTW
SepFormer
(WHAMR)
SepFormer
(WHAM16k)
SepFormer
(WHAM)
AOI 27.27 29.15 36.32 31.84
Elder 28.31 47.99 41.57 50.80
Inter-training 14.18 22.80 18.76 23.51

THANK YOU
Any questions?
You can find me at
📨 jasonho610@gmail.com NTNU-SMIL

A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement and Speech Separation .pptx

Recommended

Recommended

More Related Content

Similar to A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement and Speech Separation .pptx

Similar to A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement and Speech Separation .pptx (20)

More from ssuser849b73

More from ssuser849b73 (7)

Recently uploaded

Recently uploaded (20)

A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement and Speech Separation .pptx