Speech Separation under Reverberant Condition.pdf

Speech Separation under
Reverberant Condition
Presenter : 何冠勳 61047017s
Date : 2022/08/31
1: Tobias Cord-Landwehr et al., “Monaural Source Separation: from Anechoic to Reverberant Enviorments”, IWAENC 2022
2: Cem Subakan et al., “On Using Transformers for Speech-Separation”, IEEE Signal Processing Letters
3: Jens Heitkaemper et al., “Demystifying TasNet: a Dissecting Approach”, ICASSP 2020

Outline
2
1 3 5
Introduction Architecture Conclusions
4
2
Prior
Knowledge
Experiments

Introduction
What is Speech Separation? Why? How?
1
3

What is Speech Separation?
See page 1 ~ 5 in PPT made by professor HUNG-YI LEE.
4
<Speech Separation - HUNG-YI LEE>

History
◉ Starting with the seminal papers on deep clustering and permutation invariant training (PIT),
improvements have been achieved by combining the two in a multi-objective training criterion,
or replacing the STFT with a learnable encoder and decoder, e.g. Conv-TasNet,
or accounting for short- and longer-term correlations in the signal, e.g. DPRNN,
or conditioning on simultaneously computed speaker centroids, e.g. Wavesplit.
◉ Overall, this has led to an improvement in SI-SDR from roughly 10 ~ 20+ dB on the standard
WSJ0-2mix dataset, which consists of artiﬁcial mixtures of anechoic speech.
anechoic (adj.): free from echoes and reverberations
5

6
SepFormer
DPTNet
Sudormrf
Recommended article
<WSJ0-2mix Benchmark (Speech Separation) | Papers With Code>

Datasets
◉ Common datasets include:
➢ anechoic: WSJ0mix, LibriMix
➢ noisy: WHAM!, LibriMix
➢ reverberant: SMS-WSJ
➢ noisy & reverberant: WHAMR!
◉ There are also data related to task of enhancing speech in WHAM!, WHAM! and LibriMix.
◉ Instead of utterance manner, there are also continuous speech separation corpus, like LibriCSS.
8

Non-anechoic
◉ However, an anechoic environment is a rather unrealistic assumption as in a real-world scenario, the
superposition of the speech of two or more speakers typically occurs in a distant microphone setting.
◉ In particular, reverberation has been considered more challenging than noise.
◉ WHAMR! and SMS-WSJ are two widely used datasets for research on source separation for
reverberant mixtures. Both contain artiﬁcially reverberated utterances from the WSJ corpus.
◉ Source separation performance on WHAMR! is 2 ~ 8 dB, on SMS-WSJ is 5 ~ 6 dB for single-channel
input and single-stage processing, which is much worse than the performance on anechoic mixtures.
9

Non-anechoic
◉ In [1], they aim to explore which of the recent innovations that proved useful for the separation of
anechoic mixtures are also beneﬁcial in the reverberant case, in order to propose some guidelines on
how to adjust a separation system to reverberated input.
◉ They take the SepFormer architecture, which achieves state-of-the-art performance both on
WSJ0-2mix and WHAMR!, and the traditional model PIT-BLSTM then adopt both models on
SMS-WSJ datasets.
◉ By modifying and optimizing the each model, w.r.t. loss function, encoder/decoder architecture,
resolution and representation, they adapt the model to mitigate the performance degradation between
the anechoic and reverberant scenario.
10

Non-anechoic
◉ In [2], they expand the study by providing additional experiments and insights on more realistic and
challenging datasets such as Libri2/3-Mix, which include longer mixtures, WHAM! and WHAMR!,
featuring noisy and noisy & reverberant conditions respectively.
◉ Moreover, on WHAM! and on WHAMR! datasets they also provide results for speech enhancement.
◉ Another contribution of [2] is investigating different types of self-attention for speech separation with
and without the dual-path mechanism. Namely, they compare the vanilla Transformer with
Longformer, Linformer, and Reformer.
11

Reverberation
12
<Reverberation-Wikipedia> <測量殘響時間 RT60>
WHAMR!: RT60 ∈ U(200 ms, 500ms)
SMS-WSJ: RT60 ∈ U(100 ms, 1000ms)

Objectives
13
◉ In this presentation, we expect to obtain:
✓ separation performance on WHAM!, WHAMR!, LibriMix and SMS-WSJ,[1][2]
✓ enhancement performance on WHAM! and WHAMR!, [2]
✓ modiﬁcations that are rewarding to adapt to the reverberant condition,[1]
✓ analysis on those modiﬁcations[1]
and
✓ comparison between transformer variety. [2]

Prior Knowledge
- Basic architecture
- Metrics
- Permutation invariant training
- Transformer
- Time Complexity
- Variety
2
14

Basic Architecture
◉ Encoder module is used to transform short segments of the mixture waveform into their
corresponding representations in an intermediate feature space.
◉ This representation is then used to estimate a multiplicative function (separation mask) for each
source at each time step.
◉ The source waveforms are then reconstructed by transforming the masked encoder features
using a decoder module.
15
conv1d transpose-conv1d

Dual-path Mechanism
16
Process intra-chunk
dependencies
Process inter-chunk
dependencies

Metrics
17

PIT
18

◉ Transformers enable direct and accurate context modeling of
longer term dependencies which render them suitable for
audio processing, especially for speech separation, where
long-term modeling has been shown to impact performance
signiﬁcantly.
◉ However, to avoid confusions, the Transformer in this paper
refers especially to the encoder part, and it is comprised of
three core modules: scaled dot-product attention, multi-head
attention and position-wise feed-forward network.
◉ 【機器學習2021】Transformer (上)、
【機器學習2021】Transformer (下)、
Attention 及 Transformer 架構理解
19
Transformer

Variety - Longformer
22
<Longformer: The Long-Document Transformer>
<Longformer論文筆記－知乎 >

Variety - Linformer
23
<Linformer: Self-Attention with Linear Complexity>
<Linformer論文筆記－知乎 >

24
Variety - Reformer
<Reformer: The Efﬁcient Transformer>
<圖解Reformer>

Architecture
- SepFormer[2]
- PIT-BLSTM[1]
- Comparison[1]
3
25

SepFormer
◉ For mixture , the encoder learns an STFT-like representation:
◉ The masking network is fed by and estimates masks for each of the
speakers.
1. Linear + Chunking (segmentation + overlap) :
2. SepFormer block (intra-Transformer + inter-Transformer) :
3. Linear + PReLU :
27

SepFormer
4. Overlap-add :
5. FFW + PReLU :
◉ The input to the decoder is the element-wise multiplication between masks and initial
representation:
28

SepFormer Block
◉ Intra-Transformer processes the second dimension of , and thus acts on each chunk
independently, modeling the short-term dependencies within each chunk.
◉ Next, we permute the last two dimensions (denoted as ), and the Inter-Transformer is
applied to model the transitions across chunks.
◉ Overall transformation :
◉ The above consist a SepFormer block, and there can be N blocks repeatedly.
29

Dual-path mechanism
30
Process intra-chunk
dependencies
Process inter-chunk
dependencies

SepFormer Block
◉ Transformer procedure (Pre-LN setting):
1. Positional encoding :
2. Layer norm + Multi-head attention :
3. Layer norm + Feed forward + Residuals :
4. Repeat blocks
31
<On Layer Normalization in the Transformer Architecture>

SepFormer
◉ Parameters:
○ Encoder basis 256 ; Kernel size 16 ; Chunk size 250 ;
○ Transformers 8 ; SepFormer blocks 2 ; Attention heads 8 ; Dimension 1024 ;
○ Optimizer Adam ; Loss negative SI-SNR ; Learning rate 1.5e-4
; Batch size 1 ; Epochs 200 .
◉ They explored the use of dynamic mixing data augmentation which consists in on-the-ﬂy creation of
new mixtures from single speaker sources, along with speed perturbation in [95%, 105%].
◉ Training process also apply learning rate halving, gradient clipping and mixed-precision.
Mixed-precision training is a technique for substantially reducing neural net training time by
performing as many operations as possible in fp16, instead of fp32.
32

SepFormer
◉ When using dynamic mixing, SepFormer
achieves state-of- the-art performance.
◉ SepFormer outperforms previous
systems without using dynamic mixing
except Wavesplit, which uses speaker
identity as additional information.
33

SepFormer
◉ A respectable performance of 19.2 dB is obtained
even when we use a single layer Transformer for
the Inter-Transformer. This suggests that the
Intra-Transformer, and thus local processing, has
a greater inﬂuence on the performance.
◉ It also emerges that positional encoding is helpful.
A similar outcome has been observed in T-gsa for
speech enhancement.
◉ Finally, it can be observed that dynamic mixing
helps the performance drastically.
34

SepFormer
❗ However, my inference experience shows that during training the memory consumption of
SepFormer is, on the contrary, large enough to be unable to ﬁt in one 24 GB card.
❗ As a result, they can only train with batch size of one in 32 GB card.
35

PIT-BLSTM
36
STFT iSTFT
Masking net 3 BLSTM + 2 FC ;
Hidden units 600 ;
Window size 512 ;
Frame size 128 ;
Feature dim 257 ;
Loss MSE (on magnitude).

PIT-BLSTM
◉ In [1], SepFormer is trained with a soft-thresholded time-domain negative SDR:
❓ This loss decreases the contribution of well separated examples to the gradient, encouraging
the model to focus more on enhancing examples with a low SDR than those that already show
a good separation.
(I argue that such modiﬁcation is mainly for stability, e.g. when the distance between estimation
and ground truth is close enough, like the evaluation in Sudormrf.)
37

Comparison
◉ In [1], SepFormer (small) reduce the original number of
intra- and inter-Transformer layers to 4, each. This
modiﬁcation yields an about 1 dB lower SDR on
WSJ0-2mix, but signiﬁcantly reduces the number of
parameters.
39
❗ Note that the memory footprint of SepFormer (small) still is 16 times larger than the PIT-BLSTM, so that a
complexity comparison purely based on the parameters is not fair.

Experiments
- Non-anechoic[1][2]
- Optimization[1]
- Summary[1]
- My perspective
- Variety[2]
4
40

LibriMix
41
◉ LibriMix is proposed as an alternative to WSJ0-2/3Mix with more diverse and longer utterances.
◉ The network is trained to perform separation and de-noising at the same time.

WHAM! & WHAMR!
◉ In WHAM! dataset, the model learns to de-noise while
also separating, and in WHAMR! dataset, the model learns
to de-noise and de-reverberate in addition to separating.
◉ SepFormer outperforms the previously proposed methods
even without the use of DM, which further improves the
performance.
42

Speech Enhancement
◉ In addition to training SepFormer, we also trained a
Bidirectional LSTM, CNNTransformer, CFDN models
from Speechbrain, which are trained to minimize MSE
of the estimated magnitude spectrogram and the
spectrogram of the clean signals.
43
<The SpeechBrain Toolkit>

SMS-WSJ
◉ It can be seen that both systems degrade under the presence of reverberation. However, the
separation performance of the SepFormer degrades by more than 10 dB in terms of SDR and
almost 30 percentage points regarding the WER.
(WERs are evaluated by kaldi baseline provided in SMS-WSJ.)
44

Optimization
◉ Apart from the discrepancy in masking net structure, there are also other differences
between both model, namely loss function, encoder/decoder architecture and resolution.
◉ The choice of representation, e.g. magnitude, phase or both, is also worth exploring.
◉ But ﬁrst, they speculate whether the degradation is on account of artifacts. The method to
mitigate that is by adding white Gaussian noise at an SNR of 25 dB to the separated audio
ﬁles before they were input to the speech recognizer.
45
<White Gaussian Noise>
<How Bad Are Artifacts?: Analyzing the Impact of Speech Enhancement Errors on ASR>

Optim.
◉ Even more so, these modiﬁcations work well both with and without reverberation and lead to a
reduction in WER of more than 20 percentage points for both scenarios.
◉ As for SepFormer, it is striking that it is no longer superior to PIT-BLSTM for reverberant data.
46

Optim. - En/Decoder
◉ There is a large mismatch between the window size and the frame size in between. It is necessary
to verify whether the violation of the Multiplicative Transfer Function Approximation (MTFA) related
problem caused by the small window size contributes to the system deterioration under
reverberation.
◉ MTFA is a long standing problem when facing any application using STFT in reverberant scenario.
When switching from a ﬁxed STFT encoder to a learnable encoder, the overall structure of the
system stays the same. Therefore, it can be assumed that similar issues arise with the learnable
encoder.
47

MTFA
◉ Mask-based separation relies on the sparsity and orthogonality of the sources in the the domain
where the masks are computed. In case of the STFT domain, this means that a time-frequency bin
of a mixture will typically be:
,where and are the STFT representation of the k-th
source signal and the Room Impulse Response (RIR) from the k-th source to the
microphone, respectively.
◉ Note that the equation makes the assumption that the convolution of the source signal with the
RIR corresponds to a multiplication of in STFT domain. This so-called Multiplicative Transfer
Function Approximation, however, only holds true if the temporal extent of is smaller than the
STFT analysis window.
48

49
A convolution of and is deﬁned as:
So, its fourier transform will be:
Since , we may apply Fubini's theorem (i.e. interchange the order of
integration):
Proof of “convolution in time = multiplication in frequency”
詳情請見DSP相關課程

“
The multiplication is NOT accurate in STFT domain!
In particularly, it’s on account of the windowing of
short time.
50
<On MTF Approximation in the STFT Domain>
<STFT: Inﬂuence of Window Function>

51
◉ Expected behavior in anechoic conditions :
✓ Reducing the frame shift leads to an improvement in SDR and WER. The recommended
analysis window size and shift of 16 and 8 samples (i.e. 2 ms and 1 ms), respectively, provides
the best results.
✓ Furthermore, the learnable encoder proves superior to the STFT encoder.

52
◉ Unexpected behaviour:
✗ While the STFT encoder is signiﬁcantly worse than a learnable encoder for a small window
size and shift, it begins to be on par or even outperforms the learnable encoder for an
increased window size of 32 ms. violation of the MTFA
✗ Interestingly, the overall best results of the SepFormer are achieved with the STFT.

53
◉ W-Disjoint Orthogonality (WDO) is a score which
measures the orthogonality of the single-speaker utterances
in the latent space.
◉ This indicates that the learnable encoder is able to compensate the effects to some degree, but that
choosing a large enough window size is mandatory to stabilize the performance under
reverberation.
(though not highly correlated)
from high orthogonality to relatively low
mitigated by larger window

Optim. - Representation
◉ A signiﬁcant difference between PIT-BLSTM and SepFormer is that the PIT model estimates the
masks based on the magnitude spectrogram only, whereas the SepFormer mask estimator has
access to the both magnitude and phase implicitly.
◉ Result shows that the phase information is not helpful for the SepFormer in the reverberant
scenario. Even more so, omitting the phase information leads to a better system performance.
54

Optim. - Representation
◉ Firstly, only using the magnitude spectrogram results in a larger window of 512 samples to keep the
size of the separator identical, increasing the temporal context of each frame even further.
◉ Secondly, papers have shown that the phase becomes less informative while the magnitude
becomes more informative for increasing frame sizes.
◉ However, the large window sizes that were shown to be necessary for the reverberant scenario.
◉ An interesting ﬁnding is when using the magnitude and frame shift altered from 16 to 128 samples
improves both signal-level metrics and WER.
55
<Phase-Aware Deep Speech Enhancement: It's All About The Frame Length>

Summary
◉ When only the network of the separator is different, i.e., intra/inter-transformer layers vs BLSTM
layers, the performances are close.
◉ However, this modiﬁed SepFormer only shows a marginally better SDR and an improvement of 1
percentage point in the WER.
56
STFT, SDR loss, +Gaussian
Learned en/decoder, 16 window
STFT, 512 window

My Perspective
◉ What possibly causes the inconsistent performance in WHAMR (SOTA) and SMS-WSJ (marginally
better)?
➜ In [1] and [2], the configuration of each are different.
➜ If learned properly, the obvious benefit of having more parameters is that model can
represent much more complicated functions than the one with fewer parameters.
(If not, having more parameters increase the risk of overfitting.)
◉ What is the consensuses in separation, even under reverberant? (cont.)
57

My Perspective
58
Anechoic Time domain loss Fine-grained Learned en/decoder
SepFormer[2] (16/8k = 2ms)
Conv-TasNet[3] (0.5 ms) (improv. < 2dB)
Reverberant Time domain loss Fine-grained Learned en/decoder
SepFormer (small)[1] (64 ms) (improv. < 1dB)
SepFormer[2] (2ms)
Conv-TasNet[3] (8 ms) (improv. < 2dB)
Agree Disagree
<Preference for 20-40 ms window duration in speech analysis>
Typical frame lengths in speech processing, around 32 ms, correspond to
an interval that is short enough to be considered quasi-stationary but
long enough to cover multiple fundamental periods of voiced speech
whose fundamental period lies between 2 ms to 12.5 ms.

My Perspective
59
SepFormer[2] (2ms)
Agree Disagree

My Perspective
60
SepFormer[2] (2ms)
Agree Disagree
consensus on anechoic env.

My Perspective
61
SepFormer[2] (2ms)
Agree Disagree
not a crucial factor, but FFC may help
<Fast Fourier Convolution>

My Perspective
62
SepFormer[2] (2ms)
Perhaps multi-resolution encoder is a solution!
Agree Disagree
<Multi-encoder multi-resolution framework for end-to-end speech recognition>

My Perspective
◉ Is SDR the right metrics for obtaining better WER?
63
<Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR>

Conclusions
◉ In [1][2], we have explored speech separation and enhancement upon SepFormer, an
attention-only masking network that is able to achieve state-of-the-art performance in anechoic
conditions.
◉ The experiments in [1] and [3] raise the issue of whether the improvements that have been
appraised for the separation of anechoic mixtures are futile for the more realistic case of
reverberant source separation.
◉ We also investigate the use of more efﬁcient self-attention mechanisms such as Longformer,
Linformer and Reformer.
◉ Can we ﬁnd a versatile model that can separate speech in both anechoic and reverberant
condition but with less parameters or even model size?
67

Appendix
◉ <My research notes>
68
◉ “Speech Separation” video from professor HUNG-YI LEE:

Any questions ?
You can ﬁnd me at
◉ jasonho610@gmail.com
◉ NTNU-SMIL
Thanks!
69

Speech Separation under Reverberant Condition.pdf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Speech Separation under Reverberant Condition.pdf

Similar to Speech Separation under Reverberant Condition.pdf (20)

Recently uploaded

Recently uploaded (20)

Speech Separation under Reverberant Condition.pdf