(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
Speech Separation under Reverberant Condition.pdf
1. Speech Separation under
Reverberant Condition
Presenter : 何冠勳 61047017s
Date : 2022/08/31
1: Tobias Cord-Landwehr et al., “Monaural Source Separation: from Anechoic to Reverberant Enviorments”, IWAENC 2022
2: Cem Subakan et al., “On Using Transformers for Speech-Separation”, IEEE Signal Processing Letters
3: Jens Heitkaemper et al., “Demystifying TasNet: a Dissecting Approach”, ICASSP 2020
4. What is Speech Separation?
See page 1 ~ 5 in PPT made by professor HUNG-YI LEE.
4
<Speech Separation - HUNG-YI LEE>
5. History
◉ Starting with the seminal papers on deep clustering and permutation invariant training (PIT),
improvements have been achieved by combining the two in a multi-objective training criterion,
or replacing the STFT with a learnable encoder and decoder, e.g. Conv-TasNet,
or accounting for short- and longer-term correlations in the signal, e.g. DPRNN,
or conditioning on simultaneously computed speaker centroids, e.g. Wavesplit.
◉ Overall, this has led to an improvement in SI-SDR from roughly 10 ~ 20+ dB on the standard
WSJ0-2mix dataset, which consists of artificial mixtures of anechoic speech.
anechoic (adj.): free from echoes and reverberations
5
8. Datasets
◉ Common datasets include:
➢ anechoic: WSJ0mix, LibriMix
➢ noisy: WHAM!, LibriMix
➢ reverberant: SMS-WSJ
➢ noisy & reverberant: WHAMR!
◉ There are also data related to task of enhancing speech in WHAM!, WHAM! and LibriMix.
◉ Instead of utterance manner, there are also continuous speech separation corpus, like LibriCSS.
8
9. Non-anechoic
◉ However, an anechoic environment is a rather unrealistic assumption as in a real-world scenario, the
superposition of the speech of two or more speakers typically occurs in a distant microphone setting.
◉ In particular, reverberation has been considered more challenging than noise.
◉ WHAMR! and SMS-WSJ are two widely used datasets for research on source separation for
reverberant mixtures. Both contain artificially reverberated utterances from the WSJ corpus.
◉ Source separation performance on WHAMR! is 2 ~ 8 dB, on SMS-WSJ is 5 ~ 6 dB for single-channel
input and single-stage processing, which is much worse than the performance on anechoic mixtures.
9
10. Non-anechoic
◉ In [1], they aim to explore which of the recent innovations that proved useful for the separation of
anechoic mixtures are also beneficial in the reverberant case, in order to propose some guidelines on
how to adjust a separation system to reverberated input.
◉ They take the SepFormer architecture, which achieves state-of-the-art performance both on
WSJ0-2mix and WHAMR!, and the traditional model PIT-BLSTM then adopt both models on
SMS-WSJ datasets.
◉ By modifying and optimizing the each model, w.r.t. loss function, encoder/decoder architecture,
resolution and representation, they adapt the model to mitigate the performance degradation between
the anechoic and reverberant scenario.
10
11. Non-anechoic
◉ In [2], they expand the study by providing additional experiments and insights on more realistic and
challenging datasets such as Libri2/3-Mix, which include longer mixtures, WHAM! and WHAMR!,
featuring noisy and noisy & reverberant conditions respectively.
◉ Moreover, on WHAM! and on WHAMR! datasets they also provide results for speech enhancement.
◉ Another contribution of [2] is investigating different types of self-attention for speech separation with
and without the dual-path mechanism. Namely, they compare the vanilla Transformer with
Longformer, Linformer, and Reformer.
11
13. Objectives
13
◉ In this presentation, we expect to obtain:
✓ separation performance on WHAM!, WHAMR!, LibriMix and SMS-WSJ,[1][2]
✓ enhancement performance on WHAM! and WHAMR!, [2]
✓ modifications that are rewarding to adapt to the reverberant condition,[1]
✓ analysis on those modifications[1]
and
✓ comparison between transformer variety. [2]
15. Basic Architecture
◉ Encoder module is used to transform short segments of the mixture waveform into their
corresponding representations in an intermediate feature space.
◉ This representation is then used to estimate a multiplicative function (separation mask) for each
source at each time step.
◉ The source waveforms are then reconstructed by transforming the masked encoder features
using a decoder module.
15
conv1d transpose-conv1d
17. Metrics
See page 6 ~ 9 in PPT made by professor HUNG-YI LEE.
17
<Speech Separation - HUNG-YI LEE>
18. PIT
See page 10 ~ 11 in PPT made by professor HUNG-YI LEE.
18
<Speech Separation - HUNG-YI LEE>
19. ◉ Transformers enable direct and accurate context modeling of
longer term dependencies which render them suitable for
audio processing, especially for speech separation, where
long-term modeling has been shown to impact performance
significantly.
◉ However, to avoid confusions, the Transformer in this paper
refers especially to the encoder part, and it is comprised of
three core modules: scaled dot-product attention, multi-head
attention and position-wise feed-forward network.
◉ 【機器學習2021】Transformer (上)、
【機器學習2021】Transformer (下)、
Attention 及 Transformer 架構理解
19
Transformer
27. SepFormer
◉ For mixture , the encoder learns an STFT-like representation:
◉ The masking network is fed by and estimates masks for each of the
speakers.
1. Linear + Chunking (segmentation + overlap) :
2. SepFormer block (intra-Transformer + inter-Transformer) :
3. Linear + PReLU :
27
28. SepFormer
4. Overlap-add :
5. FFW + PReLU :
◉ The input to the decoder is the element-wise multiplication between masks and initial
representation:
28
29. SepFormer Block
◉ Intra-Transformer processes the second dimension of , and thus acts on each chunk
independently, modeling the short-term dependencies within each chunk.
◉ Next, we permute the last two dimensions (denoted as ), and the Inter-Transformer is
applied to model the transitions across chunks.
◉ Overall transformation :
◉ The above consist a SepFormer block, and there can be N blocks repeatedly.
29
32. SepFormer
◉ Parameters:
○ Encoder basis 256 ; Kernel size 16 ; Chunk size 250 ;
○ Transformers 8 ; SepFormer blocks 2 ; Attention heads 8 ; Dimension 1024 ;
○ Optimizer Adam ; Loss negative SI-SNR ; Learning rate 1.5e-4
; Batch size 1 ; Epochs 200 .
◉ They explored the use of dynamic mixing data augmentation which consists in on-the-fly creation of
new mixtures from single speaker sources, along with speed perturbation in [95%, 105%].
◉ Training process also apply learning rate halving, gradient clipping and mixed-precision.
Mixed-precision training is a technique for substantially reducing neural net training time by
performing as many operations as possible in fp16, instead of fp32.
32
33. SepFormer
◉ When using dynamic mixing, SepFormer
achieves state-of- the-art performance.
◉ SepFormer outperforms previous
systems without using dynamic mixing
except Wavesplit, which uses speaker
identity as additional information.
33
34. SepFormer
◉ A respectable performance of 19.2 dB is obtained
even when we use a single layer Transformer for
the Inter-Transformer. This suggests that the
Intra-Transformer, and thus local processing, has
a greater influence on the performance.
◉ It also emerges that positional encoding is helpful.
A similar outcome has been observed in T-gsa for
speech enhancement.
◉ Finally, it can be observed that dynamic mixing
helps the performance drastically.
34
35. SepFormer
❗ However, my inference experience shows that during training the memory consumption of
SepFormer is, on the contrary, large enough to be unable to fit in one 24 GB card.
❗ As a result, they can only train with batch size of one in 32 GB card.
35
36. PIT-BLSTM
36
STFT iSTFT
Masking net 3 BLSTM + 2 FC ;
Hidden units 600 ;
Window size 512 ;
Frame size 128 ;
Feature dim 257 ;
Loss MSE (on magnitude).
37. PIT-BLSTM
◉ In [1], SepFormer is trained with a soft-thresholded time-domain negative SDR:
❓ This loss decreases the contribution of well separated examples to the gradient, encouraging
the model to focus more on enhancing examples with a low SDR than those that already show
a good separation.
(I argue that such modification is mainly for stability, e.g. when the distance between estimation
and ground truth is close enough, like the evaluation in Sudormrf.)
37
39. Comparison
◉ In [1], SepFormer (small) reduce the original number of
intra- and inter-Transformer layers to 4, each. This
modification yields an about 1 dB lower SDR on
WSJ0-2mix, but significantly reduces the number of
parameters.
39
❗ Note that the memory footprint of SepFormer (small) still is 16 times larger than the PIT-BLSTM, so that a
complexity comparison purely based on the parameters is not fair.
41. LibriMix
41
◉ LibriMix is proposed as an alternative to WSJ0-2/3Mix with more diverse and longer utterances.
◉ The network is trained to perform separation and de-noising at the same time.
42. WHAM! & WHAMR!
◉ In WHAM! dataset, the model learns to de-noise while
also separating, and in WHAMR! dataset, the model learns
to de-noise and de-reverberate in addition to separating.
◉ SepFormer outperforms the previously proposed methods
even without the use of DM, which further improves the
performance.
42
43. Speech Enhancement
◉ In addition to training SepFormer, we also trained a
Bidirectional LSTM, CNNTransformer, CFDN models
from Speechbrain, which are trained to minimize MSE
of the estimated magnitude spectrogram and the
spectrogram of the clean signals.
43
<The SpeechBrain Toolkit>
44. SMS-WSJ
◉ It can be seen that both systems degrade under the presence of reverberation. However, the
separation performance of the SepFormer degrades by more than 10 dB in terms of SDR and
almost 30 percentage points regarding the WER.
(WERs are evaluated by kaldi baseline provided in SMS-WSJ.)
44
45. Optimization
◉ Apart from the discrepancy in masking net structure, there are also other differences
between both model, namely loss function, encoder/decoder architecture and resolution.
◉ The choice of representation, e.g. magnitude, phase or both, is also worth exploring.
◉ But first, they speculate whether the degradation is on account of artifacts. The method to
mitigate that is by adding white Gaussian noise at an SNR of 25 dB to the separated audio
files before they were input to the speech recognizer.
45
<White Gaussian Noise>
<How Bad Are Artifacts?: Analyzing the Impact of Speech Enhancement Errors on ASR>
46. Optim.
◉ Even more so, these modifications work well both with and without reverberation and lead to a
reduction in WER of more than 20 percentage points for both scenarios.
◉ As for SepFormer, it is striking that it is no longer superior to PIT-BLSTM for reverberant data.
46
47. Optim. - En/Decoder
◉ There is a large mismatch between the window size and the frame size in between. It is necessary
to verify whether the violation of the Multiplicative Transfer Function Approximation (MTFA) related
problem caused by the small window size contributes to the system deterioration under
reverberation.
◉ MTFA is a long standing problem when facing any application using STFT in reverberant scenario.
When switching from a fixed STFT encoder to a learnable encoder, the overall structure of the
system stays the same. Therefore, it can be assumed that similar issues arise with the learnable
encoder.
47
48. MTFA
◉ Mask-based separation relies on the sparsity and orthogonality of the sources in the the domain
where the masks are computed. In case of the STFT domain, this means that a time-frequency bin
of a mixture will typically be:
,where and are the STFT representation of the k-th
source signal and the Room Impulse Response (RIR) from the k-th source to the
microphone, respectively.
◉ Note that the equation makes the assumption that the convolution of the source signal with the
RIR corresponds to a multiplication of in STFT domain. This so-called Multiplicative Transfer
Function Approximation, however, only holds true if the temporal extent of is smaller than the
STFT analysis window.
48
49. 49
A convolution of and is defined as:
So, its fourier transform will be:
Since , we may apply Fubini's theorem (i.e. interchange the order of
integration):
Proof of “convolution in time = multiplication in frequency”
詳情請見DSP相關課程
50. “
The multiplication is NOT accurate in STFT domain!
In particularly, it’s on account of the windowing of
short time.
50
<On MTF Approximation in the STFT Domain>
<STFT: Influence of Window Function>
51. 51
◉ Expected behavior in anechoic conditions :
✓ Reducing the frame shift leads to an improvement in SDR and WER. The recommended
analysis window size and shift of 16 and 8 samples (i.e. 2 ms and 1 ms), respectively, provides
the best results.
✓ Furthermore, the learnable encoder proves superior to the STFT encoder.
52. 52
◉ Unexpected behaviour:
✗ While the STFT encoder is significantly worse than a learnable encoder for a small window
size and shift, it begins to be on par or even outperforms the learnable encoder for an
increased window size of 32 ms. violation of the MTFA
✗ Interestingly, the overall best results of the SepFormer are achieved with the STFT.
53. 53
◉ W-Disjoint Orthogonality (WDO) is a score which
measures the orthogonality of the single-speaker utterances
in the latent space.
◉ This indicates that the learnable encoder is able to compensate the effects to some degree, but that
choosing a large enough window size is mandatory to stabilize the performance under
reverberation.
(though not highly correlated)
from high orthogonality to relatively low
mitigated by larger window
54. Optim. - Representation
◉ A significant difference between PIT-BLSTM and SepFormer is that the PIT model estimates the
masks based on the magnitude spectrogram only, whereas the SepFormer mask estimator has
access to the both magnitude and phase implicitly.
◉ Result shows that the phase information is not helpful for the SepFormer in the reverberant
scenario. Even more so, omitting the phase information leads to a better system performance.
54
55. Optim. - Representation
◉ Firstly, only using the magnitude spectrogram results in a larger window of 512 samples to keep the
size of the separator identical, increasing the temporal context of each frame even further.
◉ Secondly, papers have shown that the phase becomes less informative while the magnitude
becomes more informative for increasing frame sizes.
◉ However, the large window sizes that were shown to be necessary for the reverberant scenario.
◉ An interesting finding is when using the magnitude and frame shift altered from 16 to 128 samples
improves both signal-level metrics and WER.
55
<Phase-Aware Deep Speech Enhancement: It's All About The Frame Length>
56. Summary
◉ When only the network of the separator is different, i.e., intra/inter-transformer layers vs BLSTM
layers, the performances are close.
◉ However, this modified SepFormer only shows a marginally better SDR and an improvement of 1
percentage point in the WER.
56
STFT, SDR loss, +Gaussian
Learned en/decoder, 16 window
STFT, 512 window
57. My Perspective
◉ What possibly causes the inconsistent performance in WHAMR (SOTA) and SMS-WSJ (marginally
better)?
➜ In [1] and [2], the configuration of each are different.
➜ If learned properly, the obvious benefit of having more parameters is that model can
represent much more complicated functions than the one with fewer parameters.
(If not, having more parameters increase the risk of overfitting.)
◉ What is the consensuses in separation, even under reverberant? (cont.)
57
58. My Perspective
58
Anechoic Time domain loss Fine-grained Learned en/decoder
SepFormer[2] (16/8k = 2ms)
Conv-TasNet[3] (0.5 ms) (improv. < 2dB)
Reverberant Time domain loss Fine-grained Learned en/decoder
SepFormer (small)[1] (64 ms) (improv. < 1dB)
SepFormer[2] (2ms)
Conv-TasNet[3] (8 ms) (improv. < 2dB)
Agree Disagree
<Preference for 20-40 ms window duration in speech analysis>
Typical frame lengths in speech processing, around 32 ms, correspond to
an interval that is short enough to be considered quasi-stationary but
long enough to cover multiple fundamental periods of voiced speech
whose fundamental period lies between 2 ms to 12.5 ms.
59. My Perspective
59
Anechoic Time domain loss Fine-grained Learned en/decoder
SepFormer[2] (16/8k = 2ms)
Conv-TasNet[3] (0.5 ms) (improv. < 2dB)
Reverberant Time domain loss Fine-grained Learned en/decoder
SepFormer (small)[1] (64 ms) (improv. < 1dB)
SepFormer[2] (2ms)
Conv-TasNet[3] (8 ms) (improv. < 2dB)
Agree Disagree
60. My Perspective
60
Anechoic Time domain loss Fine-grained Learned en/decoder
SepFormer[2] (16/8k = 2ms)
Conv-TasNet[3] (0.5 ms) (improv. < 2dB)
Reverberant Time domain loss Fine-grained Learned en/decoder
SepFormer (small)[1] (64 ms) (improv. < 1dB)
SepFormer[2] (2ms)
Conv-TasNet[3] (8 ms) (improv. < 2dB)
Agree Disagree
consensus on anechoic env.
61. My Perspective
61
Anechoic Time domain loss Fine-grained Learned en/decoder
SepFormer[2] (16/8k = 2ms)
Conv-TasNet[3] (0.5 ms) (improv. < 2dB)
Reverberant Time domain loss Fine-grained Learned en/decoder
SepFormer (small)[1] (64 ms) (improv. < 1dB)
SepFormer[2] (2ms)
Conv-TasNet[3] (8 ms) (improv. < 2dB)
Agree Disagree
not a crucial factor, but FFC may help
<Fast Fourier Convolution>
62. My Perspective
62
Anechoic Time domain loss Fine-grained Learned en/decoder
SepFormer[2] (16/8k = 2ms)
Conv-TasNet[3] (0.5 ms) (improv. < 2dB)
Reverberant Time domain loss Fine-grained Learned en/decoder
SepFormer (small)[1] (64 ms) (improv. < 1dB)
SepFormer[2] (2ms)
Conv-TasNet[3] (8 ms) (improv. < 2dB)
Perhaps multi-resolution encoder is a solution!
Agree Disagree
<Multi-encoder multi-resolution framework for end-to-end speech recognition>
63. My Perspective
◉ Is SDR the right metrics for obtaining better WER?
63
<Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR>
67. Conclusions
◉ In [1][2], we have explored speech separation and enhancement upon SepFormer, an
attention-only masking network that is able to achieve state-of-the-art performance in anechoic
conditions.
◉ The experiments in [1] and [3] raise the issue of whether the improvements that have been
appraised for the separation of anechoic mixtures are futile for the more realistic case of
reverberant source separation.
◉ We also investigate the use of more efficient self-attention mechanisms such as Longformer,
Linformer and Reformer.
◉ Can we find a versatile model that can separate speech in both anechoic and reverberant
condition but with less parameters or even model size?
67