Speech Separation Models: SepFormer and DPTNet

Speech Separation by Transformer:
SepFormer1
& DPTNet2
Presenter : 何冠勳 61047017s
Date : 2022/06/23
1: Neil Zeghidour, David Grangier, Published in arXiv:2010.13154v2 [eess.AS] 8 Mar 2021
2: Jingjing Chen, Qirong Mao, Dong Liu, Published in arXiv:2007.13975v3 [eess.AS] 14 Aug 2020

Outline
2
1 3 5
Introduction Experiment Conclusions
4
2
Architecture Results

Introduction
Why Transformer?
What is SepFormer & DPTNet?
1
3

Why Transformer?
◉ A Transformer model is a neural network that learns context and thus meaning by tracking
relationships in sequential data like the words in this sentence.
◉ Before the deep learning era, many traditional methods are introduced for this task, such as
non-negative matrix factorization (NMF), computational auditory scene analysis (CASA) and
probabilistic models. Only works in close-set speakers.
◉ Deep learning techniques for monaural speech separation can be divided into two categories:
time-frequency domain methods and end-to-end time-domain approaches.
Phase problem is non-trivial problem.
4

Why Transformer?
5
Non-negative Matrix Factorization
Computational Auditory Scene Analysis

Why Transformer?
◉ RNN based models need to pass information through many intermediate states, while the CNN
based models suffer from limited receptive ﬁelds.
◉ The inherently sequential nature of RNNs impairs an effective parallelization of the computations.
This bottleneck is particularly evident when processing large datasets with long sequences.
◉ Fortunately, the Transformer based on self-attention mechanism can resolve this problem by
interact directly to the inputs.
◉ Transformers completely avoid this bottleneck by eliminating recurrence and replacing it with a
fully attention-based mechanism.
6

What is SepFormer?
◉ Current systems rely, in large part, on the learned-domain masking strategy popularized by
Conv-TasNet.
◉ Building on this, Dual-path mechanism has demonstrated that better long-term modeling is crucial to
improve the separation performance.
(introduced in Dual-path RNN)
7

What is SepFormer?
◉ In this paper, a novel model called SepFormer is proposed, which is mainly composed of multi-head
attention and feed-forward layers.
◉ SepFormer adopt the dual-path framework and replace RNNs with inter/intra-Transformers that learn
both short and long-term dependencies. The dual-path framework enables to mitigate the quadratic
complexity of Transformers, as Transformers in the dual-path framework process smaller chunks.
◉ SepFormer not only processes all the time steps in parallel but also achieves competitive performance
when downsampling the encoded representation. This makes the proposed architecture signiﬁcantly
faster and less memory demanding than the latest RNN-based separation models.
8

What is DPTNet?
◉ DPTNet is an end-to-end monaural speech separation, which introduces an improved Transformer to
allow direct context-aware modeling on the speech sequences.
◉ This is the ﬁrst work that introduces direct context-aware modeling into speech separation. This
method enables the elements in speech sequences can interact directly, which is beneﬁcial to
information transmission.
◉ To consider the order of signal, DPTNet integrate a recurrent neural network into original Transformer
to make it can learn the order information of the speech sequences without positional encodings.
9

“
Note!
Although DPTNet is shown to outperform the standard
DPRNN, such an architecture still embeds an RNN,
effectively negating the parallelization capability of
pure-attention models.
10

Architecture
- Architecture
- Separator
➔ SepFormer
➔ DPTNet
- Overview
2
11

Architecture
◉ Encoder module is used to transform short segments of the mixture waveform into their
corresponding representations in an intermediate feature space.
◉ This representation is then used to estimate a multiplicative function (separation mask) for
each source at each time step.
◉ The source waveforms are then reconstructed by transforming the masked encoder features
using a decoder module.
12
conv1d transpose-conv1d

Dual-path mechanism
13
Process intra-chunk
dependencies
Process inter-chunk
dependencies

Separator - SepFormer
◉ For mixture , the encoder learns an STFT-like representation:
◉ The masking network is fed by and estimates masks for each of the
speakers.
1. Linear + Chunking (segmentation + overlap) :
2. SepFormer block (intra-Transformer + inter-Transformer) :
3. Linear + PReLU :
15

Separator - SepFormer (cont.)
4. Overlap-add :
5. FFW + PReLU :
◉ The input to the decoder is the element-wise multiplication between masks and initial
representation:
16

SepFormer block
◉ Intra-Transformer processes the second dimension of , and thus acts on each chunk
independently, modeling the short-term dependencies within each chunk.
◉ Next, we permute the last two dimensions (denoted as ), and the Inter-Transformer is
applied to model the transitions across chunks.
◉ Overall transformation :
◉ The above consist a sepformer block, and there can be N blocks repeatedly.
17

SepFormer block
◉ The Transformer used here closely resemble the
original one.
◉ However, to avoid confusions, the Transformer in this
paper refers especially to the encoder part, and it is
comprised of three core modules: scaled dot-product
attention, multi-head attention and position-wise
feed-forward network.
◉ <On Layer Normalization in the Transformer
Architecture>
18

SepFormer block
◉ Transformer procedure (Pre-LN setting):
1. Positional encoding :
2. Layer norm + Multi-head attention :
3. Layer norm + Feed forward + Residuals :
4. Repeat blocks
19

Improved Transformer
◉ The origin Transformer adds positional encodings
to the input embeddings to represent order
information, which is sine-and-cosine functions
or learned parameters.
◉ However, we ﬁnd that the positional encodings
are not suitable for dual-path network and
usually lead to model divergence in the training
process.
◉ To learn the order information, we replace the
ﬁrst fully connected layer with a RNN in the
feed-forward network.
21

Overview
22
SepFormer DPTNet
Encoder Conv1d Conv1d
Decoder Transpose-conv1d Transpose-conv1d
Main separator Transformer (encoder part) Transformer (encoder part)
Idea
Pre-layer normalization applied
Positional encoding retained
RNN used in FFW
Positional encoding eliminated

Experiments
- Dataset
- Training objective
- Conﬁgurations
3
23

Dataset
◉ SepFormer is evaluated on WSJ0mix.
◉ DPTNet is evaluated on WSJ0-2mix and Libri2mix.
◉ All waveforms are sampled at 8kHz.
‼ On Asteroid and Speechbrain toolkit, there’re pre-trained models trained on WHAM! or
WHAMR! that gain fair performance, regardless of whether the task is separation or
enhancement.
24

Training objectives
◉ The objective of training the end-to-end system is maximizing the improvement of the
scale-invariant source-to-noise ratio (SI-SNR), replacing the standard source-to-distortion
ratio (SDR).
◉ Utterance-level permutation invariant training (uPIT) is applied during training to address the
source permutation problem.
25

Configurations - SepFormer
◉ Numbers of
○ Encoder basis 256 ; Kernel size 16 ; Chunk size 250 .
○ Transformers 8 ; SepFormer blocks 2 ; Attention heads 8 ; Dimension 1024 .
◉ We explored the use of dynamic mixing data augmentation which consists in on-the-fly creation of
new mixtures from single speaker sources, along with speed perturbation in [95%, 105%].
◉ Training process also apply learning rate halving, gradient clipping and mixed-precision.
◉ Mixed-precision training is a technique for substantially reducing neural net training time by
performing as many operations as possible in half-precision floating point, fp16, instead of the
single-precision floating point, fp32.
26

Configurations - DPTNet
◉ Numbers of
○ Encoder basis 64 ; Kernel size 2 ; Segmentation size not specified .
○ Transformers 6 ; DPT blocks 6 ; Attention heads 4 ; Dimension not specified .
◉ Training process also apply early stopping and gradient clipping.
◉ Learning rate schedule is set as
,where , , .
◉ We increase the learning rate linearly for the first warm up training steps , and then decay it by 0.98
for every two epochs.
27

Results
- SepFormer
- DPTNet
- Comparison
- Inference experience
- Overview
4
28

SepFormer
◉ When using dynamic mixing, SepFormer
achieves state-of- the-art performance.
◉ SepFormer outperforms previous
systems without using dynamic mixing
except Wavesplit, which uses speaker
identity as additional information.
29

SepFormer
◉ SepFormer obtains the state-of-the-art performance
with an SI-SNRi of 19.5 dB and an SDRi of 19.7 dB.
◉ Our results on WSJ0mix show that it is possible to
achieve state-of-the-art performance in separation
with an RNN-free Transformer-based model.
◉ The big advantage of SepFormer over RNN-based
systems is the possibility to parallelize the computations
over different time steps.
30

SepFormer
◉ A respectable performance of 19.2 dB is obtained
even when we use a single layer Transformer for
the Inter-Transformer. This suggests that the
Intra-Transformer, and thus local processing, has
a greater inﬂuence on the performance.
◉ It also emerges that positional encoding is helpful
(e.g. see green line). A similar outcome has been
observed in T-gsa for speech enhancement.
◉ Finally, it can be observed that dynamic mixing
helps the performance drastically.
31

DPTNet
32
◉ Compared to those in the WSJ0-2mix data corpus, the mixtures in Libri2mix is difﬁcult to
separate.
◉ The result shows that direct context-aware modeling is still signiﬁcantly superior to the RNN
method. This presents the generalization of Transformer and further demonstrates the
effectiveness of it.

Comparison - time & space
◉ The far left graph compares the training speed of SepFormer, DPRNN and DPTNet.
◉ The middle and right graph compares the average inference time and total memory allocation of
Wavesplit, SepFormer, DPRNN and DPTNet.
33

Comparison - time & space
◉ From this analysis, it emerges that SepFormer is not only faster but also less memory demanding
than DPTNet, DPRNN, and Wavesplit.
◉ Such a level of computational efficiency is achieved even though the proposed SepFormer
employs more parameters than others.
◉ This is not only due to the superior parallelization capabilities of the Transformer, but also
because the best performance is achieved with a stride factor of 8 samples, against a stride of 1
for DPRNN and DPTNet.
◉ Window size defines the memory footprint and hop size defines the minimum latency of the
system. 34

Inference experience
◉ SepFormer is hard to train and converges slowly due to the large number of parameters,
although the original papers claimed the opposite.
◉ Based on Sam Tsao’s training experience, RNN is more effective on memorizing context
compared to linear network.
◉ To cope with that we often take a bigger input size on linear layer, which also explains where the
massive parameters come from.
35

Overview
◉ In terms of performance, we can realize that without dynamic mixing DPTNet is competitive
with SepFormer.
◉ Consequently, we can argue that the reason Wavesplit+DM gaining the second best
performance is identical with the reason SepFormer+DM gaining the best; that is, the design of
DM helps improve the separation tasks to be robust.
◉ However, the defect embedded in RNN is undeniable.
36

Conclusion
◉ Both network learns both short and long-term dependencies using a multi-scale approach and
obtain fair performance. The dual-path Transformer network is able to model different level of
the speech sequences directly conditioning on context.
◉ DPTNet can learn the order information in speech sequences without positional encodings but
by RNN, while SepFormer is a RNN-free architecture based on pre-layer Transformer.
◉ By the facts that
1) with pure Transformer, the computations over different time-steps can be parallelized, and
2) SepFormer achieves a competitive performance even when subsampling the encoded
representation by a factor of 8,
these two properties lead to a signiﬁcant speed-up at training/inference time and a signiﬁcant
reduction of memory usage.
38

Separation results on Delta dataset:
AOI / Inter-training / Elderly
using SuDoRM-RF
Presenter : 何冠勳 61047017s
Date : 2022/06/23

Results
40
Spk1 Spk2 Original
AOI
Inter-training
Elderly

Any questions ?
You can ﬁnd me at
◉ jasonho610@gmail.com
◉ NTNU-SMIL
Thanks!
41

Speech Separation Models: SepFormer and DPTNet

Recommended

Recommended

More Related Content

Similar to Speech Separation Models: SepFormer and DPTNet

Similar to Speech Separation Models: SepFormer and DPTNet (20)

Recently uploaded

Recently uploaded (20)

Speech Separation Models: SepFormer and DPTNet