Speech Separation by Transformer:
SepFormer1
& DPTNet2
Presenter : 何冠勳 61047017s
Date : 2022/06/23
1: Neil Zeghidour, David Grangier, Published in arXiv:2010.13154v2 [eess.AS] 8 Mar 2021
2: Jingjing Chen, Qirong Mao, Dong Liu, Published in arXiv:2007.13975v3 [eess.AS] 14 Aug 2020
Outline
2
1 3 5
Introduction Experiment Conclusions
4
2
Architecture Results
Introduction
Why Transformer?
What is SepFormer & DPTNet?
1
3
Why Transformer?
◉ A Transformer model is a neural network that learns context and thus meaning by tracking
relationships in sequential data like the words in this sentence.
◉ Before the deep learning era, many traditional methods are introduced for this task, such as
non-negative matrix factorization (NMF), computational auditory scene analysis (CASA) and
probabilistic models. Only works in close-set speakers.
◉ Deep learning techniques for monaural speech separation can be divided into two categories:
time-frequency domain methods and end-to-end time-domain approaches.
Phase problem is non-trivial problem.
4
Why Transformer?
5
Non-negative Matrix Factorization
Computational Auditory Scene Analysis
Why Transformer?
◉ RNN based models need to pass information through many intermediate states, while the CNN
based models suffer from limited receptive fields.
◉ The inherently sequential nature of RNNs impairs an effective parallelization of the computations.
This bottleneck is particularly evident when processing large datasets with long sequences.
◉ Fortunately, the Transformer based on self-attention mechanism can resolve this problem by
interact directly to the inputs.
◉ Transformers completely avoid this bottleneck by eliminating recurrence and replacing it with a
fully attention-based mechanism.
6
What is SepFormer?
◉ Current systems rely, in large part, on the learned-domain masking strategy popularized by
Conv-TasNet.
◉ Building on this, Dual-path mechanism has demonstrated that better long-term modeling is crucial to
improve the separation performance.
(introduced in Dual-path RNN)
7
What is SepFormer?
◉ In this paper, a novel model called SepFormer is proposed, which is mainly composed of multi-head
attention and feed-forward layers.
◉ SepFormer adopt the dual-path framework and replace RNNs with inter/intra-Transformers that learn
both short and long-term dependencies. The dual-path framework enables to mitigate the quadratic
complexity of Transformers, as Transformers in the dual-path framework process smaller chunks.
◉ SepFormer not only processes all the time steps in parallel but also achieves competitive performance
when downsampling the encoded representation. This makes the proposed architecture significantly
faster and less memory demanding than the latest RNN-based separation models.
8
What is DPTNet?
◉ DPTNet is an end-to-end monaural speech separation, which introduces an improved Transformer to
allow direct context-aware modeling on the speech sequences.
◉ This is the first work that introduces direct context-aware modeling into speech separation. This
method enables the elements in speech sequences can interact directly, which is beneficial to
information transmission.
◉ To consider the order of signal, DPTNet integrate a recurrent neural network into original Transformer
to make it can learn the order information of the speech sequences without positional encodings.
9
“
Note!
Although DPTNet is shown to outperform the standard
DPRNN, such an architecture still embeds an RNN,
effectively negating the parallelization capability of
pure-attention models.
10
Architecture
- Architecture
- Separator
➔ SepFormer
➔ DPTNet
- Overview
2
11
Architecture
◉ Encoder module is used to transform short segments of the mixture waveform into their
corresponding representations in an intermediate feature space.
◉ This representation is then used to estimate a multiplicative function (separation mask) for
each source at each time step.
◉ The source waveforms are then reconstructed by transforming the masked encoder features
using a decoder module.
12
conv1d transpose-conv1d
Dual-path mechanism
13
Process intra-chunk
dependencies
Process inter-chunk
dependencies
Separator - SepFormer
14
Separator - SepFormer
◉ For mixture , the encoder learns an STFT-like representation:
◉ The masking network is fed by and estimates masks for each of the
speakers.
1. Linear + Chunking (segmentation + overlap) :
2. SepFormer block (intra-Transformer + inter-Transformer) :
3. Linear + PReLU :
15
Separator - SepFormer (cont.)
4. Overlap-add :
5. FFW + PReLU :
◉ The input to the decoder is the element-wise multiplication between masks and initial
representation:
16
SepFormer block
◉ Intra-Transformer processes the second dimension of , and thus acts on each chunk
independently, modeling the short-term dependencies within each chunk.
◉ Next, we permute the last two dimensions (denoted as ), and the Inter-Transformer is
applied to model the transitions across chunks.
◉ Overall transformation :
◉ The above consist a sepformer block, and there can be N blocks repeatedly.
17
SepFormer block
◉ The Transformer used here closely resemble the
original one.
◉ However, to avoid confusions, the Transformer in this
paper refers especially to the encoder part, and it is
comprised of three core modules: scaled dot-product
attention, multi-head attention and position-wise
feed-forward network.
◉ <On Layer Normalization in the Transformer
Architecture>
18
SepFormer block
◉ Transformer procedure (Pre-LN setting):
1. Positional encoding :
2. Layer norm + Multi-head attention :
3. Layer norm + Feed forward + Residuals :
4. Repeat blocks
19
Separator - DPTNet
20
Improved Transformer
◉ The origin Transformer adds positional encodings
to the input embeddings to represent order
information, which is sine-and-cosine functions
or learned parameters.
◉ However, we find that the positional encodings
are not suitable for dual-path network and
usually lead to model divergence in the training
process.
◉ To learn the order information, we replace the
first fully connected layer with a RNN in the
feed-forward network.
21
Overview
22
SepFormer DPTNet
Encoder Conv1d Conv1d
Decoder Transpose-conv1d Transpose-conv1d
Main separator Transformer (encoder part) Transformer (encoder part)
Idea
Pre-layer normalization applied
Positional encoding retained
RNN used in FFW
Positional encoding eliminated
Experiments
- Dataset
- Training objective
- Configurations
3
23
Dataset
◉ SepFormer is evaluated on WSJ0mix.
◉ DPTNet is evaluated on WSJ0-2mix and Libri2mix.
◉ All waveforms are sampled at 8kHz.
‼ On Asteroid and Speechbrain toolkit, there’re pre-trained models trained on WHAM! or
WHAMR! that gain fair performance, regardless of whether the task is separation or
enhancement.
24
Training objectives
◉ The objective of training the end-to-end system is maximizing the improvement of the
scale-invariant source-to-noise ratio (SI-SNR), replacing the standard source-to-distortion
ratio (SDR).
◉ Utterance-level permutation invariant training (uPIT) is applied during training to address the
source permutation problem.
25
Configurations - SepFormer
◉ Numbers of
○ Encoder basis 256 ; Kernel size 16 ; Chunk size 250 .
○ Transformers 8 ; SepFormer blocks 2 ; Attention heads 8 ; Dimension 1024 .
◉ We explored the use of dynamic mixing data augmentation which consists in on-the-fly creation of
new mixtures from single speaker sources, along with speed perturbation in [95%, 105%].
◉ Training process also apply learning rate halving, gradient clipping and mixed-precision.
◉ Mixed-precision training is a technique for substantially reducing neural net training time by
performing as many operations as possible in half-precision floating point, fp16, instead of the
single-precision floating point, fp32.
26
Configurations - DPTNet
◉ Numbers of
○ Encoder basis 64 ; Kernel size 2 ; Segmentation size not specified .
○ Transformers 6 ; DPT blocks 6 ; Attention heads 4 ; Dimension not specified .
◉ Training process also apply early stopping and gradient clipping.
◉ Learning rate schedule is set as
,where , , .
◉ We increase the learning rate linearly for the first warm up training steps , and then decay it by 0.98
for every two epochs.
27
Results
- SepFormer
- DPTNet
- Comparison
- Inference experience
- Overview
4
28
SepFormer
◉ When using dynamic mixing, SepFormer
achieves state-of- the-art performance.
◉ SepFormer outperforms previous
systems without using dynamic mixing
except Wavesplit, which uses speaker
identity as additional information.
29
SepFormer
◉ SepFormer obtains the state-of-the-art performance
with an SI-SNRi of 19.5 dB and an SDRi of 19.7 dB.
◉ Our results on WSJ0mix show that it is possible to
achieve state-of-the-art performance in separation
with an RNN-free Transformer-based model.
◉ The big advantage of SepFormer over RNN-based
systems is the possibility to parallelize the computations
over different time steps.
30
SepFormer
◉ A respectable performance of 19.2 dB is obtained
even when we use a single layer Transformer for
the Inter-Transformer. This suggests that the
Intra-Transformer, and thus local processing, has
a greater influence on the performance.
◉ It also emerges that positional encoding is helpful
(e.g. see green line). A similar outcome has been
observed in T-gsa for speech enhancement.
◉ Finally, it can be observed that dynamic mixing
helps the performance drastically.
31
DPTNet
32
◉ Compared to those in the WSJ0-2mix data corpus, the mixtures in Libri2mix is difficult to
separate.
◉ The result shows that direct context-aware modeling is still significantly superior to the RNN
method. This presents the generalization of Transformer and further demonstrates the
effectiveness of it.
Comparison - time & space
◉ The far left graph compares the training speed of SepFormer, DPRNN and DPTNet.
◉ The middle and right graph compares the average inference time and total memory allocation of
Wavesplit, SepFormer, DPRNN and DPTNet.
33
Comparison - time & space
◉ From this analysis, it emerges that SepFormer is not only faster but also less memory demanding
than DPTNet, DPRNN, and Wavesplit.
◉ Such a level of computational efficiency is achieved even though the proposed SepFormer
employs more parameters than others.
◉ This is not only due to the superior parallelization capabilities of the Transformer, but also
because the best performance is achieved with a stride factor of 8 samples, against a stride of 1
for DPRNN and DPTNet.
◉ Window size defines the memory footprint and hop size defines the minimum latency of the
system. 34
Inference experience
◉ SepFormer is hard to train and converges slowly due to the large number of parameters,
although the original papers claimed the opposite.
◉ Based on Sam Tsao’s training experience, RNN is more effective on memorizing context
compared to linear network.
◉ To cope with that we often take a bigger input size on linear layer, which also explains where the
massive parameters come from.
35
Overview
◉ In terms of performance, we can realize that without dynamic mixing DPTNet is competitive
with SepFormer.
◉ Consequently, we can argue that the reason Wavesplit+DM gaining the second best
performance is identical with the reason SepFormer+DM gaining the best; that is, the design of
DM helps improve the separation tasks to be robust.
◉ However, the defect embedded in RNN is undeniable.
36
Conclusions
6
37
Conclusion
◉ Both network learns both short and long-term dependencies using a multi-scale approach and
obtain fair performance. The dual-path Transformer network is able to model different level of
the speech sequences directly conditioning on context.
◉ DPTNet can learn the order information in speech sequences without positional encodings but
by RNN, while SepFormer is a RNN-free architecture based on pre-layer Transformer.
◉ By the facts that
1) with pure Transformer, the computations over different time-steps can be parallelized, and
2) SepFormer achieves a competitive performance even when subsampling the encoded
representation by a factor of 8,
these two properties lead to a significant speed-up at training/inference time and a significant
reduction of memory usage.
38
Separation results on Delta dataset:
AOI / Inter-training / Elderly
using SuDoRM-RF
Presenter : 何冠勳 61047017s
Date : 2022/06/23
Results
40
Spk1 Spk2 Original
AOI
Inter-training
Elderly
Any questions ?
You can find me at
◉ jasonho610@gmail.com
◉ NTNU-SMIL
Thanks!
41

Sepformer&DPTNet.pdf

  • 1.
    Speech Separation byTransformer: SepFormer1 & DPTNet2 Presenter : 何冠勳 61047017s Date : 2022/06/23 1: Neil Zeghidour, David Grangier, Published in arXiv:2010.13154v2 [eess.AS] 8 Mar 2021 2: Jingjing Chen, Qirong Mao, Dong Liu, Published in arXiv:2007.13975v3 [eess.AS] 14 Aug 2020
  • 2.
    Outline 2 1 3 5 IntroductionExperiment Conclusions 4 2 Architecture Results
  • 3.
  • 4.
    Why Transformer? ◉ ATransformer model is a neural network that learns context and thus meaning by tracking relationships in sequential data like the words in this sentence. ◉ Before the deep learning era, many traditional methods are introduced for this task, such as non-negative matrix factorization (NMF), computational auditory scene analysis (CASA) and probabilistic models. Only works in close-set speakers. ◉ Deep learning techniques for monaural speech separation can be divided into two categories: time-frequency domain methods and end-to-end time-domain approaches. Phase problem is non-trivial problem. 4
  • 5.
    Why Transformer? 5 Non-negative MatrixFactorization Computational Auditory Scene Analysis
  • 6.
    Why Transformer? ◉ RNNbased models need to pass information through many intermediate states, while the CNN based models suffer from limited receptive fields. ◉ The inherently sequential nature of RNNs impairs an effective parallelization of the computations. This bottleneck is particularly evident when processing large datasets with long sequences. ◉ Fortunately, the Transformer based on self-attention mechanism can resolve this problem by interact directly to the inputs. ◉ Transformers completely avoid this bottleneck by eliminating recurrence and replacing it with a fully attention-based mechanism. 6
  • 7.
    What is SepFormer? ◉Current systems rely, in large part, on the learned-domain masking strategy popularized by Conv-TasNet. ◉ Building on this, Dual-path mechanism has demonstrated that better long-term modeling is crucial to improve the separation performance. (introduced in Dual-path RNN) 7
  • 8.
    What is SepFormer? ◉In this paper, a novel model called SepFormer is proposed, which is mainly composed of multi-head attention and feed-forward layers. ◉ SepFormer adopt the dual-path framework and replace RNNs with inter/intra-Transformers that learn both short and long-term dependencies. The dual-path framework enables to mitigate the quadratic complexity of Transformers, as Transformers in the dual-path framework process smaller chunks. ◉ SepFormer not only processes all the time steps in parallel but also achieves competitive performance when downsampling the encoded representation. This makes the proposed architecture significantly faster and less memory demanding than the latest RNN-based separation models. 8
  • 9.
    What is DPTNet? ◉DPTNet is an end-to-end monaural speech separation, which introduces an improved Transformer to allow direct context-aware modeling on the speech sequences. ◉ This is the first work that introduces direct context-aware modeling into speech separation. This method enables the elements in speech sequences can interact directly, which is beneficial to information transmission. ◉ To consider the order of signal, DPTNet integrate a recurrent neural network into original Transformer to make it can learn the order information of the speech sequences without positional encodings. 9
  • 10.
    “ Note! Although DPTNet isshown to outperform the standard DPRNN, such an architecture still embeds an RNN, effectively negating the parallelization capability of pure-attention models. 10
  • 11.
    Architecture - Architecture - Separator ➔SepFormer ➔ DPTNet - Overview 2 11
  • 12.
    Architecture ◉ Encoder moduleis used to transform short segments of the mixture waveform into their corresponding representations in an intermediate feature space. ◉ This representation is then used to estimate a multiplicative function (separation mask) for each source at each time step. ◉ The source waveforms are then reconstructed by transforming the masked encoder features using a decoder module. 12 conv1d transpose-conv1d
  • 13.
  • 14.
  • 15.
    Separator - SepFormer ◉For mixture , the encoder learns an STFT-like representation: ◉ The masking network is fed by and estimates masks for each of the speakers. 1. Linear + Chunking (segmentation + overlap) : 2. SepFormer block (intra-Transformer + inter-Transformer) : 3. Linear + PReLU : 15
  • 16.
    Separator - SepFormer(cont.) 4. Overlap-add : 5. FFW + PReLU : ◉ The input to the decoder is the element-wise multiplication between masks and initial representation: 16
  • 17.
    SepFormer block ◉ Intra-Transformerprocesses the second dimension of , and thus acts on each chunk independently, modeling the short-term dependencies within each chunk. ◉ Next, we permute the last two dimensions (denoted as ), and the Inter-Transformer is applied to model the transitions across chunks. ◉ Overall transformation : ◉ The above consist a sepformer block, and there can be N blocks repeatedly. 17
  • 18.
    SepFormer block ◉ TheTransformer used here closely resemble the original one. ◉ However, to avoid confusions, the Transformer in this paper refers especially to the encoder part, and it is comprised of three core modules: scaled dot-product attention, multi-head attention and position-wise feed-forward network. ◉ <On Layer Normalization in the Transformer Architecture> 18
  • 19.
    SepFormer block ◉ Transformerprocedure (Pre-LN setting): 1. Positional encoding : 2. Layer norm + Multi-head attention : 3. Layer norm + Feed forward + Residuals : 4. Repeat blocks 19
  • 20.
  • 21.
    Improved Transformer ◉ Theorigin Transformer adds positional encodings to the input embeddings to represent order information, which is sine-and-cosine functions or learned parameters. ◉ However, we find that the positional encodings are not suitable for dual-path network and usually lead to model divergence in the training process. ◉ To learn the order information, we replace the first fully connected layer with a RNN in the feed-forward network. 21
  • 22.
    Overview 22 SepFormer DPTNet Encoder Conv1dConv1d Decoder Transpose-conv1d Transpose-conv1d Main separator Transformer (encoder part) Transformer (encoder part) Idea Pre-layer normalization applied Positional encoding retained RNN used in FFW Positional encoding eliminated
  • 23.
    Experiments - Dataset - Trainingobjective - Configurations 3 23
  • 24.
    Dataset ◉ SepFormer isevaluated on WSJ0mix. ◉ DPTNet is evaluated on WSJ0-2mix and Libri2mix. ◉ All waveforms are sampled at 8kHz. ‼ On Asteroid and Speechbrain toolkit, there’re pre-trained models trained on WHAM! or WHAMR! that gain fair performance, regardless of whether the task is separation or enhancement. 24
  • 25.
    Training objectives ◉ Theobjective of training the end-to-end system is maximizing the improvement of the scale-invariant source-to-noise ratio (SI-SNR), replacing the standard source-to-distortion ratio (SDR). ◉ Utterance-level permutation invariant training (uPIT) is applied during training to address the source permutation problem. 25
  • 26.
    Configurations - SepFormer ◉Numbers of ○ Encoder basis 256 ; Kernel size 16 ; Chunk size 250 . ○ Transformers 8 ; SepFormer blocks 2 ; Attention heads 8 ; Dimension 1024 . ◉ We explored the use of dynamic mixing data augmentation which consists in on-the-fly creation of new mixtures from single speaker sources, along with speed perturbation in [95%, 105%]. ◉ Training process also apply learning rate halving, gradient clipping and mixed-precision. ◉ Mixed-precision training is a technique for substantially reducing neural net training time by performing as many operations as possible in half-precision floating point, fp16, instead of the single-precision floating point, fp32. 26
  • 27.
    Configurations - DPTNet ◉Numbers of ○ Encoder basis 64 ; Kernel size 2 ; Segmentation size not specified . ○ Transformers 6 ; DPT blocks 6 ; Attention heads 4 ; Dimension not specified . ◉ Training process also apply early stopping and gradient clipping. ◉ Learning rate schedule is set as ,where , , . ◉ We increase the learning rate linearly for the first warm up training steps , and then decay it by 0.98 for every two epochs. 27
  • 28.
    Results - SepFormer - DPTNet -Comparison - Inference experience - Overview 4 28
  • 29.
    SepFormer ◉ When usingdynamic mixing, SepFormer achieves state-of- the-art performance. ◉ SepFormer outperforms previous systems without using dynamic mixing except Wavesplit, which uses speaker identity as additional information. 29
  • 30.
    SepFormer ◉ SepFormer obtainsthe state-of-the-art performance with an SI-SNRi of 19.5 dB and an SDRi of 19.7 dB. ◉ Our results on WSJ0mix show that it is possible to achieve state-of-the-art performance in separation with an RNN-free Transformer-based model. ◉ The big advantage of SepFormer over RNN-based systems is the possibility to parallelize the computations over different time steps. 30
  • 31.
    SepFormer ◉ A respectableperformance of 19.2 dB is obtained even when we use a single layer Transformer for the Inter-Transformer. This suggests that the Intra-Transformer, and thus local processing, has a greater influence on the performance. ◉ It also emerges that positional encoding is helpful (e.g. see green line). A similar outcome has been observed in T-gsa for speech enhancement. ◉ Finally, it can be observed that dynamic mixing helps the performance drastically. 31
  • 32.
    DPTNet 32 ◉ Compared tothose in the WSJ0-2mix data corpus, the mixtures in Libri2mix is difficult to separate. ◉ The result shows that direct context-aware modeling is still significantly superior to the RNN method. This presents the generalization of Transformer and further demonstrates the effectiveness of it.
  • 33.
    Comparison - time& space ◉ The far left graph compares the training speed of SepFormer, DPRNN and DPTNet. ◉ The middle and right graph compares the average inference time and total memory allocation of Wavesplit, SepFormer, DPRNN and DPTNet. 33
  • 34.
    Comparison - time& space ◉ From this analysis, it emerges that SepFormer is not only faster but also less memory demanding than DPTNet, DPRNN, and Wavesplit. ◉ Such a level of computational efficiency is achieved even though the proposed SepFormer employs more parameters than others. ◉ This is not only due to the superior parallelization capabilities of the Transformer, but also because the best performance is achieved with a stride factor of 8 samples, against a stride of 1 for DPRNN and DPTNet. ◉ Window size defines the memory footprint and hop size defines the minimum latency of the system. 34
  • 35.
    Inference experience ◉ SepFormeris hard to train and converges slowly due to the large number of parameters, although the original papers claimed the opposite. ◉ Based on Sam Tsao’s training experience, RNN is more effective on memorizing context compared to linear network. ◉ To cope with that we often take a bigger input size on linear layer, which also explains where the massive parameters come from. 35
  • 36.
    Overview ◉ In termsof performance, we can realize that without dynamic mixing DPTNet is competitive with SepFormer. ◉ Consequently, we can argue that the reason Wavesplit+DM gaining the second best performance is identical with the reason SepFormer+DM gaining the best; that is, the design of DM helps improve the separation tasks to be robust. ◉ However, the defect embedded in RNN is undeniable. 36
  • 37.
  • 38.
    Conclusion ◉ Both networklearns both short and long-term dependencies using a multi-scale approach and obtain fair performance. The dual-path Transformer network is able to model different level of the speech sequences directly conditioning on context. ◉ DPTNet can learn the order information in speech sequences without positional encodings but by RNN, while SepFormer is a RNN-free architecture based on pre-layer Transformer. ◉ By the facts that 1) with pure Transformer, the computations over different time-steps can be parallelized, and 2) SepFormer achieves a competitive performance even when subsampling the encoded representation by a factor of 8, these two properties lead to a significant speed-up at training/inference time and a significant reduction of memory usage. 38
  • 39.
    Separation results onDelta dataset: AOI / Inter-training / Elderly using SuDoRM-RF Presenter : 何冠勳 61047017s Date : 2022/06/23
  • 40.
  • 41.
    Any questions ? Youcan find me at ◉ jasonho610@gmail.com ◉ NTNU-SMIL Thanks! 41