SlideShare a Scribd company logo
1 of 41
Download to read offline
Speech Separation by Transformer:
SepFormer1
& DPTNet2
Presenter : 何冠勳 61047017s
Date : 2022/06/23
1: Neil Zeghidour, David Grangier, Published in arXiv:2010.13154v2 [eess.AS] 8 Mar 2021
2: Jingjing Chen, Qirong Mao, Dong Liu, Published in arXiv:2007.13975v3 [eess.AS] 14 Aug 2020
Outline
2
1 3 5
Introduction Experiment Conclusions
4
2
Architecture Results
Introduction
Why Transformer?
What is SepFormer & DPTNet?
1
3
Why Transformer?
◉ A Transformer model is a neural network that learns context and thus meaning by tracking
relationships in sequential data like the words in this sentence.
◉ Before the deep learning era, many traditional methods are introduced for this task, such as
non-negative matrix factorization (NMF), computational auditory scene analysis (CASA) and
probabilistic models. Only works in close-set speakers.
◉ Deep learning techniques for monaural speech separation can be divided into two categories:
time-frequency domain methods and end-to-end time-domain approaches.
Phase problem is non-trivial problem.
4
Why Transformer?
5
Non-negative Matrix Factorization
Computational Auditory Scene Analysis
Why Transformer?
◉ RNN based models need to pass information through many intermediate states, while the CNN
based models suffer from limited receptive fields.
◉ The inherently sequential nature of RNNs impairs an effective parallelization of the computations.
This bottleneck is particularly evident when processing large datasets with long sequences.
◉ Fortunately, the Transformer based on self-attention mechanism can resolve this problem by
interact directly to the inputs.
◉ Transformers completely avoid this bottleneck by eliminating recurrence and replacing it with a
fully attention-based mechanism.
6
What is SepFormer?
◉ Current systems rely, in large part, on the learned-domain masking strategy popularized by
Conv-TasNet.
◉ Building on this, Dual-path mechanism has demonstrated that better long-term modeling is crucial to
improve the separation performance.
(introduced in Dual-path RNN)
7
What is SepFormer?
◉ In this paper, a novel model called SepFormer is proposed, which is mainly composed of multi-head
attention and feed-forward layers.
◉ SepFormer adopt the dual-path framework and replace RNNs with inter/intra-Transformers that learn
both short and long-term dependencies. The dual-path framework enables to mitigate the quadratic
complexity of Transformers, as Transformers in the dual-path framework process smaller chunks.
◉ SepFormer not only processes all the time steps in parallel but also achieves competitive performance
when downsampling the encoded representation. This makes the proposed architecture significantly
faster and less memory demanding than the latest RNN-based separation models.
8
What is DPTNet?
◉ DPTNet is an end-to-end monaural speech separation, which introduces an improved Transformer to
allow direct context-aware modeling on the speech sequences.
◉ This is the first work that introduces direct context-aware modeling into speech separation. This
method enables the elements in speech sequences can interact directly, which is beneficial to
information transmission.
◉ To consider the order of signal, DPTNet integrate a recurrent neural network into original Transformer
to make it can learn the order information of the speech sequences without positional encodings.
9
“
Note!
Although DPTNet is shown to outperform the standard
DPRNN, such an architecture still embeds an RNN,
effectively negating the parallelization capability of
pure-attention models.
10
Architecture
- Architecture
- Separator
➔ SepFormer
➔ DPTNet
- Overview
2
11
Architecture
◉ Encoder module is used to transform short segments of the mixture waveform into their
corresponding representations in an intermediate feature space.
◉ This representation is then used to estimate a multiplicative function (separation mask) for
each source at each time step.
◉ The source waveforms are then reconstructed by transforming the masked encoder features
using a decoder module.
12
conv1d transpose-conv1d
Dual-path mechanism
13
Process intra-chunk
dependencies
Process inter-chunk
dependencies
Separator - SepFormer
14
Separator - SepFormer
◉ For mixture , the encoder learns an STFT-like representation:
◉ The masking network is fed by and estimates masks for each of the
speakers.
1. Linear + Chunking (segmentation + overlap) :
2. SepFormer block (intra-Transformer + inter-Transformer) :
3. Linear + PReLU :
15
Separator - SepFormer (cont.)
4. Overlap-add :
5. FFW + PReLU :
◉ The input to the decoder is the element-wise multiplication between masks and initial
representation:
16
SepFormer block
◉ Intra-Transformer processes the second dimension of , and thus acts on each chunk
independently, modeling the short-term dependencies within each chunk.
◉ Next, we permute the last two dimensions (denoted as ), and the Inter-Transformer is
applied to model the transitions across chunks.
◉ Overall transformation :
◉ The above consist a sepformer block, and there can be N blocks repeatedly.
17
SepFormer block
◉ The Transformer used here closely resemble the
original one.
◉ However, to avoid confusions, the Transformer in this
paper refers especially to the encoder part, and it is
comprised of three core modules: scaled dot-product
attention, multi-head attention and position-wise
feed-forward network.
◉ <On Layer Normalization in the Transformer
Architecture>
18
SepFormer block
◉ Transformer procedure (Pre-LN setting):
1. Positional encoding :
2. Layer norm + Multi-head attention :
3. Layer norm + Feed forward + Residuals :
4. Repeat blocks
19
Separator - DPTNet
20
Improved Transformer
◉ The origin Transformer adds positional encodings
to the input embeddings to represent order
information, which is sine-and-cosine functions
or learned parameters.
◉ However, we find that the positional encodings
are not suitable for dual-path network and
usually lead to model divergence in the training
process.
◉ To learn the order information, we replace the
first fully connected layer with a RNN in the
feed-forward network.
21
Overview
22
SepFormer DPTNet
Encoder Conv1d Conv1d
Decoder Transpose-conv1d Transpose-conv1d
Main separator Transformer (encoder part) Transformer (encoder part)
Idea
Pre-layer normalization applied
Positional encoding retained
RNN used in FFW
Positional encoding eliminated
Experiments
- Dataset
- Training objective
- Configurations
3
23
Dataset
◉ SepFormer is evaluated on WSJ0mix.
◉ DPTNet is evaluated on WSJ0-2mix and Libri2mix.
◉ All waveforms are sampled at 8kHz.
‼ On Asteroid and Speechbrain toolkit, there’re pre-trained models trained on WHAM! or
WHAMR! that gain fair performance, regardless of whether the task is separation or
enhancement.
24
Training objectives
◉ The objective of training the end-to-end system is maximizing the improvement of the
scale-invariant source-to-noise ratio (SI-SNR), replacing the standard source-to-distortion
ratio (SDR).
◉ Utterance-level permutation invariant training (uPIT) is applied during training to address the
source permutation problem.
25
Configurations - SepFormer
◉ Numbers of
○ Encoder basis 256 ; Kernel size 16 ; Chunk size 250 .
○ Transformers 8 ; SepFormer blocks 2 ; Attention heads 8 ; Dimension 1024 .
◉ We explored the use of dynamic mixing data augmentation which consists in on-the-fly creation of
new mixtures from single speaker sources, along with speed perturbation in [95%, 105%].
◉ Training process also apply learning rate halving, gradient clipping and mixed-precision.
◉ Mixed-precision training is a technique for substantially reducing neural net training time by
performing as many operations as possible in half-precision floating point, fp16, instead of the
single-precision floating point, fp32.
26
Configurations - DPTNet
◉ Numbers of
○ Encoder basis 64 ; Kernel size 2 ; Segmentation size not specified .
○ Transformers 6 ; DPT blocks 6 ; Attention heads 4 ; Dimension not specified .
◉ Training process also apply early stopping and gradient clipping.
◉ Learning rate schedule is set as
,where , , .
◉ We increase the learning rate linearly for the first warm up training steps , and then decay it by 0.98
for every two epochs.
27
Results
- SepFormer
- DPTNet
- Comparison
- Inference experience
- Overview
4
28
SepFormer
◉ When using dynamic mixing, SepFormer
achieves state-of- the-art performance.
◉ SepFormer outperforms previous
systems without using dynamic mixing
except Wavesplit, which uses speaker
identity as additional information.
29
SepFormer
◉ SepFormer obtains the state-of-the-art performance
with an SI-SNRi of 19.5 dB and an SDRi of 19.7 dB.
◉ Our results on WSJ0mix show that it is possible to
achieve state-of-the-art performance in separation
with an RNN-free Transformer-based model.
◉ The big advantage of SepFormer over RNN-based
systems is the possibility to parallelize the computations
over different time steps.
30
SepFormer
◉ A respectable performance of 19.2 dB is obtained
even when we use a single layer Transformer for
the Inter-Transformer. This suggests that the
Intra-Transformer, and thus local processing, has
a greater influence on the performance.
◉ It also emerges that positional encoding is helpful
(e.g. see green line). A similar outcome has been
observed in T-gsa for speech enhancement.
◉ Finally, it can be observed that dynamic mixing
helps the performance drastically.
31
DPTNet
32
◉ Compared to those in the WSJ0-2mix data corpus, the mixtures in Libri2mix is difficult to
separate.
◉ The result shows that direct context-aware modeling is still significantly superior to the RNN
method. This presents the generalization of Transformer and further demonstrates the
effectiveness of it.
Comparison - time & space
◉ The far left graph compares the training speed of SepFormer, DPRNN and DPTNet.
◉ The middle and right graph compares the average inference time and total memory allocation of
Wavesplit, SepFormer, DPRNN and DPTNet.
33
Comparison - time & space
◉ From this analysis, it emerges that SepFormer is not only faster but also less memory demanding
than DPTNet, DPRNN, and Wavesplit.
◉ Such a level of computational efficiency is achieved even though the proposed SepFormer
employs more parameters than others.
◉ This is not only due to the superior parallelization capabilities of the Transformer, but also
because the best performance is achieved with a stride factor of 8 samples, against a stride of 1
for DPRNN and DPTNet.
◉ Window size defines the memory footprint and hop size defines the minimum latency of the
system. 34
Inference experience
◉ SepFormer is hard to train and converges slowly due to the large number of parameters,
although the original papers claimed the opposite.
◉ Based on Sam Tsao’s training experience, RNN is more effective on memorizing context
compared to linear network.
◉ To cope with that we often take a bigger input size on linear layer, which also explains where the
massive parameters come from.
35
Overview
◉ In terms of performance, we can realize that without dynamic mixing DPTNet is competitive
with SepFormer.
◉ Consequently, we can argue that the reason Wavesplit+DM gaining the second best
performance is identical with the reason SepFormer+DM gaining the best; that is, the design of
DM helps improve the separation tasks to be robust.
◉ However, the defect embedded in RNN is undeniable.
36
Conclusions
6
37
Conclusion
◉ Both network learns both short and long-term dependencies using a multi-scale approach and
obtain fair performance. The dual-path Transformer network is able to model different level of
the speech sequences directly conditioning on context.
◉ DPTNet can learn the order information in speech sequences without positional encodings but
by RNN, while SepFormer is a RNN-free architecture based on pre-layer Transformer.
◉ By the facts that
1) with pure Transformer, the computations over different time-steps can be parallelized, and
2) SepFormer achieves a competitive performance even when subsampling the encoded
representation by a factor of 8,
these two properties lead to a significant speed-up at training/inference time and a significant
reduction of memory usage.
38
Separation results on Delta dataset:
AOI / Inter-training / Elderly
using SuDoRM-RF
Presenter : 何冠勳 61047017s
Date : 2022/06/23
Results
40
Spk1 Spk2 Original
AOI
Inter-training
Elderly
Any questions ?
You can find me at
◉ jasonho610@gmail.com
◉ NTNU-SMIL
Thanks!
41

More Related Content

Similar to Speech Separation Models: SepFormer and DPTNet

Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...Lviv Startup Club
 
Convolutional Neural Network and Feature Transformation for Distant Speech Re...
Convolutional Neural Network and Feature Transformation for Distant Speech Re...Convolutional Neural Network and Feature Transformation for Distant Speech Re...
Convolutional Neural Network and Feature Transformation for Distant Speech Re...IJECEIAES
 
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...Jisang Yoon
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)Susang Kim
 
Design Ofdm System And Remove Nonlinear Distortion In OFDM Signal At Transmit...
Design Ofdm System And Remove Nonlinear Distortion In OFDM Signal At Transmit...Design Ofdm System And Remove Nonlinear Distortion In OFDM Signal At Transmit...
Design Ofdm System And Remove Nonlinear Distortion In OFDM Signal At Transmit...Rupesh Sharma
 
Automated Speech Recognition
Automated Speech Recognition Automated Speech Recognition
Automated Speech Recognition Pruthvij Thakar
 
COLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisCOLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisRushin Shah
 
Frame-Online DNN-WPE Dereverberation.pdf
Frame-Online DNN-WPE Dereverberation.pdfFrame-Online DNN-WPE Dereverberation.pdf
Frame-Online DNN-WPE Dereverberation.pdfssuser849b73
 
Performance evaluation on the basis of bit error rate for different order of ...
Performance evaluation on the basis of bit error rate for different order of ...Performance evaluation on the basis of bit error rate for different order of ...
Performance evaluation on the basis of bit error rate for different order of ...ijmnct
 
A BER Performance Analysis of Shift Keying Technique with MMSE/MLSE estimatio...
A BER Performance Analysis of Shift Keying Technique with MMSE/MLSE estimatio...A BER Performance Analysis of Shift Keying Technique with MMSE/MLSE estimatio...
A BER Performance Analysis of Shift Keying Technique with MMSE/MLSE estimatio...AM Publications
 
Design of an Efficient FFT Processor ffor DAB System
Design of an Efficient FFT Processor ffor DAB SystemDesign of an Efficient FFT Processor ffor DAB System
Design of an Efficient FFT Processor ffor DAB SystemZunAib Ali
 
Identification of frequency domain using quantum based optimization neural ne...
Identification of frequency domain using quantum based optimization neural ne...Identification of frequency domain using quantum based optimization neural ne...
Identification of frequency domain using quantum based optimization neural ne...eSAT Publishing House
 
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISHA NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISHIRJET Journal
 
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...taeseon ryu
 
Applying Deep Learning Machine Translation to Language Services
Applying Deep Learning Machine Translation to Language ServicesApplying Deep Learning Machine Translation to Language Services
Applying Deep Learning Machine Translation to Language ServicesYannis Flet-Berliac
 

Similar to Speech Separation Models: SepFormer and DPTNet (20)

Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
 
Convolutional Neural Network and Feature Transformation for Distant Speech Re...
Convolutional Neural Network and Feature Transformation for Distant Speech Re...Convolutional Neural Network and Feature Transformation for Distant Speech Re...
Convolutional Neural Network and Feature Transformation for Distant Speech Re...
 
Et25897899
Et25897899Et25897899
Et25897899
 
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
PPT - Enhancing the Locality and Breaking the Memory Bottleneck of Transforme...
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)
 
Design Ofdm System And Remove Nonlinear Distortion In OFDM Signal At Transmit...
Design Ofdm System And Remove Nonlinear Distortion In OFDM Signal At Transmit...Design Ofdm System And Remove Nonlinear Distortion In OFDM Signal At Transmit...
Design Ofdm System And Remove Nonlinear Distortion In OFDM Signal At Transmit...
 
Automated Speech Recognition
Automated Speech Recognition Automated Speech Recognition
Automated Speech Recognition
 
COLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisCOLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech Analysis
 
Conformer review
Conformer reviewConformer review
Conformer review
 
Frame-Online DNN-WPE Dereverberation.pdf
Frame-Online DNN-WPE Dereverberation.pdfFrame-Online DNN-WPE Dereverberation.pdf
Frame-Online DNN-WPE Dereverberation.pdf
 
5 ofdm
5 ofdm5 ofdm
5 ofdm
 
Performance evaluation on the basis of bit error rate for different order of ...
Performance evaluation on the basis of bit error rate for different order of ...Performance evaluation on the basis of bit error rate for different order of ...
Performance evaluation on the basis of bit error rate for different order of ...
 
A BER Performance Analysis of Shift Keying Technique with MMSE/MLSE estimatio...
A BER Performance Analysis of Shift Keying Technique with MMSE/MLSE estimatio...A BER Performance Analysis of Shift Keying Technique with MMSE/MLSE estimatio...
A BER Performance Analysis of Shift Keying Technique with MMSE/MLSE estimatio...
 
Design of an Efficient FFT Processor ffor DAB System
Design of an Efficient FFT Processor ffor DAB SystemDesign of an Efficient FFT Processor ffor DAB System
Design of an Efficient FFT Processor ffor DAB System
 
37 44
37 4437 44
37 44
 
Identification of frequency domain using quantum based optimization neural ne...
Identification of frequency domain using quantum based optimization neural ne...Identification of frequency domain using quantum based optimization neural ne...
Identification of frequency domain using quantum based optimization neural ne...
 
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISHA NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
 
Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
 
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
 
Applying Deep Learning Machine Translation to Language Services
Applying Deep Learning Machine Translation to Language ServicesApplying Deep Learning Machine Translation to Language Services
Applying Deep Learning Machine Translation to Language Services
 

Recently uploaded

Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 

Recently uploaded (20)

Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
★ CALL US 9953330565 ( HOT Young Call Girls In Badarpur delhi NCR
 

Speech Separation Models: SepFormer and DPTNet

  • 1. Speech Separation by Transformer: SepFormer1 & DPTNet2 Presenter : 何冠勳 61047017s Date : 2022/06/23 1: Neil Zeghidour, David Grangier, Published in arXiv:2010.13154v2 [eess.AS] 8 Mar 2021 2: Jingjing Chen, Qirong Mao, Dong Liu, Published in arXiv:2007.13975v3 [eess.AS] 14 Aug 2020
  • 2. Outline 2 1 3 5 Introduction Experiment Conclusions 4 2 Architecture Results
  • 3. Introduction Why Transformer? What is SepFormer & DPTNet? 1 3
  • 4. Why Transformer? ◉ A Transformer model is a neural network that learns context and thus meaning by tracking relationships in sequential data like the words in this sentence. ◉ Before the deep learning era, many traditional methods are introduced for this task, such as non-negative matrix factorization (NMF), computational auditory scene analysis (CASA) and probabilistic models. Only works in close-set speakers. ◉ Deep learning techniques for monaural speech separation can be divided into two categories: time-frequency domain methods and end-to-end time-domain approaches. Phase problem is non-trivial problem. 4
  • 5. Why Transformer? 5 Non-negative Matrix Factorization Computational Auditory Scene Analysis
  • 6. Why Transformer? ◉ RNN based models need to pass information through many intermediate states, while the CNN based models suffer from limited receptive fields. ◉ The inherently sequential nature of RNNs impairs an effective parallelization of the computations. This bottleneck is particularly evident when processing large datasets with long sequences. ◉ Fortunately, the Transformer based on self-attention mechanism can resolve this problem by interact directly to the inputs. ◉ Transformers completely avoid this bottleneck by eliminating recurrence and replacing it with a fully attention-based mechanism. 6
  • 7. What is SepFormer? ◉ Current systems rely, in large part, on the learned-domain masking strategy popularized by Conv-TasNet. ◉ Building on this, Dual-path mechanism has demonstrated that better long-term modeling is crucial to improve the separation performance. (introduced in Dual-path RNN) 7
  • 8. What is SepFormer? ◉ In this paper, a novel model called SepFormer is proposed, which is mainly composed of multi-head attention and feed-forward layers. ◉ SepFormer adopt the dual-path framework and replace RNNs with inter/intra-Transformers that learn both short and long-term dependencies. The dual-path framework enables to mitigate the quadratic complexity of Transformers, as Transformers in the dual-path framework process smaller chunks. ◉ SepFormer not only processes all the time steps in parallel but also achieves competitive performance when downsampling the encoded representation. This makes the proposed architecture significantly faster and less memory demanding than the latest RNN-based separation models. 8
  • 9. What is DPTNet? ◉ DPTNet is an end-to-end monaural speech separation, which introduces an improved Transformer to allow direct context-aware modeling on the speech sequences. ◉ This is the first work that introduces direct context-aware modeling into speech separation. This method enables the elements in speech sequences can interact directly, which is beneficial to information transmission. ◉ To consider the order of signal, DPTNet integrate a recurrent neural network into original Transformer to make it can learn the order information of the speech sequences without positional encodings. 9
  • 10. “ Note! Although DPTNet is shown to outperform the standard DPRNN, such an architecture still embeds an RNN, effectively negating the parallelization capability of pure-attention models. 10
  • 11. Architecture - Architecture - Separator ➔ SepFormer ➔ DPTNet - Overview 2 11
  • 12. Architecture ◉ Encoder module is used to transform short segments of the mixture waveform into their corresponding representations in an intermediate feature space. ◉ This representation is then used to estimate a multiplicative function (separation mask) for each source at each time step. ◉ The source waveforms are then reconstructed by transforming the masked encoder features using a decoder module. 12 conv1d transpose-conv1d
  • 15. Separator - SepFormer ◉ For mixture , the encoder learns an STFT-like representation: ◉ The masking network is fed by and estimates masks for each of the speakers. 1. Linear + Chunking (segmentation + overlap) : 2. SepFormer block (intra-Transformer + inter-Transformer) : 3. Linear + PReLU : 15
  • 16. Separator - SepFormer (cont.) 4. Overlap-add : 5. FFW + PReLU : ◉ The input to the decoder is the element-wise multiplication between masks and initial representation: 16
  • 17. SepFormer block ◉ Intra-Transformer processes the second dimension of , and thus acts on each chunk independently, modeling the short-term dependencies within each chunk. ◉ Next, we permute the last two dimensions (denoted as ), and the Inter-Transformer is applied to model the transitions across chunks. ◉ Overall transformation : ◉ The above consist a sepformer block, and there can be N blocks repeatedly. 17
  • 18. SepFormer block ◉ The Transformer used here closely resemble the original one. ◉ However, to avoid confusions, the Transformer in this paper refers especially to the encoder part, and it is comprised of three core modules: scaled dot-product attention, multi-head attention and position-wise feed-forward network. ◉ <On Layer Normalization in the Transformer Architecture> 18
  • 19. SepFormer block ◉ Transformer procedure (Pre-LN setting): 1. Positional encoding : 2. Layer norm + Multi-head attention : 3. Layer norm + Feed forward + Residuals : 4. Repeat blocks 19
  • 21. Improved Transformer ◉ The origin Transformer adds positional encodings to the input embeddings to represent order information, which is sine-and-cosine functions or learned parameters. ◉ However, we find that the positional encodings are not suitable for dual-path network and usually lead to model divergence in the training process. ◉ To learn the order information, we replace the first fully connected layer with a RNN in the feed-forward network. 21
  • 22. Overview 22 SepFormer DPTNet Encoder Conv1d Conv1d Decoder Transpose-conv1d Transpose-conv1d Main separator Transformer (encoder part) Transformer (encoder part) Idea Pre-layer normalization applied Positional encoding retained RNN used in FFW Positional encoding eliminated
  • 23. Experiments - Dataset - Training objective - Configurations 3 23
  • 24. Dataset ◉ SepFormer is evaluated on WSJ0mix. ◉ DPTNet is evaluated on WSJ0-2mix and Libri2mix. ◉ All waveforms are sampled at 8kHz. ‼ On Asteroid and Speechbrain toolkit, there’re pre-trained models trained on WHAM! or WHAMR! that gain fair performance, regardless of whether the task is separation or enhancement. 24
  • 25. Training objectives ◉ The objective of training the end-to-end system is maximizing the improvement of the scale-invariant source-to-noise ratio (SI-SNR), replacing the standard source-to-distortion ratio (SDR). ◉ Utterance-level permutation invariant training (uPIT) is applied during training to address the source permutation problem. 25
  • 26. Configurations - SepFormer ◉ Numbers of ○ Encoder basis 256 ; Kernel size 16 ; Chunk size 250 . ○ Transformers 8 ; SepFormer blocks 2 ; Attention heads 8 ; Dimension 1024 . ◉ We explored the use of dynamic mixing data augmentation which consists in on-the-fly creation of new mixtures from single speaker sources, along with speed perturbation in [95%, 105%]. ◉ Training process also apply learning rate halving, gradient clipping and mixed-precision. ◉ Mixed-precision training is a technique for substantially reducing neural net training time by performing as many operations as possible in half-precision floating point, fp16, instead of the single-precision floating point, fp32. 26
  • 27. Configurations - DPTNet ◉ Numbers of ○ Encoder basis 64 ; Kernel size 2 ; Segmentation size not specified . ○ Transformers 6 ; DPT blocks 6 ; Attention heads 4 ; Dimension not specified . ◉ Training process also apply early stopping and gradient clipping. ◉ Learning rate schedule is set as ,where , , . ◉ We increase the learning rate linearly for the first warm up training steps , and then decay it by 0.98 for every two epochs. 27
  • 28. Results - SepFormer - DPTNet - Comparison - Inference experience - Overview 4 28
  • 29. SepFormer ◉ When using dynamic mixing, SepFormer achieves state-of- the-art performance. ◉ SepFormer outperforms previous systems without using dynamic mixing except Wavesplit, which uses speaker identity as additional information. 29
  • 30. SepFormer ◉ SepFormer obtains the state-of-the-art performance with an SI-SNRi of 19.5 dB and an SDRi of 19.7 dB. ◉ Our results on WSJ0mix show that it is possible to achieve state-of-the-art performance in separation with an RNN-free Transformer-based model. ◉ The big advantage of SepFormer over RNN-based systems is the possibility to parallelize the computations over different time steps. 30
  • 31. SepFormer ◉ A respectable performance of 19.2 dB is obtained even when we use a single layer Transformer for the Inter-Transformer. This suggests that the Intra-Transformer, and thus local processing, has a greater influence on the performance. ◉ It also emerges that positional encoding is helpful (e.g. see green line). A similar outcome has been observed in T-gsa for speech enhancement. ◉ Finally, it can be observed that dynamic mixing helps the performance drastically. 31
  • 32. DPTNet 32 ◉ Compared to those in the WSJ0-2mix data corpus, the mixtures in Libri2mix is difficult to separate. ◉ The result shows that direct context-aware modeling is still significantly superior to the RNN method. This presents the generalization of Transformer and further demonstrates the effectiveness of it.
  • 33. Comparison - time & space ◉ The far left graph compares the training speed of SepFormer, DPRNN and DPTNet. ◉ The middle and right graph compares the average inference time and total memory allocation of Wavesplit, SepFormer, DPRNN and DPTNet. 33
  • 34. Comparison - time & space ◉ From this analysis, it emerges that SepFormer is not only faster but also less memory demanding than DPTNet, DPRNN, and Wavesplit. ◉ Such a level of computational efficiency is achieved even though the proposed SepFormer employs more parameters than others. ◉ This is not only due to the superior parallelization capabilities of the Transformer, but also because the best performance is achieved with a stride factor of 8 samples, against a stride of 1 for DPRNN and DPTNet. ◉ Window size defines the memory footprint and hop size defines the minimum latency of the system. 34
  • 35. Inference experience ◉ SepFormer is hard to train and converges slowly due to the large number of parameters, although the original papers claimed the opposite. ◉ Based on Sam Tsao’s training experience, RNN is more effective on memorizing context compared to linear network. ◉ To cope with that we often take a bigger input size on linear layer, which also explains where the massive parameters come from. 35
  • 36. Overview ◉ In terms of performance, we can realize that without dynamic mixing DPTNet is competitive with SepFormer. ◉ Consequently, we can argue that the reason Wavesplit+DM gaining the second best performance is identical with the reason SepFormer+DM gaining the best; that is, the design of DM helps improve the separation tasks to be robust. ◉ However, the defect embedded in RNN is undeniable. 36
  • 38. Conclusion ◉ Both network learns both short and long-term dependencies using a multi-scale approach and obtain fair performance. The dual-path Transformer network is able to model different level of the speech sequences directly conditioning on context. ◉ DPTNet can learn the order information in speech sequences without positional encodings but by RNN, while SepFormer is a RNN-free architecture based on pre-layer Transformer. ◉ By the facts that 1) with pure Transformer, the computations over different time-steps can be parallelized, and 2) SepFormer achieves a competitive performance even when subsampling the encoded representation by a factor of 8, these two properties lead to a significant speed-up at training/inference time and a significant reduction of memory usage. 38
  • 39. Separation results on Delta dataset: AOI / Inter-training / Elderly using SuDoRM-RF Presenter : 何冠勳 61047017s Date : 2022/06/23
  • 41. Any questions ? You can find me at ◉ jasonho610@gmail.com ◉ NTNU-SMIL Thanks! 41