SlideShare a Scribd company logo
Compute and memory efficient
universal sound source separation
Authors: Efthymios Tzinis · Zhepei Wang · Xilin Jiang · Paris Smaragdis
Published in 2020 IEEE 30th International Workshop on MLSP, and
arXiv:2103.02644v2 [cs.SD] 14 Jul 2021
Presenter: 何冠勳
1
Outline
– Introduction
– Architecture
– Experiment
– Result & Discussion
– Conclusions
2
Introduction
Why & What ?
3
Introduction
– There are three fields in audio separation: speech separation, universal sound
separation and music source separation.
– Previous works that are prestigious:
– Conv-TasNet
– DPRNN
– DPTNet
– Demucs
– Two step sound source separation
– Despite the dramatic advances in source separation performance, the
computational complexity of the aforementioned methods might hinder their
extensive usage across multiple devices.
– Additionally, training such systems is also an expensive computational undertaking
which can amount to significant costs.
4
Introduction
– Several studies, mainly in the image domain, have introduced more efficient
architectures in order to overcome the growing concern of large models with high
computational requirements.
– E.g. depth-wise separable convolutions, dilated convolutions, meta-learning
– Despite the recent success of low-resource architectures in the image domain, little
progress has been made towards proposing efficient architectures for audio tasks
and especially source separation.
– Modern approaches, mainly in speech enhancement and music source separation,
have been focusing on developing models which are capable of real-time inference.
5
Introduction
– In this study, we propose a novel neural network architecture for audio source
separation while following a more holistic approach in terms of computational
resources that we take into consideration (FLOPs, latency and total memory
requirements).
– We entitle SuDoRM-RF: SUccessive DOwnsampling and Resampling of Multi-
Resolution Features (SuDoRM-RF).
– We also propose improved versions and causal variations.
– The separation performance comparable to or even better than several recent
SOTA models on speech, environmental and universal sound separation tasks with
significantly lower computational requirements.
6
Architecture
- Encoder
- Separator
- U-ConvBlock
- Decoder
- Variations
7
Architecture
– On par with many SOTA approaches, SuDoRM-RF performs end-to-end audio
source separation using a mask-based architecture with adaptive encoder and
decoder basis.
– We have extended our basic model in order to also remove the mask estimation
process by introducing SuDoRM-RF++ that directly estimates the latent
representations of the sources.
– Besides original model, also propose
– improved version SuDoRM-RF++
– SuDoRM-RF++ with group communication
– causal variation C-SuDoRM-RF++
9
Architecture – Encoder
– The encoder E architecture consists of a one-dimensional convolution with kernel
size KE and stride equal to KE/2 similar to Conv-TasNet.
– We force the output of the encoder to be strictly non-negative by applying a recti-
fied linear unit (ReLU) activation on top of the output of the 1D-convolution.
(channel, kernel, stride)
10
Architecture – Separator
– In essence, the separator S module performs the following transformations to the
encoded mixture representation
1 2 3 4 5
11
Architecture – Separator
1. Projects the encoded mixture representation to a new channel space through a
layer-normalization followed by a point-wise convolution as shown next:
2. Performs repetitive non-linear transformations provided by B U-convolutional
blocks on . The output of the ith U-ConvBlock would be denoted as
and would be used as input for the (i + 1)th block.
3. Aggregates the information over multiple channels by applying a regular one-
dimensional convolution for each source on the transposed feature
representation .
12
Architecture – Separator
3. (cont.) This step has been introduced in Two step sound source separation and
empirically shown to make the training process more stable.
4. Combines the aforementioned latent codes for all sources by performing a
softmax operation in order to get mask estimates, where mask coefficients all lie
in [0,1].
5. Estimates a latent representation for each source by multiplying element-wise
the encoded mixture representation with the corresponding mask.
13
Architecture – U-ConvBlock
– U-ConvBlock uses a block structure which resembles a depth-wise separable
convolution but without a skip connection and residual unlike ConvTasNet.
– It extracts information from multiple resolutions using Q successive temporal
downsampling and Q upsampling operations that resembles a U-Net architecture.
– We postulate that this whole resampling procedure of extracting features at
multiple scales combined with the efficient increase of the effective receptive field.
14
Architecture – U-ConvBlock
15
16
Architecture – Decoder
– Our decoder module is the final step in order to transform the latent space
representation for each source back to the time domain.
– Each latent source representation is fed through a different transposed convolution
decoder.
17
Architecture – Variations 1
– In the improved version of the proposed architecture, namely, SuDoRM-RF++ , the
model estimates directly the latent representation for each target signal and uses
only one decoder module.
18
Architecture – Variations 1
– Difference between original version:
– Replace the mask estimation and element-wise multiplication process with a direct
estimation of the latent target signals.
– Use only one trainable decoder.
– Replace the layer normalization layers with global layer normalization.
– Simplify the activation layers and we use PReLU activation layers with only one shared
learnable parameter.
– We would like to underline that original SuDoRM-RF models could potentially outperform
the improved SuDoRM-RF++ variation in cases where the direct estimation of the latent
targets would be more difficult.
– Moreover, the alternation of containing two decoders proposed in the initial version might
be more useful in cases where one wants to solve an audio source separation problem
containing two distinct classes of sounds (e.g. speech enhancement) where each decoder
could be fine-tuned.
19
Architecture – Variations 2
– We also propose a new variation of our model, namely, SuDoRM-RF++ GC, where
we combine group communication.
– In the proposed architecture, the intermediate representations are being
processed in groups of sub-bands of channels.
– We divide the channels of each 1 × 1 convolutional block into 16 groups and we
process them first independently by sharing the parameters across all groups of
sub-bands.
– After, we apply a self-attention module to combine them.
– The resulting architecture leads to a significant improvement in the number of
trainable parameters.
arXiv:2011.08397v3 [eess.AS] 20 Nov 2020 : ULTRA-LIGHTWEIGHT SPEECH SEPARATION VIA GROUP COMMUNICATION
21
Architecture – Variations 3
– Our latest extension of the proposed
model, named C-SuDoRM-RF++, is to
be able to run online and enable
streamable extensions for real-time
applications.
– Differences between SuDoRM-RF++:
– Replace all non-causal convolutions
with causal counterparts.
– Remove all the normalization layers.
22
Experiment
- Datasets
- Data preprocessing
- Loss function
- Evaluation metrics
- Model configurations
23
Experiment – Datasets
– Speech: WSJ0-2mix (2 active speakers)
– Speaker mixtures are generated by randomly mixing speech utterances with
two active speakers at random SNR between −5 and 5dB from the WSJ0.
– Non-speech: ESC50 (2 active sources)
– ESC50 consists of a wide variety of sounds (non-speech human sounds, animal
sounds, natural soundscapes, interior sounds and urban noises). For each data
sample, two audio sources are mixed with a random SNR between −2.5 and
2.5dB where each source belongs to a distinct sound category from a total of
50.
– Universal sound separation: FUSS (1~4 active sources)
– We also evaluate our models under a purely universal sound separation setup
where we do not know how many sources are active in each input mixture.
24
Experiment – Preprocessing
– For fixed number of sources:
1. random choosing two sound classes or speakers
2. random cropping of 4sec segments from two sources audio files
3. mixing the source segments with a random SNR
4. generating 20, 000 new training mixtures for each epoch
5. 3,000 mixtures for validation and test sets
6. down-sampling each audio clip to 8kHz.
– For variable number of sources:
1. using the same dataset splits as the one provided (10sec, 16kHz)
2. generating 20, 000 new training mixtures for each epoch
3. 5,000 and 3,000 mixtures for validation and test sets respectively
25
Experiment – Loss
– For fixed number of sources:
– For variable number of sources:
– N denotes maximum sources, N’ denotes active sources
– Loss function is defined referencing FUSS baseline.
26
Experiment – Evaluation
– In order to evaluate the performance of our models we use a stable version of the
permutation invariant SI-SDRi.
– For the case where non-active sources exist, thus, Nʹ < N, we omit to compute the
aforementioned metric as it provides infinity values.
– and * sign denote the best permutation; denote all possible permutation
sets; ε = 1e−9 and τ = 1e−3 solve numerical stability issues created by zero target
signals.
arXiv:2011.00803v1 [cs.SD] 2 Nov 2020: WHAT’S ALL THE FUSS ABOUT FREE UNIVERSAL SOUND SEPARATION DATA?
28
Experiment – Model Conf.
– For the en/decoder we use a kernel size 21 for input mixtures sampled at 8kHz and
kernel size 41 for 16kHz. Also, the number of basis (channel) is 512.
– For the configuration of each U-ConvBlock we set the input number of channels
equal to 128, the number of successive resampling operations equal to 4 and, the
expanded number of channels equal to 512. In each subsampling operation we
reduce the temporal dimension by a factor of 2 and all depth-wise separable
convolutions have a kernel length of 5 and a stride of 2.
– For C-SuDoRM-RF++, we increase the number of input channels to 256 and the
default kernel length to 11 in order to increase the receptive field.
29
Experiment – Model Conf.
– To simplify, SuDoRM-RF 2.0x , SuDoRM-RF 1.0x , SuDoRM-RF 0.5x , SuDoRM-RF
0.25x means that it consist of 32, 16, 8 and 4 blocks, respectively.
– The same applies to the improved version SuDoRM-RF++ and its causal variation C-
SuDoRM-RF++.
30
Experiment – Comparison
– One of the main goals of this study is to propose models for audio source
separation which could be trained using limited computational resources and
deployed easily on a mobile or edge-computing device.
– Consider following aspects:
1. Number of executed floating point operations (FLOPs).
2. Number of trainable parameters.
3. Memory allocation required on the device for a single pass.
4. Time for completing each process.
31
Result & Discussion
- Two active sources
- FLOPs
- Training cost
- Memory
- Variable active sources
- Causal capability
32
Result – Two sources
– In Table 1, we focus on the computational aspects required during a forwards pass
of those models on a CPU.
– In Table 2, the same computational resource requirements are shown for a
backward pass on GPU as well as the number of trainable parameters.
36
Result – FLOPs
– The computational resource one might be interested in is the number of FLOPs
required during inference.
– In Figure 4, SuDoRM-RF models scale well as we increase the number of U-
ConvBlocks B from 4 → 8 → 16.
– SuDoRM-RF++ achieves similar or even better results than the original version of
the proposed model SuDoRM-RF with a lower number of FLOPs both in forward
and backward for a similar number of parameters and execution time.
– A significant drop in the absolute number of FLOPs is also obtained by combining
the group communication mechanism
37
Result – Training cost
– Usually one of the most
detrimental factors for training
deep learning models is the
requirement of allocating
multiple GPU devices for
several days or weeks until an
adequate performance is
obtained on the validation set.
– SuDoRM-RF 1.0x has a faster
convergence.
38
Result – Memory
– Group communication combined with SuDoRM-RF++ is one of the most effective
ways to reduce the number of trainable parameters caused by the bottleneck
dense layers between the U-ConvBlocks.
– However, the trainable parameters comprise only a small portion of the total
amount of memory required for a single forward or backward pass. The space
complexity could easily be dominated by the storage of intermediate
representations and not the actual memory footprint.
– It could become even worse when multiple skip connections are present, gradients
from multiple layers have to be stored or implementations require augmented
matrices (dilated, transposed convolutions, etc.).
39
Result – WSJ0-2mix
– We perform an ablation study SuDoRM-RF++ on
WSJ0-2mix dataset.
– The stride of the encoder and decoder which is
always defined as kernel//2. By decreasing the size
of the stride, we force the model to perform more
computations and estimate the signal in a more fine-
grained scale closer to the time-domain resolution
leading to better results.
– GLN significantly helps our model to reach a better
solution.
– More U-ConvBlocks seems to get better results.
40
Result – FUSS
– We see that by increasing our SuDoRM-RF and SuDoRM-RF++ model sizes to match
the size of TDCNN++ , we can match its performance.
– For the single source mixtures we see that our models per- form worse than the
TDCNN++, however, above 25dB it is difficult even for a human being to
understand the nuance artifacts which are barely audible.
41
Result – Casual setup
– We see that we are able to obtain competitive separation performance for all
configurations while remaining ≈ 10 to 20 times faster than real time.
– However the performance still needs some improvements.
42
Conclusions
What has been done?
43
Conclusions
– Our experiments suggest that SuDoRM-RF models
a) could be deployed on devices with limited resources,
b) be trained significantly faster and achieve good separation performance and
c) scale well when increasing the number of parameters.
– We show that these models can perform similarly or even better than recent state-
of-the-art models while requiring significantly less computational resources in
FLOPs, memory and, time for experiments with a fixed and a variable number of
sources.
THANK YOU
Any questions?
You can find me at
◉ jasonho610@gmail.com ◉ NTNU-SMIL

More Related Content

Similar to Sudormrf.pdf

Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Lviv Startup Club
 
Design and implementation of different audio restoration techniques for audio...
Design and implementation of different audio restoration techniques for audio...Design and implementation of different audio restoration techniques for audio...
Design and implementation of different audio restoration techniques for audio...
eSAT Journals
 
Methodology of Implementing the Pulse code techniques for Distributed Optical...
Methodology of Implementing the Pulse code techniques for Distributed Optical...Methodology of Implementing the Pulse code techniques for Distributed Optical...
Methodology of Implementing the Pulse code techniques for Distributed Optical...
Editor IJCATR
 
Digital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief ReviewDigital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief Review
Editor IJCATR
 
Hardware Architecture of Complex K-best MIMO Decoder
Hardware Architecture of Complex K-best MIMO DecoderHardware Architecture of Complex K-best MIMO Decoder
Hardware Architecture of Complex K-best MIMO Decoder
CSCJournals
 
PID1063629
PID1063629PID1063629
PID1063629
Abhishek Datta
 
IRJET- A Review on Audible Sound Analysis based on State Clustering throu...
IRJET-  	  A Review on Audible Sound Analysis based on State Clustering throu...IRJET-  	  A Review on Audible Sound Analysis based on State Clustering throu...
IRJET- A Review on Audible Sound Analysis based on State Clustering throu...
IRJET Journal
 
Mobile Networking and Mobile Ad Hoc Routing Protocol Modeling
Mobile Networking and Mobile Ad Hoc Routing Protocol ModelingMobile Networking and Mobile Ad Hoc Routing Protocol Modeling
Mobile Networking and Mobile Ad Hoc Routing Protocol Modeling
IOSR Journals
 
A New Hybrid Diversity Combining Scheme for Mobile Radio Communication System...
A New Hybrid Diversity Combining Scheme for Mobile Radio Communication System...A New Hybrid Diversity Combining Scheme for Mobile Radio Communication System...
A New Hybrid Diversity Combining Scheme for Mobile Radio Communication System...
AIRCC Publishing Corporation
 
A NEW HYBRID DIVERSITY COMBINING SCHEME FOR MOBILE RADIO COMMUNICATION SYSTEM...
A NEW HYBRID DIVERSITY COMBINING SCHEME FOR MOBILE RADIO COMMUNICATION SYSTEM...A NEW HYBRID DIVERSITY COMBINING SCHEME FOR MOBILE RADIO COMMUNICATION SYSTEM...
A NEW HYBRID DIVERSITY COMBINING SCHEME FOR MOBILE RADIO COMMUNICATION SYSTEM...
ijcsit
 
Ijarcet vol-2-issue-7-2374-2377
Ijarcet vol-2-issue-7-2374-2377Ijarcet vol-2-issue-7-2374-2377
Ijarcet vol-2-issue-7-2374-2377
Editor IJARCET
 
Ijarcet vol-2-issue-7-2374-2377
Ijarcet vol-2-issue-7-2374-2377Ijarcet vol-2-issue-7-2374-2377
Ijarcet vol-2-issue-7-2374-2377
Editor IJARCET
 
Open Source SDR Frontend and Measurements for 60-GHz Wireless Experimentation
Open Source SDR Frontend and Measurements for 60-GHz Wireless ExperimentationOpen Source SDR Frontend and Measurements for 60-GHz Wireless Experimentation
Open Source SDR Frontend and Measurements for 60-GHz Wireless Experimentation
AndreaDriutti
 
ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...
ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...
ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...
ijasuc
 
ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...
ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...
ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...
ijasuc
 
IRJET- Compressed Sensing based Modified Orthogonal Matching Pursuit in DTTV ...
IRJET- Compressed Sensing based Modified Orthogonal Matching Pursuit in DTTV ...IRJET- Compressed Sensing based Modified Orthogonal Matching Pursuit in DTTV ...
IRJET- Compressed Sensing based Modified Orthogonal Matching Pursuit in DTTV ...
IRJET Journal
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Science
researchinventy
 
Nath2021_Article_DesignOfLowPowerPreamplifierIC.pdf
Nath2021_Article_DesignOfLowPowerPreamplifierIC.pdfNath2021_Article_DesignOfLowPowerPreamplifierIC.pdf
Nath2021_Article_DesignOfLowPowerPreamplifierIC.pdf
Chikkapriyanka
 
Standardising the compressed representation of neural networks
Standardising the compressed representation of neural networksStandardising the compressed representation of neural networks
Standardising the compressed representation of neural networks
Förderverein Technische Fakultät
 
PERFORMANCE ANALYSIS OF BARKER CODE BASED ON THEIR CORRELATION PROPERTY IN MU...
PERFORMANCE ANALYSIS OF BARKER CODE BASED ON THEIR CORRELATION PROPERTY IN MU...PERFORMANCE ANALYSIS OF BARKER CODE BASED ON THEIR CORRELATION PROPERTY IN MU...
PERFORMANCE ANALYSIS OF BARKER CODE BASED ON THEIR CORRELATION PROPERTY IN MU...
ijistjournal
 

Similar to Sudormrf.pdf (20)

Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
 
Design and implementation of different audio restoration techniques for audio...
Design and implementation of different audio restoration techniques for audio...Design and implementation of different audio restoration techniques for audio...
Design and implementation of different audio restoration techniques for audio...
 
Methodology of Implementing the Pulse code techniques for Distributed Optical...
Methodology of Implementing the Pulse code techniques for Distributed Optical...Methodology of Implementing the Pulse code techniques for Distributed Optical...
Methodology of Implementing the Pulse code techniques for Distributed Optical...
 
Digital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief ReviewDigital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief Review
 
Hardware Architecture of Complex K-best MIMO Decoder
Hardware Architecture of Complex K-best MIMO DecoderHardware Architecture of Complex K-best MIMO Decoder
Hardware Architecture of Complex K-best MIMO Decoder
 
PID1063629
PID1063629PID1063629
PID1063629
 
IRJET- A Review on Audible Sound Analysis based on State Clustering throu...
IRJET-  	  A Review on Audible Sound Analysis based on State Clustering throu...IRJET-  	  A Review on Audible Sound Analysis based on State Clustering throu...
IRJET- A Review on Audible Sound Analysis based on State Clustering throu...
 
Mobile Networking and Mobile Ad Hoc Routing Protocol Modeling
Mobile Networking and Mobile Ad Hoc Routing Protocol ModelingMobile Networking and Mobile Ad Hoc Routing Protocol Modeling
Mobile Networking and Mobile Ad Hoc Routing Protocol Modeling
 
A New Hybrid Diversity Combining Scheme for Mobile Radio Communication System...
A New Hybrid Diversity Combining Scheme for Mobile Radio Communication System...A New Hybrid Diversity Combining Scheme for Mobile Radio Communication System...
A New Hybrid Diversity Combining Scheme for Mobile Radio Communication System...
 
A NEW HYBRID DIVERSITY COMBINING SCHEME FOR MOBILE RADIO COMMUNICATION SYSTEM...
A NEW HYBRID DIVERSITY COMBINING SCHEME FOR MOBILE RADIO COMMUNICATION SYSTEM...A NEW HYBRID DIVERSITY COMBINING SCHEME FOR MOBILE RADIO COMMUNICATION SYSTEM...
A NEW HYBRID DIVERSITY COMBINING SCHEME FOR MOBILE RADIO COMMUNICATION SYSTEM...
 
Ijarcet vol-2-issue-7-2374-2377
Ijarcet vol-2-issue-7-2374-2377Ijarcet vol-2-issue-7-2374-2377
Ijarcet vol-2-issue-7-2374-2377
 
Ijarcet vol-2-issue-7-2374-2377
Ijarcet vol-2-issue-7-2374-2377Ijarcet vol-2-issue-7-2374-2377
Ijarcet vol-2-issue-7-2374-2377
 
Open Source SDR Frontend and Measurements for 60-GHz Wireless Experimentation
Open Source SDR Frontend and Measurements for 60-GHz Wireless ExperimentationOpen Source SDR Frontend and Measurements for 60-GHz Wireless Experimentation
Open Source SDR Frontend and Measurements for 60-GHz Wireless Experimentation
 
ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...
ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...
ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...
 
ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...
ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...
ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...
 
IRJET- Compressed Sensing based Modified Orthogonal Matching Pursuit in DTTV ...
IRJET- Compressed Sensing based Modified Orthogonal Matching Pursuit in DTTV ...IRJET- Compressed Sensing based Modified Orthogonal Matching Pursuit in DTTV ...
IRJET- Compressed Sensing based Modified Orthogonal Matching Pursuit in DTTV ...
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Science
 
Nath2021_Article_DesignOfLowPowerPreamplifierIC.pdf
Nath2021_Article_DesignOfLowPowerPreamplifierIC.pdfNath2021_Article_DesignOfLowPowerPreamplifierIC.pdf
Nath2021_Article_DesignOfLowPowerPreamplifierIC.pdf
 
Standardising the compressed representation of neural networks
Standardising the compressed representation of neural networksStandardising the compressed representation of neural networks
Standardising the compressed representation of neural networks
 
PERFORMANCE ANALYSIS OF BARKER CODE BASED ON THEIR CORRELATION PROPERTY IN MU...
PERFORMANCE ANALYSIS OF BARKER CODE BASED ON THEIR CORRELATION PROPERTY IN MU...PERFORMANCE ANALYSIS OF BARKER CODE BASED ON THEIR CORRELATION PROPERTY IN MU...
PERFORMANCE ANALYSIS OF BARKER CODE BASED ON THEIR CORRELATION PROPERTY IN MU...
 

More from ssuser849b73

Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdf
ssuser849b73
 
Transformer-based SE.pptx
Transformer-based SE.pptxTransformer-based SE.pptx
Transformer-based SE.pptx
ssuser849b73
 
Audio Inpainting with D2WGAN.pdf
Audio Inpainting with D2WGAN.pdfAudio Inpainting with D2WGAN.pdf
Audio Inpainting with D2WGAN.pdf
ssuser849b73
 
Frame-Online DNN-WPE Dereverberation.pdf
Frame-Online DNN-WPE Dereverberation.pdfFrame-Online DNN-WPE Dereverberation.pdf
Frame-Online DNN-WPE Dereverberation.pdf
ssuser849b73
 
WaveNet.pdf
WaveNet.pdfWaveNet.pdf
WaveNet.pdf
ssuser849b73
 
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
ssuser849b73
 
Wavesplit.pdf
Wavesplit.pdfWavesplit.pdf
Wavesplit.pdf
ssuser849b73
 
Sepformer&DPTNet.pdf
Sepformer&DPTNet.pdfSepformer&DPTNet.pdf
Sepformer&DPTNet.pdf
ssuser849b73
 

More from ssuser849b73 (8)

Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdf
 
Transformer-based SE.pptx
Transformer-based SE.pptxTransformer-based SE.pptx
Transformer-based SE.pptx
 
Audio Inpainting with D2WGAN.pdf
Audio Inpainting with D2WGAN.pdfAudio Inpainting with D2WGAN.pdf
Audio Inpainting with D2WGAN.pdf
 
Frame-Online DNN-WPE Dereverberation.pdf
Frame-Online DNN-WPE Dereverberation.pdfFrame-Online DNN-WPE Dereverberation.pdf
Frame-Online DNN-WPE Dereverberation.pdf
 
WaveNet.pdf
WaveNet.pdfWaveNet.pdf
WaveNet.pdf
 
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
 
Wavesplit.pdf
Wavesplit.pdfWavesplit.pdf
Wavesplit.pdf
 
Sepformer&DPTNet.pdf
Sepformer&DPTNet.pdfSepformer&DPTNet.pdf
Sepformer&DPTNet.pdf
 

Recently uploaded

5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
ihlasbinance2003
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
Dr Ramhari Poudyal
 
Low power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniquesLow power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniques
nooriasukmaningtyas
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
yokeleetan1
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
SUTEJAS
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
Madan Karki
 
bank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdfbank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdf
Divyam548318
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
2. Operations Strategy in a Global Environment.ppt
2. Operations Strategy in a Global Environment.ppt2. Operations Strategy in a Global Environment.ppt
2. Operations Strategy in a Global Environment.ppt
PuktoonEngr
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
IJECEIAES
 
Series of visio cisco devices Cisco_Icons.ppt
Series of visio cisco devices Cisco_Icons.pptSeries of visio cisco devices Cisco_Icons.ppt
Series of visio cisco devices Cisco_Icons.ppt
PauloRodrigues104553
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
JamalHussainArman
 
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
awadeshbabu
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
VICTOR MAESTRE RAMIREZ
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
digital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdfdigital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdf
drwaing
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
mamunhossenbd75
 

Recently uploaded (20)

5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
 
Low power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniquesLow power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniques
 
Swimming pool mechanical components design.pptx
Swimming pool  mechanical components design.pptxSwimming pool  mechanical components design.pptx
Swimming pool mechanical components design.pptx
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
 
bank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdfbank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdf
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
2. Operations Strategy in a Global Environment.ppt
2. Operations Strategy in a Global Environment.ppt2. Operations Strategy in a Global Environment.ppt
2. Operations Strategy in a Global Environment.ppt
 
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
 
Series of visio cisco devices Cisco_Icons.ppt
Series of visio cisco devices Cisco_Icons.pptSeries of visio cisco devices Cisco_Icons.ppt
Series of visio cisco devices Cisco_Icons.ppt
 
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptxML Based Model for NIDS MSc Updated Presentation.v2.pptx
ML Based Model for NIDS MSc Updated Presentation.v2.pptx
 
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
[JPP-1] - (JEE 3.0) - Kinematics 1D - 14th May..pdf
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
digital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdfdigital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdf
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
 

Sudormrf.pdf

  • 1. Compute and memory efficient universal sound source separation Authors: Efthymios Tzinis · Zhepei Wang · Xilin Jiang · Paris Smaragdis Published in 2020 IEEE 30th International Workshop on MLSP, and arXiv:2103.02644v2 [cs.SD] 14 Jul 2021 Presenter: 何冠勳
  • 2. 1 Outline – Introduction – Architecture – Experiment – Result & Discussion – Conclusions
  • 4. 3 Introduction – There are three fields in audio separation: speech separation, universal sound separation and music source separation. – Previous works that are prestigious: – Conv-TasNet – DPRNN – DPTNet – Demucs – Two step sound source separation – Despite the dramatic advances in source separation performance, the computational complexity of the aforementioned methods might hinder their extensive usage across multiple devices. – Additionally, training such systems is also an expensive computational undertaking which can amount to significant costs.
  • 5. 4 Introduction – Several studies, mainly in the image domain, have introduced more efficient architectures in order to overcome the growing concern of large models with high computational requirements. – E.g. depth-wise separable convolutions, dilated convolutions, meta-learning – Despite the recent success of low-resource architectures in the image domain, little progress has been made towards proposing efficient architectures for audio tasks and especially source separation. – Modern approaches, mainly in speech enhancement and music source separation, have been focusing on developing models which are capable of real-time inference.
  • 6. 5 Introduction – In this study, we propose a novel neural network architecture for audio source separation while following a more holistic approach in terms of computational resources that we take into consideration (FLOPs, latency and total memory requirements). – We entitle SuDoRM-RF: SUccessive DOwnsampling and Resampling of Multi- Resolution Features (SuDoRM-RF). – We also propose improved versions and causal variations. – The separation performance comparable to or even better than several recent SOTA models on speech, environmental and universal sound separation tasks with significantly lower computational requirements.
  • 7. 6 Architecture - Encoder - Separator - U-ConvBlock - Decoder - Variations
  • 8. 7 Architecture – On par with many SOTA approaches, SuDoRM-RF performs end-to-end audio source separation using a mask-based architecture with adaptive encoder and decoder basis. – We have extended our basic model in order to also remove the mask estimation process by introducing SuDoRM-RF++ that directly estimates the latent representations of the sources. – Besides original model, also propose – improved version SuDoRM-RF++ – SuDoRM-RF++ with group communication – causal variation C-SuDoRM-RF++
  • 9.
  • 10. 9 Architecture – Encoder – The encoder E architecture consists of a one-dimensional convolution with kernel size KE and stride equal to KE/2 similar to Conv-TasNet. – We force the output of the encoder to be strictly non-negative by applying a recti- fied linear unit (ReLU) activation on top of the output of the 1D-convolution. (channel, kernel, stride)
  • 11. 10 Architecture – Separator – In essence, the separator S module performs the following transformations to the encoded mixture representation 1 2 3 4 5
  • 12. 11 Architecture – Separator 1. Projects the encoded mixture representation to a new channel space through a layer-normalization followed by a point-wise convolution as shown next: 2. Performs repetitive non-linear transformations provided by B U-convolutional blocks on . The output of the ith U-ConvBlock would be denoted as and would be used as input for the (i + 1)th block. 3. Aggregates the information over multiple channels by applying a regular one- dimensional convolution for each source on the transposed feature representation .
  • 13. 12 Architecture – Separator 3. (cont.) This step has been introduced in Two step sound source separation and empirically shown to make the training process more stable. 4. Combines the aforementioned latent codes for all sources by performing a softmax operation in order to get mask estimates, where mask coefficients all lie in [0,1]. 5. Estimates a latent representation for each source by multiplying element-wise the encoded mixture representation with the corresponding mask.
  • 14. 13 Architecture – U-ConvBlock – U-ConvBlock uses a block structure which resembles a depth-wise separable convolution but without a skip connection and residual unlike ConvTasNet. – It extracts information from multiple resolutions using Q successive temporal downsampling and Q upsampling operations that resembles a U-Net architecture. – We postulate that this whole resampling procedure of extracting features at multiple scales combined with the efficient increase of the effective receptive field.
  • 16. 15
  • 17. 16 Architecture – Decoder – Our decoder module is the final step in order to transform the latent space representation for each source back to the time domain. – Each latent source representation is fed through a different transposed convolution decoder.
  • 18. 17 Architecture – Variations 1 – In the improved version of the proposed architecture, namely, SuDoRM-RF++ , the model estimates directly the latent representation for each target signal and uses only one decoder module.
  • 19. 18 Architecture – Variations 1 – Difference between original version: – Replace the mask estimation and element-wise multiplication process with a direct estimation of the latent target signals. – Use only one trainable decoder. – Replace the layer normalization layers with global layer normalization. – Simplify the activation layers and we use PReLU activation layers with only one shared learnable parameter. – We would like to underline that original SuDoRM-RF models could potentially outperform the improved SuDoRM-RF++ variation in cases where the direct estimation of the latent targets would be more difficult. – Moreover, the alternation of containing two decoders proposed in the initial version might be more useful in cases where one wants to solve an audio source separation problem containing two distinct classes of sounds (e.g. speech enhancement) where each decoder could be fine-tuned.
  • 20. 19 Architecture – Variations 2 – We also propose a new variation of our model, namely, SuDoRM-RF++ GC, where we combine group communication. – In the proposed architecture, the intermediate representations are being processed in groups of sub-bands of channels. – We divide the channels of each 1 × 1 convolutional block into 16 groups and we process them first independently by sharing the parameters across all groups of sub-bands. – After, we apply a self-attention module to combine them. – The resulting architecture leads to a significant improvement in the number of trainable parameters.
  • 21. arXiv:2011.08397v3 [eess.AS] 20 Nov 2020 : ULTRA-LIGHTWEIGHT SPEECH SEPARATION VIA GROUP COMMUNICATION
  • 22. 21 Architecture – Variations 3 – Our latest extension of the proposed model, named C-SuDoRM-RF++, is to be able to run online and enable streamable extensions for real-time applications. – Differences between SuDoRM-RF++: – Replace all non-causal convolutions with causal counterparts. – Remove all the normalization layers.
  • 23. 22 Experiment - Datasets - Data preprocessing - Loss function - Evaluation metrics - Model configurations
  • 24. 23 Experiment – Datasets – Speech: WSJ0-2mix (2 active speakers) – Speaker mixtures are generated by randomly mixing speech utterances with two active speakers at random SNR between −5 and 5dB from the WSJ0. – Non-speech: ESC50 (2 active sources) – ESC50 consists of a wide variety of sounds (non-speech human sounds, animal sounds, natural soundscapes, interior sounds and urban noises). For each data sample, two audio sources are mixed with a random SNR between −2.5 and 2.5dB where each source belongs to a distinct sound category from a total of 50. – Universal sound separation: FUSS (1~4 active sources) – We also evaluate our models under a purely universal sound separation setup where we do not know how many sources are active in each input mixture.
  • 25. 24 Experiment – Preprocessing – For fixed number of sources: 1. random choosing two sound classes or speakers 2. random cropping of 4sec segments from two sources audio files 3. mixing the source segments with a random SNR 4. generating 20, 000 new training mixtures for each epoch 5. 3,000 mixtures for validation and test sets 6. down-sampling each audio clip to 8kHz. – For variable number of sources: 1. using the same dataset splits as the one provided (10sec, 16kHz) 2. generating 20, 000 new training mixtures for each epoch 3. 5,000 and 3,000 mixtures for validation and test sets respectively
  • 26. 25 Experiment – Loss – For fixed number of sources: – For variable number of sources: – N denotes maximum sources, N’ denotes active sources – Loss function is defined referencing FUSS baseline.
  • 27. 26 Experiment – Evaluation – In order to evaluate the performance of our models we use a stable version of the permutation invariant SI-SDRi. – For the case where non-active sources exist, thus, Nʹ < N, we omit to compute the aforementioned metric as it provides infinity values. – and * sign denote the best permutation; denote all possible permutation sets; ε = 1e−9 and τ = 1e−3 solve numerical stability issues created by zero target signals.
  • 28. arXiv:2011.00803v1 [cs.SD] 2 Nov 2020: WHAT’S ALL THE FUSS ABOUT FREE UNIVERSAL SOUND SEPARATION DATA?
  • 29. 28 Experiment – Model Conf. – For the en/decoder we use a kernel size 21 for input mixtures sampled at 8kHz and kernel size 41 for 16kHz. Also, the number of basis (channel) is 512. – For the configuration of each U-ConvBlock we set the input number of channels equal to 128, the number of successive resampling operations equal to 4 and, the expanded number of channels equal to 512. In each subsampling operation we reduce the temporal dimension by a factor of 2 and all depth-wise separable convolutions have a kernel length of 5 and a stride of 2. – For C-SuDoRM-RF++, we increase the number of input channels to 256 and the default kernel length to 11 in order to increase the receptive field.
  • 30. 29 Experiment – Model Conf. – To simplify, SuDoRM-RF 2.0x , SuDoRM-RF 1.0x , SuDoRM-RF 0.5x , SuDoRM-RF 0.25x means that it consist of 32, 16, 8 and 4 blocks, respectively. – The same applies to the improved version SuDoRM-RF++ and its causal variation C- SuDoRM-RF++.
  • 31. 30 Experiment – Comparison – One of the main goals of this study is to propose models for audio source separation which could be trained using limited computational resources and deployed easily on a mobile or edge-computing device. – Consider following aspects: 1. Number of executed floating point operations (FLOPs). 2. Number of trainable parameters. 3. Memory allocation required on the device for a single pass. 4. Time for completing each process.
  • 32. 31 Result & Discussion - Two active sources - FLOPs - Training cost - Memory - Variable active sources - Causal capability
  • 33. 32 Result – Two sources – In Table 1, we focus on the computational aspects required during a forwards pass of those models on a CPU. – In Table 2, the same computational resource requirements are shown for a backward pass on GPU as well as the number of trainable parameters.
  • 34.
  • 35.
  • 36.
  • 37. 36 Result – FLOPs – The computational resource one might be interested in is the number of FLOPs required during inference. – In Figure 4, SuDoRM-RF models scale well as we increase the number of U- ConvBlocks B from 4 → 8 → 16. – SuDoRM-RF++ achieves similar or even better results than the original version of the proposed model SuDoRM-RF with a lower number of FLOPs both in forward and backward for a similar number of parameters and execution time. – A significant drop in the absolute number of FLOPs is also obtained by combining the group communication mechanism
  • 38. 37 Result – Training cost – Usually one of the most detrimental factors for training deep learning models is the requirement of allocating multiple GPU devices for several days or weeks until an adequate performance is obtained on the validation set. – SuDoRM-RF 1.0x has a faster convergence.
  • 39. 38 Result – Memory – Group communication combined with SuDoRM-RF++ is one of the most effective ways to reduce the number of trainable parameters caused by the bottleneck dense layers between the U-ConvBlocks. – However, the trainable parameters comprise only a small portion of the total amount of memory required for a single forward or backward pass. The space complexity could easily be dominated by the storage of intermediate representations and not the actual memory footprint. – It could become even worse when multiple skip connections are present, gradients from multiple layers have to be stored or implementations require augmented matrices (dilated, transposed convolutions, etc.).
  • 40. 39 Result – WSJ0-2mix – We perform an ablation study SuDoRM-RF++ on WSJ0-2mix dataset. – The stride of the encoder and decoder which is always defined as kernel//2. By decreasing the size of the stride, we force the model to perform more computations and estimate the signal in a more fine- grained scale closer to the time-domain resolution leading to better results. – GLN significantly helps our model to reach a better solution. – More U-ConvBlocks seems to get better results.
  • 41. 40 Result – FUSS – We see that by increasing our SuDoRM-RF and SuDoRM-RF++ model sizes to match the size of TDCNN++ , we can match its performance. – For the single source mixtures we see that our models per- form worse than the TDCNN++, however, above 25dB it is difficult even for a human being to understand the nuance artifacts which are barely audible.
  • 42. 41 Result – Casual setup – We see that we are able to obtain competitive separation performance for all configurations while remaining ≈ 10 to 20 times faster than real time. – However the performance still needs some improvements.
  • 44. 43 Conclusions – Our experiments suggest that SuDoRM-RF models a) could be deployed on devices with limited resources, b) be trained significantly faster and achieve good separation performance and c) scale well when increasing the number of parameters. – We show that these models can perform similarly or even better than recent state- of-the-art models while requiring significantly less computational resources in FLOPs, memory and, time for experiments with a fixed and a variable number of sources.
  • 45. THANK YOU Any questions? You can find me at ◉ jasonho610@gmail.com ◉ NTNU-SMIL