SlideShare a Scribd company logo
1 of 45
Download to read offline
Compute and memory efficient
universal sound source separation
Authors: Efthymios Tzinis · Zhepei Wang · Xilin Jiang · Paris Smaragdis
Published in 2020 IEEE 30th International Workshop on MLSP, and
arXiv:2103.02644v2 [cs.SD] 14 Jul 2021
Presenter: 何冠勳
1
Outline
– Introduction
– Architecture
– Experiment
– Result & Discussion
– Conclusions
2
Introduction
Why & What ?
3
Introduction
– There are three fields in audio separation: speech separation, universal sound
separation and music source separation.
– Previous works that are prestigious:
– Conv-TasNet
– DPRNN
– DPTNet
– Demucs
– Two step sound source separation
– Despite the dramatic advances in source separation performance, the
computational complexity of the aforementioned methods might hinder their
extensive usage across multiple devices.
– Additionally, training such systems is also an expensive computational undertaking
which can amount to significant costs.
4
Introduction
– Several studies, mainly in the image domain, have introduced more efficient
architectures in order to overcome the growing concern of large models with high
computational requirements.
– E.g. depth-wise separable convolutions, dilated convolutions, meta-learning
– Despite the recent success of low-resource architectures in the image domain, little
progress has been made towards proposing efficient architectures for audio tasks
and especially source separation.
– Modern approaches, mainly in speech enhancement and music source separation,
have been focusing on developing models which are capable of real-time inference.
5
Introduction
– In this study, we propose a novel neural network architecture for audio source
separation while following a more holistic approach in terms of computational
resources that we take into consideration (FLOPs, latency and total memory
requirements).
– We entitle SuDoRM-RF: SUccessive DOwnsampling and Resampling of Multi-
Resolution Features (SuDoRM-RF).
– We also propose improved versions and causal variations.
– The separation performance comparable to or even better than several recent
SOTA models on speech, environmental and universal sound separation tasks with
significantly lower computational requirements.
6
Architecture
- Encoder
- Separator
- U-ConvBlock
- Decoder
- Variations
7
Architecture
– On par with many SOTA approaches, SuDoRM-RF performs end-to-end audio
source separation using a mask-based architecture with adaptive encoder and
decoder basis.
– We have extended our basic model in order to also remove the mask estimation
process by introducing SuDoRM-RF++ that directly estimates the latent
representations of the sources.
– Besides original model, also propose
– improved version SuDoRM-RF++
– SuDoRM-RF++ with group communication
– causal variation C-SuDoRM-RF++
9
Architecture – Encoder
– The encoder E architecture consists of a one-dimensional convolution with kernel
size KE and stride equal to KE/2 similar to Conv-TasNet.
– We force the output of the encoder to be strictly non-negative by applying a recti-
fied linear unit (ReLU) activation on top of the output of the 1D-convolution.
(channel, kernel, stride)
10
Architecture – Separator
– In essence, the separator S module performs the following transformations to the
encoded mixture representation
1 2 3 4 5
11
Architecture – Separator
1. Projects the encoded mixture representation to a new channel space through a
layer-normalization followed by a point-wise convolution as shown next:
2. Performs repetitive non-linear transformations provided by B U-convolutional
blocks on . The output of the ith U-ConvBlock would be denoted as
and would be used as input for the (i + 1)th block.
3. Aggregates the information over multiple channels by applying a regular one-
dimensional convolution for each source on the transposed feature
representation .
12
Architecture – Separator
3. (cont.) This step has been introduced in Two step sound source separation and
empirically shown to make the training process more stable.
4. Combines the aforementioned latent codes for all sources by performing a
softmax operation in order to get mask estimates, where mask coefficients all lie
in [0,1].
5. Estimates a latent representation for each source by multiplying element-wise
the encoded mixture representation with the corresponding mask.
13
Architecture – U-ConvBlock
– U-ConvBlock uses a block structure which resembles a depth-wise separable
convolution but without a skip connection and residual unlike ConvTasNet.
– It extracts information from multiple resolutions using Q successive temporal
downsampling and Q upsampling operations that resembles a U-Net architecture.
– We postulate that this whole resampling procedure of extracting features at
multiple scales combined with the efficient increase of the effective receptive field.
14
Architecture – U-ConvBlock
15
16
Architecture – Decoder
– Our decoder module is the final step in order to transform the latent space
representation for each source back to the time domain.
– Each latent source representation is fed through a different transposed convolution
decoder.
17
Architecture – Variations 1
– In the improved version of the proposed architecture, namely, SuDoRM-RF++ , the
model estimates directly the latent representation for each target signal and uses
only one decoder module.
18
Architecture – Variations 1
– Difference between original version:
– Replace the mask estimation and element-wise multiplication process with a direct
estimation of the latent target signals.
– Use only one trainable decoder.
– Replace the layer normalization layers with global layer normalization.
– Simplify the activation layers and we use PReLU activation layers with only one shared
learnable parameter.
– We would like to underline that original SuDoRM-RF models could potentially outperform
the improved SuDoRM-RF++ variation in cases where the direct estimation of the latent
targets would be more difficult.
– Moreover, the alternation of containing two decoders proposed in the initial version might
be more useful in cases where one wants to solve an audio source separation problem
containing two distinct classes of sounds (e.g. speech enhancement) where each decoder
could be fine-tuned.
19
Architecture – Variations 2
– We also propose a new variation of our model, namely, SuDoRM-RF++ GC, where
we combine group communication.
– In the proposed architecture, the intermediate representations are being
processed in groups of sub-bands of channels.
– We divide the channels of each 1 × 1 convolutional block into 16 groups and we
process them first independently by sharing the parameters across all groups of
sub-bands.
– After, we apply a self-attention module to combine them.
– The resulting architecture leads to a significant improvement in the number of
trainable parameters.
arXiv:2011.08397v3 [eess.AS] 20 Nov 2020 : ULTRA-LIGHTWEIGHT SPEECH SEPARATION VIA GROUP COMMUNICATION
21
Architecture – Variations 3
– Our latest extension of the proposed
model, named C-SuDoRM-RF++, is to
be able to run online and enable
streamable extensions for real-time
applications.
– Differences between SuDoRM-RF++:
– Replace all non-causal convolutions
with causal counterparts.
– Remove all the normalization layers.
22
Experiment
- Datasets
- Data preprocessing
- Loss function
- Evaluation metrics
- Model configurations
23
Experiment – Datasets
– Speech: WSJ0-2mix (2 active speakers)
– Speaker mixtures are generated by randomly mixing speech utterances with
two active speakers at random SNR between −5 and 5dB from the WSJ0.
– Non-speech: ESC50 (2 active sources)
– ESC50 consists of a wide variety of sounds (non-speech human sounds, animal
sounds, natural soundscapes, interior sounds and urban noises). For each data
sample, two audio sources are mixed with a random SNR between −2.5 and
2.5dB where each source belongs to a distinct sound category from a total of
50.
– Universal sound separation: FUSS (1~4 active sources)
– We also evaluate our models under a purely universal sound separation setup
where we do not know how many sources are active in each input mixture.
24
Experiment – Preprocessing
– For fixed number of sources:
1. random choosing two sound classes or speakers
2. random cropping of 4sec segments from two sources audio files
3. mixing the source segments with a random SNR
4. generating 20, 000 new training mixtures for each epoch
5. 3,000 mixtures for validation and test sets
6. down-sampling each audio clip to 8kHz.
– For variable number of sources:
1. using the same dataset splits as the one provided (10sec, 16kHz)
2. generating 20, 000 new training mixtures for each epoch
3. 5,000 and 3,000 mixtures for validation and test sets respectively
25
Experiment – Loss
– For fixed number of sources:
– For variable number of sources:
– N denotes maximum sources, N’ denotes active sources
– Loss function is defined referencing FUSS baseline.
26
Experiment – Evaluation
– In order to evaluate the performance of our models we use a stable version of the
permutation invariant SI-SDRi.
– For the case where non-active sources exist, thus, Nʹ < N, we omit to compute the
aforementioned metric as it provides infinity values.
– and * sign denote the best permutation; denote all possible permutation
sets; ε = 1e−9 and τ = 1e−3 solve numerical stability issues created by zero target
signals.
arXiv:2011.00803v1 [cs.SD] 2 Nov 2020: WHAT’S ALL THE FUSS ABOUT FREE UNIVERSAL SOUND SEPARATION DATA?
28
Experiment – Model Conf.
– For the en/decoder we use a kernel size 21 for input mixtures sampled at 8kHz and
kernel size 41 for 16kHz. Also, the number of basis (channel) is 512.
– For the configuration of each U-ConvBlock we set the input number of channels
equal to 128, the number of successive resampling operations equal to 4 and, the
expanded number of channels equal to 512. In each subsampling operation we
reduce the temporal dimension by a factor of 2 and all depth-wise separable
convolutions have a kernel length of 5 and a stride of 2.
– For C-SuDoRM-RF++, we increase the number of input channels to 256 and the
default kernel length to 11 in order to increase the receptive field.
29
Experiment – Model Conf.
– To simplify, SuDoRM-RF 2.0x , SuDoRM-RF 1.0x , SuDoRM-RF 0.5x , SuDoRM-RF
0.25x means that it consist of 32, 16, 8 and 4 blocks, respectively.
– The same applies to the improved version SuDoRM-RF++ and its causal variation C-
SuDoRM-RF++.
30
Experiment – Comparison
– One of the main goals of this study is to propose models for audio source
separation which could be trained using limited computational resources and
deployed easily on a mobile or edge-computing device.
– Consider following aspects:
1. Number of executed floating point operations (FLOPs).
2. Number of trainable parameters.
3. Memory allocation required on the device for a single pass.
4. Time for completing each process.
31
Result & Discussion
- Two active sources
- FLOPs
- Training cost
- Memory
- Variable active sources
- Causal capability
32
Result – Two sources
– In Table 1, we focus on the computational aspects required during a forwards pass
of those models on a CPU.
– In Table 2, the same computational resource requirements are shown for a
backward pass on GPU as well as the number of trainable parameters.
36
Result – FLOPs
– The computational resource one might be interested in is the number of FLOPs
required during inference.
– In Figure 4, SuDoRM-RF models scale well as we increase the number of U-
ConvBlocks B from 4 → 8 → 16.
– SuDoRM-RF++ achieves similar or even better results than the original version of
the proposed model SuDoRM-RF with a lower number of FLOPs both in forward
and backward for a similar number of parameters and execution time.
– A significant drop in the absolute number of FLOPs is also obtained by combining
the group communication mechanism
37
Result – Training cost
– Usually one of the most
detrimental factors for training
deep learning models is the
requirement of allocating
multiple GPU devices for
several days or weeks until an
adequate performance is
obtained on the validation set.
– SuDoRM-RF 1.0x has a faster
convergence.
38
Result – Memory
– Group communication combined with SuDoRM-RF++ is one of the most effective
ways to reduce the number of trainable parameters caused by the bottleneck
dense layers between the U-ConvBlocks.
– However, the trainable parameters comprise only a small portion of the total
amount of memory required for a single forward or backward pass. The space
complexity could easily be dominated by the storage of intermediate
representations and not the actual memory footprint.
– It could become even worse when multiple skip connections are present, gradients
from multiple layers have to be stored or implementations require augmented
matrices (dilated, transposed convolutions, etc.).
39
Result – WSJ0-2mix
– We perform an ablation study SuDoRM-RF++ on
WSJ0-2mix dataset.
– The stride of the encoder and decoder which is
always defined as kernel//2. By decreasing the size
of the stride, we force the model to perform more
computations and estimate the signal in a more fine-
grained scale closer to the time-domain resolution
leading to better results.
– GLN significantly helps our model to reach a better
solution.
– More U-ConvBlocks seems to get better results.
40
Result – FUSS
– We see that by increasing our SuDoRM-RF and SuDoRM-RF++ model sizes to match
the size of TDCNN++ , we can match its performance.
– For the single source mixtures we see that our models per- form worse than the
TDCNN++, however, above 25dB it is difficult even for a human being to
understand the nuance artifacts which are barely audible.
41
Result – Casual setup
– We see that we are able to obtain competitive separation performance for all
configurations while remaining ≈ 10 to 20 times faster than real time.
– However the performance still needs some improvements.
42
Conclusions
What has been done?
43
Conclusions
– Our experiments suggest that SuDoRM-RF models
a) could be deployed on devices with limited resources,
b) be trained significantly faster and achieve good separation performance and
c) scale well when increasing the number of parameters.
– We show that these models can perform similarly or even better than recent state-
of-the-art models while requiring significantly less computational resources in
FLOPs, memory and, time for experiments with a fixed and a variable number of
sources.
THANK YOU
Any questions?
You can find me at
◉ jasonho610@gmail.com ◉ NTNU-SMIL

More Related Content

Similar to Sudormrf.pdf

Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...Lviv Startup Club
 
Design and implementation of different audio restoration techniques for audio...
Design and implementation of different audio restoration techniques for audio...Design and implementation of different audio restoration techniques for audio...
Design and implementation of different audio restoration techniques for audio...eSAT Journals
 
Methodology of Implementing the Pulse code techniques for Distributed Optical...
Methodology of Implementing the Pulse code techniques for Distributed Optical...Methodology of Implementing the Pulse code techniques for Distributed Optical...
Methodology of Implementing the Pulse code techniques for Distributed Optical...Editor IJCATR
 
Digital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief ReviewDigital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief ReviewEditor IJCATR
 
Hardware Architecture of Complex K-best MIMO Decoder
Hardware Architecture of Complex K-best MIMO DecoderHardware Architecture of Complex K-best MIMO Decoder
Hardware Architecture of Complex K-best MIMO DecoderCSCJournals
 
IRJET- A Review on Audible Sound Analysis based on State Clustering throu...
IRJET-  	  A Review on Audible Sound Analysis based on State Clustering throu...IRJET-  	  A Review on Audible Sound Analysis based on State Clustering throu...
IRJET- A Review on Audible Sound Analysis based on State Clustering throu...IRJET Journal
 
Mobile Networking and Mobile Ad Hoc Routing Protocol Modeling
Mobile Networking and Mobile Ad Hoc Routing Protocol ModelingMobile Networking and Mobile Ad Hoc Routing Protocol Modeling
Mobile Networking and Mobile Ad Hoc Routing Protocol ModelingIOSR Journals
 
A New Hybrid Diversity Combining Scheme for Mobile Radio Communication System...
A New Hybrid Diversity Combining Scheme for Mobile Radio Communication System...A New Hybrid Diversity Combining Scheme for Mobile Radio Communication System...
A New Hybrid Diversity Combining Scheme for Mobile Radio Communication System...AIRCC Publishing Corporation
 
A NEW HYBRID DIVERSITY COMBINING SCHEME FOR MOBILE RADIO COMMUNICATION SYSTEM...
A NEW HYBRID DIVERSITY COMBINING SCHEME FOR MOBILE RADIO COMMUNICATION SYSTEM...A NEW HYBRID DIVERSITY COMBINING SCHEME FOR MOBILE RADIO COMMUNICATION SYSTEM...
A NEW HYBRID DIVERSITY COMBINING SCHEME FOR MOBILE RADIO COMMUNICATION SYSTEM...ijcsit
 
Ijarcet vol-2-issue-7-2374-2377
Ijarcet vol-2-issue-7-2374-2377Ijarcet vol-2-issue-7-2374-2377
Ijarcet vol-2-issue-7-2374-2377Editor IJARCET
 
Ijarcet vol-2-issue-7-2374-2377
Ijarcet vol-2-issue-7-2374-2377Ijarcet vol-2-issue-7-2374-2377
Ijarcet vol-2-issue-7-2374-2377Editor IJARCET
 
Open Source SDR Frontend and Measurements for 60-GHz Wireless Experimentation
Open Source SDR Frontend and Measurements for 60-GHz Wireless ExperimentationOpen Source SDR Frontend and Measurements for 60-GHz Wireless Experimentation
Open Source SDR Frontend and Measurements for 60-GHz Wireless ExperimentationAndreaDriutti
 
ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...
ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...
ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...ijasuc
 
ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...
ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...
ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...ijasuc
 
IRJET- Compressed Sensing based Modified Orthogonal Matching Pursuit in DTTV ...
IRJET- Compressed Sensing based Modified Orthogonal Matching Pursuit in DTTV ...IRJET- Compressed Sensing based Modified Orthogonal Matching Pursuit in DTTV ...
IRJET- Compressed Sensing based Modified Orthogonal Matching Pursuit in DTTV ...IRJET Journal
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Scienceresearchinventy
 
Nath2021_Article_DesignOfLowPowerPreamplifierIC.pdf
Nath2021_Article_DesignOfLowPowerPreamplifierIC.pdfNath2021_Article_DesignOfLowPowerPreamplifierIC.pdf
Nath2021_Article_DesignOfLowPowerPreamplifierIC.pdfChikkapriyanka
 
Standardising the compressed representation of neural networks
Standardising the compressed representation of neural networksStandardising the compressed representation of neural networks
Standardising the compressed representation of neural networksFörderverein Technische Fakultät
 
PERFORMANCE ANALYSIS OF BARKER CODE BASED ON THEIR CORRELATION PROPERTY IN MU...
PERFORMANCE ANALYSIS OF BARKER CODE BASED ON THEIR CORRELATION PROPERTY IN MU...PERFORMANCE ANALYSIS OF BARKER CODE BASED ON THEIR CORRELATION PROPERTY IN MU...
PERFORMANCE ANALYSIS OF BARKER CODE BASED ON THEIR CORRELATION PROPERTY IN MU...ijistjournal
 

Similar to Sudormrf.pdf (20)

Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
Grant Reaber “Wavenet and Wavenet 2: Generating high-quality audio with neura...
 
Design and implementation of different audio restoration techniques for audio...
Design and implementation of different audio restoration techniques for audio...Design and implementation of different audio restoration techniques for audio...
Design and implementation of different audio restoration techniques for audio...
 
Methodology of Implementing the Pulse code techniques for Distributed Optical...
Methodology of Implementing the Pulse code techniques for Distributed Optical...Methodology of Implementing the Pulse code techniques for Distributed Optical...
Methodology of Implementing the Pulse code techniques for Distributed Optical...
 
Digital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief ReviewDigital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief Review
 
Hardware Architecture of Complex K-best MIMO Decoder
Hardware Architecture of Complex K-best MIMO DecoderHardware Architecture of Complex K-best MIMO Decoder
Hardware Architecture of Complex K-best MIMO Decoder
 
PID1063629
PID1063629PID1063629
PID1063629
 
IRJET- A Review on Audible Sound Analysis based on State Clustering throu...
IRJET-  	  A Review on Audible Sound Analysis based on State Clustering throu...IRJET-  	  A Review on Audible Sound Analysis based on State Clustering throu...
IRJET- A Review on Audible Sound Analysis based on State Clustering throu...
 
Mobile Networking and Mobile Ad Hoc Routing Protocol Modeling
Mobile Networking and Mobile Ad Hoc Routing Protocol ModelingMobile Networking and Mobile Ad Hoc Routing Protocol Modeling
Mobile Networking and Mobile Ad Hoc Routing Protocol Modeling
 
A New Hybrid Diversity Combining Scheme for Mobile Radio Communication System...
A New Hybrid Diversity Combining Scheme for Mobile Radio Communication System...A New Hybrid Diversity Combining Scheme for Mobile Radio Communication System...
A New Hybrid Diversity Combining Scheme for Mobile Radio Communication System...
 
A NEW HYBRID DIVERSITY COMBINING SCHEME FOR MOBILE RADIO COMMUNICATION SYSTEM...
A NEW HYBRID DIVERSITY COMBINING SCHEME FOR MOBILE RADIO COMMUNICATION SYSTEM...A NEW HYBRID DIVERSITY COMBINING SCHEME FOR MOBILE RADIO COMMUNICATION SYSTEM...
A NEW HYBRID DIVERSITY COMBINING SCHEME FOR MOBILE RADIO COMMUNICATION SYSTEM...
 
Ijarcet vol-2-issue-7-2374-2377
Ijarcet vol-2-issue-7-2374-2377Ijarcet vol-2-issue-7-2374-2377
Ijarcet vol-2-issue-7-2374-2377
 
Ijarcet vol-2-issue-7-2374-2377
Ijarcet vol-2-issue-7-2374-2377Ijarcet vol-2-issue-7-2374-2377
Ijarcet vol-2-issue-7-2374-2377
 
Open Source SDR Frontend and Measurements for 60-GHz Wireless Experimentation
Open Source SDR Frontend and Measurements for 60-GHz Wireless ExperimentationOpen Source SDR Frontend and Measurements for 60-GHz Wireless Experimentation
Open Source SDR Frontend and Measurements for 60-GHz Wireless Experimentation
 
ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...
ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...
ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...
 
ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...
ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...
ENERGY EFFICIENCY OF MIMO COOPERATIVE NETWORKS WITH ENERGY HARVESTING SENSOR ...
 
IRJET- Compressed Sensing based Modified Orthogonal Matching Pursuit in DTTV ...
IRJET- Compressed Sensing based Modified Orthogonal Matching Pursuit in DTTV ...IRJET- Compressed Sensing based Modified Orthogonal Matching Pursuit in DTTV ...
IRJET- Compressed Sensing based Modified Orthogonal Matching Pursuit in DTTV ...
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Science
 
Nath2021_Article_DesignOfLowPowerPreamplifierIC.pdf
Nath2021_Article_DesignOfLowPowerPreamplifierIC.pdfNath2021_Article_DesignOfLowPowerPreamplifierIC.pdf
Nath2021_Article_DesignOfLowPowerPreamplifierIC.pdf
 
Standardising the compressed representation of neural networks
Standardising the compressed representation of neural networksStandardising the compressed representation of neural networks
Standardising the compressed representation of neural networks
 
PERFORMANCE ANALYSIS OF BARKER CODE BASED ON THEIR CORRELATION PROPERTY IN MU...
PERFORMANCE ANALYSIS OF BARKER CODE BASED ON THEIR CORRELATION PROPERTY IN MU...PERFORMANCE ANALYSIS OF BARKER CODE BASED ON THEIR CORRELATION PROPERTY IN MU...
PERFORMANCE ANALYSIS OF BARKER CODE BASED ON THEIR CORRELATION PROPERTY IN MU...
 

More from ssuser849b73

Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfssuser849b73
 
Transformer-based SE.pptx
Transformer-based SE.pptxTransformer-based SE.pptx
Transformer-based SE.pptxssuser849b73
 
Audio Inpainting with D2WGAN.pdf
Audio Inpainting with D2WGAN.pdfAudio Inpainting with D2WGAN.pdf
Audio Inpainting with D2WGAN.pdfssuser849b73
 
Frame-Online DNN-WPE Dereverberation.pdf
Frame-Online DNN-WPE Dereverberation.pdfFrame-Online DNN-WPE Dereverberation.pdf
Frame-Online DNN-WPE Dereverberation.pdfssuser849b73
 
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...ssuser849b73
 
Sepformer&DPTNet.pdf
Sepformer&DPTNet.pdfSepformer&DPTNet.pdf
Sepformer&DPTNet.pdfssuser849b73
 

More from ssuser849b73 (8)

Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdf
 
Transformer-based SE.pptx
Transformer-based SE.pptxTransformer-based SE.pptx
Transformer-based SE.pptx
 
Audio Inpainting with D2WGAN.pdf
Audio Inpainting with D2WGAN.pdfAudio Inpainting with D2WGAN.pdf
Audio Inpainting with D2WGAN.pdf
 
Frame-Online DNN-WPE Dereverberation.pdf
Frame-Online DNN-WPE Dereverberation.pdfFrame-Online DNN-WPE Dereverberation.pdf
Frame-Online DNN-WPE Dereverberation.pdf
 
WaveNet.pdf
WaveNet.pdfWaveNet.pdf
WaveNet.pdf
 
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
 
Wavesplit.pdf
Wavesplit.pdfWavesplit.pdf
Wavesplit.pdf
 
Sepformer&DPTNet.pdf
Sepformer&DPTNet.pdfSepformer&DPTNet.pdf
Sepformer&DPTNet.pdf
 

Recently uploaded

High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxhumanexperienceaaa
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 

Recently uploaded (20)

Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptxthe ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
the ladakh protest in leh ladakh 2024 sonam wangchuk.pptx
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 

Sudormrf.pdf

  • 1. Compute and memory efficient universal sound source separation Authors: Efthymios Tzinis · Zhepei Wang · Xilin Jiang · Paris Smaragdis Published in 2020 IEEE 30th International Workshop on MLSP, and arXiv:2103.02644v2 [cs.SD] 14 Jul 2021 Presenter: 何冠勳
  • 2. 1 Outline – Introduction – Architecture – Experiment – Result & Discussion – Conclusions
  • 4. 3 Introduction – There are three fields in audio separation: speech separation, universal sound separation and music source separation. – Previous works that are prestigious: – Conv-TasNet – DPRNN – DPTNet – Demucs – Two step sound source separation – Despite the dramatic advances in source separation performance, the computational complexity of the aforementioned methods might hinder their extensive usage across multiple devices. – Additionally, training such systems is also an expensive computational undertaking which can amount to significant costs.
  • 5. 4 Introduction – Several studies, mainly in the image domain, have introduced more efficient architectures in order to overcome the growing concern of large models with high computational requirements. – E.g. depth-wise separable convolutions, dilated convolutions, meta-learning – Despite the recent success of low-resource architectures in the image domain, little progress has been made towards proposing efficient architectures for audio tasks and especially source separation. – Modern approaches, mainly in speech enhancement and music source separation, have been focusing on developing models which are capable of real-time inference.
  • 6. 5 Introduction – In this study, we propose a novel neural network architecture for audio source separation while following a more holistic approach in terms of computational resources that we take into consideration (FLOPs, latency and total memory requirements). – We entitle SuDoRM-RF: SUccessive DOwnsampling and Resampling of Multi- Resolution Features (SuDoRM-RF). – We also propose improved versions and causal variations. – The separation performance comparable to or even better than several recent SOTA models on speech, environmental and universal sound separation tasks with significantly lower computational requirements.
  • 7. 6 Architecture - Encoder - Separator - U-ConvBlock - Decoder - Variations
  • 8. 7 Architecture – On par with many SOTA approaches, SuDoRM-RF performs end-to-end audio source separation using a mask-based architecture with adaptive encoder and decoder basis. – We have extended our basic model in order to also remove the mask estimation process by introducing SuDoRM-RF++ that directly estimates the latent representations of the sources. – Besides original model, also propose – improved version SuDoRM-RF++ – SuDoRM-RF++ with group communication – causal variation C-SuDoRM-RF++
  • 9.
  • 10. 9 Architecture – Encoder – The encoder E architecture consists of a one-dimensional convolution with kernel size KE and stride equal to KE/2 similar to Conv-TasNet. – We force the output of the encoder to be strictly non-negative by applying a recti- fied linear unit (ReLU) activation on top of the output of the 1D-convolution. (channel, kernel, stride)
  • 11. 10 Architecture – Separator – In essence, the separator S module performs the following transformations to the encoded mixture representation 1 2 3 4 5
  • 12. 11 Architecture – Separator 1. Projects the encoded mixture representation to a new channel space through a layer-normalization followed by a point-wise convolution as shown next: 2. Performs repetitive non-linear transformations provided by B U-convolutional blocks on . The output of the ith U-ConvBlock would be denoted as and would be used as input for the (i + 1)th block. 3. Aggregates the information over multiple channels by applying a regular one- dimensional convolution for each source on the transposed feature representation .
  • 13. 12 Architecture – Separator 3. (cont.) This step has been introduced in Two step sound source separation and empirically shown to make the training process more stable. 4. Combines the aforementioned latent codes for all sources by performing a softmax operation in order to get mask estimates, where mask coefficients all lie in [0,1]. 5. Estimates a latent representation for each source by multiplying element-wise the encoded mixture representation with the corresponding mask.
  • 14. 13 Architecture – U-ConvBlock – U-ConvBlock uses a block structure which resembles a depth-wise separable convolution but without a skip connection and residual unlike ConvTasNet. – It extracts information from multiple resolutions using Q successive temporal downsampling and Q upsampling operations that resembles a U-Net architecture. – We postulate that this whole resampling procedure of extracting features at multiple scales combined with the efficient increase of the effective receptive field.
  • 16. 15
  • 17. 16 Architecture – Decoder – Our decoder module is the final step in order to transform the latent space representation for each source back to the time domain. – Each latent source representation is fed through a different transposed convolution decoder.
  • 18. 17 Architecture – Variations 1 – In the improved version of the proposed architecture, namely, SuDoRM-RF++ , the model estimates directly the latent representation for each target signal and uses only one decoder module.
  • 19. 18 Architecture – Variations 1 – Difference between original version: – Replace the mask estimation and element-wise multiplication process with a direct estimation of the latent target signals. – Use only one trainable decoder. – Replace the layer normalization layers with global layer normalization. – Simplify the activation layers and we use PReLU activation layers with only one shared learnable parameter. – We would like to underline that original SuDoRM-RF models could potentially outperform the improved SuDoRM-RF++ variation in cases where the direct estimation of the latent targets would be more difficult. – Moreover, the alternation of containing two decoders proposed in the initial version might be more useful in cases where one wants to solve an audio source separation problem containing two distinct classes of sounds (e.g. speech enhancement) where each decoder could be fine-tuned.
  • 20. 19 Architecture – Variations 2 – We also propose a new variation of our model, namely, SuDoRM-RF++ GC, where we combine group communication. – In the proposed architecture, the intermediate representations are being processed in groups of sub-bands of channels. – We divide the channels of each 1 × 1 convolutional block into 16 groups and we process them first independently by sharing the parameters across all groups of sub-bands. – After, we apply a self-attention module to combine them. – The resulting architecture leads to a significant improvement in the number of trainable parameters.
  • 21. arXiv:2011.08397v3 [eess.AS] 20 Nov 2020 : ULTRA-LIGHTWEIGHT SPEECH SEPARATION VIA GROUP COMMUNICATION
  • 22. 21 Architecture – Variations 3 – Our latest extension of the proposed model, named C-SuDoRM-RF++, is to be able to run online and enable streamable extensions for real-time applications. – Differences between SuDoRM-RF++: – Replace all non-causal convolutions with causal counterparts. – Remove all the normalization layers.
  • 23. 22 Experiment - Datasets - Data preprocessing - Loss function - Evaluation metrics - Model configurations
  • 24. 23 Experiment – Datasets – Speech: WSJ0-2mix (2 active speakers) – Speaker mixtures are generated by randomly mixing speech utterances with two active speakers at random SNR between −5 and 5dB from the WSJ0. – Non-speech: ESC50 (2 active sources) – ESC50 consists of a wide variety of sounds (non-speech human sounds, animal sounds, natural soundscapes, interior sounds and urban noises). For each data sample, two audio sources are mixed with a random SNR between −2.5 and 2.5dB where each source belongs to a distinct sound category from a total of 50. – Universal sound separation: FUSS (1~4 active sources) – We also evaluate our models under a purely universal sound separation setup where we do not know how many sources are active in each input mixture.
  • 25. 24 Experiment – Preprocessing – For fixed number of sources: 1. random choosing two sound classes or speakers 2. random cropping of 4sec segments from two sources audio files 3. mixing the source segments with a random SNR 4. generating 20, 000 new training mixtures for each epoch 5. 3,000 mixtures for validation and test sets 6. down-sampling each audio clip to 8kHz. – For variable number of sources: 1. using the same dataset splits as the one provided (10sec, 16kHz) 2. generating 20, 000 new training mixtures for each epoch 3. 5,000 and 3,000 mixtures for validation and test sets respectively
  • 26. 25 Experiment – Loss – For fixed number of sources: – For variable number of sources: – N denotes maximum sources, N’ denotes active sources – Loss function is defined referencing FUSS baseline.
  • 27. 26 Experiment – Evaluation – In order to evaluate the performance of our models we use a stable version of the permutation invariant SI-SDRi. – For the case where non-active sources exist, thus, Nʹ < N, we omit to compute the aforementioned metric as it provides infinity values. – and * sign denote the best permutation; denote all possible permutation sets; ε = 1e−9 and τ = 1e−3 solve numerical stability issues created by zero target signals.
  • 28. arXiv:2011.00803v1 [cs.SD] 2 Nov 2020: WHAT’S ALL THE FUSS ABOUT FREE UNIVERSAL SOUND SEPARATION DATA?
  • 29. 28 Experiment – Model Conf. – For the en/decoder we use a kernel size 21 for input mixtures sampled at 8kHz and kernel size 41 for 16kHz. Also, the number of basis (channel) is 512. – For the configuration of each U-ConvBlock we set the input number of channels equal to 128, the number of successive resampling operations equal to 4 and, the expanded number of channels equal to 512. In each subsampling operation we reduce the temporal dimension by a factor of 2 and all depth-wise separable convolutions have a kernel length of 5 and a stride of 2. – For C-SuDoRM-RF++, we increase the number of input channels to 256 and the default kernel length to 11 in order to increase the receptive field.
  • 30. 29 Experiment – Model Conf. – To simplify, SuDoRM-RF 2.0x , SuDoRM-RF 1.0x , SuDoRM-RF 0.5x , SuDoRM-RF 0.25x means that it consist of 32, 16, 8 and 4 blocks, respectively. – The same applies to the improved version SuDoRM-RF++ and its causal variation C- SuDoRM-RF++.
  • 31. 30 Experiment – Comparison – One of the main goals of this study is to propose models for audio source separation which could be trained using limited computational resources and deployed easily on a mobile or edge-computing device. – Consider following aspects: 1. Number of executed floating point operations (FLOPs). 2. Number of trainable parameters. 3. Memory allocation required on the device for a single pass. 4. Time for completing each process.
  • 32. 31 Result & Discussion - Two active sources - FLOPs - Training cost - Memory - Variable active sources - Causal capability
  • 33. 32 Result – Two sources – In Table 1, we focus on the computational aspects required during a forwards pass of those models on a CPU. – In Table 2, the same computational resource requirements are shown for a backward pass on GPU as well as the number of trainable parameters.
  • 34.
  • 35.
  • 36.
  • 37. 36 Result – FLOPs – The computational resource one might be interested in is the number of FLOPs required during inference. – In Figure 4, SuDoRM-RF models scale well as we increase the number of U- ConvBlocks B from 4 → 8 → 16. – SuDoRM-RF++ achieves similar or even better results than the original version of the proposed model SuDoRM-RF with a lower number of FLOPs both in forward and backward for a similar number of parameters and execution time. – A significant drop in the absolute number of FLOPs is also obtained by combining the group communication mechanism
  • 38. 37 Result – Training cost – Usually one of the most detrimental factors for training deep learning models is the requirement of allocating multiple GPU devices for several days or weeks until an adequate performance is obtained on the validation set. – SuDoRM-RF 1.0x has a faster convergence.
  • 39. 38 Result – Memory – Group communication combined with SuDoRM-RF++ is one of the most effective ways to reduce the number of trainable parameters caused by the bottleneck dense layers between the U-ConvBlocks. – However, the trainable parameters comprise only a small portion of the total amount of memory required for a single forward or backward pass. The space complexity could easily be dominated by the storage of intermediate representations and not the actual memory footprint. – It could become even worse when multiple skip connections are present, gradients from multiple layers have to be stored or implementations require augmented matrices (dilated, transposed convolutions, etc.).
  • 40. 39 Result – WSJ0-2mix – We perform an ablation study SuDoRM-RF++ on WSJ0-2mix dataset. – The stride of the encoder and decoder which is always defined as kernel//2. By decreasing the size of the stride, we force the model to perform more computations and estimate the signal in a more fine- grained scale closer to the time-domain resolution leading to better results. – GLN significantly helps our model to reach a better solution. – More U-ConvBlocks seems to get better results.
  • 41. 40 Result – FUSS – We see that by increasing our SuDoRM-RF and SuDoRM-RF++ model sizes to match the size of TDCNN++ , we can match its performance. – For the single source mixtures we see that our models per- form worse than the TDCNN++, however, above 25dB it is difficult even for a human being to understand the nuance artifacts which are barely audible.
  • 42. 41 Result – Casual setup – We see that we are able to obtain competitive separation performance for all configurations while remaining ≈ 10 to 20 times faster than real time. – However the performance still needs some improvements.
  • 44. 43 Conclusions – Our experiments suggest that SuDoRM-RF models a) could be deployed on devices with limited resources, b) be trained significantly faster and achieve good separation performance and c) scale well when increasing the number of parameters. – We show that these models can perform similarly or even better than recent state- of-the-art models while requiring significantly less computational resources in FLOPs, memory and, time for experiments with a fixed and a variable number of sources.
  • 45. THANK YOU Any questions? You can find me at ◉ jasonho610@gmail.com ◉ NTNU-SMIL