SlideShare a Scribd company logo
1 of 43
Download to read offline
Wavesplit:
End-to-End Speech Separation by
Speaker Clustering
Presenter : 何冠勳 61047017s
Date : 2022/07/27
Neil Zeghidour, David Grangier from Google Research, Brain Team, IEEE/ACM Transactions on Audio,
Speech, and Language Processing, vol. 29, pp. 2840-2849, 2021
Outline
2
1 3 5
6
4
2
Introduction Wavesplit Samples
Related Work Experiments &
Results
Conclusions
Introduction
Why Wavesplit? What is Wavesplit?
1
3
Why Wavesplit?
◉ Source separation is an ill-posed problem of separating multiple sources from a single mixture. An
additional difficulty arises when the sources to separate belong to the same class of signals.
e.g. isolating appliance electric consumption from meter reading
separating overlapped fingerprints
retrieving individual compounds in chemical mixtures from spectroscopy
◉ Another difficulty regards permutation where predicted channels are separated but not
neccesarily inconsistent along time.
◉ Thus, designing a model that maintains a consistent assignment between the GT and the predicted
channels is crucial for this task.
4
What is Wavesplit?
◉ This approach, Wavesplit, aims at separating novel sources at test time, called open speaker
separation in speech, but leverages source identities during training.
◉ The training objective identifies instantaneous speaker representations such that
➢ these representations can be grouped into individual speaker clusters and
➢ the cluster centroids provide a long-term speaker representation then condition the
reconstruction of individual speaker signals.
5
What is Wavesplit?
◉ In this work, the contributions are five-fold:
✓ automatically leverage training speaker embeddings but do not need any prerequisite
information,
✓ limits channel-swap by aggregate information and use clustering to assign speaker,
✓ report state-of-the-art results WSJ0-2/3mix (clean), WHAM! (noisy) and WHAMR! (noisy and
reverberant), Libri2/3mix (clean and noisy) then,
✓ analyze the empirical advantages and drawbacks of our method,
✓ show that this approach is generic and can be applied to non-speech tasks, by separating
maternal and fetal heart rate from a single abdominal electrode graph.
6
Maternal: 母系的、Fetal: 胎兒的
7
A view of similar application
Related Work
- Clustering
- Utterance-level Permutation-Invariant Training
- Speaker Representations
- Electrocardiograms
2
8
Clustering
◉ In TF-domain methods, the previous works historically relied on learning TF masks. They divide the
input mixture in TFBs using STFT, and identify the source from the magnitude. Source spectrograms
can then be produced by masking the TFBs, and source signals are later estimated by phase
reconstruction.
◉ These models learns a latent representation for each TFB such that the distance between TFBs from
the same source is lower than the distance between TFBs from different sources.
◉ Wavesplit also relies on clustering but instead its representations are learned to
➢ predict training speaker identity and
➢ provide conditioning variables to separation network.
9
uPIT
◉ Permutation-Invariant Training (PIT) avoids clustering and predicts multiple masks directly. The
predictions are compared to the GT by searching over the permutations that minimize the total
loss.
◉ Among PIT systems, time domain approaches avoid phase reconstruction and its degradation.
◉
10
conv1d transpose-conv1d
◉ Wavesplit is also a time domain approach but it solves
the permutation problem prior to signal estimation:
At train time, the latent source representations are
ordered to best match the labels prior to conditioning
the separation network.
Speaker Representations
◉ As many study shows, discriminative speaker representations can be extracted from short
speech segments to help separation.
◉ However, the representations encoder in previous works are either separately trained or
require a clean enrollment sequence.
◉ Conversely, Wavesplit jointly learns to identify and separate sources.
11
Electro-cardio-gram
◉ Separation of fetal and maternal heart rates from abdominal electrocardiograms (ECGs)
allows for affordable, non-invasive monitoring of fetal health during pregnancy and labour,
as an alternative to fetal scalp ECG.
◉ Source separation is an intermediate step for detecting fetal heart rate peaks. This work
evaluates Wavesplit for separating maternal and fetal heart rate signals.
12
Wavesplit
- Problem
- Architecture
- Training Objective
- Training Algorithm
3
13
Wavesplit
◉ Wavesplit combines two convolutional subnetworks that are jointly trained:
➢ The speaker stack maps the mixture to a set of vectors representing the recorded speakers.
➢ The separation stack consumes both the mixture and the set of speaker representations from the
speaker stack. It then produces a multi-channel audio output with separated speech from each
speaker.
◉ At training time, speaker labels are used to learn a vector representation per speaker such that the
inter-speaker distances are large, while the intra-speaker distances are small.
◉ At the same time, this representation allow the separation stack to reconstruct the clean signals
conditionally.
◉ At test time, the speaker stack relies on clustering to identify a centroid representation per speaker.
14
Wavesplit
◉ With joint training, the speaker representation is not solely optimized for identification but also for
the reconstruction of separated speech. In contrast with uPIT, Wavesplit condition decoding with
a speaker representation valid for the whole sequence.
◉ It is less prone to channel swap since clustering assigns a persistent source representation to
each channel. In other words, the separation stack is conditioned with speaker vectors ordered
consistently with the labels.
15
Problem
16
quality
Estimated sources and GT permutations
# of sources
quality measurement Model with mixture as input
◉ In separation task, the model search for a model that can maximize the separation quality .
◉ As for quality measurement , literature typically relies on SDR to assess reconstruction
quality.
Architecture
◉ Wavesplit is a residual convolutional network with two stacks.
◉ The speaker stack produces speaker representations at each time step and then performs an
aggregation over the whole sequence.
◉ Intuitively, produces a latent representation of each speaker at every time step. It is
important to note that is not required to order speakers consistently across a sequence.
17
speaker vectors
Architecture
◉ At the end of the sequence, the aggregation step groups all vectors by speaker and outputs N
summary vectors for the whole sequence. K-means clustering performs this aggregation at
inference and returns the centroids of the N identified clusters.
◉ : speaker centroid, : speaker vector
◉ During training, clustering is not used. Speaker centroids are derived by grouping speaker vectors
by speaker identity, relying on the speaker training objective.
18
Architecture
◉ The separation stack maps the mixture and the speaker centroids into an N-channel signal.
◉ Inspired by Conv-TasNet, they rely on a residual convolutional architecture for both stacks. Each
residual block in the speaker stack composes a dilated convolution, a non-linearity and layer
normalization.
19
Architecture
◉ The residual blocks of the separation stack are conditioned by the speaker centroids relying on
FiLM, Feature-wise Linear Modulation.
, where and are different linear projections of , the
concatenation of speaker centroids.
20
Training Objectives
◉ Speaker identities are labeled in most separation datasets but mostly unused in the source
separation literature. Wavesplit exploits this information to build an internal model of each source
and improve long-term separation.
◉ The speaker vector objective encourages the speaker stack outputs to have small intra-speaker
and large inter-speaker distances.
,where denotes the corresponding speakers, and defines
a loss function between a vector of and a speaker identity .
21
Training Objectives
◉ The minimum over permutations expresses that each identity should be identified at each time
step, in any arbitrary order. The best permutation ( ) at each time-step is used to re-order
the speaker vectors in an order consistent with the training labels.
◉ This allows averaging the speaker vectors originating from the same speaker to be the centroids at
training time. Moreover, the separation stack is trained from different permutations of the same
labels.
⨷ The speaker stack might stabilize the speaker assignments within intra-utterance training, but the
speaker assignments between difference utterances is still agnostic. In other words, to obtain a
diarization-like result must still employ any clustering method.
22
Training Objectives
◉ Three alternative are explored for . All three maintain an embedding table over
training speakers.
1. favors small distances between a speaker vector and the corresponding embedding while
enforcing the distance between different speaker vectors to be larger than a margin of 1.
2. is a local classifier objective which discriminates among the speakers present in the
sequence.
,where with learned scalar parameters .
23
discriminative term
discriminative term
Training Objectives
3. is similar to the last one, except that the partition function is computed over all speakers in
the training set.
◉ The reconstruction objective aims at optimizing the separation quality.
◉ For , negative SDR is used with a clipping to limit the influence of the best training predictions.
24
discriminative term
Training Objectives
◉ Different forms of regularization is as well considered to improve generalization to new speakers.
➢ Adding Gaussian noise to speaker centroids
➢ Replacing full centroids with zeros (dropout)
➢ Replacing some centroids with a linear combination with other centroids (speaker mixup)
◉ Finally, entropy regularization is applied which favors well separated embeddings for the training
speakers with.
◉ In conclusion, the total loss is
25
Training Algorithm
◉ Model training optimizes the total loss function with Adam optimizer by mini batches of fixed-size
windows. Experiment shows that Wavesplit performs well for a wide-range of window sizes starting
at 750ms, unlike most uPIT approaches that require longer segments (∼ 4s).
◉ The training set is shuffled at each epoch and a window starting point is uniformly sampled each time
a sequence is visited.
◉ They explored the use of dynamic mixing data augmentation which consists in on-the-fly creation of
new mixtures from single speaker sources.
26
27
Experiments &
Results
- Hyperparameter
- Clean Settings
- Noisy Settings
- Ablation
- Electrocardiogram
4
28
Experiments
◉ The experiments rely on the 8kHz data from WSJ0-2/3mix (clean), WHAM! (noisy) and WHAMR!
(noisy and reverberant), Libri2/3mix (clean and noisy)
◉ They further evaluate variants of Wavesplit with different loss functions and architectural alternatives.
They conduct an error analysis examining a small fraction of sequences with a strong negative impact
on overall performance.
◉ The evaluation uses SDR and SI-SDR and
they report results as improvements.
29
Hyperparameter
◉ Preliminary experiments on WSJ0-2mix drove our architecture choices for subsequent experiments.
➢ Both stacks have a latent dimension of 512 and the dilated convolutions have a kernel size of
3 without striding.
➢ The speaker stack is 14 -layer deep and dilation grows exponentially from 20
~ 213
.
➢ The separation stack has 40 layers with the dilation pattern 2l mod 10
.
➢ For loss of speaker stack, they found the global classifier to be the most effective.
30
Hyperparameter
➢ Other parameters:
■ learning rate as 1e−3 in [1e−3, 2e−3, 3e−3];
■ speaker loss weight of 2 in [1, 2, 5];
■ regularization weight at 0.3 in [0, 0.2, 0.3, 0.5];
■ gaussian noise with std at 0.2 in [0, 0.1, 0.2, 0.3];
■ speaker dropout rate of 0.4 in [0, 0.2, 0.4, 0.6];
■ speaker mixup rate of 0.5 in [0, 0.5, 1].
■ the clipping threshold on the negative SDR loss as 30 for clean data and 27 for noisy
data within [22,27,30].
31
Clean settings
→ Unlike test data, validation data contains the same speakers
as the training set and they observe that test examples with
low SDRi are sequences where both speakers are close to
the same training speaker identity (confusing speaker).
→ Our oracle permutes the predicted samples across channels,
and reports the best permutation, showing that most of the
errors are channel assignment errors.
32
Clean settings
◉ WSJ0-2mix recordings have a single dominant speaker, and it is
speculatable that PIT might implicitly rely on this bias to address the
channel assignment ambiguity.
◉ To test that, 10 times longer test sequences are created, by concatenating
test sequences with the same pair of speakers, also the loudest speaker
changes between each concatenated sequence.
33
◉ Table shows the advantage of Wavesplit, which explicitly models sources in a time-independent
fashion rather than relying on implicit rules that exploit biases in the training data. This is remarkable
since Wavesplit is trained on 1s long windows compared to longer 4s windows for both PIT models.
Noisy settings
◉ Under the scenario of WHAM! and WHAMR!, the task is even harder as the model should predict clean
signals without reverberation, i.e. jointly addressing denoising, dereverberation and source separation.
◉ For both dataset, dynamic mixing is also adapted.
34
Both settings
→ LibriMix is a good testbed for Wavesplit, as its training set contains significantly more speakers than
WSJ0-2/3mix which improves the robustness of the speaker stack.
→ Moreover, the results are in the same range as the equivalent condition in WSJ0-2/3mix and WHAM!,
which again confirms the robustness of the method across datasets.
35
Ablation Study
◉ Although the distance loss is common in distance learning for clustering, the global classifier
reports better results. Similarly, local classifier loss yields a slower training and worse
genereralization.
◉ Table also reports the advantage of FiLM
conditioning compared to standard additive conditioning.
Not only reported SDRs are better but FiLM allows using
a higher learning rate and yields faster training.
◉ Regularization helps improve the peformance as well.
36
Electrocardiogram
◉ Wavesplit separation method can be applied beyond speech. ECG reports voltage time series of
the electrical activity of the heart from electrodes placed on the skin.
◉ During pregnancy, ECG informs about the function of the fetal heart but maternal and fetal ECG
are mixed. The goal is to separate these signals from a single noisy electrode recording.
◉ In this context, the source/speaker stack learns a representation of a mother and its fetus to
condition the separation stack.
37
Electrocardiogram
◉ They train a single model independently of the electrode’s location and rely on the same
architecture as in our speech experiments, only validating regularization parameters.
38
Electrocardiogram
39
Samples
Let’s hear it!
5
40
Conclusions
6
41
Conclusion
◉ This work introduce Wavesplit, a neural network for source separation. From the input mixture,
the model extracts a representation for each source and estimates the separated signals
conditioned on the inferred representations.
◉ Separation with Waveplit relies on a single consistent representation of each source regardless
of the input signal length. This is advantageous on long recordings, as shown by the experiments.
◉ Wavesplit redefines the state-of-the-art on several datasets, both in clean and noisy conditions.
◉ Experiments also the benefits of Wavesplit on fetal/maternal heart rate separation from ECGs.
These results open perspectives in other separation domains.
42
Any questions ?
You can find me at
◉ jasonho610@gmail.com
◉ NTNU-SMIL
Thanks!
43

More Related Content

What's hot

Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learningSANTHOSH RAJA M G
 
Instance based learning
Instance based learningInstance based learning
Instance based learningSlideshare
 
Machine Learning - Ensemble Methods
Machine Learning - Ensemble MethodsMachine Learning - Ensemble Methods
Machine Learning - Ensemble MethodsAndrew Ferlitsch
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementNAVER Engineering
 
ADDITTIVE WHITE GAUSIAN NOIS ( AWGN)
ADDITTIVE WHITE GAUSIAN NOIS ( AWGN)ADDITTIVE WHITE GAUSIAN NOIS ( AWGN)
ADDITTIVE WHITE GAUSIAN NOIS ( AWGN)mohammedalimahdi
 
EuroSciPy 2019 - GANs: Theory and Applications
EuroSciPy 2019 - GANs: Theory and ApplicationsEuroSciPy 2019 - GANs: Theory and Applications
EuroSciPy 2019 - GANs: Theory and ApplicationsEmanuele Ghelfi
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learningHaris Jamil
 
Kalman filter for Beginners
Kalman filter for BeginnersKalman filter for Beginners
Kalman filter for Beginnerswinfred lu
 
Kalman filter for object tracking
Kalman filter for object trackingKalman filter for object tracking
Kalman filter for object trackingMohit Yadav
 
Equalization
EqualizationEqualization
Equalizationbhabendu
 
Genetic algorithm
Genetic algorithmGenetic algorithm
Genetic algorithmgarima931
 
Monte carlo simulation
Monte carlo simulationMonte carlo simulation
Monte carlo simulationAnurag Jaiswal
 
Expectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture ModelsExpectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture Modelspetitegeek
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximizationbutest
 
Application of adaptive linear equalizer
Application of adaptive linear equalizerApplication of adaptive linear equalizer
Application of adaptive linear equalizerSayahnarahul
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational AutoencoderMark Chang
 
Support vector machine
Support vector machineSupport vector machine
Support vector machineMusa Hawamdah
 
3.1 structure of a wireless communicaiton link
3.1   structure of a wireless communicaiton link3.1   structure of a wireless communicaiton link
3.1 structure of a wireless communicaiton linkJAIGANESH SEKAR
 

What's hot (20)

Ensemble methods in machine learning
Ensemble methods in machine learningEnsemble methods in machine learning
Ensemble methods in machine learning
 
Instance based learning
Instance based learningInstance based learning
Instance based learning
 
Fuzzy Clustering(C-means, K-means)
Fuzzy Clustering(C-means, K-means)Fuzzy Clustering(C-means, K-means)
Fuzzy Clustering(C-means, K-means)
 
Machine Learning - Ensemble Methods
Machine Learning - Ensemble MethodsMachine Learning - Ensemble Methods
Machine Learning - Ensemble Methods
 
Deep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech EnhancementDeep Learning Based Voice Activity Detection and Speech Enhancement
Deep Learning Based Voice Activity Detection and Speech Enhancement
 
ADDITTIVE WHITE GAUSIAN NOIS ( AWGN)
ADDITTIVE WHITE GAUSIAN NOIS ( AWGN)ADDITTIVE WHITE GAUSIAN NOIS ( AWGN)
ADDITTIVE WHITE GAUSIAN NOIS ( AWGN)
 
EuroSciPy 2019 - GANs: Theory and Applications
EuroSciPy 2019 - GANs: Theory and ApplicationsEuroSciPy 2019 - GANs: Theory and Applications
EuroSciPy 2019 - GANs: Theory and Applications
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
 
Kalman filter for Beginners
Kalman filter for BeginnersKalman filter for Beginners
Kalman filter for Beginners
 
Kalman filter for object tracking
Kalman filter for object trackingKalman filter for object tracking
Kalman filter for object tracking
 
Equalization
EqualizationEqualization
Equalization
 
Genetic algorithm
Genetic algorithmGenetic algorithm
Genetic algorithm
 
Monte carlo simulation
Monte carlo simulationMonte carlo simulation
Monte carlo simulation
 
Expectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture ModelsExpectation Maximization and Gaussian Mixture Models
Expectation Maximization and Gaussian Mixture Models
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximization
 
Application of adaptive linear equalizer
Application of adaptive linear equalizerApplication of adaptive linear equalizer
Application of adaptive linear equalizer
 
NLP_KASHK:N-Grams
NLP_KASHK:N-GramsNLP_KASHK:N-Grams
NLP_KASHK:N-Grams
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
 
Support vector machine
Support vector machineSupport vector machine
Support vector machine
 
3.1 structure of a wireless communicaiton link
3.1   structure of a wireless communicaiton link3.1   structure of a wireless communicaiton link
3.1 structure of a wireless communicaiton link
 

Similar to Wavesplit.pdf

Sepformer&DPTNet.pdf
Sepformer&DPTNet.pdfSepformer&DPTNet.pdf
Sepformer&DPTNet.pdfssuser849b73
 
Semantic Mask for Transformer Based End-to-End Speech Recognition
Semantic Mask for Transformer Based End-to-End Speech RecognitionSemantic Mask for Transformer Based End-to-End Speech Recognition
Semantic Mask for Transformer Based End-to-End Speech RecognitionWhenty Ariyanti
 
Biomedical Signals Classification With Transformer Based Model.pptx
Biomedical Signals Classification With Transformer Based Model.pptxBiomedical Signals Classification With Transformer Based Model.pptx
Biomedical Signals Classification With Transformer Based Model.pptxSandeep Kumar
 
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...tsysglobalsolutions
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
 
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...kevig
 
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...kevig
 
Parallel WaveGAN review
Parallel WaveGAN reviewParallel WaveGAN review
Parallel WaveGAN reviewJune-Woo Kim
 
Sequence to sequence model speech recognition
Sequence to sequence model speech recognitionSequence to sequence model speech recognition
Sequence to sequence model speech recognitionAditya Kumar Khare
 
Audio Inpainting with D2WGAN.pdf
Audio Inpainting with D2WGAN.pdfAudio Inpainting with D2WGAN.pdf
Audio Inpainting with D2WGAN.pdfssuser849b73
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...Jinwon Lee
 
Lexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchLexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchSatoru Katsumata
 
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONQUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONijma
 
A study of EMG based Speech Recognition
A study of EMG  based Speech Recognition A study of EMG  based Speech Recognition
A study of EMG based Speech Recognition vetrivel D
 

Similar to Wavesplit.pdf (20)

Sepformer&DPTNet.pdf
Sepformer&DPTNet.pdfSepformer&DPTNet.pdf
Sepformer&DPTNet.pdf
 
Semantic Mask for Transformer Based End-to-End Speech Recognition
Semantic Mask for Transformer Based End-to-End Speech RecognitionSemantic Mask for Transformer Based End-to-End Speech Recognition
Semantic Mask for Transformer Based End-to-End Speech Recognition
 
Biomedical Signals Classification With Transformer Based Model.pptx
Biomedical Signals Classification With Transformer Based Model.pptxBiomedical Signals Classification With Transformer Based Model.pptx
Biomedical Signals Classification With Transformer Based Model.pptx
 
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
 
Conformer review
Conformer reviewConformer review
Conformer review
 
Voice Morphing System for People Suffering from Laryngectomy
Voice Morphing System for People Suffering from LaryngectomyVoice Morphing System for People Suffering from Laryngectomy
Voice Morphing System for People Suffering from Laryngectomy
 
EEND-SS.pdf
EEND-SS.pdfEEND-SS.pdf
EEND-SS.pdf
 
L046056365
L046056365L046056365
L046056365
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
WaveNet.pdf
WaveNet.pdfWaveNet.pdf
WaveNet.pdf
 
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
 
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
 
Parallel WaveGAN review
Parallel WaveGAN reviewParallel WaveGAN review
Parallel WaveGAN review
 
Sequence to sequence model speech recognition
Sequence to sequence model speech recognitionSequence to sequence model speech recognition
Sequence to sequence model speech recognition
 
Audio Inpainting with D2WGAN.pdf
Audio Inpainting with D2WGAN.pdfAudio Inpainting with D2WGAN.pdf
Audio Inpainting with D2WGAN.pdf
 
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...
 
Lexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchLexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam search
 
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITIONQUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
 
Preliminary study on using vector quantization latent spaces for TTS/VC syste...
Preliminary study on using vector quantization latent spaces for TTS/VC syste...Preliminary study on using vector quantization latent spaces for TTS/VC syste...
Preliminary study on using vector quantization latent spaces for TTS/VC syste...
 
A study of EMG based Speech Recognition
A study of EMG  based Speech Recognition A study of EMG  based Speech Recognition
A study of EMG based Speech Recognition
 

Recently uploaded

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGSIVASHANKAR N
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 

Recently uploaded (20)

(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 

Wavesplit.pdf

  • 1. Wavesplit: End-to-End Speech Separation by Speaker Clustering Presenter : 何冠勳 61047017s Date : 2022/07/27 Neil Zeghidour, David Grangier from Google Research, Brain Team, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2840-2849, 2021
  • 2. Outline 2 1 3 5 6 4 2 Introduction Wavesplit Samples Related Work Experiments & Results Conclusions
  • 4. Why Wavesplit? ◉ Source separation is an ill-posed problem of separating multiple sources from a single mixture. An additional difficulty arises when the sources to separate belong to the same class of signals. e.g. isolating appliance electric consumption from meter reading separating overlapped fingerprints retrieving individual compounds in chemical mixtures from spectroscopy ◉ Another difficulty regards permutation where predicted channels are separated but not neccesarily inconsistent along time. ◉ Thus, designing a model that maintains a consistent assignment between the GT and the predicted channels is crucial for this task. 4
  • 5. What is Wavesplit? ◉ This approach, Wavesplit, aims at separating novel sources at test time, called open speaker separation in speech, but leverages source identities during training. ◉ The training objective identifies instantaneous speaker representations such that ➢ these representations can be grouped into individual speaker clusters and ➢ the cluster centroids provide a long-term speaker representation then condition the reconstruction of individual speaker signals. 5
  • 6. What is Wavesplit? ◉ In this work, the contributions are five-fold: ✓ automatically leverage training speaker embeddings but do not need any prerequisite information, ✓ limits channel-swap by aggregate information and use clustering to assign speaker, ✓ report state-of-the-art results WSJ0-2/3mix (clean), WHAM! (noisy) and WHAMR! (noisy and reverberant), Libri2/3mix (clean and noisy) then, ✓ analyze the empirical advantages and drawbacks of our method, ✓ show that this approach is generic and can be applied to non-speech tasks, by separating maternal and fetal heart rate from a single abdominal electrode graph. 6 Maternal: 母系的、Fetal: 胎兒的
  • 7. 7 A view of similar application
  • 8. Related Work - Clustering - Utterance-level Permutation-Invariant Training - Speaker Representations - Electrocardiograms 2 8
  • 9. Clustering ◉ In TF-domain methods, the previous works historically relied on learning TF masks. They divide the input mixture in TFBs using STFT, and identify the source from the magnitude. Source spectrograms can then be produced by masking the TFBs, and source signals are later estimated by phase reconstruction. ◉ These models learns a latent representation for each TFB such that the distance between TFBs from the same source is lower than the distance between TFBs from different sources. ◉ Wavesplit also relies on clustering but instead its representations are learned to ➢ predict training speaker identity and ➢ provide conditioning variables to separation network. 9
  • 10. uPIT ◉ Permutation-Invariant Training (PIT) avoids clustering and predicts multiple masks directly. The predictions are compared to the GT by searching over the permutations that minimize the total loss. ◉ Among PIT systems, time domain approaches avoid phase reconstruction and its degradation. ◉ 10 conv1d transpose-conv1d ◉ Wavesplit is also a time domain approach but it solves the permutation problem prior to signal estimation: At train time, the latent source representations are ordered to best match the labels prior to conditioning the separation network.
  • 11. Speaker Representations ◉ As many study shows, discriminative speaker representations can be extracted from short speech segments to help separation. ◉ However, the representations encoder in previous works are either separately trained or require a clean enrollment sequence. ◉ Conversely, Wavesplit jointly learns to identify and separate sources. 11
  • 12. Electro-cardio-gram ◉ Separation of fetal and maternal heart rates from abdominal electrocardiograms (ECGs) allows for affordable, non-invasive monitoring of fetal health during pregnancy and labour, as an alternative to fetal scalp ECG. ◉ Source separation is an intermediate step for detecting fetal heart rate peaks. This work evaluates Wavesplit for separating maternal and fetal heart rate signals. 12
  • 13. Wavesplit - Problem - Architecture - Training Objective - Training Algorithm 3 13
  • 14. Wavesplit ◉ Wavesplit combines two convolutional subnetworks that are jointly trained: ➢ The speaker stack maps the mixture to a set of vectors representing the recorded speakers. ➢ The separation stack consumes both the mixture and the set of speaker representations from the speaker stack. It then produces a multi-channel audio output with separated speech from each speaker. ◉ At training time, speaker labels are used to learn a vector representation per speaker such that the inter-speaker distances are large, while the intra-speaker distances are small. ◉ At the same time, this representation allow the separation stack to reconstruct the clean signals conditionally. ◉ At test time, the speaker stack relies on clustering to identify a centroid representation per speaker. 14
  • 15. Wavesplit ◉ With joint training, the speaker representation is not solely optimized for identification but also for the reconstruction of separated speech. In contrast with uPIT, Wavesplit condition decoding with a speaker representation valid for the whole sequence. ◉ It is less prone to channel swap since clustering assigns a persistent source representation to each channel. In other words, the separation stack is conditioned with speaker vectors ordered consistently with the labels. 15
  • 16. Problem 16 quality Estimated sources and GT permutations # of sources quality measurement Model with mixture as input ◉ In separation task, the model search for a model that can maximize the separation quality . ◉ As for quality measurement , literature typically relies on SDR to assess reconstruction quality.
  • 17. Architecture ◉ Wavesplit is a residual convolutional network with two stacks. ◉ The speaker stack produces speaker representations at each time step and then performs an aggregation over the whole sequence. ◉ Intuitively, produces a latent representation of each speaker at every time step. It is important to note that is not required to order speakers consistently across a sequence. 17 speaker vectors
  • 18. Architecture ◉ At the end of the sequence, the aggregation step groups all vectors by speaker and outputs N summary vectors for the whole sequence. K-means clustering performs this aggregation at inference and returns the centroids of the N identified clusters. ◉ : speaker centroid, : speaker vector ◉ During training, clustering is not used. Speaker centroids are derived by grouping speaker vectors by speaker identity, relying on the speaker training objective. 18
  • 19. Architecture ◉ The separation stack maps the mixture and the speaker centroids into an N-channel signal. ◉ Inspired by Conv-TasNet, they rely on a residual convolutional architecture for both stacks. Each residual block in the speaker stack composes a dilated convolution, a non-linearity and layer normalization. 19
  • 20. Architecture ◉ The residual blocks of the separation stack are conditioned by the speaker centroids relying on FiLM, Feature-wise Linear Modulation. , where and are different linear projections of , the concatenation of speaker centroids. 20
  • 21. Training Objectives ◉ Speaker identities are labeled in most separation datasets but mostly unused in the source separation literature. Wavesplit exploits this information to build an internal model of each source and improve long-term separation. ◉ The speaker vector objective encourages the speaker stack outputs to have small intra-speaker and large inter-speaker distances. ,where denotes the corresponding speakers, and defines a loss function between a vector of and a speaker identity . 21
  • 22. Training Objectives ◉ The minimum over permutations expresses that each identity should be identified at each time step, in any arbitrary order. The best permutation ( ) at each time-step is used to re-order the speaker vectors in an order consistent with the training labels. ◉ This allows averaging the speaker vectors originating from the same speaker to be the centroids at training time. Moreover, the separation stack is trained from different permutations of the same labels. ⨷ The speaker stack might stabilize the speaker assignments within intra-utterance training, but the speaker assignments between difference utterances is still agnostic. In other words, to obtain a diarization-like result must still employ any clustering method. 22
  • 23. Training Objectives ◉ Three alternative are explored for . All three maintain an embedding table over training speakers. 1. favors small distances between a speaker vector and the corresponding embedding while enforcing the distance between different speaker vectors to be larger than a margin of 1. 2. is a local classifier objective which discriminates among the speakers present in the sequence. ,where with learned scalar parameters . 23 discriminative term discriminative term
  • 24. Training Objectives 3. is similar to the last one, except that the partition function is computed over all speakers in the training set. ◉ The reconstruction objective aims at optimizing the separation quality. ◉ For , negative SDR is used with a clipping to limit the influence of the best training predictions. 24 discriminative term
  • 25. Training Objectives ◉ Different forms of regularization is as well considered to improve generalization to new speakers. ➢ Adding Gaussian noise to speaker centroids ➢ Replacing full centroids with zeros (dropout) ➢ Replacing some centroids with a linear combination with other centroids (speaker mixup) ◉ Finally, entropy regularization is applied which favors well separated embeddings for the training speakers with. ◉ In conclusion, the total loss is 25
  • 26. Training Algorithm ◉ Model training optimizes the total loss function with Adam optimizer by mini batches of fixed-size windows. Experiment shows that Wavesplit performs well for a wide-range of window sizes starting at 750ms, unlike most uPIT approaches that require longer segments (∼ 4s). ◉ The training set is shuffled at each epoch and a window starting point is uniformly sampled each time a sequence is visited. ◉ They explored the use of dynamic mixing data augmentation which consists in on-the-fly creation of new mixtures from single speaker sources. 26
  • 27. 27
  • 28. Experiments & Results - Hyperparameter - Clean Settings - Noisy Settings - Ablation - Electrocardiogram 4 28
  • 29. Experiments ◉ The experiments rely on the 8kHz data from WSJ0-2/3mix (clean), WHAM! (noisy) and WHAMR! (noisy and reverberant), Libri2/3mix (clean and noisy) ◉ They further evaluate variants of Wavesplit with different loss functions and architectural alternatives. They conduct an error analysis examining a small fraction of sequences with a strong negative impact on overall performance. ◉ The evaluation uses SDR and SI-SDR and they report results as improvements. 29
  • 30. Hyperparameter ◉ Preliminary experiments on WSJ0-2mix drove our architecture choices for subsequent experiments. ➢ Both stacks have a latent dimension of 512 and the dilated convolutions have a kernel size of 3 without striding. ➢ The speaker stack is 14 -layer deep and dilation grows exponentially from 20 ~ 213 . ➢ The separation stack has 40 layers with the dilation pattern 2l mod 10 . ➢ For loss of speaker stack, they found the global classifier to be the most effective. 30
  • 31. Hyperparameter ➢ Other parameters: ■ learning rate as 1e−3 in [1e−3, 2e−3, 3e−3]; ■ speaker loss weight of 2 in [1, 2, 5]; ■ regularization weight at 0.3 in [0, 0.2, 0.3, 0.5]; ■ gaussian noise with std at 0.2 in [0, 0.1, 0.2, 0.3]; ■ speaker dropout rate of 0.4 in [0, 0.2, 0.4, 0.6]; ■ speaker mixup rate of 0.5 in [0, 0.5, 1]. ■ the clipping threshold on the negative SDR loss as 30 for clean data and 27 for noisy data within [22,27,30]. 31
  • 32. Clean settings → Unlike test data, validation data contains the same speakers as the training set and they observe that test examples with low SDRi are sequences where both speakers are close to the same training speaker identity (confusing speaker). → Our oracle permutes the predicted samples across channels, and reports the best permutation, showing that most of the errors are channel assignment errors. 32
  • 33. Clean settings ◉ WSJ0-2mix recordings have a single dominant speaker, and it is speculatable that PIT might implicitly rely on this bias to address the channel assignment ambiguity. ◉ To test that, 10 times longer test sequences are created, by concatenating test sequences with the same pair of speakers, also the loudest speaker changes between each concatenated sequence. 33 ◉ Table shows the advantage of Wavesplit, which explicitly models sources in a time-independent fashion rather than relying on implicit rules that exploit biases in the training data. This is remarkable since Wavesplit is trained on 1s long windows compared to longer 4s windows for both PIT models.
  • 34. Noisy settings ◉ Under the scenario of WHAM! and WHAMR!, the task is even harder as the model should predict clean signals without reverberation, i.e. jointly addressing denoising, dereverberation and source separation. ◉ For both dataset, dynamic mixing is also adapted. 34
  • 35. Both settings → LibriMix is a good testbed for Wavesplit, as its training set contains significantly more speakers than WSJ0-2/3mix which improves the robustness of the speaker stack. → Moreover, the results are in the same range as the equivalent condition in WSJ0-2/3mix and WHAM!, which again confirms the robustness of the method across datasets. 35
  • 36. Ablation Study ◉ Although the distance loss is common in distance learning for clustering, the global classifier reports better results. Similarly, local classifier loss yields a slower training and worse genereralization. ◉ Table also reports the advantage of FiLM conditioning compared to standard additive conditioning. Not only reported SDRs are better but FiLM allows using a higher learning rate and yields faster training. ◉ Regularization helps improve the peformance as well. 36
  • 37. Electrocardiogram ◉ Wavesplit separation method can be applied beyond speech. ECG reports voltage time series of the electrical activity of the heart from electrodes placed on the skin. ◉ During pregnancy, ECG informs about the function of the fetal heart but maternal and fetal ECG are mixed. The goal is to separate these signals from a single noisy electrode recording. ◉ In this context, the source/speaker stack learns a representation of a mother and its fetus to condition the separation stack. 37
  • 38. Electrocardiogram ◉ They train a single model independently of the electrode’s location and rely on the same architecture as in our speech experiments, only validating regularization parameters. 38
  • 42. Conclusion ◉ This work introduce Wavesplit, a neural network for source separation. From the input mixture, the model extracts a representation for each source and estimates the separated signals conditioned on the inferred representations. ◉ Separation with Waveplit relies on a single consistent representation of each source regardless of the input signal length. This is advantageous on long recordings, as shown by the experiments. ◉ Wavesplit redefines the state-of-the-art on several datasets, both in clean and noisy conditions. ◉ Experiments also the benefits of Wavesplit on fetal/maternal heart rate separation from ECGs. These results open perspectives in other separation domains. 42
  • 43. Any questions ? You can find me at ◉ jasonho610@gmail.com ◉ NTNU-SMIL Thanks! 43