1. Wavesplit:
End-to-End Speech Separation by
Speaker Clustering
Presenter : 何冠勳 61047017s
Date : 2022/07/27
Neil Zeghidour, David Grangier from Google Research, Brain Team, IEEE/ACM Transactions on Audio,
Speech, and Language Processing, vol. 29, pp. 2840-2849, 2021
4. Why Wavesplit?
◉ Source separation is an ill-posed problem of separating multiple sources from a single mixture. An
additional difficulty arises when the sources to separate belong to the same class of signals.
e.g. isolating appliance electric consumption from meter reading
separating overlapped fingerprints
retrieving individual compounds in chemical mixtures from spectroscopy
◉ Another difficulty regards permutation where predicted channels are separated but not
neccesarily inconsistent along time.
◉ Thus, designing a model that maintains a consistent assignment between the GT and the predicted
channels is crucial for this task.
4
5. What is Wavesplit?
◉ This approach, Wavesplit, aims at separating novel sources at test time, called open speaker
separation in speech, but leverages source identities during training.
◉ The training objective identifies instantaneous speaker representations such that
➢ these representations can be grouped into individual speaker clusters and
➢ the cluster centroids provide a long-term speaker representation then condition the
reconstruction of individual speaker signals.
5
6. What is Wavesplit?
◉ In this work, the contributions are five-fold:
✓ automatically leverage training speaker embeddings but do not need any prerequisite
information,
✓ limits channel-swap by aggregate information and use clustering to assign speaker,
✓ report state-of-the-art results WSJ0-2/3mix (clean), WHAM! (noisy) and WHAMR! (noisy and
reverberant), Libri2/3mix (clean and noisy) then,
✓ analyze the empirical advantages and drawbacks of our method,
✓ show that this approach is generic and can be applied to non-speech tasks, by separating
maternal and fetal heart rate from a single abdominal electrode graph.
6
Maternal: 母系的、Fetal: 胎兒的
8. Related Work
- Clustering
- Utterance-level Permutation-Invariant Training
- Speaker Representations
- Electrocardiograms
2
8
9. Clustering
◉ In TF-domain methods, the previous works historically relied on learning TF masks. They divide the
input mixture in TFBs using STFT, and identify the source from the magnitude. Source spectrograms
can then be produced by masking the TFBs, and source signals are later estimated by phase
reconstruction.
◉ These models learns a latent representation for each TFB such that the distance between TFBs from
the same source is lower than the distance between TFBs from different sources.
◉ Wavesplit also relies on clustering but instead its representations are learned to
➢ predict training speaker identity and
➢ provide conditioning variables to separation network.
9
10. uPIT
◉ Permutation-Invariant Training (PIT) avoids clustering and predicts multiple masks directly. The
predictions are compared to the GT by searching over the permutations that minimize the total
loss.
◉ Among PIT systems, time domain approaches avoid phase reconstruction and its degradation.
◉
10
conv1d transpose-conv1d
◉ Wavesplit is also a time domain approach but it solves
the permutation problem prior to signal estimation:
At train time, the latent source representations are
ordered to best match the labels prior to conditioning
the separation network.
11. Speaker Representations
◉ As many study shows, discriminative speaker representations can be extracted from short
speech segments to help separation.
◉ However, the representations encoder in previous works are either separately trained or
require a clean enrollment sequence.
◉ Conversely, Wavesplit jointly learns to identify and separate sources.
11
12. Electro-cardio-gram
◉ Separation of fetal and maternal heart rates from abdominal electrocardiograms (ECGs)
allows for affordable, non-invasive monitoring of fetal health during pregnancy and labour,
as an alternative to fetal scalp ECG.
◉ Source separation is an intermediate step for detecting fetal heart rate peaks. This work
evaluates Wavesplit for separating maternal and fetal heart rate signals.
12
14. Wavesplit
◉ Wavesplit combines two convolutional subnetworks that are jointly trained:
➢ The speaker stack maps the mixture to a set of vectors representing the recorded speakers.
➢ The separation stack consumes both the mixture and the set of speaker representations from the
speaker stack. It then produces a multi-channel audio output with separated speech from each
speaker.
◉ At training time, speaker labels are used to learn a vector representation per speaker such that the
inter-speaker distances are large, while the intra-speaker distances are small.
◉ At the same time, this representation allow the separation stack to reconstruct the clean signals
conditionally.
◉ At test time, the speaker stack relies on clustering to identify a centroid representation per speaker.
14
15. Wavesplit
◉ With joint training, the speaker representation is not solely optimized for identification but also for
the reconstruction of separated speech. In contrast with uPIT, Wavesplit condition decoding with
a speaker representation valid for the whole sequence.
◉ It is less prone to channel swap since clustering assigns a persistent source representation to
each channel. In other words, the separation stack is conditioned with speaker vectors ordered
consistently with the labels.
15
16. Problem
16
quality
Estimated sources and GT permutations
# of sources
quality measurement Model with mixture as input
◉ In separation task, the model search for a model that can maximize the separation quality .
◉ As for quality measurement , literature typically relies on SDR to assess reconstruction
quality.
17. Architecture
◉ Wavesplit is a residual convolutional network with two stacks.
◉ The speaker stack produces speaker representations at each time step and then performs an
aggregation over the whole sequence.
◉ Intuitively, produces a latent representation of each speaker at every time step. It is
important to note that is not required to order speakers consistently across a sequence.
17
speaker vectors
18. Architecture
◉ At the end of the sequence, the aggregation step groups all vectors by speaker and outputs N
summary vectors for the whole sequence. K-means clustering performs this aggregation at
inference and returns the centroids of the N identified clusters.
◉ : speaker centroid, : speaker vector
◉ During training, clustering is not used. Speaker centroids are derived by grouping speaker vectors
by speaker identity, relying on the speaker training objective.
18
19. Architecture
◉ The separation stack maps the mixture and the speaker centroids into an N-channel signal.
◉ Inspired by Conv-TasNet, they rely on a residual convolutional architecture for both stacks. Each
residual block in the speaker stack composes a dilated convolution, a non-linearity and layer
normalization.
19
20. Architecture
◉ The residual blocks of the separation stack are conditioned by the speaker centroids relying on
FiLM, Feature-wise Linear Modulation.
, where and are different linear projections of , the
concatenation of speaker centroids.
20
21. Training Objectives
◉ Speaker identities are labeled in most separation datasets but mostly unused in the source
separation literature. Wavesplit exploits this information to build an internal model of each source
and improve long-term separation.
◉ The speaker vector objective encourages the speaker stack outputs to have small intra-speaker
and large inter-speaker distances.
,where denotes the corresponding speakers, and defines
a loss function between a vector of and a speaker identity .
21
22. Training Objectives
◉ The minimum over permutations expresses that each identity should be identified at each time
step, in any arbitrary order. The best permutation ( ) at each time-step is used to re-order
the speaker vectors in an order consistent with the training labels.
◉ This allows averaging the speaker vectors originating from the same speaker to be the centroids at
training time. Moreover, the separation stack is trained from different permutations of the same
labels.
⨷ The speaker stack might stabilize the speaker assignments within intra-utterance training, but the
speaker assignments between difference utterances is still agnostic. In other words, to obtain a
diarization-like result must still employ any clustering method.
22
23. Training Objectives
◉ Three alternative are explored for . All three maintain an embedding table over
training speakers.
1. favors small distances between a speaker vector and the corresponding embedding while
enforcing the distance between different speaker vectors to be larger than a margin of 1.
2. is a local classifier objective which discriminates among the speakers present in the
sequence.
,where with learned scalar parameters .
23
discriminative term
discriminative term
24. Training Objectives
3. is similar to the last one, except that the partition function is computed over all speakers in
the training set.
◉ The reconstruction objective aims at optimizing the separation quality.
◉ For , negative SDR is used with a clipping to limit the influence of the best training predictions.
24
discriminative term
25. Training Objectives
◉ Different forms of regularization is as well considered to improve generalization to new speakers.
➢ Adding Gaussian noise to speaker centroids
➢ Replacing full centroids with zeros (dropout)
➢ Replacing some centroids with a linear combination with other centroids (speaker mixup)
◉ Finally, entropy regularization is applied which favors well separated embeddings for the training
speakers with.
◉ In conclusion, the total loss is
25
26. Training Algorithm
◉ Model training optimizes the total loss function with Adam optimizer by mini batches of fixed-size
windows. Experiment shows that Wavesplit performs well for a wide-range of window sizes starting
at 750ms, unlike most uPIT approaches that require longer segments (∼ 4s).
◉ The training set is shuffled at each epoch and a window starting point is uniformly sampled each time
a sequence is visited.
◉ They explored the use of dynamic mixing data augmentation which consists in on-the-fly creation of
new mixtures from single speaker sources.
26
29. Experiments
◉ The experiments rely on the 8kHz data from WSJ0-2/3mix (clean), WHAM! (noisy) and WHAMR!
(noisy and reverberant), Libri2/3mix (clean and noisy)
◉ They further evaluate variants of Wavesplit with different loss functions and architectural alternatives.
They conduct an error analysis examining a small fraction of sequences with a strong negative impact
on overall performance.
◉ The evaluation uses SDR and SI-SDR and
they report results as improvements.
29
30. Hyperparameter
◉ Preliminary experiments on WSJ0-2mix drove our architecture choices for subsequent experiments.
➢ Both stacks have a latent dimension of 512 and the dilated convolutions have a kernel size of
3 without striding.
➢ The speaker stack is 14 -layer deep and dilation grows exponentially from 20
~ 213
.
➢ The separation stack has 40 layers with the dilation pattern 2l mod 10
.
➢ For loss of speaker stack, they found the global classifier to be the most effective.
30
31. Hyperparameter
➢ Other parameters:
■ learning rate as 1e−3 in [1e−3, 2e−3, 3e−3];
■ speaker loss weight of 2 in [1, 2, 5];
■ regularization weight at 0.3 in [0, 0.2, 0.3, 0.5];
■ gaussian noise with std at 0.2 in [0, 0.1, 0.2, 0.3];
■ speaker dropout rate of 0.4 in [0, 0.2, 0.4, 0.6];
■ speaker mixup rate of 0.5 in [0, 0.5, 1].
■ the clipping threshold on the negative SDR loss as 30 for clean data and 27 for noisy
data within [22,27,30].
31
32. Clean settings
→ Unlike test data, validation data contains the same speakers
as the training set and they observe that test examples with
low SDRi are sequences where both speakers are close to
the same training speaker identity (confusing speaker).
→ Our oracle permutes the predicted samples across channels,
and reports the best permutation, showing that most of the
errors are channel assignment errors.
32
33. Clean settings
◉ WSJ0-2mix recordings have a single dominant speaker, and it is
speculatable that PIT might implicitly rely on this bias to address the
channel assignment ambiguity.
◉ To test that, 10 times longer test sequences are created, by concatenating
test sequences with the same pair of speakers, also the loudest speaker
changes between each concatenated sequence.
33
◉ Table shows the advantage of Wavesplit, which explicitly models sources in a time-independent
fashion rather than relying on implicit rules that exploit biases in the training data. This is remarkable
since Wavesplit is trained on 1s long windows compared to longer 4s windows for both PIT models.
34. Noisy settings
◉ Under the scenario of WHAM! and WHAMR!, the task is even harder as the model should predict clean
signals without reverberation, i.e. jointly addressing denoising, dereverberation and source separation.
◉ For both dataset, dynamic mixing is also adapted.
34
35. Both settings
→ LibriMix is a good testbed for Wavesplit, as its training set contains significantly more speakers than
WSJ0-2/3mix which improves the robustness of the speaker stack.
→ Moreover, the results are in the same range as the equivalent condition in WSJ0-2/3mix and WHAM!,
which again confirms the robustness of the method across datasets.
35
36. Ablation Study
◉ Although the distance loss is common in distance learning for clustering, the global classifier
reports better results. Similarly, local classifier loss yields a slower training and worse
genereralization.
◉ Table also reports the advantage of FiLM
conditioning compared to standard additive conditioning.
Not only reported SDRs are better but FiLM allows using
a higher learning rate and yields faster training.
◉ Regularization helps improve the peformance as well.
36
37. Electrocardiogram
◉ Wavesplit separation method can be applied beyond speech. ECG reports voltage time series of
the electrical activity of the heart from electrodes placed on the skin.
◉ During pregnancy, ECG informs about the function of the fetal heart but maternal and fetal ECG
are mixed. The goal is to separate these signals from a single noisy electrode recording.
◉ In this context, the source/speaker stack learns a representation of a mother and its fetus to
condition the separation stack.
37
38. Electrocardiogram
◉ They train a single model independently of the electrode’s location and rely on the same
architecture as in our speech experiments, only validating regularization parameters.
38
42. Conclusion
◉ This work introduce Wavesplit, a neural network for source separation. From the input mixture,
the model extracts a representation for each source and estimates the separated signals
conditioned on the inferred representations.
◉ Separation with Waveplit relies on a single consistent representation of each source regardless
of the input signal length. This is advantageous on long recordings, as shown by the experiments.
◉ Wavesplit redefines the state-of-the-art on several datasets, both in clean and noisy conditions.
◉ Experiments also the benefits of Wavesplit on fetal/maternal heart rate separation from ECGs.
These results open perspectives in other separation domains.
42
43. Any questions ?
You can find me at
◉ jasonho610@gmail.com
◉ NTNU-SMIL
Thanks!
43