EEND-SS is a joint model that performs speaker diarization, speech separation, and speaker counting in an end-to-end manner. It combines the EEND-EDA diarization model and Conv-TasNet separation model. EEND-SS uses a multi-task learning approach and multiple 1x1 convolution layers to estimate separation masks for a flexible number of speakers. Experimental results on the LibriMix dataset show it outperforms baseline methods in metrics like SDRi and DER, handling both a fixed and variable number of speakers.
4. 3
Introduction
– Task of
Speaker diarization (SD) : who spoke when
Speech separation (SS) : (who) spoke what
– Nowadays, fully end-to-end neural diarization (EEND) systems can handle speaker
overlap by training with the speaker overlap data.
– One drawback of EEND is the number of speakers must be known and fixed
beforehand.
– To cope with that, either 1) set a maximum speaker or 2) extract and listen
iteratively.
– EEND with Encoder-Decoder-based Attractor calculation (EEND-EDA) uses LSTM
encoder-decoder based attractors to estimate the maximum speaker counting
within diarization.
5. 4
Introduction
– Even though speaker diarization and speech separation are often used together,
their optimal order is not fixed, and ordering varies with the scenario and dataset.
– Our solution is to unify these models as a single neural network and jointly train it
with multi-task learning so that both tasks can benefit from each other.
– We combine Conv-TasNet & EEND-EDA as a novel model, entitled EEND-SS.
– We also propose the multiple 1×1 convolutional layer (MCL) architecture for
estimating the separation masks corresponding to the number of speakers, and a
post-processing technique for refining the separated speech signals.
– Experimental results show that EEND-SS can improve both task.
7. 6
Architecture
– Before that, we define the tasks model tackles.
– Given mixture of speakers and noises ,
– SD aims to estimate .
– SS aims to estimate .
– Speaker counting aims to estimate the number of speakers .
8. 7
Architecture – EEND
– EEND is a method for estimating the
speech activities of each speaker from a
multiple speaker input mixture using an
end-to-end neural network.
– This version is the self-attentive one
(SA-EEND).
10. 9
Architecture – EEND
– Loss function is defined as:
,where is the set of all possible permutations
– is the binary cross entropy, defined as:
– Same, drawback of EEND is that the number of speakers is fixed.
11. 10
Architecture – EEND-EDA • To overcome this difficulty, EEND-EDA was
proposed, which handles a flexible
number of speakers by predicting speaker
existence.
• LSTM-based encoder-decoder generates
speaker-wise attractors
until a stopping criterion is satisfied.
• The number of speakers is estimated by
using (equivalent to p on the figure).
• The posterior probabilities can be
calculated by matrix multiplication.
12. We obtain for each speaker and time
frame, by comparing the posterior
probability with the given threshold
.
Here indicates C speakers is active.
The loss function is defined as
.
During inference, C is counted if p
is greater than a threshold .
14. 13
Architecture – Conv-TasNet
– Conv-TasNet consists of
üEncoder, decoder, separator (mask estimation)
üTemporal convolutional network (TCN)
üDilated convolutional blocks, residual path, skip-connection …
– It is the benchmark model on SS.
– The loss function uses
Normal convolution Dilated convolution
Residual
15. 14
Architecture – EEND-SS
– It includes the encoder, separator, decoder, and diarization modules.
– The encoder, separator, and decoder modules in EEND-SS are equivalent to those
of Conv-TasNet, except for the last 1×1 convolution layer in the separator.
– The diarization module architecture is the same as EEND-EDA except for the input
features. EEND-SS uses learned TCN bottleneck features.
– Optionally, the diarization module may also include the log-mel filterbank features
(LMF) concatenated with TCN bottleneck features (depicted as dotted lines).
– In EEND-SS, to allow masks for a variable number of speakers, MCL are used.
– For C, , the oracle number is used during training, and the number estimated by
the diarization module is used during inference.
16.
17. 16
Architecture – Training, Inference
– For jointly training, EEND-SS is trained with a multi-task cost function
,where are the weights that are chosen empirically.
– For inference, we utilize the following 2-pass procedure:
1) Obtain SD result & the number of speakers from input speech mixture.
2) Select MCL corresponding to number of masks, then obtain separated speech
signals.
18. 17
Architecture – Post-processing
– We also propose to utilize post-processing for refining the separated speech signals
with speech activity, formulated as:
, where denotes element-wise multiplication and denotes a correlation
function.
20. 19
Experiments – Datasets
– For the training and evaluation, we used the LibriMix dataset.
– LibriMix uses speech samples from LibriSpeech train-clean100/dev-clean/test-clean
and the noise samples from WHAM!.
– Libri2Mix / Libri3Mix
– https://arxiv.org/pdf/2005.11262.pdf
21. 20
Experiments – Configurations
– For SS section:
– For DS section:
Encoder / Decoder Kernel size 16
Representation dim 512
Separator Feature dim 128
Repeat 3
Conv-block 8 per layer
Input 2D-conv 1/8 sub-sampling
Transformer Repeat 4
Attention heads 4
22. 21
Experiments – Configurations
– For EEND-SS:
LMF dim 80
Spectra length 512
shift 64
𝜽, 𝝉 0.5
𝜸𝟏, 𝜸𝟐 , 𝜸𝟑 1, 0.2, 0.2 empirically
Learning rate 10$%, halved if not improve
Batch 16
25. 24
Results – Fixed
– We show the effectiveness of joint speech separation and speaker diarization based
on the proposed method for fixed numbers of speakers.
26. 25
Results – Fixed
– We tested our proposed method on
Libri2Mix max mode to compare the
diarization performance with EEND-
based models reported in SUPERB.
https://superbbenchmark.org/
– The results indicate room for further
improvement using self-supervised
features instead of LMF, which is left
for future work.
27. 26
Result – Flexible
– Next, we evaluated our method on the
2/3speaker mixed condition created by
combining both Libri2/3Mix datasets.
– In this experiment, the number of reference
speech signals and the separated speech
signals may differ due to speaker counting
error.
– To solve that, we append silent audio
signals to the reference.
– EEND-SS outperformed the baseline
methods in all the metrics, even if Conv-
TasNet used the oracle number.
29. 28
Conclusions
– In this paper, we proposed a framework for jointly and directly optimizing speaker
diarization, speech separation, and speaker counting.
– Additionally, a post-processing mechanism is proposed for refining separated speech
with speech activity.
– The experimental results show that the proposed method outperforms baseline
systems evaluated with both fixed and flexible numbers of speakers.
– Future work includes using other separation techniques, as well as using the
features from self-supervised pretrained models.
30. 29
Conclusions
– SSGD – Speech Separation guided Diarization
http://staff.ustc.edu.cn/~jundu/Publications/publications/xinfang.pdf
– CSS – Continuous Speech Separation
https://arxiv.org/pdf/2001.11482.pdf
– WaveSplit (uses speaker embeddings to gain performance on SS)
https://arxiv.org/pdf/2002.08933.pdf