EEND-SS.pdf

EEND-SS:
Joint End-to-End Neural Speaker Diarization and
Speech Separation for Flexible Number of Speakers
Authors: Yushi Ueda, Soumi Maiti, Shinji Watanabe, Chunlei Zhang,
Meng Yu, Shi-Xiong Zhang, Yong Xu
Published in arXiv:2203.17068v1 [eess.AS] 31 Mar 2022
Presenter: 何冠勳

1
Table of Contents
– Introduction
– Architecture
– Experiments
– Results
– Conclusions

3
Introduction
– Task of
Speaker diarization (SD) : who spoke when
Speech separation (SS) : (who) spoke what
– Nowadays, fully end-to-end neural diarization (EEND) systems can handle speaker
overlap by training with the speaker overlap data.
– One drawback of EEND is the number of speakers must be known and fixed
beforehand.
– To cope with that, either 1) set a maximum speaker or 2) extract and listen
iteratively.
– EEND with Encoder-Decoder-based Attractor calculation (EEND-EDA) uses LSTM
encoder-decoder based attractors to estimate the maximum speaker counting
within diarization.

4
Introduction
– Even though speaker diarization and speech separation are often used together,
their optimal order is not fixed, and ordering varies with the scenario and dataset.
– Our solution is to unify these models as a single neural network and jointly train it
with multi-task learning so that both tasks can benefit from each other.
– We combine Conv-TasNet & EEND-EDA as a novel model, entitled EEND-SS.
– We also propose the multiple 1×1 convolutional layer (MCL) architecture for
estimating the separation masks corresponding to the number of speakers, and a
post-processing technique for refining the separated speech signals.
– Experimental results show that EEND-SS can improve both task.

5
Architecture
- EEND
- EEND-EDA
- Conv-TasNet
- EEND-SS
- Training, Inference
- Post-prosessing

6
Architecture
– Before that, we define the tasks model tackles.
– Given mixture of speakers and noises ,
– SD aims to estimate .
– SS aims to estimate .
– Speaker counting aims to estimate the number of speakers .

7
Architecture – EEND
– EEND is a method for estimating the
speech activities of each speaker from a
multiple speaker input mixture using an
end-to-end neural network.
– This version is the self-attentive one
(SA-EEND).

PIT
Acoustic Features
Embeddings
t
C
Posterior Speech Activity Prob.
Ground Truth

9
Architecture – EEND
– Loss function is defined as:
,where is the set of all possible permutations
– is the binary cross entropy, defined as:
– Same, drawback of EEND is that the number of speakers is fixed.

10
Architecture – EEND-EDA • To overcome this difficulty, EEND-EDA was
proposed, which handles a flexible
number of speakers by predicting speaker
existence.
• LSTM-based encoder-decoder generates
speaker-wise attractors
until a stopping criterion is satisfied.
• The number of speakers is estimated by
using (equivalent to p on the figure).
• The posterior probabilities can be
calculated by matrix multiplication.

We obtain for each speaker and time
frame, by comparing the posterior
probability with the given threshold
.
Here indicates C speakers is active.
The loss function is defined as
.
During inference, C is counted if p
is greater than a threshold .

12
Architecture – Conv-TasNet

13
Architecture – Conv-TasNet
– Conv-TasNet consists of
üEncoder, decoder, separator (mask estimation)
üTemporal convolutional network (TCN)
üDilated convolutional blocks, residual path, skip-connection …
– It is the benchmark model on SS.
– The loss function uses
Normal convolution Dilated convolution
Residual

14
Architecture – EEND-SS
– It includes the encoder, separator, decoder, and diarization modules.
– The encoder, separator, and decoder modules in EEND-SS are equivalent to those
of Conv-TasNet, except for the last 1×1 convolution layer in the separator.
– The diarization module architecture is the same as EEND-EDA except for the input
features. EEND-SS uses learned TCN bottleneck features.
– Optionally, the diarization module may also include the log-mel filterbank features
(LMF) concatenated with TCN bottleneck features (depicted as dotted lines).
– In EEND-SS, to allow masks for a variable number of speakers, MCL are used.
– For C, , the oracle number is used during training, and the number estimated by
the diarization module is used during inference.

16
Architecture – Training, Inference
– For jointly training, EEND-SS is trained with a multi-task cost function
,where are the weights that are chosen empirically.
– For inference, we utilize the following 2-pass procedure:
1) Obtain SD result & the number of speakers from input speech mixture.
2) Select MCL corresponding to number of masks, then obtain separated speech
signals.

17
Architecture – Post-processing
– We also propose to utilize post-processing for refining the separated speech signals
with speech activity, formulated as:
, where denotes element-wise multiplication and denotes a correlation
function.

18
Experiments
- Datasets
- Configurations
- Evaluation metrics

19
Experiments – Datasets
– For the training and evaluation, we used the LibriMix dataset.
– LibriMix uses speech samples from LibriSpeech train-clean100/dev-clean/test-clean
and the noise samples from WHAM!.
– Libri2Mix / Libri3Mix
– https://arxiv.org/pdf/2005.11262.pdf

20
Experiments – Configurations
– For SS section:
– For DS section:
Encoder / Decoder Kernel size 16
Representation dim 512
Separator Feature dim 128
Repeat 3
Conv-block 8 per layer
Input 2D-conv 1/8 sub-sampling
Transformer Repeat 4
Attention heads 4

21
Experiments – Configurations
– For EEND-SS:
LMF dim 80
Spectra length 512
shift 64
𝜽, 𝝉 0.5
𝜸𝟏, 𝜸𝟐 , 𝜸𝟑 1, 0.2, 0.2 empirically
Learning rate 10$%, halved if not improve
Batch 16

22
Experiments – Evaluation Metrics
– SDRi
– SI-SDRi
– STOI
– DER (diarization error rate)
https://pyannote.github.io/pyannote-metrics/reference.html#diarization
– SCA (speaker counting accuracy)

23
Results
- Fixed number of speakers
- Flexible number of speakers

24
Results – Fixed
– We show the effectiveness of joint speech separation and speaker diarization based
on the proposed method for fixed numbers of speakers.

25
Results – Fixed
– We tested our proposed method on
Libri2Mix max mode to compare the
diarization performance with EEND-
based models reported in SUPERB.
https://superbbenchmark.org/
– The results indicate room for further
improvement using self-supervised
features instead of LMF, which is left
for future work.

26
Result – Flexible
– Next, we evaluated our method on the
2/3speaker mixed condition created by
combining both Libri2/3Mix datasets.
– In this experiment, the number of reference
speech signals and the separated speech
signals may differ due to speaker counting
error.
– To solve that, we append silent audio
signals to the reference.
– EEND-SS outperformed the baseline
methods in all the metrics, even if Conv-
TasNet used the oracle number.

27
Conclusions
What has been done?

28
Conclusions
– In this paper, we proposed a framework for jointly and directly optimizing speaker
diarization, speech separation, and speaker counting.
– Additionally, a post-processing mechanism is proposed for refining separated speech
with speech activity.
– The experimental results show that the proposed method outperforms baseline
systems evaluated with both fixed and flexible numbers of speakers.
– Future work includes using other separation techniques, as well as using the
features from self-supervised pretrained models.

29
Conclusions
– SSGD – Speech Separation guided Diarization
http://staff.ustc.edu.cn/~jundu/Publications/publications/xinfang.pdf
– CSS – Continuous Speech Separation
https://arxiv.org/pdf/2001.11482.pdf
– WaveSplit (uses speaker embeddings to gain performance on SS)
https://arxiv.org/pdf/2002.08933.pdf

Separation results on Delta dataset:
AOI / Inter-training / Elderly

31
Pretrained on
Improved_Sudormrf
_U36_Bases4096_W
HAMRexclmark.pt
Spk1 Spk2 Original
AOI
Inter-training
Elderly

THANK YOU
Any questions?
You can find me at
📨 jasonho610@gmail.com 🥽 NTNU-SMIL

EEND-SS.pdf

Recommended

Recommended

More Related Content

Similar to EEND-SS.pdf

Similar to EEND-SS.pdf (20)

Recently uploaded

Recently uploaded (20)

EEND-SS.pdf