SlideShare a Scribd company logo
1 of 33
Download to read offline
EEND-SS:
Joint End-to-End Neural Speaker Diarization and
Speech Separation for Flexible Number of Speakers
Authors: Yushi Ueda, Soumi Maiti, Shinji Watanabe, Chunlei Zhang,
Meng Yu, Shi-Xiong Zhang, Yong Xu
Published in arXiv:2203.17068v1 [eess.AS] 31 Mar 2022
Presenter: 何冠勳
1
Table of Contents
– Introduction
– Architecture
– Experiments
– Results
– Conclusions
2
Introduction
Why & What ?
3
Introduction
– Task of
Speaker diarization (SD) : who spoke when
Speech separation (SS) : (who) spoke what
– Nowadays, fully end-to-end neural diarization (EEND) systems can handle speaker
overlap by training with the speaker overlap data.
– One drawback of EEND is the number of speakers must be known and fixed
beforehand.
– To cope with that, either 1) set a maximum speaker or 2) extract and listen
iteratively.
– EEND with Encoder-Decoder-based Attractor calculation (EEND-EDA) uses LSTM
encoder-decoder based attractors to estimate the maximum speaker counting
within diarization.
4
Introduction
– Even though speaker diarization and speech separation are often used together,
their optimal order is not fixed, and ordering varies with the scenario and dataset.
– Our solution is to unify these models as a single neural network and jointly train it
with multi-task learning so that both tasks can benefit from each other.
– We combine Conv-TasNet & EEND-EDA as a novel model, entitled EEND-SS.
– We also propose the multiple 1×1 convolutional layer (MCL) architecture for
estimating the separation masks corresponding to the number of speakers, and a
post-processing technique for refining the separated speech signals.
– Experimental results show that EEND-SS can improve both task.
5
Architecture
- EEND
- EEND-EDA
- Conv-TasNet
- EEND-SS
- Training, Inference
- Post-prosessing
6
Architecture
– Before that, we define the tasks model tackles.
– Given mixture of speakers and noises ,
– SD aims to estimate .
– SS aims to estimate .
– Speaker counting aims to estimate the number of speakers .
7
Architecture – EEND
– EEND is a method for estimating the
speech activities of each speaker from a
multiple speaker input mixture using an
end-to-end neural network.
– This version is the self-attentive one
(SA-EEND).
PIT
Acoustic Features
Embeddings
t
C
Posterior Speech Activity Prob.
Ground Truth
9
Architecture – EEND
– Loss function is defined as:
,where is the set of all possible permutations
– is the binary cross entropy, defined as:
– Same, drawback of EEND is that the number of speakers is fixed.
10
Architecture – EEND-EDA • To overcome this difficulty, EEND-EDA was
proposed, which handles a flexible
number of speakers by predicting speaker
existence.
• LSTM-based encoder-decoder generates
speaker-wise attractors
until a stopping criterion is satisfied.
• The number of speakers is estimated by
using (equivalent to p on the figure).
• The posterior probabilities can be
calculated by matrix multiplication.
We obtain for each speaker and time
frame, by comparing the posterior
probability with the given threshold
.
Here indicates C speakers is active.
The loss function is defined as
.
During inference, C is counted if p
is greater than a threshold .
12
Architecture – Conv-TasNet
13
Architecture – Conv-TasNet
– Conv-TasNet consists of
üEncoder, decoder, separator (mask estimation)
üTemporal convolutional network (TCN)
üDilated convolutional blocks, residual path, skip-connection …
– It is the benchmark model on SS.
– The loss function uses
Normal convolution Dilated convolution
Residual
14
Architecture – EEND-SS
– It includes the encoder, separator, decoder, and diarization modules.
– The encoder, separator, and decoder modules in EEND-SS are equivalent to those
of Conv-TasNet, except for the last 1×1 convolution layer in the separator.
– The diarization module architecture is the same as EEND-EDA except for the input
features. EEND-SS uses learned TCN bottleneck features.
– Optionally, the diarization module may also include the log-mel filterbank features
(LMF) concatenated with TCN bottleneck features (depicted as dotted lines).
– In EEND-SS, to allow masks for a variable number of speakers, MCL are used.
– For C, , the oracle number is used during training, and the number estimated by
the diarization module is used during inference.
16
Architecture – Training, Inference
– For jointly training, EEND-SS is trained with a multi-task cost function
,where are the weights that are chosen empirically.
– For inference, we utilize the following 2-pass procedure:
1) Obtain SD result & the number of speakers from input speech mixture.
2) Select MCL corresponding to number of masks, then obtain separated speech
signals.
17
Architecture – Post-processing
– We also propose to utilize post-processing for refining the separated speech signals
with speech activity, formulated as:
, where denotes element-wise multiplication and denotes a correlation
function.
18
Experiments
- Datasets
- Configurations
- Evaluation metrics
19
Experiments – Datasets
– For the training and evaluation, we used the LibriMix dataset.
– LibriMix uses speech samples from LibriSpeech train-clean100/dev-clean/test-clean
and the noise samples from WHAM!.
– Libri2Mix / Libri3Mix
– https://arxiv.org/pdf/2005.11262.pdf
20
Experiments – Configurations
– For SS section:
– For DS section:
Encoder / Decoder Kernel size 16
Representation dim 512
Separator Feature dim 128
Repeat 3
Conv-block 8 per layer
Input 2D-conv 1/8 sub-sampling
Transformer Repeat 4
Attention heads 4
21
Experiments – Configurations
– For EEND-SS:
LMF dim 80
Spectra length 512
shift 64
𝜽, 𝝉 0.5
𝜸𝟏, 𝜸𝟐 , 𝜸𝟑 1, 0.2, 0.2 empirically
Learning rate 10$%, halved if not improve
Batch 16
22
Experiments – Evaluation Metrics
– SDRi
– SI-SDRi
– STOI
– DER (diarization error rate)
https://pyannote.github.io/pyannote-metrics/reference.html#diarization
– SCA (speaker counting accuracy)
23
Results
- Fixed number of speakers
- Flexible number of speakers
24
Results – Fixed
– We show the effectiveness of joint speech separation and speaker diarization based
on the proposed method for fixed numbers of speakers.
25
Results – Fixed
– We tested our proposed method on
Libri2Mix max mode to compare the
diarization performance with EEND-
based models reported in SUPERB.
https://superbbenchmark.org/
– The results indicate room for further
improvement using self-supervised
features instead of LMF, which is left
for future work.
26
Result – Flexible
– Next, we evaluated our method on the
2/3speaker mixed condition created by
combining both Libri2/3Mix datasets.
– In this experiment, the number of reference
speech signals and the separated speech
signals may differ due to speaker counting
error.
– To solve that, we append silent audio
signals to the reference.
– EEND-SS outperformed the baseline
methods in all the metrics, even if Conv-
TasNet used the oracle number.
27
Conclusions
What has been done?
28
Conclusions
– In this paper, we proposed a framework for jointly and directly optimizing speaker
diarization, speech separation, and speaker counting.
– Additionally, a post-processing mechanism is proposed for refining separated speech
with speech activity.
– The experimental results show that the proposed method outperforms baseline
systems evaluated with both fixed and flexible numbers of speakers.
– Future work includes using other separation techniques, as well as using the
features from self-supervised pretrained models.
29
Conclusions
– SSGD – Speech Separation guided Diarization
http://staff.ustc.edu.cn/~jundu/Publications/publications/xinfang.pdf
– CSS – Continuous Speech Separation
https://arxiv.org/pdf/2001.11482.pdf
– WaveSplit (uses speaker embeddings to gain performance on SS)
https://arxiv.org/pdf/2002.08933.pdf
Separation results on Delta dataset:
AOI / Inter-training / Elderly
31
Pretrained on
Improved_Sudormrf
_U36_Bases4096_W
HAMRexclmark.pt
Spk1 Spk2 Original
AOI
Inter-training
Elderly
THANK YOU
Any questions?
You can find me at
📨 jasonho610@gmail.com 🥽 NTNU-SMIL

More Related Content

Similar to EEND-SS.pdf

IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)IRJET Journal
 
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...kevig
 
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...kevig
 
Low power fpga solution for dab audio decoder
Low power fpga solution for dab audio decoderLow power fpga solution for dab audio decoder
Low power fpga solution for dab audio decodereSAT Publishing House
 
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...IJERA Editor
 
Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfssuser849b73
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ijnlc
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...
ANALYZING ARCHITECTURES FOR NEURAL  MACHINE TRANSLATION USING LOW  COMPUTATIO...ANALYZING ARCHITECTURES FOR NEURAL  MACHINE TRANSLATION USING LOW  COMPUTATIO...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...kevig
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...kevig
 
111111111111111111111111111111111789.ppt
111111111111111111111111111111111789.ppt111111111111111111111111111111111789.ppt
111111111111111111111111111111111789.pptAllamJayaPrakash
 
111111111111111111111111111111111789.ppt
111111111111111111111111111111111789.ppt111111111111111111111111111111111789.ppt
111111111111111111111111111111111789.pptAllamJayaPrakash
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPAnil Bohare
 
Realization and design of a pilot assist decision making system based on spee...
Realization and design of a pilot assist decision making system based on spee...Realization and design of a pilot assist decision making system based on spee...
Realization and design of a pilot assist decision making system based on spee...csandit
 
DESIGN OF DOUBLE PRECISION FLOATING POINT MULTIPLICATION ALGORITHM WITH VECTO...
DESIGN OF DOUBLE PRECISION FLOATING POINT MULTIPLICATION ALGORITHM WITH VECTO...DESIGN OF DOUBLE PRECISION FLOATING POINT MULTIPLICATION ALGORITHM WITH VECTO...
DESIGN OF DOUBLE PRECISION FLOATING POINT MULTIPLICATION ALGORITHM WITH VECTO...jmicro
 
Copy of colloquium 3 latest
Copy of  colloquium 3 latestCopy of  colloquium 3 latest
Copy of colloquium 3 latestshaik fairooz
 

Similar to EEND-SS.pdf (20)

IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)
 
Real Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform DomainReal Time Speech Enhancement in the Waveform Domain
Real Time Speech Enhancement in the Waveform Domain
 
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
 
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
 
Low power fpga solution for dab audio decoder
Low power fpga solution for dab audio decoderLow power fpga solution for dab audio decoder
Low power fpga solution for dab audio decoder
 
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
Compressive Sensing in Speech from LPC using Gradient Projection for Sparse R...
 
Speech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdfSpeech Separation under Reverberant Condition.pdf
Speech Separation under Reverberant Condition.pdf
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...
ANALYZING ARCHITECTURES FOR NEURAL  MACHINE TRANSLATION USING LOW  COMPUTATIO...ANALYZING ARCHITECTURES FOR NEURAL  MACHINE TRANSLATION USING LOW  COMPUTATIO...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
 
111111111111111111111111111111111789.ppt
111111111111111111111111111111111789.ppt111111111111111111111111111111111789.ppt
111111111111111111111111111111111789.ppt
 
111111111111111111111111111111111789.ppt
111111111111111111111111111111111789.ppt111111111111111111111111111111111789.ppt
111111111111111111111111111111111789.ppt
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
 
microprocessor
 microprocessor microprocessor
microprocessor
 
Realization and design of a pilot assist decision making system based on spee...
Realization and design of a pilot assist decision making system based on spee...Realization and design of a pilot assist decision making system based on spee...
Realization and design of a pilot assist decision making system based on spee...
 
H0814247
H0814247H0814247
H0814247
 
DESIGN OF DOUBLE PRECISION FLOATING POINT MULTIPLICATION ALGORITHM WITH VECTO...
DESIGN OF DOUBLE PRECISION FLOATING POINT MULTIPLICATION ALGORITHM WITH VECTO...DESIGN OF DOUBLE PRECISION FLOATING POINT MULTIPLICATION ALGORITHM WITH VECTO...
DESIGN OF DOUBLE PRECISION FLOATING POINT MULTIPLICATION ALGORITHM WITH VECTO...
 
מצגת פרויקט
מצגת פרויקטמצגת פרויקט
מצגת פרויקט
 
Copy of colloquium 3 latest
Copy of  colloquium 3 latestCopy of  colloquium 3 latest
Copy of colloquium 3 latest
 

Recently uploaded

Artificial Intelligence in due diligence
Artificial Intelligence in due diligenceArtificial Intelligence in due diligence
Artificial Intelligence in due diligencemahaffeycheryld
 
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...drjose256
 
Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1T.D. Shashikala
 
Circuit Breakers for Engineering Students
Circuit Breakers for Engineering StudentsCircuit Breakers for Engineering Students
Circuit Breakers for Engineering Studentskannan348865
 
Seizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networksSeizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networksIJECEIAES
 
8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...josephjonse
 
Software Engineering Practical File Front Pages.pdf
Software Engineering Practical File Front Pages.pdfSoftware Engineering Practical File Front Pages.pdf
Software Engineering Practical File Front Pages.pdfssuser5c9d4b1
 
Adsorption (mass transfer operations 2) ppt
Adsorption (mass transfer operations 2) pptAdsorption (mass transfer operations 2) ppt
Adsorption (mass transfer operations 2) pptjigup7320
 
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdfInvolute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdfJNTUA
 
Diploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdfDiploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdfJNTUA
 
What is Coordinate Measuring Machine? CMM Types, Features, Functions
What is Coordinate Measuring Machine? CMM Types, Features, FunctionsWhat is Coordinate Measuring Machine? CMM Types, Features, Functions
What is Coordinate Measuring Machine? CMM Types, Features, FunctionsVIEW
 
Passive Air Cooling System and Solar Water Heater.ppt
Passive Air Cooling System and Solar Water Heater.pptPassive Air Cooling System and Solar Water Heater.ppt
Passive Air Cooling System and Solar Water Heater.pptamrabdallah9
 
handbook on reinforce concrete and detailing
handbook on reinforce concrete and detailinghandbook on reinforce concrete and detailing
handbook on reinforce concrete and detailingAshishSingh1301
 
Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Ramkumar k
 
Basics of Relay for Engineering Students
Basics of Relay for Engineering StudentsBasics of Relay for Engineering Students
Basics of Relay for Engineering Studentskannan348865
 
Maximizing Incident Investigation Efficacy in Oil & Gas: Techniques and Tools
Maximizing Incident Investigation Efficacy in Oil & Gas: Techniques and ToolsMaximizing Incident Investigation Efficacy in Oil & Gas: Techniques and Tools
Maximizing Incident Investigation Efficacy in Oil & Gas: Techniques and Toolssoginsider
 
21scheme vtu syllabus of visveraya technological university
21scheme vtu syllabus of visveraya technological university21scheme vtu syllabus of visveraya technological university
21scheme vtu syllabus of visveraya technological universityMohd Saifudeen
 
engineering chemistry power point presentation
engineering chemistry  power point presentationengineering chemistry  power point presentation
engineering chemistry power point presentationsj9399037128
 
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdflitvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdfAlexander Litvinenko
 
UNIT-2 image enhancement.pdf Image Processing Unit 2 AKTU
UNIT-2 image enhancement.pdf Image Processing Unit 2 AKTUUNIT-2 image enhancement.pdf Image Processing Unit 2 AKTU
UNIT-2 image enhancement.pdf Image Processing Unit 2 AKTUankushspencer015
 

Recently uploaded (20)

Artificial Intelligence in due diligence
Artificial Intelligence in due diligenceArtificial Intelligence in due diligence
Artificial Intelligence in due diligence
 
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
Tembisa Central Terminating Pills +27838792658 PHOMOLONG Top Abortion Pills F...
 
Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1
 
Circuit Breakers for Engineering Students
Circuit Breakers for Engineering StudentsCircuit Breakers for Engineering Students
Circuit Breakers for Engineering Students
 
Seizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networksSeizure stage detection of epileptic seizure using convolutional neural networks
Seizure stage detection of epileptic seizure using convolutional neural networks
 
8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...8th International Conference on Soft Computing, Mathematics and Control (SMC ...
8th International Conference on Soft Computing, Mathematics and Control (SMC ...
 
Software Engineering Practical File Front Pages.pdf
Software Engineering Practical File Front Pages.pdfSoftware Engineering Practical File Front Pages.pdf
Software Engineering Practical File Front Pages.pdf
 
Adsorption (mass transfer operations 2) ppt
Adsorption (mass transfer operations 2) pptAdsorption (mass transfer operations 2) ppt
Adsorption (mass transfer operations 2) ppt
 
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdfInvolute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
 
Diploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdfDiploma Engineering Drawing Qp-2024 Ece .pdf
Diploma Engineering Drawing Qp-2024 Ece .pdf
 
What is Coordinate Measuring Machine? CMM Types, Features, Functions
What is Coordinate Measuring Machine? CMM Types, Features, FunctionsWhat is Coordinate Measuring Machine? CMM Types, Features, Functions
What is Coordinate Measuring Machine? CMM Types, Features, Functions
 
Passive Air Cooling System and Solar Water Heater.ppt
Passive Air Cooling System and Solar Water Heater.pptPassive Air Cooling System and Solar Water Heater.ppt
Passive Air Cooling System and Solar Water Heater.ppt
 
handbook on reinforce concrete and detailing
handbook on reinforce concrete and detailinghandbook on reinforce concrete and detailing
handbook on reinforce concrete and detailing
 
Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)Theory of Time 2024 (Universal Theory for Everything)
Theory of Time 2024 (Universal Theory for Everything)
 
Basics of Relay for Engineering Students
Basics of Relay for Engineering StudentsBasics of Relay for Engineering Students
Basics of Relay for Engineering Students
 
Maximizing Incident Investigation Efficacy in Oil & Gas: Techniques and Tools
Maximizing Incident Investigation Efficacy in Oil & Gas: Techniques and ToolsMaximizing Incident Investigation Efficacy in Oil & Gas: Techniques and Tools
Maximizing Incident Investigation Efficacy in Oil & Gas: Techniques and Tools
 
21scheme vtu syllabus of visveraya technological university
21scheme vtu syllabus of visveraya technological university21scheme vtu syllabus of visveraya technological university
21scheme vtu syllabus of visveraya technological university
 
engineering chemistry power point presentation
engineering chemistry  power point presentationengineering chemistry  power point presentation
engineering chemistry power point presentation
 
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdflitvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
litvinenko_Henry_Intrusion_Hong-Kong_2024.pdf
 
UNIT-2 image enhancement.pdf Image Processing Unit 2 AKTU
UNIT-2 image enhancement.pdf Image Processing Unit 2 AKTUUNIT-2 image enhancement.pdf Image Processing Unit 2 AKTU
UNIT-2 image enhancement.pdf Image Processing Unit 2 AKTU
 

EEND-SS.pdf

  • 1. EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers Authors: Yushi Ueda, Soumi Maiti, Shinji Watanabe, Chunlei Zhang, Meng Yu, Shi-Xiong Zhang, Yong Xu Published in arXiv:2203.17068v1 [eess.AS] 31 Mar 2022 Presenter: 何冠勳
  • 2. 1 Table of Contents – Introduction – Architecture – Experiments – Results – Conclusions
  • 4. 3 Introduction – Task of Speaker diarization (SD) : who spoke when Speech separation (SS) : (who) spoke what – Nowadays, fully end-to-end neural diarization (EEND) systems can handle speaker overlap by training with the speaker overlap data. – One drawback of EEND is the number of speakers must be known and fixed beforehand. – To cope with that, either 1) set a maximum speaker or 2) extract and listen iteratively. – EEND with Encoder-Decoder-based Attractor calculation (EEND-EDA) uses LSTM encoder-decoder based attractors to estimate the maximum speaker counting within diarization.
  • 5. 4 Introduction – Even though speaker diarization and speech separation are often used together, their optimal order is not fixed, and ordering varies with the scenario and dataset. – Our solution is to unify these models as a single neural network and jointly train it with multi-task learning so that both tasks can benefit from each other. – We combine Conv-TasNet & EEND-EDA as a novel model, entitled EEND-SS. – We also propose the multiple 1×1 convolutional layer (MCL) architecture for estimating the separation masks corresponding to the number of speakers, and a post-processing technique for refining the separated speech signals. – Experimental results show that EEND-SS can improve both task.
  • 6. 5 Architecture - EEND - EEND-EDA - Conv-TasNet - EEND-SS - Training, Inference - Post-prosessing
  • 7. 6 Architecture – Before that, we define the tasks model tackles. – Given mixture of speakers and noises , – SD aims to estimate . – SS aims to estimate . – Speaker counting aims to estimate the number of speakers .
  • 8. 7 Architecture – EEND – EEND is a method for estimating the speech activities of each speaker from a multiple speaker input mixture using an end-to-end neural network. – This version is the self-attentive one (SA-EEND).
  • 10. 9 Architecture – EEND – Loss function is defined as: ,where is the set of all possible permutations – is the binary cross entropy, defined as: – Same, drawback of EEND is that the number of speakers is fixed.
  • 11. 10 Architecture – EEND-EDA • To overcome this difficulty, EEND-EDA was proposed, which handles a flexible number of speakers by predicting speaker existence. • LSTM-based encoder-decoder generates speaker-wise attractors until a stopping criterion is satisfied. • The number of speakers is estimated by using (equivalent to p on the figure). • The posterior probabilities can be calculated by matrix multiplication.
  • 12. We obtain for each speaker and time frame, by comparing the posterior probability with the given threshold . Here indicates C speakers is active. The loss function is defined as . During inference, C is counted if p is greater than a threshold .
  • 14. 13 Architecture – Conv-TasNet – Conv-TasNet consists of üEncoder, decoder, separator (mask estimation) üTemporal convolutional network (TCN) üDilated convolutional blocks, residual path, skip-connection … – It is the benchmark model on SS. – The loss function uses Normal convolution Dilated convolution Residual
  • 15. 14 Architecture – EEND-SS – It includes the encoder, separator, decoder, and diarization modules. – The encoder, separator, and decoder modules in EEND-SS are equivalent to those of Conv-TasNet, except for the last 1×1 convolution layer in the separator. – The diarization module architecture is the same as EEND-EDA except for the input features. EEND-SS uses learned TCN bottleneck features. – Optionally, the diarization module may also include the log-mel filterbank features (LMF) concatenated with TCN bottleneck features (depicted as dotted lines). – In EEND-SS, to allow masks for a variable number of speakers, MCL are used. – For C, , the oracle number is used during training, and the number estimated by the diarization module is used during inference.
  • 16.
  • 17. 16 Architecture – Training, Inference – For jointly training, EEND-SS is trained with a multi-task cost function ,where are the weights that are chosen empirically. – For inference, we utilize the following 2-pass procedure: 1) Obtain SD result & the number of speakers from input speech mixture. 2) Select MCL corresponding to number of masks, then obtain separated speech signals.
  • 18. 17 Architecture – Post-processing – We also propose to utilize post-processing for refining the separated speech signals with speech activity, formulated as: , where denotes element-wise multiplication and denotes a correlation function.
  • 20. 19 Experiments – Datasets – For the training and evaluation, we used the LibriMix dataset. – LibriMix uses speech samples from LibriSpeech train-clean100/dev-clean/test-clean and the noise samples from WHAM!. – Libri2Mix / Libri3Mix – https://arxiv.org/pdf/2005.11262.pdf
  • 21. 20 Experiments – Configurations – For SS section: – For DS section: Encoder / Decoder Kernel size 16 Representation dim 512 Separator Feature dim 128 Repeat 3 Conv-block 8 per layer Input 2D-conv 1/8 sub-sampling Transformer Repeat 4 Attention heads 4
  • 22. 21 Experiments – Configurations – For EEND-SS: LMF dim 80 Spectra length 512 shift 64 𝜽, 𝝉 0.5 𝜸𝟏, 𝜸𝟐 , 𝜸𝟑 1, 0.2, 0.2 empirically Learning rate 10$%, halved if not improve Batch 16
  • 23. 22 Experiments – Evaluation Metrics – SDRi – SI-SDRi – STOI – DER (diarization error rate) https://pyannote.github.io/pyannote-metrics/reference.html#diarization – SCA (speaker counting accuracy)
  • 24. 23 Results - Fixed number of speakers - Flexible number of speakers
  • 25. 24 Results – Fixed – We show the effectiveness of joint speech separation and speaker diarization based on the proposed method for fixed numbers of speakers.
  • 26. 25 Results – Fixed – We tested our proposed method on Libri2Mix max mode to compare the diarization performance with EEND- based models reported in SUPERB. https://superbbenchmark.org/ – The results indicate room for further improvement using self-supervised features instead of LMF, which is left for future work.
  • 27. 26 Result – Flexible – Next, we evaluated our method on the 2/3speaker mixed condition created by combining both Libri2/3Mix datasets. – In this experiment, the number of reference speech signals and the separated speech signals may differ due to speaker counting error. – To solve that, we append silent audio signals to the reference. – EEND-SS outperformed the baseline methods in all the metrics, even if Conv- TasNet used the oracle number.
  • 29. 28 Conclusions – In this paper, we proposed a framework for jointly and directly optimizing speaker diarization, speech separation, and speaker counting. – Additionally, a post-processing mechanism is proposed for refining separated speech with speech activity. – The experimental results show that the proposed method outperforms baseline systems evaluated with both fixed and flexible numbers of speakers. – Future work includes using other separation techniques, as well as using the features from self-supervised pretrained models.
  • 30. 29 Conclusions – SSGD – Speech Separation guided Diarization http://staff.ustc.edu.cn/~jundu/Publications/publications/xinfang.pdf – CSS – Continuous Speech Separation https://arxiv.org/pdf/2001.11482.pdf – WaveSplit (uses speaker embeddings to gain performance on SS) https://arxiv.org/pdf/2002.08933.pdf
  • 31. Separation results on Delta dataset: AOI / Inter-training / Elderly
  • 33. THANK YOU Any questions? You can find me at 📨 jasonho610@gmail.com 🥽 NTNU-SMIL