SlideShare a Scribd company logo
Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
Attention Back-end for Automatic
Speaker Verification with Multiple
Enrollment Utterances
Chang Zeng1,2, Xin Wang1, Erica Cooper1, Xiaoxiao Miao1, Junichi Yamagishi1,2
1National Institute of Informatics, Japan 2SOKENDAI, Japan
Monday, 9th May, 2022
1
Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
Outline
• 1. Introduction
• 2. Conventional Back-end Models
• 3. Attention Back-end Model
• 3.1 Model architecture
• 3.2 Trials sampling method
• 3.3 Loss functions
• 4. Experimental Results
• 4.1 Datasets
• 4.2 Result and analysis
• 4.3 Ablation study
• 5. Conclusions
2
Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
1. Introduction
• ASV framework
• Front-end speaker encoder
• TDNN, ECAPA-TDNN
• ResNet series
• Back-end model
• Cosine similarity
• Probabilistic linear discriminant analysis (PLDA)
• Neural PLDA (NPLDA) [1]
• Single enrollment VS Multiple enrollments
• The case of multiple enrollments is more robust in real world setting;
• Multiple enrollments contain more acoustic variations of enrolled speaker.
• Common operation for Multiple enrollments case
• Concatenate waveforms or acoustic features
• Average speaker embeddings
3
Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
1. Introduction
• Can we use multiple enrollments more efficiently in a neural back-end?
• Learning intra-relationships among speaker embeddings of multiple enrollments;
• Aggregating a varying number of enrollment speaker embeddings by adaptive
weights.
4
Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
Outline
• 1. Introduction
• 2. Conventional Back-end Models
• 3. Attention Back-end Model
• 3.1 Model architecture
• 3.2 Trials sampling method
• 3.3 Loss functions
• 4. Experimental Results
• 4.1 Datasets
• 4.2 Result and analysis
• 4.3 Ablation study
• 5. Conclusions
5
Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
Cosine score
or PLDA
Speaker
encoder
Enc(·)
Speaker
encoder
Enc(·)
y1:T x
(1)
1:T x
(2)
1:T
x
(3)
1:T
q
�
Cosine score
or PLDA
Speaker
encoder
Enc(·)
Speaker
encoder
Enc(·)
Speaker
encoder
Enc(·)
Speaker
encoder
Enc(·)
y1:T x
(1)
1:T x
(2)
1:T x
(3)
1:T
q e1
e2 e3
Average
�
�
2. Conventional Back-end Models
• Multiple enrollments case, where enrollment speaker has an enrollment
set including 𝐾 utterances. Note 𝐾 is a varying number.
• Concatenating all enrollment utterances as a long waveforms
6
• Extracting speaker embedding from each utterance, then averaging all
speaker embeddings
Note: The scoring method of PLDA can be extended to
adapt the case of multiple enrollments in theoritically
except concatenation or average operation. As for the
detail, please refer to [2,3]
Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
Outline
• 1. Introduction
• 2. Conventional Back-end Models
• 3. Attention Back-end Model
• 3.1 Model architecture
• 3.2 Trials sampling method
• 3.3 Loss functions
• 4. Experimental Results
• 4.1 Datasets
• 4.2 Result and analysis
• 4.3 Ablation study
• 5. Conclusions
7
Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
3. Attention Back-end Model
• Model architecture
• Extract enrollment speaker embeddings
and testing speaker embedding 𝒒
8
Cos( , )
Speaker
encoder
Speaker
encoder
Speaker
encoder
Speaker
encoder
Logistic reg.
Scaled-dot
self-attention
Feed-forward
self-attention
• Score calibration
• Logistic regression (LR) for score calibration
• Exploring intra-relationships among all enrollment
speaker embeddings
• Scaled-dot self-attention (SDSA) [3]
• Stacking all enrollment speaker embeddings as
matrix 𝑬.
• Aggregating a varying number of enrollment speaker
embeddings by adaptive weights
• Feed-forward self-attention (FFSA) [4]
Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
 Negative pair: pairs marked by test trial=
, enroll=( ) of other speakers
included in a mini-batch.
3. Attention Back-end Model
• Trials sampling method for training
• Why: Introduce multiple enrollment process in
training stage
• How:  Load speaker-balanced min-batch from
dataset
9
 Rearrange the mini-batch to form trial pairs
(testing, enrollments)
 Positive pair: test trial and enrollment data
are from the same speaker, one test trial is
selected from the speaker’s data, and the
rest are left for enrollment.
Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
3. Attention Back-end Model
• Loss functions
• A weighted sum of binary cross-entropy (BCE) loss and generalized end-to-
end (GE2E) loss
• Binary cross-entropy loss
• Generalized end-to-end loss
10
Logistic reg.
(1-𝜆)*BCE+
𝜆 *GE2E
Label
Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
Outline
• 1. Introduction
• 2. Conventional Back-end Models
• 3. Attention Back-end Model
• 3.1 Model architecture
• 3.2 Trials sampling method
• 3.3 Loss functions
• 4. Experimental Results
• 4.1 Datasets
• 4.2 Result and analysis
• 4.3 Ablation study
• 5. Conclusions
11
Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
• Single enrollment case
VoxCeleb1&2 [5,6]
• Multiple enrollments case
CNCeleb1&2 [7,8]
4. Experimental Results
• Datasets
source:https://www.robots.ox.ac.uk/~vgg/data/voxceleb/
source:http://cnceleb.org/dataset
12
Note: The number of enrollment utterances of the
evaluation protocol in CNCeleb dataset is varying!
Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
4. Experimental Results
• Speaker encoders
• TDNN with statistics pooling (TDNN)
• ResNet34
• TDNN with attentive statistics pooling (TDNN-ASP)
• ECAPA-TDNN
13
Speaker encoder without frame-level attention
Speaker encoder with frame-level attention
Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
4. Experimental Results
• Result and analysis
• Single enrollment case on VoxCeleb
• Our back-end model is comparable with PLDA in term of EER, even if
there is no intra-relationship in enrollment utterance;
• As for minDCF, the proposed model slightly outperforms PLDA, which
proves that our model works reasonably well for the single enrollment
case.
14
• Multiple enrollments case on CNCeleb
• For Concat or Mean operation, we give the best performance of cosine similarity or PLDA model.
• Despite of the criterion, our proposed attention back-end realizes the best performance.
Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
4. Experimental Results
• Ablation study
• From the second and the third lines of the table, two self-
attention mechanisms both contributed to the improvement.
• From the fourth line, we can also see that LR for calibration
was another essential component of the proposed model
• From a comparison of the fifth and sixth lines, we see that
the proposed model using BCE loss only had a lower EER than
that using GE2E loss only, while one using GE2E loss resulted
in a lower minDCF(0.001) value. Combining BCE and GE2E
thus resulted in the lowest values for all the metrics.
15
15
Cos( , )
Speaker
encoder
Speaker
encoder
Speaker
encoder
Speaker
encoder
Logistic reg.
Scaled-dot
self-attention
Feed-forward
self-attention
(1-𝜆)*BCE+
𝜆 *GE2E
Label
mean
Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
Outline
• 1. Introduction
• 2. Conventional Back-end Models
• 3. Attention Back-end Model
• 3.1 Model architecture
• 3.2 Trials sampling method
• 3.3 Loss functions
• 4. Experimental Results
• 4.1 Datasets
• 4.2 Result and analysis
• 4.3 Ablation study
• 5. Conclusions
16
Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
5. Conclusion
• Conventional back-end model like cosine similarity and PLDA cannot
work well in multiple enrollments scenario.
• Utterance-level attention mechanism can improve the performance
further, even if frame-level attention mechanism and spatial attention
mechanism have been applied in speaker encoder.
17
Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
6. References
1) S. Ramoji, P. Krishnan, and S. Ganapathy, “Neural plda modeling for end-to-end speaker verification,” arXiv preprint arXiv:2008.04527, 2020.
2) Kong Aik Lee, Anthony Larcher, Chang Huai You, Bin Ma, and Haizhou Li, “Multi-session plda scoring of i-vector for partially open-set speaker detection,” in Interspeech, 2013.
3) Padmanabhan Rajan, Anton Afanasyev, Ville Hautam ̈aki, and Tomi Kinnunen, “From single to multiple enrollment i-vectors: Practical plda scoring variants for speaker verification,” Digital
Signal Processing, vol. 31, pp. 93–101, 2014.
4) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in NIPS, 2017.
5) Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio, “A structured self-attentive sentence embedding,” in ICLR, 2017.
6) A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” in Interspeech, 2017.
7) J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” in Interspeech, 2018.
8) Yue Fan, JW Kang, LT Li, KC Li, HL Chen, ST Cheng, PY Zhang, ZY Zhou, YQ Cai, and Dong Wang, “Cn-celeb: a challenging chinese speaker recognition dataset,” in 2020 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7604–7608.
9) Lantian Li, Ruiqi Liu, Jiawen Kang, Yue Fan, Hao Cui, Yunqi Cai, Ravichander Vipperla, Thomas Fang Zheng, and Dong Wang, “Cn-celeb: multi-genre speaker recognition,” arXiv preprint
arXiv:2012.12468, 2020.
18
Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742
Thanks!
19

More Related Content

What's hot

Deep learning ppt
Deep learning pptDeep learning ppt
Deep learning ppt
BalneSridevi
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentation
Surya Sg
 
2. introduction to compiler
2. introduction to compiler2. introduction to compiler
2. introduction to compiler
Saeed Parsa
 
Lip Reading.pptx
Lip Reading.pptxLip Reading.pptx
Lip Reading.pptx
NivethaT15
 
Introduction to pattern recognization
Introduction to pattern recognizationIntroduction to pattern recognization
Introduction to pattern recognization
Ajharul Abedeen
 
Text independent speaker recognition system
Text independent speaker recognition systemText independent speaker recognition system
Text independent speaker recognition system
Deepesh Lekhak
 
Sequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learningSequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learning
Roberto Pereira Silveira
 
Id3 algorithm
Id3 algorithmId3 algorithm
Id3 algorithm
SreekuttanJayakumar
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
CloudxLab
 
Genetic algorithms
Genetic algorithmsGenetic algorithms
Genetic algorithms
zamakhan
 
Deep Learning Explained
Deep Learning ExplainedDeep Learning Explained
Deep Learning Explained
Melanie Swan
 
Bert
BertBert
Build an LLM-powered application using LangChain.pdf
Build an LLM-powered application using LangChain.pdfBuild an LLM-powered application using LangChain.pdf
Build an LLM-powered application using LangChain.pdf
AnastasiaSteele10
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
Traian Rebedea
 
Wrapper feature selection method
Wrapper feature selection methodWrapper feature selection method
Wrapper feature selection method
Amir Razmjou
 
Deep learning presentation
Deep learning presentationDeep learning presentation
Deep learning presentation
Tunde Ajose-Ismail
 
Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning
Frontiers of Vision and Language: Bridging Images and Texts by Deep LearningFrontiers of Vision and Language: Bridging Images and Texts by Deep Learning
Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning
Yoshitaka Ushiku
 
Sign language recognizer
Sign language recognizerSign language recognizer
Sign language recognizer
Bikash Chandra Karmokar
 
Face Recognition based Lecture Attendance System
Face Recognition based Lecture Attendance SystemFace Recognition based Lecture Attendance System
Face Recognition based Lecture Attendance System
Karmesh Maheshwari
 
And or graph
And or graphAnd or graph
And or graph
Ali A Jalil
 

What's hot (20)

Deep learning ppt
Deep learning pptDeep learning ppt
Deep learning ppt
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentation
 
2. introduction to compiler
2. introduction to compiler2. introduction to compiler
2. introduction to compiler
 
Lip Reading.pptx
Lip Reading.pptxLip Reading.pptx
Lip Reading.pptx
 
Introduction to pattern recognization
Introduction to pattern recognizationIntroduction to pattern recognization
Introduction to pattern recognization
 
Text independent speaker recognition system
Text independent speaker recognition systemText independent speaker recognition system
Text independent speaker recognition system
 
Sequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learningSequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learning
 
Id3 algorithm
Id3 algorithmId3 algorithm
Id3 algorithm
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 
Genetic algorithms
Genetic algorithmsGenetic algorithms
Genetic algorithms
 
Deep Learning Explained
Deep Learning ExplainedDeep Learning Explained
Deep Learning Explained
 
Bert
BertBert
Bert
 
Build an LLM-powered application using LangChain.pdf
Build an LLM-powered application using LangChain.pdfBuild an LLM-powered application using LangChain.pdf
Build an LLM-powered application using LangChain.pdf
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Wrapper feature selection method
Wrapper feature selection methodWrapper feature selection method
Wrapper feature selection method
 
Deep learning presentation
Deep learning presentationDeep learning presentation
Deep learning presentation
 
Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning
Frontiers of Vision and Language: Bridging Images and Texts by Deep LearningFrontiers of Vision and Language: Bridging Images and Texts by Deep Learning
Frontiers of Vision and Language: Bridging Images and Texts by Deep Learning
 
Sign language recognizer
Sign language recognizerSign language recognizer
Sign language recognizer
 
Face Recognition based Lecture Attendance System
Face Recognition based Lecture Attendance SystemFace Recognition based Lecture Attendance System
Face Recognition based Lecture Attendance System
 
And or graph
And or graphAnd or graph
And or graph
 

Similar to Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances

A Test Analysis Method for Black Box Testing Using AUT and Fault Knowledge.
A Test Analysis Method for Black Box Testing Using AUT and Fault Knowledge.A Test Analysis Method for Black Box Testing Using AUT and Fault Knowledge.
A Test Analysis Method for Black Box Testing Using AUT and Fault Knowledge.
Tsuyoshi Yumoto
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia Voulibasi
ISSEL
 
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
Boris Glavic
 
VLSI testing and analysis
VLSI testing and analysisVLSI testing and analysis
VLSI testing and analysis
Surekha PuriGosavi
 
Cell Verification Lead
Cell Verification LeadCell Verification Lead
Cell Verification Lead
DVClub
 
Cell verification dv club
Cell verification dv clubCell verification dv club
Cell verification dv club
Obsidian Software
 
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Certificates for bist including index
Certificates for bist including indexCertificates for bist including index
Certificates for bist including index
Prabhu Kiran
 
6 Steps to Implementing a World Class Testing Ecosystem Final
6 Steps to Implementing a World Class Testing Ecosystem Final6 Steps to Implementing a World Class Testing Ecosystem Final
6 Steps to Implementing a World Class Testing Ecosystem Final
Eggplant
 
Chris brown ti
Chris brown tiChris brown ti
Chris brown ti
Obsidian Software
 
Improving Batch-Process Testing Techniques with a Domain-Specific Language
Improving Batch-Process Testing Techniques with a Domain-Specific LanguageImproving Batch-Process Testing Techniques with a Domain-Specific Language
Improving Batch-Process Testing Techniques with a Domain-Specific Language
Dr. Spock
 
Visual Search for Musical Performances and Endoscopic Videos
Visual Search for Musical Performances and Endoscopic VideosVisual Search for Musical Performances and Endoscopic Videos
Visual Search for Musical Performances and Endoscopic Videos
Universitat Politècnica de Catalunya
 
Isorc18 keynote
Isorc18 keynoteIsorc18 keynote
Isorc18 keynote
Abhik Roychoudhury
 
6 Top Tips to a Testing Strategy That Works
6 Top Tips to a Testing Strategy That Works6 Top Tips to a Testing Strategy That Works
6 Top Tips to a Testing Strategy That Works
Eggplant
 
Best Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowBest Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflow
Databricks
 
Huffman data compression-decompression
Huffman data compression-decompressionHuffman data compression-decompression
Huffman data compression-decompression
dipugovind
 
Development Of An ICSBEP Benchmark 2011
Development Of An ICSBEP Benchmark 2011Development Of An ICSBEP Benchmark 2011
Development Of An ICSBEP Benchmark 2011
jdbess
 
Statistical Machine Translation for Language Localisation
Statistical Machine Translation for Language LocalisationStatistical Machine Translation for Language Localisation
Statistical Machine Translation for Language Localisation
Achchuthan Yogarajah
 
Test vector compression
Test vector compressionTest vector compression
Test vector compression
Amr Abd El Latief
 
Reducing computational complexity of Mathematical functions using FPGA
Reducing computational complexity of Mathematical functions using FPGAReducing computational complexity of Mathematical functions using FPGA
Reducing computational complexity of Mathematical functions using FPGA
nehagaur339
 

Similar to Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances (20)

A Test Analysis Method for Black Box Testing Using AUT and Fault Knowledge.
A Test Analysis Method for Black Box Testing Using AUT and Fault Knowledge.A Test Analysis Method for Black Box Testing Using AUT and Fault Knowledge.
A Test Analysis Method for Black Box Testing Using AUT and Fault Knowledge.
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia Voulibasi
 
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
DEBS 2013 - "Ariadne: Managing Fine-Grained Provenance on Data Streams"
 
VLSI testing and analysis
VLSI testing and analysisVLSI testing and analysis
VLSI testing and analysis
 
Cell Verification Lead
Cell Verification LeadCell Verification Lead
Cell Verification Lead
 
Cell verification dv club
Cell verification dv clubCell verification dv club
Cell verification dv club
 
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
Spoofing-aware Attention Back-end with Multiple Enrollment and Novel Trials S...
 
Certificates for bist including index
Certificates for bist including indexCertificates for bist including index
Certificates for bist including index
 
6 Steps to Implementing a World Class Testing Ecosystem Final
6 Steps to Implementing a World Class Testing Ecosystem Final6 Steps to Implementing a World Class Testing Ecosystem Final
6 Steps to Implementing a World Class Testing Ecosystem Final
 
Chris brown ti
Chris brown tiChris brown ti
Chris brown ti
 
Improving Batch-Process Testing Techniques with a Domain-Specific Language
Improving Batch-Process Testing Techniques with a Domain-Specific LanguageImproving Batch-Process Testing Techniques with a Domain-Specific Language
Improving Batch-Process Testing Techniques with a Domain-Specific Language
 
Visual Search for Musical Performances and Endoscopic Videos
Visual Search for Musical Performances and Endoscopic VideosVisual Search for Musical Performances and Endoscopic Videos
Visual Search for Musical Performances and Endoscopic Videos
 
Isorc18 keynote
Isorc18 keynoteIsorc18 keynote
Isorc18 keynote
 
6 Top Tips to a Testing Strategy That Works
6 Top Tips to a Testing Strategy That Works6 Top Tips to a Testing Strategy That Works
6 Top Tips to a Testing Strategy That Works
 
Best Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowBest Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflow
 
Huffman data compression-decompression
Huffman data compression-decompressionHuffman data compression-decompression
Huffman data compression-decompression
 
Development Of An ICSBEP Benchmark 2011
Development Of An ICSBEP Benchmark 2011Development Of An ICSBEP Benchmark 2011
Development Of An ICSBEP Benchmark 2011
 
Statistical Machine Translation for Language Localisation
Statistical Machine Translation for Language LocalisationStatistical Machine Translation for Language Localisation
Statistical Machine Translation for Language Localisation
 
Test vector compression
Test vector compressionTest vector compression
Test vector compression
 
Reducing computational complexity of Mathematical functions using FPGA
Reducing computational complexity of Mathematical functions using FPGAReducing computational complexity of Mathematical functions using FPGA
Reducing computational complexity of Mathematical functions using FPGA
 

More from Yamagishi Laboratory, National Institute of Informatics, Japan

DDS: A new device-degraded speech dataset for speech enhancement
DDS: A new device-degraded speech dataset for speech enhancement DDS: A new device-degraded speech dataset for speech enhancement
DDS: A new device-degraded speech dataset for speech enhancement
Yamagishi Laboratory, National Institute of Informatics, Japan
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...
Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...
Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Generalization Ability of MOS Prediction Networks
Generalization Ability of MOS Prediction NetworksGeneralization Ability of MOS Prediction Networks
Generalization Ability of MOS Prediction Networks
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Estimating the confidence of speech spoofing countermeasure
Estimating the confidence of speech spoofing countermeasureEstimating the confidence of speech spoofing countermeasure
Estimating the confidence of speech spoofing countermeasure
Yamagishi Laboratory, National Institute of Informatics, Japan
 
How do Voices from Past Speech Synthesis Challenges Compare Today?
 How do Voices from Past Speech Synthesis Challenges Compare Today? How do Voices from Past Speech Synthesis Challenges Compare Today?
How do Voices from Past Speech Synthesis Challenges Compare Today?
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
Text-to-Speech Synthesis Techniques for MIDI-to-Audio SynthesisText-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Preliminary study on using vector quantization latent spaces for TTS/VC syste...
Preliminary study on using vector quantization latent spaces for TTS/VC syste...Preliminary study on using vector quantization latent spaces for TTS/VC syste...
Preliminary study on using vector quantization latent spaces for TTS/VC syste...
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Advancements in Neural Vocoders
Advancements in Neural VocodersAdvancements in Neural Vocoders
Neural Waveform Modeling
Neural Waveform Modeling Neural Waveform Modeling
Neural source-filter waveform model
Neural source-filter waveform modelNeural source-filter waveform model
Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...
Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...
Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
Yamagishi Laboratory, National Institute of Informatics, Japan
 
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
Yamagishi Laboratory, National Institute of Informatics, Japan
 
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
Yamagishi Laboratory, National Institute of Informatics, Japan
 

More from Yamagishi Laboratory, National Institute of Informatics, Japan (17)

DDS: A new device-degraded speech dataset for speech enhancement
DDS: A new device-degraded speech dataset for speech enhancement DDS: A new device-degraded speech dataset for speech enhancement
DDS: A new device-degraded speech dataset for speech enhancement
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022
 
Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...
Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...
Analyzing Language-Independent Speaker Anonymization Framework under Unseen C...
 
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
Odyssey 2022: Language-Independent Speaker Anonymization Approach using Self-...
 
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
Odyssey 2022: Investigating self-supervised front ends for speech spoofing co...
 
Generalization Ability of MOS Prediction Networks
Generalization Ability of MOS Prediction NetworksGeneralization Ability of MOS Prediction Networks
Generalization Ability of MOS Prediction Networks
 
Estimating the confidence of speech spoofing countermeasure
Estimating the confidence of speech spoofing countermeasureEstimating the confidence of speech spoofing countermeasure
Estimating the confidence of speech spoofing countermeasure
 
How do Voices from Past Speech Synthesis Challenges Compare Today?
 How do Voices from Past Speech Synthesis Challenges Compare Today? How do Voices from Past Speech Synthesis Challenges Compare Today?
How do Voices from Past Speech Synthesis Challenges Compare Today?
 
Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
Text-to-Speech Synthesis Techniques for MIDI-to-Audio SynthesisText-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
 
Preliminary study on using vector quantization latent spaces for TTS/VC syste...
Preliminary study on using vector quantization latent spaces for TTS/VC syste...Preliminary study on using vector quantization latent spaces for TTS/VC syste...
Preliminary study on using vector quantization latent spaces for TTS/VC syste...
 
Advancements in Neural Vocoders
Advancements in Neural VocodersAdvancements in Neural Vocoders
Advancements in Neural Vocoders
 
Neural Waveform Modeling
Neural Waveform Modeling Neural Waveform Modeling
Neural Waveform Modeling
 
Neural source-filter waveform model
Neural source-filter waveform modelNeural source-filter waveform model
Neural source-filter waveform model
 
Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...
Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...
Tutorial on end-to-end text-to-speech synthesis: Part 2 – Tactron and related...
 
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
Tutorial on end-to-end text-to-speech synthesis: Part 1 – Neural waveform mod...
 
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 2)~
 
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
 

Recently uploaded

Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdfAI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
Techgropse Pvt.Ltd.
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 

Recently uploaded (20)

Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdfAI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 

Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances

  • 1. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742 Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng1,2, Xin Wang1, Erica Cooper1, Xiaoxiao Miao1, Junichi Yamagishi1,2 1National Institute of Informatics, Japan 2SOKENDAI, Japan Monday, 9th May, 2022 1
  • 2. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742 Outline • 1. Introduction • 2. Conventional Back-end Models • 3. Attention Back-end Model • 3.1 Model architecture • 3.2 Trials sampling method • 3.3 Loss functions • 4. Experimental Results • 4.1 Datasets • 4.2 Result and analysis • 4.3 Ablation study • 5. Conclusions 2
  • 3. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742 1. Introduction • ASV framework • Front-end speaker encoder • TDNN, ECAPA-TDNN • ResNet series • Back-end model • Cosine similarity • Probabilistic linear discriminant analysis (PLDA) • Neural PLDA (NPLDA) [1] • Single enrollment VS Multiple enrollments • The case of multiple enrollments is more robust in real world setting; • Multiple enrollments contain more acoustic variations of enrolled speaker. • Common operation for Multiple enrollments case • Concatenate waveforms or acoustic features • Average speaker embeddings 3
  • 4. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742 1. Introduction • Can we use multiple enrollments more efficiently in a neural back-end? • Learning intra-relationships among speaker embeddings of multiple enrollments; • Aggregating a varying number of enrollment speaker embeddings by adaptive weights. 4
  • 5. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742 Outline • 1. Introduction • 2. Conventional Back-end Models • 3. Attention Back-end Model • 3.1 Model architecture • 3.2 Trials sampling method • 3.3 Loss functions • 4. Experimental Results • 4.1 Datasets • 4.2 Result and analysis • 4.3 Ablation study • 5. Conclusions 5
  • 6. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742 Cosine score or PLDA Speaker encoder Enc(·) Speaker encoder Enc(·) y1:T x (1) 1:T x (2) 1:T x (3) 1:T q � Cosine score or PLDA Speaker encoder Enc(·) Speaker encoder Enc(·) Speaker encoder Enc(·) Speaker encoder Enc(·) y1:T x (1) 1:T x (2) 1:T x (3) 1:T q e1 e2 e3 Average � � 2. Conventional Back-end Models • Multiple enrollments case, where enrollment speaker has an enrollment set including 𝐾 utterances. Note 𝐾 is a varying number. • Concatenating all enrollment utterances as a long waveforms 6 • Extracting speaker embedding from each utterance, then averaging all speaker embeddings Note: The scoring method of PLDA can be extended to adapt the case of multiple enrollments in theoritically except concatenation or average operation. As for the detail, please refer to [2,3]
  • 7. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742 Outline • 1. Introduction • 2. Conventional Back-end Models • 3. Attention Back-end Model • 3.1 Model architecture • 3.2 Trials sampling method • 3.3 Loss functions • 4. Experimental Results • 4.1 Datasets • 4.2 Result and analysis • 4.3 Ablation study • 5. Conclusions 7
  • 8. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742 3. Attention Back-end Model • Model architecture • Extract enrollment speaker embeddings and testing speaker embedding 𝒒 8 Cos( , ) Speaker encoder Speaker encoder Speaker encoder Speaker encoder Logistic reg. Scaled-dot self-attention Feed-forward self-attention • Score calibration • Logistic regression (LR) for score calibration • Exploring intra-relationships among all enrollment speaker embeddings • Scaled-dot self-attention (SDSA) [3] • Stacking all enrollment speaker embeddings as matrix 𝑬. • Aggregating a varying number of enrollment speaker embeddings by adaptive weights • Feed-forward self-attention (FFSA) [4]
  • 9. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742  Negative pair: pairs marked by test trial= , enroll=( ) of other speakers included in a mini-batch. 3. Attention Back-end Model • Trials sampling method for training • Why: Introduce multiple enrollment process in training stage • How:  Load speaker-balanced min-batch from dataset 9  Rearrange the mini-batch to form trial pairs (testing, enrollments)  Positive pair: test trial and enrollment data are from the same speaker, one test trial is selected from the speaker’s data, and the rest are left for enrollment.
  • 10. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742 3. Attention Back-end Model • Loss functions • A weighted sum of binary cross-entropy (BCE) loss and generalized end-to- end (GE2E) loss • Binary cross-entropy loss • Generalized end-to-end loss 10 Logistic reg. (1-𝜆)*BCE+ 𝜆 *GE2E Label
  • 11. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742 Outline • 1. Introduction • 2. Conventional Back-end Models • 3. Attention Back-end Model • 3.1 Model architecture • 3.2 Trials sampling method • 3.3 Loss functions • 4. Experimental Results • 4.1 Datasets • 4.2 Result and analysis • 4.3 Ablation study • 5. Conclusions 11
  • 12. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742 • Single enrollment case VoxCeleb1&2 [5,6] • Multiple enrollments case CNCeleb1&2 [7,8] 4. Experimental Results • Datasets source:https://www.robots.ox.ac.uk/~vgg/data/voxceleb/ source:http://cnceleb.org/dataset 12 Note: The number of enrollment utterances of the evaluation protocol in CNCeleb dataset is varying!
  • 13. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742 4. Experimental Results • Speaker encoders • TDNN with statistics pooling (TDNN) • ResNet34 • TDNN with attentive statistics pooling (TDNN-ASP) • ECAPA-TDNN 13 Speaker encoder without frame-level attention Speaker encoder with frame-level attention
  • 14. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742 4. Experimental Results • Result and analysis • Single enrollment case on VoxCeleb • Our back-end model is comparable with PLDA in term of EER, even if there is no intra-relationship in enrollment utterance; • As for minDCF, the proposed model slightly outperforms PLDA, which proves that our model works reasonably well for the single enrollment case. 14 • Multiple enrollments case on CNCeleb • For Concat or Mean operation, we give the best performance of cosine similarity or PLDA model. • Despite of the criterion, our proposed attention back-end realizes the best performance.
  • 15. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742 4. Experimental Results • Ablation study • From the second and the third lines of the table, two self- attention mechanisms both contributed to the improvement. • From the fourth line, we can also see that LR for calibration was another essential component of the proposed model • From a comparison of the fifth and sixth lines, we see that the proposed model using BCE loss only had a lower EER than that using GE2E loss only, while one using GE2E loss resulted in a lower minDCF(0.001) value. Combining BCE and GE2E thus resulted in the lowest values for all the metrics. 15 15 Cos( , ) Speaker encoder Speaker encoder Speaker encoder Speaker encoder Logistic reg. Scaled-dot self-attention Feed-forward self-attention (1-𝜆)*BCE+ 𝜆 *GE2E Label mean
  • 16. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742 Outline • 1. Introduction • 2. Conventional Back-end Models • 3. Attention Back-end Model • 3.1 Model architecture • 3.2 Trials sampling method • 3.3 Loss functions • 4. Experimental Results • 4.1 Datasets • 4.2 Result and analysis • 4.3 Ablation study • 5. Conclusions 16
  • 17. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742 5. Conclusion • Conventional back-end model like cosine similarity and PLDA cannot work well in multiple enrollments scenario. • Utterance-level attention mechanism can improve the performance further, even if frame-level attention mechanism and spatial attention mechanism have been applied in speaker encoder. 17
  • 18. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742 6. References 1) S. Ramoji, P. Krishnan, and S. Ganapathy, “Neural plda modeling for end-to-end speaker verification,” arXiv preprint arXiv:2008.04527, 2020. 2) Kong Aik Lee, Anthony Larcher, Chang Huai You, Bin Ma, and Haizhou Li, “Multi-session plda scoring of i-vector for partially open-set speaker detection,” in Interspeech, 2013. 3) Padmanabhan Rajan, Anton Afanasyev, Ville Hautam ̈aki, and Tomi Kinnunen, “From single to multiple enrollment i-vectors: Practical plda scoring variants for speaker verification,” Digital Signal Processing, vol. 31, pp. 93–101, 2014. 4) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in NIPS, 2017. 5) Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio, “A structured self-attentive sentence embedding,” in ICLR, 2017. 6) A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” in Interspeech, 2017. 7) J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” in Interspeech, 2018. 8) Yue Fan, JW Kang, LT Li, KC Li, HL Chen, ST Cheng, PY Zhang, ZY Zhou, YQ Cai, and Dong Wang, “Cn-celeb: a challenging chinese speaker recognition dataset,” in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7604–7608. 9) Lantian Li, Ruiqi Liu, Jiawen Kang, Yue Fan, Hao Cui, Yunqi Cai, Ravichander Vipperla, Thomas Fang Zheng, and Dong Wang, “Cn-celeb: multi-genre speaker recognition,” arXiv preprint arXiv:2012.12468, 2020. 18
  • 19. Attention Back-end for Automatic Speaker Verification with Multiple Enrollment Utterances Chang Zeng, Xin Wang, Erica Cooper, Xiaoxiao Miao, Junichi Yamagishi Paper ID: 2742 Thanks! 19

Editor's Notes

  1. Hello everyone, today I will give a report for our paper whose title is Attention back-end for automatic speaker verification with multiple enrollment utterances
  2. First let's see this figure. It shows the general architecture of a speaker verification system including the front-end speaker encoder for extracting speaker embeddings and the back-end model for measuring pair-wise similarity. The typical speaker encoders include time delay neural network and its variant called ECAPA-TDNN. ResNet is another choice for speaker encoder. As for the back-end model, cosine similarity is the simplest one, which is utilized to measure the similarity between enrollment and testing utterances. The other is Probabilistic linear discriminant analysis (PLDA) which can alleviate the channel mismatch problem. Inspired by PLDA, neural PLDA model was proposed to use neural network to simulate the behavior of PLDA. Previous study almost focused on single enrollment case. There is little literature shows how to handle multiple enrollments. However, There are two common operations for multiple enrollment case First, concatenating waveforms or acoustic feature of enrollment utterances The other is averaging speaker embeddings of enrollment utterances.
  3. Let's see the multiple enrollments case. Suppose a enrolled speaker has K utterances. Note K is a varying number. We can concatenate all utterances as a long utterance as the figure shows. Under this operation, the multiple enrollment case is switched to single enrollment case. Alternatively, we can also extract speaker embedding for each utterances, then average these embeddings to get a center vector to represent this speaker. In this way, the following process is same with single enrollment case.
  4. No matter which method we use, the multiple enrollment case is switched to single enrollment case and we can use cosine score or PLDA model to measure the similarity between enrollment speaker and testing speaker. However, there is no interaction among enrollment utterances.
  5. First let's see the model architecture. Same with previous description, we also extract speaker embedding for each enrollment utterance. Next, these embeddings are stacked as matrix denoted as capital E. Then scaled-dot self attention is used to compute the intra-relationships among these embeddings. We believe in this way some latent factor would be emphasized and some latent factor would be de-emphasized. The output matrix of scaled-dot self-attention will be aggregated by the following feed-forward attention module and output a vector denoted as h to represent the enrolled speaker. Finally, we compute the cosine similarity between h and test speaker embedding q. A logistic regression is used here for score calibration.
  6. Next I will introduce our trials sampling method. In this method, we can introduce multiple enrollment process in training stage and the interactions among embeddings can update parameters of neural network by backward propagation. OK I am going to introduce this method step by step. First, we load a speaker-balanced mini-batch from dataset, which means the number of embeddings for each speaker in one mini-batch are same. For example, in the right table, there are 3 speakers A, B, C and each speaker has 4 utterances whose index range from 1 to 4. Then, the data in this mini-batch will be re-arranged to form (testing, enrollments) trial pairs. For positive pair, testing embedding and enrollment data come from same speaker, when we select one embedding as testing embedding, the rest of this same speaker is used as enrollment data. For negative pair, testing embedding is marked by the check symbol and enrollment data is marked by three cross symbol of other speakers included in this mini-batch.
  7. This page shows the datasets we used in our experiments. The left are voxceleb1 & 2, which contain more than 7000 speakers and more than 2000 hours speech data. All utterances come from the Youtube video of celebrities in English. The right are cnceleb1 & 2, which are Chinese datasets including more than 1200 hours speech data. Different from voxceleb datasets only including interview data, cnceleb has 11 different genres including some hard scenarios like singing and entertainment.
  8. There are several variants of PLDA model. Here we only consider the widely used Gaussian PLDA model. Epsilon Although there are some literature like the reference 1 & 2, which has adapted scoring method of PLDA model to multi-sessions enrollment, the reported performance was not ideal even compared with simple average or concatenation.