SlideShare a Scribd company logo
1 of 21
Download to read offline
SGEM: Test-Time Adaptation
for Automatic Speech Recognition via
Sequential-Level Generalized Entropy
Minimization
2023.08.23
Changhun Kim, Joonhyung Park, Hajin Shim and Eunho Yang
Masterโ€™s Student @ MLILAB, KAIST AI
changhun.kim@kaist.ac.kr
Table of Contents
2
โ€ข Introduction
โ€ข Method
โ€ข Beam Search-Based Logit Acquisition
โ€ข Generalized Entropy Minimization
โ€ข Negative Sampling
โ€ข Experiments
โ€ข Experimental Setup
โ€ข Main Results
โ€ข Non-Native English Speech Corpora / Data Deficient Condition / Ablation Study
โ€ข Adaptation Example
โ€ข Conclusion
Introduction
3
โ€ข Automatic speech recognition (ASR) models are frequently exposed to distribution
shifts.
โ€ข Distribution shifts between source and target domain severely degrade the ASR
performance.
Introduction
4
Severe Background Noise
American English British English
Train Set Test Set
Introduction
โ€ข Unsupervised domain adaptation (UDA) jointly train the ASR model with labeled
source domain and unlabeled target domain.
โ€ข Limitations of UDA
โ€ข Source data might not be accessible due to privacy/storage issues.
โ€ข Restricts the generalization capacity only to the pre-collected target data.
5
Introduction
โ€ข SUTA [INTERSPEECHโ€™22] suggested test-time adaptation (TTA) strategy for ASR
models.
6
โ€ข It fine-tunes the pre-trained ASR model using
unlabeled test instances without source data.
โ€ข It utilizes unsupervised objectives like entropy
minimization and minimum class confusion.
Introduction
7
Motivation
Goal
โ€ข Previous work targets on CTC-based model, which relies on naรฏve greedy decoding.
โ€ข It naรฏvely adopts TTA methods in computer vision at a frame level.
โ€ข Can we consider the sequential nature of ASR output and design speech-specific
components?
โ€ข Adapt the ASR model by considering the nature of speech at a sequential-level.
โ€ข Achieve state-of-the-art performance by developing novel unsupervised objectives.
Method
8
Method
9
Method
โ€ข Beam Search-Based Logit Acquisition
โ€ข Frame-level greedy adaptation considers the joint probability of a sequence myopically
over timesteps.
10
โ€œhelloโ€
โ€œhelloโ€ โ€œhelloโ€
๐‘(๐’‰๐’†๐’๐’๐’) = 0.138 ๐‘(๐’‰๐’†๐’๐’๐’) = 0.211
Beam Search Output
โ€œgelloโ€ โ€œgelloโ€
๐‘ ๐’ˆ๐’†๐’๐’๐’ = 0.279 ๐‘(๐’ˆ๐’†๐’๐’๐’) = 0.342
Greedy Search Output
correct
supervision
wrong
supervision
Method
โ€ข Beam Search-Based Logit Acquisition
โ€ข Instead, we exploit beam search decoding and find the most plausible output sequence
!
๐’š = $
๐‘ฆ!, โ‹ฏ , $
๐‘ฆ" ,
โ€ข and pass !
๐’š to acquire ๐‘–-th logit !
๐’๐’Š = (๐‘œ$!, โ‹ฏ , ๐‘œ$% ) for ๐‘– โˆˆ 1, โ‹ฏ , ๐ฟ ,
โ€ข where ๐‘œ$& = log ๐‘(๐‘ฆ$ = ๐‘—|$
๐‘ฆ'$, ๐‘ฅ, ๐œƒ).
11
โ€ข Logits obtained from beam search is
more accurate and naturally aligned
with ASR decoding strategy.
Method
โ€ข Generalized Entropy Minimization
โ€ข Shannon entropy โˆ’ โˆ‘&(!
%
โ„™(๐‘‹ = ๐‘—) log โ„™(๐‘‹ = ๐‘—) is a specialized version of Rรฉnyi entropy
with hyperparameter ๐›ผ โ†’ 1.
โ€ข Rรฉnyi entropy with hyperparameter ๐›ผ โˆˆ (0, 1) โˆช (1, โˆž) is define as follows:
โ€ข We hypothesize that there exists an optimal ๐œถ for TTA and define the generalized entropy
minimization loss as follows:
12
Method
โ€ข Negative Sampling
โ€ข Negative sampling loss penalizes the probabilities of low-confident classes.
โ€ข Even if the model is incorrectly predicted, ground truth label will be included in the top-k
classes with highest probability.
13
Experiments
14
Experiments
โ€ข Experimental Setup
โ€ข Source ASR Models
โ€ข CTC-based model: wav2vec 2.0
โ€ข Conformer: Conformer-CTC
โ€ข Transducer: Conformer-Transducer
โ€ข Language model: 4-gram language model
โ€ข Datasets
โ€ข Unseen speakers/words: CHiME-3 (CH), TED-LIUM 2 (TD), Common Voice (CV), Valentini (VA)
โ€ข Background Noise: LibriSpeech test-other dataset + noises sampled in MS-SNSD noise test set
โ€ข Air conditioner (AC), airport announcement (AA), babble (BA), copy machine (CM), munching
(MU), neighbors (NB), shutting door (SD), typing (TP) sampled in MS-SNSD noisy dataset
โ€ข Non-native English speech corpora: L2-Arctic
15
Experiments
โ€ข Main Result: Greedy Decoding
16
โ€ข Main Result: Beam Search Decoding
Experiments
โ€ข Non-Native English Speech Corpora
17
โ€ข Data Deficient Condition
โ€ข Ablation Study
Experiments
โ€ข Adaptation Example
18
โ€œWhat is it perhaps I can ilp yoโ€
โ€œWhat is it perhaps I can help youโ€
โ€œWhat is it perhaps I can help youโ€
Before Adaptation (WER: 25%)
After Adaptation (WER: 0%)
Ground Truth
Conclusion
19
Conclusion
โ€ข Conclusion
โ€ข We have suggested SGEM, an effective single-utterance TTA framework for general ASR
models.
โ€ข SGEM achieved state-of-the-art results in almost every settings including harsh conditions
like non-native English corpora and the data deficient condition.
โ€ข SGEM sheds light on the careful design of speech-specific components when devising test-
time adaptation methods for ASR models.
โ€ข Limitation
โ€ข Adaptation cost is high (0.771 seconds for a 1-second utterance).
โ€ข Hyperparameters such as learning rate are quite sensitive.
20
GitHub LinkedIn

More Related Content

Similar to C. Kim, INTERSPEECH 2023, MLILAB, KAISTAI

tmptmptmp123.pptx
tmptmptmp123.pptxtmptmptmp123.pptx
tmptmptmp123.pptxssuser893445
ย 
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...PyData
ย 
Speaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet VocoderSpeaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet VocoderAkira Tamamori
ย 
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"Lviv Startup Club
ย 
ChIP-seq - Data processing
ChIP-seq - Data processingChIP-seq - Data processing
ChIP-seq - Data processingSebastian Schmeier
ย 
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...Vienna Data Science Group
ย 
Automatic Key Term Extraction from Spoken Course Lectures
Automatic Key Term Extraction from Spoken Course LecturesAutomatic Key Term Extraction from Spoken Course Lectures
Automatic Key Term Extraction from Spoken Course LecturesYun-Nung (Vivian) Chen
ย 
์กฐ์Œ Goodness-Of-Pronunciation ์ž์งˆ์„ ์ด์šฉํ•œ ์˜์–ด ํ•™์Šต์ž์˜ ์กฐ์Œ ์˜ค๋ฅ˜ ์ง„๋‹จ
์กฐ์Œ Goodness-Of-Pronunciation ์ž์งˆ์„ ์ด์šฉํ•œ ์˜์–ด ํ•™์Šต์ž์˜ ์กฐ์Œ ์˜ค๋ฅ˜ ์ง„๋‹จ์กฐ์Œ Goodness-Of-Pronunciation ์ž์งˆ์„ ์ด์šฉํ•œ ์˜์–ด ํ•™์Šต์ž์˜ ์กฐ์Œ ์˜ค๋ฅ˜ ์ง„๋‹จ
์กฐ์Œ Goodness-Of-Pronunciation ์ž์งˆ์„ ์ด์šฉํ•œ ์˜์–ด ํ•™์Šต์ž์˜ ์กฐ์Œ ์˜ค๋ฅ˜ ์ง„๋‹จNAVER Engineering
ย 
The L2F Spoken Web Search system for Mediaeval 2012
The L2F Spoken Web Search system for Mediaeval 2012The L2F Spoken Web Search system for Mediaeval 2012
The L2F Spoken Web Search system for Mediaeval 2012MediaEval2012
ย 
Lazy man's learning: How To Build Your Own Text Summarizer
Lazy man's learning: How To Build Your Own Text SummarizerLazy man's learning: How To Build Your Own Text Summarizer
Lazy man's learning: How To Build Your Own Text SummarizerSho Fola Soboyejo
ย 
Logic Programming and ILP
Logic Programming and ILPLogic Programming and ILP
Logic Programming and ILPPierre de Lacaze
ย 
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech RecognitionTeaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech RecognitionZachary S. Brown
ย 
2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekingeProf. Wim Van Criekinge
ย 
Mediaeval 2013 Spoken Web Search results slides
Mediaeval 2013 Spoken Web Search results slidesMediaeval 2013 Spoken Web Search results slides
Mediaeval 2013 Spoken Web Search results slidesXavier Anguera
ย 
Development of voice password based speaker verification system
Development of voice password based speaker verification systemDevelopment of voice password based speaker verification system
Development of voice password based speaker verification systemniranjan kumar
ย 
Development of voice password based speaker verification system
Development of voice password based speaker verification systemDevelopment of voice password based speaker verification system
Development of voice password based speaker verification systemniranjan kumar
ย 
Drill dchug-29 nov2012
Drill dchug-29 nov2012Drill dchug-29 nov2012
Drill dchug-29 nov2012MapR Technologies
ย 
LogMap: Large-scale, Logic-based and Interactive Ontology Matching
LogMap: Large-scale, Logic-based and Interactive Ontology MatchingLogMap: Large-scale, Logic-based and Interactive Ontology Matching
LogMap: Large-scale, Logic-based and Interactive Ontology MatchingErnesto Jimenez Ruiz
ย 

Similar to C. Kim, INTERSPEECH 2023, MLILAB, KAISTAI (20)

tmptmptmp123.pptx
tmptmptmp123.pptxtmptmptmp123.pptx
tmptmptmp123.pptx
ย 
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
ย 
Speaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet VocoderSpeaker Dependent WaveNet Vocoder
Speaker Dependent WaveNet Vocoder
ย 
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
ย 
ChIP-seq - Data processing
ChIP-seq - Data processingChIP-seq - Data processing
ChIP-seq - Data processing
ย 
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
Wastian, Brunmeir - Data Analyses in Industrial Applications: From Predictive...
ย 
Automatic Key Term Extraction from Spoken Course Lectures
Automatic Key Term Extraction from Spoken Course LecturesAutomatic Key Term Extraction from Spoken Course Lectures
Automatic Key Term Extraction from Spoken Course Lectures
ย 
์กฐ์Œ Goodness-Of-Pronunciation ์ž์งˆ์„ ์ด์šฉํ•œ ์˜์–ด ํ•™์Šต์ž์˜ ์กฐ์Œ ์˜ค๋ฅ˜ ์ง„๋‹จ
์กฐ์Œ Goodness-Of-Pronunciation ์ž์งˆ์„ ์ด์šฉํ•œ ์˜์–ด ํ•™์Šต์ž์˜ ์กฐ์Œ ์˜ค๋ฅ˜ ์ง„๋‹จ์กฐ์Œ Goodness-Of-Pronunciation ์ž์งˆ์„ ์ด์šฉํ•œ ์˜์–ด ํ•™์Šต์ž์˜ ์กฐ์Œ ์˜ค๋ฅ˜ ์ง„๋‹จ
์กฐ์Œ Goodness-Of-Pronunciation ์ž์งˆ์„ ์ด์šฉํ•œ ์˜์–ด ํ•™์Šต์ž์˜ ์กฐ์Œ ์˜ค๋ฅ˜ ์ง„๋‹จ
ย 
The L2F Spoken Web Search system for Mediaeval 2012
The L2F Spoken Web Search system for Mediaeval 2012The L2F Spoken Web Search system for Mediaeval 2012
The L2F Spoken Web Search system for Mediaeval 2012
ย 
Lazy man's learning: How To Build Your Own Text Summarizer
Lazy man's learning: How To Build Your Own Text SummarizerLazy man's learning: How To Build Your Own Text Summarizer
Lazy man's learning: How To Build Your Own Text Summarizer
ย 
Asr
AsrAsr
Asr
ย 
Logic Programming and ILP
Logic Programming and ILPLogic Programming and ILP
Logic Programming and ILP
ย 
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech RecognitionTeaching Machines to Listen: An Introduction to Automatic Speech Recognition
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
ย 
Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014Bioinformatics t8-go-hmm v2014
Bioinformatics t8-go-hmm v2014
ย 
2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge
ย 
Mediaeval 2013 Spoken Web Search results slides
Mediaeval 2013 Spoken Web Search results slidesMediaeval 2013 Spoken Web Search results slides
Mediaeval 2013 Spoken Web Search results slides
ย 
Development of voice password based speaker verification system
Development of voice password based speaker verification systemDevelopment of voice password based speaker verification system
Development of voice password based speaker verification system
ย 
Development of voice password based speaker verification system
Development of voice password based speaker verification systemDevelopment of voice password based speaker verification system
Development of voice password based speaker verification system
ย 
Drill dchug-29 nov2012
Drill dchug-29 nov2012Drill dchug-29 nov2012
Drill dchug-29 nov2012
ย 
LogMap: Large-scale, Logic-based and Interactive Ontology Matching
LogMap: Large-scale, Logic-based and Interactive Ontology MatchingLogMap: Large-scale, Logic-based and Interactive Ontology Matching
LogMap: Large-scale, Logic-based and Interactive Ontology Matching
ย 

More from MLILAB

J. Jeong, AAAI 2024, MLILAB, KAIST AI..
J. Jeong,  AAAI 2024, MLILAB, KAIST AI..J. Jeong,  AAAI 2024, MLILAB, KAIST AI..
J. Jeong, AAAI 2024, MLILAB, KAIST AI..MLILAB
ย 
J. Yun, NeurIPS 2023, MLILAB, KAISTAI
J. Yun,  NeurIPS 2023,  MLILAB,  KAISTAIJ. Yun,  NeurIPS 2023,  MLILAB,  KAISTAI
J. Yun, NeurIPS 2023, MLILAB, KAISTAIMLILAB
ย 
S. Kim, NeurIPS 2023, MLILAB, KAISTAI
S. Kim,  NeurIPS 2023,  MLILAB,  KAISTAIS. Kim,  NeurIPS 2023,  MLILAB,  KAISTAI
S. Kim, NeurIPS 2023, MLILAB, KAISTAIMLILAB
ย 
Y. Jung, ICML 2023, MLILAB, KAISTAI
Y. Jung, ICML 2023, MLILAB, KAISTAIY. Jung, ICML 2023, MLILAB, KAISTAI
Y. Jung, ICML 2023, MLILAB, KAISTAIMLILAB
ย 
J. Song, S. Kim, ICML 2023, MLILAB, KAISTAI
J. Song, S. Kim, ICML 2023, MLILAB, KAISTAIJ. Song, S. Kim, ICML 2023, MLILAB, KAISTAI
J. Song, S. Kim, ICML 2023, MLILAB, KAISTAIMLILAB
ย 
K. Seo, ICASSP 2023, MLILAB, KAISTAI
K. Seo, ICASSP 2023, MLILAB, KAISTAIK. Seo, ICASSP 2023, MLILAB, KAISTAI
K. Seo, ICASSP 2023, MLILAB, KAISTAIMLILAB
ย 
G. Kim, CVPR 2023, MLILAB, KAISTAI
G. Kim, CVPR 2023, MLILAB, KAISTAIG. Kim, CVPR 2023, MLILAB, KAISTAI
G. Kim, CVPR 2023, MLILAB, KAISTAIMLILAB
ย 
S. Kim, ICLR 2023, MLILAB, KAISTAI
S. Kim, ICLR 2023, MLILAB, KAISTAIS. Kim, ICLR 2023, MLILAB, KAISTAI
S. Kim, ICLR 2023, MLILAB, KAISTAIMLILAB
ย 
Y. Kim, ICLR 2023, MLILAB, KAISTAI
Y. Kim, ICLR 2023, MLILAB, KAISTAIY. Kim, ICLR 2023, MLILAB, KAISTAI
Y. Kim, ICLR 2023, MLILAB, KAISTAIMLILAB
ย 
J. Yun, AISTATS 2022, MLILAB, KAISTAI
J. Yun, AISTATS 2022, MLILAB, KAISTAIJ. Yun, AISTATS 2022, MLILAB, KAISTAI
J. Yun, AISTATS 2022, MLILAB, KAISTAIMLILAB
ย 
J. Song, J. Park, ICML 2022, MLILAB, KAISTAI
J. Song, J. Park, ICML 2022, MLILAB, KAISTAIJ. Song, J. Park, ICML 2022, MLILAB, KAISTAI
J. Song, J. Park, ICML 2022, MLILAB, KAISTAIMLILAB
ย 
J. Park, J. Song, ICLR 2022, MLILAB, KAISTAI
J. Park, J. Song, ICLR 2022, MLILAB, KAISTAIJ. Park, J. Song, ICLR 2022, MLILAB, KAISTAI
J. Park, J. Song, ICLR 2022, MLILAB, KAISTAIMLILAB
ย 
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAI
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAIJ. Park, H. Shim, AAAI 2022, MLILAB, KAISTAI
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAIMLILAB
ย 
J. Park, AAAI 2022, MLILAB, KAIST AI
J. Park, AAAI 2022, MLILAB, KAIST AIJ. Park, AAAI 2022, MLILAB, KAIST AI
J. Park, AAAI 2022, MLILAB, KAIST AIMLILAB
ย 
J. Song, et. al., ASRU 2021, MLILAB, KAIST AI
J. Song, et. al., ASRU 2021, MLILAB, KAIST AIJ. Song, et. al., ASRU 2021, MLILAB, KAIST AI
J. Song, et. al., ASRU 2021, MLILAB, KAIST AIMLILAB
ย 
J. Song, H. Shim et al., ICASSP 2021, MLILAB, KAIST AI
J. Song, H. Shim et al., ICASSP 2021, MLILAB, KAIST AIJ. Song, H. Shim et al., ICASSP 2021, MLILAB, KAIST AI
J. Song, H. Shim et al., ICASSP 2021, MLILAB, KAIST AIMLILAB
ย 
T. Yoon, et. al., ICLR 2021, MLILAB, KAIST AI
T. Yoon, et. al., ICLR 2021, MLILAB, KAIST AIT. Yoon, et. al., ICLR 2021, MLILAB, KAIST AI
T. Yoon, et. al., ICLR 2021, MLILAB, KAIST AIMLILAB
ย 
G. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AI
G. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AIG. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AI
G. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AIMLILAB
ย 
I. Chung, AAAI 2020, MLILAB, KAIST AI
I. Chung, AAAI 2020, MLILAB, KAIST AII. Chung, AAAI 2020, MLILAB, KAIST AI
I. Chung, AAAI 2020, MLILAB, KAIST AIMLILAB
ย 
H. Shim, NeurIPS 2018, MLILAB, KAIST AI
H. Shim, NeurIPS 2018, MLILAB, KAIST AIH. Shim, NeurIPS 2018, MLILAB, KAIST AI
H. Shim, NeurIPS 2018, MLILAB, KAIST AIMLILAB
ย 

More from MLILAB (20)

J. Jeong, AAAI 2024, MLILAB, KAIST AI..
J. Jeong,  AAAI 2024, MLILAB, KAIST AI..J. Jeong,  AAAI 2024, MLILAB, KAIST AI..
J. Jeong, AAAI 2024, MLILAB, KAIST AI..
ย 
J. Yun, NeurIPS 2023, MLILAB, KAISTAI
J. Yun,  NeurIPS 2023,  MLILAB,  KAISTAIJ. Yun,  NeurIPS 2023,  MLILAB,  KAISTAI
J. Yun, NeurIPS 2023, MLILAB, KAISTAI
ย 
S. Kim, NeurIPS 2023, MLILAB, KAISTAI
S. Kim,  NeurIPS 2023,  MLILAB,  KAISTAIS. Kim,  NeurIPS 2023,  MLILAB,  KAISTAI
S. Kim, NeurIPS 2023, MLILAB, KAISTAI
ย 
Y. Jung, ICML 2023, MLILAB, KAISTAI
Y. Jung, ICML 2023, MLILAB, KAISTAIY. Jung, ICML 2023, MLILAB, KAISTAI
Y. Jung, ICML 2023, MLILAB, KAISTAI
ย 
J. Song, S. Kim, ICML 2023, MLILAB, KAISTAI
J. Song, S. Kim, ICML 2023, MLILAB, KAISTAIJ. Song, S. Kim, ICML 2023, MLILAB, KAISTAI
J. Song, S. Kim, ICML 2023, MLILAB, KAISTAI
ย 
K. Seo, ICASSP 2023, MLILAB, KAISTAI
K. Seo, ICASSP 2023, MLILAB, KAISTAIK. Seo, ICASSP 2023, MLILAB, KAISTAI
K. Seo, ICASSP 2023, MLILAB, KAISTAI
ย 
G. Kim, CVPR 2023, MLILAB, KAISTAI
G. Kim, CVPR 2023, MLILAB, KAISTAIG. Kim, CVPR 2023, MLILAB, KAISTAI
G. Kim, CVPR 2023, MLILAB, KAISTAI
ย 
S. Kim, ICLR 2023, MLILAB, KAISTAI
S. Kim, ICLR 2023, MLILAB, KAISTAIS. Kim, ICLR 2023, MLILAB, KAISTAI
S. Kim, ICLR 2023, MLILAB, KAISTAI
ย 
Y. Kim, ICLR 2023, MLILAB, KAISTAI
Y. Kim, ICLR 2023, MLILAB, KAISTAIY. Kim, ICLR 2023, MLILAB, KAISTAI
Y. Kim, ICLR 2023, MLILAB, KAISTAI
ย 
J. Yun, AISTATS 2022, MLILAB, KAISTAI
J. Yun, AISTATS 2022, MLILAB, KAISTAIJ. Yun, AISTATS 2022, MLILAB, KAISTAI
J. Yun, AISTATS 2022, MLILAB, KAISTAI
ย 
J. Song, J. Park, ICML 2022, MLILAB, KAISTAI
J. Song, J. Park, ICML 2022, MLILAB, KAISTAIJ. Song, J. Park, ICML 2022, MLILAB, KAISTAI
J. Song, J. Park, ICML 2022, MLILAB, KAISTAI
ย 
J. Park, J. Song, ICLR 2022, MLILAB, KAISTAI
J. Park, J. Song, ICLR 2022, MLILAB, KAISTAIJ. Park, J. Song, ICLR 2022, MLILAB, KAISTAI
J. Park, J. Song, ICLR 2022, MLILAB, KAISTAI
ย 
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAI
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAIJ. Park, H. Shim, AAAI 2022, MLILAB, KAISTAI
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAI
ย 
J. Park, AAAI 2022, MLILAB, KAIST AI
J. Park, AAAI 2022, MLILAB, KAIST AIJ. Park, AAAI 2022, MLILAB, KAIST AI
J. Park, AAAI 2022, MLILAB, KAIST AI
ย 
J. Song, et. al., ASRU 2021, MLILAB, KAIST AI
J. Song, et. al., ASRU 2021, MLILAB, KAIST AIJ. Song, et. al., ASRU 2021, MLILAB, KAIST AI
J. Song, et. al., ASRU 2021, MLILAB, KAIST AI
ย 
J. Song, H. Shim et al., ICASSP 2021, MLILAB, KAIST AI
J. Song, H. Shim et al., ICASSP 2021, MLILAB, KAIST AIJ. Song, H. Shim et al., ICASSP 2021, MLILAB, KAIST AI
J. Song, H. Shim et al., ICASSP 2021, MLILAB, KAIST AI
ย 
T. Yoon, et. al., ICLR 2021, MLILAB, KAIST AI
T. Yoon, et. al., ICLR 2021, MLILAB, KAIST AIT. Yoon, et. al., ICLR 2021, MLILAB, KAIST AI
T. Yoon, et. al., ICLR 2021, MLILAB, KAIST AI
ย 
G. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AI
G. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AIG. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AI
G. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AI
ย 
I. Chung, AAAI 2020, MLILAB, KAIST AI
I. Chung, AAAI 2020, MLILAB, KAIST AII. Chung, AAAI 2020, MLILAB, KAIST AI
I. Chung, AAAI 2020, MLILAB, KAIST AI
ย 
H. Shim, NeurIPS 2018, MLILAB, KAIST AI
H. Shim, NeurIPS 2018, MLILAB, KAIST AIH. Shim, NeurIPS 2018, MLILAB, KAIST AI
H. Shim, NeurIPS 2018, MLILAB, KAIST AI
ย 

Recently uploaded

Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spaintimesproduction05
ย 
Call Now โ‰ฝ 9953056974 โ‰ผ๐Ÿ” Call Girls In New Ashok Nagar โ‰ผ๐Ÿ” Delhi door step de...
Call Now โ‰ฝ 9953056974 โ‰ผ๐Ÿ” Call Girls In New Ashok Nagar  โ‰ผ๐Ÿ” Delhi door step de...Call Now โ‰ฝ 9953056974 โ‰ผ๐Ÿ” Call Girls In New Ashok Nagar  โ‰ผ๐Ÿ” Delhi door step de...
Call Now โ‰ฝ 9953056974 โ‰ผ๐Ÿ” Call Girls In New Ashok Nagar โ‰ผ๐Ÿ” Delhi door step de...9953056974 Low Rate Call Girls In Saket, Delhi NCR
ย 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
ย 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
ย 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
ย 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
ย 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
ย 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
ย 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
ย 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
ย 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
ย 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
ย 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
ย 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLManishPatel169454
ย 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi
ย 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
ย 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
ย 

Recently uploaded (20)

Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spain
ย 
Call Now โ‰ฝ 9953056974 โ‰ผ๐Ÿ” Call Girls In New Ashok Nagar โ‰ผ๐Ÿ” Delhi door step de...
Call Now โ‰ฝ 9953056974 โ‰ผ๐Ÿ” Call Girls In New Ashok Nagar  โ‰ผ๐Ÿ” Delhi door step de...Call Now โ‰ฝ 9953056974 โ‰ผ๐Ÿ” Call Girls In New Ashok Nagar  โ‰ผ๐Ÿ” Delhi door step de...
Call Now โ‰ฝ 9953056974 โ‰ผ๐Ÿ” Call Girls In New Ashok Nagar โ‰ผ๐Ÿ” Delhi door step de...
ย 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
ย 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
ย 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
ย 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
ย 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
ย 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
ย 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
ย 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
ย 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
ย 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
ย 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
ย 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
ย 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
ย 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
ย 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
ย 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
ย 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
ย 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
ย 

C. Kim, INTERSPEECH 2023, MLILAB, KAISTAI

  • 1. SGEM: Test-Time Adaptation for Automatic Speech Recognition via Sequential-Level Generalized Entropy Minimization 2023.08.23 Changhun Kim, Joonhyung Park, Hajin Shim and Eunho Yang Masterโ€™s Student @ MLILAB, KAIST AI changhun.kim@kaist.ac.kr
  • 2. Table of Contents 2 โ€ข Introduction โ€ข Method โ€ข Beam Search-Based Logit Acquisition โ€ข Generalized Entropy Minimization โ€ข Negative Sampling โ€ข Experiments โ€ข Experimental Setup โ€ข Main Results โ€ข Non-Native English Speech Corpora / Data Deficient Condition / Ablation Study โ€ข Adaptation Example โ€ข Conclusion
  • 4. โ€ข Automatic speech recognition (ASR) models are frequently exposed to distribution shifts. โ€ข Distribution shifts between source and target domain severely degrade the ASR performance. Introduction 4 Severe Background Noise American English British English Train Set Test Set
  • 5. Introduction โ€ข Unsupervised domain adaptation (UDA) jointly train the ASR model with labeled source domain and unlabeled target domain. โ€ข Limitations of UDA โ€ข Source data might not be accessible due to privacy/storage issues. โ€ข Restricts the generalization capacity only to the pre-collected target data. 5
  • 6. Introduction โ€ข SUTA [INTERSPEECHโ€™22] suggested test-time adaptation (TTA) strategy for ASR models. 6 โ€ข It fine-tunes the pre-trained ASR model using unlabeled test instances without source data. โ€ข It utilizes unsupervised objectives like entropy minimization and minimum class confusion.
  • 7. Introduction 7 Motivation Goal โ€ข Previous work targets on CTC-based model, which relies on naรฏve greedy decoding. โ€ข It naรฏvely adopts TTA methods in computer vision at a frame level. โ€ข Can we consider the sequential nature of ASR output and design speech-specific components? โ€ข Adapt the ASR model by considering the nature of speech at a sequential-level. โ€ข Achieve state-of-the-art performance by developing novel unsupervised objectives.
  • 10. Method โ€ข Beam Search-Based Logit Acquisition โ€ข Frame-level greedy adaptation considers the joint probability of a sequence myopically over timesteps. 10 โ€œhelloโ€ โ€œhelloโ€ โ€œhelloโ€ ๐‘(๐’‰๐’†๐’๐’๐’) = 0.138 ๐‘(๐’‰๐’†๐’๐’๐’) = 0.211 Beam Search Output โ€œgelloโ€ โ€œgelloโ€ ๐‘ ๐’ˆ๐’†๐’๐’๐’ = 0.279 ๐‘(๐’ˆ๐’†๐’๐’๐’) = 0.342 Greedy Search Output correct supervision wrong supervision
  • 11. Method โ€ข Beam Search-Based Logit Acquisition โ€ข Instead, we exploit beam search decoding and find the most plausible output sequence ! ๐’š = $ ๐‘ฆ!, โ‹ฏ , $ ๐‘ฆ" , โ€ข and pass ! ๐’š to acquire ๐‘–-th logit ! ๐’๐’Š = (๐‘œ$!, โ‹ฏ , ๐‘œ$% ) for ๐‘– โˆˆ 1, โ‹ฏ , ๐ฟ , โ€ข where ๐‘œ$& = log ๐‘(๐‘ฆ$ = ๐‘—|$ ๐‘ฆ'$, ๐‘ฅ, ๐œƒ). 11 โ€ข Logits obtained from beam search is more accurate and naturally aligned with ASR decoding strategy.
  • 12. Method โ€ข Generalized Entropy Minimization โ€ข Shannon entropy โˆ’ โˆ‘&(! % โ„™(๐‘‹ = ๐‘—) log โ„™(๐‘‹ = ๐‘—) is a specialized version of Rรฉnyi entropy with hyperparameter ๐›ผ โ†’ 1. โ€ข Rรฉnyi entropy with hyperparameter ๐›ผ โˆˆ (0, 1) โˆช (1, โˆž) is define as follows: โ€ข We hypothesize that there exists an optimal ๐œถ for TTA and define the generalized entropy minimization loss as follows: 12
  • 13. Method โ€ข Negative Sampling โ€ข Negative sampling loss penalizes the probabilities of low-confident classes. โ€ข Even if the model is incorrectly predicted, ground truth label will be included in the top-k classes with highest probability. 13
  • 15. Experiments โ€ข Experimental Setup โ€ข Source ASR Models โ€ข CTC-based model: wav2vec 2.0 โ€ข Conformer: Conformer-CTC โ€ข Transducer: Conformer-Transducer โ€ข Language model: 4-gram language model โ€ข Datasets โ€ข Unseen speakers/words: CHiME-3 (CH), TED-LIUM 2 (TD), Common Voice (CV), Valentini (VA) โ€ข Background Noise: LibriSpeech test-other dataset + noises sampled in MS-SNSD noise test set โ€ข Air conditioner (AC), airport announcement (AA), babble (BA), copy machine (CM), munching (MU), neighbors (NB), shutting door (SD), typing (TP) sampled in MS-SNSD noisy dataset โ€ข Non-native English speech corpora: L2-Arctic 15
  • 16. Experiments โ€ข Main Result: Greedy Decoding 16 โ€ข Main Result: Beam Search Decoding
  • 17. Experiments โ€ข Non-Native English Speech Corpora 17 โ€ข Data Deficient Condition โ€ข Ablation Study
  • 18. Experiments โ€ข Adaptation Example 18 โ€œWhat is it perhaps I can ilp yoโ€ โ€œWhat is it perhaps I can help youโ€ โ€œWhat is it perhaps I can help youโ€ Before Adaptation (WER: 25%) After Adaptation (WER: 0%) Ground Truth
  • 20. Conclusion โ€ข Conclusion โ€ข We have suggested SGEM, an effective single-utterance TTA framework for general ASR models. โ€ข SGEM achieved state-of-the-art results in almost every settings including harsh conditions like non-native English corpora and the data deficient condition. โ€ข SGEM sheds light on the careful design of speech-specific components when devising test- time adaptation methods for ASR models. โ€ข Limitation โ€ข Adaptation cost is high (0.771 seconds for a 1-second utterance). โ€ข Hyperparameters such as learning rate are quite sensitive. 20