1. SGEM: Test-Time Adaptation
for Automatic Speech Recognition via
Sequential-Level Generalized Entropy
Minimization
2023.08.23
Changhun Kim, Joonhyung Park, Hajin Shim and Eunho Yang
Masterโs Student @ MLILAB, KAIST AI
changhun.kim@kaist.ac.kr
2. Table of Contents
2
โข Introduction
โข Method
โข Beam Search-Based Logit Acquisition
โข Generalized Entropy Minimization
โข Negative Sampling
โข Experiments
โข Experimental Setup
โข Main Results
โข Non-Native English Speech Corpora / Data Deficient Condition / Ablation Study
โข Adaptation Example
โข Conclusion
4. โข Automatic speech recognition (ASR) models are frequently exposed to distribution
shifts.
โข Distribution shifts between source and target domain severely degrade the ASR
performance.
Introduction
4
Severe Background Noise
American English British English
Train Set Test Set
5. Introduction
โข Unsupervised domain adaptation (UDA) jointly train the ASR model with labeled
source domain and unlabeled target domain.
โข Limitations of UDA
โข Source data might not be accessible due to privacy/storage issues.
โข Restricts the generalization capacity only to the pre-collected target data.
5
6. Introduction
โข SUTA [INTERSPEECHโ22] suggested test-time adaptation (TTA) strategy for ASR
models.
6
โข It fine-tunes the pre-trained ASR model using
unlabeled test instances without source data.
โข It utilizes unsupervised objectives like entropy
minimization and minimum class confusion.
7. Introduction
7
Motivation
Goal
โข Previous work targets on CTC-based model, which relies on naรฏve greedy decoding.
โข It naรฏvely adopts TTA methods in computer vision at a frame level.
โข Can we consider the sequential nature of ASR output and design speech-specific
components?
โข Adapt the ASR model by considering the nature of speech at a sequential-level.
โข Achieve state-of-the-art performance by developing novel unsupervised objectives.
10. Method
โข Beam Search-Based Logit Acquisition
โข Frame-level greedy adaptation considers the joint probability of a sequence myopically
over timesteps.
10
โhelloโ
โhelloโ โhelloโ
๐(๐๐๐๐๐) = 0.138 ๐(๐๐๐๐๐) = 0.211
Beam Search Output
โgelloโ โgelloโ
๐ ๐๐๐๐๐ = 0.279 ๐(๐๐๐๐๐) = 0.342
Greedy Search Output
correct
supervision
wrong
supervision
11. Method
โข Beam Search-Based Logit Acquisition
โข Instead, we exploit beam search decoding and find the most plausible output sequence
!
๐ = $
๐ฆ!, โฏ , $
๐ฆ" ,
โข and pass !
๐ to acquire ๐-th logit !
๐๐ = (๐$!, โฏ , ๐$% ) for ๐ โ 1, โฏ , ๐ฟ ,
โข where ๐$& = log ๐(๐ฆ$ = ๐|$
๐ฆ'$, ๐ฅ, ๐).
11
โข Logits obtained from beam search is
more accurate and naturally aligned
with ASR decoding strategy.
12. Method
โข Generalized Entropy Minimization
โข Shannon entropy โ โ&(!
%
โ(๐ = ๐) log โ(๐ = ๐) is a specialized version of Rรฉnyi entropy
with hyperparameter ๐ผ โ 1.
โข Rรฉnyi entropy with hyperparameter ๐ผ โ (0, 1) โช (1, โ) is define as follows:
โข We hypothesize that there exists an optimal ๐ถ for TTA and define the generalized entropy
minimization loss as follows:
12
13. Method
โข Negative Sampling
โข Negative sampling loss penalizes the probabilities of low-confident classes.
โข Even if the model is incorrectly predicted, ground truth label will be included in the top-k
classes with highest probability.
13
18. Experiments
โข Adaptation Example
18
โWhat is it perhaps I can ilp yoโ
โWhat is it perhaps I can help youโ
โWhat is it perhaps I can help youโ
Before Adaptation (WER: 25%)
After Adaptation (WER: 0%)
Ground Truth
20. Conclusion
โข Conclusion
โข We have suggested SGEM, an effective single-utterance TTA framework for general ASR
models.
โข SGEM achieved state-of-the-art results in almost every settings including harsh conditions
like non-native English corpora and the data deficient condition.
โข SGEM sheds light on the careful design of speech-specific components when devising test-
time adaptation methods for ASR models.
โข Limitation
โข Adaptation cost is high (0.771 seconds for a 1-second utterance).
โข Hyperparameters such as learning rate are quite sensitive.
20