C. Kim, INTERSPEECH 2023, MLILAB, KAISTAI

SGEM: Test-Time Adaptation
for Automatic Speech Recognition via
Sequential-Level Generalized Entropy
Minimization
2023.08.23
Changhun Kim, Joonhyung Park, Hajin Shim and Eunho Yang
Master’s Student @ MLILAB, KAIST AI
changhun.kim@kaist.ac.kr

Table of Contents
2
• Introduction
• Method
• Beam Search-Based Logit Acquisition
• Generalized Entropy Minimization
• Negative Sampling
• Experiments
• Experimental Setup
• Main Results
• Non-Native English Speech Corpora / Data Deficient Condition / Ablation Study
• Adaptation Example
• Conclusion

• Automatic speech recognition (ASR) models are frequently exposed to distribution
shifts.
• Distribution shifts between source and target domain severely degrade the ASR
performance.
Introduction
4
Severe Background Noise
American English British English
Train Set Test Set

Introduction
• Unsupervised domain adaptation (UDA) jointly train the ASR model with labeled
source domain and unlabeled target domain.
• Limitations of UDA
• Source data might not be accessible due to privacy/storage issues.
• Restricts the generalization capacity only to the pre-collected target data.
5

Introduction
• SUTA [INTERSPEECH’22] suggested test-time adaptation (TTA) strategy for ASR
models.
6
• It fine-tunes the pre-trained ASR model using
unlabeled test instances without source data.
• It utilizes unsupervised objectives like entropy
minimization and minimum class confusion.

Introduction
7
Motivation
Goal
• Previous work targets on CTC-based model, which relies on naïve greedy decoding.
• It naïvely adopts TTA methods in computer vision at a frame level.
• Can we consider the sequential nature of ASR output and design speech-specific
components?
• Adapt the ASR model by considering the nature of speech at a sequential-level.
• Achieve state-of-the-art performance by developing novel unsupervised objectives.

Method
• Frame-level greedy adaptation considers the joint probability of a sequence myopically
over timesteps.
10
“hello”
“hello” “hello”
𝑝(𝒉𝒆𝒍𝒍𝒐) = 0.138 𝑝(𝒉𝒆𝒍𝒍𝒐) = 0.211
Beam Search Output
“gello” “gello”
𝑝 𝒈𝒆𝒍𝒍𝒐 = 0.279 𝑝(𝒈𝒆𝒍𝒍𝒐) = 0.342
Greedy Search Output
correct
supervision
wrong
supervision

Method
• Instead, we exploit beam search decoding and find the most plausible output sequence
!
𝒚 = $
𝑦!, ⋯ , $
𝑦" ,
• and pass !
𝒚 to acquire 𝑖-th logit !
𝒐𝒊 = (𝑜$!, ⋯ , 𝑜$% ) for 𝑖 ∈ 1, ⋯ , 𝐿 ,
• where 𝑜$& = log 𝑝(𝑦$ = 𝑗|$
𝑦'$, 𝑥, 𝜃).
11
• Logits obtained from beam search is
more accurate and naturally aligned
with ASR decoding strategy.

Method
• Generalized Entropy Minimization
• Shannon entropy − ∑&(!
%
ℙ(𝑋 = 𝑗) log ℙ(𝑋 = 𝑗) is a specialized version of Rényi entropy
with hyperparameter 𝛼 → 1.
• Rényi entropy with hyperparameter 𝛼 ∈ (0, 1) ∪ (1, ∞) is define as follows:
• We hypothesize that there exists an optimal 𝜶 for TTA and define the generalized entropy
minimization loss as follows:
12

Method
• Negative Sampling
• Negative sampling loss penalizes the probabilities of low-confident classes.
• Even if the model is incorrectly predicted, ground truth label will be included in the top-k
classes with highest probability.
13

Experiments
• Experimental Setup
• Source ASR Models
• CTC-based model: wav2vec 2.0
• Conformer: Conformer-CTC
• Transducer: Conformer-Transducer
• Language model: 4-gram language model
• Datasets
• Unseen speakers/words: CHiME-3 (CH), TED-LIUM 2 (TD), Common Voice (CV), Valentini (VA)
• Background Noise: LibriSpeech test-other dataset + noises sampled in MS-SNSD noise test set
• Air conditioner (AC), airport announcement (AA), babble (BA), copy machine (CM), munching
(MU), neighbors (NB), shutting door (SD), typing (TP) sampled in MS-SNSD noisy dataset
• Non-native English speech corpora: L2-Arctic
15

Experiments
• Main Result: Greedy Decoding
16
• Main Result: Beam Search Decoding

Experiments
• Non-Native English Speech Corpora
17
• Data Deficient Condition
• Ablation Study

Experiments
• Adaptation Example
18
“What is it perhaps I can ilp yo”
“What is it perhaps I can help you”
“What is it perhaps I can help you”
Before Adaptation (WER: 25%)
After Adaptation (WER: 0%)
Ground Truth

Conclusion
• Conclusion
• We have suggested SGEM, an effective single-utterance TTA framework for general ASR
models.
• SGEM achieved state-of-the-art results in almost every settings including harsh conditions
like non-native English corpora and the data deficient condition.
• SGEM sheds light on the careful design of speech-specific components when devising test-
time adaptation methods for ASR models.
• Limitation
• Adaptation cost is high (0.771 seconds for a 1-second utterance).
• Hyperparameters such as learning rate are quite sensitive.
20

C. Kim, INTERSPEECH 2023, MLILAB, KAISTAI

Recommended

Recommended

More Related Content

Similar to C. Kim, INTERSPEECH 2023, MLILAB, KAISTAI

Similar to C. Kim, INTERSPEECH 2023, MLILAB, KAISTAI (20)

More from MLILAB

More from MLILAB (20)

Recently uploaded

Recently uploaded (20)

C. Kim, INTERSPEECH 2023, MLILAB, KAISTAI