2211 APSIPA

Cross-Modal Knowledge Distillation With
Dropout-Based Confidence
2022. 11. 9, @APSIPA ASC 2022
Won Ik Cho¹, Jeunghun Kim², Nam Soo Kim²
Samsung Advanced Institute of Technology¹
Department of ECE and INMC, Seoul National University²

Contents
• Motivation
• Task and Dataset
• Related Work
• Method
• Results
• Conclusion
1

Motivation
• Text and speech : Two main medium of communication
• More difficult to train with speeches
 Why?
• Scarce amount of data
• Difficult to control the generation
and storage of the recordings
2
“THIS IS A SPEECH”
Difference in search result with ‘English’ in ELRA catalog

Motivation
• Pretrained language models
 Mainly developed for the text-based systems
• ELMo, BERT, GPTs …
 Bases on huge amount of raw corpus
• Trained with simple but non-task-specific objectives
• How to leverage pretrained LMs in speech processing?
 Direct use?
• Only if the ASR output are accurate
 Training LMs with erroneous speech transcriptions?
• Okay, but cannot cover all the possible cases, and requires script for various sce
narios
 Distillation?
3

Motivation
• Teacher confidence
 How should knowledge distillation be managed depending on the
uncertainty of the inference?
4

Task and Dataset
• Task: Spoken language understanding
 Fluent speech command
• 16kHz single channel 30,043 audio files
• Each audio labeled with three slots: action / object / location
• 248 different phrases spoken by 97 speakers (77/10/10)
• Multi-label classification problem
 Why Fluent speech command? (suggested in Lugosch et al., 2019)
• Google speech command:
– Only short keywords, thus not an SLU
• ATIS
– Not publicly available
• Grabo, Domonica, Pactor
– Free, but only a small number of speakers and phrases
• Snips audio
– Variety of phrases, but less audio
5

Related Work
• ASR-NLU pipelines
 Conventional approaches
 Best if an accurate ASR is guaranteed
 Easier to interpret the issue and enhance partial modules
• End-to-end SLU
 Less prone to ASR errors
 Non-textual information might be preserved as well
• Pretrained LMs
 Takes advantage of massive textual knowledge
 High performance, freely available modules
• Knowledge distillation
 Adaptive to various training schemes
 Cross-modal application is probable
6

Related Work
• ASR-NLU pipelines
 Conventional approaches
 Best if an accurate ASR is guaranteed
 Easier to interpret the issue and enhance partial modules
• End-to-end SLU
 Less prone to ASR errors
 Non-textual information might be preserved as well
• Pretrained LMs
 Takes advantage of massive textual knowledge
 High performance, freely available modules
• Knowledge distillation
 Adaptive to various training schemes
 Cross-modal application is probable
7

Related Work
• End-to-end SLU
 Lugosch et al. "Speech Model Pre-training for End-to-End Spoken
Language Understanding." INTERSPEECH 2019.
8

Related Work
9
• Pretrained LMs
 Transformer architectures

Related Work
• End-to-end speech processing + PLM
 Chuang et al.
“SpeechBERT: Cross-Modal
Pre-Trained Language Model
for End-to-End Spoken Question
Answering.“ INTERSPEECH 2020.
10

Related Work
• End-to-end speech processing + KD
 Liu et al. "End-to-End
Speech Translation with Knowledge
Distillation." INTERSPEECH 2019.
11

Related Work
• End-to-end SLU+ PLM + Cross-modal KD
 Cho et al., ” Speech to text adaptation: Towards an efficient cross-modal
distillation,” INTERSPEECH 2020.
12

Related Work
• End-to-end SLU
 Backbone: Lugosch et al. (2019)
• Phoneme module (SincNet layer)
• Word module
– BiGRU-based, with dropout/pooling
• Intent module
– Consequent prediction of three slots
– Also implemented with BiGRU
13
(Ravanelli and Bengio, 2018)
From previous ver. of Wang et al. (2020)

Related Work
• End-to-end SLU
14

Related Work
• PLM
 Fine-tuning the pretrained model
• BERT-Base (Devlin et al., 2018)
– Bidirectional encoder representations from Transformers (BERT)
• Hugging Face PyTorch wrapper
15

Related Work
• PLM
 Fine-tuning with FSC ground truth scripts!
16

Related Work
• Cross-modal KD
 Distillation as a teacher-student learning
• Loss1 = f answer, inferences
• Loss2 = g inferences , inferencet
• Different input, same task?
– e.g., speech translation
17
𝑇𝑜𝑡𝑎𝑙 𝐿𝑜𝑠𝑠 = 𝐿𝑜𝑠𝑠1 + 𝐿𝑜𝑠𝑠2
Distilled knowledge
(Liu et al., 2019)

Method
• Cross-modal KD
 What determines the loss?
• WHO TEACHES
– BERT-based text inference model
• HOW IS THE LOSS CALCULATED
– MAE, MSE between logits
• HOW MUCH THE GUIDANCE
INFLUENCES
– Time-dependent factors
(scheduling)
– Sample/batch-level factors
18
(Cho et al., 2020)

Method
• Cross-modal KD
 How can we determine the influence of distillation?
• Scheduling (suggested in Cho et al., 2020)
– Decaying
– Triangular
• Sample/batch-level factors
– Error rate (per batch)
– Entropy (averaged across batch; Kwon et al., 2020)
– Dropout-based confidence (averaged across batch; proposed)
» For N dropout-perturbated output, KLD with the original logit is averaged
19

Method
• Dropout-based confidence
 Idea: Distillation will be more meaningful if reliability of prediction is
guaranteed
• Reliability check – By giving perturbation to the distribution and see how it is
consistent with the original distribution (how robust is it?)
• Giving perturbation – Assign a dropout layer and see KD between the
perturbed one and the original one (for multiple dropout scenarios)
– C: # Output classes
– Q: Dropout layer set
– p: Dropout rate
– N: Number of dropouts
– T: Teacher output (original distribution)
– q(T): Output after dropout layer q
20

Method
• Hyperparameter searching
 Pilot with toy set
• D is relatively robust to C and N
– Dropout-based scheme does not depend on the number of output classes as when
using entropy
– As N increases, curve is smoothed, while oversall tendency is not affected
• Only factor affecting D is p
– Empirically set N=100 and p=0.1 after experiment
21

Results
• Comparison with the baseline
 Baseline performs well in
triangular scheduling (1.00%)
and error rate (1.00%)
 Proposed method is not
effective alone (1.05)
without scheduling
 Proposed method is most
significant with decaying (0.97)
and triangular scheduling (0.92)
22

Results
• Discussion
 Confidence modeling works
• Strategies
– Error rate – Student performance
– Entropy – Teacher inference distribution
– Dropout – Teacher confidence
• Error rate adopts student to the gold label, while others decide the weight free
from the varying student performance
• Prevents situation that the gold label might not be ‘answer’ and cause over-
fitting
 Confidence helps scheduled KD
• Wide applicability of the proposed strategy, even along with mechanical
scheduling schemes
23

Conclusion
• Search for schemes to manage the teacher’s influence in cross-
modal distillation
• Dropout-based confidence automatically induces the weight that
decides the influence of KD loss, in sample-wise manner
• Effect of dropout-based confidence and its alignment with
scheduling strategies are verified with public SLU dataset
24

2211 APSIPA

Recommended

Recommended

More Related Content

Similar to 2211 APSIPA

Similar to 2211 APSIPA (20)

More from WarNik Chow

More from WarNik Chow (20)

Recently uploaded

Recently uploaded (20)

2211 APSIPA

Editor's Notes