SlideShare a Scribd company logo
1 of 26
Cross-Modal Knowledge Distillation With
Dropout-Based Confidence
2022. 11. 9, @APSIPA ASC 2022
Won Ik Cho¹, Jeunghun Kim², Nam Soo Kim²
Samsung Advanced Institute of Technology¹
Department of ECE and INMC, Seoul National University²
Contents
• Motivation
• Task and Dataset
• Related Work
• Method
• Results
• Conclusion
1
Motivation
• Text and speech : Two main medium of communication
• More difficult to train with speeches
 Why?
• Scarce amount of data
• Difficult to control the generation
and storage of the recordings
2
“THIS IS A SPEECH”
Difference in search result with ‘English’ in ELRA catalog
Motivation
• Pretrained language models
 Mainly developed for the text-based systems
• ELMo, BERT, GPTs …
 Bases on huge amount of raw corpus
• Trained with simple but non-task-specific objectives
• How to leverage pretrained LMs in speech processing?
 Direct use?
• Only if the ASR output are accurate
 Training LMs with erroneous speech transcriptions?
• Okay, but cannot cover all the possible cases, and requires script for various sce
narios
 Distillation?
3
Motivation
• Teacher confidence
 How should knowledge distillation be managed depending on the
uncertainty of the inference?
4
Task and Dataset
• Task: Spoken language understanding
 Fluent speech command
• 16kHz single channel 30,043 audio files
• Each audio labeled with three slots: action / object / location
• 248 different phrases spoken by 97 speakers (77/10/10)
• Multi-label classification problem
 Why Fluent speech command? (suggested in Lugosch et al., 2019)
• Google speech command:
– Only short keywords, thus not an SLU
• ATIS
– Not publicly available
• Grabo, Domonica, Pactor
– Free, but only a small number of speakers and phrases
• Snips audio
– Variety of phrases, but less audio
5
Related Work
• ASR-NLU pipelines
 Conventional approaches
 Best if an accurate ASR is guaranteed
 Easier to interpret the issue and enhance partial modules
• End-to-end SLU
 Less prone to ASR errors
 Non-textual information might be preserved as well
• Pretrained LMs
 Takes advantage of massive textual knowledge
 High performance, freely available modules
• Knowledge distillation
 Adaptive to various training schemes
 Cross-modal application is probable
6
Related Work
• ASR-NLU pipelines
 Conventional approaches
 Best if an accurate ASR is guaranteed
 Easier to interpret the issue and enhance partial modules
• End-to-end SLU
 Less prone to ASR errors
 Non-textual information might be preserved as well
• Pretrained LMs
 Takes advantage of massive textual knowledge
 High performance, freely available modules
• Knowledge distillation
 Adaptive to various training schemes
 Cross-modal application is probable
7
Related Work
• End-to-end SLU
 Lugosch et al. "Speech Model Pre-training for End-to-End Spoken
Language Understanding." INTERSPEECH 2019.
8
Related Work
9
• Pretrained LMs
 Transformer architectures
Related Work
• End-to-end speech processing + PLM
 Chuang et al.
“SpeechBERT: Cross-Modal
Pre-Trained Language Model
for End-to-End Spoken Question
Answering.“ INTERSPEECH 2020.
10
Related Work
• End-to-end speech processing + KD
 Liu et al. "End-to-End
Speech Translation with Knowledge
Distillation." INTERSPEECH 2019.
11
Related Work
• End-to-end SLU+ PLM + Cross-modal KD
 Cho et al., ” Speech to text adaptation: Towards an efficient cross-modal
distillation,” INTERSPEECH 2020.
12
Related Work
• End-to-end SLU
 Backbone: Lugosch et al. (2019)
• Phoneme module (SincNet layer)
• Word module
– BiGRU-based, with dropout/pooling
• Intent module
– Consequent prediction of three slots
– Also implemented with BiGRU
13
(Ravanelli and Bengio, 2018)
From previous ver. of Wang et al. (2020)
Related Work
• End-to-end SLU
14
Related Work
• PLM
 Fine-tuning the pretrained model
• BERT-Base (Devlin et al., 2018)
– Bidirectional encoder representations from Transformers (BERT)
• Hugging Face PyTorch wrapper
15
Related Work
• PLM
 Fine-tuning with FSC ground truth scripts!
16
Related Work
• Cross-modal KD
 Distillation as a teacher-student learning
• Loss1 = f answer, inferences
• Loss2 = g inferences , inferencet
• Different input, same task?
– e.g., speech translation
17
𝑇𝑜𝑡𝑎𝑙 𝐿𝑜𝑠𝑠 = 𝐿𝑜𝑠𝑠1 + 𝐿𝑜𝑠𝑠2
Distilled knowledge
(Liu et al., 2019)
Method
• Cross-modal KD
 What determines the loss?
• WHO TEACHES
– BERT-based text inference model
• HOW IS THE LOSS CALCULATED
– MAE, MSE between logits
• HOW MUCH THE GUIDANCE
INFLUENCES
– Time-dependent factors
(scheduling)
– Sample/batch-level factors
18
(Cho et al., 2020)
Method
• Cross-modal KD
 How can we determine the influence of distillation?
• Scheduling (suggested in Cho et al., 2020)
– Decaying
– Triangular
• Sample/batch-level factors
– Error rate (per batch)
– Entropy (averaged across batch; Kwon et al., 2020)
– Dropout-based confidence (averaged across batch; proposed)
» For N dropout-perturbated output, KLD with the original logit is averaged
19
Method
• Dropout-based confidence
 Idea: Distillation will be more meaningful if reliability of prediction is
guaranteed
• Reliability check – By giving perturbation to the distribution and see how it is
consistent with the original distribution (how robust is it?)
• Giving perturbation – Assign a dropout layer and see KD between the
perturbed one and the original one (for multiple dropout scenarios)
– C: # Output classes
– Q: Dropout layer set
– p: Dropout rate
– N: Number of dropouts
– T: Teacher output (original distribution)
– q(T): Output after dropout layer q
20
Method
• Hyperparameter searching
 Pilot with toy set
• D is relatively robust to C and N
– Dropout-based scheme does not depend on the number of output classes as when
using entropy
– As N increases, curve is smoothed, while oversall tendency is not affected
• Only factor affecting D is p
– Empirically set N=100 and p=0.1 after experiment
21
Results
• Comparison with the baseline
 Baseline performs well in
triangular scheduling (1.00%)
and error rate (1.00%)
 Proposed method is not
effective alone (1.05)
without scheduling
 Proposed method is most
significant with decaying (0.97)
and triangular scheduling (0.92)
22
Results
• Discussion
 Confidence modeling works
• Strategies
– Error rate – Student performance
– Entropy – Teacher inference distribution
– Dropout – Teacher confidence
• Error rate adopts student to the gold label, while others decide the weight free
from the varying student performance
• Prevents situation that the gold label might not be ‘answer’ and cause over-
fitting
 Confidence helps scheduled KD
• Wide applicability of the proposed strategy, even along with mechanical
scheduling schemes
23
Conclusion
• Search for schemes to manage the teacher’s influence in cross-
modal distillation
• Dropout-based confidence automatically induces the weight that
decides the influence of KD loss, in sample-wise manner
• Effect of dropout-based confidence and its alignment with
scheduling strategies are verified with public SLU dataset
24
Thank you!
EndOfPresentation

More Related Content

Similar to 2211 APSIPA

The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...Jinho Choi
 
Open Creativity Scoring Tutorial
Open Creativity Scoring TutorialOpen Creativity Scoring Tutorial
Open Creativity Scoring TutorialDenisDumas2
 
Obc 2011
Obc 2011Obc 2011
Obc 2011obepsp
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT Lifeng (Aaron) Han
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLifeng (Aaron) Han
 
Enriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationEnriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationSeonghyun Kim
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
 
EDM2014 paper: General Features in Knowledge Tracing to Model Multiple Subski...
EDM2014 paper: General Features in Knowledge Tracing to Model Multiple Subski...EDM2014 paper: General Features in Knowledge Tracing to Model Multiple Subski...
EDM2014 paper: General Features in Knowledge Tracing to Model Multiple Subski...Yun Huang
 
Quintin Cutts - Teaching and Learning to Program: Too much doing and not enou...
Quintin Cutts - Teaching and Learning to Program: Too much doing and not enou...Quintin Cutts - Teaching and Learning to Program: Too much doing and not enou...
Quintin Cutts - Teaching and Learning to Program: Too much doing and not enou...compatsch
 
Student Pipeline to Open Source Communities using HFOSS
Student Pipeline to Open Source Communities using HFOSSStudent Pipeline to Open Source Communities using HFOSS
Student Pipeline to Open Source Communities using HFOSSAll Things Open
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue
Transformers to Learn Hierarchical Contexts in Multiparty DialogueTransformers to Learn Hierarchical Contexts in Multiparty Dialogue
Transformers to Learn Hierarchical Contexts in Multiparty DialogueJinho Choi
 
Building MOOCs: Scalable Course Development & Delivery
Building MOOCs: Scalable Course Development & DeliveryBuilding MOOCs: Scalable Course Development & Delivery
Building MOOCs: Scalable Course Development & DeliveryOpus Learning
 
2016-05-31 Venia Legendi (CEITER): Sergey Sosnovsky
2016-05-31 Venia Legendi (CEITER): Sergey Sosnovsky2016-05-31 Venia Legendi (CEITER): Sergey Sosnovsky
2016-05-31 Venia Legendi (CEITER): Sergey Sosnovskyifi8106tlu
 
Methodology of MT Post-Editors Training
Methodology of MT Post-Editors TrainingMethodology of MT Post-Editors Training
Methodology of MT Post-Editors TrainingJakub Absolon
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categoriesWarNik Chow
 
Action Sequence Mining and Behavior Pattern Analysis for User Modeling
Action Sequence Mining and Behavior Pattern Analysis for User ModelingAction Sequence Mining and Behavior Pattern Analysis for User Modeling
Action Sequence Mining and Behavior Pattern Analysis for User ModelingPeter Brusilovsky
 

Similar to 2211 APSIPA (20)

Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
 
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
 
Open Creativity Scoring Tutorial
Open Creativity Scoring TutorialOpen Creativity Scoring Tutorial
Open Creativity Scoring Tutorial
 
Obc 2011
Obc 2011Obc 2011
Obc 2011
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
 
Enriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationEnriching Word Vectors with Subword Information
Enriching Word Vectors with Subword Information
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
EDM2014 paper: General Features in Knowledge Tracing to Model Multiple Subski...
EDM2014 paper: General Features in Knowledge Tracing to Model Multiple Subski...EDM2014 paper: General Features in Knowledge Tracing to Model Multiple Subski...
EDM2014 paper: General Features in Knowledge Tracing to Model Multiple Subski...
 
Quintin Cutts - Teaching and Learning to Program: Too much doing and not enou...
Quintin Cutts - Teaching and Learning to Program: Too much doing and not enou...Quintin Cutts - Teaching and Learning to Program: Too much doing and not enou...
Quintin Cutts - Teaching and Learning to Program: Too much doing and not enou...
 
Student Pipeline to Open Source Communities using HFOSS
Student Pipeline to Open Source Communities using HFOSSStudent Pipeline to Open Source Communities using HFOSS
Student Pipeline to Open Source Communities using HFOSS
 
Lec 01 introduction
Lec 01   introductionLec 01   introduction
Lec 01 introduction
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue
Transformers to Learn Hierarchical Contexts in Multiparty DialogueTransformers to Learn Hierarchical Contexts in Multiparty Dialogue
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue
 
Building MOOCs: Scalable Course Development & Delivery
Building MOOCs: Scalable Course Development & DeliveryBuilding MOOCs: Scalable Course Development & Delivery
Building MOOCs: Scalable Course Development & Delivery
 
2016-05-31 Venia Legendi (CEITER): Sergey Sosnovsky
2016-05-31 Venia Legendi (CEITER): Sergey Sosnovsky2016-05-31 Venia Legendi (CEITER): Sergey Sosnovsky
2016-05-31 Venia Legendi (CEITER): Sergey Sosnovsky
 
EURO Conference 2015 - Automated Timetabling
EURO Conference 2015 - Automated TimetablingEURO Conference 2015 - Automated Timetabling
EURO Conference 2015 - Automated Timetabling
 
Methodology of MT Post-Editors Training
Methodology of MT Post-Editors TrainingMethodology of MT Post-Editors Training
Methodology of MT Post-Editors Training
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories
 
Chounta@paws
Chounta@pawsChounta@paws
Chounta@paws
 
Action Sequence Mining and Behavior Pattern Analysis for User Modeling
Action Sequence Mining and Behavior Pattern Analysis for User ModelingAction Sequence Mining and Behavior Pattern Analysis for User Modeling
Action Sequence Mining and Behavior Pattern Analysis for User Modeling
 

More from WarNik Chow

2206 FAccT_inperson
2206 FAccT_inperson2206 FAccT_inperson
2206 FAccT_inpersonWarNik Chow
 
2204 Kakao talk on Hate speech dataset
2204 Kakao talk on Hate speech dataset2204 Kakao talk on Hate speech dataset
2204 Kakao talk on Hate speech datasetWarNik Chow
 
2108 [LangCon2021] kosp2e
2108 [LangCon2021] kosp2e2108 [LangCon2021] kosp2e
2108 [LangCon2021] kosp2eWarNik Chow
 
2102 Redone seminar
2102 Redone seminar2102 Redone seminar
2102 Redone seminarWarNik Chow
 
2010 HCLT Hate Speech
2010 HCLT Hate Speech2010 HCLT Hate Speech
2010 HCLT Hate SpeechWarNik Chow
 
2009 DevC Seongnam - NLP
2009 DevC Seongnam - NLP2009 DevC Seongnam - NLP
2009 DevC Seongnam - NLPWarNik Chow
 
2008 [lang con2020] act!
2008 [lang con2020] act!2008 [lang con2020] act!
2008 [lang con2020] act!WarNik Chow
 
2007 CogSci 2020 poster
2007 CogSci 2020 poster2007 CogSci 2020 poster
2007 CogSci 2020 posterWarNik Chow
 

More from WarNik Chow (20)

2312 PACLIC
2312 PACLIC2312 PACLIC
2312 PACLIC
 
2311 EAAMO
2311 EAAMO2311 EAAMO
2311 EAAMO
 
2211 HCOMP
2211 HCOMP2211 HCOMP
2211 HCOMP
 
2211 AACL
2211 AACL2211 AACL
2211 AACL
 
2210 CODI
2210 CODI2210 CODI
2210 CODI
 
2206 FAccT_inperson
2206 FAccT_inperson2206 FAccT_inperson
2206 FAccT_inperson
 
2206 Modupop!
2206 Modupop!2206 Modupop!
2206 Modupop!
 
2204 Kakao talk on Hate speech dataset
2204 Kakao talk on Hate speech dataset2204 Kakao talk on Hate speech dataset
2204 Kakao talk on Hate speech dataset
 
2108 [LangCon2021] kosp2e
2108 [LangCon2021] kosp2e2108 [LangCon2021] kosp2e
2108 [LangCon2021] kosp2e
 
2106 PRSLLS
2106 PRSLLS2106 PRSLLS
2106 PRSLLS
 
2106 JWLLP
2106 JWLLP2106 JWLLP
2106 JWLLP
 
2106 ACM DIS
2106 ACM DIS2106 ACM DIS
2106 ACM DIS
 
2104 Talk @SSU
2104 Talk @SSU2104 Talk @SSU
2104 Talk @SSU
 
2103 ACM FAccT
2103 ACM FAccT2103 ACM FAccT
2103 ACM FAccT
 
2102 Redone seminar
2102 Redone seminar2102 Redone seminar
2102 Redone seminar
 
2011 NLP-OSS
2011 NLP-OSS2011 NLP-OSS
2011 NLP-OSS
 
2010 HCLT Hate Speech
2010 HCLT Hate Speech2010 HCLT Hate Speech
2010 HCLT Hate Speech
 
2009 DevC Seongnam - NLP
2009 DevC Seongnam - NLP2009 DevC Seongnam - NLP
2009 DevC Seongnam - NLP
 
2008 [lang con2020] act!
2008 [lang con2020] act!2008 [lang con2020] act!
2008 [lang con2020] act!
 
2007 CogSci 2020 poster
2007 CogSci 2020 poster2007 CogSci 2020 poster
2007 CogSci 2020 poster
 

Recently uploaded

Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 

Recently uploaded (20)

Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 

2211 APSIPA

  • 1. Cross-Modal Knowledge Distillation With Dropout-Based Confidence 2022. 11. 9, @APSIPA ASC 2022 Won Ik Cho¹, Jeunghun Kim², Nam Soo Kim² Samsung Advanced Institute of Technology¹ Department of ECE and INMC, Seoul National University²
  • 2. Contents • Motivation • Task and Dataset • Related Work • Method • Results • Conclusion 1
  • 3. Motivation • Text and speech : Two main medium of communication • More difficult to train with speeches  Why? • Scarce amount of data • Difficult to control the generation and storage of the recordings 2 “THIS IS A SPEECH” Difference in search result with ‘English’ in ELRA catalog
  • 4. Motivation • Pretrained language models  Mainly developed for the text-based systems • ELMo, BERT, GPTs …  Bases on huge amount of raw corpus • Trained with simple but non-task-specific objectives • How to leverage pretrained LMs in speech processing?  Direct use? • Only if the ASR output are accurate  Training LMs with erroneous speech transcriptions? • Okay, but cannot cover all the possible cases, and requires script for various sce narios  Distillation? 3
  • 5. Motivation • Teacher confidence  How should knowledge distillation be managed depending on the uncertainty of the inference? 4
  • 6. Task and Dataset • Task: Spoken language understanding  Fluent speech command • 16kHz single channel 30,043 audio files • Each audio labeled with three slots: action / object / location • 248 different phrases spoken by 97 speakers (77/10/10) • Multi-label classification problem  Why Fluent speech command? (suggested in Lugosch et al., 2019) • Google speech command: – Only short keywords, thus not an SLU • ATIS – Not publicly available • Grabo, Domonica, Pactor – Free, but only a small number of speakers and phrases • Snips audio – Variety of phrases, but less audio 5
  • 7. Related Work • ASR-NLU pipelines  Conventional approaches  Best if an accurate ASR is guaranteed  Easier to interpret the issue and enhance partial modules • End-to-end SLU  Less prone to ASR errors  Non-textual information might be preserved as well • Pretrained LMs  Takes advantage of massive textual knowledge  High performance, freely available modules • Knowledge distillation  Adaptive to various training schemes  Cross-modal application is probable 6
  • 8. Related Work • ASR-NLU pipelines  Conventional approaches  Best if an accurate ASR is guaranteed  Easier to interpret the issue and enhance partial modules • End-to-end SLU  Less prone to ASR errors  Non-textual information might be preserved as well • Pretrained LMs  Takes advantage of massive textual knowledge  High performance, freely available modules • Knowledge distillation  Adaptive to various training schemes  Cross-modal application is probable 7
  • 9. Related Work • End-to-end SLU  Lugosch et al. "Speech Model Pre-training for End-to-End Spoken Language Understanding." INTERSPEECH 2019. 8
  • 10. Related Work 9 • Pretrained LMs  Transformer architectures
  • 11. Related Work • End-to-end speech processing + PLM  Chuang et al. “SpeechBERT: Cross-Modal Pre-Trained Language Model for End-to-End Spoken Question Answering.“ INTERSPEECH 2020. 10
  • 12. Related Work • End-to-end speech processing + KD  Liu et al. "End-to-End Speech Translation with Knowledge Distillation." INTERSPEECH 2019. 11
  • 13. Related Work • End-to-end SLU+ PLM + Cross-modal KD  Cho et al., ” Speech to text adaptation: Towards an efficient cross-modal distillation,” INTERSPEECH 2020. 12
  • 14. Related Work • End-to-end SLU  Backbone: Lugosch et al. (2019) • Phoneme module (SincNet layer) • Word module – BiGRU-based, with dropout/pooling • Intent module – Consequent prediction of three slots – Also implemented with BiGRU 13 (Ravanelli and Bengio, 2018) From previous ver. of Wang et al. (2020)
  • 16. Related Work • PLM  Fine-tuning the pretrained model • BERT-Base (Devlin et al., 2018) – Bidirectional encoder representations from Transformers (BERT) • Hugging Face PyTorch wrapper 15
  • 17. Related Work • PLM  Fine-tuning with FSC ground truth scripts! 16
  • 18. Related Work • Cross-modal KD  Distillation as a teacher-student learning • Loss1 = f answer, inferences • Loss2 = g inferences , inferencet • Different input, same task? – e.g., speech translation 17 𝑇𝑜𝑡𝑎𝑙 𝐿𝑜𝑠𝑠 = 𝐿𝑜𝑠𝑠1 + 𝐿𝑜𝑠𝑠2 Distilled knowledge (Liu et al., 2019)
  • 19. Method • Cross-modal KD  What determines the loss? • WHO TEACHES – BERT-based text inference model • HOW IS THE LOSS CALCULATED – MAE, MSE between logits • HOW MUCH THE GUIDANCE INFLUENCES – Time-dependent factors (scheduling) – Sample/batch-level factors 18 (Cho et al., 2020)
  • 20. Method • Cross-modal KD  How can we determine the influence of distillation? • Scheduling (suggested in Cho et al., 2020) – Decaying – Triangular • Sample/batch-level factors – Error rate (per batch) – Entropy (averaged across batch; Kwon et al., 2020) – Dropout-based confidence (averaged across batch; proposed) » For N dropout-perturbated output, KLD with the original logit is averaged 19
  • 21. Method • Dropout-based confidence  Idea: Distillation will be more meaningful if reliability of prediction is guaranteed • Reliability check – By giving perturbation to the distribution and see how it is consistent with the original distribution (how robust is it?) • Giving perturbation – Assign a dropout layer and see KD between the perturbed one and the original one (for multiple dropout scenarios) – C: # Output classes – Q: Dropout layer set – p: Dropout rate – N: Number of dropouts – T: Teacher output (original distribution) – q(T): Output after dropout layer q 20
  • 22. Method • Hyperparameter searching  Pilot with toy set • D is relatively robust to C and N – Dropout-based scheme does not depend on the number of output classes as when using entropy – As N increases, curve is smoothed, while oversall tendency is not affected • Only factor affecting D is p – Empirically set N=100 and p=0.1 after experiment 21
  • 23. Results • Comparison with the baseline  Baseline performs well in triangular scheduling (1.00%) and error rate (1.00%)  Proposed method is not effective alone (1.05) without scheduling  Proposed method is most significant with decaying (0.97) and triangular scheduling (0.92) 22
  • 24. Results • Discussion  Confidence modeling works • Strategies – Error rate – Student performance – Entropy – Teacher inference distribution – Dropout – Teacher confidence • Error rate adopts student to the gold label, while others decide the weight free from the varying student performance • Prevents situation that the gold label might not be ‘answer’ and cause over- fitting  Confidence helps scheduled KD • Wide applicability of the proposed strategy, even along with mechanical scheduling schemes 23
  • 25. Conclusion • Search for schemes to manage the teacher’s influence in cross- modal distillation • Dropout-based confidence automatically induces the weight that decides the influence of KD loss, in sample-wise manner • Effect of dropout-based confidence and its alignment with scheduling strategies are verified with public SLU dataset 24

Editor's Notes

  1. Hi, this is wonik cho from Samsung advanced institute of technology. Sorry for participating in remote due to some issues in travel. I will present the work on cross-modal knowledge distillation done while I was in Seoul national university.
  2. First, we will wrap up the literature on spoken language understanding and cross-modal distillation with relevant works, and demonstrate how we tackled the conventional cross-modal distillation with dropout-based confidence.
  3. First, we want to talk about the background of using textual information on speech tasks. Text and speech are two main medium of communication, but it is usually regarded that model training with speech is usually more difficult. Main reason is the scarce amount of data, and furthermore, it is difficult to control the generation and storage of the recordings. This leads again to scarce existence of speech data.
  4. In this regard, considering that speech contains semantic and syntactic information that language and text data contains, it would be much beneficial and efficient to utilize widely used pretrained language models in spoken language processing. They are mainly developed for the text-based systems, and bases on huge amount of raw corpus as pretraining data. Usually they are trained with simple but non-task-specific objectives, in terms of self-supervised learning. Then, how can we leverage such pretrained LMs in speech and spoken language processing? If we are to directly use it, we can use if only by ASR output assuming that it is accurate. Or, we can train LMs with probably erroneous speech transcripitions. However it is quite heuristic and may depend on the ASR system that is used. Thus, we can think of distilling some information from the pretrained or fine-tuned model, possibly in task-specific manner.
  5. However, one side of the distillation is that teacher is not always fully confident about its inference, since there are always some tricky examples that even teachers find it difficult to infer. In this study, we tackle these cases: how should the knowledge distillation be controlled concerning these uncertain situations? We take a look at how it is dealt within a cross-modal distillation of spoken language understanding.
  6. Here, our task is spoken language processing, and we use Fluent speech command which is prevalently used for SLU task. It consists of 16kHz single channel, 30 000 audio files, where each audio is labeled with three slots of action, object, and location. It’s composed of 248 different phrases spoken by 97 speakers, and it is formulated as multi-label classification problem. Compared to SLU datasets such as Google speech command with short keywords, ATIS which is not publicly available, Grabo with simple composition and Snips with less amount of audio, Fluent speech command seems to be more appropriate candidate for testing the spoken language understanding task regarding distillation technique.
  7. So far we've discussed on what to do and how we are to do, and now we take a look on how it had been done. First, on conventional ASR-NLU pipelines, as said, the highest performance would be able to be obtained if an accurate ASR is guaranteed. Above all, it is easier to interpret the issue and enhance the partial modules. In contrast, end-to-end SLU technologies are less prone to ASR errors, and non-textual information such as acoustics might be preserved as well.
  8. Next two are the literature on the pretrained langauge models and how we are to utilize them. First, PLMs take advantage of massive textual knowledge and currently, high performance-freely available modules are distributed. Therefore, to leverage them, we bring knowledge distillation technologies which are adaptive to various training schemes and where the cross-modal approach is probable.
  9. Deeper on end-to-end SLU, we mainly referred to Lugosch et al 2019 where the Fluent Speech Command dataset and the baseline thereof were released. The phoneme-level and word-level classifiers, each bases on SincNet and RNN were pretrained and used. The final intent inference module that also bases on RNN predicts the slots and the best performance was obtained by freezing all the pretrained layers. That is, the word-level posterior provided by ASR pretrained module largely helps the intent inference module, at the same time allowing the end-to-end approach.
  10. Also, we take a look on recent pretrained LMs, which start off from RNN-based ELMo and Transformer-based BERT. Especially the prevalent BERT, which is pretrained with the objective of masked word prediction and sentence relevance checking, has shown its power among the syntactic and semantic tasks, and here it is expected to boost the performance in understanding spoken language.
  11. These approaches started to be combined, as in speech BERT, where the text and the corresponding audio source are trained simultaneoulsy so that the representation of the speech utterances can take advantage from the text understanding. This seems powerful for SLU, but the approach also requires the new format of LM pretraining.
  12. In view of knowledge distillation, Liu et al suggested leveraging the machine translation onto the corresponding speech translation, using the inference of teacher MT module as a KD loss that benefits the speech-based inference of the student ST module. Our task has two difference, first ours seeks for the representation of text and speech within a single language, and the teacher and student have different architecture.
  13. We describe the approach in our previous paper below. The left SLU module adopts audio as input, where the pretrained module yields a word posterior-level sequence which is again fed to the RNN-based intent prediction module. In this phase, the prediction is guided by the logit inference of the fine-tuned LM. The script corresponds to the given audio.
  14. In detail of the end-to-end SLU, the backbone is adopted is from Lugosch et al, consisting of SincNet-based phoneme module, BiGRU-based word module with dropout and pooling, and intent module which consequently predicts three slots and is also implemented with BiGRU.
  15. The baseline experiment shows that the highest accuracy is obtained with freezed or at least word layer unfreezed model. Also, the result was also convincing with only 10% of the training dataset.
  16. Secondly on the PLMs, to fine-tune the BERT based on a publicly available model, we adopted a Hugging Face PyTorch wrapper of the Google BERT.
  17. To fit the fine-tuning strategy, the original FSC ground truth scripts and labels were transformed into the proper format. The fine-tuning was done in a straightforward manner and with 50 epochs.
  18. Next, on cross-modal KD, we regard it as a teacher-student learning where the KD loss is augmented to original loss as a distilled knowledge. By cross-modal we mean the different input modality, which can be considered as a different input with same task, in concurrence with the previously discussed speech translation.
  19. Detailed on our method on the cross-modal knowledge distillation, we formulated as below, with the weighted sum of cross-entropy loss and KD loss. Mainly three factors affect the final KD loss. First, who teaches is important, and here we adopt BERT-based text inference model used in Cho et al., which achieved almost perfect score on the text test set with ground truth labels. Next, which kind of loss is used is also important, and here we also refer to the previous study and MAE loss between logits are effective in this case. However, it is not trivial to determine the value of lambda. There are mainly two factors, namely time-dependent factors that are calculated as scheduling, and next sample or batch-level factors such as error rate.
  20. Let’s take a deeper look into determining the influence of distillation. In Cho et al., using the same dataset, decaying and triangular schedulings were used to manage the influence of the teacher. They are more closed to a mechanical controlling, compared to sample or batch-level factors that changes according to the samples that are actually contained in the dataset. Error rate is calculated per batch referring to the accuracy of the teacher model regarding the batch. Entropy is calculated with the logit distribution per sample and is divided by the number of output classes for normalization. Next, the proposed model, dropout-based confidence, returns a kullback-leibler distance with the original logit averaged by N, where N is the number of dropout-perturbated output.
  21. We explain more details on our idea. First, we thought that distillation will be more meaningful if reliability of prediction is guaranteed. Here, reliability check is done by giving perturbation to the distribution and see how it is consistent with the original distribution. That is, we check the robustness of the inference. Finally, giving perturbation here implies, assigning a dropout layer and see KD between the perturbed one and the original one, and conduct this for multiple dropout scenarios. In the formula below, C implies to the number of output classes, Q the dropout layer set, p the dropout rate, N the number of dropouts, T the teacher output and q(T) the output after trespassing the dropout layer q.
  22. We also conducted a hyperparameter searching with a toy numerical set, assuming a vector where a center component is significantly high compared to others which are uniformly distributed. We first found that the dropout-based confidence is relatively robust to C compared to when using entropy, and also to N, concerning that the increase of N leads to the smoothing of the total curve but not the tendency. We checked that only factor that affects D here is p, and after experiments, empirically set N to 100 and p to 0.1.
  23. We compare the training results with various baselines. First, among models that use distillation, we found that baseline models already performed well in triangular scheduling, or in the scenario using the error rate. It seemed that the proposed method is not effective alone without scheduling, showing similar performance to phoneme posterior model or entropy based model. So far, ERNIE-based phoneme posterior model showed the best performance. However, our method outperformed the baselines with decaying and triangular scheduling.
  24. From the results, we want to tell that confidence modeling works in cross-modal distillation for spoken language understanding, accompanied by a proper scheduling scheme. Redeeming three strategies, error rate regards the student performance while others concern teacher behavior. Error rate adopts student to the gold level, while others decide the KD weight free from varying student performance. This prevents situation that the gold label might not be ‘answer’ and cause overfitting. Also, we saw that confidence helps scheduled KD, and also the scheduling enhances the utility of dropout-based approach. This suggests the wide applicability of the proposed scheme, concerning that scheduling and automatic decision schemes are two independent factors deciding the influence of KD loss.
  25. In our study, we searched for schemes to manage the teacher’s influence in cross-modal distillation. We found that dropout-based confidence automatically induces the weight that decides the influence of KD loss, in sample-wise manner. Effect of dropout-based confidence and its alignment with scheduling strategies are verified with public SLU dataset. Thanks for listening to our talk