SlideShare a Scribd company logo
Cross-Modal Knowledge Distillation With
Dropout-Based Confidence
2022. 11. 9, @APSIPA ASC 2022
Won Ik Cho¹, Jeunghun Kim², Nam Soo Kim²
Samsung Advanced Institute of Technology¹
Department of ECE and INMC, Seoul National University²
Contents
• Motivation
• Task and Dataset
• Related Work
• Method
• Results
• Conclusion
1
Motivation
• Text and speech : Two main medium of communication
• More difficult to train with speeches
 Why?
• Scarce amount of data
• Difficult to control the generation
and storage of the recordings
2
“THIS IS A SPEECH”
Difference in search result with ‘English’ in ELRA catalog
Motivation
• Pretrained language models
 Mainly developed for the text-based systems
• ELMo, BERT, GPTs …
 Bases on huge amount of raw corpus
• Trained with simple but non-task-specific objectives
• How to leverage pretrained LMs in speech processing?
 Direct use?
• Only if the ASR output are accurate
 Training LMs with erroneous speech transcriptions?
• Okay, but cannot cover all the possible cases, and requires script for various sce
narios
 Distillation?
3
Motivation
• Teacher confidence
 How should knowledge distillation be managed depending on the
uncertainty of the inference?
4
Task and Dataset
• Task: Spoken language understanding
 Fluent speech command
• 16kHz single channel 30,043 audio files
• Each audio labeled with three slots: action / object / location
• 248 different phrases spoken by 97 speakers (77/10/10)
• Multi-label classification problem
 Why Fluent speech command? (suggested in Lugosch et al., 2019)
• Google speech command:
– Only short keywords, thus not an SLU
• ATIS
– Not publicly available
• Grabo, Domonica, Pactor
– Free, but only a small number of speakers and phrases
• Snips audio
– Variety of phrases, but less audio
5
Related Work
• ASR-NLU pipelines
 Conventional approaches
 Best if an accurate ASR is guaranteed
 Easier to interpret the issue and enhance partial modules
• End-to-end SLU
 Less prone to ASR errors
 Non-textual information might be preserved as well
• Pretrained LMs
 Takes advantage of massive textual knowledge
 High performance, freely available modules
• Knowledge distillation
 Adaptive to various training schemes
 Cross-modal application is probable
6
Related Work
• ASR-NLU pipelines
 Conventional approaches
 Best if an accurate ASR is guaranteed
 Easier to interpret the issue and enhance partial modules
• End-to-end SLU
 Less prone to ASR errors
 Non-textual information might be preserved as well
• Pretrained LMs
 Takes advantage of massive textual knowledge
 High performance, freely available modules
• Knowledge distillation
 Adaptive to various training schemes
 Cross-modal application is probable
7
Related Work
• End-to-end SLU
 Lugosch et al. "Speech Model Pre-training for End-to-End Spoken
Language Understanding." INTERSPEECH 2019.
8
Related Work
9
• Pretrained LMs
 Transformer architectures
Related Work
• End-to-end speech processing + PLM
 Chuang et al.
“SpeechBERT: Cross-Modal
Pre-Trained Language Model
for End-to-End Spoken Question
Answering.“ INTERSPEECH 2020.
10
Related Work
• End-to-end speech processing + KD
 Liu et al. "End-to-End
Speech Translation with Knowledge
Distillation." INTERSPEECH 2019.
11
Related Work
• End-to-end SLU+ PLM + Cross-modal KD
 Cho et al., ” Speech to text adaptation: Towards an efficient cross-modal
distillation,” INTERSPEECH 2020.
12
Related Work
• End-to-end SLU
 Backbone: Lugosch et al. (2019)
• Phoneme module (SincNet layer)
• Word module
– BiGRU-based, with dropout/pooling
• Intent module
– Consequent prediction of three slots
– Also implemented with BiGRU
13
(Ravanelli and Bengio, 2018)
From previous ver. of Wang et al. (2020)
Related Work
• End-to-end SLU
14
Related Work
• PLM
 Fine-tuning the pretrained model
• BERT-Base (Devlin et al., 2018)
– Bidirectional encoder representations from Transformers (BERT)
• Hugging Face PyTorch wrapper
15
Related Work
• PLM
 Fine-tuning with FSC ground truth scripts!
16
Related Work
• Cross-modal KD
 Distillation as a teacher-student learning
• Loss1 = f answer, inferences
• Loss2 = g inferences , inferencet
• Different input, same task?
– e.g., speech translation
17
𝑇𝑜𝑡𝑎𝑙 𝐿𝑜𝑠𝑠 = 𝐿𝑜𝑠𝑠1 + 𝐿𝑜𝑠𝑠2
Distilled knowledge
(Liu et al., 2019)
Method
• Cross-modal KD
 What determines the loss?
• WHO TEACHES
– BERT-based text inference model
• HOW IS THE LOSS CALCULATED
– MAE, MSE between logits
• HOW MUCH THE GUIDANCE
INFLUENCES
– Time-dependent factors
(scheduling)
– Sample/batch-level factors
18
(Cho et al., 2020)
Method
• Cross-modal KD
 How can we determine the influence of distillation?
• Scheduling (suggested in Cho et al., 2020)
– Decaying
– Triangular
• Sample/batch-level factors
– Error rate (per batch)
– Entropy (averaged across batch; Kwon et al., 2020)
– Dropout-based confidence (averaged across batch; proposed)
» For N dropout-perturbated output, KLD with the original logit is averaged
19
Method
• Dropout-based confidence
 Idea: Distillation will be more meaningful if reliability of prediction is
guaranteed
• Reliability check – By giving perturbation to the distribution and see how it is
consistent with the original distribution (how robust is it?)
• Giving perturbation – Assign a dropout layer and see KD between the
perturbed one and the original one (for multiple dropout scenarios)
– C: # Output classes
– Q: Dropout layer set
– p: Dropout rate
– N: Number of dropouts
– T: Teacher output (original distribution)
– q(T): Output after dropout layer q
20
Method
• Hyperparameter searching
 Pilot with toy set
• D is relatively robust to C and N
– Dropout-based scheme does not depend on the number of output classes as when
using entropy
– As N increases, curve is smoothed, while oversall tendency is not affected
• Only factor affecting D is p
– Empirically set N=100 and p=0.1 after experiment
21
Results
• Comparison with the baseline
 Baseline performs well in
triangular scheduling (1.00%)
and error rate (1.00%)
 Proposed method is not
effective alone (1.05)
without scheduling
 Proposed method is most
significant with decaying (0.97)
and triangular scheduling (0.92)
22
Results
• Discussion
 Confidence modeling works
• Strategies
– Error rate – Student performance
– Entropy – Teacher inference distribution
– Dropout – Teacher confidence
• Error rate adopts student to the gold label, while others decide the weight free
from the varying student performance
• Prevents situation that the gold label might not be ‘answer’ and cause over-
fitting
 Confidence helps scheduled KD
• Wide applicability of the proposed strategy, even along with mechanical
scheduling schemes
23
Conclusion
• Search for schemes to manage the teacher’s influence in cross-
modal distillation
• Dropout-based confidence automatically induces the weight that
decides the influence of KD loss, in sample-wise manner
• Effect of dropout-based confidence and its alignment with
scheduling strategies are verified with public SLU dataset
24
Thank you!
EndOfPresentation

More Related Content

Similar to 2211 APSIPA

The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
Jinho Choi
 
Open Creativity Scoring Tutorial
Open Creativity Scoring TutorialOpen Creativity Scoring Tutorial
Open Creativity Scoring Tutorial
DenisDumas2
 
Obc 2011
Obc 2011Obc 2011
Obc 2011obepsp
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLifeng (Aaron) Han
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
Lifeng (Aaron) Han
 
Enriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationEnriching Word Vectors with Subword Information
Enriching Word Vectors with Subword Information
Seonghyun Kim
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Matthew Lease
 
EDM2014 paper: General Features in Knowledge Tracing to Model Multiple Subski...
EDM2014 paper: General Features in Knowledge Tracing to Model Multiple Subski...EDM2014 paper: General Features in Knowledge Tracing to Model Multiple Subski...
EDM2014 paper: General Features in Knowledge Tracing to Model Multiple Subski...
Yun Huang
 
Quintin Cutts - Teaching and Learning to Program: Too much doing and not enou...
Quintin Cutts - Teaching and Learning to Program: Too much doing and not enou...Quintin Cutts - Teaching and Learning to Program: Too much doing and not enou...
Quintin Cutts - Teaching and Learning to Program: Too much doing and not enou...
compatsch
 
Student Pipeline to Open Source Communities using HFOSS
Student Pipeline to Open Source Communities using HFOSSStudent Pipeline to Open Source Communities using HFOSS
Student Pipeline to Open Source Communities using HFOSS
All Things Open
 
Lec 01 introduction
Lec 01   introductionLec 01   introduction
Lec 01 introduction
UmairMuzaffar9
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue
Transformers to Learn Hierarchical Contexts in Multiparty DialogueTransformers to Learn Hierarchical Contexts in Multiparty Dialogue
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue
Jinho Choi
 
Building MOOCs: Scalable Course Development & Delivery
Building MOOCs: Scalable Course Development & DeliveryBuilding MOOCs: Scalable Course Development & Delivery
Building MOOCs: Scalable Course Development & DeliveryOpus Learning
 
2016-05-31 Venia Legendi (CEITER): Sergey Sosnovsky
2016-05-31 Venia Legendi (CEITER): Sergey Sosnovsky2016-05-31 Venia Legendi (CEITER): Sergey Sosnovsky
2016-05-31 Venia Legendi (CEITER): Sergey Sosnovsky
ifi8106tlu
 
EURO Conference 2015 - Automated Timetabling
EURO Conference 2015 - Automated TimetablingEURO Conference 2015 - Automated Timetabling
EURO Conference 2015 - Automated Timetabling
Dionisio Chiuratto Agourakis
 
Methodology of MT Post-Editors Training
Methodology of MT Post-Editors TrainingMethodology of MT Post-Editors Training
Methodology of MT Post-Editors Training
Jakub Absolon
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories
WarNik Chow
 
Chounta@paws
Chounta@pawsChounta@paws
Action Sequence Mining and Behavior Pattern Analysis for User Modeling
Action Sequence Mining and Behavior Pattern Analysis for User ModelingAction Sequence Mining and Behavior Pattern Analysis for User Modeling
Action Sequence Mining and Behavior Pattern Analysis for User Modeling
Peter Brusilovsky
 

Similar to 2211 APSIPA (20)

Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
 
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
 
Open Creativity Scoring Tutorial
Open Creativity Scoring TutorialOpen Creativity Scoring Tutorial
Open Creativity Scoring Tutorial
 
Obc 2011
Obc 2011Obc 2011
Obc 2011
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
 
Enriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationEnriching Word Vectors with Subword Information
Enriching Word Vectors with Subword Information
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
EDM2014 paper: General Features in Knowledge Tracing to Model Multiple Subski...
EDM2014 paper: General Features in Knowledge Tracing to Model Multiple Subski...EDM2014 paper: General Features in Knowledge Tracing to Model Multiple Subski...
EDM2014 paper: General Features in Knowledge Tracing to Model Multiple Subski...
 
Quintin Cutts - Teaching and Learning to Program: Too much doing and not enou...
Quintin Cutts - Teaching and Learning to Program: Too much doing and not enou...Quintin Cutts - Teaching and Learning to Program: Too much doing and not enou...
Quintin Cutts - Teaching and Learning to Program: Too much doing and not enou...
 
Student Pipeline to Open Source Communities using HFOSS
Student Pipeline to Open Source Communities using HFOSSStudent Pipeline to Open Source Communities using HFOSS
Student Pipeline to Open Source Communities using HFOSS
 
Lec 01 introduction
Lec 01   introductionLec 01   introduction
Lec 01 introduction
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue
Transformers to Learn Hierarchical Contexts in Multiparty DialogueTransformers to Learn Hierarchical Contexts in Multiparty Dialogue
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue
 
Building MOOCs: Scalable Course Development & Delivery
Building MOOCs: Scalable Course Development & DeliveryBuilding MOOCs: Scalable Course Development & Delivery
Building MOOCs: Scalable Course Development & Delivery
 
2016-05-31 Venia Legendi (CEITER): Sergey Sosnovsky
2016-05-31 Venia Legendi (CEITER): Sergey Sosnovsky2016-05-31 Venia Legendi (CEITER): Sergey Sosnovsky
2016-05-31 Venia Legendi (CEITER): Sergey Sosnovsky
 
EURO Conference 2015 - Automated Timetabling
EURO Conference 2015 - Automated TimetablingEURO Conference 2015 - Automated Timetabling
EURO Conference 2015 - Automated Timetabling
 
Methodology of MT Post-Editors Training
Methodology of MT Post-Editors TrainingMethodology of MT Post-Editors Training
Methodology of MT Post-Editors Training
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories
 
Chounta@paws
Chounta@pawsChounta@paws
Chounta@paws
 
Action Sequence Mining and Behavior Pattern Analysis for User Modeling
Action Sequence Mining and Behavior Pattern Analysis for User ModelingAction Sequence Mining and Behavior Pattern Analysis for User Modeling
Action Sequence Mining and Behavior Pattern Analysis for User Modeling
 

More from WarNik Chow

2312 PACLIC
2312 PACLIC2312 PACLIC
2312 PACLIC
WarNik Chow
 
2311 EAAMO
2311 EAAMO2311 EAAMO
2311 EAAMO
WarNik Chow
 
2211 HCOMP
2211 HCOMP2211 HCOMP
2211 HCOMP
WarNik Chow
 
2211 AACL
2211 AACL2211 AACL
2211 AACL
WarNik Chow
 
2210 CODI
2210 CODI2210 CODI
2210 CODI
WarNik Chow
 
2206 FAccT_inperson
2206 FAccT_inperson2206 FAccT_inperson
2206 FAccT_inperson
WarNik Chow
 
2206 Modupop!
2206 Modupop!2206 Modupop!
2206 Modupop!
WarNik Chow
 
2204 Kakao talk on Hate speech dataset
2204 Kakao talk on Hate speech dataset2204 Kakao talk on Hate speech dataset
2204 Kakao talk on Hate speech dataset
WarNik Chow
 
2108 [LangCon2021] kosp2e
2108 [LangCon2021] kosp2e2108 [LangCon2021] kosp2e
2108 [LangCon2021] kosp2e
WarNik Chow
 
2106 PRSLLS
2106 PRSLLS2106 PRSLLS
2106 PRSLLS
WarNik Chow
 
2106 JWLLP
2106 JWLLP2106 JWLLP
2106 JWLLP
WarNik Chow
 
2106 ACM DIS
2106 ACM DIS2106 ACM DIS
2106 ACM DIS
WarNik Chow
 
2104 Talk @SSU
2104 Talk @SSU2104 Talk @SSU
2104 Talk @SSU
WarNik Chow
 
2103 ACM FAccT
2103 ACM FAccT2103 ACM FAccT
2103 ACM FAccT
WarNik Chow
 
2102 Redone seminar
2102 Redone seminar2102 Redone seminar
2102 Redone seminar
WarNik Chow
 
2011 NLP-OSS
2011 NLP-OSS2011 NLP-OSS
2011 NLP-OSS
WarNik Chow
 
2010 HCLT Hate Speech
2010 HCLT Hate Speech2010 HCLT Hate Speech
2010 HCLT Hate Speech
WarNik Chow
 
2009 DevC Seongnam - NLP
2009 DevC Seongnam - NLP2009 DevC Seongnam - NLP
2009 DevC Seongnam - NLP
WarNik Chow
 
2008 [lang con2020] act!
2008 [lang con2020] act!2008 [lang con2020] act!
2008 [lang con2020] act!
WarNik Chow
 
2007 CogSci 2020 poster
2007 CogSci 2020 poster2007 CogSci 2020 poster
2007 CogSci 2020 poster
WarNik Chow
 

More from WarNik Chow (20)

2312 PACLIC
2312 PACLIC2312 PACLIC
2312 PACLIC
 
2311 EAAMO
2311 EAAMO2311 EAAMO
2311 EAAMO
 
2211 HCOMP
2211 HCOMP2211 HCOMP
2211 HCOMP
 
2211 AACL
2211 AACL2211 AACL
2211 AACL
 
2210 CODI
2210 CODI2210 CODI
2210 CODI
 
2206 FAccT_inperson
2206 FAccT_inperson2206 FAccT_inperson
2206 FAccT_inperson
 
2206 Modupop!
2206 Modupop!2206 Modupop!
2206 Modupop!
 
2204 Kakao talk on Hate speech dataset
2204 Kakao talk on Hate speech dataset2204 Kakao talk on Hate speech dataset
2204 Kakao talk on Hate speech dataset
 
2108 [LangCon2021] kosp2e
2108 [LangCon2021] kosp2e2108 [LangCon2021] kosp2e
2108 [LangCon2021] kosp2e
 
2106 PRSLLS
2106 PRSLLS2106 PRSLLS
2106 PRSLLS
 
2106 JWLLP
2106 JWLLP2106 JWLLP
2106 JWLLP
 
2106 ACM DIS
2106 ACM DIS2106 ACM DIS
2106 ACM DIS
 
2104 Talk @SSU
2104 Talk @SSU2104 Talk @SSU
2104 Talk @SSU
 
2103 ACM FAccT
2103 ACM FAccT2103 ACM FAccT
2103 ACM FAccT
 
2102 Redone seminar
2102 Redone seminar2102 Redone seminar
2102 Redone seminar
 
2011 NLP-OSS
2011 NLP-OSS2011 NLP-OSS
2011 NLP-OSS
 
2010 HCLT Hate Speech
2010 HCLT Hate Speech2010 HCLT Hate Speech
2010 HCLT Hate Speech
 
2009 DevC Seongnam - NLP
2009 DevC Seongnam - NLP2009 DevC Seongnam - NLP
2009 DevC Seongnam - NLP
 
2008 [lang con2020] act!
2008 [lang con2020] act!2008 [lang con2020] act!
2008 [lang con2020] act!
 
2007 CogSci 2020 poster
2007 CogSci 2020 poster2007 CogSci 2020 poster
2007 CogSci 2020 poster
 

Recently uploaded

Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
BrazilAccount1
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
ongomchris
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
SupreethSP4
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
manasideore6
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 

Recently uploaded (20)

Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
AP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specificAP LAB PPT.pdf ap lab ppt no title specific
AP LAB PPT.pdf ap lab ppt no title specific
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
Fundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptxFundamentals of Electric Drives and its applications.pptx
Fundamentals of Electric Drives and its applications.pptx
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 

2211 APSIPA

  • 1. Cross-Modal Knowledge Distillation With Dropout-Based Confidence 2022. 11. 9, @APSIPA ASC 2022 Won Ik Cho¹, Jeunghun Kim², Nam Soo Kim² Samsung Advanced Institute of Technology¹ Department of ECE and INMC, Seoul National University²
  • 2. Contents • Motivation • Task and Dataset • Related Work • Method • Results • Conclusion 1
  • 3. Motivation • Text and speech : Two main medium of communication • More difficult to train with speeches  Why? • Scarce amount of data • Difficult to control the generation and storage of the recordings 2 “THIS IS A SPEECH” Difference in search result with ‘English’ in ELRA catalog
  • 4. Motivation • Pretrained language models  Mainly developed for the text-based systems • ELMo, BERT, GPTs …  Bases on huge amount of raw corpus • Trained with simple but non-task-specific objectives • How to leverage pretrained LMs in speech processing?  Direct use? • Only if the ASR output are accurate  Training LMs with erroneous speech transcriptions? • Okay, but cannot cover all the possible cases, and requires script for various sce narios  Distillation? 3
  • 5. Motivation • Teacher confidence  How should knowledge distillation be managed depending on the uncertainty of the inference? 4
  • 6. Task and Dataset • Task: Spoken language understanding  Fluent speech command • 16kHz single channel 30,043 audio files • Each audio labeled with three slots: action / object / location • 248 different phrases spoken by 97 speakers (77/10/10) • Multi-label classification problem  Why Fluent speech command? (suggested in Lugosch et al., 2019) • Google speech command: – Only short keywords, thus not an SLU • ATIS – Not publicly available • Grabo, Domonica, Pactor – Free, but only a small number of speakers and phrases • Snips audio – Variety of phrases, but less audio 5
  • 7. Related Work • ASR-NLU pipelines  Conventional approaches  Best if an accurate ASR is guaranteed  Easier to interpret the issue and enhance partial modules • End-to-end SLU  Less prone to ASR errors  Non-textual information might be preserved as well • Pretrained LMs  Takes advantage of massive textual knowledge  High performance, freely available modules • Knowledge distillation  Adaptive to various training schemes  Cross-modal application is probable 6
  • 8. Related Work • ASR-NLU pipelines  Conventional approaches  Best if an accurate ASR is guaranteed  Easier to interpret the issue and enhance partial modules • End-to-end SLU  Less prone to ASR errors  Non-textual information might be preserved as well • Pretrained LMs  Takes advantage of massive textual knowledge  High performance, freely available modules • Knowledge distillation  Adaptive to various training schemes  Cross-modal application is probable 7
  • 9. Related Work • End-to-end SLU  Lugosch et al. "Speech Model Pre-training for End-to-End Spoken Language Understanding." INTERSPEECH 2019. 8
  • 10. Related Work 9 • Pretrained LMs  Transformer architectures
  • 11. Related Work • End-to-end speech processing + PLM  Chuang et al. “SpeechBERT: Cross-Modal Pre-Trained Language Model for End-to-End Spoken Question Answering.“ INTERSPEECH 2020. 10
  • 12. Related Work • End-to-end speech processing + KD  Liu et al. "End-to-End Speech Translation with Knowledge Distillation." INTERSPEECH 2019. 11
  • 13. Related Work • End-to-end SLU+ PLM + Cross-modal KD  Cho et al., ” Speech to text adaptation: Towards an efficient cross-modal distillation,” INTERSPEECH 2020. 12
  • 14. Related Work • End-to-end SLU  Backbone: Lugosch et al. (2019) • Phoneme module (SincNet layer) • Word module – BiGRU-based, with dropout/pooling • Intent module – Consequent prediction of three slots – Also implemented with BiGRU 13 (Ravanelli and Bengio, 2018) From previous ver. of Wang et al. (2020)
  • 16. Related Work • PLM  Fine-tuning the pretrained model • BERT-Base (Devlin et al., 2018) – Bidirectional encoder representations from Transformers (BERT) • Hugging Face PyTorch wrapper 15
  • 17. Related Work • PLM  Fine-tuning with FSC ground truth scripts! 16
  • 18. Related Work • Cross-modal KD  Distillation as a teacher-student learning • Loss1 = f answer, inferences • Loss2 = g inferences , inferencet • Different input, same task? – e.g., speech translation 17 𝑇𝑜𝑡𝑎𝑙 𝐿𝑜𝑠𝑠 = 𝐿𝑜𝑠𝑠1 + 𝐿𝑜𝑠𝑠2 Distilled knowledge (Liu et al., 2019)
  • 19. Method • Cross-modal KD  What determines the loss? • WHO TEACHES – BERT-based text inference model • HOW IS THE LOSS CALCULATED – MAE, MSE between logits • HOW MUCH THE GUIDANCE INFLUENCES – Time-dependent factors (scheduling) – Sample/batch-level factors 18 (Cho et al., 2020)
  • 20. Method • Cross-modal KD  How can we determine the influence of distillation? • Scheduling (suggested in Cho et al., 2020) – Decaying – Triangular • Sample/batch-level factors – Error rate (per batch) – Entropy (averaged across batch; Kwon et al., 2020) – Dropout-based confidence (averaged across batch; proposed) » For N dropout-perturbated output, KLD with the original logit is averaged 19
  • 21. Method • Dropout-based confidence  Idea: Distillation will be more meaningful if reliability of prediction is guaranteed • Reliability check – By giving perturbation to the distribution and see how it is consistent with the original distribution (how robust is it?) • Giving perturbation – Assign a dropout layer and see KD between the perturbed one and the original one (for multiple dropout scenarios) – C: # Output classes – Q: Dropout layer set – p: Dropout rate – N: Number of dropouts – T: Teacher output (original distribution) – q(T): Output after dropout layer q 20
  • 22. Method • Hyperparameter searching  Pilot with toy set • D is relatively robust to C and N – Dropout-based scheme does not depend on the number of output classes as when using entropy – As N increases, curve is smoothed, while oversall tendency is not affected • Only factor affecting D is p – Empirically set N=100 and p=0.1 after experiment 21
  • 23. Results • Comparison with the baseline  Baseline performs well in triangular scheduling (1.00%) and error rate (1.00%)  Proposed method is not effective alone (1.05) without scheduling  Proposed method is most significant with decaying (0.97) and triangular scheduling (0.92) 22
  • 24. Results • Discussion  Confidence modeling works • Strategies – Error rate – Student performance – Entropy – Teacher inference distribution – Dropout – Teacher confidence • Error rate adopts student to the gold label, while others decide the weight free from the varying student performance • Prevents situation that the gold label might not be ‘answer’ and cause over- fitting  Confidence helps scheduled KD • Wide applicability of the proposed strategy, even along with mechanical scheduling schemes 23
  • 25. Conclusion • Search for schemes to manage the teacher’s influence in cross- modal distillation • Dropout-based confidence automatically induces the weight that decides the influence of KD loss, in sample-wise manner • Effect of dropout-based confidence and its alignment with scheduling strategies are verified with public SLU dataset 24

Editor's Notes

  1. Hi, this is wonik cho from Samsung advanced institute of technology. Sorry for participating in remote due to some issues in travel. I will present the work on cross-modal knowledge distillation done while I was in Seoul national university.
  2. First, we will wrap up the literature on spoken language understanding and cross-modal distillation with relevant works, and demonstrate how we tackled the conventional cross-modal distillation with dropout-based confidence.
  3. First, we want to talk about the background of using textual information on speech tasks. Text and speech are two main medium of communication, but it is usually regarded that model training with speech is usually more difficult. Main reason is the scarce amount of data, and furthermore, it is difficult to control the generation and storage of the recordings. This leads again to scarce existence of speech data.
  4. In this regard, considering that speech contains semantic and syntactic information that language and text data contains, it would be much beneficial and efficient to utilize widely used pretrained language models in spoken language processing. They are mainly developed for the text-based systems, and bases on huge amount of raw corpus as pretraining data. Usually they are trained with simple but non-task-specific objectives, in terms of self-supervised learning. Then, how can we leverage such pretrained LMs in speech and spoken language processing? If we are to directly use it, we can use if only by ASR output assuming that it is accurate. Or, we can train LMs with probably erroneous speech transcripitions. However it is quite heuristic and may depend on the ASR system that is used. Thus, we can think of distilling some information from the pretrained or fine-tuned model, possibly in task-specific manner.
  5. However, one side of the distillation is that teacher is not always fully confident about its inference, since there are always some tricky examples that even teachers find it difficult to infer. In this study, we tackle these cases: how should the knowledge distillation be controlled concerning these uncertain situations? We take a look at how it is dealt within a cross-modal distillation of spoken language understanding.
  6. Here, our task is spoken language processing, and we use Fluent speech command which is prevalently used for SLU task. It consists of 16kHz single channel, 30 000 audio files, where each audio is labeled with three slots of action, object, and location. It’s composed of 248 different phrases spoken by 97 speakers, and it is formulated as multi-label classification problem. Compared to SLU datasets such as Google speech command with short keywords, ATIS which is not publicly available, Grabo with simple composition and Snips with less amount of audio, Fluent speech command seems to be more appropriate candidate for testing the spoken language understanding task regarding distillation technique.
  7. So far we've discussed on what to do and how we are to do, and now we take a look on how it had been done. First, on conventional ASR-NLU pipelines, as said, the highest performance would be able to be obtained if an accurate ASR is guaranteed. Above all, it is easier to interpret the issue and enhance the partial modules. In contrast, end-to-end SLU technologies are less prone to ASR errors, and non-textual information such as acoustics might be preserved as well.
  8. Next two are the literature on the pretrained langauge models and how we are to utilize them. First, PLMs take advantage of massive textual knowledge and currently, high performance-freely available modules are distributed. Therefore, to leverage them, we bring knowledge distillation technologies which are adaptive to various training schemes and where the cross-modal approach is probable.
  9. Deeper on end-to-end SLU, we mainly referred to Lugosch et al 2019 where the Fluent Speech Command dataset and the baseline thereof were released. The phoneme-level and word-level classifiers, each bases on SincNet and RNN were pretrained and used. The final intent inference module that also bases on RNN predicts the slots and the best performance was obtained by freezing all the pretrained layers. That is, the word-level posterior provided by ASR pretrained module largely helps the intent inference module, at the same time allowing the end-to-end approach.
  10. Also, we take a look on recent pretrained LMs, which start off from RNN-based ELMo and Transformer-based BERT. Especially the prevalent BERT, which is pretrained with the objective of masked word prediction and sentence relevance checking, has shown its power among the syntactic and semantic tasks, and here it is expected to boost the performance in understanding spoken language.
  11. These approaches started to be combined, as in speech BERT, where the text and the corresponding audio source are trained simultaneoulsy so that the representation of the speech utterances can take advantage from the text understanding. This seems powerful for SLU, but the approach also requires the new format of LM pretraining.
  12. In view of knowledge distillation, Liu et al suggested leveraging the machine translation onto the corresponding speech translation, using the inference of teacher MT module as a KD loss that benefits the speech-based inference of the student ST module. Our task has two difference, first ours seeks for the representation of text and speech within a single language, and the teacher and student have different architecture.
  13. We describe the approach in our previous paper below. The left SLU module adopts audio as input, where the pretrained module yields a word posterior-level sequence which is again fed to the RNN-based intent prediction module. In this phase, the prediction is guided by the logit inference of the fine-tuned LM. The script corresponds to the given audio.
  14. In detail of the end-to-end SLU, the backbone is adopted is from Lugosch et al, consisting of SincNet-based phoneme module, BiGRU-based word module with dropout and pooling, and intent module which consequently predicts three slots and is also implemented with BiGRU.
  15. The baseline experiment shows that the highest accuracy is obtained with freezed or at least word layer unfreezed model. Also, the result was also convincing with only 10% of the training dataset.
  16. Secondly on the PLMs, to fine-tune the BERT based on a publicly available model, we adopted a Hugging Face PyTorch wrapper of the Google BERT.
  17. To fit the fine-tuning strategy, the original FSC ground truth scripts and labels were transformed into the proper format. The fine-tuning was done in a straightforward manner and with 50 epochs.
  18. Next, on cross-modal KD, we regard it as a teacher-student learning where the KD loss is augmented to original loss as a distilled knowledge. By cross-modal we mean the different input modality, which can be considered as a different input with same task, in concurrence with the previously discussed speech translation.
  19. Detailed on our method on the cross-modal knowledge distillation, we formulated as below, with the weighted sum of cross-entropy loss and KD loss. Mainly three factors affect the final KD loss. First, who teaches is important, and here we adopt BERT-based text inference model used in Cho et al., which achieved almost perfect score on the text test set with ground truth labels. Next, which kind of loss is used is also important, and here we also refer to the previous study and MAE loss between logits are effective in this case. However, it is not trivial to determine the value of lambda. There are mainly two factors, namely time-dependent factors that are calculated as scheduling, and next sample or batch-level factors such as error rate.
  20. Let’s take a deeper look into determining the influence of distillation. In Cho et al., using the same dataset, decaying and triangular schedulings were used to manage the influence of the teacher. They are more closed to a mechanical controlling, compared to sample or batch-level factors that changes according to the samples that are actually contained in the dataset. Error rate is calculated per batch referring to the accuracy of the teacher model regarding the batch. Entropy is calculated with the logit distribution per sample and is divided by the number of output classes for normalization. Next, the proposed model, dropout-based confidence, returns a kullback-leibler distance with the original logit averaged by N, where N is the number of dropout-perturbated output.
  21. We explain more details on our idea. First, we thought that distillation will be more meaningful if reliability of prediction is guaranteed. Here, reliability check is done by giving perturbation to the distribution and see how it is consistent with the original distribution. That is, we check the robustness of the inference. Finally, giving perturbation here implies, assigning a dropout layer and see KD between the perturbed one and the original one, and conduct this for multiple dropout scenarios. In the formula below, C implies to the number of output classes, Q the dropout layer set, p the dropout rate, N the number of dropouts, T the teacher output and q(T) the output after trespassing the dropout layer q.
  22. We also conducted a hyperparameter searching with a toy numerical set, assuming a vector where a center component is significantly high compared to others which are uniformly distributed. We first found that the dropout-based confidence is relatively robust to C compared to when using entropy, and also to N, concerning that the increase of N leads to the smoothing of the total curve but not the tendency. We checked that only factor that affects D here is p, and after experiments, empirically set N to 100 and p to 0.1.
  23. We compare the training results with various baselines. First, among models that use distillation, we found that baseline models already performed well in triangular scheduling, or in the scenario using the error rate. It seemed that the proposed method is not effective alone without scheduling, showing similar performance to phoneme posterior model or entropy based model. So far, ERNIE-based phoneme posterior model showed the best performance. However, our method outperformed the baselines with decaying and triangular scheduling.
  24. From the results, we want to tell that confidence modeling works in cross-modal distillation for spoken language understanding, accompanied by a proper scheduling scheme. Redeeming three strategies, error rate regards the student performance while others concern teacher behavior. Error rate adopts student to the gold level, while others decide the KD weight free from varying student performance. This prevents situation that the gold label might not be ‘answer’ and cause overfitting. Also, we saw that confidence helps scheduled KD, and also the scheduling enhances the utility of dropout-based approach. This suggests the wide applicability of the proposed scheme, concerning that scheduling and automatic decision schemes are two independent factors deciding the influence of KD loss.
  25. In our study, we searched for schemes to manage the teacher’s influence in cross-modal distillation. We found that dropout-based confidence automatically induces the weight that decides the influence of KD loss, in sample-wise manner. Effect of dropout-based confidence and its alignment with scheduling strategies are verified with public SLU dataset. Thanks for listening to our talk