This document presents a method for cross-modal knowledge distillation from text to speech models using dropout-based teacher confidence. It aims to leverage large pre-trained language models trained on text for speech tasks by distilling knowledge from a text-based BERT model to an end-to-end speech understanding model. The method uses dropout perturbations on the teacher's logits to calculate a confidence score for each sample, which is then used to weight the distillation loss and guide the training of the student speech model, especially when combined with scheduling strategies like decaying and triangular schedules. The method is evaluated on a public spoken language understanding dataset and shows improvements over baseline distillation, demonstrating the effectiveness of using teacher confidence to manage knowledge transfer across modal
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
2211 APSIPA
1. Cross-Modal Knowledge Distillation With
Dropout-Based Confidence
2022. 11. 9, @APSIPA ASC 2022
Won Ik Cho¹, Jeunghun Kim², Nam Soo Kim²
Samsung Advanced Institute of Technology¹
Department of ECE and INMC, Seoul National University²
3. Motivation
• Text and speech : Two main medium of communication
• More difficult to train with speeches
Why?
• Scarce amount of data
• Difficult to control the generation
and storage of the recordings
2
“THIS IS A SPEECH”
Difference in search result with ‘English’ in ELRA catalog
4. Motivation
• Pretrained language models
Mainly developed for the text-based systems
• ELMo, BERT, GPTs …
Bases on huge amount of raw corpus
• Trained with simple but non-task-specific objectives
• How to leverage pretrained LMs in speech processing?
Direct use?
• Only if the ASR output are accurate
Training LMs with erroneous speech transcriptions?
• Okay, but cannot cover all the possible cases, and requires script for various sce
narios
Distillation?
3
6. Task and Dataset
• Task: Spoken language understanding
Fluent speech command
• 16kHz single channel 30,043 audio files
• Each audio labeled with three slots: action / object / location
• 248 different phrases spoken by 97 speakers (77/10/10)
• Multi-label classification problem
Why Fluent speech command? (suggested in Lugosch et al., 2019)
• Google speech command:
– Only short keywords, thus not an SLU
• ATIS
– Not publicly available
• Grabo, Domonica, Pactor
– Free, but only a small number of speakers and phrases
• Snips audio
– Variety of phrases, but less audio
5
7. Related Work
• ASR-NLU pipelines
Conventional approaches
Best if an accurate ASR is guaranteed
Easier to interpret the issue and enhance partial modules
• End-to-end SLU
Less prone to ASR errors
Non-textual information might be preserved as well
• Pretrained LMs
Takes advantage of massive textual knowledge
High performance, freely available modules
• Knowledge distillation
Adaptive to various training schemes
Cross-modal application is probable
6
8. Related Work
• ASR-NLU pipelines
Conventional approaches
Best if an accurate ASR is guaranteed
Easier to interpret the issue and enhance partial modules
• End-to-end SLU
Less prone to ASR errors
Non-textual information might be preserved as well
• Pretrained LMs
Takes advantage of massive textual knowledge
High performance, freely available modules
• Knowledge distillation
Adaptive to various training schemes
Cross-modal application is probable
7
9. Related Work
• End-to-end SLU
Lugosch et al. "Speech Model Pre-training for End-to-End Spoken
Language Understanding." INTERSPEECH 2019.
8
11. Related Work
• End-to-end speech processing + PLM
Chuang et al.
“SpeechBERT: Cross-Modal
Pre-Trained Language Model
for End-to-End Spoken Question
Answering.“ INTERSPEECH 2020.
10
12. Related Work
• End-to-end speech processing + KD
Liu et al. "End-to-End
Speech Translation with Knowledge
Distillation." INTERSPEECH 2019.
11
13. Related Work
• End-to-end SLU+ PLM + Cross-modal KD
Cho et al., ” Speech to text adaptation: Towards an efficient cross-modal
distillation,” INTERSPEECH 2020.
12
14. Related Work
• End-to-end SLU
Backbone: Lugosch et al. (2019)
• Phoneme module (SincNet layer)
• Word module
– BiGRU-based, with dropout/pooling
• Intent module
– Consequent prediction of three slots
– Also implemented with BiGRU
13
(Ravanelli and Bengio, 2018)
From previous ver. of Wang et al. (2020)
16. Related Work
• PLM
Fine-tuning the pretrained model
• BERT-Base (Devlin et al., 2018)
– Bidirectional encoder representations from Transformers (BERT)
• Hugging Face PyTorch wrapper
15
18. Related Work
• Cross-modal KD
Distillation as a teacher-student learning
• Loss1 = f answer, inferences
• Loss2 = g inferences , inferencet
• Different input, same task?
– e.g., speech translation
17
𝑇𝑜𝑡𝑎𝑙 𝐿𝑜𝑠𝑠 = 𝐿𝑜𝑠𝑠1 + 𝐿𝑜𝑠𝑠2
Distilled knowledge
(Liu et al., 2019)
19. Method
• Cross-modal KD
What determines the loss?
• WHO TEACHES
– BERT-based text inference model
• HOW IS THE LOSS CALCULATED
– MAE, MSE between logits
• HOW MUCH THE GUIDANCE
INFLUENCES
– Time-dependent factors
(scheduling)
– Sample/batch-level factors
18
(Cho et al., 2020)
20. Method
• Cross-modal KD
How can we determine the influence of distillation?
• Scheduling (suggested in Cho et al., 2020)
– Decaying
– Triangular
• Sample/batch-level factors
– Error rate (per batch)
– Entropy (averaged across batch; Kwon et al., 2020)
– Dropout-based confidence (averaged across batch; proposed)
» For N dropout-perturbated output, KLD with the original logit is averaged
19
21. Method
• Dropout-based confidence
Idea: Distillation will be more meaningful if reliability of prediction is
guaranteed
• Reliability check – By giving perturbation to the distribution and see how it is
consistent with the original distribution (how robust is it?)
• Giving perturbation – Assign a dropout layer and see KD between the
perturbed one and the original one (for multiple dropout scenarios)
– C: # Output classes
– Q: Dropout layer set
– p: Dropout rate
– N: Number of dropouts
– T: Teacher output (original distribution)
– q(T): Output after dropout layer q
20
22. Method
• Hyperparameter searching
Pilot with toy set
• D is relatively robust to C and N
– Dropout-based scheme does not depend on the number of output classes as when
using entropy
– As N increases, curve is smoothed, while oversall tendency is not affected
• Only factor affecting D is p
– Empirically set N=100 and p=0.1 after experiment
21
23. Results
• Comparison with the baseline
Baseline performs well in
triangular scheduling (1.00%)
and error rate (1.00%)
Proposed method is not
effective alone (1.05)
without scheduling
Proposed method is most
significant with decaying (0.97)
and triangular scheduling (0.92)
22
24. Results
• Discussion
Confidence modeling works
• Strategies
– Error rate – Student performance
– Entropy – Teacher inference distribution
– Dropout – Teacher confidence
• Error rate adopts student to the gold label, while others decide the weight free
from the varying student performance
• Prevents situation that the gold label might not be ‘answer’ and cause over-
fitting
Confidence helps scheduled KD
• Wide applicability of the proposed strategy, even along with mechanical
scheduling schemes
23
25. Conclusion
• Search for schemes to manage the teacher’s influence in cross-
modal distillation
• Dropout-based confidence automatically induces the weight that
decides the influence of KD loss, in sample-wise manner
• Effect of dropout-based confidence and its alignment with
scheduling strategies are verified with public SLU dataset
24
Hi, this is wonik cho from Samsung advanced institute of technology. Sorry for participating in remote due to some issues in travel. I will present the work on cross-modal knowledge distillation done while I was in Seoul national university.
First, we will wrap up the literature on spoken language understanding and cross-modal distillation with relevant works, and demonstrate how we tackled the conventional cross-modal distillation with dropout-based confidence.
First, we want to talk about the background of using textual information on speech tasks. Text and speech are two main medium of communication, but it is usually regarded that model training with speech is usually more difficult. Main reason is the scarce amount of data, and furthermore, it is difficult to control the generation and storage of the recordings. This leads again to scarce existence of speech data.
In this regard, considering that speech contains semantic and syntactic information that language and text data contains, it would be much beneficial and efficient to utilize widely used pretrained language models in spoken language processing. They are mainly developed for the text-based systems, and bases on huge amount of raw corpus as pretraining data. Usually they are trained with simple but non-task-specific objectives, in terms of self-supervised learning. Then, how can we leverage such pretrained LMs in speech and spoken language processing? If we are to directly use it, we can use if only by ASR output assuming that it is accurate. Or, we can train LMs with probably erroneous speech transcripitions. However it is quite heuristic and may depend on the ASR system that is used. Thus, we can think of distilling some information from the pretrained or fine-tuned model, possibly in task-specific manner.
However, one side of the distillation is that teacher is not always fully confident about its inference, since there are always some tricky examples that even teachers find it difficult to infer. In this study, we tackle these cases: how should the knowledge distillation be controlled concerning these uncertain situations? We take a look at how it is dealt within a cross-modal distillation of spoken language understanding.
Here, our task is spoken language processing, and we use Fluent speech command which is prevalently used for SLU task. It consists of 16kHz single channel, 30 000 audio files, where each audio is labeled with three slots of action, object, and location. It’s composed of 248 different phrases spoken by 97 speakers, and it is formulated as multi-label classification problem. Compared to SLU datasets such as Google speech command with short keywords, ATIS which is not publicly available, Grabo with simple composition and Snips with less amount of audio, Fluent speech command seems to be more appropriate candidate for testing the spoken language understanding task regarding distillation technique.
So far we've discussed on what to do and how we are to do, and now we take a look on how it had been done. First, on conventional ASR-NLU pipelines, as said, the highest performance would be able to be obtained if an accurate ASR is guaranteed. Above all, it is easier to interpret the issue and enhance the partial modules. In contrast, end-to-end SLU technologies are less prone to ASR errors, and non-textual information such as acoustics might be preserved as well.
Next two are the literature on the pretrained langauge models and how we are to utilize them. First, PLMs take advantage of massive textual knowledge and currently, high performance-freely available modules are distributed. Therefore, to leverage them, we bring knowledge distillation technologies which are adaptive to various training schemes and where the cross-modal approach is probable.
Deeper on end-to-end SLU, we mainly referred to Lugosch et al 2019 where the Fluent Speech Command dataset and the baseline thereof were released. The phoneme-level and word-level classifiers, each bases on SincNet and RNN were pretrained and used. The final intent inference module that also bases on RNN predicts the slots and the best performance was obtained by freezing all the pretrained layers. That is, the word-level posterior provided by ASR pretrained module largely helps the intent inference module, at the same time allowing the end-to-end approach.
Also, we take a look on recent pretrained LMs, which start off from RNN-based ELMo and Transformer-based BERT. Especially the prevalent BERT, which is pretrained with the objective of masked word prediction and sentence relevance checking, has shown its power among the syntactic and semantic tasks, and here it is expected to boost the performance in understanding spoken language.
These approaches started to be combined, as in speech BERT, where the text and the corresponding audio source are trained simultaneoulsy so that the representation of the speech utterances can take advantage from the text understanding. This seems powerful for SLU, but the approach also requires the new format of LM pretraining.
In view of knowledge distillation, Liu et al suggested leveraging the machine translation onto the corresponding speech translation, using the inference of teacher MT module as a KD loss that benefits the speech-based inference of the student ST module. Our task has two difference, first ours seeks for the representation of text and speech within a single language, and the teacher and student have different architecture.
We describe the approach in our previous paper below. The left SLU module adopts audio as input, where the pretrained module yields a word posterior-level sequence which is again fed to the RNN-based intent prediction module. In this phase, the prediction is guided by the logit inference of the fine-tuned LM. The script corresponds to the given audio.
In detail of the end-to-end SLU, the backbone is adopted is from Lugosch et al, consisting of SincNet-based phoneme module, BiGRU-based word module with dropout and pooling, and intent module which consequently predicts three slots and is also implemented with BiGRU.
The baseline experiment shows that the highest accuracy is obtained with freezed or at least word layer unfreezed model. Also, the result was also convincing with only 10% of the training dataset.
Secondly on the PLMs, to fine-tune the BERT based on a publicly available model, we adopted a Hugging Face PyTorch wrapper of the Google BERT.
To fit the fine-tuning strategy, the original FSC ground truth scripts and labels were transformed into the proper format. The fine-tuning was done in a straightforward manner and with 50 epochs.
Next, on cross-modal KD, we regard it as a teacher-student learning where the KD loss is augmented to original loss as a distilled knowledge. By cross-modal we mean the different input modality, which can be considered as a different input with same task, in concurrence with the previously discussed speech translation.
Detailed on our method on the cross-modal knowledge distillation, we formulated as below, with the weighted sum of cross-entropy loss and KD loss. Mainly three factors affect the final KD loss. First, who teaches is important, and here we adopt BERT-based text inference model used in Cho et al., which achieved almost perfect score on the text test set with ground truth labels. Next, which kind of loss is used is also important, and here we also refer to the previous study and MAE loss between logits are effective in this case. However, it is not trivial to determine the value of lambda. There are mainly two factors, namely time-dependent factors that are calculated as scheduling, and next sample or batch-level factors such as error rate.
Let’s take a deeper look into determining the influence of distillation. In Cho et al., using the same dataset, decaying and triangular schedulings were used to manage the influence of the teacher. They are more closed to a mechanical controlling, compared to sample or batch-level factors that changes according to the samples that are actually contained in the dataset. Error rate is calculated per batch referring to the accuracy of the teacher model regarding the batch. Entropy is calculated with the logit distribution per sample and is divided by the number of output classes for normalization. Next, the proposed model, dropout-based confidence, returns a kullback-leibler distance with the original logit averaged by N, where N is the number of dropout-perturbated output.
We explain more details on our idea. First, we thought that distillation will be more meaningful if reliability of prediction is guaranteed. Here, reliability check is done by giving perturbation to the distribution and see how it is consistent with the original distribution. That is, we check the robustness of the inference. Finally, giving perturbation here implies, assigning a dropout layer and see KD between the perturbed one and the original one, and conduct this for multiple dropout scenarios. In the formula below, C implies to the number of output classes, Q the dropout layer set, p the dropout rate, N the number of dropouts, T the teacher output and q(T) the output after trespassing the dropout layer q.
We also conducted a hyperparameter searching with a toy numerical set, assuming a vector where a center component is significantly high compared to others which are uniformly distributed. We first found that the dropout-based confidence is relatively robust to C compared to when using entropy, and also to N, concerning that the increase of N leads to the smoothing of the total curve but not the tendency. We checked that only factor that affects D here is p, and after experiments, empirically set N to 100 and p to 0.1.
We compare the training results with various baselines. First, among models that use distillation, we found that baseline models already performed well in triangular scheduling, or in the scenario using the error rate. It seemed that the proposed method is not effective alone without scheduling, showing similar performance to phoneme posterior model or entropy based model. So far, ERNIE-based phoneme posterior model showed the best performance. However, our method outperformed the baselines with decaying and triangular scheduling.
From the results, we want to tell that confidence modeling works in cross-modal distillation for spoken language understanding, accompanied by a proper scheduling scheme. Redeeming three strategies, error rate regards the student performance while others concern teacher behavior. Error rate adopts student to the gold level, while others decide the KD weight free from varying student performance. This prevents situation that the gold label might not be ‘answer’ and cause overfitting. Also, we saw that confidence helps scheduled KD, and also the scheduling enhances the utility of dropout-based approach. This suggests the wide applicability of the proposed scheme, concerning that scheduling and automatic decision schemes are two independent factors deciding the influence of KD loss.
In our study, we searched for schemes to manage the teacher’s influence in cross-modal distillation. We found that dropout-based confidence automatically induces the weight that decides the influence of KD loss, in sample-wise manner. Effect of dropout-based confidence and its alignment with scheduling strategies are verified with public SLU dataset. Thanks for listening to our talk