Investigating Pre-Trained SSL Models in Limited Data NLU
1. Fuse and Adapt: Investigating the Use of Pre-Trained
Self-Supervised Learning Models in Limited Data NLU
Problems
Shamane Siriwardhana
Supervised by Associate Professor Suranga Nanayakkara, Professor Mark Billinghurst & Dr Elliott Wen
2. 1
Deep Learning and Cake Analogy 2.0
(-Yann LeCun Chief AI Scientist at Meta-)
Cherry - Reinforcement
Learning
Icing - Supervised Learning
Cake - Self-supervised
Learning
3. ● Is Self-Supervised Learning the future of AI?
2
Self-Supervised Learning (SSL)
72K GitHub stars since 2020 50K citations since 2019 5M $ to train a single model
4. ● Eliminates the prerequisite of requiring humans to label data.
● Models can use naturally available context as labels.
3
SSL workflow
Pretext Task Eg : Masked
Language Modeling
SSL Workflow
5. High Availability of Pre-trained SSL Models
4
Pioneers in pre-training
Open-sourced model checkpoints
6. The Focus of the Thesis: Utilization of Pre-trained SSL Models
5
7. 6
Research Questions
Fusion of multimodal pre-trained SSL features and models
RQ1: How to fuse multimodal features extracted from frozen pretrained SSL
models?
RQ2: How to fuse two pre-trained transformer-based architectures in
multimodal settings?
Domain adaptation of pre-trained SSL models with fine-tuning mechanisms
RQ3: How to adapt generative pre-trained SSL model when there are no high-quality
training data?
RQ4: How to domain adopt compound neural architectures that consist of several pre-
trained SSL models?
8. 7
Fusion of Multimodal Features Extracted from Three Different Frozen
Pre-trained SSL Models
Frozen models
Multimodal emotion Recognition:
● Challenging to collect and annotate
data
RQ1
9. 8
Multimodal Frozen SSL Networks
● Fabnet - Video
● Convolution-based architecture
● Vector Size - 256
● Seq-len: frames in the video
● Wav2Vec - Speech
● Temporal convolutions (tc)
● Vector Size - 512
● Seq-len: strides in tc
● RoBERTa - Text
● Transformer
● Vector Size - 1024
● Seq-len: number of words
RQ1
10. 9
SSL-Embedding Fusion Transformer
Ablation Studies on CMU-MOSEI dataset
Model comparisons
Proposed transformer-based fusion
Multimodal emotion recognition with transformer-based self supervised feature
fusion (Siriwardhana et al. 2020)
RQ1
11. 10
Some Findings
❖ Dense SSL features extracted from different SSL models have robust
representational capabilities.
➢ They can be fused with transformer-based fusion mechanisms
➢ Self-attention plays an important role when combining sequential
embeddings
❖ Feature fusion while keeping the pre-trained models frozen is important
➢ When pre-trained models have a vast number of parameters
➢ E.g.,GPT3 - model consists of 185 billion neurons.
RQ1
Findings related to RQ1 have presented as a journal paper in IEEE Access 2020. S. Siriwardhana, T. Kaluarachchi, M. Billinghurst and S. Nanayakkara,
"Multimodal Emotion Recognition With Transformer-Based Self Supervised Feature Fusion," in IEEE Access, vol. 8, pp. 176274- 176285, 2020, doi:
10.1109/ACCESS.2020.3026823.
Impact Factor - 3.367
12. 11
Fusion of two Transformers Architectures in
Multimodal Settings
● Represent different modalities with transformer-based pre-trained models
● Utilizing architectural properties in the fusion
RQ2
13. 12
Transformer models
● RoBERTa (unfrozen) - Text
● Transformer
● Vector Size - 1024
● Seq-len: number of words
● Speech-BERT (unfrozen) - Speech
● Transformer
● Vector Size - 1024
● Seq-len: Sampling frequency
RQ2
14. 13
Shallow vs Co-attentional fusion
Shallow Fusion Co-attentional fusion
Fusion mechanisms
Model comparisons
Ablation studies
Jointly fine-tuning" bert-like" self supervised models to improve multimodal
speech emotion recognition (Siriwardhana et al. (2020))
RQ2
15. 14
Some Findings
❖ Pre-trained SSL models with transformer-based architectures can easily fuse together
➢ Employing unique properties like [CLS] token
➢ Shallow fusion
❖ Transformer-based SSL models can finetune stably even with less amount of data
➢ Can finetune stably with lower learning rates
❖ Transformer architecture is becoming increasingly ubiquitous in self supervised
learning
➢ Transformer-based models represent different data modalities
RQ2
Findings related to RQ2 have presented as a full conference paper in Interspeech 2020. S. Siriwardhana, Reis A, Weerasakera R, Nanayakkara S. “Jointly
Fine-Tuning BERT-like Self Supervised Models to Improve Multimodal Speech Emotion Recognition.” Proceedings of the Annual Conference of the
International Speech Communication Association, INTERSPEECH. Vol. 2020.
H index - 100
16. 15
Domain adaptation of Generative BART model when high-quality
training data is missing
Autobiographical Text Summarization
● Privacy issues
● Different language Patterns
● Scarcity of records and gold-standard
summaries
● BART transformer - generate text
● Works well for generation benchmarks
● Sequence-to-Sequence architecture
RQ3
17. 16
Utilization of Reddit data and high-quality news data
Thread
Title
● News summarization
● Fundamental task
● Gold standard datasets
● Closely related dataset to the domain
● Titles only
RQ3
18. 17
Mix Distribution Multitask Learning
● Finetuning BART for the autobiographical summarization with :
○ Domain-specific weakly labeled dataset
○ Task-specific dataset with gold-standard labels
Model Comparison Factual consistency (FactCC)
Abstractive Summarization System for Autobiographical Text (Siriwardhana et al. (2022))
RQ3
19. 18
Human Studies
Abstractive Summarization System for Autobiographical Text (Siriwardhana et al. (2022))
Model comparison with Mturk participants
SummarizeMe (Digital Diary) - User study conducted with 75 users
RQ3
20. 19
Some of the Findings
❖ SSL models like BART consist of strong language generation capabilities
➢ Such models have seen a large amount of data during the pre-training
➢ BART-like models can perform well even without high-quality data
❖ Data-centric approaches are crucial when adopting tasks like autobiographical
text summarization
➢ Designing better mechanisms to make use of available domain-specific data
❖ Human studies are essential and beneficial for evaluating generative models
Findings related to RQ3 have submitted as a journal paper in ISRE 2022. S. Siriwardhana, Kalurachchi T, Chithralekha G Scholl P, Dissanayake V,
Nanayakkara S. ``SummarizeMe: Abstractive Summarization System for Autobiographical Text'' Proceedings of the Information System Research (ISR)
2022 [Under review]
RQ3
21. 20
Domain adaptation of Compound Neural Architectures with Several
Pre-trained SSL Models.
● Retrieval Augment Generation (RAG) model (Meta)
● Combines the information retrieval and seq2seq
generation
● DPR neural retriever and BART generator
RQ4
● Open Domain Question Answering (ODQA)
● Works well for Wikipedia-based knowledge bases
● Less work on domain adaptation of ODQA
23. 22
RAG-end2end and Introduction of an Auxiliary Signal
Improving the Domain Adaptation of Retrieval Augmented Generation (RAG) Models for Open Domain Question
Answering (Siriwardhana et al - 2022)
End-to-End RAG retriever Reconstruction auxiliary signal
RQ4
24. 23
End2end retriever training improves the domain adaptation
RAG-end2end and auxiliary signals
can improve the overall results
Empowering further research in the paradigm of retrieval
augmentation
RQ4
Improving the Domain Adaptation of Retrieval Augmented Generation (RAG) Models for Open Domain Question Answering (Siriwardhana
et al - 2022)
25. 24
Some of the Findings
❖ Different SSL-pre-trained models can be combined and create effective RAG-like
pipelines.
❖ Retrieval models play a vital role in the domain adaptation of RAG.
➢ Neural retrieval models like DPR benefit from domain-specific fine-tuning
since they are mainly trained with Wiki-based data.
❖ Auxiliary signals can improve the process of domain adaptation.
➢ A solution to the scarcity of domain-specific labeled data.
Findings related to RQ4 have accepted as a journal paper in TACL 2022. Siriwardhana S, Weerasakera R, Kalurachchi T, Elliott W, Rana R, Nanayakkara S.
``Improving the Domain Adaptation of Retrieval Augmented Generation (RAG) Models for Open-Domain Question-Answering'' Transactions of the Association
for Computational Linguistics TACL 2022 (will be presented at EMNLP - 2022)
Impact Factor - 9.194
RQ4
27. 26
❖ Model compression techniques are important
➢ SSL model checkpoints are large
➢ Model Pruning and distillation
➢ Can support recent paradigms like federated learning
Model pruning Federated learning
Knowledge distillation
❖ Human-centric model evaluation is getting stronger
➢ Hallucinations and factual constancy is a significant area
Future Work
28. Future Work
27
❖ Retrieval Augmentation could play a significant role in the field of AI
➢ Could be a solution to billion-dollar large pre-trained models
➢ Are we getting closer to human-like intelligence?
Retrieval augmentation to make models to go beyond a
parametric memory (Source Retro-deepmind (2022))
29. Directly related publications
● S. Siriwardhana, T. Kaluarachchi, M. Billinghurst and S. Nanayakkara, "Multimodal Emotion Recognition With Transformer-
Based Self Supervised Feature Fusion," in IEEE Access 2020.
● Siriwardhana S, Reis A, Weerasakera R, Nanayakkara S. ``Jointly Fine-Tuning BERT-like Self Supervised Models to
Improve Multimodal Speech Emotion Recognition.'' Proceedings of the International Speech Communication Association,
INTERSPEECH. 2020.
● Siriwardhana S, Kalurachchi T, Scholl P, Dissanayake V, Nanayakkara S. ``SummarizeMe: Abstractive Summarization
System for Autobiographical Text'' Proceedings of the Information System Research (ISR) 2022 [Under review]
● Siriwardhana S, Weerasakera R, Kalurachchi T, Elliott W, Rana R, Nanayakkara S. ``Improving the Domain Adaptation of
Retrieval Augmented Generation (RAG) Models for Open-Domain Question-Answering'' Transactions of the Association for
Computational Linguistics TACL 2022 (will be presented at EMNLP - 2022)
29
Other Publications
● Wen, E., Kaluarachchi, T., Siriwardhana, S., Tang, V., Billinghurst, M., Lindeman, R.W., Yao, R., Lin, J. and
Nanayakkara, S.C., 2022. VRhook: A Data Collection Tool for VR Motion Sickness Research. Proceedings of the Annual
Conference of the User Interface Software and Technology UIST ’22.
● Kaluarachchi, T., Siriwardhana, S., Wenn, E., and Nanayakkara, S., A Corneal Surface Reflections-Based Intelligent
System for Lifelogging Application. International Journal of Human Computer Interaction (IJHCI) 22(4), [Under Review]
30. ❖ My supervisors
➢ Prof. Suranga Nanayakkara
➢ Prof. Mark Billinghurst
➢ Dr. Elliotte Wen
❖ Examination committee members
➢ Assoc Prof Kwan Hui Lim
➢ Assoc Prof Alan Wang
❖ The University of Auckland Doctoral Scholarship Programme
❖ All my Co-authors and lab members
30
Acknowledgement
36. Difference between IMA and Co-attention
● Co-attention doesn’t need any modification like adding a class Token, or few layers of transformers
37
IMA modification Co-attention
38. DL features vs SSL features
39
● CNN Features off-the-shelf: an Astounding Baseline for Recognition (2014)
● PASS: An ImageNet replacement for self-supervised pretraining without humans
(2021)
● Efficient Self-supervised Vision Transformers for Representation Learning (2022)
(“When transferring to downstream linear classification tasks, EsViT outperforms its
supervised counterpart on 17 out of 18 datasets. ”)
● Transfer Learning or Self-supervised Learning? A Tale of Two Pretraining Paradigms
(2019)
● How Well Do Self-Supervised Models Transfer? (CVPR2022)
40. GPT-3 (Open AI)
● 12 M $ to train
● 175 Billion Parameters (365GB)
● Bigger the better
33
Editor's Notes
Deep Learning is important
It has its limitations
SSL as a savior
It is becoming so popular
Deep Learning is important
It has its limitations
SSL as a savior
It is becoming so popular
What is ssl
It has two phases
What is pre texting
First phase is pre-texting which could take a lot of computational power
But we do not have to worry about the pre-training all the time
Big guns open source these models
Focus of the thesis is utilization of the pre-trained models
On downstream tasks - specially where we do not have much training data
Research questions are separated by two main areas fusion and adoptation
How to utilize features extracted from pre-trained frozen SSL models
Conducted my experiments in multimodal emotion recognition
Why multimodal ? because it is a challenging area
I used three different frozen networks with different vector sizes and seq-lenths .. so it is not trivial to connect these dense embeddings
So introduced a transformer based fusion mechanism. It showed competitive results
I used three different frozen networks with different vector sizes and seq-lenths .. so it is not trivial to connect these dense embeddings
Motivated from the first research question and conducted the experiments on he multimodal emotion recognition
When the SSL came first, Transformer architecture was mainly for the text
Then it got introduced to speech
So can we use some special properties like CLS token
Two transformers to represent both text and speech
Having the same transformer architecture could help the fusion and improve the results
Two fusion mechanism one employing the direct architectural properties
Results on IEMOCAP dataset
I used three different frozen networks with different vector sizes and seq-lenths .. so it is not trivial to connect these dense embeddings
I used three different frozen networks with different vector sizes and seq-lenths .. so it is not trivial to connect these dense embeddings
I used three different frozen networks with different vector sizes and seq-lenths .. so it is not trivial to connect these dense embeddings
I used three different frozen networks with different vector sizes and seq-lenths .. so it is not trivial to connect these dense embeddings
How DL has improved the field of NLU , but why still there are problems due to scarcity of data.