Investigating Pre-Trained SSL Models in Limited Data NLU

Fuse and Adapt: Investigating the Use of Pre-Trained
Self-Supervised Learning Models in Limited Data NLU
Problems
Shamane Siriwardhana
Supervised by Associate Professor Suranga Nanayakkara, Professor Mark Billinghurst & Dr Elliott Wen

1
Deep Learning and Cake Analogy 2.0
(-Yann LeCun Chief AI Scientist at Meta-)
Cherry - Reinforcement
Learning
Icing - Supervised Learning
Cake - Self-supervised
Learning

● Is Self-Supervised Learning the future of AI?
2
Self-Supervised Learning (SSL)
72K GitHub stars since 2020 50K citations since 2019 5M $ to train a single model

● Eliminates the prerequisite of requiring humans to label data.
● Models can use naturally available context as labels.
3
SSL workflow
Pretext Task Eg : Masked
Language Modeling
SSL Workflow

High Availability of Pre-trained SSL Models
4
Pioneers in pre-training
Open-sourced model checkpoints

The Focus of the Thesis: Utilization of Pre-trained SSL Models
5

6
Research Questions
Fusion of multimodal pre-trained SSL features and models
RQ1: How to fuse multimodal features extracted from frozen pretrained SSL
models?
RQ2: How to fuse two pre-trained transformer-based architectures in
multimodal settings?
Domain adaptation of pre-trained SSL models with fine-tuning mechanisms
RQ3: How to adapt generative pre-trained SSL model when there are no high-quality
training data?
RQ4: How to domain adopt compound neural architectures that consist of several pre-
trained SSL models?

7
Fusion of Multimodal Features Extracted from Three Different Frozen
Pre-trained SSL Models
Frozen models
Multimodal emotion Recognition:
● Challenging to collect and annotate
data
RQ1

8
Multimodal Frozen SSL Networks
● Fabnet - Video
● Convolution-based architecture
● Vector Size - 256
● Seq-len: frames in the video
● Wav2Vec - Speech
● Temporal convolutions (tc)
● Seq-len: strides in tc
● RoBERTa - Text
● Transformer
● Seq-len: number of words
RQ1

9
SSL-Embedding Fusion Transformer
Ablation Studies on CMU-MOSEI dataset
Model comparisons
Proposed transformer-based fusion
Multimodal emotion recognition with transformer-based self supervised feature
fusion (Siriwardhana et al. 2020)
RQ1

10
Some Findings
❖ Dense SSL features extracted from different SSL models have robust
representational capabilities.
➢ They can be fused with transformer-based fusion mechanisms
➢ Self-attention plays an important role when combining sequential
embeddings
❖ Feature fusion while keeping the pre-trained models frozen is important
➢ When pre-trained models have a vast number of parameters
➢ E.g.,GPT3 - model consists of 185 billion neurons.
RQ1
Findings related to RQ1 have presented as a journal paper in IEEE Access 2020. S. Siriwardhana, T. Kaluarachchi, M. Billinghurst and S. Nanayakkara,
"Multimodal Emotion Recognition With Transformer-Based Self Supervised Feature Fusion," in IEEE Access, vol. 8, pp. 176274- 176285, 2020, doi:
10.1109/ACCESS.2020.3026823.
Impact Factor - 3.367

11
Fusion of two Transformers Architectures in
Multimodal Settings
● Represent different modalities with transformer-based pre-trained models
● Utilizing architectural properties in the fusion
RQ2

12
Transformer models
● RoBERTa (unfrozen) - Text
● Transformer
● Seq-len: number of words
● Speech-BERT (unfrozen) - Speech
● Transformer
● Seq-len: Sampling frequency
RQ2

13
Shallow vs Co-attentional fusion
Shallow Fusion Co-attentional fusion
Fusion mechanisms
Model comparisons
Ablation studies
Jointly fine-tuning" bert-like" self supervised models to improve multimodal
speech emotion recognition (Siriwardhana et al. (2020))
RQ2

14
Some Findings
❖ Pre-trained SSL models with transformer-based architectures can easily fuse together
➢ Employing unique properties like [CLS] token
➢ Shallow fusion
❖ Transformer-based SSL models can finetune stably even with less amount of data
➢ Can finetune stably with lower learning rates
❖ Transformer architecture is becoming increasingly ubiquitous in self supervised
learning
➢ Transformer-based models represent different data modalities
RQ2
Findings related to RQ2 have presented as a full conference paper in Interspeech 2020. S. Siriwardhana, Reis A, Weerasakera R, Nanayakkara S. “Jointly
Fine-Tuning BERT-like Self Supervised Models to Improve Multimodal Speech Emotion Recognition.” Proceedings of the Annual Conference of the
International Speech Communication Association, INTERSPEECH. Vol. 2020.
H index - 100

15
Domain adaptation of Generative BART model when high-quality
training data is missing
Autobiographical Text Summarization
● Privacy issues
● Different language Patterns
● Scarcity of records and gold-standard
summaries
● BART transformer - generate text
● Works well for generation benchmarks
● Sequence-to-Sequence architecture
RQ3

16
Utilization of Reddit data and high-quality news data
Thread
Title
● News summarization
● Fundamental task
● Gold standard datasets
● Closely related dataset to the domain
● Titles only
RQ3

17
Mix Distribution Multitask Learning
● Finetuning BART for the autobiographical summarization with :
○ Domain-specific weakly labeled dataset
○ Task-specific dataset with gold-standard labels
Model Comparison Factual consistency (FactCC)
Abstractive Summarization System for Autobiographical Text (Siriwardhana et al. (2022))
RQ3

18
Human Studies
Abstractive Summarization System for Autobiographical Text (Siriwardhana et al. (2022))
Model comparison with Mturk participants
SummarizeMe (Digital Diary) - User study conducted with 75 users
RQ3

19
Some of the Findings
❖ SSL models like BART consist of strong language generation capabilities
➢ Such models have seen a large amount of data during the pre-training
➢ BART-like models can perform well even without high-quality data
❖ Data-centric approaches are crucial when adopting tasks like autobiographical
text summarization
➢ Designing better mechanisms to make use of available domain-specific data
❖ Human studies are essential and beneficial for evaluating generative models
Findings related to RQ3 have submitted as a journal paper in ISRE 2022. S. Siriwardhana, Kalurachchi T, Chithralekha G Scholl P, Dissanayake V,
Nanayakkara S. ``SummarizeMe: Abstractive Summarization System for Autobiographical Text'' Proceedings of the Information System Research (ISR)
2022 [Under review]
RQ3

20
Domain adaptation of Compound Neural Architectures with Several
Pre-trained SSL Models.
● Retrieval Augment Generation (RAG) model (Meta)
● Combines the information retrieval and seq2seq
generation
● DPR neural retriever and BART generator
RQ4
● Open Domain Question Answering (ODQA)
● Works well for Wikipedia-based knowledge bases
● Less work on domain adaptation of ODQA

21
Domain Adaptation of the RAG
RQ4

22
RAG-end2end and Introduction of an Auxiliary Signal
Improving the Domain Adaptation of Retrieval Augmented Generation (RAG) Models for Open Domain Question
Answering (Siriwardhana et al - 2022)
End-to-End RAG retriever Reconstruction auxiliary signal
RQ4

23
End2end retriever training improves the domain adaptation
RAG-end2end and auxiliary signals
can improve the overall results
Empowering further research in the paradigm of retrieval
augmentation
RQ4
Improving the Domain Adaptation of Retrieval Augmented Generation (RAG) Models for Open Domain Question Answering (Siriwardhana
et al - 2022)

24
Some of the Findings
❖ Different SSL-pre-trained models can be combined and create effective RAG-like
pipelines.
❖ Retrieval models play a vital role in the domain adaptation of RAG.
➢ Neural retrieval models like DPR benefit from domain-specific fine-tuning
since they are mainly trained with Wiki-based data.
❖ Auxiliary signals can improve the process of domain adaptation.
➢ A solution to the scarcity of domain-specific labeled data.
Findings related to RQ4 have accepted as a journal paper in TACL 2022. Siriwardhana S, Weerasakera R, Kalurachchi T, Elliott W, Rana R, Nanayakkara S.
``Improving the Domain Adaptation of Retrieval Augmented Generation (RAG) Models for Open-Domain Question-Answering'' Transactions of the Association
for Computational Linguistics TACL 2022 (will be presented at EMNLP - 2022)
Impact Factor - 9.194
RQ4

26
❖ Model compression techniques are important
➢ SSL model checkpoints are large
➢ Model Pruning and distillation
➢ Can support recent paradigms like federated learning
Model pruning Federated learning
Knowledge distillation
❖ Human-centric model evaluation is getting stronger
➢ Hallucinations and factual constancy is a significant area
Future Work

Future Work
27
❖ Retrieval Augmentation could play a significant role in the field of AI
➢ Could be a solution to billion-dollar large pre-trained models
➢ Are we getting closer to human-like intelligence?
Retrieval augmentation to make models to go beyond a
parametric memory (Source Retro-deepmind (2022))

Directly related publications
● S. Siriwardhana, T. Kaluarachchi, M. Billinghurst and S. Nanayakkara, "Multimodal Emotion Recognition With Transformer-
Based Self Supervised Feature Fusion," in IEEE Access 2020.
● Siriwardhana S, Reis A, Weerasakera R, Nanayakkara S. ``Jointly Fine-Tuning BERT-like Self Supervised Models to
Improve Multimodal Speech Emotion Recognition.'' Proceedings of the International Speech Communication Association,
INTERSPEECH. 2020.
● Siriwardhana S, Kalurachchi T, Scholl P, Dissanayake V, Nanayakkara S. ``SummarizeMe: Abstractive Summarization
System for Autobiographical Text'' Proceedings of the Information System Research (ISR) 2022 [Under review]
● Siriwardhana S, Weerasakera R, Kalurachchi T, Elliott W, Rana R, Nanayakkara S. ``Improving the Domain Adaptation of
Retrieval Augmented Generation (RAG) Models for Open-Domain Question-Answering'' Transactions of the Association for
Computational Linguistics TACL 2022 (will be presented at EMNLP - 2022)
29
Other Publications
● Wen, E., Kaluarachchi, T., Siriwardhana, S., Tang, V., Billinghurst, M., Lindeman, R.W., Yao, R., Lin, J. and
Nanayakkara, S.C., 2022. VRhook: A Data Collection Tool for VR Motion Sickness Research. Proceedings of the Annual
Conference of the User Interface Software and Technology UIST ’22.
● Kaluarachchi, T., Siriwardhana, S., Wenn, E., and Nanayakkara, S., A Corneal Surface Reflections-Based Intelligent
System for Lifelogging Application. International Journal of Human Computer Interaction (IJHCI) 22(4), [Under Review]

❖ My supervisors
➢ Prof. Suranga Nanayakkara
➢ Prof. Mark Billinghurst
➢ Dr. Elliotte Wen
❖ Examination committee members
➢ Assoc Prof Kwan Hui Lim
➢ Assoc Prof Alan Wang
❖ The University of Auckland Doctoral Scholarship Programme
❖ All my Co-authors and lab members
30
Acknowledgement

Appendix
● Pre-training is expensive
○ It is so expensive
30

Performance Matters!
32
● Pre-trained SSL models are performing exceptionally well for many tasks.

36
● Retrieval augmented models have some important qualities

Difference between IMA and Co-attention
● Co-attention doesn’t need any modification like adding a class Token, or few layers of transformers
37
IMA modification Co-attention

DL features vs SSL features
38

39
● CNN Features off-the-shelf: an Astounding Baseline for Recognition (2014)
● PASS: An ImageNet replacement for self-supervised pretraining without humans
(2021)
● Efficient Self-supervised Vision Transformers for Representation Learning (2022)
(“When transferring to downstream linear classification tasks, EsViT outperforms its
supervised counterpart on 17 out of 18 datasets. ”)
● Transfer Learning or Self-supervised Learning? A Tale of Two Pretraining Paradigms
(2019)
● How Well Do Self-Supervised Models Transfer? (CVPR2022)

40

GPT-3 (Open AI)
● 12 M $ to train
● 175 Billion Parameters (365GB)
● Bigger the better
33

Investigating Pre-Trained SSL Models in Limited Data NLU

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Investigating Pre-Trained SSL Models in Limited Data NLU

Similar to Investigating Pre-Trained SSL Models in Limited Data NLU (20)

Recently uploaded

Recently uploaded (20)

Investigating Pre-Trained SSL Models in Limited Data NLU

Editor's Notes