Fuse and Adapt: Investigating the Use of Pre-Trained
Self-Supervised Learning Models in Limited Data NLU
Problems
Shamane Siriwardhana
Supervised by Associate Professor Suranga Nanayakkara, Professor Mark Billinghurst & Dr Elliott Wen
1
Deep Learning and Cake Analogy 2.0
(-Yann LeCun Chief AI Scientist at Meta-)
Cherry - Reinforcement
Learning
Icing - Supervised Learning
Cake - Self-supervised
Learning
● Is Self-Supervised Learning the future of AI?
2
Self-Supervised Learning (SSL)
72K GitHub stars since 2020 50K citations since 2019 5M $ to train a single model
● Eliminates the prerequisite of requiring humans to label data.
● Models can use naturally available context as labels.
3
SSL workflow
Pretext Task Eg : Masked
Language Modeling
SSL Workflow
High Availability of Pre-trained SSL Models
4
Pioneers in pre-training
Open-sourced model checkpoints
The Focus of the Thesis: Utilization of Pre-trained SSL Models
5
6
Research Questions
Fusion of multimodal pre-trained SSL features and models
RQ1: How to fuse multimodal features extracted from frozen pretrained SSL
models?
RQ2: How to fuse two pre-trained transformer-based architectures in
multimodal settings?
Domain adaptation of pre-trained SSL models with fine-tuning mechanisms
RQ3: How to adapt generative pre-trained SSL model when there are no high-quality
training data?
RQ4: How to domain adopt compound neural architectures that consist of several pre-
trained SSL models?
7
Fusion of Multimodal Features Extracted from Three Different Frozen
Pre-trained SSL Models
Frozen models
Multimodal emotion Recognition:
● Challenging to collect and annotate
data
RQ1
8
Multimodal Frozen SSL Networks
● Fabnet - Video
● Convolution-based architecture
● Vector Size - 256
● Seq-len: frames in the video
● Wav2Vec - Speech
● Temporal convolutions (tc)
● Vector Size - 512
● Seq-len: strides in tc
● RoBERTa - Text
● Transformer
● Vector Size - 1024
● Seq-len: number of words
RQ1
9
SSL-Embedding Fusion Transformer
Ablation Studies on CMU-MOSEI dataset
Model comparisons
Proposed transformer-based fusion
Multimodal emotion recognition with transformer-based self supervised feature
fusion (Siriwardhana et al. 2020)
RQ1
10
Some Findings
❖ Dense SSL features extracted from different SSL models have robust
representational capabilities.
➢ They can be fused with transformer-based fusion mechanisms
➢ Self-attention plays an important role when combining sequential
embeddings
❖ Feature fusion while keeping the pre-trained models frozen is important
➢ When pre-trained models have a vast number of parameters
➢ E.g.,GPT3 - model consists of 185 billion neurons.
RQ1
Findings related to RQ1 have presented as a journal paper in IEEE Access 2020. S. Siriwardhana, T. Kaluarachchi, M. Billinghurst and S. Nanayakkara,
"Multimodal Emotion Recognition With Transformer-Based Self Supervised Feature Fusion," in IEEE Access, vol. 8, pp. 176274- 176285, 2020, doi:
10.1109/ACCESS.2020.3026823.
Impact Factor - 3.367
11
Fusion of two Transformers Architectures in
Multimodal Settings
● Represent different modalities with transformer-based pre-trained models
● Utilizing architectural properties in the fusion
RQ2
12
Transformer models
● RoBERTa (unfrozen) - Text
● Transformer
● Vector Size - 1024
● Seq-len: number of words
● Speech-BERT (unfrozen) - Speech
● Transformer
● Vector Size - 1024
● Seq-len: Sampling frequency
RQ2
13
Shallow vs Co-attentional fusion
Shallow Fusion Co-attentional fusion
Fusion mechanisms
Model comparisons
Ablation studies
Jointly fine-tuning" bert-like" self supervised models to improve multimodal
speech emotion recognition (Siriwardhana et al. (2020))
RQ2
14
Some Findings
❖ Pre-trained SSL models with transformer-based architectures can easily fuse together
➢ Employing unique properties like [CLS] token
➢ Shallow fusion
❖ Transformer-based SSL models can finetune stably even with less amount of data
➢ Can finetune stably with lower learning rates
❖ Transformer architecture is becoming increasingly ubiquitous in self supervised
learning
➢ Transformer-based models represent different data modalities
RQ2
Findings related to RQ2 have presented as a full conference paper in Interspeech 2020. S. Siriwardhana, Reis A, Weerasakera R, Nanayakkara S. “Jointly
Fine-Tuning BERT-like Self Supervised Models to Improve Multimodal Speech Emotion Recognition.” Proceedings of the Annual Conference of the
International Speech Communication Association, INTERSPEECH. Vol. 2020.
H index - 100
15
Domain adaptation of Generative BART model when high-quality
training data is missing
Autobiographical Text Summarization
● Privacy issues
● Different language Patterns
● Scarcity of records and gold-standard
summaries
● BART transformer - generate text
● Works well for generation benchmarks
● Sequence-to-Sequence architecture
RQ3
16
Utilization of Reddit data and high-quality news data
Thread
Title
● News summarization
● Fundamental task
● Gold standard datasets
● Closely related dataset to the domain
● Titles only
RQ3
17
Mix Distribution Multitask Learning
● Finetuning BART for the autobiographical summarization with :
○ Domain-specific weakly labeled dataset
○ Task-specific dataset with gold-standard labels
Model Comparison Factual consistency (FactCC)
Abstractive Summarization System for Autobiographical Text (Siriwardhana et al. (2022))
RQ3
18
Human Studies
Abstractive Summarization System for Autobiographical Text (Siriwardhana et al. (2022))
Model comparison with Mturk participants
SummarizeMe (Digital Diary) - User study conducted with 75 users
RQ3
19
Some of the Findings
❖ SSL models like BART consist of strong language generation capabilities
➢ Such models have seen a large amount of data during the pre-training
➢ BART-like models can perform well even without high-quality data
❖ Data-centric approaches are crucial when adopting tasks like autobiographical
text summarization
➢ Designing better mechanisms to make use of available domain-specific data
❖ Human studies are essential and beneficial for evaluating generative models
Findings related to RQ3 have submitted as a journal paper in ISRE 2022. S. Siriwardhana, Kalurachchi T, Chithralekha G Scholl P, Dissanayake V,
Nanayakkara S. ``SummarizeMe: Abstractive Summarization System for Autobiographical Text'' Proceedings of the Information System Research (ISR)
2022 [Under review]
RQ3
20
Domain adaptation of Compound Neural Architectures with Several
Pre-trained SSL Models.
● Retrieval Augment Generation (RAG) model (Meta)
● Combines the information retrieval and seq2seq
generation
● DPR neural retriever and BART generator
RQ4
● Open Domain Question Answering (ODQA)
● Works well for Wikipedia-based knowledge bases
● Less work on domain adaptation of ODQA
21
Domain Adaptation of the RAG
RQ4
22
RAG-end2end and Introduction of an Auxiliary Signal
Improving the Domain Adaptation of Retrieval Augmented Generation (RAG) Models for Open Domain Question
Answering (Siriwardhana et al - 2022)
End-to-End RAG retriever Reconstruction auxiliary signal
RQ4
23
End2end retriever training improves the domain adaptation
RAG-end2end and auxiliary signals
can improve the overall results
Empowering further research in the paradigm of retrieval
augmentation
RQ4
Improving the Domain Adaptation of Retrieval Augmented Generation (RAG) Models for Open Domain Question Answering (Siriwardhana
et al - 2022)
24
Some of the Findings
❖ Different SSL-pre-trained models can be combined and create effective RAG-like
pipelines.
❖ Retrieval models play a vital role in the domain adaptation of RAG.
➢ Neural retrieval models like DPR benefit from domain-specific fine-tuning
since they are mainly trained with Wiki-based data.
❖ Auxiliary signals can improve the process of domain adaptation.
➢ A solution to the scarcity of domain-specific labeled data.
Findings related to RQ4 have accepted as a journal paper in TACL 2022. Siriwardhana S, Weerasakera R, Kalurachchi T, Elliott W, Rana R, Nanayakkara S.
``Improving the Domain Adaptation of Retrieval Augmented Generation (RAG) Models for Open-Domain Question-Answering'' Transactions of the Association
for Computational Linguistics TACL 2022 (will be presented at EMNLP - 2022)
Impact Factor - 9.194
RQ4
25
Summary of the thesis
26
❖ Model compression techniques are important
➢ SSL model checkpoints are large
➢ Model Pruning and distillation
➢ Can support recent paradigms like federated learning
Model pruning Federated learning
Knowledge distillation
❖ Human-centric model evaluation is getting stronger
➢ Hallucinations and factual constancy is a significant area
Future Work
Future Work
27
❖ Retrieval Augmentation could play a significant role in the field of AI
➢ Could be a solution to billion-dollar large pre-trained models
➢ Are we getting closer to human-like intelligence?
Retrieval augmentation to make models to go beyond a
parametric memory (Source Retro-deepmind (2022))
Directly related publications
● S. Siriwardhana, T. Kaluarachchi, M. Billinghurst and S. Nanayakkara, "Multimodal Emotion Recognition With Transformer-
Based Self Supervised Feature Fusion," in IEEE Access 2020.
● Siriwardhana S, Reis A, Weerasakera R, Nanayakkara S. ``Jointly Fine-Tuning BERT-like Self Supervised Models to
Improve Multimodal Speech Emotion Recognition.'' Proceedings of the International Speech Communication Association,
INTERSPEECH. 2020.
● Siriwardhana S, Kalurachchi T, Scholl P, Dissanayake V, Nanayakkara S. ``SummarizeMe: Abstractive Summarization
System for Autobiographical Text'' Proceedings of the Information System Research (ISR) 2022 [Under review]
● Siriwardhana S, Weerasakera R, Kalurachchi T, Elliott W, Rana R, Nanayakkara S. ``Improving the Domain Adaptation of
Retrieval Augmented Generation (RAG) Models for Open-Domain Question-Answering'' Transactions of the Association for
Computational Linguistics TACL 2022 (will be presented at EMNLP - 2022)
29
Other Publications
● Wen, E., Kaluarachchi, T., Siriwardhana, S., Tang, V., Billinghurst, M., Lindeman, R.W., Yao, R., Lin, J. and
Nanayakkara, S.C., 2022. VRhook: A Data Collection Tool for VR Motion Sickness Research. Proceedings of the Annual
Conference of the User Interface Software and Technology UIST ’22.
● Kaluarachchi, T., Siriwardhana, S., Wenn, E., and Nanayakkara, S., A Corneal Surface Reflections-Based Intelligent
System for Lifelogging Application. International Journal of Human Computer Interaction (IJHCI) 22(4), [Under Review]
❖ My supervisors
➢ Prof. Suranga Nanayakkara
➢ Prof. Mark Billinghurst
➢ Dr. Elliotte Wen
❖ Examination committee members
➢ Assoc Prof Kwan Hui Lim
➢ Assoc Prof Alan Wang
❖ The University of Auckland Doctoral Scholarship Programme
❖ All my Co-authors and lab members
30
Acknowledgement
Thank you!
Appendix
● Pre-training is expensive
○ It is so expensive
30
32
● Huge carbon footprint
Performance Matters!
32
● Pre-trained SSL models are performing exceptionally well for many tasks.
36
● Retrieval augmented models have some important qualities
Difference between IMA and Co-attention
● Co-attention doesn’t need any modification like adding a class Token, or few layers of transformers
37
IMA modification Co-attention
DL features vs SSL features
38
DL features vs SSL features
39
● CNN Features off-the-shelf: an Astounding Baseline for Recognition (2014)
● PASS: An ImageNet replacement for self-supervised pretraining without humans
(2021)
● Efficient Self-supervised Vision Transformers for Representation Learning (2022)
(“When transferring to downstream linear classification tasks, EsViT outperforms its
supervised counterpart on 17 out of 18 datasets. ”)
● Transfer Learning or Self-supervised Learning? A Tale of Two Pretraining Paradigms
(2019)
● How Well Do Self-Supervised Models Transfer? (CVPR2022)
40
DL features vs SSL features
GPT-3 (Open AI)
● 12 M $ to train
● 175 Billion Parameters (365GB)
● Bigger the better
33

Shamane-PhD-Defence-Final.pptx

  • 1.
    Fuse and Adapt:Investigating the Use of Pre-Trained Self-Supervised Learning Models in Limited Data NLU Problems Shamane Siriwardhana Supervised by Associate Professor Suranga Nanayakkara, Professor Mark Billinghurst & Dr Elliott Wen
  • 2.
    1 Deep Learning andCake Analogy 2.0 (-Yann LeCun Chief AI Scientist at Meta-) Cherry - Reinforcement Learning Icing - Supervised Learning Cake - Self-supervised Learning
  • 3.
    ● Is Self-SupervisedLearning the future of AI? 2 Self-Supervised Learning (SSL) 72K GitHub stars since 2020 50K citations since 2019 5M $ to train a single model
  • 4.
    ● Eliminates theprerequisite of requiring humans to label data. ● Models can use naturally available context as labels. 3 SSL workflow Pretext Task Eg : Masked Language Modeling SSL Workflow
  • 5.
    High Availability ofPre-trained SSL Models 4 Pioneers in pre-training Open-sourced model checkpoints
  • 6.
    The Focus ofthe Thesis: Utilization of Pre-trained SSL Models 5
  • 7.
    6 Research Questions Fusion ofmultimodal pre-trained SSL features and models RQ1: How to fuse multimodal features extracted from frozen pretrained SSL models? RQ2: How to fuse two pre-trained transformer-based architectures in multimodal settings? Domain adaptation of pre-trained SSL models with fine-tuning mechanisms RQ3: How to adapt generative pre-trained SSL model when there are no high-quality training data? RQ4: How to domain adopt compound neural architectures that consist of several pre- trained SSL models?
  • 8.
    7 Fusion of MultimodalFeatures Extracted from Three Different Frozen Pre-trained SSL Models Frozen models Multimodal emotion Recognition: ● Challenging to collect and annotate data RQ1
  • 9.
    8 Multimodal Frozen SSLNetworks ● Fabnet - Video ● Convolution-based architecture ● Vector Size - 256 ● Seq-len: frames in the video ● Wav2Vec - Speech ● Temporal convolutions (tc) ● Vector Size - 512 ● Seq-len: strides in tc ● RoBERTa - Text ● Transformer ● Vector Size - 1024 ● Seq-len: number of words RQ1
  • 10.
    9 SSL-Embedding Fusion Transformer AblationStudies on CMU-MOSEI dataset Model comparisons Proposed transformer-based fusion Multimodal emotion recognition with transformer-based self supervised feature fusion (Siriwardhana et al. 2020) RQ1
  • 11.
    10 Some Findings ❖ DenseSSL features extracted from different SSL models have robust representational capabilities. ➢ They can be fused with transformer-based fusion mechanisms ➢ Self-attention plays an important role when combining sequential embeddings ❖ Feature fusion while keeping the pre-trained models frozen is important ➢ When pre-trained models have a vast number of parameters ➢ E.g.,GPT3 - model consists of 185 billion neurons. RQ1 Findings related to RQ1 have presented as a journal paper in IEEE Access 2020. S. Siriwardhana, T. Kaluarachchi, M. Billinghurst and S. Nanayakkara, "Multimodal Emotion Recognition With Transformer-Based Self Supervised Feature Fusion," in IEEE Access, vol. 8, pp. 176274- 176285, 2020, doi: 10.1109/ACCESS.2020.3026823. Impact Factor - 3.367
  • 12.
    11 Fusion of twoTransformers Architectures in Multimodal Settings ● Represent different modalities with transformer-based pre-trained models ● Utilizing architectural properties in the fusion RQ2
  • 13.
    12 Transformer models ● RoBERTa(unfrozen) - Text ● Transformer ● Vector Size - 1024 ● Seq-len: number of words ● Speech-BERT (unfrozen) - Speech ● Transformer ● Vector Size - 1024 ● Seq-len: Sampling frequency RQ2
  • 14.
    13 Shallow vs Co-attentionalfusion Shallow Fusion Co-attentional fusion Fusion mechanisms Model comparisons Ablation studies Jointly fine-tuning" bert-like" self supervised models to improve multimodal speech emotion recognition (Siriwardhana et al. (2020)) RQ2
  • 15.
    14 Some Findings ❖ Pre-trainedSSL models with transformer-based architectures can easily fuse together ➢ Employing unique properties like [CLS] token ➢ Shallow fusion ❖ Transformer-based SSL models can finetune stably even with less amount of data ➢ Can finetune stably with lower learning rates ❖ Transformer architecture is becoming increasingly ubiquitous in self supervised learning ➢ Transformer-based models represent different data modalities RQ2 Findings related to RQ2 have presented as a full conference paper in Interspeech 2020. S. Siriwardhana, Reis A, Weerasakera R, Nanayakkara S. “Jointly Fine-Tuning BERT-like Self Supervised Models to Improve Multimodal Speech Emotion Recognition.” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. Vol. 2020. H index - 100
  • 16.
    15 Domain adaptation ofGenerative BART model when high-quality training data is missing Autobiographical Text Summarization ● Privacy issues ● Different language Patterns ● Scarcity of records and gold-standard summaries ● BART transformer - generate text ● Works well for generation benchmarks ● Sequence-to-Sequence architecture RQ3
  • 17.
    16 Utilization of Redditdata and high-quality news data Thread Title ● News summarization ● Fundamental task ● Gold standard datasets ● Closely related dataset to the domain ● Titles only RQ3
  • 18.
    17 Mix Distribution MultitaskLearning ● Finetuning BART for the autobiographical summarization with : ○ Domain-specific weakly labeled dataset ○ Task-specific dataset with gold-standard labels Model Comparison Factual consistency (FactCC) Abstractive Summarization System for Autobiographical Text (Siriwardhana et al. (2022)) RQ3
  • 19.
    18 Human Studies Abstractive SummarizationSystem for Autobiographical Text (Siriwardhana et al. (2022)) Model comparison with Mturk participants SummarizeMe (Digital Diary) - User study conducted with 75 users RQ3
  • 20.
    19 Some of theFindings ❖ SSL models like BART consist of strong language generation capabilities ➢ Such models have seen a large amount of data during the pre-training ➢ BART-like models can perform well even without high-quality data ❖ Data-centric approaches are crucial when adopting tasks like autobiographical text summarization ➢ Designing better mechanisms to make use of available domain-specific data ❖ Human studies are essential and beneficial for evaluating generative models Findings related to RQ3 have submitted as a journal paper in ISRE 2022. S. Siriwardhana, Kalurachchi T, Chithralekha G Scholl P, Dissanayake V, Nanayakkara S. ``SummarizeMe: Abstractive Summarization System for Autobiographical Text'' Proceedings of the Information System Research (ISR) 2022 [Under review] RQ3
  • 21.
    20 Domain adaptation ofCompound Neural Architectures with Several Pre-trained SSL Models. ● Retrieval Augment Generation (RAG) model (Meta) ● Combines the information retrieval and seq2seq generation ● DPR neural retriever and BART generator RQ4 ● Open Domain Question Answering (ODQA) ● Works well for Wikipedia-based knowledge bases ● Less work on domain adaptation of ODQA
  • 22.
  • 23.
    22 RAG-end2end and Introductionof an Auxiliary Signal Improving the Domain Adaptation of Retrieval Augmented Generation (RAG) Models for Open Domain Question Answering (Siriwardhana et al - 2022) End-to-End RAG retriever Reconstruction auxiliary signal RQ4
  • 24.
    23 End2end retriever trainingimproves the domain adaptation RAG-end2end and auxiliary signals can improve the overall results Empowering further research in the paradigm of retrieval augmentation RQ4 Improving the Domain Adaptation of Retrieval Augmented Generation (RAG) Models for Open Domain Question Answering (Siriwardhana et al - 2022)
  • 25.
    24 Some of theFindings ❖ Different SSL-pre-trained models can be combined and create effective RAG-like pipelines. ❖ Retrieval models play a vital role in the domain adaptation of RAG. ➢ Neural retrieval models like DPR benefit from domain-specific fine-tuning since they are mainly trained with Wiki-based data. ❖ Auxiliary signals can improve the process of domain adaptation. ➢ A solution to the scarcity of domain-specific labeled data. Findings related to RQ4 have accepted as a journal paper in TACL 2022. Siriwardhana S, Weerasakera R, Kalurachchi T, Elliott W, Rana R, Nanayakkara S. ``Improving the Domain Adaptation of Retrieval Augmented Generation (RAG) Models for Open-Domain Question-Answering'' Transactions of the Association for Computational Linguistics TACL 2022 (will be presented at EMNLP - 2022) Impact Factor - 9.194 RQ4
  • 26.
  • 27.
    26 ❖ Model compressiontechniques are important ➢ SSL model checkpoints are large ➢ Model Pruning and distillation ➢ Can support recent paradigms like federated learning Model pruning Federated learning Knowledge distillation ❖ Human-centric model evaluation is getting stronger ➢ Hallucinations and factual constancy is a significant area Future Work
  • 28.
    Future Work 27 ❖ RetrievalAugmentation could play a significant role in the field of AI ➢ Could be a solution to billion-dollar large pre-trained models ➢ Are we getting closer to human-like intelligence? Retrieval augmentation to make models to go beyond a parametric memory (Source Retro-deepmind (2022))
  • 29.
    Directly related publications ●S. Siriwardhana, T. Kaluarachchi, M. Billinghurst and S. Nanayakkara, "Multimodal Emotion Recognition With Transformer- Based Self Supervised Feature Fusion," in IEEE Access 2020. ● Siriwardhana S, Reis A, Weerasakera R, Nanayakkara S. ``Jointly Fine-Tuning BERT-like Self Supervised Models to Improve Multimodal Speech Emotion Recognition.'' Proceedings of the International Speech Communication Association, INTERSPEECH. 2020. ● Siriwardhana S, Kalurachchi T, Scholl P, Dissanayake V, Nanayakkara S. ``SummarizeMe: Abstractive Summarization System for Autobiographical Text'' Proceedings of the Information System Research (ISR) 2022 [Under review] ● Siriwardhana S, Weerasakera R, Kalurachchi T, Elliott W, Rana R, Nanayakkara S. ``Improving the Domain Adaptation of Retrieval Augmented Generation (RAG) Models for Open-Domain Question-Answering'' Transactions of the Association for Computational Linguistics TACL 2022 (will be presented at EMNLP - 2022) 29 Other Publications ● Wen, E., Kaluarachchi, T., Siriwardhana, S., Tang, V., Billinghurst, M., Lindeman, R.W., Yao, R., Lin, J. and Nanayakkara, S.C., 2022. VRhook: A Data Collection Tool for VR Motion Sickness Research. Proceedings of the Annual Conference of the User Interface Software and Technology UIST ’22. ● Kaluarachchi, T., Siriwardhana, S., Wenn, E., and Nanayakkara, S., A Corneal Surface Reflections-Based Intelligent System for Lifelogging Application. International Journal of Human Computer Interaction (IJHCI) 22(4), [Under Review]
  • 30.
    ❖ My supervisors ➢Prof. Suranga Nanayakkara ➢ Prof. Mark Billinghurst ➢ Dr. Elliotte Wen ❖ Examination committee members ➢ Assoc Prof Kwan Hui Lim ➢ Assoc Prof Alan Wang ❖ The University of Auckland Doctoral Scholarship Programme ❖ All my Co-authors and lab members 30 Acknowledgement
  • 31.
  • 32.
    Appendix ● Pre-training isexpensive ○ It is so expensive 30
  • 33.
  • 34.
    Performance Matters! 32 ● Pre-trainedSSL models are performing exceptionally well for many tasks.
  • 35.
    36 ● Retrieval augmentedmodels have some important qualities
  • 36.
    Difference between IMAand Co-attention ● Co-attention doesn’t need any modification like adding a class Token, or few layers of transformers 37 IMA modification Co-attention
  • 37.
    DL features vsSSL features 38
  • 38.
    DL features vsSSL features 39 ● CNN Features off-the-shelf: an Astounding Baseline for Recognition (2014) ● PASS: An ImageNet replacement for self-supervised pretraining without humans (2021) ● Efficient Self-supervised Vision Transformers for Representation Learning (2022) (“When transferring to downstream linear classification tasks, EsViT outperforms its supervised counterpart on 17 out of 18 datasets. ”) ● Transfer Learning or Self-supervised Learning? A Tale of Two Pretraining Paradigms (2019) ● How Well Do Self-Supervised Models Transfer? (CVPR2022)
  • 39.
    40 DL features vsSSL features
  • 40.
    GPT-3 (Open AI) ●12 M $ to train ● 175 Billion Parameters (365GB) ● Bigger the better 33

Editor's Notes

  • #3 Deep Learning is important It has its limitations SSL as a savior It is becoming so popular
  • #4 Deep Learning is important It has its limitations SSL as a savior It is becoming so popular
  • #5 What is ssl It has two phases What is pre texting First phase is pre-texting which could take a lot of computational power
  • #6 But we do not have to worry about the pre-training all the time Big guns open source these models
  • #7 Focus of the thesis is utilization of the pre-trained models On downstream tasks - specially where we do not have much training data
  • #8 Research questions are separated by two main areas fusion and adoptation
  • #9 How to utilize features extracted from pre-trained frozen SSL models Conducted my experiments in multimodal emotion recognition Why multimodal ? because it is a challenging area
  • #10 I used three different frozen networks with different vector sizes and seq-lenths .. so it is not trivial to connect these dense embeddings
  • #11 So introduced a transformer based fusion mechanism. It showed competitive results
  • #12 I used three different frozen networks with different vector sizes and seq-lenths .. so it is not trivial to connect these dense embeddings
  • #13 Motivated from the first research question and conducted the experiments on he multimodal emotion recognition When the SSL came first, Transformer architecture was mainly for the text Then it got introduced to speech So can we use some special properties like CLS token
  • #14 Two transformers to represent both text and speech Having the same transformer architecture could help the fusion and improve the results
  • #15 Two fusion mechanism one employing the direct architectural properties Results on IEMOCAP dataset
  • #16 I used three different frozen networks with different vector sizes and seq-lenths .. so it is not trivial to connect these dense embeddings
  • #21 I used three different frozen networks with different vector sizes and seq-lenths .. so it is not trivial to connect these dense embeddings
  • #26 I used three different frozen networks with different vector sizes and seq-lenths .. so it is not trivial to connect these dense embeddings
  • #27 I used three different frozen networks with different vector sizes and seq-lenths .. so it is not trivial to connect these dense embeddings
  • #33 How DL has improved the field of NLU , but why still there are problems due to scarcity of data.