SlideShare a Scribd company logo
1 of 23
MULTI MODALITY
FOR THE
UNINITIATED
– SIDDHARTH SHARMA, HARMFUL INFO PROBLEM, CI
FEATURE TYPES
https://www.internalfb.com/intern/wiki/Ads-ranking-feature/#type-of-
features
SIGNAL DIVERSITY
Deep Neural Networks for YouTube
Recommendations
Feed
Ranking
FUSION
A LONG TIME AGO IN A GALAXY FAR, FAR
AWAY....
Wide & Deep Learning for Recommender
Systems
HOW TO MERGE THESE FEATURES ?
Simplest Approach – Concatenate
https://www.internalfb.com/intern/wiki/Facebook_AI_Multimodal_(FAIM)/Model_Architectures/Non-
temporal_Models/ConcatMLP_Fusion_Model/
SPARSE – SPARSE INTERACTION
Demystify CTR_MBL_FEED_MODEL and learn modeling techniques
step by step
https://fb.quip.com/fwEFAoD4rDBs
DENSE-SPARSE INTERACTION
https://fb.quip.com/fwEFAoD4rDBs#YVJACAZa
0SO
TEXT IMAGE SPECIFIC TASKS
A
Hummingbird
MMF, a PyTorch powered MultiModal
Framework
https://www.youtube.com/watch?v=igAF-
48Pwnc
POPULAR DATA SETS
VISUALBERT
1.Architecture
1. The architecture of VisualBERT. Image regions and language are combined with a Transformer to allow the self-
attention to discover implicit alignments between language and vision.
2. Uses BERT weights for initialization, BERT word embeddings
3. Visual Token Embedding (sum of three representations):
1. A visual feature representation of the bounding regions (Faster-RCNN)
2. Segment Embedding
3. Position embedding
2. Dataset: The Flickr30k dataset contains 31,000 images collected from Flickr, together with 5 reference sentences provided by
human annotators.
VisualBERT: A Simple and Performant Baseline for Vision and Language
Entity Grounding
Attending to the corresponding bonding regions from entities in the sentence
For each entity in the sentence and for each attention head in VisualBERT,
look at the bounding region which receives the most attention weight.
For this evaluation, the head’s attention to other words was masked out.
SYNTACTIC GROUNDING
• Find whether model is learning syntactic relations between words (by analysing weights of attention heads)
• Parse all sentences in Flickr30k using AllenNLP’s dependency parser
• For each attention head in VisualBERT, given that two words have a particular dependency relationship, and one of them
has a ground-truth grounding in Flickr30K, compute how accurately the head attention weights predict the ground-truth
grounding.
ISSUES
• Issues with just concatenating features of linguistic and visual
modality
• VisualBert treats inputs from both modalities identically
• they would need different pre-processing and they are at different level of abstraction
• Forcing pretrained BERT weights to accommodate the large set of additional
visual tokens may damage the learned BERT language model
VILBERT: PRE-TRAINING TASK AGNOSTIC VISIOLINGUISTIC
REPRESENTATIONS FOR VISION AND LANGUAGE TASKS
• Key Contribution:
• Two parallel streams for visual and linguistic processing that interact through
novel co-attentional transformer layers.
• Dataset:
• Conceptual Captions ~ 3.3. Million images
• Proxy Tasks:
• Predicting masked words and image regions
• Predicting whether an image and text segment corresponds
VILBERT
Method:
• Develop two stream architecture modelling each modality separately and then
fusing them through a small set of attention based interactions.
• Approach allows for variable network depth for each modality and enables cross-
modal connections at different depths.
VILBERT : INPUT REPRESENTATIONS
• Image features
• Generated by extracting bounding boxes and their visual features from a pre-trained object
detection network. ( Faster R-CNN (with Resnet-101) backbone
• Spatial info is encoded in a 5-d vector from region position (normalized top-left, bottom-right
coordinates and fraction of image area covered).
• This is then projected to match dimensions of the visual features and they are
summed.
• Word embedding initialized with BERT base pretrained on
BookCorpus and Wikipedia
NOVELTY
• Co-TRM : Co-attentional transformer layers to enable information
exchange between modalities.
• The key and values from each modality are passed as input to the other modality’s
multi headed attention block.
• The exchange between the two streams is restricted to be between specific
layers
• Text stream has significantly more processing before interacting
with visual features
• visual features are already fairly high level and require limited context aggregation
compared to words in a sentence
TRAINING TASKS
• Alignment Task
• Model is presented with an image and text pair
• {IMG, v1, …vt, CLS, w1, …, wt, SEP} predicts whether the image and text
are aligned.
• The outputs IMG and CLS are holistic representation of image and
text inputs.
• Overall representation is computed as an element-wise dot product
between IMG and CLS representations and a linear layer sits on top to
make binary prediction.
• Masked Modeling Task
I
M
G
C
L
S
+
M
M
ISSUES WITH VILBERT
• Cannot incorporate pre trained unimodal representations
• Cannot work for any sequence of dense vectors
FB: SUPERVISED MULTIMODAL
BITRANSFORMERS FOR CLASSIFYING IMAGES
AND TEXT
• Jointly finetunes unimodally pretrained text and image encoders by
projecting image embeddings to text token space
• Easier to incorporate pre trained unimodal modals in this architecture
MBIT: IMAGE ENCODER
• Get feature maps from ResNet-152
• Use ResNet-152 with average pooling over K x M grids in the image, yielding N = KM output vectors
of 2048 dimensions
• Learn weights to project each of the N image embeddings to D-dimensional token input
embedding space
• In a way we are mapping image embeddings to BERT’s token space using a set
of randomly initialized mappings
EVALUATION
• Surprisingly competitive to VILBERT
• Create hard test sets
• Construct hard test sets by taking the examples where BERT and IMG classifier
predictions are most different from the ground truth classes in the test set
• Compare with
• Text-only Bert
• Image only model
• Concat BOW + Image
• Late fusion
• Concat BERT + Img
• Concatenate output of bert and image baselines (2048 + 768) and apply linear classifier
on top

More Related Content

Similar to multi modal transformers representation generation .pptx

leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdfleewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdfrobertsamuel23
 
IRJET- Generation of HTML Code using Machine Learning Techniques from Mock-Up...
IRJET- Generation of HTML Code using Machine Learning Techniques from Mock-Up...IRJET- Generation of HTML Code using Machine Learning Techniques from Mock-Up...
IRJET- Generation of HTML Code using Machine Learning Techniques from Mock-Up...IRJET Journal
 
AaSeminar_Template.pptx
AaSeminar_Template.pptxAaSeminar_Template.pptx
AaSeminar_Template.pptxManojGowdaKb
 
Image Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyImage Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyNUPUR YADAV
 
3_Transfer_Learning.pdf
3_Transfer_Learning.pdf3_Transfer_Learning.pdf
3_Transfer_Learning.pdfFEG
 
Mirko Lucchese - Deep Image Processing
Mirko Lucchese - Deep Image ProcessingMirko Lucchese - Deep Image Processing
Mirko Lucchese - Deep Image ProcessingMeetupDataScienceRoma
 
Automated Image Captioning – Model Based on CNN – GRU Architecture
Automated Image Captioning – Model Based on CNN – GRU ArchitectureAutomated Image Captioning – Model Based on CNN – GRU Architecture
Automated Image Captioning – Model Based on CNN – GRU ArchitectureIRJET Journal
 
Deep Learning Project.pptx
Deep Learning Project.pptxDeep Learning Project.pptx
Deep Learning Project.pptxTasnimRahman54
 
Lec16 - Autoencoders.pptx
Lec16 - Autoencoders.pptxLec16 - Autoencoders.pptx
Lec16 - Autoencoders.pptxSameer Gulshan
 
Survey of Attention mechanism
Survey of Attention mechanismSurvey of Attention mechanism
Survey of Attention mechanismSwatiNarkhede1
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用CHENHuiMei
 
Java image processing ieee projects 2012 @ Seabirds ( Chennai, Bangalore, Hyd...
Java image processing ieee projects 2012 @ Seabirds ( Chennai, Bangalore, Hyd...Java image processing ieee projects 2012 @ Seabirds ( Chennai, Bangalore, Hyd...
Java image processing ieee projects 2012 @ Seabirds ( Chennai, Bangalore, Hyd...SBGC
 
Unsupervised/Self-supervvised visual object tracking
Unsupervised/Self-supervvised visual object trackingUnsupervised/Self-supervvised visual object tracking
Unsupervised/Self-supervvised visual object trackingYu Huang
 
How is a Vision Transformer (ViT) model built and implemented?
How is a Vision Transformer (ViT) model built and implemented?How is a Vision Transformer (ViT) model built and implemented?
How is a Vision Transformer (ViT) model built and implemented?Benjaminlapid1
 
[WSO2Con EU 2017] Building Next Generation Banking Middleware at ING: The Rol...
[WSO2Con EU 2017] Building Next Generation Banking Middleware at ING: The Rol...[WSO2Con EU 2017] Building Next Generation Banking Middleware at ING: The Rol...
[WSO2Con EU 2017] Building Next Generation Banking Middleware at ING: The Rol...WSO2
 
Survey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer VisionSurvey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer VisionSwatiNarkhede1
 

Similar to multi modal transformers representation generation .pptx (20)

leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdfleewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
leewayhertz.com-HOW IS A VISION TRANSFORMER MODEL ViT BUILT AND IMPLEMENTED.pdf
 
chaitra_resume
chaitra_resumechaitra_resume
chaitra_resume
 
Fashion AI
Fashion AIFashion AI
Fashion AI
 
IRJET- Generation of HTML Code using Machine Learning Techniques from Mock-Up...
IRJET- Generation of HTML Code using Machine Learning Techniques from Mock-Up...IRJET- Generation of HTML Code using Machine Learning Techniques from Mock-Up...
IRJET- Generation of HTML Code using Machine Learning Techniques from Mock-Up...
 
AaSeminar_Template.pptx
AaSeminar_Template.pptxAaSeminar_Template.pptx
AaSeminar_Template.pptx
 
Dl
DlDl
Dl
 
Image Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyImage Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A survey
 
3_Transfer_Learning.pdf
3_Transfer_Learning.pdf3_Transfer_Learning.pdf
3_Transfer_Learning.pdf
 
Mirko Lucchese - Deep Image Processing
Mirko Lucchese - Deep Image ProcessingMirko Lucchese - Deep Image Processing
Mirko Lucchese - Deep Image Processing
 
Automated Image Captioning – Model Based on CNN – GRU Architecture
Automated Image Captioning – Model Based on CNN – GRU ArchitectureAutomated Image Captioning – Model Based on CNN – GRU Architecture
Automated Image Captioning – Model Based on CNN – GRU Architecture
 
Deep Learning Project.pptx
Deep Learning Project.pptxDeep Learning Project.pptx
Deep Learning Project.pptx
 
Lec16 - Autoencoders.pptx
Lec16 - Autoencoders.pptxLec16 - Autoencoders.pptx
Lec16 - Autoencoders.pptx
 
Survey of Attention mechanism
Survey of Attention mechanismSurvey of Attention mechanism
Survey of Attention mechanism
 
深度學習在AOI的應用
深度學習在AOI的應用深度學習在AOI的應用
深度學習在AOI的應用
 
Java image processing ieee projects 2012 @ Seabirds ( Chennai, Bangalore, Hyd...
Java image processing ieee projects 2012 @ Seabirds ( Chennai, Bangalore, Hyd...Java image processing ieee projects 2012 @ Seabirds ( Chennai, Bangalore, Hyd...
Java image processing ieee projects 2012 @ Seabirds ( Chennai, Bangalore, Hyd...
 
Unsupervised/Self-supervvised visual object tracking
Unsupervised/Self-supervvised visual object trackingUnsupervised/Self-supervvised visual object tracking
Unsupervised/Self-supervvised visual object tracking
 
lec6a.ppt
lec6a.pptlec6a.ppt
lec6a.ppt
 
How is a Vision Transformer (ViT) model built and implemented?
How is a Vision Transformer (ViT) model built and implemented?How is a Vision Transformer (ViT) model built and implemented?
How is a Vision Transformer (ViT) model built and implemented?
 
[WSO2Con EU 2017] Building Next Generation Banking Middleware at ING: The Rol...
[WSO2Con EU 2017] Building Next Generation Banking Middleware at ING: The Rol...[WSO2Con EU 2017] Building Next Generation Banking Middleware at ING: The Rol...
[WSO2Con EU 2017] Building Next Generation Banking Middleware at ING: The Rol...
 
Survey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer VisionSurvey of Attention mechanism & Use in Computer Vision
Survey of Attention mechanism & Use in Computer Vision
 

Recently uploaded

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 

Recently uploaded (20)

Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 

multi modal transformers representation generation .pptx

  • 1. MULTI MODALITY FOR THE UNINITIATED – SIDDHARTH SHARMA, HARMFUL INFO PROBLEM, CI
  • 3. SIGNAL DIVERSITY Deep Neural Networks for YouTube Recommendations Feed Ranking
  • 5. A LONG TIME AGO IN A GALAXY FAR, FAR AWAY.... Wide & Deep Learning for Recommender Systems
  • 6. HOW TO MERGE THESE FEATURES ? Simplest Approach – Concatenate https://www.internalfb.com/intern/wiki/Facebook_AI_Multimodal_(FAIM)/Model_Architectures/Non- temporal_Models/ConcatMLP_Fusion_Model/
  • 7. SPARSE – SPARSE INTERACTION Demystify CTR_MBL_FEED_MODEL and learn modeling techniques step by step https://fb.quip.com/fwEFAoD4rDBs
  • 9. TEXT IMAGE SPECIFIC TASKS A Hummingbird MMF, a PyTorch powered MultiModal Framework https://www.youtube.com/watch?v=igAF- 48Pwnc
  • 11. VISUALBERT 1.Architecture 1. The architecture of VisualBERT. Image regions and language are combined with a Transformer to allow the self- attention to discover implicit alignments between language and vision. 2. Uses BERT weights for initialization, BERT word embeddings 3. Visual Token Embedding (sum of three representations): 1. A visual feature representation of the bounding regions (Faster-RCNN) 2. Segment Embedding 3. Position embedding 2. Dataset: The Flickr30k dataset contains 31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators. VisualBERT: A Simple and Performant Baseline for Vision and Language
  • 12. Entity Grounding Attending to the corresponding bonding regions from entities in the sentence For each entity in the sentence and for each attention head in VisualBERT, look at the bounding region which receives the most attention weight. For this evaluation, the head’s attention to other words was masked out.
  • 13. SYNTACTIC GROUNDING • Find whether model is learning syntactic relations between words (by analysing weights of attention heads) • Parse all sentences in Flickr30k using AllenNLP’s dependency parser • For each attention head in VisualBERT, given that two words have a particular dependency relationship, and one of them has a ground-truth grounding in Flickr30K, compute how accurately the head attention weights predict the ground-truth grounding.
  • 14. ISSUES • Issues with just concatenating features of linguistic and visual modality • VisualBert treats inputs from both modalities identically • they would need different pre-processing and they are at different level of abstraction • Forcing pretrained BERT weights to accommodate the large set of additional visual tokens may damage the learned BERT language model
  • 15. VILBERT: PRE-TRAINING TASK AGNOSTIC VISIOLINGUISTIC REPRESENTATIONS FOR VISION AND LANGUAGE TASKS • Key Contribution: • Two parallel streams for visual and linguistic processing that interact through novel co-attentional transformer layers. • Dataset: • Conceptual Captions ~ 3.3. Million images • Proxy Tasks: • Predicting masked words and image regions • Predicting whether an image and text segment corresponds
  • 16. VILBERT Method: • Develop two stream architecture modelling each modality separately and then fusing them through a small set of attention based interactions. • Approach allows for variable network depth for each modality and enables cross- modal connections at different depths.
  • 17. VILBERT : INPUT REPRESENTATIONS • Image features • Generated by extracting bounding boxes and their visual features from a pre-trained object detection network. ( Faster R-CNN (with Resnet-101) backbone • Spatial info is encoded in a 5-d vector from region position (normalized top-left, bottom-right coordinates and fraction of image area covered). • This is then projected to match dimensions of the visual features and they are summed. • Word embedding initialized with BERT base pretrained on BookCorpus and Wikipedia
  • 18. NOVELTY • Co-TRM : Co-attentional transformer layers to enable information exchange between modalities. • The key and values from each modality are passed as input to the other modality’s multi headed attention block. • The exchange between the two streams is restricted to be between specific layers • Text stream has significantly more processing before interacting with visual features • visual features are already fairly high level and require limited context aggregation compared to words in a sentence
  • 19. TRAINING TASKS • Alignment Task • Model is presented with an image and text pair • {IMG, v1, …vt, CLS, w1, …, wt, SEP} predicts whether the image and text are aligned. • The outputs IMG and CLS are holistic representation of image and text inputs. • Overall representation is computed as an element-wise dot product between IMG and CLS representations and a linear layer sits on top to make binary prediction. • Masked Modeling Task I M G C L S + M M
  • 20. ISSUES WITH VILBERT • Cannot incorporate pre trained unimodal representations • Cannot work for any sequence of dense vectors
  • 21. FB: SUPERVISED MULTIMODAL BITRANSFORMERS FOR CLASSIFYING IMAGES AND TEXT • Jointly finetunes unimodally pretrained text and image encoders by projecting image embeddings to text token space • Easier to incorporate pre trained unimodal modals in this architecture
  • 22. MBIT: IMAGE ENCODER • Get feature maps from ResNet-152 • Use ResNet-152 with average pooling over K x M grids in the image, yielding N = KM output vectors of 2048 dimensions • Learn weights to project each of the N image embeddings to D-dimensional token input embedding space • In a way we are mapping image embeddings to BERT’s token space using a set of randomly initialized mappings
  • 23. EVALUATION • Surprisingly competitive to VILBERT • Create hard test sets • Construct hard test sets by taking the examples where BERT and IMG classifier predictions are most different from the ground truth classes in the test set • Compare with • Text-only Bert • Image only model • Concat BOW + Image • Late fusion • Concat BERT + Img • Concatenate output of bert and image baselines (2048 + 768) and apply linear classifier on top