This presentation covers multi modal representation learning by using multi modal transformer techniques like VisualBert, VilBert and MBit transformers. It discusses pros and cons of various modeling approaches.
5. A LONG TIME AGO IN A GALAXY FAR, FAR
AWAY....
Wide & Deep Learning for Recommender
Systems
6. HOW TO MERGE THESE FEATURES ?
Simplest Approach – Concatenate
https://www.internalfb.com/intern/wiki/Facebook_AI_Multimodal_(FAIM)/Model_Architectures/Non-
temporal_Models/ConcatMLP_Fusion_Model/
7. SPARSE – SPARSE INTERACTION
Demystify CTR_MBL_FEED_MODEL and learn modeling techniques
step by step
https://fb.quip.com/fwEFAoD4rDBs
11. VISUALBERT
1.Architecture
1. The architecture of VisualBERT. Image regions and language are combined with a Transformer to allow the self-
attention to discover implicit alignments between language and vision.
2. Uses BERT weights for initialization, BERT word embeddings
3. Visual Token Embedding (sum of three representations):
1. A visual feature representation of the bounding regions (Faster-RCNN)
2. Segment Embedding
3. Position embedding
2. Dataset: The Flickr30k dataset contains 31,000 images collected from Flickr, together with 5 reference sentences provided by
human annotators.
VisualBERT: A Simple and Performant Baseline for Vision and Language
12. Entity Grounding
Attending to the corresponding bonding regions from entities in the sentence
For each entity in the sentence and for each attention head in VisualBERT,
look at the bounding region which receives the most attention weight.
For this evaluation, the head’s attention to other words was masked out.
13. SYNTACTIC GROUNDING
• Find whether model is learning syntactic relations between words (by analysing weights of attention heads)
• Parse all sentences in Flickr30k using AllenNLP’s dependency parser
• For each attention head in VisualBERT, given that two words have a particular dependency relationship, and one of them
has a ground-truth grounding in Flickr30K, compute how accurately the head attention weights predict the ground-truth
grounding.
14. ISSUES
• Issues with just concatenating features of linguistic and visual
modality
• VisualBert treats inputs from both modalities identically
• they would need different pre-processing and they are at different level of abstraction
• Forcing pretrained BERT weights to accommodate the large set of additional
visual tokens may damage the learned BERT language model
15. VILBERT: PRE-TRAINING TASK AGNOSTIC VISIOLINGUISTIC
REPRESENTATIONS FOR VISION AND LANGUAGE TASKS
• Key Contribution:
• Two parallel streams for visual and linguistic processing that interact through
novel co-attentional transformer layers.
• Dataset:
• Conceptual Captions ~ 3.3. Million images
• Proxy Tasks:
• Predicting masked words and image regions
• Predicting whether an image and text segment corresponds
16. VILBERT
Method:
• Develop two stream architecture modelling each modality separately and then
fusing them through a small set of attention based interactions.
• Approach allows for variable network depth for each modality and enables cross-
modal connections at different depths.
17. VILBERT : INPUT REPRESENTATIONS
• Image features
• Generated by extracting bounding boxes and their visual features from a pre-trained object
detection network. ( Faster R-CNN (with Resnet-101) backbone
• Spatial info is encoded in a 5-d vector from region position (normalized top-left, bottom-right
coordinates and fraction of image area covered).
• This is then projected to match dimensions of the visual features and they are
summed.
• Word embedding initialized with BERT base pretrained on
BookCorpus and Wikipedia
18. NOVELTY
• Co-TRM : Co-attentional transformer layers to enable information
exchange between modalities.
• The key and values from each modality are passed as input to the other modality’s
multi headed attention block.
• The exchange between the two streams is restricted to be between specific
layers
• Text stream has significantly more processing before interacting
with visual features
• visual features are already fairly high level and require limited context aggregation
compared to words in a sentence
19. TRAINING TASKS
• Alignment Task
• Model is presented with an image and text pair
• {IMG, v1, …vt, CLS, w1, …, wt, SEP} predicts whether the image and text
are aligned.
• The outputs IMG and CLS are holistic representation of image and
text inputs.
• Overall representation is computed as an element-wise dot product
between IMG and CLS representations and a linear layer sits on top to
make binary prediction.
• Masked Modeling Task
I
M
G
C
L
S
+
M
M
20. ISSUES WITH VILBERT
• Cannot incorporate pre trained unimodal representations
• Cannot work for any sequence of dense vectors
21. FB: SUPERVISED MULTIMODAL
BITRANSFORMERS FOR CLASSIFYING IMAGES
AND TEXT
• Jointly finetunes unimodally pretrained text and image encoders by
projecting image embeddings to text token space
• Easier to incorporate pre trained unimodal modals in this architecture
22. MBIT: IMAGE ENCODER
• Get feature maps from ResNet-152
• Use ResNet-152 with average pooling over K x M grids in the image, yielding N = KM output vectors
of 2048 dimensions
• Learn weights to project each of the N image embeddings to D-dimensional token input
embedding space
• In a way we are mapping image embeddings to BERT’s token space using a set
of randomly initialized mappings
23. EVALUATION
• Surprisingly competitive to VILBERT
• Create hard test sets
• Construct hard test sets by taking the examples where BERT and IMG classifier
predictions are most different from the ground truth classes in the test set
• Compare with
• Text-only Bert
• Image only model
• Concat BOW + Image
• Late fusion
• Concat BERT + Img
• Concatenate output of bert and image baselines (2048 + 768) and apply linear classifier
on top