Vision and Language: Past, Present and Future

Goergen Institute for Data Science
Goergen Institute for Data ScienceProfessor at University of Rochester

ICME 2021 Keynote

2021/7/9
1
Vision and Language
The Past, Present and Future
Jiebo Luo
University of Rochester
2021 IEEE International Conference on Multimedia and Expo
Vision-and-Language
A cat is sitting next to a
pine tree, looking up
• The intersection of computer vision and natural language processing
• Multi-modal learning
Language
Vision
2021/7/9
2
A cat is sitting next to a
pine tree, looking up
Language
Vision
A cat is sitting next to a
pine tree, looking up
Language
Vision
2021/7/9
3
A cat is sitting next to a
pine tree, looking up
Language
Vision
• Vision-language joint representation learning
• General multi-modal learning
Why Language in Vision?
• Human-computer-interaction
○ Easy to generate
• Language-based search
○ Clear specification
“person wearing black pants walking to the back chair”
● Direct applications
2021/7/9
4
Why Language in Vision?
• Visual representation learning
with language supervision
● Representation learning
• Stronger vision, language, and
VL models
Why Vision in Language?
• Multimodal Machine Translation
○ Help solve data sparsity and
ambiguity
• Unsupervised Grammar Induction
○ Help induce syntactic
structures
2021/7/9
5
A cat is sitting
next to a pine
tree, looking up.
Generation tasks
Understanding tasks
QA/
Reasoning
Visual
captioning
Text-to-image
synthesis
Image retrieval
Vision-and-Language Tasks
Visual grounding
What is the cat
doing
Vision-and-Language Research in the Stone Age (Pre-DL)
[1] * Iftekhar Naim, Young Song, Daniel Gildea, Qiguang Liu, Henry Kautz and Jiebo Luo, "Unsupervised Alignment of Natural Language
Instructions with Video Segments," AAAI 2014.
[2] * Iftekhar Naim, Young Chol Song, Henry Kautz, Jiebo Luo, Qiguang Liu, Daniel Gildea, and Liang Huang, "Discriminative Unsupervised
Alignment of Natural Language Instructions with Corresponding Video Segments," NAACL 2015.
[3] * Young Chol Song, Iftekhar Naim, Abdullah Al Mamun, Kaustubh Kulkarni, Parag Singla, Jiebo Luo, Daniel Gildea and Henry Kautz,
"Unsupervised Alignment of Actions in Video with Text Descriptions," IJCAI 2016.
Unsupervised
BoW, CRF, DTW, etc.
2021/7/9
6
Credit: VL-CVPR Tutorial. https://rohit497.github.io/Recent-Advances-in-Vision-and-Language-Research
Machine Translation/Grammar Induction
Credit: VL-CVPR Tutorial. https://rohit497.github.io/Recent-Advances-in-Vision-and-Language-Research
2021/7/9
7
Visual Captioning
Visual Captioning – Describe the content of an image or video with a
natural language sentence
Visual Captioning Taxonomy
Credit: Luowei Zhou
2021/7/9
8
Image Captioning
• Motivations
– Real-world Usability
• Help visually impaired people, learning-impaired
– Improving Image Understanding
• Classification, Objection detection
– Image Retrieval
1. a young girl inhales with the intent of blowing out a candle
2. girl blowing out the candle on an ice cream
1. A shot from behind home plate of
children playing baseball
2. A group of children playing
baseball in the rain
3. Group of baseball players playing
on a wet field
Image Captioning with CNN-LSTM
Vinyals et al. “Show and Tell: A Neural Image Caption Generator”, CVPR 2015
2021/7/9
9
Image Captioning with CNN-LSTM
Vinyals et al. “Show and Tell: A Neural Image Caption Generator”, CVPR 2015
Image Captioning with CNN-LSTM
Vinyals et al. “Show and Tell: A Neural Image Caption Generator”, CVPR 2015
• Teacher forcing training
Cat sitting outside [END]
[Begin] cat sitting outside
2021/7/9
10
Image Captioning with CNN-LSTM
Vinyals et al. “Show and Tell: A Neural Image Caption Generator”, CVPR 2015
Image Captioning with Soft Attention
• Soft (self) Attention – Dynamically attend to input content based on query
Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.
Entire image
=> Informative parts
2021/7/9
11
Review: Previous Image Captioning
Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.
Image Captioning with Soft Attention
Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.
2021/7/9
12
Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.
Image Captioning with Soft Attention
Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.
Image Captioning with Soft Attention
2021/7/9
13
Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.
Image Captioning with Soft Attention
Additional textual information
• Own noisy titles, tags or captions (Web)
• Visually similar nearest neighbor images
Exploit stronger vision models
*You, Jin, Wang, Fang, Luo. "Image captioning with semantic attention." CVPR 2016.
Image Captioning with Semantic Attention
2021/7/9
14
Image Captioning with Semantic Attention
Visual Attributes
2021/7/9
15
Image Captioning with “Fancier” Attention
Anderson, Peter, et al. "Bottom-up and top-down attention for image captioning and visual question answering.", CVPR 2018
● Region based attention
2021/7/9
16
Image Captioning with “Fancier” Attention
Anderson, Peter, et al. "Bottom-up and top-down attention for image captioning and visual question answering.", CVPR 2018
Review: Previous Image Captioning
Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.
2021/7/9
17
Image Captioning with “Fancier” Attention
• Attention (weighted sum) over:
N*N patches -> K object regions
Image Captioning with “Fancier” Attention
Anderson, Peter, et al. "Bottom-up and top-down attention for image captioning and visual question answering.", CVPR 2018
2021/7/9
18
Image Captioning with “Fancier” Attention
Image Captioning with “Fancier” Attention
2021/7/9
19
Evaluation – Benchmark Datasets
Evaluation – Benchmark Datasets
2021/7/9
20
Evaluation – Metrics
Most commonly-used: BLEU / METEOR / CIDEr / SPICE
• BLEU: based on n-gram based precision
• METEOR: ordering sensitive through unigram matching
• CIDEr: gives more weight-age to important n-grams through TF-IDF
• SPICE: F1-score over caption scene-graph tuples
Image Captioning – Advanced Topics
• Un-/weakly-supervised
• Dense captioning
• Novel object captioning
• Stylized captioning
• Captioning with reading comprehension
• Grounded captioning
2021/7/9
21
Image Captioning with Unpaired Data
Key: disentangle sentence quality and image-text relevance
* Yang Feng, Lin Ma, Wei Liu, Jiebo Luo. "Unsupervised image captioning." In CVPR 2019.
• Image-text relevance
– Cycle consistency
Image Captioning with Unpaired Data
Caption Image Caption’ Image’
G C G
2021/7/9
22
• Sentence quality
Image Captioning with Unpaired Data
Image Captioning with Unpaired Data
2021/7/9
23
Stylized Captioning
* Chen, Zhang, You, Fang, Wang, Jin, Luo. "``Factual''or``Emotional'': Stylized Image Captioning with Adaptive Learning and Attention." In ECCV 2018.
Style word and factual word
Stylized Captioning
2021/7/9
24
Captioning with Reading Comprehension
* Yang, Lu, Yin, Florencio, Wang, Zhang, Zhang, Luo. "TAP: Text-Aware Pre-training for Text-VQA and Text-Caption." In CVPR 2021.
Captioning with Reading Comprehension
OCR
Object
30
...317
bike
...
helmet
* Yang, Lu, Yin, Florencio, Wang, Zhang, Zhang, Luo. "TAP: Text-Aware Pre-training for Text-VQA and Text-Caption." In CVPR 2021.
2021/7/9
25
Captioning with Reading Comprehension
* Yang, Lu, Yin, Florencio, Wang, Zhang, Zhang, Luo. "TAP: Text-Aware Pre-training for Text-VQA and Text-Caption." In CVPR 2021.
Captioning with Reading Comprehension
2021/7/9
26
From Image to Video
Video Captioning
● Encoder-Decoder Network
2021/7/9
27
Assisted by POS Tags
* Hou, Wu, Zhao, Luo. "Joint syntax representation learning and visual cue translation for video captioning." In ICCV 2019.
Video Captioning
• Dense video captioning
• Video paragraph description
• Other special domains
GIF, Sports, etc.
2021/7/9
28
Animated GIF Captioning (TGIF dataset)
* Li, Song, Cao, Tetreault, Goldberg, Luo. "TGIF: A new dataset and benchmark on animated GIF description." In CVPR 2016.
Sports Video Captioning
* Qi, Qin, Li, Wang, Luo. "Sports video captioning by attentive motion representation based hierarchical recurrent neural networks." ACMMW 2018.
* Qi, Qin, Li, Wang, Luo, Van Gool. "stagNet: an attentive semantic RNN for group activity and individual action recognition." IEEE TCSVT.
2021/7/9
29
Sports Video Captioning
Sports Video Captioning
2021/7/9
30
Credit: VL-CVPR Tutorial. https://rohit497.github.io/Recent-Advances-in-Vision-and-Language-Research
Machine Translation/Grammar Induction
Visual Grounding
Bounding box/
box tubelet
• Visual grounding:
visual-text correspondence
Language query:
grass in front of the house
a skater in red is skating with
her partner in black
2021/7/9
31
Image Grounding
• Image visual grounding
Language query:
grass in front of the house
Bounding box
Image Visual Grounding
• Language query => region of the image
Query: grass in front of the house
* Yang, Gong, Wang, Huang, Yu, Luo. "A fast and accurate one-stage approach to visual grounding." In ICCV 2019. (oral)
2021/7/9
32
• Two-stage framework
Existing Framework
Query: center building
• Two-stage framework
– Region proposal
– Similarity ranking
Existing Framework
2021/7/9
33
• Limited candidates
• Slow in speed
Existing Framework
● Propose-and-Rank
One-stage Visual Grounding
• Propose-and-rank => Grounding-by-detection
* Yang, Gong, Wang, Huang, Yu, Luo. "A fast and accurate one-stage approach to visual grounding." In ICCV 2019. (oral)
2021/7/9
34
• Accurate: +7-20% absolute
• Fast: 10x
One-stage Visual Grounding
Architecture
• Encoder module
• Fusion module
• Grounding module
2021/7/9
35
• Encoder
• Fusion module
• Grounding module
• Visual encoder
• Language encoder
• Spatial encoder
Architecture
• Encoder
• Fusion module
• Grounding module
• Image-level fusion
Architecture
2021/7/9
36
• Encoder
• Fusion module
• Grounding module
• Output format: box + confidence
Architecture
Datasets
• For phrase localization: Flickr 30K Entities
• For referring expression comprehension: ReferItGame
• Acc@0.5 IoU
the black backpack on the bottom right
Phrase localization
Referring expression
comprehension
2021/7/9
37
Comparison to other methods
Qualitative Results
Pred.
gt
Ours
Two-
stage
2021/7/9
38
Understanding Complex Queries
• External VL graph/ tree
• Recursive modeling
Understanding Complex Queries
● Recursive Solution
• Proposed a recursive multi-round approach
Man on the
couch
looking
on the tv
* Yang, Chen, Wang, Luo. "Improving one-stage visual grounding by recursive sub-query construction." In ECCV 2020
2021/7/9
39
• Sub-query learner (to construct the sub-queries)
• Sub-query modulation (to refine the fused feature with sub-queries)
Method
● Framework Overview
Experiments
2021/7/9
40
Experiments
● Recursive Disambiguation
• Recursive dis-ambiguous procedures
Visual Grounding beyond Images
Bounding box/
box tubelet
Language query:
grass in front of the house
a skater in red is skating with
her partner in black
2021/7/9
41
Video Temporal Grounding
[1] Gao, Jiyang, et al. "Tall: Temporal activity localization via language query." In ICCV 2017.
[2] *Songyang Zhang, Jinsong Su, Jiebo Luo. "Exploiting Temporal Relationships in Video Moment Localization with Natural Language.” MM 2019.
Video Temporal Grounding
2021/7/9
42
Video Temporal Grounding (2D TAN)
* Zhang, Peng, Fu, Luo. "Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language." In AAAI 2020.
Video Object Grounding
• From image to video
* Real-time (~40 fps) with a single NVIDIA 1080TI GPU
** Results of Yang, Kumar, Chen, Su, Luo. "Grounding-Tracking-Integration." In IEEE T-CSVT.
2021/7/9
43
Video Object Grounding
● Two-stage Approach
Tubelet
Linking
Fusion &
Ranking
Encoder
Zhang, Zhu, et al. "Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences." In CVPR 2020.
Video Object Grounding
● One-stage Approach
Detection
Fusion
2021/7/9
44
Video Object Grounding
● One-stage Approach
• First real-time video object grounding framework
• Self-evaluate
3D Visual Grounding
[1] Panos, Achlioptas, et al. "ReferIt3D: Neural Listeners for Fine-Grained Object Identification in Real-World 3D Scenes." In ECCV 2020 Oral.
[2] Chen, Dave, et al. "ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language." In ECCV 2020.
2021/7/9
45
Challenges
• Propose and rank
[1] Panos, Achlioptas, et al. "ReferIt3D: Neural Listeners for Fine-Grained Object Identification in Real-World 3D Scenes." In ECCV 2020 Oral.
[2] Chen, Dave, et al. "ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language." In ECCV 2020.
• Ranking is the key: point-cloud and language joint representation
learning
* Zhengyuan Yang, Songyang Zhang, Liwei Wang, Jiebo Luo. “SAT: 2D Semantics Assisted Training for 3D Visual Grounding.” arxiv.
2021/7/9
46
Nr3D Challenge Winner (CVPR 2021)
2021/7/9
47
Credit: VL-CVPR Tutorial. https://rohit497.github.io/Recent-Advances-in-Vision-and-Language-Research
Machine Translation/Grammar Induction
Visual QA/Reasoning
2021/7/9
48
Datasets
• Large-scale annotated datasets have driven tremendous progress in this field
Slide credit: Zhe Gan. CVPR tutorial
• What a typical system looks like
• Image captioning
• Visual grounding
2021/7/9
49
Image Feature
• From grid features to region features, and to grid features again
Multi-modal Fusion
Slide credit: Zhe Gan. CVPR tutorial
Transformers
2021/7/9
50
Credit: VL-CVPR Tutorial. https://rohit497.github.io/Recent-Advances-in-Vision-and-Language-Research
Machine Translation/Grammar Induction
Text-to-image Synthesis
2021/7/9
51
Generative Adversarial Networks (GAN)
Conditional Image Synthesis
Slide credit: Yu Cheng. CVPR tutorial
2021/7/9
52
Conditional Image Synthesis
Slide credit: Yu Cheng. CVPR tutorial
Slide credit: Yu Cheng. CVPR tutorial
2021/7/9
53
Dall·e
Unsupervised Text-to-Image Synthesis
* Yanlong Dong, Ying Zhang, Lin Ma, Zhi Wang, Jiebo Luo, ''Unsupervised text-to-image synthesis,'' Pattern Recognition.
Caption Image Caption’ Image’
G C G
2021/7/9
54
Unsupervised Text-to-Image Synthesis
* Yanlong Dong, Ying Zhang, Lin Ma, Zhi Wang, Jiebo Luo, ''Unsupervised text-to-image synthesis,'' Pattern Recognition.
Text to Sentiment
[1] * Jie An, Tianlang Chen, Songyang Zhang, Jiebo Luo, "Global Image Sentiment Transfer," ICPR 2020.
[2] * Tianlang Chen, Wei Xiong, Haitian Zheng, Jiebo Luo, "Image Sentiment Transfer," ACM MM 2020.
2021/7/9
55
Credit: VL-CVPR Tutorial. https://rohit497.github.io/Recent-Advances-in-Vision-and-Language-Research
Machine Translation/Grammar Induction
Grammar Induction
Goal: Grammar induction aims to capture syntactic information in sentences in the form of
constituency parsing trees.
• Supervised Grammar Induction
– Annotating syntactic trees is expensive and time-consuming
– Limited to the newswire domain in several major languages
• Unsupervised Grammar Induction
– Learn from large-scale unlabeled text
– Provide evidence for statistical learning
* Songyang Zhang, Linfeng Song, Lifeng Jin, Dong Yu,Jiebo Luo, "Video-aided Unsupervised Grammar Induction," In NAACL 2021
(Best Long Paper).
2021/7/9
56
Image-aided Unsupervised Grammar Induction
Images can help us induce syntactic structure. [Shi et al. ACL’19]
Intuition:
Exploiting regularities between text spans and images.
a squirrel jumps
on stump
Sentence: a squirrel jumps on stump
Parser
𝑐 : a squirrel
𝑐 : on stump
𝑐 : jumps on stump
…
Video-aided Unsupervised Grammar Induction
Motivation: Videos include not only static objects but also actions and state changes useful for
inducing verb phrases (VP).
✓
a squirrel jumps
on stump
Sentence: a squirrel jumps on stump
Parser
𝑐 : a squirrel
𝑐 : on stump
𝑐 : jumps on stump
…
2021/7/9
57
Our Approach
Multi-Modal Compound PCFG (MMC-PCFG) where PCFG stands for probabilistic context-free grammar
[Gabeur et al. ECCV 2020]
Grammar Inducer Marginal Likelihood Optimization
Chart
w1 w2 w3 w4 w5
w1
w2
w3
w4
w5
w1 w2 w3 w4 w5
w1
w2
w3
w4
w5
Chart
w1 w2 w3 w4 w5
w1
w2
w3
w4
w5
Video V:
Video-Span Matching Loss
Video-Span Matching Loss
Video-Sentence Matching Loss
Video-Sentence Matching Loss
Gated Embedding Module
Video-Span Similarity
Object
...
Action
...
Scene
...
Audio
...
OCR
...
Face
...
Speech
...
Object
...
Action
...
Scene
...
Audio
...
OCR
...
Face
...
Speech
...
Span & its representation
w5
w4 w5
w4
w3 ...
w1 w2
c ...
Span & its representation
w5
w4 w5
w4
w3 ...
w1 w2
c ...
Multi-Modal Transformer
Multi-Modal Transformer
Sentence σ:
w1 w2 w3 w4 w5
A squirrel jumps on stump
Sentence σ:
w1 w2 w3 w4 w5
A squirrel jumps on stump
ψ1
ξ1
ξ2
ξ3 ξ4
ξ5
ξ6
ξ7
ψ2
ψ3
ψ4
ψ5
ψ6
ψ7
Main Results
Sentence-level F1 scores by singular features on three benchmark datasets.
Te
xt
Onl
y
Tex
t
Onl
y
Te
xt
Onl
y
Single type of
feature
Single type of
feature
Single type of
feature
2021/7/9
58
Main Results
Sentence-level F1 scores on three benchmark datasets.
MMC-
PCFG
MMC-
PCFG
MMC-
PCFG
Qualitative Analysis
C-PCFG
Reference Tree
MMC-PCFG
VC-PCFG
+Object(SENet)
VC-PCFG
+Action(I3D)
a guy playing with his cat making the cat dance
a guy playing with his cat making the cat dance
a guy playing with his cat making the cat dance
a guy playing with his cat making the cat dance
a guy playing with his cat making the cat dance
a guy playing with his cat making the cat dance
a guy playing with his cat making the cat dance
a guy playing with his cat making the cat dance
a guy playing with his cat making the cat dance
a guy playing with his cat making the cat dance
some one is describing about racing car kept on road having different colors
some one is describing about racing car kept on road having different colors a boy laying by a wall talking to a webcam
a boy laying by a wall talking to a webcam
some one is describing about racing car kept on road having different colors
some one is describing about racing car kept on road having different colors a boy laying by a wall talking to a webcam
a boy laying by a wall talking to a webcam
some one is describing about racing car kept on road having different colors
some one is describing about racing car kept on road having different colors
some one is describing about racing car kept on road having different colors
some one is describing about racing car kept on road having different colors
some one is describing about racing car kept on road having different colors
some one is describing about racing car kept on road having different colors
a boy laying by a wall talking to a webcam
a boy laying by a wall talking to a webcam
a boy laying by a wall talking to a webcam
a boy laying by a wall talking to a webcam
a boy laying by a wall talking to a webcam
a boy laying by a wall talking to a webcam
Video
2021/7/9
59
Multimodal Machine Translation (MMT)
√
×
* Huan Lin, Fandong Meng, Jinsong Su, Yongjing Yin, Zhengyuan Yang, Yubin Ge, Jie Zhou, Jiebo Luo, "Dynamic Context-
guided Capsule Network for Multimodal Machine Translation," ACM Multimedia Conference, 2020.
Visual Features used in MMT
 Global visual features (Huang et al.,2016;
Calixto et al., 2017, Elliott et al., 2017;
Zhou et al., 2018; Calixto et al., 2019)
CNN
 Object-based visual features (Huang
et al., 2016, Ive et al., 2019)
CNN
2021/7/9
60
How to Effectively Utilize Visual Features in MMT?
 Exploit visual features as global visual context (Huang et al., 2016; Calixto et al.,
2017a; Grönroos et al., 2018)
 Apply attention mechanism to extract visual context (Calixto et al., 2017b;
Delbrouck et al., 2017; Helcl et al., 2018; Arslan et al., 2018)
 Learn multimodal joint representations (Calixto et al., 2019; Elliott et al., 2017;
Zhou et al., 2018)
← Lack variability!
← Too many parameters!
← Lack variability!
Dynamic Context-guided Capsule Network (DCCN)
2021/7/9
61
Experiments
Table 1: Experimental results on the En-De translation task.
Experiments
Table 1: Experimental results on the En-De translation task.
2021/7/9
62
Case Study
 Global visual features
Objects: [shorts, man, woman, sign, sidewalk, umbrella, dress, glasses, girl]
EN: A girl wearing a mask rides on a man’s shoulders through a crowded sidewalk.
Ref (DE): ... reitet auf den Schultern eines Mannes ...
Transformer: ... fährt auf den Schultern eines Mannes ...
Encoder-attention: ... fährt auf den Schultern eines Mannes ...
Doubly-attention : ... fährt auf den Schultern eines Mannes ...
DCCN: ... reitet auf den Schultern eines Mannes ...
(c)
Credit: VL-CVPR Tutorial. https://rohit497.github.io/Recent-Advances-in-Vision-and-Language-Research
2021/7/9
63
Supervised Learning
Slide credit: Licheng Yu , Linjie Li and Yen-Chun Chen CVPR tutorial
Slide credit: Licheng Yu , Linjie Li and Yen-Chun Chen CVPR tutorial
2021/7/9
64
Slide credit: Licheng Yu , Linjie Li and Yen-Chun Chen CVPR tutorial
Slide credit: Licheng Yu , Linjie Li and Yen-Chun Chen CVPR tutorial
2021/7/9
65
Slide credit: Licheng Yu , Linjie Li and Yen-Chun Chen CVPR tutorial
1
Slide credit: Licheng Yu , Linjie Li and Yen-Chun Chen CVPR tutorial
2021/7/9
66
Slide credit: Licheng Yu , Linjie Li and Yen-Chun Chen CVPR tutorial
Recent VLP Methods
CLIP, ALIGN, Wenlan, VinVL
Slide credit: Licheng Yu , Linjie Li and Yen-Chun Chen CVPR tutorial
2021/7/9
67
Slide credit: Licheng Yu , Linjie Li and Yen-Chun Chen CVPR tutorial
Recent Large Scale Pre-training
CLIP: 18 days to train on 592 V100 GPUs
Clip OpenAI 300M
ALIGN Google 1.8B
Wenlan Renmin University 500M
WIT Google 37.6M
2021/7/9
68
Model Architecture
VLP: (1) architecture + (2) pre-training tasks
Transformer
Encoder
Decoder
Vaswani, Ashish, et al. "Attention is all you need." arXiv preprint arXiv:1706.03762 (2017).
2021/7/9
69
Transformer
Transformer
2021/7/9
70
Transformer
VLP Pretraining Tasks
2021/7/9
71
Pretraining Tasks
Masked region/language modeling
Pretraining Tasks
Image-text matching
[1] * Tianlang Chen, Jiajun Deng and Jiebo Luo. "Adaptive Offline Quintuplet Loss for Image-Text Matching." ECCV 2020.
[2] * Tianlang Chen and Jiebo Luo. "Expressing Objects just like Words: Recurrent Visual Embedding for Image-Text Matching." AAAI 2020.
[3] * Quanzeng You, Zhengyou Zhang, Jiebo Luo. “End-to-end Convolutional Semantic Embeddings.” CVPR 2018.
2021/7/9
72
VL Pretraining with Reading Comprehension
* Yang, Lu, Yin, Florencio, Wang, Zhang, Zhang, Luo. "TAP: Text-Aware Pre-training for Text-VQA and Text-Caption." In CVPR 2021.
VL Pretraining with Reading Comprehension
2021/7/9
73
Question OCR Token Object Token
[CLS] what candy bar is … [SEP] honey 1800 yellow tail ... [SEP] juice wine ... [SEP]
Masked language modeling (MLM)
Method
● VL alignment tasks
[CLS] what candy bar is … [SEP] honey 1800 yellow tail ... [SEP] juice wine ... [SEP]
Question OCR Token Object Token
Contrastive loss
2
Region Alignment Tasks
• Obj ⇔ OCR region relationship
Q: what number is
on the bike on the
right? A: 317
Q: what is written
on the man's shirt?
A: Life cycle
• Relative position prediction
❌
2021/7/9
74
Experiments
• TextVQA
– 28,408 images (22K training)
– Soft-voting of 10 answers
• TextCaps
– Same images, 110K captions
– Captioning metrics
[1] Hu, Ronghang, et al. "Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA." In CVPR 2020. (M4C)
[2] Singh, Amanpreet, et al. "Towards vqa models that can read." In CVPR 2019.
[3] Sidorov, Oleksii, et al. "TextCaps: a Dataset for Image Captioning with Reading Comprehension." In ECCV 2020.
Question: what is this sign warning drivers of?
Prediction Answer: road work.
GT Answer: ['road work', 'road work ahead', .... ,
'road work ahead', 'road work']. (#matched = 2)
Acc: 0.6.
● Datasets and Metrics
TextVQA, TextCaps
• #1 in OCR-VL challenges (CVPR 2021)
+14.74
+8.31
+9.88
[1] Hu, Ronghang, et al. "Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA." In CVPR 2020. (M4C)
[2] Kant, Yash, et al. "Spatially Aware Multimodal Transformers for TextVQA." In ECCV 2020. (SA-M4C)
2021/7/9
75
TextVQA, TextCaps
● Latent Grounding
Video-Language Pretraining
2021/7/9
76
Video-Language Pretraining
• Token representation
• Pretraining tasks
Credit: VL-CVPR Tutorial. https://rohit497.github.io/Recent-Advances-in-Vision-and-Language-Research
2021/7/9
77
Future Directions
• General vision-language models
– Unified pipeline
• Architecture
Wonjae Kim, et al. "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision." In Arxiv 2102.03334
Future Directions
• Unified visual-text representation
2021/7/9
78
Future Directions
• General vision-language models
– Unified pipeline
• Tasks: V-L => V-V, V-L, L-L
[1] Hu, Ronghang, et al. "Transformer is all you need: Multimodal multitask learning with a unified transformer." arXiv:2102.10772.
[2] Mingyang Zhou, et al. "UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training." arXiv:2104.00332.
Future Directions
• General vision-language models
– Extra modalities, multi-lingual
2021/7/9
79
Future Directions
• External knowledge
• Specific domain
[1] Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." In CVPR 2019.
[2] * Yuan, Jianbo, et al. "Automatic radiology report generation based on multi-view image fusion and medical concept enrichment." In MICCAI 2019.
[3] Tam, Leo K., et al. "Weakly supervised one-stage vision and language disease detection using large scale pneumonia and pneumothorax studies." In MICCAI 2020.
[4] Luo, Weixin, et al. "SIRI: Spatial Relation Induced Network For Spatial Description Resolution." In Neurips 2020.
Future Directions
• New NLP tasks
– Multimodal machine translation
– Video-aided grammar induction
[1] * Huan Lin, Fandong Meng, Jinsong Su, Yongjing Yin, Zhengyuan Yang, Yubin Ge, Jie Zhou, Jiebo Luo. “Dynamic Context-guided Capsule Network for Multimodal Machine
Translation.” ACM MM 2020. (oral presentation)
[2] * Songyang Zhang, Linfeng Song, Lifeng Jin, Kun Xu, Dong Yu, Jiebo Luo. “Video-aided Unsupervised Grammar Induction.” NAACL 2021. (Best Long Paper)
2021/7/9
80
Keep in mind our original aspiration. Keep marching forward.
Computer vision is an interdisciplinary scientific field that deals with how
computers can gain high-level understanding from digital images or
videos. (Wikipedia)
“Vision is the process of discovering (and describing) from images what
is present in the world, and where it is.”
-- David Marr, Vision (1982)
Thank you!

More Related Content

What's hot(20)

Mask R-CNNMask R-CNN
Mask R-CNN
Jaehyun Jun523 views
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
Kuppusamy P3.2K views
ADO .Net ADO .Net
ADO .Net
DrSonali Vyas3K views
Architecture of data mining systemArchitecture of data mining system
Architecture of data mining system
ramya marichamy3.8K views
STRUCTURE OF SQL QUERIESSTRUCTURE OF SQL QUERIES
STRUCTURE OF SQL QUERIES
VENNILAV63.7K views
Object Detection & TrackingObject Detection & Tracking
Object Detection & Tracking
Akshay Gujarathi19.3K views
Object detection presentationObject detection presentation
Object detection presentation
AshwinBicholiya7.8K views
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
Bhaskar Mitra30.3K views
Ontology development 101Ontology development 101
Ontology development 101
Carter Chen525 views
Word embedding Word embedding
Word embedding
ShivaniChoudhary74688 views
5. stored procedure and functions5. stored procedure and functions
5. stored procedure and functions
Amrit Kaur3.2K views
Defending deep learning from adversarial attacksDefending deep learning from adversarial attacks
Defending deep learning from adversarial attacks
Svetlana Levitan, PhD197 views

Similar to Vision and Language: Past, Present and Future(20)

BagwordsBagwords
Bagwords
mustafa sarac627 views
Andrea boehmecimt543spring2012assureAndrea boehmecimt543spring2012assure
Andrea boehmecimt543spring2012assure
Andrea Boehme331 views
Perception and Quality of Immersive MediaPerception and Quality of Immersive Media
Perception and Quality of Immersive Media
Alpen-Adria-Universität577 views
Carey week 11  visual literacyCarey week 11  visual literacy
Carey week 11 visual literacy
Beth Carey6 views
CV_Lisongnan2015CV_Lisongnan2015
CV_Lisongnan2015
Songnan Li244 views
Carey week 11  visual literacyCarey week 11  visual literacy
Carey week 11 visual literacy
Beth Carey4 views
Andrea boehmecimt543spring2012videoAndrea boehmecimt543spring2012video
Andrea boehmecimt543spring2012video
Andrea Boehme193 views
Video + Language: Where Does Domain Knowledge Fit in?Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?
Goergen Institute for Data Science239 views
Video + Language: Where Does Domain Knowledge Fit in?Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?
Goergen Institute for Data Science699 views
Learning with Unpaired DataLearning with Unpaired Data
Learning with Unpaired Data
Goergen Institute for Data Science138 views
Iasl nov2012Iasl nov2012
Iasl nov2012
British International School, Phuket535 views

More from Goergen Institute for Data Science(11)

How to properly review AI papers?How to properly review AI papers?
How to properly review AI papers?
Goergen Institute for Data Science631 views
Video + Language 2019Video + Language 2019
Video + Language 2019
Goergen Institute for Data Science162 views
Video + LanguageVideo + Language
Video + Language
Goergen Institute for Data Science116 views
Computer Vision++: Where Do We Go from Here?Computer Vision++: Where Do We Go from Here?
Computer Vision++: Where Do We Go from Here?
Goergen Institute for Data Science431 views
Big Data Better LifeBig Data Better Life
Big Data Better Life
Goergen Institute for Data Science641 views
Social Multimedia as SensorsSocial Multimedia as Sensors
Social Multimedia as Sensors
Goergen Institute for Data Science215 views
Video+Language: From Classification to DescriptionVideo+Language: From Classification to Description
Video+Language: From Classification to Description
Goergen Institute for Data Science357 views

Recently uploaded(20)

ThroughputThroughput
Throughput
Moisés Armani Ramírez31 views
ChatGPT and AI for Web DevelopersChatGPT and AI for Web Developers
ChatGPT and AI for Web Developers
Maximiliano Firtman161 views
The Research Portal of Catalonia: Growing more (information) & more (services)The Research Portal of Catalonia: Growing more (information) & more (services)
The Research Portal of Catalonia: Growing more (information) & more (services)
CSUC - Consorci de Serveis Universitaris de Catalunya59 views
CXL at OCPCXL at OCP
CXL at OCP
CXL Forum203 views

Vision and Language: Past, Present and Future

  • 1. 2021/7/9 1 Vision and Language The Past, Present and Future Jiebo Luo University of Rochester 2021 IEEE International Conference on Multimedia and Expo Vision-and-Language A cat is sitting next to a pine tree, looking up • The intersection of computer vision and natural language processing • Multi-modal learning Language Vision
  • 2. 2021/7/9 2 A cat is sitting next to a pine tree, looking up Language Vision A cat is sitting next to a pine tree, looking up Language Vision
  • 3. 2021/7/9 3 A cat is sitting next to a pine tree, looking up Language Vision • Vision-language joint representation learning • General multi-modal learning Why Language in Vision? • Human-computer-interaction ○ Easy to generate • Language-based search ○ Clear specification “person wearing black pants walking to the back chair” ● Direct applications
  • 4. 2021/7/9 4 Why Language in Vision? • Visual representation learning with language supervision ● Representation learning • Stronger vision, language, and VL models Why Vision in Language? • Multimodal Machine Translation ○ Help solve data sparsity and ambiguity • Unsupervised Grammar Induction ○ Help induce syntactic structures
  • 5. 2021/7/9 5 A cat is sitting next to a pine tree, looking up. Generation tasks Understanding tasks QA/ Reasoning Visual captioning Text-to-image synthesis Image retrieval Vision-and-Language Tasks Visual grounding What is the cat doing Vision-and-Language Research in the Stone Age (Pre-DL) [1] * Iftekhar Naim, Young Song, Daniel Gildea, Qiguang Liu, Henry Kautz and Jiebo Luo, "Unsupervised Alignment of Natural Language Instructions with Video Segments," AAAI 2014. [2] * Iftekhar Naim, Young Chol Song, Henry Kautz, Jiebo Luo, Qiguang Liu, Daniel Gildea, and Liang Huang, "Discriminative Unsupervised Alignment of Natural Language Instructions with Corresponding Video Segments," NAACL 2015. [3] * Young Chol Song, Iftekhar Naim, Abdullah Al Mamun, Kaustubh Kulkarni, Parag Singla, Jiebo Luo, Daniel Gildea and Henry Kautz, "Unsupervised Alignment of Actions in Video with Text Descriptions," IJCAI 2016. Unsupervised BoW, CRF, DTW, etc.
  • 6. 2021/7/9 6 Credit: VL-CVPR Tutorial. https://rohit497.github.io/Recent-Advances-in-Vision-and-Language-Research Machine Translation/Grammar Induction Credit: VL-CVPR Tutorial. https://rohit497.github.io/Recent-Advances-in-Vision-and-Language-Research
  • 7. 2021/7/9 7 Visual Captioning Visual Captioning – Describe the content of an image or video with a natural language sentence Visual Captioning Taxonomy Credit: Luowei Zhou
  • 8. 2021/7/9 8 Image Captioning • Motivations – Real-world Usability • Help visually impaired people, learning-impaired – Improving Image Understanding • Classification, Objection detection – Image Retrieval 1. a young girl inhales with the intent of blowing out a candle 2. girl blowing out the candle on an ice cream 1. A shot from behind home plate of children playing baseball 2. A group of children playing baseball in the rain 3. Group of baseball players playing on a wet field Image Captioning with CNN-LSTM Vinyals et al. “Show and Tell: A Neural Image Caption Generator”, CVPR 2015
  • 9. 2021/7/9 9 Image Captioning with CNN-LSTM Vinyals et al. “Show and Tell: A Neural Image Caption Generator”, CVPR 2015 Image Captioning with CNN-LSTM Vinyals et al. “Show and Tell: A Neural Image Caption Generator”, CVPR 2015 • Teacher forcing training Cat sitting outside [END] [Begin] cat sitting outside
  • 10. 2021/7/9 10 Image Captioning with CNN-LSTM Vinyals et al. “Show and Tell: A Neural Image Caption Generator”, CVPR 2015 Image Captioning with Soft Attention • Soft (self) Attention – Dynamically attend to input content based on query Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015. Entire image => Informative parts
  • 11. 2021/7/9 11 Review: Previous Image Captioning Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015. Image Captioning with Soft Attention Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.
  • 12. 2021/7/9 12 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015. Image Captioning with Soft Attention Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015. Image Captioning with Soft Attention
  • 13. 2021/7/9 13 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015. Image Captioning with Soft Attention Additional textual information • Own noisy titles, tags or captions (Web) • Visually similar nearest neighbor images Exploit stronger vision models *You, Jin, Wang, Fang, Luo. "Image captioning with semantic attention." CVPR 2016. Image Captioning with Semantic Attention
  • 14. 2021/7/9 14 Image Captioning with Semantic Attention Visual Attributes
  • 15. 2021/7/9 15 Image Captioning with “Fancier” Attention Anderson, Peter, et al. "Bottom-up and top-down attention for image captioning and visual question answering.", CVPR 2018 ● Region based attention
  • 16. 2021/7/9 16 Image Captioning with “Fancier” Attention Anderson, Peter, et al. "Bottom-up and top-down attention for image captioning and visual question answering.", CVPR 2018 Review: Previous Image Captioning Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.
  • 17. 2021/7/9 17 Image Captioning with “Fancier” Attention • Attention (weighted sum) over: N*N patches -> K object regions Image Captioning with “Fancier” Attention Anderson, Peter, et al. "Bottom-up and top-down attention for image captioning and visual question answering.", CVPR 2018
  • 18. 2021/7/9 18 Image Captioning with “Fancier” Attention Image Captioning with “Fancier” Attention
  • 19. 2021/7/9 19 Evaluation – Benchmark Datasets Evaluation – Benchmark Datasets
  • 20. 2021/7/9 20 Evaluation – Metrics Most commonly-used: BLEU / METEOR / CIDEr / SPICE • BLEU: based on n-gram based precision • METEOR: ordering sensitive through unigram matching • CIDEr: gives more weight-age to important n-grams through TF-IDF • SPICE: F1-score over caption scene-graph tuples Image Captioning – Advanced Topics • Un-/weakly-supervised • Dense captioning • Novel object captioning • Stylized captioning • Captioning with reading comprehension • Grounded captioning
  • 21. 2021/7/9 21 Image Captioning with Unpaired Data Key: disentangle sentence quality and image-text relevance * Yang Feng, Lin Ma, Wei Liu, Jiebo Luo. "Unsupervised image captioning." In CVPR 2019. • Image-text relevance – Cycle consistency Image Captioning with Unpaired Data Caption Image Caption’ Image’ G C G
  • 22. 2021/7/9 22 • Sentence quality Image Captioning with Unpaired Data Image Captioning with Unpaired Data
  • 23. 2021/7/9 23 Stylized Captioning * Chen, Zhang, You, Fang, Wang, Jin, Luo. "``Factual''or``Emotional'': Stylized Image Captioning with Adaptive Learning and Attention." In ECCV 2018. Style word and factual word Stylized Captioning
  • 24. 2021/7/9 24 Captioning with Reading Comprehension * Yang, Lu, Yin, Florencio, Wang, Zhang, Zhang, Luo. "TAP: Text-Aware Pre-training for Text-VQA and Text-Caption." In CVPR 2021. Captioning with Reading Comprehension OCR Object 30 ...317 bike ... helmet * Yang, Lu, Yin, Florencio, Wang, Zhang, Zhang, Luo. "TAP: Text-Aware Pre-training for Text-VQA and Text-Caption." In CVPR 2021.
  • 25. 2021/7/9 25 Captioning with Reading Comprehension * Yang, Lu, Yin, Florencio, Wang, Zhang, Zhang, Luo. "TAP: Text-Aware Pre-training for Text-VQA and Text-Caption." In CVPR 2021. Captioning with Reading Comprehension
  • 26. 2021/7/9 26 From Image to Video Video Captioning ● Encoder-Decoder Network
  • 27. 2021/7/9 27 Assisted by POS Tags * Hou, Wu, Zhao, Luo. "Joint syntax representation learning and visual cue translation for video captioning." In ICCV 2019. Video Captioning • Dense video captioning • Video paragraph description • Other special domains GIF, Sports, etc.
  • 28. 2021/7/9 28 Animated GIF Captioning (TGIF dataset) * Li, Song, Cao, Tetreault, Goldberg, Luo. "TGIF: A new dataset and benchmark on animated GIF description." In CVPR 2016. Sports Video Captioning * Qi, Qin, Li, Wang, Luo. "Sports video captioning by attentive motion representation based hierarchical recurrent neural networks." ACMMW 2018. * Qi, Qin, Li, Wang, Luo, Van Gool. "stagNet: an attentive semantic RNN for group activity and individual action recognition." IEEE TCSVT.
  • 30. 2021/7/9 30 Credit: VL-CVPR Tutorial. https://rohit497.github.io/Recent-Advances-in-Vision-and-Language-Research Machine Translation/Grammar Induction Visual Grounding Bounding box/ box tubelet • Visual grounding: visual-text correspondence Language query: grass in front of the house a skater in red is skating with her partner in black
  • 31. 2021/7/9 31 Image Grounding • Image visual grounding Language query: grass in front of the house Bounding box Image Visual Grounding • Language query => region of the image Query: grass in front of the house * Yang, Gong, Wang, Huang, Yu, Luo. "A fast and accurate one-stage approach to visual grounding." In ICCV 2019. (oral)
  • 32. 2021/7/9 32 • Two-stage framework Existing Framework Query: center building • Two-stage framework – Region proposal – Similarity ranking Existing Framework
  • 33. 2021/7/9 33 • Limited candidates • Slow in speed Existing Framework ● Propose-and-Rank One-stage Visual Grounding • Propose-and-rank => Grounding-by-detection * Yang, Gong, Wang, Huang, Yu, Luo. "A fast and accurate one-stage approach to visual grounding." In ICCV 2019. (oral)
  • 34. 2021/7/9 34 • Accurate: +7-20% absolute • Fast: 10x One-stage Visual Grounding Architecture • Encoder module • Fusion module • Grounding module
  • 35. 2021/7/9 35 • Encoder • Fusion module • Grounding module • Visual encoder • Language encoder • Spatial encoder Architecture • Encoder • Fusion module • Grounding module • Image-level fusion Architecture
  • 36. 2021/7/9 36 • Encoder • Fusion module • Grounding module • Output format: box + confidence Architecture Datasets • For phrase localization: Flickr 30K Entities • For referring expression comprehension: ReferItGame • Acc@0.5 IoU the black backpack on the bottom right Phrase localization Referring expression comprehension
  • 37. 2021/7/9 37 Comparison to other methods Qualitative Results Pred. gt Ours Two- stage
  • 38. 2021/7/9 38 Understanding Complex Queries • External VL graph/ tree • Recursive modeling Understanding Complex Queries ● Recursive Solution • Proposed a recursive multi-round approach Man on the couch looking on the tv * Yang, Chen, Wang, Luo. "Improving one-stage visual grounding by recursive sub-query construction." In ECCV 2020
  • 39. 2021/7/9 39 • Sub-query learner (to construct the sub-queries) • Sub-query modulation (to refine the fused feature with sub-queries) Method ● Framework Overview Experiments
  • 40. 2021/7/9 40 Experiments ● Recursive Disambiguation • Recursive dis-ambiguous procedures Visual Grounding beyond Images Bounding box/ box tubelet Language query: grass in front of the house a skater in red is skating with her partner in black
  • 41. 2021/7/9 41 Video Temporal Grounding [1] Gao, Jiyang, et al. "Tall: Temporal activity localization via language query." In ICCV 2017. [2] *Songyang Zhang, Jinsong Su, Jiebo Luo. "Exploiting Temporal Relationships in Video Moment Localization with Natural Language.” MM 2019. Video Temporal Grounding
  • 42. 2021/7/9 42 Video Temporal Grounding (2D TAN) * Zhang, Peng, Fu, Luo. "Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language." In AAAI 2020. Video Object Grounding • From image to video * Real-time (~40 fps) with a single NVIDIA 1080TI GPU ** Results of Yang, Kumar, Chen, Su, Luo. "Grounding-Tracking-Integration." In IEEE T-CSVT.
  • 43. 2021/7/9 43 Video Object Grounding ● Two-stage Approach Tubelet Linking Fusion & Ranking Encoder Zhang, Zhu, et al. "Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences." In CVPR 2020. Video Object Grounding ● One-stage Approach Detection Fusion
  • 44. 2021/7/9 44 Video Object Grounding ● One-stage Approach • First real-time video object grounding framework • Self-evaluate 3D Visual Grounding [1] Panos, Achlioptas, et al. "ReferIt3D: Neural Listeners for Fine-Grained Object Identification in Real-World 3D Scenes." In ECCV 2020 Oral. [2] Chen, Dave, et al. "ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language." In ECCV 2020.
  • 45. 2021/7/9 45 Challenges • Propose and rank [1] Panos, Achlioptas, et al. "ReferIt3D: Neural Listeners for Fine-Grained Object Identification in Real-World 3D Scenes." In ECCV 2020 Oral. [2] Chen, Dave, et al. "ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language." In ECCV 2020. • Ranking is the key: point-cloud and language joint representation learning * Zhengyuan Yang, Songyang Zhang, Liwei Wang, Jiebo Luo. “SAT: 2D Semantics Assisted Training for 3D Visual Grounding.” arxiv.
  • 47. 2021/7/9 47 Credit: VL-CVPR Tutorial. https://rohit497.github.io/Recent-Advances-in-Vision-and-Language-Research Machine Translation/Grammar Induction Visual QA/Reasoning
  • 48. 2021/7/9 48 Datasets • Large-scale annotated datasets have driven tremendous progress in this field Slide credit: Zhe Gan. CVPR tutorial • What a typical system looks like • Image captioning • Visual grounding
  • 49. 2021/7/9 49 Image Feature • From grid features to region features, and to grid features again Multi-modal Fusion Slide credit: Zhe Gan. CVPR tutorial Transformers
  • 50. 2021/7/9 50 Credit: VL-CVPR Tutorial. https://rohit497.github.io/Recent-Advances-in-Vision-and-Language-Research Machine Translation/Grammar Induction Text-to-image Synthesis
  • 51. 2021/7/9 51 Generative Adversarial Networks (GAN) Conditional Image Synthesis Slide credit: Yu Cheng. CVPR tutorial
  • 52. 2021/7/9 52 Conditional Image Synthesis Slide credit: Yu Cheng. CVPR tutorial Slide credit: Yu Cheng. CVPR tutorial
  • 53. 2021/7/9 53 Dall·e Unsupervised Text-to-Image Synthesis * Yanlong Dong, Ying Zhang, Lin Ma, Zhi Wang, Jiebo Luo, ''Unsupervised text-to-image synthesis,'' Pattern Recognition. Caption Image Caption’ Image’ G C G
  • 54. 2021/7/9 54 Unsupervised Text-to-Image Synthesis * Yanlong Dong, Ying Zhang, Lin Ma, Zhi Wang, Jiebo Luo, ''Unsupervised text-to-image synthesis,'' Pattern Recognition. Text to Sentiment [1] * Jie An, Tianlang Chen, Songyang Zhang, Jiebo Luo, "Global Image Sentiment Transfer," ICPR 2020. [2] * Tianlang Chen, Wei Xiong, Haitian Zheng, Jiebo Luo, "Image Sentiment Transfer," ACM MM 2020.
  • 55. 2021/7/9 55 Credit: VL-CVPR Tutorial. https://rohit497.github.io/Recent-Advances-in-Vision-and-Language-Research Machine Translation/Grammar Induction Grammar Induction Goal: Grammar induction aims to capture syntactic information in sentences in the form of constituency parsing trees. • Supervised Grammar Induction – Annotating syntactic trees is expensive and time-consuming – Limited to the newswire domain in several major languages • Unsupervised Grammar Induction – Learn from large-scale unlabeled text – Provide evidence for statistical learning * Songyang Zhang, Linfeng Song, Lifeng Jin, Dong Yu,Jiebo Luo, "Video-aided Unsupervised Grammar Induction," In NAACL 2021 (Best Long Paper).
  • 56. 2021/7/9 56 Image-aided Unsupervised Grammar Induction Images can help us induce syntactic structure. [Shi et al. ACL’19] Intuition: Exploiting regularities between text spans and images. a squirrel jumps on stump Sentence: a squirrel jumps on stump Parser 𝑐 : a squirrel 𝑐 : on stump 𝑐 : jumps on stump … Video-aided Unsupervised Grammar Induction Motivation: Videos include not only static objects but also actions and state changes useful for inducing verb phrases (VP). ✓ a squirrel jumps on stump Sentence: a squirrel jumps on stump Parser 𝑐 : a squirrel 𝑐 : on stump 𝑐 : jumps on stump …
  • 57. 2021/7/9 57 Our Approach Multi-Modal Compound PCFG (MMC-PCFG) where PCFG stands for probabilistic context-free grammar [Gabeur et al. ECCV 2020] Grammar Inducer Marginal Likelihood Optimization Chart w1 w2 w3 w4 w5 w1 w2 w3 w4 w5 w1 w2 w3 w4 w5 w1 w2 w3 w4 w5 Chart w1 w2 w3 w4 w5 w1 w2 w3 w4 w5 Video V: Video-Span Matching Loss Video-Span Matching Loss Video-Sentence Matching Loss Video-Sentence Matching Loss Gated Embedding Module Video-Span Similarity Object ... Action ... Scene ... Audio ... OCR ... Face ... Speech ... Object ... Action ... Scene ... Audio ... OCR ... Face ... Speech ... Span & its representation w5 w4 w5 w4 w3 ... w1 w2 c ... Span & its representation w5 w4 w5 w4 w3 ... w1 w2 c ... Multi-Modal Transformer Multi-Modal Transformer Sentence σ: w1 w2 w3 w4 w5 A squirrel jumps on stump Sentence σ: w1 w2 w3 w4 w5 A squirrel jumps on stump ψ1 ξ1 ξ2 ξ3 ξ4 ξ5 ξ6 ξ7 ψ2 ψ3 ψ4 ψ5 ψ6 ψ7 Main Results Sentence-level F1 scores by singular features on three benchmark datasets. Te xt Onl y Tex t Onl y Te xt Onl y Single type of feature Single type of feature Single type of feature
  • 58. 2021/7/9 58 Main Results Sentence-level F1 scores on three benchmark datasets. MMC- PCFG MMC- PCFG MMC- PCFG Qualitative Analysis C-PCFG Reference Tree MMC-PCFG VC-PCFG +Object(SENet) VC-PCFG +Action(I3D) a guy playing with his cat making the cat dance a guy playing with his cat making the cat dance a guy playing with his cat making the cat dance a guy playing with his cat making the cat dance a guy playing with his cat making the cat dance a guy playing with his cat making the cat dance a guy playing with his cat making the cat dance a guy playing with his cat making the cat dance a guy playing with his cat making the cat dance a guy playing with his cat making the cat dance some one is describing about racing car kept on road having different colors some one is describing about racing car kept on road having different colors a boy laying by a wall talking to a webcam a boy laying by a wall talking to a webcam some one is describing about racing car kept on road having different colors some one is describing about racing car kept on road having different colors a boy laying by a wall talking to a webcam a boy laying by a wall talking to a webcam some one is describing about racing car kept on road having different colors some one is describing about racing car kept on road having different colors some one is describing about racing car kept on road having different colors some one is describing about racing car kept on road having different colors some one is describing about racing car kept on road having different colors some one is describing about racing car kept on road having different colors a boy laying by a wall talking to a webcam a boy laying by a wall talking to a webcam a boy laying by a wall talking to a webcam a boy laying by a wall talking to a webcam a boy laying by a wall talking to a webcam a boy laying by a wall talking to a webcam Video
  • 59. 2021/7/9 59 Multimodal Machine Translation (MMT) √ × * Huan Lin, Fandong Meng, Jinsong Su, Yongjing Yin, Zhengyuan Yang, Yubin Ge, Jie Zhou, Jiebo Luo, "Dynamic Context- guided Capsule Network for Multimodal Machine Translation," ACM Multimedia Conference, 2020. Visual Features used in MMT  Global visual features (Huang et al.,2016; Calixto et al., 2017, Elliott et al., 2017; Zhou et al., 2018; Calixto et al., 2019) CNN  Object-based visual features (Huang et al., 2016, Ive et al., 2019) CNN
  • 60. 2021/7/9 60 How to Effectively Utilize Visual Features in MMT?  Exploit visual features as global visual context (Huang et al., 2016; Calixto et al., 2017a; Grönroos et al., 2018)  Apply attention mechanism to extract visual context (Calixto et al., 2017b; Delbrouck et al., 2017; Helcl et al., 2018; Arslan et al., 2018)  Learn multimodal joint representations (Calixto et al., 2019; Elliott et al., 2017; Zhou et al., 2018) ← Lack variability! ← Too many parameters! ← Lack variability! Dynamic Context-guided Capsule Network (DCCN)
  • 61. 2021/7/9 61 Experiments Table 1: Experimental results on the En-De translation task. Experiments Table 1: Experimental results on the En-De translation task.
  • 62. 2021/7/9 62 Case Study  Global visual features Objects: [shorts, man, woman, sign, sidewalk, umbrella, dress, glasses, girl] EN: A girl wearing a mask rides on a man’s shoulders through a crowded sidewalk. Ref (DE): ... reitet auf den Schultern eines Mannes ... Transformer: ... fährt auf den Schultern eines Mannes ... Encoder-attention: ... fährt auf den Schultern eines Mannes ... Doubly-attention : ... fährt auf den Schultern eines Mannes ... DCCN: ... reitet auf den Schultern eines Mannes ... (c) Credit: VL-CVPR Tutorial. https://rohit497.github.io/Recent-Advances-in-Vision-and-Language-Research
  • 63. 2021/7/9 63 Supervised Learning Slide credit: Licheng Yu , Linjie Li and Yen-Chun Chen CVPR tutorial Slide credit: Licheng Yu , Linjie Li and Yen-Chun Chen CVPR tutorial
  • 64. 2021/7/9 64 Slide credit: Licheng Yu , Linjie Li and Yen-Chun Chen CVPR tutorial Slide credit: Licheng Yu , Linjie Li and Yen-Chun Chen CVPR tutorial
  • 65. 2021/7/9 65 Slide credit: Licheng Yu , Linjie Li and Yen-Chun Chen CVPR tutorial 1 Slide credit: Licheng Yu , Linjie Li and Yen-Chun Chen CVPR tutorial
  • 66. 2021/7/9 66 Slide credit: Licheng Yu , Linjie Li and Yen-Chun Chen CVPR tutorial Recent VLP Methods CLIP, ALIGN, Wenlan, VinVL Slide credit: Licheng Yu , Linjie Li and Yen-Chun Chen CVPR tutorial
  • 67. 2021/7/9 67 Slide credit: Licheng Yu , Linjie Li and Yen-Chun Chen CVPR tutorial Recent Large Scale Pre-training CLIP: 18 days to train on 592 V100 GPUs Clip OpenAI 300M ALIGN Google 1.8B Wenlan Renmin University 500M WIT Google 37.6M
  • 68. 2021/7/9 68 Model Architecture VLP: (1) architecture + (2) pre-training tasks Transformer Encoder Decoder Vaswani, Ashish, et al. "Attention is all you need." arXiv preprint arXiv:1706.03762 (2017).
  • 71. 2021/7/9 71 Pretraining Tasks Masked region/language modeling Pretraining Tasks Image-text matching [1] * Tianlang Chen, Jiajun Deng and Jiebo Luo. "Adaptive Offline Quintuplet Loss for Image-Text Matching." ECCV 2020. [2] * Tianlang Chen and Jiebo Luo. "Expressing Objects just like Words: Recurrent Visual Embedding for Image-Text Matching." AAAI 2020. [3] * Quanzeng You, Zhengyou Zhang, Jiebo Luo. “End-to-end Convolutional Semantic Embeddings.” CVPR 2018.
  • 72. 2021/7/9 72 VL Pretraining with Reading Comprehension * Yang, Lu, Yin, Florencio, Wang, Zhang, Zhang, Luo. "TAP: Text-Aware Pre-training for Text-VQA and Text-Caption." In CVPR 2021. VL Pretraining with Reading Comprehension
  • 73. 2021/7/9 73 Question OCR Token Object Token [CLS] what candy bar is … [SEP] honey 1800 yellow tail ... [SEP] juice wine ... [SEP] Masked language modeling (MLM) Method ● VL alignment tasks [CLS] what candy bar is … [SEP] honey 1800 yellow tail ... [SEP] juice wine ... [SEP] Question OCR Token Object Token Contrastive loss 2 Region Alignment Tasks • Obj ⇔ OCR region relationship Q: what number is on the bike on the right? A: 317 Q: what is written on the man's shirt? A: Life cycle • Relative position prediction ❌
  • 74. 2021/7/9 74 Experiments • TextVQA – 28,408 images (22K training) – Soft-voting of 10 answers • TextCaps – Same images, 110K captions – Captioning metrics [1] Hu, Ronghang, et al. "Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA." In CVPR 2020. (M4C) [2] Singh, Amanpreet, et al. "Towards vqa models that can read." In CVPR 2019. [3] Sidorov, Oleksii, et al. "TextCaps: a Dataset for Image Captioning with Reading Comprehension." In ECCV 2020. Question: what is this sign warning drivers of? Prediction Answer: road work. GT Answer: ['road work', 'road work ahead', .... , 'road work ahead', 'road work']. (#matched = 2) Acc: 0.6. ● Datasets and Metrics TextVQA, TextCaps • #1 in OCR-VL challenges (CVPR 2021) +14.74 +8.31 +9.88 [1] Hu, Ronghang, et al. "Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA." In CVPR 2020. (M4C) [2] Kant, Yash, et al. "Spatially Aware Multimodal Transformers for TextVQA." In ECCV 2020. (SA-M4C)
  • 75. 2021/7/9 75 TextVQA, TextCaps ● Latent Grounding Video-Language Pretraining
  • 76. 2021/7/9 76 Video-Language Pretraining • Token representation • Pretraining tasks Credit: VL-CVPR Tutorial. https://rohit497.github.io/Recent-Advances-in-Vision-and-Language-Research
  • 77. 2021/7/9 77 Future Directions • General vision-language models – Unified pipeline • Architecture Wonjae Kim, et al. "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision." In Arxiv 2102.03334 Future Directions • Unified visual-text representation
  • 78. 2021/7/9 78 Future Directions • General vision-language models – Unified pipeline • Tasks: V-L => V-V, V-L, L-L [1] Hu, Ronghang, et al. "Transformer is all you need: Multimodal multitask learning with a unified transformer." arXiv:2102.10772. [2] Mingyang Zhou, et al. "UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training." arXiv:2104.00332. Future Directions • General vision-language models – Extra modalities, multi-lingual
  • 79. 2021/7/9 79 Future Directions • External knowledge • Specific domain [1] Zellers, Rowan, et al. "From recognition to cognition: Visual commonsense reasoning." In CVPR 2019. [2] * Yuan, Jianbo, et al. "Automatic radiology report generation based on multi-view image fusion and medical concept enrichment." In MICCAI 2019. [3] Tam, Leo K., et al. "Weakly supervised one-stage vision and language disease detection using large scale pneumonia and pneumothorax studies." In MICCAI 2020. [4] Luo, Weixin, et al. "SIRI: Spatial Relation Induced Network For Spatial Description Resolution." In Neurips 2020. Future Directions • New NLP tasks – Multimodal machine translation – Video-aided grammar induction [1] * Huan Lin, Fandong Meng, Jinsong Su, Yongjing Yin, Zhengyuan Yang, Yubin Ge, Jie Zhou, Jiebo Luo. “Dynamic Context-guided Capsule Network for Multimodal Machine Translation.” ACM MM 2020. (oral presentation) [2] * Songyang Zhang, Linfeng Song, Lifeng Jin, Kun Xu, Dong Yu, Jiebo Luo. “Video-aided Unsupervised Grammar Induction.” NAACL 2021. (Best Long Paper)
  • 80. 2021/7/9 80 Keep in mind our original aspiration. Keep marching forward. Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. (Wikipedia) “Vision is the process of discovering (and describing) from images what is present in the world, and where it is.” -- David Marr, Vision (1982) Thank you!