The document is a slide deck for a lecture on language and vision. It covers topics like image captioning, visual question answering, cross-modal embeddings, and image generation from text. The slides provide outlines of the topics, descriptions and examples of different models in each area, and discussions of limitations and future directions. They cite numerous papers in the field and include visualizations to illustrate key concepts.
3. 3
Language & Vision
Caption this picture:
“a bed decorated in black and white bedding”
“a large white bed sitting under two framed pictures”
“a bedroom scene with a bed an television on the wall”
4. 4
Language & Vision
“Two children riding a horse
in front of their home”
“a group of sheep trailing
one another in a line”
5. 5
Language & Vision
“Two children riding a horse
in front of their home”
“a group of sheep trailing
one another in a line”
13. 13Karpathy et al. "Deep visual-semantic alignments for generating image descriptions." CVPR 2015
only takes into account
image features in the first
time step
Image Captioning
14. 14
Limitation:
All output predictions are based on the final and static output of
the encoder
LSTMLSTM LSTM
CNN LSTM
A bird flying
...
<EOS>
Features:
D
...
Image Captioning
16. Visual Attention for Image Captioning
CNN
Image:
H x W x 3
Features f:
L x D
h0
16
a1 y1
c0 y0
first context vector
is the average
Attention weights (LxD) Predicted word
First word (<start> token)
17. Visual Attention for Image Captioning
CNN
Image:
H x W x 3
h0
c1
Visual features weighted
with attention give the next
context vector
y1
h1
a2 y2
17
a1 y1
c0 y0
Predicted word in
previous timestep
18. Visual Attention for Image Captioning
CNN
Image:
H x W x 3
h0
c1 y1
h1
a2 y2
h2
a3 y3
c2 y2
18
a1 y1
c0 y0
19. Visual Attention for Image Captioning
Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
19
20. Visual Attention for Image Captioning
20
Some outputs can probably be predicted without looking at the image...
21. Visual Attention for Image Captioning
21
Some outputs can probably be predicted without looking at the image...
22. Visual Attention for Image Captioning
22
Can we focus on the image only when necessary?
23. Visual Attention for Image Captioning
CNN
Image:
H x W x 3
h0
c1 y1
h1
a2 y2
h2
a3 y3
c2 y2
23
a1 y1
c0 y0
“Regular” spatial attention
24. Visual Attention for Image Captioning
CNN
Image:
H x W x 3 c1 y1
a2 y2 a3 y3
c2 y2
24
a1 y1
c0 y0
Attention with sentinel: LSTM is modified to output a “non-visual” feature to attend to
s0 h0 s1 h1 s2 h2
Lu et al. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. CVPR 2017
25. Visual Attention for Image Captioning
25
Attention weights indicate when it’s more important to look at the image features, and when it’s
better to rely on the current LSTM state
Lu et al. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. CVPR 2017
28. Grounded Image Captioning
28Lu et al. “Neural Baby Talk” CVPR 2018
Slot-filling approach: generating a sentence template with empty
slots to be filled using the outputs of an object detection model
33. 33
Visual Question Answering (VQA)
Kafle & Kanan. Visual Question Answering: Datasets, Algorithms, and Future Challenges. In Computer Vision and
Image Understanding 2017
34. 34
Visual Question Answering (VQA)
[z1
, z2
, … zN
] [y1
, y2
, … yM
]
“What is the mustache made of ?”
“bananas”
Encode
Encode
Decode
Antol et al. "VQA: Visual question answering." CVPR 2015.
35. 35
Anderson et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question
Answering. CVPR 2018
Visual Question Answering (VQA)
Region features
from Faster R-CNN
36. 36
Anderson et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question
Answering. CVPR 2018
Visual Question Answering (VQA)
48. 48
Salvador et al. “Learning Cross-modal Embeddings for Cooking Recipes and Food
Images”. CVPR 2017
Image and text retrieval with joint neural embeddings
Cross-Modal Representations
54. 54
Image Generation from Text
Reed et al. "Generative adversarial text to image synthesis." ICML 2016.
55. 55
Image Generation from Text
Reed et al. "Generative adversarial text to image synthesis." ICML 2016.
56. 56
Image Generation from Text
Zhang et al. "Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks." ICCV 2017
57. 57
Image Generation from Text
Zhang et al. "Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks." ICCV 2017
58. 58
Image Generation from Text
Hong et al. Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis. CVPR 2018
59. 59
Image Generation from Text
Hong et al. Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis. CVPR 2018
60. 60
Image Generation from Text
Hong et al. Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis. CVPR 2018
65. 65
(Slides by Marc Bolaños) Pingbo Pan, Zhongwen Xu, Yi Yang,Fei Wu,Yueting Zhuang Hierarchical
Recurrent Neural Encoder for Video Representation with Application to Captioning, CVPR 2016.
LSTM unit
(2nd layer)
Time
Image
t = 1 t = T
hidden state
at t = T
first chunk
of data
Captioning: Video
66. 66
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild."
CVPR 2017
67. 67
Lipreading: Watch, Listen, Attend & Spell
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the
wild." CVPR 2017
Audio
features
Image
features
68. 68
Lipreading: Watch, Listen, Attend & Spell
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the
wild." CVPR 2017
Attention over output
states from audio and
video is computed at
each timestep
69. 69
Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End
Sentence-level Lipreading." (2016).
Lip Reading: LipNet
Input (video frames) and output (sentence) sequences are not
aligned
70. 70
Graves et al. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with
Recurrent Neural Networks. ICML 2006
Lip Reading: LipNet
CTC Loss: Connectionist temporal classification
● Avoiding the need for alignment between input and output sequence by predicting
an additional “_” blank word
● Before computing the loss, repeated words and blank tokens are removed
● “a _ a b _ ” == “_ a a _ _ b b” == “a a b”
72. 72
Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for
dense captioning." CVPR 2016
Captioning (+ Detection): DenseCap
73. 73
Captioning (+ Detection): DenseCap
Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for
dense captioning." CVPR 2016
74. 74
Captioning: Video
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan,
Kate Saenko, Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and
Description, CVPR 2015. code
79. 79
Visual Question Answering (VQA)
Masuda, Issey, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto. "Open-Ended Visual
Question-Answering." ETSETB UPC TelecomBCN (2016).
Image
Question
Answer
80. 80
Visual Question Answering (VQA)
Francisco Roldán, Issey Masuda, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto.
"Visual Question-Answering 2.0." ETSETB UPC TelecomBCN (2017).
82. 82
Visual Dialog (Image Guessing Game)
Das, Abhishek, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra.
"Visual Dialog." CVPR 2017
83. 83
Visual Question Answering: Dynamic
Xiong et al. "Dynamic Memory Networks for Visual and Textual Question Answering." ICML 2016
84. Gella, Spandana, Rico Sennrich, Frank Keller, and Mirella Lapata. "Image Pivoting for Learning Multilingual Multimodal Representations." arXiv preprint
arXiv:1707.07601 (2017).
Janarthanan Rajendran, Mitesh M Khapra, Sarath Chandar, Balaraman Ravindran, Bridge Correlational Neural Networks for Multilingual Multimodal
Representation Learning NAACL, 2016
Multilingual & Multimodal Embeddings
85. 85
Frome et al. "Devise: A deep visual-semantic embedding model." NIPS 2013
Cross-Modal Embeddings
86. 86
Socher et al. Zero-shot learning through cross-modal transfer. NIPS 2013
Zero-shot learning:
a class not present in the
training set of images
can be predicted
(eg. no images from
“cat” in the training set)
Joint Neural Embeddings
87. 87
Reasoning: MAC
Hudson et al. "Compositional attention networks for machine reasoning." arXiv preprint arXiv:1803.03067
(2018).
88. 88
GANs for Image Captioning
Dai et al. Towards Diverse and Natural Image Descriptions via a Conditional GAN. ICCV 2017