Deep Language and Vision by Amaia Salvador (Insight DCU 2018)

Amaia Salvador
amaia.salvador@upc.edu
PhD Candidate
Universitat Politècnica de Catalunya
DEEP
LEARNING
WORKSHOP
Dublin City University
21-22 May 2018
Language & Vision
Day 2 Lecture 12

2
Language & Vision
Caption this picture:

3
Language & Vision
Caption this picture:
“a bed decorated in black and white bedding”
“a large white bed sitting under two framed pictures”
“a bedroom scene with a bed an television on the wall”

4
Language & Vision
“Two children riding a horse
in front of their home”
“a group of sheep trailing
one another in a line”

5
Language & Vision

6
Language & Vision
What is the mustache made of? How many slices of pizza are
there?

7
Outline
1. Image Captioning
2. Visual Question Answering
3. Cross-Modal Embeddings
4. Image Generation from Text

8
Outline
1. Image Captioning

Text (English)
9
Text (French)
Neural Machine Translation
Image Captioning

10
Representation or
Embedding
Encoder Decoder
Image Captioning

11
Representation or
Embedding
Encoder Decoder
abeddecoratedinblackandwhitebedding
Image Captioning

12
Image Captioning
Vinyals et al. "Show and tell: A neural image caption generator." CVPR 2015.

13Karpathy et al. "Deep visual-semantic alignments for generating image descriptions." CVPR 2015
only takes into account
image features in the first
time step
Image Captioning

14
Limitation:
All output predictions are based on the final and static output of
the encoder
LSTMLSTM LSTM
CNN LSTM
A bird flying
...
<EOS>
Features:
D
...
Image Captioning

Visual Attention for Image Captioning
CNN
Image:
H x W x 3
15

CNN
Image:
H x W x 3
Features f:
L x D
h0
16
a1 y1
c0 y0
first context vector
is the average
Attention weights (LxD) Predicted word
First word (<start> token)

CNN
Image:
H x W x 3
h0
c1
Visual features weighted
with attention give the next
context vector
y1
h1
a2 y2
17
a1 y1
c0 y0
Predicted word in
previous timestep

CNN
Image:
H x W x 3
h0
c1 y1
h1
a2 y2
h2
a3 y3
c2 y2
18
a1 y1
c0 y0

Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015
19

20
Some outputs can probably be predicted without looking at the image...

21
Some outputs can probably be predicted without looking at the image...

22
Can we focus on the image only when necessary?

CNN
Image:
H x W x 3
h0
c1 y1
h1
a2 y2
h2
a3 y3
c2 y2
23
a1 y1
c0 y0
“Regular” spatial attention

CNN
Image:
H x W x 3 c1 y1
a2 y2 a3 y3
c2 y2
24
a1 y1
c0 y0
Attention with sentinel: LSTM is modified to output a “non-visual” feature to attend to
s0 h0 s1 h1 s2 h2
Lu et al. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. CVPR 2017

25
Attention weights indicate when it’s more important to look at the image features, and when it’s
better to rely on the current LSTM state
Lu et al. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. CVPR 2017

Grounded Image Captioning
26
Typical captioning models tend to produce plausible generic captions based on image features

27
The outputs of an object detector can assist the language model to give more accurate descriptions

28Lu et al. “Neural Baby Talk” CVPR 2018
Slot-filling approach: generating a sentence template with empty
slots to be filled using the outputs of an object detection model

29
Lu et al. “Neural Baby Talk” CVPR 2018

30
Image Captioning Evaluation
Kilickaya et al. Re-evaluating Automatic Metrics for Image Captioning. arXiv preprint arXiv 2016.

31
Outline
1. Image Captioning

32
Antol et al. "VQA: Visual question answering." CVPR 2015.
Visual Question Answering (VQA)

33
Kafle & Kanan. Visual Question Answering: Datasets, Algorithms, and Future Challenges. In Computer Vision and
Image Understanding 2017

34
[z1
, z2
, … zN
] [y1
, y2
, … yM
]
“What is the mustache made of ?”
“bananas”
Encode
Encode
Decode
Antol et al. "VQA: Visual question answering." CVPR 2015.

35
Anderson et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question
Answering. CVPR 2018
Region features
from Faster R-CNN

36
Anderson et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question
Answering. CVPR 2018

38
Dataset bias in VQA
Kafle & Kanan. Visual Question Answering: Datasets, Algorithms, and Future Challenges. In Computer Vision and
Image Understanding 2017

39
Dataset bias in VQA
Goyal et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. CVPR 2017

40
Dataset bias in VQA
Goyal et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. CVPR 2017

41
Visual Reasoning
Johnson et al. "CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual
Reasoning." CVPR 2017

42
Visual Reasoning
Johnson et al. “Inferring and Executing Programs for Visual Reasoning”. ICCV 2017
Program Generator Execution Engine

43
Outline
1. Image Captioning

44
Cross-Modal Representations

45
Castrejon et al. Learning Aligned Cross-Modal Representations from Weakly Aligned Data. CVPR
2016.

46
2016.

47
2016.

48
Salvador et al. “Learning Cross-modal Embeddings for Cooking Recipes and Food
Images”. CVPR 2017
Image and text retrieval with joint neural embeddings

49
Salvador et al. “Learning Cross-modal Embeddings for Cooking Recipes and Food
Images”. CVPR 2017

50
Aytar et al. “See, Hear and Read: Deep Aligned Representations”. arXiv June 2017

51
Aytar et al. “See, Hear and Read: Deep Aligned Representations”. arXiv June 2017

52
Outline
1. Image Captioning

53
Image Generation from Text
“Mark Zuckerberg
wearing a flowery shirt”

54
Reed et al. "Generative adversarial text to image synthesis." ICML 2016.

55
Reed et al. "Generative adversarial text to image synthesis." ICML 2016.

56
Zhang et al. "Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks." ICCV 2017

57
Zhang et al. "Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks." ICCV 2017

58
Hong et al. Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis. CVPR 2018

59

60

61
Johnson et al. “Image Generation from Scene Graphs” CVPR 2018

62

63

64
Questions?
1. Image Captioning

65
(Slides by Marc Bolaños) Pingbo Pan, Zhongwen Xu, Yi Yang,Fei Wu,Yueting Zhuang Hierarchical
Recurrent Neural Encoder for Video Representation with Application to Captioning, CVPR 2016.
LSTM unit
(2nd layer)
Time
Image
t = 1 t = T
hidden state
at t = T
first chunk
of data
Captioning: Video

66
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild."
CVPR 2017

67
Lipreading: Watch, Listen, Attend & Spell
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the
wild." CVPR 2017
Audio
features
Image
features

68
Lipreading: Watch, Listen, Attend & Spell
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the
wild." CVPR 2017
Attention over output
states from audio and
video is computed at
each timestep

69
Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End
Sentence-level Lipreading." (2016).
Lip Reading: LipNet
Input (video frames) and output (sentence) sequences are not
aligned

70
Graves et al. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with
Recurrent Neural Networks. ICML 2006
Lip Reading: LipNet
CTC Loss: Connectionist temporal classification
● Avoiding the need for alignment between input and output sequence by predicting
an additional “_” blank word
● Before computing the loss, repeated words and blank tokens are removed
● “a _ a b _ ” == “_ a a _ _ b b” == “a a b”

71
Lip Reading: LipNet
Assael et al. LipNet: Sentence-level Lipreading. arXiv Nov 2016

72
Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for
dense captioning." CVPR 2016
Captioning (+ Detection): DenseCap

73
Captioning (+ Detection): DenseCap
Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for
dense captioning." CVPR 2016

74
Captioning: Video
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan,
Kate Saenko, Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and
Description, CVPR 2015. code

75
Neural Machine Translation
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
Representation or
Embedding

76
Representation or
Embedding

77
Word Embeddings
Figure:
Christopher Olah
Visualizing Representations

78
Embedding
Predict answerMerge
Question
What object is flying?
Answer
Kite
Slide credit: Issey Masuda

79
Masuda, Issey, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto. "Open-Ended Visual
Question-Answering." ETSETB UPC TelecomBCN (2016).
Image
Question
Answer

80
Francisco Roldán, Issey Masuda, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto.
"Visual Question-Answering 2.0." ETSETB UPC TelecomBCN (2017).

81
Visual Dialog (Image Guessing Game)
Das et al. "Visual Dialog." CVPR 2017

82
Visual Dialog (Image Guessing Game)
Das, Abhishek, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra.
"Visual Dialog." CVPR 2017

83
Visual Question Answering: Dynamic
Xiong et al. "Dynamic Memory Networks for Visual and Textual Question Answering." ICML 2016

Gella, Spandana, Rico Sennrich, Frank Keller, and Mirella Lapata. "Image Pivoting for Learning Multilingual Multimodal Representations." arXiv preprint
arXiv:1707.07601 (2017).
Janarthanan Rajendran, Mitesh M Khapra, Sarath Chandar, Balaraman Ravindran, Bridge Correlational Neural Networks for Multilingual Multimodal
Representation Learning NAACL, 2016
Multilingual & Multimodal Embeddings

85
Frome et al. "Devise: A deep visual-semantic embedding model." NIPS 2013
Cross-Modal Embeddings

86
Socher et al. Zero-shot learning through cross-modal transfer. NIPS 2013
Zero-shot learning:
a class not present in the
training set of images
can be predicted
(eg. no images from
“cat” in the training set)
Joint Neural Embeddings

87
Reasoning: MAC
Hudson et al. "Compositional attention networks for machine reasoning." arXiv preprint arXiv:1803.03067
(2018).

88
GANs for Image Captioning
Dai et al. Towards Diverse and Natural Image Descriptions via a Conditional GAN. ICCV 2017

89
Lu et al. “Neural Baby Talk” CVPR 2018

90
Grounded VQA
Zhu et al."Visual7W: Grounded Question Answering in Images." CVPR 2016.

Deep Language and Vision by Amaia Salvador (Insight DCU 2018)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep Language and Vision by Amaia Salvador (Insight DCU 2018)

Similar to Deep Language and Vision by Amaia Salvador (Insight DCU 2018) (20)

More from Universitat Politècnica de Catalunya

More from Universitat Politècnica de Catalunya (20)

Recently uploaded

Recently uploaded (20)

Deep Language and Vision by Amaia Salvador (Insight DCU 2018)