Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)

[course site]
Xavier Giro-i-Nieto
xavier.giro@upc.edu
Associate Professor
Universitat Politecnica de Catalunya
Technical University of Catalonia
Language and Vision
#DLUPC

2
Acknowledgments
Antonio
Bonafonte
Santiago
Pascual

3
Acknowledgments
Marta R. Costa-jussà

4
Outline
1. Motivation
2. Image Captioning
3. Visual Question Answering / Reasoning
4. Joint Embeddings

5
Outline
1. Motivation
2. Image and Video Captioning
4. Joint Embeddings

Text (English)
8
Text (French)
Neural Machine Translation

9
Neural Machine Translation
Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015)
Representation or
Embedding

10
Representation or
Embedding

11
Representation or
Embedding
Encoder Decoder

12
Word Embeddings
Figure:
Christopher Olah
Visualizing Representations

13
Outline
1. Motivation
4. Joint Embeddings

14
Representation or
Embedding
Encoder Decoder

15
Captioning: Show & Tell
Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption
generator." CVPR 2015. [video]

16
Captioning: DeepImageSent
(Slides by Marc Bolaños): Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating
image descriptions." CVPR 2015

17
Captioning: DeepImageSent
(Slides by Marc Bolaños): Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for
generating image descriptions." CVPR 2015
only takes into account
image features in the first
hidden state

Challenge on Multimodal Image Translation: http://www.statmt.org/wmt17/multimodal-task.html#task1
Multimodal Machine Translation

19
Captioning: Show, Attend & Tell
Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S.
Zemel, and Yoshua Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual
Attention." ICML 2015

20
Captioning: Show, Attend & Tell
Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S.
Zemel, and Yoshua Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual
Attention." ICML 2015

21
Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for
dense captioning." CVPR 2016
Captioning (+ Detection): DenseCap

22

23
XAVI: “man has
short hair”, “man
with short hair”
AMAIA:”a woman
wearing a black
shirt”, “
BOTH: “two men
wearing black
glasses”

24
Captioning: Video
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan,
Kate Saenko, Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and
Description, CVPR 2015. code

25
(Slides by Marc Bolaños) Pingbo Pan, Zhongwen Xu, Yi Yang,Fei Wu,Yueting Zhuang Hierarchical
Recurrent Neural Encoder for Video Representation with Application to Captioning, CVPR 2016.
LSTM unit
(2nd layer)
Time
Image
t = 1 t = T
hidden state
at t = T
first chunk
of data
Captioning: Video

26
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild."
CVPR 2017

27
Lipreading: Watch, Listen, Attend & Spell
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the
wild." CVPR 2017
Audio
features
Image
features

28
Lipreading: Watch, Listen, Attend & Spell
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the
wild." CVPR 2017
Attention over output
states from audio and
video is computed at
each timestep

29
Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End
Sentence-level Lipreading." (2016).
Lip Reading: LipNet
Input (video frames) and output (sentence) sequences are not
aligned

30
Graves et al. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with
Recurrent Neural Networks. ICML 2006
Lip Reading: LipNet
CTC Loss: Connectionist temporal classification
● Avoiding the need for alignment between input and output sequence by predicting
an additional “_” blank word
● Before computing the loss, repeated words and blank tokens are removed
● “a _ a b _ ” == “_ a a _ _ b b” == “a a b”

31
Lip Reading: LipNet
Assael et al. LipNet: Sentence-level Lipreading. arXiv Nov 2016

32
Outline
1. Neural Machine Transaltion (no vision here !)
4. Joint Embeddings

33
Visual Question Answering (VQA)
[z1
, z2
, … zN
] [y1
, y2
, … yM
]
“Is economic growth decreasing ?”
“Yes”
Encode
Encode
Decode

34
Antol, Stanislaw, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and
Devi Parikh. "VQA: Visual question answering." CVPR 2015.

35
Extract visual
features
Embedding
Predict answerMerge
Question
What object is flying?
Answer
Kite
Slide credit: Issey Masuda

36
Masuda, Issey, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto. "Open-Ended Visual
Question-Answering." ETSETB UPC TelecomBCN (2016).
Image
Question
Answer

37
Francisco Roldán, Issey Masuda, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto.
"Visual Question-Answering 2.0." ETSETB UPC TelecomBCN (2017).

38
Noh, H., Seo, P. H., & Han, B. Image question answering using convolutional neural network with
dynamic parameter prediction. CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)

39
Visual Question Answering: Dynamic
(Slides and Slidecast by Santi Pascual): Xiong, Caiming, Stephen Merity, and Richard Socher. "Dynamic
Memory Networks for Visual and Textual Question Answering." ICML 2016

40
Visual Question Answering: Grounded
(Slides and Screencast by Issey Masuda): Zhu, Yuke, Oliver Groth, Michael Bernstein, and Li Fei-Fei."Visual7W: Grounded
Question Answering in Images." CVPR 2016.

41
Visual Dialog (Image Guessing Game)
Das, Abhishek, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra.
"Visual Dialog." CVPR 2017

42
Visual Dialog (Image Guessing Game)
Das, Abhishek, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra.
"Visual Dialog." CVPR 2017

43
Visual Reasoning
Johnson, Justin, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross
Girshick. "CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning."
CVPR 2017

44
Visual Reasoning
(Slides by Fran Roldan) Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Fei-Fei Li, Larry
Zitnick, Ross Girshick , “Inferring and Executing Programs for Visual Reasoning”. ICCV 2017
Program Generator Execution Engine

45
Outline
1. Neural Machine Transaltion (no vision here !)
4. Joint Embeddings

46
Representation or
EmbeddingEncoder
Joint Neural Embeddings
Encoder

47
Frome, Andrea, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, and Tomas Mikolov. "Devise: A
deep visual-semantic embedding model." NIPS 2013

48
Socher, R., Ganjoo, M., Manning, C. D., & Ng, A., Zero-shot learning through cross-modal transfer.
NIPS 2013 [slides] [code]
Zero-shot learning:
a class not present in the
training set of images
can be predicted
(eg. no images from
“cat” in the training set)

49
Alejandro Woodward, Víctor Campos, Dèlia Fernàndez, Jordi Torres, Xavier Giró-i-Nieto,
Brendan Jou and Shih-Fu Chang (submitted)
Foggy Day

51
Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber,
Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food
Images”. CVPR 2017
Image and text retrieval with joint embeddings.

52

53
joint
embedding
LSTM Bidirectional LSTM

54
Image to image and text
Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal
Scene Networks." CVPR 2016.

55
Image to image and text
Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal
Scene Networks." CVPR 2016.

Gella, Spandana, Rico Sennrich, Frank Keller, and Mirella Lapata. "Image Pivoting for Learning Multilingual Multimodal Representations." arXiv preprint
arXiv:1707.07601 (2017).
Janarthanan Rajendran, Mitesh M Khapra, Sarath Chandar, Balaraman Ravindran, Bridge Correlational Neural Networks for Multilingual Multimodal
Representation Learning NAACL, 2016
Multilingual & Multimodal Embeddings

57
Outline
1. Motivation
2. Image Captioning
4. Joint Embeddings

Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)

Similar to Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language) (20)

More from Universitat Politècnica de Catalunya

More from Universitat Politècnica de Catalunya (20)

Recently uploaded

Recently uploaded (20)

Deep Language and Vision (DLSL D2L4 2018 UPC Deep Learning for Speech and Language)