One Perceptron to Rule Them All: Language and Vision

One Perceptron to Rule Them All:
Language and Vision
Xavier Giro-i-Nieto
xavier.giro@upc.edu
Associate Professor
Intelligent Data Science and Artiﬁcial
Intelligence Center (IDEAI)
Universitat Politecnica de Catalunya (UPC)
Barcelona Supercomputing Center (BSC)
Deep Learning
for Natural
Language
Processing
San Sebastian
5 July 2019
bit.ly/ixa-dlnlp-2019
xavier.giro@upc.edu
@DocXavi

2
Xavier Giro-i-Nieto
Associate Professor at Universitat Politècnica de Catalunya (UPC)
Kaixo
IDEAI Center for
Intelligent Data Science
& Artiﬁcial Intelligence

3
● 11 faculty members
● 12 Phd students
Research Group & Centers
https://imatge.upc.edu/
https://www.bsc.es/
● National computation center #1
● Supercomputer MareNostrum
● Emerging Technologies for
Artiﬁcial Intelligence Group,
directed by Prof. Jordi Torres.
https://ideai.upc.edu/
● Center funded in 2017
● 60 researchers
IDEAI (Intelligent Data Science and
Artiﬁcial Intelligence)

4
Acknowledgments
Mariona
Carós
Janna
Escur
Benet
Oriol
Amaia
Salvador
Santiago
Pascual
Marta R.
Costa-jussà
Francisco
Roldan
Issey
Masuda
Ionut
Sorodoc
Carina
Silberer
Gemma
Boleda
Antonio
Bonafonte
José A. R.
Fonollosa
IDEAI Center for
Intelligent Data Science
& Artiﬁcial Intelligence

bit.ly/mmm-docxavi
@DocXavi
7
Densely linked slides

8
Outline
1. Encoder-Decoder Architectures
2. Image and Video Encoding
3. Image Captioning & Grounding
4. Image Generation
5. Visual Question Answering / Reasoning
6. Joint Embeddings (+ recipe generation)

13
Encoder
0
1
0
Cat
A Krizhevsky, I Sutskever, GE Hinton “Imagenet classiﬁcation with deep convolutional neural networks” NIPS 2012

14Slide concept: Perronin, F., Tutorial on LSVR @ CVPR’14, Output embedding for LSVR
One-hot Representation
[1,0,0]
[0,1,0]
[0,0,1]

17
Decoder
Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with deep convolutional generative
adversarial networks." ICLR 2016. #DCGAN
0
1
0
Cat
Fig: Xudong Mao #DCGAN

18
Encoder Decoder
Representation

19
Encoder Decoder
Representation

20
Outline
4. Image Generation

22
Perceptron
Weights and bias are the parameters that deﬁne the behavior. They must be
learned during training.

23
Convolutional Layers for Vision
Fully Connected layer (FC) Convolutional layer (Conv)

24
Pooling Layer
Figure Credit: Ranzatto
Pooling is a downsample operation
along the spatial dimensions (width,
height)
● It reduces progressively the
spatial size of the
representation, so it reduces the
computation greatly.
● Provides invariance to small
local changes

25
Pooling Layer (critics)
"The pooling operation
used in CNNs is a big
mistake and the fact that it
works so well is a disaster."
Geoﬀrey Hinton,
AMA reddit (2015).
Learn more:
Richard Zhang, “Making Convolutional Networks Shift-Invariant Again” (ICML 2019)

26
Convolutional Neural Networks for Vision
LeNet-5: Several convolutional layers, combined with pooling layers, and followed by a
small number of fully connected layers
#LeNet-5 LeCun, Y., Bottou, L., Bengio, Y., & Haﬀner, P. (1998). Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86(11), 2278-2324.

27
ImageNet Challenge
● 1,000 object classes
(categories).
● Images:
○ 1.2 M train
○ 100k test.
Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. "Imagenet: A large-scale hierarchical image
database." CVPR 2019.

28
ImageNet Challenge: 2012
Slide credit:
Rob Fergus (NYU)
-9.8%
Based on SIFT + Fisher Vectors
Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang et al. "Imagenet
large scale visual recognition challenge." International Journal of Computer Vision 115, no. 3 (2015): 211-252. [web]

29
Image Encoding
A Krizhevsky, I Sutskever, GE Hinton “Imagenet classiﬁcation with deep convolutional neural networks” NIPS 2012
Cat
CNN FC

31
Video Encoding
Slide: Víctor Campos (UPC 2018)
CNN CNN CNN...
Combination method
Combination is commonly
implemented as a small NN on
top of a pooling operation
(e.g. max, sum, average).
Drawback: pooling is not
aware of the temporal order!
Ng et al., Beyond short snippets: Deep networks for video classiﬁcation, CVPR 2015

32
Video Encoding
Slide: Víctor Campos (UPC 2018)
Recurrent Neural Networks are
well suited for processing
sequences.
Drawback: RNNs are sequential
and cannot be parallelized.
Donahue et al., Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015
CNN CNN CNN...
RNN RNN RNN...

33
Learn more on visual encoding

35
Image Decoding
CNN
Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with deep convolutional generative
adversarial networks." ICLR 2016. #DCGAN

36
Encoder Decoder
Representation

37
Image Encoding and Decoding
Noh, Hyeonwoo, Seunghoon Hong, and Bohyung Han. "Learning deconvolution network for semantic segmentation."
ICCV 2015.
“Regular” VGG “Upside down” VGG

38
Isola, Phillip, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. "Image-to-image translation with conditional adversarial
networks." CVPR 2017.

39
Outline
4. Image Generation

40
Encoder Decoder
Representation

41
#ShowAndTell Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption
generator." CVPR 2015.
Image Captioning

42
Image Captioning
#DeepImageSent Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions."
CVPR 2015 (Slides by Marc Bolaños)

43
Captioning: Show, Attend & Tell
Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua
Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015

44
Captioning: Show, Attend & Tell
Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua
Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015

45
Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense
captioning." CVPR 2016
Dense Captioning

46
XAVI: “man has
short hair”, “man
with short hair”
AMAIA:”a woman
wearing a black
shirt”, “
BOTH: “two men
wearing black
glasses”
Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense
captioning." CVPR 2016
Dense Captioning

Image Captioning for News
Ali Furkan Biten, Lluis Gomez, Marçal Rusiñol, Dimosthenis Karatzas, “Good News, Everyone! Context driven entity-aware
captioning for news images” CVPR 2019.

48
Filtering Social Bias in Neural Models
#Equalizer Burns, Kaylee, Lisa Anne Hendricks, Trevor Darrell, and Anna Rohrbach. "Women also Snowboard: Overcoming
Bias in Captioning Models." ECCV 2018.

49
Captioning: Dataset biases
#Equalizer Burns, Kaylee, Lisa Anne Hendricks, Trevor Darrell, and Anna Rohrbach. "Women also Snowboard: Overcoming
Bias in Captioning Models." ECCV 2018.

50
Jeﬀrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor
Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code
Captioning: Video

51
(Slides by Marc Bolaños) Pingbo Pan, Zhongwen Xu, Yi Yang,Fei Wu,Yueting Zhuang Hierarchical Recurrent Neural
Encoder for Video Representation with Application to Captioning, CVPR 2016.
LSTM unit
(2nd layer)
Time
Image
t = 1 t = T
hidden state
at t = T
first chunk
of data
Captioning: Video

52
Sign Language Translation
Camgoz, Necati Cihan, et al. Neural Sign Language Translation. CVPR 2018.

53
Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level Lipreading."
(2016).

54
Lip Reading
Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level
Lipreading." (2016).

55
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild."
CVPR 2017

56
Lipreading: Watch, Listen, Attend & Spell
Audio
features
Image
features
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017

57
Lipreading: Watch, Listen, Attend & Spell
Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017
Attention over output
states from audio and
video is computed at
each timestep

58
Lipreading
Afouras, Triantafyllos, Joon Son Chung, and Andrew Zisserman. "Deep Lip Reading: a comparison of models and an online
application." Interspeech 2018.

59
Grounded Captioning from Objects
Lu, Jiasen and Yang, Jianwei and Batra, Dhruv and Parikh, Devi “Neural Baby Talk” CVPR 2018 [code]

60Lu, Jiasen and Yang, Jianwei and Batra, Dhruv and Parikh, Devi “Neural Baby Talk” CVPR 2018 [code]
Grounded Captioning from Objects

61Akbari, Hassan, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl Vondrick, and Shih-Fu Chang. "Multi-level Multimodal
Common Semantic Space for Image-Phrase Grounding." CVPR 2019. [code]
Weak grounding w/o supervision

62Akbari, Hassan, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl Vondrick, and Shih-Fu Chang. "Multi-level Multimodal
Common Semantic Space for Image-Phrase Grounding." CVPR 2019. [code]
Grounding with weak supervision

63
Cornia, Marcella, Lorenzo Baraldi, and Rita Cucchiara. "Show, Control and Tell: A Framework for Generating Controllable and
Grounded Captions." CVPR 2019. [code]
Controlled Grounding

64
Outline
4. Image Generation

65
Encoder Decoder
Representation

66
Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial
text to image synthesis." ICML 2016.
Image Generation

67
text to image synthesis." ICML 2016. [code]
Image Synthesis

68
text to image synthesis." ICML 2016. [code]
Image Generation

69
#StackGAN Zhang, Han, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris Metaxas.
"Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks." ICCV 2017. [code]
Image Synthesis

70
#StackGAN Zhang, Han, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris Metaxas.
"Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks." ICCV 2017. [code]
Image Synthesis

71Justin Johnson, Agrim Gupta, Li Fei-Fei, “Image Generation from Scene Graphs” CVPR 2018
Image Generation via Scene Graphs

72Justin Johnson, Agrim Gupta, Li Fei-Fei, “Image Generation from Scene Graphs” CVPR 2018
Image Synthesis via Scene Graphs

73
#Text2Scene Tan, Fuwen, Song Feng, and Vicente Ordonez. "Text2Scene: Generating Compositional Scenes From Textual
Descriptions." CVPR 2019 [blog].
Image Generation by Composition

74

75

76
#CRAFT Gupta, Tanmay, Dustin Schwenk, Ali Farhadi, Derek Hoiem, and Aniruddha Kembhavi. "Imagine this! scripts to
compositions to videos." ECCV 2018

77
#CRAFT Gupta, Tanmay, Dustin Schwenk, Ali Farhadi, Derek Hoiem, and Aniruddha Kembhavi. "Imagine this! scripts to
compositions to videos." ECCV 2018
Video Generation by Composition

78
Outline
4. Image Generation

79
Encoder
Decoder
Representation
Encoder
Representation

80
#Mattnet Yu, Licheng, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. "Mattnet: Modular
attention network for referring expression comprehension." CVPR 2018. [code]
Object from Referring Expressions

81
Khoreva, Anna, Anna Rohrbach, and Bernt Schiele. "Video object segmentation with language referring expressions." ACCV
2018.
Video Object Grounding

82
Encoder
Decoder
Representation
Encoder
Representation

83
Visual Question Answering
Antol, Stanislaw, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. "VQA:
Visual question answering." CVPR 2015.

84
Visual Question Answering (VQA)
[z1
, z2
, … zN
] [y1
, y2
, … yM
]
“Is economic growth decreasing ?”
“Yes”
Encode
Encode
Decode

85
Extract visual
features
Embedding
Predict answerMerge
Question
What object is flying?
Answer
Kite
Masuda, Issey, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto. "Open-Ended Visual
Question-Answering." ETSETB UPC TelecomBCN (2016).

86
Masuda, Issey, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto. "Open-Ended Visual
Question-Answering." ETSETB UPC TelecomBCN (2016).
Image
Question
Answer

87
Francisco Roldán, Issey Masuda, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto. "Visual
Question-Answering 2.0." ETSETB UPC TelecomBCN (2017).

88
Noh, H., Seo, P. H., & Han, B. Image question answering using convolutional neural network with
dynamic parameter prediction. CVPR 2016
Dynamic Parameter Prediction Network (DPPnet)

89
VQA: Dynamic Memory Networks
(Slides and Slidecast by Santi Pascual): Xiong, Caiming, Stephen Merity, and Richard Socher. "Dynamic Memory Networks for
Visual and Textual Question Answering." ICML 2016

90
Grounded VQA
(Slides and Screencast by Issey Masuda): Zhu, Yuke, Oliver Groth, Michael Bernstein, and Li Fei-Fei."Visual7W: Grounded
Question Answering in Images." CVPR 2016.

91
Visual Reasoning
#Clevr Johnson, Justin, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick.
"CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning." CVPR 2017

92
Visual Reasoning
(Slides by Fran Roldan) Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoﬀman, Fei-Fei Li, Larry
Zitnick, Ross Girshick , “Inferring and Executing Programs for Visual Reasoning”. ICCV 2017
Program Generator Execution Engine

93
Visual Dialog
Das, Abhishek, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. "Visual
Dialog." CVPR 2017 [Project]

95
Hate Speech Detection in Memes
Benet Oriol, Cristian Canton, Xavier Giro-i-Nieto, “Hate Speech Detection in Memes”. UPC TelecomBCN
2019.
Hate Speech Detection

96
Visual Reasoning: Relation Networks
Santoro, Adam, David Raposo, David G. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy
Lillicrap. "A simple neural network module for relational reasoning." NIPS 2017.
Relation Networks concatenate all possible pairs of objects with the an encoded question to later ﬁnd the
answer with a MLP.

97
Multimodal Machine Translation
Challenge on Multimodal Image Translation:
http://www.statmt.org/wmt17/multimodal-task.html#task1

98
Outline
4. Image Generation

99
Encoder Encoder
Representation

100
Joint Representations (Embeddings)
Frome, Andrea, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeﬀ Dean, and Tomas Mikolov. "Devise: A deep
visual-semantic embedding model." NIPS 2013

101
Zero-shot learning
Socher, R., Ganjoo, M., Manning, C. D., & Ng, A., Zero-shot learning through cross-modal transfer. NIPS 2013 [slides] [code]
No images from “cat” in
the training set...
...but they can still be
recognised as “cats”
thanks to the
representations learned
from text .

102
Multimodal Retrieval
Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks."
CVPR 2016.

103
Multimodal Retrieval
Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks."
CVPR 2016.

104
Image and text retrieval with joint embeddings.
Joint Neural Embeddings
#pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Oﬂi, Ingmar Weber, Antonio
Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017

105

106
joint
embedding
LSTM Bidirectional LSTM

107
● Constrained to database recipes
● Ingredients and Instructions are retrieved as a whole
● Prohibits user manipulation (ingredient replacements)

109
Recipe Generation (not retrieval !)
Salvador, Amaia, Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe
Generation from Food Images." CVPR 2019.

110
Title: Edamame corn salad
Ingredients
pepper, corn, onion, edamame, salt, vinegar, cilantro, avocado, oil
Instructions
- In a large bowl, combine edamame, corn, red onion, cilantro,
avocado, and red bell pepper.
- In a small bowl, whisk together olive oil, vinegar, salt, and
pepper.
- Pour dressing over edamame mixture and toss to coat.
- Cover and refrigerate for at least 1 hour before serving.

111
According to human judgment, our proposed system is able to generate better recipes than the previous
retrieval method.

112
Recipe Generation (data as the DL ingredient!)
Title: Spaghetti with spicy tomato sauce
Ingredients:
onion, tomato, chili, salt, noodles, pepper, spaghetti, clove, cumin, water
Instructions:
-In a large pot, combine the tomatoes, onion, garlic, chili powder, cumin, salt,
pepper, water and tomato sauce.
-Bring to a boil, then reduce heat and simmer for about 20 minutes.
-Meanwhile, cook the spaghetti according to package directions.
-Drain and set aside.
-When the spaghetti is done, drain and return to pot.
-Add the sauce and stir to combine.
-Serve with the shredded cheese and a dollop of sour cream.

114
Outline
4. Image Generation

Recommended tool
Pythia for vision and language multimodal AI models by Facebook FAIR.

116
Deep Learning courses @ UPC TelecomBCN:
● MSc course [2017] [2018] [2019]
● BSc course [2018] [2019]
● 1st edition (2016)
● 2nd edition (2017)
● 3rd edition (2018)
● 4th edition (2019)
● 1st edition (2017)
● 2nd edition (2018)
● 3rd edition - NLP (2019)
Next edition: Autumn 2019 Central repo with slides & videos here

117
Deep Learning courses @ UPC TelecomBCN:
Central repo with slides & videos here

118
Multimodal DL with audio+speech
https://telecombcn-dl.github.io/2019-mmm-tutorial/

119
Deep Learning for Professionals @ UPC School
Next edition starts November 2019. Sign up here.

120
Community building
bcn.ai deeplearning.barcelona

121
Eskerrik asko
Victor
Campos
Amaia
Salvador
Amanda
Duarte
Dèlia
Fernández
Eduard
Ramon
Andreu
Girbau
Dani
Fojo
Oscar
Mañas
Santi
Pascual
Xavi
Giró
Miriam
Bellver
Janna
Escur
Carles
Ventura
Paula
Gómez
Benet
Oriol
Mariona
Carós
Jordi
Torres
Ferran
Marqués
bit.ly/ixa-dlnlp-2019
xavier.giro@upc.edu
@DocXavi

Rowan Zellers, Yonatan Bisk, Ali Farhadi, Yejin Choi. From Recognition to Cognition: Visual Commonsense
Reasoning. CVPR 2019 (oral)
https://visualcommonsense.com/

123
Ma, Chih-Yao, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira, Richard Socher, and Caiming Xiong.
"Self-Monitoring Navigation Agent via Auxiliary Progress Estimation." ICLR 2019. [code]

124
Visual Question Answering
Gurari, Danna, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. "VizWiz
Grand Challenge: Answering Visual Questions from Blind People." arXiv preprint arXiv:1802.08218 (2018).

125
Reasoning: MAC
Hudson, Drew A., and Christopher D. Manning. "Compositional attention networks for machine reasoning."
ICLR 2018.

126
Navigation with Language and Vision
Fried, Daniel, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate
Saenko, Dan Klein, and Trevor Darrell. "Speaker-Follower Models for Vision-and-Language Navigation." arXiv preprint
arXiv:1806.02724 (2018).

127
Translation
Harwath, David, Galen Chuang, and James Glass. "Vision as an Interlingua: Learning Multilingual Semantic
Embeddings of Untranscribed Speech." arXiv preprint arXiv:1804.03052 (2018).

One Perceptron to Rule Them All: Language and Vision

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to One Perceptron to Rule Them All: Language and Vision

Similar to One Perceptron to Rule Them All: Language and Vision (20)

More from Universitat Politècnica de Catalunya

More from Universitat Politècnica de Catalunya (20)

Recently uploaded

Recently uploaded (20)

One Perceptron to Rule Them All: Language and Vision