Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

One Perceptron to Rule Them All: Language and Vision

158 views

Published on

http://ixa2.si.ehu.es/deep_learning_seminar/

Deep neural networks have boosted the convergence of multimedia data analytics in a unified framework shared by practitioners in natural language and vision. Image captioning, visual question answering or multimodal translation are some of the first applications of a new and exciting field that exploiting the generalization properties of deep neural representations. This talk will provide an overview of how vision and language problems are addressed with deep neural networks, and the exciting challenges being addressed nowadays by the research community.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

One Perceptron to Rule Them All: Language and Vision

  1. 1. One Perceptron to Rule Them All: Language and Vision Xavier Giro-i-Nieto xavier.giro@upc.edu Associate Professor Intelligent Data Science and Artificial Intelligence Center (IDEAI) Universitat Politecnica de Catalunya (UPC) Barcelona Supercomputing Center (BSC) Deep Learning for Natural Language Processing San Sebastian 5 July 2019 bit.ly/ixa-dlnlp-2019 xavier.giro@upc.edu @DocXavi
  2. 2. 2 Xavier Giro-i-Nieto Associate Professor at Universitat Politècnica de Catalunya (UPC) Kaixo IDEAI Center for Intelligent Data Science & Artificial Intelligence
  3. 3. 3 ● 11 faculty members ● 12 Phd students Research Group & Centers https://imatge.upc.edu/ https://www.bsc.es/ ● National computation center #1 ● Supercomputer MareNostrum ● Emerging Technologies for Artificial Intelligence Group, directed by Prof. Jordi Torres. https://ideai.upc.edu/ ● Center funded in 2017 ● 60 researchers IDEAI (Intelligent Data Science and Artificial Intelligence)
  4. 4. 4 Acknowledgments Mariona Carós Janna Escur Benet Oriol Amaia Salvador Santiago Pascual Marta R. Costa-jussà Francisco Roldan Issey Masuda Ionut Sorodoc Carina Silberer Gemma Boleda Antonio Bonafonte José A. R. Fonollosa IDEAI Center for Intelligent Data Science & Artificial Intelligence
  5. 5. 5
  6. 6. 6[course site]
  7. 7. bit.ly/mmm-docxavi @DocXavi 7 Densely linked slides
  8. 8. 8 Outline 1. Encoder-Decoder Architectures 2. Image and Video Encoding 3. Image Captioning & Grounding 4. Image Generation 5. Visual Question Answering / Reasoning 6. Joint Embeddings (+ recipe generation)
  9. 9. Text Audio 9 Speech Vision
  10. 10. Text Audio 10 Speech Vision
  11. 11. Text Audio 11 Speech Vision
  12. 12. 12
  13. 13. 13 Encoder 0 1 0 Cat A Krizhevsky, I Sutskever, GE Hinton “Imagenet classification with deep convolutional neural networks” NIPS 2012
  14. 14. 14Slide concept: Perronin, F., Tutorial on LSVR @ CVPR’14, Output embedding for LSVR One-hot Representation [1,0,0] [0,1,0] [0,0,1]
  15. 15. 15 Encoder Representation
  16. 16. 16 Encoder Representation
  17. 17. 17 Decoder Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with deep convolutional generative adversarial networks." ICLR 2016. #DCGAN 0 1 0 Cat Fig: Xudong Mao #DCGAN
  18. 18. 18 Encoder Decoder Representation
  19. 19. 19 Encoder Decoder Representation
  20. 20. 20 Outline 1. Encoder-Decoder Architectures 2. Image and Video Encoding 3. Image Captioning & Grounding 4. Image Generation 5. Visual Question Answering / Reasoning 6. Joint Embeddings (+ recipe generation)
  21. 21. 21 Encoder Representation
  22. 22. 22 Perceptron Weights and bias are the parameters that define the behavior. They must be learned during training.
  23. 23. 23 Convolutional Layers for Vision Fully Connected layer (FC) Convolutional layer (Conv)
  24. 24. 24 Pooling Layer Figure Credit: Ranzatto Pooling is a downsample operation along the spatial dimensions (width, height) ● It reduces progressively the spatial size of the representation, so it reduces the computation greatly. ● Provides invariance to small local changes
  25. 25. 25 Pooling Layer (critics) "The pooling operation used in CNNs is a big mistake and the fact that it works so well is a disaster." Geoffrey Hinton, AMA reddit (2015). Learn more: Richard Zhang, “Making Convolutional Networks Shift-Invariant Again” (ICML 2019)
  26. 26. 26 Convolutional Neural Networks for Vision LeNet-5: Several convolutional layers, combined with pooling layers, and followed by a small number of fully connected layers #LeNet-5 LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.
  27. 27. 27 ImageNet Challenge ● 1,000 object classes (categories). ● Images: ○ 1.2 M train ○ 100k test. Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. "Imagenet: A large-scale hierarchical image database." CVPR 2019.
  28. 28. 28 ImageNet Challenge: 2012 Slide credit: Rob Fergus (NYU) -9.8% Based on SIFT + Fisher Vectors Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang et al. "Imagenet large scale visual recognition challenge." International Journal of Computer Vision 115, no. 3 (2015): 211-252. [web]
  29. 29. 29 Image Encoding A Krizhevsky, I Sutskever, GE Hinton “Imagenet classification with deep convolutional neural networks” NIPS 2012 Cat CNN FC
  30. 30. 30 Encoder Representation
  31. 31. 31 Video Encoding Slide: Víctor Campos (UPC 2018) CNN CNN CNN... Combination method Combination is commonly implemented as a small NN on top of a pooling operation (e.g. max, sum, average). Drawback: pooling is not aware of the temporal order! Ng et al., Beyond short snippets: Deep networks for video classification, CVPR 2015
  32. 32. 32 Video Encoding Slide: Víctor Campos (UPC 2018) Recurrent Neural Networks are well suited for processing sequences. Drawback: RNNs are sequential and cannot be parallelized. Donahue et al., Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015 CNN CNN CNN... RNN RNN RNN...
  33. 33. 33 Learn more on visual encoding
  34. 34. 34 Decoder Representation
  35. 35. 35 Image Decoding CNN Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with deep convolutional generative adversarial networks." ICLR 2016. #DCGAN
  36. 36. 36 Encoder Decoder Representation
  37. 37. 37 Image Encoding and Decoding Noh, Hyeonwoo, Seunghoon Hong, and Bohyung Han. "Learning deconvolution network for semantic segmentation." ICCV 2015. “Regular” VGG “Upside down” VGG
  38. 38. 38 Isola, Phillip, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. "Image-to-image translation with conditional adversarial networks." CVPR 2017.
  39. 39. 39 Outline 1. Encoder-Decoder Architectures 2. Image and Video Encoding 3. Image Captioning & Grounding 4. Image Generation 5. Visual Question Answering / Reasoning 6. Joint Embeddings (+ recipe generation)
  40. 40. 40 Encoder Decoder Representation
  41. 41. 41 #ShowAndTell Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. "Show and tell: A neural image caption generator." CVPR 2015. Image Captioning
  42. 42. 42 Image Captioning #DeepImageSent Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions." CVPR 2015 (Slides by Marc Bolaños)
  43. 43. 43 Captioning: Show, Attend & Tell Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015
  44. 44. 44 Captioning: Show, Attend & Tell Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." ICML 2015
  45. 45. 45 Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense captioning." CVPR 2016 Dense Captioning
  46. 46. 46 XAVI: “man has short hair”, “man with short hair” AMAIA:”a woman wearing a black shirt”, “ BOTH: “two men wearing black glasses” Johnson, Justin, Andrej Karpathy, and Li Fei-Fei. "Densecap: Fully convolutional localization networks for dense captioning." CVPR 2016 Dense Captioning
  47. 47. Image Captioning for News Ali Furkan Biten, Lluis Gomez, Marçal Rusiñol, Dimosthenis Karatzas, “Good News, Everyone! Context driven entity-aware captioning for news images” CVPR 2019.
  48. 48. 48 Filtering Social Bias in Neural Models #Equalizer Burns, Kaylee, Lisa Anne Hendricks, Trevor Darrell, and Anna Rohrbach. "Women also Snowboard: Overcoming Bias in Captioning Models." ECCV 2018.
  49. 49. 49 Captioning: Dataset biases #Equalizer Burns, Kaylee, Lisa Anne Hendricks, Trevor Darrell, and Anna Rohrbach. "Women also Snowboard: Overcoming Bias in Captioning Models." ECCV 2018.
  50. 50. 50 Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, Trevor Darrel. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, CVPR 2015. code Captioning: Video
  51. 51. 51 (Slides by Marc Bolaños) Pingbo Pan, Zhongwen Xu, Yi Yang,Fei Wu,Yueting Zhuang Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning, CVPR 2016. LSTM unit (2nd layer) Time Image t = 1 t = T hidden state at t = T first chunk of data Captioning: Video
  52. 52. 52 Sign Language Translation Camgoz, Necati Cihan, et al. Neural Sign Language Translation. CVPR 2018.
  53. 53. 53 Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level Lipreading." (2016).
  54. 54. 54 Lip Reading Assael, Yannis M., Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. "LipNet: End-to-End Sentence-level Lipreading." (2016).
  55. 55. 55 Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017
  56. 56. 56 Lipreading: Watch, Listen, Attend & Spell Audio features Image features Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017
  57. 57. 57 Lipreading: Watch, Listen, Attend & Spell Chung, Joon Son, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. "Lip reading sentences in the wild." CVPR 2017 Attention over output states from audio and video is computed at each timestep
  58. 58. 58 Lipreading Afouras, Triantafyllos, Joon Son Chung, and Andrew Zisserman. "Deep Lip Reading: a comparison of models and an online application." Interspeech 2018.
  59. 59. 59 Grounded Captioning from Objects Lu, Jiasen and Yang, Jianwei and Batra, Dhruv and Parikh, Devi “Neural Baby Talk” CVPR 2018 [code]
  60. 60. 60Lu, Jiasen and Yang, Jianwei and Batra, Dhruv and Parikh, Devi “Neural Baby Talk” CVPR 2018 [code] Grounded Captioning from Objects
  61. 61. 61Akbari, Hassan, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl Vondrick, and Shih-Fu Chang. "Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding." CVPR 2019. [code] Weak grounding w/o supervision
  62. 62. 62Akbari, Hassan, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl Vondrick, and Shih-Fu Chang. "Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding." CVPR 2019. [code] Grounding with weak supervision
  63. 63. 63 Cornia, Marcella, Lorenzo Baraldi, and Rita Cucchiara. "Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions." CVPR 2019. [code] Controlled Grounding
  64. 64. 64 Outline 1. Encoder-Decoder Architectures 2. Image and Video Encoding 3. Image Captioning & Grounding 4. Image Generation 5. Visual Question Answering / Reasoning 6. Joint Embeddings (+ recipe generation)
  65. 65. 65 Encoder Decoder Representation
  66. 66. 66 Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial text to image synthesis." ICML 2016. Image Generation
  67. 67. 67 Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial text to image synthesis." ICML 2016. [code] Image Synthesis
  68. 68. 68 Reed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial text to image synthesis." ICML 2016. [code] Image Generation
  69. 69. 69 #StackGAN Zhang, Han, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris Metaxas. "Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks." ICCV 2017. [code] Image Synthesis
  70. 70. 70 #StackGAN Zhang, Han, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris Metaxas. "Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks." ICCV 2017. [code] Image Synthesis
  71. 71. 71Justin Johnson, Agrim Gupta, Li Fei-Fei, “Image Generation from Scene Graphs” CVPR 2018 Image Generation via Scene Graphs
  72. 72. 72Justin Johnson, Agrim Gupta, Li Fei-Fei, “Image Generation from Scene Graphs” CVPR 2018 Image Synthesis via Scene Graphs
  73. 73. 73 #Text2Scene Tan, Fuwen, Song Feng, and Vicente Ordonez. "Text2Scene: Generating Compositional Scenes From Textual Descriptions." CVPR 2019 [blog]. Image Generation by Composition
  74. 74. 74 #Text2Scene Tan, Fuwen, Song Feng, and Vicente Ordonez. "Text2Scene: Generating Compositional Scenes From Textual Descriptions." CVPR 2019 [blog].
  75. 75. 75 #Text2Scene Tan, Fuwen, Song Feng, and Vicente Ordonez. "Text2Scene: Generating Compositional Scenes From Textual Descriptions." CVPR 2019 [blog].
  76. 76. 76 #CRAFT Gupta, Tanmay, Dustin Schwenk, Ali Farhadi, Derek Hoiem, and Aniruddha Kembhavi. "Imagine this! scripts to compositions to videos." ECCV 2018
  77. 77. 77 #CRAFT Gupta, Tanmay, Dustin Schwenk, Ali Farhadi, Derek Hoiem, and Aniruddha Kembhavi. "Imagine this! scripts to compositions to videos." ECCV 2018 Video Generation by Composition
  78. 78. 78 Outline 1. Encoder-Decoder Architectures 2. Image and Video Encoding 3. Image Captioning & Grounding 4. Image Generation 5. Visual Question Answering / Reasoning 6. Joint Embeddings (+ recipe generation)
  79. 79. 79 Encoder Decoder Representation Encoder Representation
  80. 80. 80 #Mattnet Yu, Licheng, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. "Mattnet: Modular attention network for referring expression comprehension." CVPR 2018. [code] Object from Referring Expressions
  81. 81. 81 Khoreva, Anna, Anna Rohrbach, and Bernt Schiele. "Video object segmentation with language referring expressions." ACCV 2018. Video Object Grounding
  82. 82. 82 Encoder Decoder Representation Encoder Representation
  83. 83. 83 Visual Question Answering Antol, Stanislaw, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. "VQA: Visual question answering." CVPR 2015.
  84. 84. 84 Visual Question Answering (VQA) [z1 , z2 , … zN ] [y1 , y2 , … yM ] “Is economic growth decreasing ?” “Yes” Encode Encode Decode
  85. 85. 85 Extract visual features Embedding Predict answerMerge Question What object is flying? Answer Kite Visual Question Answering (VQA) Masuda, Issey, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto. "Open-Ended Visual Question-Answering." ETSETB UPC TelecomBCN (2016).
  86. 86. 86 Visual Question Answering (VQA) Masuda, Issey, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto. "Open-Ended Visual Question-Answering." ETSETB UPC TelecomBCN (2016). Image Question Answer
  87. 87. 87 Visual Question Answering (VQA) Francisco Roldán, Issey Masuda, Santiago Pascual de la Puente, and Xavier Giro-i-Nieto. "Visual Question-Answering 2.0." ETSETB UPC TelecomBCN (2017).
  88. 88. 88 Noh, H., Seo, P. H., & Han, B. Image question answering using convolutional neural network with dynamic parameter prediction. CVPR 2016 Dynamic Parameter Prediction Network (DPPnet) Visual Question Answering (VQA)
  89. 89. 89 VQA: Dynamic Memory Networks (Slides and Slidecast by Santi Pascual): Xiong, Caiming, Stephen Merity, and Richard Socher. "Dynamic Memory Networks for Visual and Textual Question Answering." ICML 2016
  90. 90. 90 Grounded VQA (Slides and Screencast by Issey Masuda): Zhu, Yuke, Oliver Groth, Michael Bernstein, and Li Fei-Fei."Visual7W: Grounded Question Answering in Images." CVPR 2016.
  91. 91. 91 Visual Reasoning #Clevr Johnson, Justin, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. "CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning." CVPR 2017
  92. 92. 92 Visual Reasoning (Slides by Fran Roldan) Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Fei-Fei Li, Larry Zitnick, Ross Girshick , “Inferring and Executing Programs for Visual Reasoning”. ICCV 2017 Program Generator Execution Engine
  93. 93. 93 Visual Dialog Das, Abhishek, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. "Visual Dialog." CVPR 2017 [Project]
  94. 94. 94 Visual Dialog
  95. 95. 95 Hate Speech Detection in Memes Benet Oriol, Cristian Canton, Xavier Giro-i-Nieto, “Hate Speech Detection in Memes”. UPC TelecomBCN 2019. Hate Speech Detection
  96. 96. 96 Visual Reasoning: Relation Networks Santoro, Adam, David Raposo, David G. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. "A simple neural network module for relational reasoning." NIPS 2017. Relation Networks concatenate all possible pairs of objects with the an encoded question to later find the answer with a MLP.
  97. 97. 97 Multimodal Machine Translation Challenge on Multimodal Image Translation: http://www.statmt.org/wmt17/multimodal-task.html#task1
  98. 98. 98 Outline 1. Encoder-Decoder Architectures 2. Image and Video Encoding 3. Image Captioning & Grounding 4. Image Generation 5. Visual Question Answering / Reasoning 6. Joint Embeddings (+ recipe generation)
  99. 99. 99 Encoder Encoder Representation
  100. 100. 100 Joint Representations (Embeddings) Frome, Andrea, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeff Dean, and Tomas Mikolov. "Devise: A deep visual-semantic embedding model." NIPS 2013
  101. 101. 101 Zero-shot learning Socher, R., Ganjoo, M., Manning, C. D., & Ng, A., Zero-shot learning through cross-modal transfer. NIPS 2013 [slides] [code] No images from “cat” in the training set... ...but they can still be recognised as “cats” thanks to the representations learned from text .
  102. 102. 102 Multimodal Retrieval Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks." CVPR 2016.
  103. 103. 103 Multimodal Retrieval Aytar, Yusuf, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. "Cross-Modal Scene Networks." CVPR 2016.
  104. 104. 104 Image and text retrieval with joint embeddings. Joint Neural Embeddings #pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber, Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017
  105. 105. 105 #pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber, Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017 Joint Neural Embeddings
  106. 106. 106 Joint Neural Embeddings joint embedding LSTM Bidirectional LSTM #pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber, Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017
  107. 107. 107 Joint Neural Embeddings ● Constrained to database recipes ● Ingredients and Instructions are retrieved as a whole ● Prohibits user manipulation (ingredient replacements) #pic2recipe Amaia Salvador, Nicholas Haynes, Yusuf Aytar, Javier Marín, Ferda Ofli, Ingmar Weber, Antonio Torralba, “Learning Cross-modal Embeddings for Cooking Recipes and Food Images”. CVPR 2017
  108. 108. 108
  109. 109. 109 Recipe Generation (not retrieval !) Salvador, Amaia, Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe Generation from Food Images." CVPR 2019.
  110. 110. 110 Recipe Generation (not retrieval !) Salvador, Amaia, Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe Generation from Food Images." CVPR 2019. Title: Edamame corn salad Ingredients pepper, corn, onion, edamame, salt, vinegar, cilantro, avocado, oil Instructions - In a large bowl, combine edamame, corn, red onion, cilantro, avocado, and red bell pepper. - In a small bowl, whisk together olive oil, vinegar, salt, and pepper. - Pour dressing over edamame mixture and toss to coat. - Cover and refrigerate for at least 1 hour before serving.
  111. 111. 111 Recipe Generation (not retrieval !) Salvador, Amaia, Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe Generation from Food Images." CVPR 2019. According to human judgment, our proposed system is able to generate better recipes than the previous retrieval method.
  112. 112. 112 Recipe Generation (data as the DL ingredient!) Salvador, Amaia, Michal Drozdzal, Xavier Giro-i-Nieto, and Adriana Romero. "Inverse Cooking: Recipe Generation from Food Images." CVPR 2019. Title: Spaghetti with spicy tomato sauce Ingredients: onion, tomato, chili, salt, noodles, pepper, spaghetti, clove, cumin, water Instructions: -In a large pot, combine the tomatoes, onion, garlic, chili powder, cumin, salt, pepper, water and tomato sauce. -Bring to a boil, then reduce heat and simmer for about 20 minutes. -Meanwhile, cook the spaghetti according to package directions. -Drain and set aside. -When the spaghetti is done, drain and return to pot. -Add the sauce and stir to combine. -Serve with the shredded cheese and a dollop of sour cream.
  113. 113. 113 @JonathanFly
  114. 114. 114 Outline 1. Encoder-Decoder Architectures 2. Image and Video Encoding 3. Image Captioning & Grounding 4. Image Generation 5. Visual Question Answering / Reasoning 6. Joint Embeddings (+ recipe generation)
  115. 115. Recommended tool Pythia for vision and language multimodal AI models by Facebook FAIR.
  116. 116. 116 Deep Learning courses @ UPC TelecomBCN: ● MSc course [2017] [2018] [2019] ● BSc course [2018] [2019] ● 1st edition (2016) ● 2nd edition (2017) ● 3rd edition (2018) ● 4th edition (2019) ● 1st edition (2017) ● 2nd edition (2018) ● 3rd edition - NLP (2019) Next edition: Autumn 2019 Central repo with slides & videos here
  117. 117. 117 Deep Learning courses @ UPC TelecomBCN: Central repo with slides & videos here
  118. 118. 118 Multimodal DL with audio+speech https://telecombcn-dl.github.io/2019-mmm-tutorial/
  119. 119. 119 Deep Learning for Professionals @ UPC School Next edition starts November 2019. Sign up here.
  120. 120. 120 Community building bcn.ai deeplearning.barcelona
  121. 121. 121 Eskerrik asko Victor Campos Amaia Salvador Amanda Duarte Dèlia Fernández Eduard Ramon Andreu Girbau Dani Fojo Oscar Mañas Santi Pascual Xavi Giró Miriam Bellver Janna Escur Carles Ventura Paula Gómez Benet Oriol Mariona Carós Jordi Torres Ferran Marqués bit.ly/ixa-dlnlp-2019 xavier.giro@upc.edu @DocXavi
  122. 122. Rowan Zellers, Yonatan Bisk, Ali Farhadi, Yejin Choi. From Recognition to Cognition: Visual Commonsense Reasoning. CVPR 2019 (oral) https://visualcommonsense.com/
  123. 123. 123 Ma, Chih-Yao, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira, Richard Socher, and Caiming Xiong. "Self-Monitoring Navigation Agent via Auxiliary Progress Estimation." ICLR 2019. [code]
  124. 124. 124 Visual Question Answering Gurari, Danna, Qing Li, Abigale J. Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. "VizWiz Grand Challenge: Answering Visual Questions from Blind People." arXiv preprint arXiv:1802.08218 (2018).
  125. 125. 125 Reasoning: MAC Hudson, Drew A., and Christopher D. Manning. "Compositional attention networks for machine reasoning." ICLR 2018.
  126. 126. 126 Navigation with Language and Vision Fried, Daniel, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell. "Speaker-Follower Models for Vision-and-Language Navigation." arXiv preprint arXiv:1806.02724 (2018).
  127. 127. 127 Translation Harwath, David, Galen Chuang, and James Glass. "Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech." arXiv preprint arXiv:1804.03052 (2018).

×