Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Multi modal retrieval and generation with deep distributed models

1,276 views

Published on

Guest Lecture at DD2476 Search engines and Information retrieval systems (KTH, Stockholm 26 April 2016)

Published in: Education
  • Be the first to comment

Multi modal retrieval and generation with deep distributed models

  1. 1. @graphific Roelof Pieters Mul--modal Retrieval and Genera-on with Deep Distributed Models 26 April 2016 
 KTH www.csc.kth.se/~roelof/ roelof@kth.se
  2. 2. Creative AI > a “brush” > rapid experimentation human-machine collaboration
  3. 3. Multi-modal retrieval 3
  4. 4. Modalities 4
  5. 5. [Karlgren 2014, NLP Sthlm Meetup]5 Digital Media Deluge: text
  6. 6. [ http://lexicon.gavagai.se/lookup/en/lol ]6 Digital Media Deluge: text lol ? …
  7. 7. [Youtube Blog, 2010]7 Digital Media Deluge: video
  8. 8. [Reelseo, 2015]8 Digital Media Deluge: video
  9. 9. [Reelseo, 2015]9 Digital Media Deluge: audio
  10. 10. [Reelseo, 2015]10 Digital Media Deluge: audio
  11. 11. Challenges 11 • Volume • Velocity • Variety
  12. 12. Can we make it searchable? 12 Language
  13. 13. Language: Compositionality Principle of compositionality: the “meaning (vector) of a complex expression (sentence) is determined by: — Gottlob Frege 
 (1848 - 1925) - the meanings of its constituent expressions (words) and - the rules (grammar) used to combine them” 13
  14. 14. • NLP treats words mainly (rule-based/statistical approaches at least) as atomic symbols:
 • or in vector space:
 • also known as “one hot” representation. • Its problem ? Word Representation Love Candy Store [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 …] Candy [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 …] AND Store [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 …] = 0 ! 14
  15. 15. Word Representation 15
  16. 16. Distributional semantics Distributional meaning as co-occurrence vector: 16
  17. 17. Deep Distributional representations • Taking it further: • Continuous word embeddings • Combine vector space semantics with the prediction of probabilistic models • Words are represented as a dense vector: Candy = 17
  18. 18. • Can theoretically (given enough units) approximate “any” function • and fit to “any” kind of data • Efficient for NLP: hidden layers can be used as word lookup tables • Dense distributed word vectors + efficient NN training algorithms: • Can scale to billions of words ! Neural Networks for NLP 18
  19. 19. Word Embeddings: SocherVector Space Model adapted rom Bengio, “Representation Learning and Deep Learning”, July, 2012, UCLA In a perfect world: 20
  20. 20. Word Embeddings: SocherVector Space Model adapted rom Bengio, “Representation Learning and Deep Learning”, July, 2012, UCLA In a perfect world: the country of my birth the place where I was born 21
  21. 21. Word Embeddings: SocherVector Space Model Figure (edited) from Bengio, “Representation Learning and Deep Learning”, July, 2012, UCLA In a perfect world: the country of my birth the place where I was born ? … 22
  22. 22. Word Embeddings: Turian (2010) Turian, J., Ratinov, L., Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning code & info: http://metaoptimize.com/projects/wordreprs/23
  23. 23. Word Embeddings: Turian (2010) Turian, J., Ratinov, L., Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning code & info: http://metaoptimize.com/projects/wordreprs/ 24
  24. 24. Word Embeddings: Collobert & Weston (2011) Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P. (2011) . Natural Language Processing (almost) from Scratch 25
  25. 25. Multi-embeddings: Stanford (2012) Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng (2012)
 Improving Word Representations via Global Context and Multiple Word Prototypes 26
  26. 26. Linguistic Regularities: Mikolov (2013) code & info: https://code.google.com/p/word2vec/ Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic Regularities in Continuous Space Word Representations 27
  27. 27. Word Embeddings for MT: Mikolov (2013) Mikolov, T., Le, V. L., Sutskever, I. (2013) . 
 Exploiting Similarities among Languages for Machine Translation 28
  28. 28. Word Embeddings for MT: Kiros (2014) 29
  29. 29. Recursive Embeddings for Sentiment: Socher (2013) Socher, R., Perelygin, A., Wu, J., Chuang, J.,Manning, C., Ng, A., Potts, C. (2013) 
 Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. code & demo: http://nlp.stanford.edu/sentiment/index.html 30
  30. 30. Paragraph Vectors: Dai et al. (2014) 31
  31. 31. Paragraph Vectors: Dai et al. (2014) 32
  32. 32. Can we make it searchable? 33 Other modalities
  33. 33. • Image -> vector -> embedding ? ? • Video -> vector -> embedding ? ? • Audio -> vector -> embedding ? ? 34 Other modalities: Embeddings?
  34. 34. •A host of statistical machine learning techniques •Enables the automatic learning of feature hierarchies •Generally based on artificial neural networks Deep Learning?
  35. 35. • Manually designed features are often over-specified, incomplete and take a long time to design and validate • Learned Features are easy to adapt, fast to learn
 • Deep learning provides a very flexible, (almost?) universal, learnable framework for representing world, visual and linguistic information. • Deep learning can learn unsupervised (from raw text/ audio/images/whatever content) and supervised (with specific labels like positive/negative) (as summarised by Richard Socher 2014) Deep Learning?
  36. 36. 37 2006+ : The Deep Learning Conspirators
  37. 37. • Image -> vector -> embedding • Video -> vector -> embedding ? ? • Audio -> vector -> embedding ? ? 39 Image Embeddings
  38. 38. 40 Convolutional Neural Nets for Images classification demo
  39. 39. 41 Convolutional Neural Nets for Images http://ml4a.github.io/dev/demos/demo_convolution.html
  40. 40. 42 Convolutional Neural Nets for Images Zeiler and Fergus 2013, 
 Visualizing and Understanding Convolutional Networks
  41. 41. 43 Convolutional Neural Nets for Images
  42. 42. 44 Convolutional Neural Nets for Images
  43. 43. 45 Deep Nets
  44. 44. 46 Deep Nets
  45. 45. 47 Convolutional Neural Nets: Embeddings? [-0.34, 0.28, …] 4096-dimensional fc7 AlexNet CNN
  46. 46. 48 (Karpathy)
  47. 47. 49 Convolutional Neural Nets: Embeddings? http://ml4a.github.io/dev/demos/tsne-viewer.html
  48. 48. • Image -> vector -> embedding ?? • Video -> vector -> embedding • Audio -> vector -> embedding ? ? 50 Video Embeddings
  49. 49. 51 Convolutional Neural Nets for Video 3D Convolutional Neural Networks for Human Action Recognition, Ji et al., 2010
  50. 50. 52 Convolutional Neural Nets for Video Sequential Deep Learning for Human Action Recognition, Baccouche et al., 2011
  51. 51. 53 Convolutional Neural Nets for Video Large-scale Video Classification with Convolutional Neural Networks, Karpathy et al., 2014
  52. 52. 54 Convolutional Neural Nets for Video Large-scale Video Classification with Convolutional Neural Networks, Karpathy et al., 2014
  53. 53. 55 Convolutional Neural Nets for Video [Large-scale Video Classification with Convolutional Neural Networks, Karpathy et al., 2014 [Le et al. '11] vs classic 2d convnet:
  54. 54. 56 Convolutional Neural Nets for Video [Large-scale Video Classification with Convolutional Neural Networks, Karpathy et al., 2014
  55. 55. 57 Convolutional Neural Nets for Video Sequential Deep Learning for Human Action Recognition, Baccouche et al., 2011
  56. 56. 58 Convolutional Neural Nets for Video Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al., 2015
  57. 57. 59 Convolutional Neural Nets for Video Beyond Short Snippets: Deep Networks for Video Classification, Ng et al., 2015]
  58. 58. 60 Convolutional Neural Nets for Video Delving Deeper into Convolutional Networks for Learning Video Representations, Ballas et al., 2016
  59. 59. • Image -> vector -> embedding ?? • Video -> vector -> embedding ?? • Audio -> vector -> embedding 61 Audio Embeddings
  60. 60. 62 Zero-shot Learning [Sander Dieleman, 2014]
  61. 61. 63 Audio Embeddings [Sander Dieleman, 2014]
  62. 62. demo
  63. 63. • Can we take this further? 65 Multi Modal Embeddings?
  64. 64. • unsupervised pre-training (on many images) • in parallel train a neural network (Language) Model • train linear mapping between (image) representations and (word) embeddings, representing the different “classes” 66 Zero-shot Learning
  65. 65. DeViSE model (Frome et al. 2013) • skip-gram text model on wikipedia corpus of 5.7 million documents (5.4 billion words) - approach from (Mikolov et al. ICLR 2013) 67 Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., Ranzato, M.A. (2013) 
 Devise: A deep visual-semantic embedding model
  66. 66. Encoder: A deep convolutional network (CNN) and long short- term memory recurrent network (LSTM) for learning a joint image-sentence embedding. Decoder: A new neural language model that combines structure and content vectors for generating words one at a time in sequence. Encoder-Decoder pipeline (Kiros et al 2014) 68 Kiros, R., Salakhutdinov, R., Zemerl, R. S. (2014) 
 Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models
  67. 67. Kiros, R., Salakhutdinov, R., Zemerl, R. S. (2014) 
 Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models • matches state-of-the-art performance on Flickr8K and Flickr30K without using object detections • new best results when using the 19-layer Oxford convolutional network. • linear encoders: learned embedding space captures multimodal regularities (e.g. *image of a blue car* - "blue" + "red" is near images of red cars) Encoder-Decoder pipeline (Kiros et al 2014) 69
  68. 68. Image-Text Embeddings 70 Socher et al (2013) Zero Shot Learning Through Cross-Modal Transfer (info)
  69. 69. Image-Captioning • Andrej Karpathy Li Fei-Fei , 2015. 
 Deep Visual-Semantic Alignments for Generating Image Descriptions (pdf) (info) (code) • Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan , 2015. Show and Tell: A Neural Image Caption Generator (arxiv) • Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (arxiv) (info) (code)
  70. 70. “A person riding a motorcycle on a dirt road.”??? Image-Captioning
  71. 71. “Two hockey players are fighting over the puck.”??? Image-Captioning
  72. 72. • Let’s turn it around! • Generative Models • (we wont cover, but common architectures): • Auto encoders (AE), variational variants: VAE • Generative Adversarial Nets (GAN) • Variational Recurrent Neural Net (VRNN) 74 Generative Models
  73. 73. Wanna Play ? Text generation (RNN) 75 Karpathy (2015), The Unreasonable Effectiveness of Recurrent Neural Networks (blog)
  74. 74. Wanna Play ? Text generation 76 Karpathy (2015), The Unreasonable Effectiveness of Recurrent Neural Networks (blog)
  75. 75. Karpathy (2015), The Unreasonable Effectiveness of Recurrent Neural Networks (blog)
  76. 76. “A stop sign is flying in blue skies.” “A herd of elephants flying in the blue skies.” Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, Ruslan Salakhutdinov, 2015. Generating Images from Captions with Attention (arxiv) (examples) Caption -> Image generation
  77. 77. Turn Convnet Around: “Deep Dream” Image -> NN -> What do you (think) you see 
 -> Whats the (text) label Image -> NN -> What do you (think) you see -> 
 feed back activations -> 
 optimize image to “fit” to the ConvNets “hallucination” (iteratively)
  78. 78. see also: www.csc.kth.se/~roelof/deepdream/ 
 Turn Convnet Around: “Deep Dream”
  79. 79. Turn Convnet Around: “Deep Dream” see also: www.csc.kth.se/~roelof/deepdream/
  80. 80. see also: www.csc.kth.se/~roelof/deepdream/ codeyoutubeRoelof Pieters 2015 Turn Convnet Around: “Deep Dream”
  81. 81. https://www.flickr.com/photos/graphific/albums/72157657250972188 Single Units
  82. 82. Inter-modal: “Style Net” Leon A. Gatys, Alexander S. Ecker, Matthias Bethge , 2015. 
 A Neural Algorithm of Artistic Style (GitXiv)
  83. 83. 88
  84. 84. 89
  85. 85. 90 + + =
  86. 86. https://github.com/alexjc/neural-doodle Neural Doodle
  87. 87. Gene Kogan, 2015. Why is a Raven Like a Writing Desk? (vimeo)
  88. 88. • Image Analogies, 2001, A. Hertzmann, C. Jacobs, N. Oliver, B. Curless, D. Sales • A Neural Algorithm of Artistic Style, 2015. Leon A. Gatys, Alexander S. Ecker, Matthias Bethge • Combining Markov Random Fields and Convolutional Neural Networks for Image Synthesis, 2016, Chuan Li, Michael Wand • Semantic Style Transfer and Turning Two-Bit Doodles into Fine Artworks, 2016, Alex J. Champandard • Texture Networks: Feed-forward Synthesis of Textures and Stylized Images, 2016, Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, Victor Lempitsky • Perceptual Losses for Real-Time Style Transfer and Super-Resolution, 2016, Justin Johnson, Alexandre Alahi, Li Fei-Fei • Precomputed Real-Time Texture Synthesis with Markovian Generative Adversarial Networks, 2016, Chuan Li, Michael Wand • @DeepForger 93 “Style Transfer” papers
  89. 89. • https://soundcloud.com/graphific/neural-music-walk • https://soundcloud.com/graphific/pyotr-lstm- tchaikovsky • https://soundcloud.com/graphific/neural-remix-net 94 Audio Generation A Recurrent Latent Variable Model for Sequential Data, 2016, 
 J. Chung, K. Kastner, L. Dinh, K. Goel, A. Courville, Y. Bengio
  90. 90. Wanna be Doing Deep Learning?
  91. 91. python has a wide range of deep learning-related libraries available Deep Learning with Python Low level High level deeplearning.net/software/theano caffe.berkeleyvision.org tensorflow.org/ lasagne.readthedocs.org/en/latest and of course: keras.io
  92. 92. Questions? love letters? existential dilemma’s? academic questions? gifts? 
 find me at:
 www.csc.kth.se/~roelof/ roelof@kth.se Code & Papers? Collaborative Open Computer Science .com @graphific
  93. 93. Questions? love letters? existential dilemma’s? academic questions? gifts? 
 find me at:
 www.csc.kth.se/~roelof/ roelof@kth.se Generative “creative” AI “stuff”? .net @graphific
  94. 94. Creative AI > a “brush” > rapid experimentation human-machine collaboration
  95. 95. Creative AI > a “brush” > rapid experimentation (YouTube, Paper)
  96. 96. Creative AI > a “brush” > rapid experimentation (YouTube, Paper)
  97. 97. Creative AI > a “brush” > rapid experimentation (Vimeo, Paper)
  98. 98. 105 Generative Adverserial Nets Emily Denton, Soumith Chintala, Arthur Szlam, Rob Fergus, 2015. 
 Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks (GitXiv)
  99. 99. 106 Generative Adverserial Nets Alec Radford, Luke Metz, Soumith Chintala , 2015. 
 Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks (GitXiv)
  100. 100. 107 Generative Adverserial Nets Alec Radford, Luke Metz, Soumith Chintala , 2015. 
 Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks (GitXiv)
  101. 101. 108 Generative Adverserial Nets Alec Radford, Luke Metz, Soumith Chintala , 2015. 
 Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks (GitXiv) ”turn” vector created from four averaged samples of faces looking left vs looking right.
  102. 102. walking through the manifold Generative Adverserial Nets
  103. 103. top: unmodified samples bottom: same samples dropping out ”window” filters Generative Adverserial Nets

×