Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Attention Models (D2L12 Insight@DCU Machine Learning Workshop 2017)

511 views

Published on

https://telecombcn-dl.github.io/dlmm-2017-dcu/

Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of big annotated data and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which had been addressed until now with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Attention Models (D2L12 Insight@DCU Machine Learning Workshop 2017)

  1. 1. Amaia Salvador amaia.salvador@upc.edu PhD Candidate Universitat Politècnica de Catalunya DEEP LEARNING WORKSHOP Dublin City University 27-28 April 2017 Attention Models Day 2 Lecture 12
  2. 2. Attention Models: Motivation Image: H x W x 3 bird The whole input volume is used to predict the output... 2
  3. 3. Attention Models: Motivation Image: H x W x 3 bird The whole input volume is used to predict the output... ...despite the fact that not all pixels are equally important 3
  4. 4. Attention Models: Motivation Attention models can relieve computational burden Helpful when processing big images ! 4
  5. 5. Attention Models: Motivation Attention models can relieve computational burden Helpful when processing big images ! 5 bird
  6. 6. Encoder & Decoder for Machine Translation 6 Limtation 1: The whole information is encoded in a fixed-size vector, no matter the length of the input sentence.
  7. 7. Encoder & Decoder 7 Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) Limitation 2: All output predictions are based on the final and static recurrent state of the encoder (hT ) No matter the output word being predicted at each time step, all input words are considered in an equal way.
  8. 8. Attention Mechanism 8 Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) The vector to be fed to the RNN at each timestep is a weighted sum of all the annotation vectors.
  9. 9. Attention Mechanism 9 Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) An attention weight (scalar) is predicted at each time-step for each annotation vector hj with a simple fully connected neural network. h1 zi Annotation vector Recurrent state Attention weight (a1 )
  10. 10. Attention Mechanism 10 Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) An attention weight (scalar) is predicted at each time-step for each annotation vector hj with a simple fully connected neural network. h2 zi Annotation vector Recurrent state Attention weight (a2 ) Shared for all j
  11. 11. Attention Mechanism 11 Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) Once a relevance score (weight) is estimated for each word, they are normalized with a softmax function so they sum up to 1.
  12. 12. Attention Mechanism 12 Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) Finally, a context-aware representation ci for the output word at timestep i can be defined as:
  13. 13. Attention Mechanism 13 Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) The model automatically finds the correspondence structure between two languages (alignment). (Edge thicknesses represent the attention weights found by the attention model)
  14. 14. Attention Mechanism 14 Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) The model automatically found the correspondence structure between two languages (alignment). (Edge thicknesses represent the attention weights found by the attention model)
  15. 15. Attention Models Attend to different parts of the input to optimize a certain output 15
  16. 16. Attention Models 16 Chan et al. Listen, Attend and Spell. ICASSP 2016 Source: distill.pub Input: Audio features; Output: Text Attend to different parts of the input to optimize a certain output
  17. 17. Attention Models 17 Input: Image; Output: Text A bird flying over a body of water Attend to different parts of the input to optimize a certain output
  18. 18. LSTM Decoder for Image Captioning LSTMLSTM LSTM CNN LSTM A bird flying ... <EOS> The LSTM decoder “sees” the input only at the beginning ! Features: D 18 ... Vinyals et al. Show and tell: A neural image caption generator. CVPR 2015
  19. 19. Attention for Image Captioning CNN Image: H x W x 3 Features: L x D Slide Credit: CS231n 19
  20. 20. Attention for Image Captioning CNN Image: H x W x 3 Features: L x D h0 a1 Slide Credit: CS231n Attention weights (LxD) 20
  21. 21. Attention for Image Captioning Slide Credit: CS231n CNN Image: H x W x 3 Features: L x D h0 a1 z1 Weighted combination of features y1 h1 First word Attention weights (LxD) a2 y2 Weighted features: D predicted word 21
  22. 22. Attention for Image Captioning CNN Image: H x W x 3 Features: L x D h0 a1 z1 Weighted combination of features y1 h1 First word a2 y2 h2 a3 y3 z2 y2 Weighted features: D predicted word Attention weights (LxD) Slide Credit: CS231n 22
  23. 23. Attention for Image Captioning Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 23
  24. 24. Attention for Image Captioning Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 24
  25. 25. Attention for Image Captioning Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 25
  26. 26. Attention for Image Captioning Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 26 Some outputs can probably be predicted without looking at the image...
  27. 27. Attention for Image Captioning Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 27 Some outputs can probably be predicted without looking at the image...
  28. 28. Attention for Image Captioning Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 28 Can we focus on the image only when necessary?
  29. 29. Attention for Image Captioning Lu et al. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. CVPR 2017 29 “Regular” Attention Attention with Sentinel
  30. 30. Attention for Image Captioning Lu et al. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. CVPR 2017 30
  31. 31. Soft Attention Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 CNN Image: H x W x 3 Grid of features (Each D-dimensional) a b c d pa pb pc pd Distribution over grid locations pa + pb + pc + pc = 1 Soft attention: Summarize ALL locations z = pa a+ pb b + pc c + pd d Derivative dz/dp is nice! Train with gradient descent Context vector z (D-dimensional) From RNN: Slide Credit: CS231n 31
  32. 32. Soft Attention Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 CNN Image: H x W x 3 Grid of features (Each D-dimensional) a b c d pa pb pc pd Distribution over grid locations pa + pb + pc + pc = 1 Soft attention: Summarize ALL locations z = pa a+ pb b + pc c + pd d Differentiable function Train with gradient descent Context vector z (D-dimensional) From RNN: Slide Credit: CS231n ● Still uses the whole input ! ● Constrained to fix grid 32
  33. 33. Hard Attention Input image: H x W x 3 Box Coordinates: (xc, yc, w, h) Cropped and rescaled image: X x Y x 3 Not a differentiable function ! Can’t train with backprop :( 33 Hard attention: Sample a subset of the input Need other optimization strategies e.g.: reinforcement learning
  34. 34. Attention for Image Captioning Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 34
  35. 35. Hard Attention Input image: H x W x 3 Box Coordinates: (xc, yc, w, h) Cropped and rescaled image: X x Y x 3 CNN bird Not a differentiable function ! Can’t train with backprop :( 35
  36. 36. Spatial Transformer Networks Input image: H x W x 3 Box Coordinates: (xc, yc, w, h) Cropped and rescaled image: X x Y x 3 CNN bird Jaderberg et al. Spatial Transformer Networks. NIPS 2015 Not a differentiable function ! Can’t train with backprop :( Make it differentiable Train with backprop :) 36
  37. 37. Spatial Transformer Networks Jaderberg et al. Spatial Transformer Networks. NIPS 2015 Input image: H x W x 3 Box Coordinates: (xc, yc, w, h) Cropped and rescaled image: X x Y x 3 Can we make this function differentiable? Idea: Function mapping pixel coordinates (xt, yt) of output to pixel coordinates (xs, ys) of input Slide Credit: CS231n Repeat for all pixels in output Network attends to input by predicting 37 Mapping given by box coordinates (translation + scale)
  38. 38. Spatial Transformer Networks Jaderberg et al. Spatial Transformer Networks. NIPS 2015 Easy to incorporate in any network, anywhere ! Differentiable module Insert spatial transformers into a classification network and it learns to attend and transform the input 38
  39. 39. Spatial Transformer Networks Jaderberg et al. Spatial Transformer Networks. NIPS 2015 39 Fine-grained classification Also used as an alternative to RoI pooling in proposal-based detection & segmentation pipelines
  40. 40. Semantic Attention: Image Captioning 40You et al. Image Captioning with Semantic Attention. CVPR 2016
  41. 41. Visual Attention: Saliency Detection Kuen et al. Recurrent Attentional Networks for Saliency Detection. CVPR 2016 41
  42. 42. Visual Attention: Fixation Prediction Cornia et al. Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model. 42
  43. 43. Visual Attention: Action Recognition Sharma et al. Action Recognition Using Visual Attention. arXiv 2016 43
  44. 44. Deformable Convolutions Dai, Qi, Xiong, Li, Zhang et al. Deformable Convolutional Networks. arXiv Mar 2017 44 Dynamic & learnable receptive field
  45. 45. 45 Questions?

×