Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)

1,194 views

Published on

https://telecombcn-dl.github.io/2017-dlcv/

Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks and Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles and applications of deep learning to computer vision problems, such as image classification, object detection or image captioning.

Published in: Data & Analytics
  • Be the first to comment

Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)

  1. 1. [course site] Attention Models Day 3 Lecture 6 #DLUPC Amaia Salvador amaia.salvador@upc.edu PhD Candidate Universitat Politècnica de Catalunya
  2. 2. Attention Models: Motivation Image: H x W x 3 bird The whole input volume is used to predict the output... ...despite the fact that not all pixels are equally important 2
  3. 3. Attention Models: Motivation 3 A bird flying over a body of water Attend to different parts of the input to optimize a certain output Case study: Image Captioning
  4. 4. Previously D3L5: Image Captioning 4 only takes into account image features in the first hidden state Multimodal Recurrent Neural Network Karpathy and Fei-Fei. "Deep visual-semantic alignments for generating image descriptions." CVPR 2015
  5. 5. LSTM Decoder for Image Captioning LSTMLSTM LSTM CNN LSTM A bird flying ... <EOS> Features: D 5 ... Vinyals et al. Show and tell: A neural image caption generator. CVPR 2015 Limitation: All output predictions are based on the final and static output of the encoder
  6. 6. Attention for Image Captioning CNN Image: H x W x 3 6
  7. 7. Attention for Image Captioning CNN Image: H x W x 3 Features f: L x D h0 7 a1 y1 c0 y0 first context vector is the average Attention weights (LxD) Predicted word First word (<start> token)
  8. 8. Attention for Image Captioning CNN Image: H x W x 3 h0 c1 Visual features weighted with attention give the next context vector y1 h1 a2 y2 8 a1 y1 c0 y0 Predicted word in previous timestep
  9. 9. Attention for Image Captioning CNN Image: H x W x 3 h0 c1 y1 h1 a2 y2 h2 a3 y3 c2 y2 9 a1 y1 c0 y0
  10. 10. Attention for Image Captioning Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 10
  11. 11. Attention for Image Captioning Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 11
  12. 12. Attention for Image Captioning Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 12 Some outputs can probably be predicted without looking at the image...
  13. 13. Attention for Image Captioning Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 13 Some outputs can probably be predicted without looking at the image...
  14. 14. Attention for Image Captioning 14 Can we focus on the image only when necessary?
  15. 15. Attention for Image Captioning CNN Image: H x W x 3 h0 c1 y1 h1 a2 y2 h2 a3 y3 c2 y2 15 a1 y1 c0 y0 “Regular” spatial attention
  16. 16. Attention for Image Captioning CNN Image: H x W x 3 c1 y1 a2 y2 a3 y3 c2 y2 16 a1 y1 c0 y0 Attention with sentinel: LSTM is modified to output a “non-visual” feature to attend to s0 h0 s1 h1 s2 h2 Lu et al. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. CVPR 2017
  17. 17. Attention for Image Captioning Lu et al. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. CVPR 2017 17 Attention weights indicate when it’s more important to look at the image features, and when it’s better to rely on the current LSTM state If: sum(a[0:LxD]) > a[LxD] image features are needed for the final decision Else: RNN state is enough to predict the next word
  18. 18. Soft Attention Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 CNN Image: H x W x 3 Grid of features (Each D-dimensional) a b c d pa pb pc pd Distribution over grid locations pa + pb + pc + pc = 1 Soft attention: Summarize ALL locations z = pa a+ pb b + pc c + pd d Derivative dz/dp is nice! Train with gradient descent Context vector z (D-dimensional) From RNN: Slide Credit: CS231n 18
  19. 19. Soft Attention Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 CNN Image: H x W x 3 Grid of features (Each D-dimensional) a b c d pa pb pc pd Distribution over grid locations pa + pb + pc + pc = 1 Soft attention: Summarize ALL locations z = pa a+ pb b + pc c + pd d Differentiable function Train with gradient descent Context vector z (D-dimensional) From RNN: Slide Credit: CS231n ● Still uses the whole input ! ● Constrained to fix grid 19
  20. 20. Hard Attention Input image: H x W x 3 Box Coordinates: (xc, yc, w, h) Cropped and rescaled image: X x Y x 3 Not a differentiable function ! Can’t train with backprop :( 20 Hard attention: Sample a subset of the input Need other optimization strategies e.g.: reinforcement learning
  21. 21. Spatial Transformer Networks Input image: H x W x 3 Box Coordinates: (xc, yc, w, h) Cropped and rescaled image: X x Y x 3 CNN bird Jaderberg et al. Spatial Transformer Networks. NIPS 2015 Not a differentiable function ! Can’t train with backprop :( Make it differentiable Train with backprop :) 21
  22. 22. Spatial Transformer Networks Jaderberg et al. Spatial Transformer Networks. NIPS 2015 Input image: H x W x 3 Cropped and rescaled image: X x Y x 3 Can we make this function differentiable? Idea: Function mapping pixel coordinates (xt, yt) of output to pixel coordinates (xs, ys) of input Slide Credit: CS231n Repeat for all pixels in output Network attends to input by predicting 22 Mapping given by box coordinates (translation + scale)
  23. 23. Spatial Transformer Networks Jaderberg et al. Spatial Transformer Networks. NIPS 2015 Easy to incorporate in any network, anywhere ! Differentiable module Insert spatial transformers into a classification network and it learns to attend and transform the input 23
  24. 24. Spatial Transformer Networks Jaderberg et al. Spatial Transformer Networks. NIPS 2015 24 Fine-grained classification Also used as an alternative to RoI pooling in proposal-based detection & segmentation pipelines
  25. 25. Deformable Convolutions Dai, Qi, Xiong, Li, Zhang et al. Deformable Convolutional Networks. arXiv Mar 2017 25 Dynamic & learnable receptive field
  26. 26. Resources 26 Seq2seq implementations with attention: ● Tensorflow ● Pytorch Spatial Transformers ● Tensorflow ● Coming soon to Pytorch (thread here) Deformable Convolutions ● MXNet (Original) ● Tensorflow / Keras (slow) ● [WIP]PyTorch
  27. 27. Questions?
  28. 28. Attention Mechanism 28 Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) The vector to be fed to the RNN at each timestep is a weighted sum of all the annotation vectors.
  29. 29. Attention Mechanism 29 Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) An attention weight (scalar) is predicted at each time-step for each annotation vector hj with a simple fully connected neural network. h1 zi Annotation vector Recurrent state Attention weight (a1 )
  30. 30. Attention Mechanism 30 Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) An attention weight (scalar) is predicted at each time-step for each annotation vector hj with a simple fully connected neural network. h2 zi Annotation vector Recurrent state Attention weight (a2 ) Shared for all j
  31. 31. Attention Mechanism 31 Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) Once a relevance score (weight) is estimated for each word, they are normalized with a softmax function so they sum up to 1.
  32. 32. Attention Mechanism 32 Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) Finally, a context-aware representation ci+1 for the output word at timestep i can be defined as:
  33. 33. Attention Mechanism 33 Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) The model automatically finds the correspondence structure between two languages (alignment). (Edge thicknesses represent the attention weights found by the attention model)
  34. 34. Attention Models Attend to different parts of the input to optimize a certain output 34
  35. 35. Attention Models 35 Chan et al. Listen, Attend and Spell. ICASSP 2016 Source: distill.pub Input: Audio features; Output: Text Attend to different parts of the input to optimize a certain output
  36. 36. Attention for Image Captioning 36 Side-note: attention can be computed with previous or current hidden state CNN Image: H x W x 3 h1 v y1 h2 h3 v y2 a1 y1 v y0average c1 a2 y2 c2 a3 y3 c3
  37. 37. Attention for Image Captioning 37 Attention with sentinel: LSTM is modified to output a “non-visual” feature to attend to CNN Image: H x W x 3 v y1 v y2 a1 y1 v y0average c1 a2 y2 c2 a3 y3 c3 s1 h1 s2 h2 s3 h3
  38. 38. Semantic Attention: Image Captioning 38You et al. Image Captioning with Semantic Attention. CVPR 2016
  39. 39. Visual Attention: Saliency Detection Kuen et al. Recurrent Attentional Networks for Saliency Detection. CVPR 2016 39
  40. 40. Visual Attention: Fixation Prediction Cornia et al. Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model. 40

×