Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Importance of Time in Visual Attention Models

https://imatge.upc.edu/web/publications/importance-time-visual-attention-models

Bachelor thesis by Marta Cool, advised by Kevin McGuinness (Dublin City University) and Xavier Giro-i-Nieto (Universitat Politecnica de Catalunya).
Predicting visual attention is a very active eld in the computer vision community. Visual attention is a mechanism of the visual system that can select relevant areas within a scene. Models for saliency prediction are intended to automatically predict which regions are likely to be attended by a human observer. Traditionally, ground truth saliency maps are built using only the spatial position of the fixation points, being these xation points the locations where an observer fixates the gaze when viewing a scene. In this work we explore encoding the temporal information as well, and assess it in the application of prediction saliency maps with deep neural networks. It has been observed that the later fi xations in a scanpath are usually selected randomly during visualization, specially in those images with few regions of interest. Therefore, computer vision models have dificulties learning to predict them. In this work, we explore a temporal weighting over the saliency maps to better cope with this random behaviour. The newly proposed saliency representation assigns di erent weights depending on the position in the sequence of gaze fixations, giving more importance to early timesteps than later ones. We used this maps to train MLNet, a state of the art for predicting saliency maps. MLNet predictions were evaluated and compared to the results obtained when the model has been trained using traditional saliency maps. Finally, we show how the temporally weighted saliency maps brought some improvement when used to weight the visual features in an image retrieval task.

  • Login to see the comments

  • Be the first to like this

The Importance of Time in Visual Attention Models

  1. 1. The Importance of Time in Visual Attention Models Author: Marta Coll Pol Advisors: Kevin Mc Guinness and Xavier Giró-i-Nieto
  2. 2. 2 INTRODUCTION
  3. 3. Visual Attention Models ● What is visual attention? ○ Visual System Mechanism: allows humans to selectively process visual information. ● What is a visual attention model? ○ Predictive models that aim to predict regions that attract human attention. Source: https://algorithmia.com/algorithms/deeplearning/SalNet/docs 3
  4. 4. Saliency Information Representations 4
  5. 5. Motivation Consistency in fixation selection between participants as a function of fixation number [1]. 5 [1] Benjamin W Tatler, Roland J Baddeley, and Iain D Gilchrist. Visual correlates of fixation selection: Effects of scale and time. Vision research, 45(5):643-659, 2005.
  6. 6. Motivation ● Hypothesis I: Visual Attention models have difficulties predicting later fixation points. ● Hypothesis II: Adding less weight to later fixations in the ground-truth maps used for training, could help improving visual attention models’ performance. 6
  7. 7. 7 DATASETS OF TEMPORALLY SORTED FIXATIONS
  8. 8. Datasets ● iSUN: ○ Data collection: eye-tracking [2]. ● SALICON: ○ Data collection: mouse-tracking [3]. 8 Saliency ground-truth data for one observer in a given image: [2] Yinda Zhang, Fisher Yu, Shuran Song, Pingmei Xu, Ari Se, and Jianxiong Xiao. Large-scale scene understanding challenge: Eye tracking saliency estimation. [3] Ming Jiang, Shengsheng Huang, Juanyong Duan, and Qi Zhao. Salicon: Saliency in context. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 1072-1080. IEEE, 2015.
  9. 9. Fixation points in order of visualization iSUN: ● Locations to Fixations: Mean-shift. ● We used timestamps as a weight to retrieve fixation points in order. 9 Mean-shift applied to the locations made for a given image by an observer. Colored dots are locations assigned to different clusters. Red Crosses are the mean of each cluster (centroid), Fixation points.
  10. 10. Fixation points in order of visualization SALICON: ● Locations to Fixations: Samples with high mouse-moving velocity are excluded while keeping fixation points. ● An observation was made to see if fixation points are given in order of visualization in the dataset. 10
  11. 11. Fixation points in order of visualization : SALICON 11 1 23 4 5 6 7 8 1 23 4 5 6 7 8
  12. 12. 12 TEMPORALLY WEIGHTED SALIENCY
  13. 13. Temporally Weighted Saliency Maps (TWSM) ● Weighting in function of visualization order. ● Weighting function inspired in consistency graph. Weighting function: 13
  14. 14. TWSM : Choosing a weighting function parameter 14 Consistency in fixation selection between participants as a function of fixation number
  15. 15. Saliency Prediction Metrics ● Area Under ROC Curve (AUC) Judd ● Kullback-Leibler Divergence (KLdiv) ● Pearson correlation coefficient (CC) 15
  16. 16. Model: MLNet Architecture: CNN + Encoding Network ( produces a temporary saliency map) + Prior learning network = Final Saliency Map 16 [2] Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. A deep multi-level network for saliency prediction. ICPR 2016.
  17. 17. Model: Different MLNet versions 17 Model Training Temporal Weighting MLNet Cornia et el [2] ❌ (nSM) nMLNet (baseline) Ours ❌ (nSM) wMLNet Ours ✅ (wSM) [2] Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. A deep multi-level network for saliency prediction. ICPR 2016.
  18. 18. Model: Replicating MLNet Results (nMLNet) 18 SALICON 2015 (Validation set) AUC Judd MLNet 0.886 nMLNet (baseline) 0.814
  19. 19. SALIENCY PREDICTION Experiments 19
  20. 20. Experiment: Do visual attention models have difficulties predicting later fixations? Hypothesis I: Visual attention models have difficulties predicting later fixations. ● Evaluate MLNet’s predicted maps for the iSUN training set using two types of ground-truth: nSMs and wSMs. 20
  21. 21. Results: Do visual attention models have difficulties predicting later fixations? Maps that scored way better when evaluated using wSM: ● Metric: KLdiv. 21
  22. 22. Results: Do visual attention models have difficulties predicting later fixations? 22
  23. 23. Results: Do visual attention models have difficulties predicting later fixations? Maps that scored way better when evaluated using nSM: ● Metric: KLdiv ● Scores are in general bad for both maps, we could not extract further conclusions. 23
  24. 24. Experiment: Study on the effect on the use of Weighted Saliency Maps on a visual attention model's performance Hypothesis II: Adding less weight to later fixations in the ground-truth maps used for training, can improve model’s performance. ● We trained nMLNet and wMLNet, using the SALICON dataset. And evaluated their performance. 24
  25. 25. Results: Study on the effect on the use of Weighted Saliency Maps on a visual attention model's performance AUC Judd Kullback-Leibler Divergence (KLdiv) 25 Ground-truth nMLNet wMLNet nSM 0.814 0.816 Ground-truth nMLNet wMLNet nSM 1.332 1.039 wSM 1.493 1.136 Pearson's Correlation Coecient(CC) Ground-truth nMLNet wMLNet nSM 0.534 0.539 wSM 0.509 0.517
  26. 26. VISUAL SEARCH Experiments 26
  27. 27. SalBoW ● SalBow is a retrieval framework that uses saliency maps to weight the contribution of local convolutional representations for the instance search task. ● The model is divided in two parts: ○ CNN + K-means = Assignment Map (visual vocabulary) ○ Saliency prediction model ● Model’s output: After weighting a Bag of Words is created to build the final image representation. 27 SalBow pipeline with saliency weighting [3]. [3] Eva Mohedano, Kevin McGuinness, Xavier Giro-i Nieto, and Noel E O'Connor. Saliency weighted convolutional features for instance search. arXiv preprint arXiv:1711.10795, 2017.
  28. 28. Results 28 Mean Average Precision (mAP) Saliency Maps type mAP nSM 0.6730 wSM 0.6743
  29. 29. 29 CONCLUSIONS
  30. 30. Conclusions 30 Experiment: Do visual attention models have difficulties predicting later fixations?
  31. 31. Conclusions 31 Experiment: Study on the effect on the use of Weighted Saliency Maps on a visual attention model's performance. AUC Judd Kullback-Leibler Divergence (KLdiv) Ground-truth nMLNet wMLNet nSM 0.814 0.816 Ground-truth nMLNet wMLNet nSM 1.332 1.039 wSM 1.493 1.136 Pearson's Correlation Coecient(CC) Ground-truth nMLNet wMLNet nSM 0.534 0.539 wSM 0.509 0.517
  32. 32. Conclusions 32 Mean Average Precision (mAP) Saliency Maps type mAP nSM 0.6730 wSM 0.6743 Experiment: Visual search task.
  33. 33. Conclusions : Future Work New evaluation metric: ● Take into account what visual attention models can predict. Ways of improving Temporally Weighted Saliency Maps: ● Weighting in function of order and the time spend on each fixation point. ● Looking for those fixation points that are common between observers for each specific image. 33
  34. 34. Project’s Code ● The code for this project can be found in : https://github.com/imatge-upc/saliency-2018-timeweight ● The model and the not-owned code used in this project, is open-source. 34
  35. 35. Any Questions? 35
  36. 36. Quick Introduction to Deep Learning Neural Networks: ● Forward propagation ● Loss function : metric to evaluate a model’s performance. ● Back propagation ● The goal is to minimize the loss function. Source: http://cs231n.github.io/neural-networks-1/ Three-layer neural network (two hidden layers of four neurons each and a single output), with three inputs. 36
  37. 37. Convolutional Neural Networks Convolutional Neural Network (CNN) architectures receive a multi-channel image as an input and are usually structured with a set of Convolutional layers each followed by a non-linear operation (usually RELU) and sometimes by a pooling layer (usually max-pooling). At the end of the network, Fully Connected layers are usually used. 37Source: http://www.mdpi.com/1099-4300/19/6/242
  38. 38. Convolutional Neural Networks 2) Source : http://cs231n.github.io/convolutional-networks/ 38
  39. 39. 39 BUDGET
  40. 40. Budget 40 ● The resources used on this thesis were of 1 GPU with 11GB of GDDR SDRAM (Graphics Double Data Rate Synchronous Dynamic RAM) and about 30GB of regular RAM. The most similar resources in Amazon Web Services can be found in EC2 instance; p2.xlarge. This service provides 1 GPU, 12GB GDDR SDRAM and 1 CPU with 61GB of RAM. The cost of this service for the duration of the project ascents to 1647€. ● Open-source software. ● No maintenance costs due to the project being a comparative study.
  41. 41. Results: Do visual attention models have difficulties predicting later fixations? 41 ● Evaluation performed using KLdiv metric. ● We computed the distance for each image between both evaluations' scores using the absolute difference |a-b|.

×