Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Intelligent Multimedia Recommendation

7 views

Published on

ICME2019 Tutorial: Intelligent Multimedia Recommendation

Published in: Internet
  • Be the first to comment

  • Be the first to like this

Intelligent Multimedia Recommendation

  1. 1. Towards Automatic Construction of Diverse, High-quality Image Dataset Jian Zhang Multimedia and Data Analytics Lab University of Technology Sydney
  2. 2. Outlines ➢ Introduction ➢ Related works ➢ Research challenges ➢ Motivations and solutions ➢ Experimental results ➢ Applications 2
  3. 3. Introduction  Image dataset construction: is the process of collecting relevant images for the given/target query.  Significances: 1. Labeled image datasets are the backbone for high- level image understanding tasks; 2. Continuously drive and evaluate the progress of feature designing and supervised learning models. 3
  4. 4. Related Works  Manual labelling based methods  Active learning based methods  Automatic learning based methods 4
  5. 5. Manual labelling based methods Related works:  LabelMe [Sinclair et al. 1990]  Pascal VOC series [Everingham et al. 2007]  ImageNet [Deng et al. 2009]  CIFAR-10/100 [Krizhevsky et al. 2009] Advantages: High accuracy Disadvantages:  Time consuming and labor intensive  Scalability is limited  Diversity is limited 5
  6. 6. Active learning based methods Related works:  Towards scalable dataset construction: An active learning approach. [Collins et al. 2008]  Large-scale live active learning: Training object detectors with crawled data and crowds. [Grauman et al. 2014] Disadvantages:  Scalability is limited  Diversity is limited Advantages:  The cost of manual labelling has some decrease.  Relatively high accuracy. 6
  7. 7. Automatic learning based methods Related works:  Optimol: automatic online picture collection via incremental model learning. [Li et al. 2010]  Harvesting: harvesting image databases from the web. [Schroff et al. 2011]  Prajna: Towards recognizing whatever you want from images without image labeling. [Hua et al. 2015] Disadvantages:  Low accuracy  Limited diversity Advantages: Scalability is no longer a problem 7
  8. 8. Research challenges Our work focus on automatic learning based methods. The major challenges include:  Visual polysemy  Limited diversity  Low accuracy 8
  9. 9. Research challenge: Visual Polysemy Fig. 1: Visual polysemy. For example, the query “mouse” returns multiple visual senses on the first page of results. The retrieved web images suffer from the low precision of any particular visual sense. 9
  10. 10. Research challenge: Limited Diversity Fig. 2: Images for “airplane” from four different datasets. Each dataset has their preference for image selection. 10
  11. 11. Research challenge: Low Accuracy Fig. 3: Due to the error indexing of image search engine, even we retrieve the sense-specific images, some instance-level noise may also be included. The noisy images are marked with red bounding boxes. 11
  12. 12. Motivations Fig. 4: The average accuracy of top 500 images in Google Image Search, Web Search and Flickr Search for 10 queries. 12
  13. 13. Motivations  The first few images returned from image search engine tend to have a relatively high accuracy.  Image search engine restricts the number of returned images for each query.  The diversity of the selected images not only depends on the selection mechanisms, but also relies on the diversity of the initial candidate images. 13
  14. 14. Solutions Query Noisy images pruning Selected images Collected candidate images Query Pruning noisy images Selected images Collected candidate images Discovering candidate textual metadata Pruning noisy textual metadata Traditional solution: Our solution: 14
  15. 15. Solutions Noisy textual metadata: Fig. 5: A snapshot of the retrieved images for visual non-salient and less relevant textual metadata. 15
  16. 16. Solutions Our solution: Fig. 6: Illustration of the process for obtaining multiple textual metadata. The input is a textual query that we would like to find multiple textual metadata for. The output is a set of selected textual metadata which will be used for raw image dataset construction. 16
  17. 17. Solutions Why:  Solving Visual Polysemy by Allowing Sense-specific Diversity  Improving Scalability of Image Search Engine  Increasing Diversity of the Initial Candidate Images 17
  18. 18. Solutions Fig.7: Illustration of the process for obtaining selected images. The input is a set of selected textual metadata. Artificial images, inter-class noisy images, and intra-class noisy images are marked with red, green and blue bounding boxes separately. The output is a group of selected images in which the images corresponding to different textual metadata. 18
  19. 19. Solutions Why:  Improving the Accuracy of the Selected Images  Ensuring the Diversity of the Selected Images 19
  20. 20. Experiments Polysemy:  Multiple Visual Senses Discovering  Classifying Sense-specific Images  Re-ranking Search Results Accuracy:  Image Dataset WSID-100 Introduction  Image Classification Ability Comparison Diversity:  Cross-dataset Generalization Ability Comparison 20
  21. 21. Polysemy: Multiple Visual Senses Discovering Fig. 8: Examples of multiple visual senses discovered by our proposed approach. For example, our approach automatically discovers and distinguishes four senses for “Note”: notes, galaxy note, note tablet and music note. For “Bass”, it discovers multiple visual senses of: bass fish, bass guitar, bass amp, and Mr./Mrs. Bass etc. 21
  22. 22. Polysemy: Classifying Sense-specific Images Fig. 9: The detailed performance comparison of classification accuracy over 30 categories on the CMU-Poly-30 dataset 22
  23. 23. Polysemy: Classifying Sense-specific Images Fig. 10: The detailed performance comparison of classification accuracy over 5 categories on the MIT-ISD dataset. Tab. 1: The average performance comparison of classification accuracy on the CMU-Poly-30 and MIT-ISD dataset. 23
  24. 24. Polysemy: Re-ranking Search Results Tab. 2: Web images for polysemy terms were annotated manually. For each term, the number of annotated images, the semantic senses, the visual senses and their distributions are provided, with core semantic senses marked in boldface. 24
  25. 25. Polysemy: Re-ranking Search Results Tab. 3: Area Under Curve (AUC) of all senses for “bass” and “mouse”. 25
  26. 26. Accuracy: Image Dataset WSID-100 Introduction WSID-100: Web-supervised Image Dataset 100 Categories http://www.multimediauts.org/dataset/WSID-100.html 26
  27. 27. Accuracy: Image Classification Ability Comparison Fig. 11: The image classification accuracy (%) comparison over 14 categories on the PASCAL VOC 2007 dataset. Fig. 12: The image classification accuracy (%) comparison over 6 categories on the PASCAL VOC 2007 dataset. Tab. 4: The average accuracy (%) comparison over 14 and 6 common categories on the PASCAL VOC 2007 dataset. 27
  28. 28. Diversity: Cross-dataset Generalization Ability Comparison Fig. 13: The cross-dataset generalization ability of various datasets by using a varying number of training images, and tested on (a) ImageNet, (b) Optimol, (c) Harvesting, (d) DRID-20, (e) Ours, (f) Average. 28
  29. 29. Application: Fine-grained Visual Categorization  Fine-grained visual recognition: aims to distinguish between subordinate categories such as different birds, flowers, food, car, etc.  Large variance Little variance Fig. 14: Illustration of the two characteristics that fine-grained subcategories have: large variance in the same subcategory as shown in the first line, and small variance among different subcategories as shown in the second line. Images in (a) birds and (b) cars are from CUB-200- 2011 and Cars-196 datasets, respectively. 29
  30. 30.  Challenges: 1) Large variance in the same subcategory; 2) Small variance among different subcategories.  Solutions: 1) Strongly supervised methods; 2) Weakly supervised methods; 3) Webly supervised methods. 30
  31. 31. Strongly supervised methods: require manually labeled bounding boxes or part annotations during training. Fig. 15: Bounding boxes and part annotations for fine-grained visual categorization. 31
  32. 32. Weakly supervised methods: require image-level labels during training. Fig. 15: Image-level labeling for fine-grained visual categorization. 32
  33. 33. Webly supervised methods: require no need for human intervention during training. Fig. 16: A snapshot of retrieved web images for “Bobolink”. Label noise is marked in red bounding boxes and blue bounding boxes. 33
  34. 34. Our work focus on webly supervised based methods. The major problems include:  Label noise (low accuracy)  Domain mismatch (limited diversity) 34
  35. 35. Fig. 17: Label noise and domain mismatch problems in the web data for fine-grained categorization. Due to incorrect labeling, the web images collected through query “Fish Crow” may contain noise (e.g., the “Purple Finch” image marked with red bounding box). Directly leverage the web images without noise removal may lead to classifying the test image “Purple Finch” as “Fish Crow”. In addition, the collected web images may contain multiple domains (e.g., natural, sketch, and cartoon images). Due to the domain distribution between the test image “Bobolink” and the collected sketch domain web images for “Fish Crow”, although the test image “Bobolink” has no noise in the collected web images for “Fish Crow”, directly leverage the web images without domain mismatch alleviation may also lead to classifying the test image “Bobolink” as “Fish Crow”. 35
  36. 36. Fig. 18: The architecture of our proposed DDN model. The input of our model is a set of “bags”, which consists of multiple images (“instances”). The bag of images is first fed into the backbone net (e.g., VGG16) to generate the feature map. Then the feature map goes into a fully connected layer followed by a ReLU, thus generating a feature vector N. This feature vector is subsequently sent into two branches. The upper branch is our proposed Attention Block, which consists of a fully connected layer followed by a tanh layer and another fully connected layer. After the Attention Block, we obtain the instance probability vector A. In the bottom branch, the feature vector N is multiplied by the output of the Attention Block A and then fed into a fully connected layer with a Sigmoid layer to calculate the probability of the bag being positive. Finally, the bottom branch adopts the Negative Bernoulli Log Loss, while the upper branch leverages our proposed Attentive Focal Loss. The weighted sum of the two branch losses is leveraged as the final loss function for DDN. 36
  37. 37. Fig. 19: The architecture of instance-level model. Feature vector N is subsequently sent into two branches. The upper branch consists of a fully connected layer followed by a tanh layer and another fully connected layer. After the Attention Block, we obtain the instance probability vector. Finally, the upper branch leverages Attentive Focal Loss. 37
  38. 38. Fig. 18: The architecture of bag-level model. Feature vector N is subsequently sent into two branches. In bottom branch, the feature vector N is multiplied by the output of the Attention Block and then fed into a fully connected layer with a Sigmoid layer to calculate the probability of the bag being positive. Finally, the bottom branch adopts the Negative Bernoulli Log Loss. 38
  39. 39. Tab. 5: Fine-grained ACA (%) results on CUB200 and Sandford dogs. Tr-B/P means bounding box or part annotation required in the training stage. Data means the training data is manually labeled (anno.) or collected from the web (web). The best result is marked in red, the second best in blue, and the third best in bold. 39
  40. 40. Publications ▪ Yazhou Yao, Jian Zhang, Fumin Shen, Li Liu, Fan Zhu, Dongxiang Zhang, and Heng-Tao Shen. "Towards Automatic Construction of Diverse, High-quality Image Dataset", IEEE Transactions on Knowledge and Data Engineering (TKDE), 2019. ▪ Yazhou Yao, Zeren Sun, Fumin Shen, Li Liu, Limin Wang, Fan Zhu, Lizhong Ding, Gangshan Wu, Ling Shao. "Dynamically Visual Disambiguation of Keyword-based Image Search", International Joint Conference on Artificial Intelligence (IJCAI), 2019. ▪ Yazhou Yao, Fumin Shen, Jian Zhang, Li Liu, Zhenmin Tang and Ling Shao. "Extracting Privileged Information for Enhancing Classifier Learning", IEEE Transactions on Image Processing (TIP), 2018. ▪ Yazhou Yao, Fumin Shen, Jian Zhang, Li Liu, Zhenmin Tang and Ling Shao. " Extracting Multiple Visual Senses for Web Learning ", IEEE Transactions on Multimedia (TMM), 2018. ▪ Yazhou Yao, Jian Zhang, Fumin Shen, Wankou Yang, Xiansheng Hua and Zhenmin Tang. “Extracting Privileged Information from Untagged Corpora for Classifier Learning”, International Joint Conference on Artificial Intelligence (IJCAI), 2018. ▪ Yazhou Yao, Jian Zhang, Fumin Shen, Wankou Yang, Pu Huang and Zhenmin Tang. “Discovering and Distinguishing Multiple Visual Senses for Polysemous Words”, AAAI Conference on Artificial Intelligence (AAAI), 2018. ▪ Yazhou Yao, Jian Zhang, Fumin Shen, Xian-Sheng Hua, Jingsong Xu and Zhenmin Tang. "Exploiting Web Images for Dataset Construction: A Domain Robust Approach", IEEE Transactions on Multimedia (TMM), 2017. ▪ Yazhou Yao, Xiansheng Hua, Fumin Shen, Jian Zhang and Zhenmin Tang. "A Domain Robust Approach for Image Dataset Construction ", ACM International Conference on Multimedia (ACM MM), 2016. ▪ Yazhou Yao, Jian Zhang, Fumin Shen, Xian-Sheng Hua, Jingsong Xu and Zhenmin Tang. "Automatic Image Dataset Construction with Multiple Textual Metadata ", IEEE Conference on Multimedia and Expo (ICME), 2016. 40
  41. 41. Thank you! Any questions? 41

×