Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Visual Search and Question Answering II

9 views

Published on

ICME2019 Tutorial: Visual Search and Question Answering II

Published in: Internet
  • Be the first to comment

  • Be the first to like this

Visual Search and Question Answering II

  1. 1. Liangliang Cao http://www.llcao.net UMass (now at Google AI*) * The research in this talk are done before joining Google/Facebook Visual Search and Question Answering Lu Jiang http://www.lujiang.info/ Google AI Yannis Kalantidis http://www.skamalas.com/ Facebook AI* ICME 2019 Tutorial July 8th 13:30--17:00
  2. 2. I. Overview of Visual Search and Understanding (Liangliang). II. Visual Representations and Indexing (Yannis) III. MemexQA (Lu) Outline 2
  3. 3. Section II: Visual Representations and Indexing 3
  4. 4. Visual Search: We want to see more of the “same” 4
  5. 5. Color Similarity *slide credit: Clayton Mellina, Huy Nguyen5
  6. 6. Compositional Similarity *slide credit: Clayton Mellina, Huy Nguyen6
  7. 7. Identity Similarity *slide credit: Clayton Mellina, Huy Nguyen7
  8. 8. Semantic Similarity *slide credit: Clayton Mellina, Huy Nguyen8
  9. 9. Visual Search Applications Similarity search: ● Given an image as query, show me visually similar images ● Useful tool for commercial photo search & licensing ● Visually congruent native ads Clustering and deduplication: ● Cluster images of a large collection for browsing ● Personal photo album summarization ● Deduplicate or diversify image search results Batch search and recommendations: ● Use all photos from a group to recommend photos to the group admin ● Use all photos favorited by a user to get recommendations ● Visual recommendations can be combined with social metadata 9
  10. 10. Basic Ingredients for large-scale search Representation Learning Documents/images/videos are represented as vectors Quantization and Indexing ● Storing high dimensional features could be prohibitive ○ Hashing (bad performance, reconstruction not possible) ○ Quantization (better performance, allows approx. reconstruction) ● Searching in them can only be feasible if only a very small percentage of the collection is checked → Indexing 10
  11. 11. Visual Representations 11
  12. 12. Some Recent Visual Representations A (highly biased) set of recent CNN architectures that aim at: ● Reducing network parameters ○ Multi-Fiber Networks [ECCV 2018] ● Reducing memory for attention mechanisms ○ A2 -Nets: Double Attention Networks [NeurIPS 2018] ● Reasoning with global context ○ Global Reasoning Networks [CVPR 2019] ● Reducing spatial redundancy ○ Octave Convolutions [arXiv 2019] 12
  13. 13. Visual Representations A (highly biased) set of recent CNN architectures that aim at: ● Reducing network parameters ○ Multi-Fiber Networks [ECCV 2018] ● Reducing memory for attention mechanisms ○ A2 -Nets: Double Attention Networks [NeurIPS 2018] ● Reasoning with global context ○ Global Reasoning Networks [CVPR 2019] ● Reducing spatial redundancy ○ Octave Convolutions [arXiv 2019] 13
  14. 14. The Multi-fiber Unit Idea: slice the complex residual unit into N parallel and separated units (called fibers), each of which is isolated from the others 14
  15. 15. The Multi-fiber Unit ● one fiber cannot access and utilize the feature learned from the others. ● Transistor component: facilitates information flow across these fibers ● number of the first-layer output channels to be 4 times smaller (cost would be reduced by a factor of 2) [Chen, Kalantidis, et al. Multi-Fiber Networks. ECCV 2018] 15
  16. 16. Results on Imagenet [Chen, Kalantidis, et al. Multi-Fiber Networks. ECCV 2018] 16
  17. 17. Results on Imagenet [Chen, Kalantidis, et al. Multi-Fiber Networks. ECCV 2018] 17
  18. 18. Visual Representations A (highly biased) set of recent CNN architectures that aim at: ● Reducing network parameters ○ Multi-Fiber Networks [ECCV 2018] ● Reducing memory for attention mechanisms ○ A2 -Nets: Double Attention Networks [NeurIPS 2018] ● Reasoning with global context ○ Global Reasoning Networks [CVPR 2019] ● Reducing spatial redundancy ○ Octave Convolutions [arXiv 2019] 18
  19. 19. Reducing computations for attention mechanisms Incorporating global context ● e.g. the attention mechanisms [Vaswani et al. 2017, Wang et al. 2018] ● Enables interactions between locations over the full coordinate space ● Requires computing and storing a (quadratic) matrix of all input location pairs Convolutional Neural Networks model local relations ● Operate on the (spatio-temporal) coordinate space grid ● Require stacking multiple layers to capture relations between distant locations [Vaswani et al. Attention is all you need. NIPS 2017] [Wang et al. Non-local Neural Networks. CVPR, 2018] 19
  20. 20. A2 -Nets: Double Attention Networks Decomposed attention mechanism Aggregate and propagate features from the entire (spatio-temporal) input space efficiently ● First attention: Gather features from the entire space into a compact set through second-order attention pooling ● Second attention: Adaptively select and distribute features to each location. [Chen, Kalantidis, et al. A2 -Nets: Double Attention Networks. NeurIPS 2018] 20
  21. 21. Accuracy on Imagenet A2 -Nets: Double Attention Networks [Chen, Kalantidis, et al. A2 -Nets: Double Attention Networks. NeurIPS 2018] 21
  22. 22. Visual Representations A (highly biased) set of recent CNN architectures that aim at: ● Reducing network parameters ○ Multi-Fiber Networks [ECCV 2018] ● Reducing memory for attention mechanisms ○ A2 -Nets: Double Attention Networks [NeurIPS 2018] ● Reasoning with global context ○ Global Reasoning Networks [CVPR 2019] ● Reducing spatial redundancy ○ Octave Convolutions [arXiv 2019] 22
  23. 23. Global context modeling is highly important ● Attention-like mechanisms becoming standard across ML A limitation of current global context modeling approaches ● Follow the Gather → Distribute model ● Only focus on delivering information ● Rely on convolutional layers for reasoning Can we capture and reason on global region interactions efficiently? 23 Beyond the simple attention mechanism
  24. 24. Gather → Reason → Distribute Can we construct a (latent) space, where relations over sets of features scattered over the coordinate space, translate to simple feature interactions? 24 [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019] Global Reasoning Networks Coordinate Space Interaction Space
  25. 25. 1) From Coordinate Space to Interaction Space 2) Reasoning in Interaction Space 3) From Interaction Space (back) to Coordinate Space → Weighted projections → Graph convolutions → Weighted broadcasting 25 Global Reasoning in Three Steps Coordinate Space Interaction Space [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
  26. 26. Interaction Space ● We want to learn a set of projections for (arbitrary) region features Projection Coordinate Space 26 From Coordinate Space to Interaction Space [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
  27. 27. learnable projection weights 27 Given a set of input features , compute projection function From Coordinate Space to Interaction Space [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
  28. 28. 28 Given a set of input features , compute projection function From Coordinate Space to Interaction Space C H W H W C bi [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
  29. 29. 29 Given a set of input features , compute projection function From Coordinate Space to Interaction Space H N W N C [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
  30. 30. ● After projection → N feature vectors Projection Coordinate Space 30 From Coordinate Space to Interaction Space [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
  31. 31. ● After projection → N feature vectors ● Relations between arbitrary regions → interactions between features Projection Coordinate Space Interaction Space 31 From Coordinate Space to Interaction Space [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019] ● What is an efficient way of reasoning over feature interactions?
  32. 32. How to model interactions? ● Treat each feature as a node in a fully-connected graph ● Learn the edge weights that correspond to interactions of features ● Graph convolution formulation by [Kipf & Welling]: Reverse Projection N x N (learnt) adjacency matrix state update 32 Reasoning in Interaction Space [Kipf & Welling. Semi-supervised classification with graph convolutional networks. ICLR, 2017] [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
  33. 33. ● Reverse projection: Distribute the updated states back ● Reuse projection weights Reverse Projection Coordinate Space Interaction Space 33 From Interaction Space to Coordinate Space [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
  34. 34. ● Projection: Weighted global pooling 34 Global Reasoning (GloRe) Unit [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
  35. 35. ● Projection: Weighted global pooling ● Reasoning: Graph Convolution 35 Global Reasoning (GloRe) Unit [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
  36. 36. ● Projection: Weighted global pooling ● Reasoning: Graph Convolution ● Reverse projection: Weighted broadcasting 36 Global Reasoning (GloRe) Unit [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
  37. 37. ● Projection: Weighted global pooling ● Reasoning: Graph Convolution ● Reverse projection: Weighted broadcasting 37 Global Reasoning (GloRe) Unit [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
  38. 38. ● Projection: Weighted global pooling ● Reasoning: Graph Convolution ● Reverse projection: Weighted broadcasting What do the learnt projection weights look like? 38 Global Reasoning (GloRe) Unit [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
  39. 39. Visualization of projection weights What do the learnt projections look like? 39 Global Reasoning (GloRe) Unit [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019]
  40. 40. The Global Reasoning (GloRe) unit ● Is highly efficient (smaller computational cost than a self-attention) ● Is a plug-and-play residual unit that can be inserted in CNNs for different tasks Image Classification & Action Recognition backbone CNNs ● Insert one or more units units different positions Semantic segmentation ● Insert before bottleneck 40 Global Reasoning Networks [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019] Figure from [Noa et al ICCV 2015]
  41. 41. 41 [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019] Ablations on Imagenet How many blocks to add and where? How many graph convolutions?
  42. 42. 42 [Chen, Rohrbach, Yan, Shuicheng, Feng, Kalantidis. Graph-Based Global Reasoning Networks. CVPR 2019] Experiments on ImageNet
  43. 43. Visual Representations A (highly biased) set of recent CNN architectures that aim at: ● Reducing network parameters ○ Multi-Fiber Networks [ECCV 2018] ● Reducing memory for attention mechanisms ○ A2 -Nets: Double Attention Networks [NeurIPS 2018] ● Reasoning with global context ○ Global Reasoning Networks [CVPR 2019] ● Reducing spatial redundancy ○ Octave Convolutions [arXiv 2019] 43
  44. 44. [Huang et al. Multi-Scale Dense Networks for Resource Efficient Image Classification, ICLR 2018] [Chen et al. Big-Little Net: An Efficient Multi-Scale Feature Representation for Visual and Speech Recognition, ICLR 2019] Reducing Spatial Redundancy Many approaches exploit multi-scale inputs • Recent Examples • Multi-scale DenseNets [Huang et al.]: Multi-resolution paths over a DenseNet • Big-Little Nets [Chen et al.]: Multi-resolution paths, synchronizing at every block • Network architecture is altered Spatial-redundancy in feature maps • ConvNet kernels are highly local • Some feature maps must contain low frequency information (smooth and slowly varying) 44
  45. 45. Octave Convolution 45
  46. 46. Octave Convolution Advantages •Multi-scale processing with effective communication between the low- and high-frequency maps •Gains in terms of FLOPS •Gains in terms of memory •Larger receptive field for low-frequency feature maps The Octave Convolution kernel 46
  47. 47. import OctConv as conv Ablation study on ImageNet for varying models and ratios 47
  48. 48. ImageNet Classification 48
  49. 49. Is the speedup real? •On CPU (i.e. FB production): Reaching (almost) theoretical gains! •On GPU: An optimized CUDA-level implementation is required Results for ResNet-50 49
  50. 50. Recent Visual Representations Code online: ● Multi-Fiber Networks [ECCV 2018] ○ https://github.com/cypw/PyTorch-MFNet ● Global Reasoning Networks [CVPR 2019] ○ https://github.com/facebookresearch/GloRe (coming soon) ● Octave Convolutions [arXiv 2019] ○ https://github.com/facebookresearch/OctConv 50
  51. 51. Indexing 51
  52. 52. Basic Ingredients for large-scale search Representation Learning Documents/images/videos are represented as vectors Quantization and Indexing ● Storing high dimensional features could be prohibitive ○ Hashing (bad performance, reconstruction not possible) ○ Quantization (better performance, allows approx. reconstruction) ● Searching in them can only be feasible if only a very small percentage of the collection is checked → Indexing 52
  53. 53. Quantization: k-means Pros: ● Very high compression Cons: ● Hard to train for large k ● Performance is good only for large k Idea: Create a “vocabulary” in high-dimensional space through clustering Represent each vector with the index of its closest “word” [McQueen 1967]53
  54. 54. Quantization: product quantization Idea: Split the vector in multiple sub-vectors, create a vocabulary for each subvector Represent each feature with the list of indices for its closest words [Gray, ASSP 1984] [Jegou, Douze & Schmid, PAMI 2011]54
  55. 55. Quantization: product quantization Pros: ● Tunable compression & better reconstruction ● Easy & fast to train, a vocabulary of size k gives you km effective “cells” for m subvectors Cons: ● Independence assumption (“fix”: PCA) ● Unbalanced partitioning (fix: OPQ) [Gray, ASSP 1984] [Jegou, Douze & Schmid, PAMI 2011]55
  56. 56. Optimized product quantization [Ge et al, CVPR 2013, PAMI 2014]56
  57. 57. Locally Optimized Product Quantization [Kalantidis & Avrithis, CVPR 2014] Idea: Locally optimize residuals, balance variance across subspaces, use multi-index 57
  58. 58. Locally Optimized Product Quantization [Kalantidis & Avrithis, CVPR 2014]58
  59. 59. Locally Optimized Product Quantization ● Balance variance across subspaces ● Local optimization using OPQ ● 20% improvement in precision over state-of-the-art ● Overhead independent of database size Stats for multi-LOPQ: ● 1 Billion 128-dimensional vectors ● ~22GB memory ● less than 55ms search time [Kalantidis & Avrithis, CVPR 2014] Idea: Locally optimize residuals, balance variance across subspaces, use multi-index 59
  60. 60. Indexing 21.1 3.33 21.2 20.1 2.21 11.1 11.2 0.21 id: 123984 . . . . 5,4 id:123984... 1 5 6 ... 7 2 4 21.1 3.33 21.2 20.1 11.1 11.2 0.21 11 231 661 id: 123984 . . . . 11 id:123984... ... 60
  61. 61. Indexing: multi-index Pros: ● 2-step quantization: in the second stage one can quantize residuals ● Finer partitioning / smaller residuals ● Need to search many cells/posting lists: multi-sequence: fast algorithm for traversing neighboring cells [Babenko & Lempitsky, CVPR 2012] Idea: Use product quantization for indexing: Split into 2 sub-vectors 61
  62. 62. Multi-LOPQ: Searching in a multi-index ● split query vector ● sort PQ centroids by ascending distance for each subvector ● start at the cell (Q1 [0], Q2 [0]), the first clusters in each posting list ● for the current cell (Q1 [a], Q2 [b]), insert both its bottom and right neighbors into a priority queue with priority: dist(xL , Q1 [a]) + dist(xR , Q2 [b]) 62
  63. 63. Locally Optimized Product Quantization [Kalantidis & Avrithis, CVPR 2014]63
  64. 64. Project Name Thank you! Yannis Kalantidis ykalant@image.ntua.gr http://www.skamalas.com 64
  65. 65. Locally Optimized Product Quantization https://github.com/yahoo/lopq [Kalantidis & Avrithis, CVPR 2014] [Kalantidis et al, ECCV-W 2016]65

×