Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017

Words are no longer sufficient in delivering the search results users are looking for, particularly in relation to image search. Text and languages pose many challenges in describing visual details and providing the necessary context for optimal results. Machine Learning technology opens a new world of search innovation that has yet to be applied by businesses.

In this session, Mike Ranzinger of Shutterstock will share a technical presentation detailing his research on composition aware search. He will also demonstrate how the research led to the launch of AI technology allowing users to more precisely find the image they need within Shutterstock’s collection of more than 150 million images. While the company released a number of AI search enabled tools in 2016, this new technology allows users to search for items in an image and specify where they should be located within the image. The research identifies the networks that localize and describe regions of an image as well as the relationships between things. The goal of this research was to improve the future of search using visual data, contextual search functions, and AI. A combination of multiple machine learning technologies led to this breakthrough.

  • Login to see the comments

The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017

  1. 1. This is your cover. Insert an amazing image and align it with this grey rectangle. Please use the red, patterned, Shutterstock background for internal presentations only. The search for a new visual search Mike Ranzinger, Senior Research Engineer @ Shutterstock
  2. 2. Image Footage Music Editorial Single user Team Companies Agencies Shutterstock platform Create Edit Share Publish
  3. 3. • We’re going to explore a new technology that was just released to beta called “Composition Aware Search” • This involves some key technologies: •Convolutional Neural Nets (Vision and NLP) •Discriminative Localization •Multi-modal Embeddings •Dimensionality Reduction •Inverted Multi-Index • Yes, between this presentation, and our publicly shared white paper, you should be able to implement this yourself •(non-commercially of course). Outline
  4. 4. Language and visual domain mismatch VS Query: Red Bike
  5. 5. An image can help provide visual interest to your written content. Insert an image and align it with this grey rectangle. Domain mismatch • The saying is: “A picture is worth a thousand words” • Our average query length is 2 words • Sometimes it’s hard to describe exactly what you’re looking for • Our users are accustomed to looking through multiple pages of results to find what they were looking for
  6. 6. Image similarity / reverse image search
  7. 7. • Common problem: I have picture X without a license, and I need to get a license for it •Perhaps you saw it on social media, and you wanted to share it more officially • My toy problem: I took this bad picture, find me a good one! • We don’t use words, at all. We communicate through pixels. Reverse image search
  8. 8. My bike How does it work? Trained CNN Fixed length vectorTrained CNN Our bike images Maximum inner product search between our collection, and the query vector OurCollection
  9. 9. • We have a vision model that can produce an N-dimensional vector for a given image. • Train a language model that maps a query to the vector of the downloaded image. • Training set: Query to download pairs. Multimodal embedding / query language models Lemur on rock
  10. 10. • Kiros et. al. “Unifying visual-semantic embeddings with multimodal neural language models” • Trained using “Triplet Loss” •Let 𝑓(𝑥) be the L2 normalized output of the vision model on image 𝑥 •Let 𝑔(𝑞) be the L2 normalized output of the language model on query 𝑞 •Let 𝑞' be the query corresponding to image 𝑥' •Let 𝑚 be some margin 0 < 𝑚 < 2 • 𝐿 = max 0, 𝑓 𝑥2 ∘ 𝑔 𝑞' − 𝑓 𝑥' ∘ 𝑔 𝑞' + 𝑚 • In words, the dot product between a query and it’s corresponding image (green) should be greater than the query and some unrelated image (red) by some margin 𝑚. Multimodal embedding
  11. 11. • We train the vision model first • Next, we train the language model. •We don’t backprop gradients though the vision model because it degrades it • Once we’ve finished training the language model, we can search for images given a query using MIPS, the same way that we did with reverse image search. Multimodal embedding
  12. 12. Multimodal embedding search
  13. 13. • Here’s an example of a “fully convolutional” neural network. • A fully convolutional network is typically a series of convolutions and downsampling operations that ends with a global average pooling operation. • The GAP reduces the final feature maps down to a single vector (one value per feature map). • We call a position (y, x) in the final feature maps a “spatial feature vector” Spatial feature vectors
  14. 14. • Like the feature vector produced by the global average pool, spatial vectors also encode information in the same embedding space. • Importantly, these vectors tend to encode more localized information based on the receptive field of the given neuron. • We exploit these vectors to build out CAS Spatial feature vectors
  15. 15. • Zhou et. al. introduced a very important paper titled “Learning Deep Features for Discriminative Localization” • They introduce the concept of “Class Activation Maps” (CAM), which is effectively a heatmap of the classification strength for each output position before the GAP, for a given class. Discriminative localization
  16. 16. • Let 𝑓6 𝑦, 𝑥 be the activation of unit 𝑘 at position (y, x) of the last convolutional layer • 𝐹6 = ' :,; ∑ 𝑓6 𝑦, 𝑥:,; •The result of the GAP for unit 𝑘 • For a given class 𝑐, the input to the softmax, 𝑆@ = ∑ 𝑤6 @ 6 𝐹6 •In words, the dot product between the GAP features, and the learned vector for the given class, where 𝑤6 @ is the weight for class 𝑐 for unit 𝑘 • Let 𝑀@ 𝑦, 𝑥 = ∑ 𝑤6 @ 6 𝑓6 𝑦, 𝑥 be the class activation map, or in words, the importance of spatial position (y,x) for the classification of class c. •I’d recommend reading the paper to see the full derivation Discriminative localization
  17. 17. CAM for highest probability guess, which is “meerkat” with probability 40%. What it looks like for us
  18. 18. What it looks like for us Mountain bike, 43%
  19. 19. • Recall that the output of the GAP is • 𝐹6 = ' :,; ∑ 𝑓6 𝑦, 𝑥:,; • What if, instead of needing class 𝑐, we instead use 𝐹6 as the target • 𝑀@ 𝑦, 𝑥 = ∑ 𝐹66 𝑓6 𝑦, 𝑥 • Basically, this tells us how close a given spatial vector is to the average vector. One way to interpret this is, “how salient is the spatial vector to the classification”. Auto-saliency
  20. 20. Note that “lemur” isn’t actually a class that the network was trained against. The closest class neighbors are meerkat and koala. Auto-saliency
  21. 21. Auto-saliency
  22. 22. Auto-saliency
  23. 23. Multiple regions within the image may be deemed important. Auto-saliency
  24. 24. • Why is this important? •It allows us to visualize how the network behaves on inputs for classes that it wasn’t explicitly trained on. • The idea that this works also reveals an open problem for us: •In order for the salient vectors to emerge, the non-salient regions of the image must either try to align themselves in the same direction as the salient vector •Dilation •Or, the non-salient regions must reduce their magnitude so not to bias the salient vector Auto-saliency
  25. 25. • We have now seen how we can use CAMs, as well as the GAP vectors themselves to guide the heatmaps. • Finally, we can look back at the language model we trained earlier. • 𝐿 = max 0, 𝑓 𝑥2 ∘ 𝑔 𝑞' − 𝒇 𝒙 𝟏 ∘ 𝒈 𝒒 𝟏 + 𝑚 •The language model learns to match the direction of the GAP •In effect, we can use the language model to generate the class weights for the CAM technique on the fly. •I think it’s neat to interpret the language model as a low-rank approximator of the (potentially infinite) classification weight matrix. Language models as discriminators
  26. 26. Composition aware search - overview Spatial IndexCollect This texture is also an anchor, with position, size, and query image. Lamp Lamp and Chair are called “anchors”, which have both a position and a query string. Chair VisionModel Language Model
  27. 27. • Vision Model •We are using a variant of the Inception v3 paper by Szegedy et. al. titled “Rethinking the Inception Architecture for Computer Vision” •Notable differences: •We are not using batch normalization •We are using ELU non-linearities instead of ReLUs. • Language Model •We tried to be fancy and use cool tech such as character-models and LSTMs •The character LSTMs massively overfit on us •So, we used words, and dropped recurrency altogether in favor of a simpler convolutional language model as described by Collobert et. al. in “Natural Language Processing (Almost) from Scratch” Models
  28. 28. • Let’s look at the query formulation for this, starting with the simple case. • Let 𝑆(𝑖) be the score for image 𝑖 • Let 𝐐 be the set of anchors, and 𝐪L be the 𝑗-th anchor L2 normalized (column) vector • Let 𝐕O be the set of spatial vectors in image 𝑖, and 𝐯OQ be the 𝑝-th spatial L2 normalized (column) vector • Let 𝑤LQ be a positional weight applied to position 𝑝 based on the position of query 𝑗 The search problem 𝑆(𝑖) = 1 𝐐 T max UVQVW 𝑤LQ 𝐯OQ ⏉ 𝐪L 𝐐 L
  29. 29. Take the average over the anchors. The search problem 𝑆 𝑖 = 1 𝐐 T max UVQVW 𝑤LQ 𝐯OQ ⏉ 𝐪L 𝐐 L We only care about the largest weighted similarity score. This gives us a single score per query anchor for an image. We use this to weight the similarity score based on the relative position of the anchor to the spatial vector. Take the average over the anchors.
  30. 30. Query Canvas Visualization of 𝑤 𝑆(𝑖) = 1 𝐐 T max UVQVW 𝑤LQ 𝐯OQ ⏉ 𝐪L 𝐐 L Column Row 𝑤
  31. 31. The search problem • Using the above definition, the size of the index is defined by the following variables: • 𝐶 - The size of the collection • 𝐷 - The dimensionality of the spatial vectors • 𝑃 - The number of spatial positions • For our (beta) production offering, we have: • 𝐶 = 10,000,000 • 𝐷 = 256 • 𝑃 = 64, we use 8 rows and 8 columns • Requires about 611 GB of space to store the index. • Algorithm complexity is also 𝑂(𝐶 ⋅ 𝐐 ⋅ 𝑃 ⋅ 𝐷), which is, a lot. 𝑆(𝑖) = 1 𝐐 T max UVQVW 𝑤LQ 𝐯OQ ⏉ 𝐪L 𝐐 L
  32. 32. Visualizing concepts Using PCA, we can visualize how the concepts are arranged based on the 2 principal directions. Cat Background
  33. 33. Visualizing concepts Mountains Cyclists Sky
  34. 34. Visualizing concepts – reduction methods Asphalt PCA t-SNE
  35. 35. Visualizing concepts – reduction methods • t-SNE was superior at disentangling concepts on a 2d plane •Manifold learning technique •Popular technique for data visualization • PCA was still able to do a decent job •Linear • We use PCA because embedding a new point is efficiently computed with a single GEMM
  36. 36. The search problem (part 2) • Since we are performing a dimensionality reduction on the spatial vectors for each image, let’s re-define the search problem. • Let 𝐁O ∈ ℝd×f be the orthonormal basis of the PCA for image 𝑖 such that we preserve 𝑁 dimensions, and 𝑁 ≤ 𝐷. • Let 𝐝OQ = 𝐁O 𝐯OQbe the reduced dimensionality spatial vector for image 𝑖 at position 𝑝. 𝑆 𝑖 = 1 𝐐 T max UVQVW 𝑤LQ 𝐝OQ ⏉ 𝐁O 𝐪L 𝐐 L
  37. 37. The search problem (part 2) Then compute the dot product between the two vectors in the subspace. Project the query vector into the reduced dimensionality subspace for the image. 𝑆 𝑖 = 1 𝐐 T max UVQVW 𝑤LQ 𝐝OQ ⏉ 𝐁O 𝐪L 𝐐 L
  38. 38. The search problem (part 2) Naïve definition • Requires 𝐶𝐷 𝑃 storage space •611 GB for 10mil images • Computation 𝑂(𝐶 ⋅ 𝐐 ⋅ 𝑃 ⋅ 𝐷) •In practice, 𝑃 = 64, 𝐷 = 256, so 𝑃 ⋅ 𝐷 = 16384 𝑆(𝑖) = 1 𝐐 T max UVQVW 𝑤LQ 𝐯OQ ⏉ 𝐪L 𝐐 L 𝑆 𝑖 = 1 𝐐 T max UVQVW 𝑤LQ 𝐝OQ ⏉ 𝐁O 𝐪L 𝐐 L New definition • Requires 𝐶𝑁 𝐷 + 𝑃 storage space • 𝑁 ≈ 4 •48 GB for 10mil images •12.7x reduction • 𝑂 𝐶 ⋅ 𝐐 𝑁 𝐷 + 𝑃 • 𝑁 𝐷 + 𝑃 = 1280 •12.8x reduction
  39. 39. The search problem (part 3) • Now the current computational complexity is: • 𝑂 𝐶 ⋅ 𝐐 𝑁 𝐷 + 𝑃 • Importantly, this is still intractable because it still processes every image in the collection. •Users typically don’t want to wait roughly 7 seconds for a response • Our best bet is to formulate the problem such that we only process a tiny fraction of 𝐶
  40. 40. Inverted index • Construction: •Select codebook size, 𝐖 •Find the 𝐖 centroids of 𝐶 using a K-means like process •Each of these centroids are called “codewords” •Assign each vector in 𝐂 to it’s nearest vector in 𝐖 • Inference: •Find the 𝑘 nearest codewords in 𝐖 to 𝐪 •Either return all of the vectors in the 𝑘 codewords, or perhaps find the 𝑘′ nearest vectors to 𝐪 within the codewords.
  41. 41. Inverted multi-index • Introduced by Babenko and Lempitsky at CVPR 2012 • This technique combines Product Quantization with Inverted Indices • Construction: •Split your collection 𝐶 into 𝑁 partitions, typically 𝑁 = 2 •For each partition 𝑀, find 𝐖 cluster centers, as with the inverted index •For each vector in 𝐶, assign it to the nearest codeword in each partition independently. •This forms a Cartesian product of the codebooks, such that the full codebook is essentially size 𝐖 d • Paper: Inverted Multi-Index Dims 1 → 𝑟 Dims 𝑟 + 1 → 𝐷
  42. 42. Inverted multi-index • Inference: •Sort the codebooks in each partition 𝑀 based on distance to 𝐪′ •Traverse the 𝑁 codebooks by visiting the nearest codeword 𝑚 defined by the sum of the distances for each 𝐪′ to each 𝐦. • I strongly recommend reading the paper for this one. It’s hard to explain on a slide.
  43. 43. Inverted multi-index Source: “The Inverted Multi-Index” by Artem Babenko and Victor Lempitsky Visualization of a set of datapoints, and their respective clusters.
  44. 44. The inverted multi-index for CAS • For the most part, we use the basic formulation of the IMI •We use 𝐖 = 10000, which results in 100-million possible codewords • Except: •Each image has 𝑃 spatial vectors associated with it, so we assign each spatial vector to a cluster independently •This is actually the main reason we use the IMI over the regular inverted index, because we effectively have 𝑃 ⋅ 𝐶 vectors to index, and the inverted index doesn’t scale as well. •The paper primarily addresses 𝐿2 distance, but we use cosine distance •Scale the codebook vectors by d d such that the magnitude of any set of codewords 𝑚', 𝑚2, ⋯ , 𝑚t = 1
  45. 45. The inverted multi-index for CAS • We expand clusters for each query term until we reach a fixed number of images • We then take the set union of expansions for each query term, and run the previously defined scoring function. • We look for about 5k images per anchor, so we typically only rank between 0.05% and 0.15% of the collection.
  46. 46. Spatial-semantic image search by visual feature synthesis Mai et. al. introduced the above titled paper at CVPR 2017 Credit: Mai et. Al. Spatial-semantic image search by visual feature synthesis
  47. 47. Mai et. al. paper • Joint effort between Portland State University and Adobe Research • Their problem space is very similar to CAS • Key Differences: •Their language model learns to map all of the non-uniformly sized anchors to a single feature vector, and then search proceeds like a standard nearest neighbors query. •They train their models using a dataset with object localization information (COCO) • Basically, if your data has localized labels, their approach is very compelling.
  48. 48. Levels Of Supervision Unsupervised • No labeled data • GANs, VAEs, etc. Where I want to be. LeCun thinks so too. Semi-supervised • Some labeled data What best leverages Shutterstock’s data Supervised • Classification labels • ILSVRC, etc. Where CAS is currently Very supervised • Classification and localization labels, sometimes even pixel level segmentation. • COCO, etc. Mai et. al.
  49. 49. Challenges to the current system • The global average pool encourages a couple different bad behaviors: •It “dilates” the salient regions of the image, such that neurons that are near the salient concept adopt the salient vectors instead of representing the primary concept of their own receptive field •It creates a hierarchy of vector magnitudes, such that salient concepts have much larger magnitude than less salient concepts. This can allow the network to learn less-robust representations of the non-salient image patches.
  50. 50. This is your closing slide.. Insert an amazing image and align it with this grey rectangle for a dramatic transition. Feel free to change the copy to white should want it to show up better against the image. Thank you! Mike Ranzinger