The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017

This is your cover.
Insert an amazing image and align it
with this grey rectangle. Please use the
red, patterned, Shutterstock background
for internal presentations only.
The search for a new visual search
Mike Ranzinger, Senior Research Engineer @ Shutterstock

Image
Footage
Music
Editorial
Single user
Team
Companies
Agencies
Shutterstock
platform
Create
Edit
Share
Publish

• We’re going to explore a new technology that was just released to beta called “Composition
Aware Search”
• This involves some key technologies:
•Convolutional Neural Nets (Vision and NLP)
•Discriminative Localization
•Multi-modal Embeddings
•Dimensionality Reduction
•Inverted Multi-Index
• Yes, between this presentation, and our publicly shared white paper, you should be able to
implement this yourself
•(non-commercially of course).
Outline

Language and visual domain mismatch
VS
Query: Red Bike

An image can help provide visual
interest to your written content.
Insert an image and align
it with this grey rectangle.
Domain mismatch
• The saying is: “A picture is worth a thousand
words”
• Our average query length is 2 words
• Sometimes it’s hard to describe exactly what
you’re looking for
• Our users are accustomed to looking
through multiple pages of results to find
what they
were looking for

Image similarity / reverse image search

• Common problem: I have picture X without a license, and I need to get a
license for it
•Perhaps you saw it on social media, and you wanted to share it more officially
• My toy problem: I took this bad picture, find me a good one!
• We don’t use words, at all. We communicate through pixels.
Reverse image search

My bike
How does it work?
Trained CNN
Fixed length
vectorTrained CNN
Our bike images
Maximum inner product
search between our
collection, and the
query vector
OurCollection

• We have a vision model that can produce
an N-dimensional vector for a given image.
• Train a language model that maps a query
to the vector of the downloaded image.
• Training set: Query to download pairs.
Multimodal embedding / query language models
Lemur on rock

• Kiros et. al. “Unifying visual-semantic embeddings with multimodal
neural language models”
• Trained using “Triplet Loss”
•Let 𝑓(𝑥) be the L2 normalized output of the vision model on image 𝑥
•Let 𝑔(𝑞) be the L2 normalized output of the language model on query 𝑞
•Let 𝑞' be the query corresponding to image 𝑥'
•Let 𝑚 be some margin 0 < 𝑚 < 2
• 𝐿 = max 0, 𝑓 𝑥2 ∘ 𝑔 𝑞' − 𝑓 𝑥' ∘ 𝑔 𝑞' + 𝑚
• In words, the dot product between a query and it’s corresponding
image (green) should be greater than the query and some unrelated
image (red) by some margin 𝑚.
Multimodal embedding

• We train the vision model first
• Next, we train the language model.
•We don’t backprop gradients though the vision model because it degrades
it
• Once we’ve finished training the language model, we can search for
images given a query using MIPS, the same way that we did with reverse
image search.
Multimodal embedding

• Here’s an example of a “fully convolutional” neural network.
• A fully convolutional network is typically a series of convolutions and
downsampling operations that ends with a global average pooling operation.
• The GAP reduces the final feature maps down to a single vector (one value per
feature map).
• We call a position (y, x) in the final feature maps a “spatial feature vector”
Spatial feature vectors

• Like the feature vector produced by the global average pool,
spatial vectors also encode information in the same
embedding space.
• Importantly, these vectors tend to encode more localized
information based on the receptive field of the given neuron.
• We exploit these vectors to build out CAS
Spatial feature vectors

• Zhou et. al. introduced a very important paper
titled “Learning Deep Features for Discriminative
Localization”
• They introduce the concept of “Class Activation
Maps” (CAM), which is effectively a heatmap of
the classification strength for each output
position before the GAP, for a given class.
Discriminative localization

• Let 𝑓6 𝑦, 𝑥 be the activation of unit 𝑘 at position (y, x) of the last
convolutional layer
• 𝐹6 =
'
:,;
∑ 𝑓6 𝑦, 𝑥:,;
•The result of the GAP for unit 𝑘
• For a given class 𝑐, the input to the softmax, 𝑆@ = ∑ 𝑤6
@
6 𝐹6
•In words, the dot product between the GAP features, and the learned
vector for the given class, where 𝑤6
@
is the weight for class 𝑐 for unit 𝑘
• Let 𝑀@ 𝑦, 𝑥 = ∑ 𝑤6
@
6 𝑓6 𝑦, 𝑥 be the class activation map, or in
words, the importance of spatial position (y,x) for the classification
of class c.
•I’d recommend reading the paper to see the full derivation
Discriminative localization

CAM for highest probability guess, which is “meerkat” with probability 40%.
What it looks like for us

What it looks like for us
Mountain bike, 43%

• Recall that the output of the GAP is
• 𝐹6 =
'
:,;
∑ 𝑓6 𝑦, 𝑥:,;
• What if, instead of needing class 𝑐, we instead use 𝐹6 as the target
• 𝑀@ 𝑦, 𝑥 = ∑ 𝐹66 𝑓6 𝑦, 𝑥
• Basically, this tells us how close a given spatial vector is to the average
vector. One way to interpret this is, “how salient is the spatial vector to the
classification”.
Auto-saliency

Note that “lemur” isn’t actually a class that the network was trained against. The
closest class neighbors are meerkat and koala.
Auto-saliency

Multiple regions within the image may be deemed important.
Auto-saliency

• Why is this important?
•It allows us to visualize how the network behaves on inputs for classes that it
wasn’t explicitly trained on.
• The idea that this works also reveals an open problem for us:
•In order for the salient vectors to emerge, the non-salient regions of the image
must either try to align themselves in the same direction as the salient vector
•Dilation
•Or, the non-salient regions must reduce their magnitude so not to bias the
salient vector
Auto-saliency

• We have now seen how we can use CAMs, as well as the GAP vectors
themselves to guide the heatmaps.
• Finally, we can look back at the language model we trained earlier.
• 𝐿 = max 0, 𝑓 𝑥2 ∘ 𝑔 𝑞' − 𝒇 𝒙 𝟏 ∘ 𝒈 𝒒 𝟏 + 𝑚
•The language model learns to match the direction of the GAP
•In effect, we can use the language model to generate the class weights for the CAM
technique on the fly.
•I think it’s neat to interpret the language model as a low-rank approximator of
the (potentially infinite) classification weight matrix.
Language models as discriminators

Composition aware search - overview
Spatial IndexCollect
This texture is also an anchor,
with position, size, and query
image.
Lamp
Lamp and Chair are called
“anchors”, which have both
a position and a query string.
Chair
VisionModel
Language Model

• Vision Model
•We are using a variant of the Inception v3 paper by Szegedy et. al. titled
“Rethinking the Inception Architecture for Computer Vision”
•Notable differences:
•We are not using batch normalization
•We are using ELU non-linearities instead of ReLUs.
• Language Model
•We tried to be fancy and use cool tech such as character-models and LSTMs
•The character LSTMs massively overfit on us
•So, we used words, and dropped recurrency altogether in favor of a simpler
convolutional language model as described by Collobert et. al. in “Natural
Language Processing (Almost) from Scratch”
Models

• Let’s look at the query formulation for this, starting
with the simple case.
• Let 𝑆(𝑖) be the score for image 𝑖
• Let 𝐐 be the set of anchors, and 𝐪L be the 𝑗-th
anchor L2 normalized (column) vector
• Let 𝐕O be the set of spatial vectors in image 𝑖, and 𝐯OQ
be the 𝑝-th spatial L2 normalized (column) vector
• Let 𝑤LQ be a positional weight applied to position 𝑝
based on the position of query 𝑗
The search problem
𝑆(𝑖) =
1
𝐐
T max
UVQVW
𝑤LQ 𝐯OQ
⏉
𝐪L
𝐐
L

Take the average over the anchors.
The search problem
𝑆 𝑖 =
1
𝐐
T max
UVQVW
𝑤LQ 𝐯OQ
⏉
𝐪L
𝐐
L
We only care about the largest
weighted similarity score. This
gives us a single score per
query anchor for an image.
We use this to weight the similarity
score based on the relative position
of the anchor to the spatial vector.
Take the average over the anchors.

Query Canvas
Visualization of 𝑤
𝑆(𝑖) =
1
𝐐
T max
UVQVW
𝑤LQ 𝐯OQ
⏉
𝐪L
𝐐
L
Column Row
𝑤

The search problem
• Using the above definition, the size of the index is
defined by the following variables:
• 𝐶 - The size of the collection
• 𝐷 - The dimensionality of the spatial vectors
• 𝑃 - The number of spatial positions
• For our (beta) production offering, we have:
• 𝐶 = 10,000,000
• 𝐷 = 256
• 𝑃 = 64, we use 8 rows and 8 columns
• Requires about 611 GB of space to store the index.
• Algorithm complexity is also 𝑂(𝐶 ⋅ 𝐐 ⋅ 𝑃 ⋅ 𝐷),
which is, a lot.
𝑆(𝑖) =
1
𝐐
T max
UVQVW
𝑤LQ 𝐯OQ
⏉
𝐪L
𝐐
L

Visualizing concepts
Using PCA, we can visualize how the concepts are arranged based
on the 2 principal directions.
Cat
Background

Visualizing concepts
Mountains
Cyclists
Sky

Visualizing concepts – reduction methods
Asphalt
PCA t-SNE

Visualizing concepts – reduction methods
• t-SNE was superior at disentangling concepts on a 2d plane
•Manifold learning technique
•Popular technique for data visualization
• PCA was still able to do a decent job
•Linear
• We use PCA because embedding a new point is efficiently
computed with a single GEMM

The search problem (part 2)
• Since we are performing a dimensionality
reduction on the spatial vectors for each image,
let’s re-define the search problem.
• Let 𝐁O ∈ ℝd×f be the orthonormal basis of the
PCA for image 𝑖 such that we preserve 𝑁
dimensions, and 𝑁 ≤ 𝐷.
• Let 𝐝OQ = 𝐁O 𝐯OQbe the reduced dimensionality
spatial vector for image 𝑖 at position 𝑝.
𝑆 𝑖 =
1
𝐐
T max
UVQVW
𝑤LQ 𝐝OQ
⏉
𝐁O 𝐪L
𝐐
L

Then compute the dot product
between the two vectors in the
subspace.
Project the query vector into the
reduced dimensionality subspace
for the image.
𝑆 𝑖 =
1
𝐐
T max
UVQVW
𝑤LQ 𝐝OQ
⏉
𝐁O 𝐪L
𝐐
L

Naïve definition
• Requires 𝐶𝐷 𝑃 storage space
•611 GB for 10mil images
• Computation 𝑂(𝐶 ⋅ 𝐐 ⋅ 𝑃 ⋅ 𝐷)
•In practice,
𝑃 = 64, 𝐷 = 256, so 𝑃 ⋅ 𝐷 = 16384
𝑆(𝑖) =
1
𝐐
T max
UVQVW
𝑤LQ 𝐯OQ
⏉
𝐪L
𝐐
L
𝑆 𝑖 =
1
𝐐
T max
UVQVW
𝑤LQ 𝐝OQ
⏉
𝐁O 𝐪L
𝐐
L
New definition
• Requires 𝐶𝑁 𝐷 + 𝑃 storage space
• 𝑁 ≈ 4
•48 GB for 10mil images
•12.7x reduction
• 𝑂 𝐶 ⋅ 𝐐 𝑁 𝐷 + 𝑃
• 𝑁 𝐷 + 𝑃 = 1280
•12.8x reduction

• Now the current computational complexity is:
• 𝑂 𝐶 ⋅ 𝐐 𝑁 𝐷 + 𝑃
• Importantly, this is still intractable because it still processes
every image in the collection.
•Users typically don’t want to wait roughly 7 seconds for a
response
• Our best bet is to formulate the problem such that we only
process a tiny fraction of 𝐶

Inverted index
• Construction:
•Select codebook size, 𝐖
•Find the 𝐖 centroids of 𝐶 using a K-means like process
•Each of these centroids are called “codewords”
•Assign each vector in 𝐂 to it’s nearest vector in 𝐖
• Inference:
•Find the 𝑘 nearest codewords in 𝐖 to 𝐪
•Either return all of the vectors in the 𝑘 codewords, or perhaps
find the 𝑘′ nearest vectors to 𝐪 within the codewords.

Inverted multi-index
• Introduced by Babenko and Lempitsky at CVPR 2012
• This technique combines Product Quantization with
Inverted Indices
• Construction:
•Split your collection 𝐶 into 𝑁 partitions, typically 𝑁 = 2
•For each partition 𝑀, find 𝐖 cluster centers, as with
the inverted index
•For each vector in 𝐶, assign it to the nearest codeword
in each partition independently.
•This forms a Cartesian product of the codebooks,
such that the full codebook is essentially size 𝐖 d
• Paper: Inverted Multi-Index
Dims 1 → 𝑟 Dims 𝑟 + 1 → 𝐷

• Inference:
•Sort the codebooks in each partition 𝑀 based on
distance to 𝐪′
•Traverse the 𝑁 codebooks by visiting the nearest
codeword 𝑚 defined by the sum of the distances for
each 𝐪′ to each 𝐦.
• I strongly recommend reading the paper for this one.
It’s hard to explain on a slide.

Source: “The Inverted Multi-Index” by Artem Babenko and Victor Lempitsky
Visualization of a set of datapoints, and their respective clusters.

The inverted multi-index for
CAS
• For the most part, we use the basic formulation of the IMI
•We use 𝐖 = 10000, which results in 100-million possible codewords
• Except:
•Each image has 𝑃 spatial vectors associated with it, so we assign each
spatial vector to a cluster independently
•This is actually the main reason we use the IMI over the regular
inverted index, because we effectively have 𝑃 ⋅ 𝐶 vectors to index,
and the inverted index doesn’t scale as well.
•The paper primarily addresses 𝐿2 distance, but we use cosine distance
•Scale the codebook vectors by
d
d
such that the magnitude of any
set of codewords 𝑚', 𝑚2, ⋯ , 𝑚t = 1

The inverted multi-index for
CAS
• We expand clusters for each query term until we reach a fixed
number of images
• We then take the set union of expansions for each query term,
and run the previously defined scoring function.
• We look for about 5k images per anchor, so we typically only
rank between 0.05% and 0.15% of the collection.

Spatial-semantic image search
by visual feature synthesis
Mai et. al. introduced the above titled paper at CVPR 2017
Credit: Mai et. Al.
Spatial-semantic image search by visual feature synthesis

Mai et. al. paper
• Joint effort between Portland State University and Adobe
Research
• Their problem space is very similar to CAS
• Key Differences:
•Their language model learns to map all of the non-uniformly
sized anchors to a single feature vector, and then search
proceeds like a standard nearest neighbors query.
•They train their models using a dataset with object
localization information (COCO)
• Basically, if your data has localized labels, their approach is
very compelling.

Levels Of Supervision
Unsupervised
• No labeled data
• GANs, VAEs, etc.
Where I want to be.
LeCun thinks so too.
Semi-supervised
• Some labeled data
What best
leverages
Shutterstock’s
data
Supervised
• Classification labels
• ILSVRC, etc.
Where CAS is
currently
Very supervised
• Classification and
localization labels,
sometimes even
pixel level
segmentation.
• COCO, etc.
Mai et. al.

Challenges to the current system
• The global average pool encourages a couple different bad behaviors:
•It “dilates” the salient regions of the image, such that neurons that
are near the salient concept adopt the salient vectors instead of
representing the primary concept of their own receptive field
•It creates a hierarchy of vector magnitudes, such that salient concepts
have much larger magnitude than less salient concepts. This can allow
the network to learn less-robust representations of the non-salient
image patches.

This is your closing slide..
Insert an amazing image and align it with this
grey rectangle for a dramatic transition.
Feel free to change the copy to white should
want it to show up better against the image.
Thank you!
Mike Ranzinger
mranzinger@shutterstock.com
www.Shutterstock.com/labs/compositionsearch

The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017

Similar to The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017 (20)

More from StampedeCon

More from StampedeCon (20)

Recently uploaded

Recently uploaded (20)

The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017