Spark+AI Summit | Retrieving Visually Similar Products

WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics

Retrieving visually similar
products for Shopping
Recommendations using Spark
and Tensorflow
Zhichao Zhong, Wehkamp
#UnifiedDataAnalytics #SparkAISummit

Agenda
● Introduction
● Implementation
○ Image embedding extraction
○ Similarity search
○ Pipeline overview
● Summary

Zhichao Zhong
● Data scientist @ wehkamp
● Ph.D. in applied mathematics @ CWI

the online department store for families in the
Netherlands
√
About wehkamp
> 400.000
products
> 500.000
daily visitors
€ 661 million
sales 18/19
67 years’
history
wehkamp: the online department store for families in the Netherlands.
11 million
packages
18/19

About wehkamp
1952 - first advertisement 1955 - first catalog 1995 - first steps online 2010 - completely online 2018 - mobile first
2019 -
a great shop
experience

the online department store for families in the
Netherlands
√
Data science at wehkamp
Use data science to improve the online shopping experience for customers.
Search ranking
Recommendation
system
Personalization
And many others ...
Visual similarity

Visuals are important for shopping, especially for fashion (our largest category).
People look at look-alike items when
shopping.
Visual similarity: to retrieve similar items based on images.
Visual similarity
8

Use case: to show substitutes for out-of-stock items in the look.
Use cases
9
Out of
stock
Substitute

Use cases
Use case: to show similar items together on the products overview page.
10

Use cases
Use case: to recommend similar items for newly onboarded items (the cold-start
problem).
11

How to retrieve visually similar items?
Step 1: Extract image embeddings.
Step 2: Search for similar embeddings.
Steps for visual-similarity
12
1 6 3 2 1.....
2 6 3 2 1.....
0 5 2 2 1.....
1 6 3 7 1.....
3 6 3 2 9.....
1 3 8 3 1.....
21
CNN
Similarity
search

Image embedding
Image embedding: low-dimensional vector representations of the image
that contains abstract information.
13
512⨉512⨉3
256
13831.....3131

Image embedding
Use convolutional neural network (CNN) to extract the embeddings.
14
fully-connected
layer
convolutional/pooling/activation layers prediction
embedding
CNN

Transfer learning
Use a pre-trained model? Train a model from scratch?
• Adopt the VGG16 model pre-trained on the ImageNet dataset (natural images).
• Replace the fully-connected layers.
• Train the fully-connected layers on our own dataset.
15
FC layer
4096
layers
from VGG16
FC layer
512
512⨉1
Embedding
224⨉
224⨉
3
Im
age

Triplet loss
Data triplet: anchor image, positive image and negative image
Triplet loss:
Similarity is defined by the Euclidean distance.
are the embeddings for anchor, positive and negative images
respectively.
16
AnchorPositive Negative
FaceNet: A Unified Embedding for Face Recognition and Clustering, F. Schroff et al. (2015)

Triplet loss
Minimize the triplet loss
17
Learning
ɑ

Siamese network: identical CNNs take two or more inputs.
Siamese network
18
CNN
CNN
Triplet lossCNN
Identical CNNs

Data preparation
Similar product images are put in the same group.
Sample triplets:
• sample 2 images from the same group as the anchor and positive images.
• sample 1 image from other groups as the negative image.
3500 images => 56000 triplets
19
FaceNet: A Unified Embedding for Face Recognition and Clustering, F. Schroff et al. (2015)

Training result
Precision@k on the test data.
20
k: number of embeddings returned by the similarity search.

Model training
• 50 epochs, 29 hours on a Nvidia K80 GPU
• How can we scale up the model training to
– fit more data,
– fine tune the hyperparameters quickly?
Use distributed training to speed up the training !
21

Distributed training
• Distributed training framework: Horovod by Uber.
– Good scaling efficiency.
– Minimal code modification.
• Training API: HorovodRunner on Databricks, integrated with Spark’s
barrier mode.
22

Code example
23
Single-machine

The throughput scales up with more GPUs,
24

but not as much as expected.
25

How to retrieve visually similar items?
Step 1: extract image embeddings.
• Train a model on our own dataset.
• From single-machine to distributed training.
Step 2: search for similar embeddings.
Steps for visual-similarity
26
2 6 3 2 1.....
0 5 2 2 1.....
1 6 3 7 1.....
3 6 3 2 9.....
1 3 8 3 1.....
21
CNN
Similarity
search
6 3 2..... 11

• Brute-force search can be expensive and slow for large size of
high dimensional data.
• We use the approximate similarity search implemented in Spark:
• Hash step: hash similar embeddings into the same buckets
using locality sensitive hashing (LSH).
• Search step: only search for embeddings in the same
buckets with Euclidean distance.
Similar items retrieval
27
Hashing for Similarity Search: A Survey, J. Wang et al. (2014)

LSH hashes dimensional vectors with a small distance into the same buckets
with a high probability.
The hash function for Euclidean distance is:
, where v is a random unit vector, r is the bucket length.
Example:
v = [0.44, 0.90], r = 2
x1 = [2.0, 2.0], h(x1) = 1
x2 = [2.0, 3.0], h(x2) = 1
x3 = [0.0, 5.0], h(x3) = 2
Locality sensitive hashing
28
LSH

Two parameters:
bucketLength: the length of each hash bucket.
numHashTable: the number of hash tables.
accuracy query performance
bucketLength
numHashTable
Parameters in LSH
29
h1 h2 hn-1 hn

Pipeline overview
31
Embedding
extraction
Embeddings
Similarity search Similarity data storage
Images
Embeddings
Product
information
Service

Result examples
32
1 2 3 54
6 7 8 109

Summary
34
1. Visual similarity applications at Wehkamp
2. Embedding CNN model trained on our own dataset.
– Improved accuracy.
– Reduced embedding size.
– Distributed training enabled by Horovod and Databricks.
3. Approximate similarity search on Spark LSH.
4. Future works:
– Higher accuracy enabled by a larger dataset.
– Binary embeddings to speed up the search process.
– Image embeddings as part of product features.

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Spark+AI Summit | Retrieving Visually Similar Products

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark+AI Summit | Retrieving Visually Similar Products

Similar to Spark+AI Summit | Retrieving Visually Similar Products (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Spark+AI Summit | Retrieving Visually Similar Products