Transformer_Clustering_PyData_2022.pdf

Dat Tran - Head of Data Science
Transformer based clustering:
Identifying product clusters for E-commerce
Christopher Lennan
Sebastian Wanner
13/04/2022 PyConDE & PyData Berlin

Sebastian Wanner
Senior ML Engineer
Christopher Lennan
Lead ML Engineer

20 More than 20 years
experience
900+ "idealos" from 40
nations
Active in 6 different countries
(DE, AT, ES, IT, FR, UK)
18 million visitors/month
50.000 shops
Over 330 million offers and
2 million products
Germany's 4th largest
eCommerce website
idealo key facts

Problem: vast majority of offers are not mapped to
product catalogue!

idealo open catalogue
Cluster A
Cluster B
Cluster C

Offer clustering – EAN matching
EAN: 123
EAN: 123
EAN: 123
EAN: null
EAN: 321
EAN: null
EAN: 234
Cluster A
Cluster B
Cluster C

Offer clustering – ML on text attributes
EAN: 123
title: abc
colour: lmn
EAN: 123
title: abc
colour: lmn
EAN: 123
title: cde
colour: stu
EAN: null
title: cde
colour: null
EAN: 321
title: cd-e
colour: stu
EAN: null
title: bc d
colour: mno
EAN: 234
title: bcd
colour: null
Cluster A
Cluster B
Cluster C

So we tried various ML approaches ...

Results 10k products (shoe category) ⌀ 17 offers per product
Dataset
* no exhaustive hyper-parameter tuning performed
scaling
ruleset
precision 👍
recall 👎
https://github.com/moj-analytical-services/splink

KNN
clustering
Transformer
encoders
Embeddings based clustering
EAN: 123
title: abc
colour: lmn
EAN: 123
title: cde
colour: stu
EAN: 234
title: bcd
colour: null
Offers ML model Offers as
vectors
1
2
3
2
3
4
1
2
3
x
y
z
cluster A
Cluster
similar vectors
text
attributes
as features
outputs
embeddings
cluster
embeddings

Siamese network
with Transformer models perform best …

Transfer Learning with Transformers
Learn one task, transfer knowledge to a new task
Pretraining Fine-tuning
Masked language modelling
• Sentence: Where are we [MASK]
• Label: going
Training objective:
Unlabeled
Text data Pretrained model

Leverage large scale pre-trained language models
• Transformer encoder with
110M. parameters
• 160GB uncompressed texts
(five English-language
corpora )
• training time 35 days on 32
GPUs
microsoft / mpnet-base
Transformer
Pre
training

110M. parameters
• 160GB uncompressed texts
corpora )
GPUs
fine-tuning
microsoft / mpnet-base sentence-transformers /
all-mpnet-base-v2
• trained on 1.2 billion English
sentence pairs
• transferred to 100+ languages
through Multi-Lingual
Knowledge Distillation
Transformer Transformer
Pre
training

110 M. parameters
• 160 GB uncompressed texts
corpora )
GPUs
fine-tuning
microsoft / mpnet-base sentence-transformers /
all-mpnet-base-v2
• trained on 1.2 billion English
sentence pairs
• transferred to 100+
languages through Multi-
Lingual Knowledge
Distillation
• trained on >5 million idealo
offer pairs
• training time 28 hours on a
NVIDIA V100 GPU
fine-tuning
idealo-offer-clustering
Transformer Transformer Transformer
Pre
training

Siamese Networks
Train on positive and negative training pairs.
Label:
1 = similar
0 = not similar

Siamese Networks
Train on positive and negative training pairs. Before fine-tuning: 0.58
After fine-tuning: 0.76
+18 pp
Label:
1 = similar
0 = not similar

Sentence Transformers
v Provide access to language models fine-tuned on 1 billion sentence pairs
v Integrated with Hugging Face Modelhub
v Multilingual Models available, support for 100+ languages
v 10+ Loss functions implemented and ready to use

Training pair generation makes a difference …

Generate Training Pairs
Choose positive pairs and negative pairs randomly
v Randomly selected negative pairs
are too easy for the model.
v Random negative pairs do not
contribute much to training
progress.
v Model quickly converges and
performance is not enough.
Lessons Learned

Generate Training Pairs
Select Hard-negative pairs Offline Strategy
Average embedding
for each product cluster
Generate Pairs
Training
Compute embeddings
Epoch
Search for neighbors
+6 pp

Building product clusters can be challenging …

Building product cluster
v Scale to millions of vector
searches
v Search quality is important
v Search time should be small
Challenges
Find K-Nearest Neighbor and apply
threshold
K=10
Threshold

Faiss built by Facebook Research
v Allows to scale to billions of vectors („Billion Scale Similarity Search“ Paper)
v Native distributed GPU-support
v Out of the box optimization strategies:
v Compressed representation by using product quantization methods
v Approximate nearest neighbor search
Source: https://github.com/facebookresearch/faiss/wiki
Index size: 25 GB
Vectors: > 13 million
Hardware: NVIDIA V100 (Multi-GPUs)
Time: 4,3 hrs (⌀ 1,2 ms per vector)
Performance

Index size: 25 GB
Vectors: > 13 million
Hardware: NVIDIA V100 (Multi-GPUs)
Time: 4,3 hrs (⌀ 1,2 ms per vector)
Faiss built by Facebook Research
v Allows to scale to billions of vectors („Billion Scale Similarity Search“ Paper)
v Native distributed GPU-support
v Out of the box optimization strategies:
v Compressed representation by using product quantization methods
v Approximate nearest neighbor search
Source: https://github.com/facebookresearch/faiss/wiki
Performance

Let‘s talk about challenges ...

Identify final product clusters
KNN for two offers KNN graph clusters after LPA algorithm
• create KNN graph with edge weights = cosine similarity
• use Label Propagation Algorithm (LPA) to identify clusters
• GraphFrames Spark library
Approach

Noisy Text Attributes
Hard to identify product variants
Title:
Adidas Originals Superstar UNISEX schwarz weiß
Title:
Adidas Originals Sportschuhe FV3139_35, 5 Sneakers White, 35.5 EU

Transformer_Clustering_PyData_2022.pdf

Recommended

Recommended

More Related Content

Similar to Transformer_Clustering_PyData_2022.pdf

Similar to Transformer_Clustering_PyData_2022.pdf (20)

Recently uploaded

Recently uploaded (20)

Transformer_Clustering_PyData_2022.pdf