SlideShare a Scribd company logo
Dat Tran - Head of Data Science
Transformer based clustering:
Identifying product clusters for E-commerce
Christopher Lennan
Sebastian Wanner
13/04/2022 PyConDE & PyData Berlin
Sebastian Wanner
Senior ML Engineer
Christopher Lennan
Lead ML Engineer
20 More than 20 years
experience
900+ "idealos" from 40
nations
Active in 6 different countries
(DE, AT, ES, IT, FR, UK)
18 million visitors/month
50.000 shops
Over 330 million offers and
2 million products
Germany's 4th largest
eCommerce website
idealo key facts
idealo product catalogue
idealo product catalogue
Problem: vast majority of offers are not mapped to
product catalogue!
idealo open catalogue
Cluster A
Cluster B
Cluster C
idealo open catalogue
Offer clustering – EAN matching
EAN: 123
EAN: 123
EAN: 123
EAN: null
EAN: 321
EAN: null
EAN: 234
Cluster A
Cluster B
Cluster C
idealo open catalogue
Offer clustering – ML on text attributes
EAN: 123
title: abc
colour: lmn
EAN: 123
title: abc
colour: lmn
EAN: 123
title: cde
colour: stu
EAN: null
title: cde
colour: null
EAN: 321
title: cd-e
colour: stu
EAN: null
title: bc d
colour: mno
EAN: 234
title: bcd
colour: null
Cluster A
Cluster B
Cluster C
So we tried various ML approaches ...
Results 10k products (shoe category) ⌀ 17 offers per product
Dataset
* no exhaustive hyper-parameter tuning performed
scaling
ruleset
precision 👍
recall 👎
https://github.com/moj-analytical-services/splink
KNN
clustering
Transformer
encoders
Embeddings based clustering
EAN: 123
title: abc
colour: lmn
EAN: 123
title: cde
colour: stu
EAN: 234
title: bcd
colour: null
Offers ML model Offers as
vectors
1
2
3
2
3
4
1
2
3
x
y
z
cluster A
Cluster
similar vectors
text
attributes
as features
outputs
embeddings
cluster
embeddings
Siamese network
with Transformer models perform best …
Transfer Learning with Transformers
Learn one task, transfer knowledge to a new task
Pretraining Fine-tuning
Masked language modelling
• Sentence: Where are we [MASK]
• Label: going
Training objective:
Unlabeled
Text data Pretrained model
Transfer Learning with Transformers
Leverage large scale pre-trained language models
• Transformer encoder with
110M. parameters
• 160GB uncompressed texts
(five English-language
corpora )
• training time 35 days on 32
GPUs
microsoft / mpnet-base
Transformer
Pre
training
Transfer Learning with Transformers
Leverage large scale pre-trained language models
• Transformer encoder with
110M. parameters
• 160GB uncompressed texts
(five English-language
corpora )
• training time 35 days on 32
GPUs
fine-tuning
microsoft / mpnet-base sentence-transformers /
all-mpnet-base-v2
• trained on 1.2 billion English
sentence pairs
• transferred to 100+ languages
through Multi-Lingual
Knowledge Distillation
Transformer Transformer
Pre
training
Transfer Learning with Transformers
Leverage large scale pre-trained language models
• Transformer encoder with
110 M. parameters
• 160 GB uncompressed texts
(five English-language
corpora )
• training time 35 days on 32
GPUs
fine-tuning
microsoft / mpnet-base sentence-transformers /
all-mpnet-base-v2
• trained on 1.2 billion English
sentence pairs
• transferred to 100+
languages through Multi-
Lingual Knowledge
Distillation
• trained on >5 million idealo
offer pairs
• training time 28 hours on a
NVIDIA V100 GPU
fine-tuning
idealo-offer-clustering
Transformer Transformer Transformer
Pre
training
Siamese Networks
Train on positive and negative training pairs.
Label:
1 = similar
0 = not similar
Siamese Networks
Train on positive and negative training pairs. Before fine-tuning: 0.58
After fine-tuning: 0.76
+18 pp
Label:
1 = similar
0 = not similar
Sentence Transformers
v Provide access to language models fine-tuned on 1 billion sentence pairs
v Integrated with Hugging Face Modelhub
v Multilingual Models available, support for 100+ languages
v 10+ Loss functions implemented and ready to use
Sentence Transformers
v Provide access to language models fine-tuned on 1 billion sentence pairs
v Integrated with Hugging Face Modelhub
v Multilingual Models available, support for 100+ languages
v 10+ Loss functions implemented and ready to use
Training pair generation makes a difference …
Generate Training Pairs
Choose positive pairs and negative pairs randomly
v Randomly selected negative pairs
are too easy for the model.
v Random negative pairs do not
contribute much to training
progress.
v Model quickly converges and
performance is not enough.
Lessons Learned
Generate Training Pairs
Select Hard-negative pairs Offline Strategy
Average embedding
for each product cluster
Generate Pairs
Training
Compute embeddings
Epoch
Search for neighbors
+6 pp
Building product clusters can be challenging …
Building product cluster
v Scale to millions of vector
searches
v Search quality is important
v Search time should be small
Challenges
Find K-Nearest Neighbor and apply
threshold
K=10
Threshold
Faiss built by Facebook Research
v Allows to scale to billions of vectors („Billion Scale Similarity Search“ Paper)
v Native distributed GPU-support
v Out of the box optimization strategies:
v Compressed representation by using product quantization methods
v Approximate nearest neighbor search
Source: https://github.com/facebookresearch/faiss/wiki
Index size: 25 GB
Vectors: > 13 million
Hardware: NVIDIA V100 (Multi-GPUs)
Time: 4,3 hrs (⌀ 1,2 ms per vector)
Performance
Index size: 25 GB
Vectors: > 13 million
Hardware: NVIDIA V100 (Multi-GPUs)
Time: 4,3 hrs (⌀ 1,2 ms per vector)
Faiss built by Facebook Research
v Allows to scale to billions of vectors („Billion Scale Similarity Search“ Paper)
v Native distributed GPU-support
v Out of the box optimization strategies:
v Compressed representation by using product quantization methods
v Approximate nearest neighbor search
Source: https://github.com/facebookresearch/faiss/wiki
Performance
Let‘s talk about challenges ...
Identify final product clusters
KNN for two offers KNN graph clusters after LPA algorithm
• create KNN graph with edge weights = cosine similarity
• use Label Propagation Algorithm (LPA) to identify clusters
• GraphFrames Spark library
Approach
Noisy Text Attributes
Hard to identify product variants
Title:
Adidas Originals Superstar UNISEX schwarz weiß
Title:
Adidas Originals Sportschuhe FV3139_35, 5 Sneakers White, 35.5 EU
Next Steps …
Thank you!

More Related Content

Similar to Transformer_Clustering_PyData_2022.pdf

Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuning
Yosuke Mizutani
 
Use Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdfUse Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdf
M Waleed Kadous
 
How to Improve Translation Productivity
How to Improve Translation ProductivityHow to Improve Translation Productivity
How to Improve Translation Productivity
kantanmt
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Databricks
 
2014 01-ticosa
2014 01-ticosa2014 01-ticosa
2014 01-ticosa
Pharo
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
C4Media
 
Introducing Language-Oriented Business Applications - Markus Voelter
Introducing Language-Oriented Business Applications - Markus VoelterIntroducing Language-Oriented Business Applications - Markus Voelter
Introducing Language-Oriented Business Applications - Markus Voelter
JAXLondon2014
 
Session 1 - The Current Landscape of Big Data Benchmarks
Session 1 - The Current Landscape of Big Data BenchmarksSession 1 - The Current Landscape of Big Data Benchmarks
Session 1 - The Current Landscape of Big Data Benchmarks
DataBench
 
Taming the Wild West of NLP
Taming the Wild West of NLPTaming the Wild West of NLP
Taming the Wild West of NLP
Yunyao Li
 
Austin,TX Meetup presentation tensorflow final oct 26 2017
Austin,TX Meetup presentation tensorflow final oct 26 2017Austin,TX Meetup presentation tensorflow final oct 26 2017
Austin,TX Meetup presentation tensorflow final oct 26 2017
Clarisse Hedglin
 
Applying NLP to product comparison at visual meta
Applying NLP to product comparison at visual metaApplying NLP to product comparison at visual meta
Applying NLP to product comparison at visual meta
Ross Turner
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural search
Dmitry Kan
 
Age of Language Models in NLP
Age of Language Models in NLPAge of Language Models in NLP
Age of Language Models in NLP
Tyrone Systems
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...MongoDB
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
MLconf
 
Bol.com
Bol.comBol.com
Bol.com
BigDataExpo
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Maurice Nsabimana
 
Build 2019 Recap
Build 2019 RecapBuild 2019 Recap
Build 2019 Recap
Eran Stiller
 
Wwx2014 - Todd Kulick "Shipping One Million Lines of Haxe to (Over) One Milli...
Wwx2014 - Todd Kulick "Shipping One Million Lines of Haxe to (Over) One Milli...Wwx2014 - Todd Kulick "Shipping One Million Lines of Haxe to (Over) One Milli...
Wwx2014 - Todd Kulick "Shipping One Million Lines of Haxe to (Over) One Milli...
antopensource
 

Similar to Transformer_Clustering_PyData_2022.pdf (20)

Adtech x Scala x Performance tuning
Adtech x Scala x Performance tuningAdtech x Scala x Performance tuning
Adtech x Scala x Performance tuning
 
Use Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdfUse Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdf
 
How to Improve Translation Productivity
How to Improve Translation ProductivityHow to Improve Translation Productivity
How to Improve Translation Productivity
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
2014 01-ticosa
2014 01-ticosa2014 01-ticosa
2014 01-ticosa
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Introducing Language-Oriented Business Applications - Markus Voelter
Introducing Language-Oriented Business Applications - Markus VoelterIntroducing Language-Oriented Business Applications - Markus Voelter
Introducing Language-Oriented Business Applications - Markus Voelter
 
Session 1 - The Current Landscape of Big Data Benchmarks
Session 1 - The Current Landscape of Big Data BenchmarksSession 1 - The Current Landscape of Big Data Benchmarks
Session 1 - The Current Landscape of Big Data Benchmarks
 
Taming the Wild West of NLP
Taming the Wild West of NLPTaming the Wild West of NLP
Taming the Wild West of NLP
 
Austin,TX Meetup presentation tensorflow final oct 26 2017
Austin,TX Meetup presentation tensorflow final oct 26 2017Austin,TX Meetup presentation tensorflow final oct 26 2017
Austin,TX Meetup presentation tensorflow final oct 26 2017
 
Applying NLP to product comparison at visual meta
Applying NLP to product comparison at visual metaApplying NLP to product comparison at visual meta
Applying NLP to product comparison at visual meta
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural search
 
Age of Language Models in NLP
Age of Language Models in NLPAge of Language Models in NLP
Age of Language Models in NLP
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Bol.com
Bol.comBol.com
Bol.com
 
ShaREing is Caring
ShaREing is CaringShaREing is Caring
ShaREing is Caring
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
 
Build 2019 Recap
Build 2019 RecapBuild 2019 Recap
Build 2019 Recap
 
Wwx2014 - Todd Kulick "Shipping One Million Lines of Haxe to (Over) One Milli...
Wwx2014 - Todd Kulick "Shipping One Million Lines of Haxe to (Over) One Milli...Wwx2014 - Todd Kulick "Shipping One Million Lines of Haxe to (Over) One Milli...
Wwx2014 - Todd Kulick "Shipping One Million Lines of Haxe to (Over) One Milli...
 

Recently uploaded

【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 

Recently uploaded (20)

【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 

Transformer_Clustering_PyData_2022.pdf

  • 1. Dat Tran - Head of Data Science Transformer based clustering: Identifying product clusters for E-commerce Christopher Lennan Sebastian Wanner 13/04/2022 PyConDE & PyData Berlin
  • 2. Sebastian Wanner Senior ML Engineer Christopher Lennan Lead ML Engineer
  • 3. 20 More than 20 years experience 900+ "idealos" from 40 nations Active in 6 different countries (DE, AT, ES, IT, FR, UK) 18 million visitors/month 50.000 shops Over 330 million offers and 2 million products Germany's 4th largest eCommerce website idealo key facts
  • 6. Problem: vast majority of offers are not mapped to product catalogue!
  • 7. idealo open catalogue Cluster A Cluster B Cluster C
  • 8. idealo open catalogue Offer clustering – EAN matching EAN: 123 EAN: 123 EAN: 123 EAN: null EAN: 321 EAN: null EAN: 234 Cluster A Cluster B Cluster C
  • 9. idealo open catalogue Offer clustering – ML on text attributes EAN: 123 title: abc colour: lmn EAN: 123 title: abc colour: lmn EAN: 123 title: cde colour: stu EAN: null title: cde colour: null EAN: 321 title: cd-e colour: stu EAN: null title: bc d colour: mno EAN: 234 title: bcd colour: null Cluster A Cluster B Cluster C
  • 10. So we tried various ML approaches ...
  • 11. Results 10k products (shoe category) ⌀ 17 offers per product Dataset * no exhaustive hyper-parameter tuning performed scaling ruleset precision 👍 recall 👎 https://github.com/moj-analytical-services/splink
  • 12. KNN clustering Transformer encoders Embeddings based clustering EAN: 123 title: abc colour: lmn EAN: 123 title: cde colour: stu EAN: 234 title: bcd colour: null Offers ML model Offers as vectors 1 2 3 2 3 4 1 2 3 x y z cluster A Cluster similar vectors text attributes as features outputs embeddings cluster embeddings
  • 13. Siamese network with Transformer models perform best …
  • 14. Transfer Learning with Transformers Learn one task, transfer knowledge to a new task Pretraining Fine-tuning Masked language modelling • Sentence: Where are we [MASK] • Label: going Training objective: Unlabeled Text data Pretrained model
  • 15. Transfer Learning with Transformers Leverage large scale pre-trained language models • Transformer encoder with 110M. parameters • 160GB uncompressed texts (five English-language corpora ) • training time 35 days on 32 GPUs microsoft / mpnet-base Transformer Pre training
  • 16. Transfer Learning with Transformers Leverage large scale pre-trained language models • Transformer encoder with 110M. parameters • 160GB uncompressed texts (five English-language corpora ) • training time 35 days on 32 GPUs fine-tuning microsoft / mpnet-base sentence-transformers / all-mpnet-base-v2 • trained on 1.2 billion English sentence pairs • transferred to 100+ languages through Multi-Lingual Knowledge Distillation Transformer Transformer Pre training
  • 17. Transfer Learning with Transformers Leverage large scale pre-trained language models • Transformer encoder with 110 M. parameters • 160 GB uncompressed texts (five English-language corpora ) • training time 35 days on 32 GPUs fine-tuning microsoft / mpnet-base sentence-transformers / all-mpnet-base-v2 • trained on 1.2 billion English sentence pairs • transferred to 100+ languages through Multi- Lingual Knowledge Distillation • trained on >5 million idealo offer pairs • training time 28 hours on a NVIDIA V100 GPU fine-tuning idealo-offer-clustering Transformer Transformer Transformer Pre training
  • 18. Siamese Networks Train on positive and negative training pairs. Label: 1 = similar 0 = not similar
  • 19. Siamese Networks Train on positive and negative training pairs. Before fine-tuning: 0.58 After fine-tuning: 0.76 +18 pp Label: 1 = similar 0 = not similar
  • 20. Sentence Transformers v Provide access to language models fine-tuned on 1 billion sentence pairs v Integrated with Hugging Face Modelhub v Multilingual Models available, support for 100+ languages v 10+ Loss functions implemented and ready to use
  • 21. Sentence Transformers v Provide access to language models fine-tuned on 1 billion sentence pairs v Integrated with Hugging Face Modelhub v Multilingual Models available, support for 100+ languages v 10+ Loss functions implemented and ready to use
  • 22. Training pair generation makes a difference …
  • 23. Generate Training Pairs Choose positive pairs and negative pairs randomly v Randomly selected negative pairs are too easy for the model. v Random negative pairs do not contribute much to training progress. v Model quickly converges and performance is not enough. Lessons Learned
  • 24. Generate Training Pairs Select Hard-negative pairs Offline Strategy Average embedding for each product cluster Generate Pairs Training Compute embeddings Epoch Search for neighbors +6 pp
  • 25. Building product clusters can be challenging …
  • 26. Building product cluster v Scale to millions of vector searches v Search quality is important v Search time should be small Challenges Find K-Nearest Neighbor and apply threshold K=10 Threshold
  • 27. Faiss built by Facebook Research v Allows to scale to billions of vectors („Billion Scale Similarity Search“ Paper) v Native distributed GPU-support v Out of the box optimization strategies: v Compressed representation by using product quantization methods v Approximate nearest neighbor search Source: https://github.com/facebookresearch/faiss/wiki Index size: 25 GB Vectors: > 13 million Hardware: NVIDIA V100 (Multi-GPUs) Time: 4,3 hrs (⌀ 1,2 ms per vector) Performance
  • 28. Index size: 25 GB Vectors: > 13 million Hardware: NVIDIA V100 (Multi-GPUs) Time: 4,3 hrs (⌀ 1,2 ms per vector) Faiss built by Facebook Research v Allows to scale to billions of vectors („Billion Scale Similarity Search“ Paper) v Native distributed GPU-support v Out of the box optimization strategies: v Compressed representation by using product quantization methods v Approximate nearest neighbor search Source: https://github.com/facebookresearch/faiss/wiki Performance
  • 29. Let‘s talk about challenges ...
  • 30. Identify final product clusters KNN for two offers KNN graph clusters after LPA algorithm • create KNN graph with edge weights = cosine similarity • use Label Propagation Algorithm (LPA) to identify clusters • GraphFrames Spark library Approach
  • 31. Noisy Text Attributes Hard to identify product variants Title: Adidas Originals Superstar UNISEX schwarz weiß Title: Adidas Originals Sportschuhe FV3139_35, 5 Sneakers White, 35.5 EU