SlideShare a Scribd company logo
1 of 22
Download to read offline
All Rights Reserved Š 2019
IMPROVING DATA QUALITY
1
EVI LAZARIDOU
IMPROVING DATA QUALITY WITH
PRODUCT SIMILARITY SEARCH
All Rights Reserved Š 2019
About me
• Electrical & Computer Engineer, MSc in Computer Science
• Research
• Data Scientist at commercetools GmbH
• Data-driven internal features for backend teams
• Development of Data Science-based APIs
https://www.linkedin.com/in/evi-lazaridou
evi.lazaridou@commercetools.de
https://medium.com/@Evi.lazaridou
2
All Rights Reserved Š 2019
What Product Similarity solves
• Content-based product recommendations could leverage product similarity to recommend
alternative items of same characteristics for out of stock products
3
• Duplicate entries
• Marketplaces: products added by new seller
might exist in the catalog
• e-commerce stores: boilerplate content in the
product data
All Rights Reserved Š 2019
The challenges
• Incompatibilities between different datasets for a marketplace
• different names / types / encoding for same variable
• Scalability: comparing each single product variant with each other
• Missing & noisy data (formatting tags, ids etc.)
• automated preprocessing hard due to individual business’s data specificities
• Multiple data types
4
All Rights Reserved Š 2019
How does product data look like?
{
"id": "df2ecef4-fd68-4000-8740-0e5639dff471",
"version": 17,
"name": “Clutch DKNY grey",
"description": "Classic clutch with multiple compartments and a sleek design.",
"categories": [],
"masterVariant": {
"prices": ["currencyCode": "EUR", "centAmount": 8750, "fractionDigits": 2}],
"images": [{"url": "https://082034_1_large.jpg", "dimensions": {"w": 0, "h": 0}}],
"attributes": [
{“matrixId": ”A0E2000000026I5"},
{“designer": "DKNY"},
{“size": ”one size"},
{"color": "grey"},
{“style":"sporty"},
{“gender”: "women"},
{“season”: "s15"},
{"isOnStock": true}] }
"variants": [],
"createdAt": "2017-07-10T14:05:13.665Z"
}
5
…of multiple data types
• name, description: text
• attribute values: array of multiple
different data types (text, numerical,
boolean, sets)
• price, variant count: numerical
All Rights Reserved Š 2019
Leveraging multiple product data sources
{
"id": "df2ecef4-fd68-4000-8740-0e5639dff471",
"version": 17,
"name": “Clutch DKNY grey",
"description": "Classic clutch with multiple compartments and a sleek design.”,
"categories": [],
"masterVariant": {
"prices": ["currencyCode": "EUR", "centAmount": 8750, "fractionDigits": 2}],
"images": [{"url": "https://082034_1_large.jpg", "dimensions": {"w": 0, "h": 0}}],
"attributes": [
{“matrixId": ”A0E2000000026I5"},
{“designer": "DKNY"},
{“size": ”one size"},
{"color": "grey"},
{“style":"sporty"},
{“gender”: "women"},
{“season”: "s15"},
{"isOnStock": true}] }
"variants": [],
"createdAt": "2017-07-10T14:05:13.665Z"
}
6
• Different data types
• smaller independent components, each calculates the similarity for the respective data source
• Users can
• select which data sources to include in the computation
• specify which should have a stronger influence
All Rights Reserved Š 2019
Data flow
7
product data
Name Clutch DKNY grey
Description
Classic clutch with
multiple compartments and
a sleek design
Price 8750
Variant

Count
3
Attributes
[“color”: “gray”,
“style”: “sporty”,
“gender”: “women”, …]
text similarity
numerical similarity
mixed data similarity
name similarity
description similarity
W1
ÎŁ
W2
attribute similarity W5
price similarity
variantCount similarity
W3
W4
vtfs!efgjofe!xfjhiut
• Text similarity for names & descriptions
• Hashing Vectorizer (scikit-learn): text vectorizer similar to Count Vectorizer without keeping a
vocabulary, faster response time
All Rights Reserved Š 2019
Breaking it down
8
product data
Name Clutch DKNY grey
Description
Classic clutch with
multiple compartments and
a sleek design
Price 8750
Variant

Count
3
Attributes
[“color”: “gray”,
“style”: “sporty”,
“gender”: “women”, …]
text similarity
numerical similarity
mixed data similarity
name similarity
description similarity
W1
ÎŁ
W2
attribute similarity W5
price similarity
variantCount similarity
W3
W4
• Text similarity for names & descriptions
• Hashing Vectorizer (scikit-learn): text vectorizer similar to Count Vectorizer without keeping a
vocabulary, faster response time
• Numerical similarity for prices & variant count: using absolute distance of scaled values
All Rights Reserved Š 2019
Breaking it down
9
product data
Name Clutch DKNY grey
Description
Classic clutch with
multiple compartments and
a sleek design
Price 8750
Variant

Count
3
Attributes
[“color”: “gray”,
“style”: “sporty”,
“gender”: “women”, …]
text similarity
numerical similarity
mixed data similarity
name similarity
description similarity
W1
ÎŁ
W2
attribute similarity W5
price similarity
variantCount similarity
W3
W4
• Text similarity for names & descriptions
• Hashing Vectorizer (scikit-learn): text vectorizer similar to Count Vectorizer without keeping a
vocabulary, faster response time
• Numerical similarity for prices & variant count: absolute distance of scaled values
• Mixed data similarity for attributes
All Rights Reserved Š 2019
Breaking it down
10
product data
Name Clutch DKNY grey
Description
Classic clutch with
multiple compartments and
a sleek design
Price 8750
Variant

Count
3
Attributes
[“color”: “gray”,
“style”: “sporty”,
“gender”: “women”, …]
text similarity
numerical similarity
mixed data similarity
name similarity
description similarity
W1
ÎŁ
W2
attribute similarity W5
price similarity
variantCount similarity
W3
W4
All Rights Reserved Š 2019
Attributes: Mixed Data Similarity
• Arrays of numerical, categorical, boolean & multi-valued features
• No common similarity metric to compare all types
• Approach to handle different data types based on Gower distance:
• Calculate distances between two instances differently for each variable type & combine in a
final (weighted) distance score
• Distance between missing values?
• The distance between a missing value & any other value should be the maximum (1.0)
• Which distance metric should be used for every type?
11
acidity color contents country country_availability foods available
5,7 g/l white 750.0 Italy [‘DE’, ‘AT’] [‘vegetarian’, ‘poultry’] TRUE
9,0 g/l red 1500.0 France [‘DE’, ‘AT’, ‘FR’] [‘seafood’, ‘fish’] TRUE
5,4 g/l red 750.0 Portugal [‘DE’, ‘AT’] [‘lamb’, ‘beef’] TRUE
4,4 g/l red 750.0 Germany [‘DE’, ‘AT’, ‘IT’] [‘poultry’, ‘pork’, ‘beef’] FALSE
6,6 g/l rosé 750.0 Austria [‘DE’, ‘AT’] [‘seafood’, ‘fish’, ‘poultry’] FALSE
All Rights Reserved Š 2019
Mixed Data Similarity: Numerical
• Numerical attributes
• Euclidean distance
12
acidity color contents country country_availability foods available
5,7 g/l white 750.0 Italy [‘DE’, ‘AT’] [‘vegetarian’, ‘poultry’] TRUE
9,0 g/l red 1500.0 France [‘DE’, ‘AT’, ‘FR’] [‘seafood’, ‘fish’] TRUE
5,4 g/l red 750.0 Portugal [‘DE’, ‘AT’] [‘lamb’, ‘beef’] TRUE
4,4 g/l red 750.0 Germany [‘DE’, ‘AT’, ‘IT’] [‘poultry’, ‘pork’, ‘beef’] FALSE
6,6 g/l rosé 750.0 Austria [‘DE’, ‘AT’] [‘seafood’, ‘fish’, ‘poultry’] FALSE
All Rights Reserved Š 2019
Mixed Data Similarity: Boolean
• Numerical attributes
• Euclidean distance
• Boolean attributes
• Converted to numerical values and treat as
numerical
13
acidity color contents country country_availability foods available
5,7 g/l white 750.0 Italy [‘DE’, ‘AT’] [‘vegetarian’, ‘poultry’] TRUE
9,0 g/l red 1500.0 France [‘DE’, ‘AT’, ‘FR’] [‘seafood’, ‘fish’] TRUE
5,4 g/l red 750.0 Portugal [‘DE’, ‘AT’] [‘lamb’, ‘beef’] TRUE
4,4 g/l red 750.0 Germany [‘DE’, ‘AT’, ‘IT’] [‘poultry’, ‘pork’, ‘beef’] FALSE
6,6 g/l rosé 750.0 Austria [‘DE’, ‘AT’] [‘seafood’, ‘fish’, ‘poultry’] FALSE
All Rights Reserved Š 2019
Mixed Data Similarity: Multi-valued
• Numerical attributes
• Euclidean distance
• Boolean attributes
• Converted to numerical values and treat as
numerical
• Multi-valued attributes
• Jaccard similarity (coefficient) between two sets
of values: size of their intersection divided by the
size of their union
14
acidity color contents country country_availability foods available
5,7 g/l white 750.0 Italy [‘DE’, ‘AT’] [‘vegetarian’, ‘poultry’] TRUE
9,0 g/l red 1500.0 France [‘DE’, ‘AT’, ‘FR’] [‘seafood’, ‘fish’] TRUE
5,4 g/l red 750.0 Portugal [‘DE’, ‘AT’] [‘lamb’, ‘beef’] TRUE
4,4 g/l red 750.0 Germany [‘DE’, ‘AT’, ‘IT’] [‘poultry’, ‘pork’, ‘beef’] FALSE
6,6 g/l rosé 750.0 Austria [‘DE’, ‘AT’] [‘seafood’, ‘fish’, ‘poultry’] FALSE
All Rights Reserved Š 2019
Mixed Data Similarity: Categorical
• Typical approaches present disadvantages
• Encoding categorical values with numerical
✗ distances between the values are random
• One Hot Encoding
✗ high dimensionality
• Measure whether values are identical or not
• Hamming distance (SciPy’s cdist)
15
acidity color contents country country_availability foods available
5,7 g/l white 750.0 Italy [‘DE’, ‘AT’] [‘vegetarian’, ‘poultry’] TRUE
9,0 g/l red 1500.0 France [‘DE’, ‘AT’, ‘FR’] [‘seafood’, ‘fish’] TRUE
5,4 g/l red 750.0 Portugal [‘DE’, ‘AT’] [‘lamb’, ‘beef’] TRUE
4,4 g/l red 750.0 Germany [‘DE’, ‘AT’, ‘IT’] [‘poultry’, ‘pork’, ‘beef’] FALSE
6,6 g/l rosé 750.0 Austria [‘DE’, ‘AT’] [‘seafood’, ‘fish’, ‘poultry’] FALSE
All Rights Reserved Š 2019
Mixed Data Similarity: Categorical
• Maybe better to handle as text?
• Hard without supervision
• Not always meaningful & safe
• Computationally expensive
Only enabled for small product sets & limited number
of nominal attributes and is based on the Levenshtein
distance
16
acidity color contents country country_availability foods available
5,7 g/l white 750.0 Italy [‘DE’, ‘AT’] [‘vegetarian’, ‘poultry’] TRUE
9,0 g/l red 1500.0 France [‘DE’, ‘AT’, ‘FR’] [‘seafood’, ‘fish’] TRUE
5,4 g/l red 750.0 Portugal [‘DE’, ‘AT’] [‘lamb’, ‘beef’] TRUE
4,4 g/l red 750.0 Germany [‘DE’, ‘AT’, ‘IT’] [‘poultry’, ‘pork’, ‘beef’] FALSE
6,6 g/l rosé 750.0 Austria [‘DE’, ‘AT’] [‘seafood’, ‘fish’, ‘poultry’] FALSE
All Rights Reserved Š 2019
Attribute selection & weight assessment
• Some attributes are irrelevant or add noise
• Strongly influenced by customer’s data & patterns
• Hard to automate “universally” without inspection of data
• Some attributes are more useful
• They should have a higher impact on the final score
• We need attribute weights and a metric of “importance” to define them
• Variation & density of values as indicator of the discriminative ability & importance of attributes
17
All Rights Reserved Š 2019
Assessing variation in different variable types
• No single variation metric applicable to every data type
• Experimented with different variance metrics (std, variance, variation ratio..)
• All tied to the data type
• Can’t compare the one with the other
• One-for-all variation counterpart: entropy (Shannon’s entropy) of the values
• Measure of randomness in data
• Not influenced by values, only by their distributions
18
All Rights Reserved Š 2019
Entropy to the rescue
• How?
• Treating all data values as distinct & take normalized entropy (so H in [0,1])
• High entropy generally indicates high variation
• Remove attributes with entropy (almost) equal to 1 because it’s a uniform distribution
• Very low entropy when most data points fall in one value
• Not much discriminative ability
• Not a perfect match but a good proxy
• The final attribute importance weight: (Entropy) x (Density)
19
All Rights Reserved Š 2019
Outcome
• An API that
• leverages multiple data sources
• is flexible to customize based on use case & data specificities
• Duplicate detection: include only/weigh higher data sources prone to duplicate content
• Content-based product recommendation: rely more on attributes, names, prices
• Knowledge gained on a real business case that is not widely covered & explored
20
Our API docs
https://bit.ly/2XjvzIc
All Rights Reserved Š 2019
Read all about it !
Our tech blog post
https://bit.ly/32L5SBC
21
All Rights Reserved Š 2019 22

More Related Content

More from Institute of Contemporary Sciences

Building valuable (online and offline) Data Science communities - Experience ...
Building valuable (online and offline) Data Science communities - Experience ...Building valuable (online and offline) Data Science communities - Experience ...
Building valuable (online and offline) Data Science communities - Experience ...Institute of Contemporary Sciences
 
Data Science Master 4.0 on Belgrade University - Drazen Draskovic
Data Science Master 4.0 on Belgrade University - Drazen DraskovicData Science Master 4.0 on Belgrade University - Drazen Draskovic
Data Science Master 4.0 on Belgrade University - Drazen DraskovicInstitute of Contemporary Sciences
 
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...Institute of Contemporary Sciences
 
Solving churn challenge in Big Data environment - Jelena Pekez
Solving churn challenge in Big Data environment  - Jelena PekezSolving churn challenge in Big Data environment  - Jelena Pekez
Solving churn challenge in Big Data environment - Jelena PekezInstitute of Contemporary Sciences
 
Application of Business Intelligence in bank risk management - Dimitar Dilov
Application of Business Intelligence in bank risk management - Dimitar DilovApplication of Business Intelligence in bank risk management - Dimitar Dilov
Application of Business Intelligence in bank risk management - Dimitar DilovInstitute of Contemporary Sciences
 
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...Institute of Contemporary Sciences
 
Recommender systems for personalized financial advice from concept to product...
Recommender systems for personalized financial advice from concept to product...Recommender systems for personalized financial advice from concept to product...
Recommender systems for personalized financial advice from concept to product...Institute of Contemporary Sciences
 
Advanced tools in real time analytics and AI in customer support - Milan Sima...
Advanced tools in real time analytics and AI in customer support - Milan Sima...Advanced tools in real time analytics and AI in customer support - Milan Sima...
Advanced tools in real time analytics and AI in customer support - Milan Sima...Institute of Contemporary Sciences
 
Complex AI forecasting methods for investments portfolio optimization - Pawel...
Complex AI forecasting methods for investments portfolio optimization - Pawel...Complex AI forecasting methods for investments portfolio optimization - Pawel...
Complex AI forecasting methods for investments portfolio optimization - Pawel...Institute of Contemporary Sciences
 
Data and data scientists are not equal to money david hoyle
Data and data scientists are not equal to money   david hoyleData and data scientists are not equal to money   david hoyle
Data and data scientists are not equal to money david hoyleInstitute of Contemporary Sciences
 
When it's raining gold, bring a bucket - Andjela Culibrk
When it's raining gold, bring a bucket - Andjela CulibrkWhen it's raining gold, bring a bucket - Andjela Culibrk
When it's raining gold, bring a bucket - Andjela CulibrkInstitute of Contemporary Sciences
 
Reality and traps of real time data engineering - Milos Solujic
Reality and traps of real time data engineering - Milos SolujicReality and traps of real time data engineering - Milos Solujic
Reality and traps of real time data engineering - Milos SolujicInstitute of Contemporary Sciences
 
Sensor networks for personalized health monitoring - Vladimir Brusic
Sensor networks for personalized health monitoring - Vladimir BrusicSensor networks for personalized health monitoring - Vladimir Brusic
Sensor networks for personalized health monitoring - Vladimir BrusicInstitute of Contemporary Sciences
 
Prediction of good patterns for future sales using image recognition
Prediction of good patterns for future sales using image recognitionPrediction of good patterns for future sales using image recognition
Prediction of good patterns for future sales using image recognitionInstitute of Contemporary Sciences
 
Using data to fight corruption: full budget transparency in local government
Using data to fight corruption: full budget transparency in local governmentUsing data to fight corruption: full budget transparency in local government
Using data to fight corruption: full budget transparency in local governmentInstitute of Contemporary Sciences
 
Machine Learning-Driven Injury Prediction for a Professional Sports Team
Machine Learning-Driven Injury Prediction for a Professional Sports TeamMachine Learning-Driven Injury Prediction for a Professional Sports Team
Machine Learning-Driven Injury Prediction for a Professional Sports TeamInstitute of Contemporary Sciences
 

More from Institute of Contemporary Sciences (20)

First 5 years of PSI:ML - Filip Panjevic
First 5 years of PSI:ML - Filip PanjevicFirst 5 years of PSI:ML - Filip Panjevic
First 5 years of PSI:ML - Filip Panjevic
 
Building valuable (online and offline) Data Science communities - Experience ...
Building valuable (online and offline) Data Science communities - Experience ...Building valuable (online and offline) Data Science communities - Experience ...
Building valuable (online and offline) Data Science communities - Experience ...
 
Data Science Master 4.0 on Belgrade University - Drazen Draskovic
Data Science Master 4.0 on Belgrade University - Drazen DraskovicData Science Master 4.0 on Belgrade University - Drazen Draskovic
Data Science Master 4.0 on Belgrade University - Drazen Draskovic
 
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
Deep learning fast and slow, a responsible and explainable AI framework - Ahm...
 
Solving churn challenge in Big Data environment - Jelena Pekez
Solving churn challenge in Big Data environment  - Jelena PekezSolving churn challenge in Big Data environment  - Jelena Pekez
Solving churn challenge in Big Data environment - Jelena Pekez
 
Application of Business Intelligence in bank risk management - Dimitar Dilov
Application of Business Intelligence in bank risk management - Dimitar DilovApplication of Business Intelligence in bank risk management - Dimitar Dilov
Application of Business Intelligence in bank risk management - Dimitar Dilov
 
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
Trends and practical applications of AI/ML in Fin Tech industry - Milos Kosan...
 
Recommender systems for personalized financial advice from concept to product...
Recommender systems for personalized financial advice from concept to product...Recommender systems for personalized financial advice from concept to product...
Recommender systems for personalized financial advice from concept to product...
 
Advanced tools in real time analytics and AI in customer support - Milan Sima...
Advanced tools in real time analytics and AI in customer support - Milan Sima...Advanced tools in real time analytics and AI in customer support - Milan Sima...
Advanced tools in real time analytics and AI in customer support - Milan Sima...
 
Complex AI forecasting methods for investments portfolio optimization - Pawel...
Complex AI forecasting methods for investments portfolio optimization - Pawel...Complex AI forecasting methods for investments portfolio optimization - Pawel...
Complex AI forecasting methods for investments portfolio optimization - Pawel...
 
From Zero to ML Hero for Underdogs - Amir Tabakovic
From Zero to ML Hero for Underdogs  - Amir TabakovicFrom Zero to ML Hero for Underdogs  - Amir Tabakovic
From Zero to ML Hero for Underdogs - Amir Tabakovic
 
Data and data scientists are not equal to money david hoyle
Data and data scientists are not equal to money   david hoyleData and data scientists are not equal to money   david hoyle
Data and data scientists are not equal to money david hoyle
 
The price is right - Tomislav Krizan
The price is right - Tomislav KrizanThe price is right - Tomislav Krizan
The price is right - Tomislav Krizan
 
When it's raining gold, bring a bucket - Andjela Culibrk
When it's raining gold, bring a bucket - Andjela CulibrkWhen it's raining gold, bring a bucket - Andjela Culibrk
When it's raining gold, bring a bucket - Andjela Culibrk
 
Reality and traps of real time data engineering - Milos Solujic
Reality and traps of real time data engineering - Milos SolujicReality and traps of real time data engineering - Milos Solujic
Reality and traps of real time data engineering - Milos Solujic
 
Sensor networks for personalized health monitoring - Vladimir Brusic
Sensor networks for personalized health monitoring - Vladimir BrusicSensor networks for personalized health monitoring - Vladimir Brusic
Sensor networks for personalized health monitoring - Vladimir Brusic
 
Prediction of good patterns for future sales using image recognition
Prediction of good patterns for future sales using image recognitionPrediction of good patterns for future sales using image recognition
Prediction of good patterns for future sales using image recognition
 
Using data to fight corruption: full budget transparency in local government
Using data to fight corruption: full budget transparency in local governmentUsing data to fight corruption: full budget transparency in local government
Using data to fight corruption: full budget transparency in local government
 
Geospatial Analysis and Open Data - Forest and Climate
Geospatial Analysis and Open Data - Forest and ClimateGeospatial Analysis and Open Data - Forest and Climate
Geospatial Analysis and Open Data - Forest and Climate
 
Machine Learning-Driven Injury Prediction for a Professional Sports Team
Machine Learning-Driven Injury Prediction for a Professional Sports TeamMachine Learning-Driven Injury Prediction for a Professional Sports Team
Machine Learning-Driven Injury Prediction for a Professional Sports Team
 

Recently uploaded

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 

Recently uploaded (20)

{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 

Improving Data Quality with Product Similarity Search

  • 1. All Rights Reserved Š 2019 IMPROVING DATA QUALITY 1 EVI LAZARIDOU IMPROVING DATA QUALITY WITH PRODUCT SIMILARITY SEARCH
  • 2. All Rights Reserved Š 2019 About me • Electrical & Computer Engineer, MSc in Computer Science • Research • Data Scientist at commercetools GmbH • Data-driven internal features for backend teams • Development of Data Science-based APIs https://www.linkedin.com/in/evi-lazaridou evi.lazaridou@commercetools.de https://medium.com/@Evi.lazaridou 2
  • 3. All Rights Reserved Š 2019 What Product Similarity solves • Content-based product recommendations could leverage product similarity to recommend alternative items of same characteristics for out of stock products 3 • Duplicate entries • Marketplaces: products added by new seller might exist in the catalog • e-commerce stores: boilerplate content in the product data
  • 4. All Rights Reserved Š 2019 The challenges • Incompatibilities between different datasets for a marketplace • different names / types / encoding for same variable • Scalability: comparing each single product variant with each other • Missing & noisy data (formatting tags, ids etc.) • automated preprocessing hard due to individual business’s data specificities • Multiple data types 4
  • 5. All Rights Reserved Š 2019 How does product data look like? { "id": "df2ecef4-fd68-4000-8740-0e5639dff471", "version": 17, "name": “Clutch DKNY grey", "description": "Classic clutch with multiple compartments and a sleek design.", "categories": [], "masterVariant": { "prices": ["currencyCode": "EUR", "centAmount": 8750, "fractionDigits": 2}], "images": [{"url": "https://082034_1_large.jpg", "dimensions": {"w": 0, "h": 0}}], "attributes": [ {“matrixId": ”A0E2000000026I5"}, {“designer": "DKNY"}, {“size": ”one size"}, {"color": "grey"}, {“style":"sporty"}, {“gender”: "women"}, {“season”: "s15"}, {"isOnStock": true}] } "variants": [], "createdAt": "2017-07-10T14:05:13.665Z" } 5
  • 6. …of multiple data types • name, description: text • attribute values: array of multiple different data types (text, numerical, boolean, sets) • price, variant count: numerical All Rights Reserved Š 2019 Leveraging multiple product data sources { "id": "df2ecef4-fd68-4000-8740-0e5639dff471", "version": 17, "name": “Clutch DKNY grey", "description": "Classic clutch with multiple compartments and a sleek design.”, "categories": [], "masterVariant": { "prices": ["currencyCode": "EUR", "centAmount": 8750, "fractionDigits": 2}], "images": [{"url": "https://082034_1_large.jpg", "dimensions": {"w": 0, "h": 0}}], "attributes": [ {“matrixId": ”A0E2000000026I5"}, {“designer": "DKNY"}, {“size": ”one size"}, {"color": "grey"}, {“style":"sporty"}, {“gender”: "women"}, {“season”: "s15"}, {"isOnStock": true}] } "variants": [], "createdAt": "2017-07-10T14:05:13.665Z" } 6
  • 7. • Different data types • smaller independent components, each calculates the similarity for the respective data source • Users can • select which data sources to include in the computation • specify which should have a stronger influence All Rights Reserved Š 2019 Data flow 7 product data Name Clutch DKNY grey Description Classic clutch with multiple compartments and a sleek design Price 8750 Variant Count 3 Attributes [“color”: “gray”, “style”: “sporty”, “gender”: “women”, …] text similarity numerical similarity mixed data similarity name similarity description similarity W1 ÎŁ W2 attribute similarity W5 price similarity variantCount similarity W3 W4 vtfs!efgjofe!xfjhiut
  • 8. • Text similarity for names & descriptions • Hashing Vectorizer (scikit-learn): text vectorizer similar to Count Vectorizer without keeping a vocabulary, faster response time All Rights Reserved Š 2019 Breaking it down 8 product data Name Clutch DKNY grey Description Classic clutch with multiple compartments and a sleek design Price 8750 Variant Count 3 Attributes [“color”: “gray”, “style”: “sporty”, “gender”: “women”, …] text similarity numerical similarity mixed data similarity name similarity description similarity W1 ÎŁ W2 attribute similarity W5 price similarity variantCount similarity W3 W4
  • 9. • Text similarity for names & descriptions • Hashing Vectorizer (scikit-learn): text vectorizer similar to Count Vectorizer without keeping a vocabulary, faster response time • Numerical similarity for prices & variant count: using absolute distance of scaled values All Rights Reserved Š 2019 Breaking it down 9 product data Name Clutch DKNY grey Description Classic clutch with multiple compartments and a sleek design Price 8750 Variant Count 3 Attributes [“color”: “gray”, “style”: “sporty”, “gender”: “women”, …] text similarity numerical similarity mixed data similarity name similarity description similarity W1 ÎŁ W2 attribute similarity W5 price similarity variantCount similarity W3 W4
  • 10. • Text similarity for names & descriptions • Hashing Vectorizer (scikit-learn): text vectorizer similar to Count Vectorizer without keeping a vocabulary, faster response time • Numerical similarity for prices & variant count: absolute distance of scaled values • Mixed data similarity for attributes All Rights Reserved Š 2019 Breaking it down 10 product data Name Clutch DKNY grey Description Classic clutch with multiple compartments and a sleek design Price 8750 Variant Count 3 Attributes [“color”: “gray”, “style”: “sporty”, “gender”: “women”, …] text similarity numerical similarity mixed data similarity name similarity description similarity W1 ÎŁ W2 attribute similarity W5 price similarity variantCount similarity W3 W4
  • 11. All Rights Reserved Š 2019 Attributes: Mixed Data Similarity • Arrays of numerical, categorical, boolean & multi-valued features • No common similarity metric to compare all types • Approach to handle different data types based on Gower distance: • Calculate distances between two instances differently for each variable type & combine in a final (weighted) distance score • Distance between missing values? • The distance between a missing value & any other value should be the maximum (1.0) • Which distance metric should be used for every type? 11 acidity color contents country country_availability foods available 5,7 g/l white 750.0 Italy [‘DE’, ‘AT’] [‘vegetarian’, ‘poultry’] TRUE 9,0 g/l red 1500.0 France [‘DE’, ‘AT’, ‘FR’] [‘seafood’, ‘fish’] TRUE 5,4 g/l red 750.0 Portugal [‘DE’, ‘AT’] [‘lamb’, ‘beef’] TRUE 4,4 g/l red 750.0 Germany [‘DE’, ‘AT’, ‘IT’] [‘poultry’, ‘pork’, ‘beef’] FALSE 6,6 g/l rosĂŠ 750.0 Austria [‘DE’, ‘AT’] [‘seafood’, ‘fish’, ‘poultry’] FALSE
  • 12. All Rights Reserved Š 2019 Mixed Data Similarity: Numerical • Numerical attributes • Euclidean distance 12 acidity color contents country country_availability foods available 5,7 g/l white 750.0 Italy [‘DE’, ‘AT’] [‘vegetarian’, ‘poultry’] TRUE 9,0 g/l red 1500.0 France [‘DE’, ‘AT’, ‘FR’] [‘seafood’, ‘fish’] TRUE 5,4 g/l red 750.0 Portugal [‘DE’, ‘AT’] [‘lamb’, ‘beef’] TRUE 4,4 g/l red 750.0 Germany [‘DE’, ‘AT’, ‘IT’] [‘poultry’, ‘pork’, ‘beef’] FALSE 6,6 g/l rosĂŠ 750.0 Austria [‘DE’, ‘AT’] [‘seafood’, ‘fish’, ‘poultry’] FALSE
  • 13. All Rights Reserved Š 2019 Mixed Data Similarity: Boolean • Numerical attributes • Euclidean distance • Boolean attributes • Converted to numerical values and treat as numerical 13 acidity color contents country country_availability foods available 5,7 g/l white 750.0 Italy [‘DE’, ‘AT’] [‘vegetarian’, ‘poultry’] TRUE 9,0 g/l red 1500.0 France [‘DE’, ‘AT’, ‘FR’] [‘seafood’, ‘fish’] TRUE 5,4 g/l red 750.0 Portugal [‘DE’, ‘AT’] [‘lamb’, ‘beef’] TRUE 4,4 g/l red 750.0 Germany [‘DE’, ‘AT’, ‘IT’] [‘poultry’, ‘pork’, ‘beef’] FALSE 6,6 g/l rosĂŠ 750.0 Austria [‘DE’, ‘AT’] [‘seafood’, ‘fish’, ‘poultry’] FALSE
  • 14. All Rights Reserved Š 2019 Mixed Data Similarity: Multi-valued • Numerical attributes • Euclidean distance • Boolean attributes • Converted to numerical values and treat as numerical • Multi-valued attributes • Jaccard similarity (coefficient) between two sets of values: size of their intersection divided by the size of their union 14 acidity color contents country country_availability foods available 5,7 g/l white 750.0 Italy [‘DE’, ‘AT’] [‘vegetarian’, ‘poultry’] TRUE 9,0 g/l red 1500.0 France [‘DE’, ‘AT’, ‘FR’] [‘seafood’, ‘fish’] TRUE 5,4 g/l red 750.0 Portugal [‘DE’, ‘AT’] [‘lamb’, ‘beef’] TRUE 4,4 g/l red 750.0 Germany [‘DE’, ‘AT’, ‘IT’] [‘poultry’, ‘pork’, ‘beef’] FALSE 6,6 g/l rosĂŠ 750.0 Austria [‘DE’, ‘AT’] [‘seafood’, ‘fish’, ‘poultry’] FALSE
  • 15. All Rights Reserved Š 2019 Mixed Data Similarity: Categorical • Typical approaches present disadvantages • Encoding categorical values with numerical ✗ distances between the values are random • One Hot Encoding ✗ high dimensionality • Measure whether values are identical or not • Hamming distance (SciPy’s cdist) 15 acidity color contents country country_availability foods available 5,7 g/l white 750.0 Italy [‘DE’, ‘AT’] [‘vegetarian’, ‘poultry’] TRUE 9,0 g/l red 1500.0 France [‘DE’, ‘AT’, ‘FR’] [‘seafood’, ‘fish’] TRUE 5,4 g/l red 750.0 Portugal [‘DE’, ‘AT’] [‘lamb’, ‘beef’] TRUE 4,4 g/l red 750.0 Germany [‘DE’, ‘AT’, ‘IT’] [‘poultry’, ‘pork’, ‘beef’] FALSE 6,6 g/l rosĂŠ 750.0 Austria [‘DE’, ‘AT’] [‘seafood’, ‘fish’, ‘poultry’] FALSE
  • 16. All Rights Reserved Š 2019 Mixed Data Similarity: Categorical • Maybe better to handle as text? • Hard without supervision • Not always meaningful & safe • Computationally expensive Only enabled for small product sets & limited number of nominal attributes and is based on the Levenshtein distance 16 acidity color contents country country_availability foods available 5,7 g/l white 750.0 Italy [‘DE’, ‘AT’] [‘vegetarian’, ‘poultry’] TRUE 9,0 g/l red 1500.0 France [‘DE’, ‘AT’, ‘FR’] [‘seafood’, ‘fish’] TRUE 5,4 g/l red 750.0 Portugal [‘DE’, ‘AT’] [‘lamb’, ‘beef’] TRUE 4,4 g/l red 750.0 Germany [‘DE’, ‘AT’, ‘IT’] [‘poultry’, ‘pork’, ‘beef’] FALSE 6,6 g/l rosĂŠ 750.0 Austria [‘DE’, ‘AT’] [‘seafood’, ‘fish’, ‘poultry’] FALSE
  • 17. All Rights Reserved Š 2019 Attribute selection & weight assessment • Some attributes are irrelevant or add noise • Strongly influenced by customer’s data & patterns • Hard to automate “universally” without inspection of data • Some attributes are more useful • They should have a higher impact on the final score • We need attribute weights and a metric of “importance” to define them • Variation & density of values as indicator of the discriminative ability & importance of attributes 17
  • 18. All Rights Reserved Š 2019 Assessing variation in different variable types • No single variation metric applicable to every data type • Experimented with different variance metrics (std, variance, variation ratio..) • All tied to the data type • Can’t compare the one with the other • One-for-all variation counterpart: entropy (Shannon’s entropy) of the values • Measure of randomness in data • Not influenced by values, only by their distributions 18
  • 19. All Rights Reserved Š 2019 Entropy to the rescue • How? • Treating all data values as distinct & take normalized entropy (so H in [0,1]) • High entropy generally indicates high variation • Remove attributes with entropy (almost) equal to 1 because it’s a uniform distribution • Very low entropy when most data points fall in one value • Not much discriminative ability • Not a perfect match but a good proxy • The final attribute importance weight: (Entropy) x (Density) 19
  • 20. All Rights Reserved Š 2019 Outcome • An API that • leverages multiple data sources • is flexible to customize based on use case & data specificities • Duplicate detection: include only/weigh higher data sources prone to duplicate content • Content-based product recommendation: rely more on attributes, names, prices • Knowledge gained on a real business case that is not widely covered & explored 20
  • 21. Our API docs https://bit.ly/2XjvzIc All Rights Reserved Š 2019 Read all about it ! Our tech blog post https://bit.ly/32L5SBC 21
  • 22. All Rights Reserved Š 2019 22