This document discusses improving data quality through product similarity search. It describes leveraging multiple data sources like product names, descriptions, prices and attributes to calculate similarity between products. Different techniques are used depending on the data type, such as text similarity for names/descriptions, numerical similarity for prices and variant counts, and mixed similarity for attributes. Attributes require special handling due to different data types within. The document outlines challenges in comparing incompatible datasets and noisy data. It proposes a solution using an API that can customize similarity based on use cases and data specificities.
Improving Data Quality with Product Similarity Search
1. All Rights Reserved Š 2019
IMPROVING DATA QUALITY
1
EVI LAZARIDOU
IMPROVING DATA QUALITY WITH
PRODUCT SIMILARITY SEARCH
2. All Rights Reserved Š 2019
About me
⢠Electrical & Computer Engineer, MSc in Computer Science
⢠Research
⢠Data Scientist at commercetools GmbH
⢠Data-driven internal features for backend teams
⢠Development of Data Science-based APIs
https://www.linkedin.com/in/evi-lazaridou
evi.lazaridou@commercetools.de
https://medium.com/@Evi.lazaridou
2
3. All Rights Reserved Š 2019
What Product Similarity solves
⢠Content-based product recommendations could leverage product similarity to recommend
alternative items of same characteristics for out of stock products
3
⢠Duplicate entries
⢠Marketplaces: products added by new seller
might exist in the catalog
⢠e-commerce stores: boilerplate content in the
product data
4. All Rights Reserved Š 2019
The challenges
⢠Incompatibilities between different datasets for a marketplace
⢠different names / types / encoding for same variable
⢠Scalability: comparing each single product variant with each other
⢠Missing & noisy data (formatting tags, ids etc.)
⢠automated preprocessing hard due to individual businessâs data specificities
⢠Multiple data types
4
5. All Rights Reserved Š 2019
How does product data look like?
{
"id": "df2ecef4-fd68-4000-8740-0e5639dff471",
"version": 17,
"name": âClutch DKNY grey",
"description": "Classic clutch with multiple compartments and a sleek design.",
"categories": [],
"masterVariant": {
"prices": ["currencyCode": "EUR", "centAmount": 8750, "fractionDigits": 2}],
"images": [{"url": "https://082034_1_large.jpg", "dimensions": {"w": 0, "h": 0}}],
"attributes": [
{âmatrixId": âA0E2000000026I5"},
{âdesigner": "DKNY"},
{âsize": âone size"},
{"color": "grey"},
{âstyle":"sporty"},
{âgenderâ: "women"},
{âseasonâ: "s15"},
{"isOnStock": true}] }
"variants": [],
"createdAt": "2017-07-10T14:05:13.665Z"
}
5
6. âŚof multiple data types
⢠name, description: text
⢠attribute values: array of multiple
different data types (text, numerical,
boolean, sets)
⢠price, variant count: numerical
All Rights Reserved Š 2019
Leveraging multiple product data sources
{
"id": "df2ecef4-fd68-4000-8740-0e5639dff471",
"version": 17,
"name": âClutch DKNY grey",
"description": "Classic clutch with multiple compartments and a sleek design.â,
"categories": [],
"masterVariant": {
"prices": ["currencyCode": "EUR", "centAmount": 8750, "fractionDigits": 2}],
"images": [{"url": "https://082034_1_large.jpg", "dimensions": {"w": 0, "h": 0}}],
"attributes": [
{âmatrixId": âA0E2000000026I5"},
{âdesigner": "DKNY"},
{âsize": âone size"},
{"color": "grey"},
{âstyle":"sporty"},
{âgenderâ: "women"},
{âseasonâ: "s15"},
{"isOnStock": true}] }
"variants": [],
"createdAt": "2017-07-10T14:05:13.665Z"
}
6
7. ⢠Different data types
⢠smaller independent components, each calculates the similarity for the respective data source
⢠Users can
⢠select which data sources to include in the computation
⢠specify which should have a stronger influence
All Rights Reserved Š 2019
Data flow
7
product data
Name Clutch DKNY grey
Description
Classic clutch with
multiple compartments and
a sleek design
Price 8750
Variant
Count
3
Attributes
[âcolorâ: âgrayâ,
âstyleâ: âsportyâ,
âgenderâ: âwomenâ, âŚ]
text similarity
numerical similarity
mixed data similarity
name similarity
description similarity
W1
ÎŁ
W2
attribute similarity W5
price similarity
variantCount similarity
W3
W4
vtfs!efgjofe!xfjhiut
8. ⢠Text similarity for names & descriptions
⢠Hashing Vectorizer (scikit-learn): text vectorizer similar to Count Vectorizer without keeping a
vocabulary, faster response time
All Rights Reserved Š 2019
Breaking it down
8
product data
Name Clutch DKNY grey
Description
Classic clutch with
multiple compartments and
a sleek design
Price 8750
Variant
Count
3
Attributes
[âcolorâ: âgrayâ,
âstyleâ: âsportyâ,
âgenderâ: âwomenâ, âŚ]
text similarity
numerical similarity
mixed data similarity
name similarity
description similarity
W1
ÎŁ
W2
attribute similarity W5
price similarity
variantCount similarity
W3
W4
9. ⢠Text similarity for names & descriptions
⢠Hashing Vectorizer (scikit-learn): text vectorizer similar to Count Vectorizer without keeping a
vocabulary, faster response time
⢠Numerical similarity for prices & variant count: using absolute distance of scaled values
All Rights Reserved Š 2019
Breaking it down
9
product data
Name Clutch DKNY grey
Description
Classic clutch with
multiple compartments and
a sleek design
Price 8750
Variant
Count
3
Attributes
[âcolorâ: âgrayâ,
âstyleâ: âsportyâ,
âgenderâ: âwomenâ, âŚ]
text similarity
numerical similarity
mixed data similarity
name similarity
description similarity
W1
ÎŁ
W2
attribute similarity W5
price similarity
variantCount similarity
W3
W4
10. ⢠Text similarity for names & descriptions
⢠Hashing Vectorizer (scikit-learn): text vectorizer similar to Count Vectorizer without keeping a
vocabulary, faster response time
⢠Numerical similarity for prices & variant count: absolute distance of scaled values
⢠Mixed data similarity for attributes
All Rights Reserved Š 2019
Breaking it down
10
product data
Name Clutch DKNY grey
Description
Classic clutch with
multiple compartments and
a sleek design
Price 8750
Variant
Count
3
Attributes
[âcolorâ: âgrayâ,
âstyleâ: âsportyâ,
âgenderâ: âwomenâ, âŚ]
text similarity
numerical similarity
mixed data similarity
name similarity
description similarity
W1
ÎŁ
W2
attribute similarity W5
price similarity
variantCount similarity
W3
W4
11. All Rights Reserved Š 2019
Attributes: Mixed Data Similarity
⢠Arrays of numerical, categorical, boolean & multi-valued features
⢠No common similarity metric to compare all types
⢠Approach to handle different data types based on Gower distance:
⢠Calculate distances between two instances differently for each variable type & combine in a
final (weighted) distance score
⢠Distance between missing values?
⢠The distance between a missing value & any other value should be the maximum (1.0)
⢠Which distance metric should be used for every type?
11
acidity color contents country country_availability foods available
5,7 g/l white 750.0 Italy [âDEâ, âATâ] [âvegetarianâ, âpoultryâ] TRUE
9,0 g/l red 1500.0 France [âDEâ, âATâ, âFRâ] [âseafoodâ, âfishâ] TRUE
5,4 g/l red 750.0 Portugal [âDEâ, âATâ] [âlambâ, âbeefâ] TRUE
4,4 g/l red 750.0 Germany [âDEâ, âATâ, âITâ] [âpoultryâ, âporkâ, âbeefâ] FALSE
6,6 g/l rosĂŠ 750.0 Austria [âDEâ, âATâ] [âseafoodâ, âfishâ, âpoultryâ] FALSE
12. All Rights Reserved Š 2019
Mixed Data Similarity: Numerical
⢠Numerical attributes
⢠Euclidean distance
12
acidity color contents country country_availability foods available
5,7 g/l white 750.0 Italy [âDEâ, âATâ] [âvegetarianâ, âpoultryâ] TRUE
9,0 g/l red 1500.0 France [âDEâ, âATâ, âFRâ] [âseafoodâ, âfishâ] TRUE
5,4 g/l red 750.0 Portugal [âDEâ, âATâ] [âlambâ, âbeefâ] TRUE
4,4 g/l red 750.0 Germany [âDEâ, âATâ, âITâ] [âpoultryâ, âporkâ, âbeefâ] FALSE
6,6 g/l rosĂŠ 750.0 Austria [âDEâ, âATâ] [âseafoodâ, âfishâ, âpoultryâ] FALSE
13. All Rights Reserved Š 2019
Mixed Data Similarity: Boolean
⢠Numerical attributes
⢠Euclidean distance
⢠Boolean attributes
⢠Converted to numerical values and treat as
numerical
13
acidity color contents country country_availability foods available
5,7 g/l white 750.0 Italy [âDEâ, âATâ] [âvegetarianâ, âpoultryâ] TRUE
9,0 g/l red 1500.0 France [âDEâ, âATâ, âFRâ] [âseafoodâ, âfishâ] TRUE
5,4 g/l red 750.0 Portugal [âDEâ, âATâ] [âlambâ, âbeefâ] TRUE
4,4 g/l red 750.0 Germany [âDEâ, âATâ, âITâ] [âpoultryâ, âporkâ, âbeefâ] FALSE
6,6 g/l rosĂŠ 750.0 Austria [âDEâ, âATâ] [âseafoodâ, âfishâ, âpoultryâ] FALSE
14. All Rights Reserved Š 2019
Mixed Data Similarity: Multi-valued
⢠Numerical attributes
⢠Euclidean distance
⢠Boolean attributes
⢠Converted to numerical values and treat as
numerical
⢠Multi-valued attributes
⢠Jaccard similarity (coefficient) between two sets
of values: size of their intersection divided by the
size of their union
14
acidity color contents country country_availability foods available
5,7 g/l white 750.0 Italy [âDEâ, âATâ] [âvegetarianâ, âpoultryâ] TRUE
9,0 g/l red 1500.0 France [âDEâ, âATâ, âFRâ] [âseafoodâ, âfishâ] TRUE
5,4 g/l red 750.0 Portugal [âDEâ, âATâ] [âlambâ, âbeefâ] TRUE
4,4 g/l red 750.0 Germany [âDEâ, âATâ, âITâ] [âpoultryâ, âporkâ, âbeefâ] FALSE
6,6 g/l rosĂŠ 750.0 Austria [âDEâ, âATâ] [âseafoodâ, âfishâ, âpoultryâ] FALSE
15. All Rights Reserved Š 2019
Mixed Data Similarity: Categorical
⢠Typical approaches present disadvantages
⢠Encoding categorical values with numerical
â distances between the values are random
⢠One Hot Encoding
â high dimensionality
⢠Measure whether values are identical or not
⢠Hamming distance (SciPyâs cdist)
15
acidity color contents country country_availability foods available
5,7 g/l white 750.0 Italy [âDEâ, âATâ] [âvegetarianâ, âpoultryâ] TRUE
9,0 g/l red 1500.0 France [âDEâ, âATâ, âFRâ] [âseafoodâ, âfishâ] TRUE
5,4 g/l red 750.0 Portugal [âDEâ, âATâ] [âlambâ, âbeefâ] TRUE
4,4 g/l red 750.0 Germany [âDEâ, âATâ, âITâ] [âpoultryâ, âporkâ, âbeefâ] FALSE
6,6 g/l rosĂŠ 750.0 Austria [âDEâ, âATâ] [âseafoodâ, âfishâ, âpoultryâ] FALSE
16. All Rights Reserved Š 2019
Mixed Data Similarity: Categorical
⢠Maybe better to handle as text?
⢠Hard without supervision
⢠Not always meaningful & safe
⢠Computationally expensive
Only enabled for small product sets & limited number
of nominal attributes and is based on the Levenshtein
distance
16
acidity color contents country country_availability foods available
5,7 g/l white 750.0 Italy [âDEâ, âATâ] [âvegetarianâ, âpoultryâ] TRUE
9,0 g/l red 1500.0 France [âDEâ, âATâ, âFRâ] [âseafoodâ, âfishâ] TRUE
5,4 g/l red 750.0 Portugal [âDEâ, âATâ] [âlambâ, âbeefâ] TRUE
4,4 g/l red 750.0 Germany [âDEâ, âATâ, âITâ] [âpoultryâ, âporkâ, âbeefâ] FALSE
6,6 g/l rosĂŠ 750.0 Austria [âDEâ, âATâ] [âseafoodâ, âfishâ, âpoultryâ] FALSE
17. All Rights Reserved Š 2019
Attribute selection & weight assessment
⢠Some attributes are irrelevant or add noise
⢠Strongly influenced by customerâs data & patterns
⢠Hard to automate âuniversallyâ without inspection of data
⢠Some attributes are more useful
⢠They should have a higher impact on the final score
⢠We need attribute weights and a metric of âimportanceâ to define them
⢠Variation & density of values as indicator of the discriminative ability & importance of attributes
17
18. All Rights Reserved Š 2019
Assessing variation in different variable types
⢠No single variation metric applicable to every data type
⢠Experimented with different variance metrics (std, variance, variation ratio..)
⢠All tied to the data type
⢠Canât compare the one with the other
⢠One-for-all variation counterpart: entropy (Shannonâs entropy) of the values
⢠Measure of randomness in data
⢠Not influenced by values, only by their distributions
18
19. All Rights Reserved Š 2019
Entropy to the rescue
⢠How?
⢠Treating all data values as distinct & take normalized entropy (so H in [0,1])
⢠High entropy generally indicates high variation
⢠Remove attributes with entropy (almost) equal to 1 because itâs a uniform distribution
⢠Very low entropy when most data points fall in one value
⢠Not much discriminative ability
⢠Not a perfect match but a good proxy
⢠The final attribute importance weight: (Entropy) x (Density)
19
20. All Rights Reserved Š 2019
Outcome
⢠An API that
⢠leverages multiple data sources
⢠is flexible to customize based on use case & data specificities
⢠Duplicate detection: include only/weigh higher data sources prone to duplicate content
⢠Content-based product recommendation: rely more on attributes, names, prices
⢠Knowledge gained on a real business case that is not widely covered & explored
20