Improving Data Quality with Product Similarity Search

All Rights Reserved © 2019
IMPROVING DATA QUALITY
1
EVI LAZARIDOU
IMPROVING DATA QUALITY WITH
PRODUCT SIMILARITY SEARCH

About me
• Electrical & Computer Engineer, MSc in Computer Science
• Research
• Data Scientist at commercetools GmbH
• Data-driven internal features for backend teams
• Development of Data Science-based APIs
https://www.linkedin.com/in/evi-lazaridou
evi.lazaridou@commercetools.de
https://medium.com/@Evi.lazaridou
2

What Product Similarity solves
• Content-based product recommendations could leverage product similarity to recommend
alternative items of same characteristics for out of stock products
3
• Duplicate entries
• Marketplaces: products added by new seller
might exist in the catalog
• e-commerce stores: boilerplate content in the
product data

The challenges
• Incompatibilities between different datasets for a marketplace
• different names / types / encoding for same variable
• Scalability: comparing each single product variant with each other
• Missing & noisy data (formatting tags, ids etc.)
• automated preprocessing hard due to individual business’s data specificities
• Multiple data types
4

How does product data look like?
{
"id": "df2ecef4-fd68-4000-8740-0e5639dff471",
"version": 17,
"name": “Clutch DKNY grey",
"description": "Classic clutch with multiple compartments and a sleek design.",
"categories": [],
"masterVariant": {
"prices": ["currencyCode": "EUR", "centAmount": 8750, "fractionDigits": 2}],
"images": [{"url": "https://082034_1_large.jpg", "dimensions": {"w": 0, "h": 0}}],
"attributes": [
{“matrixId": ”A0E2000000026I5"},
{“designer": "DKNY"},
{“size": ”one size"},
{"color": "grey"},
{“style":"sporty"},
{“gender”: "women"},
{“season”: "s15"},
{"isOnStock": true}] }
"variants": [],
"createdAt": "2017-07-10T14:05:13.665Z"
}
5

…of multiple data types
• name, description: text
• attribute values: array of multiple
different data types (text, numerical,
boolean, sets)
• price, variant count: numerical
Leveraging multiple product data sources
{
"id": "df2ecef4-fd68-4000-8740-0e5639dff471",
"version": 17,
"name": “Clutch DKNY grey",
"description": "Classic clutch with multiple compartments and a sleek design.”,
"categories": [],
"masterVariant": {
"prices": ["currencyCode": "EUR", "centAmount": 8750, "fractionDigits": 2}],
"images": [{"url": "https://082034_1_large.jpg", "dimensions": {"w": 0, "h": 0}}],
"attributes": [
{“matrixId": ”A0E2000000026I5"},
{“designer": "DKNY"},
{“size": ”one size"},
{"color": "grey"},
{“style":"sporty"},
{“gender”: "women"},
{“season”: "s15"},
{"isOnStock": true}] }
"variants": [],
"createdAt": "2017-07-10T14:05:13.665Z"
}
6

• Different data types
• smaller independent components, each calculates the similarity for the respective data source
• Users can
• select which data sources to include in the computation
• specify which should have a stronger influence
Data flow
7
product data
Name Clutch DKNY grey
Description
Classic clutch with
multiple compartments and
a sleek design
Price 8750
Variant

Count
3
Attributes
[“color”: “gray”,
“style”: “sporty”,
“gender”: “women”, …]
text similarity
numerical similarity
mixed data similarity
name similarity
description similarity
W1
Σ
W2
attribute similarity W5
price similarity
variantCount similarity
W3
W4
vtfs!efgjofe!xfjhiut

• Text similarity for names & descriptions
• Hashing Vectorizer (scikit-learn): text vectorizer similar to Count Vectorizer without keeping a
vocabulary, faster response time
Breaking it down
8
product data
Description
Classic clutch with
a sleek design
Price 8750
Variant

Count
3
Attributes
text similarity
name similarity
W1
Σ
W2
price similarity
W3
W4

• Numerical similarity for prices & variant count: using absolute distance of scaled values
Breaking it down
9
product data
Description
Classic clutch with
a sleek design
Price 8750
Variant

Count
3
Attributes
text similarity
name similarity
W1
Σ
W2
price similarity
W3
W4

• Numerical similarity for prices & variant count: absolute distance of scaled values
• Mixed data similarity for attributes
Breaking it down
10
product data
Description
Classic clutch with
a sleek design
Price 8750
Variant

Count
3
Attributes
text similarity
name similarity
W1
Σ
W2
price similarity
W3
W4

Attributes: Mixed Data Similarity
• Arrays of numerical, categorical, boolean & multi-valued features
• No common similarity metric to compare all types
• Approach to handle different data types based on Gower distance:
• Calculate distances between two instances differently for each variable type & combine in a
final (weighted) distance score
• Distance between missing values?
• The distance between a missing value & any other value should be the maximum (1.0)
• Which distance metric should be used for every type?
11
acidity color contents country country_availability foods available
5,7 g/l white 750.0 Italy [‘DE’, ‘AT’] [‘vegetarian’, ‘poultry’] TRUE
9,0 g/l red 1500.0 France [‘DE’, ‘AT’, ‘FR’] [‘seafood’, ‘fish’] TRUE
5,4 g/l red 750.0 Portugal [‘DE’, ‘AT’] [‘lamb’, ‘beef’] TRUE
4,4 g/l red 750.0 Germany [‘DE’, ‘AT’, ‘IT’] [‘poultry’, ‘pork’, ‘beef’] FALSE
6,6 g/l rosé 750.0 Austria [‘DE’, ‘AT’] [‘seafood’, ‘fish’, ‘poultry’] FALSE

Mixed Data Similarity: Numerical
• Numerical attributes
• Euclidean distance
12

Mixed Data Similarity: Boolean
• Boolean attributes
• Converted to numerical values and treat as
numerical
13

Mixed Data Similarity: Multi-valued
• Boolean attributes
• Converted to numerical values and treat as
numerical
• Multi-valued attributes
• Jaccard similarity (coefficient) between two sets
of values: size of their intersection divided by the
size of their union
14

Mixed Data Similarity: Categorical
• Typical approaches present disadvantages
• Encoding categorical values with numerical
✗ distances between the values are random
• One Hot Encoding
✗ high dimensionality
• Measure whether values are identical or not
• Hamming distance (SciPy’s cdist)
15

Mixed Data Similarity: Categorical
• Maybe better to handle as text?
• Hard without supervision
• Not always meaningful & safe
• Computationally expensive
Only enabled for small product sets & limited number
of nominal attributes and is based on the Levenshtein
distance
16

Attribute selection & weight assessment
• Some attributes are irrelevant or add noise
• Strongly influenced by customer’s data & patterns
• Hard to automate “universally” without inspection of data
• Some attributes are more useful
• They should have a higher impact on the final score
• We need attribute weights and a metric of “importance” to define them
• Variation & density of values as indicator of the discriminative ability & importance of attributes
17

Assessing variation in different variable types
• No single variation metric applicable to every data type
• Experimented with different variance metrics (std, variance, variation ratio..)
• All tied to the data type
• Can’t compare the one with the other
• One-for-all variation counterpart: entropy (Shannon’s entropy) of the values
• Measure of randomness in data
• Not influenced by values, only by their distributions
18

Entropy to the rescue
• How?
• Treating all data values as distinct & take normalized entropy (so H in [0,1])
• High entropy generally indicates high variation
• Remove attributes with entropy (almost) equal to 1 because it’s a uniform distribution
• Very low entropy when most data points fall in one value
• Not much discriminative ability
• Not a perfect match but a good proxy
• The final attribute importance weight: (Entropy) x (Density)
19

Outcome
• An API that
• leverages multiple data sources
• is flexible to customize based on use case & data specificities
• Duplicate detection: include only/weigh higher data sources prone to duplicate content
• Content-based product recommendation: rely more on attributes, names, prices
• Knowledge gained on a real business case that is not widely covered & explored
20

Our API docs
https://bit.ly/2XjvzIc
Read all about it !
Our tech blog post
https://bit.ly/32L5SBC
21

Improving Data Quality with Product Similarity Search

Recommended

Recommended

More Related Content

More from Institute of Contemporary Sciences

More from Institute of Contemporary Sciences (20)

Recently uploaded

Recently uploaded (20)

Improving Data Quality with Product Similarity Search