Un-natural language processing: Catalog NLP

“Unnatural”
Language
processing
How NLP helps us map our
product catalog for
recommendations, search &
product development
- "The question is," said
Alice, "whether you can
make words mean so
many different things."
- "The question is," said
Humpty Dumpty, "which
is to be master - that's
all.”
(Lewis Carroll)

Motivation
● Deliver “good” product recommendations in product
pages, search, dashboards, etc.
● Improve existing categories & identify gaps in the catalog
● Make this scalable & automatic

So… why not just recommend “best sellers”?
● Cannibalisation: Pushing
merchants to a saturated
markets can disrupt existing
businesses
● Privacy: Recommending niche
products based on sales is a
sensitive topic
● Healthy ecosystem: We want
to expose more of our catalog,
not just a few “rockstars”
Let’s focus on what
each merchant does
well (sell) and their
interests (search,
select, etc.)

Motivation (again)
● Deliver “good” similar* product recommendations in
product pages, etc.
● Improve existing categories & identify gaps in the catalog
● Scalability & automation
* Assumption: merchants that are able to sell / interested in a product will sell / show interest in “similar”
products.

Sources of information on our products
● Manual tagging: our categories
● Suppliers: product description, image upload
● Merchants: product description customisation, choice of
images, search queries, etc.
● Performance Metrics: imports/orders/disputes/... as an
indication of quality

Yes. We have multiple images for every product, so let’s play...
Wait, did you just say “we have images”?
WHAT’S IN THIS IMAGE?

Yes. We have multiple images for every product, so let’s play...
Wait, did you just say “we have images”?
I wish someone was actually trying to tell me what’s in the pic...
Room Aromarizer
Not a mystery box
French fries slicer
Can you even see
the potato?
Phone cover
Not woman /
phone / shadow
Plant pot
Not snake /
cactus

Good new: Our suppliers are trying to!
Category tree Product title Product description
● Dozens of categories on
different levels
● Uneven distribution of
products to categories
➔ Not specific enough for
recommendations
● A short (usually 10-18
words) succinct description
of the product
● No standards or structure
➔ Let’s try!
● Uploaded HTML structure
● Mixes the good (metadata
on material, sizes) with the
bad (free text description,
marketing phrases).
➔ Hard to extract data...
Catalog
Consumer
Electronics
Women’s
Fashion
Cables
Dresses
Swimwear
...
Speakers

What’s in a Title?
Natural
Language
Curated closed list,
Indexed & organised
Not a sentence (context based NLP does
not work well), but also no uniformity
Context to correct & tag
POS, extract topics with NLP
This wonderful product! It
comes in a beutiful green
colo(u)r. The smooth
silicon(e) surface guarantees
perfect grip. This is the best
protection for your precious
iPhone from iCovers inc.
2018 collection!
Tags: Stopwords / Spelling errors / Spelling variants / Marketing
phrases / Brand names / Nouns / Verbs / Adj / Adv / punctuation …
Tags, Keywords
& metadata
Titles
Keyword Class
Phone cover category
iPhone brand
silicone material
green color
140x60mm dimensions
smooth texture
ID Class
12345 Green silicon(e)
iphone cover,
140x60mm smooth
with good grip
best quality 2018
collection
excelent price
12346 ...

Similarity of titles
Other
● Similar letters freq.
● Sentence length
● Punctuation
● Register
● …
Similarity of Meaning
Another “hard” problem,
especially without
context
Similarity of Words
If two descriptions share a lot
of similar words and don’t
have a lot of dissimilar words,
then the two underlying
products are somehow similar
(“Jaccard index”)
Where’s
the
toilet?
Show me the
way to the
Ladies please!

Toolkit
● Data source: Product information from our DWH
(via Redshift JDBC)
● Scripting: R/tidyverse
● Spell correction & stemming: Hunspell
(http://hunspell.github.io/) via R package
(https://cran.r-project.org/package=hunspell)
● POS tagging: spaCy (https://spacy.io/) via
R package (https://cran.r-
project.org/package=spacyr)
Hunspell

How standard NLP workflow ( ) went wrong
● Over-Tokenization: “19mm” ⇒ [19,mm] : [NUM,PROPN]
● Wrong lemmas: “sleeved” ⇒ “sleev” but “sleeve” ⇒ “sleeve”
● PoS tag inconsistency: “nine” is NOUN, NUM and stopword(?)
Our solution
● Selective tokenization: we manually strip common punctuations and
then split by whitespace (better safe than sorry)
● Single word NLP for lemmatization and tagging (consistent)
Tokenize Spelling Stop words PoS Tagging
Lemmatization

This is not quite enough...
iPhone ✔
cover ✔
silicone ✔
green X
iPhone ✔
cover ✔
silicone X
green X
iPhone X
cover ✔
silicone X
green ✔

Let’s apply some “world knowledge”
● Extract expression metadata
○ Phrases: “Free shipping” / “Wholesale”
○ Expressions that describe product features in the
description (prices, pieces, sets, etc.)
● We also manage a list of expressions we want to remove completely:
Marketing phrases, “Drop shipping”, “Best price”, ...
Phrase in title %
Pieces 8.9%
Sale phrases 4.3%
“Drop Shipping” 1.1%
“Wholesale” 1.1%
“Free Shipping” 0.5%
Sets 0.4%
How many “sets” in title? How many “pieces” in title?
Wholesale
domain
Wholesale or
mistake(?)

Cleanup mechanism
Brand
list
Keep?
Brand A ✓
Brand B ✗
... Manually
From To
Mans men
Men’s men
...
Regex list
[0-9]+s*[Ss]et
...
# Match
1 3
2 <NA>
3 set
Spell ✗
Suggested
slim fit
slim-fit
slimmest
Stop
words
for
with
I
...
Phrase list
([Gg]reat|[Bb]est)s*[Pprice]
[Ff]rees*[Sh]ippingW
...
Rules?
Spell ✓
ID Keywords pieces ...
12345 men, shirt, dark, blue, casual, slimfit 3 ...
ID Title
12345 BRAND-B Mans shirt, 3set dark blue casual slimfit. Great price with free shipping!

Things we discovered:
● We can safely remove stop words, except
when we can’t (e.g. “top”)
● Spelling mistakes are useful!
○ No suggestion: try to extract metadata
(model number, measurements, …)
○ With suggestion: curate brand names
● We can’t remove/fix spelling automatically
(e.g. “splitter”)
● When in doubt, avoid stemming
Has spelling suggestion
No spelling suggestion

More “world knowledge”
● Industry specific: e.g. “top” is not a stopword
● Lexical: colors, materials, brands, device models, …
● Structural: hashtags, capacity, volume, size, …
● Synonyms / Antonyms
● … (a long list)
ID Keywords pieces ...
12345
men: gender, shirt, dark: shade,
blue: color, casual, slimfit: style
3 ...

And for the rest…
● A working assumption is that nouns describe the object and
adjectives are properties.
● Not having natural language in title means no context for
tagging (PoS), but we can make some simplifying assumptions
(e.g. past tense verbs like “coloured” are adjectives, etc.)
Grammar to the rescue(!|?)

Why tagging is important (example)
● “Top quality Elegant Blue men’s shirt” ⇒
{elegant:adj, shirt:noun, man:gender, blue:color}
● Nouns should have higher priority for similarity calculation than adj.
● Gender is used as a filter: we should exclude any product that refers
to “woman”/”girls” as opposite gender, but also exclude other age
groups like “kid” or “baby”. A similar mechanism is applied for
brands (Android makers vs. iPhone for phone accessories).
● Color can be used either for better matching (as a secondary criteria)
or for recommending other colors (and the same for materials and
other properties)

Implementation
● Browsing: Similar products (local), Sub-sub-...
categories, Catalog maintenance, etc.
● Personalized context based recommendations
● Feed tags to ElasticSearch
● Ideas for product collections
Sub
Category
A
Sub
Category
B
Errors?

In Conclusion
● First of all, make sure you understand your data provenance (e.g. we thought
our merchants were describing their products), but always validate your
assumptions (we found marketing phrases & metadata). look at your data!
● “Out of the box” tools are awesome, but their default models are trained on
very specific datasets. Tune or replace the components that don’t work for
you, and use errors / issues you discover more about your data (e.g.
spelling mistakes led us to discover a new domain on metadata in the title)
● When presented with the right type of aggregated data product (list, node-
graph, word-cloud) A human-in-the-loop can solve at a glance a lot of
problems that are very hard to solve algorithmically. Use your colleagues!

So…. Why R?
● Because it’s an awesome tool to look at your data!
● Because it’s easy to integrate different data sources and
frameworks super quickly (thanks CRAN) and integrate
them together (DF as a first class citizens)
● Because you can do things quickly an in a tidy way, so
it’s an awesome tool for prototyping
● Because notebooks are awesome (We <3 reproducibility)

But words can be tricky
● Not all words have the same importance (e.g. “the”, “a” vs. “shirt”, “shoe”)
● Words can change their appearance but still mean the same thing for us:
○ Plural / singular (“dress” vs. “dresses”, “man” vs. ”men” vs. ”men’s”)
○ Synonyms (“autumn” vs. ”fall”, and maybe “Halloween” too?)
○ Spelling variants (“colour” vs. “color”).
● Homographs (word of the day!) - same spelling, different meaning
(e.g. “band”, “rock”)
● Word amalgamation: Colloquially, words can join and split (“handsfree” vs.
”hands free”) so we can’t always analyse on a single-word level (n-grams)

Implementation III: Catalog maintenance
● Products with no/few recommendation:
Check for mis-labelling (wrong category?)
● Popular Adjective / Adverbs can lead to
new collections
● Product features for supplier system:
○ Spell check
○ Morebetter input fields
○ Warning when we get marketing phrases or
metadata in title
○ Score “title quality” - can we penalise for mis-
formed titles/text?

How to decide what to match first?
● High specificity => lower priority for recommendations but higher priority for
search
○ Example: “Phone cover” vs. “Blue phone cover”
● Our approach:
○ Recommendations: Try to make a full match but relax the matching strictness in a clever way
○ Search (beta): specific tags can act as filters, but this is very sensitive to user persona (laser-
focussed searcher, search / browse, “just looking”)

Known issues
Fixable Probably Fixable Hard
Words in French &
Spanish: Identify with
spell check and
translate / remove
A standard approach to splitting
amalgamation (“handsfree” vs. “hands
free”)
● For splitting we can potentially use
spell suggestion (so far we cannot
guarantee 100% accuracy)
● We can curate a list and use RE to
standardize
([Oo]vers?[Ss]ize[d]? ⇒ oversized)
Better stemming / synonym / PoS libraries
Extend lexical tagging
(create or “get” lists):
● Brands
● Model names
● Phrases

Implementation I: Similar products
● Deliverable: A list of similar products for product detail page
● Scope: all products (part of the re-design)
● Mechanism: For each product we calculate similarity index of the “cleaned”
title from all other products in child category
● API: TBD. current implementation created a
TSV file per product (70k files in an S3
bucket)
● Extensions:
○ Look into parent categories or the entire catalog
○ Weight words by classification (e.g. noun > adj > …)
● Monitoring: imports from related products

Implementation II: shop recommendations
● Deliverable: personalized list per shop
● Mechanism:
○ For each product sold by the shop in the last 30 days
we find similar products
○ We sum up similarity and rank by sum (so if a product
is very similar to one product, or slightly similar to many
products they will be ranked high.
○ We remove products already imported by the merchant
○ Repeat the same process for imports
○ We choose the top 48 products (orders first)
● API: we inject these recommendations to the existing ETL
● Monitoring: imports via recommended products

Tentative Roadmap
V1
Current implementation
All words the survived the
cleaning stemming
process have equal
importance
Better local similarity
Improved calculation: match
keyword groups separately and
weight the similarities, and use
gender brand for exclusions
V1.1
V1.2
Global similarity
Solve scale issues and
add AE products
V1.2p
Harden the system
Python? Containers?
V2
Description
Deep dive to HTML
parsing, NLP, etc.

Similarity:
● Easy to calculate per product offline (a file per product)
● Real time functionality requires more dev(ops)
● Still have some scale issues
● If we want to include AE products the problem grows
exponentially (BIG data)
Local
similarity
Global
similarity

Implementation IV: Search
● Upload tagged keywords to ElasticSearch DB
● Match exclusionary / lexical / structural keywords:
Hard code lists for lexical, and use RE for expressions for structural
● Apply weights to improve relevance: Query = “Blue women’s pants”
○ Match “pants” = +10 points
○ Match “woman” = +5 points
○ Match {“men”,”child”,”boy”,…} = -10e7 points (or filter)
○ Match “blue” = +1 point
And we can decide if showing a green pair of women’s pants should be higher than a showing a
blue pair of pants (no gender mentioned) using a different score.
● Research realtime PoS tagging in ElasticSearch

Un-natural language processing: Catalog NLP

More Related Content

Similar to Un-natural language processing: Catalog NLP

Recently uploaded

Un-natural language processing: Catalog NLP

Editor's Notes