“Unnatural”
Language
processing
How NLP helps us map our
product catalog for
recommendations, search &
product development
- "The question is," said
Alice, "whether you can
make words mean so
many different things."
- "The question is," said
Humpty Dumpty, "which
is to be master - that's
all.”
(Lewis Carroll)
Motivation
● Deliver “good” product recommendations in product
pages, search, dashboards, etc.
● Improve existing categories & identify gaps in the catalog
● Make this scalable & automatic
So… why not just recommend “best sellers”?
● Cannibalisation: Pushing
merchants to a saturated
markets can disrupt existing
businesses
● Privacy: Recommending niche
products based on sales is a
sensitive topic
● Healthy ecosystem: We want
to expose more of our catalog,
not just a few “rockstars”
Let’s focus on what
each merchant does
well (sell) and their
interests (search,
select, etc.)
Motivation (again)
● Deliver “good” similar* product recommendations in
product pages, etc.
● Improve existing categories & identify gaps in the catalog
● Scalability & automation
* Assumption: merchants that are able to sell / interested in a product will sell / show interest in “similar”
products.
Sources of information on our products
● Manual tagging: our categories
● Suppliers: product description, image upload
● Merchants: product description customisation, choice of
images, search queries, etc.
● Performance Metrics: imports/orders/disputes/... as an
indication of quality
Yes. We have multiple images for every product, so let’s play...
Wait, did you just say “we have images”?
WHAT’S IN THIS IMAGE?
Yes. We have multiple images for every product, so let’s play...
Wait, did you just say “we have images”?
I wish someone was actually trying to tell me what’s in the pic...
Room Aromarizer
Not a mystery box
French fries slicer
Can you even see
the potato?
Phone cover
Not woman /
phone / shadow
Plant pot
Not snake /
cactus
Good new: Our suppliers are trying to!
Category tree Product title Product description
● Dozens of categories on
different levels
● Uneven distribution of
products to categories
➔ Not specific enough for
recommendations
● A short (usually 10-18
words) succinct description
of the product
● No standards or structure
➔ Let’s try!
● Uploaded HTML structure
● Mixes the good (metadata
on material, sizes) with the
bad (free text description,
marketing phrases).
➔ Hard to extract data...
Catalog
Consumer
Electronics
Women’s
Fashion
Cables
Dresses
Swimwear
...
Speakers
What’s in a Title?
Natural
Language
Curated closed list,
Indexed & organised
Not a sentence (context based NLP does
not work well), but also no uniformity
Context to correct & tag
POS, extract topics with NLP
This wonderful product! It
comes in a beutiful green
colo(u)r. The smooth
silicon(e) surface guarantees
perfect grip. This is the best
protection for your precious
iPhone from iCovers inc.
2018 collection!
Tags: Stopwords / Spelling errors / Spelling variants / Marketing
phrases / Brand names / Nouns / Verbs / Adj / Adv / punctuation …
Tags, Keywords
& metadata
Titles
Keyword Class
Phone cover category
iPhone brand
silicone material
green color
140x60mm dimensions
smooth texture
ID Class
12345 Green silicon(e)
iphone cover,
140x60mm smooth
with good grip
best quality 2018
collection
excelent price
12346 ...
Similarity of titles
Other
● Similar letters freq.
● Sentence length
● Punctuation
● Register
● …
Similarity of Meaning
Another “hard” problem,
especially without
context
Similarity of Words
If two descriptions share a lot
of similar words and don’t
have a lot of dissimilar words,
then the two underlying
products are somehow similar
(“Jaccard index”)
Where’s
the
toilet?
Show me the
way to the
Ladies please!
Toolkit
● Data source: Product information from our DWH
(via Redshift JDBC)
● Scripting: R/tidyverse
● Spell correction & stemming: Hunspell
(http://hunspell.github.io/) via R package
(https://cran.r-project.org/package=hunspell)
● POS tagging: spaCy (https://spacy.io/) via
R package (https://cran.r-
project.org/package=spacyr)
Hunspell
How standard NLP workflow ( ) went wrong
● Over-Tokenization: “19mm” ⇒ [19,mm] : [NUM,PROPN]
● Wrong lemmas: “sleeved” ⇒ “sleev” but “sleeve” ⇒ “sleeve”
● PoS tag inconsistency: “nine” is NOUN, NUM and stopword(?)
Our solution
● Selective tokenization: we manually strip common punctuations and
then split by whitespace (better safe than sorry)
● Single word NLP for lemmatization and tagging (consistent)
Tokenize Spelling Stop words PoS Tagging
Lemmatization
This is not quite enough...
iPhone ✔
cover ✔
silicone ✔
green X
iPhone ✔
cover ✔
silicone X
green X
iPhone X
cover ✔
silicone X
green ✔
Let’s apply some “world knowledge”
● Extract expression metadata
○ Phrases: “Free shipping” / “Wholesale”
○ Expressions that describe product features in the
description (prices, pieces, sets, etc.)
● We also manage a list of expressions we want to remove completely:
Marketing phrases, “Drop shipping”, “Best price”, ...
Phrase in title %
Pieces 8.9%
Sale phrases 4.3%
“Drop Shipping” 1.1%
“Wholesale” 1.1%
“Free Shipping” 0.5%
Sets 0.4%
How many “sets” in title? How many “pieces” in title?
Wholesale
domain
Wholesale or
mistake(?)
Cleanup mechanism
Brand
list
Keep?
Brand A ✓
Brand B ✗
... Manually
From To
Mans men
Men’s men
...
Regex list
[0-9]+s*[Ss]et
...
# Match
1 3
2 <NA>
3 set
Spell ✗
Suggested
slim fit
slim-fit
slimmest
Stop
words
for
with
I
...
Phrase list
([Gg]reat|[Bb]est)s*[Pprice]
[Ff]rees*[Sh]ippingW
...
Rules?
Spell ✓
ID Keywords pieces ...
12345 men, shirt, dark, blue, casual, slimfit 3 ...
ID Title
12345 BRAND-B Mans shirt, 3set dark blue casual slimfit. Great price with free shipping!
Things we discovered:
● We can safely remove stop words, except
when we can’t (e.g. “top”)
● Spelling mistakes are useful!
○ No suggestion: try to extract metadata
(model number, measurements, …)
○ With suggestion: curate brand names
● We can’t remove/fix spelling automatically
(e.g. “splitter”)
● When in doubt, avoid stemming
Has spelling suggestion
No spelling suggestion
More “world knowledge”
● Industry specific: e.g. “top” is not a stopword
● Lexical: colors, materials, brands, device models, …
● Structural: hashtags, capacity, volume, size, …
● Synonyms / Antonyms
● … (a long list)
ID Keywords pieces ...
12345
men: gender, shirt, dark: shade,
blue: color, casual, slimfit: style
3 ...
And for the rest…
● A working assumption is that nouns describe the object and
adjectives are properties.
● Not having natural language in title means no context for
tagging (PoS), but we can make some simplifying assumptions
(e.g. past tense verbs like “coloured” are adjectives, etc.)
Grammar to the rescue(!|?)
Why tagging is important (example)
● “Top quality Elegant Blue men’s shirt” ⇒
{elegant:adj, shirt:noun, man:gender, blue:color}
● Nouns should have higher priority for similarity calculation than adj.
● Gender is used as a filter: we should exclude any product that refers
to “woman”/”girls” as opposite gender, but also exclude other age
groups like “kid” or “baby”. A similar mechanism is applied for
brands (Android makers vs. iPhone for phone accessories).
● Color can be used either for better matching (as a secondary criteria)
or for recommending other colors (and the same for materials and
other properties)
Implementation
● Browsing: Similar products (local), Sub-sub-...
categories, Catalog maintenance, etc.
● Personalized  context based recommendations
● Feed tags to ElasticSearch
● Ideas for product collections
Sub
Category
A
Sub
Category
B
Errors?
In Conclusion
● First of all, make sure you understand your data provenance (e.g. we thought
our merchants were describing their products), but always validate your
assumptions (we found marketing phrases & metadata). look at your data!
● “Out of the box” tools are awesome, but their default models are trained on
very specific datasets. Tune or replace the components that don’t work for
you, and use errors / issues you discover more about your data (e.g.
spelling mistakes led us to discover a new domain on metadata in the title)
● When presented with the right type of aggregated data product (list, node-
graph, word-cloud) A human-in-the-loop can solve at a glance a lot of
problems that are very hard to solve algorithmically. Use your colleagues!
So…. Why R?
● Because it’s an awesome tool to look at your data!
● Because it’s easy to integrate different data sources and
frameworks super quickly (thanks CRAN) and integrate
them together (DF as a first class citizens)
● Because you can do things quickly an in a tidy way, so
it’s an awesome tool for prototyping
● Because notebooks are awesome (We <3 reproducibility)
We’re Hiring!
But words can be tricky
● Not all words have the same importance (e.g. “the”, “a” vs. “shirt”, “shoe”)
● Words can change their appearance but still mean the same thing for us:
○ Plural / singular (“dress” vs. “dresses”, “man” vs. ”men” vs. ”men’s”)
○ Synonyms (“autumn” vs. ”fall”, and maybe “Halloween” too?)
○ Spelling variants (“colour” vs. “color”).
● Homographs (word of the day!) - same spelling, different meaning
(e.g. “band”, “rock”)
● Word amalgamation: Colloquially, words can join and split (“handsfree” vs.
”hands free”) so we can’t always analyse on a single-word level (n-grams)
Implementation III: Catalog maintenance
● Products with no/few recommendation:
Check for mis-labelling (wrong category?)
● Popular Adjective / Adverbs can lead to
new collections
● Product features for supplier system:
○ Spell check
○ Morebetter input fields
○ Warning when we get marketing phrases or
metadata in title
○ Score “title quality” - can we penalise for mis-
formed titles/text?
How to decide what to match first?
● High specificity => lower priority for recommendations but higher priority for
search
○ Example: “Phone cover” vs. “Blue phone cover”
● Our approach:
○ Recommendations: Try to make a full match but relax the matching strictness in a clever way
○ Search (beta): specific tags can act as filters, but this is very sensitive to user persona (laser-
focussed searcher, search / browse, “just looking”)
Known issues
Fixable Probably Fixable Hard
Words in French &
Spanish: Identify with
spell check and
translate / remove
A standard approach to splitting 
amalgamation (“handsfree” vs. “hands
free”)
● For splitting we can potentially use
spell suggestion (so far we cannot
guarantee 100% accuracy)
● We can curate a list and use RE to
standardize
([Oo]vers?[Ss]ize[d]? ⇒ oversized)
Better stemming / synonym / PoS libraries
Extend lexical tagging
(create or “get” lists):
● Brands
● Model names
● Phrases
Implementation I: Similar products
● Deliverable: A list of similar products for product detail page
● Scope: all products (part of the re-design)
● Mechanism: For each product we calculate similarity index of the “cleaned”
title from all other products in child category
● API: TBD. current implementation created a
TSV file per product (70k files in an S3
bucket)
● Extensions:
○ Look into parent categories or the entire catalog
○ Weight words by classification (e.g. noun > adj > …)
● Monitoring: imports from related products
Implementation II: shop recommendations
● Deliverable: personalized list per shop
● Mechanism:
○ For each product sold by the shop in the last 30 days
we find similar products
○ We sum up similarity and rank by sum (so if a product
is very similar to one product, or slightly similar to many
products they will be ranked high.
○ We remove products already imported by the merchant
○ Repeat the same process for imports
○ We choose the top 48 products (orders first)
● API: we inject these recommendations to the existing ETL
● Monitoring: imports via recommended products
Tentative Roadmap
V1
Current implementation
All words the survived the
cleaning  stemming
process have equal
importance
Better local similarity
Improved calculation: match
keyword groups separately and
weight the similarities, and use
gender  brand for exclusions
V1.1
V1.2
Global similarity
Solve scale issues and
add AE products
V1.2p
Harden the system
Python? Containers?
V2
Description
Deep dive to HTML
parsing, NLP, etc.
Similarity:
● Easy to calculate per product offline (a file per product)
● Real time functionality requires more dev(ops)
● Still have some scale issues
● If we want to include AE products the problem grows
exponentially (BIG data)
Local
similarity
Global
similarity
Implementation IV: Search
● Upload tagged keywords to ElasticSearch DB
● Match exclusionary / lexical / structural keywords:
Hard code lists for lexical, and use RE for expressions for structural
● Apply weights to improve relevance: Query = “Blue women’s pants”
○ Match “pants” = +10 points
○ Match “woman” = +5 points
○ Match {“men”,”child”,”boy”,…} = -10e7 points (or filter)
○ Match “blue” = +1 point
And we can decide if showing a green pair of women’s pants should be higher than a showing a
blue pair of pants (no gender mentioned) using a different score.
● Research realtime PoS tagging in ElasticSearch

Un-natural language processing: Catalog NLP

  • 1.
    “Unnatural” Language processing How NLP helpsus map our product catalog for recommendations, search & product development - "The question is," said Alice, "whether you can make words mean so many different things." - "The question is," said Humpty Dumpty, "which is to be master - that's all.” (Lewis Carroll)
  • 2.
    Motivation ● Deliver “good”product recommendations in product pages, search, dashboards, etc. ● Improve existing categories & identify gaps in the catalog ● Make this scalable & automatic
  • 3.
    So… why notjust recommend “best sellers”? ● Cannibalisation: Pushing merchants to a saturated markets can disrupt existing businesses ● Privacy: Recommending niche products based on sales is a sensitive topic ● Healthy ecosystem: We want to expose more of our catalog, not just a few “rockstars” Let’s focus on what each merchant does well (sell) and their interests (search, select, etc.)
  • 4.
    Motivation (again) ● Deliver“good” similar* product recommendations in product pages, etc. ● Improve existing categories & identify gaps in the catalog ● Scalability & automation * Assumption: merchants that are able to sell / interested in a product will sell / show interest in “similar” products.
  • 5.
    Sources of informationon our products ● Manual tagging: our categories ● Suppliers: product description, image upload ● Merchants: product description customisation, choice of images, search queries, etc. ● Performance Metrics: imports/orders/disputes/... as an indication of quality
  • 6.
    Yes. We havemultiple images for every product, so let’s play... Wait, did you just say “we have images”? WHAT’S IN THIS IMAGE?
  • 7.
    Yes. We havemultiple images for every product, so let’s play... Wait, did you just say “we have images”? I wish someone was actually trying to tell me what’s in the pic... Room Aromarizer Not a mystery box French fries slicer Can you even see the potato? Phone cover Not woman / phone / shadow Plant pot Not snake / cactus
  • 8.
    Good new: Oursuppliers are trying to! Category tree Product title Product description ● Dozens of categories on different levels ● Uneven distribution of products to categories ➔ Not specific enough for recommendations ● A short (usually 10-18 words) succinct description of the product ● No standards or structure ➔ Let’s try! ● Uploaded HTML structure ● Mixes the good (metadata on material, sizes) with the bad (free text description, marketing phrases). ➔ Hard to extract data... Catalog Consumer Electronics Women’s Fashion Cables Dresses Swimwear ... Speakers
  • 9.
    What’s in aTitle? Natural Language Curated closed list, Indexed & organised Not a sentence (context based NLP does not work well), but also no uniformity Context to correct & tag POS, extract topics with NLP This wonderful product! It comes in a beutiful green colo(u)r. The smooth silicon(e) surface guarantees perfect grip. This is the best protection for your precious iPhone from iCovers inc. 2018 collection! Tags: Stopwords / Spelling errors / Spelling variants / Marketing phrases / Brand names / Nouns / Verbs / Adj / Adv / punctuation … Tags, Keywords & metadata Titles Keyword Class Phone cover category iPhone brand silicone material green color 140x60mm dimensions smooth texture ID Class 12345 Green silicon(e) iphone cover, 140x60mm smooth with good grip best quality 2018 collection excelent price 12346 ...
  • 10.
    Similarity of titles Other ●Similar letters freq. ● Sentence length ● Punctuation ● Register ● … Similarity of Meaning Another “hard” problem, especially without context Similarity of Words If two descriptions share a lot of similar words and don’t have a lot of dissimilar words, then the two underlying products are somehow similar (“Jaccard index”) Where’s the toilet? Show me the way to the Ladies please!
  • 11.
    Toolkit ● Data source:Product information from our DWH (via Redshift JDBC) ● Scripting: R/tidyverse ● Spell correction & stemming: Hunspell (http://hunspell.github.io/) via R package (https://cran.r-project.org/package=hunspell) ● POS tagging: spaCy (https://spacy.io/) via R package (https://cran.r- project.org/package=spacyr) Hunspell
  • 12.
    How standard NLPworkflow ( ) went wrong ● Over-Tokenization: “19mm” ⇒ [19,mm] : [NUM,PROPN] ● Wrong lemmas: “sleeved” ⇒ “sleev” but “sleeve” ⇒ “sleeve” ● PoS tag inconsistency: “nine” is NOUN, NUM and stopword(?) Our solution ● Selective tokenization: we manually strip common punctuations and then split by whitespace (better safe than sorry) ● Single word NLP for lemmatization and tagging (consistent) Tokenize Spelling Stop words PoS Tagging Lemmatization
  • 13.
    This is notquite enough... iPhone ✔ cover ✔ silicone ✔ green X iPhone ✔ cover ✔ silicone X green X iPhone X cover ✔ silicone X green ✔
  • 14.
    Let’s apply some“world knowledge” ● Extract expression metadata ○ Phrases: “Free shipping” / “Wholesale” ○ Expressions that describe product features in the description (prices, pieces, sets, etc.) ● We also manage a list of expressions we want to remove completely: Marketing phrases, “Drop shipping”, “Best price”, ... Phrase in title % Pieces 8.9% Sale phrases 4.3% “Drop Shipping” 1.1% “Wholesale” 1.1% “Free Shipping” 0.5% Sets 0.4% How many “sets” in title? How many “pieces” in title? Wholesale domain Wholesale or mistake(?)
  • 15.
    Cleanup mechanism Brand list Keep? Brand A✓ Brand B ✗ ... Manually From To Mans men Men’s men ... Regex list [0-9]+s*[Ss]et ... # Match 1 3 2 <NA> 3 set Spell ✗ Suggested slim fit slim-fit slimmest Stop words for with I ... Phrase list ([Gg]reat|[Bb]est)s*[Pprice] [Ff]rees*[Sh]ippingW ... Rules? Spell ✓ ID Keywords pieces ... 12345 men, shirt, dark, blue, casual, slimfit 3 ... ID Title 12345 BRAND-B Mans shirt, 3set dark blue casual slimfit. Great price with free shipping!
  • 16.
    Things we discovered: ●We can safely remove stop words, except when we can’t (e.g. “top”) ● Spelling mistakes are useful! ○ No suggestion: try to extract metadata (model number, measurements, …) ○ With suggestion: curate brand names ● We can’t remove/fix spelling automatically (e.g. “splitter”) ● When in doubt, avoid stemming Has spelling suggestion No spelling suggestion
  • 17.
    More “world knowledge” ●Industry specific: e.g. “top” is not a stopword ● Lexical: colors, materials, brands, device models, … ● Structural: hashtags, capacity, volume, size, … ● Synonyms / Antonyms ● … (a long list) ID Keywords pieces ... 12345 men: gender, shirt, dark: shade, blue: color, casual, slimfit: style 3 ...
  • 18.
    And for therest… ● A working assumption is that nouns describe the object and adjectives are properties. ● Not having natural language in title means no context for tagging (PoS), but we can make some simplifying assumptions (e.g. past tense verbs like “coloured” are adjectives, etc.) Grammar to the rescue(!|?)
  • 19.
    Why tagging isimportant (example) ● “Top quality Elegant Blue men’s shirt” ⇒ {elegant:adj, shirt:noun, man:gender, blue:color} ● Nouns should have higher priority for similarity calculation than adj. ● Gender is used as a filter: we should exclude any product that refers to “woman”/”girls” as opposite gender, but also exclude other age groups like “kid” or “baby”. A similar mechanism is applied for brands (Android makers vs. iPhone for phone accessories). ● Color can be used either for better matching (as a secondary criteria) or for recommending other colors (and the same for materials and other properties)
  • 20.
    Implementation ● Browsing: Similarproducts (local), Sub-sub-... categories, Catalog maintenance, etc. ● Personalized context based recommendations ● Feed tags to ElasticSearch ● Ideas for product collections Sub Category A Sub Category B Errors?
  • 21.
    In Conclusion ● Firstof all, make sure you understand your data provenance (e.g. we thought our merchants were describing their products), but always validate your assumptions (we found marketing phrases & metadata). look at your data! ● “Out of the box” tools are awesome, but their default models are trained on very specific datasets. Tune or replace the components that don’t work for you, and use errors / issues you discover more about your data (e.g. spelling mistakes led us to discover a new domain on metadata in the title) ● When presented with the right type of aggregated data product (list, node- graph, word-cloud) A human-in-the-loop can solve at a glance a lot of problems that are very hard to solve algorithmically. Use your colleagues!
  • 22.
    So…. Why R? ●Because it’s an awesome tool to look at your data! ● Because it’s easy to integrate different data sources and frameworks super quickly (thanks CRAN) and integrate them together (DF as a first class citizens) ● Because you can do things quickly an in a tidy way, so it’s an awesome tool for prototyping ● Because notebooks are awesome (We <3 reproducibility)
  • 23.
  • 25.
    But words canbe tricky ● Not all words have the same importance (e.g. “the”, “a” vs. “shirt”, “shoe”) ● Words can change their appearance but still mean the same thing for us: ○ Plural / singular (“dress” vs. “dresses”, “man” vs. ”men” vs. ”men’s”) ○ Synonyms (“autumn” vs. ”fall”, and maybe “Halloween” too?) ○ Spelling variants (“colour” vs. “color”). ● Homographs (word of the day!) - same spelling, different meaning (e.g. “band”, “rock”) ● Word amalgamation: Colloquially, words can join and split (“handsfree” vs. ”hands free”) so we can’t always analyse on a single-word level (n-grams)
  • 26.
    Implementation III: Catalogmaintenance ● Products with no/few recommendation: Check for mis-labelling (wrong category?) ● Popular Adjective / Adverbs can lead to new collections ● Product features for supplier system: ○ Spell check ○ Morebetter input fields ○ Warning when we get marketing phrases or metadata in title ○ Score “title quality” - can we penalise for mis- formed titles/text?
  • 27.
    How to decidewhat to match first? ● High specificity => lower priority for recommendations but higher priority for search ○ Example: “Phone cover” vs. “Blue phone cover” ● Our approach: ○ Recommendations: Try to make a full match but relax the matching strictness in a clever way ○ Search (beta): specific tags can act as filters, but this is very sensitive to user persona (laser- focussed searcher, search / browse, “just looking”)
  • 28.
    Known issues Fixable ProbablyFixable Hard Words in French & Spanish: Identify with spell check and translate / remove A standard approach to splitting amalgamation (“handsfree” vs. “hands free”) ● For splitting we can potentially use spell suggestion (so far we cannot guarantee 100% accuracy) ● We can curate a list and use RE to standardize ([Oo]vers?[Ss]ize[d]? ⇒ oversized) Better stemming / synonym / PoS libraries Extend lexical tagging (create or “get” lists): ● Brands ● Model names ● Phrases
  • 29.
    Implementation I: Similarproducts ● Deliverable: A list of similar products for product detail page ● Scope: all products (part of the re-design) ● Mechanism: For each product we calculate similarity index of the “cleaned” title from all other products in child category ● API: TBD. current implementation created a TSV file per product (70k files in an S3 bucket) ● Extensions: ○ Look into parent categories or the entire catalog ○ Weight words by classification (e.g. noun > adj > …) ● Monitoring: imports from related products
  • 30.
    Implementation II: shoprecommendations ● Deliverable: personalized list per shop ● Mechanism: ○ For each product sold by the shop in the last 30 days we find similar products ○ We sum up similarity and rank by sum (so if a product is very similar to one product, or slightly similar to many products they will be ranked high. ○ We remove products already imported by the merchant ○ Repeat the same process for imports ○ We choose the top 48 products (orders first) ● API: we inject these recommendations to the existing ETL ● Monitoring: imports via recommended products
  • 31.
    Tentative Roadmap V1 Current implementation Allwords the survived the cleaning stemming process have equal importance Better local similarity Improved calculation: match keyword groups separately and weight the similarities, and use gender brand for exclusions V1.1 V1.2 Global similarity Solve scale issues and add AE products V1.2p Harden the system Python? Containers? V2 Description Deep dive to HTML parsing, NLP, etc.
  • 32.
    Similarity: ● Easy tocalculate per product offline (a file per product) ● Real time functionality requires more dev(ops) ● Still have some scale issues ● If we want to include AE products the problem grows exponentially (BIG data) Local similarity Global similarity
  • 33.
    Implementation IV: Search ●Upload tagged keywords to ElasticSearch DB ● Match exclusionary / lexical / structural keywords: Hard code lists for lexical, and use RE for expressions for structural ● Apply weights to improve relevance: Query = “Blue women’s pants” ○ Match “pants” = +10 points ○ Match “woman” = +5 points ○ Match {“men”,”child”,”boy”,…} = -10e7 points (or filter) ○ Match “blue” = +1 point And we can decide if showing a green pair of women’s pants should be higher than a showing a blue pair of pants (no gender mentioned) using a different score. ● Research realtime PoS tagging in ElasticSearch

Editor's Notes

  • #21 it should probably get a rank “boost” ranked higher but not necessarily (e.g. if the nouns match better without a mention of gender)