SlideShare a Scribd company logo
1 of 25
Download to read offline
Search Relevance Engineering:
Query Understanding & Ranking
Dipak Parmar
Principal Engineer
Feb 18, 2020
Confidential 2
Agenda
1 Introduction
2 Search Relevance Engineering
- Text analysis
- Query understanding
- Ranking
3 Overview of Search Platform @ Shipt
- Shipt search platform technologies stack
- Search indexing pipeline
- Search services
- Elasticsearch clusters
- Lessons learned and best practices
4 Q & A
Confidential
About Shipt
● Shipt launched in the summer of 2014 in the heart of the Magic City, Birmingham, AL.
● Shipt connects members to fresh groceries and everyday essentials. Saving time, fuel and
headspace, next-hour, same day grocery delivery is quickly becoming an everyday
necessity for people looking for an extra few hours and intentional food choices.
cities covered
5,000+
and counting in
shopper tips
$175M+
of households covered
nationwide
80%
shoppers
400K
Retailers
100+
NPS Score
70
Confidential
Beyond Grocery...
Confidential
Search @ Shipt
5
● 100+ retailers, 1000+ store locations
● 50% of searches come from just top
1000 unique searches
● 30% of all purchases are repeat buys
● Search keywords distribution
○ 20% single token
○ 50% two tokens
○ 30% three+ tokens
Confidential 6
Search Relevance
Engineering
V
e
s
t
i
b
u
l
u
m
c
o
n
g
u
e
Vestibulum
congue Vestibulum
congue
Vestibulum
congue
V
e
s
t
i
b
u
l
u
m
c
o
n
g
u
e
V
e
s
t
i
b
u
l
u
m
c
o
n
g
u
e
Vestibulum
congue Vestibulum
congue
Vestibulum
congue
V
e
s
t
i
b
u
l
u
m
c
o
n
g
u
e
T
e
x
t
A
n
a
l
y
s
i
s
Phase-1
Ranking
Phase-n
Ranking
Query Intent &
Classification
Q
u
e
r
y
B
u
i
l
d
e
r
Text Analysis &
Query Understanding
Confidential
Text Analysis
8
The QUICK brown foxes jumped over the dog!
[ “quick”, “fast”, “brown”, “fox”, “jump”, “over”, dog”]
Ngrams [“q”, “qu”, “qui”, “quic”, “quick”, ..]
Confidential
Text Analysis
9
Stemming: [ “starwberii”, “ic”, “cream”]
strawberry ice cream
Shingles: [“strawberry ice”, “ice cream”]
Shingle concat: [“strawberryice”, “icecream”]
Confidential
Query Understanding
10
As per Wikipedia
“Query understanding is the process of inferring the intent of a search
engine user by extracting semantic meaning from the searcher’s
keywords. Query understanding methods generally take place before the
search engine retrieves and ranks results”
Confidential
Query Understanding
11
Lorem ipsum dolor sit
amet at nec at adipiscing
05
● Donec risus dolor porta venenatis
● Pharetra luctus felis
● Proin in tellus felis volutpat
Lorem ipsum dolor sit
amet at nec at adipiscing
04
● Donec risus dolor porta venenatis
● Pharetra luctus felis
● Proin in tellus felis volutpat
Lorem ipsum dolor sit
amet at nec at adipiscing
03
● Donec risus dolor porta venenatis
● Pharetra luctus felis
● Proin in tellus felis volutpat
Lorem ipsum dolor sit
amet at nec at adipiscing
02
● Donec risus dolor porta venenatis
● Pharetra luctus felis
● Proin in tellus felis volutpat
Lorem ipsum dolor sit
amet at nec at adipiscing
01
● Donec risus dolor porta venenatis
● Pharetra luctus felis
● Proin in tellus felis volutpat
Lorem ipsum dolor sit
amet at nec at adipiscing
05
● Donec risus dolor porta venenatis
● Pharetra luctus felis
● Proin in tellus felis volutpat
Lorem ipsum dolor sit
amet at nec at adipiscing
04
● Donec risus dolor porta venenatis
● Pharetra luctus felis
● Proin in tellus felis volutpat
Lorem ipsum dolor sit
amet at nec at adipiscing
03
● Donec risus dolor porta venenatis
● Pharetra luctus felis
● Proin in tellus felis volutpat
Lorem ipsum dolor sit
amet at nec at adipiscing
02
● Donec risus dolor porta venenatis
● Pharetra luctus felis
● Proin in tellus felis volutpat
Lorem ipsum dolor sit
amet at nec at adipiscing
01
● Donec risus dolor porta venenatis
● Pharetra luctus felis
● Proin in tellus felis volutpat
fresh red onion
05
● Entity: “onion”
● Modifier: “red”
● Optional: “fresh”
half and half coffee pods
½ & ½ coffee pods
04
● Modifier: “half & half”
● Entity: “coffee pods” (compound word)
16 oz sour cream
sourcream 16oz
03
● Size: 16 oz
● Entity: “sour cream” (compound word)
1 gal 2% milk
2% gallon milk
02
● Size: 1 gallon
● Modifier: 2%
● Entity: “milk”
good & gather veggie
pizza
01
● Brand: good & gather
● Dietary preference: veggie
● Entity: “pizza”
Confidential
Query Understanding
12
● Detecting intent with classifiers
○ Size, brands, nutritional/dietary keywords
○ Identify color, pattern, modifiers, flavors, taste, shape, etc.
○ Optional keywords
○ Negative intent: “egg free pizza”
● Replacements, query expansions, synonyms, hypernyms
● Linguistic analysis
● ML, AI, NLP techniques
Confidential
Query Builder
13
● Build query based on intent
○ Partial match
○ Phrase search
○ Word drop
○ Minimum match criteria
○ Fuzzy search
○ Filtering
○ Boost
● Ranking based on intent
Ranking
Confidential
Ranking
15
● 5 phases of ranking
○ Phase-1: Base query ranking based on query intent (BM25)
○ Phase-2: Boosting (BM25)
○ Phase-3: Custom ranking with other signals (script score or custom plug-in)
○ Phase-4: Re-rank top x documents (LTR / vector similarity / other signals)
○ Phase-5: Post process ranking (outside search engine)
Confidential
Ranking
16
● Ingest external signals
○ Offline process to generate popularity scores at
product level
○ Signals at neighborhood level can be generated
○ Product to converted keywords mapping can be
maintained
● Use external signals along with BM25 score in deriving
final score
● “Painless” script in Elasticsearch can be used for
ranking
Confidential
Ranking
17
Final score = (BM25 score * weight) +
(signal-1 * multiplier * weight) +
(signal-2 * multiplier * weight) + …
Example:
Final score = (BM25 score * 0.50) + (log(popularity) * 100 * 0.30) + (geographic trend * 0.10) + ..
Overview of
Search Platform @ Shipt
Confidential
Shipt Search Platform Technologies Stack
19
● Language:
○ Golang
● Datastores:
○ Elasticsearch, Redis, PostgreSQL
● Messaging:
○ Kafka
● Infrastructure / Observability:
○ Docker, Kubernetes, Rollbar, Grafana
● Cloud hosting:
○ AWS, GCP, ElasticCloud, Confluent
Confidential
Search
Indexing
Pipeline
@ Shipt
20
Confidential
Search
Services
@ Shipt
21
Confidential
Elasticsearch clusters
22
● Write heavy indexing pipeline
● Multiple groups of clusters based on search traffic, indexing traffic and number of
documents
● High availability, multiple replicas
● Single query approach, p99 of 70ms
Confidential
Lessons learned, best practices
23
● Single-Responsibility Principle. Don’t do too many things in one micro-service
● Separation of concerns. Modularize and re-use, avoid duplicate logic
● Don’t try to solve everything with search engine!
● Don’t try to solve everything with machine learning!
● High refresh interval setting in Elasticsearch improves response time p99
● Start with optimizing head keywords followed by torso and tail keywords
● Act fast, evaluate a/b test fast, learn from a/b tests, fail fast
● Parent-child relationship in Elasticsearch could degrade performance
● Duplication of data is just fine in most cases in search engine
● Excessive use of Fuzzy search could lead to performance overhead. It’s better not to
apply fuzzy search for already correctly spelled keywords
Questions
Thank you!
Special thanks to OpenSource
Connections for organizing this
event!

More Related Content

More from OpenSource Connections

Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
OpenSource Connections
 
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
OpenSource Connections
 
Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...
OpenSource Connections
 
Haystack 2019 - Towards a Learning To Rank Ecosystem @ Snag - We've got LTR t...
Haystack 2019 - Towards a Learning To Rank Ecosystem @ Snag - We've got LTR t...Haystack 2019 - Towards a Learning To Rank Ecosystem @ Snag - We've got LTR t...
Haystack 2019 - Towards a Learning To Rank Ecosystem @ Snag - We've got LTR t...
OpenSource Connections
 

More from OpenSource Connections (20)

Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj BharadwajHaystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
 
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
 
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlHaystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
 
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
 
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
 
Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...
 
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
 
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
 
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
 
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
 
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...
 
Haystack 2019 - Beyond The Search Engine: Improving Relevancy through Query E...
Haystack 2019 - Beyond The Search Engine: Improving Relevancy through Query E...Haystack 2019 - Beyond The Search Engine: Improving Relevancy through Query E...
Haystack 2019 - Beyond The Search Engine: Improving Relevancy through Query E...
 
Haystack 2019 - Evolution of Yelp search to a generalized ranking platform - ...
Haystack 2019 - Evolution of Yelp search to a generalized ranking platform - ...Haystack 2019 - Evolution of Yelp search to a generalized ranking platform - ...
Haystack 2019 - Evolution of Yelp search to a generalized ranking platform - ...
 
Haystack 2019 - Query relaxation - a rewriting technique between search and r...
Haystack 2019 - Query relaxation - a rewriting technique between search and r...Haystack 2019 - Query relaxation - a rewriting technique between search and r...
Haystack 2019 - Query relaxation - a rewriting technique between search and r...
 
Haystack 2019 - Towards a Learning To Rank Ecosystem @ Snag - We've got LTR t...
Haystack 2019 - Towards a Learning To Rank Ecosystem @ Snag - We've got LTR t...Haystack 2019 - Towards a Learning To Rank Ecosystem @ Snag - We've got LTR t...
Haystack 2019 - Towards a Learning To Rank Ecosystem @ Snag - We've got LTR t...
 
Haystack 2019 - Autocomplete as Relevancy - Rimple Shah
Haystack 2019 - Autocomplete as Relevancy - Rimple ShahHaystack 2019 - Autocomplete as Relevancy - Rimple Shah
Haystack 2019 - Autocomplete as Relevancy - Rimple Shah
 
Haystack 2019 - Making the case for human judgement relevance testing - Tara ...
Haystack 2019 - Making the case for human judgement relevance testing - Tara ...Haystack 2019 - Making the case for human judgement relevance testing - Tara ...
Haystack 2019 - Making the case for human judgement relevance testing - Tara ...
 
Haystack 2019 - Ontology and Oncology: NLP for Precision Medicine - Sean Mullane
Haystack 2019 - Ontology and Oncology: NLP for Precision Medicine - Sean MullaneHaystack 2019 - Ontology and Oncology: NLP for Precision Medicine - Sean Mullane
Haystack 2019 - Ontology and Oncology: NLP for Precision Medicine - Sean Mullane
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 

Dipak parmar haystack meetup search relevance engineering

  • 1. Search Relevance Engineering: Query Understanding & Ranking Dipak Parmar Principal Engineer Feb 18, 2020
  • 2. Confidential 2 Agenda 1 Introduction 2 Search Relevance Engineering - Text analysis - Query understanding - Ranking 3 Overview of Search Platform @ Shipt - Shipt search platform technologies stack - Search indexing pipeline - Search services - Elasticsearch clusters - Lessons learned and best practices 4 Q & A
  • 3. Confidential About Shipt ● Shipt launched in the summer of 2014 in the heart of the Magic City, Birmingham, AL. ● Shipt connects members to fresh groceries and everyday essentials. Saving time, fuel and headspace, next-hour, same day grocery delivery is quickly becoming an everyday necessity for people looking for an extra few hours and intentional food choices. cities covered 5,000+ and counting in shopper tips $175M+ of households covered nationwide 80% shoppers 400K Retailers 100+ NPS Score 70
  • 5. Confidential Search @ Shipt 5 ● 100+ retailers, 1000+ store locations ● 50% of searches come from just top 1000 unique searches ● 30% of all purchases are repeat buys ● Search keywords distribution ○ 20% single token ○ 50% two tokens ○ 30% three+ tokens
  • 6. Confidential 6 Search Relevance Engineering V e s t i b u l u m c o n g u e Vestibulum congue Vestibulum congue Vestibulum congue V e s t i b u l u m c o n g u e V e s t i b u l u m c o n g u e Vestibulum congue Vestibulum congue Vestibulum congue V e s t i b u l u m c o n g u e T e x t A n a l y s i s Phase-1 Ranking Phase-n Ranking Query Intent & Classification Q u e r y B u i l d e r
  • 7. Text Analysis & Query Understanding
  • 8. Confidential Text Analysis 8 The QUICK brown foxes jumped over the dog! [ “quick”, “fast”, “brown”, “fox”, “jump”, “over”, dog”] Ngrams [“q”, “qu”, “qui”, “quic”, “quick”, ..]
  • 9. Confidential Text Analysis 9 Stemming: [ “starwberii”, “ic”, “cream”] strawberry ice cream Shingles: [“strawberry ice”, “ice cream”] Shingle concat: [“strawberryice”, “icecream”]
  • 10. Confidential Query Understanding 10 As per Wikipedia “Query understanding is the process of inferring the intent of a search engine user by extracting semantic meaning from the searcher’s keywords. Query understanding methods generally take place before the search engine retrieves and ranks results”
  • 11. Confidential Query Understanding 11 Lorem ipsum dolor sit amet at nec at adipiscing 05 ● Donec risus dolor porta venenatis ● Pharetra luctus felis ● Proin in tellus felis volutpat Lorem ipsum dolor sit amet at nec at adipiscing 04 ● Donec risus dolor porta venenatis ● Pharetra luctus felis ● Proin in tellus felis volutpat Lorem ipsum dolor sit amet at nec at adipiscing 03 ● Donec risus dolor porta venenatis ● Pharetra luctus felis ● Proin in tellus felis volutpat Lorem ipsum dolor sit amet at nec at adipiscing 02 ● Donec risus dolor porta venenatis ● Pharetra luctus felis ● Proin in tellus felis volutpat Lorem ipsum dolor sit amet at nec at adipiscing 01 ● Donec risus dolor porta venenatis ● Pharetra luctus felis ● Proin in tellus felis volutpat Lorem ipsum dolor sit amet at nec at adipiscing 05 ● Donec risus dolor porta venenatis ● Pharetra luctus felis ● Proin in tellus felis volutpat Lorem ipsum dolor sit amet at nec at adipiscing 04 ● Donec risus dolor porta venenatis ● Pharetra luctus felis ● Proin in tellus felis volutpat Lorem ipsum dolor sit amet at nec at adipiscing 03 ● Donec risus dolor porta venenatis ● Pharetra luctus felis ● Proin in tellus felis volutpat Lorem ipsum dolor sit amet at nec at adipiscing 02 ● Donec risus dolor porta venenatis ● Pharetra luctus felis ● Proin in tellus felis volutpat Lorem ipsum dolor sit amet at nec at adipiscing 01 ● Donec risus dolor porta venenatis ● Pharetra luctus felis ● Proin in tellus felis volutpat fresh red onion 05 ● Entity: “onion” ● Modifier: “red” ● Optional: “fresh” half and half coffee pods ½ & ½ coffee pods 04 ● Modifier: “half & half” ● Entity: “coffee pods” (compound word) 16 oz sour cream sourcream 16oz 03 ● Size: 16 oz ● Entity: “sour cream” (compound word) 1 gal 2% milk 2% gallon milk 02 ● Size: 1 gallon ● Modifier: 2% ● Entity: “milk” good & gather veggie pizza 01 ● Brand: good & gather ● Dietary preference: veggie ● Entity: “pizza”
  • 12. Confidential Query Understanding 12 ● Detecting intent with classifiers ○ Size, brands, nutritional/dietary keywords ○ Identify color, pattern, modifiers, flavors, taste, shape, etc. ○ Optional keywords ○ Negative intent: “egg free pizza” ● Replacements, query expansions, synonyms, hypernyms ● Linguistic analysis ● ML, AI, NLP techniques
  • 13. Confidential Query Builder 13 ● Build query based on intent ○ Partial match ○ Phrase search ○ Word drop ○ Minimum match criteria ○ Fuzzy search ○ Filtering ○ Boost ● Ranking based on intent
  • 15. Confidential Ranking 15 ● 5 phases of ranking ○ Phase-1: Base query ranking based on query intent (BM25) ○ Phase-2: Boosting (BM25) ○ Phase-3: Custom ranking with other signals (script score or custom plug-in) ○ Phase-4: Re-rank top x documents (LTR / vector similarity / other signals) ○ Phase-5: Post process ranking (outside search engine)
  • 16. Confidential Ranking 16 ● Ingest external signals ○ Offline process to generate popularity scores at product level ○ Signals at neighborhood level can be generated ○ Product to converted keywords mapping can be maintained ● Use external signals along with BM25 score in deriving final score ● “Painless” script in Elasticsearch can be used for ranking
  • 17. Confidential Ranking 17 Final score = (BM25 score * weight) + (signal-1 * multiplier * weight) + (signal-2 * multiplier * weight) + … Example: Final score = (BM25 score * 0.50) + (log(popularity) * 100 * 0.30) + (geographic trend * 0.10) + ..
  • 19. Confidential Shipt Search Platform Technologies Stack 19 ● Language: ○ Golang ● Datastores: ○ Elasticsearch, Redis, PostgreSQL ● Messaging: ○ Kafka ● Infrastructure / Observability: ○ Docker, Kubernetes, Rollbar, Grafana ● Cloud hosting: ○ AWS, GCP, ElasticCloud, Confluent
  • 22. Confidential Elasticsearch clusters 22 ● Write heavy indexing pipeline ● Multiple groups of clusters based on search traffic, indexing traffic and number of documents ● High availability, multiple replicas ● Single query approach, p99 of 70ms
  • 23. Confidential Lessons learned, best practices 23 ● Single-Responsibility Principle. Don’t do too many things in one micro-service ● Separation of concerns. Modularize and re-use, avoid duplicate logic ● Don’t try to solve everything with search engine! ● Don’t try to solve everything with machine learning! ● High refresh interval setting in Elasticsearch improves response time p99 ● Start with optimizing head keywords followed by torso and tail keywords ● Act fast, evaluate a/b test fast, learn from a/b tests, fail fast ● Parent-child relationship in Elasticsearch could degrade performance ● Duplication of data is just fine in most cases in search engine ● Excessive use of Fuzzy search could lead to performance overhead. It’s better not to apply fuzzy search for already correctly spelled keywords
  • 25. Thank you! Special thanks to OpenSource Connections for organizing this event!