Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr with Faizan Ahmed

FAIZAN AHMED
Using Spark-Solr at Scale:
Productionizing Spark for Search
with Apache Solr

#ExpSAIS18
Data Engineer on Flipp’s Search Team
2
medium.com/@faizanahmed_18678
About Me
Role
Education
Responsibilities
M.Sc in Computer Science from University of Waterloo
Building pipelines to support machine learning models
Developing scalable systems and working on products to
enhance search relevancy.
Core work includes using Spark for machine learning at
scale & data warehousing for analytics.

#ExpSAIS18
• About Flipp
• The Search Experience
• The Problem Domain
• The Spark-Solr Journey
– Solution Architecture
– Deep Dive: Content Categorization
– Deep Dive: Understanding User Intent
• Improvements
• Lessons Learned
• Key Takeaways
3
Agenda

#ExpSAIS18
About Flipp
4
Digital Shopping
Marketplace
The Flipp App is a digital
shopping marketplace with
over 40M+ downloads.
Storefront Visual
Merchandising
Enterprise Platform converts
analog content to provide a
mobile browsing experience,
which showcases retailers’
digital storefront.
The Flipp Platform
Flipp’s platform is used by
over 90% of tier one
retailers in North America.

#ExpSAIS18
Search Platform At a Glance
• 2M Searches/day
• 30M Indexed Documents
• Team of 5 engineers
5

#ExpSAIS18
Searchable Entities
6

#ExpSAIS18
Merchant Flyers
7
The Search Experience
Items Coupons Online Deals

#ExpSAIS18 8
The Problem Domain

#ExpSAIS18
1
Basic
Keyw
ord
Search
Inverted-index,tf-idf,BM
-25
2
Q
uery
Taxonom
ies
Synonym
s,Spell-correction,
Stem
m
ing
3
Q
uery
Intent
Entity
recognition,dom
ain-specific
classification
4
Self-learning
Learning-to-rank
Discovering and leveraging query intent is dependent on domain specific user & content signals.
9
Search Platform Evolution

#ExpSAIS18
User Signals
10
Signals: The Building Blocks Of Search
Content Signals

#ExpSAIS18
“Bose SoundSport In-Ear Sport Headphones”
• Non-uniform category structure
Consumer Electronics -> Consumer Audio Headphones
Audio-> Earbuds & In-Ear Headphones-> Headphones
11
Content Signal Challenges: Product Categorization
Every item would have a merchant-specific category

#ExpSAIS18
Tagged as “Kale”
Tagged as “Greens”
Food, Beverages & Tobacco > Food Items > Fruits & Vegetables > Fresh & Frozen Vegetables > Greens > Kale
Categorization hierarchy for “Kale”
• Content tagged with correct categories lacked complete hierarchy
12
Content Signal Challenges: Product Categorization

#ExpSAIS18 13
User Signal Challenges
● Query categorization not deep enough
Food, Beverages & Tobacco > Food Items > Fruits & Vegetables
{!boost b=log(popularity)}foo
“cheese” -> Dairy Products -> Cheese
“shredded cheese” ->
● Unable to leverage Solr features

#ExpSAIS18 14
What Was Needed?
ML, ETL & Analytics Combine Signals Spark-Solr Integration
Reduce Productionization Time Scalable & Efficient Cost-friendly

#ExpSAIS18 15
Implementing the solution
The Spark-Solr Journey

#ExpSAIS18 16
Action Plan
Content Context Collaboration

#ExpSAIS18
Pre-processed
analytics data
17
Raw Content
Indexing
Spark-Solr
Search ML & ETL Platform
User Signals
Search
Analytics
Backend
Infrastructure
Usage
metrics
The Spark-Solr Architecture
Search Middleware
Reading content
signals
Writing
signals to
solr
Search Results

#ExpSAIS18 18
Separating wheat from the chaff
Content Categorization:
DEEP DIVE

19#ExpSAIS18
What Raw Content In Solr Looks Like
Same products but
different category
structure
TV accessories
that would show
up as noise

#ExpSAIS18
Categorization Algorithm
20
name: “Samsung TV”,
desc: “LED”,
category:{..}
Product metadata
Samsung TV
Retailer A category
structure
TV & Home Theatre
Televisions
53” TVs
53”TVs, Televisions, TV & Home Theatre
Electronics
Video
Televisions
Creating
training data
name:“Sony TV”
_L1: Electronics
_L2: Video
_L3: Televisions
Predicted Categories
Category key

#ExpSAIS18 21
Content
indexing
Read
product
data
Get
Clickstream
Signals
Classification
model
Items categorized
with popularity
metrics
Training
data
Content Categorization Workflow

22#ExpSAIS18
Voilà!
TVs are now under
similar categories with
popularity weight
The stands are now
correctly classified as
‘Furniture’ and can be
filtered

#ExpSAIS18 23
Understanding
User Intent
DEEP DIVE

#ExpSAIS18
A query for “nuts”
ranks cereal,
guitar & car oil
filter on top due
to noisy keyword
signals
24
Why Decipher Query Intent?

#ExpSAIS18
● Using past impression data to improve how latest impression data is interpreted
Users behaviour informs
algorithmic improvements
Users perform
certain actions (
click, impression
etc)
Users see
search results
Users query
25
Reinforced Intelligence

26#ExpSAIS18
What is the probability that a user will click on a particular category class given the query “nuts”?
One query can be
associated with
multiple classes
The query “nuts” will be
interpreted to have category
“Dried Fruits”
The category with the most clicks for a particular search query would be interpreted as the query category
Narrowing Down The Query Intent

27#ExpSAIS18
Persisting Query Classifier Model In Solr
Model schema in solr Category classifications for “nuts” variations

#ExpSAIS18
Predict
● Query model
● Boost content ranking
Train
● Compute category classes
● Classify queries as domain
specific entities
Persist
● Query as a searchable
entity.
● Taxonomic features.
28
Query Classification Workflow In Solr

#ExpSAIS18
User
Query
Identify Query
Intent
Raw
Content
Categorize
Content
29
Collaboration!

#ExpSAIS18 31
Measuring Search Relevance
● Cumulative Gain: 0 => Not relevant, 1 => Near relevant, 2 => Relevant
● Discounting: Positional bias
● Normalization: Ideal DCG (iDCG)
Normalized Discounted Cumulative Gain (NDCG) is a popular method for
measuring the quality of a set of search results. It asserts the following:

#ExpSAIS18
Q: “Olive Oil”
2+2+1+0+0+1= 6 2+2+2+2+2+2= 12
2 2
1 0
0 1
2 2
2 2
2 2
32
CG Improvement

#ExpSAIS18
1.44 1.82
0.62
1.11
1.44
0
2.89 1.82
1.24
0.56
0.72
0
1.44+1.82+1.44+0.62+1.11+0= 6.43 2.89+1.82+0.72+1.24+0.56= 7.23
33
Q: “Fish”
DCG Improvement

#ExpSAIS18 34
Spark-Solr:
Lessons Learned

#ExpSAIS18 35
Lessons Learned
● Latency rise in p95
by 40% during
indexing
Rise in p95
latency during
indexing

#ExpSAIS18 36
Lessons Learned
● Read Parallelism
○ Match the number of spark executor nodes to the number of shards in your solr collection.
○ Prefer using solr filter params over where clause in Spark.
● Indexing improvements
○ Multivalued fields may require extra processing.
○ Avoid using soft_commit_secs param in production.
○ Consider using smaller batch_size.

#ExpSAIS18
● The Spark-Solr framework unlocks a wide variety of machine learning techniques
to be applied to search applications.
● Enables faster data analytics in SparkSQL than traditional SQL frameworks.
● Spark-Solr can be used to optimize solr indexing with signal data.
● Storing text classification models in solr has significant semantic benefits over
other persistence layers.
37
Key Takeaways

Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr with Faizan Ahmed

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr with Faizan Ahmed

Similar to Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr with Faizan Ahmed (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Using Spark-Solr at Scale: Productionizing Spark for Search with Apache Solr with Faizan Ahmed