This talk is a case-study on how Apache Spark and the Spark-Solr library is being used at Flipp for driving search relevancy. Flipp is a Toronto based digital flyer and ecommerce company which helps shoppers save money on weekly shopping. Our customers have the option of browsing through our 5+ million products from the brick-and-mortar retailers in North America. This makes Search a very challenging function in our app. How to show the most relevant and personalized search results to users on a query?
The talk will focus on using user signals such as Click Through Rate (CTR) and Impressions to increase search relevancy. I will also talk about how PySpark is used to create the Flipp Search ETL platform for collecting user signals and reading product data from Solr. The problem scenario will be explained in which keyword search and basic relevancy algorithms become ineffective when dealing with a large product database. The solutions will cover the following implementations being used at Flipp to drive relevancy: – Utilizing user clicks and popularity data to derive and index normalized item weights to implement the Search Crowd Curation models in Apache Solr
– How around 5+ million items are classified into Google Categories in real time using Keras and Apache Spark to power product category curation in Solr.
– How to create a crowd sourced query intent categorizer in Solr using the Spark-Solr library.
– The use of offline and online metrics at Flipp for evaluating changes in search relevancy.
– Future plans for incorporating Kafka-connect in Apache Solr with structured streaming to perform real-time product indexing with Spark-Solr library.
2. #ExpSAIS18
Data Engineer on Flipp’s Search Team
2
medium.com/@faizanahmed_18678
About Me
Role
Education
Responsibilities
M.Sc in Computer Science from University of Waterloo
Building pipelines to support machine learning models
Developing scalable systems and working on products to
enhance search relevancy.
Core work includes using Spark for machine learning at
scale & data warehousing for analytics.
3. #ExpSAIS18
• About Flipp
• The Search Experience
• The Problem Domain
• The Spark-Solr Journey
– Solution Architecture
– Deep Dive: Content Categorization
– Deep Dive: Understanding User Intent
• Improvements
• Lessons Learned
• Key Takeaways
3
Agenda
4. #ExpSAIS18
About Flipp
4
Digital Shopping
Marketplace
The Flipp App is a digital
shopping marketplace with
over 40M+ downloads.
Storefront Visual
Merchandising
Enterprise Platform converts
analog content to provide a
mobile browsing experience,
which showcases retailers’
digital storefront.
The Flipp Platform
Flipp’s platform is used by
over 90% of tier one
retailers in North America.
19. 19#ExpSAIS18
What Raw Content In Solr Looks Like
Same products but
different category
structure
TV accessories
that would show
up as noise
20. #ExpSAIS18
Categorization Algorithm
20
name: “Samsung TV”,
desc: “LED”,
category:{..}
Product metadata
Samsung TV
Retailer A category
structure
TV & Home Theatre
Televisions
53” TVs
53”TVs, Televisions, TV & Home Theatre
Electronics
Video
Televisions
Creating
training data
name:“Sony TV”
_L1: Electronics
_L2: Video
_L3: Televisions
Predicted Categories
Category key
22. 22#ExpSAIS18
Voilà!
TVs are now under
similar categories with
popularity weight
The stands are now
correctly classified as
‘Furniture’ and can be
filtered
24. #ExpSAIS18
A query for “nuts”
ranks cereal,
guitar & car oil
filter on top due
to noisy keyword
signals
24
Why Decipher Query Intent?
25. #ExpSAIS18
● Using past impression data to improve how latest impression data is interpreted
Users behaviour informs
algorithmic improvements
Users perform
certain actions (
click, impression
etc)
Users see
search results
Users query
25
Reinforced Intelligence
26. 26#ExpSAIS18
What is the probability that a user will click on a particular category class given the query “nuts”?
One query can be
associated with
multiple classes
The query “nuts” will be
interpreted to have category
“Dried Fruits”
The category with the most clicks for a particular search query would be interpreted as the query category
Narrowing Down The Query Intent
31. #ExpSAIS18 31
Measuring Search Relevance
● Cumulative Gain: 0 => Not relevant, 1 => Near relevant, 2 => Relevant
● Discounting: Positional bias
● Normalization: Ideal DCG (iDCG)
Normalized Discounted Cumulative Gain (NDCG) is a popular method for
measuring the quality of a set of search results. It asserts the following:
36. #ExpSAIS18 36
Lessons Learned
● Read Parallelism
○ Match the number of spark executor nodes to the number of shards in your solr collection.
○ Prefer using solr filter params over where clause in Spark.
● Indexing improvements
○ Multivalued fields may require extra processing.
○ Avoid using soft_commit_secs param in production.
○ Consider using smaller batch_size.
37. #ExpSAIS18
● The Spark-Solr framework unlocks a wide variety of machine learning techniques
to be applied to search applications.
● Enables faster data analytics in SparkSQL than traditional SQL frameworks.
● Spark-Solr can be used to optimize solr indexing with signal data.
● Storing text classification models in solr has significant semantic benefits over
other persistence layers.
37
Key Takeaways