Presented at PyData Amsterdam 2016. Describes the Rewinder tool, to compare search engine configuration performance between Microsoft FAST and Apache Solr for the ScienceDirect search backend migration.
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Measuring Search Engine Quality using Spark and Python
1. Sujit Pal
March 13, 2016
Measuring Search Engine
Quality using Spark and Python
2. | 2
• About Me
Work at Elsevier Labs
Interests: Search, NLP and Distributed Processing.
URL: labs.elsevier.com
Email: sujit.pal@elsevier.com
Blog: Salmon Run
Twitter: @palsujit
• About Elsevier
World’s largest publisher of STM Books and Journals
Uses data to inform and enable consumers of STM info
Introduction
3. | 3
• Problem Description
• Our Solution
• Other Uses
• Future Work
• Q&A
Agenda
8. • A/B Tests happen in Production
Expensive: Requires Production Deployment of Search
Engine(s).
Risky: Bad customer experience in AB can result in customer
leaving.
Limited scope for iterative improvement: because of expense
and risk.
A/B tests take time to be statistically meaningful.
• We needed something that
Could be run by DEV/QA on demand.
Produces repeatable indicator of search engine quality.
Does not require production deployment.
• This tool is the subject of our talk today.
But…
10. • We have
Query Logs – from query string entered by user.
Click Logs – from the download links clicked by user.
• We can generate
Search results for each query against the search engine.
• Combining which we can provide
Click Rank Distribution for a search engine (configuration)
Solution Overview
11. Click Rank Definition
• Click Rank is the sum of the ranks of all PIIs in the result set that
match the PIIs clicked for that query, divided by the number of
matches.
• Deconstructing the above:
Let the click logs for a query be the document set Q.
Let the top N search results for the query be represented by a
List R.
Let P be the intersection of Q and R, and P be the (one-based)
indexes of the documents in R that are in P.
Click Rank = Σ P / || P ||
16. Search Results Example Data
• Search Results are saved one file per query.
• Top 50 results extracted (so each file is 50 lines long).
• Maintains parity with FAST reference query results (provided one-
time via legacy process).
17. Compute Click Rank Distribution
• Use Apache Spark to compute the Click Rank Distribution.
• Use Python + Matplotlib to build reports and visualizations.
18. Spark Code for Generating Click Rank Distribution
• Skeleton of a PySpark Program
19. Spark Code for Generating Click Rank Distribution
• Step #1: Convert the Clicks data to (Query_ID, List of clicked
PIIs)
20. Spark Code for Generating Click Rank Distribution
• Step #2: Convert the Search Results data to (Query_ID, List of
PIIs in search result)
21. Spark Code for Generating Click Rank Distribution
• Step #3: Join the two RDDs and compute Click Rank from the
intersection of the clicked PIIs and the result PIIs for each
Query_ID.
22. Generate Reports
• Download Click Rank Distribution for Search Engine (configuration).
• Use Python + Matplotlib to build reports and visualizations.
23. Outputs from Tool
• Step #4: Download distribution from S3 and aggregate to chart
and spreadsheet.
24. How did we do (in our A/B test)?
• Solr PDF downloads were 99.6% of FAST downloads.
• Difference in download rates not statistically significant.
• Decision made to put Solr into production.
90%
95%
100%
105%
Jan
(AB #1)
Feb
(AB #2)
Mar
(AB #3)
Apr
(AB #4)
SOLR Downloads as % of FAST Downloads
PDF
HTML
Level of FAST Downloads
26. Find Search Result Overlap between Configurations
• Measure drift between two search configurations.
• Ordered and Unordered comparison at different top N positions.
• Result set overlap increases with N.
• Lot of positional overlap in the top N positions across engines.
27. Search Quality as Overlap between Title and Query
• Measures overlap of title words with query words for various top
N positions.
• Overlap @ N defined as sum of number of words overlap for the
first N titles with the query normalized by N times the number of
words in the query.
• Overlap @ N decreases monotonically with N.
• Solr engines seem to do better at this measure.
28. Click Distribution
• Measures the distribution of clicked positions across the top 50
positions for each engine and compares them.
• In this chart, FAST has higher number of clicks at the top
positions than the Solr configurations shown.
29. Distribution of Publication Dates in Results
• The engine has a temporal component in its ranking algorithm.
• Compares the distribution of publication dates across search
engine configurations to visualize its behavior.
30. More Uses …
• Measuring impact of query response time on click rank.
• Comparing click rank distributions by document type.
• …
31. • Compute
(Average/Median) CR
per user.
• Compute CR per query
and user.
• Use this as input to
Learning to Rank
Algorithms.
• Other ideas…
Future Work