Sudarshan Gaikaiwari - Lucene @ Yelp

1,795 views

Published on

This talk describes how the Yelp uses Lucene to provide search services. It includes

* Statistics of Yelp search usage
* Overview of Yelp search architecture: Yelp uses different services to provide searches for different types of data. Some are based on Lucene and some on SOLR
* Deeper dive into business and review search. This is the most important search service at Yelp.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,795
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
10
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Sudarshan Gaikaiwari - Lucene @ Yelp

  1. 1. Lucene @ YelpSudarshan Gaikaiwari
  2. 2. Bio1. Over a decade of experience in information retrieval2. Used IR techniques at Symantecs DLP group3. Search Engineer at Yelp
  3. 3. Outline1. Overview of search services at Yelp2. Federation Motivation3. Lucy Indexing4. Lucy Searching5. Efficiently Retrieving top k hits
  4. 4. The services we provide
  5. 5. Lucy: business search
  6. 6. Lucy also powers phone search
  7. 7. Cathy: she talks a lot
  8. 8. Listsearch: it searches lists....
  9. 9. Reviewsearch: it searches reviews....
  10. 10. DYM: did you really mean that?
  11. 11. Suggest: auto completion
  12. 12. Federation Motivation
  13. 13. Problem Search is too slow
  14. 14. Hard Disk Seek Latency Disk seek 10,000,000 ns Source Software Engineering Advice from Building Large-Scale Distributed Systems Jeffery Dean
  15. 15. RAM read latency Main memory reference 100 ns
  16. 16. Pinning Index in RAM● vmtouch● mlock● http://hoytech.com/vmtouch/
  17. 17. ProblemIndex is too large fit in memory on a single machine
  18. 18. Geographical sharding
  19. 19. Geographical Sharding drawbacks1. Cumbersome manual process to determine shard boundary2. No guarantee that a boundary can be found.
  20. 20. Federation1. Split index across multiple machines2. Shard on business id3. TF-IDF scores from different machines should be comparable
  21. 21. Mapping businesses to shards 1. Assigning businesses to shardsshard = shardlist[hash(business_id) % len(shardlist)]Problems1. Involves re-indexing all the businesses if we want to add anew shard
  22. 22. Virtual Nodes
  23. 23. Advantages1. Flexibility (move vbuckets from one shard to another)2. Split hot spot shards
  24. 24. Lucy Master Slave ArchitectureSeparate indexing (masters)A master for each shard of a serviceSearching (slaves)A slave for every replica of a service
  25. 25. Lucy Indexing
  26. 26. Lucy Searching
  27. 27. Federator: Combining results acrossshards1. Once we distribute an index across shards we need a component which will search all these shards and combine their results.2. Written in Python (runs inside a python web process).3. Uses Tornado IO loop to send requests to all shards.4. The transfer protocol for the requests in JSON RPC
  28. 28. Lucy Server
  29. 29. Tokens to Business Attributes
  30. 30. Executing queries1. Gather the top results for a query2. Collect attribute statitics for attributes like places, categories
  31. 31. Lucene1. Efficiently executes queries over the index2. Provides how relevant the business is to the words in the query (word score)3. Upgrading lucene to 2.9/3.1 is WIP
  32. 32. Successive geobounds relaxation
  33. 33. Successive geobounds relaxation
  34. 34. Federation
  35. 35. Efficiently Retrieving top k hits 1. When user moves through multiple pages the number of hits to be returned increasesnum hits = start + count2. So if we need to retrieve 500 hits the naive way would be toretrieve 500 hits from each shard and then sort them
  36. 36. Distribution of hits in shards
  37. 37. Probability a hit is in a shard
  38. 38. Binomial DistributionProbability (r of top k hits) are in a particular shardMeanVariance
  39. 39. FormulaStd DeviationFormula
  40. 40. Simulation Formula Hits selected from each Results Missed (%) shard k = 100 p = 0.2 24 0.017 32 0.0001407 44 0.00000
  41. 41. Simulation Graph
  42. 42. Results1. ~ 50% savings over 100 hits (44 hits requested from each shard)2. 77% savings over 1000 hits (228 hits requested from each shard)
  43. 43. Future work1. In memory index2. Move towards real time search
  44. 44. Come Join Us!
  45. 45. Thank You smg@yelp.com

×