Your SlideShare is downloading. ×
0
Lucene @ YelpSudarshan Gaikaiwari
Bio1. Over a decade of experience in information retrieval2. Used IR techniques at Symantecs DLP group3. Search Engineer a...
Outline1. Overview of search services at Yelp2. Federation Motivation3. Lucy Indexing4. Lucy Searching5. Efficiently Retri...
The services we provide
Lucy: business search
Lucy also powers phone search
Cathy: she talks a lot
Listsearch: it searches lists....
Reviewsearch: it searches reviews....
DYM: did you really mean that?
Suggest: auto completion
Federation Motivation
Problem      Search is too slow
Hard Disk Seek Latency                     Disk seek 10,000,000 ns                         Source Software Engineering Adv...
RAM read latency                   Main memory                   reference                   100 ns
Pinning Index in RAM● vmtouch● mlock● http://hoytech.com/vmtouch/
ProblemIndex is too large fit in memory on a single machine
Geographical sharding
Geographical Sharding drawbacks1. Cumbersome manual process to determine shard boundary2. No guarantee that a boundary can...
Federation1. Split index across multiple machines2. Shard on business id3. TF-IDF scores from different machines should be...
Mapping businesses to shards 1. Assigning businesses to shardsshard = shardlist[hash(business_id) % len(shardlist)]Problem...
Virtual Nodes
Advantages1. Flexibility (move vbuckets from one shard to another)2. Split hot spot shards
Lucy Master Slave ArchitectureSeparate indexing (masters)A master for each shard of a serviceSearching (slaves)A slave for...
Lucy Indexing
Lucy Searching
Federator: Combining results acrossshards1. Once we distribute an index across shards we need a   component which will sea...
Lucy Server
Tokens to Business Attributes
Executing queries1. Gather the top results for a query2. Collect attribute statitics for attributes like places, categories
Lucene1. Efficiently executes queries over the index2. Provides how relevant the business is to the words in the   query (...
Successive geobounds relaxation
Successive geobounds relaxation
Federation
Efficiently Retrieving top k hits 1. When user moves through multiple pages the number of    hits to be returned increases...
Distribution of hits in shards
Probability a hit is in a shard
Binomial DistributionProbability (r of top k hits) are in a particular shardMeanVariance
FormulaStd DeviationFormula
Simulation    Formula   Hits selected from each   Results Missed (%)              shard              k = 100              ...
Simulation Graph
Results1. ~ 50% savings over 100 hits (44 hits requested from each   shard)2. 77% savings over 1000 hits (228 hits request...
Future work1. In memory index2. Move towards real time search
Come Join Us!
Thank You            smg@yelp.com
Sudarshan Gaikaiwari - Lucene @ Yelp
Sudarshan Gaikaiwari - Lucene @ Yelp
Sudarshan Gaikaiwari - Lucene @ Yelp
Sudarshan Gaikaiwari - Lucene @ Yelp
Sudarshan Gaikaiwari - Lucene @ Yelp
Sudarshan Gaikaiwari - Lucene @ Yelp
Sudarshan Gaikaiwari - Lucene @ Yelp
Sudarshan Gaikaiwari - Lucene @ Yelp
Sudarshan Gaikaiwari - Lucene @ Yelp
Upcoming SlideShare
Loading in...5
×

Sudarshan Gaikaiwari - Lucene @ Yelp

1,464

Published on

This talk describes how the Yelp uses Lucene to provide search services. It includes

* Statistics of Yelp search usage
* Overview of Yelp search architecture: Yelp uses different services to provide searches for different types of data. Some are based on Lucene and some on SOLR
* Deeper dive into business and review search. This is the most important search service at Yelp.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,464
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Sudarshan Gaikaiwari - Lucene @ Yelp"

  1. 1. Lucene @ YelpSudarshan Gaikaiwari
  2. 2. Bio1. Over a decade of experience in information retrieval2. Used IR techniques at Symantecs DLP group3. Search Engineer at Yelp
  3. 3. Outline1. Overview of search services at Yelp2. Federation Motivation3. Lucy Indexing4. Lucy Searching5. Efficiently Retrieving top k hits
  4. 4. The services we provide
  5. 5. Lucy: business search
  6. 6. Lucy also powers phone search
  7. 7. Cathy: she talks a lot
  8. 8. Listsearch: it searches lists....
  9. 9. Reviewsearch: it searches reviews....
  10. 10. DYM: did you really mean that?
  11. 11. Suggest: auto completion
  12. 12. Federation Motivation
  13. 13. Problem Search is too slow
  14. 14. Hard Disk Seek Latency Disk seek 10,000,000 ns Source Software Engineering Advice from Building Large-Scale Distributed Systems Jeffery Dean
  15. 15. RAM read latency Main memory reference 100 ns
  16. 16. Pinning Index in RAM● vmtouch● mlock● http://hoytech.com/vmtouch/
  17. 17. ProblemIndex is too large fit in memory on a single machine
  18. 18. Geographical sharding
  19. 19. Geographical Sharding drawbacks1. Cumbersome manual process to determine shard boundary2. No guarantee that a boundary can be found.
  20. 20. Federation1. Split index across multiple machines2. Shard on business id3. TF-IDF scores from different machines should be comparable
  21. 21. Mapping businesses to shards 1. Assigning businesses to shardsshard = shardlist[hash(business_id) % len(shardlist)]Problems1. Involves re-indexing all the businesses if we want to add anew shard
  22. 22. Virtual Nodes
  23. 23. Advantages1. Flexibility (move vbuckets from one shard to another)2. Split hot spot shards
  24. 24. Lucy Master Slave ArchitectureSeparate indexing (masters)A master for each shard of a serviceSearching (slaves)A slave for every replica of a service
  25. 25. Lucy Indexing
  26. 26. Lucy Searching
  27. 27. Federator: Combining results acrossshards1. Once we distribute an index across shards we need a component which will search all these shards and combine their results.2. Written in Python (runs inside a python web process).3. Uses Tornado IO loop to send requests to all shards.4. The transfer protocol for the requests in JSON RPC
  28. 28. Lucy Server
  29. 29. Tokens to Business Attributes
  30. 30. Executing queries1. Gather the top results for a query2. Collect attribute statitics for attributes like places, categories
  31. 31. Lucene1. Efficiently executes queries over the index2. Provides how relevant the business is to the words in the query (word score)3. Upgrading lucene to 2.9/3.1 is WIP
  32. 32. Successive geobounds relaxation
  33. 33. Successive geobounds relaxation
  34. 34. Federation
  35. 35. Efficiently Retrieving top k hits 1. When user moves through multiple pages the number of hits to be returned increasesnum hits = start + count2. So if we need to retrieve 500 hits the naive way would be toretrieve 500 hits from each shard and then sort them
  36. 36. Distribution of hits in shards
  37. 37. Probability a hit is in a shard
  38. 38. Binomial DistributionProbability (r of top k hits) are in a particular shardMeanVariance
  39. 39. FormulaStd DeviationFormula
  40. 40. Simulation Formula Hits selected from each Results Missed (%) shard k = 100 p = 0.2 24 0.017 32 0.0001407 44 0.00000
  41. 41. Simulation Graph
  42. 42. Results1. ~ 50% savings over 100 hits (44 hits requested from each shard)2. 77% savings over 1000 hits (228 hits requested from each shard)
  43. 43. Future work1. In memory index2. Move towards real time search
  44. 44. Come Join Us!
  45. 45. Thank You smg@yelp.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×