Your SlideShare is downloading. ×
Sudarshan Gaikaiwari - Lucene @ Yelp
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Sudarshan Gaikaiwari - Lucene @ Yelp

1,440
views

Published on

This talk describes how the Yelp uses Lucene to provide search services. It includes …

This talk describes how the Yelp uses Lucene to provide search services. It includes

* Statistics of Yelp search usage
* Overview of Yelp search architecture: Yelp uses different services to provide searches for different types of data. Some are based on Lucene and some on SOLR
* Deeper dive into business and review search. This is the most important search service at Yelp.

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,440
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Lucene @ YelpSudarshan Gaikaiwari
  • 2. Bio1. Over a decade of experience in information retrieval2. Used IR techniques at Symantecs DLP group3. Search Engineer at Yelp
  • 3. Outline1. Overview of search services at Yelp2. Federation Motivation3. Lucy Indexing4. Lucy Searching5. Efficiently Retrieving top k hits
  • 4. The services we provide
  • 5. Lucy: business search
  • 6. Lucy also powers phone search
  • 7. Cathy: she talks a lot
  • 8. Listsearch: it searches lists....
  • 9. Reviewsearch: it searches reviews....
  • 10. DYM: did you really mean that?
  • 11. Suggest: auto completion
  • 12. Federation Motivation
  • 13. Problem Search is too slow
  • 14. Hard Disk Seek Latency Disk seek 10,000,000 ns Source Software Engineering Advice from Building Large-Scale Distributed Systems Jeffery Dean
  • 15. RAM read latency Main memory reference 100 ns
  • 16. Pinning Index in RAM● vmtouch● mlock● http://hoytech.com/vmtouch/
  • 17. ProblemIndex is too large fit in memory on a single machine
  • 18. Geographical sharding
  • 19. Geographical Sharding drawbacks1. Cumbersome manual process to determine shard boundary2. No guarantee that a boundary can be found.
  • 20. Federation1. Split index across multiple machines2. Shard on business id3. TF-IDF scores from different machines should be comparable
  • 21. Mapping businesses to shards 1. Assigning businesses to shardsshard = shardlist[hash(business_id) % len(shardlist)]Problems1. Involves re-indexing all the businesses if we want to add anew shard
  • 22. Virtual Nodes
  • 23. Advantages1. Flexibility (move vbuckets from one shard to another)2. Split hot spot shards
  • 24. Lucy Master Slave ArchitectureSeparate indexing (masters)A master for each shard of a serviceSearching (slaves)A slave for every replica of a service
  • 25. Lucy Indexing
  • 26. Lucy Searching
  • 27. Federator: Combining results acrossshards1. Once we distribute an index across shards we need a component which will search all these shards and combine their results.2. Written in Python (runs inside a python web process).3. Uses Tornado IO loop to send requests to all shards.4. The transfer protocol for the requests in JSON RPC
  • 28. Lucy Server
  • 29. Tokens to Business Attributes
  • 30. Executing queries1. Gather the top results for a query2. Collect attribute statitics for attributes like places, categories
  • 31. Lucene1. Efficiently executes queries over the index2. Provides how relevant the business is to the words in the query (word score)3. Upgrading lucene to 2.9/3.1 is WIP
  • 32. Successive geobounds relaxation
  • 33. Successive geobounds relaxation
  • 34. Federation
  • 35. Efficiently Retrieving top k hits 1. When user moves through multiple pages the number of hits to be returned increasesnum hits = start + count2. So if we need to retrieve 500 hits the naive way would be toretrieve 500 hits from each shard and then sort them
  • 36. Distribution of hits in shards
  • 37. Probability a hit is in a shard
  • 38. Binomial DistributionProbability (r of top k hits) are in a particular shardMeanVariance
  • 39. FormulaStd DeviationFormula
  • 40. Simulation Formula Hits selected from each Results Missed (%) shard k = 100 p = 0.2 24 0.017 32 0.0001407 44 0.00000
  • 41. Simulation Graph
  • 42. Results1. ~ 50% savings over 100 hits (44 hits requested from each shard)2. 77% savings over 1000 hits (228 hits requested from each shard)
  • 43. Future work1. In memory index2. Move towards real time search
  • 44. Come Join Us!
  • 45. Thank You smg@yelp.com