Your SlideShare is downloading. ×
Sudarshan Gaikaiwari - Lucene @ Yelp
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Sudarshan Gaikaiwari - Lucene @ Yelp


Published on

This talk describes how the Yelp uses Lucene to provide search services. It includes …

This talk describes how the Yelp uses Lucene to provide search services. It includes

* Statistics of Yelp search usage
* Overview of Yelp search architecture: Yelp uses different services to provide searches for different types of data. Some are based on Lucene and some on SOLR
* Deeper dive into business and review search. This is the most important search service at Yelp.

Published in: Technology

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. Lucene @ YelpSudarshan Gaikaiwari
  • 2. Bio1. Over a decade of experience in information retrieval2. Used IR techniques at Symantecs DLP group3. Search Engineer at Yelp
  • 3. Outline1. Overview of search services at Yelp2. Federation Motivation3. Lucy Indexing4. Lucy Searching5. Efficiently Retrieving top k hits
  • 4. The services we provide
  • 5. Lucy: business search
  • 6. Lucy also powers phone search
  • 7. Cathy: she talks a lot
  • 8. Listsearch: it searches lists....
  • 9. Reviewsearch: it searches reviews....
  • 10. DYM: did you really mean that?
  • 11. Suggest: auto completion
  • 12. Federation Motivation
  • 13. Problem Search is too slow
  • 14. Hard Disk Seek Latency Disk seek 10,000,000 ns Source Software Engineering Advice from Building Large-Scale Distributed Systems Jeffery Dean
  • 15. RAM read latency Main memory reference 100 ns
  • 16. Pinning Index in RAM● vmtouch● mlock●
  • 17. ProblemIndex is too large fit in memory on a single machine
  • 18. Geographical sharding
  • 19. Geographical Sharding drawbacks1. Cumbersome manual process to determine shard boundary2. No guarantee that a boundary can be found.
  • 20. Federation1. Split index across multiple machines2. Shard on business id3. TF-IDF scores from different machines should be comparable
  • 21. Mapping businesses to shards 1. Assigning businesses to shardsshard = shardlist[hash(business_id) % len(shardlist)]Problems1. Involves re-indexing all the businesses if we want to add anew shard
  • 22. Virtual Nodes
  • 23. Advantages1. Flexibility (move vbuckets from one shard to another)2. Split hot spot shards
  • 24. Lucy Master Slave ArchitectureSeparate indexing (masters)A master for each shard of a serviceSearching (slaves)A slave for every replica of a service
  • 25. Lucy Indexing
  • 26. Lucy Searching
  • 27. Federator: Combining results acrossshards1. Once we distribute an index across shards we need a component which will search all these shards and combine their results.2. Written in Python (runs inside a python web process).3. Uses Tornado IO loop to send requests to all shards.4. The transfer protocol for the requests in JSON RPC
  • 28. Lucy Server
  • 29. Tokens to Business Attributes
  • 30. Executing queries1. Gather the top results for a query2. Collect attribute statitics for attributes like places, categories
  • 31. Lucene1. Efficiently executes queries over the index2. Provides how relevant the business is to the words in the query (word score)3. Upgrading lucene to 2.9/3.1 is WIP
  • 32. Successive geobounds relaxation
  • 33. Successive geobounds relaxation
  • 34. Federation
  • 35. Efficiently Retrieving top k hits 1. When user moves through multiple pages the number of hits to be returned increasesnum hits = start + count2. So if we need to retrieve 500 hits the naive way would be toretrieve 500 hits from each shard and then sort them
  • 36. Distribution of hits in shards
  • 37. Probability a hit is in a shard
  • 38. Binomial DistributionProbability (r of top k hits) are in a particular shardMeanVariance
  • 39. FormulaStd DeviationFormula
  • 40. Simulation Formula Hits selected from each Results Missed (%) shard k = 100 p = 0.2 24 0.017 32 0.0001407 44 0.00000
  • 41. Simulation Graph
  • 42. Results1. ~ 50% savings over 100 hits (44 hits requested from each shard)2. 77% savings over 1000 hits (228 hits requested from each shard)
  • 43. Future work1. In memory index2. Move towards real time search
  • 44. Come Join Us!
  • 45. Thank You