Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram Sankar, LinkedIn

7,339 views

Published on

Presented at Lucene/Solr Revolution 2014

Published in: Software

Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram Sankar, LinkedIn

  1. 1. Galene: LinkedIn’s search architecture Diego Buthay & Sriram Sankar
  2. 2. LinkedIn’s Vision “Create economic opportunity for every member of the global workforce” • Find work • Realize your dream job • Be great at what you do
  3. 3. LinkedIn’s Vision Search and Recommendations are core to our Vision
  4. 4. Overview • Infrastructure scaling • Developer productivity scaling • Result quality scaling
  5. 5. Comparison of different Search Engines Netflix: AirBnB: Ebay: Bing: Google: Facebook:
  6. 6. Comparison of different Search Engines Netflix: 100K AirBnB: 800K Ebay: 500M Bing: 100’s of Billions Google: 100’s of Billions Facebook: Trillions
  7. 7. Comparison of different Search Engines Netflix: 100K Lucene AirBnB: 800K Lucene Ebay: 500M Custom C++ Bing: 100’s of Billions Custom C++ Google: 100’s of Billions Custom C++ Facebook: Trillions Custom C++ LinkedIn: 100’s of Millions Lucene Galene (Lucene based) Galene (Custom)
  8. 8. Important Galene Features • Offline index building • Live updates at a fine granularity • Static rank and early termination • Faceting • Data distribution • Relevance framework
  9. 9. Offline index building Live updates at a fine granularity
  10. 10. A little about LinkedIn data • Most datasets at LinkedIn are available in 2 ways • A real 9me, change no9fica9on stream • A complete dataset, ETL’d to Hadoop • We often rely on derived datasets • Many derived datasets can’t be crunched in real time
  11. 11. Anatomy of a Galene index • Base Index • Generated by Hadoop periodically • Single-­‐segment Lucene index • On Disk. Immutable. MMAPed and MLOCKed • Contains complex / rich features, that we can only afford to compute offline • Live Index • Inverted index with our own format • In-­‐memory data structure • Contains incremental updates to documents • Snapshot Index • On Disk Snapshot of Live index when necessary • Ini9ally empty • Single segment Lucene Index. Live index is folded in regularly
  12. 12. BLAH BLAH BLAH Jeff BLAH BLAH LinkedIn BLAH BLAH BLAH BLAH 1. 2. BLAH BLAH Reid BLAH LinkedIn BLAH BLAH BLAH BLAH BLAH BLAH BLAH Jeff Reid LinkedIn 1 2 Inverted Index (with Posting Lists) Forward Index
  13. 13. 1 2 3 4 5 6 7 8 9 1 2 3 4 5 10 11 12 . . . Base Index Live Update Snapshot In-­‐Memory Live Updates
  14. 14. Inverted Index: Three Segments Three independent segments with non-overlapped UIDs: • B1S1L1 (Base/snapshot/live) segment • Base has all UIDs. • Neither of Snapshot nor Live introduces new UIDs. • S2L2 (Snapshot/live) segment • None of UIDs exist in BSL. • Snapshot has all UIDs • Live does not introduce any new UIDs. • L3 (live) segment • None of UIDs exist in BSL or SL.
  15. 15. B1 S1 L1 L3 S2 L2
  16. 16. Static rank and early termination
  17. 17. Search: Static Rank (SR) • A global score of a document • Each document must have one and only one SR • It could be anything that can globally represent the importance of an UID, for example, the number of 1st degree connec9ons • Different documents might have same SRs • B1S1L1 segment • Base knows SRs of all UIDs of the segment • S2L2 • Snapshot knows SRs of all UIDs of the segment • L3 segments • We assign ar9ficial SRs in either of the two ways: • Ascending order star9ng from the max SR of all UIDs in all 3 segments • Descending order star9ng from the min SR of all UIDs in all 3 segments
  18. 18. Search: Early Termination (ET) • Segment Level ET • Depending on the ordering of sta9c ranking assignment of L segment, which will affect the ordering of all segments, we can search: • BSL -­‐> SL -­‐> L (if it is descending) • L -­‐> SL -­‐> BSL (if it is ascending) • Posting List Level ET • Since all pos9ngs are first sorted by SR, early termina9on on pos9ng list guarantees that documents with highest SRs are always first retrieved (however, this does not guarantee that the final scores are also highest scores).
  19. 19. Going Forward • Very efficient custom index in C++ • Base index build can be run in a distributed manner • BSL supported at a more fundamental level
  20. 20. Faceting
  21. 21. Faceting • Types of facets supported: • discoverable (e.g. current company) • sta9c values (e.g. network) • supplied values (e.g. my groups) • Legacy stack had no early termination allowing for exact facet counting (at a cost) • Current Galene stack applies heuristics to determine counts in an approximate manner • Going forward, custom posting list format will encode facet details for more efficient facet count estimation
  22. 22. Relevance framework
  23. 23. Relevance Framework • Infrastructure to support common scoring needs • Provides framework to evaluate relevance changes • Enables rapid iterations over relevance experiments • Allows relevance engineers to focus on building features
  24. 24. Life of a Query – Within A Rewriter Query DATA MODEL Rewriter State Rewriter Module DATA MODEL DATA MODEL Rewri4en Query Rewriter Module Rewriter Module
  25. 25. Life of a Query – Within A Search Shard INDEX Top Results Retrieve a Document Score the Document Rewri4en Query Top Results From Shard
  26. 26. Case study – Instant Search
  27. 27. Case Study: Instant Member Search • The index contains connections as document terms (term:diego AND prefix:buth AND (connec>on:35176 OR connec>on:418001 OR connec>on:1520032)) • Static Rank of documents reflects popularity • Documents are augmented offline with spell correction data • “shreeram sa” : (term:shreeram OR cluster:5678) AND (prefix:sa) AND (connec9on:1234)
  28. 28. Summary • Infrastructure scaling • Developer productivity scaling • Result quality scaling
  29. 29. 30

×