Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman


Published on

Search is an important and integrated part of the overall LinkedIn experience, and it takes many forms - such as Instant, SERP, Recruiter Search, Job Seeker, etc. Search needs to deal with both structured and unstructured content, and be personalized.

In this talk, Sriram will describe Linkedin unified infrastructure to support these different needs, and will provide some insights into our various approaches to search quality.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Video – not a dig on any one, but trying to show we need to do some unique stuff

    On a journey – have made a lot of progress, but we still have a long way to go. Kumaresh will focus on our product experiences at the end.
  • Like most other companies needing to integrate search into their products
  • Conventional wisdom – CS276 notes, Facebook, etc. – LinkedIn not alone on this.
  • Other growing companies should keep all of this in mind
    Rebuilding – no index enhancements, resharding limited to adding shards at end
    Live updates require content store
  • Other growing companies should keep all of this in mind
    Rebuilding – no index enhancements, resharding limited to adding shards at end
    Live updates require content store
  • Unifying infrastructure always pays dividends even if not the perfect fit for each use case
    Typeahead (instant) in production – so no more Cleo
  • Leaving out frontend, device side stuff
  • Scoring taken out of Lucene
  • Rewriting examples - intent recognition, stemming, synonyms, personalization
    Rationale for data models - examples are intent models, synonym tables, etc.
  • May no longer have Lucene
  • Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

    1. 1. Recruiting SolutionsRecruiting SolutionsRecruiting Solutions Search at LinkedIn Sriram Sankar, Principal Staff Engineer Kumaresh Pattabiraman, Senior Product Manager
    2. 2. 2
    3. 3. Search at LinkedIn  Personalized professional search  Part of a bigger product experience  But a really big part of it 3
    4. 4. 4 Some history . . .
    5. 5. Approach to Search  Off the shelf components (Lucene)  Extended to address Lucene limitations (Sensei, Bobo, Zoie, Content Store)  Specialized verticals (Cleo, Krati)  Stack adopted for other purposes (recommendations, newsfeed, ads, analytics, etc.) 5
    6. 6. Lucene An open source API that supports search functionality:  Add new documents to index  Delete documents from the index  Construct queries  Search the index using the query  Score the retrieved documents 6
    7. 7. The Search Index  Inverted Index: Mapping from (search) terms to list of documents (they are present in)  Forward Index: Mapping from documents to metadata about them 7
    8. 8. 8 BLAH BLAH BLAH Kumaresh BLAH BLAH LinkedIn BLAH BLAH BLAH BLAH BLAH BLAH Sriram BLAH LinkedIn BLAH BLAH BLAH BLAH BLAH BLAH BLAH2. 1. Kumaresh Sriram LinkedIn 2 1 Inverted Index Forward Index
    9. 9. The Search Index  The lists are called posting lists  Upto hundreds of millions of posting lists  Upto hundreds of millions of documents  Posting lists may contain as few as a single hit and as many as tens of millions of hits  Terms can be – words in the document – inferred attributes about the document 9
    10. 10. Lucene Queries  “Sriram Sankar”  Sriram Kumaresh  +Sriram +LinkedIn  +Kumaresh connection:418001  +Kumaresh industry:software connection:418001^4 10
    11. 11. Lucene Scoring  As documents are added to the index, Lucene maintains some metadata on the terms (e.g., term position, tf/idf)  Lucene accepts scoring information via query modifications, boosts, etc.  Lucene assigns a score to each retrieved document using this information 11
    12. 12. Sensei Layer over Lucene that provides:  Sharding  Cluster management  Enhanced query language 12
    13. 13. 13
    14. 14. Sensei BQL SELECT * FROM cars WHERE price > 2000.00 USING RELEVANCE MODEL my_model (favoriteColor:"black", favoriteTag:"cool") DEFINED AS (String favoriteColor, String favoriteTag) BEGIN float boost = 1.0; if (tags.contains(favoriteTag)) boost += 0.5; if (color.equals(my_color)) boost += 1.2; return _INNER_SCORE * boost; END 14
    15. 15. Live Updates – Zoie and Content Store  The index reader has to be reopened before earlier live updates are visible  The only way to perform a live update is to replace the entire document – which requires access to the unchanged attributes also 15
    16. 16. Zoie 16
    17. 17. Search Content Store 17 Search Content Store Lucene Index Activity Feeds Deletes Inserts
    18. 18. Faceting 18
    19. 19. Bobo 19
    20. 20. Typeahead (Instant Search)  Results as you type  Conventional wisdom: Inverted indices cannot support typeahead  Cleo, Krati 20
    21. 21. 21 Fast forward to last year – and growing pains . . .
    22. 22. Scalability  Rebuilding index from scratch extremely difficult  Not possible to use complex algorithms during indexing  Live updates at document granularity  Inflexible scoring – both at Lucene and Sensei levels 22
    23. 23. Fragmentation  Too many open source components glued together with primary developers spread across many companies  Different instantiations starting to diverge to deal with their specific growing pains – so diverging stacks and distracted engineers 23
    24. 24. 24 Our new search stack . . . Two verticals already in production
    25. 25. Life of a Query 25 Query Rewriter/ Planner Results Merging User Query Search Results Search Shard Search Shard
    26. 26. Life of a Query – Within A Search Shard 26 Rewritten Query Top Results From Shard INDEX Top Results Retrieve a Document Score the Document
    27. 27. Life of a Query – Within A Rewriter 27 Query DATA MODEL Rewriter State Rewriter Module DATA MODEL DATA MODEL Rewritten Query Rewriter Module Rewriter Module
    28. 28. Life of Data - Offline 28 INDEX Derived DataRaw Data DATA MODEL DATA MODEL DATA MODEL DATA MODEL DATA MODEL
    29. 29. Benefits of New Stack  A complete search engine  Frequent reindexing possible (a full reset)  Resharding becomes easy  Clear separation of infrastructure and relevance functions  A single stack with a single identity! 29
    30. 30. Early Termination  We order documents in the index based on a static rank – from most important to least important  An offline relevance algorithm assigns a static rank to each document on which the sorting is performed  This allows retrieval to be early-terminated (assuming a strong correlation between static rank and importance of result for a specific query)  Happens to work well with personalized search also 30
    31. 31. New Strategy for Live Updates  Lucene segments are “document-partitioned”  We have enhanced Lucene with “term-partitioned” segments  We use 3 term-partitioned segments: – Base index (never changed) – Live update buffer – Snapshot index  Fault tolerant, and performant  No more content store! 31
    32. 32. 32 Base Index Snapshot Index Live Update Buffer
    33. 33. Data Distribution  Bit torrent based data distribution framework  More details at a later time 33
    34. 34. Relevance  Offline analysis – resulting in a better index and data models  Query rewriting – for better and more accurate recall  Scoring – to fine tune each of the retrieved results  Reranking – selection of top results for overall result set quality  Blending – to combine results from multiple verticals 34
    35. 35. Machine Learned Scorers  Goal: To automatically build a function whose arguments are interesting features of the query and the document  Input to the machine learning system is a set of training data that describes how the function should behave on various combination of feature values  The function takes the form of standard templates – a linear formula is commonly used (due to simplicity) 35
    36. 36. Linear Regression on a Single Feature 36
    37. 37. 37 LinkedIn Scorer: Different Linear Models for Different Intents  Relevance models incorporate user features: score = P (Document | Query, User)  Tree with linear regression leaves 37 b0 +b1T(x1)+...+bn xn a0 +a1 P(x1)+...+anQ(xn) X2=? X10< 0.1234 ? g0 +g1 R(x1)+...+gnQ(xn)
    38. 38. Going Forward  Further standardize infrastructure for relevance components  Scatter-gather  Java GC issues  Extend infrastructure to browser/device  Reintegrate diverging stacks 38
    39. 39. Product Overview 39
    40. 40. LinkedIn’s Vision 40 “Create economic opportunity for every member of the global workforce”
    41. 41. The Economic Graph 41
    42. 42. Search is core to the economic graph vision 42
    43. 43. LI as a way to get the day job Job Seeker Who uses search? Casual User LI as professional identity 43 Outbound professional (Recruiter / Sales) LI as day job
    44. 44. Casual User Name Search Topic Search 44
    45. 45. Instant: Name Search Search all members by name or approximate name 45
    46. 46. Unified Search: Topic Search One federated search result page with all relevant entities about the topic 46
    47. 47. Outbound professional Exploratory people search 47
    48. 48. Instant: Search Suggestions Entity-aware suggestions for companies, skills & titles 48
    49. 49. Instant: Just one keystroke From name search to exploratory search 49
    50. 50. People Search Explore using facets and advanced search fields 50
    51. 51. People Search Leverage the network through shared connections 51
    52. 52. Recruiter & Sales Navigator Products powered by search 52
    53. 53. Job Seeker Job Search 53
    54. 54. Instant: Search Suggestions Entity-aware suggestions for companies, skills & titles 54
    55. 55. Job Search Explore using facets and advanced search fields 55
    56. 56. Job Search Leverage the network through relationship to job poster or connections in the company 56
    57. 57. Other Search Users include… Students – University Search Information Seekers / Researchers - Content Search Advertisers / Content Marketers – Company & Group Search 57
    58. 58. Bringing it all together 58 300 Million+ members Search the economic graph of 300M profiles 3B Endorsements 300K jobs 3M Companies 2M Groups 25K Schools 100M+ pieces of professional content One index One unified search stack Users Product Platform
    59. 59. 59