Recruiting SolutionsRecruiting SolutionsRecruiting Solutions
Search at LinkedIn
Sriram Sankar, Principal Staff Engineer
Ku...
https://www.youtube.com/watch?v=obCHKPYHuhA
2
Search at LinkedIn
 Personalized professional search
 Part of a bigger product experience
 But a really big part of it
3
4
Some history . . .
Approach to Search
 Off the shelf components (Lucene)
 Extended to address Lucene limitations (Sensei, Bobo,
Zoie, Conte...
Lucene
An open source API that supports search functionality:
 Add new documents to index
 Delete documents from the ind...
The Search Index
 Inverted Index: Mapping from (search) terms to list of
documents (they are present in)
 Forward Index:...
8
BLAH BLAH BLAH Kumaresh BLAH BLAH LinkedIn BLAH BLAH BLAH BLAH
BLAH BLAH Sriram BLAH LinkedIn BLAH BLAH BLAH BLAH BLAH B...
The Search Index
 The lists are called posting lists
 Upto hundreds of millions of posting lists
 Upto hundreds of mill...
Lucene Queries
 “Sriram Sankar”
 Sriram Kumaresh
 +Sriram +LinkedIn
 +Kumaresh connection:418001
 +Kumaresh industry:...
Lucene Scoring
 As documents are added to the index, Lucene maintains
some metadata on the terms (e.g., term position, tf...
Sensei
Layer over Lucene that provides:
 Sharding
 Cluster management
 Enhanced query language
12
13
Sensei BQL
SELECT *
FROM cars
WHERE price > 2000.00
USING RELEVANCE MODEL my_model
(favoriteColor:"black", favoriteTag:"co...
Live Updates – Zoie and Content Store
 The index reader has to be reopened before earlier live
updates are visible
 The ...
Zoie
16
Search Content Store
17
Search
Content
Store
Lucene
Index
Activity
Feeds
Deletes
Inserts
Faceting
18
Bobo
19
Typeahead (Instant Search)
 Results as you type
 Conventional wisdom: Inverted indices cannot support
typeahead
 Cleo, ...
21
Fast forward to last year –
and growing pains . . .
Scalability
 Rebuilding index from scratch extremely difficult
 Not possible to use complex algorithms during indexing
...
Fragmentation
 Too many open source components glued together with
primary developers spread across many companies
 Diff...
24
Our new search stack . . .
Two verticals already in production
Life of a Query
25
Query Rewriter/
Planner
Results
Merging
User
Query
Search
Results
Search Shard
Search Shard
Life of a Query – Within A Search Shard
26
Rewritten
Query
Top
Results
From
Shard
INDEX
Top
Results
Retrieve a
Document
Sc...
Life of a Query – Within A Rewriter
27
Query
DATA
MODEL
Rewriter
State
Rewriter
Module
DATA
MODEL
DATA
MODEL
Rewritten
Que...
Life of Data - Offline
28
INDEX
Derived DataRaw Data
DATA
MODEL
DATA
MODEL
DATA
MODEL
DATA
MODEL
DATA
MODEL
Benefits of New Stack
 A complete search engine
 Frequent reindexing possible (a full reset)
 Resharding becomes easy
...
Early Termination
 We order documents in the index based on a static rank –
from most important to least important
 An o...
New Strategy for Live Updates
 Lucene segments are “document-partitioned”
 We have enhanced Lucene with “term-partitione...
32
Base Index
Snapshot
Index
Live Update
Buffer
Data Distribution
 Bit torrent based data distribution framework
 More details at a later time
33
Relevance
 Offline analysis – resulting in a better index and data
models
 Query rewriting – for better and more accurat...
Machine Learned Scorers
 Goal: To automatically build a function whose arguments
are interesting features of the query an...
Linear Regression on a Single Feature
36
37
LinkedIn Scorer:
Different Linear Models for Different Intents
 Relevance models incorporate user features:
score = P ...
Going Forward
 Further standardize infrastructure for relevance
components
 Scatter-gather
 Java GC issues
 Extend inf...
Product Overview
39
LinkedIn’s Vision
40
“Create economic opportunity for every member of the
global workforce”
The Economic Graph
41
Search is core to the economic graph vision
42
LI as a way to get the
day job
Job Seeker
Who uses search?
Casual User
LI as professional
identity
43
Outbound
professiona...
Casual User
Name Search
Topic Search
44
Instant: Name Search
Search all members by name or approximate name
45
Unified Search: Topic Search
One federated search result page with all relevant entities
about the topic
46
Outbound professional
Exploratory people search
47
Instant: Search Suggestions
Entity-aware suggestions for companies, skills & titles
48
Instant: Just one keystroke
From name search to exploratory search
49
People Search
Explore using facets and advanced search fields
50
People Search
Leverage the network through shared connections
51
Recruiter & Sales Navigator
Products powered by search
52
Job Seeker
Job Search
53
Instant: Search Suggestions
Entity-aware suggestions for companies, skills & titles
54
Job Search
Explore using facets and advanced search fields
55
Job Search
Leverage the network through relationship to job poster or
connections in the company
56
Other Search Users include…
Students – University Search
Information Seekers / Researchers - Content Search
Advertisers / ...
Bringing it all together
58
300 Million+ members
Search the economic graph of
300M profiles
3B Endorsements
300K jobs
3M C...
59
Upcoming SlideShare
Loading in...5
×

Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

893

Published on

Search is an important and integrated part of the overall LinkedIn experience, and it takes many forms - such as Instant, SERP, Recruiter Search, Job Seeker, etc. Search needs to deal with both structured and unstructured content, and be personalized.

In this talk, Sriram will describe Linkedin unified infrastructure to support these different needs, and will provide some insights into our various approaches to search quality.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
893
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
23
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • Video – not a dig on any one, but trying to show we need to do some unique stuff

    On a journey – have made a lot of progress, but we still have a long way to go. Kumaresh will focus on our product experiences at the end.
  • Like most other companies needing to integrate search into their products
  • Conventional wisdom – CS276 notes, Facebook, etc. – LinkedIn not alone on this.
  • Other growing companies should keep all of this in mind
    Rebuilding – no index enhancements, resharding limited to adding shards at end
    Live updates require content store
  • Other growing companies should keep all of this in mind
    Rebuilding – no index enhancements, resharding limited to adding shards at end
    Live updates require content store
  • Unifying infrastructure always pays dividends even if not the perfect fit for each use case
    Typeahead (instant) in production – so no more Cleo
  • Leaving out frontend, device side stuff
  • Scoring taken out of Lucene
  • Rewriting examples - intent recognition, stemming, synonyms, personalization
    Rationale for data models - examples are intent models, synonym tables, etc.
  • May no longer have Lucene
  • Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

    1. 1. Recruiting SolutionsRecruiting SolutionsRecruiting Solutions Search at LinkedIn Sriram Sankar, Principal Staff Engineer Kumaresh Pattabiraman, Senior Product Manager
    2. 2. https://www.youtube.com/watch?v=obCHKPYHuhA 2
    3. 3. Search at LinkedIn  Personalized professional search  Part of a bigger product experience  But a really big part of it 3
    4. 4. 4 Some history . . .
    5. 5. Approach to Search  Off the shelf components (Lucene)  Extended to address Lucene limitations (Sensei, Bobo, Zoie, Content Store)  Specialized verticals (Cleo, Krati)  Stack adopted for other purposes (recommendations, newsfeed, ads, analytics, etc.) 5
    6. 6. Lucene An open source API that supports search functionality:  Add new documents to index  Delete documents from the index  Construct queries  Search the index using the query  Score the retrieved documents 6
    7. 7. The Search Index  Inverted Index: Mapping from (search) terms to list of documents (they are present in)  Forward Index: Mapping from documents to metadata about them 7
    8. 8. 8 BLAH BLAH BLAH Kumaresh BLAH BLAH LinkedIn BLAH BLAH BLAH BLAH BLAH BLAH Sriram BLAH LinkedIn BLAH BLAH BLAH BLAH BLAH BLAH BLAH2. 1. Kumaresh Sriram LinkedIn 2 1 Inverted Index Forward Index
    9. 9. The Search Index  The lists are called posting lists  Upto hundreds of millions of posting lists  Upto hundreds of millions of documents  Posting lists may contain as few as a single hit and as many as tens of millions of hits  Terms can be – words in the document – inferred attributes about the document 9
    10. 10. Lucene Queries  “Sriram Sankar”  Sriram Kumaresh  +Sriram +LinkedIn  +Kumaresh connection:418001  +Kumaresh industry:software connection:418001^4 10
    11. 11. Lucene Scoring  As documents are added to the index, Lucene maintains some metadata on the terms (e.g., term position, tf/idf)  Lucene accepts scoring information via query modifications, boosts, etc.  Lucene assigns a score to each retrieved document using this information 11
    12. 12. Sensei Layer over Lucene that provides:  Sharding  Cluster management  Enhanced query language 12
    13. 13. 13
    14. 14. Sensei BQL SELECT * FROM cars WHERE price > 2000.00 USING RELEVANCE MODEL my_model (favoriteColor:"black", favoriteTag:"cool") DEFINED AS (String favoriteColor, String favoriteTag) BEGIN float boost = 1.0; if (tags.contains(favoriteTag)) boost += 0.5; if (color.equals(my_color)) boost += 1.2; return _INNER_SCORE * boost; END 14
    15. 15. Live Updates – Zoie and Content Store  The index reader has to be reopened before earlier live updates are visible  The only way to perform a live update is to replace the entire document – which requires access to the unchanged attributes also 15
    16. 16. Zoie 16
    17. 17. Search Content Store 17 Search Content Store Lucene Index Activity Feeds Deletes Inserts
    18. 18. Faceting 18
    19. 19. Bobo 19
    20. 20. Typeahead (Instant Search)  Results as you type  Conventional wisdom: Inverted indices cannot support typeahead  Cleo, Krati 20
    21. 21. 21 Fast forward to last year – and growing pains . . .
    22. 22. Scalability  Rebuilding index from scratch extremely difficult  Not possible to use complex algorithms during indexing  Live updates at document granularity  Inflexible scoring – both at Lucene and Sensei levels 22
    23. 23. Fragmentation  Too many open source components glued together with primary developers spread across many companies  Different instantiations starting to diverge to deal with their specific growing pains – so diverging stacks and distracted engineers 23
    24. 24. 24 Our new search stack . . . Two verticals already in production
    25. 25. Life of a Query 25 Query Rewriter/ Planner Results Merging User Query Search Results Search Shard Search Shard
    26. 26. Life of a Query – Within A Search Shard 26 Rewritten Query Top Results From Shard INDEX Top Results Retrieve a Document Score the Document
    27. 27. Life of a Query – Within A Rewriter 27 Query DATA MODEL Rewriter State Rewriter Module DATA MODEL DATA MODEL Rewritten Query Rewriter Module Rewriter Module
    28. 28. Life of Data - Offline 28 INDEX Derived DataRaw Data DATA MODEL DATA MODEL DATA MODEL DATA MODEL DATA MODEL
    29. 29. Benefits of New Stack  A complete search engine  Frequent reindexing possible (a full reset)  Resharding becomes easy  Clear separation of infrastructure and relevance functions  A single stack with a single identity! 29
    30. 30. Early Termination  We order documents in the index based on a static rank – from most important to least important  An offline relevance algorithm assigns a static rank to each document on which the sorting is performed  This allows retrieval to be early-terminated (assuming a strong correlation between static rank and importance of result for a specific query)  Happens to work well with personalized search also 30
    31. 31. New Strategy for Live Updates  Lucene segments are “document-partitioned”  We have enhanced Lucene with “term-partitioned” segments  We use 3 term-partitioned segments: – Base index (never changed) – Live update buffer – Snapshot index  Fault tolerant, and performant  No more content store! 31
    32. 32. 32 Base Index Snapshot Index Live Update Buffer
    33. 33. Data Distribution  Bit torrent based data distribution framework  More details at a later time 33
    34. 34. Relevance  Offline analysis – resulting in a better index and data models  Query rewriting – for better and more accurate recall  Scoring – to fine tune each of the retrieved results  Reranking – selection of top results for overall result set quality  Blending – to combine results from multiple verticals 34
    35. 35. Machine Learned Scorers  Goal: To automatically build a function whose arguments are interesting features of the query and the document  Input to the machine learning system is a set of training data that describes how the function should behave on various combination of feature values  The function takes the form of standard templates – a linear formula is commonly used (due to simplicity) 35
    36. 36. Linear Regression on a Single Feature 36
    37. 37. 37 LinkedIn Scorer: Different Linear Models for Different Intents  Relevance models incorporate user features: score = P (Document | Query, User)  Tree with linear regression leaves 37 b0 +b1T(x1)+...+bn xn a0 +a1 P(x1)+...+anQ(xn) X2=? X10< 0.1234 ? g0 +g1 R(x1)+...+gnQ(xn)
    38. 38. Going Forward  Further standardize infrastructure for relevance components  Scatter-gather  Java GC issues  Extend infrastructure to browser/device  Reintegrate diverging stacks 38
    39. 39. Product Overview 39
    40. 40. LinkedIn’s Vision 40 “Create economic opportunity for every member of the global workforce”
    41. 41. The Economic Graph 41
    42. 42. Search is core to the economic graph vision 42
    43. 43. LI as a way to get the day job Job Seeker Who uses search? Casual User LI as professional identity 43 Outbound professional (Recruiter / Sales) LI as day job
    44. 44. Casual User Name Search Topic Search 44
    45. 45. Instant: Name Search Search all members by name or approximate name 45
    46. 46. Unified Search: Topic Search One federated search result page with all relevant entities about the topic 46
    47. 47. Outbound professional Exploratory people search 47
    48. 48. Instant: Search Suggestions Entity-aware suggestions for companies, skills & titles 48
    49. 49. Instant: Just one keystroke From name search to exploratory search 49
    50. 50. People Search Explore using facets and advanced search fields 50
    51. 51. People Search Leverage the network through shared connections 51
    52. 52. Recruiter & Sales Navigator Products powered by search 52
    53. 53. Job Seeker Job Search 53
    54. 54. Instant: Search Suggestions Entity-aware suggestions for companies, skills & titles 54
    55. 55. Job Search Explore using facets and advanced search fields 55
    56. 56. Job Search Leverage the network through relationship to job poster or connections in the company 56
    57. 57. Other Search Users include… Students – University Search Information Seekers / Researchers - Content Search Advertisers / Content Marketers – Company & Group Search 57
    58. 58. Bringing it all together 58 300 Million+ members Search the economic graph of 300M profiles 3B Endorsements 300K jobs 3M Companies 2M Groups 25K Schools 100M+ pieces of professional content One index One unified search stack Users Product Platform
    59. 59. 59
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×