Recruiting SolutionsRecruiting SolutionsRecruiting Solutions
Search at LinkedIn
Sriram Sankar, Principal Staff Engineer
Kumaresh Pattabiraman, Senior Product Manager
https://www.youtube.com/watch?v=obCHKPYHuhA
2
Search at LinkedIn
 Personalized professional search
 Part of a bigger product experience
 But a really big part of it
3
4
Some history . . .
Approach to Search
 Off the shelf components (Lucene)
 Extended to address Lucene limitations (Sensei, Bobo,
Zoie, Content Store)
 Specialized verticals (Cleo, Krati)
 Stack adopted for other purposes (recommendations,
newsfeed, ads, analytics, etc.)
5
Lucene
An open source API that supports search functionality:
 Add new documents to index
 Delete documents from the index
 Construct queries
 Search the index using the query
 Score the retrieved documents
6
The Search Index
 Inverted Index: Mapping from (search) terms to list of
documents (they are present in)
 Forward Index: Mapping from documents to metadata
about them
7
8
BLAH BLAH BLAH Kumaresh BLAH BLAH LinkedIn BLAH BLAH BLAH BLAH
BLAH BLAH Sriram BLAH LinkedIn BLAH BLAH BLAH BLAH BLAH BLAH BLAH2.
1.
Kumaresh Sriram LinkedIn
2
1
Inverted Index Forward Index
The Search Index
 The lists are called posting lists
 Upto hundreds of millions of posting lists
 Upto hundreds of millions of documents
 Posting lists may contain as few as a single hit and as
many as tens of millions of hits
 Terms can be
– words in the document
– inferred attributes about the document
9
Lucene Queries
 “Sriram Sankar”
 Sriram Kumaresh
 +Sriram +LinkedIn
 +Kumaresh connection:418001
 +Kumaresh industry:software connection:418001^4
10
Lucene Scoring
 As documents are added to the index, Lucene maintains
some metadata on the terms (e.g., term position, tf/idf)
 Lucene accepts scoring information via query
modifications, boosts, etc.
 Lucene assigns a score to each retrieved document
using this information
11
Sensei
Layer over Lucene that provides:
 Sharding
 Cluster management
 Enhanced query language
12
13
Sensei BQL
SELECT *
FROM cars
WHERE price > 2000.00
USING RELEVANCE MODEL my_model
(favoriteColor:"black", favoriteTag:"cool")
DEFINED AS (String favoriteColor,
String favoriteTag)
BEGIN
float boost = 1.0;
if (tags.contains(favoriteTag))
boost += 0.5;
if (color.equals(my_color))
boost += 1.2;
return _INNER_SCORE * boost;
END
14
Live Updates – Zoie and Content Store
 The index reader has to be reopened before earlier live
updates are visible
 The only way to perform a live update is to replace the
entire document – which requires access to the
unchanged attributes also
15
Zoie
16
Search Content Store
17
Search
Content
Store
Lucene
Index
Activity
Feeds
Deletes
Inserts
Faceting
18
Bobo
19
Typeahead (Instant Search)
 Results as you type
 Conventional wisdom: Inverted indices cannot support
typeahead
 Cleo, Krati
20
21
Fast forward to last year –
and growing pains . . .
Scalability
 Rebuilding index from scratch extremely difficult
 Not possible to use complex algorithms during indexing
 Live updates at document granularity
 Inflexible scoring – both at Lucene and Sensei levels
22
Fragmentation
 Too many open source components glued together with
primary developers spread across many companies
 Different instantiations starting to diverge to deal with
their specific growing pains – so diverging stacks and
distracted engineers
23
24
Our new search stack . . .
Two verticals already in production
Life of a Query
25
Query Rewriter/
Planner
Results
Merging
User
Query
Search
Results
Search Shard
Search Shard
Life of a Query – Within A Search Shard
26
Rewritten
Query
Top
Results
From
Shard
INDEX
Top
Results
Retrieve a
Document
Score the
Document
Life of a Query – Within A Rewriter
27
Query
DATA
MODEL
Rewriter
State
Rewriter
Module
DATA
MODEL
DATA
MODEL
Rewritten
Query
Rewriter
Module
Rewriter
Module
Life of Data - Offline
28
INDEX
Derived DataRaw Data
DATA
MODEL
DATA
MODEL
DATA
MODEL
DATA
MODEL
DATA
MODEL
Benefits of New Stack
 A complete search engine
 Frequent reindexing possible (a full reset)
 Resharding becomes easy
 Clear separation of infrastructure and relevance functions
 A single stack with a single identity!
29
Early Termination
 We order documents in the index based on a static rank –
from most important to least important
 An offline relevance algorithm assigns a static rank to
each document on which the sorting is performed
 This allows retrieval to be early-terminated (assuming a
strong correlation between static rank and importance of
result for a specific query)
 Happens to work well with personalized search also
30
New Strategy for Live Updates
 Lucene segments are “document-partitioned”
 We have enhanced Lucene with “term-partitioned”
segments
 We use 3 term-partitioned segments:
– Base index (never changed)
– Live update buffer
– Snapshot index
 Fault tolerant, and performant
 No more content store!
31
32
Base Index
Snapshot
Index
Live Update
Buffer
Data Distribution
 Bit torrent based data distribution framework
 More details at a later time
33
Relevance
 Offline analysis – resulting in a better index and data
models
 Query rewriting – for better and more accurate recall
 Scoring – to fine tune each of the retrieved results
 Reranking – selection of top results for overall result set
quality
 Blending – to combine results from multiple verticals
34
Machine Learned Scorers
 Goal: To automatically build a function whose arguments
are interesting features of the query and the document
 Input to the machine learning system is a set of training
data that describes how the function should behave on
various combination of feature values
 The function takes the form of standard templates – a
linear formula is commonly used (due to simplicity)
35
Linear Regression on a Single Feature
36
37
LinkedIn Scorer:
Different Linear Models for Different Intents
 Relevance models incorporate user features:
score = P (Document | Query, User)
 Tree with linear regression leaves
37
b0 +b1T(x1)+...+bn xn
a0 +a1 P(x1)+...+anQ(xn)
X2=?
X10< 0.1234 ?
g0 +g1 R(x1)+...+gnQ(xn)
Going Forward
 Further standardize infrastructure for relevance
components
 Scatter-gather
 Java GC issues
 Extend infrastructure to browser/device
 Reintegrate diverging stacks
38
Product Overview
39
LinkedIn’s Vision
40
“Create economic opportunity for every member of the
global workforce”
The Economic Graph
41
Search is core to the economic graph vision
42
LI as a way to get the
day job
Job Seeker
Who uses search?
Casual User
LI as professional
identity
43
Outbound
professional
(Recruiter / Sales)
LI as day job
Casual User
Name Search
Topic Search
44
Instant: Name Search
Search all members by name or approximate name
45
Unified Search: Topic Search
One federated search result page with all relevant entities
about the topic
46
Outbound professional
Exploratory people search
47
Instant: Search Suggestions
Entity-aware suggestions for companies, skills & titles
48
Instant: Just one keystroke
From name search to exploratory search
49
People Search
Explore using facets and advanced search fields
50
People Search
Leverage the network through shared connections
51
Recruiter & Sales Navigator
Products powered by search
52
Job Seeker
Job Search
53
Instant: Search Suggestions
Entity-aware suggestions for companies, skills & titles
54
Job Search
Explore using facets and advanced search fields
55
Job Search
Leverage the network through relationship to job poster or
connections in the company
56
Other Search Users include…
Students – University Search
Information Seekers / Researchers - Content Search
Advertisers / Content Marketers – Company & Group Search
57
Bringing it all together
58
300 Million+ members
Search the economic graph of
300M profiles
3B Endorsements
300K jobs
3M Companies
2M Groups
25K Schools
100M+ pieces of professional content
One index
One unified search stack
Users
Product
Platform
59

Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

  • 1.
    Recruiting SolutionsRecruiting SolutionsRecruitingSolutions Search at LinkedIn Sriram Sankar, Principal Staff Engineer Kumaresh Pattabiraman, Senior Product Manager
  • 2.
  • 3.
    Search at LinkedIn Personalized professional search  Part of a bigger product experience  But a really big part of it 3
  • 4.
  • 5.
    Approach to Search Off the shelf components (Lucene)  Extended to address Lucene limitations (Sensei, Bobo, Zoie, Content Store)  Specialized verticals (Cleo, Krati)  Stack adopted for other purposes (recommendations, newsfeed, ads, analytics, etc.) 5
  • 6.
    Lucene An open sourceAPI that supports search functionality:  Add new documents to index  Delete documents from the index  Construct queries  Search the index using the query  Score the retrieved documents 6
  • 7.
    The Search Index Inverted Index: Mapping from (search) terms to list of documents (they are present in)  Forward Index: Mapping from documents to metadata about them 7
  • 8.
    8 BLAH BLAH BLAHKumaresh BLAH BLAH LinkedIn BLAH BLAH BLAH BLAH BLAH BLAH Sriram BLAH LinkedIn BLAH BLAH BLAH BLAH BLAH BLAH BLAH2. 1. Kumaresh Sriram LinkedIn 2 1 Inverted Index Forward Index
  • 9.
    The Search Index The lists are called posting lists  Upto hundreds of millions of posting lists  Upto hundreds of millions of documents  Posting lists may contain as few as a single hit and as many as tens of millions of hits  Terms can be – words in the document – inferred attributes about the document 9
  • 10.
    Lucene Queries  “SriramSankar”  Sriram Kumaresh  +Sriram +LinkedIn  +Kumaresh connection:418001  +Kumaresh industry:software connection:418001^4 10
  • 11.
    Lucene Scoring  Asdocuments are added to the index, Lucene maintains some metadata on the terms (e.g., term position, tf/idf)  Lucene accepts scoring information via query modifications, boosts, etc.  Lucene assigns a score to each retrieved document using this information 11
  • 12.
    Sensei Layer over Lucenethat provides:  Sharding  Cluster management  Enhanced query language 12
  • 13.
  • 14.
    Sensei BQL SELECT * FROMcars WHERE price > 2000.00 USING RELEVANCE MODEL my_model (favoriteColor:"black", favoriteTag:"cool") DEFINED AS (String favoriteColor, String favoriteTag) BEGIN float boost = 1.0; if (tags.contains(favoriteTag)) boost += 0.5; if (color.equals(my_color)) boost += 1.2; return _INNER_SCORE * boost; END 14
  • 15.
    Live Updates –Zoie and Content Store  The index reader has to be reopened before earlier live updates are visible  The only way to perform a live update is to replace the entire document – which requires access to the unchanged attributes also 15
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
    Typeahead (Instant Search) Results as you type  Conventional wisdom: Inverted indices cannot support typeahead  Cleo, Krati 20
  • 21.
    21 Fast forward tolast year – and growing pains . . .
  • 22.
    Scalability  Rebuilding indexfrom scratch extremely difficult  Not possible to use complex algorithms during indexing  Live updates at document granularity  Inflexible scoring – both at Lucene and Sensei levels 22
  • 23.
    Fragmentation  Too manyopen source components glued together with primary developers spread across many companies  Different instantiations starting to diverge to deal with their specific growing pains – so diverging stacks and distracted engineers 23
  • 24.
    24 Our new searchstack . . . Two verticals already in production
  • 25.
    Life of aQuery 25 Query Rewriter/ Planner Results Merging User Query Search Results Search Shard Search Shard
  • 26.
    Life of aQuery – Within A Search Shard 26 Rewritten Query Top Results From Shard INDEX Top Results Retrieve a Document Score the Document
  • 27.
    Life of aQuery – Within A Rewriter 27 Query DATA MODEL Rewriter State Rewriter Module DATA MODEL DATA MODEL Rewritten Query Rewriter Module Rewriter Module
  • 28.
    Life of Data- Offline 28 INDEX Derived DataRaw Data DATA MODEL DATA MODEL DATA MODEL DATA MODEL DATA MODEL
  • 29.
    Benefits of NewStack  A complete search engine  Frequent reindexing possible (a full reset)  Resharding becomes easy  Clear separation of infrastructure and relevance functions  A single stack with a single identity! 29
  • 30.
    Early Termination  Weorder documents in the index based on a static rank – from most important to least important  An offline relevance algorithm assigns a static rank to each document on which the sorting is performed  This allows retrieval to be early-terminated (assuming a strong correlation between static rank and importance of result for a specific query)  Happens to work well with personalized search also 30
  • 31.
    New Strategy forLive Updates  Lucene segments are “document-partitioned”  We have enhanced Lucene with “term-partitioned” segments  We use 3 term-partitioned segments: – Base index (never changed) – Live update buffer – Snapshot index  Fault tolerant, and performant  No more content store! 31
  • 32.
  • 33.
    Data Distribution  Bittorrent based data distribution framework  More details at a later time 33
  • 34.
    Relevance  Offline analysis– resulting in a better index and data models  Query rewriting – for better and more accurate recall  Scoring – to fine tune each of the retrieved results  Reranking – selection of top results for overall result set quality  Blending – to combine results from multiple verticals 34
  • 35.
    Machine Learned Scorers Goal: To automatically build a function whose arguments are interesting features of the query and the document  Input to the machine learning system is a set of training data that describes how the function should behave on various combination of feature values  The function takes the form of standard templates – a linear formula is commonly used (due to simplicity) 35
  • 36.
    Linear Regression ona Single Feature 36
  • 37.
    37 LinkedIn Scorer: Different LinearModels for Different Intents  Relevance models incorporate user features: score = P (Document | Query, User)  Tree with linear regression leaves 37 b0 +b1T(x1)+...+bn xn a0 +a1 P(x1)+...+anQ(xn) X2=? X10< 0.1234 ? g0 +g1 R(x1)+...+gnQ(xn)
  • 38.
    Going Forward  Furtherstandardize infrastructure for relevance components  Scatter-gather  Java GC issues  Extend infrastructure to browser/device  Reintegrate diverging stacks 38
  • 39.
  • 40.
    LinkedIn’s Vision 40 “Create economicopportunity for every member of the global workforce”
  • 41.
  • 42.
    Search is coreto the economic graph vision 42
  • 43.
    LI as away to get the day job Job Seeker Who uses search? Casual User LI as professional identity 43 Outbound professional (Recruiter / Sales) LI as day job
  • 44.
  • 45.
    Instant: Name Search Searchall members by name or approximate name 45
  • 46.
    Unified Search: TopicSearch One federated search result page with all relevant entities about the topic 46
  • 47.
  • 48.
    Instant: Search Suggestions Entity-awaresuggestions for companies, skills & titles 48
  • 49.
    Instant: Just onekeystroke From name search to exploratory search 49
  • 50.
    People Search Explore usingfacets and advanced search fields 50
  • 51.
    People Search Leverage thenetwork through shared connections 51
  • 52.
    Recruiter & SalesNavigator Products powered by search 52
  • 53.
  • 54.
    Instant: Search Suggestions Entity-awaresuggestions for companies, skills & titles 54
  • 55.
    Job Search Explore usingfacets and advanced search fields 55
  • 56.
    Job Search Leverage thenetwork through relationship to job poster or connections in the company 56
  • 57.
    Other Search Usersinclude… Students – University Search Information Seekers / Researchers - Content Search Advertisers / Content Marketers – Company & Group Search 57
  • 58.
    Bringing it alltogether 58 300 Million+ members Search the economic graph of 300M profiles 3B Endorsements 300K jobs 3M Companies 2M Groups 25K Schools 100M+ pieces of professional content One index One unified search stack Users Product Platform
  • 59.

Editor's Notes

  • #4 Video – not a dig on any one, but trying to show we need to do some unique stuff On a journey – have made a lot of progress, but we still have a long way to go. Kumaresh will focus on our product experiences at the end.
  • #6 Like most other companies needing to integrate search into their products
  • #21 Conventional wisdom – CS276 notes, Facebook, etc. – LinkedIn not alone on this.
  • #23 Other growing companies should keep all of this in mind Rebuilding – no index enhancements, resharding limited to adding shards at end Live updates require content store
  • #24 Other growing companies should keep all of this in mind Rebuilding – no index enhancements, resharding limited to adding shards at end Live updates require content store
  • #25 Unifying infrastructure always pays dividends even if not the perfect fit for each use case Typeahead (instant) in production – so no more Cleo
  • #26 Leaving out frontend, device side stuff
  • #27 Scoring taken out of Lucene
  • #28 Rewriting examples - intent recognition, stemming, synonyms, personalization Rationale for data models - examples are intent models, synonym tables, etc.
  • #39 May no longer have Lucene