Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

Recruiting SolutionsRecruiting SolutionsRecruiting Solutions
Search at LinkedIn
Sriram Sankar, Principal Staff Engineer
Kumaresh Pattabiraman, Senior Product Manager

https://www.youtube.com/watch?v=obCHKPYHuhA
2

Search at LinkedIn
 Personalized professional search
 Part of a bigger product experience
 But a really big part of it
3

Approach to Search
 Off the shelf components (Lucene)
 Extended to address Lucene limitations (Sensei, Bobo,
Zoie, Content Store)
 Specialized verticals (Cleo, Krati)
 Stack adopted for other purposes (recommendations,
newsfeed, ads, analytics, etc.)
5

Lucene
An open source API that supports search functionality:
 Add new documents to index
 Delete documents from the index
 Construct queries
 Search the index using the query
 Score the retrieved documents
6

The Search Index
 Inverted Index: Mapping from (search) terms to list of
documents (they are present in)
 Forward Index: Mapping from documents to metadata
about them
7

8
BLAH BLAH BLAH Kumaresh BLAH BLAH LinkedIn BLAH BLAH BLAH BLAH
BLAH BLAH Sriram BLAH LinkedIn BLAH BLAH BLAH BLAH BLAH BLAH BLAH2.
1.
Kumaresh Sriram LinkedIn
2
1
Inverted Index Forward Index

The Search Index
 The lists are called posting lists
 Upto hundreds of millions of posting lists
 Upto hundreds of millions of documents
 Posting lists may contain as few as a single hit and as
many as tens of millions of hits
 Terms can be
– words in the document
– inferred attributes about the document
9

Lucene Queries
 “Sriram Sankar”
 Sriram Kumaresh
 +Sriram +LinkedIn
 +Kumaresh connection:418001
 +Kumaresh industry:software connection:418001^4
10

Lucene Scoring
 As documents are added to the index, Lucene maintains
some metadata on the terms (e.g., term position, tf/idf)
 Lucene accepts scoring information via query
modifications, boosts, etc.
 Lucene assigns a score to each retrieved document
using this information
11

Sensei
Layer over Lucene that provides:
 Sharding
 Cluster management
 Enhanced query language
12

Sensei BQL
SELECT *
FROM cars
WHERE price > 2000.00
USING RELEVANCE MODEL my_model
(favoriteColor:"black", favoriteTag:"cool")
DEFINED AS (String favoriteColor,
String favoriteTag)
BEGIN
float boost = 1.0;
if (tags.contains(favoriteTag))
boost += 0.5;
if (color.equals(my_color))
boost += 1.2;
return _INNER_SCORE * boost;
END
14

Live Updates – Zoie and Content Store
 The index reader has to be reopened before earlier live
updates are visible
 The only way to perform a live update is to replace the
entire document – which requires access to the
unchanged attributes also
15

Search Content Store
17
Search
Content
Store
Lucene
Index
Activity
Feeds
Deletes
Inserts

Typeahead (Instant Search)
 Results as you type
 Conventional wisdom: Inverted indices cannot support
typeahead
 Cleo, Krati
20

21
Fast forward to last year –
and growing pains . . .

Scalability
 Rebuilding index from scratch extremely difficult
 Not possible to use complex algorithms during indexing
 Live updates at document granularity
 Inflexible scoring – both at Lucene and Sensei levels
22

Fragmentation
 Too many open source components glued together with
primary developers spread across many companies
 Different instantiations starting to diverge to deal with
their specific growing pains – so diverging stacks and
distracted engineers
23

24
Our new search stack . . .
Two verticals already in production

Life of a Query
25
Query Rewriter/
Planner
Results
Merging
User
Query
Search
Results
Search Shard
Search Shard

Life of a Query – Within A Search Shard
26
Rewritten
Query
Top
Results
From
Shard
INDEX
Top
Results
Retrieve a
Document
Score the
Document

Life of a Query – Within A Rewriter
27
Query
DATA
MODEL
Rewriter
State
Rewriter
Module
DATA
MODEL
DATA
MODEL
Rewritten
Query
Rewriter
Module
Rewriter
Module

Life of Data - Offline
28
INDEX
Derived DataRaw Data
DATA
MODEL
DATA
MODEL
DATA
MODEL
DATA
MODEL
DATA
MODEL

Benefits of New Stack
 A complete search engine
 Frequent reindexing possible (a full reset)
 Resharding becomes easy
 Clear separation of infrastructure and relevance functions
 A single stack with a single identity!
29

Early Termination
 We order documents in the index based on a static rank –
from most important to least important
 An offline relevance algorithm assigns a static rank to
each document on which the sorting is performed
 This allows retrieval to be early-terminated (assuming a
strong correlation between static rank and importance of
result for a specific query)
 Happens to work well with personalized search also
30

New Strategy for Live Updates
 Lucene segments are “document-partitioned”
 We have enhanced Lucene with “term-partitioned”
segments
 We use 3 term-partitioned segments:
– Base index (never changed)
– Live update buffer
– Snapshot index
 Fault tolerant, and performant
 No more content store!
31

32
Base Index
Snapshot
Index
Live Update
Buffer

Data Distribution
 Bit torrent based data distribution framework
 More details at a later time
33

Relevance
 Offline analysis – resulting in a better index and data
models
 Query rewriting – for better and more accurate recall
 Scoring – to fine tune each of the retrieved results
 Reranking – selection of top results for overall result set
quality
 Blending – to combine results from multiple verticals
34

Machine Learned Scorers
 Goal: To automatically build a function whose arguments
are interesting features of the query and the document
 Input to the machine learning system is a set of training
data that describes how the function should behave on
various combination of feature values
 The function takes the form of standard templates – a
linear formula is commonly used (due to simplicity)
35

Linear Regression on a Single Feature
36

37
LinkedIn Scorer:
Different Linear Models for Different Intents
 Relevance models incorporate user features:
score = P (Document | Query, User)
 Tree with linear regression leaves
37
b0 +b1T(x1)+...+bn xn
a0 +a1 P(x1)+...+anQ(xn)
X2=?
X10< 0.1234 ?
g0 +g1 R(x1)+...+gnQ(xn)

Going Forward
 Further standardize infrastructure for relevance
components
 Scatter-gather
 Java GC issues
 Extend infrastructure to browser/device
 Reintegrate diverging stacks
38

LinkedIn’s Vision
40
“Create economic opportunity for every member of the
global workforce”

Search is core to the economic graph vision
42

LI as a way to get the
day job
Job Seeker
Who uses search?
Casual User
LI as professional
identity
43
Outbound
professional
(Recruiter / Sales)
LI as day job

Casual User
Name Search
Topic Search
44

Instant: Name Search
Search all members by name or approximate name
45

Unified Search: Topic Search
One federated search result page with all relevant entities
about the topic
46

Outbound professional
Exploratory people search
47

Instant: Search Suggestions
Entity-aware suggestions for companies, skills & titles
48

Instant: Just one keystroke
From name search to exploratory search
49

People Search
Explore using facets and advanced search fields
50

People Search
Leverage the network through shared connections
51

Recruiter & Sales Navigator
Products powered by search
52

Instant: Search Suggestions
Entity-aware suggestions for companies, skills & titles
54

Job Search
Explore using facets and advanced search fields
55

Job Search
Leverage the network through relationship to job poster or
connections in the company
56

Other Search Users include…
Students – University Search
Information Seekers / Researchers - Content Search
Advertisers / Content Marketers – Company & Group Search
57

Bringing it all together
58
300 Million+ members
Search the economic graph of
300M profiles
3B Endorsements
300K jobs
3M Companies
2M Groups
25K Schools
100M+ pieces of professional content
One index
One unified search stack
Users
Product
Platform

Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

More Related Content

Viewers also liked

Similar to Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

More from The Hive

Recently uploaded

Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

Editor's Notes