The document discusses LinkedIn's search capabilities and infrastructure. It describes LinkedIn's transition from using open source search components like Lucene to developing their own proprietary search stack. The new stack allows for more flexible indexing, live updates, and relevance capabilities powered by machine learning. Search is a core part of LinkedIn's vision of creating economic opportunity by connecting professionals to jobs, talent, and information through their economic graph.
5. Approach to Search
Off the shelf components (Lucene)
Extended to address Lucene limitations (Sensei, Bobo,
Zoie, Content Store)
Specialized verticals (Cleo, Krati)
Stack adopted for other purposes (recommendations,
newsfeed, ads, analytics, etc.)
5
6. Lucene
An open source API that supports search functionality:
Add new documents to index
Delete documents from the index
Construct queries
Search the index using the query
Score the retrieved documents
6
7. The Search Index
Inverted Index: Mapping from (search) terms to list of
documents (they are present in)
Forward Index: Mapping from documents to metadata
about them
7
9. The Search Index
The lists are called posting lists
Upto hundreds of millions of posting lists
Upto hundreds of millions of documents
Posting lists may contain as few as a single hit and as
many as tens of millions of hits
Terms can be
– words in the document
– inferred attributes about the document
9
11. Lucene Scoring
As documents are added to the index, Lucene maintains
some metadata on the terms (e.g., term position, tf/idf)
Lucene accepts scoring information via query
modifications, boosts, etc.
Lucene assigns a score to each retrieved document
using this information
11
12. Sensei
Layer over Lucene that provides:
Sharding
Cluster management
Enhanced query language
12
14. Sensei BQL
SELECT *
FROM cars
WHERE price > 2000.00
USING RELEVANCE MODEL my_model
(favoriteColor:"black", favoriteTag:"cool")
DEFINED AS (String favoriteColor,
String favoriteTag)
BEGIN
float boost = 1.0;
if (tags.contains(favoriteTag))
boost += 0.5;
if (color.equals(my_color))
boost += 1.2;
return _INNER_SCORE * boost;
END
14
15. Live Updates – Zoie and Content Store
The index reader has to be reopened before earlier live
updates are visible
The only way to perform a live update is to replace the
entire document – which requires access to the
unchanged attributes also
15
22. Scalability
Rebuilding index from scratch extremely difficult
Not possible to use complex algorithms during indexing
Live updates at document granularity
Inflexible scoring – both at Lucene and Sensei levels
22
23. Fragmentation
Too many open source components glued together with
primary developers spread across many companies
Different instantiations starting to diverge to deal with
their specific growing pains – so diverging stacks and
distracted engineers
23
25. Life of a Query
25
Query Rewriter/
Planner
Results
Merging
User
Query
Search
Results
Search Shard
Search Shard
26. Life of a Query – Within A Search Shard
26
Rewritten
Query
Top
Results
From
Shard
INDEX
Top
Results
Retrieve a
Document
Score the
Document
27. Life of a Query – Within A Rewriter
27
Query
DATA
MODEL
Rewriter
State
Rewriter
Module
DATA
MODEL
DATA
MODEL
Rewritten
Query
Rewriter
Module
Rewriter
Module
28. Life of Data - Offline
28
INDEX
Derived DataRaw Data
DATA
MODEL
DATA
MODEL
DATA
MODEL
DATA
MODEL
DATA
MODEL
29. Benefits of New Stack
A complete search engine
Frequent reindexing possible (a full reset)
Resharding becomes easy
Clear separation of infrastructure and relevance functions
A single stack with a single identity!
29
30. Early Termination
We order documents in the index based on a static rank –
from most important to least important
An offline relevance algorithm assigns a static rank to
each document on which the sorting is performed
This allows retrieval to be early-terminated (assuming a
strong correlation between static rank and importance of
result for a specific query)
Happens to work well with personalized search also
30
31. New Strategy for Live Updates
Lucene segments are “document-partitioned”
We have enhanced Lucene with “term-partitioned”
segments
We use 3 term-partitioned segments:
– Base index (never changed)
– Live update buffer
– Snapshot index
Fault tolerant, and performant
No more content store!
31
33. Data Distribution
Bit torrent based data distribution framework
More details at a later time
33
34. Relevance
Offline analysis – resulting in a better index and data
models
Query rewriting – for better and more accurate recall
Scoring – to fine tune each of the retrieved results
Reranking – selection of top results for overall result set
quality
Blending – to combine results from multiple verticals
34
35. Machine Learned Scorers
Goal: To automatically build a function whose arguments
are interesting features of the query and the document
Input to the machine learning system is a set of training
data that describes how the function should behave on
various combination of feature values
The function takes the form of standard templates – a
linear formula is commonly used (due to simplicity)
35
37. 37
LinkedIn Scorer:
Different Linear Models for Different Intents
Relevance models incorporate user features:
score = P (Document | Query, User)
Tree with linear regression leaves
37
b0 +b1T(x1)+...+bn xn
a0 +a1 P(x1)+...+anQ(xn)
X2=?
X10< 0.1234 ?
g0 +g1 R(x1)+...+gnQ(xn)
38. Going Forward
Further standardize infrastructure for relevance
components
Scatter-gather
Java GC issues
Extend infrastructure to browser/device
Reintegrate diverging stacks
38
43. LI as a way to get the
day job
Job Seeker
Who uses search?
Casual User
LI as professional
identity
43
Outbound
professional
(Recruiter / Sales)
LI as day job
56. Job Search
Leverage the network through relationship to job poster or
connections in the company
56
57. Other Search Users include…
Students – University Search
Information Seekers / Researchers - Content Search
Advertisers / Content Marketers – Company & Group Search
57
58. Bringing it all together
58
300 Million+ members
Search the economic graph of
300M profiles
3B Endorsements
300K jobs
3M Companies
2M Groups
25K Schools
100M+ pieces of professional content
One index
One unified search stack
Users
Product
Platform
Video – not a dig on any one, but trying to show we need to do some unique stuff
On a journey – have made a lot of progress, but we still have a long way to go. Kumaresh will focus on our product experiences at the end.
Like most other companies needing to integrate search into their products
Conventional wisdom – CS276 notes, Facebook, etc. – LinkedIn not alone on this.
Other growing companies should keep all of this in mind
Rebuilding – no index enhancements, resharding limited to adding shards at end
Live updates require content store
Other growing companies should keep all of this in mind
Rebuilding – no index enhancements, resharding limited to adding shards at end
Live updates require content store
Unifying infrastructure always pays dividends even if not the perfect fit for each use case
Typeahead (instant) in production – so no more Cleo
Leaving out frontend, device side stuff
Scoring taken out of Lucene
Rewriting examples - intent recognition, stemming, synonyms, personalization
Rationale for data models - examples are intent models, synonym tables, etc.