Socializing Search. Professionally.
Sriram Sankar and Daniel Tunkelang
Presented at the O'Reilly Strata 2014 Conference
LinkedIn has a unique data collection: the 277M+ members who use LinkedIn are also the most valuable entities in our corpus, which consists of people, companies, jobs, and a rich content ecosystem. Our members use LinkedIn to satisfy a diverse set of navigational and exploratory information needs, which we address by leveraging semi-structured and social content to understanding their query intent and deliver a personalized search experience.
As a result, we’ve built a system quite different from those used for web or enterprise search. In this talk, we will discuss how we have addressed the unique scalability, performance, and search quality challenges in order to deliver billions of deeply personalized searches to our members. Although many of the challenges we face are unique to LinkedIn, we hope that the ideas we share will prove useful to other folks thinking about entity-oriented search or working with large-scale social network data.
10. Query understanding can act as a relevance filter.
for i in [1..n]
s
w1 w2 … wi
if Pc(s) > 0
a
new Segment()
a.segs
{s}
a.prob
Pc(s)
B[i]
{a}
for j in [1..i-1]
for b in B[j]
s
wj wj+1 … wi
if Pc(s) > 0
a
new Segment()
a.segs
b.segs U {s}
a.prob
b.prob * Pc(s)
B[i]
B[i] U {a}
sort B[i] by prob
truncate B[i] to size k
10
12. Coming soon: entity-driven search assist.
link
Jobs at LinkedIn
People currently working at LinkedIn
People who used to work at LinkedIn
Search
13. Infrastructure
Lucene
Map of terms to documents – the index
Provides an API to add and remove documents to the
index
Provides an API to query the index
13
14. 1.
2.
BLAH BLAH BLAH
BLAH BLAH
Daniel
Daniel BLAH BLAH LinkedIn BLAH BLAH BLAH BLAH
Sriram
BLAH
LinkedIn BLAH BLAH BLAH BLAH BLAH BLAH BLAH
Sriram
LinkedIn
1
2
Inverted Index
Forward Index
14
16. Extremely easy to build a search engine
But difficult to get sophisticated
16
17. The LinkedIn Search Stack
Request
Live
Updates
Updates
Query Rewriter
Index Retrieval
Scorer
Offline
Data
Building
Data
Sorter/Blender
Response
17
18. Search Index Served by Lucene
Inverted index
Forward index
Static rank based document ordering
18
19. Offline Data Builds on Hadoop
Multi-stage map-reduce pipeline allows complex data
processing
Produces sharded single segment Lucene index with
documents sorted by static rank
Produces data models for use in query rewriting
19
20. Live Data Updates
Feed based framework to support updates to offline data
builds
Lucene enhanced with a partial index update capability
20
21. Query Rewriting (and Planning)
Accepts raw query and user metadata
Produces Lucene retrieval query and metadata for
scoring
May use data models built offline
21
22. Index Retrieval
Lucene query built by query rewriter is used to retrieve
documents from the Lucene index
Documents are retrieved in static rank order (best
document first)
Retrieval may be early-terminated – given that retrieval is
in static rank order
No scoring is performed during retrieval
22
23. Scoring
Scoring is performed after retrieval
Its input is the retrieved document (i.e., includes the
forward index), a description of how the retrieval query
matched the document, and the scoring metadata
produced by the rewriter
Costly features can be computed offline during the index
building process in Hadoop – e.g., tf/idf calculations
23
24. Summary
Quality
LinkedIn Search leverages the economic graph.
Social means that relevance is highly personalized.
Less is more: query understanding is a relevance filter.
Moving in the direction of suggesting structured queries.
System
Powered by Lucene, but with additional components.
Offline data builds on Hadoop, partial index updates.
Index uses static ranking and early termination.
Scoring performed outside of Lucene.
24