SlideShare a Scribd company logo
1 of 25
Socializing Search. Professionally.
Sriram Sankar
Principal Staff Engineer
Recruiting Solutions

Daniel Tunkelang
Head, Query Understanding
Whether you’ve tried to find an Apache committer…
…or an Apache commander,

3
you’ve probably used LinkedIn Search.

4
Let’s talk about…

• Infrastructure

• Quality
5
LinkedIn Search leverages the economic graph.

6
Social means that relevance is highly personalized.

7
Machine-learned ranking, socially.
 Relevance models incorporate user features:
score = P (Document | Query, User)

 Our model: tree with logistic regression leaves.
X2=?

b0 + b1 T(x1 )+...+ bn xn

X10< 0.1234 ?

a0 + a1 P(x1 )+...+ anQ(xn )

g 0 + g1 R(x1 )+...+ g nQ(xn )
8
LinkedIn’s focus: entity-oriented search.

Company

Name
Search

Employees

Jobs

9
Query understanding can act as a relevance filter.

for i in [1..n]
s
w1 w2 … wi
if Pc(s) > 0
a
new Segment()
a.segs
{s}
a.prob
Pc(s)
B[i]
{a}
for j in [1..i-1]
for b in B[j]
s
wj wj+1 … wi
if Pc(s) > 0
a
new Segment()
a.segs
b.segs U {s}
a.prob
b.prob * Pc(s)
B[i]
B[i] U {a}
sort B[i] by prob
truncate B[i] to size k

10
Less is more.
warren buffett

11
Coming soon: entity-driven search assist.
link
Jobs at LinkedIn
People currently working at LinkedIn
People who used to work at LinkedIn

Search
Infrastructure

Lucene
 Map of terms to documents – the index
 Provides an API to add and remove documents to the
index
 Provides an API to query the index

13
1.

2.

BLAH BLAH BLAH

BLAH BLAH

Daniel

Daniel BLAH BLAH LinkedIn BLAH BLAH BLAH BLAH

Sriram

BLAH

LinkedIn BLAH BLAH BLAH BLAH BLAH BLAH BLAH

Sriram

LinkedIn

1
2
Inverted Index

Forward Index
14
A standard scoring capability is built in

15
 Extremely easy to build a search engine
 But difficult to get sophisticated

16
The LinkedIn Search Stack
Request
Live
Updates

Updates

Query Rewriter

Index Retrieval

Scorer
Offline
Data
Building

Data

Sorter/Blender

Response
17
Search Index Served by Lucene
 Inverted index
 Forward index
 Static rank based document ordering

18
Offline Data Builds on Hadoop
 Multi-stage map-reduce pipeline allows complex data
processing
 Produces sharded single segment Lucene index with
documents sorted by static rank
 Produces data models for use in query rewriting

19
Live Data Updates
 Feed based framework to support updates to offline data
builds
 Lucene enhanced with a partial index update capability

20
Query Rewriting (and Planning)
 Accepts raw query and user metadata
 Produces Lucene retrieval query and metadata for
scoring
 May use data models built offline

21
Index Retrieval
 Lucene query built by query rewriter is used to retrieve
documents from the Lucene index
 Documents are retrieved in static rank order (best
document first)
 Retrieval may be early-terminated – given that retrieval is
in static rank order
 No scoring is performed during retrieval

22
Scoring
 Scoring is performed after retrieval
 Its input is the retrieved document (i.e., includes the
forward index), a description of how the retrieval query
matched the document, and the scoring metadata
produced by the rewriter
 Costly features can be computed offline during the index
building process in Hadoop – e.g., tf/idf calculations

23
Summary
Quality
 LinkedIn Search leverages the economic graph.
 Social means that relevance is highly personalized.
 Less is more: query understanding is a relevance filter.
 Moving in the direction of suggesting structured queries.
System
 Powered by Lucene, but with additional components.
 Offline data builds on Hadoop, partial index updates.
 Index uses static ranking and early termination.
 Scoring performed outside of Lucene.

24
Sriram Sankar
ssankar@linkedin.com
https://linkedin.com/in/sriramxsankar

Daniel Tunkelang
dtunkelang@linkedin.com
https://linkedin.com/in/dtunkelang
25

More Related Content

More from Daniel Tunkelang

Search as Communication: Lessons from a Personal Journey
Search as Communication: Lessons from a Personal JourneySearch as Communication: Lessons from a Personal Journey
Search as Communication: Lessons from a Personal Journey
Daniel Tunkelang
 
Enterprise Search: How do we get there from here?
Enterprise Search: How do we get there from here?Enterprise Search: How do we get there from here?
Enterprise Search: How do we get there from here?
Daniel Tunkelang
 
Big Data, We Have a Communication Problem
Big Data, We Have a Communication Problem Big Data, We Have a Communication Problem
Big Data, We Have a Communication Problem
Daniel Tunkelang
 
Data By The People, For The People
Data By The People, For The PeopleData By The People, For The People
Data By The People, For The People
Daniel Tunkelang
 
Design for Interaction
Design for InteractionDesign for Interaction
Design for Interaction
Daniel Tunkelang
 

More from Daniel Tunkelang (20)

Query Understanding and Ecommerce
Query Understanding and EcommerceQuery Understanding and Ecommerce
Query Understanding and Ecommerce
 
Semantic Equivalence of e-Commerce Queries
Semantic Equivalence of e-Commerce QueriesSemantic Equivalence of e-Commerce Queries
Semantic Equivalence of e-Commerce Queries
 
Helping Searchers Satisfice through Query Understanding
Helping Searchers Satisfice through Query UnderstandingHelping Searchers Satisfice through Query Understanding
Helping Searchers Satisfice through Query Understanding
 
MMM, Search!
MMM, Search!MMM, Search!
MMM, Search!
 
Where should you put your data scientists?
Where should you put your data scientists?Where should you put your data scientists?
Where should you put your data scientists?
 
Search as Communication: Lessons from a Personal Journey
Search as Communication: Lessons from a Personal JourneySearch as Communication: Lessons from a Personal Journey
Search as Communication: Lessons from a Personal Journey
 
Enterprise Search: How do we get there from here?
Enterprise Search: How do we get there from here?Enterprise Search: How do we get there from here?
Enterprise Search: How do we get there from here?
 
Big Data, We Have a Communication Problem
Big Data, We Have a Communication Problem Big Data, We Have a Communication Problem
Big Data, We Have a Communication Problem
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data Scientist
 
Information, Attention, and Trust: A Hierarchy of Needs
Information, Attention, and Trust: A Hierarchy of NeedsInformation, Attention, and Trust: A Hierarchy of Needs
Information, Attention, and Trust: A Hierarchy of Needs
 
Data By The People, For The People
Data By The People, For The PeopleData By The People, For The People
Data By The People, For The People
 
Content, Connections, and Context
Content, Connections, and ContextContent, Connections, and Context
Content, Connections, and Context
 
Scale, Structure, and Semantics
Scale, Structure, and SemanticsScale, Structure, and Semantics
Scale, Structure, and Semantics
 
Strata 2012: Humans, Machines, and the Dimensions of Microwork
Strata 2012: Humans, Machines, and the Dimensions of MicroworkStrata 2012: Humans, Machines, and the Dimensions of Microwork
Strata 2012: Humans, Machines, and the Dimensions of Microwork
 
Recommendations as a Conversation with the User
Recommendations as a Conversation with the UserRecommendations as a Conversation with the User
Recommendations as a Conversation with the User
 
Keeping It Professional: Relevance, Recommendations, and Reputation at LinkedIn
Keeping It Professional: Relevance, Recommendations, and Reputation at LinkedInKeeping It Professional: Relevance, Recommendations, and Reputation at LinkedIn
Keeping It Professional: Relevance, Recommendations, and Reputation at LinkedIn
 
The War on Attention Poverty: Measuring Twitter Authority
The War on Attention Poverty: Measuring Twitter AuthorityThe War on Attention Poverty: Measuring Twitter Authority
The War on Attention Poverty: Measuring Twitter Authority
 
Design for Interaction
Design for InteractionDesign for Interaction
Design for Interaction
 
Enabling Exploration Through Text Analytics
Enabling Exploration Through Text AnalyticsEnabling Exploration Through Text Analytics
Enabling Exploration Through Text Analytics
 
exploring semantic means
exploring semantic meansexploring semantic means
exploring semantic means
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 

Socializing Search. Professionally.

  • 1. Socializing Search. Professionally. Sriram Sankar Principal Staff Engineer Recruiting Solutions Daniel Tunkelang Head, Query Understanding
  • 2. Whether you’ve tried to find an Apache committer…
  • 3. …or an Apache commander, 3
  • 4. you’ve probably used LinkedIn Search. 4
  • 5. Let’s talk about… • Infrastructure • Quality 5
  • 6. LinkedIn Search leverages the economic graph. 6
  • 7. Social means that relevance is highly personalized. 7
  • 8. Machine-learned ranking, socially.  Relevance models incorporate user features: score = P (Document | Query, User)  Our model: tree with logistic regression leaves. X2=? b0 + b1 T(x1 )+...+ bn xn X10< 0.1234 ? a0 + a1 P(x1 )+...+ anQ(xn ) g 0 + g1 R(x1 )+...+ g nQ(xn ) 8
  • 9. LinkedIn’s focus: entity-oriented search. Company Name Search Employees Jobs 9
  • 10. Query understanding can act as a relevance filter. for i in [1..n] s w1 w2 … wi if Pc(s) > 0 a new Segment() a.segs {s} a.prob Pc(s) B[i] {a} for j in [1..i-1] for b in B[j] s wj wj+1 … wi if Pc(s) > 0 a new Segment() a.segs b.segs U {s} a.prob b.prob * Pc(s) B[i] B[i] U {a} sort B[i] by prob truncate B[i] to size k 10
  • 11. Less is more. warren buffett 11
  • 12. Coming soon: entity-driven search assist. link Jobs at LinkedIn People currently working at LinkedIn People who used to work at LinkedIn Search
  • 13. Infrastructure Lucene  Map of terms to documents – the index  Provides an API to add and remove documents to the index  Provides an API to query the index 13
  • 14. 1. 2. BLAH BLAH BLAH BLAH BLAH Daniel Daniel BLAH BLAH LinkedIn BLAH BLAH BLAH BLAH Sriram BLAH LinkedIn BLAH BLAH BLAH BLAH BLAH BLAH BLAH Sriram LinkedIn 1 2 Inverted Index Forward Index 14
  • 15. A standard scoring capability is built in 15
  • 16.  Extremely easy to build a search engine  But difficult to get sophisticated 16
  • 17. The LinkedIn Search Stack Request Live Updates Updates Query Rewriter Index Retrieval Scorer Offline Data Building Data Sorter/Blender Response 17
  • 18. Search Index Served by Lucene  Inverted index  Forward index  Static rank based document ordering 18
  • 19. Offline Data Builds on Hadoop  Multi-stage map-reduce pipeline allows complex data processing  Produces sharded single segment Lucene index with documents sorted by static rank  Produces data models for use in query rewriting 19
  • 20. Live Data Updates  Feed based framework to support updates to offline data builds  Lucene enhanced with a partial index update capability 20
  • 21. Query Rewriting (and Planning)  Accepts raw query and user metadata  Produces Lucene retrieval query and metadata for scoring  May use data models built offline 21
  • 22. Index Retrieval  Lucene query built by query rewriter is used to retrieve documents from the Lucene index  Documents are retrieved in static rank order (best document first)  Retrieval may be early-terminated – given that retrieval is in static rank order  No scoring is performed during retrieval 22
  • 23. Scoring  Scoring is performed after retrieval  Its input is the retrieved document (i.e., includes the forward index), a description of how the retrieval query matched the document, and the scoring metadata produced by the rewriter  Costly features can be computed offline during the index building process in Hadoop – e.g., tf/idf calculations 23
  • 24. Summary Quality  LinkedIn Search leverages the economic graph.  Social means that relevance is highly personalized.  Less is more: query understanding is a relevance filter.  Moving in the direction of suggesting structured queries. System  Powered by Lucene, but with additional components.  Offline data builds on Hadoop, partial index updates.  Index uses static ranking and early termination.  Scoring performed outside of Lucene. 24