Socializing Search. Professionally.


Published on

Socializing Search. Professionally.
Sriram Sankar and Daniel Tunkelang

Presented at the O'Reilly Strata 2014 Conference

LinkedIn has a unique data collection: the 277M+ members who use LinkedIn are also the most valuable entities in our corpus, which consists of people, companies, jobs, and a rich content ecosystem. Our members use LinkedIn to satisfy a diverse set of navigational and exploratory information needs, which we address by leveraging semi-structured and social content to understanding their query intent and deliver a personalized search experience.

As a result, we’ve built a system quite different from those used for web or enterprise search. In this talk, we will discuss how we have addressed the unique scalability, performance, and search quality challenges in order to deliver billions of deeply personalized searches to our members. Although many of the challenges we face are unique to LinkedIn, we hope that the ideas we share will prove useful to other folks thinking about entity-oriented search or working with large-scale social network data.

Published in: Technology

Socializing Search. Professionally.

  1. Socializing Search. Professionally. Sriram Sankar Principal Staff Engineer Recruiting Solutions Daniel Tunkelang Head, Query Understanding
  2. Whether you’ve tried to find an Apache committer…
  3. …or an Apache commander, 3
  4. you’ve probably used LinkedIn Search. 4
  5. Let’s talk about… • Infrastructure • Quality 5
  6. LinkedIn Search leverages the economic graph. 6
  7. Social means that relevance is highly personalized. 7
  8. Machine-learned ranking, socially.  Relevance models incorporate user features: score = P (Document | Query, User)  Our model: tree with logistic regression leaves. X2=? b0 + b1 T(x1 )+...+ bn xn X10< 0.1234 ? a0 + a1 P(x1 )+...+ anQ(xn ) g 0 + g1 R(x1 )+...+ g nQ(xn ) 8
  9. LinkedIn’s focus: entity-oriented search. Company Name Search Employees Jobs 9
  10. Query understanding can act as a relevance filter. for i in [1..n] s w1 w2 … wi if Pc(s) > 0 a new Segment() a.segs {s} a.prob Pc(s) B[i] {a} for j in [1..i-1] for b in B[j] s wj wj+1 … wi if Pc(s) > 0 a new Segment() a.segs b.segs U {s} a.prob b.prob * Pc(s) B[i] B[i] U {a} sort B[i] by prob truncate B[i] to size k 10
  11. Less is more. warren buffett 11
  12. Coming soon: entity-driven search assist. link Jobs at LinkedIn People currently working at LinkedIn People who used to work at LinkedIn Search
  13. Infrastructure Lucene  Map of terms to documents – the index  Provides an API to add and remove documents to the index  Provides an API to query the index 13
  14. 1. 2. BLAH BLAH BLAH BLAH BLAH Daniel Daniel BLAH BLAH LinkedIn BLAH BLAH BLAH BLAH Sriram BLAH LinkedIn BLAH BLAH BLAH BLAH BLAH BLAH BLAH Sriram LinkedIn 1 2 Inverted Index Forward Index 14
  15. A standard scoring capability is built in 15
  16.  Extremely easy to build a search engine  But difficult to get sophisticated 16
  17. The LinkedIn Search Stack Request Live Updates Updates Query Rewriter Index Retrieval Scorer Offline Data Building Data Sorter/Blender Response 17
  18. Search Index Served by Lucene  Inverted index  Forward index  Static rank based document ordering 18
  19. Offline Data Builds on Hadoop  Multi-stage map-reduce pipeline allows complex data processing  Produces sharded single segment Lucene index with documents sorted by static rank  Produces data models for use in query rewriting 19
  20. Live Data Updates  Feed based framework to support updates to offline data builds  Lucene enhanced with a partial index update capability 20
  21. Query Rewriting (and Planning)  Accepts raw query and user metadata  Produces Lucene retrieval query and metadata for scoring  May use data models built offline 21
  22. Index Retrieval  Lucene query built by query rewriter is used to retrieve documents from the Lucene index  Documents are retrieved in static rank order (best document first)  Retrieval may be early-terminated – given that retrieval is in static rank order  No scoring is performed during retrieval 22
  23. Scoring  Scoring is performed after retrieval  Its input is the retrieved document (i.e., includes the forward index), a description of how the retrieval query matched the document, and the scoring metadata produced by the rewriter  Costly features can be computed offline during the index building process in Hadoop – e.g., tf/idf calculations 23
  24. Summary Quality  LinkedIn Search leverages the economic graph.  Social means that relevance is highly personalized.  Less is more: query understanding is a relevance filter.  Moving in the direction of suggesting structured queries. System  Powered by Lucene, but with additional components.  Offline data builds on Hadoop, partial index updates.  Index uses static ranking and early termination.  Scoring performed outside of Lucene. 24
  25. Sriram Sankar Daniel Tunkelang 25