Find and be Found: Information Retrieval at LinkedIn


Published on

Find and Be Found: Information Retrieval at LinkedIn

SIGIR 2013 Industry Track Presentation

LinkedIn has a unique data collection: the 200M+ members who use LinkedIn are also the most valuable entities in our corpus, which consists of people, companies, jobs, and a rich content ecosystem. Our members use LinkedIn to satisfy a diverse set of navigational and exploratory information needs, which we address by leveraging semi-structured and social content to understanding their query intent and deliver a personalized search experience. In this talk, we will discuss some of the unique challenges we face in building the LinkedIn search platform, the solutions we've developed so far, and the open problems we see ahead of us.

Shakti Sinha heads LinkedIn's search relevance team, and has been making key contributions to LinkedIn's search products since 2010. He previously worked at Google as both a research intern and a software engineer. He has an MS in Computer Science from Stanford, as well as a BS degree from College of Engineering, Pune.

Daniel Tunkelang leads LinkedIn's efforts around query understanding. Before that, he led LinkedIn's product data science team. He previously led a local search quality team at Google and was a founding employee of Endeca (acquired by Oracle in 2011). He has written a textbook on faceted search, and is a recognized advocate of human-computer interaction and information retrieval (HCIR). He has a PhD in Computer Science from CMU, as well as BS and MS degrees from MIT.

Published in: Technology
1 Comment
No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Find and be Found: Information Retrieval at LinkedIn

  1. 1. Recruiting SolutionsRecruiting SolutionsRecruiting Solutions formation Retrieval at LinkedIn Shakti Sinha Daniel Tunkelang Head, Search Relevance Head, Query Understanding Shakti Daniel Find and be Found:
  2. 2. Why do 225M+ people use LinkedIn? 2
  3. 3. Profile: the professional identity of record. 3
  4. 4. Job recommendations. 4
  5. 5. Publishing platform for professional content. 5
  6. 6. Search helps members find and be found. 6
  7. 7. Search for people, 7
  8. 8. Search for people, jobs, 8
  9. 9. Search for people, jobs, groups, and more. 9
  10. 10. Every search is personalized. 10
  11. 11. Let’s talk a bit about how it all works. §  Query Understanding §  Ranking More at 11
  12. 12. Query Understanding 12 Daniel Tunkelang Head, Query Understanding
  13. 13. Pre-retrieval: segment and tag queries. lucene software engineer lucene “software engineer”
  14. 14. LinkedIn’s focus: entity-oriented search. 14 Company Employees Jobs Name Search
  15. 15. Query tagging: key to query understanding. §  Using human judgments to evaluate tag precision. –  Extremely accurate (> 99%) for identifying person names. –  Harder to distinguish company vs. title vs. skill (e.g., oracle dba). §  Comparing CTR for tag matches vs. non-matches. –  Difference can be large enough to suggest filtering vs. ranking: 15
  16. 16. Detecting navigational vs. exploratory queries. Pre-retrieval §  Sequence of query tags. Post-retrieval §  Distribution of scores / features. 16 Click behavior §  Title searches >50x more likely to get 2+ clicks than name searches.
  17. 17. Query expansion for exploratory queries. 17 software patent lawyer Query expansions derived from reformulations. e.g., lawyer -> attorney
  18. 18. Understanding misspelled queries. 18 daniel tankalong infomation retrieval marisa meyer ingenero eletrico jonathan podemsky desenista industrail Did you mean daniel tunkelang? Did you mean marissa mayer? Did you mean johnathan podemsky? Did you mean information retrieval? Did you mean ingeniero electrico? Did you mean desenhista industrial?
  19. 19. Spelling out the details. entity data people, companies successful queries tunkelang => reformulations marisa => marissa n-grams dublin => du ub bl li in metaphones mark/marc => MRK word pairs johnathan podemsky INDEX } {marisa meyer yoohoo marissa marisa meyer mayer yahoo yoohoo 19
  20. 20. Ranking 20 Shakti Sinha Head, Search Relevance
  21. 21. LinkedIn search is personalized. 21 kevin scott
  22. 22. But global factors matter. 22
  23. 23. Relevant results can be in or out of network. 23 §  Searcher’s network matters for relevance. –  Within network results have higher CTR. §  But the network is not enough. –  About two thirds of search clicks come from out of network results.
  24. 24. Personalized machine-learned ranking. 24 §  Data point is a triple (searcher, query, document). –  Searcher features are important! §  Labels: Is this document relevant to the query and the user? –  Depends on the user’s network, location, etc. –  Too much to ask random person to judge. §  Training data has to be collected from search logs.
  25. 25. Search log data has biases. 25 §  Presentation bias –  Results shown higher tend to get clicked more often. –  Use FairPairs [Radlinski and Joachims, AAAI’06]. not flipped flipped flipped Clicked! ✗ ✔ ✔ ✗ ✗ ✗ training data
  26. 26. Search log data has biases. 26 §  Sample bias –  User clicks or skips only what is shown. –  What about low scoring results from existing model? –  Add low-scoring results as ‘easy negatives’ so model learns bad results not presented to user. … label 0 label 0 label 0 label 0 … page 1 page 2 page 3 page n
  27. 27. 27 How to train your model.
  28. 28. How to train your model. 28 §  Train simple models to resemble complex ones. –  Build Additive Groves model [Sorokina et al, ECML ’07], which is good at detecting interactions. §  Build tree with logistic regression leaves. §  By restricting tree to user and query features, only regression model evaluated for each document. β0 +β1 T(x1)+...+βn xn α0 +α1 P(x1)+...+αnQ(xn ) X2=? X10< 0.1234 ? γ0 +γ1 R(x1)+...+γnQ(xn )
  29. 29. Take-Aways §  LinkedIn’s search problem is unique because of deep role of personalization – users are integral part of the corpus. §  Query understanding allows us to optimize for entity- oriented search against semi-structured content. §  Ranking requires us to contextually apply global and personalized user, query, and document features. 29
  30. 30. Thank you! 30 225,
  31. 31. Want to learn more? §  Check out §  Contact us: –  Shakti: –  Daniel: –  Asif: §  Did we mention that we’re hiring? 31
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.