Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search
Upcoming SlideShare
Loading in...5
×
 

Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search

on

  • 4,890 views

Structure, Personalization, Scale: A Deep Dive into LinkedIn Search ...

Structure, Personalization, Scale: A Deep Dive into LinkedIn Search

Presented at QCon New York 2014 in the Applied Science and Machine Learning track
https://qconnewyork.com/presentation/structure-personalization-scale-deep-dive-linkedin-search

Also presented at the NYC Search, Discovery, and Analytics Meetup
http://www.meetup.com/NYC-Search-and-Discovery/events/168265892/

Video: http://new.livestream.com/hugeinc/search-at-linkedin

Abstract

All of us are familiar with search as users. And as software engineers, many of us have worked on search problems in the context of web search, site search, or enterprise search. But search at LinkedIn is different. Our corpus is a richly structured professional graph comprised of 300M+ people, 3M+ companies, 2M+ groups, and 1.5M+ publishers. Our members perform billions of searches (over 5.7B in 2012), and each of those searches is highly personalized based on the searcher's identity and relationships with other professional entities in LinkedIn's economic graph. And all this data is in constant flux as LinkedIn adds more than 2 members every second in over 200 countries (2/3 of our members are outside the United States). As a result, we’ve built a system quite different from those used for other search applications. In this talk, we will discuss some of the unique challenges we've faced as we deliver highly personalized search over semi-structured data at massive scale.

Asif Makhani heads Search at LinkedIn. Prior to that, he was a founding member of A9 and led the development and launch of Amazon CloudSearch, a fully managed and elastic search service in the AWS Cloud. Asif has a Masters in Computer Science from Stanford University and a BMath from University of Waterloo. @asifm

Speaker Bios

Daniel Tunkelang leads LinkedIn's efforts around query understanding. Before that, he led LinkedIn's product data science team. He previously led a local search quality team at Google and was a founding employee of Endeca (acquired by Oracle in 2011). He has written a textbook on faceted search, and is a recognized advocate of human-computer interaction and information retrieval (HCIR). He has a PhD in Computer Science from CMU, as well as BS and MS degrees from MIT. @dtunkelang

Statistics

Views

Total Views
4,890
Views on SlideShare
4,616
Embed Views
274

Actions

Likes
85
Downloads
88
Comments
0

6 Embeds 274

https://www.linkedin.com 196
http://godwincaruana.me 53
http://www.linkedin.com 19
https://twitter.com 3
http://www.google.com 2
http://plus.url.google.com 1

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search Structure, Personalization, Scale: A Deep Dive Into LinkedIn Search Presentation Transcript

  • Recruiting SolutionsRecruiting SolutionsRecruiting Solutions Structure, Personalization, Scale: A Deep Dive into LinkedIn Search
  • Overview What is LinkedIn search and why should you care? What are our systems challenges? What are our relevance challenges? 2
  • 3
  • 4
  • Search helps members find and be found. 5
  • Search for people, jobs, groups, and more. 6
  • 7
  • A separate product for recruiters. 8
  • Search is the core of key LinkedIn use cases. 9
  • What’s unique  Personalized  Part of a larger product experience – Many products – Big part  Task-centric – Find a job, hire top talent, find a person, … 10
  • Systems Challenges 11
  • Evolution of LinkedIn’ Search Architecture 2004:  No Search Engine  Iterate through your network and filter 12 2004
  • 13 2004 2007 Lucene Lucene LuceneLucene (Single Shard) Updates Queries Results 2007: Introducing Lucene (single shard, multiple replicas)
  • 14 2004 2007 2008 Lucene Lucene Lucene Updates Queries Results Lucene Zoie 2008: Zoie - real-time search (search without commits/shutdown)
  • 15 2004 2007 2008 Lucene Lucene Lucene Source 1 Queries Results Lucene Zoie Source 2 …. Source N Content Store …. 2008: Content Store (aggregating multiple input sources)
  • 16 2004 2007 2008 Source1 Queries Results Source2 …. SourceN Content Store …. Sharded Broker 2008: Sharded search
  • 17 2004 2007 2008 2009 Source1 Queries Results Source2 …. SourceN Content Store …. Sensei Broker Lucene Zoie Bobo 2009: Bobo – Faceted Search
  • 18 2004 2007 2008 2009 2010 2010: SenseiDB (cluster management, new query language, wrapping existing pieces)
  • 19 2004 2007 2008 2009 2010 2011 2011: Cleo (instant typeahead results)
  • 20 2004 2007 2008 2009 2010 2011 2013 2013: Too many stacks Group Search Article/Post Search And more…
  • Challenges  Index rebuilding very difficult  Live updates are at an entity granularity  Scoring is inflexible  Lucene limitations  Fragmentation – too many components, too many stacks  Economic Graph 21 Opportunity
  • 22 2004 2007 2008 2009 2010 2011 2013 2014 2014: Introducing Galene
  • Life of a Query 23 Query Rewriter/ Planner Results Merging User Query Search Results Search Shard Search Shard
  • Life of a Query – Within A Search Shard 24 Rewritten Query Top Results From Shard INDEX Top Results Retrieve a Document Score the Document
  • Life of a Query – Within A Rewriter 25 Query DATA MODEL Rewriter State Rewriter Module DATA MODEL DATA MODEL Rewritten Query Rewriter Module Rewriter Module
  • Life of Data - Offline 26 INDEX Derived DataRaw Data DATA MODEL DATA MODEL DATA MODEL DATA MODEL DATA MODEL
  • Improvements  Regular full index builds using Hadoop – Easier to reshard, add fields  Improved Relevance – Offline relevance, query rewriting frameworks  Partial Live Updates Support – Allows efficient updates of high frequency fields (no sync) – Goodbye Content Store, Goodbye Zoie  Early termination – Ultra low latency for instant results – Goodbye Cleo  Indexing and searching across graph entities/attributes  Single engine, single stack 27
  • Galene Deep dive 28
  • Primer on Search 29
  • Lucene An open source API that supports search functionality:  Add new documents to index  Delete documents from the index  Construct queries  Search the index using the query  Score the retrieved documents 30
  • The Search Index  Inverted Index: Mapping from (search) terms to list of documents (they are present in)  Forward Index: Mapping from documents to metadata about them 31
  • 32 BLAH BLAH BLAH DanielBLAH BLAH LinkedIn BLAH BLAH BLAH BLAH BLAH BLAH AsifBLAH LinkedIn BLAH BLAH BLAH BLAH BLAH BLAH BLAH2. 1. Daniel Asif LinkedIn 2 1 Inverted Index Forward Index
  • The Search Index  The lists are called posting lists  Upto hundreds of millions of posting lists  Upto hundreds of millions of documents  Posting lists may contain as few as a single hit and as many as tens of millions of hits  Terms can be – words in the document – inferred attributes about the document 33
  • Lucene Queries  term:“asif makhani”  term:asif term:daniel  +term:daniel +prefix:tunk  +asif +linkedIn  +term:daniel connection:50510  +term:daniel industry:software connection:50510^4 34
  • Early termination  We order documents in the index based on a static rank – from most important to least important  An offline relevance algorithm assigns a static rank to each document on which the sorting is performed  This allows retrieval to be early-terminated (assuming a strong correlation between static rank and importance of result for a specific query)  Also works well with personalized search – +term:asif +prefix:makh +(connection:35176 connection:418001 connection:1520032) 35
  • Partial Updates  Lucene segments are “document-partitioned”  We have enhanced Lucene with “term-partitioned” segments  We use 3 term-partitioned segments: – Base index (never changed) – Live update buffer – Snapshot index 36
  • 37 Base Index Snapshot Index Live Update Buffer
  • Going Forward  Consolidation across verticals  Improved Relevance Support – Machine-learned models, query rewriting, relevant snippets,…  Improved Performance  Search as a Service (SeaS)  Exploring the Economic Graph 38
  • Quality Challenges 39
  • The Search Quality Pipeline 40 New document F e a t u r e s Machin e learning model score New document F e a t u r e s Machin e learning model score New document F e a t u r e s Machine learning model score Ordered list Ordered list Ordered list spellcheck query tagging vertical intent query expansion
  • Spellcheck 41 PEOPLE NAMES COMPANIES TITLES PAST QUERIES n-grams marissa => ma ar ri is ss sa metaphone mark/marc => MRK co-occurrence counts marissa:mayer = 1000 marisa meyer yahoo marissa marisa meyer mayer yahoo
  • Query Tagging 42 machine learning data scientist brooklyn
  • Vertical Intent: Results Blending 43 [company] [employees] [jobs] [name search]
  • Vertical Intent: Typeahead 44 P(mongodb | mon) = 5% P(monsanto | mons): 50% P(mongodb | mong): 80%
  • Query Expansion 45
  • Ranking 46 New document F e a t u r e s Machin e learning model score New document F e a t u r e s Machin e learning model score New document F e a t u r e s Machine learning model score Ordered list Ordered list Ordered list
  • Ranking is highly personalized. 47 kevin scott
  • Not just for name search. 48
  • Relevance Model 49 keywords document F e a t u r e s Machine learning model
  • Examples of Features 50 Search keywords matching title = 3 Searcher location = Result location Searcher network distance to result = 2 …
  • Model Training: Traditional Approach 51 Documents for training F e a t u r e s Human evaluation L a b e l s Machine learning model
  • Model Training: LinkedIn’s Approach 52 Documents for training F e a t u r e s Human evaluation Search logs L a b e l s Machine learning model
  • Fair Pairs and Easy Negatives 53 Flipped [Radlinski and Joachims, 2006] • Sample negatives from bottom results • But watch out for variable length result sets. • Compromise, e.g., sample from page 10.
  • Model Selection  Select model based on user and query features. – e.g., person name queries, recruiters making skills queries  Resulting model is a tree with logistic regression leaves.  Only one regression model evaluated for each document. 54 b0 +b1T(x1)+...+bn xn a0 +a1 P(x1)+...+anQ(xn) X2=? X10< 0.1234 ? g0 +g1 R(x1)+...+gnQ(xn)
  • Summary What is LinkedIn search and why should you care? LinkedIn search enables the participants in the economic graph to find and be found. What are our systems challenges? Indexing rich, structured content; retrieving using global and social factors; real-time updates. What are our relevance challenges? Query understanding, personalized machine-learned ranking models. 55
  • 56 Asif Makhani Daniel Tunkelang amakhani@linkedin.com dtunkelang@linkedin.com https://linkedin.com/in/asifmakhani https://linkedin.com/in/dtunkelang