Presented to the Bay Area Search Meetup on February 26, 2014
http://www.meetup.com/Bay-Area-Search/events/136150622/
At LinkedIn, we face a number of challenges in delivering high quality search results to 277M+ members. Our results are highly personalized, requiring us to build machine-learned relevance models that combine document, query, and user features. And our emphasis on entities (names, companies, job titles, etc.) affects how we process and understand queries. In this talk, we'll talk about these challenges in detail, and we'll describe some of the solutions we are building to address them.
Speakers:
Satya Kanduri has worked on LinkedIn search relevance since 2011. Most recently he led the development of LinkedIn's machine-learned ranking platform. He previously worked at Microsoft, improving relevance for Bing Product Search. He has an MS in Computer Science from the University of Nebraska - Lincoln, and a BE in Computer Science from the Osmania University College of Engineering.
Abhimanyu Lad has worked at LinkedIn as a software engineer and data scientist since 2011. He has worked on a variety of relevance and query understanding problems, including query intent prediction, query suggestion, and spelling correction. He has a PhD in Computer Science from CMU, where he worked on developing machine learning techniques for diversifying search results.
15. SPELLING OUT THE DETAILS
N-grams
marissa => ma ar ri is ss sa
Metaphone
PEOPLE NAMES
COMPANIES
TITLES
mark/marc => MRK
Co-occurrence counts
PAST QUERIES
marissa:mayer = 1000
marisa meyer yahoo
marissa
meyer
marisa
yahoo
mayer
15
16. SPELLING OUT THE DETAILS
PROBLEM: Corpus as well as query logs contain many spelling errors
Certain spelling errors are quite frequent
While genuine words (especially names) might be infrequent
16
17. SPELLING OUT THE DETAILS
PROBLEM: Corpus as well as query logs contain many spelling errors
SOLUTION: Use query chains to infer correct spelling
[product manger]
[marissa mayer]
[product manager]
CLICK
CLICK
17
19. QUERY TAGGING
IDENTIFYING ENTITIES IN THE QUERY
TITLE
TITLE-237
software engineer
software developer
programmer
…
CO
GEO
CO-1441
Google Inc.
Industry: Internet
GEO-7583
Country: US
Lat: 42.3482 N
Long: 75.1890 W
(RECOGNIZED TAGS: NAME, TITLE, COMPANY, SCHOOL, GEO, SKILL )
19
27. QUERY TAGGING : SEQUENTIAL MODEL
TRAINING
EMISSION PROBABILITIES
(Learned from user profiles)
TRANSITION PROBABILITIES
(Learned from query logs)
27
28. QUERY TAGGING : SEQUENTIAL MODEL
INFERENCE
Given a query, find the most likely sequence of tags
28
38. QUERY UNDERSTANDING: SUMMARY
High degree of structure in queries as well as corpus
(user profiles, job postings, companies, …)
Query understanding allows us to optimally balance recall
and precision by supporting entity-oriented search
Query tagging and query log analysis play a big role in
query understanding
38
44. RANKING IS COMPLICATED
Seemingly similar queries require dissimilar scoring
functions
Personalization matters
– Multiple dimensions to personalize on
– Dimensions vary with query class
54. CLICKS AS TRAINING DATA
Approach: Clicked = Relevant, Not-Clicked = Not Relevant
55. CLICKS AS TRAINING DATA
Approach: Clicked = Relevant, Not-Clicked = Not Relevant
56. CLICKS AS TRAINING DATA
Approach: Clicked = Relevant, Not-Clicked = Not Relevant
57. CLICKS AS TRAINING DATA
Approach: Clicked = Relevant, Not-Clicked = Not Relevant
User eye
scan
direction
Good results not
seen are marked
Not Relevant.
Unfairly penalized?
58. CLICKS AS TRAINING DATA
Approach: Clicked = Relevant, Skipped = Not Relevant
• Only penalize results that the user has seen but
ignored
59. CLICKS AS TRAINING DATA
Approach: Clicked = Relevant, Skipped = Not Relevant
• Only penalize results that the user has seen but ignored
• Risks inverting model by overweighing low-ranked results
60. FAIR PAIRS
• Fair Pairs:
• Randomize, Clicked=
R, Skipped= NR
[Radlinski and
Joachims, AAAI’06]
61. FAIR PAIRS
• Fair Pairs:
• Randomize, Clicked=
R, Skipped= NR
Flipped
[Radlinski and Joachims,
AAAI’06]
62. FAIR PAIRS
• Fair Pairs:
• Randomize, Clicked=
R, Skipped= NR
• Great at dealing with position bias
• Does not invert models
Flipped
[Radlinski and
Joachims, AAAI’06]
63. EASY NEGATIVES
• Assumption: A decent current model would
push out bad results to the very end.
• Easy Negatives: Some of the results at the
end are picked up as negative examples
64. EASY NEGATIVES
2 pages
•
90+ pages
Use strategies that sample across the feature space
• Searches with less results preferred
• Always sample from a given page, say page 10
65. PUTTING IT ALL TOGETHER
Human evaluation is not practical for personalized
searches
Learn from user behavior
– Multiple heuristics depending on the need
– Different pros and cons
66. EFFICIENCY VS EXPRESSIVENESS
Build tree with logistic regression leaves.
By restricting decision nodes to (Query, User)
segments, only one regression model can be evaluated for
each document.
X2=?
b0 + b1 T(x1 )+...+ bn xn
a0 + a1 P(x1 )+...+ anQ(xn )
X4?
g 0 + g1 R(x1 )+...+ g nQ(xn )
66
70. SUMMARY
Query understanding leverages the rich structure of
LinkedIn’s content and information needs.
Query tagging and rewriting allows us to deliver precision
and recall.
For ranking, personalization is both the biggest challenge
and the core of our solution.
Segmenting relevance models by query type helps us
efficiently address the diversity of search needs.
There’s a high degree of structure in our users’ queries as well as our corpus(i.e, user profiles, job postings, companies, etc)Query understanding allows us to take advantage of this structure to do entity-oriented search to optimally balance recall and precision.Finally, our ability to understand and intelligently rewrite queries heavily depends on two things: query tagging (the ability to identify entities in the query) and query log analysis (analysing how users reformulate their queries)
ThanksAbhi. Today I’ll talking about some of the ranking challenges we face here at LinkedIn. Through this talk I’ll be focusing on People Search, but the challenges are applicable to all search problems we strive to solve.In order to get a sense of the ranking problem, let’s take a look at some examples.
One of the more frequent types of queries we see in people search are name queries. In this example, that query happens to be richardbranson. While we have other Richard bransons on LinkedIn, most likely the searcher was looking for the founder of Virgin group. In order to get this search right, we only need 2 things – name has to match the terms and the rank should be based on global popularity. That is pretty straightforward. Now, let’s take a look at another example.
multiple dimensions of personalizationIf we look at the result sets, the left ones are mostly clustered around san francisco bay area and the right ones are clustered around atlanta area. We could be looking for any Kevin Scott local to us, but considering global prior we put the respective SVPs on top kevinscott --> non #1 results - start talking about featuresThis is also another name query. But there are multiple Kevin Scotts present on LinkedIn. Which of these 2 result sets are relevant? It’s hard to say. If I was issuing this search, I’d say the left result set is better as I work at LinkedIn and I live in SF Bay Area. On the other hand, if someone works at Home Depot and located in Atlanta area, the right result set is probably better. This example shows us 2 dimensions of personalization – 1. company, 2. location. Are there more factors we could personalize on?
There is not query here. I chose to use facets in this case to select the results preciselyWe’ve seen in the previous slide that personalization involves more than 1 dimension. Let us look at an example to see if there are any more dimensions that influence personalizationLet’s take a look at yet another example. Let us say I am looking for someone working at NetApp. Apart from the location personalization that you can see in the 2nd result, there are 2 other important dimensions to personalization – network distance on the first one and industry overlap on the 3rd result.So the question now is, by personalizing on all these dimensions (company, location, network distance, industry etc.) for every query, can we obtain the best set of results? Let’s find out with another example.
One of the unique value propositions of LinkedIn is to search people with a skillNot all features are useful in any query class. For instance, for skills searches, industry overlap didn’t turn out to be a significant features whereas for name searches that is one of the significant featuresballet --> all of the top ranked results are in performing arts (where as I am not in performing industry)One of the unique value propositions of linkedin is to search for people possessing a skill. Most of these results are from performing arts industry. As you can probably guess, I do not work in performing arts industry and I possess no skills related to performing arts. So personalization based on industry is not applicable here. However, if you look carefully, the results are still personalized based on my network distance.
To recap what we saw in the examples,
Considering all these factors that we should take into account – how do we train a personalized machine learning model
Of course I have severely simplified the process, but this is just to give those are you who aren’t familiar with machine learning an ideathis is how machine learned models are trained for ranking.
Most of the work typically involves around sampling documents, engineering features, and obtaining truth data. In my next few slide, we’ll explore different ways in which we can get labelsmore important part is data - unreasonable effectiveness of data- train a model for each of 270 million models
Let’s say a recruiter is looking for someone with skill oracle database – is this still a right result?
- non-personalized (allude to others conventional, traditional) - we are always personalizedSatya’s notes – A conventional non-personalized model is a function of document and query. But in LinkedIn’s case, we have an additional “User” dimension due to personalization. As you can see, our score is a function of document, query, and user.
Cannot use human labels
- lower ranked results not labeled negative - we are throwing our own ranking function under the bus.. there might be a good reason they are ranked lower, but there might be a good result...
----- Meeting Notes (11/15/13 10:06) -----all the results we didn't evaluate look better than the any results before the ones that was clicked. If the original model was pretty good, that gives a lot of credit to the unseen ones
sampling bias – data concentrated to top results. model does not know how to differentiate really poor results
----- Meeting Notes (11/15/13 10:25) -----Why is this okay given that unrepresentative sample?
The best models for LTR are generally complex, like ensembles of trees. These models are expensive – especially in first-pass rankers which often need to score hundreds of thousands of results for every query. The approach we use is to first train complex models, then use insights from those to train simple models.don’t talk about ndcgexpressiveness/complexity
Potentially score hundreds of documents… trade-off between expressiveness vs evaluation
The decision nodes can also be based on user segments – such as whether the user is a recruiter or a regular userIndustry overlap is not required ins kill queriesAs can be seen, we can avoid computing IndustryOverlap for a skill query
We test offline as much as possible, but online evaluation is the litmus testConventional ways to measure are CTR, MRR, P/R etc.Interleaving side-by-side2 result sets…