This document provides an agenda for a tutorial on candidate selection techniques for large scale personalized search and recommender systems. The tutorial will cover the lifetime of a query, indexing building, query understanding, and candidate selection and retrieval. It will also include a case study on LinkedIn job search and recommendations. Attendees will learn about building blocks of large scale search systems, query processing, candidate selection techniques, and build a prototype search system. The result will be a full stack search system on a news dataset using open source tools.
SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Recommender Systems
1. Candidate Selection for Large Scale
Personalized Search and Recommender Systems
Dhruv Arya Aman Grover Yiqun Liu Ganesh
Venkataraman
Krishnaraman
Kenthapadi
2. Where to find information
• Code – https://github.com/candidate-selection-tutorial-
sigir2017/candidate-selection-tutorial
• Slack – https://goo.gl/WyNY5g
• Slides – Will be posted on Slideshare
2
3. What you will learn today
• At the end of the tutorial attendees should:
• Understand the need for candidate selection, techniques and
challenges/tradeoffs faced when dealing with large scale personalized
systems
• Get a broad overview of:
• Building Blocks of a Large Scale Search System
• Query Processing and Understanding
• Candidate Selection Techniques
• Build a prototype implementation of a search system with different candidate
selection queries and ranking
3
4. What will graduation look like ?
• Full stack search system on a
news dataset
• Built with open source tools
Apache Solr, Python and
Stanford NER
• Ability to extend/modify query
construction and ranking
4
6. Agenda
• Lifetime of a Query
• Index Building
• Query Understanding
• Candidate Selection & Retrieval
• Case Study - LinkedIn Job Search & Recommendations
• Hands on tutorial with News Aggregator Dataset
o Build index with NER fields
o Explore Query Rewriting for Fast Retrieval
6
7. Preliminary Terminologies
• The Search Index
• Forward Index:
Mapping from documents to content
Used in scoring
• Inverted Index:
Mapping from (search) terms to list of documents they occur in, called postings list
Used in retrieving and scoring
Document Words
Document 1 the,cow,says,moo
Document 2 the,cat,and,the,hat
Document 3 the,dish,ran,away,with,the,spoon
8. Preliminary Terminologies
• Postings List:
Simple case: a list of documents that contain the individual term
More flexible: list of docID: <position1, position2, …> and term frequency
• Shard: one partition of document-partitioned index
shard0 shard1
9. Lifetime of a Query
Browser/Device
Search Frontend
User
Query
Query +
Metadata
Query +
Metadata
Structured Query +
Metadata
Quantitative
Software Engineer
Backend
Vertical
Broker
Federator TITLE = quantitative software engineer
SKILL = quantitative
TITLE = software engineer
OR
…More Verticals
11. Lifetime of a Query
Browser/Device
Search Frontend
User
Query
Query +
Metadata
Query +
Metadata
Structured Query +
Metadata
Quantitative
Software Engineer
Backend
Vertical
Broker
Searcher
(One shard)
Searcher
(One shard)
Searcher
(One shard)
Structured Query +
Metadata
Federator
…More shards
…More Verticals
…
jobTitle OR jobDescription = quantitative software engineer
jobSkills OR jobDescription OR jobTitle = quantitative)
jobTitle OR jobDescription = software engineer ||
developer || programmer || software developer
||software engineering ||software programmer
OR
12. Searcher
• Operates on a single shard of the inverted index.
• Receives the rewritten query + metadata from the Broker
• Retrieves matching documents from the inverted index.
• The documents are scored and the top scoring documents are
returned to the Broker.
13. Scoring
• Boolean Model (BM)
• 1 or 0
• Fast, easy to implement
• but hard to rank, terms are weighted
• Vector Space Model (VSM) (cosine sim of query q and document d)
• TF-IDF
• BM 25
• Weighted term
• Assume independence between terms
• Probabilistic Model (P(Relevant = 1| Q = q, D = d))
• Binary Independence Model
• Logistic Regression
14. Lifetime of a Query
Browser/Device
Search Frontend
User
Query
Query +
Metadata
Query +
Metadata
Structured Query +
Metadata
Quantitative
Software Engineer
Backend
Vertical
Broker
Searcher
(One shard)
Searcher
(One shard)
Searcher
(One shard)
Results from ShardStructured Query +
Metadata
Federator
…More shards
…More Verticals
……
jobTitle OR jobDescription = quantitative software engineer
jobSkills OR jobDescription OR jobTitle = quantitative)
jobTitle OR jobDescription = software engineer ||
developer || programmer || software developer
||software engineering ||software programmer
OR
15. Federator/Broker: Reranker
• Diversity Based reranking
• maximize weighted sum of relevance score but penalized by similarity of the
documents
• Business rule based reranking
• Maximize revenue
17. Agenda
• Lifetime of a Query
• Index Building
• Query Understanding
• Candidate Selection & Retrieval
• Case Study - LinkedIn Job Search & Recommendations
• Hands on tutorial with News Aggregator Dataset
o Build index with NER fields
o Explore Query Rewriting for Fast Retrieval
18. Index Building
Collect the Documents to be
Indexed
Tokenization and Named
Entity Recognition
Linguistic Preprocessing of
Tokens
Index the Documents that
each term occurs in
Sample Job Posting
19. Index Building
Collect the Documents to be
Indexed
Tokenization and Named
Entity Recognition
Linguistic Preprocessing of
Tokens
Index the Documents that
each term occurs in
Sample Job Posting
20. Index Building
Collect the Documents to be
Indexed
Tokenization and Named
Entity Recognition
Linguistic Preprocessing of
Tokens
Index the Documents that
each term occurs in
title
skill
Sample Job Posting
21. Index Building: Named Entity
Recognition(NER)
• Hidden Markov Models
• Assumes Markov Property
• Models joint P(X, Y) where X are observed data, and Y the label of X
• Maximum Entropy Markov Models
• Label-bias
• Models conditional P(Y|X)
• CRF Models
• Models conditional P(Y|X)
• Does not assume Markov property
22. Index Building
Collect the Documents to be
Indexed
Tokenization and Named
Entity Recognition
Linguistic Preprocessing of
Tokens
Index the Documents that
each term occurs in
BSc
NLP
Bachelor of Science
Sample Job Posting
23. Index Building: Index Compression
• Motivation: Keep as much dictionary and postings list in memory
• Dictionary compression
• Dictionary as a long string
• Pointer to next word shows end of current word
• Blocked Storage: Store pointers to every kth term string
….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….
Freq. Postings ptr. Term ptr.
33
29
44
126
24. Index Building: Index Compression
• Postings list compression
• Goal: want to store posting list of more frequent term with less bit
• Suffices to store GAP of docIDs
• Variable bytes encoding and Gamma encoding
25. Agenda
• Lifetime of a Query
• Index Building
• Query Understanding
• Candidate Selection & Retrieval
• Case Study - LinkedIn Job Search & Recommendations
• Hands on tutorial with News Aggregator Dataset
o Build index with NER fields
o Explore Query Rewriting for Fast Retrieval
25
26. Query Segmentation
• Task of dividing the search query
into segments (tokens/phrases)
by identifying semantic entities
present in the search query.
• Helps improve the precision of
the candidate set by utilizing
query segments in query
rewriting
26
Oracle
Java Application Developer
Oracle
Java
Application Developer
Oracle
Java Application
Developer
Query Segmentations
26
27. Query Tagging & Annotation
• Tag query segments based on
recognized entity tags.
• Annotate the tagged segments
with:
• Standardized identifiers
• Related Entities
• Entity Specific Metadata
27
COMPANY = Oracle
TITLE = Java Application
Developer
COMPANY = Oracle
SKILL = Java
TITLE = Application Developer
27 27
28. Query Segmentation Approaches
• Dictionary Based
• Simple approach to utilize corpus data to segment phrases/tokens in the
query.
• Score based on co-occurrence is assigned to disambiguate multiple segments.
• Machine Learning Based
• Model the problem as a sequence to sequence learning problem or a
classification problem.
• Utilize correctly labeled segments from corpus / human evaluation data to
learn the model
• Ex. HMM, CRF
28
29. Use in Candidate Selection
• Query Segments instead of
individual tokens are used as
semantic units for matching
document fields and retrieving
correct documents.
• Precise candidates retrieved with
reduced search space resulting in
improved latency.
• Utilize multiple segments to
diversify the retrieved result set.
29
COMPANY = Oracle
TITLE = Java Application
Developer
SKILL = Oracle
SKILL = Java
TITLE = Application Developer
Title
Company
Title
Skills
29
30. Query Expansion
• Task of adding additional
tokens/phrases to the query to
improve recall.
• Includes -
• Synonyms
• Abbreviations
• Related Terms
• Critical to understand when to
expand
30
COMPANY = Oracle OR NetSuite OR
Taleo OR Sun Microsystems OR …
SKILL = Java OR Java EE OR J2EE
OR JVM OR JRE OR JDK …
TITLE = Application Developer OR
Software Engineer OR
Software Developer OR
Programmer …
Green – Synonyms
Blue – Related Entities
30
31. Synonyms and Abbreviations
• Dictionary based approach
• Use a pruned dictionary generated of common abbreviations and synonyms
• Ex. Ceo – Chief Executive Officer, Vp – Vice President
• Simple but limiting and requires post processing for disambiguation ex. CA –
California vs. Chartered Accountant ?
• Utilize query reformulations and search sessions to extract synonyms
• Ex. Software Engineer OR Software Developer
• Ex. Application Developer -> Software Engineer -> Software Developer (Same
Session)
31
32. Word Embeddings for Synonym Expansion
• Word embeddings are projections of words in a dense low dimensional space.
• Allows capturing synonyms, lemmatizations, cross language representations and
relatedness
• Different unsupervised/supervised approaches to experiment – Word2Vec, Glove,
DSSM etc.
• Examples
• softwareentwickler - software engineer, engineer software, software engineering
• marketing – advertising, sales, digital marketing
• senior - sr, snr, lead
32
33. Query Relaxation
• Task of removing tokens/phrases
from the query to make it less
restrictive and increase recall.
• Useful for searches with low
results
• Critical to understand when to
relax to balance precision and
recall
3333 33
34. Which terms to relax ?
• Term Importance
• Utilize query reformulations to learn token importance by focusing on
addition and removal of tokens
• Assign scores based on query co-occurrence, historical result count, idf etc.
• Build dictionary of terms and their importance
• Word Embeddings
• Utilize word embeddings to find the closest similar queries
• Can be done at a query level qa qb or entity level within the queries ea eb
34
35. Using Query Understanding Techniques in
Personalized Recommendations
• The techniques described for the query are also applicable to textual
fields on the user profile
• Ex. Extracting Company and Title from User Headline - “Software
Engineer at LinkedIn”
• Ex. Relaxing User Title in the Personalization Query – “Engineering
Manager Machine Learning -> Manager Machine Learning”
35
36. How to apply these techniques
• Function of relevance and the
count of final result set
• Query Plans
• Early / Late Decision
• Federated Search and Blending
36
q
qo
qe
qe
qe qr
qr
Search Index
r1 r2
rf
r
Query Plans
Result Federation
37. Agenda
• Lifetime of a Query
• Query Understanding
• Index Building
• Candidate Selection & Retrieval
• Case Study - LinkedIn Job Search & Recommendations
• Hands on tutorial with News Aggregator Dataset
o Build index with NER fields
o Explore Query Rewriting for Fast Retrieval
37
40. Need for Candidate Selection ?
•Ranking is an expensive operation
•A lot of context such user’s past behavior, user’s profile around
search can be us to select candidates
•Recommendations are highly personalized
•Users expect a fast, updated and contextually aware system
40
41. Generic Problem Formulation
• Build a first pass filter that mimics final ranking function
• Focus on Recall, Don’t forget Precision
• CQ = f(EQ, EP, LP, U, C)
• CQ -> Constructed Query, EQ -> Explicit Query, EP-> Explicit preference, LP ->
Latent Preference, U -> User, C -> Context
• RF= f(U, CQ, D)
• RF -> Ranking Function, D -> Document
• Optimization function to retrieve candidates set M such that M is a
superset of Top K ranked by ranking function
41
42. Where can we do candidate selection
42
Index
Indexer
Top-K
retrieval
Results
Offline
Training /
Model
Result
Ranking
Query
Rewritting
• Term Match
• Decision Trees
• Linear model
User Query
42 42
43. Where can we do candidate selection
43
Index
Indexer
Term Match
Results
Offline
Training /
Model
Result
Ranking
First Pass
Ranker
User Query
43 43
44. Approaches to Candidate Selection
• Explicit Query Term matching[Explicit Query Matching]
• Generalized Linear Model for learning weights for constructed query
clauses[Query Rewriting]
44
45. Approaches to Candidate Selection
• Tree models to learns association between entities in query and
documents[Query Rewriting]
• Deep Learning models for transforming query and documents into
lower dimensional space. Then use some distance metrics to filter out
irrelevant documents.[Ranking]
45
46. Naive Candidate Selection
• Explicit Query Matching
• Retrieval using term matching
• Works on small index
• Generates a lot of false positives
• ’QA Architect’ “Landscape Architect”, “Java Architect”, “Software Test Architect”
• Retrieval using rules
• Requires Domain knowledge
• (user_title ^ job_title) & (user_skill ^ job_skill)
• Difficult to update
46
48. Understanding WAND Query
Query : “Quality Assurance Engineer”
WAND : “(Quality[5] AND Assurance[5] AND Engineer[1]) [10]”
✅ ✅
49. Why Generalized Linear Model
• Learn weights of query clauses
• For larger indexes and structured data with many entities.
• Can be applied to:
• Retrieval
• Ranking
• Easier to explain using good debugging capabilities.
• Cannot capture interaction between clauses.
49
50. How to apply?
50
Query : “Bryer Ice Cream”
company= bryer
Category = ice cream
Rewritten Query: WAND(company:bryer[5] AND
category:ice cream[5] AND description : ice cream [2] AND
description : bryer [2]) [11]
50 50
WAND operator for
your index
Label data collection
Generate Query
Clauses
Train GLM
Construct Query From
Learnt weights
Use constructed query
to retrieve results
51. Motivation for using Decision Trees
• Interaction are important
• ”Architect” -> “QA Architect” ,
“Landscape Architect”
• Title Match AND Function Match
• Used for query rewriting
• Complex interactions can be
learnt
• Example query would
• (Title Match AND Function
Match) OR ( Function Match) OR
(Seniority Match)
51
Title
Match
Seniority
Match
Negative Positive
Function
Match
Positive Positive
NO Yes
YesYes NONO
51 51
52. What ifs?
• What if we don’t have index
• What if we don’t have way to determine entities
• What if we don’t have any knowledge of data set
52
53. Motivation for Deep Learning
• No structured information available or only
semi structured information available
• Generate embedding using raw content such
as text, images, videos etc
• Tools like Word2vec, Glove and DSSM can be
used to generate embedding
• Use techniques like clustering, KNN to find
candidates
5353
54. Agenda
• Lifetime of a Query
• Index Building
• Query Understanding
• Candidate Selection & Retrieval
• Case Study - LinkedIn Job Search & Recommendations
• Hands on tutorial with News Aggregator Dataset
o Build index with NER fields
o Explore Query Rewriting for Fast Retrieval
54
55. Case Studies for Jobs at Linkedin
• Jobs Ecosystem at Linkedin
• Job Search
• Job Recommendation
• Similar Jobs
• Candidate Selection Techniques Used at LinkedIn
• Logistic Regression model
• Tree based model
55
56. Jobs Ecosystem
• Semi - Structured Job Listings
• Posted by Recruiters or scraped from the
web
• A lifetime of 30 days options to renew
or repost a job there after
• 10+ Million active job postings
• Daily churn of 150 – 300 K job postings
• Core Products – Search and
Recommendations, Browse Maps
5656 56
57. Job Recommendations
• Recommend user jobs based on
Apply likelihood.
• User Features
• Profile, Network, Activity etc.
• Job Features
• Poster, Company, Description, Title,
Location, Skills etc.
• Past Interactions
• Apply, Click, Search etc.
57
Job Recommendations
57 57
58. Job Search
• Surface relevant jobs when user
specifies an explicit query
• User Features
• Profile, Network, Activity etc.
• Query Features
• Skills, title, company etc.
• Job Features
• Poster, Company, Description, Title,
Location, Skills etc.
• Past Interactions
• Apply, Click, Search etc.
58
60. Linear Model for Candidate Selection
• Use WAND operator to rewrite the query
• Training logistic regression model with feature shrinking to learn the
weights on the clauses
• Evaluating the model using AUC
60
61. • Input:
• Member profile data (Past titles, Current Title, Skills,…)
• Explicit Query in case of search
• Target: Job data (Title, Skills, Description, Seniority,…)
• Steps :
• Construct a set F of possible pairs of fields <Member field, Job Field>.
• Example: (member-Title ∧ job-Title)
• Generate labelled data using click data
• Train a logistic regression model with feature shrinking
• Evaluate the model offline using AUC.
Algorithm for constructing query clauses
6161
62. Training Pipeline
62
User interaction log data
Training data
generation
Generate Feature vector based on predefined clauses
Create negative training
data
Create positive training
data
Split into training and test data
Train Machine learning algorithm (logistic regression)
Tune the threshold using ROC or PR-curve
62 62
63. All Negative Feature weights gone.
Remove features with smaller and
negative weights
Retrain the model (without features
with negative coefficients)
Output the model with all positive
coefficients
Feature Shrinking
63
65. Decision Tree Based Approach
• LinkedIn has a lot of structured entities
• Standardized entities like title,
function, seniority etc are available on
both member and job
• Interaction between entities are learnt
using any Decision tree algorithm like
ID3, Boosted tree
• Use the positive clauses to rewrite a
query as explained earlier
6565
67. Agenda
• Lifetime of a Query
• Query Understanding
• Candidate Selection & Retrieval
• Case Study - LinkedIn Job Search & Recommendations
• Hands on tutorial with News Aggregator Dataset
o Build index with NER fields
o Explore Query Rewriting for Fast Retrieval
67
68. Dataset
• Content
• This dataset contains headlines, URLs, and categories for 422,937 news stories collected by a
web aggregator between March 10th, 2014 and August 10th, 2014.
• News categories included in this dataset include business; science and technology;
entertainment; and health.
• Columns
• ID, TITLE, URL, PUBLISHER, CATEGORY, STORY, HOSTNAME, TIMESTAMP
• Acknowledgments
• Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine,
CA: University of California, School of Information and Computer Science.
• This specific dataset can be found in the UCI ML Repository at this URL
68
72. Assignments
• Assignments available on Github
• Each assignment builds on a component of the end product
• You should be able to test your code with the pre-built ui
• Finished files available for reference (if needed)
• Raise hand if you need help or have a question
72
75. Take Aways
• Building an index is not instantaneous hence have replicas in
production
• Real world indexes seldom can be stored in a single shard
• Data sources should be kept separate and be maintained as source of
truth
• A catch-all field increases number of candidates and limits precision
75
76. Assignment 2
Building an entity aware index with help of Stanford NER and search with NER tags
at query time
76
77. Take Aways
• Understanding documents and structuring the index is critical for
candidate selection
• Query understanding plays a crucial role in balancing precision and
recall
• Serving top-k partial results is better than serving no results.
77
79. Take Aways
• Utilize personalization in retrieval for reducing documents to be
retrieved and ranked
• A search system provides facets on top of recommendations
• Utilize custom operators for applying business rules to retrieval and
ranking
79
80. Summary
• Understanding of a large scale search system, need for query
understanding and candidate selection
• Insights and learning from LinkedIn case studies
• Working end-to-end personalized search system
implementation with open source tools and dataset.
80
81. References
• Building The LinkedIn Knowledge Graph by Qi He [https://engineering.linkedin.com/blog/2016/10/building-the-
linkedin-knowledge-graph]
• Li, J., Arya, D., Ha-Thuc, V., & Sinha, S. (2016, August). How to Get Them a Dream Job?: Entity-Aware Features for
Personalized Job Search Ranking. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining. ACM, 2016.
• Borisyuk, Fedor, et al. "CaSMoS: A Framework for Learning Candidate Selection Models over Structured Queries and
Documents." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining. ACM, 2016.
• Zhang, XianXing, et al. "Glmix: Generalized linear mixed models for large-scale response prediction." Proceedings of
the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016.
• Grover, A., Arya, D., Venkatraman, G. “Latency reduction via Decision Tree Based Query Construction” Proceedings of
the 26th ACM International on Conference on Information and Knowledge Management. ACM, 2017.
• Venkatraman, Ganesh et al. “Conventional Tutorial - Deep Learning for Personalized Search and Recommender
Systems” ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2017.
• P. Covington, J. Adams, and E. Sargin. Deep neural networks for youtube recommendations. In Proceedings of the
10th ACM Conference on Recommender Systems, pages 191–198. ACM, 2016.
• Did you mean "Galene"? https://engineering.linkedin.com/search/did-you-mean-galene
81