SlideShare a Scribd company logo
1 of 82
Candidate Selection for Large Scale
Personalized Search and Recommender Systems
Dhruv Arya Aman Grover Yiqun Liu Ganesh
Venkataraman
Krishnaraman
Kenthapadi
Where to find information
• Code – https://github.com/candidate-selection-tutorial-
sigir2017/candidate-selection-tutorial
• Slack – https://goo.gl/WyNY5g
• Slides – Will be posted on Slideshare
2
What you will learn today
• At the end of the tutorial attendees should:
• Understand the need for candidate selection, techniques and
challenges/tradeoffs faced when dealing with large scale personalized
systems
• Get a broad overview of:
• Building Blocks of a Large Scale Search System
• Query Processing and Understanding
• Candidate Selection Techniques
• Build a prototype implementation of a search system with different candidate
selection queries and ranking
3
What will graduation look like ?
• Full stack search system on a
news dataset
• Built with open source tools
Apache Solr, Python and
Stanford NER
• Ability to extend/modify query
construction and ranking
4
Result at end of tutorial
55
Agenda
• Lifetime of a Query
• Index Building
• Query Understanding
• Candidate Selection & Retrieval
• Case Study - LinkedIn Job Search & Recommendations
• Hands on tutorial with News Aggregator Dataset
o Build index with NER fields
o Explore Query Rewriting for Fast Retrieval
6
Preliminary Terminologies
• The Search Index
• Forward Index:
Mapping from documents to content
Used in scoring
• Inverted Index:
Mapping from (search) terms to list of documents they occur in, called postings list
Used in retrieving and scoring
Document Words
Document 1 the,cow,says,moo
Document 2 the,cat,and,the,hat
Document 3 the,dish,ran,away,with,the,spoon
Preliminary Terminologies
• Postings List:
Simple case: a list of documents that contain the individual term
More flexible: list of docID: <position1, position2, …> and term frequency
• Shard: one partition of document-partitioned index
shard0 shard1
Lifetime of a Query
Browser/Device
Search Frontend
User
Query
Query +
Metadata
Query +
Metadata
Structured Query +
Metadata
Quantitative
Software Engineer
Backend
Vertical
Broker
Federator TITLE = quantitative software engineer
SKILL = quantitative
TITLE = software engineer
OR
…More Verticals
Federator/Broker: Query Rewriter
TITLE = quantitative software engineer
OR
SKILL = quantitative
TITLE = software engineer
jobTitle | jobDescription : quantitative software engineer
jobSkills | jobDescription | jobTitle : quantitative)
jobTitle | jobDescription : software engineer ||
developer || programmer || software developer
||software engineering ||software programmer
OR
…
Quantitative
Software Engineer
Federator
Job Search
Broker
Lifetime of a Query
Browser/Device
Search Frontend
User
Query
Query +
Metadata
Query +
Metadata
Structured Query +
Metadata
Quantitative
Software Engineer
Backend
Vertical
Broker
Searcher
(One shard)
Searcher
(One shard)
Searcher
(One shard)
Structured Query +
Metadata
Federator
…More shards
…More Verticals
…
jobTitle OR jobDescription = quantitative software engineer
jobSkills OR jobDescription OR jobTitle = quantitative)
jobTitle OR jobDescription = software engineer ||
developer || programmer || software developer
||software engineering ||software programmer
OR
Searcher
• Operates on a single shard of the inverted index.
• Receives the rewritten query + metadata from the Broker
• Retrieves matching documents from the inverted index.
• The documents are scored and the top scoring documents are
returned to the Broker.
Scoring
• Boolean Model (BM)
• 1 or 0
• Fast, easy to implement
• but hard to rank, terms are weighted
• Vector Space Model (VSM) (cosine sim of query q and document d)
• TF-IDF
• BM 25
• Weighted term
• Assume independence between terms
• Probabilistic Model (P(Relevant = 1| Q = q, D = d))
• Binary Independence Model
• Logistic Regression
Lifetime of a Query
Browser/Device
Search Frontend
User
Query
Query +
Metadata
Query +
Metadata
Structured Query +
Metadata
Quantitative
Software Engineer
Backend
Vertical
Broker
Searcher
(One shard)
Searcher
(One shard)
Searcher
(One shard)
Results from ShardStructured Query +
Metadata
Federator
…More shards
…More Verticals
……
jobTitle OR jobDescription = quantitative software engineer
jobSkills OR jobDescription OR jobTitle = quantitative)
jobTitle OR jobDescription = software engineer ||
developer || programmer || software developer
||software engineering ||software programmer
OR
Federator/Broker: Reranker
• Diversity Based reranking
• maximize weighted sum of relevance score but penalized by similarity of the
documents
• Business rule based reranking
• Maximize revenue
Lifetime of a Query
Browser/Device
Search Frontend
User
Query Results
Query +
Metadata Results
Query +
Metadata Blended Results
Structured Query +
Metadata
Quantitative
Software Engineer
Vertical Results
Backend
Vertical
Broker
Searcher
(One shard)
Searcher
(One shard)
Searcher
(One shard)
Results from ShardStructured Query +
Metadata
Federator
…More shards
…More Verticals
…
Agenda
• Lifetime of a Query
• Index Building
• Query Understanding
• Candidate Selection & Retrieval
• Case Study - LinkedIn Job Search & Recommendations
• Hands on tutorial with News Aggregator Dataset
o Build index with NER fields
o Explore Query Rewriting for Fast Retrieval
Index Building
Collect the Documents to be
Indexed
Tokenization and Named
Entity Recognition
Linguistic Preprocessing of
Tokens
Index the Documents that
each term occurs in
Sample Job Posting
Index Building
Collect the Documents to be
Indexed
Tokenization and Named
Entity Recognition
Linguistic Preprocessing of
Tokens
Index the Documents that
each term occurs in
Sample Job Posting
Index Building
Collect the Documents to be
Indexed
Tokenization and Named
Entity Recognition
Linguistic Preprocessing of
Tokens
Index the Documents that
each term occurs in
title
skill
Sample Job Posting
Index Building: Named Entity
Recognition(NER)
• Hidden Markov Models
• Assumes Markov Property
• Models joint P(X, Y) where X are observed data, and Y the label of X
• Maximum Entropy Markov Models
• Label-bias
• Models conditional P(Y|X)
• CRF Models
• Models conditional P(Y|X)
• Does not assume Markov property
Index Building
Collect the Documents to be
Indexed
Tokenization and Named
Entity Recognition
Linguistic Preprocessing of
Tokens
Index the Documents that
each term occurs in
BSc
NLP
Bachelor of Science
Sample Job Posting
Index Building: Index Compression
• Motivation: Keep as much dictionary and postings list in memory
• Dictionary compression
• Dictionary as a long string
• Pointer to next word shows end of current word
• Blocked Storage: Store pointers to every kth term string
….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….
Freq. Postings ptr. Term ptr.
33
29
44
126
Index Building: Index Compression
• Postings list compression
• Goal: want to store posting list of more frequent term with less bit
• Suffices to store GAP of docIDs
• Variable bytes encoding and Gamma encoding
Agenda
• Lifetime of a Query
• Index Building
• Query Understanding
• Candidate Selection & Retrieval
• Case Study - LinkedIn Job Search & Recommendations
• Hands on tutorial with News Aggregator Dataset
o Build index with NER fields
o Explore Query Rewriting for Fast Retrieval
25
Query Segmentation
• Task of dividing the search query
into segments (tokens/phrases)
by identifying semantic entities
present in the search query.
• Helps improve the precision of
the candidate set by utilizing
query segments in query
rewriting
26
Oracle
Java Application Developer
Oracle
Java
Application Developer
Oracle
Java Application
Developer
Query Segmentations
26
Query Tagging & Annotation
• Tag query segments based on
recognized entity tags.
• Annotate the tagged segments
with:
• Standardized identifiers
• Related Entities
• Entity Specific Metadata
27
COMPANY = Oracle
TITLE = Java Application
Developer
COMPANY = Oracle
SKILL = Java
TITLE = Application Developer
27 27
Query Segmentation Approaches
• Dictionary Based
• Simple approach to utilize corpus data to segment phrases/tokens in the
query.
• Score based on co-occurrence is assigned to disambiguate multiple segments.
• Machine Learning Based
• Model the problem as a sequence to sequence learning problem or a
classification problem.
• Utilize correctly labeled segments from corpus / human evaluation data to
learn the model
• Ex. HMM, CRF
28
Use in Candidate Selection
• Query Segments instead of
individual tokens are used as
semantic units for matching
document fields and retrieving
correct documents.
• Precise candidates retrieved with
reduced search space resulting in
improved latency.
• Utilize multiple segments to
diversify the retrieved result set.
29
COMPANY = Oracle
TITLE = Java Application
Developer
SKILL = Oracle
SKILL = Java
TITLE = Application Developer
Title
Company
Title
Skills
29
Query Expansion
• Task of adding additional
tokens/phrases to the query to
improve recall.
• Includes -
• Synonyms
• Abbreviations
• Related Terms
• Critical to understand when to
expand
30
COMPANY = Oracle OR NetSuite OR
Taleo OR Sun Microsystems OR …
SKILL = Java OR Java EE OR J2EE
OR JVM OR JRE OR JDK …
TITLE = Application Developer OR
Software Engineer OR
Software Developer OR
Programmer …
Green – Synonyms
Blue – Related Entities
30
Synonyms and Abbreviations
• Dictionary based approach
• Use a pruned dictionary generated of common abbreviations and synonyms
• Ex. Ceo – Chief Executive Officer, Vp – Vice President
• Simple but limiting and requires post processing for disambiguation ex. CA –
California vs. Chartered Accountant ?
• Utilize query reformulations and search sessions to extract synonyms
• Ex. Software Engineer OR Software Developer
• Ex. Application Developer -> Software Engineer -> Software Developer (Same
Session)
31
Word Embeddings for Synonym Expansion
• Word embeddings are projections of words in a dense low dimensional space.
• Allows capturing synonyms, lemmatizations, cross language representations and
relatedness
• Different unsupervised/supervised approaches to experiment – Word2Vec, Glove,
DSSM etc.
• Examples
• softwareentwickler - software engineer, engineer software, software engineering
• marketing – advertising, sales, digital marketing
• senior - sr, snr, lead
32
Query Relaxation
• Task of removing tokens/phrases
from the query to make it less
restrictive and increase recall.
• Useful for searches with low
results
• Critical to understand when to
relax to balance precision and
recall
3333 33
Which terms to relax ?
• Term Importance
• Utilize query reformulations to learn token importance by focusing on
addition and removal of tokens
• Assign scores based on query co-occurrence, historical result count, idf etc.
• Build dictionary of terms and their importance
• Word Embeddings
• Utilize word embeddings to find the closest similar queries
• Can be done at a query level qa  qb or entity level within the queries ea eb
34
Using Query Understanding Techniques in
Personalized Recommendations
• The techniques described for the query are also applicable to textual
fields on the user profile
• Ex. Extracting Company and Title from User Headline - “Software
Engineer at LinkedIn”
• Ex. Relaxing User Title in the Personalization Query – “Engineering
Manager Machine Learning -> Manager Machine Learning”
35
How to apply these techniques
• Function of relevance and the
count of final result set
• Query Plans
• Early / Late Decision
• Federated Search and Blending
36
q
qo
qe
qe
qe qr
qr
Search Index
r1 r2
rf
r
Query Plans
Result Federation
Agenda
• Lifetime of a Query
• Query Understanding
• Index Building
• Candidate Selection & Retrieval
• Case Study - LinkedIn Job Search & Recommendations
• Hands on tutorial with News Aggregator Dataset
o Build index with NER fields
o Explore Query Rewriting for Fast Retrieval
37
Why Candidate Selection?
3838 38
Why Candidate Selection?
3939 39
Need for Candidate Selection ?
•Ranking is an expensive operation
•A lot of context such user’s past behavior, user’s profile around
search can be us to select candidates
•Recommendations are highly personalized
•Users expect a fast, updated and contextually aware system
40
Generic Problem Formulation
• Build a first pass filter that mimics final ranking function
• Focus on Recall, Don’t forget Precision
• CQ = f(EQ, EP, LP, U, C)
• CQ -> Constructed Query, EQ -> Explicit Query, EP-> Explicit preference, LP ->
Latent Preference, U -> User, C -> Context
• RF= f(U, CQ, D)
• RF -> Ranking Function, D -> Document
• Optimization function to retrieve candidates set M such that M is a
superset of Top K ranked by ranking function
41
Where can we do candidate selection
42
Index
Indexer
Top-K
retrieval
Results
Offline
Training /
Model
Result
Ranking
Query
Rewritting
• Term Match
• Decision Trees
• Linear model
User Query
42 42
Where can we do candidate selection
43
Index
Indexer
Term Match
Results
Offline
Training /
Model
Result
Ranking
First Pass
Ranker
User Query
43 43
Approaches to Candidate Selection
• Explicit Query Term matching[Explicit Query Matching]
• Generalized Linear Model for learning weights for constructed query
clauses[Query Rewriting]
44
Approaches to Candidate Selection
• Tree models to learns association between entities in query and
documents[Query Rewriting]
• Deep Learning models for transforming query and documents into
lower dimensional space. Then use some distance metrics to filter out
irrelevant documents.[Ranking]
45
Naive Candidate Selection
• Explicit Query Matching
• Retrieval using term matching
• Works on small index
• Generates a lot of false positives
• ’QA Architect’  “Landscape Architect”, “Java Architect”, “Software Test Architect”
• Retrieval using rules
• Requires Domain knowledge
• (user_title ^ job_title) & (user_skill ^ job_skill)
• Difficult to update
46
Understanding WAND Query
Query : “Quality Assurance Engineer”
AND Query: “Quality AND Assurance AND Engineer”
✅ ❌
Understanding WAND Query
Query : “Quality Assurance Engineer”
WAND : “(Quality[5] AND Assurance[5] AND Engineer[1]) [10]”
✅ ✅
Why Generalized Linear Model
• Learn weights of query clauses
• For larger indexes and structured data with many entities.
• Can be applied to:
• Retrieval
• Ranking
• Easier to explain using good debugging capabilities.
• Cannot capture interaction between clauses.
49
How to apply?
50
Query : “Bryer Ice Cream”
company= bryer
Category = ice cream
Rewritten Query: WAND(company:bryer[5] AND
category:ice cream[5] AND description : ice cream [2] AND
description : bryer [2]) [11]
50 50
WAND operator for
your index
Label data collection
Generate Query
Clauses
Train GLM
Construct Query From
Learnt weights
Use constructed query
to retrieve results
Motivation for using Decision Trees
• Interaction are important
• ”Architect” -> “QA Architect” ,
“Landscape Architect”
• Title Match AND Function Match
• Used for query rewriting
• Complex interactions can be
learnt
• Example query would
• (Title Match AND Function
Match) OR ( Function Match) OR
(Seniority Match)
51
Title
Match
Seniority
Match
Negative Positive
Function
Match
Positive Positive
NO Yes
YesYes NONO
51 51
What ifs?
• What if we don’t have index
• What if we don’t have way to determine entities
• What if we don’t have any knowledge of data set
52
Motivation for Deep Learning
• No structured information available or only
semi structured information available
• Generate embedding using raw content such
as text, images, videos etc
• Tools like Word2vec, Glove and DSSM can be
used to generate embedding
• Use techniques like clustering, KNN to find
candidates
5353
Agenda
• Lifetime of a Query
• Index Building
• Query Understanding
• Candidate Selection & Retrieval
• Case Study - LinkedIn Job Search & Recommendations
• Hands on tutorial with News Aggregator Dataset
o Build index with NER fields
o Explore Query Rewriting for Fast Retrieval
54
Case Studies for Jobs at Linkedin
• Jobs Ecosystem at Linkedin
• Job Search
• Job Recommendation
• Similar Jobs
• Candidate Selection Techniques Used at LinkedIn
• Logistic Regression model
• Tree based model
55
Jobs Ecosystem
• Semi - Structured Job Listings
• Posted by Recruiters or scraped from the
web
• A lifetime of 30 days options to renew
or repost a job there after
• 10+ Million active job postings
• Daily churn of 150 – 300 K job postings
• Core Products – Search and
Recommendations, Browse Maps
5656 56
Job Recommendations
• Recommend user jobs based on
Apply likelihood.
• User Features
• Profile, Network, Activity etc.
• Job Features
• Poster, Company, Description, Title,
Location, Skills etc.
• Past Interactions
• Apply, Click, Search etc.
57
Job Recommendations
57 57
Job Search
• Surface relevant jobs when user
specifies an explicit query
• User Features
• Profile, Network, Activity etc.
• Query Features
• Skills, title, company etc.
• Job Features
• Poster, Company, Description, Title,
Location, Skills etc.
• Past Interactions
• Apply, Click, Search etc.
58
Candidate Selection Models
• Generalized Linear models
• Tree based models
59
Linear Model for Candidate Selection
• Use WAND operator to rewrite the query
• Training logistic regression model with feature shrinking to learn the
weights on the clauses
• Evaluating the model using AUC
60
• Input:
• Member profile data (Past titles, Current Title, Skills,…)
• Explicit Query in case of search
• Target: Job data (Title, Skills, Description, Seniority,…)
• Steps :
• Construct a set F of possible pairs of fields <Member field, Job Field>.
• Example: (member-Title ∧ job-Title)
• Generate labelled data using click data
• Train a logistic regression model with feature shrinking
• Evaluate the model offline using AUC.
Algorithm for constructing query clauses
6161
Training Pipeline
62
User interaction log data
Training data
generation
Generate Feature vector based on predefined clauses
Create negative training
data
Create positive training
data
Split into training and test data
Train Machine learning algorithm (logistic regression)
Tune the threshold using ROC or PR-curve
62 62
All Negative Feature weights gone.
Remove features with smaller and
negative weights
Retrain the model (without features
with negative coefficients)
Output the model with all positive
coefficients
Feature Shrinking
63
Candidate Selection Models at LinkedIn
• Generalized Linear models
• Tree based models
64
Decision Tree Based Approach
• LinkedIn has a lot of structured entities
• Standardized entities like title,
function, seniority etc are available on
both member and job
• Interaction between entities are learnt
using any Decision tree algorithm like
ID3, Boosted tree
• Use the positive clauses to rewrite a
query as explained earlier
6565
Architecture
66
Agenda
• Lifetime of a Query
• Query Understanding
• Candidate Selection & Retrieval
• Case Study - LinkedIn Job Search & Recommendations
• Hands on tutorial with News Aggregator Dataset
o Build index with NER fields
o Explore Query Rewriting for Fast Retrieval
67
Dataset
• Content
• This dataset contains headlines, URLs, and categories for 422,937 news stories collected by a
web aggregator between March 10th, 2014 and August 10th, 2014.
• News categories included in this dataset include business; science and technology;
entertainment; and health.
• Columns
• ID, TITLE, URL, PUBLISHER, CATEGORY, STORY, HOSTNAME, TIMESTAMP
• Acknowledgments
• Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine,
CA: University of California, School of Information and Computer Science.
• This specific dataset can be found in the UCI ML Repository at this URL
68
Tools
6969 69
Architecture
70
Users
Frontend Mid Tier News Dataset Index
(Apache Solr)
Query Builder Ranker
Stanford NER Server
Document Understanding
& Creation
70
End Product
7171
Assignments
• Assignments available on Github
• Each assignment builds on a component of the end product
• You should be able to test your code with the pre-built ui
• Finished files available for reference (if needed)
• Raise hand if you need help or have a question
72
Assignment 0
Setting up the development environment
73
Assignment 1
Building a simple index and search on it with a catch all lucene field
74
Take Aways
• Building an index is not instantaneous hence have replicas in
production
• Real world indexes seldom can be stored in a single shard
• Data sources should be kept separate and be maintained as source of
truth
• A catch-all field increases number of candidates and limits precision
75
Assignment 2
Building an entity aware index with help of Stanford NER and search with NER tags
at query time
76
Take Aways
• Understanding documents and structuring the index is critical for
candidate selection
• Query understanding plays a crucial role in balancing precision and
recall
• Serving top-k partial results is better than serving no results.
77
Assignment 3
Personalized query rewriting for empty query use case (recommendations)
78
Take Aways
• Utilize personalization in retrieval for reducing documents to be
retrieved and ranked
• A search system provides facets on top of recommendations
• Utilize custom operators for applying business rules to retrieval and
ranking
79
Summary
• Understanding of a large scale search system, need for query
understanding and candidate selection
• Insights and learning from LinkedIn case studies
• Working end-to-end personalized search system
implementation with open source tools and dataset.
80
References
• Building The LinkedIn Knowledge Graph by Qi He [https://engineering.linkedin.com/blog/2016/10/building-the-
linkedin-knowledge-graph]
• Li, J., Arya, D., Ha-Thuc, V., & Sinha, S. (2016, August). How to Get Them a Dream Job?: Entity-Aware Features for
Personalized Job Search Ranking. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining. ACM, 2016.
• Borisyuk, Fedor, et al. "CaSMoS: A Framework for Learning Candidate Selection Models over Structured Queries and
Documents." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining. ACM, 2016.
• Zhang, XianXing, et al. "Glmix: Generalized linear mixed models for large-scale response prediction." Proceedings of
the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016.
• Grover, A., Arya, D., Venkatraman, G. “Latency reduction via Decision Tree Based Query Construction” Proceedings of
the 26th ACM International on Conference on Information and Knowledge Management. ACM, 2017.
• Venkatraman, Ganesh et al. “Conventional Tutorial - Deep Learning for Personalized Search and Recommender
Systems” ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2017.
• P. Covington, J. Adams, and E. Sargin. Deep neural networks for youtube recommendations. In Proceedings of the
10th ACM Conference on Recommender Systems, pages 191–198. ACM, 2016.
• Did you mean "Galene"? https://engineering.linkedin.com/search/did-you-mean-galene
81
82

More Related Content

Similar to SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Recommender Systems

Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkSimon Hughes
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comSimon Hughes
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Lucidworks
 
Managing Enterprise Data Science 201904
Managing Enterprise Data Science 201904Managing Enterprise Data Science 201904
Managing Enterprise Data Science 201904Mark Tabladillo
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systemsTrey Grainger
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge GraphTrey Grainger
 
Using Compass to Diagnose Performance Problems
Using Compass to Diagnose Performance Problems Using Compass to Diagnose Performance Problems
Using Compass to Diagnose Performance Problems MongoDB
 
Using Compass to Diagnose Performance Problems in Your Cluster
Using Compass to Diagnose Performance Problems in Your ClusterUsing Compass to Diagnose Performance Problems in Your Cluster
Using Compass to Diagnose Performance Problems in Your ClusterMongoDB
 
Search Powered by Deep Learning SmartData 2017
Search Powered by Deep Learning SmartData 2017Search Powered by Deep Learning SmartData 2017
Search Powered by Deep Learning SmartData 2017Debanjan Mahata
 
Search powered by deep learning smart data 2017
Search powered by deep learning smart data 2017Search powered by deep learning smart data 2017
Search powered by deep learning smart data 2017Debanjan Mahata
 
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Lucidworks
 
Solving Real World Challenges with Enterprise Search
Solving Real World Challenges with Enterprise SearchSolving Real World Challenges with Enterprise Search
Solving Real World Challenges with Enterprise SearchSPC Adriatics
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDBMongoDB
 
Agile Development – Why requirements matter by Fariz Saracevic
Agile Development – Why requirements matter by Fariz SaracevicAgile Development – Why requirements matter by Fariz Saracevic
Agile Development – Why requirements matter by Fariz SaracevicAgile ME
 

Similar to SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Recommender Systems (20)

Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank Talk
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
 
Managing Enterprise Data Science 201904
Managing Enterprise Data Science 201904Managing Enterprise Data Science 201904
Managing Enterprise Data Science 201904
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systems
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
 
Using Compass to Diagnose Performance Problems
Using Compass to Diagnose Performance Problems Using Compass to Diagnose Performance Problems
Using Compass to Diagnose Performance Problems
 
Using Compass to Diagnose Performance Problems in Your Cluster
Using Compass to Diagnose Performance Problems in Your ClusterUsing Compass to Diagnose Performance Problems in Your Cluster
Using Compass to Diagnose Performance Problems in Your Cluster
 
Search Powered by Deep Learning SmartData 2017
Search Powered by Deep Learning SmartData 2017Search Powered by Deep Learning SmartData 2017
Search Powered by Deep Learning SmartData 2017
 
Search powered by deep learning smart data 2017
Search powered by deep learning smart data 2017Search powered by deep learning smart data 2017
Search powered by deep learning smart data 2017
 
Final presentation
Final presentationFinal presentation
Final presentation
 
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
 
Sap abap course
Sap abap course Sap abap course
Sap abap course
 
Sap abap course content
Sap abap course contentSap abap course content
Sap abap course content
 
Solving Real World Challenges with Enterprise Search
Solving Real World Challenges with Enterprise SearchSolving Real World Challenges with Enterprise Search
Solving Real World Challenges with Enterprise Search
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDB
 
Agile Development – Why requirements matter by Fariz Saracevic
Agile Development – Why requirements matter by Fariz SaracevicAgile Development – Why requirements matter by Fariz Saracevic
Agile Development – Why requirements matter by Fariz Saracevic
 

Recently uploaded

Memory Interfacing of 8086 with DMA 8257
Memory Interfacing of 8086 with DMA 8257Memory Interfacing of 8086 with DMA 8257
Memory Interfacing of 8086 with DMA 8257subhasishdas79
 
Path loss model, OKUMURA Model, Hata Model
Path loss model, OKUMURA Model, Hata ModelPath loss model, OKUMURA Model, Hata Model
Path loss model, OKUMURA Model, Hata ModelDrAjayKumarYadav4
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"mphochane1998
 
Max. shear stress theory-Maximum Shear Stress Theory ​ Maximum Distortional ...
Max. shear stress theory-Maximum Shear Stress Theory ​  Maximum Distortional ...Max. shear stress theory-Maximum Shear Stress Theory ​  Maximum Distortional ...
Max. shear stress theory-Maximum Shear Stress Theory ​ Maximum Distortional ...ronahami
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network DevicesChandrakantDivate1
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxSCMS School of Architecture
 
Introduction to Robotics in Mechanical Engineering.pptx
Introduction to Robotics in Mechanical Engineering.pptxIntroduction to Robotics in Mechanical Engineering.pptx
Introduction to Robotics in Mechanical Engineering.pptxhublikarsn
 
fitting shop and tools used in fitting shop .ppt
fitting shop and tools used in fitting shop .pptfitting shop and tools used in fitting shop .ppt
fitting shop and tools used in fitting shop .pptAfnanAhmad53
 
UNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxUNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxkalpana413121
 
Introduction to Geographic Information Systems
Introduction to Geographic Information SystemsIntroduction to Geographic Information Systems
Introduction to Geographic Information SystemsAnge Felix NSANZIYERA
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...drmkjayanthikannan
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdfKamal Acharya
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdfAldoGarca30
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdfKamal Acharya
 
Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...ppkakm
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiessarkmank1
 
Introduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdfIntroduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdfsumitt6_25730773
 

Recently uploaded (20)

Memory Interfacing of 8086 with DMA 8257
Memory Interfacing of 8086 with DMA 8257Memory Interfacing of 8086 with DMA 8257
Memory Interfacing of 8086 with DMA 8257
 
Path loss model, OKUMURA Model, Hata Model
Path loss model, OKUMURA Model, Hata ModelPath loss model, OKUMURA Model, Hata Model
Path loss model, OKUMURA Model, Hata Model
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Max. shear stress theory-Maximum Shear Stress Theory ​ Maximum Distortional ...
Max. shear stress theory-Maximum Shear Stress Theory ​  Maximum Distortional ...Max. shear stress theory-Maximum Shear Stress Theory ​  Maximum Distortional ...
Max. shear stress theory-Maximum Shear Stress Theory ​ Maximum Distortional ...
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
Introduction to Robotics in Mechanical Engineering.pptx
Introduction to Robotics in Mechanical Engineering.pptxIntroduction to Robotics in Mechanical Engineering.pptx
Introduction to Robotics in Mechanical Engineering.pptx
 
fitting shop and tools used in fitting shop .ppt
fitting shop and tools used in fitting shop .pptfitting shop and tools used in fitting shop .ppt
fitting shop and tools used in fitting shop .ppt
 
UNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptxUNIT 4 PTRP final Convergence in probability.pptx
UNIT 4 PTRP final Convergence in probability.pptx
 
Introduction to Geographic Information Systems
Introduction to Geographic Information SystemsIntroduction to Geographic Information Systems
Introduction to Geographic Information Systems
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...Basic Electronics for diploma students as per technical education Kerala Syll...
Basic Electronics for diploma students as per technical education Kerala Syll...
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
Introduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdfIntroduction to Data Visualization,Matplotlib.pdf
Introduction to Data Visualization,Matplotlib.pdf
 

SIGIR 2017 - Candidate Selection for Large Scale Personalized Search and Recommender Systems

  • 1. Candidate Selection for Large Scale Personalized Search and Recommender Systems Dhruv Arya Aman Grover Yiqun Liu Ganesh Venkataraman Krishnaraman Kenthapadi
  • 2. Where to find information • Code – https://github.com/candidate-selection-tutorial- sigir2017/candidate-selection-tutorial • Slack – https://goo.gl/WyNY5g • Slides – Will be posted on Slideshare 2
  • 3. What you will learn today • At the end of the tutorial attendees should: • Understand the need for candidate selection, techniques and challenges/tradeoffs faced when dealing with large scale personalized systems • Get a broad overview of: • Building Blocks of a Large Scale Search System • Query Processing and Understanding • Candidate Selection Techniques • Build a prototype implementation of a search system with different candidate selection queries and ranking 3
  • 4. What will graduation look like ? • Full stack search system on a news dataset • Built with open source tools Apache Solr, Python and Stanford NER • Ability to extend/modify query construction and ranking 4
  • 5. Result at end of tutorial 55
  • 6. Agenda • Lifetime of a Query • Index Building • Query Understanding • Candidate Selection & Retrieval • Case Study - LinkedIn Job Search & Recommendations • Hands on tutorial with News Aggregator Dataset o Build index with NER fields o Explore Query Rewriting for Fast Retrieval 6
  • 7. Preliminary Terminologies • The Search Index • Forward Index: Mapping from documents to content Used in scoring • Inverted Index: Mapping from (search) terms to list of documents they occur in, called postings list Used in retrieving and scoring Document Words Document 1 the,cow,says,moo Document 2 the,cat,and,the,hat Document 3 the,dish,ran,away,with,the,spoon
  • 8. Preliminary Terminologies • Postings List: Simple case: a list of documents that contain the individual term More flexible: list of docID: <position1, position2, …> and term frequency • Shard: one partition of document-partitioned index shard0 shard1
  • 9. Lifetime of a Query Browser/Device Search Frontend User Query Query + Metadata Query + Metadata Structured Query + Metadata Quantitative Software Engineer Backend Vertical Broker Federator TITLE = quantitative software engineer SKILL = quantitative TITLE = software engineer OR …More Verticals
  • 10. Federator/Broker: Query Rewriter TITLE = quantitative software engineer OR SKILL = quantitative TITLE = software engineer jobTitle | jobDescription : quantitative software engineer jobSkills | jobDescription | jobTitle : quantitative) jobTitle | jobDescription : software engineer || developer || programmer || software developer ||software engineering ||software programmer OR … Quantitative Software Engineer Federator Job Search Broker
  • 11. Lifetime of a Query Browser/Device Search Frontend User Query Query + Metadata Query + Metadata Structured Query + Metadata Quantitative Software Engineer Backend Vertical Broker Searcher (One shard) Searcher (One shard) Searcher (One shard) Structured Query + Metadata Federator …More shards …More Verticals … jobTitle OR jobDescription = quantitative software engineer jobSkills OR jobDescription OR jobTitle = quantitative) jobTitle OR jobDescription = software engineer || developer || programmer || software developer ||software engineering ||software programmer OR
  • 12. Searcher • Operates on a single shard of the inverted index. • Receives the rewritten query + metadata from the Broker • Retrieves matching documents from the inverted index. • The documents are scored and the top scoring documents are returned to the Broker.
  • 13. Scoring • Boolean Model (BM) • 1 or 0 • Fast, easy to implement • but hard to rank, terms are weighted • Vector Space Model (VSM) (cosine sim of query q and document d) • TF-IDF • BM 25 • Weighted term • Assume independence between terms • Probabilistic Model (P(Relevant = 1| Q = q, D = d)) • Binary Independence Model • Logistic Regression
  • 14. Lifetime of a Query Browser/Device Search Frontend User Query Query + Metadata Query + Metadata Structured Query + Metadata Quantitative Software Engineer Backend Vertical Broker Searcher (One shard) Searcher (One shard) Searcher (One shard) Results from ShardStructured Query + Metadata Federator …More shards …More Verticals …… jobTitle OR jobDescription = quantitative software engineer jobSkills OR jobDescription OR jobTitle = quantitative) jobTitle OR jobDescription = software engineer || developer || programmer || software developer ||software engineering ||software programmer OR
  • 15. Federator/Broker: Reranker • Diversity Based reranking • maximize weighted sum of relevance score but penalized by similarity of the documents • Business rule based reranking • Maximize revenue
  • 16. Lifetime of a Query Browser/Device Search Frontend User Query Results Query + Metadata Results Query + Metadata Blended Results Structured Query + Metadata Quantitative Software Engineer Vertical Results Backend Vertical Broker Searcher (One shard) Searcher (One shard) Searcher (One shard) Results from ShardStructured Query + Metadata Federator …More shards …More Verticals …
  • 17. Agenda • Lifetime of a Query • Index Building • Query Understanding • Candidate Selection & Retrieval • Case Study - LinkedIn Job Search & Recommendations • Hands on tutorial with News Aggregator Dataset o Build index with NER fields o Explore Query Rewriting for Fast Retrieval
  • 18. Index Building Collect the Documents to be Indexed Tokenization and Named Entity Recognition Linguistic Preprocessing of Tokens Index the Documents that each term occurs in Sample Job Posting
  • 19. Index Building Collect the Documents to be Indexed Tokenization and Named Entity Recognition Linguistic Preprocessing of Tokens Index the Documents that each term occurs in Sample Job Posting
  • 20. Index Building Collect the Documents to be Indexed Tokenization and Named Entity Recognition Linguistic Preprocessing of Tokens Index the Documents that each term occurs in title skill Sample Job Posting
  • 21. Index Building: Named Entity Recognition(NER) • Hidden Markov Models • Assumes Markov Property • Models joint P(X, Y) where X are observed data, and Y the label of X • Maximum Entropy Markov Models • Label-bias • Models conditional P(Y|X) • CRF Models • Models conditional P(Y|X) • Does not assume Markov property
  • 22. Index Building Collect the Documents to be Indexed Tokenization and Named Entity Recognition Linguistic Preprocessing of Tokens Index the Documents that each term occurs in BSc NLP Bachelor of Science Sample Job Posting
  • 23. Index Building: Index Compression • Motivation: Keep as much dictionary and postings list in memory • Dictionary compression • Dictionary as a long string • Pointer to next word shows end of current word • Blocked Storage: Store pointers to every kth term string ….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo…. Freq. Postings ptr. Term ptr. 33 29 44 126
  • 24. Index Building: Index Compression • Postings list compression • Goal: want to store posting list of more frequent term with less bit • Suffices to store GAP of docIDs • Variable bytes encoding and Gamma encoding
  • 25. Agenda • Lifetime of a Query • Index Building • Query Understanding • Candidate Selection & Retrieval • Case Study - LinkedIn Job Search & Recommendations • Hands on tutorial with News Aggregator Dataset o Build index with NER fields o Explore Query Rewriting for Fast Retrieval 25
  • 26. Query Segmentation • Task of dividing the search query into segments (tokens/phrases) by identifying semantic entities present in the search query. • Helps improve the precision of the candidate set by utilizing query segments in query rewriting 26 Oracle Java Application Developer Oracle Java Application Developer Oracle Java Application Developer Query Segmentations 26
  • 27. Query Tagging & Annotation • Tag query segments based on recognized entity tags. • Annotate the tagged segments with: • Standardized identifiers • Related Entities • Entity Specific Metadata 27 COMPANY = Oracle TITLE = Java Application Developer COMPANY = Oracle SKILL = Java TITLE = Application Developer 27 27
  • 28. Query Segmentation Approaches • Dictionary Based • Simple approach to utilize corpus data to segment phrases/tokens in the query. • Score based on co-occurrence is assigned to disambiguate multiple segments. • Machine Learning Based • Model the problem as a sequence to sequence learning problem or a classification problem. • Utilize correctly labeled segments from corpus / human evaluation data to learn the model • Ex. HMM, CRF 28
  • 29. Use in Candidate Selection • Query Segments instead of individual tokens are used as semantic units for matching document fields and retrieving correct documents. • Precise candidates retrieved with reduced search space resulting in improved latency. • Utilize multiple segments to diversify the retrieved result set. 29 COMPANY = Oracle TITLE = Java Application Developer SKILL = Oracle SKILL = Java TITLE = Application Developer Title Company Title Skills 29
  • 30. Query Expansion • Task of adding additional tokens/phrases to the query to improve recall. • Includes - • Synonyms • Abbreviations • Related Terms • Critical to understand when to expand 30 COMPANY = Oracle OR NetSuite OR Taleo OR Sun Microsystems OR … SKILL = Java OR Java EE OR J2EE OR JVM OR JRE OR JDK … TITLE = Application Developer OR Software Engineer OR Software Developer OR Programmer … Green – Synonyms Blue – Related Entities 30
  • 31. Synonyms and Abbreviations • Dictionary based approach • Use a pruned dictionary generated of common abbreviations and synonyms • Ex. Ceo – Chief Executive Officer, Vp – Vice President • Simple but limiting and requires post processing for disambiguation ex. CA – California vs. Chartered Accountant ? • Utilize query reformulations and search sessions to extract synonyms • Ex. Software Engineer OR Software Developer • Ex. Application Developer -> Software Engineer -> Software Developer (Same Session) 31
  • 32. Word Embeddings for Synonym Expansion • Word embeddings are projections of words in a dense low dimensional space. • Allows capturing synonyms, lemmatizations, cross language representations and relatedness • Different unsupervised/supervised approaches to experiment – Word2Vec, Glove, DSSM etc. • Examples • softwareentwickler - software engineer, engineer software, software engineering • marketing – advertising, sales, digital marketing • senior - sr, snr, lead 32
  • 33. Query Relaxation • Task of removing tokens/phrases from the query to make it less restrictive and increase recall. • Useful for searches with low results • Critical to understand when to relax to balance precision and recall 3333 33
  • 34. Which terms to relax ? • Term Importance • Utilize query reformulations to learn token importance by focusing on addition and removal of tokens • Assign scores based on query co-occurrence, historical result count, idf etc. • Build dictionary of terms and their importance • Word Embeddings • Utilize word embeddings to find the closest similar queries • Can be done at a query level qa  qb or entity level within the queries ea eb 34
  • 35. Using Query Understanding Techniques in Personalized Recommendations • The techniques described for the query are also applicable to textual fields on the user profile • Ex. Extracting Company and Title from User Headline - “Software Engineer at LinkedIn” • Ex. Relaxing User Title in the Personalization Query – “Engineering Manager Machine Learning -> Manager Machine Learning” 35
  • 36. How to apply these techniques • Function of relevance and the count of final result set • Query Plans • Early / Late Decision • Federated Search and Blending 36 q qo qe qe qe qr qr Search Index r1 r2 rf r Query Plans Result Federation
  • 37. Agenda • Lifetime of a Query • Query Understanding • Index Building • Candidate Selection & Retrieval • Case Study - LinkedIn Job Search & Recommendations • Hands on tutorial with News Aggregator Dataset o Build index with NER fields o Explore Query Rewriting for Fast Retrieval 37
  • 40. Need for Candidate Selection ? •Ranking is an expensive operation •A lot of context such user’s past behavior, user’s profile around search can be us to select candidates •Recommendations are highly personalized •Users expect a fast, updated and contextually aware system 40
  • 41. Generic Problem Formulation • Build a first pass filter that mimics final ranking function • Focus on Recall, Don’t forget Precision • CQ = f(EQ, EP, LP, U, C) • CQ -> Constructed Query, EQ -> Explicit Query, EP-> Explicit preference, LP -> Latent Preference, U -> User, C -> Context • RF= f(U, CQ, D) • RF -> Ranking Function, D -> Document • Optimization function to retrieve candidates set M such that M is a superset of Top K ranked by ranking function 41
  • 42. Where can we do candidate selection 42 Index Indexer Top-K retrieval Results Offline Training / Model Result Ranking Query Rewritting • Term Match • Decision Trees • Linear model User Query 42 42
  • 43. Where can we do candidate selection 43 Index Indexer Term Match Results Offline Training / Model Result Ranking First Pass Ranker User Query 43 43
  • 44. Approaches to Candidate Selection • Explicit Query Term matching[Explicit Query Matching] • Generalized Linear Model for learning weights for constructed query clauses[Query Rewriting] 44
  • 45. Approaches to Candidate Selection • Tree models to learns association between entities in query and documents[Query Rewriting] • Deep Learning models for transforming query and documents into lower dimensional space. Then use some distance metrics to filter out irrelevant documents.[Ranking] 45
  • 46. Naive Candidate Selection • Explicit Query Matching • Retrieval using term matching • Works on small index • Generates a lot of false positives • ’QA Architect’  “Landscape Architect”, “Java Architect”, “Software Test Architect” • Retrieval using rules • Requires Domain knowledge • (user_title ^ job_title) & (user_skill ^ job_skill) • Difficult to update 46
  • 47. Understanding WAND Query Query : “Quality Assurance Engineer” AND Query: “Quality AND Assurance AND Engineer” ✅ ❌
  • 48. Understanding WAND Query Query : “Quality Assurance Engineer” WAND : “(Quality[5] AND Assurance[5] AND Engineer[1]) [10]” ✅ ✅
  • 49. Why Generalized Linear Model • Learn weights of query clauses • For larger indexes and structured data with many entities. • Can be applied to: • Retrieval • Ranking • Easier to explain using good debugging capabilities. • Cannot capture interaction between clauses. 49
  • 50. How to apply? 50 Query : “Bryer Ice Cream” company= bryer Category = ice cream Rewritten Query: WAND(company:bryer[5] AND category:ice cream[5] AND description : ice cream [2] AND description : bryer [2]) [11] 50 50 WAND operator for your index Label data collection Generate Query Clauses Train GLM Construct Query From Learnt weights Use constructed query to retrieve results
  • 51. Motivation for using Decision Trees • Interaction are important • ”Architect” -> “QA Architect” , “Landscape Architect” • Title Match AND Function Match • Used for query rewriting • Complex interactions can be learnt • Example query would • (Title Match AND Function Match) OR ( Function Match) OR (Seniority Match) 51 Title Match Seniority Match Negative Positive Function Match Positive Positive NO Yes YesYes NONO 51 51
  • 52. What ifs? • What if we don’t have index • What if we don’t have way to determine entities • What if we don’t have any knowledge of data set 52
  • 53. Motivation for Deep Learning • No structured information available or only semi structured information available • Generate embedding using raw content such as text, images, videos etc • Tools like Word2vec, Glove and DSSM can be used to generate embedding • Use techniques like clustering, KNN to find candidates 5353
  • 54. Agenda • Lifetime of a Query • Index Building • Query Understanding • Candidate Selection & Retrieval • Case Study - LinkedIn Job Search & Recommendations • Hands on tutorial with News Aggregator Dataset o Build index with NER fields o Explore Query Rewriting for Fast Retrieval 54
  • 55. Case Studies for Jobs at Linkedin • Jobs Ecosystem at Linkedin • Job Search • Job Recommendation • Similar Jobs • Candidate Selection Techniques Used at LinkedIn • Logistic Regression model • Tree based model 55
  • 56. Jobs Ecosystem • Semi - Structured Job Listings • Posted by Recruiters or scraped from the web • A lifetime of 30 days options to renew or repost a job there after • 10+ Million active job postings • Daily churn of 150 – 300 K job postings • Core Products – Search and Recommendations, Browse Maps 5656 56
  • 57. Job Recommendations • Recommend user jobs based on Apply likelihood. • User Features • Profile, Network, Activity etc. • Job Features • Poster, Company, Description, Title, Location, Skills etc. • Past Interactions • Apply, Click, Search etc. 57 Job Recommendations 57 57
  • 58. Job Search • Surface relevant jobs when user specifies an explicit query • User Features • Profile, Network, Activity etc. • Query Features • Skills, title, company etc. • Job Features • Poster, Company, Description, Title, Location, Skills etc. • Past Interactions • Apply, Click, Search etc. 58
  • 59. Candidate Selection Models • Generalized Linear models • Tree based models 59
  • 60. Linear Model for Candidate Selection • Use WAND operator to rewrite the query • Training logistic regression model with feature shrinking to learn the weights on the clauses • Evaluating the model using AUC 60
  • 61. • Input: • Member profile data (Past titles, Current Title, Skills,…) • Explicit Query in case of search • Target: Job data (Title, Skills, Description, Seniority,…) • Steps : • Construct a set F of possible pairs of fields <Member field, Job Field>. • Example: (member-Title ∧ job-Title) • Generate labelled data using click data • Train a logistic regression model with feature shrinking • Evaluate the model offline using AUC. Algorithm for constructing query clauses 6161
  • 62. Training Pipeline 62 User interaction log data Training data generation Generate Feature vector based on predefined clauses Create negative training data Create positive training data Split into training and test data Train Machine learning algorithm (logistic regression) Tune the threshold using ROC or PR-curve 62 62
  • 63. All Negative Feature weights gone. Remove features with smaller and negative weights Retrain the model (without features with negative coefficients) Output the model with all positive coefficients Feature Shrinking 63
  • 64. Candidate Selection Models at LinkedIn • Generalized Linear models • Tree based models 64
  • 65. Decision Tree Based Approach • LinkedIn has a lot of structured entities • Standardized entities like title, function, seniority etc are available on both member and job • Interaction between entities are learnt using any Decision tree algorithm like ID3, Boosted tree • Use the positive clauses to rewrite a query as explained earlier 6565
  • 67. Agenda • Lifetime of a Query • Query Understanding • Candidate Selection & Retrieval • Case Study - LinkedIn Job Search & Recommendations • Hands on tutorial with News Aggregator Dataset o Build index with NER fields o Explore Query Rewriting for Fast Retrieval 67
  • 68. Dataset • Content • This dataset contains headlines, URLs, and categories for 422,937 news stories collected by a web aggregator between March 10th, 2014 and August 10th, 2014. • News categories included in this dataset include business; science and technology; entertainment; and health. • Columns • ID, TITLE, URL, PUBLISHER, CATEGORY, STORY, HOSTNAME, TIMESTAMP • Acknowledgments • Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. • This specific dataset can be found in the UCI ML Repository at this URL 68
  • 70. Architecture 70 Users Frontend Mid Tier News Dataset Index (Apache Solr) Query Builder Ranker Stanford NER Server Document Understanding & Creation 70
  • 72. Assignments • Assignments available on Github • Each assignment builds on a component of the end product • You should be able to test your code with the pre-built ui • Finished files available for reference (if needed) • Raise hand if you need help or have a question 72
  • 73. Assignment 0 Setting up the development environment 73
  • 74. Assignment 1 Building a simple index and search on it with a catch all lucene field 74
  • 75. Take Aways • Building an index is not instantaneous hence have replicas in production • Real world indexes seldom can be stored in a single shard • Data sources should be kept separate and be maintained as source of truth • A catch-all field increases number of candidates and limits precision 75
  • 76. Assignment 2 Building an entity aware index with help of Stanford NER and search with NER tags at query time 76
  • 77. Take Aways • Understanding documents and structuring the index is critical for candidate selection • Query understanding plays a crucial role in balancing precision and recall • Serving top-k partial results is better than serving no results. 77
  • 78. Assignment 3 Personalized query rewriting for empty query use case (recommendations) 78
  • 79. Take Aways • Utilize personalization in retrieval for reducing documents to be retrieved and ranked • A search system provides facets on top of recommendations • Utilize custom operators for applying business rules to retrieval and ranking 79
  • 80. Summary • Understanding of a large scale search system, need for query understanding and candidate selection • Insights and learning from LinkedIn case studies • Working end-to-end personalized search system implementation with open source tools and dataset. 80
  • 81. References • Building The LinkedIn Knowledge Graph by Qi He [https://engineering.linkedin.com/blog/2016/10/building-the- linkedin-knowledge-graph] • Li, J., Arya, D., Ha-Thuc, V., & Sinha, S. (2016, August). How to Get Them a Dream Job?: Entity-Aware Features for Personalized Job Search Ranking. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016. • Borisyuk, Fedor, et al. "CaSMoS: A Framework for Learning Candidate Selection Models over Structured Queries and Documents." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016. • Zhang, XianXing, et al. "Glmix: Generalized linear mixed models for large-scale response prediction." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016. • Grover, A., Arya, D., Venkatraman, G. “Latency reduction via Decision Tree Based Query Construction” Proceedings of the 26th ACM International on Conference on Information and Knowledge Management. ACM, 2017. • Venkatraman, Ganesh et al. “Conventional Tutorial - Deep Learning for Personalized Search and Recommender Systems” ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2017. • P. Covington, J. Adams, and E. Sargin. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems, pages 191–198. ACM, 2016. • Did you mean "Galene"? https://engineering.linkedin.com/search/did-you-mean-galene 81
  • 82. 82