This document summarizes Simon Hughes' presentation on personalized search and job recommendations. Hughes is the Chief Data Scientist at Dice.com, where he works on recommender engines, skills pages, and other projects. The presentation discusses relevancy feedback algorithms like Rocchio that can be used to improve search results based on user interactions. It also describes how content-based and collaborative filtering recommendations can be provided in real-time using Solr plugins. Finally, it shows how personalized search can be achieved by boosting results matching a user's profile or search history.
2. 01
Who Am I?
• Chief Data Scientist at Dice.com and DHI under Yuri Bykov
• Dice.com – leading US job board for IT professionals
• Twitter handle: https://twitter.com/hughes_meister
• Key projects
- Dice recommender engines
- Dice market value tool (salary predictor)
- Dice career path tool
- Dice skills pages
• PhD
- Phd Candidate at DePaul – ML and NLP
- Thesis topic – Detecting causality in scientific explanatory essays
6. Outline
• Relevancy Feedback
• The Rocchio Algorithm
• Implementation Details
- Open source Solr plugins
- Naïve entity extraction
• Use Cases
- Conceptual / Semantic Search
- Real-time recommendations
- Personalized Search
- Query and Filter Suggestions
7. 01
Motivation
2 Main Problems In Achieving Relevancy
• Polysemy - Words/phrases can have multiple meanings
- Caused problems with Precision
- Need to disambiguate query – determine query intent
• Synonymy – multiple words/phrases with the same meaning
- Causes problems with Recall
2 types of solution
• Global Methods – adjust query based on analysis of entire index
- Improve Precision – LTR, Probabilistic Query Parsing, Reinforcement learning, etc.
- Improve Recall – Synonyms, Conceptual Search, Thesaurus/Ontology Learning
• Local Methods – adjust a query relative to the documents that match
- Improve Precision – Relevancy feedback
- Improve Recall – Blind feedback
9. 01
Relevancy Feedback
‘Supervised’ relevancy feedback uses information from the user’s
profile or search behavior to adjust their search results to improve
relevancy.
‘Unsupervised Relevancy Feedback’ or ‘Blind Feedback’ instead uses
co-occurrence information in the search index to improve relevancy.
Both of these mechanisms can be implemented as simple Solr plugins
using the Rocchio Algorithm
10. 01
Relevancy Feedback Process
1. User issues a short query
2. System returns an initial set of results
3. The results are annotated as relevant / non-relevant
4. The system computes a better query using this feedback
5. Second (improved) result set is returned to the user
12. 01
The Rocchio Algorithm
The Rocchio Algorithm is typically used for both forms of
relevancy feedback
• A set of relevant documents are chosen for a given query
• Using these documents, the query vector is computed that
represents the relevant documents
• This vector is used to formulate a new query which is then
executed to produce a more relevant result set
14. 01
Supervised Feedback – Implicit vs Explicit
Explicit Feedback
• User explicitly tells you what they want through some action
- E.g. buys a product, applies for a job, rates a movie.
Implicit Feedback
• User preferences can be inferred from user behavior
- E.g. User views a web page, clicks on a search result, hovers their
mouse over an item, etc
• Weak signal - additional data can be gathered to strengthen signal
- E.g. Time spent on page, depth of navigation before clicking, etc.
15. 01
Implementation Details
2 solr plugins, similar to the MLT handler
• Configure a number of fields in Solr to do naïve entity extraction (see
later slides)
• For a given query, extract all entities from these fields for each
document (using tf.idf score)
- TF – number occurrences of each term in relevant documents
• Pick the top k terms per field, weight by tf.idf score
• Normalize each field to have unit length
• Weight each term by normalized weighting, multiplied by field boost
• In a number of internal tests, this improved the Mean Average
Precision over the standard MLT for job recommendations
16. 01
Naïve Entity Extraction
Most documents are long, yet contain few useful content words
• Rocchio algorithm works much better if it uses fields containing only
the most important entities or keywords in your domain
- E.g. for dice, we extract job titles, and skills
• Naïve entity extraction - configure a set of keywords and phrases to
extract score
• Solutions
1. SolrTextTagger – good 3rd party tool for entity extraction
2. Use a sequence of synonym filters, followed by a type filter or
keep word filter. Very fast, synonym filter uses an FST
20. 01
Supervised Feedback – Uses Cases
• Allow users to search from examples instead of just from a query
- E.g. image search. Show them images matching initial query,
allow them to select which are more relevant
- Users often find it easier to show you examples of what they
want than forcing them into formulating a complex query
• Personal recommendations based on browse history
- E.g. Recommend jobs based on the last 5 jobs viewed
21. 01
Blind Feedback
Uses the top–ranked results from the search to expand the search, and
improve recall
1. Execute the query and take the results from the top n documents
(10-50)
2. Extract the top k terms by tf.idf score (10-30)
3. Use these terms to do query expansion, and re-execute the query
• Also called “Pseudo-Relevancy Feedback”
• Has been shown to be highly effective in some situations (see notes)
• Has a performance penalty – query is executed twice
- Can be partly mitigated by intelligent caching
24. 01
Recommendations
Three Main Types of Recommender
• Content Based
- Uses information from user’s profile to generate
recommendations
- E.g. use a resume to find matching jobs
• Collaborative Filtering
- Find similar documents to those they a user has liked previously
- E.g. Find jobs similar to jobs they have applied to
• Hybrid Recommender
- Combines both approaches
All of these can be achieved in real-time using our plugin
25. 01
Content Based Recommendations
Plugin is sent a content stream via a POST call
• Entity extraction is performed by Solr in real-time
- Extracts Jobs Titles
- Extracts Skills
• Query is formulated using top k terms, as before
• Location based boost is applied using a boost query to boost
documents closed to the user’s location
Dice has a batch recommender algorithm that powers most of our
recommendations.
This plugin powers our real-time recommendations (new documents)
27. 01
Collaboration Filtering Recommendations
Plugin is sent a query, listing the id’s of documents to match on
• Top k terms are extracted across all documents
• Recommendations are generated using the Rocchio algorithm
Use Cases
• Recommendations from browse history (implicit)
- Can work off cookies – if user not logged in
• Recommendations from past purchases, applied jobs (explicit)
29. 01
From Relevancy Feedback to Personalized Search
• We can use the query generated by the Relevancy Feedback
handler to personalize the search results using a boost query
• Problem - user may be searching for documents that differ from
their apply history or their profile (e.g. looking for a career change)
• We want to personalize results only if the user’s query is related to
the personalization data we have for them
30. 01
From Relevancy Feedback to Personalized Search
Hadoop Developer
Big Data
Hadoop
Java Developer
Java
JVM
Spring
Eclipse
IntelliJ
Hibernate
Hbase
SQL
MapReduce
HBase
Accountant Java Developer
Spring
Eclipse
IntelliJ
Hibernate
Auditor
Finance
Accounts
Payable
GAAP
TaxesHDFS
Oracle
Oracle
Related Queries Unrelated Queries
Java
31. 01
Boost Query + High MM Threshold
• Use the relevancy feedback query as a boost query
• Set the boost query with high mm threshold – will only boost
documents that match most of the top k terms from plugin
q=+(Java Developer)^10
OR ((title:”Hadoop Developer” skills:”Cassandra” skills:”Big Data”
skills:”Hadoop”)~3)
q=“Java Developer”^10&bq={!edismax v=title:”Hadoop Developer”
skills:”Cassandra” skills:”Big Data” skills:”Hadoop” mm=-25% bq=}
32. 01
Demo – 2 Unrelated Queries = No Personalization
34. 01
Personalized Search - Use Cases
• Content Based
- Use the users’s profile to generate the boost query
• Collaborative Filtering (behavior based)
- Use previously viewed documents
• Hybrid
- Do both
• Based on previous search(es)
- Use the blind feedback handler to generate boost query
35. 01
Other Use Cases
• Relevancy feedback
- Use query expansion terms to produce filter suggestions
• Blind feedback
- Faceting terms are often dominated by common terms from the
least relevant documents (especially if an OR/should query)
- Use query expansion terms from most relevant matches to
produce better terms to facet on
Enhancements
• Relevancy feedback – use negative terms from negative examples
• Blind feedback – only extract terms close to the query terms in the document
- Has been shown to improve accuracy in some domains
- Called the “positional relevance model” – see this paper
36. 01
GitHub Repo
GitHubRepo - https://github.com/DiceTechJobs/RelevancyFeedback
• Supports content streams and url’s
• More Like ‘These’
- Can generate recommendations from multiple documents
• Algorithm improvements from core MLT handler
- Top terms by field – prevents one field from dominating top terms
- Normalizes terms within a field – smaller fields (e.g. job title) have equal weighting
• Supports boost functions to boost recommendations
- E.g. boost recommended jobs by distance from the user
• Can add filter queries to both the resulting MLT query as well as the source query
• Supports the mm parameter for MLT query
- Ensures that all recommendations match at least x% of the top terms
• Supports boosting individual terms using payloads
37. 01
Useful References
• “Modern Information Retrieval”, Chapter 10, Yates and Neto
- From Berkeley
- Free online version
• "Introduction to Information Retrieval”, Chapter 9 – Manning, Raghavan and Schutze
- From Stanford NLP group
- Free online version
- Amazon (hardcover)
Other Related Ideas
Attribute pivots
• Uses decision trees and rule based learning to suggest query
refinements to users
• University of Texas has done some good work on attribute pivots
- Has been shown to improve accuracy in some domains