SlideShare a Scribd company logo
Enhancing Enterprise Search
with Machine Learning
By Simon Hughes, Dice.com
Who Am I?
• Chief Data Scientist at DHI (Dice.com) under Yuri Bykov
• Dice.com – leading US job board for IT professionals
• PhD Candidate DePaul University (NLP and Machine Learning)
• Twitter handle: https://twitter.com/hughes_meister
• Email: simon.hughes@dice.com
• Main Data Science Projects
• Dice Job and Talent Search
• Dice Recommender Engines (e.g. Similar Positions)
• Dice Salary Predictor - https://www.dice.com/salary-calculator
• Dice Career Paths Page - https://www.dice.com/career-paths
• Dice Skills Pages - https://www.dice.com/skills
Salary Predictor
Salary Predictor
Measuring Search Relevancy
• Recall - How many of the relevant documents were returned?
• Precision - How relevant were the results returned?
Retrieved DocumentsRelevant Documents PrecisionRecall
Retrieved Relevant Documents
Relevancy Optimization
• Improving Recall – Conceptual Search*, Blind Feedback
• Improving Precision – Query Optimization*, Query Classification, LTR
• Optimizing for precision is easier – correct mistakes in the current
search results
• Optimizing for recall is harder – need to know which relevant
documents in the index don’t get retrieved
Conceptual Search
• A.K.A. Semantic Search
• Two key challenges with keyword matching:
• Polysemy: Words have multiple meanings
• E.g. engineer – mechanical engineer? Programmer? automation engineer?
• Synonymy: Many different words have the same or similar meaning
• E.g. QA, quality assurance, tester; VB, Visual Basic, VB.Net
• Other related challenges –
• Typos, Spelling Errors, Idioms
• Conceptual search attempts to solve these problems by learning
concepts from words
• Attempts to improve recall
Conceptual Search
Senior Hadoop* Developer
At least eight years of database/application
development experience in an complex enterprise
environment. Experience writing in SQL, stored
procedures, query performance tuning preferably
on SQL Server. Strong familiarity with working in a
Linux and Windows environment which includes
shell and power shell scripting. At least two years of
hands on experience designing and implementing
data pipelines in production using tools from the
Hadoop* ecosystem such as MapReduce, Hive,
HBase, Spark*, Sqoop, Oozie, and Pig. Broad
knowledge of software development including
software architecture, functional and non-
functional aspects, CI/CD, principles and tools
Java
Technologies*
Big Data
Databases
Software
Architecture
System
Admin
*items are also java technologies
Conceptual Search
• Conceptual search allows us to retrieve documents by how similar the
concepts in the query are to the concepts in a document
• Concepts are automatically learned from documents using machine
learning
• Traditional techniques (LSA, LDA) are based on factorizing large
matrices and don’t scale well
• Word2vec – learns vector representations of words based on context
- an iterative algorithm, scales much better
Word2vec
• Learns vector representations of words by predicting surrounding words
• Similar words get similar vector representations
• Finds interesting relationships between words - e.g. ‘word math’
Word2vec Pros and Cons
• Works much better if common phrases are treated as single tokens
• e.g. java developer=>java_developer, sql server=>sql_server
• Advantages
• Effective at learning related terms /phrases
• e.g. java developer, j2ee developer, java engineer, java architect, hadoop engineer
• Disadvantages
• Doesn’t handle word sense disambiguation well
• Sees antonyms as similar as appear in similar contexts:
• Black and white, up and down, hot and cold, Trump and Clinton, Democrat and Republican
• If the keywords in your domain are noun phrases, typically less of an issue
• Often aggregating concepts over an entire document can solve a lot of these
issues provided query is disambiguated
Using Word2vec In Search
Search engines use inverted indexes - work with terms and not vectors. Approaches:
• Query Expansion
• Expand user’s query with most similar word2vec terms/phrases
• Doesn’t require modifying the search index
• Can boost expansion terms using word2vec similarity score
• Clustering
• Cluster word2vec terms and create separate fields mapping terms into their clusters
• Easy to implement using standard synonym files
• Create different sized clusters to get broader / finer grain matching
• Re-Ranker
• Re-rank the top n documents of a query using the word2vec vector similarity
• More complicated to implement
• Can be used as features for a LTR model
Learned Clusters
Pre-processing - Colocation (phrase) detection using PMI, word2vec over
phrases and top keywords, then k-means clustering
• Natural Languages: bi lingual, bilingual, chinese, fluent, french, german,
japanese, korean, lingual, localized, portuguese, russian, spanish, speak,
speaker
• Apply Programming Languages: cocoa, swift
• Search Engine Technologies: apache solr, elasticsearch, lucene, lucene solr,
search, search engines, search technologies, solr, solr lucene
• Microsoft .Net Technologies: c# wcf, microsoft c#, microsoft.net, mvc web,
wcf web services, web forms, webforms, windows forms, winforms, wpf wcf
Learned Clusters – Soft Skills
Attention / Attitude:
• attention, attentive, close attention, compromising, conscientious,
conscious, customer oriented, customer service focus, customer service
oriented, deliver results, delivering results, demonstrated commitment,
dependability, dependable, detailed oriented, diligence, diligent, do
attitude, ethic, excellent follow, extremely detail oriented, good
attention, meticulous, meticulous attention, organized, orientated,
outgoing, outstanding customer service, pay attention, personality,
pleasant, positive attitude, professional appearance, professional
attitude, professional demeanor, punctual, punctuality, self motivated,
self motivation, superb, superior, thoroughness
Conceptual Search In Action
• Only conceptual search matches shown
– all keyword matches are excluded
• These are documents that would not be
returned by regular keyword search
Conceptual Search In Action
• Only conceptual search matches shown
– all keyword matches are excluded
• These are documents that would not be
returned by regular keyword search
Relevancy Tuning
• Search engines provide a lot of different knobs that can be used to
improve relevancy
• These include the weight (or ‘boost’) given to each field in a search
query, the minimum number of terms required for a match, what type of
queries are executed (disjunction max, best fields, etc), and document
quality scores (e.g. google’s page rank)
• Often these knobs are tuned manually by the search engineer to
optimize their view of the optimal search experience
• Focus is primarily on precision as easier to judge
• Can we do better?
Golden Test Collection
• We really need a set of high quality relevancy judgements
• Two Main Sources:
1. Manual Annotations
• Expert users rate results for common queries
• Costly to collect
• May not reflect judgements of your users
• Active learning can be used to improve annotation efficiency if used in LTR
2. Search Logs / Click Stream Data
• Collect data from search logs that indicate which documents seem to be relevant
• Reflects how your users view relevancy
• Relies on implicit signals which can be noisy – documents clicked, viewed
• Hard to get explicit feedback from users
Manual Annotations
• Users rate each document
based on how relevant it is to
the query
• Important that the ratings
differ for a query, otherwise
no useful information is
provided to the algorithm
Machine Learning Approaches
• Often we can’t optimize search engine relevancy directly as the scoring
functions are not differentiable
• Evaluating relevancy can be very costly – running thousands of queries
against the search engine to evaluate each parameter configuration
• Instead we can use black-box optimization algorithms to optimize the
parameters, typically this is more efficient than random search
• Most companies also using machine learning to train a re-ranking model
to re-rank the top N results
• However it is better to first optimize the search engine’s settings so that
the top N results are more likely to contain the most relevant documents
Information Retrieval Metrics
• Precision alone is not a great metric as it is insensitive to the ordering of the
documents returned
• Objective – maximize preferred information retrieval metric:
1. Normalized Discounted Cumulative Gain (NDCG)
• Discounts relevancy scores by their ranking in the results
2. Mean Average Precision (MAP)
• Average of the precision at the location of each relevant document returned
3. Precision at k
• Precision at the top k documents (usually 10)
• Insensitive to the ordering of documents within top k
• NDCG is used when you have ratings, MAP and ‘Precision at k’ are used for
binary relevant/irrelevant judgements or click data
Black Box Optimization Algorithms
1. Genetic Algorithms
• Standard GA
• Evolutionary Strategies
• Genetic Programming – for evolving new scoring equations
• E.g Python DEAP package
2. Bayesian Optimization
• As it searches the parameter space, focuses more on areas of uncertainty (using LCB and
similar variants from reinforcement learning)
• E.g. Python scikit-optimize package
3. Coordinate Ascent/Descent
• Very simple algorithm – use a line search to find the optimal value for each parameter
while keeping all others fixed
• Can get stuck in local maxima/minima
• Searches more efficiently than more random approaches
Test NDCG Improvements on MLT Task
• Tried different algorithms for
optimizing Elastic Search
MoreLikeThis queries
• Parameters – relative boosts on title
and skills, number of terms
extracted, min doc freq per term
• Coordinate ascent produced the
largest improvement in the training
and test data
• 8.2% Improvement on test data set
Test NDCG Improvements on Talent Search
• Tried different algorithms for
optimizing Talent search queries
• Parameters – relative boosts on
different fields, phrase vs term
matching
• GA produced the highest test score
at the end, but GBT had highest test
score overall – early stopping?
• 0.64% Improvement on test data set
– much smaller but ratings quality
much lower
Summary
• There are many ways you can apply machine learning to improve your user’s
search experience
• I have gone over two ways in which you can improve the recall and relevancy
of your search engine
• Using conceptual search to learn synonyms and improve recall
• Using black box optimization algorithms to automate relevancy tuning
• Many other approaches for applying machine learning to improve search:
• Learning to Rank (LTR)
• Query Classification
• Query Parsing
• Personalization

More Related Content

Similar to Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com

Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologies
enterprisesearchmeetup
 
Personalized Search and Job Recommendations - Simon Hughes, Dice.com
Personalized Search and Job Recommendations - Simon Hughes, Dice.comPersonalized Search and Job Recommendations - Simon Hughes, Dice.com
Personalized Search and Job Recommendations - Simon Hughes, Dice.com
Lucidworks
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
Joaquin Delgado PhD.
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
S. Diana Hu
 
An introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxAn introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolbox
Elasticsearch
 
Introduction to Enterprise Search
Introduction to Enterprise SearchIntroduction to Enterprise Search
Introduction to Enterprise Search
Findwise
 
Addressing scalability challenges in peer-to-peer search
Addressing scalability challenges in peer-to-peer searchAddressing scalability challenges in peer-to-peer search
Addressing scalability challenges in peer-to-peer search
Harisankar H
 
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docxDATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
randyburney60861
 
Lessons after working as a data scientist for 1 year
Lessons after working as a data scientist for 1 yearLessons after working as a data scientist for 1 year
Lessons after working as a data scientist for 1 year
Yao Yao
 
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Lucidworks
 
Evaluation of web scale discovery services
Evaluation of web scale discovery servicesEvaluation of web scale discovery services
Evaluation of web scale discovery services
Nikesh Narayanan
 
Modern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyModern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in Mendeley
Kris Jack
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Lucidworks
 
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment Analysis
CrowdFlower
 
Enterprise search Information
Enterprise search Information Enterprise search Information
Enterprise search Information
Netwoven Inc.
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
itnewsafrica
 
Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Info 2402 irt-chapter_2
Info 2402 irt-chapter_2
Shahriar Rafee
 
Solving Real World Challenges with Enterprise Search
Solving Real World Challenges with Enterprise SearchSolving Real World Challenges with Enterprise Search
Solving Real World Challenges with Enterprise Search
SPC Adriatics
 
Evolving the Optimal Relevancy Ranking Model at Dice.com
Evolving the Optimal Relevancy Ranking Model at Dice.comEvolving the Optimal Relevancy Ranking Model at Dice.com
Evolving the Optimal Relevancy Ranking Model at Dice.com
Simon Hughes
 
Recommender system
Recommender system Recommender system
Recommender system
FidanHasanguliyeva
 

Similar to Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com (20)

Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologies
 
Personalized Search and Job Recommendations - Simon Hughes, Dice.com
Personalized Search and Job Recommendations - Simon Hughes, Dice.comPersonalized Search and Job Recommendations - Simon Hughes, Dice.com
Personalized Search and Job Recommendations - Simon Hughes, Dice.com
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
An introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxAn introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolbox
 
Introduction to Enterprise Search
Introduction to Enterprise SearchIntroduction to Enterprise Search
Introduction to Enterprise Search
 
Addressing scalability challenges in peer-to-peer search
Addressing scalability challenges in peer-to-peer searchAddressing scalability challenges in peer-to-peer search
Addressing scalability challenges in peer-to-peer search
 
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docxDATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
DATA SCIENCE AND BIG DATA ANALYTICSCHAPTER 2 DATA ANA.docx
 
Lessons after working as a data scientist for 1 year
Lessons after working as a data scientist for 1 yearLessons after working as a data scientist for 1 year
Lessons after working as a data scientist for 1 year
 
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
 
Evaluation of web scale discovery services
Evaluation of web scale discovery servicesEvaluation of web scale discovery services
Evaluation of web scale discovery services
 
Modern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyModern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in Mendeley
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
 
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment Analysis
 
Enterprise search Information
Enterprise search Information Enterprise search Information
Enterprise search Information
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Info 2402 irt-chapter_2
Info 2402 irt-chapter_2
 
Solving Real World Challenges with Enterprise Search
Solving Real World Challenges with Enterprise SearchSolving Real World Challenges with Enterprise Search
Solving Real World Challenges with Enterprise Search
 
Evolving the Optimal Relevancy Ranking Model at Dice.com
Evolving the Optimal Relevancy Ranking Model at Dice.comEvolving the Optimal Relevancy Ranking Model at Dice.com
Evolving the Optimal Relevancy Ranking Model at Dice.com
 
Recommender system
Recommender system Recommender system
Recommender system
 

Recently uploaded

standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 

Recently uploaded (20)

standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 

Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com

  • 1. Enhancing Enterprise Search with Machine Learning By Simon Hughes, Dice.com
  • 2. Who Am I? • Chief Data Scientist at DHI (Dice.com) under Yuri Bykov • Dice.com – leading US job board for IT professionals • PhD Candidate DePaul University (NLP and Machine Learning) • Twitter handle: https://twitter.com/hughes_meister • Email: simon.hughes@dice.com • Main Data Science Projects • Dice Job and Talent Search • Dice Recommender Engines (e.g. Similar Positions) • Dice Salary Predictor - https://www.dice.com/salary-calculator • Dice Career Paths Page - https://www.dice.com/career-paths • Dice Skills Pages - https://www.dice.com/skills
  • 5.
  • 6.
  • 7. Measuring Search Relevancy • Recall - How many of the relevant documents were returned? • Precision - How relevant were the results returned? Retrieved DocumentsRelevant Documents PrecisionRecall Retrieved Relevant Documents
  • 8. Relevancy Optimization • Improving Recall – Conceptual Search*, Blind Feedback • Improving Precision – Query Optimization*, Query Classification, LTR • Optimizing for precision is easier – correct mistakes in the current search results • Optimizing for recall is harder – need to know which relevant documents in the index don’t get retrieved
  • 9. Conceptual Search • A.K.A. Semantic Search • Two key challenges with keyword matching: • Polysemy: Words have multiple meanings • E.g. engineer – mechanical engineer? Programmer? automation engineer? • Synonymy: Many different words have the same or similar meaning • E.g. QA, quality assurance, tester; VB, Visual Basic, VB.Net • Other related challenges – • Typos, Spelling Errors, Idioms • Conceptual search attempts to solve these problems by learning concepts from words • Attempts to improve recall
  • 10. Conceptual Search Senior Hadoop* Developer At least eight years of database/application development experience in an complex enterprise environment. Experience writing in SQL, stored procedures, query performance tuning preferably on SQL Server. Strong familiarity with working in a Linux and Windows environment which includes shell and power shell scripting. At least two years of hands on experience designing and implementing data pipelines in production using tools from the Hadoop* ecosystem such as MapReduce, Hive, HBase, Spark*, Sqoop, Oozie, and Pig. Broad knowledge of software development including software architecture, functional and non- functional aspects, CI/CD, principles and tools Java Technologies* Big Data Databases Software Architecture System Admin *items are also java technologies
  • 11. Conceptual Search • Conceptual search allows us to retrieve documents by how similar the concepts in the query are to the concepts in a document • Concepts are automatically learned from documents using machine learning • Traditional techniques (LSA, LDA) are based on factorizing large matrices and don’t scale well • Word2vec – learns vector representations of words based on context - an iterative algorithm, scales much better
  • 12. Word2vec • Learns vector representations of words by predicting surrounding words • Similar words get similar vector representations • Finds interesting relationships between words - e.g. ‘word math’
  • 13. Word2vec Pros and Cons • Works much better if common phrases are treated as single tokens • e.g. java developer=>java_developer, sql server=>sql_server • Advantages • Effective at learning related terms /phrases • e.g. java developer, j2ee developer, java engineer, java architect, hadoop engineer • Disadvantages • Doesn’t handle word sense disambiguation well • Sees antonyms as similar as appear in similar contexts: • Black and white, up and down, hot and cold, Trump and Clinton, Democrat and Republican • If the keywords in your domain are noun phrases, typically less of an issue • Often aggregating concepts over an entire document can solve a lot of these issues provided query is disambiguated
  • 14. Using Word2vec In Search Search engines use inverted indexes - work with terms and not vectors. Approaches: • Query Expansion • Expand user’s query with most similar word2vec terms/phrases • Doesn’t require modifying the search index • Can boost expansion terms using word2vec similarity score • Clustering • Cluster word2vec terms and create separate fields mapping terms into their clusters • Easy to implement using standard synonym files • Create different sized clusters to get broader / finer grain matching • Re-Ranker • Re-rank the top n documents of a query using the word2vec vector similarity • More complicated to implement • Can be used as features for a LTR model
  • 15. Learned Clusters Pre-processing - Colocation (phrase) detection using PMI, word2vec over phrases and top keywords, then k-means clustering • Natural Languages: bi lingual, bilingual, chinese, fluent, french, german, japanese, korean, lingual, localized, portuguese, russian, spanish, speak, speaker • Apply Programming Languages: cocoa, swift • Search Engine Technologies: apache solr, elasticsearch, lucene, lucene solr, search, search engines, search technologies, solr, solr lucene • Microsoft .Net Technologies: c# wcf, microsoft c#, microsoft.net, mvc web, wcf web services, web forms, webforms, windows forms, winforms, wpf wcf
  • 16. Learned Clusters – Soft Skills Attention / Attitude: • attention, attentive, close attention, compromising, conscientious, conscious, customer oriented, customer service focus, customer service oriented, deliver results, delivering results, demonstrated commitment, dependability, dependable, detailed oriented, diligence, diligent, do attitude, ethic, excellent follow, extremely detail oriented, good attention, meticulous, meticulous attention, organized, orientated, outgoing, outstanding customer service, pay attention, personality, pleasant, positive attitude, professional appearance, professional attitude, professional demeanor, punctual, punctuality, self motivated, self motivation, superb, superior, thoroughness
  • 17. Conceptual Search In Action • Only conceptual search matches shown – all keyword matches are excluded • These are documents that would not be returned by regular keyword search
  • 18. Conceptual Search In Action • Only conceptual search matches shown – all keyword matches are excluded • These are documents that would not be returned by regular keyword search
  • 19. Relevancy Tuning • Search engines provide a lot of different knobs that can be used to improve relevancy • These include the weight (or ‘boost’) given to each field in a search query, the minimum number of terms required for a match, what type of queries are executed (disjunction max, best fields, etc), and document quality scores (e.g. google’s page rank) • Often these knobs are tuned manually by the search engineer to optimize their view of the optimal search experience • Focus is primarily on precision as easier to judge • Can we do better?
  • 20. Golden Test Collection • We really need a set of high quality relevancy judgements • Two Main Sources: 1. Manual Annotations • Expert users rate results for common queries • Costly to collect • May not reflect judgements of your users • Active learning can be used to improve annotation efficiency if used in LTR 2. Search Logs / Click Stream Data • Collect data from search logs that indicate which documents seem to be relevant • Reflects how your users view relevancy • Relies on implicit signals which can be noisy – documents clicked, viewed • Hard to get explicit feedback from users
  • 21. Manual Annotations • Users rate each document based on how relevant it is to the query • Important that the ratings differ for a query, otherwise no useful information is provided to the algorithm
  • 22. Machine Learning Approaches • Often we can’t optimize search engine relevancy directly as the scoring functions are not differentiable • Evaluating relevancy can be very costly – running thousands of queries against the search engine to evaluate each parameter configuration • Instead we can use black-box optimization algorithms to optimize the parameters, typically this is more efficient than random search • Most companies also using machine learning to train a re-ranking model to re-rank the top N results • However it is better to first optimize the search engine’s settings so that the top N results are more likely to contain the most relevant documents
  • 23. Information Retrieval Metrics • Precision alone is not a great metric as it is insensitive to the ordering of the documents returned • Objective – maximize preferred information retrieval metric: 1. Normalized Discounted Cumulative Gain (NDCG) • Discounts relevancy scores by their ranking in the results 2. Mean Average Precision (MAP) • Average of the precision at the location of each relevant document returned 3. Precision at k • Precision at the top k documents (usually 10) • Insensitive to the ordering of documents within top k • NDCG is used when you have ratings, MAP and ‘Precision at k’ are used for binary relevant/irrelevant judgements or click data
  • 24. Black Box Optimization Algorithms 1. Genetic Algorithms • Standard GA • Evolutionary Strategies • Genetic Programming – for evolving new scoring equations • E.g Python DEAP package 2. Bayesian Optimization • As it searches the parameter space, focuses more on areas of uncertainty (using LCB and similar variants from reinforcement learning) • E.g. Python scikit-optimize package 3. Coordinate Ascent/Descent • Very simple algorithm – use a line search to find the optimal value for each parameter while keeping all others fixed • Can get stuck in local maxima/minima • Searches more efficiently than more random approaches
  • 25. Test NDCG Improvements on MLT Task • Tried different algorithms for optimizing Elastic Search MoreLikeThis queries • Parameters – relative boosts on title and skills, number of terms extracted, min doc freq per term • Coordinate ascent produced the largest improvement in the training and test data • 8.2% Improvement on test data set
  • 26. Test NDCG Improvements on Talent Search • Tried different algorithms for optimizing Talent search queries • Parameters – relative boosts on different fields, phrase vs term matching • GA produced the highest test score at the end, but GBT had highest test score overall – early stopping? • 0.64% Improvement on test data set – much smaller but ratings quality much lower
  • 27. Summary • There are many ways you can apply machine learning to improve your user’s search experience • I have gone over two ways in which you can improve the recall and relevancy of your search engine • Using conceptual search to learn synonyms and improve recall • Using black box optimization algorithms to automate relevancy tuning • Many other approaches for applying machine learning to improve search: • Learning to Rank (LTR) • Query Classification • Query Parsing • Personalization

Editor's Notes

  1. Salary Predictor - https://www.dice.com/salary-calculator
  2. Salary Predictor - https://www.dice.com/salary-calculator
  3. Career Paths - https://www.dice.com/career-paths
  4. Skills pages - https://www.dice.com/skills
  5. This talk will cover conceptual search and query optimization techniques. Query classification and LTR (Learning To Rank) are other common approaches to improving precision which won’t be covered in this talk.
  6. Map words to concepts Words can map to multiple concepts, e.g. the java technologies above, a number of terms map to that.
  7. Labels in bold are manually assigned for interpretability
  8. DEAP – https://github.com/DEAP/deap Scikit-optimize - https://scikit-optimize.github.io/