Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Bay Area Search
Searching on Intent: Knowledge Graphs,
Personalization, and Contextual Disambiguation
2015.11.10
Bay Area ...
Bay Area Search
Trey Grainger
Director of Engineering, Search & Recommendations
• Joined CareerBuilder in 2007 as a Softwa...
Bay Area Search
Agenda
• Introduction
• Traditional Keyword Search vs. Personalization vs. Semantic Search
• Searching on ...
Bay Area Search
At CareerBuilder, Solr Powers...At CareerBuilder, Solr Powers...
Search by the Numbers
5
Powering 50+ Search Experiences Including:
100million +
Searches per day
30+
Software Developers, ...
Bay Area Search
Conceptual Framework for Information Retrieval:
Traditional
Keyword
Search
Recommendations
Semantic
Search...
Bay Area Search
Traditional Search
Bay Area Search
Classic Lucene Relevancy Algorithm (though BM25 to be default soon):
*Source: Solr in Action, chapter 3
Sc...
Bay Area Search
News Search : popularity and freshness drive relevance
Restaurant Search: geographical proximity and price...
Bay Area Search
Example of domain-specific relevancy calculation
News website:
/select?
fq=$myQuery&
q=_query_:"{!func}sca...
Bay Area Search
Fancy boosting functions
Separating “relevancy” and “filtering” from the query:
q=_val_:"$keywords"&fq={!c...
Bay Area Search
Personalization /
Recommendations
Bay Area Search
John lives in Boston but wants to move to New York or possibly another big city. He is
currently a sales m...
Bay Area Search
http://localhost:8983/solr/jobs/select/?
fl=jobtitle,city,state,salary&
q=(
jobtitle:"nurse educator"^25 O...
Bay Area Search
{ ...
"response":{"numFound":22,"start":0,"docs":[
{"jobtitle":" Clinical Educator
(New England/ Boston)",...
Bay Area Search
We built a recommendation engine!
What is a recommendation engine?
“A system that uses known information (...
Bay Area Search
For full coverage of building a recommendation engine in Solr…
See my talk from Lucene Revolution 2012 (Bo...
Bay Area Search
Personalized Search
Why limit yourself to JUST explicit search or JUST automated recommendations?
By augme...
Bay Area Search
Semantic Search
Bay Area Search
What’s the problem we’re trying to solve today?
User’s Query:
machine learning research and development Po...
Bay Area Search
...we also really want to
search on “things”, not “strings”…
Job Level Job title Company
Job Title Company...
Bay Area Search
Type-ahead
Prediction
Building an Intent Engine
Search Box
Semantic Query
Parsing
Intent Engine
Spelling C...
Bay Area Search
Type-ahead Predictions
Semantic Autocomplete
• Shows top terms for any search
• Breaks out job titles, skills, companies,
related keywords, and o...
Spelling Correction
Entity / Entity-type
Resolution
Bay Area Search
Differentiating related terms
Synonyms: cpa => certified public accountant
rn => registered nurse
r.n. => ...
Bay Area Search
Building a Taxonomy of Entities
Many ways to generate this:
• Topic Modelling
• Clustering of documents
• ...
Bay Area Search
Entity-type Recognition
Build classifiers trained on
External data sources
(Wikipedia, DBPedia,
WordNet, e...
Bay Area Search
Contextual Disambiguation
Bay Area Search
How do we handle phrases with ambiguous meanings?
Example Related Keywords (representing multiple meanings...
Bay Area Search
Discovering ambiguous phrases
1) Classify user’s who ran each
search in the search logs
(i.e. by the job t...
Bay Area Search
Disambiguated meanings (represented as term vectors)
Example Related Keywords (Disambiguated Meanings)
arc...
Bay Area Search
Using the disambiguated meanings
In a situation where a user searches for an ambiguous phrase, what inform...
Bay Area Search
Semantic Query Parsing
Bay Area Search
Query Parsing: The whole is greater than the sum of the parts
project manager vs. "project" AND "manager"
...
Bay Area Search
Bay Area Search
Probabilistic Query Parser
Goal: given a query, predict which
combinations of keywords should be
combined ...
Bay Area Search
Input: senior hadoop developer java ruby on rails perl
Bay Area Search
Semantic Search Architecture – Query Parsing
1) Generate the previously discussed taxonomy of
Domain-speci...
Bay Area Search
Query Augmentation
Bay Area Search
machine learning
Keywords:
Search Behavior,
Application Behavior, etc.
Job Title Classifier, Skills Extrac...
Bay Area Search
Query Enrichment
Bay Area Search
Document Enrichment
Bay Area Search
Document Enrichment
Bay Area Search
Knowledge Graph
Bay Area Search
Serves as a “data science toolkit” API that allows dynamically navigating and pivoting through
multiple le...
Bay Area Search
So how does it work?
Foreground vs. Background Analysis
Every term scored against it’s context. The more
c...
Bay Area Search
Knowledge Graph – Potential Use Cases
Cross-walk between Types
• Have an ID field, but want to enable free...
Bay Area Search
2014-2015 Publications & Presentations
Books:
Solr in Action - A comprehensive guide to implementing scala...
Bay Area Search
So What’s Next?
Bay Area Search
machine learning
Keywords:
Search Behavior,
Application Behavior, etc.
Job Title Classifier, Skills Extrac...
Bay Area Search
Type-ahead
Prediction
Building an Intent Engine
Search Box
Semantic Query
Parsing
Intent Engine
Spelling C...
Bay Area Search
Conceptual Framework for Information Retrieval:
Traditional
Keyword
Search
Recommendations
Semantic
Search...
Bay Area Search
Additional References:
Bay Area Search
Bonus Slides
Audience question: how can you
discover terms / related terms
without having query logs to mi...
Bay Area Search
One Option: Clustering on documents to find semantic links
Bay Area Search
Setting up Clustering in solrconfig.xml
<searchComponent name="clustering" enable=“true“ class="solr.clust...
Bay Area Search
Clustering Query
/solr/clustering/?q=solr
&rows=100
&carrot.title=titlefield
&carrot.snippet=titlefield
&L...
Bay Area Search
Original Query: q=solr
Clustering Results
Clusters Identified:
Developer (22)
Java Developer (13)
Software...
Bay Area Search
q="solr" OR ("Developer”^0.22 or "Java Developer"^0.13 or
"Software "^0.10 or "Senior Java Developer"^0.9 ...
Bay Area Search
Contact Info
Yes, WE ARE HIRING @ . Come talk with me if you are interested…
Trey Grainger
trey.grainger@c...
Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disambiguation
Upcoming SlideShare
Loading in …5
×

Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disambiguation

7,023 views

Published on

Search engines frequently miss the mark when it comes to understanding user intent. This talk will walk through some of the key building blocks necessary to turn a search engine into a dynamically-learning "intent engine", able to interpret and search on meaning, not just keywords. We will walk through CareerBuilder's semantic search architecture, including semantic autocomplete, query and document interpretation, probabilistic query parsing, automatic taxonomy discovery, keyword disambiguation, and personalization based upon user context/behavior. We will also see how to leverage an inverted index (Lucene/Solr) as a knowledge graph that can be used as a dynamic ontology to extract phrases, understand and weight the semantic relationships between those phrases and known entities, and expand the query to include those additional conceptual relationships.

As an example, most search engines completely miss the mark at parsing a query like (Senior Java Developer Portland, OR Hadoop). We will show how to dynamically understand that "senior" designates an experience level, that "java developer" is a job title related to "software engineering", that "portland, or" is a city with a specific geographical boundary (as opposed to a keyword followed by a boolean operator), and that "hadoop" is the skill "Apache Hadoop", which is also related to other terms like "hbase", "hive", and "map/reduce". We will discuss how to train the search engine to parse the query into this intended understanding and how to reflect this understanding to the end user to provide an insightful, augmented search experience.

Topics: Semantic Search, Apache Solr, Finite State Transducers, Probabilistic Query Parsing, Bayes Theorem, Augmented Search, Recommendations, Query Disambiguation, NLP, Knowledge Graphs

Published in: Technology
  • Be the first to comment

Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disambiguation

  1. 1. Bay Area Search Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disambiguation 2015.11.10 Bay Area Search Trey Grainger Director of Engineering, Search & Recommendations
  2. 2. Bay Area Search Trey Grainger Director of Engineering, Search & Recommendations • Joined CareerBuilder in 2007 as a Software Engineer • MBA, Management of Technology – Georgia Tech • BA, Computer Science, Business, & Philosophy – Furman University • Mining Massive Datasets (in progress) - Stanford University Fun outside of CB: • Co-author of Solr in Action, plus a handful of research papers • Frequent conference speaker • Founder of Celiaccess.com, the gluten-free search engine • Lucene/Solr contributor About Me
  3. 3. Bay Area Search Agenda • Introduction • Traditional Keyword Search vs. Personalization vs. Semantic Search • Searching on Intent - Type-ahead prediction - Spelling Correction - Entity / Entity-type Resolution - Contextual Disambiguation - Semantic Query Parsing - Query Augmentation - The Knowledge Graph • Conclusion Knowledge Graph
  4. 4. Bay Area Search At CareerBuilder, Solr Powers...At CareerBuilder, Solr Powers...
  5. 5. Search by the Numbers 5 Powering 50+ Search Experiences Including: 100million + Searches per day 30+ Software Developers, Data Scientists + Analysts 500+ Search Servers 1,5billion + Documents indexed and searchable 1 Global Search Technology platform ...and many more
  6. 6. Bay Area Search Conceptual Framework for Information Retrieval: Traditional Keyword Search Recommendations Semantic Search User Intent Personalized Search Augmented Search Domain-aware Matching
  7. 7. Bay Area Search Traditional Search
  8. 8. Bay Area Search Classic Lucene Relevancy Algorithm (though BM25 to be default soon): *Source: Solr in Action, chapter 3 Score(q, d) = ∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t, d) ) · coord(q, d) · queryNorm(q) t in q Where: t = term; d = document; q = query; f = field tf(t in d) = numTermOccurrencesInDocument ½ idf(t) = 1 + log (numDocs / (docFreq + 1)) coord(q, d) = numTermsInDocumentFromQuery / numTermsInQuery queryNorm(q) = 1 / (sumOfSquaredWeights ½ ) sumOfSquaredWeights = q.getBoost()2 · ∑ (idf(t) · t.getBoost() )2 t in q norm(t, d) = d.getBoost() · lengthNorm(f) · f.getBoost()
  9. 9. Bay Area Search News Search : popularity and freshness drive relevance Restaurant Search: geographical proximity and price range are critical Ecommerce: likelihood of a purchase is key Movie search: More popular titles are generally more relevant Job search: category of job, salary range, and geographical proximity matter TF * IDF of keywords can’t hold it’s own against good domain-specific relevance factors! That’s great, but what about domain-specific knowledge?
  10. 10. Bay Area Search Example of domain-specific relevancy calculation News website: /select? fq=$myQuery& q=_query_:"{!func}scale(query($myQuery),0,100)" AND _query_:"{!func}div(100,map(geodist(),0,1,1))" AND _query_:"{!func}recip(rord(publicationDate),0,100,100)" AND _query_:"{!func}scale(popularity,0,100)"& myQuery="street festival"& sfield=location& pt=33.748,-84.391 25% 25% 25% 25% *Example from chapter 16 of Solr in Action
  11. 11. Bay Area Search Fancy boosting functions Separating “relevancy” and “filtering” from the query: q=_val_:"$keywords"&fq={!cache=false v=$keywords}&keywords=solr Keywords (50%) + distance (25%) + category (25%) q=_val_:"scale(mul(query($keywords),1),0,50)" AND _val_:"scale(sum($radiusInKm,mul(query($distance),-1)),0,25)” AND _val_:"scale(mul(query($category),1),0,25)" &keywords=solr &radiusInKm=48.28 &distance=_val_:"geodist(latitudelongitude.latlon_is,33.77402,-84.29659)” &category=jobtitle:"java developer" &fq={!cache=false v=$keywords}
  12. 12. Bay Area Search Personalization / Recommendations
  13. 13. Bay Area Search John lives in Boston but wants to move to New York or possibly another big city. He is currently a sales manager but wants to move towards business development. Irene is a bartender in Dublin and is only interested in jobs within 10KM of her location in the food service industry. Irfan is a software engineer in Atlanta and is interested in software engineering jobs at a Big Data company. He is happy to move across the U.S. for the right job. Jane is a nurse educator in Boston seeking between $40K and $60K Beyond domain knowledge… consider per-user knowledge
  14. 14. Bay Area Search http://localhost:8983/solr/jobs/select/? fl=jobtitle,city,state,salary& q=( jobtitle:"nurse educator"^25 OR jobtitle:(nurse educator)^10 ) AND ( (city:"Boston" AND state:"MA")^15 OR state:"MA") AND _val_:"map(salary, 40000, 60000,10, 0)” *Example from chapter 16 of Solr in Action Query for Jane Jane is a nurse educator in Boston seeking between $40K and $60K
  15. 15. Bay Area Search { ... "response":{"numFound":22,"start":0,"docs":[ {"jobtitle":" Clinical Educator (New England/ Boston)", "city":"Boston", "state":"MA", "salary":41503}, …]}} *Example documents available @ http://github.com/treygrainger/solr-in-action/ Search Results for Jane {"jobtitle":"Nurse Educator", "city":"Braintree", "state":"MA", "salary":56183}, {"jobtitle":"Nurse Educator", "city":"Brighton", "state":"MA", "salary":71359}
  16. 16. Bay Area Search We built a recommendation engine! What is a recommendation engine? “A system that uses known information (or derived information from that known information) to automatically suggest relevant content” Our example was just an attribute based recommendation… but we can also use any behavioral-based features, as well (i.e. collaborative filtering). What did we just do?
  17. 17. Bay Area Search For full coverage of building a recommendation engine in Solr… See my talk from Lucene Revolution 2012 (Boston):
  18. 18. Bay Area Search Personalized Search Why limit yourself to JUST explicit search or JUST automated recommendations? By augmenting your user’s explicit queries with information you know about them, you can personalize their search results. Examples: A known software engineer runs a blank job search in New York… Why not show software engineering higher in the results? A new user runs a keyword-only search for nurse Why not use the user’s IP address to boost documents geographically closer?
  19. 19. Bay Area Search Semantic Search
  20. 20. Bay Area Search What’s the problem we’re trying to solve today? User’s Query: machine learning research and development Portland, OR software engineer AND hadoop, java Traditional Query Parsing: (machine AND learning AND research AND development AND portland) OR (software AND engineer AND hadoop AND java) Semantic Query Parsing: "machine learning" AND "research and development" AND "Portland, OR" AND "software engineer" AND hadoop AND java Semantically Expanded Query: ("machine learning"^10 OR "data scientist" OR "data mining" OR "artificial intelligence") AND ("research and development"^10 OR "r&d") AND AND ("Portland, OR"^10 OR "Portland, Oregon" OR {!geofilt pt=45.512,-122.676 d=50 sfield=geo}) AND ("software engineer"^10 OR "software developer") AND (hadoop^10 OR "big data" OR hbase OR hive) AND (java^10 OR j2ee)
  21. 21. Bay Area Search ...we also really want to search on “things”, not “strings”… Job Level Job title Company Job Title Company School + Degree
  22. 22. Bay Area Search Type-ahead Prediction Building an Intent Engine Search Box Semantic Query Parsing Intent Engine Spelling Correction Entity / Entity Type Resolution Machine-learned Ranking Relevancy Engine (“re-expressing intent”) User Feedback (Clarifying Intent) Query Re-writing Search Results Query Augmentation Knowledge Graph Contextual Disambiguation
  23. 23. Bay Area Search Type-ahead Predictions
  24. 24. Semantic Autocomplete • Shows top terms for any search • Breaks out job titles, skills, companies, related keywords, and other categories • Understands abbreviations, alternate forms, misspellings • Supports full Boolean syntax and multi-term autocomplete • Enables fielded search on entities, not just keywords
  25. 25. Spelling Correction
  26. 26. Entity / Entity-type Resolution
  27. 27. Bay Area Search Differentiating related terms Synonyms: cpa => certified public accountant rn => registered nurse r.n. => registered nurse Ambiguous Terms*: driver => driver (trucking) ~80% likelihood driver => driver (software) ~20% likelihood Related Terms: r.n. => nursing, bsn hadoop => mapreduce, hive, pig *differentiated based upon user and query context
  28. 28. Bay Area Search Building a Taxonomy of Entities Many ways to generate this: • Topic Modelling • Clustering of documents • Statistical Analysis of interesting phrases • Buy a dictionary (often doesn’t work for domain-specific search problems) • … Our strategy: Generate a model of domain-specific phrases by mining query logs for commonly searched phrases within the domain [1] [1] K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.
  29. 29. Bay Area Search Entity-type Recognition Build classifiers trained on External data sources (Wikipedia, DBPedia, WordNet, etc.), as well as from our own domain. The subject for a future talk / research paper… java developer registered nurse emergency room director job title skill job level location work type Portland, OR part-time
  30. 30. Bay Area Search Contextual Disambiguation
  31. 31. Bay Area Search How do we handle phrases with ambiguous meanings? Example Related Keywords (representing multiple meanings) driver truck driver, linux, windows, courier, embedded, cdl, delivery architect autocad drafter, designer, enterprise architect, java architect, designer, architectural designer, data architect, oracle, java, architectural drafter, autocad, drafter, cad, engineer … …
  32. 32. Bay Area Search Discovering ambiguous phrases 1) Classify user’s who ran each search in the search logs (i.e. by the job title classifications of the jobs to which they applied) 3) Segment the search term => related search terms list by classification, to return a separate related terms list per classification 2) Create a probabilistic graphical model of those classifications mapped to each keyword phrase.
  33. 33. Bay Area Search Disambiguated meanings (represented as term vectors) Example Related Keywords (Disambiguated Meanings) architect 1: enterprise architect, java architect, data architect, oracle, java, .net 2: architectural designer, architectural drafter, autocad, autocad drafter, designer, drafter, cad, engineer driver 1: linux, windows, embedded 2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier designer 1: design, print, animation, artist, illustrator, creative, graphic artist, graphic, photoshop, video 2: graphic, web designer, design, web design, graphic design, graphic designer 3: design, drafter, cad designer, draftsman, autocad, mechanical designer, proe, structural designer, revit … …
  34. 34. Bay Area Search Using the disambiguated meanings In a situation where a user searches for an ambiguous phrase, what information can we use to pick the correct underlying meaning? 1. Any pre-existing knowledge about the user: • User is a software engineer • User has previously run searches for “c++” and “linux” 2. Context within the query: • User searched for windows AND driver vs. courier OR driver 3. If all else fails (and there is no context), use the most commonly occurring meaning. driver 1: linux, windows, embedded 2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier
  35. 35. Bay Area Search Semantic Query Parsing
  36. 36. Bay Area Search Query Parsing: The whole is greater than the sum of the parts project manager vs. "project" AND "manager" building architect vs. "building" AND "architect" software architect vs. "software" AND "architect" Consider: a "software architect" designs and builds software a "building architect" uses software to design architecture User’s Query: machine learning research and development Portland, OR software engineer AND hadoop java Traditional Query Parsing: (machine AND learning AND research AND development AND portland) OR (software AND engineer AND hadoop AND java) ≠ Identifying the correct phrase (not just the parts) is crucial here!
  37. 37. Bay Area Search
  38. 38. Bay Area Search Probabilistic Query Parser Goal: given a query, predict which combinations of keywords should be combined together as phrases Example: senior java developer hadoop Possible Parsings: senior, java, developer, hadoop "senior java", developer, hadoop "senior java developer", hadoop "senior java developer hadoop” "senior java", "developer hadoop” senior, "java developer", hadoop senior, java, "developer hadoop"
  39. 39. Bay Area Search Input: senior hadoop developer java ruby on rails perl
  40. 40. Bay Area Search Semantic Search Architecture – Query Parsing 1) Generate the previously discussed taxonomy of Domain-specific phrases • You can mine query logs or actual text of documents for significant phrases within your domain [1] 2) Feed these phrases to SolrTextTagger (uses Lucene FST for high-throughput term lookups) 3) Use SolrTextTagger to perform entity extraction on incoming queries (tagging documents is also possible) 4) Also invoke probabilistic parser to dynamically identify unknown phrases from a corpus of data (language model) 5) Shown on next slides: Pass extracted entities to a Query Augmentation phase to rewrite the query with enhanced semantic understanding [1] K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014. [2] https://github.com/OpenSextant/SolrTextTagger
  41. 41. Bay Area Search Query Augmentation
  42. 42. Bay Area Search machine learning Keywords: Search Behavior, Application Behavior, etc. Job Title Classifier, Skills Extractor, Job Level Classifier, etc. Semantic Query Augmentation keywords:((machine learning)^10 OR { AT_LEAST_2: ("data mining"^0.9, matlab^0.8, "data scientist"^0.75, "artificial intelligence"^0.7, "neural networks"^0.55)) } { BOOST_TO_TOP: ( job_title:( "software engineer" OR "data manager" OR "data scientist" OR "hadoop engineer")) } Modified Query: Related Occupations machine learning: {15-1031.00 .58 Computer Software Engineers, Applications 15-1011.00 .55 Computer and Information Scientists, Research 15-1032.00 .52 Computer Software Engineers, Systems Software } machine learning: { software engineer .65, data manager .3, data scientist .25, hadoop engineer .2, } Common Job Titles Semantic Search Architecture – Query Augmentation Related Phrases machine learning: { data mining .9, matlab .8, data scientist .75, artificial intelligence .7, neural networks .55 } Known keyword phrases java developer machine learning registered nurse FST Knowledge Graph in +
  43. 43. Bay Area Search Query Enrichment
  44. 44. Bay Area Search Document Enrichment
  45. 45. Bay Area Search Document Enrichment
  46. 46. Bay Area Search Knowledge Graph
  47. 47. Bay Area Search Serves as a “data science toolkit” API that allows dynamically navigating and pivoting through multiple levels of relationships between items in our domain. Compare the relationships of skills to keywords, job titles to skills to keywords, skills to government occupation codes, skills to experience level, etc. Knowledge Graph API Core similarity engine, exposed via API Any product can leverage our core relationship scoring engine to score any list of entities against any other list Full domain support Keywords, job titles, skills, companies, job levels, locations, and all other taxonomies. Intersections, overlaps, & relationship scoring, many levels deep Users can either provide a list of items to score, or else have the system dynamically discover the most related items (or both). Knowledge Graph
  48. 48. Bay Area Search So how does it work? Foreground vs. Background Analysis Every term scored against it’s context. The more commonly the term appears within it’s foreground context versus its background context, the more relevant it is to the specified foreground context. countFG(x) - totalDocsFG * probBG(x) z = -------------------------------------------------------- sqrt(totalDocsFG * probBG(x) * (1 - probBG(x))) { "type":"keywords”, "values":[ { "value":"hive", "relatedness":0.9773, "popularity":369 }, { "value":"java", "relatedness":0.9236, "popularity":15653 }, { "value":".net", "relatedness":0.5294, "popularity":17683 }, { "value":"bee", "relatedness":0.0, "popularity":0 }, { "value":"teacher", "relatedness":-0.2380, "popularity":9923 }, { "value":"registered nurse", "relatedness": -0.3802 "popularity":27089 } ] } We are essentially boosting terms which are more related to some known feature (and ignoring terms which are equally likely to appear in the background corpus) + - Foreground Query: "Hadoop" Knowledge Graph
  49. 49. Bay Area Search Knowledge Graph – Potential Use Cases Cross-walk between Types • Have an ID field, but want to enable free text search on the most associated entity with that ID? • Have a “state” (geo) search box, but want to accept any free-text location and map it to the right state? • Have an old classification taxonomy and want to know how the values from the old system now map into the new values? Build User Profiles from Search Logs • If someone searches for “Java”, and then “JQuery”, and then “CSS”, and then “JSP”, what do those have in common? • What if they search for “Java”, and then “C++”, and then “Assembly”? Discover Relationships Between Anything • If I want to become a data scientist and know Python, what libraries should I learn? • If my last job was mid-level software engineer and my current job is Engineering Lead, what are my most likely next roles? Traverse arbitrarily deep, Sort on anything • Build an instant co-occurrence matrix, sort the top values by their relatedness, and then add in any number of additional dimensions (RAM permitting). Data Cleansing • Have dirty taxonomies and need to figure out which items don’t belong? • Need to understand the conceptual cohesion of a document (vs spammy or off-topic content)? Knowledge Graph
  50. 50. Bay Area Search 2014-2015 Publications & Presentations Books: Solr in Action - A comprehensive guide to implementing scalable search using Apache Solr Research papers: ● Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific jargon - 2014 ● Towards a Job title Classification System - 2014 ● Augmenting Recommendation Systems Using a Model of Semantically-related Terms Extracted from User Behavior - 2014 ● sCooL: A system for academic institution name normalization - 2014 ● PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems - 2014 ● SKILL: A System for Skill Identification and Normalization – 2015 ● Carotene: A Job Title Classification System for the Online Recruitment Domain - 2015 ● WebScalding: A Framework for Big Data Web Services - 2015 ● A Pipeline for Extracting and Deduplicating Domain-Specific Knowledge Bases - 2015 ● Macau: Large-Scale Skill Sense Disambiguation in the Online Recruitment Domain - 2015 ● Improving the Quality of Semantic Relationships Extracted from Massive User Behavioral Data – 2015 ● Query Sense Disambiguation Leveraging Large Scale User Behavioral Data - 2015 Speaking Engagements: ● Over a dozen in the last year: Lucene/Solr Revolution 2014, WSDM 2014, Atlanta Solr Meetup, Atlanta Big Data Meetup, Second International Syposium on Big Data and Data Analytics, RecSys 2014, IEEE Big Data Conference 2014 (x2), AAAI/IAAI 2015, IEEE Big Data 2015 (x6), Lucene/Solr Revolution 2015, and Bay Area Search Meetup
  51. 51. Bay Area Search So What’s Next?
  52. 52. Bay Area Search machine learning Keywords: Search Behavior, Application Behavior, etc. Job Title Classifier, Skills Extractor, Job Level Classifier, etc. Semantic Query Augmentation keywords:((machine learning)^10 OR { AT_LEAST_2: ("data mining"^0.9, matlab^0.8, "data scientist"^0.75, "artificial intelligence"^0.7, "neural networks"^0.55)) } { BOOST_TO_TOP: ( job_title:( "software engineer" OR "data manager" OR "data scientist" OR "hadoop engineer")) } Modified Query: Related Occupations machine learning: {15-1031.00 .58 Computer Software Engineers, Applications 15-1011.00 .55 Computer and Information Scientists, Research 15-1032.00 .52 Computer Software Engineers, Systems Software } machine learning: { software engineer .65, data manager .3, data scientist .25, hadoop engineer .2, } Common Job Titles Semantic Search Architecture – Query Augmentation Related Phrases machine learning: { data mining .9, matlab .8, data scientist .75, artificial intelligence .7, neural networks .55 } Known keyword phrases java developer machine learning registered nurse FST Knowledge Graph in + This Piece: How do you construct the best possible queries? The answer… Learning to Rank (Machine-learned Ranking) That can be a topic for next time…
  53. 53. Bay Area Search Type-ahead Prediction Building an Intent Engine Search Box Semantic Query Parsing Intent Engine Spelling Correction Entity / Entity Type Resolution Machine-learned Ranking Relevancy Engine (“re-expressing intent”) User Feedback (Clarifying Intent) Query Re-writing Search Results Query Augmentation Knowledge Graph Contextual Disambiguation
  54. 54. Bay Area Search Conceptual Framework for Information Retrieval: Traditional Keyword Search Recommendations Semantic Search User Intent Personalized Search Augmented Search Domain-aware Matching
  55. 55. Bay Area Search Additional References:
  56. 56. Bay Area Search Bonus Slides Audience question: how can you discover terms / related terms without having query logs to mine?
  57. 57. Bay Area Search One Option: Clustering on documents to find semantic links
  58. 58. Bay Area Search Setting up Clustering in solrconfig.xml <searchComponent name="clustering" enable=“true“ class="solr.clustering.ClusteringComponent"> <lst name="engine"> <str name="name">default</str> <str name="carrot.algorithm"> org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str> <str name="MultilingualClustering.defaultLanguage">ENGLISH</str> </lst> </searchComponent> <requestHandler name="/clustering" enable=“true" class="solr.SearchHandler"> <lst name="defaults"> <str name="clustering.engine">default</str> <bool name="clustering.results">true</bool> <str name="fl">*,score</str> </lst> <arr name="last-components"> <str>clustering</str> </arr> </requestHandler>
  59. 59. Bay Area Search Clustering Query /solr/clustering/?q=solr &rows=100 &carrot.title=titlefield &carrot.snippet=titlefield &LingoClusteringAlgorithm.desiredClusterCountBase=25 //clustering & grouping don’t currently play nicely Allows you to dynamically identify “concepts” and their prevalence within a user’s top search results
  60. 60. Bay Area Search Original Query: q=solr Clustering Results Clusters Identified: Developer (22) Java Developer (13) Software (10) Senior Java Developer (9) Architect (6) Software Engineer (6) Web Developer (5) Search (3) Software Developer (3) Systems (3) Administrator (2) Hadoop Engineer (2) Java J2EE (2) Search Development (2) Software Architect (2) Solutions Architect (2) Identify Relationships:
  61. 61. Bay Area Search q="solr" OR ("Developer”^0.22 or "Java Developer"^0.13 or "Software "^0.10 or "Senior Java Developer"^0.9 or "Architect"^0.6 or "Software Engineer"^0.6 or "Web Developer"^0.5 or "Search"^0.3 or "Software Developer"^0.3 or "Systems"^0.3 or "Administrator"^0.2 or "Hadoop Engineer"^0.2 or "Java J2EE"^0.2 or "Search Development"^0.2 or "Software Architect"^0.2 or "Solutions Architect"^0.2) Just plug in those semantic relationships as before…
  62. 62. Bay Area Search Contact Info Yes, WE ARE HIRING @ . Come talk with me if you are interested… Trey Grainger trey.grainger@careerbuilder.com @treygrainger http://solrinaction.com Conference discount (39% off): 39solrmu Other presentations: http://www.treygrainger.com

×