Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Crowdsourced query augmentation through the semantic discovery of domain specific jargon


Published on

Talk Abstract: Most work in semantic search has thus far focused upon either manually building language-specific taxonomies/ontologies or upon automatic techniques such as clustering or dimensionality reduction to discover latent semantic links within the content that is being searched. The former is very labor intensive and is hard to maintain, while the latter is prone to noise and may be hard for a human to understand or to interact with directly. We believe that the links between similar user’s queries represent a largely untapped source for discovering latent semantic relationships between search terms. The proposed system is capable of mining user search logs to discover semantic relationships between key phrases in a manner that is language agnostic, human understandable, and virtually noise-free.

Published in: Technology

Crowdsourced query augmentation through the semantic discovery of domain specific jargon

  1. 1. Crowdsourced Query Augmentation through the Semantic Discovery of Domain-specific Jargon Khalifeh Aljadda, Mohammed Korayem, Trey Grainger, Chris Russell 2014.10.28 - 2014 IEEE International Conference on Big Data - Washington, D.C.
  2. 2. Authors • Khalifeh AlJadda – Ph.D. Candidate, University of Georgia • Mohammed Korayem – Ph.D. Candidate, Indiana University • Trey Grainger – Director of Engineering, Search, CareerBuilder • Chris Russell – Engineering Lead, Relevancy & Recommendations, CareerBuilder
  3. 3. The problem • Traditional search engines (i.e. Lucene, Solr, Elasticsearch) tokenize text and find documents containing those tokens and linguistic variations: – User’s Search: machine learning Tokenization: ["machine", "learning"] => Stemming: ["machin", "learn"] Final Query: machin AND learn This could match a document for a “machinist” who has “learned” something. – software architect => … => software AND architect • Might identify a building architect requiring knowledge of specialized architecture software – account manager => … => account AND manag • Will match text such as “need to manage the process and account for any variances” • We need a way to identify and search for the meaning of keyword phrases, not just the individual text tokens – i.e. machine learning = "machine learning" OR "data scientist" OR "mahout" OR "svm" OR "neural networks" …
  4. 4. Goals for the proposed system • System should be language-agnostic. We don’t want custom NLP rules to be required for each language (we support dozens of languages). • The output of the system should be human-readable. We want to show user’s how we enhance their queries in language they will understand so they can modify our enhancements. • The system should be very high-precision (since end-users will be seeing and critiquing the output) and should be automatically updated based upon new data. • The system must be fast and scalable, handling billions of search log entries (offline) and processing millions of queries an hour in real-time
  5. 5. Alternate Techniques • Latent Semantic Indexing – Approach involves doing dimensionality reduction of text across your corpus to derive underlying relationships between terms: • i.e. java => programming, c# => programming, therefore they are related. – Pros: • Can be run automatically against your corpus of data to discover underlying (latent) relationships between terms, which requires very little human work – Cons: • The latent relationships often aren’t represented as a human would express them, so it would confuse users if they saw this information.
  6. 6. Alternate Techniques • Manual building of taxonomies – Approach requires hiring human data analysts to manually build, correct, and improve taxonomies – Pros: • high-precision relationships can be mapped depending upon the quality of your hired data analysts – Cons: • Requires human analysts to comb through hundreds of thousands of data points and generate lists of important phrases and relationships, which go stale • Requires expertise in every supported spoken language to rebuild taxonomies per-language
  7. 7. Example use case • User’s Query: machine learning research and development Portland, OR software engineer AND hadoop java • Traditional Search Engine Parsing: (machine AND learning AND research AND development AND portland) OR (software AND engineer AND hadoop AND java ) • Ideal Parsing: "machine learning" AND "research and development" AND "Portland, OR” AND "software engineer" AND hadoop AND java • Semantically Enhanced Query: ("machine learning" OR "computer vision" OR "data mining" OR matlab) AND ("research and development" OR "r&d") AND ("Portland, OR" OR "Portland, Oregon") AND ("software engineer" OR "software developer") AND (hadoop OR "big data" OR hbase OR hive) AND (java OR j2ee)
  8. 8. Proposed strategy 1. Mine user search logs for a list of common phrases (“jargon”) within our domain. 2. Perform collaborative filtering on the common jargon (“user’s who searched for that phrase also search for this phrase”) 3. Remove noise through several methodologies: – Segment search phrases based upon the classification of users – Consider shared jargon used by multiple sides of our two-sided market (i.e. both Job Seekers and Recruiters utilize the same phrase) – Validate that the two “related” phrases actually co-occur in real content (i.e. within the same job or resume) with some frequency
  9. 9. Finding and Scoring Related Jargon ● Implementation: Map/Reduce job which finds and scores similar searches run for the same users ○ Jane searched for “registered nurse” and “r.n.” and “nurse”. ○ Zeke searched for “java developer” and “scala” and “jvm” and “j2ee”
  10. 10. Finding and Scoring Related Jargon Similarity Score: To do the collaborative filtering, we look at two similarity measures: 1. Search Co-occurrences - provides raw, real-world correlation 2. Point-wise Mutual Information - examines probability of terms being related by contrasting individual vs joint distributions: Final Score:
  11. 11. Example output
  12. 12. Example output Cashier => retail, retail cashier, customer service, cashiers CDL => cdl driver, cdl a, driver Data Scientist => machine learning, big data
  13. 13. Final System Architecture
  14. 14. Follow-on work: Differentiating related Jargon Synonyms: cpa => Certified Public Accountant rn => Registered Nurse r.n. => Registered Nurse Ambiguous Terms*: driver => driver (trucking) ~80% driver => driver (software) ~20% Related Terms: r.n. => nursing, bsn hadoop => mapreduce, hive, pig *disambiguated based upon user and query context
  15. 15. Applicability of Methodology • Can be used to discover domain-specific jargon across most domains (not just employment search) • Can be used to discover related jargon in any language since the jargon and relationships is crowd-sourced at the phrase level • The high-precision results achieved by intersecting input from both sides of a two-sided market is optional. If you only have a single source of user queries, you will just get lower-precision mappings. • The only absolute requirement is sufficient search log history mapping users to multiple search phrases
  16. 16. Q&A
  17. 17. Semantic Search “under the hood”
  18. 18. Contact Info ▪ Trey Grainger @treygrainger Other presentations: Yes, WE ARE HIRING @CareerBuilder. Come talk with me if you are interested…