Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How to Build a Semantic Search System

1,506 views

Published on

Building a semantic search system - one that can correctly parse and interpret end-user intent and return the ideal results for users’ queries - is not an easy task. It requires semantically parsing the terms, phrases, and structure within queries, disambiguating polysemous terms, correcting misspellings, expanding to conceptually synonymous or related concepts, and rewriting queries in a way that maps the correct interpretation of each end user’s query into the ideal representation of features and weights that will return the best results for that user. Not only that, but the above must often be done within the confines of a very specific domain - ripe with its own jargon and linguistic and conceptual nuances.

This talk will walk through the anatomy of a semantic search system and how each of the pieces described above fit together to deliver a final solution. We'll leverage several recently-released capabilities in Apache Solr (the Semantic Knowledge Graph, Solr Text Tagger, Statistical Phrase Identifier) and Lucidworks Fusion (query log mining, misspelling job, word2vec job, query pipelines, relevancy experiment backtesting) to show you an end-to-end working Semantic Search system that can automatically learn the nuances of any domain and deliver a substantially more relevant search experience.

Published in: Software
  • Want to earn $4000/m? Of course you do. Learn how when you join today! ▲▲▲ http://ishbv.com/ezpayjobs/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

How to Build a Semantic Search System

  1. 1. How to Build a Semantic Search System Trey Grainger SVP of Engineering, Lucidworks @treygrainger #Activate18 #ActivateSearch
  2. 2. Trey Grainger SVP of Engineering • Previously Director of Engineering @ CareerBuilder • Georgia Tech – MBA, Management of Technology • Furman University – BA, Computer Science, Business, & Philosophy • Stanford University – Information Retrieval & Web Search Other fun projects: • Co-author of Solr in Action, plus numerous research papers • Advisor to Presearch, the decentralized search engine • Lucene / Solr contributor About Me
  3. 3. Agenda •Philosophy of Semantic Search •Technology for Semantic Search •Q&A / Demo (time permitting)
  4. 4. Lucidworks Fusion powers search for the brightest companies in the world.
  5. 5. Philosophy of Semantic Search
  6. 6. most often used in reference to “free text”
  7. 7. My Three Philosophical Assertions 1) Unstructured data is actually “hyper-structured” data. It is a graph that contains much more structure than typical “structured data.” 2) That graph is very rich, but is a compression of meaning into a lossy format. Much of data science is essentially the decompression from this lossy format into a reconstituted form. 3) Most Important: Every instance of a word or phrase you ever encounter has a unique meaning.
  8. 8. Assertion 1: Unstructured data is actually “hyper-structured” data. It is a graph that contains much more structure than typical “structured data.”
  9. 9. Structured Data Employees Table id name company start_date lw100 Trey Grainger 1234 2016-02-01 dis2 Mickey Mouse 9123 1928-11-28 tsla1 Elon Musk 5678 2003-07-01 Companies Table id name start_date 1234 Lucidworks 2016-02-01 5678 Tesla 1928-11-28 9123 Disney 2003-07-01 Discrete Values Continuous Values Foreign Key
  10. 10. Unstructured Data Trey Grainger works at Lucidworks. He is speaking at Activate 2018. #ActivateSearch (Activate) is being held in Montreal October 15-18, 2018. Trey got his masters from Georgia Tech.
  11. 11. Trey Grainger works for Lucidworks. He is speaking at the Activate 2018. #ActivateSearch (Activate) is being held in Montreal April 12-14, 2018. Trey got his masters degree from Georgia Tech. Trey’s Voicemail Unstructured Data
  12. 12. Trey Grainger works for Lucidworks. He is speaking at the Activate 2018. #ActivateSearch (Activate) is being held in Montreal April 12-14, 2018. Trey got his masters degree from Georgia Tech. Trey’s Voicemail Foreign Key?
  13. 13. Trey Grainger works for Lucidworks. He is speaking at the Activate 2018. #ActivateSearch (Activate) is being held in Montreal April 12-14, 2018. Trey got his masters degree from Georgia Tech. Trey’s Voicemail Fuzzy Foreign Key? (Entity Resolution)
  14. 14. Trey Grainger works for Lucidworks. He is speaking at the Activate 2018. #ActivateSearch (Activate) is being held in Montreal April 12-14, 2018. Trey got his masters degree from Georgia Tech. Trey’s Voicemail Fuzzier Foreign Key? (metadata, latent features)
  15. 15. Fuzzier Foreign Key? (metadata, latent features) Trey Grainger works for Lucidworks. He is speaking at the Activate 2018. #ActivateSearch (Activate) is being held in Montreal April 12-14, 2018. Trey got his masters degree from Georgia Tech. Trey’s Voicemail Not so fast!
  16. 16. Giant Graph of Relationships... Trey Grainger works for Lucidworks. He is speaking at the Activate 2018. #ActivateSearch (Activate) is being held in Montreal April 12-14, 2018. Trey got his masters degree from Georgia Tech. Trey’s Voicemail
  17. 17. Assertion 1 (Summary): Unstructured data is actually “hyper-structured” data. It is a graph that contains much more structure than typical “structured data.”
  18. 18. Assertion 2: That graph is very rich, but is a compression of meaning into a lossy format. Much of data science is essentially the decompression from this lossy format into a reconstituted form.
  19. 19. 01 Semantic Data Encoded into Free Text Content
  20. 20. How do we easily harness this “semantic graph” of relationships within unstructured information?
  21. 21. Search Engines are really good at querying across character sequences, term sequences, and documents Example Queries: c?o CTO, CEO, CFO, … "VP Engineering"~2 “VP of Engineering”, VP Engineering” ,“Engineering VP”, “VP of Infrastructure Engineering” (Microsoft OR MS) AND Word “MS Word”, “Microsoft Word”
  22. 22. /solr/collection/select/?q=apache solr Term Documents … … apache doc1, doc3, doc4, doc5 … hadoop doc2, doc4, doc6 … … solr doc1, doc3, doc4, doc7, doc8 … … doc5 doc7 doc8 doc1 doc3 doc4 solr apache apache solr Matching queries to documents
  23. 23. id: 1 job_title: Software Engineer desc: software engineer at a great company skills: .Net, C#, java id: 2 job_title: Registered Nurse desc: a registered nurse at hospital doing hard work skills: oncology, phlebotemy id: 3 job_title: Java Developer desc: a software engineer or a java engineer doing work skills: java, scala, hibernate field doc term desc 1 a at company engineer great software 2 a at doing hard hospital nurse registered work 3 a doing engineer java or software work job_title 1 Software Engineer … … … Terms-Docs Inverted IndexDocs-Terms Forward IndexDocuments Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016. Knowledge Graph field term postings list doc pos desc a 1 4 2 1 3 1, 5 at 1 3 2 4 company 1 6 doing 2 6 3 8 engineer 1 2 3 3, 7 great 1 5 hard 2 7 hospital 2 5 java 3 6 nurse 2 3 or 3 4 registered 2 2 software 1 1 3 2 work 2 10 3 9 job_title java developer 3 1 … … … …
  24. 24. Semantic Knowledge Graph
  25. 25. DOI: 10.1109/DSAA.2016.51 Conference: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA) Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016. Knowledge Graph Graph Traversal Data Structure View Graph View doc 1 doc 2 doc 3 doc 4 doc 5 doc 6 skill: Java skill: Java skill: Scala skill: Hibernate skill: Oncology doc 1 doc 2 doc 3 doc 4 doc 5 doc 6 job_title: Software Engineer job_title: Data Scientist job_title: Java Developer …… Inverted Index Lookup Forward Index Lookup Forward Index Lookup Inverted Index Lookup Java Java Developer Hibernate Scala Software Engineer Data Scientist has_related_skill has_related_skill has_related_skill has_related_job_title has_related_job_title has_related_job_title has_related_job_title has_related_job_title has_related_job_title
  26. 26. Search engines also do relevancy ranking Score(q, d) = ∑ idf(t) · ( tf(t in d) · (k + 1) ) / ( tf(t in d) + k · (1 – b + b · |d| / avgdl ) t in q Where: t = term; d = document; q = query; i = index tf(t in d) = numTermOccurrencesInDocument ½ idf(t) = 1 + log (numDocs / (docFreq + 1)) |d| = ∑ 1 t in d avgdl = = ( ∑ |d| ) / ( ∑ 1 ) ) d in i d in i k = Free parameter. Usually ~1.2 to 2.0. Increases term frequency saturation point. b = Free parameter. Usually ~0.75. Increases impact of document normalization.
  27. 27. Scoring of Node Relationships (Edge Weights) Foreground vs. Background Analysis Every term scored against it’s context. The more commonly the term appears within it’s foreground context versus its background context, the more relevant it is to the specified foreground context. countFG(x) - totalDocsFG * probBG(x) z = -------------------------------------------------------- sqrt(totalDocsFG * probBG(x) * (1 - probBG(x))) { "type":"keywords”, "values":[ { "value":"hive", "relatedness":0.9773, "popularity":369 }, { "value":"java", "relatedness":0.9236, "popularity":15653 }, { "value":".net", "relatedness":0.5294, "popularity":17683 }, { "value":"bee", "relatedness":0.0, "popularity":0 }, { "value":"teacher", "relatedness":-0.2380, "popularity":9923 }, { "value":"registered nurse", "relatedness": -0.3802 "popularity":27089 } ] } We are essentially boosting terms which are more related to some known feature (and ignoring terms which are equally likely to appear in the background corpus) + - Foreground Query: "Hadoop" Knowledge Graph
  28. 28. Assertion 2 (Summary): That graph is very rich, but is a compression of meaning into a lossy format. Much of data science is essentially the decompression from this lossy format into a reconstituted form.
  29. 29. Assertion 3: Every instance of a word or phrase you ever encounter has a unique meaning.
  30. 30. Thought Exercise What do you think of when I say the word “driver”? What about “architect”?
  31. 31. Ambiguity Example Related Keywords (representing multiple meanings) driver truck driver, linux, windows, courier, embedded, cdl, delivery architect autocad drafter, designer, enterprise architect, java architect, designer, architectural designer, data architect, oracle, java, architectural drafter, autocad, drafter, cad, engineer … … Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
  32. 32. Use Case: Query Disambiguation Example Related Keywords (representing multiple meanings) driver truck driver, linux, windows, courier, embedded, cdl, delivery architect autocad drafter, designer, enterprise architect, java architect, designer, architectural designer, data architect, oracle, java, architectural drafter, autocad, drafter, cad, engineer … … Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
  33. 33. Disambiguated meanings (represented as term vectors) Example Related Keywords (Disambiguated Meanings) architect 1: enterprise architect, java architect, data architect, oracle, java, .net 2: architectural designer, architectural drafter, autocad, autocad drafter, designer, drafter, cad, engineer driver 1: linux, windows, embedded 2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier designer 1: design, print, animation, artist, illustrator, creative, graphic artist, graphic, photoshop, video 2: graphic, web designer, design, web design, graphic design, graphic designer 3: design, drafter, cad designer, draftsman, autocad, mechanical designer, proe, structural designer, revit … … Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
  34. 34. Using the disambiguated meanings In a situation where a user searches for an ambiguous phrase, what information can we use to pick the correct underlying meaning? 1. Any pre-existing knowledge about the user: • User is a software engineer • User has previously run searches for “c++” and “linux” 2. Context within the query: User searched for windows AND driver vs. courier OR driver 3. If all else fails (and there is no context), use the most commonly occurring meaning. driver 1: linux, windows, embedded 2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015.
  35. 35. Thought Exercise What do you think of when I say the word “Facebook”?
  36. 36. Every term or phrase is a Context-dependent cluster of meaning with an ambiguous label
  37. 37. What does “love” mean? http://localhost:8983/solr/thesaurus/skg
  38. 38. What does “love” mean in the context of “hug”? http://localhost:8983/solr/thesaurus/skg
  39. 39. What does “love” mean in the context of “child”? http://localhost:8983/solr/thesaurus/skg
  40. 40. My Three Assertions (Recap) 1) Unstructured data is actually “hyper-structured” data. It is a graph that contains much more structure than typical “structured data.” 2) That graph is very rich, but is a compression of meaning into a lossy format. Much of data science is essentially the decompression from this lossy format into a reconstituted form. 3) Most Important: Every instance of a word or phrase you ever encounter has a unique meaning.
  41. 41. Technology for Semantic Search
  42. 42. So why all the philosophy? Because it’s much more important to intuitively understand the kinds of problem we’re trying to solve with Semantic Search than to jump head-first into the Solution. Because otherwise we may build the wrong thing, which can sometimes be worse than not doing anything. And once you have an intuitive sense of the problems you need to solve, you can confidently use the tools I’m about to describe to build the right solution for your specific domain.
  43. 43. So what’s the end goal here? User’s Query: machine learning research and development Portland, OR software engineer AND hadoop, java Traditional Query Parsing: (machine AND learning AND research AND development AND portland) OR (software AND engineer AND hadoop AND java) Semantic Query Parsing: "machine learning" AND "research and development" AND "Portland, OR" AND "software engineer" AND hadoop AND java Semantically Expanded Query: "machine learning"^10 OR "data scientist" OR "data mining" OR "artificial intelligence") AND ("research and development"^10 OR "r&d") AND AND ("Portland, OR"^10 OR "Portland, Oregon" OR {!geofilt pt=45.512,-122.676 d=50 sfield=geo}) AND ("software engineer"^10 OR "software developer") AND (hadoop^10 OR "big data" OR hbase OR hive) AND (java^10 OR j2ee)
  44. 44. Semantic Search Components: • Apache Solr • Solr Text Tagger • Semantic Knowledge Graph • Statistical Phrase Identifier • Fusion Semantic Query Pipelines • Fusion AI Synonyms Job • Fusion AI Token & Phrase Spell Correction Job • Fusion AI Head/Tail Analysis Job • Fusion AI Phrase Identification Job • Fusion Query Rules Engine
  45. 45. In 2018, Lucidworks has added the following capabilities to Solr: • Solr Text Tagger • Semantic Knowledge Graph • Statistical Phrase Identifier
  46. 46. which all integrate seamlessly with the following in Fusion: • Fusion Semantic Query Pipelines • Fusion AI Synonyms Job • Fusion AI Token & Phrase Spell Correction Job • Fusion AI Head/Tail Analysis Job • Fusion Phrase Identification Job • Fusion Query Rules Engine
  47. 47. Through these tools, Fusion self-learns domain-specific semantic relationships
  48. 48. … and enables domain experts to easily accept or adjust the built in AI… …completely deferring to Fusion’s AI, or trusting it above a certain confidence level, or even manually approving every suggestion.
  49. 49. Fusion AI Semantic Search Jobs
  50. 50. Semantic Query Pipeline
  51. 51. Released in Solr 7.5
  52. 52. Semantic Query Parsing Identification of phrases in queries using two steps: 1) Check a dictionary of known terms that is continuously built, cleaned, and refined based upon common inputs from interactions with real users of the system. We use the Solr Text Tagger (already covered) for this at query time.* 2) Also invoke a probabilistic query parser to dynamically identify unknown phrases using statistics from a corpus of data (language model) 3) Final algorithm to choose the best merge when the two approaches disagree. *K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.
  53. 53. Probabilistic Query Parsing Goal: given a query, predict which combinations of keywords should be combined together as phrases Example: senior java developer hadoop Possible Parsings: senior, java, developer, hadoop "senior java", developer, hadoop "senior java developer", hadoop "senior java developer hadoop” "senior java", "developer hadoop” senior, "java developer", hadoop senior, java, "developer hadoop" Source: Trey Grainger, “Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disambiguation”, Bay Area Search Meetup, November 2015.
  54. 54. Released in Solr 7.5
  55. 55. Released in Solr 7.4
  56. 56. A few last thoughts to leave you with…
  57. 57. Words of Advice: You can’t improve what you can’t measure.
  58. 58. Importance of Feedback Loops User Searches User Sees Results User takes an action Users’ actions inform system improvements Southern Data Science
  59. 59. Traditional Keyword Search Recommendations Semantic Search User Intent Personalized Search Augmented Search Domain-aware Matching Going beyond semantic search…
  60. 60. Trey Grainger trey.grainger@lucidworks.com @treygrainger Thank you! http://solrinaction.com #Activate18 #ActivateSearch Other presentations: http://www.treygrainger.com Discount code:ctwactivate18

×