Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Searching for Meaning

601 views

Published on

"Searching for Meaning: The Hidden Structure in Unstructured Data". Presentation by Trey Grainger at the Southern Data Science Conference (SDSC) 2018. Covers linguistic theory, application in search and information retrieval, and knowledge graph and ontology learning methods for automatically deriving contextualized meaning from unstructured (free text) content.

Published in: Technology
  • Be the first to comment

Searching for Meaning

  1. 1. Searching for Meaning: The hidden structure in unstructured data Trey Grainger SVP of Engineering, Lucidworks Southern Data Science Conference 2018.04.13
  2. 2. Trey Grainger SVP of Engineering • Previously Director of Engineering @ CareerBuilder • MBA, Management of Technology – Georgia Tech • BA, Computer Science, Business, & Philosophy – Furman University • Information Retrieval & Web Search - Stanford University Other fun projects: • Co-author of Solr in Action, plus numerous research papers • Advisor to Presearch, the decentralized search engine • Founder of Celiaccess.com, the gluten-free search engine • Lucene / Solr contributor About Me
  3. 3. Based in San Francisco, offices and employees worldwide Fusion, the platform for building data-driven, smart apps Over 400 customers running our commercial software Consulting and support for organizations using Solr Produces the world’s largest open source user conference dedicated to Lucene/Solr Lucidworks is the primary commercial contributor to the Apache Solr project Employs over 40% of the active committers on the Solr project Contributes over 70% of Solr's open source codebase 40% 70%
  4. 4. Fusion powers search for the brightest companies in the world.
  5. 5. most often used in reference to
  6. 6. My Three Assertions 1) Unstructured data is actually “hyper-structured” data. It is a graph that contains much more structure than typical “structured data.” 2) That graph is very rich, but is a compression of meaning into a lossy format. Much of data science is essentially the decompression from this lossy format into a reconstituted form. 3) Most Important: Every instance of a word or phrase you ever encounter has a unique meaning.
  7. 7. Assertion 1: Unstructured data is actually “hyper- structured” data. It is a graph that contains much more structure than typical “structured data.” Southern Data Science
  8. 8. Structured Data Employees Table id name company start_date lw100 Trey Grainger 1234 2016-02-01 dis2 Mickey Mouse 9123 1928-11-28 tsla1 Elon Musk 5678 2003-07-01 Companies Table id name start_date 1234 Lucidworks 2016-02-01 5678 Tesla 1928-11-28 9123 Disney 2003-07-01 Discrete Values Continuous Values Foreign Key Southern Data Science
  9. 9. Unstructured Data Trey Grainger works at Lucidworks. He is speaking at the SDSC 2018. Southern Data Science Conference (SDSC) is being held in Atlanta April 12-14, 2018. Trey got his masters from Georgia Tech. Southern Data Science
  10. 10. Unstructured Data Southern Data Science Conference (SDSC) is being held in Atlanta April 12-14, 2018. Trey got his masters degree from Georgia Tech. Southern Data Science Trey Grainger works at Lucidworks. He is speaking at the SDSC 2018. Trey’s Voicemail
  11. 11. Foreign Key? Trey Grainger works at Lucidworks. He is speaking at the SDSC 2018. Southern Data Science Conference (SDSC) is being held in Atlanta April 12-14, 2018. Trey got his masters degree from Georgia Tech. Southern Data Science Trey’s Voicemail
  12. 12. Trey Grainger works at Lucidworks. He is speaking at the SDSC 2018. Fuzzy Foreign Key? (Entity Resolution) Southern Data Science Conference (SDSC) is being held in Atlanta April 12-14, 2018. Trey got his masters degree from Georgia Tech. Southern Data Science Trey’s Voicemail
  13. 13. Trey Grainger works at Lucidworks. He is speaking at the SDSC 2018. Fuzzier Foreign Key? (metadata, latent features) Southern Data Science Conference (SDSC) is being held in Atlanta April 12-14, 2018. Trey got his masters degree from Georgia Tech. Trey’s Voicemail Southern Data Science
  14. 14. Trey Grainger works at Lucidworks. He is speaking at the SDSC 2018. Fuzzier Foreign Key? (metadata, latent features) Southern Data Science Conference (SDSC) is being held in Atlanta April 12-14, 2018. Trey got his masters degree from Georgia Tech. Trey’s Voicemail Southern Data Science Not so Fast!
  15. 15. Giant Graph of Relationships... Trey Grainger works for Lucidworks. He is speaking at the SDSC 2018. Southern Data Science Conference (SDSC) is being held in Atlanta April 12-14, 2018. Trey got his masters degree from Georgia Tech. Southern Data Science Trey’s Voicemail
  16. 16. Assertion 1 (Summary): Unstructured data is actually “hyper- structured” data. It is a graph that contains much more structure than typical “structured data.” Southern Data Science
  17. 17. Assertion 2: That graph is very rich, but is a compression of meaning into a lossy format. Much of data science is essentially the decompression from this lossy format into a reconstituted form. Southern Data Science
  18. 18. Southern Data Science 01 Semantic Data Encoded into Free Text Content e en eng engi engineer engineers engineer engineersNode Type: Term software engineer software engineers electrical engineering engineer engineering software … … … Node Type: Character Sequence Node Type: Term Sequence Node Type: Document id: 1 text: looking for a software engineerwith degree in computer science or electrical engineering id: 2 text: apply to be a software engineer and work with other great software engineers id: 3 text: start a great careerin electrical engineering … …
  19. 19. How do we easily harness this “semantic graph” or relationships within unstructured information? Southern Data Science
  20. 20. Search Engines are really good at querying across characters sequences, term sequences, and documents Example Queries: c?o CTO, CEO, CFO, … "VP Engineering"~2 “VP of Engineering”, VP Engineering” ,“Engineering VP”, “VP of Infrastructure Engineering” (Microsoft OR MS) AND Word “MS Word”, “Microsoft Word”
  21. 21. Term Documents a doc1 [2x] brown doc3 [1x] , doc5 [1x] cat doc4 [1x] cow doc2 [1x] , doc5 [1x] … ... once doc1 [1x], doc5 [1x] over doc2 [1x], doc3 [1x] the doc2 [2x], doc3 [2x], doc4[2x], doc5 [1x] … … Document Content Field doc1 once upon a time, in a land far, far away doc2 the cow jumped over the moon. doc3 the quick brown fox jumped over the lazy dog. doc4 the cat in the hat doc5 The brown cow said “moo” once. … … What you SEND to Lucene/Solr: How the content is INDEXED into Lucene/Solr (conceptually): An inverted index (“how a search engine works”) Southern Data Science
  22. 22. /solr/collection/select/?q=apache solr Term Documents … … apache doc1, doc3, doc4, doc5 … hadoop doc2, doc4, doc6 … … solr doc1, doc3, doc4, doc7, doc8 … … doc5 doc7 doc8 doc1 doc3 doc4 solr apache apache solr Matching queries to documents Southern Data Science
  23. 23. Search engines also do relevancy ranking (query to doc) Score(q, d) = ∑ idf(t) · ( tf(t in d) · (k + 1) ) / ( tf(t in d) + k · (1 – b + b · |d| / avgdl ) t in q Where: t = term; d = document; q = query; i = index tf(t in d) = numTermOccurrencesInDocument ½ idf(t) = 1 + log (numDocs / (docFreq + 1)) |d| = ∑ 1 t in d avgdl = = ( ∑ |d| ) / ( ∑ 1 ) ) d in i d in i k = Free parameter. Usually ~1.2 to 2.0. Increases term frequency saturation point. b = Free parameter. Usually ~0.75. Increases impact of document normalization.
  24. 24. DOI: 10.1109/DSAA.2016.51 Conference: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)
  25. 25. • “A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain” • A multi-dimensional term-to-term (vs. term-to-document) search engine • A tool which enables knowledge modeling and reasoning, natural language processing, anomaly detection, data cleansing, semantic search, analytics, data classification, root cause analysis, and recommendations systems • It’s kind of like Word2Vec, but vectors (or matrices) are generated on the fly and are better suited for interpreting the nuanced intent of typical search queries What is the Semantic Knowledge Graph?
  26. 26. Open Sourced!
  27. 27. Southern Data Science Knowledge Graph
  28. 28. Southern Data Science Knowledge Graph
  29. 29. Southern Data Science id: 1 job_title: Software Engineer desc: software engineer at a great company skills: .Net, C#, java id: 2 job_title: Registered Nurse desc: a registered nurse at hospital doing hard work skills: oncology, phlebotemy id: 3 job_title: Java Developer desc: a software engineer or a java engineer doing work skills: java, scala, hibernate field doc term desc 1 a at company engineer great software 2 a at doing hard hospital nurse registered work 3 a doing engineer java or software work job_title 1 Software Engineer … … … Terms-Docs Inverted IndexDocs-Terms Forward IndexDocuments Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016. Knowledge Graph field term postings list doc pos desc a 1 4 2 1 3 1, 5 at 1 3 2 4 company 1 6 doing 2 6 3 8 engineer 1 2 3 3, 7 great 1 5 hard 2 7 hospital 2 5 java 3 6 nurse 2 3 or 3 4 registered 2 2 software 1 1 3 2 work 2 10 3 9 job_title java developer 3 1 … … … …
  30. 30. Southern Data Science Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016. Knowledge Graph Set-theory View Graph View How the Graph Traversal Works skill: Java skill: Scala skill: Hibernate skill: Oncology doc 1 doc 2 doc 3 doc 4 doc 5 doc 6 skill: Java skill: Java skill: Scala skill: Hibernate skill: Oncology Data Structure View Java Scala Hibernate docs 1, 2, 6 docs 3, 4 Oncology doc 5
  31. 31. Southern Data Science Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016. Knowledge Graph Multi-level Traversal Data Structure View Graph View doc 1 doc 2 doc 3 doc 4 doc 5 doc 6 skill: Java skill: Java skill: Scala skill: Hibernate skill: Oncology doc 1 doc 2 doc 3 doc 4 doc 5 doc 6 job_title: Software Engineer job_title: Data Scientist job_title: Java Developer …… Inverted Index Lookup Forward Index Lookup Forward Index Lookup Inverted Index Lookup Java Java Developer Hibernate Scala Software Engineer Data Scientist has_related_job_title has_related_job_title
  32. 32. Scoring of Node Relationships (Edge Weights) Foreground vs. Background Analysis Every term scored against it’s context. The more commonly the term appears within it’s foreground context versus its background context, the more relevant it is to the specified foreground context. countFG(x) - totalDocsFG * probBG(x) z = -------------------------------------------------------- sqrt(totalDocsFG * probBG(x) * (1 - probBG(x))) { "type":"keywords”, "values":[ { "value":"hive", "relatedness":0.9773, "popularity":369 }, { "value":"java", "relatedness":0.9236, "popularity":15653 }, { "value":".net", "relatedness":0.5294, "popularity":17683 }, { "value":"bee", "relatedness":0.0, "popularity":0 }, { "value":"teacher", "relatedness":-0.2380, "popularity":9923 }, { "value":"registered nurse", "relatedness": -0.3802 "popularity":27089 } ] } We are essentially boosting terms which are more related to some known feature (and ignoring terms which are equally likely to appear in the background corpus) + - Foreground Query: "Hadoop" Knowledge Graph
  33. 33. Southern Data Science Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016. Knowledge Graph Multi-level Graph Traversal with Scores software engineer* (materialized node) Java C# .NET .NET Developer Java Developer Hibernate ScalaVB.NET Software Engineer Data Scientist Skill Nodes has_related_skillStarting Node Skill Nodes has_related_skill Job Title Nodes has_related_job_title 0.90 0.88 0.93 0.93 0.34 0.74 0.91 0.89 0.74 0.89 0.780.72 0.48 0.93 0.76 0.83 0.80 0.64 0.61 0.780.55
  34. 34. Southern Data Science Related term vector (for query concept expansion) http://localhost:8983/solr/stack-exchange-health/skg
  35. 35. Southern Data Science Who’s in Love with Jean Grey?
  36. 36. Assertion 2 (Summary): That graph is very rich, but is a compression of meaning into a lossy format. Much of data science is essentially the decompression from this lossy format into a reconstituted form. Southern Data Science
  37. 37. Assertion 3: Every instance of a word or phrase you ever encounter has a unique meaning. Southern Data Science
  38. 38. Thought Exercise What do you think of when I say the word “driver”? Southern Data Science
  39. 39. Ambiguity Example Related Keywords (representing multiple meanings) driver truck driver, linux, windows, courier, embedded, cdl, delivery architect autocad drafter, designer, enterprise architect, java architect, designer, architectural designer, data architect, oracle, java, architectural drafter, autocad, drafter, cad, engineer … … Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015. Southern Data Science
  40. 40. Use Case: Query Disambiguation Example Related Keywords (representing multiple meanings) driver truck driver, linux, windows, courier, embedded, cdl, delivery architect autocad drafter, designer, enterprise architect, java architect, designer, architectural designer, data architect, oracle, java, architectural drafter, autocad, drafter, cad, engineer … … Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015. Southern Data Science
  41. 41. Disambiguated meanings (represented as term vectors) Example Related Keywords (Disambiguated Meanings) architect 1: enterprise architect, java architect, data architect, oracle, java, .net 2: architectural designer, architectural drafter, autocad, autocad drafter, designer, drafter, cad, engineer driver 1: linux, windows, embedded 2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier designer 1: design, print, animation, artist, illustrator, creative, graphic artist, graphic, photoshop, video 2: graphic, web designer, design, web design, graphic design, graphic designer 3: design, drafter, cad designer, draftsman, autocad, mechanical designer, proe, structural designer, revit … … Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015. Southern Data Science
  42. 42. Using the disambiguated meanings In a situation where a user searches for an ambiguous phrase, what information can we use to pick the correct underlying meaning? 1. Any pre-existing knowledge about the user: • User is a software engineer • User has previously run searches for “c++” and “linux” 2. Context within the query: User searched for windows AND driver vs. courier OR driver 3. If all else fails (and there is no context), use the most commonly occurring meaning. driver 1: linux, windows, embedded 2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier Source: M. Korayem, C. Ortiz, K. AlJadda, T. Grainger. "Query Sense Disambiguation Leveraging Large Scale User Behavioral Data". IEEE Big Data 2015. Southern Data Science
  43. 43. Thought Exercise What do you think of when I say the word “Apple”? Southern Data Science
  44. 44. Every term or phrase is a Context-dependent cluster of meaning with an ambiguous label Southern Data Science
  45. 45. Every term or phrase is a Context-dependent cluster of meaning with an ambiguous label Southern Data Science
  46. 46. Southern Data Science What does “love” mean? http://localhost:8983/solr/thesaurus/skg
  47. 47. Southern Data Science What does “love” mean in the context of “hug”? http://localhost:8983/solr/thesaurus/skg
  48. 48. Southern Data Science What does “love” mean in the context of “child”? http://localhost:8983/solr/thesaurus/skg
  49. 49. My Three Assertions (Recap) 1) Unstructured data is actually “hyper-structured” data. It is a graph that contains much more structure than typical “structured data.” 2) That graph is very rich, but is a compression of meaning into a lossy format. Much of data science is essentially the decompression from this lossy format into a reconstituted form. 3) Most Important: Every instance of a word or phrase you ever encounter has a unique meaning.
  50. 50. Why do we care? User’s Query: machine learning research and development Portland, OR software engineer AND hadoop, java Traditional Query Parsing: (machine AND learning AND research AND development AND portland) OR (software AND engineer AND hadoop AND java) Semantic Query Parsing: "machine learning" AND "research and development" AND "Portland, OR" AND "software engineer" AND hadoop AND java Semantically Expanded Query: ("machine learning"^10 OR "data scientist" OR "data mining" OR "artificial intelligence") AND ("research and development"^10 OR "r&d") AND AND ("Portland, OR"^10 OR "Portland, Oregon" OR {!geofilt pt=45.512,-122.676 d=50 sfield=geo}) AND ("software engineer"^10 OR "software developer") AND (hadoop^10 OR "big data" OR hbase OR hive) AND (java^10 OR j2ee)
  51. 51. Contact Info Trey Grainger trey.grainger@lucidworks.com @treygrainger http://solrinaction.com Other presentations: http://www.treygrainger.com Discount code: ctwsdsc18 Southern Data Science

×