More Related Content

Slideshows for you(20)

Similar to Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs(20)


Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs

  1. 1 Combining Explicit and Latent Web Semantics Paul Groth - @pgroth Elsevier Labs BigNet : WWW 2018 Thanks to Ron Daniel, Brad Allen & the Labs Team Empowering KnowledgeTM for Maintaining Knowledge Graphs
  2. 2 Outline Goal: to tell you our current thinking and to get your feedback • Why we’re interested • What we’ve tried • What we’re missing • Webby Data • 2 Sources of Semantics • State of the art • What’s missing Warning: The back half is like a probably incomplete literature review so think of this as pointers 
  3. 3 Knowledge Graphs
  4. 4
  5. 5 EMMeT (Elsevier Merged Medical Taxonomy) EMMeT is a multilingual, concept-based clinical ontology • Multilingual: English, French, Spanish • Concept-based: All terms, synonyms, translations, mappings are related to a unique identifier (“IMUI”) • Ontology: Provides semantic relationships between concepts (symptoms of a disease, treatment procedures of a disease, complications of a disease or a procedure, etc…) EMMeT is a controlled reference terminology • Based on Unified Medical Language System (UMLS), standard clinical terminologies as well as Elsevier proprietary vocabularies and lists of acronyms • Explicitly mapped to international medical standards (SNOMED-CT, ICD-9-CM, ICD-10-CM, LOINC, RXNorm, CVX, etc.) and Elsevier’s vocabularies (Gold Standard, EMTREE, etc.) EMMeT is current • Continuously updated, and released every 12 weeks for automatic indexing • Updated daily and available via an API for manual tagging access • Maintained by a team of medical terminology experts,
  6. 6 Automated Tagging Manual Tagging/ Data Structuring Products and platforms using EMMeT Clinical Solutions ClinicalKey Global ClinicalKey ANZ ClinicalKey France ClinicalKey Espanol ClinicalKey Nursing ClinicalKey German ClinicalKey Nursing ANZ ClinicalKey Brazil Amirsys Decision Point RP/STMJ Health Advance The Lancet Cell LexisNexis MedMal Nav LN Insight Legend In production In Pilot In Pipeline Nursing Education Mosby’s Dictionary Clinical Solutions PoC - Clinical Overviews ClinicalKey HL7 API Health Analytics IDS FHIR API/Apps Dorland’s Dictionary Patient Engagement Gold Standard CP ERC Content 2.0 Nursing Education Sherpath EMEALAAP MedEnact RP/STMJ (SCT) Health Advance The Lancet Cell
  7. 7 EMMeT Clinical Knowledge Graph
  8. 8 Rankings of EMMeT’s ontological relationships • Relationships are ranked according to 5-tiered ranking model: for simplicity and accessibility. • 10: best option; • 9: second option. When the rank of 10 is not applicable; • 8: given two concepts that are too general to be directly related to a specific disease; • 7: is used as an outlier. • 6: default / non validated. Relationship Ranking Criteria 10 9 8 7 has cause most common common sometimes rare has clinical finding most common common sometimes rare has_complication severity (disease) severe/death high moderate low morbiditiy has_complication prevalence (disease) Strong occurrence/high prevalence Likely occurrence/ commonly prevalent Sometimes occurs Rare occurrence has_complication severity (procedure) critical/death major moderate minor has comorbidity strongly associated Commonly associated Sometimes associated Rarely associated has screening procedure best choice is done sometimes done rarely done has risk factor strongly associated Commonly associated Sometimes associated Rarely associated has diagnostic procedure best choice commonly done sometimes done rarely done has differential diagnosis Strong occurrence/high prevalence Likely occurrence/ commonly prevalent Sometimes occurs/ low prevalence Rare occurrence has drug best choice 2nd line 3rd line rarely given has contraindication drug Strongly avoid/black box Commonly avoid Sometimes avoid Rarely avoid has treatment procedure best choice commonly done sometimes done rarely done has prevention Best option common option sometimes advised rarely advised has physician specialty specific specialty general/specialty broad rare has device standard device acceptable device sometimes used rarely used
  9. 9 From EMMeT to H-Graph • Based on EMMeT • Support more complex relations including patient context (Clinical Overview content + more) • Flexible and extensible model to support links to content, model treatment strategies, numeric values, temporal data, etc. Age, sex, weight, … are very simple context. In people with atrial fibrillation presenting acutely without life-threatening haemodynamic instability, offer rate or rhythm control if the onset of the arrhythmia is less than 48 hours, and start rate control if it is more than 48 hours or is uncertain. NICE Guideline Atrial Fibrillation: Management • Continue to support existing indexing pipelines (e.g. ClinicalKey), and tagging use cases (e.g. Clinical Overviews) From EMMeT… …To H-Graph
  10. 10 Universal Schemas
  11. 11 Universal schemas • … are a specific technique from the Information Extraction and the Automatic Knowledge Base Completion literature • … are an unsupervised method to ‘learn’ by combining text extracts with existing knowledge base assertions • Applications: • Extend a medical knowledge base • scan incoming literature to suggest new additions to EMMeT and show the underlying evidence to the taxonomy editor. • scan literature backlog to find evidence for data already in EMMeT • Literature Surveillance • scan incoming literature to find existing facts even if expressed in very different ways • find new concepts in the literature related to an existing EMMeT concept*. Let taxonomy editor decide whether to add new concept and relation to EMMeT
  12. 12 Open Information Extraction • Knowledge bases are populated by scanning text and doing Information Extraction • Most information extraction systems are looking for very specific things, like drug-drug interactions • Best accuracy for that one kind of data, but misses out on all the other concepts and relations in the text • For broad knowledge base, use Open Information Extraction that only uses some knowledge of grammar • One weird trick for open information extraction … • ReVerb*: 1. Find “relation phrases” starting with a verb and ending with a verb or preposition 2. Find noun phrases before and after the relation phrase 3. Discard relation phrases not used with multiple combinations of arguments. In addition, brain scans were performed to exclude other causes of dementia. * Fader et al. Identifying Relations for Open Information Extraction
  13. 13 ReVerb output After ReVerb pulls out noun phrases, match them up to EMMeT concepts Discard rare concepts, relations, or relations that are not used with many different concepts # SD Documents Scanned 14,000,000 Extracted ReVerb Triples 473,350,566
  14. 14 Universal schemas - Initialization • Method to combine ‘facts’ found by machine reading with stronger assertions from ontology. • Build ExR matrix with entity-pairs as rows and relations as columns. • Relation columns can come from EMMeT, or from ReVerb extractions. • Cells contain 1.0 if that pair of entities is connected by that relation.
  15. 15 Universal schemas - Prediction • Factorize matrix to ExK and KxR, then recombine. • “Learns” the correlations between text relations and EMMeT relations, in the context of pairs of objects. • Find new triples to go into EMMeT e.g., (glaucoma, has_alternativeProcedure, biofeedback)
  16. 16 Content Universal schema Surface form relations Structured relations Factorization model Matrix Construction Open Information Extraction Entity Resolution Matrix Factorization Knowledge graph Curation Predicted relations Matrix Completion Taxonomy Triple Extraction Concept Resolution 14M SD articles 475 M triples 3.3 million relations 49 M relations ~15k -> 1M entries Paul Groth, Sujit Pal, Darin McBeath, Brad Allen, Ron Daniel “Applying Universal Schemas for Domain Specific Ontology Expansion” 5th Workshop on Automated Knowledge Base Construction (AKBC) 2016 Michael Lauruhn, and Paul Groth. "Sources of Change for Modern Knowledge Organization Systems." Knowledge Organization 43, no. 8 (2016). ONTOLOGY MAINTENANCE • Pretty good F measure around -.7 • Good enough with human in the loop • But we want more!
  17. 17 Paulheim, Heiko. "Knowledge graph refinement: A survey of approaches and evaluation methods." Semantic web 8.3 (2017): 489-508. WHERE TO GO?
  18. 18 MORE THAN LINK PREDICTION • Data has deep hierarchy –link prediction flattens this • Data has hooks into specific content • Schemas are increasingly richly defined – not just a single type • N-ary relations
  19. 19 OUR KG’S SHARE PROPERTIES WITH WEB KGS Ringler, Daniel, and Heiko Paulheim. "One knowledge graph to rule them all? Analyzing the differences between DBpedia, YAGO, Wikidata & co." Joint German/Austrian Conference on Artificial Intelligence (Künstliche Intelligenz). Springer, Cham, 2017.
  20. 20 The Web of Data 2017-12/stats/stats.html
  21. 21 Two sources of semantics 1.Dereferenceablity 2.Rules
  22. 22 Dereferenceablity Looking definitions up – Natural Language and Programmatic
  24. 24 Pay attention to the underlying data Paul Groth, Michael Lauruhn, Antony Scerri: “Open Information Extraction on Scientific Text: An Evaluation”, 2018; [ arXiv:1802.05574]
  25. 25 Embed more Gupta, N., Singh, S., & Roth, D. (2017). Entity linking via joint encoding of types, descriptions, and context. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 2681-2690).
  26. 26 Embed more Both, Fabian, Steffen Thoma, and Achim Rettinger. "Cross-modal Knowledge Transfer: Improving the Word Embedding of Apple by Looking at Oranges." Proceedings of the Knowledge Capture Conference. ACM, 2017.
  27. 27 Social Semantics? de Rooij, S., Beek, W., Bloem, P., van Harmelen, F., & Schlobach, S. (2016, October). Are Names Meaningful? Quantifying Social Meaning on the Semantic Web. In International Semantic Web Conference (pp. 184-199). Springer, Cham. • Distributional semantics for identifiers (NTN) • But uses the global network • Could we use the discussion space as well? NTN - Socher, R., Chen, D., Manning, C. D., & Ng, A. (2013). Reasoning with neural tensor networks for knowledge base completion. In Advances in neural information processing systems (pp. 926-934).
  28. 28 schema:dateModified a rdf:Property ; rdfs:label "dateModified" ; schema:domainIncludes schema:CreativeWork, schema:DataFeedItem ; schema:rangeIncludes schema:Date, schema:DateTime ; rdfs:comment "The date on which the CreativeWork was most recently modified or when the item's entry was modified within a DataFeed." . schema:datePublished a rdf:Property ; rdfs:label "datePublished" ; schema:domainIncludes schema:CreativeWork ; schema:rangeIncludes schema:Date ; rdfs:comment "Date of first broadcast/publication." . schema:disambiguatingDescription a rdf:Property ; rdfs:label "disambiguatingDescription" ; schema:domainIncludes schema:Thing ; schema:rangeIncludes schema:Text ; rdfs:comment "A sub property of description. A short description of the item used to disambiguate from other, similar items. Information from other properties (in particular, name) may be necessary for the description to be useful for disambiguation." ; rdfs:subPropertyOf schema:description . Rules
  29. 29 Injecting Background Knowledge as Constraints Rocktäschel, T., Singh, S., & Riedel, S. (2015). Injecting logical background knowledge into embeddings for relation extraction. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1119-1129)
  30. 30 Learning Rules Yang, Fan, Zhilin Yang, and William W. Cohen. "Differentiable learning of logical rules for knowledge base reasoning." Advances in Neural Information Processing Systems. 2017.
  31. 31 Combing Both – supporting complex reasoning with subsymbolic representations Rocktäschel, T., & Riedel, S. (2017). End-to-end differentiable proving. In Advances in Neural Information Processing Systems (pp. 3791-3803).
  32. 32 Future Welbl, J., Stenetorp, P., & Riedel, S. (2017). Constructing Datasets for Multi-hop Reading Comprehension Across Documents. arXiv preprint arXiv:1710.06481. •Scale •The knowledge base == text? •Multi-hop reasoning •Is everything end-to-end differentiable
  33. 33 Conclusion • In practice: data is webby data • Messy • Interconnected • Constraints and rules associated • Semantic Web: semantics can come from multiple different sources • Explicit & implicit • Take advantage of those sources • Knowledge graphs benefit from inference • Your thoughts? • Thanks & We’re hiring! |
  34. 34 Backup
  35. 35 INTEGRATION OF LARGE NUMBERS OF DATA SOURCES Groth, Paul, "The Knowledge-Remixing Bottleneck," Intelligent Systems, IEEE , vol.28, no.5, pp.44,48, Sept.-Oct. 2013 doi: 10.1109/MIS.2013.138 • 10 different extractors • E.g mapping-based infobox extractor • Infobox uses a hand-built ontology based on the 350 • Based on acommonly used English language infoboxes • Integrates with Yago • Yago relies on Wikipedia + Wordnet • Upper ontology from Wordnet and then a mapping to Wikipedia categories based frequencies • Wordnet is built by psycholinguists
  36. 36 Units & Measurement Annotations • Time • Dosage • Probability • Percent • Count • Not handled yet Find numbers followed by a unit name or abbreviation (perhaps with scale factor like k, m, G, …). Provide value normalized to SI units. Also provide type of measurement (time, temperature, length, mass, dosage, etc.) based on unit. Handling tolerances, ranges, probabilities, and counts adds complexity. Conjunctions not yet handled but very important. Current work – identify the property being measured (e.g. dosages of AA, indomethacin, HtE, leptin, etc.) Additionally at 120 min following glucose administration, the 100 mg/kg 5g and 5e groups had significantly (P ⩽ 0.005) a greater drop in blood glucose than the 10 and 50 mg/kg groups. In the mouse xenograft model of LLC cells in C57BL/6J mice, once daily administration of AA (50 and 100 mg/kg) inhibited tumor growth in a dose-dependent manner (Fig. 6A and C). Groups of Swiss mice (n = 6) were treated (p.o.) with vehicle, indomethacin (10 mg/kg-Roche®) or HtE (50, 100 or 200 mg/kg) 1 h before administration of carrageenan at 2.5% (Sigma-Aldrich®) injected subcutaneously into the plantar region of the left hind paw and phosphate buffer saline (PBS) in right hind paw. In the experiments designed to study the antidepressant-like effect of the repeated treatment (for 14 days) of EET, the immobility time in the TST and the locomotor activity in the open-field were assessed in independent groups of mice 24 h after the last daily administration of EET (10–100 mg/kg, p.o.). Hoppers containing chow were removed from the cages 1 h before the administration of leptin [depending on studies, 5 mg/kg or 2.5 mg/kg, ip; mouse recombinant leptin obtained from Dr. A.F.

Editor's Notes

  1. 100+ years of expert knowledge
  2. On the left side we see one concept, breast cancer, and a number of pieces of informaiton about it such as synonyms, parent and child concepts, etc. On the right we see some ontological relations from breast cancer to other concepts, such as (breast cancer, has diagnostic procedure, breast biopsy). One of the major differences between EMMeT and what is in UMLS is that we not only provide the basic 3-part relationship, such as (breast cancer, has_treatment, radical mastectomy), we also provide information about the ‘strength’ of that relation according to current medical evidence.
  3. Excerpt from National Institute for Health and Care Excellence (In people with atrial fibrillation presenting acutely without life-threatening haemodynamic instability, offer rate or rhythm control if the onset of the arrhythmia is less than 48 hours, and start rate control if it is more than 48 hours or is uncertain. . In people with atrial fibrillation presenting acutely without life-threatening haemodynamic instability, offer rate or rhythm control if the onset of the arrhythmia is less than 48 hours, and start rate control if it is more than 48 hours or is uncertain.
  4. Using EMMeT, and some code and data we already had, he built a quick prototype and tested it. Performance (in terms of accuracy of predictions) was surprisingly high. Unsupervised is very important because it means the construction of the rough underlying knowledge base is scalable and not limited by the availability of experts. Raw predictions not good enough for fully automatic operation, but are plenty good enough to help taxonomy editors and other people do their job much faster.
  5. Complex axioms Messy Integrates lots of infromation
  6. Predict entity types
  7. Concept similarity Conc svd and pca are combinations
  8. verify the null hypothesis that names are statistically independent from the two meaning proxies
  9. SRL performance TRL 2 And inductive logic programming
  10. Translate to natural language (sli)
  11. One type of NLP annotation Labs is implementing is to mark up measurements – find the quantity, the unit, any tolerances, etc. We also normalize them to SI standards so measurements can be compared and searched. This is not novel research. However, we have not found prior work that attempts to detect the specific object and property being measured. We are using several domain-specific scenarios (mouse cancer, concrete additives, NLP algorithm accuracy, neuronal properties) to find ways that information is expressed. For mouse cancer, it is relatively easy to detect that a measurement is a dosage of a particular drug. But those patterns are of little use in the other scenarios. This work has application to the h-graph – dosages, ages, weights, etc. are all important properties for the patient context. Cohort size and probability are important for the quality of evidence measures.