Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building a Biomedical Knowledge Garden

1,550 views

Published on

Describes the tribulations of building a large biomedical knowledge graph. Provides a comparison between the UMLS and Wikidata in terms of content and structure. Concludes with the idea of anchoring the knowledge graph in Wikidata items and properties.

Published in: Science
  • Someone needs to make add a statement to show the relationship between Moron https://www.wikidata.org/wiki/Q2620524 and intellectual disability https://www.wikidata.org/wiki/Q183560
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Building a Biomedical Knowledge Garden

  1. 1. Building a Biomedical Knowledge Garden Benjamin Good Su Laboratory, Group Meeting Dec. 2, 2016
  2. 2. Unstructured data PubMed Clinical Trials Etc. NLP tools SemRep DeepDive Implicitome etc. Knowledge Graph SemmedDB Literome etc. Applications Semantic MEDLINE BioGraph etc. Microtasks Mark2Cure AMT Structured data Gene Ontology etc. http://tinyurl.com/jbmn8mz The Knowledge Garden Idea. Circa Jan. 2015.
  3. 3. The devil is in the details… Unstructured data PubMed Clinical Trials Etc. NLP tools SemRep DeepDive Implicitome etc. Knowledge Graph SemmedDB Literome etc. Application Semantic MEDLINE BioGraph etc. Microtasks Mark2Cure AMT Structured data Gene Ontology etc.
  4. 4. Reality November 2016 Knowledge Graph SemmedDB Application knowledge.bio Microtasks Mark2Cure AMT
  5. 5. knowledge.bio Explore all biomedical knowledge as a graph with edges connected back to supporting references v2.5 demo
  6. 6. knowledge.bio – Data challenges • V1 – V2.5 • All content from SemmedDB or Implicitome • custom schema to support these. • V3 key requirement: ? allow import of content from many other sources, Gene Ontology, DeepDive output, User-generated…
  7. 7. This part is important… Not nailing it down makes everything else harder Knowledge Garden content managed as: csv files json documents mysql databases Postgress databases neo4j databases None of which had any coherent plan or structure
  8. 8. Requirements for a knowledge graph • Syntax: • How to refer to nodes and edges • identifiers • schema (structure of graph) • Semantics: • What things mean • How you decide on the ‘?’: • node1 ‘?’ node2 • are they the same (to you?) • if not, what is the edge? Mind the Gap… (one node in “Amino Acid” namespace other in (“Biologically Active Substance” namespace)
  9. 9. Options at kb3 scale (millions of concepts and relations) • The Unified Medical Language System (UMLS) • The Semantic Web • Wikidata ?
  10. 10. The UMLS (CUIs, Atoms, Types) C0026106HP:0001256 Mild mental retardation, Mild and nonprogressive mental retardation SNOMEDCT_US:86765009 Moron (mental age 8-12 years) MEDCIN:35101 Mild intellectual disabilities OMIM:MTHU035844 Intellectual disability, mild Atoms CUI equivalent to https://uts.nlm.nih.gov C0233630 SNOMEDCT_US:32386009 Logical Thinking Mental or Behavioral Dysfunction Disease or Syndrome isa isa Types Behavior Activity affects isa Event isa isa affects ? Types organized into a “Semantic Network” ~ 133 types, 54 predicates 13 high level ‘groups’ CUI
  11. 11. The UMLS in 2016 • 3,200,922 CUIs • 211 source vocabularies (e.g. MeSH, SNOMED, RxNORM, etc.) • 12,287,973 total terms (”ATOMS”) • Every edge in the system is a manual product of NLM • every Atom->CUI • every CUI->Type • every Type->Type
  12. 12. The Semantic Web • Concepts uniquely identified by resolvable URIs • Meaning (e.g. equivalency) encoded in OWL axioms • Concepts and mappings created and maintained by anyone who can host them • No other structure • No governance
  13. 13. UMLS versus Semantic Web • UMLS • PROs: covers large portion of biomedical concept space, manually curated, we are already using it by default, the semantic types are handy • CONs: does not exist on the semantic web - no stable URI to associate with a CUI, license is obscure and apparently limiting, weak representation of molecular biology domain, no control over its extension (e.g. no Human Disease Ontology) • Semantic Web • PROs: universal, open, infrastructure is the Web itself • CONs: need for organization, curation, mapping
  14. 14. Not thrilled with my options https://commons.wikimedia.org/wiki/File:A_frustrated_and_depressed_man_holds_his_head_in_his_hand.jpg
  15. 15. Meanwhile... • human, mouse, rat, yeast, macaque, 120+ microbes genes and proteins • Gene Ontology terms • Human Disease Ontology terms • 120,000+ chemicals • Cancer genome variants • Other people adding and using data!!!
  16. 16. Maybe ?
  17. 17. Wikidata (QIDs, ids, Types) Q183560HP:0001256 Mild mental retardation, Mild and nonprogressive mental retardation SNOMEDCT_US:86765009 Moron (mental age 8-12 years) MEDCIN:35101 Mild intellectual disabilities OMIM:MTHU035844 Intellectual disability, mild QID external id https://www.wikidata.org/wiki/Q412194 Q412194 PubChem: 2477 buspirone Specific Developmental Disorder developmental disorder of mental health subclass of subclass of treated by Poly-Ontology Drug QID Chemical isa mental disorder disorder subclass of subclass of (DO) ids
  18. 18. ACTIVE! Knowledge Flow for Wikidata Unstructured data The Internet NLP tools StrepHit Knowledge Graph Applications Wikipedia Wikigenomes Wikidata.org Microtasks Wikidata game MixnMatch Structured data Gene Ontology etc.
  19. 19. Wikidata is a Functioning and Flourishing Knowledge Garden
  20. 20. Wikidata • ~27,000,000 concepts identified by Qids like ‘Q183560’ • ~1350 source vocabularies (e.g. MeSH, RxNORM, IMDB, ETC.) • (Based on properties tagged with type ‘ExternalId’) • ? total terms integrated = labels + aliases (a lot) • Mappings to Qids product of the unwashed masses • Constantly updated
  21. 21. What concept scheme do we use ? • Wikidata • PROs: universal, open, infrastructure, active community, largely curated content • CONs: limited biomedical content so far ?
  22. 22. Challenge: Relevant Scientific Applications NLP tools SemRep Literome Implicitome PubTator DeepDive Snorkel ContentMine TEES …. Knowledge Graph Applications Wikigenomes HetioNet Knowledge.Bio … Structured data Gene Expression etc, … A. Advancing science is the goal and this is how we can help B. We need experts to help refine and build the knowledge graph and apps are the bait
  23. 23. On the plane Oct. 11,2016… “Screw it, lets go all in” I got really excited.. https://www.flickr.com/photos/alexnormand/5992512756https://www.flickr.com/photos/k6lcs/15374887957
  24. 24. knowledge.bio 3.0 • All nodes to be concepts from wikidata • All predicates to be properties from wikidata • All edges to be linked to references that could be ‘stated in’ Wikidata • Edges (‘claims’) can come from any source • Now • We have one consistent format for data import • We have a consistent pattern for gathering more data about a concept • We have access to 27 million concepts and growing (and we can add more) • We have the beginnings of new tool for expert-sourcing curation of Wikidata content • Our code is getting simpler and cleaner
  25. 25. KB3.0 – next step seeding content • You are now basically up to date… • Rest of talk is about mapping content from SemmedDB to the new structure • 3.0 release will allow users to add new nodes and edges • If you want data in there: 1. map it to Wikidata items and properties 2. make a tab-delimited file (Qid Pid Qid referenceUrl sentence) 3. load it (or ask me to) • Users needed!
  26. 26. How many concepts in the UMLS are now items in Wikidata? ? 27,000,000 3,000,000
  27. 27. Direct identifier mapping
  28. 28. Direct identifier mapping (15 shared ontologies) CUI Qid UMLS_vocab Concepts Wikidata_property Prop id Usage NCBI 1014837 NCBI Taxonomy ID P685 379589 MSH 359116 MeSH ID P486 5979 ICD10PCS 178278 ICD-10-PCS P1690 5 NCI 119620 NCI Thesaurus ID P1748 5562 ICD10CM 98899 ICD-10 P494 8826 OMIM 86181 OMIM ID P492 5835 FMA 82042 Foundational Model of Anatomy ID P1402 3378 GO 60412 Gene Ontology ID P686 43693 MDR 51961 Medical Dictionary for Regulatory Activities ID P3201 1 HGNC 39261 HGNC gene symbol P353 63691 HGNC Sometimes... HGNC-ID P354 39758 NDFRT 38206 NDF-RT ID P2115 1509 ICD9CM 20993 ICD-9-CM P1692 88 ICD10 11552 ICD-10 P494 8826 RXNORM 205998 RxNorm CUI P3345 5671 C0001629 Adrenal Medulla FMA: 15633 ?qid wdt:P1402 “15633” Q934888 Local MySQL query Build sparql query.wikidata.org
  29. 29. Strict identifier mapping CUI Qid UMLS_vocab Concepts Wikidata_property Prop id Usage NCBI 1014837 NCBI Taxonomy ID P685 379589 MSH 359116 MeSH ID P486 5979 ICD10PCS 178278 ICD-10-PCS P1690 5 NCI 119620 NCI Thesaurus ID P1748 5562 ICD10CM 98899 ICD-10 P494 8826 OMIM 86181 OMIM ID P492 5835 FMA 82042 Foundational Model of Anatomy ID P1402 3378 GO 60412 Gene Ontology ID P686 43693 MDR 51961 Medical Dictionary for Regulatory Activities ID P3201 1 HGNC 39261 HGNC gene symbol P353 63691 HGNC Sometimes... HGNC-ID P354 39758 NDFRT 38206 NDF-RT ID P2115 1509 ICD9CM 20993 ICD-9-CM P1692 88 ICD10 11552 ICD-10 P494 8826->8292 RXNORM 205998 RxNorm CUI P3345 0->5671 -> Thanks to Sebastian’s recent work..
  30. 30. How many concepts in the UMLS are now items in Wikidata? (according to identifiers) 463,059 27,000,000 3,000,000 15%
  31. 31. 463,059 Wikidata items by UMLS source id
  32. 32. Coverage of shared identifiers by item (cut off, NCBI taxonomy has > 1million) UMLS cuis Wikidata items Good targets for wikidata bots
  33. 33. 463,059 mapped concepts, by semantic group 1 10 100 1000 10000 100000 1000000 N 1 to 1 NCBI Taxons Gene Ontology Genes Diseases Drugs
  34. 34. Where are the Gaps? 0 100000 200000 300000 400000 500000 600000 700000 800000 N no Map 600,000 missing drugs 550,000 missing disorders
  35. 35. Where are(n’t) the Gaps? 0 0.1 0.2 0.3 0.4 0.5 0.6 percent_mapped
  36. 36. Label matching…
  37. 37. Adding label matching actually doesn’t help that much… • Checked only 460,080 (including all 288,552 from SemmedDB) • 21% (96,843) had an identifier match • 6.9% (31,645) had a match on the UMLS Prefered Label • 3.1% (14,319) matched one of the UMLS synonyms • Removing anything that matched more than 1 Wikidata item we get 129,726 concepts. • Limiting to concepts used in SemmedDB we get 113,623 • (43% coverage with most matches coming from identifiers)
  38. 38. SemmedDB as Wikidata, version 1 • 15,957,582 predications with 13 relation types • All Concepts Wikidata items • All relation types Wikidata properties • (Data available at http://tinyurl.com/cui2qid-1 ) • Will be accessible in kb3.0 next week or the following
  39. 39. Next steps / project opportunities • More Wikidata bots! • Establish a more consistent typing strategy in Wikidata (e.g. make each item an instance of some semantic group) • Finish the mapping of the UMLS predicates to Wikidata Properties • Add missing properties (e.g. ‘Activates’, ‘Inhibits’) • Use existing subproperty prop. to build a prop. ontology inside wikidata • Populate kb3.0 with knowledge pertinent to your disease area • Extend the user interface • Use the underlying neo4j database to extend HetioNet and related (or add HetioNet to it.
  40. 40. Pick an edge or node and create or improve it Unstructured data PubMed Clinical Trials Etc. NLP tools SemRep DeepDive Implicitome etc. Knowledge Graph SemmedDB Literome etc. Applications Semantic MEDLINE BioGraph etc. Microtasks Mark2Cure AMT Structured data Gene Ontology etc.
  41. 41. Thanks! • Richard Bruskiewich! and Star Informatics team for persevering… (v1,v2.1...5, v3.0) • Gene Wiki team! Especially bot developers: Sebastian B, Andra W, Tim P., Greg S. who planted the seeds that are making this possible. • Su laboratory! • I hope you can find something useful here and help grow the garden… • Especially you HetNetters! https://www.flickr.com/photos/alexnormand/5992512756

×