Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Investigating Term Reuse and Overlap in Biomedical Ontologies

543 views

Published on

Our conference presentation at the 6th International Conference on Biomedical Ontology (ICBO), held at Lisbon, Portugal, during 27th-30th July 2015. Conference Proceedings: http://icbo2015.fc.ul.pt/ICBO2015Proceedings.pdf

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Investigating Term Reuse and Overlap in Biomedical Ontologies

  1. 1. Investigating Term Reuse and Overlap in Biomedical Ontologies International Conference on Biomedical Ontology Lisbon, 27th -30th July 2015 MAU LI K R. K AM D AR , TANI A TUDORACHE A N D MARK A . MUS E N Are we there yet?
  2. 2. C0011849Diabetes Mellitus Diabetes Mellitus Unified Medical Language System (UMLS) SNOMEDCT ICD9CM
  3. 3. C0011849Diabetes Mellitus Diabetes Mellitus Unified Medical Language System (UMLS) Open Biomedical Ontologies (OBO) Foundry SNOMEDCT ICD9CM Binding to RNA (GRO#BindingToRNA) GO:0003723 IRI xref RNA Binding (GO:0003723) Gene Expression Ontology (GEXO) Gene Regulation Ontology (GEXO) Gene Ontology (GO)
  4. 4. Ghazvinian, Amir, et al. "How orthogonal are the OBO Foundry ontologies?." J. Biomedical Semantics 2.S-2 (2011): S2. OBO Reuse vs Overlap in 2010
  5. 5. Ghazvinian, Amir, et al. "How orthogonal are the OBO Foundry ontologies?." J. Biomedical Semantics 2.S-2 (2011): S2. OBO Reuse vs Overlap in 2010 Same IRI
  6. 6. Ghazvinian, Amir, et al. "How orthogonal are the OBO Foundry ontologies?." J. Biomedical Semantics 2.S-2 (2011): S2. OBO Reuse vs Overlap in 2010 Same IRI Intent for Reuse
  7. 7. Ghazvinian, Amir, et al. "How orthogonal are the OBO Foundry ontologies?." J. Biomedical Semantics 2.S-2 (2011): S2. OBO Reuse vs Overlap in 2010 Xref mapping Same IRI Intent for Reuse
  8. 8. Ghazvinian, Amir, et al. "How orthogonal are the OBO Foundry ontologies?." J. Biomedical Semantics 2.S-2 (2011): S2. OBO Reuse vs Overlap in 2010 September 2009
  9. 9. Ghazvinian, Amir, et al. "How orthogonal are the OBO Foundry ontologies?." J. Biomedical Semantics 2.S-2 (2011): S2. OBO Reuse vs Overlap in 2010 September 2010
  10. 10. Key Findings
  11. 11. Key Findings  ~3% Term Reuse  Only popular or upper- level ontologies reused  14.4% Term Overlap
  12. 12. Key Findings  ~3% Term Reuse  Only popular or upper- level ontologies reused  14.4% Term Overlap  Semantically-similar terms reused together  Similarity metric for a Recommender system
  13. 13. BioPortal Import Plugin
  14. 14. DOG4DAG
  15. 15. Ontofox Web tool
  16. 16. Neurological Disease Ontology
  17. 17. Neurological Disease Ontology OBI Reuse of an Ontology
  18. 18. Neurological Disease Ontology Reuse of Terms OGMS
  19. 19. Neurological Disease Ontology NDO
  20. 20. Key Findings  ~3% Term Reuse  Only popular or upper- level ontologies reused  14.4% Term Overlap  Semantically-similar terms reused together  Similarity metric for a Recommender system
  21. 21. BioPortal N-triples dump Biomedical Ontologies Terms, Labels, xrefs, CUIs Xref ReuseIRI Reuse CUI Reuse Clustering Determine Source Ontology Term Overlap Analysis 509 ontologies 377 ontologies Remove ontology views 5,718,276 class terms Label normalization Source-Target Ontology pairs >35% reuse for ontology reuse
  22. 22. 14.4% Naïve Term Overlap! • Normalized String Matching on Term Labels 14.4% (823621)
  23. 23. 156/377 ontologies reuse no terms from other ontologies! <5% of Terms reused from other Ontologies! > IRI Reuse
  24. 24. 156/377 ontologies reuse no terms from other ontologies! <5% of Terms reused from other Ontologies! > IRI Reuse
  25. 25. 156/377 ontologies reuse no terms from other ontologies! <5% of Terms reused from other Ontologies! > IRI Reuse
  26. 26. 315/377 ontologies xref link to no terms from other ontologies! <5% of Terms reused from other Ontologies! > Xref Reuse
  27. 27. 263/377 ontologies have no terms reused by other ontologies! Reuse from a small set of ontologies only! > IRI Reuse
  28. 28. 286/377 ontologies have no terms xref linked by other ontologies! Reuse from a small set of ontologies only! > Xref Reuse
  29. 29. 0-5% of total terms reused explicitly or using xref, with >150 ontologies showing 0% reuse. Average Term Reuse ~ 3% Reuse from a small set of ontologies only with terms from >250 ontologies never reused >100% term reuse from some ontologies! Why?
  30. 30. 0 10 20 30 40 50 60 70 80 90 100 BFO GO IAO OBI PATO CHEBI CL NCBITAXON UO SO UBERON CARO NCIT FMA MP SNOMEDCT NumberofOntologiesReusingTerms(#) Ontologies >100% terms reused from some ontologies! xref Reuse (No. of Ontologies IRI Reuse (No. of Ontologies)
  31. 31. 0 10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 NumberofOntologiesReusingTerms(#) Ontologies >100% terms reused from some ontologies! % of Terms reused IRIs % of Terms reused xref BFO:10 1/39
  32. 32. … Reuse from a small set of popular or upper- level ontologies only with terms from >250 ontologies never reused >100% terms reused w.r.t current version of the BFO, PATO, CARO, UO, SO ontologies! Needs rigorous analysis through term overlap …
  33. 33. 1 10 100 1000 10000 100000 1000000 ICD10PCS HCPCS NCBITAXON LOINC MESH HL7 ICD10CM OMIM RXNORM CPT PDQ MEDDRA ICD9CM NDDF ICPC ICPC2P MDDB NDFRT SNOMEDCT VANDF CRISP RCD MEDLINEPLUS SNMI COSTART WHO-ART Procedural Terminologies do not share CUIs! CUIs shared 0 Terminologies CUI Reuse NumberofTerms(LogScale)
  34. 34. 1 10 100 1000 10000 100000 1000000 ICD10PCS HCPCS NCBITAXON LOINC MESH HL7 ICD10CM OMIM RXNORM CPT PDQ MEDDRA ICD9CM NDDF ICPC ICPC2P MDDB NDFRT SNOMEDCT VANDF CRISP RCD MEDLINEPLUS SNMI COSTART WHO-ART Procedural Terminologies do not share CUIs! CUIs shared 1-5 Terminologies CUI Reuse NumberofTerms(LogScale)
  35. 35. 1 10 100 1000 10000 100000 1000000 ICD10PCS HCPCS NCBITAXON LOINC MESH HL7 ICD10CM OMIM RXNORM CPT PDQ MEDDRA ICD9CM NDDF ICPC ICPC2P MDDB NDFRT SNOMEDCT VANDF CRISP RCD MEDLINEPLUS SNMI COSTART WHO-ART Procedural Terminologies do not share CUIs! CUIs shared 6-10 Terminologies CUI Reuse NumberofTerms(LogScale)
  36. 36. 1 10 100 1000 10000 100000 1000000 ICD10PCS HCPCS NCBITAXON LOINC MESH HL7 ICD10CM OMIM RXNORM CPT PDQ MEDDRA ICD9CM NDDF ICPC ICPC2P MDDB NDFRT SNOMEDCT VANDF CRISP RCD MEDLINEPLUS SNMI COSTART WHO-ART Procedural Terminologies do not share CUIs! CUIs shared 11-15 Terminologies CUI Reuse NumberofTerms(LogScale)
  37. 37. 1 10 100 1000 10000 100000 1000000 ICD10PCS HCPCS NCBITAXON LOINC MESH HL7 ICD10CM OMIM RXNORM CPT PDQ MEDDRA ICD9CM NDDF ICPC ICPC2P MDDB NDFRT SNOMEDCT VANDF CRISP RCD MEDLINEPLUS SNMI COSTART WHO-ART Procedural Terminologies do not share CUIs! CUIs shared 16-20 Terminologies CUI Reuse NumberofTerms(LogScale)
  38. 38. 1 10 100 1000 10000 100000 1000000 ICD10PCS HCPCS NCBITAXON LOINC MESH HL7 ICD10CM OMIM RXNORM CPT PDQ MEDDRA ICD9CM NDDF ICPC ICPC2P MDDB NDFRT SNOMEDCT VANDF CRISP RCD MEDLINEPLUS SNMI COSTART WHO-ART Procedural Terminologies do not share CUIs! CUIs sharedCUI Reuse NumberofTerms(LogScale)
  39. 39. 1 10 100 1000 10000 100000 1000000 ICD10PCS HCPCS NCBITAXON LOINC MESH HL7 ICD10CM OMIM RXNORM CPT PDQ MEDDRA ICD9CM NDDF ICPC ICPC2P MDDB NDFRT SNOMEDCT VANDF CRISP RCD MEDLINEPLUS SNMI COSTART WHO-ART Procedural Terminologies do not share CUIs! CUIs sharedCUI Reuse NumberofTerms(LogScale)
  40. 40. Minimum sharing of CUIs, especially across UMLS Procedural Terminologies - ICD10PCS, HCPCS and CPT Several unique terms introduced as we migrate from ICD9CM -> ICD10CM, leading to decrease in Term reuse. Should there actually be Term Reuse?
  41. 41. Overlap decreases using correct representations! 14.4% (823621) • Normalized String Matching on Term Labels 13.2% (752,176) • Removing Explicitly Reused Terms 10.8% (617509) • Removing Terms Mapped to the same UMLS CUI 1.6% (93,650) • Removing almost-similar terms (same identifier and source ontology but different representation)
  42. 42. Average 3% Term reuse across ontologies using any method, yet a 14.4% naïve Term overlap! Term overlap decreases substantially on removing almost similar terms … Examples for almost similar terms?
  43. 43. Version 1.0/Version1.1 Subcellular Anatomy Ontology (SAO) Suggested Ontology for Pharmacogenomics (SOPHARM) Intent Different Versions BFO NCIT Different Notations FMA Different Namespaces MESH SNOMEDCT Ontology Engineers show an intent for reuse!
  44. 44. Intent Different Versions BFO NCIT Different Notations FMA Different Namespaces MESH SNOMEDCT NCIT:C53037/NCIT:Cerebral_Vein Cigarette Smoke Exposure (CSEO) Sage Bionetworks Synapse (SYN) Ontology Engineers show an intent for reuse!
  45. 45. OBO:FMA_31396 OBO:owlapi/fma#FMA_31396 OBO:owl/FMA#FMA_31396 OBO:fma#Cartilage_of_inferior_surface … Ontology Engineers show an intent for reuse! Intent Different Versions BFO NCIT Different Notations FMA Different Namespaces MESH SNOMEDCT
  46. 46. http://purl.bioontology.org/ontology/MESH http://phenomebrowser.net/ontologies/mesh/mesh.owl Intent Different Versions BFO NCIT Different Notations FMA Different Namespaces MESH SNOMEDCT Ontology Engineers show an intent for reuse!
  47. 47. Intent Different Versions BFO NCIT Different Notations FMA Different Namespaces MESH SNOMEDCT http://ihtsdo.org/snomedct/ http://purl.bioontology.org/ontology/SNOMEDCT Ontology Engineers show an intent for reuse!
  48. 48. Different versions, notations, namespaces • >100% Reuse of few source ontologies • Increase in Term Overlap Incorrect representations without mappings do not provide advantages of Term Reuse!
  49. 49. Key Findings  ~3% Term Reuse  Only popular or upper- level ontologies reused  14.4% Term Overlap  Semantically-similar terms reused together  Similarity metric for a Recommender system
  50. 50. Onto 1 Onto 2 Onto 3 Onto 4 Onto 5 Onto 6 Onto 7 Term 1 1 1 1 0 0 0 0 Term 2 0 0 0 1 1 0 0 Term 3 0 0 0 0 0 1 1 Term 4 1 1 0 0 1 0 0 Term 5 1 1 1 0 0 0 1 Term 6 0 0 0 1 1 1 0 Term 7 0 0 1 0 1 0 0 Term- Ontology Matrix K-modes Clustering Term-Term Affinity Matrix Spectral Clustering Understanding how Term Reuse Occurs
  51. 51. Term- Ontology Matrix K-modes Clustering Term-Term Affinity Matrix Spectral Clustering Understanding how Term Reuse Occurs
  52. 52. Term- Ontology Matrix K-modes Clustering Term-Term Affinity Matrix Spectral Clustering Understanding how Term Reuse Occurs • Weighted Similarity Score between Term pairs – Shared Ontologies – Jaccard Semantic Similarity Score – CUI Hierarchy from UMLS Metathesaurus
  53. 53. Semantically-similar terms are reused together! Semantic Similarity < 0.9 Cluster Size Semantic Similarity > 0.9
  54. 54. Semantically-similar terms are reused together! Semantic Similarity > 0.9
  55. 55. Semantically-similar terms are reused together! Semantic Similarity > 0.9
  56. 56. Semantic-similar terms (Parent-child or siblings) are reused together … Similarity Metric and BioPortal can be used to provide recommendations to ontology developers through a Web Protégé plugin!
  57. 57. Challenges to Term Reuse • Substantial term overlap but less than 5% reuse. • Lexically-similar terms may represent different concepts (e.g., anatomical concepts between ZFA and XAO). • Lexically-different terms may represent same concepts (e.g. myocardium and cardiac muscle) • Same terms use different IRI representations, and without explicit CUI or xref mappings. • Lack of guidelines and semi-automated tools.
  58. 58. Future Work: WebProtégé Plugin Term reuse recommendations using Item-based Collaborative Filtering method. Two-fold (A Posteriori and User-Centered) Evaluation GO:0033036 GO:0008104 GO:1902432 GO:1903260 GO:0061472 GO:0090174 GO:0071850 GO:0044770 GO:0044839 GO:0045786 GO:0007050 GO:0044843 GO:1902969 GO:0036226
  59. 59. - Still far from achieving ideal term reuse, beyond upper level and popular ontologies - Newer ontologies added in BioPortal - Without strict guidelines and semi-automated tools, we will deviate more away … The Road Ahead …
  60. 60. Acknowledgments Musen Lab, Stanford BMI PhD Program, Stanford US NIH Grants GM086587 GM103316 maulikrk@stanford.edu http://stanford.edu/~maulikrk/data/OntologyReuse

×