Reference Data Integration: A Strategy for the Future


Published on

2012 FIMA talk

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Ivan Herman
  • of X chromosome in baker’s yeast
  • Reference Data Integration: A Strategy for the Future

    1. 1. Reference Data Integration: A Strategy For The Future Barry SmithNational Center for Ontological Research University at Buffalo presented at FIMA, March 21, 2012 1
    2. 2. Who am I? National Center for Biomedical Ontologybased in Stanford Medical School, the Mayo Clinic and Buffalo Department of Philosophy • Cleveland Clinic Semantic Database • Duke University Health System • University of Pittsburgh Medical Center • German Federal Ministry of Health • European Union eHealth Directorate • Plant Genome Research Resource • Protein Information Resource 2
    3. 3. Who am I?National Center for Ontological Research ( • Joint Warfighting Center, US Joint Forces Command • Intelligence and Information Warfare Directorate (I2WD) • US Department of the Army Net-Centric Data Strategy Center of Excellence • NextGen (Next Generation Air Transportation System) Ontology Team • National Nuclear Security Administration (NNSA), Department of Energy 3
    4. 4. Some questions• How to find data?• How to understand data when you find it?• How to use data when you find it?• How to compare and integrate with other data?• How to avoid data silos? 4
    5. 5. The Web (net-centricity) as part of the solution• You build a site• Others discover the site and they link to it• The more they link, the more well known the page becomes (Google …)• Your data becomes discoverable 5
    6. 6. The roots of Semantic Technology1. Make your data available in a standard way on the Web2. Use controlled vocabularies (‘ontologies’) to capture common meanings, in ways understandable to both humans and computers – Web Ontology Language (OWL)3. Build links among the datasets to create a ‘web of data’
    7. 7. Controlled vocabularies for tagging (‘annotating’) data• Hardware changes rapidly• Organizations rapidly forming and disbanding• Data is exploding• Meanings of common words change slowly• Use web architecture to annotate exploding data stores using ontologies to capture these common meanings in a stable way 7
    8. 8. Where we stand today• increasing availability of semantically enhanced data and semantic software• increasing use of XML, RDF, OWL in attempts to create useful integration of on-line data and information• “Linked Open Data” the New Big Thing 8
    9. 9. Ontology success stories, and some reasons for failure• 9
    10. 10. as of September 2010
    11. 11. The problem: the more SemanticTechnology is successful, they more it failsThe original idea was to break down silos via common controlled vocabularies for the tagging of dataThe very success of the approach leads to the creation of ever new controlled vocabularies – semantic silos – as ever more ontologies are created in ad hoc waysThe Semantic Web framework as currently conceived and governed by the W3C yields minimal standardizationMultiplying (Meta)data registries are creating data cemeteries 11
    12. 12. NCBO Bioportal (Ontology Registry) 12
    13. 13. 13/24
    14. 14. 14/24
    15. 15. Reasons for this effect• Low incentives for reuse of existing ontologies• Each organization wants its own ontology• Poor licensing regime, poor standards, poor training• People think: Information technology (hardware) is changing constantly, so it’s not worth the effort of getting things right• People have egos: “We have done it this way for 30 years, we are not going to change now” 15
    16. 16. Why should you care?• when they are many ad hoc systems, average quality will be low• constant need for ad hoc repair through manual effort• DoD alone spends $6 billion per annum on this problem• regulatory agencies are recognizing the need for common controlled vocabularies 16/24
    17. 17. So now people are scrambling• to learn how to create ontologies• serious lag in creating trained expertise• poor quality coding leads to poor quality ontologies• poor quality ontology management 17
    18. 18. How to do it right?• how create an incremental, evolutionary process, where what is good survives ?• how to bring about ontology death ? A success story from biology 18
    19. 19. Old biology data 19/
    21. 21. Ontology in PubMed Series 1 1200 1000 800Axis Title 600 400 200 0 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
    22. 22. By far the most successful: GO (Gene Ontology) 22
    23. 23. the Gene Ontology is not an ontology of gewhat cellular component?what molecular function?what biological process? 23
    24. 24. time Defense response Microarray data Immune response Response to stimulus shows changed Toll regulated genes JAK-STAT regulated genes expression of thousands of genes. Puparial adhesion Molting cycle hemocyanin Amino acid catabolism Lipid metobolism How will you spot the patterns? Peptidase activity Protein catabloism Immune response Immune response Toll regulated genes attacked control 24e Tree: lw n3d ...lw n3d son pearson Colored by: Copy of Copy of C5_RMA (Defa... Colored by: Copy of Copy of C5_RMA (Defa... lassification: Set_LW_n3 d_5p_... Gene List:t_LW_n3 d_5p_... Gene List: allall genes (1 4010) genes (1 4010)
    25. 25. Why is GO successful• built by bench biologists• multi-species, multi-disciplinary, open source• compare use of kilograms, meters, seconds in formulating experimental results• natural language and logical definitions for all terms• initially low-tech to ensure aggressive use and testing 25
    26. 26. now used not just inbiology but also inhospital research 26
    27. 27. Lab / pathology dataEHR dataClinical trial dataFamily history dataMedical imagingMicroarray dataModel organism dataFlow cytometryMass specGenotype / SNP dataHow will you spot the patterns?How will you find the data youneed? 27
    28. 28.  over 11 million annotations relating UniProt, Ensembl and other databases to terms in the GO 28
    29. 29. Hierarchical view representingrelations between represented 29types
    30. 30. ~ $200 mill. invested in the GO so far A new kind of biomedical research Over 11 million GO annotations to biomedical research literature freely available on the web Powerful software tool support for navigating this data means that what used to take researchers months of data comparison effort, can now be performed in milliseconds 30
    31. 31. If controlled vocabularies are to serve to remove silos they have to be respected by many owners of data as resources that ensure accurate description of their data – GO maintained not by computer scientists but by biologists they have to be willingly used in annotations by many owners of data they have to be maintained by persons who are trained in common principles of ontology maintenance 31
    32. 32. The new profession of biocurator 32
    33. 33. GO has been amazingly successfulHas created a community consensusHas created a web of feedback loops where users of the GO can easily report errors and gapsHas identified principles for successful ontology managementIndispensable to every drug company and every biology lab 33
    34. 34. But GO is limited in its scopeit covers only generic biological entities of threesorts: – cellular components – molecular functions – biological processesno diseases, symptoms, diseasebiomarkers, protein interactions, experimentalprocesses … 34
    35. 35. Extending the GO methodology to other domains of biology and medicine 35
    36. 36. RELATION TO TIME CONTINUANT OCCURRENT INDEPENDENT DEPENDENTGRANULARITY Anatomical Organism Organ ORGAN AND Entity (NCBI Function ORGANISM (FMA, Taxonomy) (FMP, CPRO) Phenotypic Biological CARO) Quality Process (PaTO) (GO) CELL AND Cellular Cellular Cell CELLULAR Component Function (CL) COMPONENT (FMA, GO) (GO) Molecule Molecular Function Molecular Process MOLECULE (ChEBI, SO, (GO) (GO) RnaO, PrO)OBO (Open Biomedical Ontology) Foundry proposal (Gene Ontology in yellow) 36
    37. 37. RELATION TO TIME CONTINUANT OCCURRENT INDEPENDENT DEPENDENTGRANULARITY Anatomical Organism Organ ORGAN AND Entity (NCBI Function ORGANISM (FMA, Taxonomy) (FMP, CPRO) Phenotypic Biological CARO) Quality Process (PaTO) (GO) CELL AND Cellular Cellular Cell CELLULAR Component Function (CL) COMPONENT (FMA, GO) (GO) Molecule Molecular Function Molecular Process MOLECULE (ChEBI, SO, (GO) (GO) RnaO, PrO) The strategy of orthogonal modules 37
    38. 38. Ontology Scope URL Custodians Cell Ontology cell types from prokaryotes Jonathan Bard, Michael (CL) to mammals bin/detail.cgi?cell Ashburner, Oliver HofmanChemical Entities of Bio- Paula Dematos, molecular entities Interest (ChEBI) Rafael Alcantara Melissa Haendel, TerryCommon Anatomy Refer- anatomical structures in (under development) Hayamizu, Cornelius Rosse, ence Ontology (CARO) human and model organisms David Sutherland, Foundational Model of fma.biostr.washington. JLV Mejino Jr., structure of the human body Anatomy (FMA) edu Cornelius Rosse Functional Genomics design, protocol, data Investigation Ontology FuGO Working Group instrumentation, and analysis (FuGO) cellular components, Gene Ontology molecular functions, Gene Ontology Consortium (GO) biological processes Phenotypic Quality qualities of anatomical Michael Ashburner, Suzanna Ontology -bin/ detail.cgi? structures Lewis, Georgios Gkoutos (PaTO) attribute_and_value Protein Ontology protein types and (under development) Protein Ontology Consortium (PrO) modificationsRelation Ontology (RO) relations Barry Smith, Chris Mungall RNA Ontology three-dimensional RNA (under development) RNA Ontology Consortium (RnaO) structures Sequence Ontology properties and features of Karen Eilbeck (SO) nucleic sequences
    39. 39. How to recreate the success of the GO in other areas1. create a portal for sharing of information about existing controlled vocabularies, needs and institutions operating in a given area2. create a library of ontologies in this area3. create a consortium of developers of these ontologies who agree to pool their efforts to create a single set of non-overlapping ontology modules – one ontology for each sub-area 39
    40. 40. NextGen Ontology PortalPortal Ontology Portal • Two-Tiered Registry – NextGen Ontology – consist ofCommunities vetted ontologies Ontology Library – Ontology Library – open to the wider community • Ontology Metadata NextGen – Ontology owner, domain, and location Enterprise • Ontology Search*Search Ontology – Support ontology discovery 40
    41. 41. The OBO Foundry: a step-by- step, principles-based approach Developers commit in advance to collaborating with developers of ontologies in adjacent domains and to working to ensure that, for each domain, there is community convergence on a single ontology 41
    42. 42. OBO Foundry Principles Common governance Common training Robust versioning Common architecture 42
    43. 43. top level Basic Formal Ontology (BFO) Information Artifact Ontology for Biomedical Ontology of General mid-level Ontology Investigations Medical Science (IAO) (OBI) (OGMS) Anatomy Ontology Infectious (FMA*, CARO) Disease Environment Cellular Ontology Cell Ontology Component (IDO*) Ontology (EnvO) Ontology (CL) Phenotypic Biological (FMA*, GO*)domain level Quality Process Ontology Ontology (GO*) Subcellular Anatomy Ontology (SAO) (PaTO) Sequence Ontology (SO*) Molecular Function Protein Ontology (GO*) (PRO*) OBO Foundry Modular Organization 43
    44. 44. Extension Strategytop level UCore 2.0 / UCore SLmid-level domain level Military domain ontologies as extensions of the Universal Core Semantic Layer 44
    45. 45. Existing efforts to create modular ontology suitesNASA Sweet OntologiesMilitary Intelligence Ontology FoundryPlanned OMG efforts:• OMG (CIA) Financial Event Ontology• Semantic Layer for ISO 20022 (FinancialIndustry Message Scheme)
    46. 46. Example:Financial Securities OntologyMike Bennett (2007) 46
    47. 47. Basic principles of ontology development– for formulating definitions– of modularity– of user feedback for error correction and gap identification– for ensuring compatibility between modules– for using ontologies to annotate legacy data– for using ontologies to create new data– for developing user-specific views
    48. 48. Modularity designed to ensure• non-redundancy• annotations can be additive• division of labor among SMEs• lessons learned in one module can benefit work on other modules• transferrable training• motivation of SME users 49
    49. 49. How the FIMA Reference Datacommunity should solve this problem?Major financial institutionsMajor software vendorsMajor data management companiesEDMC and government principals – should pool information about the controlled vocabularies which already exist – create a common library of these controlled vocabularies – create a subset of thought leaders who agree to pool their efforts in the creation of a suite of ontology modules for common use – create a strategy to disseminate and evolve the selected modules – create a governance strategy to manage the modules over time – allow bad ontologies to die
    50. 50. Urgent need for trained ontologists Severe shortage of persons with the needed expertise University at Buffalo Online Training and Certification Program for Ontologists for details: