Biomedical ontology tutorial_atlanta_june2011_part1


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • microarray/datamining/
  • microarray/datamining/
  • microarray/datamining/
  • with thanks to Bill Hogan
  • (with thanks to Bill Hogan)
  • Problem example: ‘chromosome’ in Sequence Ontology and in Cell Component Ontology means different things Current solution: two distinct terms involved (qualified by respective namespace)
  • There is no species called ‘non-rabbit’
  • There is no biological species: unknown rabbit. See discussion below.
  • Biomedical ontology tutorial_atlanta_june2011_part1

    1. 1. How to Build a Biomedical Ontology Success Stories The Gene Ontology (GO) SNOMED, ICD and other controlled vocabularies Ontology Design Principles Ontology Applications Barry Smith
    2. 2. Uses of ‘ontology’ in PubMed abstracts 2
    3. 3. 3
    4. 4. By far the most successful: GO (Gene Ontology) 4
    5. 5. 5
    6. 6. Hierarchical view of GOrepresenting relationsbetween represented types 6
    7. 7. Gene Ontology$100 mill. invested in literature and databasecuration using the Gene Ontology (GO)based on the idea of annotationover 11 million annotations relating geneproducts (proteins) described in the UniProt,Ensembl and other databases to terms in theGOmultiple secondary uses – because theontology was not built to meet one specificset of requirements 7
    8. 8. GO provides a controlled system of termsfor use in annotating (describing, tagging) data• multi-species, multi-disciplinary, open source• contributing to the cumulativity of scientific results obtained by distinct research communities• compare use of kilograms, meters, seconds in formulating experimental results 8
    9. 9. Sample Gene Array Data 9
    10. 10. semantic annotation of data where in the cell ? what kind ofmolecular function ? what kind ofbiological process? 10
    11. 11. natural language labels to make the data cognitively accessible to human beings 11
    12. 12. compare: legends for maps 12
    13. 13. compare: legends for diagrams 13
    14. 14. ontologies are legends for data 14
    15. 15. compare: legends for maps 15
    16. 16. ontologies are legends for images 16
    17. 17. what lesion ?what brain function ? 17
    18. 18. ontologies are legends for databasesMouseEcotope GlyProt sphingolipid transporter activity DiabetInGene GluChem 18
    19. 19. annotation using common ontologies yields integration of databasesMouseEcotope GlyProt Holliday junction helicase complex DiabetInGene GluChem 19
    20. 20. annotation using common ontologies can support comparison of data 20
    21. 21. annotation with Gene Ontologysupports reusability of datasupports search of data by humanssupports comparison of datasupports aggregation of datasupports reasoning with data by humans and machines 21
    22. 22. 22
    23. 23. The goal: virtual science• consistent (non-redundant) annotation• cumulative (additive) annotation yielding, by incremental steps, a virtual map of the entirety of reality that is accessible to computational reasoning 23
    24. 24. This goal is realizable if we have a common ontology framework data is retrievable data is comparable data is integratable only to the degree that it is annotated using a common controlled vocabulary – compare the role of seconds, meters, kilograms … in unifying science 24
    25. 25. To achieve this end we have to engage in something like philosophy (?) is this the right way to organize the top level of this portion of the GO? how does the top level of this ontology relate to the top levels of other, neighboring ontologies? 25
    26. 26. Strategy for doing thissee the world as organized viatypes/universals/categories which arehierarchically organizedand in relation to which statementscan be formulated which areuniversally true of all instances: cell membrane part_of cell 26
    27. 27. Anatomical Anatomical Space StructureOrgan Cavity Organ Organ Organ Part Subdivision Cavity Serous Sac Serous Sac Organ Organ Cavity Cavity Serous Sac Component Subdivision Tissue Subdivisionis_a Pleural Sac Pleural Sac Pleura(Wall Pleural Pleura(Wall Pleural of Sac) of Sac) Cavity of Cavity Parietal Parietal Pleura t_ Pleura Visceral Visceral Interlobar Pleura Pleura Interlobar r recess recess Mediastinal pa Mediastinal Pleura Pleura Mesothelium Mesothelium of Pleura of Pleura 27 Foundational Model of Anatomy Ontology
    28. 28. species, substancegenera organism animal mammal cat frogsiameseinstances 28
    29. 29. 29
    30. 30. the problem of continuity of care: patients move aroundwith thanks to 30
    31. 31. ff f f ffsynchronic and diachronic problems of semantic interoperability (across space and across time) 31
    32. 32. ff f f EHR 1 EHR 2 ff how can we link EHR 1 to EHR 2 in a reliable, trustworthy, useful way, which both systems can understand ? 32
    33. 33. ff f ICD f EHR 1 EHR 2 ff the ideal solution: WHO International Classification of Diseases 33
    34. 34. ICDPRO:De facto US billing standardMultilanguageCON:De facto US billing standard (corrupts data)No definitions of terms, and so difficult to judge accuracy of hierarchy and of codingInconsistent hierarchiesHard to reason with resultsHence few secondary uses e.g. for research 34
    35. 35. ICD 11The (ontology-based) planmultiple views including ◦ billing ◦ public health statistics ◦ research ◦ SNOMED compatibility 35
    36. 36. ff f SNOMED-CT f EHR 1 EHR 2 ff the ideal solution: a single universal clinical vocabulary 36
    37. 37. SNOMED CT: Systematized Nomenclature ofMedicine-Clinical TermsPRO: International standard (sort of) Huge resource Free for member countries Multi-language (including Spanish) 37
    38. 38. SNOMED CTCONHuge (but redundant ... and gappy)Contains many examples of false synonymyStill in need of work ◦ No consistent interpretation of relations ◦ Many erroneous relation assertions ◦ Many idiosyncratic relations ◦ Mixes ontology with epistemology ◦ It contains numerous compound terms (e.g., test for X) without the constituent terms (here: X), even where the latter are of obvious salience 38
    39. 39. SNOMED CTCoding with SNOMED-CT is unreliable and inconsistentMulti-stage multi-committee process for adding terms that follows intuitive rules and not formal principlesDoes there exist a strategy for evolutionary improvement? 39
    40. 40. f f f SNOMED-CT f EHR 1 EHR 2 f fan above all: SNOMED CT cannot solve theproblem of continuity of care because it has too much redundancy 40
    41. 41. ff f SNOMED-CT f EHR 1 EHR 2 ffan AND because it is used only in certain countries 41
    42. 42. ff Unified Medical f Language System (UMLS) f EHR 1 EHR 2 ff link EHR 1 to EHR 2 through a snapshot of the patient’s condition which both systems can understand 42
    43. 43. Unified Medical Language System (UMLS) UMLS is not unified, not a language, not a system (and not only medical); it is an aggregation If we use something like UMLS as reference terminology, we will not solve the translation problem EN DE
    44. 44. R T U New York State Center of Excellence in Bioinformatics & Life SciencesUMLS approach to countering silo formation – By ‘linking between different clinical or biomedical vocabularies’ – However: ‘… the Metathesaurus does not represent a comprehensive NLM-authored ontology of biomedicine or a single consistent view of the world. The Metathesaurus preserves the many views of the world present in its source vocabularies because these different views may be useful for different tasks.’
    45. 45. R T U New York State Center of Excellence in Bioinformatics & Life Sciences
    46. 46. Prospective standardization is a good thingProspective standardization is the only thing which will work in mission critical domainsProspective standardization means that certain limits to tolerance must be imposed,Need for top-down governance to ensure common architecture and resolution of border disputes in areas of overlap between domains 46
    47. 47. Principles of Best Practice in Ontology Development 47
    48. 48. Problem of ensuring sensible cooperation in a massively interdisciplinary communityConsider multiple uses of technical terms such as − type − concept − instance − model − representation − data 48
    49. 49. Three LevelsL3. Words, models (published representations, ontologies, databases ...)L2. Ideas (concepts, thoughts, memories, ...)L1. Things (cells, planets, processes of cell division ...) 49
    50. 50. Entity =defanything which exists, including things andprocesses, functions and qualities, beliefsand actions, documents and software(entities on levels 1, 2 and 3) 50
    51. 51. First basic distinction among entities type vs. instance (science text vs. diary) (human being vs. Tom Cruise) 51
    52. 52. For ontologies it is generalizations that areimportant = types, universals, kinds, species 52
    53. 53. Catalog vs. inventoryA 515287 DC3300 Dust Collector FanB 521683 Gilmer BeltC 521682 Motor Drive Belt 53
    54. 54. An ontology is a representation of typesWe learn about types in reality from lookingat the results of scientific experiments in theform of scientific theoriesexperiments relate to what is particularscience describes what is general 54
    55. 55. Ontology =def. a representational artifact whose representational units (which may be drawn from a natural or from some formalized language) are intended to represent 1. types in reality 2. those relations between these types which obtain universally (= for all instances) lung is_a anatomical structure lobe of lung part_of lungin accordance with our best current established science 55
    56. 56. types object organism animal mammal cat frogsiameseinstances 56
    57. 57. Domain =defa portion of reality that forms the subject-matter of a single science or technology ormode of study or administrative practice: proteomics epidemiology C2 M&S 57
    58. 58. Representation =defan image, idea, map, picture, name ordescription ... of some entity or entities. 58
    59. 59. Ontologies are representational artifacts comparable to science textsand subject to the same sorts of constraints (including need for update) 59
    60. 60. Representational units =defterms, icons, alphanumeric identifiers ...which refer, or are intended to refer, toentitiesand which are minimal (atoms) 60
    61. 61. Composite representation =defrepresentation (1) built out of representational unitswhich (2) form a structure that mirrors, or is intended to mirror, the entities in some domain 61
    62. 62. The Periodic Table Periodic Table 62
    63. 63. Ontologies are here 63
    64. 64. or here 64
    65. 65. Ontologies represent general structures in reality (leg) 65
    66. 66. Ontologies do not representconcepts in people’s heads 66
    67. 67. They represent types in reality 67
    68. 68. How do we know which general terms designate types?Types are repeatables: cell, electron, weapon, F16 ...Instances are one-off: Bill Clinton, this laptop, this handwave 68
    69. 69. ProblemThe same general term can be used torefer both to types and to collections ofparticulars. Consider:HIV is an infectious retrovirusHIV is spreading very rapidly through Asia 69
    70. 70. Class =defa maximal collection of particularsdetermined by a general term(‘cell’, ‘electron’ but also: ‘ ‘restaurant inPalo Alto’, ‘Italian’)the class A= the collection of all particulars x forwhich ‘x is A’ is true 70
    71. 71. types vs. their extensions types..} collections of particulars 71
    72. 72. Extension=def The extension of a type is the class of its instances 72
    73. 73. types vs. classes types{c,d,e,...} classes 73
    74. 74. types vs. classes types extensions other sorts of classescompare: ‘natural kinds’ 74
    75. 75. types vs. classestypes populations, ... the class of all diabetic patients in Leipzig on 4 June 1952 75
    76. 76. OWL is a good representation of classes• F16s• sibling of Finnish spy• member of Abba aged > 50 years 76
    77. 77. types, classes, conceptstypes classes ‘concepts’ ? 77
    78. 78. types < classes < ‘concepts’ ?Cases of ‘concepts’ which, some people say, do not correspond to classes: ‘Cancelled oophorectomy’ ‘Absent nipple’ ‘Unlocalized ligand’A cancelled oophorectomy is not a special kind of conceptual oophorectoryUse: Information Artifact Ontology (IAO) 78
    79. 79. Principle of Low Hanging FruitInclude even absolutely trivial assertions(assertions you know to be universally true) pneumococcal virus is_a virusComputers need to be led by the hand 79
    80. 80. Example: MeSHMeSH Descriptors Index Medicus Descriptor Anthropology, Education, Sociology and Social Phenomena (MeSH Category) Social Sciences Political Systems National SocialismNational Socialism is_a Political SystemsNational Socialism is_a Anthropology ... 80
    81. 81. Principle of Singular Nouns Terms in ontologies represent types Goal: Each term in an ontology should represent exactly one typeThus every term should be a singular noun 81
    82. 82. Principle: do not commit the use- mention confusionmouse =def. common name for the species mus musculusswimming is healthy and has eight letters 82
    83. 83. Principle: do not commit the use- mention confusion Avoid confusing between words and things Avoid confusing between concepts in our minds and entities in reality Recommendation: avoid the word ‘concept’ entirely 83
    84. 84. Trialbank‘information’ = def. ‘a written or spoken designation of a concept’ 84
    85. 85. Trialbank‘Heparin therapy’ is an instance of ‘written or spoken designation of a concept’ What are the problems here? 1. misuse of quotation marks 2. confusion of instances and types 3. confusion of concept and reality 85
    86. 86. Principle: beware of terminological baggageFor the sake of interoperability with otherontologies, do not give special meanings toterms with established general meanings(Don’t use ‘cell’ when you mean ‘plant cell’) 86
    87. 87. ICNP: International Classification of Nursing Procedures (old version) water =def. a type of Nursing Phenomenon of Physical Environment with the specific characteristics: clear liquid compound of hydrogen and oxygen that is essential for most plant and animal life influencing life and development of human beings. 87
    88. 88. Principle of definitionsSupply definitions for every term1.human-understandable natural language equivalent formal definition 88
    89. 89. Principle: definitions must be uniqueEach term should have exactly one definitionit may have both natural-language and formal versions(issue with ontologies which exist with different levels of expressivity) 89
    90. 90. The Problem of CircularityA Person =def. A person with an identity documentHemolysis =def. The causes of hemolysis 90
    91. 91. Principle of non-circularityThe term defined should not appear in its own definition 91
    92. 92. Example: HL7‘stopping a medication’ = def. change of state in the record of a Substance Administration Act from Active to Aborted 92
    93. 93. Principle of Increase in UnderstandabilityA definition should use only terms which areeasier to understand than the term definedDefinitions should not make simple thingsmore difficult than they are 93
    94. 94. Generalized Tarski principle(a good, general constraint on a theory of meaning) For each linguistic expression ‘E’ ‘E’ means E ‘snow’ means: snow‘pneumonia’ means: pneumonia 94
    95. 95. HL7 Reference Information Model‘medication’ does not mean: medicationrather it means: the record of medication in an information system‘disease’ does not mean: diseaserather it means: the observation of a disease 95
    96. 96. Principle of Acknowledging Primitives In every ontology some terms and some relations are primitive = they cannot be defined (on pain of infinite regress)Examples of primitive relations: identity instance_of 96
    97. 97. Principle of Aristotelian Definitions Use Aristotelian definitions An A is a B which C’s.A human being is an animal which is rational 97
    98. 98. Rules for Formulating TermsAvoid abbreviations even when it is clear in context what they mean (‘breast’ for ‘breast tumor’)Avoid acronymsAvoid mass terms (‘tissue’, ‘brain mapping’, ‘clinical research’ ...)Treat each term ‘A’ in an ontology is shorthand for a term of the form ‘the type A’ 98
    99. 99. UnivocityTerms should have the same meanings on every occasion of use.(= They should refer to the same types)Basic ontological relations such as is_a and part_of should be used in the same way by all ontologies 99
    100. 100. UniversalityOntologies are made of relationalassertionsThey should include only those which holduniversally 100
    101. 101. UniversalityOften, order will matter:We can assert adult transformation_of childbut not child transforms_into adult 101
    102. 102. Universality viral pneumonia caused by virusbut not virus causes pneumonia pneumococcal virus causes pneumonia 102
    103. 103. Principle of Universality results analysis later_than protocol-designbut not protocol-design earlier_than results analysis 103
    104. 104. Principle of PositivityComplements of types are not themselves types.Terms such as non-mammal non-membrane other metalworker in New Zealanddo not designate types in reality 104
    105. 105. Generalized Anti-Boolean PrincipleThere are no conjunctive and disjunctive types: anatomic structure, system, or substance musculoskeletal and connective tissue disorder 105
    106. 106. ObjectivityWhich types exist in reality is not a function of our knowledge.Terms such as unknown unclassified unlocalized arthropathies not otherwise specifieddo not designate types in reality. 106
    107. 107. Keep Epistemology Separate from OntologyIf you want to say that We do not know where A’s are locateddo not invent a new class of A’s with unknown locations (A well-constructed ontology should grow linearly; it should not need to delete classes or relations because of increases in knowledge) 107
    108. 108. Keep Sentences Separate from TermsIf you want to say I surmise that this is a case of pneumoniado not invent a new class of surmised pneumoniasConfusion of ‘findings’ in medical terminologies 108
    109. 109. Single InheritanceNo kind in a classificatory hierarchyshould be asserted to have morethan one is_a parent on theimmediate higher level 109
    110. 110. Multiple Inheritance thingblue thing car is_a is_a blue car 110
    111. 111. Multiple Inheritanceis a source of errorsencourages lazinessserves as obstacle to integration with neighboring ontologieshampers use of Aristotelian methodology for defining termshampers use of statistical search tools 111
    112. 112. Multiple Inheritance thingblue thing car is_a1 is_a2 blue car 112
    113. 113. Principle of asserted single inheritanceEach reference ontology module should bebuilt as an asserted monohierarchy (ahierarchy in which each term has at mostone parent)Asserted hierarchy vs. inferred hierarchy 113
    114. 114. Principle of normalizationPolyhierarchies should be decomposableinto homogeneous disjoint monohierarchies 114
    115. 115. Principle of instantiabilityA term should be included in an ontologyonly if there is evidence that instances towhich that term refers exist or have existedor can exist in reality. Fist Crowd 115
    116. 116. Avoid mass nounsCount nouns = an organism, a planet, a handshakeMass nouns = tissue, information, discourseMass nouns almost always go hand in hand with ontological confusion 116
    117. 117. is_a OverloadingThe success of ontology alignmentdemands that ontological relations (is_a,part_of, ...) have the same meanings in thedifferent ontologies to be aligned. 117
    118. 118. Multiple Inheritance thingblue thing car is_a1 is_a2 blue car 118
    119. 119. How to solve this problemCreate two ontologies: of cars of colorsLink the two together via cross-products(= factoring, normalization, modularization) 119
    120. 120. CompositionalityThe meanings of compound terms should be determined 1. by the meanings of component termstogether with 2. the rules governing syntax 120
    121. 121. User feedback principleAn ontology should evolve on the basis offeedback derived from those who are usingthe ontology for example for purposes inannotation. 121