The National Cancer Institute Thesaurus is described by its authors as "a biomedical vocabulary that provides consistent, unambiguous codes and definitions for concepts used in cancer research" and which "exhibits ontology-like properties in its construction and use". We performed a qualitative analysis of the Thesaurus in order to assess its conformity with principles of good practice in terminology and ontology design.
We used both the on-line browsable version of the Thesaurus and its OWL-representation (version 04.08b, released on August 2, 2004), measuring each in light of the requirements put forward in relevant ISO terminology standards and in light of ontological principles advanced in the recent literature.
We found many mistakes and inconsistencies with respect to the term-formation principles used, the underlying knowledge representation system, and missing or inappropriately assigned verbal and formal definitions.
Version 04.08b of the NCI Thesaurus suffers from the same broad range of problems that have been observed in other biomedical terminologies. For its further development, we recommend the use of a more principled approach that allows the Thesaurus to be tested not just for internal consistency but also for its degree of correspondence to that part of reality which it is designed to represent.

  • Problem example: ‘chromosome’ in Sequence Ontology and in Cell Component Ontology means different things Current solution: two distinct terms involved (qualified by respective namespace)
  • There is no species called ‘non-rabbit’
  • There is no biological species: unknown rabbit. See discussion below.
    1. 1. 1 Ontology and the NCI Thesaurus Barry Smith with thanks to Werner Ceusters and Louis Goldberg
    2. 2. 2 Ontology developments in Buffalo Department of Philosophy: 8 full-time ontologists National Center for Ontological Research (http://ncor.us) NYS Center of Excellence in Bioinformatics & Life Sciences Werner Ceusters Referent Tracking Pilot EHR
    3. 3. 3 GO + OBO National Center for Biomedical Ontology Berkeley Drosophila Genome Project Cambridge University Department of Genetics Mayo Clinic University of Oregon Institute of Neuroscience University of California San Francisco Medical Center University at Buffalo Department of Philosophy http://ncbo.us
    4. 4. 4 A methodology for quality assurance of ontologies rules for ontology building based on two millennia of philosophical research on classification and categorization targets thus far in the biomedical domain: – FMA – SNOMED – GALEN – Gene Ontology – UMLS Semantic Network – ICF (International Classification of Functioning, Disability and Health) – ISO Terminology Standards – HL7-RIM
    6. 6. 6 Ontologies of Reality vs. Information Models Data: sequence, expression, genotype, structure Data structures: patterns, clusters, alignments, ... UMLS-SN: amino acid sequence is_a idea or concept Swimming is healthy and has 8 letters
    7. 7. 7 New criteria for admission to OBO (Open Biomedical Ontologies) Library Satisfaction of basic principles of ontology design Goal: to move beyond information retrieval and statistical clustering to automatic reasoning
    8. 8. 8 First Rule: Univocity Terms should have the same meanings on every occasion of use. They should refer to the same kinds of entities in reality
    9. 9. 9 Second Rule: Positivity Complements of kinds are not themselves kinds. Terms such as ‘non-mammal’ or ‘non- membrane’ or ‘other metalworker in New Zealand’ do not designate genuine kinds in reality.
    10. 10. 10 Third Rule: Objectivity Which kinds exist is not a function of our knowledge. Terms such as ‘unknown’ or ‘unclassified’ or ‘unlocalized’ do not designate biological natural kinds.
    11. 11. 11 Fourth Rule: Single Inheritance No kind in a classificatory hierarchy should have more than one is_a parent on the immediate higher level
    12. 12. 12 Basic ontological relations such as is_a and part_of should be shared by all ontologies thing carblue thing blue car is_a1 is_a2
    13. 13. 13 Fifth Rule Use common upper-level categories and relations (is_a, part_of ...) • with precise formal definitions for machine purposes • with equivalent natural language definitions for human beings
    14. 14. 14 Sixth Rule: Intelligibility of Definitions The terms used in a definition should be simpler (more intelligible) than the term to be defined otherwise the definition provides no assistance – to human understanding – to machine processing Definitions should be intuitively meaningful (should not contradict common sense)
    15. 15. 15 The National Cancer Institute Thesaurus (NCIT) part of OBO but does not (yet) satisfy these principles
    16. 16. 16 NCIT “a biomedical vocabulary that provides consistent, unambiguous codes and definitions for concepts used in cancer research” “exhibits ontology-like properties in its construction and use”.
    17. 17. 17 Goals to make use of current terminology “best practices” to relate relevant concepts to one another in a formal structure, so that computers as well as humans can use the Thesaurus for a variety of purposes, including the support of automatic reasoning; to speed the introduction of new concepts and new relationships in response to the emerging needs of basic researchers, clinical trials, information services and other users.
    18. 18. 18 Formal Definitions of 37,261 nodes, 33,720 were stipulated to be primitive in the DL sense Thus only a small portion of the NCIT ontology can be used for purposes of automatic classification and error- checking.
    19. 19. 19 Verbal Definitions About half the NCIT terms are assigned verbal definitions Unfortunately some are assigned more than one
    20. 20. 20 Disease Progression Definition1 Cancer that continues to grow or spread. Definition2 Increase in the size of a tumor or spread of cancer in the body. Definition3 The worsening of a disease over time. This concept is most often used for chronic and incurable diseases where the stage of the disease is an important determinant of therapy and prognosis.
    21. 21. 21 To make matters worse Disease Progression has subclass: Cancer Progression Definition: The worsening of a cancer over time. This concept is most often used for incurable cancers where the stage of the cancer is an important determinant of therapy and prognosis.
    22. 22. 22 Cancer an object (which can grow and spread) a process (of getting better or worse)
    23. 23. 23 Confuses definitions with descriptions Tuberculosis Definition A chronic, recurrent infection caused by the bacterium Mycobacterium tuberculosis. Tuberculosis (TB) may affect almost any tissue or organ of the body with the lungs being the most common site of infection. The clinical stages of TB are primary or initial infection, latent or dormant infection, and recrudescent or adult-type TB. Ninety to 95% of primary TB infections may go unrecognized. Histopathologically, tissue lesions consist of granulomas which usually undergo central caseation necrosis. Local symptoms of TB vary according to the part affected; acute symptoms include hectic fever, sweats, and emaciation; serious complications include granulomatous erosion of pulmonary bronchi associated with hemoptysis. If untreated, progressive TB may be associated with a high degree of mortality. This infection is frequently observed in immunocompromised individuals with AIDS or a history of illicit IV drug use.
    24. 24. 24 A better solution Tuberculosis Definition: A chronic, recurrent infection caused by the bacterium Mycobacterium tuberculosis.
    25. 25. 25 Inherits ontological and terminological incoherence from source vocabularies such as UMLS-SN Conceptual Entities Definition An organizational header for concepts representing mostly abstract entities. Confuses use and mention (swimming is healthy and has eight letters) Includes as subtypes: action, change, color, death, event, fluid, injection, temperature
    26. 26. 26 and imprecision Duratec, Lactobutyrin, Stilbene Aldehyde classified as Unclassified Drugs and Chemicals
    27. 27. 27 and problematic synonyms Anatomic Structure, System, or Substance ~ Anatomic Structures and Systems Does ‘anatomic’ apply only to structure or also to system and substance? Biological Function ~ Biological Process some biological processes are the exercises of biological functions others (e.g. pathological processes) not Genetic Abnormality ~ Molecular Abnormality (with subtype: Molecular Genetic Abnormality) (definitions not supplied)
    28. 28. 28 more problematic synonyms Diseases and Disorders ~ Disease ~ Disorder Definition1 for Disease: A disease is any abnormal condition of the body or mind that causes discomfort, dysfunction, or distress to the person affected or those in contact with the person. ... Definition2 for Disease A definite pathologic process with a characteristic set of signs and symptoms. ... Condition ≠ Process Definition2 contradicts NCIT’s own classification hierarchy
    29. 29. 29 Ontological problems Three disjoint classes of plants: Vascular Plant Non-vascular Plant Other Plant
    30. 30. 30 Ontological problems Abnormal Cell is a top-level class (thus not subsumed by Cell Cell is a subclass of Other Anatomic Concept (so that cells themselves are concepts) Normal Cell is a subclass of Microanatomy.
    31. 31. 31 Next step Alignment of OBO ontologies through a common system of top-level categories in the OBO-UBO (Upper Biomedical Ontology) and through a common system of formally defined relations in the OBO-RO (Relation Ontology) see “Relations in Biomedical Ontologies”, Genome Biology Apr. 2005 Donnelly, M., Bittner, T. and Rosse, C. 2005. 'A Formal Theory for Spatial Representation and Reasoning in Biomedical Ontologies'. Artificial Intelligence in Medicine
    32. 32. 32 is_a A is_a B Definition For all x, t if x instance_of A at t then x instance_of B at t allows reliable cross-ontology inferences from ‘abnormal cell’ to ‘cell’
    33. 33. 33 part_of A part_of B Definition For all x, t if x instance_of A at t then there is some y, y instance_of B at t and x part_of y ‘part_of’ is the instance-level part relation, e.g. between this nucleus and this cell The all-some structure of such definitions allows cascading of inferences (i) within ontologies (ii) between ontologies (iii) between ontologies and EHR repositories of instance-data
    34. 34. 34 Cascading inferences Whichever A you choose, its including B will be included in some C, which will include as part also the A with which you begin The same principle applies to the other relations located_at, transformation_of, derived_from etc. in the OBO-RO (UML treatment here very poor)
    35. 35. 35 NCIT as now constituted will block such automatic reasoning Neither Normal Cell nor Abnormal Cells are Cells within the context of the NCIT
    36. 36. 36 Some consolations NCIT is open source NCIT has broad coverage NCIT has some formal structure (DL) NCIT has realized the errors of its ways NCIT is much, much better than (for example) the HL7-RIM