Ontologies: What Librarians Need to Know
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Ontologies: What Librarians Need to Know

on

  • 1,414 views

 

Statistics

Views

Total Views
1,414
Views on SlideShare
1,413
Embed Views
1

Actions

Likes
0
Downloads
7
Comments
0

1 Embed 1

https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • http://www.w3.org/People/Ivan/CorePresentations/HighLevelIntro/
  • http://www.w3.org/People/Ivan/CorePresentations/HighLevelIntro/
  • http://www.w3.org/People/Ivan/CorePresentations/HighLevelIntro/
  • http://www.w3.org/People/Ivan/CorePresentations/HighLevelIntro/
  • Ivan Herman
  • http://www.scribd.com/doc/26643569/INCOSE-MBSE-Putting-Ontologies-to-Work-MESA-AZ-Feb-2010
  • http://dbpedia.org/fct/images/lod-datasets_2009-03-27_colored.png
  • http://dbpedia.org/fct/images/lod-datasets_2009-03-27_colored.png
  • http://bioportal.bioontology.org/visualize/42182/?id=HP:0000001#mappingsaccessed 1/25/2010
  • http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/
  • http://www.franz.com/agraph/support/documentation/current/agraph-introduction.html
  • http://www.alatechsource.org/blog/2010/01/rda-vocabularies-for-a-twenty-first-century-data-environment.html

Ontologies: What Librarians Need to Know Presentation Transcript

  • 1. Ontologies:What Librarians Need to KnowBarry SmithDepartment of PhilosophyUniversity at Buffalopresented at the conference on Research Data: Management, Access and Control,University at Buffalo, November 14, 2011http://libweb1.lib.buffalo.edu/blog/scholarly/?p=85 1
  • 2. 2
  • 3. 3/24
  • 4. a short movementof one lower legcrossing the otherleg with the footpointing outward4
  • 5. same movement, different terms• part of a mannequin’s step on the catwalk• an epileptic jerk• the kicking of a ball by a soccer player• a signal (“Get out!”) issued in heatedconversation• a “half cut” in Irish Sean-nós dancing5/
  • 6. 6/
  • 7. Some questions• How to find data?• How to understand data when you find it?• How to use data when you find it, for example in hypothesis-checking and reasoning?• How to integrate with other data?• How to label the data you are collecting?• How to build a set of labels for a new domain that willintegrate well with labels used in neighboring domains?7
  • 8. Network effects of the Web• You build a site.• Others discover the site and they link to it• The more they link to it, the more important and well knownthe page becomes (this is what Google exploits)• Your page becomes important, and others begin to rely on it• The same network effect works on the raw data– Many people link to the data, use it– Many more (and diverse) applications will be createdthan the authors would even dream of• New ‘secondary uses’ are discoveredIvan Herman 8
  • 9. The problem: doing it this way, we end up withdata in many, many silos because links areformed in overlapping and redundant waysPhoto credit “nepatterson”, Flickr9
  • 10. To avoid silos:1. The raw data must be available in astandard way on the Web.2. There must be links among thedatasets to create a ‘web of data’Use ontologies to capture commonmeanings with definitions that areunderstandable to both humans andcomputersThe roots of Semantic Technology
  • 11. Ontologies as controlledvocabularies for the tagging of data• Hardware changes rapidly• Organizations rapidly forming anddisbanding collaborations• Data is exploding• Meanings of common words change slowly• Use web architecture to annotate explodingdata stores using ontologies exploitingthese common meanings11
  • 12. Mandates for Data Reuse• Organizations such as the NIH now require useof common standards in a way that will ensurethat the results obtained through fundedresearch are more easily accessible to externalgroups• http://grants.nih.gov/grants/policy/data_sharing/• http://www.nsf.gov/bfa/dias/policy/dmp.jsp• Data Ontologies for Biomedical Research (R01):http://grants.nih.gov/grants/guide/pa-files/par-07-425.html12/24
  • 13. NCBO: National Center forBiomedical Ontology• Stanford Biomedical Research Informatics• Mayo Clinic Department of Bioinformatics• University at Buffalo, Department ofPhilosophyhttp://bioportal.bioontology.org/13/24
  • 14. NCBO Bioportal14
  • 15. Goals of Semantic TechnologyTo support data reuseTo enable data registriesMetadata managementSupport for Natural Language UnderstandingSemantic Wikisvia ontologies formulated for example inthe Web Ontology Language (OWL)15
  • 16. Where we stand today• html demonstrated the power of the Web toallow sharing of information• increasing availability of semantically enhanceddata• increasing power of semantic software toallowautomatic reasoning with online information• increasing use of OWL in attempts to break downsilos, and create useful integration of on-linedata and information16
  • 17. Ontology success stories, and somereasons for failure•A fragment of the Linked OpenData in the biomedical domain17
  • 18. as of September 2010
  • 19. The result: the more Semantic Technologyis successful, they more it fails to achieveit goalsAs we break down silos via controlledvocabularies for the tagging of datathe very success of the approach leads to thecreation of ever new controlled vocabularies ,semantic silos – because multiple ontologiesare being created in ad hoc waysThe Semantic Web framework as currentlyconceived and governed by the W3C yieldsminimal standardizationCreates data cemeteries19
  • 20. 20/24
  • 21. 21/24
  • 22. Reasons for this effect• Shrink-wrapped software mentality – you will notget paid for reusing old and good ontologies (Leta million ‘lite’ ontologies bloom)• Belief that there are no ‘good’ ontologies (justarbitrary choices of terms and relations …)• No licensing regime (database inspection tax …)• Information technology (hardware) changesconstantly, not worth the effort of getting thingsright• We have done it this way for 30 years, we are notgoing to change now22
  • 23. Ontology success stories, and somereasons for failure•Can we solve the problem bymeans of mappings? 23
  • 24. Unified Medical Language System ofthe National Library of Medicine• let a million ontologies bloom, each one closeto the terminological habits of its authors• in concordance with the “not invented here”syndrome• then map these ontologies, and use thesemappings to integrate your different pots ofdata24/24
  • 25. What you get with ‘mappings’all phenotypes (excess hair loss, duck feet)all organismsallose (a form of sugar)Acute Lymphoblastic Leukemia (A.L.L.)25
  • 26. Mappings are hardThey are fragile, and expensive to maintainNeed a new authority to maintain, yielding newrisk of forkingThe goal should be to minimize the need formappingsInvest resources in disjoint ontology moduleswhich work well together26
  • 27. Why should you care?• you need to create systems for data miningand text processing which will yield usefuloutput for library users• if the codes you use are constantly in need ofad hoc repair huge resources will be wasted,manual effort will be needed on each occasionof use• DoD alone spends $6 billion per annum onthis problem27/24
  • 28. And there are other problems• Weak expressivity of OWL• Poor quality coding, poor quality ontologies,poor quality ontology management• Strategy often serves only retrieval, notreasoning• Confusion as to the meaning of ‘linked’28
  • 29. Uncontrolled proliferation of links29
  • 30. 31/24
  • 31. How to do it right?• how create an incremental, evolutionary process,where what is good survives, and what is bad fails• create a scenario in which people will find itprofitable to reuse ontologies, terminologies andcoding systems which have been tried and tested• silo effects will be avoided and results ofinvestment in Semantic Technology will cumulateeffectively32
  • 32. 0200400600800100012002000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010AxisTitleSeries 1Ontology in PubMed
  • 33. Uses of ‘ontology’ in PubMed abstracts34/24
  • 34. By far the most successful: GO (Gene Ontology)35
  • 35. GO provides a controlled vocabulary of termsfor use in annotating (describing, tagging) data• multi-species, multi-disciplinary, open source• contributing to the cumulativity of scientificresults obtained by distinct researchcommunities• compare use of kilograms, meters, seconds informulating experimental results• natural language and logical definitions for allterms to support consistent human applicationand computational exploitation36
  • 36. You’re interestedin which genescontrol heartmuscledevelopment17,536 results37
  • 37. arson lw n3d ...t_LW_n3 d_5p_...Colored by: Copy of Copy of C5_RMA (Defa...Gene List: all genes (1 4010)attackedtimecontrolPuparial adhesionMolting cyclehemocyaninDefense responseImmune responseResponse to stimulusToll regulated genesJAK-STAT regulated genesImmune responseToll regulated genesAmino acid catabolismLipid metobolismPeptidase activityProtein catabloismImmune responsee Tree: pearson lw n3d ...lassification: Set_LW_n3d_5p_...Colored by: Copy of Copy of C5_RMA (Defa...Gene List: all genes (14010)Microarray datashows changedexpression ofthousands of genes.How will you spotthe patterns?38
  • 38. You’re interested in whichof your hospital’s patientdata is relevant tounderstanding how genescontrol heart muscledevelopment39
  • 39. Lab / pathology dataEHR dataClinical trial dataFamily history dataMedical imagingMicroarray dataModel organism dataFlow cytometryMass specGenotype / SNP dataHow will you spot the patterns?How will you find the data youneed?40
  • 40. One strategy for bringing order into this hugeconglomeration of data is through the use ofCommon Data Elements• Discipline-specific (cancer, NIAID, …)• Do not solve the problems of balkanization (datasiloes)• Do not evolve gracefully as knowledge advances• Support data cumulation, but do not readilysupport data integration and computation41
  • 41. How does theGene Ontology work?with thanks toJane Lomax, Gene Ontology Consortium42
  • 42. GO provides a controlled system ofrepresentations for use in annotating data• multi-species, multi-disciplinary, opensource• contributing to the cumulativity of scientificresults obtained by distinct researchcommunities• compare use of kilograms, meters, seconds… in formulating experimental results43
  • 43. 44
  • 44. Definitions45
  • 45. Gene products involved in cardiac muscledevelopment in humans 46
  • 46. GO provides answers to three types ofquestionsfor each gene product• in what parts of the cell has it been identified?• exercising what types of molecular functions?• with what types of biological processes?when is a particular gene product involved• in the course of normal development?• in the process leading to abnormalitywith what functions is the gene productassociated in other biological processes?47
  • 47. Some pain-related terms in GOGO:0048265 response to painGO:0019233 sensory perception of painGO:0048266 behavioral response to painGO:0019234 sensory perception of fast painGO:0019235 sensory perception of slow painGO:0051930 regulation of sensory perception of painGO:0050967 detection of electrical stimulus during sensory perception of painGO:0050968 detection of chemical stimulus involved in sensory perception of painGO:0050966 detection of mechanical stimulus involved in sensory perception of pain48
  • 48. 49Hierarchical view representingrelations between representedtypes
  • 49. A new kind of biological researchbased on analysis and comparison of the massivequantities of annotations linking ontology termsto raw data, including genomic data, clinical data,public health dataWhat 10 years ago took multiple groups ofresearchers months of data comparison effort,can now be performed in milliseconds50
  • 50. One standard methodSjöblöm T, et al. analyzed13,023 genes in 11breast and 11 colorectal cancersusing functional information captured by GO forgiven gene product typesidentified 189 as being mutated at significantfrequency and thus as providing targets fordiagnostic and therapeutic intervention.Science. 2006 Oct 13;314(5797):268-74.51
  • 51. What is the key to GO’s success?• GO is developed, maintained and by expertswho adhere to ontology best practices• over 11 million annotations relating geneproducts described in the UniProt, Ensembl andother databases to terms in the GO• experimental results reported in 52,000scientific journal articles manually annoted byexpert biologists using GO• ontology building and ontology QA are twosides of the same coin52
  • 52. If controlled vocabularies are to serveto remove silosthey have to be updated by respected expertswho are trained in best practices of ontologymaintenancethey have to be respected by many owners ofdata as resources that ensure accuratedescription of their data– GO maintained not by computer scientists butby biologiststhey have to be willingly used in annotations bymany owners of data53
  • 53. 54The new profession of biocurator
  • 54. 55
  • 55. How to do it right?• how create an incremental, evolutionary process,where what is good survives, and what is bad fails• where the number of ontologies needing to belinked is small• where links are stable• create a scenario in which people will find itprofitable to reuse ontologies, terminologies andcoding systems which have been tried and tested• and in which ontologies will evolve on the basis offeedback from users56
  • 56. Reasons why GO has been successfulIt is a system for prospective standardization built withcoherent top level but with content contributed andmonitored by domain specialistsBased on community consensusClear versioning principles ensure backwardscompatibility; prior annotations do not lose theirvalueInitially low-tech to encourage adoption by newcommunities of usersTracker for user input with rapid turnaround and helpdesk57
  • 57. But GO is limited in its scopeit covers only generic biological entities of threesorts:–cellular components–molecular functions–biological processesno diseases, symptoms, diseasebiomarkers, protein interactions, experimentalprocesses …58
  • 58. Extending the GO methodology toother domains of biology andmedicine59
  • 59. RELATIONTO TIMEGRANULARITYCONTINUANT OCCURRENTINDEPENDENT DEPENDENTORGAN ANDORGANISMOrganism(NCBITaxonomy)AnatomicalEntity(FMA,CARO)OrganFunction(FMP, CPRO) PhenotypicQuality(PaTO)BiologicalProcess(GO)CELL ANDCELLULARCOMPONENTCell(CL)CellularComponent(FMA, GO)CellularFunction(GO)MOLECULEMolecule(ChEBI, SO,RnaO, PrO)Molecular Function(GO)Molecular Process(GO)OBO (Open Biomedical Ontology) Foundry proposal(Gene Ontology in yellow) 60
  • 60. RELATIONTO TIMEGRANULARITYCONTINUANT OCCURRENTINDEPENDENT DEPENDENTORGAN ANDORGANISMOrganism(NCBITaxonomy)AnatomicalEntity(FMA,CARO)OrganFunction(FMP, CPRO) PhenotypicQuality(PaTO)BiologicalProcess(GO)CELL ANDCELLULARCOMPONENTCell(CL)CellularComponent(FMA, GO)CellularFunction(GO)MOLECULEMolecule(ChEBI, SO,RnaO, PrO)Molecular Function(GO)Molecular Process(GO)The strategy of orthogonal modules61
  • 61. Ontology Scope URL CustodiansCell Ontology(CL)cell types from prokaryotesto mammalsobo.sourceforge.net/cgi-bin/detail.cgi?cellJonathan Bard, MichaelAshburner, Oliver HofmanChemical Entities of Bio-logical Interest (ChEBI)molecular entities ebi.ac.uk/chebiPaula Dematos,Rafael AlcantaraCommon Anatomy Refer-ence Ontology (CARO)anatomical structures inhuman and model organisms(under development)Melissa Haendel, TerryHayamizu, Cornelius Rosse,David Sutherland,Foundational Model ofAnatomy (FMA)structure of the human bodyfma.biostr.washington.eduJLV Mejino Jr.,Cornelius RosseFunctional GenomicsInvestigation Ontology(FuGO)design, protocol, datainstrumentation, and analysisfugo.sf.net FuGO Working GroupGene Ontology(GO)cellular components,molecular functions,biological processeswww.geneontology.org Gene Ontology ConsortiumPhenotypic QualityOntology(PaTO)qualities of anatomicalstructuresobo.sourceforge.net/cgi-bin/ detail.cgi?attribute_and_valueMichael Ashburner, SuzannaLewis, Georgios GkoutosProtein Ontology(PrO)protein types andmodifications(under development) Protein Ontology ConsortiumRelation Ontology (RO) relations obo.sf.net/relationship Barry Smith, Chris MungallRNA Ontology(RnaO)three-dimensional RNAstructures(under development) RNA Ontology ConsortiumSequence Ontology(SO)properties and features ofnucleic sequencessong.sf.net Karen Eilbeck
  • 62. OBO Foundryrecognized by NIH as framework to addressmandates for re-usability of data collectedthrough Federally funded researchsee NIH PAR-07-425: Data Ontologies forBiomedical Research (R01)63
  • 63. OBO Foundry provides• tested guidelines enabling new groups to developthe ontologies they need in ways which counteractforking and dispersion of effort• an incremental bottoms-up approach to evidence-based terminology practices in medicine that isrooted in basic biology• automatic web-based linkage between biologicalknowledge resources (massive integration ofdatabases across species and biological system)64
  • 64. RELATIONTO TIMEGRANULARITYCONTINUANT OCCURRENTINDEPENDENT DEPENDENTORGAN ANDORGANISMOrganism(NCBITaxonomy)AnatomicalEntity(FMA,CARO)OrganFunction(FMP, CPRO) PhenotypicQuality(PaTO)BiologicalProcess(GO)CELL ANDCELLULARCOMPONENTCell(CL)CellularComponent(FMA, GO)CellularFunction(GO)MOLECULEMolecule(ChEBI, SO,RnaO, PrO)Molecular Function(GO)Molecular Process(GO)The Open Biomedical Ontologies (OBO) Foundry65
  • 65. Anatomy Ontology(FMA*, CARO)EnvironmentOntology(EnvO)InfectiousDiseaseOntology(IDO*)BiologicalProcessOntology (GO*)CellOntology(CL)CellularComponentOntology(FMA*, GO*) PhenotypicQualityOntology(PaTO)Subcellular Anatomy Ontology (SAO)Sequence Ontology(SO*) MolecularFunction(GO*)Protein Ontology(PRO*)OBO Foundry Modular OrganizationGovernanceInformation ArtifactOntology(IAO)Ontology for BiomedicalInvestigations(OBI)Ontology of GeneralMedical Science(OGMS)Basic Formal Ontology (BFO)66
  • 66. Anatomy Ontology(FMA*, CARO)EnvironmentOntology(EnvO)InfectiousDiseaseOntology(IDO*)BiologicalProcessOntology (GO*)CellOntology(CL)CellularComponentOntology(FMA*, GO*) PhenotypicQualityOntology(PaTO)Subcellular Anatomy Ontology (SAO)Sequence Ontology(SO*) MolecularFunction(GO*)Protein Ontology(PRO*)OBO Foundry Modular OrganizationTrainingInformation ArtifactOntology(IAO)Ontology for BiomedicalInvestigations(OBI)Ontology of GeneralMedical Science(OGMS)Basic Formal Ontology (BFO)67
  • 67. Anatomy Ontology(FMA*, CARO)EnvironmentOntology(EnvO)InfectiousDiseaseOntology(IDO*)BiologicalProcessOntology (GO*)CellOntology(CL)CellularComponentOntology(FMA*, GO*) PhenotypicQualityOntology(PaTO)Subcellular Anatomy Ontology (SAO)Sequence Ontology(SO*) MolecularFunction(GO*)Protein Ontology(PRO*)Extension Strategy + Modular Organization 68top levelmid-leveldomainlevelInformation ArtifactOntology(IAO)Ontology forBiomedicalInvestigations(OBI)Spatial Ontology(BSPO)Basic Formal Ontology (BFO)
  • 68. 69How to build an ontology1.due diligence: identify the existing ontologycontent that is most relevant to your needs2.work with domain experts to identify parts of thedomain not covered by this ontology3.find ~50 most commonly used termscorresponding to types of entities in this domain4.arrange these terms into a taxonomical hierarchyusing the strategy of downward population5.work with domain experts to populate the lowerlevels of the hierarchy
  • 69. Example: The Cell Ontology
  • 70. Ontology and Library Science• Nanopublishing• FaBRO• Semantically enhanced publishing• eagle-I and VIVO resource registry i71
  • 71. Nanopublishing• Definition – An online publishing model thatuses a scaled-down, inexpensive operation toreach a targeted audience, especially by usingblogging techniques• Applied to ontologies – gives credit to authorsof fragments of ontologies, including singleontology terms and definitions• Applied to annotations – gives credit tocurators for use of ontology terms in literaturetagging 72
  • 72. Functional Requirements forBibliographic Records (FRBR)• Group 1 entities: user interests in intellectual or artistic products– Work: a distinct intellectual or artistic creation– Expression: its intellectual or artistic realization– Manifestation: the physical embodiment of an expression of a work– Item: a single exemplar of a manifestation• Group 2 entities: are responsible for content, production, ..., ofgroup 1 entities– Person: an individual– Corporate body: an organization or group of individuals or organizations• Group 3 entities: serve as the subjects of works– Concept: an abstract notion or idea– Object: a material thing– Event: an action or occurrence– Place: a location 73
  • 73. FaBiO• FRBR (Functional Requirements forBibliographic Records) model from to OWLformat.FaBiO (FRBR-aligned Bibliographic Ontology).• Paolo Ciccaresehttp://www.paolociccarese.info/http://www.hcklab.org/74
  • 74. 75http://code.google.com/p/information-artifact-ontology/
  • 75. Semantically enhanced publishing76
  • 76. With highlighting on77
  • 77. 78
  • 78. eagle-i and VIVO resource registryinitiativeseagle-i: Ontology for indexing and queryingbiomedical research resourceshttp://code.google.com/p/eagle-i/VIVO: An interdisciplinary national networkenabling collaboration and discoveryamong scientists across all disciplineshttp://vivoweb.org/Shared ontology resources in OBO Foundry79
  • 79. BFO Basic Formal OntologyBiometrics Biometrics OntologyCL Cell OntologyCUMBO Common Upper Mammalian Brain OntologyCTO Counterterrorism OntologyENVO Environment OntologyFMA Foundational Model of AnatomyGO Gene OntologyIAO Information Artifact OntologyIDO Infectious Disease OntologyMFO Mental Functioning OntologyMFO-MD Mental Disease OntologyMFO-EM Emotion OntologyND Neurological Disease OntologyOBI Ontology for Biomedical InvestigationsOGMS Ontology for General Medical SciencePO Plant OntologyPRO Protein OntologyRNAO RNA OntologyVSO Vital Sign Ontology80
  • 80. New role for librarians as stewardsof ‘local’ digital data repositories81
  • 81. Librarians can take over the worldshared VIVO and eagle-I ontologies inventoryinglaboratoriesservicesinstrumentsreagentsorganismsimages and videospersonsprotocolspatentshuman studiestissue samplesDNA samplessample repositoriestraining opportunitiesdatabasespapersjournalsgenomes (plants, cars …)82