Knowledge Organization System (KOS) for biodiversity information resources, GBIF KOS work program

Presentation of the Global Biodiversity Information Facility (GBIF) knowledge organization system (KOS) work program for the National Center for Biomedical Ontology (NCBO) Web seminar series.

Presentation of the Global Biodiversity Information Facility (GBIF) knowledge organization system (KOS) work program for the National Center for Biomedical Ontology (NCBO) Web seminar series. Available at http://www.bioontology.org/GBIF-vocabulary-management-for-biodiversity-informatics



  • Recommendations for the GBIF Knowledge Organization System (KOS) work programme. NCBO webinar 17 October 2012. http://www.bioontology.org/GBIF-vocabulary-management-for-biodiversity-informatics
  • Suggested areas in which GBIF’s global mandate gives it a unique responsibility and leadership role. More on some of these in later slides.
  • Through the concepts included in Darwin Core (and through equivalent data representations) GBIF has demonstrated the significant value arising from a focus on simple, widely-used data elements to support fundamental discovery, access and filtering of biodiversity data records.
  • Wieczorek, John; D. Bloom, R. Guralnick, S. Blum, M. Döring, R. De Giovanni, T. Robertson, and D. Vieglais (2012) Darwin Core: An Evolving Community-Developed Biodiversity Data Standard. PLoS ONE 7(1): e29715. doi:10.1371/journal.pone.0029715
  • The TDWG ontology development was coordinated and maintained by Roger Hyam on behalf of the TDWG community.
  • GBIF (2011). Recommendations for the Use of Knowledge Organisation Systems by GBIF. Released on 04 Feb 2011. Authors: Terry Catapano, Donald Hobern, Hilmar Lapp, Robert A. Morris, Norman Morrison, Natasha Noy, Mark Schildhauer, David Thau. Copenhagen: Global Biodiversity Information Facility, 49 pp., accessible online at http://links.gbif.org/gbif_kos_whitepaper_v1.pdf.
  • “Task 4.1 Ontology platform (GBIF, JKI). ViBRANT needs a flexible, user-friendly ontology management environment, enabling users to create, define, extent and share their own terms and concepts where needed, providing options for discussions and annotation, while supporting re-use of terms from standardized ontologies wherever possible (via task 4.2). For this purpose ViBRANT will extent the functionalities of both the ontology managers of existing vocabulary services (like GBIF) and will develop a collaborative community interface (JKI) for users and user-networks to facilitate the (bottom-up) definition and sharing of their ontologies in a user-friendly (non-technical) way” (ViBRANT project summery page 13).
  • The cartoon is from XKCD: http://xkcd.com/927/. This work is licensed under a Creative Commons Attribution-NonCommercial 2.5 License.
  • Need to separate terms from the fundamental concepts behind the terminology.
  • Recommendation to use SKOS for concept vocabularies.
  • Recommendation to explore OWL for ontologies based on/extending the (SKOS) concept vocabularies. Start with SKOS concept vocabularies and next declare richer semantic relationships between concepts using OWL ontologies.
  • The Vocabulary Management Task Group (VoMaG) is convened at the GBIF Community Site at http://community.gbif.org/pg/groups/21382/
  • A collection of prototype KOS tools for GBIF available at http://kos.gbif.org/
  • Recommended work-flow for vocabulary management.
  • The GBIF Vocabulary Server (http://vocabularies.gbif.org) provides a tool for the development of controlled terminologies and controlled value vocabularies.
  • http://bioportal.bioontology.org/projects/168http://bioportal.bioontology.org/virtual/3058
  • See also: Endresen, D.T.F., and H. Knüpffer (2012). The Darwin Core extension for genebanks opens up new opportunities for sharing germplasm data sets. Biodiversity Informatics 8:12-29. https://journals.ku.edu/index.php/jbi/article/view/4095
  • http://code.google.com/p/gbif-ecat/wiki/DwCArchive
  • http://vocabularies.gbif.org/node/163947http://vocabularies.gbif.org/vocabularies/geo_chronostrathttp://vocabularies.gbif.org/geo_chronostrat/Cambrian
  • http://rs.gbif.org/terms/geotime/geotimeConcept.rdfhttp://terms.gbif.org/wiki/GeoTime (wiki forum)
  • Example: Global Names Architecture (GNA)
  • Darwin Core Archive (DwC-A) extensions under development. Controlled terminologies for the DwC-As.
  • Darwin Core Archive (DwC-A) controlled value vocabularies under development. Controlled values for terms included in the DwC-As.
  • A model for the translation of terms to other languages.
  • The Semantic MediaWiki provides a user-friendly and simple interface for managing biodiversity vocabulary resources such as the terms and concepts for data exchange schema and controlled value vocabularies. Each term is described by a separate Wiki page. The Semantic Wiki format provides an easy to use syntax for making semantic markup to describe these resources. The aim is to lower the technical threshold for domain experts to contribute to the description and maintenance of vocabulary resources that can be automatically extracted as RDF.
  • Recommendations for the next GBIF KOS work programme.
  • Cato the Elder ended all his speeches in the senate of Rome with: "CeterumautemcenseoCarthaginemessedelendam" (English: "Furthermore, I think Carthage must be destroyed").One proposed model for persistent and stable identifiers across biodiversity information resources could be: DOIs for datasets and collections, and UUIDs for species observations and collection specimens – and database records.

Knowledge Organization System (KOS) for biodiversity information resources, GBIF KOS work program Presentation Transcript

  • NCBO Webinar seriesKnowledge Organization System (KOS) forbiodiversity information resources- GBIF KOS work programDag EndresenKnowledge Systems Engineer, Node manager for GBIF NorwayNatural History Museum, University of OsloÉamonn Ó TuamaSenior Programme Officer, Inventory, Discovery, Access (IDA)Global Biodiversity Information Facility (GBIF)17 October 2012
  • GBIF enables free and open access to biodiversitydata online.We’re an international government-initiated andfunded initiative focused on making biodiversity dataavailable to all and anyone, for scientific research,conservation and sustainable development. Status data portal October 2012 Presented by Éamonn
  • The OECD origin…OECD Global Science Forum recommendation (1999):“Establish and support a distributed system of interlinkedand interoperable modules (databases, software andnetworking tools, search engines, analytical algorithms,etc.) that together will form a Global BiodiversityInformation Facility (GBIF)” “This facility will enable users to navigate and put to use vast quantities of biodiversity information, thereby:  advancing scientific research…  serving the economic…  providing a basis from which our knowledge of the natural world Presented can grow rapidly…” by Éamonn
  • 1. Information infrastructure – an Internet-based index of a globally distributed network of interoperable databases that contain primary biodiversity data.1. Community-developed tools, standards and protocols – the tools data providers need to format and share their data.1. Capacity-building and training – and access to a global expert community. Presented by Éamonn
  • http://data.gbif.org/ Presented by Éamonn
  • Web services (REST)Advanced search for occurrence records• Scientific names and classification • http://data.gbif.org/ws/rest/taxon• Species occurrence data • http://data.gbif.org/ws/rest/occurrence• Species occurrence data aggregated, 1 degree cell • http://data.gbif.org/ws/rest/density• Metadata on data providers • http://data.gbif.org/ws/rest/provider• Metadata on datasets • http://data.gbif.org/ws/rest/resource• Metadata on data networks • http://data.gbif.org/ws/rest/network Presented Open and free use of data! by Éamonn
  • Slide developed by Donald Hobern GBIF’s unique role• Registry of biodiversity data resources• Tools and support for biodiversity data publication• Network development at national, regional and global levels• Global virtual natural history collection• Cross-domain linkage between data from collections, ecology and genomics• Access to biodiversity data for GIS analysis and environmental monitoring – Aggregated presence data Presented by Éamonn – Site-based survey data (samples, presence/absence)
  • Unifying species data Ecological Genomics Monitoring Darwin CoreIntegrated access forrecords of theoccurrence of anyspecies:• What?• When? Collections• Where?• What evidence?• Data owner?• Link to full recordPresence only Slide developed by Donald Hobern
  • Darwin Core – a glossary of termsWieczorek J, Bloom D, Guralnick R, Blum S, Döring M, De Giovanni R, Robertson T, and Vieglais D (2012)Darwin Core: An Evolving Community-Developed Biodiversity Data Standard. PLoS ONE 7(1): e29715.doi:10.1371/journal.pone.0029715
  • TDWG Ontology The TDWG Ontology was developed and maintained between 2007 and 2009.http://rs.tdwg.org/ontology
  • GBIF KOS task group report http://links.gbif.org/gbif_kos_whitepaper_v1.pdf Presented by Éamonn11
  • GBIF KOS work program, 2011 and 2012Description of work:• “a flexible, user-friendly ontology management environment, enabling users to create, define, extent and share their own terms and concepts where needed, providing options for discussions and annotation, while supporting re-use of terms from standardized ontologies wherever possible”. • Extent the functionalities of existing vocabulary services (like GBIF). • Collaborative community interface for users and user- networks, bottom-up, user-friendly and non-technical. • Flexibility for biologists to express their knowledge regardless of whether the terminology has been standardized yet or not.12
  • Data standards Important principle: Re-use of terms from standardized terminologies wherever possible.13 The cartoon is from XKCD: http://xkcd.com/927/ CC-BY-NC
  • Term versus Concept“The SKOS (simple knowledge organization system) format is designed to presentKOS data in a format that is suitable for machine inferencing and particularly foruse in the Semantic Web (….) concepts – units of thought – and distinguishesthese from the terms that are used to label these concepts.Will, L. (2012). The ISO 25964 Data Model for the Structure of an Information Retrieval Thesaurus. Bulletin of theAmerican Society for Information Science and Technology 38(4): 48-51.Dextre Clarke, S.G. and L. Zeng (2012). From ISO 2788 to ISO 25964: the evolution of thesaurus Standards towardsInteroperability and data modeling. ISQ Information Standards Quarterly 24(1): 20-26. 14
  • Why use a flat vocabulary ?• Maximize the reuse of terms, focus on the definition and labels for basic terms.• Low threshold for non-technical biologists and biodiversity domain experts to access terms and contribute (compared to richer ontologies).• Preferred technology: RDF (resource description framework) and SKOS (simple knowledge organization system).• Construction and maintenance of OWL ontologies are demanding in respect to expertise, effort and costs.• Maintaining SKOS vocabularies are less demanding.• RDF resources are designed to be easily extended.• Ontologies (OWL) can be based on (extend) terms declared by a RDF/SKOS vocabulary.• SKOS became a W3C recommendation in 2009.15
  • Why use OWL (web ontology language) ?• OWL DL supports machine reasoning through machine accessible formal semantics.• OWL provides by default an URI as identifier for classes, properties, relations and instances.• E.g. OBO target practical solutions in the biomedical / biology domain, while OWL is more generic and provide cross-domain interoperability.• OWL 1.0 became a W3C recommendation in 2004,• OWL 2.0 in 2009.• http://www.w3.org/2007/OWL/• Recommendation: • REUSE terms declared by concept vocabularies… • Start with SKOS - then explore OWL… 16
  • Governance structure (TDWG VoMaG) http://community.gbif.org/pg/groups/21382/17
  • http://kos.gbif.org18
  • Vocabulary management (work-flow) 1. Mint and maintain concepts and terms, in domain- Wiki expert working groups. Vocabulary 2. Release final version as a Concept Vocabulary. 3. REUSE terms from published concept Management vocabularies and ontologies when designing new 1 DwC-A controlled term and value vocabularies. 4 4. Publish at the GBIF Resources Repository. 5. Browse at the GBIF Resources Browser. Resources Repository 2 ISOcat Concept Vocabulary Vocabulary GBIF 5 Management (rdf, skos) Resources 1 Browser proposed template processor DwC-A controlled GBIF VocabulariesExcel, text, etc… vocabularies Evaluation of 3 as a collaborativeTemplate for management tool for collaborativeVocabularies 1 management tools Darwin Core Archive http://kos.gbif.org/ controlled term and value GBIF Vocabularies vocabularies. 19
  • GBIF Vocabulary Server (Drupal)WikiVocabularyManagement Concept Vocabulary Resources ISOcat Vocabulary (rdf, skos) Repository Management MS Excel Template for Vocabularies GBIF IPT Evaluation of various tools for DwC-A term collaborative management of and value concept vocabularies. vocabulariesGBIF Vocabularies GBIF Vocabularies as a collaborativemanagement tool for Darwin Core Archive controlled terms and valueDarwin Core Archive vocabulariescontrolled terms and The GBIF Vocabulary controlled value Server is based vocabularies. on Drupal 6 / Scratchpads v120
  • Semantic wiki forum for termsWikiVocabularyManagement Concept ISOcat Vocabulary Resources Vocabulary (rdf, skos) Management Repository MS Excel Template for Vocabularies GBIF IPT Evaluation of various tools for collaborative management of DwC-A concept vocabularies. term and value vocabularies ? Wiki forum for terms as an open community platform for description Wiki Forum and maintenance of existing terms. for Terms Replacement tool also for the GBIF Vocabulary Server?21
  • GBIF Term browserWikiVocabularyManagement Concept Concept vocabularies ISOcat Vocabulary Resources stored/deposited at Vocabulary (rdf, skos) http://rs.gbif.org/terms/ Management Repository MS Excel Template for Vocabularies Evaluation of various tools for collaborative management of concept vocabularies. The GBIF Term Browser allows a user to browse for terms defined in widely used concept vocabularies such as Darwin Core, Dublin Core, FOAF, etc., including where available, translations. http://kos.gbif.org/termbrowser/22
  • Biodiversity ontology management REUSE terms from Concept concept vocabularies Vocabulary Ontologies (rdf, skos) (rdf, owl) Evaluation of Evaluation of tools for the biodiversitydevelopment ontologyof biodiversity repository ontologies. solutions. 1 Wiki tool 2 Resources (incl. ontology Repository development?) (incl. ontologies?) 23
  • BioPortal ontology repositoryProposal: establish a biodiversity “slice” at the NCBO BioPortal.• Loading biodiversity ontologies into the NCBO BioPortal promotes mapping (and reuse of terms) between bio-medical and biodiversity ontologies.• An instance of the BioPortal software for biodiversity requires long-term obligations to host and maintain the resource – does e.g. GBIF have the resources to offer to host a BioPortal instance? http://bioportal.bioontology.org/projects/16824
  • GBIF KOS resources Concept vocabularies (skos:conceptSchema, RDF) • Darwin Core, Darwin Core “extensions”, NCD, GNA, Audubon Core (and other concept vocabularies). as a basis and foundation for Software application schema (XML, XML schema) • Darwin Core Archive (DwC-A) controlled terminology and controlled value vocabularies. • Resources such as the DwC-A controlled term and value vocabularies REUSE terms (by URI) from a concept vocabulary.25
  • Biodiversity KOS (based on Darwin Core) Darwin Core (DwC) provides a flat list of concepts and terms, expressed using RDF.  DwC “extensions” (vocabularies for the declaration of complementary and additional concepts).  Reuse concepts from other vocabularies whenever possible. Darwin Core Archive (DwC-A) has a star schema model. • DwC-A core(s), extensions and controlled value vocabularies • declared as XML lists of terms. • DwC-A resources should always REUSE terms from Darwin Core and other concept vocabularies. • New DwC-A core types (data types), eg. sample? Formalize class entities (ontology). [Current types: Taxon & Occurrence]  Formalize a governance structure for maintaining KOS resources based on the principles established for Darwin Core (towards TDWG VoMaG). 26
  • Darwin Core Archive (DwC-A) DwC-A publish DwC records including terms from DwC-A extensions. Simple text based format. Zipped single file archive. Germplasm.txt27
  • Darwin Core Archive extension (XML term list) 28 http://rs.gbif.org/sandbox/extension/audubon.xml
  • GBIF Vocabulary Server The GBIF Vocabulary Server can assist a user to create and manage DwC-A extensions or controlled value vocabularies. However, it is not designed to create RDF/SKOS concept vocabulary resources with reusable concepts. edit interface It can export XML, but not RDF. It is based on XML export Scratchpads (v1), aka. Drupal v 6.29
  • Concept vocabulary (RDF/SKOS) In progress: XSLT -> HTML for human readable version.30 http://rs.gbif.org/terms/geotime/geotimeConcept.rdf
  • Global Names Architecture (GNA)Many of the GNA term URI identifiers doesnot resolve (404 not found).The rowType identifiers simply resolve tothe software application schema (to theDwC-A extension).We propose to formalize the GNA conceptdeclarations using RDF/SKOS forimproved re-usability of the GNA termsand concepts.31
  • Global Names Architecture (GNA) RDF/SKOSXML The Global Names Architecture (GNA) terms were originally simply declared by the DwC-A extension. We propose to formalize the GNA concept32 declarations using RDF/SKOS for improved re-usability of the GNA terms.
  • Global Names Architecture (GNA) RDF/SKOS We propose to formalize the GNA concept declarations using RDF/SKOS for improved re-usability of the GNA terms.33
  • Darwin Core Archive extensions • Global Names Architecture (GNA) • Audubon Core (multimedia) • Invasive species (GISIN) • Genetic Resources (Germplasm) • EOL species profile • Taxonomic Concept Schema (TCS) • Genomics Standards Consortium (GSC) • Meta-genomics (?) • ABCD (?) • …34
  • Controlled value vocabularies • Country codes • Language • Basis of record • Taxonomic rank • Nomenclatural status • Life form • Life stage • Geological time periods • chronostratigraphy • magnetostratigraphy • Species interactions • saproxylic interactions • pollinators • …35
  • Example: master SKOS/RDF resourceen [es [zh [ja [ Presented by Éamonn 36 http://rs.gbif.org/terms/dwc/dwc_translations.rdf
  • SemanticMediaWiki single term view Presented by Éamonn
  • Recommendations for the GBIF KOS work programme• GBIF Resources Repository (http://rs.gbif.org/) • Further development of new DwC-A extensions and controlled value vocabularies. • Workflow for the translation of term descriptions. • Versioning of terms/vocabularies.• Continue the evaluation of collaborative tools for management of flat vocabularies of terms (RDF/SKOS). • Semantic MediaWiki, ISOcat, Protégé (web-protégé), …• New semantic Wiki for the description of terms / glossary of terms / community-driven discussion forum (with JKI, Gregor Hagedorn). • Discussion, discovery and REUSE of existing terms.• NCBO BioPortal as a repository for biodiversity ontologies. • Explore BFO based OWL version of Darwin Core…?• KOS governance structure developed and formalized by the (TDWG) Vocabulary Management Task Group (VoMaG).• Roadmap for KOS into the GBIF infrastructure, portal, tools…! 38
  • Furthermore, I think that we need persistent identifiers! Cato the Elder ended all his speeches in the senate of Rome with: "Ceterum autem censeo Carthaginem esse delendam" (English: "Furthermore, I think Carthage must be destroyed").39 Available at http://www.bioontology.org/GBIF-vocabulary-management-for-biodiversity-informatics