Knowledge Organization System (KOS) for biodiversity information resources, GBIF KOS work program (Dag and Eamonn, 2012).


Published on

Slides from a presentation on the Knowledge Organization System (KOS) work program for GBIF. KOS developments for biodiversity information resources and input to the emerging Vocabulary Management Task Group (VoMaG).

GBIF KOS prototype tools,
Tool: Semantic Wiki prototype,
Tool: ISOcat prototype demo,
GBIF concept vocabulary term browser,

GBIF Resources Repository,
GBIF Vocabulary Server,
GBIF Resources Browser,

Published in: Technology, Education
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Knowledge Organization System (KOS) for biodiversity information resources, GBIF KOS work program (Dag and Eamonn, 2012).

  1. 1.    Knowledge Organization System for GBIFVirtual Biodiversity Research and Access Network for Taxonomy (ViBRANT)Dag EndresenKnowledge Systems EngineerÉamonn Ó TuamaSenior Programme Officer, Inventory, Discovery, Access (IDA)Global Biodiversity Information Facility (GBIF)31 August 2012
  2. 2. Enabling interoperability for the GBIFnetwork and beyond GEO BON,“The ability of two or more systems or IPBE Scomponents to exchange information and touse the information that has been exchanged”(ref: IEEE Standard Computer Dictionary: Compilation of IEEE Standard Computer Glossaries, ISBN:155937079) Key requirement: Common exchange standards and protocols for biodiversity data … necessitate agreement on use of common vocabularies for the classes of objects and their properties Knowledge Organisation Systems (KOS) - can help us manage our vocabularies
  3. 3. Knowledge Organisation Systems... to manage the vocabularies used forsharing biodiversity information.- Term lists: glossaries, dictionaries, gazetteers- Classifications / categorisations: taxonomies e.g., Dewey Decimal Classification- Relationships: thesauri, ontologies simple relationships a model of a domain
  4. 4. Darwin Core – a glossary of terms higherClassification coordinatePosition specificEpithet geodeticDatum collectionCode taxonConceptID taxonRankcollectionCode: The name, acronym, coden, or initialism identifying thecollection or data set from which the record was derived. Examples:"Mammals", "Hildebrandt", "eBird".
  5. 5. AgroVoc vocabulary – a thesaurus bt Resources nt Natural resources nt Biological resources nt Genetic resources nt Germplasm uf Genetic material uf Germplasm resources rt Protoplasm bt = broader term nt = narrower term rt Genes uf = used for rt Gene pools rt = related term rt Biodiversity rt Germplasm collections rt Gametes
  6. 6. Ontology – a model of a domain collectors take samples inverseOf samples are taken by collectors William Jefferson Clinton NHM, Los Angeles County sameAs differentFrom Bill Clinton NHM, London transitiveProperty A hasAncestor B B hasAncestor C hasAncestorClinton image source: ontologies = computable dictionaries
  7. 7. Term  versus  Concept  “The SKOS (simple knowledge organization system) format is designed to presentKOS data in a format that is suitable for machine inferencing and particularly for usein the Semantic Web (….) The model [ISO 25964] is based on the understanding thatthesauri show the relationships between concepts – units of thought – anddistinguishes these from the terms that are used to label these concepts. These termsmay be in one or more languages, and one term per language is chosen as apreferred term for each concept. One or more additional terms for the same conceptmay be recorded in the thesaurus as non-preferred terms.”Will, L. (2012). The ISO 25964 Data Model for the Structure of an Information Retrieval Thesaurus. Bulletin of the AmericanSociety for Information Science and Technology 38(4): 48-51.Dextre Clarke, S.G. and L. Zeng (2012). From ISO 2788 to ISO 25964: the evolution of thesaurus Standards towardsInteroperability and data modeling. ISQ Information Standards Quarterly 24(1): 20-26.
  8. 8. Knowledge Organisation SystemsKey requirement: a platform to support thedevelopment, maintenance and governance ofvocabularies for the biodiversity community- New dedicated position at GBIF funded through KOS activi external projects (ViBRANT, i4Life) in GB ties IF wo progr rk- Review recommendations in KOS task group amm report and develop implementation roadmap e- Review GBIF Vocabularies Service and develop vocabulary management system-  Engage with wider community: -  participation in Dublin Core workshop, Sept 2011 -  KOS symposium at TDWG 2011 Conf, Oct 2011 -  TDWG Vocabulary Management Task Group, 2012
  9. 9. ViBRANT: Task 4.1 Ontology platform (GBIF, JKI)Description of work:•  “[F]lexible, user-friendly ontology management environment, enabling users to create, define, extent and share their own terms and concepts where needed, providing options for discussions and annotation, while supporting re-use of terms from standardized ontologies wherever possible”. •  Extent the functionalities of existing vocabulary services (like GBIF). •  Collaborative community interface for users and user- networks, bottom-up, user-friendly and non-technical. •  Flexibility for biologists to express their knowledge regardless of whether the terminology has been standardized yet or not. Text from the ViBRANT project summary, page 13 (my highlighting).9
  10. 10. ViBRANT WP4: GBIF tasks and deliverables Deliverable  4.2:  Ontology  tools:     •  “Develop  the  GBIF  ontology  tool  and   produce  an  equivalent  tool  based  on  a   seman<c  wiki.  Deliver  a  single  user   interface  for  ontology  crea<on  and   edi<ng  based  on  user-­‐acceptance  of  the   alterna<ve  technologies.”   Text from the ViBRANT project summary, page 14 (my highlighting).10
  11. 11. Governance structure (TDWG VoMaG) h=p://    11
  12. 12. Why use a flat vocabulary ? •  Maximize the reuse of terms, focus on the definition and labels for basic terms. •  Low threshold for non-technical biologists and biodiversity domain experts to access terms and contribute (compared to richer ontologies). •  Preferred technology: RDF (resource description framework) and SKOS (simple knowledge organization system). •  Construction and maintenance of OWL ontologies are demanding in respect to expertise, effort and costs. •  Maintaining SKOS vocabularies are less demanding. •  RDF resources are designed to be easily extended. •  Ontologies (OWL) can be based on (extend) terms declared by a RDF/SKOS vocabulary. •  SKOS became a W3C recommendation in 2009.12
  13. 13. Why use OWL (web ontology language) ?•  OWL DL supports machine reasoning through machine accessible formal semantics.•  OWL provides by default an URI as identifier for classes, properties, relations and instances.•  E.g. OBO target practical solutions in the biomedical / biology domain, while OWL is more generic and provide cross-domain interoperability.•  OWL 1.0 became a W3C recommendation in 2004,•  OWL 2.0 in 2009.••  Recommendation: •  REUSE terms declared by flat vocabularies… •  Start with SKOS - then explore OWL… 13
  14. 14. Vocabulary management 1.  Mint and maintain concepts and terms, in domain- Wiki expert working groups. Vocabulary 2.  Release final version as a Concept Vocabulary. 3.  REUSE terms from published concept vocabularies Management and ontologies when designing new DwC-A 1 extensions & controlled value vocabularies. 4 4.  Publish at the GBIF Resources Repository. 5.  Browse at the GBIF Resources Browser. Resources Repository 2 ISOcat Concept Vocabulary Vocabulary GBIF 5 Management (rdf, skos) Resources 1 Browser proposed template processor DwC-A extensions & GBIF VocabulariesExcel, text, etc… controlled as a collaborativeTemplate for Evaluation of collaborative 3 vocabularies management tool forVocabularies Darwin Core Archive 1 management tools extensions and controlled GBIF  Vocabularies   vocabularies. 14
  15. 15. GBIF Vocabulary Server (Drupal) Wiki Vocabulary Management Concept Vocabulary Resources ISOcat (rdf, skos) Vocabulary Management Repository GBIF IPT MS Excel Template for Vocabularies ? Evaluation of various tools for DwC-A extensions & Scratchpads collaborative management of controlled concept vocabularies (RDF). vocabulariesGBIF Vocabularies GBIF  Vocabularies   GBIF Vocab Server is based on as a collaborative Drupal 6 / Scratchpads (v1) Darwin Core Archivemanagement tool for extensions and controlledDarwin Core Archive value vocabularies --> Drupal 7/Scratchpads2 extensions and --> Drupal 8 ? controlled value Integration with Scratchpads2? vocabularies. Integration with the NPT? 15
  16. 16. Semantic wiki forum for terms Wiki Vocabulary Management Concept Vocabulary ISOcat Vocabulary (rdf, skos) Resources Management MS Excel Repository GBIF IPT Template for VocabulariesEvaluation of various tools for DwC-Acollaborative management of Scratchpads concept vocabularies (RDF). extensions & controlled vocabularies ? Wiki forum for terms as an open community platform for Wiki Forum description and maintenance for Terms of existing terms. Replacement tool also for the GBIF Vocabulary Server?16
  17. 17. GBIF Term browser Wiki Vocabulary Management Concept Concept vocabularies Vocabulary ISOcat Vocabulary (rdf, skos) Resources stored/deposited at Management MS Excel Repository Template for VocabulariesEvaluation of various tools forcollaborative management of concept vocabularies (RDF). The GBIF Term Browser allows a user to browse for terms defined in widely used concept vocabularies such as Darwin Core, Dublin Core, FOAF, etc., including where available, translations.
  18. 18. Biodiversity ontology management REUSE terms from Concept RDF vocabularies Ontologies Vocabulary (rdf, owl) (rdf, skos) Evaluation of Evaluation of tools for the biodiversitydevelopment ontologyof biodiversity repository ontologies. solutions. Wiki tool inc. Resources 1 2 Ontology Repository Management ?? (incl. ontologies?)18
  19. 19. BioPortal ontology repositoryProposal: establish a biodiversity “slice” at the NCBO BioPortal.•  Loading biodiversity ontologies into the NCBO BioPortal promotes mapping (and reuse of terms) between bio-medical and biodiversity ontologies.•  An instance of the BioPortal software for biodiversity requires long-term obligations to host and maintain the resource – does e.g. GBIF have the resources to offer to host a BioPortal instance? h=p://     19
  20. 20. GBIF KOS resources Concept vocabularies (skos:conceptSchema, RDF) •  Darwin Core, Darwin Core “extensions”, NCD, GNA, Audubon Core (and other vocabularies of concepts). as a basis and foundation for Software application schema (XML, XML schema) •  Darwin Core Archive (DwC-A) extensions and controlled value vocabularies. •  Resources such as the DwC-A extensions and controlled value vocabularies REUSE terms (URI) from a vocabulary of terms.20
  21. 21. Biodiversity KOS (based on Darwin Core)Darwin Core (DwC) is a flat list of terms, expressed using RDF. à DwC “extensions” (flat vocabularies for declaration of concepts). à Reuse concepts from other vocabularies whenever possible.Darwin Core Archive (DwC-A) has a star schema model.•  DwC-A core(s), extensions and controlled value vocabularies •  declared as XML lists of terms.•  DwC-A resources should REUSE terms from Darwin Core and other flat concept vocabularies. •  New DwC-A core types (data types), eg. sample? Formalize class entities (ontology). [Current types: Taxon & Occurrence]à  Formalize a governance structure for maintaining KOS resources based on the principles established for Darwin Core (towards TDWG VoMaG). 21
  22. 22. Darwin Core Archive (DwC-A)v  DwC-A publish DwC records including terms from DwC-A extensions.v  Simple text based format.v  Zipped single file archive. Germplasm.txt22
  23. 23. Darwin Core Archive extension (XML term list) 23
  24. 24. Concept vocabulary (RDF/SKOS) In progress: XSLT -> HTML for human readable version.24
  25. 25. GBIF Vocabulary Server The GBIF Vocabulary Server can assist a user to create and manage DwC-A extensions or controlled value vocabularies. However, it is not designed to create RDF/SKOS concept vocabulary resources with reusable concepts. edit interface It can export XML, but not RDF. It is based on XML export Scratchpads (v1), aka. Drupal v 6.25
  26. 26. Global Names Architecture (GNA)Many of the GNA term URI identifiers doesnot resolve (404 not found).The rowType identifiers simply resolve tothe software application schema (to theDwC-A extension).We propose to formalize the GNA conceptdeclarations using RDF/SKOS forimproved re-usability of the GNA termsand concepts.26
  27. 27. Global Names Architecture (GNA) RDF/SKOS    XML   The Global Names Architecture (GNA) terms were originally simply declared by the DwC-A extension. We propose to formalize the GNA concept 27 declarations using RDF/SKOS for improved re-usability of the GNA terms.
  28. 28. Global Names Architecture (GNA) RDF/SKOS   We propose to formalize the GNA concept declarations using RDF/ SKOS for improved re-usability of the GNA terms.28
  29. 29. Darwin Core Archive extensions •  Global Names Architecture (GNA) •  Audubon Core (multimedia) •  Invasive species (GISIN) •  Genetic Resources (Germplasm) •  Natural Collections Description (NCD) •  Metadata profile (EML) •  EOL species profile •  Taxonomic Concept Schema (TCS) •  Genomics Standards Consortium (GSC) •  Meta-genomics (?) •  ABCD (?) •  …29
  30. 30. Controlled value vocabularies •  Geological time periods •  chronostratigraphy •  magnetostratigraphy •  Species interactions •  saproxylic interactions •  pollinators •  Country codes •  Language •  Basis of record •  Taxonomic rank •  Nomenclatural status •  Life form •  Life stage •  …30
  31. 31. a  proposed  workflow  /  brainstorming  
  32. 32. Versioning  resources   Move outdated vocabularies to a separated folder named “deprecated”? No versions? Will IPT be aware of this folder? Note that previous DwC-A datasets could be mapped to deprecated vocabulary resources…!
  33. 33. Versioning  resources   Version the DwC-A vocabularies and extensions using a [_DATE] postfix. Could IPT be made aware of this postfix? Note that previous DwC-A datasets could be mapped to outdated vocabulary resources…!
  34. 34. Versioning  RDF  vocabularies   Move outdated vocabularies to a subfolder named “archive/[DATE]”? Same versioning model for extensions and vocabularies…?
  35. 35. Versioning  RDF  vocabularies  Deprecated and outdated vocabularies and DwC-A resources coulddeclare their status, eg. using dcterms:isReplacedBy…?Drawback: the XML document is required to be accessed and parsed toread resource status.
  36. 36. Versioning  vocabulary  resources  •  Separated folder named “deprecated”?•  Postfix using [_DATE]?•  Subfolder named “archive/[DATE]”?•  dcterms:isReplacedBy•  Other ideas, solutions?
  37. 37. a  proposed  workflow  
  38. 38. TranslaTon  of  vocabulary  term  descripTons   Expert working groups or a collaborative expert community develop new translations or refine previous translations. The expert group Export working file provides their output as format from the SKOS a CSV file, XML data or file (RDF/SKOS à as a SKOS/RDF CSV). resource. Archive (SKOS/RDF) Term translations [DATE]/dwc_translations.rdf Translations for a given (SKOS/RDF) vocabulary of terms are dwc_translations.rdf maintained and published as a SKOS/RDF file at the GBIFArchive the translations each time Resources Repositorythe “active” SKOS file is updated. (
  39. 39. Example: master SKOS/RDF resourceen [es [zh [ja [
  40. 40. Workflow  for  term  translaTon   dwc_translations_de.csv dwc_translations_es.csv dwc_translations_fr.csv dwc_translations_jp.csv XSLT   dwc_translations_ru.csv dwc_translations_zh_Hans.csv … expert group Term translations (SKOS/RDF) XSLT split dwc_translations.rdf and merge cycle dwc_translations_de.csv dwc_translations_es.csv dwc_translations_fr.csv (*) XSLT   dwc_translations_jp.csv dwc_translations_ru.csv dwc_translations_zh_Hans.csv … dwc_translations_fr.csv (*) dwc_translations_pt.csv (**) updatedAdding new term translations or updating previous term translations alwaysstarts and ends with the “active” SKOS/RDF resource for translations.(*) Updated CSV files with translations simply replace extracted previous translations – in the XSLT split and merge cycle.(**) Adding translations to a new language simply by adding the CSV resource into the XSLT cycle.
  41. 41. New data types?Genomic level observationsEcological measurementsassociated with observations- complement, not duplicate work A roa dmap deve- GBIF as premier gateway to loped Q1 20 by discovery, access 13 - gen o - eco mic data logica l data
  42. 42. MetadataEssential for discovery andaccess to new data typesThe GBIF metadata cataloguesystem allows interoperabilityacross distributed metadatarepositories http://metadata.gbif.orgThe challenge ahead ... populating the catalogue with high quality, complete metadata
  43. 43. GBIF KOS work-programSome  suggested  next  steps  •  GBIF  Resources  Repository  (h=p://   •  Further  development  of  new  DwC-­‐A  extensions  and  controlled  value  vocabularies.   •  Workflow  for  the  translaTon  of  term  descripTons.  •  ConTnue  the  evaluaTon  of  collaboraTve  tools  for  management  of  flat   vocabularies  of  terms  (RDF/SKOS).   •  SemanTc  Wiki,  ISOcat,  Protégé  (web-­‐protégé),  …  •  New  semanTc  Wiki  for  descripTon  of  terms  /  glossary  of  terms  /  community-­‐ driven  discussion  forum  (with  JKI,  Gregor  Hagedorn).   •  Discussion,  discovery  and  REUSE  of  exisTng  terms.  •  NCBO  BioPortal  as  a  repository  for  biodiversity  ontologies.  •  Will  GBIF  contribute  to  mint  new  biodiversity  ontologies?   •  BFO  based  OWL  version  of  Darwin  Core…?  •  KOS  governance  structure  developed  and  formalized  by  the  (TDWG)   Vocabulary  Management  Task  Group  (VoMaG).  •  Roadmap  for  KOS  into  the  GBIF  infrastructure,  portal,  …!   43
  44. 44. Furthermore, I think that we need persistent identifiers! Cato the Elder ended all his speeches in the senate of Rome with: "Ceterum autem censeo Carthaginem esse delendam" (English: "Furthermore, I think Carthage must be destroyed").44