Global RDF Descriptors for
Germplasm Data
Vassilis Protonotarios
Agricultural Biotechnologist, PhD
Agro-Know, Greece
RDA 3...
Background
Connecting the pieces
agINFRA Germplasm
Working Group
Agricultural Data
Interoperability IG
Germplasm Data
Analysis
Agricu...
The agINFRA project
• A project funded under the FP7 program of EC
• Consortium with expertise on
– Technology / infrastru...
The agINFRA project
• Aims to enhance the interoperability between
the agricultural data sources
– Data sharing by
• Metad...
agINFRA major data types
agINFRA
Bibliographic
Agri Statistics
& Economics
Educational
Germplasm
Soil data
Profiles
Raw da...
Focusing on germplasm
Local
Databases
National
Databases
Aggregators
GENESYS
EURISCO
GBIF
Italian
Italian
University
Itali...
Focusing on germplasm
Local
Databases
National
Databases
Aggregators
GENESYS
EURISCO
Italian
Italian
University
Italian re...
The issue ?
• Heterogeneity!
– Data types
– Data formats
– Data management workflows
– Standards used
– Metadata exposure ...
The agINFRA Germplasm Working
Group
The Germplasm Working Group
• Created in the context of the agINFRA project
• Initially included agINFRA stakeholders
– no...
The scope of the agINFRA
Germplasm WG
• Enable/enhance interoperability between
germplasm databases
– By developing the se...
agINFRA Germplasm WG objectives
• IDENTIFY: collect all information related to germplasm
data
– People/groups
– Namespaces...
Germplasm related information
data
management
workflows
metadata
schemas
Working
groups in
germplasm
Events
(for connectin...
Germplasm related information
data
management
workflows
metadata
schemas
Working
groups in
germplasm
Events
(for connectin...
The Germplasm WG wiki
• Central point of reference
• Freely accessible (no login required)
http://wiki.aginfra.eu/index.ph...
Key outcomes of the group (1)
Dossier on Germplasm Information:
– Major programs
– Major information systems and services
...
Key outcomes of the group (2)
Key outcomes of the group (3)
• Speakers from key players in the biodiversity
data field
– GBIF, EURISCO, GENESYS, CGIAR, ...
Existing work
DwC-G KOSs
• Germplasm Term Vocabulary
• A vocabulary of terms for describing and annotating
germplasm information resourc...
DwC-G linked data
DwC-SW
• An ontology using Darwin Core terms to make it possible to
describe biodiversity resources in the Semantic Web.
h...
Bioversity Crop Descriptors
• Crop Descriptors
– Provide an international format and a universally understood
language for...
Wheat descriptors
• Descriptors for wheat and Aegilops (1978)
• Descriptors for wheat (Revised) (1985)
– Not available as ...
Methodology: towards the RDF
germplasm descriptors
Linked Data vocabularies
• Metadata vocabularies: Metadata sets, metadata element
sets
– they provide metadata elements to...
LOD guidelines (Berners Lee, 2006)
1.“Use URIs as names for things”
– concepts / values in value vocabularies and classes ...
Proposed methodology
1. Analyze metadata schemas & KOSs used to
describe germplasm resources
2. Define attributes & vocabu...
The next steps
Application of the linked agricultural
data framework in germplasm
1. Definition of base schema
– Darwin Core for Germplas...
Application of the linked agricultural
data framework in germplasm
3. Linking of terms in new KOSs to terms in existing
KO...
…and more next steps (optional)
• Update the existing analysis with new data
• Collect new user requirements
• (re)define ...
Time plan
Time plan
• June 2014: Germplasm vocabularies
– Metadata model: Darwin Core SW + DwC-G as the
reference
• Publish local cl...
Time plan
• August 2014: Germplasm RDF
– Expose some RDF output and API access for
germplasm datasets (basic DwC RDF, esse...
Time plan
• October 2014: Consuming data from agINFRA
services and components
– Link CGRIS and CRA germplasm records using...
Source: http://verastic.com/social/why-do-people-not-say-thank-you.html
vprot@agroknow.gr
Upcoming SlideShare
Loading in …5
×

Global RDF Descriptors for Germplasm Data

1,164 views

Published on

Presentation delivered in the context of the Agricultural Data Interoperability WG meeeting, during the RDA 3rd Plenary Meeting in Dublin, Ireland. 26/3/2014.

The presentation is mostly focused on the work done by the agINFRA project towards proposing a methodology for the definition of Germplasm descriptors as RDF, based on the existing work of experts in the field and making use of the existing effort in this direction.

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,164
On SlideShare
0
From Embeds
0
Number of Embeds
537
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Heterogeneous data types and formats,
  • OAI-PMH harvesting is not an option in the case of germplasm data
  • https://code.google.com/p/darwincore-germplasm/wiki/ToC: http://purl.org/germplasm/vocabulary
  • https://code.google.com/p/darwincore-germplasm/wiki/ToC: http://purl.org/germplasm/vocabulary
  • Global RDF Descriptors for Germplasm Data

    1. 1. Global RDF Descriptors for Germplasm Data Vassilis Protonotarios Agricultural Biotechnologist, PhD Agro-Know, Greece RDA 3° Plenary Meeting, Dublin, Ireland Agricultural Data Interoperability Group Meeting
    2. 2. Background
    3. 3. Connecting the pieces agINFRA Germplasm Working Group Agricultural Data Interoperability IG Germplasm Data Analysis Agricultural linked data layer
    4. 4. The agINFRA project • A project funded under the FP7 program of EC • Consortium with expertise on – Technology / infrastructures – Data / data management Combined to facilitate agricultural data sharing More info at: www.aginfra.eu
    5. 5. The agINFRA project • Aims to enhance the interoperability between the agricultural data sources – Data sharing by • Metadata aggregation & linking data • Design and deploy the linked ag-data framework – Methodology for linking data – Provide the infrastructure needed • Both cloud- and grid-based services • Tools, APIs etc.
    6. 6. agINFRA major data types agINFRA Bibliographic Agri Statistics & Economics Educational Germplasm Soil data Profiles Raw data Other?
    7. 7. Focusing on germplasm Local Databases National Databases Aggregators GENESYS EURISCO GBIF Italian Italian University Italian research center Chinese Chinese research center Data flow
    8. 8. Focusing on germplasm Local Databases National Databases Aggregators GENESYS EURISCO Italian Italian University Italian research center Chinese Chinese research center
    9. 9. The issue ? • Heterogeneity! – Data types – Data formats – Data management workflows – Standards used – Metadata exposure options – …. • Lack of connectivity with other data sources
    10. 10. The agINFRA Germplasm Working Group
    11. 11. The Germplasm Working Group • Created in the context of the agINFRA project • Initially included agINFRA stakeholders – now expanded to host all stakeholders • The group is NOT a group of experts on germplasm data!
    12. 12. The scope of the agINFRA Germplasm WG • Enable/enhance interoperability between germplasm databases – By developing the services for • exchanging their data and • delivering their data to other partners • Focusing on three actions: 1. Identify 2. Organize & Reuse 3. Propose
    13. 13. agINFRA Germplasm WG objectives • IDENTIFY: collect all information related to germplasm data – People/groups – Namespaces (metadata, KOS) – Standards – Workflows – Events • ORGANIZE & REUSE: engage all stakeholders & available resources, analyze existing standards , facilitate collaboration • PROPOSE: linked data framework to connect data sources • facilitate data sharing between germplasm data sources
    14. 14. Germplasm related information data management workflows metadata schemas Working groups in germplasm Events (for connecting stakeholders) KOS (ontologies, thesauri, vocabularies etc.) Data exposure capabilities
    15. 15. Germplasm related information data management workflows metadata schemas Working groups in germplasm Events (for connecting stakeholders) KOS (ontologies, thesauri, vocabularies etc.) Data exposure capabilities
    16. 16. The Germplasm WG wiki • Central point of reference • Freely accessible (no login required) http://wiki.aginfra.eu/index.php/Germplasm_Working_Group
    17. 17. Key outcomes of the group (1) Dossier on Germplasm Information: – Major programs – Major information systems and services – agINFRA germplasm data sources (CGRIS & CRA) – Core standards for germplasm information – Plant nomenclature, taxonomies and ontologies – Plant genomic resources – Related references and links • Freely available from the Germplasm Group wiki
    18. 18. Key outcomes of the group (2)
    19. 19. Key outcomes of the group (3) • Speakers from key players in the biodiversity data field – GBIF, EURISCO, GENESYS, CGIAR, EGFAR, CRA etc. • Aimed to provide the basis for the linked germplasm data framework
    20. 20. Existing work
    21. 21. DwC-G KOSs • Germplasm Term Vocabulary • A vocabulary of terms for describing and annotating germplasm information resources – http://purl.org/germplasm/germplasmTerm#TERM • Germplasm Type vocabulary • List of controlled values for some of the germplasm terms – http://purl.org/germplasm/germplasmType#TYPE • Germplasm ontology • to digitize and provide persistent identifiers for the terms contained within the PGR Descriptors publications – http://purl.org/germplasm/ontology
    22. 22. DwC-G linked data
    23. 23. DwC-SW • An ontology using Darwin Core terms to make it possible to describe biodiversity resources in the Semantic Web. https://code.google.com/p/darwin-sw
    24. 24. Bioversity Crop Descriptors • Crop Descriptors – Provide an international format and a universally understood language for plant genetic resources data. – They are targeted at farmers, curators, breeders, scientists and users and facilitate the exchange and use of resources. – Information includes such details as the plant's height, flowering patterns and ancestral history. • FAO/Bioversity Multi-crop Passport Descriptors (MCPD) – Originally published in 2001 – widely used as the international standard to facilitate germplasm passport information exchange. – Now expanded to include emerging documentation needs, this new version resulted from consultation with more than 300 scientists from 187 institutions in 87 countries.
    25. 25. Wheat descriptors • Descriptors for wheat and Aegilops (1978) • Descriptors for wheat (Revised) (1985) – Not available as Linked Data
    26. 26. Methodology: towards the RDF germplasm descriptors
    27. 27. Linked Data vocabularies • Metadata vocabularies: Metadata sets, metadata element sets – they provide metadata elements to describe individual pieces of information in the data sets. – Example: Dublin Core is a vocabulary that prescribes the property dc:date for the publishing date of a document. • Value vocabularies (KOS): Controlled vocabularies, authority data – they provide sets of values for (some of) the metadata elements. – Example: AGROVOC provides a set of values for agricultural topics that can be used as values for the dc:subject property.
    28. 28. LOD guidelines (Berners Lee, 2006) 1.“Use URIs as names for things” – concepts / values in value vocabularies and classes and properties in description vocabularies, as well as the vocabularies themselves, have to be identified by URIs. 2.“Use HTTP URIs so that people can look up those names” – the URIs for concept / values, classes and properties, as well as vocabularies, have to be resolved as HTTP URLs. 3.“When someone looks up a URI, provide useful information” – the URLs for concepts, classes and properties, as well as vocabularies, have to return an HTML page with useful information when requested by browsers, or RDF when requested by RDF software; besides, vocabularies should be available for querying behind a SPARQL endpoint. 4.“Include links to other URIs, so that more things can be discovered” – the URIs of concepts, classes and properties should whenever possible be linked to URIs in other vocabularies, for instance as close match of another concept or sub-class of another class.
    29. 29. Proposed methodology 1. Analyze metadata schemas & KOSs used to describe germplasm resources 2. Define attributes & vocabularies that can be used to expose germplasm resources in linked data format. 3. Provide a set of recommendations for the exposure of germplasm resources as linked data 4. Embed the recommendations in the data infrastructure of agINFRA – to allow the exposure of germplasm resources as LOD.
    30. 30. The next steps
    31. 31. Application of the linked agricultural data framework in germplasm 1. Definition of base schema – Darwin Core for Germplasm to be used as base schema • Already available in SKOS • Vocabularies published as linked data – Germplasm Vocabularies • Germplasm Term Vocabulary • Germplasm Type Vocabulary – Germplasm ontology 2. Publication of local classifications / lists for germplasm as LOD KOSs – if possible use DwC Types directly
    32. 32. Application of the linked agricultural data framework in germplasm 3. Linking of terms in new KOSs to terms in existing KOSs – e.g. DwC Types, AGROVOC 4. Link CAAS and CRA germplasm records using scientific name > AGROVOC 5. Collaboration with technical partners – technical specifications on how to write procedures that extract the relevant data from the database and "triplify" them (i.e. both serialize them as RDF and use URIs instead of just strings whenever possible, also linking to AGROVOC URIs when possible).
    33. 33. …and more next steps (optional) • Update the existing analysis with new data • Collect new user requirements • (re)define the mappings between metadata schemas and KOSs (if needed) • Fine-tune the linked data approach
    34. 34. Time plan
    35. 35. Time plan • June 2014: Germplasm vocabularies – Metadata model: Darwin Core SW + DwC-G as the reference • Publish local classifications / lists for germplasm as LOD KOSs (if possible use DwC Types directly) • Link terms in new KOSs to terms in existing KOSs (e.g. DwC Types, AGROVOC) • Germplasm phenotypic values / classifications linked to Phenotypic Ontology terms?
    36. 36. Time plan • August 2014: Germplasm RDF – Expose some RDF output and API access for germplasm datasets (basic DwC RDF, essentially basic passport descriptors). – Mandatory data for interlinking: scientific name OR AGROVOC term
    37. 37. Time plan • October 2014: Consuming data from agINFRA services and components – Link CGRIS and CRA germplasm records using scientific name > AGROVOC
    38. 38. Source: http://verastic.com/social/why-do-people-not-say-thank-you.html vprot@agroknow.gr

    ×