This document summarizes a presentation on metadata analysis of germplasm collections in the agINFRA project. It describes two main germplasm data sources - the Chinese Crop Germplasm Information System and the Italian National Germplasm Database. It discusses the schemas and descriptors used, mappings between schemas, and a linked data approach to connecting the different data sources. The overall goal is to facilitate interoperability between global germplasm databases.
1. Metadata analysis of germplasm
collections
The case of agINFRA
Dr. Vassilis Protonotarios
Agricultural Biotechnologist, PhD
Agro-Know Technologies, Greece
e-Conference on Germplasm Data Interoperability
Session 2: “Status of data and metadata for germplasm”
2. Structure of the presentation
1. The agINFRA germplasm data sources
– Chinese Crop Germplasm Information System
– Italian National Germplasm Database
2. Current status
– Mappings
– Linked Data approach
3. Conclusions
4. agINFRA germplasm data sources
• Italian Germplasm Database (CRA)
– Data available through EURISCO -> GENESYS
– Uses EURISCO set of descriptors
– Data also available through GBIF
• Chinese Crop Germplasm Information System
(CGRIS/CAAS)
– Data unavailable through aggregators
– Own schema used for description of germplasm
accessions
– Metadata exposure in CSV
5. agINFRA germplasm data analysis
1. Analysis of agINFRA germplasm data sources
2. Analysis of metadata schemas used
3. Identification of external schemas
– Review of existing work
4. Definition of a base schema (descriptors)
5. Mappings of various schemas to the base
one
6. Development of a linked data approach for
linking germplasm data sources
7. Chinese Crop Germplasm
Information System (CGRIS)
• Provided by: Chinese Academy of Agricultural Sciences
• A central repository for all type of plant genetic resources
information. It consists of six subsystems:
1. The management system of the National Crop Gene Bank (NCGB),
2. The management system of the long-term storage in Qinghai,
3. The management system of National germplasm Resources
Nursery,
4. The crop characterization and evaluation database system,
5. The database system for germplasm exchange at home and
abroad and
6. The management system of the medium-term storage in Beijing.
URL: http://icgr.caas.net.cn/cgrisintroduction.html
8. CGRIS: Data
At present, CGRIS owns
• > 2000 MB data on 180 kinds of crops
– including food crops, fibre plants, oil crops,
vegetable, fruit tree, tea, mulberry, tobacco,
sugar, green manure crops, tropical crops etc.),
• 390,000 accessions of germplasm
15. CGRIS Metadata
• CGRIS germplasm descriptors based on own
schema
– can be seen as the de facto standard for
germplasm accession information in China.
– Based on metadata scheme standards such as
developed by IPGRI (Bioversity) and GRIN
18. CGRIS Metadata: Next steps
• A mapping to the Multi-crop Passport
Descriptors (MCPD) standard is intended
– According to CAAS subject experts such a mapping
should be rather easy to produce.
19. CGRIS: Exposing data
• Data stored in relational DBs
• Hosted in an SQL server
• Exposure of data as CSV files (partially in
Chinese)
20. CGRIS: IPR information
• The CGRIS website is public and accessible for
everybody. The information is provided free of
charge but based on copyright.
• With regards to data exchange there is no
explicit policy to follow.
• CGRIS does not have an Open Access mandate
and the members of the CGRIS network apply
their own institution policy.
22. Italian Germplasm Database
• Provided by: Italian Council for Research and
Experimentation in Agriculture
• Developed in the context of the “Plant Genetic
Resources/FAO” project in 2004
– Research Centres and Units of the CRA
– The Institute of Plant Genetics of the CNR in Bari,
– NGO “Rete Semi Rurali”
– University collections (Perugia, Potenza etc.)
URL: http://fru.entecra.it
23.
24. CRA Germplasm: Data
Current status of germplasm data (CRA)
• 20,954 records from Italy are included in
EURISCO of which 17,212 from CRA
• 28,509 records for 275 plant species in the
National Inventory (in general)
– does not allow for identifying the number of CRA
germplasm records
29. CRA Metadata
• Most CRA institutional databases use the
MCPD
– however, in the records provided to the National
Inventory several fields are often not filled.
• Some CRA collections also use descriptors
defined by
– the Union for the Protection of New Varieties of
Plants (UPOV) and
– the National Register of New Varieties.
• Ensure mapping to the Multi-crop Passport
Descriptors (MCPD)/EURISCO
30. CRA: IPR information
• The CRA website is public and accessible for everybody. The
information is provided free of charge but based on
copyright
• The Multilateral System (MLS) of the Treaty demands free
availability of the information on the PGRFA that are under
the management and control of the Contracting Parties and
in the public domain (Treaty, Art. 11.2).
• This excludes
– germplasm accessions that are subject to IPR and
– other legally binding protection which restricts the Contracting
Party’s control over the material.
– Accessions that are not covered by IPR include old and
autochthonous varieties, crop wild relatives and other material
found in in-situ conditions, new cultivars not protected by IPR
and cultivars whose IPR have expired.
32. Current status
• First version of mappings is available
• EURISCO descriptors used as base schema
– MCPD
– Darwin Core for Genebanks
– ABCD
– CGRIS
– CRA
37. Linked Data
• A linked data approach will be used by
agINFRA for linking germplasm data sources
• OpenAGRIS already aggregates germplasm
data using AGROVOC
38. Conclusions
• Both schemas / sets of descriptors can be
mapped to the EURISCO ones
• Linked Data approach will facilitate linking of
germplasm data from CRA/CGRIS
• EURISCO descriptors to be published as linked
data
– To be used as the base of passport data
• Linking to other germplasm standards
– e.g. Darwin Core for Genebanks*
*https://code.google.com/p/darwincore-germplasm/wiki/DarwinCoreGermplasmMapping
39. Take home message
• The identification of common properties
between different metadata schemas will
facilitate the linked data framework
40. (Indicative) List of References
• agINFRA Deliverable D2.3 “Review of Content
Requirements”
• agINFRA Deliverable D5.3 “Conceptual
specification of linked agricultural data
framework”
• agINFRA Germplasm Working Group Wiki
http://wiki.aginfra.eu/index.php/Germplasm_Working_Group
• EURISCO passport descriptors
http://www.ecpgr.cgiar.org/germplasm_databases.html
• Draft Mapping of EURISCO Descriptors to ABCD
2.06 http://www.bgbm.org/TDWG/CODATA/Schema/Mappings/EURISCO-2-ABCD.pdf