Accessing the original observation data captured during plant exploration missions for collecting crop diversity
1. Accessing the original observation
data captured during plant
exploration missions for collecting
crop diversity
Bioversity International, Via dei Tre Denari 472/a, Maccarese, Rome, Italy
Hannes Gaisberger, Massimo Buonaiuto, Federico Mattei, Andrea De Pirro,
Valentina Barbiero, Simone Mori, Imke Thormann, Tom Hazekamp, Elizabeth
Arnaud
2. Agenda
• Part 1: Safeguarding the original paper
documents by scanning and digitizing the data
– Hannes Gaisberger
• Part 2: Creation of a public repository of full
scanned documents enabling access to the
full text – Massimo Buonaiuto
3. Bioversity supported germplasm
collecting missions
• Since 1974, Bioversity International has
supported more than 550 germplasm collecting
missions yielding 225,875 samples and
covering 4,300 species from 137 countries
• Samples were sent to several genebanks
worldwide for safety duplication, conservation
and potential distribution
• Other CGIAR centers organized various
collecting missions for their mandate crops
4. Original observation data is essential for:
• Identify duplicates between
collections and gaps in diversity –
value for genebank curators and
collecting actions
• Tracking original sample & country
of origin in pedigrees – value for
Breeders and Benefit Sharing
5. • Collectors recorded key sample information
(passport data) and other observation data in field
books
Scanning of field notebooks and
related documents
6. Original observation: a treasure for
genebanks and breeders
•Genus and Species
•Collecting Number
•Site Information: Admin boundaries,
Latitude, Longitude and Elevation
•Collecting Source and Sample Status
The collecting form contains the
botanical classification along with
localization details, environment,
cultural practices, diseases and
pest presence and symptoms
and traditional uses
7. Identification and quality-checking
in databases
• Different publicly available genebank inventories are
checked in order to track corresponding samples and
complete missing passport data
8. Integration of quality passport
data
• Data extracted from field books and databases is
integrated in a sample level database of collecting
missions
9. Results in figures
• To date, the quality of 101,171
passport data records from 375
collecting missions has been
improved through data extracted from
scanned documentation
• 56,454 of these collected samples are
linked to genebank accessions in 51
institutes worldwide
Priority crops/
use group
Number of collected
samples
Forages 44056
Rice 25022
Maize 16484
Beans 10976
Wheat 7507
Cowpea 7473
Potato 7146
Pearl millet 6662
Barley 4429
Groundnut 2928
Finger millet 2850
Chickpea 1467
Banana 1326
Pigeon pea 999
Others 86550
Total 225875
• A total of 43,637 scanned pages are saved as 1063 pdf-
files and stored in an online repository aside the 26,000
other files scanned by CGIAR centers and partners
10. Publishing the data and attached
information
• End of 2010: work must be finished
for Bioversity supported missions
• Full text available on the online
repository and publish the
collection mission database
• Visualization: Map sites where
diversity was collected (after
georeferencing with Biogeomancer)
• Various projects to address gaps analysis and diversity analysis, like
Genesys, encourage partners to perform same work and share the full
text and data – links to CWR information, Museum herbaria information,
Literature
11. Public access to the scanned collecting
missions documents
A Repository that presently contains 27,000 Collecting
Missions Files from CGIAR Centers and partners:
• Agricultural Research Centre (ARC) of Lao People’s
Democratic Republic
• AfricaRice
• Agricultural Research for Development in Africa (IITA)
• Bioversity International
• International Rice Research Institute (IRRI)
12. Typology of the documents produced by
Collectors
1) Mission Reports
2) Summary Forms
3) Sample lists
4) Collecting Forms
5) Accession Vouchers
6) Newsletters
7) Factsheets
8) Distribution lists
9) Field Books
18. Analysis of Metadata (5/5)
Darwin Core Germplasm metadata
+
Collecting Missions metadata
=
Metadata for Collecting Missions Documents
19. How users will access the Repository
Alfresco DMS
Typo3 CMS
20. Import of 27,000 PDF Files
Process of import PDF files in 3 phases:
1. Conversion of institutional metadata in Darwin
Core Germplasm metadata
2. Association of metadata to all PDFs files, using
heterogeneous sources (databases, Excel files
and filenames, etc.)
3. Batch upload of all PDF files together with
metadata file associated to each file in DC-
Germplasm standard.
24. How users will manage and publish
documents
• Simple Workflow to
publish into the
Repository:
1. Upload the file in private
user Home Space
2. Edit metadata
3. Approve the document for
public repository with a
click
... the file will be and public
25. Summary
• Improved quality of passport data for about 100,000
collected samples from 137 countries
• 56,454 of these collected samples are linked to
genebank accessions in 51 institutes worldwide
• Collected 27,000 documents classified in 9 types of
documents with metadata
• Metadata extracted and parsed using Gerplasm
Darwin Core standards
26. Open questions and challenges
- Interaction with Open Archive standards and
Protocol for Metadata Harvesting
- Integration with Crop Terminizer, University of
Manchester
- Web Analytics for monitoring of downloads in details
(referrers, visits, etc.) and web marketing
- CMIS protocol used to interact with content
management systems
- Metadata validation with crop scientists, collectors
http://www.central-repository.cgiar.org/
27. Guidelines for collecting samples
- Being revised and will be published in a new
section of the on the Crop genebank knowledge
base
- Adding guidelines for illustrating with photos that
support the tentative taxonomy, captured data and
GPS