Darwin Core extension for germplasm (11th December 2013)


Presentation on the Darwin Core germplasm extension for the "1st International e-Conference on Germplasm Data Interoperability: Session 2", 11th December 2013 (https://sites.google.com/site/germplasminteroperability/). Publishing germplasm information on plant genetic resources and their traits using the Darwin Core standard and the germplasm extension for genebanks.

  1. 1. e-­‐Conference  on  Germplasm  Data  Interoperability,  December  11th  2013.  Dag  Endresen,  GBIF-­‐Norway  (UiO).  
  2. 2. Why  did  we  make  a  germplasm   extension  for  Darwin  Core?   à Upgrade  germplasm  data  pathways  to  use  web   services   The  objecNve  was  to  enable  sharing  of  germplasm  informaNon  using  the   standard  web-­‐service  based  biodiversity  data  publishing  toolkits   maintained  by  the  Global  Biodiversity  InformaNon  Facility  (GBIF)  and  the   Biodiversity  InformaNon  Standards  (TDWG).         à Upgrade  data  types  to  include  trait  data   The  objecNve  was  to  expand  on  the  germplasm  data  types  published  to   germplasm  data  portal  from  basic  passport  data  to  include  in  parNcular   crop  trait  informaNon.   2  
  3. 3. PotenNal  of  the  GBIF  technology   2,106,765  records  of  germplasm  data  (status  2013)   hTp://data.gbif.org/datasets/network/2     hTp://www.gbif.org/network/ae3a42e4-­‐5829-­‐4210-­‐8d8a-­‐84b0cbda47bc       Using  GBIF/TDWG  technology   (and  contribuNng  to  its   development),  the  PGR   community  can  more  easily   establish  specific  PGR  networks   without  duplicaNng  GBIF's  work.   The  compaNbility  of  data  standards   between  PGR  and  biodiversity   collecNons  made  it  possible  to   integrate  the  worldwide  germplasm   collecNons  into  the  biodiversity   community  (TDWG,  GBIF).   3  
  4. 4. MulNple-­‐purpose  data  export  services   European Crop Databases   European EURISCO Catalog   Genebank dataset   GBIF   Global Crop Registries   4  
  5. 5. genesys-­‐pgr.org   2,348,549  records  of  germplasm  accessions     The  GENESYS  gateway  to  geneNc  resources  provides  access  to  informaNon  on  more   than  2.3  million  genebank  accessions,  hTp://www.genesys-­‐pgr.org/    
  6. 6. 1,074,136  records  of  germplasm  accessions     The  European  GeneNc  Resources  Search  Catalogue  (EURISCO)  receives  data  from  the  NaNonal  Inventories  (NI)   and  provides  access  to  all  ex  situ  PGR  accessions  in  Europe,  hTp://eurisco.ecpgr.org     6  
  7. 7. (10  databases)   (8  databases)   (10  databases)   (6  databases)   (8  databases)   (22  databases)   A  total  of  64  ECPGR  Central  Crop  Databases  have  been  established  by  individual  insNtutes  and   the  ECPGR  Working  Groups.  The  databases  hold  passport  data  and,  to  varying  degrees,   characterizaNon  and  primary  evaluaNon  data  of  the  major  collecNons  of  the  respecNve  crops  in   Europe,  hTp://www.ecpgr.cgiar.org/germplasm_databases/central_crop_databases.html     7  
  8. 8. Possible  Upgraded  PGR  Network  Model   v  v  The National Inventory (NI) endorse all national gene banks for EURISCO. v  ECPGR Crop databases can access passport data from EURISCO and additional crop specific data from the gene bank IPT interface. v  IllustraNon  from  the  GBIF   annual  report  2009,  page  47.   Each dataset is shared from the holding gene bank. Standard data sharing tools ensure that the genebank dataset is available to other relevant decentralized thematic, regional or global networks. 8  
  9. 9. Background  and   context   9  
  10. 10. MCPD   revisions     1997   2001   2012   10  
  11. 11. May  2009   11  
  12. 12. Some  of  the  data  publishing  toolkits   ICIS  (Java,  1996  à)   BioMOBY  (Perl,  2001  à)   EURISCO  (tab-­‐delimited,  2003  à)     DiGIR  (PHP,  2001  -­‐  2006)   2   TapirLink  (PHP,  2007  à)   BioCASE  (Python,  2001  à)   TAPIR  PyWrapper3  (Python,  2006  –  2008)   GBIF  IPT  (Java,  2009  à)   12  
  13. 13. Demo  project  in  2005  using  BioCASE   13  
  14. 14. Mapping  of  MCPD  à  ABCD  v2.06     was  required  before  using  BioCASE   National Inventory Code Institute Code Accession Number Collecting Number Collecting Institute Code Genus Species Species Authority „Subtaxa“ „Subtaxa“ Authority Common Crop Name Accession Name Acquisition Date Country of Origin Location of Collection Site Latitude of CS Longitude of CS Elevation of CS Collecting Date of Sample Breeding Institute Code Biological Status of Accession Ancestral Data Collecting/Acquisition Source Donor Institute Code Donor Accession Number Other Identification (Number) associated with the accession Location of Safety Duplicates Type of Germplasm Storage Remarks Decoded Collecting Institute Decoded Breeding Institute Decoded Donor Institute Decoded Safety Duplication Location Accession URL Highlight in green good match, orange acceptable match, red no match (was included as PGR extension in ABCD v2.06).   Helmut  Knüpffer   IPK  Gatersleben   Walter  Berendsohn   BGBM,  Berlin   Berendsohn,  W.  and  H.  Knüpffer  (2004  -­‐  2006).  Dral  mapping  of  Eurisco  descriptors  to  ABCD  2.06.   Available  at  hTp://www.bgbm.org/tdwg/codata/Schema/Mappings/EURISCO-­‐2-­‐ABCD.pdf   14  
  15. 15. 2005  :  BioCASE  demo   Genebank/germplasm  extension  to  the  ABCD  v2.06   15  
  16. 16. Demo  project  in  2010  using  the  GBIF  IPT   16  
  17. 17. Mapping  of  MCPD  à  Darwin  Core   was  required  before  using  the  GBIF  IPT   The  Darwin  Core  germplasm  extension  was   required  for  meaningful  descripNon  of  germplasm   data  sets  using  Darwin  Core  and  the  GBIF  IPT.       A  mapping  of  MCPD  terms  to  Darwin  Core.   Plus  some  addiGonal  terms  to  describe  germplasm:   •   breeding/culNvaNon  event  (source:  MCPD),   •   crop  trait  experiments  (source:  EPGRIS3/ECPGR),   •   and  internaNonal  crop  treaty  regulaNons.       The  first  DRAFT  version  was  released  in  August  2009.   17  
  18. 18. 2010  :  IPT  installaNons  for  EURISCO   v  v  v  v  v  v  v  v  v  v  EURISCO NordGen (Nordic countries) Bioversity-Montpellier (France) IPK Gatersleben (Germany) BLE (Germany) WUR CGN (The Netherlands) CRI (Czech Republic) VIR (Russian Federation) SeedNET (Balkan) Baltic (Estonia, Latvia, Lithuania) 18  
  19. 19. Darwin  Core   “The  Darwin  Core  is  primarily  based  on  taxa,  their   occurrence  in  nature  as  documented  by  observa;ons,   specimens,  and  samples,  and  related  informa;on.”     •     a  well-­‐defined  standard  core  vocabulary   •     a  flexible  framework  to  maximize  re-­‐usability     •     approved  as  TDWG  standard  2009     hTp://rs.tdwg.org/dwc/       Wieczorek  J.,  D.  Bloom,  R.  Guralnick,  S.  Blum,  M.  Döring,  R.  Giovanni,  T.   Robertson,  D.  Vieglais  (2012).  Darwin  Core:  An  Evolving  Community-­‐ Developed  Biodiversity  Data  Standard.  PLoS  ONE  7(1):  e29715.     doi:10.1371/journal.pone.0029715   19  
  20. 20. Darwin  Core  star  schema   Can relate elements one-to-one or one-to-many. 1:many   1:many   1:many   1:many   1:1   Germplasm   Breeder   Trait   Audubon   core   20  
  21. 21. Darwin  Core  Archive  (DwC-­‐A)   v  DwC-A publish Darwin Core records including extensions Simple text based format v  Zipped single file archive v  Germplasm.txt   21  
  22. 22. Darwin  Core  extension  for  genebanks   The  Darwin  Core  extension  for  genebanks  is  an   extension  to  the  Darwin  Core  standard.     Provides  a  mapping  of  MCPD  terms  and  Darwin   Core  terms.     And  it  includes  addiNonal  terms  required  for   describing  germplasm  resources  that  were   missing  in  Darwin  Core.   •  Endresen,  D.,  S.  Gaiji,  and  T.  Robertson  (2009).  Darwin  Core  Germplasm   extension  and  deployment  in  the  GBIF  infrastructure.  Proceedings  of  TDWG   2009,  Montpellier,  France.  Bioversity  InformaNon  Standards  (TDWG).   •  Endresen,  D.T.F.  and  H.  Knüpffer  (2012).  The  Darwin  Core  extension  for   genebanks  opens  up  new  opportuniNes  for  sharing  genebank  data  sets.   Biodiversity  InformaNcs  8:11-­‐29.   22  
  23. 23. Darwin  Core  extension  for  genebanks   Namespace (SKOS/RDF) (stable version) hTp://purl.org/germplasm/germplasmTerm#       Code repository (stable version) hTp://code.google.com/p/darwincore-­‐germplasm   Community discussion (development version) hTp://terms.tdwg.org/wiki/Germplasm     23  
  24. 24. MCPD  (2012)   Darwin  Core   MCPD  (2012)   Darwin  Core   (missing)   dwc.datasetID   15.5   COORDUNCERT   dwc.coordinateUncertaintyInMeters   (missing)   dwc.occurrenceID   15.6   COORDDATUM   dwc.geodeNc.Datum   1   INSTCODE   dwc.insNtuNonCode   15.7   GEOREFMETH   dwc.georeferenceSources   2   ACCENUMB   dwc.catalogNumber   16   ELEVATION   dwc.minimumElevaNonInMeters   3   COLLNUMB   dwc.recordNumber   17   COLLDATE   dwc.eventDate   4   COLLCODE   g.collecNngInsNtuteCode   18   BREDCODE   g.breederInsNtuteID   4.1   COLLNAME   dwc.recordedBy   18.1   BREDNAME   g.breedingInsNtute   COLLINSTADDRESS   (dwc.recordedBy)   19   SAMPSTAT   g.biologicalStatus   COLLMISSID   dwc.collecNonCode   20   ANCEST   g.ancestralData,  g.purdyPedigree   5   GENUS   dwc.genus   21   COLLSRC   g.acquisiNonSource   6   SPECIES   dwc.specificEpithet   22   DONORCODE   g.donorInsNtuteID   7   SPAUTHOR   dwc.scienNficNameAuthorship   22.1   DONORNAME   g.donorInsNtute   8   SUBTAXA   dwc.infraspecificEpithet   23   DONORNUMB   g.donorsIdenNfier   9   SUBTAUTHOR   (dwc.scienNficNameAuthorship)   24   OTHERNUMB   dwc.otherCatalogNumbers   10   CROPNAME   dwc.vernacularName   25   DUPLSITE   g.safetyDuplicaNonInsNtuteID   11   ACCENAME   g.breedingIdenNfier   25.1   DUPLINSTNAME   g.safetyDuplicaNonInsNtute   12   ACQDATE   g.acquisiNonDate   26   STORAGE   g.storageCondiNon   13   ORIGCTY   dwc.countryCode   27   MLSSTAT   g.mlsStatus   14   COLLSITE   dwc.locality   28   REMARKS   dwc.occurrenceRemarks   15.1   DECLATITUDE   dwc.decimalLaNtude   15.2   LATITUDE   dwc.verbaNmLaNtude   15.3   DECLONGITUDE   dwc.decimalLongitude   15.4   LONGITUDE   dwc.verbaNmLongitude   4.1.1   4.2   Mapping  of  DwC  to  to  MCPD   24  
  25. 25. Data  set   dcmitype:Dataset  (Darwin  Core:  Record-­‐level  terms)   dwc:datasetID   (missing  in  MCPD)   dwc:datasetName   (eurisco:  NICODE)   dwc:collecNonID   dwc:collecNonCode   mcpd:  COLLMISSID   dwc:insNtuNonID   dwc:insNtuNonCode   mcpd:  INSTCODE   dwc  =   hTp://rs.tdwg.org/dwc/terms/   g  =   hTp://purl.org/germplasm/germplasmTerm#   dcmitype  =   hTp://purl.org/dc/dcmitype/   dsw  =   hTp://purl.org/dsw/   mcpd  =   hTp://www.bioversityinternaNonal.org/index.php?id=244&tx_news_pi1%5Bnews %5D=1350&cHash=d953e45ada3ab285d635593b5068a38f   epgris3  =   hTp://www.epgris3.eu/docs/acNviNes/2-­‐05/Inclusion%20of%20C&E%20data.pdf   25  
  26. 26. Nomenclature   dwctype:Taxon   dwc:taxonID   (missing  in  MCPD)   dwc:scienNficNameID   dwc:scienNficName   dwc:genus   mcpd:  GENUS   dwc:specificEpithet   mcpd:  SPECIES   dwc:scienNficNameAuthorship   mcpd:  SPAUTHOR, SUBTAUTHOR   dwc:vernacularName   mcpd:  CROPNAME   26  
  27. 27. Germplasm  accession   g:GermplasmAccession  (see  also:  dsw:Specimen)   dwc:occurrenceID   (missing  in  MCPD)   g:germplasmID   (epgris3:  GENOTYPE_NUMBER)   dwc:catalogNumber   mcpd:  ACCENUMB   g:germplasmIdenNfier   mcpd:  ACCENAME   g:biologicalStatus   mcpd:  SAMPSTAT   g:storageCondiNon   mcpd:  STORAGE   dwc:otherCatalogNumbers   mcpd:  OTHERNUMB   dwc:occurrenceDetails   (eurisco:  ACCEURL)   dwc:occurrenceRemarks   mcpd:  REMARKS   27  
  28. 28. CollecNng  event   g:CollecGngEvent  (dwc.Event,  dcmitype:Event)   dwc:eventID   dwc:recordNumber   mcpd:  COLLNUMB   dwc:decimalLaNtude   mcpd:  DECLATITUDE     [geo:lat]   dwc:decimalLongitude   mcpd:  DECLONGITUDE     [geo:long]   dwc:geodeNcDatum   mcpd:  COORDDATUM   dwc:minimumElevaNonInMeters   mcpd:  ELEVATION     dwc:eventDate   mcpd:  COLLDATE   dwc:locality   mcpd:  COLLSITE   [geo:locaNon]   dwc:countryCode   mcpd:  ORIGCTY     [mcpd:  ISO  3166-­‐1  alpha-­‐3]   dwc:verbaNmLaNtude   mcpd:  LATITUDE   dwc:verbaNmLongitude   mcpd:  LONGITUDE   dwc:georeferenceSources   mcpd:  GEOREFMETH   g:collecNngInsNtuteID   mcpd:  COLLCODE   dwc:recordedBy   mcpd:  COLLNAME   [geo:alt]     dwc:eventRemarks   28  
  29. 29. Breeding  event   g:BreedingEvent  (see  also  dcmitype:Event)   g:breedingID   g:breedingIdenNfier     mcpd:  ACCENAME   g:breedingYear   g:breedingCountry   g:breedingCountryCode   g:breedingInsNtuteID     mcpd:  BREDCODE   g:breedingInsNtute   mcpd:  BREDNAME   g:breedingPerson   g:ancestralData     mcpd:  ANCEST   g:purdyPedigree   (mcpd:  ANCEST)   g:breedingRemarks     29  
  30. 30. AcquisiNon  event   g:AcquisiGonEvent  (see  also  dcmitype:Event)   g:acquisiNonID   g:donorsID   g:donorsIdenNfier     mcpd:  DONORNUMB   g:donorInsNtuteID     mcpd:  DONORCODE   g:donorInsNtute     mcpd:  DONORNAME   g:acquisiNonDate     mcpd:  ACQDATE   g:acquisiNonSource     mcpd:  COLLSRC   g:acquisiNonRemarks     30  
  31. 31. Safety  duplicaNon   g:SafetyDuplicaGon  (see  also  dcmitype:Event)   g:safetyDuplicaNonID   g:safetyDuplicaNonDate   g:safetyDuplicaNonInsNtuteID     mcpd:DUPLSITE   g:safetyDuplicaNonInsNtute   mcpd:  DUPLINSTNAME   g:safetyDuplicaNonRemarks     31  
  32. 32. Treaty  or  legislaNon   g:TreatyOrRegulaGon  (see  also  dcmitype:Text,  foaf:Document)   g:treatyOrRegulaNonID   g:treatyOrRegulaNonName   g:treatyOrRegulaNonGoverningBody   g:mlsStatus     mcpd:  MLSSTAT   32  
  33. 33. Measurement  method  (trait)   g:MeasurementMethod  (see  also  dwc:MeasurementOrFact)   g:measurementMethodID     epgris3:  TRAIT_NUMBER   g:measurementMethodName     epgris3:  TRAIT_NAME   g:measurementMethodCategory   g:measurementMethodScale   g:measurementMethodSource   g:measurementMethodRemarks     epgris3:  TRAIT_REMARK   dwc:measurementType   dwc:measurementMethod     33  
  34. 34. Measurement  experiment   g:MeasurementExperiment  (see  also  dcmitype:Event)   g:measurementEperimentID   g:measurementExperimentIdenNfier   g:measurementExperimentYear   g:measurementExperimentReport   g:measurementExperimentRemarks     34  
  35. 35. Measurement  or  fact   dwc:MeasurementOrFact   dwc:measurementID   dwc:measurementValue   dwc:measurementUnit   dwc:measurementAccuracy   dwc:measurementDeterminedDate   dwc:measurementDeterminedBy   g:measurementByInsNtuteID   g:measurementGrowthStage     35  
  36. 36. Controlled  value  vocabulary   Biological  status  type   wild  (100)  |  natural  (110)  |  semiNaturalWild  (120)  |  semiNaturalSown  (130)  |  weedy  (200)  |   landrace  (300)  |  breedingResearchMaterial  (400)  |  breedersLine  (410)  |  syntheNcPopulaNon   (411)  |  hybrid  (412)  |  founderStock  (413)  |  inbredLine  (414)  |  segregaNngPopulaNon  (415)  |   clonalSelecNon  (416)  |  geneNcStock  (420)  |  mutant  (421)  |  cytogeneNcStock  (422)   otherGeneNcStock  (423)  |  advancedCulNvar  (500)  |  GMO  (600)  |  otherBiologicalStatus  (999)     Acquisi1on  type   wildHabitat  (10)  |  forest  (11)  |  shrubland  (12)  |  grassland  (13)  |  desertOrTundra  (14)  |   aquaNcHabitat  (15)  |  culNvatedHabitat  (20)  |  field  (21)  |  orchard  (22)  |  backyard  (23)  |   fallowLand  (24)  |  pasture  (25)  |  farmStore  (26)  |  threshingFloor  (27)  |  park  (28)  |   marketOrShop  (30)  |  insNtuteOrGenebank  (40)  |  seedCompany  (50)  |  ruderalHabitat  (60)  |   roadside  (61)  |  fieldMargin  (62)  |  otherAcquisiNon  (99)  [Most  of  these  could  perhaps  be  replaced  by  their   respec;ve  term  from  the  Environmental  Ontology  (EnvO).]     Storage  type   seedCollecNon  (10)  |  shortTerm  (11)  |  mediumTerm  (12)  |  longTerm  (13)  |  fieldCollecNon  (20)   |  inVitro  (30)  |  cryopreserved  (40)  |  DNA  (50)  |  otherStorage  (99)     36  
  37. 37. Some  proposed  addiNons   In situ conservation (proposed) IUCNCategory, numberOfSeeds, bioRegion, inSituCountry, inSituRecoveryDateStarted, inSituRecoveryInstitute, inSituRecoveryRemarks Germplasm distribution Perhaps add new terms to facilitate the reporting of germplasm distribution and standards material transfer (SMTA) agreements for the International Treaty for Genetic Resources for Food and Agriculture (ITPGRFA). Germplasm management The Millennium Seed Bank (Kew) contributed feedback to the DwC-G modeling and proposed to include terminology for seed management. •  Seed processing terms •  Seed cleaning •  Seed germination testing   37  
  38. 38. Germplasm  vocabulary  of  terms  (RDF/SKOS)   …   …   hTp://purl.org/germplasm/germplasmTerm#     38  
  39. 39. Darwin  Core  Archive  extension  for  IPT   …   hTp://rs.gbif.org/extension/germplasm/20120911/GermplasmAccession.xml     39  
  40. 40. Term  Wiki   hTp://terms.tdwg.org/wiki/     40  
  41. 41. Work-­‐flow  for  Vocabulary  management   1.  Mint  and  maintain  concepts  and  terms,  in  domain-­‐expert  working  groups.   2.  Release  final  version  as  a  Concept  Vocabulary.     3.  Publish  at  the  GBIF  Resources  Repository.     REUSE  terms  from  published  concept  vocabularies  and  ontologies  when  designing   new  applicaNon  schema  such  as  DwC-­‐A  controlled  term  and  value  vocabularies.   2   Concept   Vocabulary     (rdf,  skos)   Term  Wiki     For  vocabulary   development   1   3   Resources     Repository   hTp://terms.tdwg.org/wiki/   hTp://rs.gbif.org/terms/   41  
  42. 42. Biodiversity  ontology  development   Concept   Vocabulary     (rdf,  skos)   REUSE  terms  from   concept  vocabularies   whenever  possible.     Ontologies   (rdf,  owl)   Biodiversity  ontology   repository   hTp://bis.bioportal.bioontology.org/ontologies?filter=BIS   42  
  43. 43. Example:  master  SKOS/RDF  resource     en   es   zh   ja   [   [   [   [   hTp://rs.gbif.org/terms/dwc/dwc_translaNons.rdf       43  
  44. 44. Vocabularies/ontologies   • Provide  a  shared  understanding  of   what  we  mean  when  describing   biodiversity  enNNes.   • What  kind  of  thing  or  property.   • A  list  of  things  we  as  a  community   can  agree  upon  the  meaning  of.   • “Concept  repository”  with  terms   idenNfied  by  URIs.   TDWG  Technical  Roadmap  2008  (convened  by  Roger  Hyam).     Photo  CC-­‐by-­‐3.0  by  Hannes  Grobe/ AWI.  Palaeoclimate  archives.   44  
  45. 45. Vocabulary  management   • Vocabularies/ontologies  are  one  of   the  three  core  components  in  the   TDWG  technical  architecture.   Hyam,  R  (2006).  A  technical   architecture  for  TDWG  standards.   45  
  46. 46. “Things  can  happen  in  a  band,  or  any  type   of  collabora;on,  that  would  not  otherwise   happen”  (Jim  Coleman,  Jazz-­‐musician).   GBIF, Global Biodiversity Information Facility http://www.gbif.org TDWG, Biodiversity Information Standards http://www.tdwg.org BioCASE, The Biological Collection Access Service for Europe http://www.biocase.org Bioversity International http://www.bioversityinternational.org NordGen, The Nordic Genetic Resources Center http://www.nordgen.org 46