Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2016 12-14 GBIF and reuse of research data. GBIF seminar in Bergen.

1,320 views

Published on

Biodiversity informatics seminar at the Department of Biology, University of Bergen on data publication and reuse of GBIF-mediated biodiversity data on 14th December 2016. Organized by the Norwegian GBIF Node and the Norwegian Biodiversity Information Center (NBIC, Artsdatabanken).

See also: http://www.gbif.no/events/2016/data-publishing-seminar-in-bergen.html
See also: http://doi.org/10.13140/RG.2.2.24290.32969

Published in: Science

2016 12-14 GBIF and reuse of research data. GBIF seminar in Bergen.

  1. 1. CC-BY Dag Endresen
  2. 2. •  BIG DATA – a new research paradigm •  Data curation plan (data-life-cycle) •  Publish and archive your research data •  Use shared universal data standards •  Write metadata, good data documentation •  "Data paper" and data citation •  Academic credits for data publishing •  Use digital, stable and universal identity- numbers (DOI)
  3. 3. DATA EXPLOTION •  More and more and more data is produced. •  The challenge ahead is not to produce more data, but knowledge, understanding and capacity to navigate and use very large volumes of data. •  90% of the data that currently exists was created in just the last two years. •  Data curation is critical to ensure that data is appropriately structured, available and reusable.
  4. 4. EXPONENTIAL GROWTH FOR DIGITAL DATA The digital universe will double every two years between now and 2020. The growth is mostly unstructured data (including sensor data from camera traps and weather staAons, images, video, sound clips). A major factor behind the expansion is the growth of machine generated data (from 11% in 2005 to over 40% in 2020). Image source: EMC/IDC Digital Universe Study, 2012
  5. 5. UNSTRUCTURED DATA "Data! Data! Data! he cried impatiently. I can’t make bricks without clay". (Quote from Sherlock Holmes by Sir Arthur Conan Doyle in “The Adventure of the Copper Beeches”). Unstructured data accounts for an esAmated 80% of all data in organizaAons and a whopping 95% of all new data generated daily (Grimes 2008).
  6. 6. Why create a data management plan? Graphics by Jørgen Stamp CC-BY
  7. 7. DATA LOSS Digital data are fragile and susceptible to loss for a wide variety of reasons: •  Natural disaster •  Facilities infrastructure failure •  Storage failure •  Server hardware/software failure •  Application software failure •  Format obsolescence •  Legal encumbrance •  Human error •  Malicious attack •  Loss of staffing competencies •  Loss of institutional commitment •  Loss of financial stability •  Changes in user expectations Source: OpenAIRE & EUDAT, CC-BY-4.0, 2013 Image CC BY-NC-SA 2.0 by Dave Hill https://www.flickr.com/photos/dmh650/4031607067
  8. 8. DATA MANAGEMENT PLAN •  Making your data available to others ensures that your research is truly reproducible. •  Managing your research data saves time because it ensures that you and others in your collaboration will be able to find, understand, and use the data. •  Sharing your research data enables wider dissemination of your work. •  Enabling others to use your data reinforces open scientific inquiry and can lead to new and unanticipated discoveries. Graphics by Jørgen Stamp CC-BY
  9. 9. "FAIR" DATA Findable –  assign persistent IDs, provide rich metadata, register in a searchable resource... (such as GBIF) Accessible –  Retrievable by their ID using a standard protocol, metadata remain accessible even if data aren’t... Interoperable –  Use formal, broadly applicable languages, use standard vocabularies, qualified references... (e.g. Darwin Core, …) Reusable –  Rich, accurate metadata, clear licences, provenance, use of community standards... (e.g. Dublin Core, EML, …) www.force11.org/group/fairgroup/fairprinciples Slide source: OpenAIRE & EUDAT, CC-BY-4.0, 2013
  10. 10. DATA CITATION PRINCIPLES 1.  Data to be legitimate citable products of research. 2.  Data citations giving scholarly credit and attribution. 3.  In scholarly literature, whenever claims are based on data, data should always be cited. 4.  Persistent method for identification of data, that is machine actionable, globally unique, universal. 5.  Data citation facilitate access to data or at least to metadata. 6.  Unique identifiers that persist even beyond the lifespan of the data. 7.  Data citation identify and access the specific data that support verification of the claim (provenance, time-slice, version). 8.  Flexible, but attention to interoperability of practices across communities. Data Citation Synthesis Group: Joint Declaration of Data Citation Principles. Martone M. (ed.) San Diego CA: FORCE11; 2014
  11. 11. Long-term archiving for your research data Graphics by Jørgen Stamp CC-BY
  12. 12. BACKUP AND ARCHIVING – NOT THE SAME THING! Backup –  Periodic snapshots of data in case the current version is destroyed or lost. –  Backups are copies of files stored for short-term or near-long-term. –  Often performed on a somewhat frequent schedule. Archiving –  Preserve data for historical reference. –  Usually the final version, stored for long-term, and generally not copied over. –  Often performed at the end of a project or during major milestones. Source: OpenAIRE & EUDAT, CC-BY-4.0, 2013
  13. 13. ONLINE DATA ARCHIVING CENTER Rather than leaving your research data on a local server or in cloud storage, archive your data with a trusted digital repository. Many repositories create metadata and documentation to ensure that the data will be discoverable in the future.
  14. 14. DATA ONE Source: GBIF News story, September 2014, DataONE: http://www.gbif.org/page/8199
  15. 15. NATIONAL DATA CENTER Sigma2 AS Foto: CC-BY Intel Free Press (WikiMedia Commons) A Peek Inside Facebook's Oregon Data Center UNINETT Sigma2 AS and the Norwegian Center for Research Data (NSD) provide naAonal a naAonal infrastructure service for archiving Norwegian research data. An infrastructure data repository provide many benefits compared to local insAtuAonal data archiving. •  Standardized protocols. •  Improved access for users of data from outside own insAtuAon.
  16. 16. Metadata
  17. 17. WHAT IS METADATA? Photo: CC-BY ‘Metadata is a love note to the future’ by Cea+ www.flickr.com/photos/ centralasian/8071729256
  18. 18. Commonly defined as ‘data about data’, metadata helps to make data findable and understandable. Metadata can be: Descriptive: information about the content and context of the data. Structural: information about the structure of the data. Administrative: information about the file type, rights management and preservation processes. WHAT IS METADATA? Source: CC-BY EUDAT, 2015
  19. 19. METADATA CATALOG Image CC-BY ‘University of Michigan Library Card Catalog’ by David Fulmer www.flickr.com/photos/annarbor/4350629792
  20. 20. Comprehensive metadata will: •  Facilitate data discovery •  Help users determine the applicability of the data •  Enable interpretation and reuse •  Allow any limitations to be understood •  Clarify ownership and restrictions on reuse •  Offer permanence as it transcends people and time •  Provide interoperability WHY USE METADATA? Source: CC-BY EUDAT, 2015
  21. 21. INFORMATION ENTROPY The Loss of InformaAon about Data (Metadata) Over Time, Michener et al, 1997
  22. 22. Create metadata at the time of data creation. Information will be forgotten and there won’t be time or effort left to capture it later. Metadata benefits from quality control at an early stage too. TIME MATTERS! Photo CC-BY-SA ‘egg timer – hour glass running out’ by Open Democracy www.flickr.com/photos/opendemocracy/523438942 Source: CC-BY EUDAT, 2015
  23. 23. DATASET TITLE Titles are critical in helping readers find your data. –  While individuals are searching for the most appropriate data sets, they are most likely going to use the title as the first criteria to determine if a dataset meets their needs. –  Treat the title as the opportunity to sell your dataset. A complete title includes: What, Where, When, Who, and Scale. An informative title includes: topic, timeliness of the data, specific information about place and geography. Source: CC-BY EUDAT, 2015
  24. 24. WHAT IS THE BETTER DATASET TITLE? Rivers or Rivers in Rondane national park from 1:126,700 Forest Service visitor maps (1961-1983) Rivers (what) in Rondane national park (where) from 1:126,700 (scale) Forest Service (who) visitor maps (1961-1983) (when) Source: CC-BY EUDAT, 2015
  25. 25. WRITE FOR MACHINES, NOT JUST HUMANS Remember: a computer will read your metadata. Do not use symbols that could be misinterpreted: Examples: ! @ # % { } | / < > ~ Don’t use tabs, indents, or line feeds/carriage returns. When copying and pasting from other sources, use a text editor (e.g., Notepad) to eliminate hidden characters. Source: CC-BY EUDAT, 2015
  26. 26. Peer review before data-publishing "Data paper"
  27. 27. Authors get scienAfic credit for data publicaAon. MeeAng concerns over data quality. MeeAng concerns over data citaFon mechanism. hap://www.gbif.org/publishingdata/datapapers PEER REVIEW OPTION FOR BIODIVERSITY DATASETS
  28. 28. METADATA TOPICS / HEADLINES Dataset description Project description People and Organizations (including roles) Coverage •  Taxonomic coverage •  Geographic coverage •  Temporal coverage Methods Intellectual property rights, licensing Keywords
  29. 29. RATIONALE FOR DATA PAPER •  A scholarly publication of searchable metadata document describing a dataset, or a group of datasets. •  Promote and publicize the existence of the data. •  Provide scholarly credit to data publishers through citable journal publications. •  Describe the data in a structured human- and machine-readable form.
  30. 30. Persistent and universal identity-number
  31. 31. The purpose of idenAfiers …is to name things, making it is possible to refer to them. “Each idenAfier refers to one and only one thing” (Coyle 2006). “An associa-on between a string and a thing” (Kunze 2003). “A stated associa-on between a symbol and a thing; that the symbol may be used to unambiguously refer to the thing within a given context” (Campbell 2007).
  32. 32. Many things (in GBIF) are named 123 Catalog number: 123 GBIF ID: 543392241 urn:catalog:CAS:BOT:123 Bigelowia juncea Catalog number: 123 GBIF ID: 1030591721 UAMb:Herb:123 Sphagnum girgensohnii Catalog number: 123 GBIF ID: 893477175 Parides erithalion Catalog number: 123 GBIF ID: 1050327334 Cinchona ledgeriana Catalog number: 123 GBIF ID: 931031820 Bromus kalmii Catalog number: 123 GBIF ID: 283363 urn:occurrence:Arctos:MVZ:Egg:123:164 Mercurialis ovata Catalog number: 123 GBIF ID: 231564351 Umbrina canariensis Catalog number: 123 GBIF ID: 896547722 urn:occurrence:Arctos:MVZ:Egg:123:164 Contopus sordidulus veliei NAME AMBIGUITY:
  33. 33. HTTP – PURL – UUID http://purl.org/gbifnorway/id/41d9cbb4-4590-4265-8079-ca44d46d27c3
  34. 34. Including machine- readable formats urn:uuid:41d9cbb4-4590-4265-8079-ca44d46d27c3 dc:idenAfier "urn:uuid:41d9cbb4-4590-4265-8079- ca44d46d27c3"
  35. 35. Data License (machine-readable license)
  36. 36. LICENSING FOR DATA PUBLISHED THROUGH GBIF http://www.gbif.org/terms/licences GBIF Governing Board established in 2014 support in GBIF for three licenses GBIF Portal (status December 2016) CC0 57 % CC-BY 4.0 31 % CC-BY-NC 4.0 13 %
  37. 37. DATA LICENSE REGULATES THE POSSIBILITY FOR REUSE OF DATA •  CC0 data are made available for any use without restriction or particular requirements on the part of users. •  CC BY data are made available for any use provided that attribution is appropriately given for the sources of data used. •  CC NC data are made available for no-commercial use – however, how to limit what is considered to be "commercial use"? •  CC SA data are made available provided conditional that derived products also are shared alike as CC SA – notice that this could block desired commercial products? •  CC ND data are made available for verification read-only, however no modifications or derived products are allowed (blocking reuse)!
  38. 38. NORWEGIAN LICENSE FOR PUBLIC DATA (NLOD) •  NLOD Norwegian license for public data is compatible with CC BY 4.0. •  http://data.norge.no/nlod/no/1.0 •  Recommended to use CC BY 4.0 for broader compatibility and understanding also outside of Norway (alternatively declare both).
  39. 39. H2020 – OPEN DATA BY DEFAULT FROM 2017 Kilde: OpenAIRE & EUDAT, CC-BY-4.0, 2013
  40. 40. Conclusion
  41. 41. WHY MANAGE AND PUBLISH YOUR OWN RESEARCH DATA? •  Make your own research easier! •  Stop yourself drowning in irrelevant data. •  Save your own data for later use. •  Avoid accusations of fraud or bad science (e.g. p-hacking). •  Share your research data for re-use. •  Get credit for your data. •  Meet funder/institution requirements. Because well-managed data opens up opportunities for re-use, sharing and makes for better science! Source: OpenAIRE & EUDAT, CC-BY-4.0, 2013
  42. 42. Node team at NHM, University of Oslo Dag Endresen, Node manager ChrisAan Svindseth, Database manager Fridtjof Mehlum, Research director Einar Timdal, Associate professor Geir Søli, Associate professor Vidar Bakken, Consultant Artsdatabanken, Trondheim Wouter Koch Nils Valland NTNU University Museum Anders Finstad, GBIF Science commiOee Research Council of Norway Per Backe-Hansen, Head of delegaQon Contact us at: gbif-driW@nhm.uio.no

×