This presentation describes the evolving scenario of biodiversity data publishing as it is in 2015. It was first presented in the training event for GBIF Participant nodes part of the 22nd meeting of the GBIF Governing Board.
Slide deck developed and presented by L. Russell (Vertnet)
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Session 02, Introduction to the 2015 Data Publishing Landscape at the GB22 Nodes training event
1. GB22 TRAINING EVENT FOR NODES – 4 OCTOBER 2015
Session 02: 2015 Data Publishing Landscape
Laura Russell
2. INDEX
Data publishing landscape
Biodiversity data publishing
Data types
Data standards
Data normalization and data quality
Data publishing methods
Promotion of data publishing
Use cases
3. INDEX
Data publishing landscape
Biodiversity data publishing
Data types
Data standards
Data normalization and data quality
Data publishing methods
Promotion of data publishing
Use cases
4. DATA PUBLISHING LANDSCAPE
DiGIR/TAPIR
in high use to
publish
biodiversity
data
Idea for
simple,
compressed
text-based file
for publishing
introduced at
TDWG
GBIF
introduces
IPT 1.0
GBIF
redevelops
IPT
GBIF
introduces
IPT 2.0
Data
Publishing
taught at
Nodes
training
Nodes and
aggregators
begin to
install and
use IPTs
Occurrence
and checklist
type datasets
along with IPT
installations
show
continued
growth
2008 2008 2009 2010 2011 2011
2012
2011
7. DATA PUBLISHING LANDSCAPE 2015
The continued GBIF
commitment to improving
access to biodiversity data
Refinement and expansion
of standards and
publishing software
Evolving social norms
Most data still published
with simple occurrence
core
Portals do not contain the
features to support richer
data
Many institutions still
need convincing to
publish biodiversity data
http://www.gbif.org/page/82104
8. INDEX
Data publishing landscape
Biodiversity data publishing
Data types
Data standards
Data normalization and data quality
Data publishing methods
Promotion of data publishing
Use cases
9. WHAT IS BIODIVERSITY DATA?
Digital text or multimedia data record detailing facts
about the instance of occurrence of an organism, i.e.
on the what, where, when, how and by whom of the
occurrence and the recording.
10. WHAT IS DATA PUBLISHING?
“Publishing” refers to making biodiversity datasets
publicly accessible and discoverable, in a
standardized form, via an access point, typically a web
address (a URL).
IPT
∞
14. DARWIN CORE
http://rs.tdwg.org/dwc
recordedBy: A list (concatenated and separated) of names of people, groups, or
organizations responsible for recording the original Occurrence. The primary collector or
observer, especially one who applies a personal identifier (recordNumber), should be
listed first. Examples: "José E. Crespo", "Oliver P. Pearson | Anita K. Pearson”
15. SIMPLE DARWIN CORE
SIMPLEDWC is a specification for
one particular way to use the
Darwin Core terms - to share data
about taxa and their occurrences in
a simply structured way - and is
probably what is meant if someone
suggests to "format your data
according to the Darwin Core".
http://rs.tdwg.org/dwc/terms/simple/index.htm
16. DARWIN CORE ARCHIVE
A Darwin Core Archive (DwCA) is the text
representation of data formatted to Darwin Core.
A DwCA is a compressed file containing a minimum
of three files.
http://rs.tdwg.org/dwc/terms/guides/text/index.htm
18. MAPPING CORES
Taxon Core
The category of information pertaining to taxonomic names, taxon name
usages, or taxon concepts. Released April 2015, this version removes terms
dcterms:source and dcterms:rights, and adds dcterms:license. 43 terms.
Occurrence Core
The category of information pertaining to evidence of an occurrence in nature,
in a collection, or in a dataset (specimen, observation, etc.). Released July
2015, this version removes terms dcterms:source, dcterms:rights,
dwc:individualID, dwc:occurrenceDetails, and adds dcterms:license,
dwc:organismQuantity, dwc:organismQuantityType, dwc:organismID,
dwc:organismName, dwc:organismScope, dwc:associatedOrganisms,
dwc:organismRemarks, dwc:parentEventID, dwc:sampleSizeValue,
dwc:sampleSizeUnit. 169 terms.
Event
The category of information pertaining to a sampling event. Issued 29 May
2015. 95 terms
19. EXTENSIONS
Darwin Core does not provide terms for every
possible type of data.
• 22 registered
• 25 under development
Examples
• Audubon Media Description (aka Audubon Core)
• Darwin Core Identification History
• Darwin Core Measurement or Facts
http://tools.gbif.org/dwca-validator/extensions.do
20. STAR SCHEMA EXAMPLE - OCCURRENCE
Media
Occurrence Core
Geographical
Determination
meta.xml
EML.xml
+
DwC Archive
Occurrence
Germoplasm
21. STAR SCHEMA EXAMPLE - CHECKLIST
Literature
Taxon Core
Description
Occurrences
meta.xml
EML.xml
+
DwC Archive
Checklist
Vernacular
Distribution
Types
22. STAR SCHEMA EXAMPLE - SAMPLE
Event Core
Occurrences
Measurement/Fact
meta.xml
EML.xml
+
DwC Archive
SamplesRelevé
23. DATA NORMALIZATION
What is data normalization?
Reasons to normalize a database
Normal forms
http://www.essentialsql.com/get-ready-to-learn-sql-database-normalization-explained-in-simple-english/,
http://databases.about.com/od/specificproducts/a/normalization.htm, http://www.dotnet-tricks.com/Tutorial/sqlserver/756N210512-Database-Normalization-Basics.html
24. DATA QUALITY
Tools
Should you work
on improving the
data?
Importance of
feedback
http://community.gbif.org/pg/pages/view/48546/precourse-activities
28. INDEX
Data publishing landscape
Biodiversity data publishing
Data types
Data standards
Data normalization and data quality
Data publishing methods
Promotion of data publishing
Use cases
29. PROMOTION OF DATA PUBLISHING
Topic of discussion at the Nodes Training in Berlin in
2013.
Core element in the day-to-day work of Node
Managers.
30. PROMOTION OF DATA PUBLISHING - BARRIERS
Psychological &
cultural
barriers
1. Lack of knowledge
2. Lack of understanding
3. Lack of will
4. Perceived data value
5. Privacy concerns
6. Lack of authorization
7. Lack of time / planning
8. Lack of capacity
9. Lack of funding
10. Lack of infrastructure
http://www.gbif.org/publishing-data/benefits, http://www.gbif.org/resource/81196
Institutional
barriers
Capacity
barriers
Practical
barriers
31. PROMOTION OF DATA PUBLISHING - RESTRICTIONS
1. Refuse to share.
2. Refuse to share until they have exhausted the
planned use of the data.
3. Will only share their data for a fee.
4. Will only share data under specific restrictions.
5. Agree to share data openly.
32. PROMOTION OF DATA PUBLISHING - STRATEGIES
1. Facilitate access to financial support.
2. Call upon commitments or legal mandates.
3. Call upon open access / moral principles.
4. Show the benefits of a better data management.
5. Show the benefit for their scientific careers.
6. Peer pressure.
7. Start / support big digitization programmes.
8. Start / support data repatriation efforts.
33. PROMOTION OF DATA PUBLISHING – DISCUSSION
Challenges
• Not wanting to publish
and/or not wanting to
publish all the data
• Technical threshold of an
IPT
• Restrictive licensing of data
Strategies
• Start smaller – meta data only
• Promote one-off publishing
with multiple exposures
• Provide hosted IPTs to
eliminate technical threshold
• Illustrate licensing with telling
examples.
• Promote and organize
trainings to bring reluctant
publishers in with an easier
“sell” like data papers.
http://community.gbif.org/pg/forum/topic/48616/precourse-activity-promoting-data-publishing/
34. INDEX
Data publishing landscape
Biodiversity data publishing
Data types
Data standards
Data normalization and data quality
Data publishing methods
Promotion of data publishing
Use cases
35. USE CASES - INTRODUCTION
Explore four use cases based on current publishing
practices
• Literature
• Observation data
• Natural history collections
• Checklists
Complete two exercises
• Definition of publishing strategies
• Publish datasets
40. INDEX
Data publishing landscape
Biodiversity data publishing
Data types
Data standards
Data normalization and data quality
Data publishing methods
Promotion of data publishing
Use cases
41. GB22 TRAINING EVENT FOR NODES – 4 OCTOBER 2015
Session 02: 2015 Data Publishing Landscape
Laura Russell
Editor's Notes
Image from Piotr Lewandowski, shared via http://www.freeimages.com/photo/learning-with-pencil-1415671
Data/chart provided by Kyle Braak, GBIF.
Data/chart provided by Kyle Braak, GBIF.
Good and needs improvement
The data publishing area is in continuous evolution and expansion. The standards are refined and expanded, the software is improved and debugged, the social norms evolve. That requires that we all recycle our knowledge periodically.
Despite biodiversity data publication in a standard way is possible for a long time now, most of the data is still published in a very simple way: just the occurrence core, single identifications, few/no connections among objects, simple metadata... Much richness of the original data is still non accessible because of the way data is published. This is one of the main reasons to organize this course.
· The data already published determines (although only to a certain extent) the technical developments in the GBIF network, namely in GBIF.org and its API. Only when a certain amount of data of certain type is published (e.g. through an extension), the priority to enable discovery and retrieval of that information raises in importance. Examples of this is the indexing of occurrences published using the occurrence extension of the taxon core, and the possibility to search and retrieve images from the simple multimedia extension.
Most data still published with simple occurrence core and is missing the known richness of the original data
Without the rich data, portal developers do not have the priority to enhance with features to support rich data
Reused slide from 1B Publishing Primary Biodiversity Data by Alberto González-Talaván1~ Data Sharing, Data Standards, and Demystifying the IPT ~ Gainesville, FL, USA. 13 January 2015
Modified from Reused slide from 1B Publishing Primary Biodiversity Data by Alberto González-Talaván1~ Data Sharing, Data Standards, and Demystifying the IPT ~ Gainesville, FL, USA. 13 January 2015
Modified from 1B Publishing Primary Biodiversity Data by Alberto González-Talaván1~ Data Sharing, Data Standards, and Demystifying the IPT ~ Gainesville, FL, USA. 13 January 2015
Review of the data types for publishing (http://www.gbif.org/publishing-data/summary#datatypes). This will be the first attempt to cover the instructional objectives 1a, 1b & 1c.
GBIF now deals with four types of biodiversity data:
Occurrences (observations, specimens etc)
Checklists (names)
Metadata (data about data) - http://www.gbif.org/dataset/search?type=METADATA
Occurrences are records that document a 'collection event'—evidence that a particular, named organism was found at a particular time and place. Also known as primary biodiversity data, occurrences document the 'what, where, when, how and by whom' of our exploration of the planet's species. An occurrence record can be based on an observation in the field, vouchered (labeled) specimen in a museum or herbarium, or other evidence.
Checklists are lists of scientific names of organisms grouped into taxonomic hierarchies. They serve two main functions: first, they provide data that help to enrich information about particular species, for example by including them on national checklists, and on lists of invasive or threatened species; and they provide taxonomic 'backbones' around which species information can be organized.
Metadata are structured descriptions of datasets giving essential details such as the geographic and taxonomic scope of the data, methods of collection or observation, contact details and citation requirements. They help to give context to datasets and enable users to assess whether data are fit for use in a particular research project or application.
introduce the need/push for sample-based datasets (introduction of the event core) (http://www.gbif.org/page/82105) - released March 24, 2015
beyond “presence only” data -- more quantitative information used in other areas of scientific discovery and research, particularly ecological monitoring and assessment.
Sample-based data (ecological monitoring and assessment data)
Sample-based data are records from thousands of different kinds of environmental, ecological, and natural resource monitoring and assessment investigations. These events range from one-off surveys to ongoing monitoring and includes activities like freshwater and marine sampling, plant cover and vegetation plots, and citizen science bird counts, among others.
Modified from 1B Publishing Primary Biodiversity Data by Alberto González-Talaván1~ Data Sharing, Data Standards, and Demystifying the IPT ~ Gainesville, FL, USA. 13 January 2015
This section will cover the instructional objective 2a.
Biodiversity Information Standards (TDWG), also known as the Taxonomic Databases Working Group, is a not for profit scientific and educational association that is affiliated with the International Union of Biological Sciences.
TDWG was formed to establish international collaboration among biological database projects. TDWG promoted the wider and more effective dissemination of information about the World's heritage of biological organisms for the benefit of the world at large. Biodiversity Information Standards (TDWG) now focuses on the development of standards for the exchange of biological/biodiversity data.
Our Mission
Develop, adopt and promote standards and guidelines for the recording and exchange of data about organisms
Promote the use of standards through the most appropriate and effective means and
Act as a forum for discussion through holding meetings and through publications
Modified from 1B Publishing Primary Biodiversity Data by Alberto González-Talaván1~ Data Sharing, Data Standards, and Demystifying the IPT ~ Gainesville, FL, USA. 13 January 2015
It includes a glossary of terms intended to facilitate the sharing of information about biological diversity by providing reference definitions, examples, and commentaries.
It is primarily based on taxa, their occurrence in nature as documented by observations, specimens, and samples, and related information.
Modified from 1B Publishing Primary Biodiversity Data by Alberto González-Talaván1~ Data Sharing, Data Standards, and Demystifying the IPT ~ Gainesville, FL, USA. 13 January 2015
Flat table
Few restrictions
A data file (occurrence.txt) conforming to the SIMPLEDWC in a CSV format. The first row includes Darwin Core standard term names.
A meta file (meta.xml) in an XML format. It contains technical details to instruct a computer on how to use the data file.
A meta file (eml.xml) in an XML format. It contains explanatory details about the records contained within the data file to instruct a user if the data will be fit for their use.
Modified from 1B Publishing Primary Biodiversity Data by Alberto González-Talaván1~ Data Sharing, Data Standards, and Demystifying the IPT ~ Gainesville, FL, USA. 13 January 2015
Cores updated based on updated
Modified from Standards and sharing complex primary biodiversity data; and what is an extension anyway? ~ Deb Paul ~ Data Sharing, Data Standards, and Demystifying the IPT Workshop – Day 1, Jan. 13, 2015 ~ Gainesville, FL
Modified from 1B Publishing Primary Biodiversity Data by Alberto González-Talaván1~ Data Sharing, Data Standards, and Demystifying the IPT ~ Gainesville, FL, USA. 13 January 2015
Modified from 1B Publishing Primary Biodiversity Data by Alberto González-Talaván1~ Data Sharing, Data Standards, and Demystifying the IPT ~ Gainesville, FL, USA. 13 January 2015
Modified from 1B Publishing Primary Biodiversity Data by Alberto González-Talaván1~ Data Sharing, Data Standards, and Demystifying the IPT ~ Gainesville, FL, USA. 13 January 2015
Database normalization is process used to organize a database into tables and columns. The idea is that a table should be about a specific topic and that only those columns which support that topic are included.
There are three main reasons to normalize a database. The first is to minimize duplicate data, the second is to minimize or avoid data modification issues, and the third is to simplify queries.
To assist in achieving these objectives, some rules for database table organization have been developed. The stages of organization are called normal forms; there are three normal forms most databases adhere to using.
First Normal Form – The information is stored in a relational table and each column contains atomic values, and there are not repeating groups of columns.
Second Normal Form – The table is in first normal form and all the columns depend on the table’s primary key.
Third Normal Form – the table is in second normal form and all of its columns are not transitively dependent on the primary key
There are further norms if there is interest in learning more.
For the purposes of the Star Schema, you’ll find your data adhering to the…
Tweet image - https://twitter.com/Iteration23/status/646085874963337216
GBIF community group in conjunction with TDWG group on Data Quality
See pre-course activities for some recommendations/tutorials
OpenRefine – videos
Relational databases for dummies
BioVel Tutorial
Excel
Excel is a wonderful tool, but you must understand how Excel works or it can change your data in unexpected ways! Don’t introduce unintended data issues in data you may be helping to publish…as well as help end users understand how they can avoid introducing unintended data issues in data they are using. http://vertnet.org/resources/downloadsinexcelguide.html
Slide from 1B Publishing Primary Biodiversity Data by Alberto González-Talaván1~ Data Sharing, Data Standards, and Demystifying the IPT ~ Gainesville, FL, USA. 13 January 2015
Ways to publish (strengths and weaknesses of each; include stats for numbers of datasets published via each way; how to identify what method was used when viewing datasets on gbif.org). This will cover the instructional objective 2b.
simple spreadsheets
IPT
custom-created DwCA
Slide from 1B Publishing Primary Biodiversity Data by Alberto González-Talaván1~ Data Sharing, Data Standards, and Demystifying the IPT ~ Gainesville, FL, USA. 13 January 2015
IPT currently under development with future planned updates
Web tools and templates for excel tools were contracted for development in ???? And have not been updated since then.
DiGIR protocol development ceased in 2006
TAPIR protocol last updated in 2010
BioCASE protocol last updated 2015
Online poll
Which of the following methods do you use REGULARLY to publish data online (i.e. in the last year)
o DiGIR provider
o TAPIR provider
o BioCASE provider
o IPT
o DwC-A via“DwC-A spreadsheet processor”
o Customized DwC-A via“DwC-A Assistant”
Other custom created DwC-A
o None
Which of the following methods do you use regularly to publish data online (or to help others to do so) (i.e. used at least once in the last year)
There are simple online poll tools that show the progress of the voting as you speak and can be displayed in the screen as people vote. It communicates very well and makes the exercise very dynamic.
This section will aim to start covering the instructional objective 3.
As standards and norms evolve, must ensure that Node Managers skills keep pace as well
Provides an excellent opportunity to continue the discussions from 2013 by sharing some recent promotions by your peers and to participate in an exercise that allows Node Managers to build on promoting richer and more complex data.
Slide from Module 3 – Knowledge exchange I Supporting data digitization and publishing ~ Alberto González-Talaván ~ 4 October 2013, GBIF Nodes Training ~ Berlin, Germany
Barriers to publishing
On these points:
Lack of knowledge: The holder may not be aware how sharing on the internet works, and the existence of initiatives such as GBIF.
Lack of understanding: the holder may have heard about GBIF and data publishing, but thinks it must be complicated, bureaucratic, very technical…
Lack of will: The holder understand the process but does not want to go through it because of cultural issues, perceived sensitivity of the data,
Perceived data value: the holder thinks that the data has economic or intrinsic value that (s)he wants to exploit.
Privacy concerns:
Lack of authorization: The holder would like to share the data, but institutional policies prevent it.
Lack of time / planning: The holder never finds an appropriate moment to start the digitization, data transformation or publishing. Or got discouraged after not properly planned attempts.
Lack of capacity: the holder would like to digitize and share the data, but (s)he doesn’t know what is the best (or any) way to do it.
Lack of resources/funding: the holder would like to digitize and share the data, but there is no spare capacity in the institution to carry out such tasks.
Lack of infrastructure: the holder would like to digitize and share the data, but (s)he does not have the technical infrastructure to do it.
----- Meeting Notes (10/3/15 07:09) -----
Least to most open
Objective is to get to 5 or any advancement on the scale is positive
Slide from Module 3 – Knowledge exchange I Supporting data digitization and publishing ~ Alberto González-Talaván ~ 4 October 2013, GBIF Nodes Training ~ Berlin, Germany
Least to most open
Objective is to get to 5 or any advancement on the scale is positive
Slide from Module 3 – Knowledge exchange I Supporting data digitization and publishing ~ Alberto González-Talaván ~ 4 October 2013, GBIF Nodes Training ~ Berlin, Germany
Strategies and arguments to overcome barriers/Incentives for publishing
On these points:
Facilitate access to financial support: provide digitization grants or help the data holders to obtain funding that funds directly or indirectly the digitization.
Call upon commitments or legal mandates: Try to use commitments or legal mandates that apply to the institution or the country as a way to convince the data holder.
Call upon open access / moral principles: the results of publicly funded research should be made public, access to science should not be restricted, etc.
Show the benefits of a better data management: management of digital information can facilitate the data holder’s daily work.
Show the benefit for their scientific careers: publishing data can provide scientific credit through data papers, citations and data usage indexes.
Peer pressure: competing/fellow institutions are already sharing data and the holder’s institution is being left behind.
Start / support big digitization programmes: promote the start of big digitization programmes that will benefit many holders at the same time.
Start / support data repatriation efforts: start programmes that will allow the return of digital data describing your county’s biodiversity.
Summarize community discussion on this topic
examples publishing networks/nodes and how they’ve been successful or had difficulties in publishing data?
Cees provided some great examples and strategies
Nico introduced topic of licensing, mentioning Peter Desmet’s blog post, Why we should publish under CC0 as an illustrative example of what more restrictive licenses prevent users from doing or not doing with data.
http://www.canadensys.net/2012/why-we-should-publish-our-data-under-cc0
Faustin, Hanna, and Cees provided some additional discussion on licensing
And Anne-Sophie, introduced organizing trainings on topics like Data Papers as an easier sell to data publishers as who could observe the direct impact on the visibility and numbers of downloads of their data sets for their published data papers.
4 use cases based on current publishing practices: literature, observational data, natural history collections and checklists.
The FIRST EXERCISE will last up to 20 minutes and will be around the definition of data publishing strategies. Based in the description included in their use case, each group will work on identifying suitable technical solutions, challenges and strategies. Each group will reflect the outcome of their discussions in a single page.
The SECOND EXERCISE will use the all the remaining time and will consist on the publishing of a dataset using the test IPT installation made available for the course. There are two datasets available, depending on the level of challenge that the participant is seeking. Links to the datasets will be provided as part of the use case description document. Those seeking certification, will need to fill a template describing the process and send it to the group facilitator ONLY.
Birds occurrence records from “Birds at the Danish Lighthouses 1883-1939”
Camera trap database of Tiger sightings from India
French and English
Prairie Habitat Restoration Study
VASSY, the database of vascular plants of Syldavia and Eskeastein
Image from Piotr Lewandowski, shared via http://www.freeimages.com/photo/learning-with-pencil-1415671