Session 02, Introduction to the 2015 Data Publishing Landscape at the GB22 Nodes training event

GB22 TRAINING EVENT FOR NODES – 4 OCTOBER 2015
Session 02: 2015 Data Publishing Landscape
Laura Russell

INDEX
Data publishing landscape
Biodiversity data publishing
Data types
Data standards
Data normalization and data quality
Data publishing methods
Promotion of data publishing
Use cases

DATA PUBLISHING LANDSCAPE
DiGIR/TAPIR
in high use to
publish
biodiversity
data
Idea for
simple,
compressed
text-based file
for publishing
introduced at
TDWG
GBIF
introduces
IPT 1.0
GBIF
redevelops
IPT
GBIF
introduces
IPT 2.0
Data
Publishing
taught at
Nodes
training
Nodes and
aggregators
begin to
install and
use IPTs
Occurrence
and checklist
type datasets
along with IPT
installations
show
continued
growth
2008 2008 2009 2010 2011 2011
2012

2011

DATA PUBLISHING LANDSCAPE - STATISTICS
http://www.gbif.org/ipt/stats

DATA PUBLISHING LANDSCAPE - STATISTICS

DATA PUBLISHING LANDSCAPE 2015
The continued GBIF
commitment to improving
access to biodiversity data
Refinement and expansion
of standards and
publishing software
Evolving social norms
Most data still published
with simple occurrence
core
Portals do not contain the
features to support richer
data
Many institutions still
need convincing to
publish biodiversity data
http://www.gbif.org/page/82104

WHAT IS BIODIVERSITY DATA?
Digital text or multimedia data record detailing facts
about the instance of occurrence of an organism, i.e.
on the what, where, when, how and by whom of the
occurrence and the recording.

WHAT IS DATA PUBLISHING?
“Publishing” refers to making biodiversity datasets
publicly accessible and discoverable, in a
standardized form, via an access point, typically a web
address (a URL).
IPT
∞

BIODIVERSITY DATA TYPES
http://www.gbif.org/publishing-data/summary#datatypes
Checklists
Occurrences
Metadata

BIODIVERSITY DATA TYPES – SAMPLE DATA
http://www.gbif.org/newsroom/news/sample-based-data
Samples

DATA STANDARDS
http://www.tdwg.org/standards/
ABCD Access to Biological Collection
Data (2005)
DwC Darwin Core (2009)
AC Audubon Core Multimedia
Resources Metadata Schema (2013)
NCD Natural Collection Descriptions
(Draft)

DARWIN CORE
http://rs.tdwg.org/dwc
recordedBy: A list (concatenated and separated) of names of people, groups, or
organizations responsible for recording the original Occurrence. The primary collector or
observer, especially one who applies a personal identifier (recordNumber), should be
listed first. Examples: "José E. Crespo", "Oliver P. Pearson | Anita K. Pearson”

SIMPLE DARWIN CORE
SIMPLEDWC is a specification for
one particular way to use the
Darwin Core terms - to share data
about taxa and their occurrences in
a simply structured way - and is
probably what is meant if someone
suggests to "format your data
according to the Darwin Core".
http://rs.tdwg.org/dwc/terms/simple/index.htm

DARWIN CORE ARCHIVE
A Darwin Core Archive (DwCA) is the text
representation of data formatted to Darwin Core.
A DwCA is a compressed file containing a minimum
of three files.
http://rs.tdwg.org/dwc/terms/guides/text/index.htm

STAR SCHEMA
Ext 2
Core
Ext 1
Ext 3
meta.xml
EML.xml
+
DwC Archive
Ext 4
Ext 5

MAPPING CORES
Taxon Core
The category of information pertaining to taxonomic names, taxon name
usages, or taxon concepts. Released April 2015, this version removes terms
dcterms:source and dcterms:rights, and adds dcterms:license. 43 terms.
Occurrence Core
The category of information pertaining to evidence of an occurrence in nature,
in a collection, or in a dataset (specimen, observation, etc.). Released July
2015, this version removes terms dcterms:source, dcterms:rights,
dwc:individualID, dwc:occurrenceDetails, and adds dcterms:license,
dwc:organismQuantity, dwc:organismQuantityType, dwc:organismID,
dwc:organismName, dwc:organismScope, dwc:associatedOrganisms,
dwc:organismRemarks, dwc:parentEventID, dwc:sampleSizeValue,
dwc:sampleSizeUnit. 169 terms.
Event
The category of information pertaining to a sampling event. Issued 29 May
2015. 95 terms

EXTENSIONS
Darwin Core does not provide terms for every
possible type of data.
• 22 registered
• 25 under development
Examples
• Audubon Media Description (aka Audubon Core)
• Darwin Core Identification History
• Darwin Core Measurement or Facts
http://tools.gbif.org/dwca-validator/extensions.do

STAR SCHEMA EXAMPLE - OCCURRENCE
Media
Occurrence Core
Geographical
Determination
meta.xml
EML.xml
+
DwC Archive
Occurrence
Germoplasm

STAR SCHEMA EXAMPLE - CHECKLIST
Literature
Taxon Core
Description
Occurrences
meta.xml
EML.xml
+
DwC Archive
Checklist
Vernacular
Distribution
Types

STAR SCHEMA EXAMPLE - SAMPLE
Event Core
Occurrences
Measurement/Fact
meta.xml
EML.xml
+
DwC Archive
SamplesRelevé

DATA NORMALIZATION
What is data normalization?
Reasons to normalize a database
Normal forms
http://www.essentialsql.com/get-ready-to-learn-sql-database-normalization-explained-in-simple-english/,
http://databases.about.com/od/specificproducts/a/normalization.htm, http://www.dotnet-tricks.com/Tutorial/sqlserver/756N210512-Database-Normalization-Basics.html

DATA QUALITY
Tools
Should you work
on improving the
data?
Importance of
feedback
http://community.gbif.org/pg/pages/view/48546/precourse-activities

DATA PUBLISHING METHODS – POLLS
To be explained in the live session…
 

PROMOTION OF DATA PUBLISHING
Topic of discussion at the Nodes Training in Berlin in
2013.
Core element in the day-to-day work of Node
Managers.

PROMOTION OF DATA PUBLISHING - BARRIERS
Psychological &
cultural
barriers
1. Lack of knowledge
2. Lack of understanding
3. Lack of will
4. Perceived data value
5. Privacy concerns
6. Lack of authorization
7. Lack of time / planning
8. Lack of capacity
9. Lack of funding
10. Lack of infrastructure
http://www.gbif.org/publishing-data/benefits, http://www.gbif.org/resource/81196
Institutional
barriers
Capacity
barriers
Practical
barriers

PROMOTION OF DATA PUBLISHING - RESTRICTIONS
1. Refuse to share.
2. Refuse to share until they have exhausted the
planned use of the data.
3. Will only share their data for a fee.
4. Will only share data under specific restrictions.
5. Agree to share data openly.

PROMOTION OF DATA PUBLISHING - STRATEGIES
1. Facilitate access to financial support.
2. Call upon commitments or legal mandates.
3. Call upon open access / moral principles.
4. Show the benefits of a better data management.
5. Show the benefit for their scientific careers.
6. Peer pressure.
7. Start / support big digitization programmes.
8. Start / support data repatriation efforts.

PROMOTION OF DATA PUBLISHING – DISCUSSION
Challenges
• Not wanting to publish
and/or not wanting to
publish all the data
• Technical threshold of an
IPT
• Restrictive licensing of data
Strategies
• Start smaller – meta data only
• Promote one-off publishing
with multiple exposures
• Provide hosted IPTs to
eliminate technical threshold
• Illustrate licensing with telling
examples.
• Promote and organize
trainings to bring reluctant
publishers in with an easier
“sell” like data papers.
http://community.gbif.org/pg/forum/topic/48616/precourse-activity-promoting-data-publishing/

USE CASES - INTRODUCTION
Explore four use cases based on current publishing
practices
• Literature
• Observation data
• Natural history collections
• Checklists
Complete two exercises
• Definition of publishing strategies
• Publish datasets

USE CASES: DATA FROM LITERATURE
Blue Group

USE CASE 2: OBSERVATIONAL DATA
Green Group
Red Group

USE CASE 3: NATURAL HISTORY COLLECTION DATA
Yellow
Group

USE CASE 4: TAXONOMIC CHECKLISTS
Purple Group

Session 02, Introduction to the 2015 Data Publishing Landscape at the GB22 Nodes training event

Recommended

Recommended

More Related Content

What's hot

What's hot (11)

Viewers also liked

Viewers also liked (19)

Similar to Session 02, Introduction to the 2015 Data Publishing Landscape at the GB22 Nodes training event

Similar to Session 02, Introduction to the 2015 Data Publishing Landscape at the GB22 Nodes training event (20)

More from Alberto González-Talaván

More from Alberto González-Talaván (20)

Recently uploaded

Recently uploaded (20)

Session 02, Introduction to the 2015 Data Publishing Landscape at the GB22 Nodes training event

Editor's Notes