Data Standards & Best Practices for the Stratigraphic Record

Data Standards & Best Practices
Kerstin Lehnert
Lamont-Doherty Earth Observatory
iedadata.or
g

2
Vouchering the Stratigraphic
Record
 A synthesis database?
 Aggregates data that are published in articles or in data
repositories
 Requirements: Integration, Quality (Trusted data!)
 Needs standardized metadata, semantics, and persistent
unique identifiers
 A trusted repository?
 Publishes and ensures persistent access to data
 Requirements: Compliance with international data
curation and repository standards
 Long-term preservation, data identification (DOI), editorial
procedures, etc.

3
Data Standards
“documented agreements on representation, format,
definition, structuring, tagging, transmission,
manipulation, use, and management of data.”
 Discipline specific
 Data type specific
 Application specific

4
Data Standards: Why?
 Re-usability of data
 Reproducibility of science
 Integration/interoperability of data

6
Reproducibility in the Field
Sciences
 Workshop in May 2015, organized by AAAS (M. McNutt), AGU,
and ESA, funded by the Arnold Foundation
 Report in preparation
Technical Requirements for Transparent, Reproducible Data
1. The data themselves must be publicly available in machine-readable, non-
proprietary formats with accurate and precise descriptive metadata;
2. Data provenance—process(es) by which usable datasets were generated or
derived from raw, often streaming or machine-readable-only data—must be
accurately and precisely specified;
3. Computer code (“scripts”) and software with which datasets were analyzed
must be available and adequately described to ensure their repeated use and
be publicly available in non-proprietary formats, and;
4. Version control should be used to ensure that the original data and code are
maintained.
(from draft workshop report)

7
Coalition for Publishing Data in the Earth
& Space Sciences (COPDESS)
 Joint initiative of Earth Science publishers and Data
Facilities to better help translate the aspirations of
open, available, and useful data from policy into
practice.
 Reaffirm and ensure adherence to existing journal and
publishing policies and society position statements
regarding open data sharing and archiving of data, tools,
and models.
 Ensure that Earth science data will, to the greatest extent
possible, be stored in community approved repositories
that can provide additional data services.
 Statement of Commitment signed by all major
Earth & Space Science publishers
7
www.copdess.org

9
Repository Standards
 Open access
 Data quality assurance (editorial process)
 Persistence (long-term preservation)
 Persistent & unique identification of data (DOI
registration)
 Standard-based metadata (ISO) & APIs (OAI-
PHM)
9

accessible
small data
findable
identification,
persistence
protection,
protocols
context,
provenance
re-usable
harmonized,
machine-readable
interoperable
BIG DATA
Generic Repositories Community Data CollectionsDomain Repositories

11
Distributed Data Curation
 Alert: Stratigraphy is multi-disciplinary
 There are many data types that already have homes
 Paleobio Database
 Macrostrat/Digital Crust
 Geochron (@IEDA)
 MagIC
 Open Core Data (@IEDA – under development)
 EarthChem (@IEDA)
 System for Earth Sample Registration (@IEDA)
 Don’t reinvent, but leverage, link, & integrate!

EarthCube: A Process
Get all the info at: http://earthcube.org
COMPUTER SCIENCES
SOFTWARE ENGINEERS
SCIENTIFIC VISION
TECHNICAL ARCHITECTURE
ENGAGEMENT
FUNDED PROJECTS

14
Back to Data Standards
 Metadata
 Content
 Structure (data model)
 Vocabularies & Taxonomies
 Identifiers
 (API = Application Programming Interface)

15
Metadata Standards
 Geospatial
 Scientific Context
 Object classifications
 Methods (instrumentation, computation, etc.)
 Actions
 dates
 actors
 Data provenance (references, authors, etc.)

16
Open Geospatial Consortium (OGC):
Observations & Measurements
16
Sampling Observation
“Observations commonly involve sampling of an ultimate feature of
interest. This International Standard defines a common set of sampling
feature types classified primarily by topological dimension, as well as
samples for ex-situ observations.”
(OGC O&M 2.0.0 / ISO19156; editor: Simon Cox)
e.g. Station,
Transect, Section

Observation Data
Model v2
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain"
17
ODM2 Team:
J S Horsburgh
A K Aufdenkampe
L Hsu
A Jones
K Lehnert
E Mayorga
L Song
D Tarboton
I Zaslavsky

18
Data Templates
LPSC 2015 Workshop: Restoration and Synthesis of Planetary Geochemical Data
18

Persistent Unique Identifiers
Samples
Dataset
Article publication
Awards & grants
ORCID
Cruise ID
IGSN
DOI
FundRef
DOI
Researchers
Field Program

22
Internet of Samples in the Earth
Sciences
 Physical samples need to be linked to the digital data
generated by their study.
 Reproducibility! Access to the physical samples is required to
verify & reproduce observations.
 Re-usability! Access to information about samples is required
for proper evaluation & interpretation of sample-based data.
 Physical samples need to be shared broadly for use &
re-use.
 Samples are often expensive to collect (drilling, remote locations).
 Many samples are unique and irreplaceable.
 Re-analysis augments utility of existing data.
 Samples often serve in ways that the collectors and repositories could
not have imagined.
3/26/2015
22

23
Unique Sample Identification
 Imagine the possibilities …
 Easily find a specific sample and contact its owner
 Find all publications that mention a specific sample
 Find all data for that sample across the literature
and distributed databases
 Find other samples with similar properties
 geospatial
 temporal
 compositional
23

24
Sample Identification Until Now
 Samples have ambiguous and non-persistent
names and cannot be properly cited.
24
The EarthChem Portal shows
75 publications with
geochemical data
referenced to a sample with
the name M1 (or M-1).
(www.earthchem.org)
Names of dredge sample 3 of
the Amphitrite cruise
(PetDB database, www.petdb.org)

25
Sample Identification From Now:
IGSN: International Geo Sample Number
 Persistent unique identifier for physical objects in
the Earth Sciences
 Global uniqueness guaranteed via governance by the
IGSN e.V.
 Persistent access and preservation of sample
metadata
 Cataloguing services of IGSN e.V. members
 Allows to build central search engine
 Resolving service of the IGSN central registry
 Does not replace personal or institutional naming
protocols
25

IGSN: Examples
Oriented Core Drill Hole (ODP)
Soil Section Rock Specimen

27
IGSN Status
 International governance established in 2011
 14 members (organizations) in the IGSN e.V. (www.igsn.org)
 ca. 4 million samples registered (registration tripled in 2014)
 >350 active users, including
 increasing number of individual scientists
 sample repositories & museums (Smithsonian, marine cores,
 geological surveys (USGS, Geoscience Australia, BGR)
 large-scale observatories and sampling campaigns
 ICDP, IODP, CZO, DCO, GeoPRISMs, etc.)
27

IGSN Adoption
COPDESS Statement of Commitment

IGSN in Action:
Publications
31

32
Metadata
 Identification
 Sample name(s), registrant
 Description
 Material, classification, age, size, comments
 Geospatial information
 Geographical names, coordinates
 Collection
 Expedition/cruise, platform, date, collector,
technique
 Archiving/access
 Physical location of sample (repository), contact
32

IGSN Sample “Geneology” 33

34
Extended IGSN Metadata
 Images
 Documents (.pdf, .xls, .doc)
 References
 URLs for related data resources
 User defined metadata
34

 Advance use of innovative CI to connect physical samples
across the Earth Sciences with digital data infrastructure
 Goals:
 Improve discovery, access, and re-usability of physical samples
 Improve re-usability and reproducibility of the data generated by their
study
Registries &
Catalogs
Metadata
Identifiers
Citation
Repositories
Software Tools
Taxonomies

C4P: Collaboration & Cyberinfrastructure for Paleoscience
An EarthCube Research Coordination Network
Unravel the large-scale, long-term evolution of the Earth-Life System
through the study of the geological record
Major challenges C4P addresses:
• Heterogeneous & dispersed data
• Modeling of age & time
• Legacy & ‘dark’ data
• Limited interoperability among resources
• Variable semantics & ontologies
A diverse community:
paleobiology, paleoclimate, paleoceanography, geochemistry,
dendrochronology, stratigraphy, geochronology, sample
curation, data management, bioinformatics, semantics,
software architecture, and more ...
C4P achievements:
• New resources
• data & software catalogs
• Educational materials (webinars)
• New collaborations
• Convergence on best practices (samples,
age, taxonomy)

37
Take Away Messages
37
 develop leading practices for data
 get community buy-in
 align & coordinate with existing leading
practices
 leverage existing infrastructure
 get started and don’t let the challenges stop
you

“The Hitchhiker’s Guide to
Geoinformatics”
(Lee Allison, LISTMG
Workshop 2004)“Building an International
Collaboration for
Geoinformatics”
(Walter Snyder, AGU 2005)
“Cyberinfrastructure for Solid Earth
Geochemistry” (Kerstin Lehnert, GSA 2003)
The Cultural Challenges
38

39
Thank You!
"The wonderful thing about
standards is that there are so many
of them to choose from”.
(Grace Hopper)

Data Standards & Best Practices for the Stratigraphic Record

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to Data Standards & Best Practices for the Stratigraphic Record

Similar to Data Standards & Best Practices for the Stratigraphic Record (20)

More from Kerstin Lehnert

More from Kerstin Lehnert (16)

Recently uploaded

Recently uploaded (20)

Data Standards & Best Practices for the Stratigraphic Record