EGU 2018 Ian McHarg Lecture

Data Infrastructure for the
Earth & Space Science
How Far Have We Come,
Where Are We Heading?
Kerstin Lehnert
Lamont-Doherty Earth Observatory, Columbia University
April 10, 2018
Ian McHarg Lecture 2018
1

Before I start, a short detour ...
April 10, 2018
2
The Kaiserstuhl, Germany

Making this lecture
April 10, 2018
3

My goal
April 10, 2018
4
study the past
if you would
define the future
Confucius

Learning from the past:
(1) The Big Picture
April 10, 2018
5
2007
2018
https://www.rd-alliance.org/sites/default/files/Common_Patterns_in_Revolutionising_Infrastructures-final.pdf

(2) The Real World
The story of IEDA
(Interdisciplinary Earth Data Alliance)
www.iedadata.org
... there was a database named PetDB
April 10, 2018
6

A biased perspective
I am a geoscientist who
directs a US data facility for
primarily investigator-based
data (“long tail”) funded by
the National Science
Foundation.
April 10, 2018
7
www.iedadata.org

Defining the Topic
Data infrastructure is a digital
infrastructure promoting data sharing and
consumption.
Its goal is to enable researchers to make the best use of the
world’s growing wealth of data for the advancement of
science and the benefit of society.
April 10, 2018
8

Data drive Earth science:
A new way of understanding the world
April 10, 2018
9
Data:
The 4th Paradigm
The 5th Dimension

We have been talking about it for a
while ...
April 10, 2018
10
2006

EGU ESSI
Abstract titles
April 10, 2018
11
2008 2013
2018

Growth of Earth & Space Science Informatics
 63 ESSI session proposals – an increase of 40%
 729 ESSI abstracts – an increase of ~18.7 %
 35 ESSI oral sessions - an increase of ~40%
 4 Data Fair Town Halls
 Machine Learning/Deep Learning: biggest increase in any theme
 big increases also in FAIR, Repositories & Data Storage, and Adoption & Adaption
Carnegie Institution: Unleash the Power of Data 12
Credit: Lesley Wyborn
AGU FM Program Committee Member
AGU Fall Meeting 2017:

April 10, 2018
13

Learning from the past: The Big Picture
Insights into the development of infrastructures
April 10, 2018
14

Revolutionary!
April 10, 2018
15
 Roman water supply system
 Railroad systems
 Global electrification
 Internet

Patterns of Infrastructure Development
Edwards et al. 2007
1. Deliberate and successful design of
‘local’ systems.
2. Technology transfer across domains
and locations
3. Infrastructure form via gateways
that allow dissimilar systems to be
linked into networks
Wittenburg & Strawn 2018
1. Inventions and development of
start-up systems
2. Technology transfer between
regions and also society
(creolization)
3. Planning for system growth where
"reverse salients" need to be
tackled
4. Substantial momentum (mass,
velocity, direction)
April 10, 2018
16
System Building
Growth
Consolidation

Patterns of Infrastructure Development
Edwards et al. 2007
1. Deliberate and successful design of
‘local’ systems.
2. Technology transfer across domains
and locations
3. Infrastructure form via gateways
that allow dissimilar systems to be
linked into networks
Wittenburg & Strawn 2018
1. Inventions and development of
start-up systems
2. Technology transfer between
regions and also society
(creolization)
3. Planning for system growth where
"reverse salients" need to be
tackled
4. Substantial momentum (mass,
velocity, direction)
April 10, 2018
17
System Building
Growth
Consolidation

Creolization
 New components are continuously introduced
trying to solve specific challenges
 Capabilities grow unevenly (e.g. big vs small data)
 Fragmentation
Leads to
 Inefficiencies in use and costs
 Winners & loosers: some solutions are more
promising and get more attraction
 Better understanding the underlying rules,
principles and limitations.
April 10, 2018
18After Wittenburg & Strawn, 2018)

Attraction via “Universals”
 “Simple” principles, broadly supported
 Only influence directly a specific part of the
overall infrastructure, enable efficiency at the top
layers
 Form stable basis for new developments
April 10, 2018
19After Wittenburg & Strawn, 2018)
“Universals are ... essential to create a
momentum by overcoming fragmentation and
achieving economies of scale.

Attraction is happening!
 Relevance of community organizations that
define principles, procedures, and component
specifications
 RDA: global & cross-disciplinary
 ESIP: Earth Science & US (others coming?)
 New: RDA Interest Group “ESIP/RDA Earth,
Space, and Environmental Sciences”
April 10, 2018
20

Universal: FAIR principles
April 10, 2018
21
 Represent a guideline for data providers to
enhance the reusability of their data holdings:
 Data can be found on the Internet.
 Data are accessible in a usable format with clear rights
and licenses.
 Data access is reliable & persistent.
 Data are identified in a unique and persistent way so
that they can be referred to and cited.
 Data are documented with rich metadata.

Universal:
Standards for data repositories
 Cooperative effort between Data Seal of Approval (DSA) and the World Data
System (WDS) under the umbrella of the Research Data Alliance (RDA)
 Harmonized requirements & procedures for certification of repositories
 Confidence for publishers and funders which repositories to trust
 Basis for development of new repositories
April 10, 2018
22

“Enabling FAIR Data” project @ AGU
 Develop & implement standards that will connect researchers, publishers, and
data repositories in the Earth and space sciences to enable FAIR data
 Grant from the Laura and John Arnold Foundation (LJAF) to the AGU
 FAIR-compliant data repositories (CoreTrustSeal certified, preferred domain
specific)
 FAIR-compliant Earth and space science publishers
 Align their policies for data to be deposited in certified repositories
 Gives similar experience for researchers.
Slide after S. Stall et al., presentation at RDA P11
Berlin, March 2018

All publishers who are part of the
Coalition on Publishing Data in the Earth
and Space Sciences (COPDESS) support
the efforts of trusted repositories that
aggregate research data, software, and
physical samples for the use of the
scientific community.
“These Data Guidelines align the
Author’s instructions for the submission
of data sets in the Earth and Space
Sciences, for all affiliated publishers.”

Universal:
Persistent Identifiers
April 10, 2018
25
Founded 2009
Founded 2011
Founded 2012
“The intention of this cross-
disciplinary report is to overcome still
existing confusions about PIDs and the
lack of detail knowledge in many
disciplines. ...to identify agreements
across documents that have been
suggested to be included by experts.”From: “Common Patterns in Revolutionary
Infrastructures and Data”
P. Wittenburg & G. Strawn, February 2018,

(2) The Real World
The story of IEDA
(Interdisciplinary Earth Data Alliance)
...there was a database named PetDB
April 10, 2018
26

Once upon a time ...
April 10, 2018
27
PetDB web site in 1999

April 10, 2018
28
Note:
PetDB is a database that allows to access
data at the level of individual data
points, not files!

Success: New data-driven science
in geochemistry
April 10, 2018
29
Meyzen et al. (2007): „Isotopic portrayal
of the Earth's upper mantle flow field.“
Putirka et al. (2007)
Stracke & Hofmann (2005)
Class & Goldstein (2007)
2018: 740 citations

An analysis in 2007
April 10, 2018
30
T. Plank, 1999: “Within about 5 minutes of logging on for the first
time, I was staring at an EXCEL file that had all the REE on
basalt glasses from the EPR from 10°N to 20°S. And the answer
to my La/Sm question. I am very impressed, we are looking at
the future of geochemistry.”
GSA 2007 talk: “My Data, Your Data, Our Data!”

Attraction -
but partners
disappeared
April 10, 2018
31

Another failed network attempt
 PaleoStrat not funded
 Development of interoperability
with CoreWall not funded
 Too many political obstacles
April 10, 2018
32
“Promises, Achievements, and Challenges of
Networking Global Geoinformatics Resources”
EGU General Assembly 2008

Growth of data systems at Lamont
April 10, 2018
33

Consolidation
“This Cooperative Agreement converts a series of proposal/award-driven
activities into a community-based facility that serves to support, sustain,
and advance the geosciences by providing a centralized location for the
registry of and access to data essential for research in the solid-earth and
polar sciences.”
- Continue operating & maintaining existing systems
- Develop tools for investigators to comply with NSF data policies (IEDA Data
Management Plan Tool & Data Compliance Reporting Tool)
- Develop tools and modify architecture to provide integrated access to holdings
April 10, 2018
34

IEDA’s layered architecture
April 10, 2018
35
The EUDAT model:
Shared
Partners
Shared

IEDA Today: Data Holdings & Growth
 > 70 TeraBytes of marine geophysical sensor data in the MGDS
 > 20 million analytical measurements for >1 million samples in
EarthChem
 > 4.2 million samples registered and searchable in SESAR (System
for Sample Registration)
11/15/17Presentation at NSF-EAR 36

IEDA Today
 Thousands of download requests per
month
 >2,000 citations in the literature
 ~ 10,000 start-ups of GeoMapApp per
month
 >2,700 GeoPass users*
 Demonstrated impact on science
11/15/17Presentation at NSF-EAR 37
*GeoPass accounts are required to submit data to EarthChem/
Geochron, SESAR, & USAP-DC, and to use the DMP Tool
0
50
100
150
200
250
NumberofCitationsPerYear
EarthChem/ PetDB / SedDB
MGDS/ GMRT/ GMA
Citations of IEDA Systems in the
Scientific Literature

IEDA is “attracting”
👍
 Certification: Member of World Data System since 2011 (CoreTrustSeal
certification underway)
 Use of Persistent Identifiers
 Publication agent of DataCite since 2011
 DOI registration of datasets since 2009 via TIB Hannover
 The International Geo Sample Number: A PID for physical sampleas
 FAIR data
 Finable/accessible: DOIs, landing pages, GUIs, APIs
 Interoperable: CSW, DataONE member node, schema.org (EarthCube project P418)
 Reusable: disciplinary expertise for data curation, rich provenance metadata
April 10, 2018
38

Lessons Learnedr
April 10, 2018
39

Merger of EarthChem & MGDS created
tensions
 Partner system needs versus overarching IEDA level needs
 Budget
 Staff expertise
 Staff allocations
 Distribution among different funding sources (3 different NSF programs)
 Scientific utility versus trustworthiness of operations
 Operation & maintenance versus innovation
April 10, 2018
40

Merger did not lead to the expected
‘economies of scale’
 Disciplinary data curation continues as the most relevant component.
 Additional resources/effort needed for coordination and alignment of
activities and practices across partners.
 More project management required due to budget level and status as facility.
 Building useful data search and discovery across multi-disciplinary systems is a
challenging problem.
April 10, 2018
41
Costpersystem

Achievements: IEDA Data Browser
April 10, 2018
42

 Access to all IEDA repositories in one place
 Free text, map, and facet-based search
options
 ISO metadata available for other catalogs to
harvest
 Major work to align concepts and
vocabularies in the different repositories
 Challenge to agree on facets
 Relevance to different data types
 Availability of metadata
 Granularity of datasets
April 10, 2018
43
Achievements:
IEDA Integrated Catalog

A changing ecosystem
“IEDA’s cross-disciplinary services for data discovery (IEDA Data Browser)
and data access (IEDA Integrated Catalog) across all IEDA systems are
increasingly superseded by tools developed with substantially larger
resources as part of EarthCube, Google (Google’s new Research Data
Search based on schema.org), or perhaps DataONE. These recent
developments aim to provide researchers with the tools to find and use
data in a highly distributed and fragmented data infrastructure based on
new approaches for interoperability, metadata registries, and hubs such
as SCHOLIX to link data and literature.”
IEDA: Future Scope and Structure
(IEDA internal report, K. Lehnert & S. Carbotte, January 2018)
April 10, 2018
44

We need to adapt
� Reduce complexity of operations
� Adjust to and better leverage external CI developments (e.g. EarthCube)
� Enhance opportunities to grow partnerships relevant to the disciplinary
systems to target needs of the disciplinary communities
 Systems and/or services that serve broader audiences should be funded
independently (SESAR, GeoMapApp, GMRT)
 Create a new management/governance structure
 more independence for IEDA partners and funders to allow growth
 rely on external developments for cross-disciplinary services
45

Where are we heading from here?
April 10, 2018
46

Oh no, that diagram again ...
 A Digital Object has a structured bit sequence
stored in a trustworthy repository.
 A Digital Object has a PID and metadata.
 The PID is associated with all relevant kernel
information that allows humans and machines
to enable FAIR.
 Kernel information and Digit Object have types
allowing humans and machines to associate
operations with them.
April 10, 2018
47
According to Wittenburg & Strawn (2018), the
implementation of data infrastructure can be
guided by 4 statements:

Re-
usability
Impact
on
Science
Sustaina-
bility
My take on priorities
April 10, 2018
48
Data type specific best practices
Metadata quality
Granularity of access, data fusion
Metrics
Data Science Education
Business models
Consolidation
The impact of data
infrastructure on science
& society depends on the
reusability of data and
will ultimately justify its
continued funding.

Reusability problem: Metadata quality
 Discipline-specific and data type
specific metadata not well defined
and enforced
 Lack of consistent vocabularies
 Automated metadata enrichment
(e.g. CINERGI) has not yet had
convincing results
 Manual data curation still best,
but too costly
April 10, 2018
49
“The Geochemical Data(base) Factory: From Heterogeneous Input to
Homogeneous Output. AGU FM 2009

Reusability problem: data wrangling
Surveys in recent years show that data scientists still spend 75-80% of their time
‘data wrangling’.
 RDA EU survey 2013 (75%)
 Brodie 2015 (80%)
 CrowdFlower 2017 (80%)
April 10, 2018
50
Source:
Crowdflower

Reusability solution: Data Fusion
Harmonize & integrate data so that
disparate pieces of information form a
picture that can be explored to reveal
patterns in space, time, and properties.
April 10, 2018
51

 Structure data so they can be accessed and
understood at a more granular level
 Approaches are available and improving
 ISO/OGC Observations & Measurements
 Observation Data Model ODM2 (Horsburgh et al. 2017)
 Schema.org
 Open Core Data
Reusability solution:
Data Fusion
April 10, 2018
52
S. Cox et al. “Mainstream web standards now
support science data too”; AGU FM 2017

Reusability problem: The Long Tail
 Small data volumes, but big potential
 Culture is not open to sharing
 Data fragmented and highly heterogeneous
 Lots of .xls files
 Many data never see the light of day
April 10, 2018
53
ESIP Winter Meeting, January 2016

Reusability hope: Generation change
“A new scientific truth does not triumph by
convincing its opponents and making them see
the light, but rather because its opponents
eventually die, and a new generation grows up
that is familiar with it.”
Max Planck
April 10, 2018
54

April 10, 2018
55
Credit: Jon Stelling, LeHigh University

 steps in the data life cycle are siloed in many
communities and disciplines
 Recommendation: focus on the full data life
cycle
April 10, 2018
56
Final Report from the NSF Computer and Information Science and
Engineering Advisory Committee, Data Science Working Group
Communications of the ACM, Vol. 61 No. 4,
Pages 67-72, April 2018

A trend toward large facilities
April 10, 2018
57

Education in Data Science or
Data Science in Education
 Data Science as a new field in academia
 Different organizational models emerging at academic
institutions to integrate with domain sciences
April 10, 2018
58

I’ll leave the funding question to the
experts.
April 10, 2018
59
 Trust of the science community

Funding
April 10, 2018
60
“Funding research data management and related infrastructures”, May 2016
Authors: Knowledge Exchange Research Data Expert Group and Science Europe Working Group
on Research Data.

Did we move at all?
April 10, 2018
61
Did we move at all?
2007

Success!
The International Geo Sample Number
 Grew from a local, centralized system started in 2004 to
an international organization founded in 2011
 Now has 24 members in 5 continents
 currently 5 active Allocating Agents
 Adoption by researchers, collection curators, publishers,
and funding agencies growing
 Adoption spreading to other disciplines
 Biology, archeology, material sciences
2/15/2018 62
4,261,436
2,100,273
100,342 30,925 4,809
IEDA Geoscience
Australia
MARUM CSIRO GFZ
# of IGSNs issued by active IGSN Allocating
Agents
Organic Biomarker Data Workshop
Newest members since 2017:
USGS (USA)
BGS (UK)
CNRS (France)
IFREMER (France)
ANDS (Australia)

The final message: Let’s work together!
 It is relevant that we leverage existing
capabilities and expertise.
 We do not have the luxury of duplicating
effort.
 We need to break down barriers between
communities and stakeholders that compete
for their piece of the pie.
April 10, 2018
63
NSF Workshop Cyberinfrastructure for Large Facilities, Nov 2015

Back to the beginning:
April 10, 2018
64
“Do what excites you. Follow your passion.
Don't necessarily worry about what obstacles
might be there, because there are always ways
to overcome them. But the most exciting thing
is to be able to do what you love, and just don't
let anything stand in the way of that.”
Carol Greider 2009 Nobel Prize winner

April 10, 2018
65
For my parents

April 10, 2018
66

EGU 2018 Ian McHarg Lecture

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to EGU 2018 Ian McHarg Lecture

Similar to EGU 2018 Ian McHarg Lecture (20)

More from Kerstin Lehnert

More from Kerstin Lehnert (18)

Recently uploaded

Recently uploaded (20)

EGU 2018 Ian McHarg Lecture

Editor's Notes