Seminar presentation at the Institute for Advanced Computational Science at Stony Brook University, April 9, 2015, describing achievements and challenges of data infrastructure in a long-tail science domain with the example of geochemistry.
User Guide: Orion™ Weather Station (Columbia Weather Systems)
Lehnert: Making Small Data Big, IACS, April2015
1. Making small data BIG
Insights from a Long-tail Geoscience Domain
Kerstin Lehnert
lehnert@ldeo.columbia.edu
Lamont -Doherty Earth Observatory of Columbia University
Palisades, NY, 10964
www.iedadata.org
2. Outline
• The (super-fast) Introduction to Geochemistry
• Achievements & Challenges in Geochemical Data Management
• Sustainable data infrastructure in the Long Tail
• EarthCube
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 2
3. Geochemistry
• Puts real numbers on geologic
times.
• Fingerprints sources of material
involved in geological processes.
• Reveals the history of climate and
the circulations of the atmosphere
and ocean.
• Constrains theories of the Earth’s
deep interior
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 3
4. Geochemical Observations
• Hundreds of chemical properties of
different Earth materials
• elemental or oxide concentrations
• isotopes and isotopic ratios
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 4
• Thermodynamic properties
• Kinetics
5. Geochemical Data Types
• Analytical (observational)
• Sample-based measurements
• Sensor data
• Experimental data
• Derived data (models)
• (Samples)
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 5
8. How a Geochemist Generates Data:
“Did New Zealand Dust Influence the Last Ice Age?”
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 8
Bess Koffman, Michael Kaplan, Steven Goldstein, Gisela Winckler (LDEO), Natalie Mahowald (Cornell)
http://blogs.ei.columbia.edu/2014/03/13/did-new-zealand-dust-influence-the-last-ice-age/
9. Get Samples in the Field
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 9
10. Get Samples in the Lab/Repository
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 10
11. Analyze Samples in the Lab
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 11
12. The Data!
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 12
Note the number of
data points generated
in this study (the
yellow dots) in light of
the effort that
included collecting
samples in NZ to
operating expensive
equipment in the lab.
14. Long-tail Research Data
• heterogeneous
• customized & optimized
for research questions
• lack of data standards
• data sharing limited
• lack of data
infrastructure (facilities)
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 14
15. The Value of Long-tail Data
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 15
“While the data volumes are small when viewed
individually, in total they represent a very significant
portion of the country’s scientific output.”
“The long tail is a breeding ground for new ideas and
never before attempted science.”
(Heidorn, B. 2008: “Shedding Light on the Dark Data in the Long Tail of Science”)
BUT:
Long-tail data have no value if they are not re-usable!
16. Monday’s Musings: Beyond The Three V’s of Big
Data – Viscosity and Virality
Published on February 27, 2012 by R "Ray" Wang
http://blog.softwareinsider.org/2012/02/27/mondays-
musings-beyond-the-three-vs-of-big-data-viscosity-and-
virality/
What Makes Data BIG?
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain"
Value
16
The sixth ‘V’:
17. Adding VALUE
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 17
accessible
small data
BIG DATA
findable
identification,
persistence
authorization,
protocols
context,
provenance
re-usable
harmonized,
machine-readable
interoperable“… data have no value or
meaning in isolation; they exist
within a knowledge
infrastructure — an ecology of
people, practices,
technologies, institutions,
material objects, and
relationships.”
C.L. Borgman
https://www.force11.org/group/fairgroup/fairprinciples
Generic
Repositories Domain Repositories
18. Domain-specific Data Facilities
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 18
Science Community
Domain specific
Data facility
18
Libraries
Archives
CI, Computer
Science
Publishers,
editors
Metadata registration
Software (tool) development
Interoperability
Data policies
Persistent access
Bibliometrics
Data Curation
Data access & discovery (optimized for domain)
Data products (synthesis)
Data harmonization (standards)
User Support
Funding
Agencies
Data Facilities
Registries
AGU FM 2014: IN14B-01
19. Small Data Gone BIG
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 19
IEDA Repositories
>500,000 files
47 TB
4 x 106 samples
IEDA Syntheses
19 x 106 analytical values in EarthChem
2.63 x 106 miles of data from 808 cruises in the
Global Multi-Resolution Topography (GMRT)
20. EarthChem: Big Data for Geochemistry
• EarthChem Library
• DOI registration
• Long-term archiving
• CC license
• Data templates & guidelines for data documentation
• QC by data managers
• Synthesis Databases (PetDB, EarthChem Portal)
• QA/QC by data managers
• Data & metadata harmonization
• Standards-compliant data model
• Service Oriented Architecture (ECP)
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 20
21. EarthChem Data Systems
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 21
Metadata
Data Data Data Data Data
EarthChem Library
Data Data Data
Search
Investigators
Data Repository
22. Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 22
DOI to allow proper citation
Link to publications
Link to funding source
22
24. ECL Challenges
• Metadata guidelines/templates for an increasing diversity of
data
• Need extended metadata for meaningful searches
• Geospatial
• Variables
• Sample name
• Integration with publication workflow
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 24
25. Coalition for Publishing Data in the Earth & Space
Sciences (COPDESS)
25
• Joint initiative of Earth Science publishers and Data Facilities to
help translate the aspirations of open, available, and useful
data from policy into practice.
• Reaffirm and ensure adherence to existing journal and publishing policies
and society position statements regarding open data sharing and
archiving of data, tools, and models.
• Ensure that Earth science data will, to the greatest extent possible, be
stored in community approved repositories that can provide additional
data services.
• Statement of Commitment signed by all major Earth & Space
Science publishers
• Build an online community directory of appropriate Earth
science community repositories for data, tools, and models
that meet leading standards on curation, quality, and access
www.copdess.org
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain"
26. Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 26
Presentation at EarthCube workshop “Scope & Vision”, March 2015
27. EarthChem Data Systems
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 27
Metadata
Data Data Data Data Data
EarthChem Library
Data Data Data
Search
Data &
Metadata
Search
Data Data
Search
DB DB DB DB DB
Data & Metadata
[XML]
Investigators
[.xls]
EarthChem Data Managers
Data Repository
PetDB, SedDB EarthChem Portal
Data Synthesis
28. Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 28
Example of success:
This study showed new relationships
between noble gases and the elemental and
isotope geochemistry of the deep mantle,
with implications for mantle structure and
evolution.
It was possible through a synthesis of the
global data set,
only because the scattered data were made
available by the online databases PetDB and
GEOROC.
This entire community now depends on this
cyberinfrastructure.
29. The PetDB Database
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 29
Map shows locations of
mafic volcanic rock samples.
Color of symbols is scaled
to the 87Sr/86Sr isotope ratio
in the rocks, illustrating the
difference in the
composition of the Earth’s
mantle under the Indian
and the Pacific Ocean.
Data are from >300
publications,
retrieved from the
PetDB database in
ca. 2 minutes.
30. PetDB Concept: BIG Data
• Data Mining
• Fine-grained data access: Database structure ‘disintegrates’ data sets
into individual values
• Context & provenance metadata to search and filter
• Harmonized data: controlled vocabularies, data compilation & QC by data
managers
• Data Integration
• User-defined across data sets
• By sample (use of unique sample ID)
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 30
31. Data Mining: Search & Filter
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain"
31
Filter by method or
concentration
33. PetDB Impact
• 500 - 800 downloads per quarter
• >550 citations in the literature
• many fundamental new
discoveries & insights
• new scientific approaches
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 33
Meyzen et al, 2007, Isotopic portrayal of the
Earth's upper mantle flow field. Nature 447, 1069A. W. Hofmann: “Mantle
Myths, Reservoirs, and
Databases”, Goldschmidt
Conf. 2008
34. Technical Challenges
• scalability/flexibility of database schema
• accommodate new sample and data types (time series, non-numeric
data, etc.)
• track relationships among samples
• diverse context for new sample and data types
• track provenance of metadata
• performance of search application
• usability & functionality of search application
• interoperability interfaces
• data ingestion & quality control
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 34
35. ODM2
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 35
ODM2 Team:
J S Horsburgh
A K Aufdenkampe
L Hsu
A Jones
K Lehnert
E Mayorga
L Song
D Tarboton
I Zaslavsky
Challenges:
• migration of db content
• new user interface
• new data entry & QA/QC tools
• resources
36. ODM2 Problem
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 36
from:
http://techdistrict.kirkk.com/2009/10/07/the-usereuse-paradox/
“In general, the more reusable
we choose to make a software
module, the more difficult that
same software module is to use.”
37. New User Interface (underdevelopment)
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 37
38. Challenge: User Expectations
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 38
C.H. Langmuir (Harvard): “Geochemical Databases: What is needed now?” Presentation at EarthCube
Domain End-user workshop for Petrology & Geochemistry, March 2013
39. Access to Samples is a Community Concern
• Poor and uneven access and management of sample collections
• Incomplete sample tracking and linking of samples to analyses in the
literature and databases
• Poor discoverability of existing samples
• insufficient or uneven sample density through space and time for most
geological terrains of interest
From Executive Summary of EarthCube Domain End-
user Workshop Petrology & Geochemistry 2013
EarthCube Domain End-user Workshop for Petrology & Geochemistry
at the National Museum of Natural History, Smithsonian Institution, March 2013
40. The Internet of Samples
• Central or federated online catalogs for discovery & access of
samples.
• Best practices for sample identification, documentation, and citation.
• Software tools that support personal or institutional sample
management & curation.
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 40
(And facilities to
provide access to
curated samples!)
41. IGSN: International GeoSample Number
• persistent unique identifier for physical objects in the Earth
Sciences; centralized control mechanism via IGSN e.V.
• resolves to virtual sample representations (sample metadata
profiles) managed at federated IGSN Allocating Agents.
42. Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 42
Use of the IGSN
IGSNs in data table resolve to
sample metadata in IGSN registry
43. SESAR (www.geosamples.org)
System for Earth Sample Registration
• Allocating Agent for individual investigators, sample
repositories, and science programs
• tools and services for users to catalog and manage sample metadata
(MySESAR)
• personal (authenticated) workspace
• metadata template creator
• label creation & printing (including QR code)
• transfer of sample ownership
• web services for client systems
• register sample metadata & obtain IGSNs
• access to IGSN metadata
• preservation & persistent access of sample metadata
• Global Sample Catalog (harvest metadata from other AAs
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 43
44. Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 44
Challenges:
• scalability of architecture for a rapidly growing
number of registrations
• service-oriented architecture
• handle registrations
• software tools that support investigators with
metadata capture in the field & lab
• flexibility for user specific metadata & new sample
types
• inclusion of sample images (storage!)
45. Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 45
Institutions
Collection Mgmt
Public
‘Virtual Museum’
Investigators
Sample Mgmt
(storage, software solutions, & services)
Visualization
Publications
Data Systems
Sample Registries
APIsGUIs
46. Internet of Samples Initiatives
• CODATA Task Group “Physical Samples in the Digital Era”
• SciColl: Scientific Collections International (Consortium)
• iSamples (Internet of Samples in the Earth Sciences)
• Funded EarthCube Research Coordination Network (RCN)
• advance access and re-use of physical samples through use of innovative
cyberinfrastructure
• DESC: Digital Environment for Sample Curation
• IGSN e.V.
• National Data Services test-bed
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 46
47. DATA FACILITIES
FOR THE LONG TAIL
Scalability, Sustainability
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 47
48. Many Earth Science Data Communities
48
Atmo-
spheric
Chemistry
Climate &
Large
Scale
Dynamics
Paleo-
Climate
Meteor-
ology
Aeronomy
Space
Weather
Magneto-
spheric
Physics
Solar
Terrestrial
Igneous
Petrology
& Volcan-
ology
Geo Ed &
Workforce
Training
NCAR
Geophysi
cs &
Geody-
namics
Geobiology
& Paleoen-
tology
Cryosphere
& Ice
Dynamics
Critical
Zone &
Soil
Science
Chemical
Ocean-
ography
Geomor-
phology
Hydrology
Sediment
-ology &
Strati-
graphy
Marine
Geophysics
Physical
Ocean-
ography
Marine
Geology
Biological
Ocean-
ography
Ocean
Education
Ocean
Drilling &
Engineer-
ing
Software
&
Modeling
Bio-
informatics
Ecosystems
Biology
High Perf
Computing
Semantics
&
Ontologies
Algorithm
s & Data
Mining
EarthCube
CI
Solid and
Aqueous
Geochem
-istry
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain"
49. IEDA: A “Long-Tail” Data Facility
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 49
www.iedadata.org
• Multiple core disciplines (focus: solid earth)
• High-T Geochemistry
• Low-T Geochemistry
• Petrology
• Marine Geophysics & Geology
• Geochronology
• Cross-disciplinary tools & services
• Sample registry SESAR
• IEDA Data Browser
• Portals (GeoPRISMs, USAP-DCC, etc.)
• GeoMapApp
• Data management support
49
50. From Research Data Collections to Data Facility
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 50
Formal Governance
Robust Infrastructure
Stable Expert Team
Accreditation
Adherence to
Community Standards
52. Alliance Development
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 52
Proposal “Interdisciplinary Earth Data Alliance as a Model for Integrating EC Technology
Resources and Engaging the Broad Community” submitted March 2015
MetPetDB
Mineral Physics
Deep Submergence
IcePod
Challenges:
• Social & organizational engineering
• Diversity of data needs
• Diversity of systems
• Business models
54. Conclusions
• Long-tail data can grow BIG through domain-
specific data curation.
• Partnerships among data efforts can provide
a solution for sustainability of data
infrastructure in long tail communities
• Partnerships with the computer and
information sciences are necessary to build
the cyberinfrastructure.
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 54
55. EarthCube Motivations
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 55
To transform geosciences research by supporting community-
driven cyberinfrastructure to integrate data and information.
Tech.Drivers
Supports science and
other User Needs
Create a dynamic,
community-driven
cyberinfrastructure
Open, evolvable,
sustainable
Easy interface with
existing capabilities Challenges
Diversity of the
geosciences
Interdisciplinary
Science Questions
Big, Heterogeneous
Data issues
Communities that are
poorly served/have no
community resources
56. Towards an Architecture
for EarthCube
• Under purview of the EarthCube Technology and
Architecture Committee (TAC)
– Coordinating with Council of Data Facilities, Science
Committee, and Liaison Team
• Ongoing Working Groups (since Fall 2014):
– Architecture WG
– Standards WG
– Use Cases WG
– Funded Projects and Gap Analysis WG
– Testbed WG
!
!
EarthCube!
!
!
!
Building((
Blocks(
Architecture(
Governance(
Research((
Coordina7on((
Networks(
Funded&
Projects&
!
EarthCube!
Funded!
Projects!
!
(2013!and!2014!Awards)!
!
57. TAC Workshop (ongoing on now)
Learn more at:
http://earthcube.org/group/technology-architecture-committee
http://earthcube.org/document/2014/earthcube-past-present-future