Looking at the past of infrastructure development for research data in the context of infrastructure development patterns and experiences from the evolution of the IEDA data facility to inform future pathways and developments. A major focus of the lecture is on the FAIR principles and the issues surrounding reusability of data.
Big Data, Beyond the Data Center
Increasingly the next scientific discoveries and the next industrial innovative breakthroughs will depend on the capacity to extract knowledge and sense from gigantic amount of information. Examples vary from processing data provided by scientific instruments such as the CERN’s LHC; collecting data from large-scale sensor networks; grabbing, indexing and nearly instantaneously mining and searching the Web; building and traversing the billion-edges social network graphs; anticipating market and customer trends through multiple channels of information. Collecting information from various sources, recognizing patterns and distilling insights constitutes what is called the Big Data challenge. However, As the volume of data grows exponentially, the management of these data becomes more complex in proportion. A key challenge is to handle the complexity of data management on Hybrid distributed infrastructures, i.e assemblage of Cloud, Grid or Desktop Grids. In this talk, I will overview our works in this research area; starting with BitDew, a middleware for large scale data management on Clouds and Desktop Grids. Then I will present our approach to enable MapReduce on Desktop Grids. Finally, I will present our latest results around Active Data, a programming model for managing data life cycle on heterogeneous systems and infrastructures.
A description of BRISSKit, an open source tool that may be used to combine datasets held in different locations and analyse them for the purpose of research. Talk give by Jonathan Tedds of Leicester Uni. for the Data Management in Practice workshop, which took place on Nov 14th 2013 at the London School of Hygiene and Tropical Medicine
ODI Node Vienna: Best Practise Beispiele für: Open Innovation mittels Open DataMartin Kaltenböck
Vortrag im Rahmen des Data Pioneers Workshop am 10.10.2016 am BMVIT zum Thema Open Innovation und Open Data (Open Innovation mittels Open Data) seitens Elmar Kiesling (TU Wien) und Martin Kaltenböck (SWC) für den ODI (Open Data Institute) Node Vienna.
The following brief details the use of linked data to connect various high quality data sets produced by the U.S. Environmental Protection Agency. Linked data is an open standards way to publish and consume data. Using a linked data approach and the REST API, developers, scientists, and the public can more easily find, access and re-use authoritative data published by the EPA.
Big Data, Beyond the Data Center
Increasingly the next scientific discoveries and the next industrial innovative breakthroughs will depend on the capacity to extract knowledge and sense from gigantic amount of information. Examples vary from processing data provided by scientific instruments such as the CERN’s LHC; collecting data from large-scale sensor networks; grabbing, indexing and nearly instantaneously mining and searching the Web; building and traversing the billion-edges social network graphs; anticipating market and customer trends through multiple channels of information. Collecting information from various sources, recognizing patterns and distilling insights constitutes what is called the Big Data challenge. However, As the volume of data grows exponentially, the management of these data becomes more complex in proportion. A key challenge is to handle the complexity of data management on Hybrid distributed infrastructures, i.e assemblage of Cloud, Grid or Desktop Grids. In this talk, I will overview our works in this research area; starting with BitDew, a middleware for large scale data management on Clouds and Desktop Grids. Then I will present our approach to enable MapReduce on Desktop Grids. Finally, I will present our latest results around Active Data, a programming model for managing data life cycle on heterogeneous systems and infrastructures.
A description of BRISSKit, an open source tool that may be used to combine datasets held in different locations and analyse them for the purpose of research. Talk give by Jonathan Tedds of Leicester Uni. for the Data Management in Practice workshop, which took place on Nov 14th 2013 at the London School of Hygiene and Tropical Medicine
ODI Node Vienna: Best Practise Beispiele für: Open Innovation mittels Open DataMartin Kaltenböck
Vortrag im Rahmen des Data Pioneers Workshop am 10.10.2016 am BMVIT zum Thema Open Innovation und Open Data (Open Innovation mittels Open Data) seitens Elmar Kiesling (TU Wien) und Martin Kaltenböck (SWC) für den ODI (Open Data Institute) Node Vienna.
The following brief details the use of linked data to connect various high quality data sets produced by the U.S. Environmental Protection Agency. Linked data is an open standards way to publish and consume data. Using a linked data approach and the REST API, developers, scientists, and the public can more easily find, access and re-use authoritative data published by the EPA.
US EPA Resource Conservation and Recovery Act published as Linked Open Data3 Round Stones
A presentation by 3 Round Stones to the US EPA on the new Linked Open Data Management System, including Linked Open Data on 4M facilities (from FRS), 25 years of Toxic Release Inventory (TRI), chemical substances (SRS), and Resource Conservation and Recovery Act (RCRA) content. This represents one of the largest Open Data projects published by a federal government agency using Open Source Software (OSS), Open Web Standards and government Open Data.
Present: Our lives, as well as any field of business and society, are continuously transformed by our ability to collect meaningful data in a systematic fashion and turn that into value. We are increasingly more connected to data sources, have unprecedented distributed infrastructure capabilities and continuously improve our scientific and analytical capabilities. A new interest in an evolved field of data science has emerged as a response to the push from these advances.
Potential: The state of the art and present challenges come with many opportunities. They not only push for new and innovative capabilities in composable data management and analytical methods that can run anytime, anywhere but also require methods to bridge the gap between applications and such capabilities. However, we often lack collaborative culture and effective methodologies to translate these newest advances into impactful solution architectures that can transform science, society, and education.
Future: A Collaborative Networked World as a Part of the Data Science Process: Any solution architecture for data science today depends on the effectivity of a multi-disciplinary data science team, not only with humans but also with analytical systems and infrastructure which are inter-related parts of the solution. Focusing on collaboration and communication between people, and dynamic, predictable and programmable interfaces to systems and scalable infrastructure from the beginning of any activity is critical. This talk will provide an overview of some of our recent work on networked application architectures for dynamic data-driven wildfire modeling and smart cities. It will also explain how focusing on (1) some P’s in the planning phases of a data science activity and (2) creating a measurable process that spans multiple perspectives and success metrics was effective in making these solutions scalable. Lastly, it will introduce the PPODS methodology and family of composable tools for a team-based data science process management and training.
By Sander Janssen, Research Team Leader of Earth Observation and Environmental Informatics at Alterra, Wageningen UR,
12 April 2017- 14:00 CET
--The webinar was held as part of ASIRA (Access to Scientific Information Resources in Agriculture) Online Course for Low-Income Countries--
This presentation focus on the political context of open data publishing, methodological frameworks for estimating the impacts of open data and highlight the Open Data Journal for Agricultural Research as publication channel for open data sets. It will also build on personal reflections on publishing open data from Dr. Janssen’s own research career.
For more on the topic: http://aims.fao.org/activity/blog/join-free-webinar-publishing-open-data-agricultural-research
2013 DataCite Summer Meeting - California Digital Library (Joan Starr - Calif...datacite
2013 DataCite Summer Meeting - Making Research better
DataCite. Co-sponsored by CODATA.
Thursday, 19 September 2013 at 13:00 - Friday, 20 September 2013 at 12:30
Washington, DC. National Academy of Sciences
http://datacite.eventbrite.co.uk/
End-to-End Research Data Management for the Responsible Conduct of ResearchARDC
Louise Wheeler presented at the University of Technology Sydney's RIA Data Management Workshop on 21 June 2018. In partnership with the Australian Research Council, the National Health and Medical Research Council, the Australian Research Data Commons, and RMIT University, this is part of a national workshop series in data management for research integrity advisors.
A modified k means algorithm for big data clusteringSK Ahammad Fahad
Amount of data is getting bigger in every moment and this data comes from everywhere; social media, sensors, search engines, GPS signals, transaction records, satellites, financial markets, ecommerce sites etc. This large volume of data may be semi-structured, unstructured or even structured. So it is important to derive meaningful information from this huge data set. Clustering is the process to categorize data such that data are grouped in the same cluster when they are similar according to specific metrics. In this paper, we are working on k-mean clustering technique to cluster big data. Several methods have been proposed for improving the performance of the k-means clustering algorithm. We propose a method for making the algorithm less time consuming, more effective and efficient for better clustering with reduced complexity. According to our observation, quality of the resulting clusters heavily depends on the selection of initial centroid and changes in data clusters in the subsequence iterations. As we know, after a certain number of iterations, a small part of the data points change their clusters. Therefore, our proposed method first finds the initial centroid and puts an interval between those data elements which will not change their cluster and those which may change their cluster in the subsequence iterations. So that it will reduce the workload significantly in case of very large data sets. We evaluate our method with different sets of data and compare with others methods as well.
US EPA Resource Conservation and Recovery Act published as Linked Open Data3 Round Stones
A presentation by 3 Round Stones to the US EPA on the new Linked Open Data Management System, including Linked Open Data on 4M facilities (from FRS), 25 years of Toxic Release Inventory (TRI), chemical substances (SRS), and Resource Conservation and Recovery Act (RCRA) content. This represents one of the largest Open Data projects published by a federal government agency using Open Source Software (OSS), Open Web Standards and government Open Data.
Present: Our lives, as well as any field of business and society, are continuously transformed by our ability to collect meaningful data in a systematic fashion and turn that into value. We are increasingly more connected to data sources, have unprecedented distributed infrastructure capabilities and continuously improve our scientific and analytical capabilities. A new interest in an evolved field of data science has emerged as a response to the push from these advances.
Potential: The state of the art and present challenges come with many opportunities. They not only push for new and innovative capabilities in composable data management and analytical methods that can run anytime, anywhere but also require methods to bridge the gap between applications and such capabilities. However, we often lack collaborative culture and effective methodologies to translate these newest advances into impactful solution architectures that can transform science, society, and education.
Future: A Collaborative Networked World as a Part of the Data Science Process: Any solution architecture for data science today depends on the effectivity of a multi-disciplinary data science team, not only with humans but also with analytical systems and infrastructure which are inter-related parts of the solution. Focusing on collaboration and communication between people, and dynamic, predictable and programmable interfaces to systems and scalable infrastructure from the beginning of any activity is critical. This talk will provide an overview of some of our recent work on networked application architectures for dynamic data-driven wildfire modeling and smart cities. It will also explain how focusing on (1) some P’s in the planning phases of a data science activity and (2) creating a measurable process that spans multiple perspectives and success metrics was effective in making these solutions scalable. Lastly, it will introduce the PPODS methodology and family of composable tools for a team-based data science process management and training.
By Sander Janssen, Research Team Leader of Earth Observation and Environmental Informatics at Alterra, Wageningen UR,
12 April 2017- 14:00 CET
--The webinar was held as part of ASIRA (Access to Scientific Information Resources in Agriculture) Online Course for Low-Income Countries--
This presentation focus on the political context of open data publishing, methodological frameworks for estimating the impacts of open data and highlight the Open Data Journal for Agricultural Research as publication channel for open data sets. It will also build on personal reflections on publishing open data from Dr. Janssen’s own research career.
For more on the topic: http://aims.fao.org/activity/blog/join-free-webinar-publishing-open-data-agricultural-research
2013 DataCite Summer Meeting - California Digital Library (Joan Starr - Calif...datacite
2013 DataCite Summer Meeting - Making Research better
DataCite. Co-sponsored by CODATA.
Thursday, 19 September 2013 at 13:00 - Friday, 20 September 2013 at 12:30
Washington, DC. National Academy of Sciences
http://datacite.eventbrite.co.uk/
End-to-End Research Data Management for the Responsible Conduct of ResearchARDC
Louise Wheeler presented at the University of Technology Sydney's RIA Data Management Workshop on 21 June 2018. In partnership with the Australian Research Council, the National Health and Medical Research Council, the Australian Research Data Commons, and RMIT University, this is part of a national workshop series in data management for research integrity advisors.
A modified k means algorithm for big data clusteringSK Ahammad Fahad
Amount of data is getting bigger in every moment and this data comes from everywhere; social media, sensors, search engines, GPS signals, transaction records, satellites, financial markets, ecommerce sites etc. This large volume of data may be semi-structured, unstructured or even structured. So it is important to derive meaningful information from this huge data set. Clustering is the process to categorize data such that data are grouped in the same cluster when they are similar according to specific metrics. In this paper, we are working on k-mean clustering technique to cluster big data. Several methods have been proposed for improving the performance of the k-means clustering algorithm. We propose a method for making the algorithm less time consuming, more effective and efficient for better clustering with reduced complexity. According to our observation, quality of the resulting clusters heavily depends on the selection of initial centroid and changes in data clusters in the subsequence iterations. As we know, after a certain number of iterations, a small part of the data points change their clusters. Therefore, our proposed method first finds the initial centroid and puts an interval between those data elements which will not change their cluster and those which may change their cluster in the subsequence iterations. So that it will reduce the workload significantly in case of very large data sets. We evaluate our method with different sets of data and compare with others methods as well.
Birgit Schmidt: RDA for Libraries from an International Perspectivedri_ireland
From "A National Approach to Open Research Data in Ireland", a workshop held on 8 September 2017 in National Library of Ireland, organised by The National Library of Ireland, the Digital Repository of Ireland, the Research Data Alliance and Open Research Ireland.
Understanding the Big Picture of e-ScienceAndrew Sallans
A. Sallans. "Understanding the Big Picture of e-Science." Presented at the 2011 eScience Bootcamp at the University of Virginia's Claude Moore Health Sciences Library. 4 March 2011
The Department of Energy's Integrated Research Infrastructure (IRI)Globus
We will provide an overview of DOE’s IRI initiative as it moves into early implementation, what drives the IRI vision, and the role of DOE in the larger national research ecosystem.
Data repositories are the core components of an Open Data Ecosystem. To gain a comprehensive model of the data ecosystem supporting tools and services, FAIR principles, joint storage of open data and clinical data and the integration of analysis tools should be considered. The aim was to create a data ecosystem model suitable for the sharing of open data together with sensitive data. For this purpose several tools and services were included in our data ecosystem model: Research Data Marts, I2b2 / tranSMART, CKAN, Dataverse, figshare, OSF (Open Science Framework), ... This multitude of services supports research data repositories. Different types of repositories are connected and supplement each other in the storage, release and sharing of data with different degrees of protection and data ownership. Tools to analyze, browse and visualize data are integrated in the data flow between repositories. Results of our ecosystem analysis:
It doesn‘t matter where one stores data, because everything is connected for data sharing: institutional repositories with dataverses, data marts, general repositories, domain specific repositories, figshare etc. Data governance and privacy protection is integrated at the early stage of data generation.
Research Data Management Initiatives at the University of EdinburghRobin Rice
This paper will discuss the issues involved in exploring university obligations in the area of research data management, while conveying the current state of progress at one institution, Edinburgh. The issues are fairly static – from data ownership and rights to retention and sustainability – but the solutions are a moving target as the research environment and its technologies continue to change, subtly altering what is perceived as possible, feasible, and desirable. The planned University of Edinburgh approach to research data storage and management will be outlined.
Digital Representation of Physical Samples in Scientific PublicationsKerstin Lehnert
Presentation about the digital representation of physical samples in scientific publications, given at the European Geoscience Union meeting 2015 in the Splinter Meeting 1.36 "Digital Representation of Physical Samples in Scientific Publications".
This slide deck provides an update on the development of the Astromaterials Data System, a project funded by NASA to ensure the long-term accessibility and utility of lab analytical data acquired on astromaterials samples curated at the Johnson Space Center, including samples collected on the moon during the Apollo missions and meteorites collected in Antarctica.
Presentation about geochemical research data access and publication provided to the Australian Geochemistry Network by Kerstin Lehnert of EarthChem and the Astromaterials Data System
Boosting Data Science in Geochemistry: We Need Global Geochemical Data Standa...Kerstin Lehnert
Presentation at AGU Fall Meeting 2018: Large-scale, global geochemical data syntheses like EarthChem and GEOROC have, for nearly two decades, inspired and made possible a vast range of scientific studies and new discoveries, facilitating the analysis and mining of geochemical data and creating new paradigms in geochemical data analysis such as statistical geochemistry. These syntheses provide easy access to fully integrated compilations of thousands of datasets (‘data fusion’) with millions of geochemical measurements that are accompanied by comprehensive and harmonized metadata for context and provenance to search, filter, sort, and evaluate the data.
The syntheses have been assembled and maintained through manual labor by data managers, who extract data and metadata from text, tables, and supplements of publications for inclusion in the databases, a time-consuming task due to the multitude of data formats, units, normalizations, vocabularies, etc., i.e. lack of best practices for geochemical data reporting. In order to support and advance future science endeavors that rely on access to and analysis of large volumes of geochemical data, we need to develop and implement global standards for geochemical data that not only make geochemical data FAIR (Findable, Accessible, Interoperable, Re-usable), but ready for data fusion. As more geochemical data systems are emerging at national, programmatic, and subdomain levels in response to Open Access policies and science needs, standard protocols for exchanging geochemical data among these systems will need to be developed, implemented, and governed.
Critical is the alignment with existing standards such as the Semantic Sensor Network (SSN) ontology, a recent joint W3C and OGC standard that standardizes description of sensors, observation, sampling, and actuation, with sufficient flexibility to allow details of these elements to be defined in different domains. New initiatives within the International Council for Science and CODATA are working towards coordinating the International Science Unions to identify and endorse the more authoritative standards (including vocabularies and ontologies). These initiatives present a timely opportunity for geochemical data to ensure that they are born ‘connected’ within and across disciplines.
Presentation that describes the experiences and insights of the IEDA data facility gained during the >10 years of building cyberinfrastructure for a long-tail community geochemistry
Advancing Reproducible Science from Physical Samples: The IGSN and the iSampl...Kerstin Lehnert
Presentation at the Geological Society of America (GSA) meeting 2016 in the session on FOSSIL SPECIMENS 0'S AND 1'S: DATABASES, STANDARDS, & MOBILIZATION
Making Small Data BIG (UT Austin, March 2016)Kerstin Lehnert
Presentation given at the Texas Advanced Computing Center. It describes the potential of re-using small data for new science, achievements and the challenges to make small data re-usable.
IGSN: The International Geo Sample Number (DFG Roundtable)Kerstin Lehnert
This presentation provides an overview of the rationale for the IGSN, of the organizational structure and architecture of the IGSN e.V. , and the System for Earth Sample Registration.
Research Data Infrastructure for Geochemistry (DFG Roundtable)Kerstin Lehnert
This presentation provides an overview of different aspects of data management for geochemistry and resources available at the EarthChem@IEDA data facility.
Interdisciplinary Data Resources for Volcanology at the IEDA (Interdisciplina...Kerstin Lehnert
Presentation given at the EGU 2015 General Assembly in session "Methods for Understanding Volcanic Hazards and Risks" (NH2.2), describing EarthChem data systems that make accessible and synthesize geochemical data of volcanic rocks and gases, and the System for Earth Sample Registration that catalogs sample metadata and provides persistent unique sample identifiers (International Geo Sample Number IGSN). It also mentions EarthChem's plans and ongoing work to link geochemical data with other volcanological databases, and the IEDA data rescue initiative.
Presentation about the IGSN and ongoing initiatives for the Internet of Samples at the EGU 2015 short course "Open Science Goes Geo: Beyond Data and Software".
Lehnert: Making Small Data Big, IACS, April2015Kerstin Lehnert
Seminar presentation at the Institute for Advanced Computational Science at Stony Brook University, April 9, 2015, describing achievements and challenges of data infrastructure in a long-tail science domain with the example of geochemistry.
iSamples Research Coordination Network (C4P Webinar)Kerstin Lehnert
The iSamples (Internet of Samples in the Earth Sciences) Research Coordination Network is part of EarthCube and focuses on the integration of physical samples and collections into digital data infrastructure in the Earth sciences. This presentation summarizes the activities of the iSamples RCN and presents results from a major community survey about sharing and management of physical samples that was conducted as part of the RCN.
MoonDB: Restoration & Synthesis of Planetary Geochemical DataKerstin Lehnert
This presentation explains the MoonDB project that will restore and synthesize geochemical and petrological data acquired on lunar samples over more than 4 decades. The project is a collaboration between the IEDA data facility (http://www.iedadata.org) at the Lamont-Doherty Earth Observatory of Columbia University and the Astromaterials Acquisition and Curation Office (AACO) at Johnson Space Center (JSC).
This presentation was part of a workshop of IEDA (http://www.iedadata.org) at the AGU (American Geophysical Union) Fall Meeting 2013 in San Francisco that was intended as an introduction to the topic of data publication.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
1. Data Infrastructure for the
Earth & Space Science
How Far Have We Come,
Where Are We Heading?
Kerstin Lehnert
Lamont-Doherty Earth Observatory, Columbia University
April 10, 2018
Ian McHarg Lecture 2018
1
2. Before I start, a short detour ...
April 10, 2018
Ian McHarg Lecture 2018
2
The Kaiserstuhl, Germany
4. My goal
April 10, 2018
Ian McHarg Lecture 2018
4
study the past
if you would
define the future
Confucius
5. Learning from the past:
(1) The Big Picture
April 10, 2018
Ian McHarg Lecture 2018
5
2007
2018
https://www.rd-alliance.org/sites/default/files/Common_Patterns_in_Revolutionising_Infrastructures-final.pdf
6. Learning from the past:
(2) The Real World
The story of IEDA
(Interdisciplinary Earth Data Alliance)
www.iedadata.org
... there was a database named PetDB
April 10, 2018
Ian McHarg Lecture 2018
6
7. A biased perspective
I am a geoscientist who
directs a US data facility for
primarily investigator-based
data (“long tail”) funded by
the National Science
Foundation.
April 10, 2018
Ian McHarg Lecture 2018
7
www.iedadata.org
8. Defining the Topic
Data infrastructure is a digital
infrastructure promoting data sharing and
consumption.
Its goal is to enable researchers to make the best use of the
world’s growing wealth of data for the advancement of
science and the benefit of society.
April 10, 2018
Ian McHarg Lecture 2018
8
9. Data drive Earth science:
A new way of understanding the world
April 10, 2018
Ian McHarg Lecture 2018
9
Data:
The 4th Paradigm
The 5th Dimension
10. We have been talking about it for a
while ...
April 10, 2018
Ian McHarg Lecture 2018
10
2006
12. Growth of Earth & Space Science Informatics
63 ESSI session proposals – an increase of 40%
729 ESSI abstracts – an increase of ~18.7 %
35 ESSI oral sessions - an increase of ~40%
4 Data Fair Town Halls
Machine Learning/Deep Learning: biggest increase in any theme
big increases also in FAIR, Repositories & Data Storage, and Adoption & Adaption
Carnegie Institution: Unleash the Power of Data 12
Credit: Lesley Wyborn
AGU FM Program Committee Member
AGU Fall Meeting 2017:
14. Learning from the past: The Big Picture
Insights into the development of infrastructures
April 10, 2018
Ian McHarg Lecture 2018
14
15. Revolutionary!
April 10, 2018
Ian McHarg Lecture 2018
15
Roman water supply system
Railroad systems
Global electrification
Internet
16. Patterns of Infrastructure Development
Edwards et al. 2007
1. Deliberate and successful design of
‘local’ systems.
2. Technology transfer across domains
and locations
3. Infrastructure form via gateways
that allow dissimilar systems to be
linked into networks
Wittenburg & Strawn 2018
1. Inventions and development of
start-up systems
2. Technology transfer between
regions and also society
(creolization)
3. Planning for system growth where
"reverse salients" need to be
tackled
4. Substantial momentum (mass,
velocity, direction)
April 10, 2018
Ian McHarg Lecture 2018
16
System Building
Growth
Consolidation
17. Patterns of Infrastructure Development
Edwards et al. 2007
1. Deliberate and successful design of
‘local’ systems.
2. Technology transfer across domains
and locations
3. Infrastructure form via gateways
that allow dissimilar systems to be
linked into networks
Wittenburg & Strawn 2018
1. Inventions and development of
start-up systems
2. Technology transfer between
regions and also society
(creolization)
3. Planning for system growth where
"reverse salients" need to be
tackled
4. Substantial momentum (mass,
velocity, direction)
April 10, 2018
Ian McHarg Lecture 2018
17
System Building
Growth
Consolidation
18. Creolization
New components are continuously introduced
trying to solve specific challenges
Capabilities grow unevenly (e.g. big vs small data)
Fragmentation
Leads to
Inefficiencies in use and costs
Winners & loosers: some solutions are more
promising and get more attraction
Better understanding the underlying rules,
principles and limitations.
April 10, 2018
Ian McHarg Lecture 2018
18After Wittenburg & Strawn, 2018)
19. Attraction via “Universals”
“Simple” principles, broadly supported
Only influence directly a specific part of the
overall infrastructure, enable efficiency at the top
layers
Form stable basis for new developments
April 10, 2018
Ian McHarg Lecture 2018
19After Wittenburg & Strawn, 2018)
“Universals are ... essential to create a
momentum by overcoming fragmentation and
achieving economies of scale.
20. Attraction is happening!
Relevance of community organizations that
define principles, procedures, and component
specifications
RDA: global & cross-disciplinary
ESIP: Earth Science & US (others coming?)
New: RDA Interest Group “ESIP/RDA Earth,
Space, and Environmental Sciences”
April 10, 2018
Ian McHarg Lecture 2018
20
21. Universal: FAIR principles
April 10, 2018
Ian McHarg Lecture 2018
21
Represent a guideline for data providers to
enhance the reusability of their data holdings:
Data can be found on the Internet.
Data are accessible in a usable format with clear rights
and licenses.
Data access is reliable & persistent.
Data are identified in a unique and persistent way so
that they can be referred to and cited.
Data are documented with rich metadata.
22. Universal:
Standards for data repositories
Cooperative effort between Data Seal of Approval (DSA) and the World Data
System (WDS) under the umbrella of the Research Data Alliance (RDA)
Harmonized requirements & procedures for certification of repositories
Confidence for publishers and funders which repositories to trust
Basis for development of new repositories
April 10, 2018
Ian McHarg Lecture 2018
22
23. “Enabling FAIR Data” project @ AGU
Develop & implement standards that will connect researchers, publishers, and
data repositories in the Earth and space sciences to enable FAIR data
Grant from the Laura and John Arnold Foundation (LJAF) to the AGU
FAIR-compliant data repositories (CoreTrustSeal certified, preferred domain
specific)
FAIR-compliant Earth and space science publishers
Align their policies for data to be deposited in certified repositories
Gives similar experience for researchers.
Carnegie Institution: Unleash the Power of Data 23
Slide after S. Stall et al., presentation at RDA P11
Berlin, March 2018
24. All publishers who are part of the
Coalition on Publishing Data in the Earth
and Space Sciences (COPDESS) support
the efforts of trusted repositories that
aggregate research data, software, and
physical samples for the use of the
scientific community.
Carnegie Institution: Unleash the Power of Data 24
“These Data Guidelines align the
Author’s instructions for the submission
of data sets in the Earth and Space
Sciences, for all affiliated publishers.”
25. Universal:
Persistent Identifiers
April 10, 2018
Ian McHarg Lecture 2018
25
Founded 2009
Founded 2011
Founded 2012
“The intention of this cross-
disciplinary report is to overcome still
existing confusions about PIDs and the
lack of detail knowledge in many
disciplines. ...to identify agreements
across documents that have been
suggested to be included by experts.”From: “Common Patterns in Revolutionary
Infrastructures and Data”
P. Wittenburg & G. Strawn, February 2018,
26. Learning from the past:
(2) The Real World
The story of IEDA
(Interdisciplinary Earth Data Alliance)
...there was a database named PetDB
April 10, 2018
Ian McHarg Lecture 2018
26
27. Once upon a time ...
April 10, 2018
Ian McHarg Lecture 2018
27
PetDB web site in 1999
28. April 10, 2018
Ian McHarg Lecture 2018
28
Note:
PetDB is a database that allows to access
data at the level of individual data
points, not files!
29. Success: New data-driven science
in geochemistry
April 10, 2018
Ian McHarg Lecture 2018
29
Meyzen et al. (2007): „Isotopic portrayal
of the Earth's upper mantle flow field.“
Putirka et al. (2007)
Stracke & Hofmann (2005)
Class & Goldstein (2007)
2018: 740 citations
30. An analysis in 2007
April 10, 2018
Ian McHarg Lecture 2018
30
T. Plank, 1999: “Within about 5 minutes of logging on for the first
time, I was staring at an EXCEL file that had all the REE on
basalt glasses from the EPR from 10°N to 20°S. And the answer
to my La/Sm question. I am very impressed, we are looking at
the future of geochemistry.”
GSA 2007 talk: “My Data, Your Data, Our Data!”
32. Another failed network attempt
PaleoStrat not funded
Development of interoperability
with CoreWall not funded
Too many political obstacles
April 10, 2018
Ian McHarg Lecture 2018
32
“Promises, Achievements, and Challenges of
Networking Global Geoinformatics Resources”
EGU General Assembly 2008
33. Growth of data systems at Lamont
April 10, 2018
Ian McHarg Lecture 2018
33
34. Consolidation
“This Cooperative Agreement converts a series of proposal/award-driven
activities into a community-based facility that serves to support, sustain,
and advance the geosciences by providing a centralized location for the
registry of and access to data essential for research in the solid-earth and
polar sciences.”
- Continue operating & maintaining existing systems
- Develop tools for investigators to comply with NSF data policies (IEDA Data
Management Plan Tool & Data Compliance Reporting Tool)
- Develop tools and modify architecture to provide integrated access to holdings
April 10, 2018
Ian McHarg Lecture 2018
34
36. IEDA Today: Data Holdings & Growth
> 70 TeraBytes of marine geophysical sensor data in the MGDS
> 20 million analytical measurements for >1 million samples in
EarthChem
> 4.2 million samples registered and searchable in SESAR (System
for Sample Registration)
11/15/17Presentation at NSF-EAR 36
37. IEDA Today
Thousands of download requests per
month
>2,000 citations in the literature
~ 10,000 start-ups of GeoMapApp per
month
>2,700 GeoPass users*
Demonstrated impact on science
11/15/17Presentation at NSF-EAR 37
*GeoPass accounts are required to submit data to EarthChem/
Geochron, SESAR, & USAP-DC, and to use the DMP Tool
0
50
100
150
200
250
NumberofCitationsPerYear
EarthChem/ PetDB / SedDB
MGDS/ GMRT/ GMA
Citations of IEDA Systems in the
Scientific Literature
38. IEDA is “attracting”
👍
Certification: Member of World Data System since 2011 (CoreTrustSeal
certification underway)
Use of Persistent Identifiers
Publication agent of DataCite since 2011
DOI registration of datasets since 2009 via TIB Hannover
The International Geo Sample Number: A PID for physical sampleas
FAIR data
Finable/accessible: DOIs, landing pages, GUIs, APIs
Interoperable: CSW, DataONE member node, schema.org (EarthCube project P418)
Reusable: disciplinary expertise for data curation, rich provenance metadata
April 10, 2018
Ian McHarg Lecture 2018
38
40. Merger of EarthChem & MGDS created
tensions
Partner system needs versus overarching IEDA level needs
Budget
Staff expertise
Staff allocations
Distribution among different funding sources (3 different NSF programs)
Scientific utility versus trustworthiness of operations
Operation & maintenance versus innovation
April 10, 2018
Ian McHarg Lecture 2018
40
41. Merger did not lead to the expected
‘economies of scale’
Disciplinary data curation continues as the most relevant component.
Additional resources/effort needed for coordination and alignment of
activities and practices across partners.
More project management required due to budget level and status as facility.
Building useful data search and discovery across multi-disciplinary systems is a
challenging problem.
April 10, 2018
Ian McHarg Lecture 2018
41
Costpersystem
43. Access to all IEDA repositories in one place
Free text, map, and facet-based search
options
ISO metadata available for other catalogs to
harvest
Major work to align concepts and
vocabularies in the different repositories
Challenge to agree on facets
Relevance to different data types
Availability of metadata
Granularity of datasets
April 10, 2018
Ian McHarg Lecture 2018
43
Achievements:
IEDA Integrated Catalog
44. A changing ecosystem
“IEDA’s cross-disciplinary services for data discovery (IEDA Data Browser)
and data access (IEDA Integrated Catalog) across all IEDA systems are
increasingly superseded by tools developed with substantially larger
resources as part of EarthCube, Google (Google’s new Research Data
Search based on schema.org), or perhaps DataONE. These recent
developments aim to provide researchers with the tools to find and use
data in a highly distributed and fragmented data infrastructure based on
new approaches for interoperability, metadata registries, and hubs such
as SCHOLIX to link data and literature.”
IEDA: Future Scope and Structure
(IEDA internal report, K. Lehnert & S. Carbotte, January 2018)
April 10, 2018
Ian McHarg Lecture 2018
44
45. We need to adapt
� Reduce complexity of operations
� Adjust to and better leverage external CI developments (e.g. EarthCube)
� Enhance opportunities to grow partnerships relevant to the disciplinary
systems to target needs of the disciplinary communities
Systems and/or services that serve broader audiences should be funded
independently (SESAR, GeoMapApp, GMRT)
Create a new management/governance structure
more independence for IEDA partners and funders to allow growth
rely on external developments for cross-disciplinary services
Ian McHarg Lecture 2018
45
46. Where are we heading from here?
April 10, 2018
Ian McHarg Lecture 2018
46
47. Oh no, that diagram again ...
A Digital Object has a structured bit sequence
stored in a trustworthy repository.
A Digital Object has a PID and metadata.
The PID is associated with all relevant kernel
information that allows humans and machines
to enable FAIR.
Kernel information and Digit Object have types
allowing humans and machines to associate
operations with them.
April 10, 2018
Ian McHarg Lecture 2018
47
According to Wittenburg & Strawn (2018), the
implementation of data infrastructure can be
guided by 4 statements:
48. Re-
usability
Impact
on
Science
Sustaina-
bility
My take on priorities
April 10, 2018
Ian McHarg Lecture 2018
48
Data type specific best practices
Metadata quality
Granularity of access, data fusion
Metrics
Data Science Education
Business models
Consolidation
The impact of data
infrastructure on science
& society depends on the
reusability of data and
will ultimately justify its
continued funding.
49. Reusability problem: Metadata quality
Discipline-specific and data type
specific metadata not well defined
and enforced
Lack of consistent vocabularies
Automated metadata enrichment
(e.g. CINERGI) has not yet had
convincing results
Manual data curation still best,
but too costly
April 10, 2018
Ian McHarg Lecture 2018
49
“The Geochemical Data(base) Factory: From Heterogeneous Input to
Homogeneous Output. AGU FM 2009
50. Reusability problem: data wrangling
Surveys in recent years show that data scientists still spend 75-80% of their time
‘data wrangling’.
RDA EU survey 2013 (75%)
Brodie 2015 (80%)
CrowdFlower 2017 (80%)
April 10, 2018
Ian McHarg Lecture 2018
50
Source:
Crowdflower
51. Reusability solution: Data Fusion
Harmonize & integrate data so that
disparate pieces of information form a
picture that can be explored to reveal
patterns in space, time, and properties.
April 10, 2018
Ian McHarg Lecture 2018
51
52. Structure data so they can be accessed and
understood at a more granular level
Approaches are available and improving
ISO/OGC Observations & Measurements
Observation Data Model ODM2 (Horsburgh et al. 2017)
Schema.org
Open Core Data
Reusability solution:
Data Fusion
April 10, 2018
Ian McHarg Lecture 2018
52
S. Cox et al. “Mainstream web standards now
support science data too”; AGU FM 2017
53. Reusability problem: The Long Tail
Small data volumes, but big potential
Culture is not open to sharing
Data fragmented and highly heterogeneous
Lots of .xls files
Many data never see the light of day
April 10, 2018
Ian McHarg Lecture 2018
53
ESIP Winter Meeting, January 2016
54. Reusability hope: Generation change
“A new scientific truth does not triumph by
convincing its opponents and making them see
the light, but rather because its opponents
eventually die, and a new generation grows up
that is familiar with it.”
Max Planck
April 10, 2018
Ian McHarg Lecture 2018
54
55. April 10, 2018
Ian McHarg Lecture 2018
55
Credit: Jon Stelling, LeHigh University
56. steps in the data life cycle are siloed in many
communities and disciplines
Recommendation: focus on the full data life
cycle
April 10, 2018
Ian McHarg Lecture 2018
56
Final Report from the NSF Computer and Information Science and
Engineering Advisory Committee, Data Science Working Group
Communications of the ACM, Vol. 61 No. 4,
Pages 67-72, April 2018
57. A trend toward large facilities
April 10, 2018
Ian McHarg Lecture 2018
57
58. Education in Data Science or
Data Science in Education
Data Science as a new field in academia
Different organizational models emerging at academic
institutions to integrate with domain sciences
April 10, 2018
Ian McHarg Lecture 2018
58
59. I’ll leave the funding question to the
experts.
April 10, 2018
Ian McHarg Lecture 2018
59
Trust of the science community
60. Funding
April 10, 2018
Ian McHarg Lecture 2018
60
“Funding research data management and related infrastructures”, May 2016
Authors: Knowledge Exchange Research Data Expert Group and Science Europe Working Group
on Research Data.
61. Did we move at all?
April 10, 2018
Ian McHarg Lecture 2018
61
Did we move at all?
2007
62. Success!
The International Geo Sample Number
Grew from a local, centralized system started in 2004 to
an international organization founded in 2011
Now has 24 members in 5 continents
currently 5 active Allocating Agents
Adoption by researchers, collection curators, publishers,
and funding agencies growing
Adoption spreading to other disciplines
Biology, archeology, material sciences
2/15/2018 62
4,261,436
2,100,273
100,342 30,925 4,809
IEDA Geoscience
Australia
MARUM CSIRO GFZ
# of IGSNs issued by active IGSN Allocating
Agents
Organic Biomarker Data Workshop
Newest members since 2017:
USGS (USA)
BGS (UK)
CNRS (France)
IFREMER (France)
ANDS (Australia)
63. The final message: Let’s work together!
It is relevant that we leverage existing
capabilities and expertise.
We do not have the luxury of duplicating
effort.
We need to break down barriers between
communities and stakeholders that compete
for their piece of the pie.
April 10, 2018
Ian McHarg Lecture 2018
63
NSF Workshop Cyberinfrastructure for Large Facilities, Nov 2015
64. Back to the beginning:
April 10, 2018
Ian McHarg Lecture 2018
64
“Do what excites you. Follow your passion.
Don't necessarily worry about what obstacles
might be there, because there are always ways
to overcome them. But the most exciting thing
is to be able to do what you love, and just don't
let anything stand in the way of that.”
Carol Greider 2009 Nobel Prize winner
I am incredibly honored and humbled by this medal, and I really would like you to know how much this means to me. So before I start getting into the topic of RDI, I would like to take a brief detour and talk a little bit about how I got here and what the significance of this honor is in my life.
In 1982 I was about ready to finish my dissertation in petrology when I got pregnant, married, and became a housewife. The scientific work that I was doing came to an end and my career seemed to be over before it had even started. Two years after my son was born, I took a half-time position as lab technician at the Max-Planck-Institute for Chemistry in town, and even though it did not pay any real money, it brought me back into the research environment. I had amazing colleagues, who encouraged me to finish my PhD, and supported me through a rough couple of years, when I tried to be a mom during the day and catch up with science at night. But it was the best thing I have done, and I am so grateful to all those colleagues. Without that PhD, I would not have been able to get the position as Staff Associate at the Lamont-Doherty Earth Observatory, when I moved to the US in 1996. In that position I had two main duties: to run a geochemistry lab and to build a database for volcanic rock geochemistry. And that was the beginning
A lecture like this is a great opportunity to reflect on the past, where we started off and where we got to, and use the experiences that we collected ourselves in our work and the insights gained through broader developments – be they good or bad – to inform decisions regarding the future.
I will take two different looks at the past:
one is using the work of historians, economists, social scientists, and information scientists to understand the development of infrastructures and how insights can inform the development of data and cyberinfrastructure. In 2007 while preparing a presentation for a NSF workshop that was convened to envision the future of Geoinformatics in the US and globally, I found a report written by Paul Edwards and colleagues that was a real eye-opener and helped me and I think many others to put ongoing activities aimed at building cyberinfrastructure into a context. Just last month, while preparing for this lecture, I ran into a paper by Peter Wittenburg and George Strawn that builds on the same classic book by Thomas Hughes to define the path of data infrastructure for the future.
The other one is based on my own experiences along the path of building data infrastructure for the solid earth sciences, especially the experiences gained in the creation and operation of the Interdisciplinary Earth data Alliance that I am directing.
I word of caution first: The data universe is highly complex and diverse. I cannot possibly aspire to cover all topics and address every aspect. I am a geoscientist ...
Vision:
Enable an open, extensible, and evolvable digital science ecosystem.
Facilitate research data, information, knowledge, and data tools discovery.
Enhance problem-solving processes.
Move and connect scientific data across scientific disciplines
Manage scientific workflows
Interoperation between scientific data and literature
Integrated science policy framework
Networked digital data systems & libraries that interoperate
There are a number of drivers behind building data infrastructure:
There is an ever growing, and maybe exponentially growing volume of data acquired in the sciences in general, and specifically in the Earth sciences where new data acquisition technologies and computing capabilities are used to gather observations from space, in the oceans, and on land, to simulate earth processes and to generate models that predict future paths.
And there are data and the technologies to mine, analyze and visualize data are giving us new insights into the way the earth works and
Lots of reports have come out.
There is no doubt that infrastructures have a profound effect on nature of modern human societies
Roman water supply system
Opened the way to building the largest capital in ancient times,
Railroad systems
Allowed to exchange people & goods at unknown speeds and facilitated the first industrial revolution,
Global electrification
Changed the availability of power and facilitated the second industrial revolution.
The Internet with its web applications
Changed the availability of information and facilitated new kinds of businesses.
Start with test installations, followed up by small size installations, then being extended stepwise to interconnected systems
“Attraction and convergence are driven mainly by efficiency and economic concerns.
The benefit of convergence is the belief of stakeholders that a stable fundament has been built, on top of which new investments and developments can be made to fully exploit the new technologies and infrastructures.”
FAIR principles are a major milestone that represents an ‘attractor’ in the solution space. But FAIR principles express policy goals. They need to be translated into actions
When businesses merge, it is often to achieve economies of scale. Larger organizations are typically able to produce goods and services more efficiently and at a lower per-unit cost than smaller businesses because fixed costs are spread out over a larger number of units. This is not always the case, however. Sometimes when two firms merge, being larger will actually create dis-economies of scale, where per unit production costs increase because of increased coordination costs.
Re-usabiDomain standards
Business models
Workforce
Quality
Communities need to define disciplinary and data type specific best practices (documentation of provenance, uncertainties, etc.)
Readiness for data mining & analysis
Improve granularity of access
Data fusion (the ‘data lake’)
There are more lessons to be learned from the IGSN development, but that is for another talk.