Scientific discovery and innovation in an era of data-intensive science
William (Bill) Michener, Professor and Director of e-Science Initiatives for University Libraries, University of New Mexico; DataONE Principal Investigator
The scope and nature of biological, environmental and earth sciences research are evolving rapidly in response to environmental challenges such as global climate change, invasive species and emergent diseases. Scientific studies are increasingly focusing on long-term, broad-scale, and complex questions that require massive amounts of diverse data collected by remote sensing platforms and embedded environmental sensor networks; collaborative, interdisciplinary science teams; and new tools that promote scientific data preservation, discovery, and innovation. This talk describes the challenges facing scientists as they transition into this new era of data intensive science, presents current solutions, and lays out a roadmap to the future where new information technologies significantly increase the pace of scientific discovery and innovation.
Data Equivalence
Mark Parsons, Lead Project Manager, Senior Associate Scientist, National Snow and Ice Data Center
Data citation, especially using persistent identifiers like Digital Object Identifiers (DOIs), is an increasingly accepted scientific practice. Recently, several, respected organizations have developed guidelines for data citation. The different guidelines are largely congruent in that they agree on the basic practice and elements of data citation, especially for relatively static, whole data collections. There is less agreement on the more subtle nuances of data citation that are sometimes necessary to ensure precise reference and scientific reproducibility--the core purpose of data citation. We need to be sure that if you follow a data reference you get to the precise data that were used or at least their scientific equivalent. Identifiers such as DOIs are necessary but not sufficient for the precise, detailed, references necessary. This talk discusses issues around data set versioning, micro-citation, and scientific equivalence. I propose some interim solutions and suggest research strategies for the future.
EZID: Easy dataset identification & management
Joan Starr, Manager, Strategic and Project Planning and EZID Service Manager, California Digital Library
Data and data curation are assuming a growing role today’s research library. New approaches are needed both to address the resulting challenges and take advantage of the emerging opportunities. Long-term identifiers represent one such tool. In this presentation, Joan Starr will introduce identifiers and an application designed to make them easy to create and manage: EZID. She will provide a closer look at two identifier types: DOIs and ARKs, and discuss what bringing an identifier service to your institution might mean.
Opening Keynote: The Many and the One: BCE themes in 21st century data curation
Allen Renear, Professor and Interim Dean, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Two scientists can be using "the same data" even though the computer files involved appear to be quite different. This is familiar enough, and for the most part, in small communities with shared practices and familiar datasets, raises few problems. But these informal understandings do not scale to 21st century data curation. To get full value from cyberinfrastructure we must support huge quantities of heterogeneous data developed by diverse communities and used by diverse communities -- often with widely varying methods, tools, and purposes. To accomplish this our informal practices and understandings much be replaced, or at least supplemented, by a shared framework of standard terminology for describing complex cascades of representational levels and relationships. Fundamental problems in data curation -- and in particular problems involving provenance, identifiers, and data citation — cannot be fully resolved without such a framework. Although the deepest problems here have ancient origins, useful practical measures are now within reach. Some recent work toward this end that is being carried out at the Center for Informatics Research in Science and Scholarship (CIRSS) at the Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign will be described.
This presentation sets out some of the challenges around citing and identifying datasets and introduces DataCite, the international data citation initiative. DataCite was founded on 1-December 2009 to support researchers by
providing methods for them to locate, identify, and cite
research datasets with confidence.
This presentation was given by Adam Farquhar at the STM Publishers Association Innovation Conference on 4-Dec-2009.
Scientific discovery and innovation in an era of data-intensive science
William (Bill) Michener, Professor and Director of e-Science Initiatives for University Libraries, University of New Mexico; DataONE Principal Investigator
The scope and nature of biological, environmental and earth sciences research are evolving rapidly in response to environmental challenges such as global climate change, invasive species and emergent diseases. Scientific studies are increasingly focusing on long-term, broad-scale, and complex questions that require massive amounts of diverse data collected by remote sensing platforms and embedded environmental sensor networks; collaborative, interdisciplinary science teams; and new tools that promote scientific data preservation, discovery, and innovation. This talk describes the challenges facing scientists as they transition into this new era of data intensive science, presents current solutions, and lays out a roadmap to the future where new information technologies significantly increase the pace of scientific discovery and innovation.
Data Equivalence
Mark Parsons, Lead Project Manager, Senior Associate Scientist, National Snow and Ice Data Center
Data citation, especially using persistent identifiers like Digital Object Identifiers (DOIs), is an increasingly accepted scientific practice. Recently, several, respected organizations have developed guidelines for data citation. The different guidelines are largely congruent in that they agree on the basic practice and elements of data citation, especially for relatively static, whole data collections. There is less agreement on the more subtle nuances of data citation that are sometimes necessary to ensure precise reference and scientific reproducibility--the core purpose of data citation. We need to be sure that if you follow a data reference you get to the precise data that were used or at least their scientific equivalent. Identifiers such as DOIs are necessary but not sufficient for the precise, detailed, references necessary. This talk discusses issues around data set versioning, micro-citation, and scientific equivalence. I propose some interim solutions and suggest research strategies for the future.
EZID: Easy dataset identification & management
Joan Starr, Manager, Strategic and Project Planning and EZID Service Manager, California Digital Library
Data and data curation are assuming a growing role today’s research library. New approaches are needed both to address the resulting challenges and take advantage of the emerging opportunities. Long-term identifiers represent one such tool. In this presentation, Joan Starr will introduce identifiers and an application designed to make them easy to create and manage: EZID. She will provide a closer look at two identifier types: DOIs and ARKs, and discuss what bringing an identifier service to your institution might mean.
Opening Keynote: The Many and the One: BCE themes in 21st century data curation
Allen Renear, Professor and Interim Dean, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Two scientists can be using "the same data" even though the computer files involved appear to be quite different. This is familiar enough, and for the most part, in small communities with shared practices and familiar datasets, raises few problems. But these informal understandings do not scale to 21st century data curation. To get full value from cyberinfrastructure we must support huge quantities of heterogeneous data developed by diverse communities and used by diverse communities -- often with widely varying methods, tools, and purposes. To accomplish this our informal practices and understandings much be replaced, or at least supplemented, by a shared framework of standard terminology for describing complex cascades of representational levels and relationships. Fundamental problems in data curation -- and in particular problems involving provenance, identifiers, and data citation — cannot be fully resolved without such a framework. Although the deepest problems here have ancient origins, useful practical measures are now within reach. Some recent work toward this end that is being carried out at the Center for Informatics Research in Science and Scholarship (CIRSS) at the Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign will be described.
This presentation sets out some of the challenges around citing and identifying datasets and introduces DataCite, the international data citation initiative. DataCite was founded on 1-December 2009 to support researchers by
providing methods for them to locate, identify, and cite
research datasets with confidence.
This presentation was given by Adam Farquhar at the STM Publishers Association Innovation Conference on 4-Dec-2009.
University of Bath Research Data Management training for researchersJez Cope
Slides from a workshop on Research Data Management for research staff and students at the University of Bath.
Part of the Research360 project (http://blogs.bath.ac.uk/research360).
Authors: Cathy Pink and Jez Cope, University of Bath
Whitepaper : CHI: Hadoop's Rise in Life Sciences EMC
Genomics large, semi-structured, file-based data is ideally suited for a Hadoop Distributed File System. The EMC Isilon OneFS file system features connectivity to the Hadoop Distributed File System (HDFS) that makes the Hadoop storage "oscale-out" and truly distributed. An example from the "CrossBow" project is explored.
White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...EMC
This white paper explains how the Renaissance Computing Institute (RENCI) of the University of North Carolina uses EMC Isilon scale-out NAS storage, Intel processor and system technology, and iRODS-based data management to tackle Big Data processing, Hadoop-based analytics, security and privacy challenges in research and clinical genomics.
A basic course on Research data management: part 1 - part 4Leon Osinski
Slides belonging to a basic course on research data management. The course consists of 4 parts:
Part 1: what and why
1.1 data management plans
Part 2: protecting and organizing your data
2.1 data safety and data security
2.2 file naming, organizing data (TIER documentation protocol)
Part 3: sharing your data
3.1 via collaboration platforms (during research)
3.2 via data archives (after your research)
Part 4: caring for your data, or making data usable
4.1 tidy data
4.2 documentation/metadata
4.3 licenses
4.4 open data formats
The Brain Imaging Data Structure and its use for fNIRSRobert Oostenveld
These slides were prepared for the NIRS toolkit course at the Donders, which due to the Corona crisis has been postponed. The slides present BIDS, explain how fNIRS often involves multiple signals, and relates the two to synchronization and data management
Where is the opportunity for libraries in the collaborative data infrastructure?LIBER Europe
Presentation by Susan Reilly at Bibsys2013 on the opportunties for libraries and their role in the collaborative data infrastructure. Looks at data sharing, authentication, preservation and advocacy.
Keynote presented to KE workshop held in conjunction with the release of the report "A Surfboard for Riding the Wave
Towards a four country action programme on research data": http://www.knowledge-exchange.info/Default.aspx?ID=469
This is a presentation for the Erwin Hahn Instiutute in Essen, explaining the background, functional design and technical architecture of the Donders Repository. Furthermore, it explains how it aligns with the DCCN project management and with the researchers workflow
This presentation was delivered at the Elsevier Library Connect Seminar on 6 October 2014 in Johannesburg, 7 October 2014 in Durban and 9 October 2014 in Cape Town and gives an overview of the potential role that librarians can play in research data management
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...Amazon Web Services
Professors Wall and Tonellato of Harvard Medical School in collaboration with Beth Israel Deaconess Medical Center discuss the emerging area of clinical whole genome sequencing analysis and tools. They report on the use of Amazon EC2 and Spot Instances to achieve a robust clinical time processing solution and examine the barriers to and resolution of producing clinical-grade whole genome results in the cloud. They benchmark an AWS solution, called COSMOS, against local computing solutions and demonstrate the time and capacity gains conferred through the use of AWS.
University of Bath Research Data Management training for researchersJez Cope
Slides from a workshop on Research Data Management for research staff and students at the University of Bath.
Part of the Research360 project (http://blogs.bath.ac.uk/research360).
Authors: Cathy Pink and Jez Cope, University of Bath
Whitepaper : CHI: Hadoop's Rise in Life Sciences EMC
Genomics large, semi-structured, file-based data is ideally suited for a Hadoop Distributed File System. The EMC Isilon OneFS file system features connectivity to the Hadoop Distributed File System (HDFS) that makes the Hadoop storage "oscale-out" and truly distributed. An example from the "CrossBow" project is explored.
White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...EMC
This white paper explains how the Renaissance Computing Institute (RENCI) of the University of North Carolina uses EMC Isilon scale-out NAS storage, Intel processor and system technology, and iRODS-based data management to tackle Big Data processing, Hadoop-based analytics, security and privacy challenges in research and clinical genomics.
A basic course on Research data management: part 1 - part 4Leon Osinski
Slides belonging to a basic course on research data management. The course consists of 4 parts:
Part 1: what and why
1.1 data management plans
Part 2: protecting and organizing your data
2.1 data safety and data security
2.2 file naming, organizing data (TIER documentation protocol)
Part 3: sharing your data
3.1 via collaboration platforms (during research)
3.2 via data archives (after your research)
Part 4: caring for your data, or making data usable
4.1 tidy data
4.2 documentation/metadata
4.3 licenses
4.4 open data formats
The Brain Imaging Data Structure and its use for fNIRSRobert Oostenveld
These slides were prepared for the NIRS toolkit course at the Donders, which due to the Corona crisis has been postponed. The slides present BIDS, explain how fNIRS often involves multiple signals, and relates the two to synchronization and data management
Where is the opportunity for libraries in the collaborative data infrastructure?LIBER Europe
Presentation by Susan Reilly at Bibsys2013 on the opportunties for libraries and their role in the collaborative data infrastructure. Looks at data sharing, authentication, preservation and advocacy.
Keynote presented to KE workshop held in conjunction with the release of the report "A Surfboard for Riding the Wave
Towards a four country action programme on research data": http://www.knowledge-exchange.info/Default.aspx?ID=469
This is a presentation for the Erwin Hahn Instiutute in Essen, explaining the background, functional design and technical architecture of the Donders Repository. Furthermore, it explains how it aligns with the DCCN project management and with the researchers workflow
This presentation was delivered at the Elsevier Library Connect Seminar on 6 October 2014 in Johannesburg, 7 October 2014 in Durban and 9 October 2014 in Cape Town and gives an overview of the potential role that librarians can play in research data management
A Step to the Clouded Solution of Scalable Clinical Genome Sequencing (BDT308...Amazon Web Services
Professors Wall and Tonellato of Harvard Medical School in collaboration with Beth Israel Deaconess Medical Center discuss the emerging area of clinical whole genome sequencing analysis and tools. They report on the use of Amazon EC2 and Spot Instances to achieve a robust clinical time processing solution and examine the barriers to and resolution of producing clinical-grade whole genome results in the cloud. They benchmark an AWS solution, called COSMOS, against local computing solutions and demonstrate the time and capacity gains conferred through the use of AWS.
xhttp://www.escience2009.org/ Web Semantics in Action: Web 3.0 in e-Science 11:50 – 12:15 Annamaria Carusi & Anita de Waard: Changing Modes of Scientific Discourse Analysis, Changing Perceptions of Science
Overview of scientific discourse annotatoinAnita de Waard
Presentation held at second Amicus workshop, http://amicus.uvt.nl/amicus_ws2011.htm: "Storytelling in Fairytales and Science:
Narrative structure models of scientific communication and folktales"
Have ever asked yourself how long does it take for me to drive from home to the office? You might be able to say the approximate time for your trip, but more likely, you’ll ask yourself again the following questions: which way, what day of the week, and what modes of transportation do I take; because the answer depends on the conditions.
You might not realize that you have just applied an assessment activity, just by asking question or contemplating the duration of your trip from home to your office.
Designing Sideways : integrating emergence with authorshipAdam Russell
This talk examines the tension between bottom-up or systemic game design and more traditional top-down scripting of unique narrative experiences. Market and design trends have been pushing triple-A games towards a combination of these approaches for some years now. However, many designers still see bottom-up emergence as a magic bullet, and vainly hope to integrate this with heavily scripted sequences without considering the deep implications of trying to do so. In the second half of the talk we will explore game design approaches that are neither bottom-up nor top-down, but both at the same time, which we call 'sideways design'.
Slides describing Force11 Work and background of several of the speakers, used for talks to University of Lethbridge, Carnegie Mellon and to Elsevier internally
Enabling your Human Resource Information System to support HR Strategic RolesOPUS Management
Human Resource Management (HRM) has shifted its function within Organisations over the last few years. Its function has grown considerably and has shifted into a more strategic role rather than providing support for administrative paperwork. There has been a shift too, in terminology, with the term Strategic Human Resource Management (SHRM) becoming more common.
Scott Edmunds talk on GigaScience Big-Data, Data Citation and future data handling at the International Conference of Genomics on the 15th November 2011.
Data-knowledge transition zones within the biomedical research ecosystemMaryann Martone
Overview of the Neuroscience Information Framework and how it brings together data, in the form of distributed databases, and knowledge, in the form of ontologies to show the mapping of the dataspace and places where there are mismatches between data and knowledge.
No Free Lunch: Metadata in the life sciencesChris Dwan
This presentation covers some challenges and makes suggestions to support the work of creating flexible, interoperable data systems for the life sciences.
How to best manage your data to make the most of it for your research - With ODAM Framework (Open Data for Access and Mining) Give an open access to your data and make them ready to be mined
Talk at the World Science Festival at Columbia, June 2, 2017: session on Big Data and Physics: http://www.worldsciencefestival.com/programs/big-data-future-physics/
Data Repositories: Recommendation, Certification and Models for Cost RecoveryAnita de Waard
Talk at NITRD Workshop "Measuring the Impact of Digital Repositories" February 28 – March 1, 2017 https://www.nitrd.gov/nitrdgroups/index.php?title=DigitalRepositories
The Narrative Structure of Research Articles, or, Why Science is Like a Fairy...
Whither Small Data?
1. Whither Small Data?
Some Thoughts on Managing
Research Data
February 26, 2013
Anita de Waard
VP Research Data Collaborations, Elsevier RDS
a.dewaard@elsevier.com
2. Why should data be saved?
A. Hold scientists accountable: Data Preservation
– Preserve record of scientific process, provenance
– Enable reproducible research
B. Do better science: Data Use
– Use results obtained by others!
– Improve interdisciplinary work
C. Enable long-term access: Sustainable Models
– Use for technology transfer; societal/industrial
development
– Reward scientists for data creation (credit/attribution)
– Allow public/others insight/use of results
3.
4. Where The Data Goes Now:
PDB:
A small portion of data 88,3 k
(1-2%?) stored in small, PetDB:
> 50 My Papers 1,5 k SedDB:
topic-focused
2 M scientists data repositories 0.6 k
MiRB:
2 My papers/year 25k
TAIR:
72,1 k
Some data
(8%?) stored in large,
generic data
Majority of data repositories
(90%?) is stored
on local hard drives
Dryad: Dataverse:
7,631 files 0.6 My
Datacite:
1.5 My
5. Key Needs: DEVELOP SUSTAINABLE MODELS
PDB:
A small portion of data 88,3 k
(1-2%?) stored in small, PetDB:
> 50 My Papers 1,5 k SedDB:
topic-focused
2 M scientists data repositories 0.6 k
MiRB:
2 My papers/year 25k
TAIR:
72,1 k
Some data
(8%?) stored in large,
generic data
Majority of data repositories
(90%?) is stored
on local hard drives
Dryad: Dataverse:
7,631 files 0.6 My
INCREASE DATA
PRESERVATION Datacite:
1.5 My
6. A. Data Preservation:
• Issues:
– Currently data is often used by single researchers or
small groups: many different, idiosyncratic formats
– Often not in electronic form (maps, images)
– No metadata: when, where, by whom, WHY was this
data collected?
• Needs:
– Tools to make data export/storage simple and
unavoidable
– Policies that make data sharing mandatory and simple
– Systems that reward data sharing/digitisation
7. B. Data Use:
• Issues:
– In generic data repositories, data cannot be used
because of inadequate metadata, lack of quality
review, lack of provenance
– It’s expensive to make data useable!
– Domain-specific data stores are not cross-
searchable across discipline/national borders
• Needs:
– Standardised metadata systems across
systems/repositories and tools to apply them
easily
– Integration layers to enable cross-repository
queries
– A funding model to enable long-term preservation
8. C. Sustainable Models:
• Issues:
– Many successful domain-specific data repositories
are running out of funding
– Is adding metadata something you want to keep
paying PhD+ scientists to do?
– Unclear who foots the bill: the researcher? The
institute? The grant agency? For how long?
• Needs:
– Attribution models for rewarding scientists
– Policies to improve cross-domain and cross-national
collaborations
– Funding models to sustain databases long-term
9. Linking papers to research data:
Database Object Linked Displayed
Pangaea Google Maps Location Map with location
Protein Databank PDB Protein 3d Protein Visualisation
Genbank Gene Name NCBI Gene Viewer
Exoplanets + Exoplanet name Rich Information on extrasolar Planets
Species + Species name Rich information on species
9
10. Towards ‘wrapping papers around data’
metadata 1. Store metadata on all materials
metadata
metadata
2. Track the methods while doing them
3. Write papers that ‘wrap around’ this
metadata
4. Don’t ‘send’ your papers – just
metadata expose them to the outside world
5. Invite reviews; open data to
trusted parties, at trusted time
Rats were subjected to two 6. Allow apps/tools to integrate
grueling tests
(click on fig 2 to see underlying
data). These results suggest
that the neurological pain pro-
Calculate, coordinate…
Review
Revise Compile, comment,
Edit
compare…
11. Research Data Services:
A. Increase Data Preservation:
Help increase the amount and quality of data
preserved and shared
B. Improve Data Use:
Help increase the value and usability of the data shared
by increasing annotation, normalization, provenance
enabling enhanced interoperability
C. Develop Sustainable Models:
Help measure and deliver credit for shared data, the
researchers, the institute, and the funding body,
enabling more sustainable platforms.
12. Guiding Principles of RDS:
• In principle, all open data stays open and URLs,
front end etc. stay where they are (i.e. with
repository)
• Collaboration is tailored to data repositories’
unique needs/interests- ‘service-model’ type:
– Aspects where collaboration is needed are discussed
– A collaboration plan is drawn up using a Service-Level
Agreement: agree on time, conditions, etc.
• Transparent business model
• Very small (2/3 people) department; immediate
communication; instant deployment of ideas
13. Three pilots:
1. Carnegie Mellon Electrophysiology Lab:
A. Data Input: Develop a suite of tools to enable simple
data capturing on a handheld device, add metadata
during experiment, store with raw traces and create
dashboard for viewing
B. Data Use: Integrate with NIF and eagle-I ontologies,
enable access through NIF; combine with other sources
2. ImageVault, with Duke CIVM:
A. Data Input: Get 3D image data into common format,
resolution, annotated to allow comparison
B. Data Use: View other image data sets & do image
analytics
C. Sustainable Models: Create funding for 3D image sets:
free layer for raw data/subscription analytics.
14. 3. IEDA Data Rescue Process Study
Data Rescue:
– Identify 3 -5 data sets that need to be ‘rescued’
– Work with investigators to identify data sources,
formats
– Work with IEDA to define metadata standards,
quality checks etc.
Data Rescue Process:
– A group of data wranglers perform ‘electrification’
and annotation
– (Open source) software is developed where needed,
to help this process
– We help develop common standards, if needed
15. 3. IEDA Data Rescue Process Study
Data Rescue Process Study:
Jointly publish a report on a ‘gap analysis’ comparing
where are we now vs. and where we need to be, including:
– What we did (data imported, processes/standards
created/described; software built; user tests,
outcomes)
– Effort involved (time, software, equipment, skills, etc)
– How easy it would be to scale up; what part of data
out there could be done this way.
– Recommendations for tools and skills that are
needed, if we want to scale up this process
16. Summary:
• Three key issues:
A. Data Preservation
B. Data Use
C. Sustainable Models
• Elsevier’s approach:
– Linking data to papers
– Wrap papers around data
– Explore role in the research data space
• Elsevier RDS:
– Three pilots (CMU, Duke, IEDA) to investigate issues
– We’ll report back in about a year!
17. Questions?
Anita de Waard
VP Research Data Collaborations, Elsevier
a.dewaard@elsevier.com
Editor's Notes
Are current modes of publication and excessive reliance on essentially only one medium (articles and books) serving scholarship or limiting it?