Open Science as Roadmap to Better
Data Science Research
Beth Plale
Professor, Indiana University Bloomington
On loan to National Science Foundation
ORNL August 16, 2018
abstract
• In this talk I discuss open science in data
science. I take both perspectives of funder
and researcher with the objective of showing
that open science equates to good science
that in the end benefits both science and
society.
“Open science:
today’s data,
tomorrow’s
discoveries”
Open Science draws attention to inherent value in all
primary products produced as outcome or result of research
More attentive
research processes;
With more thought to
other uses;
With more thought to
reproducibility /
replicability of work
Data science is international,
collaborative, and cross-disciplinary
Data that are created with
openness in mind more likely to be
used, spreading impact of
researcher who created it
Funders Encourage Open Science
“Investigators are expected to share
with other researchers, at no more than
incremental cost and within a
reasonable time, the primary data,
samples, physical collections and other
supporting materials created or
gathered in the course of work under
NSF grants.” National Science Foundation
Data Management Plans (DMP)
• Researcher writes a Data Management Plan
for the important data that they expect to
create during course of their research
• By National Science Foundation: “What
constitutes reasonable data management and
access will be determined by the community
of interest through process of peer review and
program management.” [Data Management & Sharing Frequently
Asked Questions, National Science Founation]
Getting to Open Science in US
• Data valuation
• Infrastructure and FAIR
• Open as possible; closed as necessary
• Reuse and reproducibility
More data is generated in course of
science than can be kept.
Science communities must come to
terms with realistic valuation of data
by answering:
Data valuation
• What data can be throw away?
• How long should a dataset be
kept?
• Who decides?
Data value: value of data
(object, product, or collection)
to science and society either as
part of larger scholarly record
(inherent value) or through
enabling new discoveries
Not all data created in the context of
science has the same value
Highest
Value /
longevity
Lowest
Value /
longevity
Tier 1: Data of highest
value/longest longevity
Tier 4: Data of lowest
value/shortest longevity
Tier 1 data
Tier 2 data
Tier 3 data
Tier 4 data
Suppose data products of research can
be binned based on some combination
of value/longevity
Scientific data created as
product (or byproduct) of
research
Examples of bins into which scientific
data might be placed
Data needed to
reproduce images in
paper
Derived data
that can be
quickly
recreated
Derived data
that is costly
to recreate
Ephemeral
observations
Raw
observational
data uncleaned
Hand annotated
datasets
Slide
decks
of talks
All model or
experimental
evaluation runs
Raw
observational
data minimally
cleaned/filtered
Tier 1 data
Tier 4 data
Tier 3 data
Tier 2 data
Cost model for taking care of data
produced through science
“Investigators are expected to share
with other researchers, at no more
than incremental cost and within a
reasonable time, the primary data,
samples, physical collections and
other supporting materials created or
gathered in the course of work under
NSF grants.” PAPPG: XI D4
This should be 5% of
grant budget.
Barend Mons
EU Open Science Cloud
à This covers curation,
retention, findability,
accessibility, reusability of the
data
Open science community has suggested that a
realistic budget to cover new data creation and
its contributing to science is 5% of grant award
budget.
Data above “curation bar” is
curated and deposited ==
reuse data
Data below bar is not
“primary data” in NSF
terms. Not subject to
curation.
Data bin stack
Curation Bar
Placement and
Implication for
Community
Tier 1 data
Tier 2 data
Tier 3 data
Tier 4 data
Curation bar
A community undertakes to
determine the value/longevity
of the data they produce.
They do so by positioning a
“curation bar”
Community positions curation
bar too high:
• No high value data products
for reuse
• Every new grant absorbs cost
of creating data they need
(and curate and deposit it)
• Violates agency commitment
to public access
Curation Bar
Placement
Implication
Curation bar
Tier 1 data
Tier 2 data
Tier 3 data
Tier 4 data
Data Pyramid
Curation Bar
Placement
ImplicationTier 1 data
Tier 2 data
Tier 3 data
Tier 4 data
Curation bar
Curation bar too low:
• Many grant $$ needlessly
spent curating low value
data products
Optimum positioning:
• Curated products only
those deemed most
valuable to community
• Opportunity to use existing
data (saves curation $$ on
grant)
• Increased use of existing
data overall means more $$
for pure research (under
fixed federal research
dollars)
Data Pyramid
Tier 1 data
Tier 2 data
Tier 3 data
Tier 4 data
Curation bar
Optimum
Placement
Takeaways, part 1
• Cost of curation resides with PI who
proposes to create new data (5%
written into grant budget)
• Curation cost is not incurred by PIs
who use existing data
Takeaways, part 2
• Communities must be at table in
decisions of data created by the
community
FAIR principles; role in open science
• A concise and measurable set of principles for
scientific data management
• Developed in 2015 under umbrella of Force11
• Data objects are
– Findable
– Accessible
– Interoperable
– Reusable
Infrastructure and FAIR
FAIR Guiding Principles
To be Findable:
F1. (Meta)data are assigned a globally unique and
eternally persistent identifier.
F2. Data are described with rich metadata.
To be Accessible:
A1. (Meta)data are retrievable by their identifier using
a standardized communications protocol.
A2. Metadata are accessible, even when data are no
longer available
FAIR Guiding Principles
To be Interoperable
I.1. (Meta) data is machine-actionable
I.2. (Meta) data formats utilize shared vocabularies
and/or ontologies
To be Re-usable
R.2 (Meta) data should be sufficiently well-
described and rich that it can be automatically (or
with minimal human effort) linked or integrated,
like-with-like, with other data sources
plale@indiana.edu
GO FAIR: https://www.go-fair.org/
Options needed for restricted
data reuse on spectrum
between completely open
(Open Access) and completely
hidden
Open as possible; closed as necessary
Possible way forward suggested by
principle: Open as possible, closed as
necessary*
Principle articulated in "Guidelines on FAIR Data Management in
Horizon 2020", EU Horizon 2020 programme
Forms of data availability on spectrum
between pure open access and fully
hidden
Capsule framework
Controlled compute environment, capsule
framework, is viable approach to accessing and
sharing restricted data that
satisfies sharing while protecting data from
unintended use or use prohibited by law
Standing on the
shoulders of
giants
Reuse and reproducibility
Image credit Library of Congress
• A core feature of the process of scientific
inquiry that occurs after reproducibility
failures is the integration of conflicting
observations and ideas into a coherent theory.
• Openness and transparency are critical to
moving research forward.
• Preregistration is critical to the replication
process.
• Not all science is built upon.
Let’s embrace reproducibility but agree
that the reproducibility bundle for a
publication can be discarded 3 years
after publication.
Open Science: 5 years out
• Scientists regularly use PIDs
• Data Management Plans regularly shared early
with people who will eventually care for the data
• Scientists will embrace the difference between
access metadata and reuse metadata and
acknowledge they may exist in different places
(services)
• Early career scholars will know what data in their
field is important enough to retain, where to put
it, how to best curate it, and how long a piece of
data should live
Beth Plale
plale@Indiana.edu

Open science as roadmap to better data science research

  • 1.
    Open Science asRoadmap to Better Data Science Research Beth Plale Professor, Indiana University Bloomington On loan to National Science Foundation ORNL August 16, 2018
  • 2.
    abstract • In thistalk I discuss open science in data science. I take both perspectives of funder and researcher with the objective of showing that open science equates to good science that in the end benefits both science and society.
  • 3.
  • 4.
    Open Science drawsattention to inherent value in all primary products produced as outcome or result of research More attentive research processes; With more thought to other uses; With more thought to reproducibility / replicability of work
  • 5.
    Data science isinternational, collaborative, and cross-disciplinary Data that are created with openness in mind more likely to be used, spreading impact of researcher who created it
  • 6.
    Funders Encourage OpenScience “Investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants.” National Science Foundation
  • 7.
    Data Management Plans(DMP) • Researcher writes a Data Management Plan for the important data that they expect to create during course of their research • By National Science Foundation: “What constitutes reasonable data management and access will be determined by the community of interest through process of peer review and program management.” [Data Management & Sharing Frequently Asked Questions, National Science Founation]
  • 8.
    Getting to OpenScience in US • Data valuation • Infrastructure and FAIR • Open as possible; closed as necessary • Reuse and reproducibility
  • 9.
    More data isgenerated in course of science than can be kept. Science communities must come to terms with realistic valuation of data by answering: Data valuation • What data can be throw away? • How long should a dataset be kept? • Who decides?
  • 10.
    Data value: valueof data (object, product, or collection) to science and society either as part of larger scholarly record (inherent value) or through enabling new discoveries Not all data created in the context of science has the same value
  • 11.
    Highest Value / longevity Lowest Value / longevity Tier1: Data of highest value/longest longevity Tier 4: Data of lowest value/shortest longevity Tier 1 data Tier 2 data Tier 3 data Tier 4 data Suppose data products of research can be binned based on some combination of value/longevity Scientific data created as product (or byproduct) of research
  • 12.
    Examples of binsinto which scientific data might be placed Data needed to reproduce images in paper Derived data that can be quickly recreated Derived data that is costly to recreate Ephemeral observations Raw observational data uncleaned Hand annotated datasets Slide decks of talks All model or experimental evaluation runs Raw observational data minimally cleaned/filtered Tier 1 data Tier 4 data Tier 3 data Tier 2 data
  • 13.
    Cost model fortaking care of data produced through science “Investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants.” PAPPG: XI D4 This should be 5% of grant budget. Barend Mons EU Open Science Cloud à This covers curation, retention, findability, accessibility, reusability of the data Open science community has suggested that a realistic budget to cover new data creation and its contributing to science is 5% of grant award budget.
  • 14.
    Data above “curationbar” is curated and deposited == reuse data Data below bar is not “primary data” in NSF terms. Not subject to curation. Data bin stack Curation Bar Placement and Implication for Community Tier 1 data Tier 2 data Tier 3 data Tier 4 data Curation bar A community undertakes to determine the value/longevity of the data they produce. They do so by positioning a “curation bar”
  • 15.
    Community positions curation bartoo high: • No high value data products for reuse • Every new grant absorbs cost of creating data they need (and curate and deposit it) • Violates agency commitment to public access Curation Bar Placement Implication Curation bar Tier 1 data Tier 2 data Tier 3 data Tier 4 data
  • 16.
    Data Pyramid Curation Bar Placement ImplicationTier1 data Tier 2 data Tier 3 data Tier 4 data Curation bar Curation bar too low: • Many grant $$ needlessly spent curating low value data products
  • 17.
    Optimum positioning: • Curatedproducts only those deemed most valuable to community • Opportunity to use existing data (saves curation $$ on grant) • Increased use of existing data overall means more $$ for pure research (under fixed federal research dollars) Data Pyramid Tier 1 data Tier 2 data Tier 3 data Tier 4 data Curation bar Optimum Placement
  • 18.
    Takeaways, part 1 •Cost of curation resides with PI who proposes to create new data (5% written into grant budget) • Curation cost is not incurred by PIs who use existing data
  • 19.
    Takeaways, part 2 •Communities must be at table in decisions of data created by the community
  • 20.
    FAIR principles; rolein open science • A concise and measurable set of principles for scientific data management • Developed in 2015 under umbrella of Force11 • Data objects are – Findable – Accessible – Interoperable – Reusable Infrastructure and FAIR
  • 21.
    FAIR Guiding Principles Tobe Findable: F1. (Meta)data are assigned a globally unique and eternally persistent identifier. F2. Data are described with rich metadata. To be Accessible: A1. (Meta)data are retrievable by their identifier using a standardized communications protocol. A2. Metadata are accessible, even when data are no longer available
  • 22.
    FAIR Guiding Principles Tobe Interoperable I.1. (Meta) data is machine-actionable I.2. (Meta) data formats utilize shared vocabularies and/or ontologies To be Re-usable R.2 (Meta) data should be sufficiently well- described and rich that it can be automatically (or with minimal human effort) linked or integrated, like-with-like, with other data sources plale@indiana.edu
  • 24.
  • 25.
    Options needed forrestricted data reuse on spectrum between completely open (Open Access) and completely hidden Open as possible; closed as necessary
  • 26.
    Possible way forwardsuggested by principle: Open as possible, closed as necessary* Principle articulated in "Guidelines on FAIR Data Management in Horizon 2020", EU Horizon 2020 programme
  • 27.
    Forms of dataavailability on spectrum between pure open access and fully hidden
  • 28.
    Capsule framework Controlled computeenvironment, capsule framework, is viable approach to accessing and sharing restricted data that satisfies sharing while protecting data from unintended use or use prohibited by law
  • 29.
    Standing on the shouldersof giants Reuse and reproducibility Image credit Library of Congress
  • 31.
    • A corefeature of the process of scientific inquiry that occurs after reproducibility failures is the integration of conflicting observations and ideas into a coherent theory.
  • 32.
    • Openness andtransparency are critical to moving research forward. • Preregistration is critical to the replication process. • Not all science is built upon. Let’s embrace reproducibility but agree that the reproducibility bundle for a publication can be discarded 3 years after publication.
  • 33.
    Open Science: 5years out • Scientists regularly use PIDs • Data Management Plans regularly shared early with people who will eventually care for the data • Scientists will embrace the difference between access metadata and reuse metadata and acknowledge they may exist in different places (services) • Early career scholars will know what data in their field is important enough to retain, where to put it, how to best curate it, and how long a piece of data should live
  • 34.