Ten habits of highly effective data

Ten habits of highly effective data:
How to help your dataset achieve its full
potential
University of Illinois, Urbana Champaign
May 7, 2014
Anita de Waard
VP Research Data Collaborations
a.dewaard@elsevier.com
http://researchdata.elsevier.com/

Who cares about Research Data?
Funding bodies:
 Demonstrate impact
 Guarantee permanence,
discoverability
 Avoid fraud
 Avoid double funding
 Serve general public
Research Management/Libary:
 Generate, track outputs
 Comply with mandates
 Ensure availability
Phil Bourne, (then) Associate Vice Chancellor, UCSD, 4/13:
“We need to think about the university as a digital enterprise.”
Mike Huerta, Ass. Director NLM:
“Today, the major public product of science are concepts, written
down in papers. But tomorrow, data will be the main product of
science…. We will require scientists to track and share their data as
least as well, if not better, than they are sharing their ideas today.”
Researchers:
 Derive credit
 Comply with mandates
 Discover and use
 Cite/acknowledge
Nathan Urban, PI Urban Lab, CMU, 3/13:
“If we can share our data, we can write a paper that will knock
everybody’s socks off!”
Barbara Ransom, NSF Program Director Earth Sciences:
“We’re not going to spend any more money for you to go out and get
more data! We want you first to show us how you’re going to use all
the data we paid y’all to collect in the past!”

What’s the problem? One example:
Using antibodies
and squishy bits
Grad Students experiment
and enter details into their
lab notebook.
The PI then tries to make
sense of their slides,
and writes a paper.
End of story.

7. Trusted (validated/checked by reviewers)
Maslow’s Hierarchy of Needs for Research Data
6. Reproducible (others can redo
experiments)
9. Usable (allow tools to run on it)
4. Comprehensible (others can understand
data & processes)
2. Archived (long-term & format-
independent)
1. Preserved (existing in some form)
5. Discoverable (can be indexed by a system)
3. Accessible (can be accessed by others)
8. Citable (able to point & track citations)

1. Preserve: Data Rescue Challenge
• With IEDA/Lamont: award succesful data
rescue attempts
• Awarded at AGU 2013
• 23 submissions of data that was digitized,
preserved, made available
• Winner: NIMBUS Data Rescue:
– Recovery, reprocessing and digitization of the
infrared and visible observations along with their
navigation and formatting.
– Over 4000 7-track tapes of global infrared
satellite data were read and reprocessed.
– Nearly 200,000 visible light images were
scanned, rectified and navigated.
– All the resultant data was converted to HDF-5
(NetCDF) format and freely distributed to users
from NASA and NSIDC servers.
– This data was then used to calculate monthly sea
ice extents for both the Arctic d the Antarctic.
• Conclusion: we (collectively) need to do more
of this! How can we fund it?
experiments)
data & processes)
independent)

experiments)
data & processes)
independent)
2. Archive: Olive Project
• CMU CS & Library: funded by a grant
from the IMLS, Elsevier is partner
• Goal: Preservation of executable content
- nowadays a large part of intellectual
output, and very fragile
• Identified a series of software packages
and prepared VM to preserve
• Does it work? Yes – see video (1:24)

experiments)
data & processes)
3. Access: Urban Legend
independent)
• Part 1: Metadata acquisition
• Step through experimental process in series
of dropdown menus in simple web UI
• Can be tailored to workflow of individual
researcher
• Connected to shared ontologies through
lookup table, managed centrally in lab
• Connect to data input console (Igor Pro)

experiments)
data & processes)
4. Comprehend: Urban Legend
independent)
• Part 2: Data Dashboard
• Access, select and manipulate data (calculate
properties, sort and plot)
• Final goal: interactive figures linked to data
• Plan to expand to more neuroscience labs
• Plan to build for geochemistry use case

experiments)
data & processes)
5. Discover: Data Indexing proposals
• Collaborated on Data Discovery Index
proposal with UCSD/Carnegie Mellon
• Also worked with UIUC!
• Interested in developing distributed
infrastructures on making data easier to
search: what is the ‘Goldilocks lndex’ where
search is scalable, yet useful?
• Looking for academic/industry partners/use
cases/platforms to address the next stage
• Discoverability is key driver for
metadata/data format structure!
independent)

experiments)
data & processes)
6. Reproduce: Resource Identifier Initiative
Force11 Working Group to add data identifiers
to articles that is
– 1) Machine readable;
– 2) Free to generate and access;
– 3) Consistent across publishers and journals.
• Authors publishing in participating journals
will be asked to provide RRID's for their
resources; these are added to the keyword
field
• RRID's will be drawn from:
– The Antibody Registry
– Model Organism Databases
– NIF Resource Registry
• So far, Springer, Wiley, Biomednet, Elsevier
journals have signed up with 11 journals,
more to come
• Wide community adoption!
independent)

experiments)
data & processes)
7.Trust: Moonrocks
independent)
How can we scale up data curation?
Pilot project with IEDA:
• A database for lunar geochemistry:
leapfrog & improve curation time
• 1-year pilot, funded by Elsevier
• Main conclusion: if spreadsheet
columns/headers map to RDB
schema we can scale curation cost!

experiments)
data & processes)
8. Cite: Force11 Data Citation Principles
• Another Force11 Working group
• Defined 8 principles:
• Now seeking endorsement/working on
implementation
independent)
1. Importance: Data should be considered legitimate, citable products of
research. Data citations should be accorded the same importance in
the scholarly record as citations of other research objects, such as
publications.
2. Credit and attribution: Data citations should facilitate giving scholarly
credit and normative and legal attribution to all contributors to the
data, recognizing that a single style or mechanism of attribution may
not be applicable to all data.
3. Evidence: Where a specific claim rests upon data, the corresponding
data citation should be provided.
4. Unique Identification: A data citation should include a persistent
method for identification that is machine actionable, globally unique,
and widely used by a community.
5. Access: Data citations should facilitate access to the data themselves
and to such associated metadata, documentation, and other materials,
as are necessary for both humans and machines to make informed use
of the referenced data.
6. Persistence: Metadata describing the data, and unique identifiers
should persist, even beyond the lifespan of the data they describe.
7. Versioning and granularity: Data citations should facilitate
identification and access to different versions and/or subsets of data.
Citations should include sufficient detail to verifiably link the citing
work to the portion and version of data cited.
8. Interoperability and flexibility: Data citation methods should be
sufficiently flexible to accommodate the variant practices among
communities but should not differ so much that they compromise
interoperability of data citation practices across communities.

experiments)
data & processes)
9. Use: Executable Papers
• Result of a challenge to come up with
cyberinfrastructure components to
enable executable papers
• Pilot in Computer Science journals
– See all code in the paper
– Save it, export it
– Change it and rerun on data set:
independent)

10: Putting it all together:
experiments)
data & processes)
independent)
Experimental Metadata:
Workflows, Samples, Settings, Reagents, Organisms, etc.
Record Metadata: DOI, Date, Author, Institute, etc.
Processed Data:
Mathematically/computationally processed
data: correlations, plots, etc.
Raw Data: Direct outputs from equipment:
images, traces, spectra, etc.
Methods and Equipment: Reagents,
settings, manufacturer’s details, etc.
Validation: Approval, Reproduction, Selection,
Quality Stamp
Morecuration
Moreusable

So how can we help research data
be more happy and productive?
• Group therapy: Force11, W3C, other fora – shared
standards help everyone (we play well with others !)
• Financial therapy: we have a lot of content & IT skills to
support data-driven processes to support grant
proposals; funders like us.
• Creative therapy: innovative collaboration projects that
expand everyone’s mind – let’s put your data through its
paces
• Relationship therapy: happy to address any issues or
concerns!

Collaborations and discussions gratefully acknowledged:
– CMU: Nathan Urban, Shreejoy Tripathy, Shawn Burton, Ed Hovy
– UCSD: Brian Shoettlander, David Minor, Declan Fleming, Ilya Zaslavsky
– NIF: Maryann Martone, Anita Bandrowski
– Force11: Ed Hovy, Tim Clark, Ivan Herman, Paul Groth, Maryann Martone,
Cameron Neylon, Stephanie Hagstrom
– OHSU: Melissa Haendel, Nicole Vasilevsky
– Columbia/IEDA: Kerstin Lehnert, Leslie Hsu
– MIT: Micah Altman
Thank you!
http://researchdata.elsevier.com/
Anita de Waard
a.dewaard@elsevier.com

Ten habits of highly effective data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Ten habits of highly effective data

Similar to Ten habits of highly effective data (20)

More from Anita de Waard

More from Anita de Waard (20)

Recently uploaded

Recently uploaded (20)

Ten habits of highly effective data