Ten Habits of Highly Effective Data

Ten Habits of Highly Effective Data
Anita de Waard
VP Research Data Collaborations
a.dewaard@elsevier.com
http://researchdata.elsevier.com/

The Maslow Hierarchy for humans:
9. Usable (allow tools to run on it)
8. Citable (able to point & track citations)
7. Trusted (validated/checked by reviewers)
6. Reproducible (others can redo
experiments)
5. Discoverable (can be indexed by a system)
4. Comprehensible (others can understand
data & processes)
3. Accessible (can be accessed by others)
2. Archived (long-term & format-independent)
1. Preserved (existing in some form)

A Maslow Hierarchy for Data:
experiments)
data & processes)

1. Preserve: Data Rescue Challenge
• With IEDA/Lamont: award succesful data
rescue attempts
• Awarded at AGU 2013
• 23 submissions of data that was digitized,
preserved, made available
• Winner: NIMBUS Data Rescue:
– Recovery, reprocessing and digitization of the
infrared and visible observations along with their
navigation and formatting.
– Over 4000 7-track tapes of global infrared
satellite data were read and reprocessed.
– Nearly 200,000 visible light images were
scanned, rectified and navigated.
– All the resultant data was converted to HDF-5
(NetCDF) format and freely distributed to users
from NASA and NSIDC servers.
– This data was then used to calculate monthly sea
ice extents for both the Arctic d the Antarctic.
• Conclusion: we (collectively) need to do more
of this! How can we fund it?
experiments)
data & processes)

2. Archive: Olive Project
experiments)
data & processes)
• CMU CS & Library: funded by a grant
from the IMLS, Elsevier is partner
• Goal: Preservation of executable content
- nowadays a large part of intellectual
output, and very fragile
• Identified a series of software packages
and prepared VM to preserve
• Does it work? Yes – see video (1:24)

3. Access: Urban Legend
experiments)
data & processes)
• Part 1: Metadata acquisition
• Step through experimental process in series of dropdown
menus in simple web UI
• Can be tailored to workflow of individual researcher
• Connected to shared ontologies through lookup table,
managed centrally in lab
• Connect to data input console (Igor Pro)

4. Comprehend: Urban Legend
experiments)
data & processes)
• Part 2: Data Dashboard
• Access, select and manipulate data (calculate
properties, sort and plot)
• Final goal: interactive figures linked to data
• Plan to expand to more labs, other data

5. Discover: Data Discovery Index
experiments)
data & processes)
• NIH interested in creating DDI consortium
• Three places where data is deposited:
1. Curated sources for a single data type (e.g.Protein
Data Bank, VentDB, Hubble Space Data)
2. Non- or semicurated sources for different data types
(e.g. DataDryad, Dataverse, Figshare)
3. Tables in papers:
• Ways to find this:
– Cross-domain query tools, i.e. NIF, DataOne, etc
– Search for papers -> link to data
– How to find data in papers??
• Propose to build prototypes across all of these
data sources:
– Needs NLP, models of data patterns? What else?
Papers
Non-curated DBs
Curated DBs

6. Reproduce: Resource Identifier Initiative
experiments)
data & processes)
Force11 Working Group to add data identifiers
to articles that is
– 1) Machine readable;
– 2) Free to generate and access;
– 3) Consistent across publishers and journals.
• Authors publishing in participating journals
will be asked to provide RRID's for their
resources; these are added to the keyword
field
• RRID's will be drawn from:
– The Antibody Registry
– Model Organism Databases
– NIF Resource Registry
• So far, Springer, Wiley, Biomednet, Elsevier
journals have signed up with 11 journals,
more to come
• Wide community adoption!

7.Trust: Moonrocks
experiments)
data & processes)
How can we scale up data curation?
Pilot project with IEDA:
• Lunar geochemistry database:
leapfrog & improve curation time
• 1-year pilot, funded by Elsevier
• If spreadsheet columns/headers
map to RDB schema, we can scale up
curation process and move from
tables > curated databases!

8. Cite: Force11 Data Citation Principles
experiments)
data & processes)
• Another Force11 Working group
• Defined 8 principles:
• Now seeking endorsement/working on
implementation
1. Importance: Data should be considered legitimate, citable products of
research. Data citations should be accorded the same importance in
the scholarly record as citations of other research objects, such as
publications.
2. Credit and attribution: Data citations should facilitate giving scholarly
credit and normative and legal attribution to all contributors to the
data, recognizing that a single style or mechanism of attribution may
not be applicable to all data.
3. Evidence: Where a specific claim rests upon data, the corresponding
data citation should be provided.
4. Unique Identification: A data citation should include a persistent
method for identification that is machine actionable, globally unique,
and widely used by a community.
5. Access: Data citations should facilitate access to the data themselves
and to such associated metadata, documentation, and other materials,
as are necessary for both humans and machines to make informed use
of the referenced data.
6. Persistence: Metadata describing the data, and unique identifiers
should persist, even beyond the lifespan of the data they describe.
7. Versioning and granularity: Data citations should facilitate
identification and access to different versions and/or subsets of data.
Citations should include sufficient detail to verifiably link the citing
work to the portion and version of data cited.
8. Interoperability and flexibility: Data citation methods should be
sufficiently flexible to accommodate the variant practices among
communities but should not differ so much that they compromise
interoperability of data citation practices across communities.

9. Use: Executable Papers
experiments)
data & processes)
• Result of a challenge to come up with
cyberinfrastructure components to
enable executable papers
• Pilot in Computer Science journals
– See all code in the paper
– Save it, export it
– Change it and rerun on data set:

10: Let’s allow our data to be happy!
Experimental Metadata:
Objects, Procedures, Properties
experiments)
data & processes)
Execute: Direct settings on equipment,
circumstances of measurement
Raw Data
Analyze: Mathematical/computational
Processed processes and analytics
Data
Record Metadata:
DOI, Date, Author, Institute, etc.
Prepare: Reagents, species/specimen/cell
type, preparation details
Entity IDs
Validation Metadata:
Reproduction, Curation; Selection, Citation,
Usage, Metrics

Minimize your metadata footprint!
Reuse:
• ‘The good thing about standards is that there are
so many to choose from’
• Haendel et al looking at 54 (!!) data standards:
many have only been used once/for one group
• Employ a common element set + modular
additions over whole new schema
Recycle:
• Make sure you design upstream metadata
with downstream processes in mind
• Useful exercise: ‘buy a tag’ where
users/systems that will store/query/cite data
say what they need to do their job
• Learn from genetics: one datum can play
several different roles!
Reduce:
• Every tag needs to be added and read by
someone/thing: this adds cost and waste
• Consider ‘return on investment’ per metadata item
• TBL: what if “http://” was “h/”?

Ten Habits of Highly Effective Data

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (10)

Similar to Ten Habits of Highly Effective Data

Similar to Ten Habits of Highly Effective Data (20)

More from Anita de Waard

More from Anita de Waard (20)

Recently uploaded

Recently uploaded (20)

Ten Habits of Highly Effective Data