Whither Small Data?

Whither Small Data?
Some Thoughts on Managing
Research Data
February 26, 2013
Anita de Waard
VP Research Data Collaborations, Elsevier RDS
a.dewaard@elsevier.com

Why should data be saved?
A. Hold scientists accountable: Data Preservation
– Preserve record of scientific process, provenance
– Enable reproducible research
B. Do better science: Data Use
– Use results obtained by others!
– Improve interdisciplinary work
C. Enable long-term access: Sustainable Models
– Use for technology transfer; societal/industrial
development
– Reward scientists for data creation (credit/attribution)
– Allow public/others insight/use of results

Where The Data Goes Now:
PDB:
A small portion of data 88,3 k
(1-2%?) stored in small, PetDB:
> 50 My Papers 1,5 k SedDB:
topic-focused
2 M scientists data repositories 0.6 k
MiRB:
2 My papers/year 25k
TAIR:
72,1 k
Some data
(8%?) stored in large,
generic data
Majority of data repositories
(90%?) is stored
on local hard drives
Dryad: Dataverse:
7,631 files 0.6 My

Datacite:
1.5 My

Key Needs: DEVELOP SUSTAINABLE MODELS
PDB:
A small portion of data 88,3 k
(1-2%?) stored in small, PetDB:
> 50 My Papers 1,5 k SedDB:
topic-focused
2 M scientists data repositories 0.6 k
MiRB:
2 My papers/year 25k
TAIR:
72,1 k
Some data
(8%?) stored in large,
generic data
Majority of data repositories
(90%?) is stored
on local hard drives
Dryad: Dataverse:
7,631 files 0.6 My

INCREASE DATA
PRESERVATION Datacite:
1.5 My

A. Data Preservation:
• Issues:
– Currently data is often used by single researchers or
small groups: many different, idiosyncratic formats
– Often not in electronic form (maps, images)
– No metadata: when, where, by whom, WHY was this
data collected?
• Needs:
– Tools to make data export/storage simple and
unavoidable
– Policies that make data sharing mandatory and simple
– Systems that reward data sharing/digitisation

B. Data Use:
• Issues:
– In generic data repositories, data cannot be used
because of inadequate metadata, lack of quality
review, lack of provenance
– It’s expensive to make data useable!
– Domain-specific data stores are not cross-
searchable across discipline/national borders
• Needs:
– Standardised metadata systems across
systems/repositories and tools to apply them
easily
– Integration layers to enable cross-repository
queries
– A funding model to enable long-term preservation

C. Sustainable Models:
• Issues:
– Many successful domain-specific data repositories
are running out of funding
– Is adding metadata something you want to keep
paying PhD+ scientists to do?
– Unclear who foots the bill: the researcher? The
institute? The grant agency? For how long?
• Needs:
– Attribution models for rewarding scientists
– Policies to improve cross-domain and cross-national
collaborations
– Funding models to sustain databases long-term

Linking papers to research data:
Database Object Linked Displayed
Pangaea Google Maps Location Map with location
Protein Databank PDB Protein 3d Protein Visualisation
Genbank Gene Name NCBI Gene Viewer
Exoplanets + Exoplanet name Rich Information on extrasolar Planets
Species + Species name Rich information on species

9

Towards ‘wrapping papers around data’
metadata 1. Store metadata on all materials
metadata

metadata
2. Track the methods while doing them
3. Write papers that ‘wrap around’ this
metadata
4. Don’t ‘send’ your papers – just
metadata expose them to the outside world
5. Invite reviews; open data to
trusted parties, at trusted time
Rats were subjected to two 6. Allow apps/tools to integrate
grueling tests
(click on fig 2 to see underlying
data). These results suggest
that the neurological pain pro-
Calculate, coordinate…
Review
Revise Compile, comment,
Edit
compare…

Research Data Services:
A. Increase Data Preservation:
Help increase the amount and quality of data
preserved and shared
B. Improve Data Use:
Help increase the value and usability of the data shared
by increasing annotation, normalization, provenance
enabling enhanced interoperability
C. Develop Sustainable Models:
Help measure and deliver credit for shared data, the
researchers, the institute, and the funding body,
enabling more sustainable platforms.

Guiding Principles of RDS:
• In principle, all open data stays open and URLs,
front end etc. stay where they are (i.e. with
repository)
• Collaboration is tailored to data repositories’
unique needs/interests- ‘service-model’ type:
– Aspects where collaboration is needed are discussed
– A collaboration plan is drawn up using a Service-Level
Agreement: agree on time, conditions, etc.
• Transparent business model
• Very small (2/3 people) department; immediate
communication; instant deployment of ideas

Three pilots:
1. Carnegie Mellon Electrophysiology Lab:
A. Data Input: Develop a suite of tools to enable simple
data capturing on a handheld device, add metadata
during experiment, store with raw traces and create
dashboard for viewing
B. Data Use: Integrate with NIF and eagle-I ontologies,
enable access through NIF; combine with other sources
2. ImageVault, with Duke CIVM:
A. Data Input: Get 3D image data into common format,
resolution, annotated to allow comparison
B. Data Use: View other image data sets & do image
analytics
C. Sustainable Models: Create funding for 3D image sets:
free layer for raw data/subscription analytics.

3. IEDA Data Rescue Process Study
Data Rescue:
– Identify 3 -5 data sets that need to be ‘rescued’
– Work with investigators to identify data sources,
formats
– Work with IEDA to define metadata standards,
quality checks etc.
Data Rescue Process:
– A group of data wranglers perform ‘electrification’
and annotation
– (Open source) software is developed where needed,
to help this process
– We help develop common standards, if needed

3. IEDA Data Rescue Process Study
Data Rescue Process Study:
Jointly publish a report on a ‘gap analysis’ comparing
where are we now vs. and where we need to be, including:
– What we did (data imported, processes/standards
created/described; software built; user tests,
outcomes)
– Effort involved (time, software, equipment, skills, etc)
– How easy it would be to scale up; what part of data
out there could be done this way.
– Recommendations for tools and skills that are
needed, if we want to scale up this process

Summary:
• Three key issues:
A. Data Preservation
B. Data Use
C. Sustainable Models
• Elsevier’s approach:
– Linking data to papers
– Wrap papers around data
– Explore role in the research data space
• Elsevier RDS:
– Three pilots (CMU, Duke, IEDA) to investigate issues
– We’ll report back in about a year!

Questions?

Anita de Waard
VP Research Data Collaborations, Elsevier
a.dewaard@elsevier.com

Whither Small Data?

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (15)

Similar to Whither Small Data?

Similar to Whither Small Data? (20)

More from Anita de Waard

More from Anita de Waard (20)

Whither Small Data?

Editor's Notes