Data Management for Scientists: Reduce, Reuse, Recycle Your Data
1. Data
Management
for
Scientists
Reduce
your
workload
Reuse
your
ideas
Recycle
your
data
www.oddee.com
Carly
Strasser,
PhD
California
Digital
Library,
UC
Office
of
the
President
carly.strasser@ucop.edu
www.carlystrasser.net
2. Roadmap
4. Toolbox
3. Control
2. Chaos
1. Who
are
you?
3. Roadmap
4. Toolbox
3. Control
2. Chaos
1. Who
are
you?
4. NSF
funded
DataNet
Project
Office
of
Cyberinfrastructure
Community
Cyberinfrastructure
Engagement
&
Outreach
From
Flickr
by
ThomasThomas
From
Flickr
by
Langwitches
5. What
role
can
libraries
play
in
data
education?
Why
don’t
people
What
barriers
to
sharing
share
data?
can
we
eliminate?
Is
data
management
Do
attitudes
about
being
taught?
sharing
differ
among
disciplines?
How
can
we
promote
storing
data
in
repositories?
6. Roadmap
4. Toolbox
3. Control
2. Chaos
1. Who
are
you?
8. Data
Models
Maximum
Likelihood
estimation
Matrix
Models
Images
Tables
Paper
9. Data
Models
Maximum
Likelihood
estimation
Matrix
Models
Images
Tables
Paper
10. UGLY TRUTH
Many
Earth
|
Environmental
|
Ecological
scientists…
5shortessays.blogspot.com
are
not
taught
data
management
don’t
know
what
metadata
are
can’t
name
data
centers
or
repositories
don’t
share
data
publicly
or
store
it
in
an
archive
aren’t
convinced
they
should
share
data
18. Roadmap
4. Toolbox
3. Control
2. Chaos
1. Who
are
you?
19. Roadmap
4. Toolbox
3. Control
2. Chaos
1. Who
are
you?
20. • Unrestricted
access
to
articles*
via
internet
digital
online
free
of
charge
free
of
most
copyright/licensing
restrictions
• Compatible
with
conventional
scholarly
literature
• Bills
not
paid
by
readers:
no
barriers
to
access
*Open
access
easily
extends
to
data
21. Roadmap
4. Toolbox
3. Control
2. Chaos
1. Who
are
you?
22. Best
Practices
for
Data
Management
1. Planning
2. Data
collection
&
organization
3. Quality
control
&
assurance
4. Metadata
5. Workflows
6. Data
Stewardship
&
reuse
23. 1.
Planning
What
is
a
data
management
plan?
A
document
that
describes
what
you
will
do
with
your
data
during
and
after
you
complete
your
research
From
Flicker
by
Ikelee
24. 1.
Planning
Why
should
I
prepare
a
DMP?
Saves
time
Increases
efficiency
Easier
to
use
data
Others
can
understand
&
use
data
Credit
for
data
products
Funders
protect
their
investment
25. 1.
Planning
Components
of
a
DMP
1. Information
about
data
&
data
format
2. Metadata
content
and
format
3. Policies
for
access,
sharing
and
re-‐use
4. Long-‐term
storage
and
data
management
5. Budget
26. 1.
Planning
dmp.cdlib.org
Step-‐by-‐step
wizard
for
generating
DMP
Create
|
edit
|
re-‐use
|
share
|
save
|
generate
Open
to
community
Links
to
institutional
resources
Directorate
information
&updates
27. 2.
Data
collection
&
organization
Personal
data
management
problems
build
up
over
time,
&
in
collaboration
plumbinghelptoday.com
28. 2.
Data
collection
&
organization
Standardize
• Consistent
within
columns
– only
numbers,
dates,
or
text
• Consistent
names,
codes,
formats
Modified
from
K.
Vanderbilt
From
Pink
Floyd,
The
Wall
themurkyfringe.com
29. 2.
Data
collection
&
organization
Standardize
• Reduce
possibility
of
manual
error
by
constraining
entry
choices
Excel
lists
Data Google
Docs
Forms
validataion
Modified
from
K.
Vanderbilt
30. 2.
Data
collection
&
organization
Create
parameter
table
Create
a
site
table
From
doi:10.3334/ORNLDAAC/777
From
doi:10.3334/ORNLDAAC/777
From
R
Cook,
ESA
Best
Practices
Workshop
2010
31. 2.
Data
collection
&
organization
Use
descriptive
file
names
PhDcomics.com
32. 2.
Data
collection
&
organization
Use
descriptive
file
names
• Unique
• Reflect
contents
Bad:
Mydata.xls
Better:
Eaffinis_nanaimo_2010_counts.xls
2001_data.csv
best
version.txt
Study
Year
organism
Site
name
What
was
measured
From
R
Cook,
ESA
Best
Practices
Workshop
2010
33. 2.
Data
collection
&
organization
Organize
files
logically
Biodiversity
Lake
Experiments
Biodiv_H20_heatExp_2005to2008.csv
Biodiv_H20_predatorExp_2001to2003.csv
…
Field
work
Biodiv_H20_PlanktonCount_2001toActive.csv
Biodiv_H20_ChlAprofiles_2003.csv
…
Grassland
From
S.
Hampton
34. 2.
Data
collection
&
organization
Preserve
information
R
script
for
processing
&
analysis
• Keep
raw
data
raw
• Use
scripts
to
process
data
&
save
them
with
data
Raw
data
as
.csv
35. 3.
Quality
control
and
quality
assurance
Define
&
enforce
standards
Double
data
entry
Document
changes
No
missing,
impossible,
or
anomalous
values
• Perform
statistical
summaries
• Use
illegal
data
filter
60
• Look
for
outliers
50
40
30
20
10
0
0
5
10
15
20
25
30
35
36. 4.
Metadata
basics
What
is
metadata?
Data
reporting
• WHO
created
the
data?
• WHAT
is
the
content
of
the
data
set?
• WHEN
was
it
created?
• WHERE
was
it
collected?
• HOW
was
it
developed?
• WHY
was
it
developed?
37. • Scientific
context
4.
Metadata
basics
• Scientific
reason
why
the
data
were
collected
• What
data
were
collected
• Digital
context
• What
instruments
(including
model
&
• Name
of
the
data
set
serial
number)
were
used
• The
name(s)
of
the
data
file(s)
in
the
data
• Environmental
conditions
during
collection
set
• Where
collected
&
spatial
resolution
When
• Date
the
data
set
was
last
modified
collected
&
temporal
resolution
• Example
data
file
records
for
each
data
• Standards
or
calibrations
used
type
file
• Information
about
parameters
• Pertinent
companion
files
• How
each
was
measured
or
produced
• List
of
related
or
ancillary
data
sets
• Units
of
measure
• Software
(including
version
number)
• Format
used
in
the
data
set
used
to
prepare/read
the
data
set
• Precision
&
accuracy
if
known
• Data
processing
that
was
performed
• Information
about
data
• Personnel
&
stakeholders
• Definitions
of
codes
used
• Who
collected
• Quality
assurance
&
control
measures
• Who
to
contact
with
questions
• Known
problems
that
limit
data
use
(e.g.
• Funders
uncertainty,
sampling
problems)
• How
to
cite
the
data
set
38. 4.
Metadata
basics
What
is
a
metadata
standard?
• Provides
structure
to
describe
data
Common
terms
|
definitions
|
language
|
structure
• Lots
of
different
standards
EML
,
FGDC,
ISO19115,
DarwinCore,…
• Tools
for
creating
metadata
files
Morpho
(EML),
Metavist
(FGDC),
NOAA
MERMaid
(CSGDM)
40. 5.
Workflows
Simplest
workflows:
commented
scripts,
flow
charts
Temperature
data
Data
import
into
R
Data
in
R
Salinity
format
data
Quality
control
&
“Clean”
T
data
cleaning
&
S
data
Analysis:
mean,
SD
Summary
statistics
Graph
production
42. 5.
Workflows
Workflows
enable
From
Flickr
by
merlinprincesse
Reproducibility
can
someone
independently
validate
findings?
Transparency
others
can
understand
how
you
arrived
at
your
results
Executability
others
can
re-‐run
or
re-‐use
your
analysis
44. 6.
Data
stewardship
&
reuse
From
Flickr
by
greensambaman
The 20-Year Rule
The
metadata
accompanying
a
data
set
should
be
written
for
a
user
20
years
into
the
future
RULE
(National
Research
Council
1991)
45. 6.
Data
stewardship
&
reuse
Use
stable
formats
csv,
txt,
tiff
Create
back-‐up
copies
original,
near,
far
Periodically
test
ability
to
restore
information
Modified from R. Cook
46. 6.
Data
stewardship
&
reuse
Where
do
I
put
it?
Insitutional
archive
Discipline/specialty
archive
DataCite
list
of
repostiories:
www.datacite.org/repolist
From
Flickr
by
torkildr
47. 6.
Data
stewardship
&
reuse
Data
Citation:
Why
everyone
should
do
it
Allow
readers
to
find
data
products
Get
credit
for
data
and
publications
Promote
reproducibility
Better
measure
of
research
impact
Example:
Sidlauskas,
B.
2007.
Data
from:
Testing
for
unequal
rates
of
morphological
diversification
in
the
absence
of
a
detailed
phylogeny:
a
case
study
from
characiform
fishes.
Dryad
Digital
Repository.
doi:10.5061/dryad.20
Modified from R. Cook
48. Roadmap
4. Toolbox
3. How
to
be
good
2. Bad
scientists
1. Who
are
you?
49. NSF
funded
DataNet
Project
Office
of
Cyberinfrastructure
Enabling
universal
access
to
data
about
life
on
earth
and
the
environment
that
sustains
it
54. www.dataone.org
• Data
Education
Tutorials
• Primer
on
data
management
55. www.dataone.org
• Data
Education
Tutorials
• Primer
on
data
management
• Database
of
best
practices
&
software
tools
• List
of
repositories
&
metadata
standards
• Links
to
DMP
Tool
Investigator
Toolkit
• ONE-‐R
• ONE-‐Mercury
• ONE-‐Drive
57. CDL
Services
for
UC
Community
• Precise
identification
of
a
dataset
• Credit
to
data
producers
and
data
publishers
• A
link
from
the
traditional
literature
to
the
data
• Research
metrics
for
datasets
• Deposit
content
(i.e.
data)
• Manage
(metadata,
versions
etc.)
• Share
• Access
• Preserve
www.cdlib.org/services/uc3
58. • Open
source
add-‐in
• Facilitate
data
management,
sharing,
archiving
for
scientists
• Part
of
DataONE
investigator
toolkit
• Collecting
requirements
for
add-‐in
from
scientists,
data
centers,
libraries
dcxl.cdlib.org
Funders:
Gordon
and
Betty
Moore
Foundation,
Microsoft
Research