1. Data
Management
for
Scientists
Reduce
your
workload
Reuse
your
ideas
Recycle
your
data
www.oddee.com
Carly
Strasser,
PhD
UC
Riverside
California
Digital
Library,
UC
Office
of
the
President
February
2012
carly.strasser@ucop.edu
www.carlystrasser.net
2. Roadmap
4. Toolbox
3. How
to
improve
2. Mistakes
we
make
1. Background
3. What
role
can
libraries
play
in
data
education?
What
barriers
to
sharing
can
we
eliminate?
Why
don’t
people
share
data?
Is
data
management
Do
attitudes
about
being
taught?
sharing
differ
among
disciplines?
How
can
we
promote
storing
data
in
repositories?
5. Roadmap
4. Toolbox
3. How
to
improve
2. Mistakes
we
make
1. Background
6. From
Flickr
by
DW0825
From
Flickr
by
Flickmor
From
Flickr
by
deltaMike
Digital
data
www.woodrow.org
C.
Strasser
Courtesey
of
WHOI
From
Flickr
by
US
Army
Environmental
Command
8. Data
Models
Maximum
Likelihood
estimation
Matrix
Models
Images
Tables
Paper
9. UGLY TRUTH
Many
Earth
|
Environmental
|
Ecological
scientists…
5shortessays.blogspot.com
are
not
taught
data
management
don’t
know
what
metadata
are
can’t
name
data
centers
or
repositories
don’t
share
data
publicly
or
store
it
in
an
archive
aren’t
convinced
they
should
share
data
14. Data
Hangover
What
happened?
From
Flickr
by
SteveMcN
15. Where
data
end
up
From
Flickr
by
diylibrarian
www
blog.order2disorder.com
From
Flickr
by
csessums
Data
Metadata
From
Flickr
by
csessums
Recreated
from
Klump
et
al.
2006
16. Who
cares?
From
Flickr
by
Redden-‐McAllister
From
Flickr
by
AJC1
www.rba.gov.au
17. Where
data
end
up
From
Flickr
by
diylibrarian
www
Data
www
Metadata
From
Flickr
by
torkildr
Recreated
from
Klump
et
al.
2006
19. Trends
in
Data
Archiving
Journal
publishers
Joint
Data
Archiving
Agreement
Data
Papers
etc.
Ecological
Archives,
Beyond
the
PDF
20. Trends
in
Data
Archiving
Journal
publishers
Joint
Data
Archiving
Agreement
Data
Papers
etc.
Ecological
Archives,
Beyond
the
PDF
Funders
Data
management
requirements
21. Roadmap
4. Toolbox
3. How
to
improve
2. Mistakes
we
make
1. Background
22. Best
Practices
for
Data
Management
1. Planning
2. Data
collection
&
organization
3. Quality
control
&
assurance
4. Metadata
5. Workflows
6. Data
stewardship
&
reuse
23. 2.
Data
collection
&
organization
Create
unique
identifiers
• Decide
on
naming
scheme
early
• Create
a
key
• Different
for
each
sample
From
Flickr
by
zebbie
From
Flickr
by
sjbresnahan
24. 2.
Data
collection
&
organization
Standardize
• Consistent
within
columns
– only
numbers,
dates,
or
text
• Consistent
names,
codes,
formats
Modified
from
K.
Vanderbilt
From
Pink
Floyd,
The
Wall
themurkyfringe.com
25. 2.
Data
collection
&
organization
Standardize
• Reduce
possibility
of
manual
error
by
constraining
entry
choices
Excel
lists
Data Google
Docs
Forms
validataion
Modified
from
K.
Vanderbilt
26. 2.
Data
collection
&
organization
Create
parameter
table
Create
a
site
table
From
doi:10.3334/ORNLDAAC/777
From
doi:10.3334/ORNLDAAC/777
From
R
Cook,
ESA
Best
Practices
Workshop
2010
27. 2.
Data
collection
&
organization
Use
descriptive
file
names
PhDcomics.com
28. 2.
Data
collection
&
organization
Use
descriptive
file
names
*
• Unique
• Reflect
contents
Bad:
Mydata.xls
Better:
Eaffinis_nanaimo_2010_counts.xls
2001_data.csv
best
version.txt
Study
Year
organism
Site
name
What
was
measured
*Not
for
everyone
From
R
Cook,
ESA
Best
Practices
Workshop
2010
29. 2.
Data
collection
&
organization
Organize
files
logically
Biodiversity
Lake
Experiments
Biodiv_H20_heatExp_2005to2008.csv
Biodiv_H20_predatorExp_2001to2003.csv
…
Field
work
Biodiv_H20_PlanktonCount_2001toActive.csv
Biodiv_H20_ChlAprofiles_2003.csv
…
Grassland
From
S.
Hampton
30. 2.
Data
collection
&
organization
Preserve
information
R
script
for
processing
&
analysis
• Keep
raw
data
raw
• Use
scripts
to
process
data
&
save
them
with
data
Raw
data
as
.csv
31. 2.
Data
collection
&
oAll
of
the
things
that
rganization
make
Excel
great
for
data
organization
are
bad
for
archiving!
What
to
do?
1. Create
archive-‐ready
raw
data
2. Put
it
somewhere
special
3. Have
your
fun
with
fancy
Excel
techniques
4. Keep
archiving
in
mind
32. 3.
Quality
control
and
quality
assurance
Define
&
enforce
standards
Double
data
entry
Document
changes
Minimize
manual
data
entry
No
missing,
impossible,
or
anomalous
values
60
50
40
30
20
10
0
0
5
10
15
20
25
30
35
34. 4.
Metadata
basics
Metadata
=
Data
reporting
WHO
created
the
data?
WHAT
is
the
content
of
the
data
set?
WHEN
was
it
created?
WHERE
was
it
collected?
HOW
was
it
developed?
WHY
was
it
developed?
35. • Scientific
context
4.
Metadata
basics
• Scientific
reason
why
the
data
were
collected
• What
data
were
collected
• Digital
context
• What
instruments
(including
model
&
• Name
of
the
data
set
serial
number)
were
used
• The
name(s)
of
the
data
file(s)
in
the
data
• Environmental
conditions
during
collection
set
• Where
collected
&
spatial
resolution
When
• Date
the
data
set
was
last
modified
collected
&
temporal
resolution
• Example
data
file
records
for
each
data
• Standards
or
calibrations
used
type
file
• Information
about
parameters
• Pertinent
companion
files
• How
each
was
measured
or
produced
• List
of
related
or
ancillary
data
sets
• Units
of
measure
• Software
(including
version
number)
• Format
used
in
the
data
set
used
to
prepare/read
the
data
set
• Precision
&
accuracy
if
known
• Data
processing
that
was
performed
• Information
about
data
• Personnel
&
stakeholders
• Definitions
of
codes
used
• Who
collected
• Quality
assurance
&
control
measures
• Who
to
contact
with
questions
• Known
problems
that
limit
data
use
(e.g.
• Funders
uncertainty,
sampling
problems)
• How
to
cite
the
data
set
36. 4.
Metadata
basics
What
is
a
What
is
metadata
metadata?
standard?
Select
the
appropriate
metadata
standard
• Provides
structure
to
describe
data
Common
terms
|
definitions
|
language
|
structure
• Lots
of
different
standards
EML
,
FGDC,
ISO19115,
DarwinCore,…
• Tools
for
creating
metadata
files
Morpho
(EML),
Metavist
(FGDC),
NOAA
MERMaid
(CSGDM)
38. 5.
Workflows
Simplest
workflows:
commented
scripts,
flow
charts
Temperature
data
Data
import
into
R
Data
in
R
Salinity
format
data
Quality
control
&
“Clean”
T
data
cleaning
&
S
data
Analysis:
mean,
SD
Summary
statistics
Graph
production
40. 5.
Workflows
Workflows
enable
From
Flickr
by
merlinprincesse
Reproducibility
can
someone
independently
validate
findings?
Transparency
others
can
understand
how
you
arrived
at
your
results
Executability
others
can
re-‐run
or
re-‐use
your
analysis
41. 6.
Data
stewardship
&
reuse
From
Flickr
by
greensambaman
The 20-Year Rule
The
metadata
accompanying
a
data
set
should
be
written
for
a
user
20
years
into
the
future
RULE
(National
Research
Council
1991)
42. 6.
Data
stewardship
&
reuse
Use
stable
formats
csv,
txt,
tiff
Create
back-‐up
copies
original,
near,
far
Periodically
test
ability
to
restore
information
Modified from R. Cook
43. 6.
Data
stewardship
&
reuse
Where
do
I
put
my
data?
Insitutional
archive
Discipline/specialty
archive
DataCite
list
of
repostiories:
www.datacite.org/repolist
From
Flickr
by
torkildr
44. 6.
Data
stewardship
&
reuse
Data
Citation:
Why
everyone
should
do
it
Allow
readers
to
find
data
products
Get
credit
for
data
and
publications
Promote
reproducibility
Better
measure
of
research
impact
Example:
Sidlauskas,
B.
2007.
Data
from:
Testing
for
unequal
rates
of
morphological
diversification
in
the
absence
of
a
detailed
phylogeny:
a
case
study
from
characiform
fishes.
Dryad
Digital
Repository.
doi:10.5061/dryad.20
Learn
more
at
www.datacite.org
Modified from R. Cook
45. Best
Practices
for
Data
Management
1. Planning
2. Data
collection
&
organization
3. Quality
control
&
assurance
4. Metadata
5. Workflows
6. Data
stewardship
&
reuse
7. Planning
46. 1.
Planning
What
is
a
data
management
plan?
A
document
that
describes
what
you
will
do
with
your
data
during
your
research
and
after
you
complete
your
research
Data
Hangover
47. 1.
Planning
Why
should
I
prepare
a
DMP?
Saves
time
Increases
efficiency
Easier
to
use
data
Others
can
understand
&
use
data
Credit
for
data
products
Funders
require
it
48. NSF
DMP
Requirements
From
Grant
Proposal
Guidelines:
DMP
supplement
may
include:
1. the
types
of
data,
samples,
physical
collections,
software,
curriculum
materials,
and
other
materials
to
be
produced
in
the
course
of
the
project
2.
the
standards
to
be
used
for
data
and
metadata
format
and
content
(where
existing
standards
are
absent
or
deemed
inadequate,
this
should
be
documented
along
with
any
proposed
solutions
or
remedies)
3.
policies
for
access
and
sharing
including
provisions
for
appropriate
protection
of
privacy,
confidentiality,
security,
intellectual
property,
or
other
rights
or
requirements
4.
policies
and
provisions
for
re-‐use,
re-‐distribution,
and
the
production
of
derivatives
5.
plans
for
archiving
data,
samples,
and
other
research
products,
and
for
preservation
of
access
to
them
49. 1. Types
of
data
&
other
information
• Types
of
data
produced
• Relationship
to
existing
data
• How/when/where
will
the
data
be
captured
or
created?
C.
Strasser
• How
will
the
data
be
processed?
• Quality
assurance
&
quality
control
measures
• Security:
version
control,
backing
up
biology.kenyon.edu
• Who
will
be
responsible
for
data
management
during/after
project?
From
Flickr
by
Lazurite
50. 2. Data
&
metadata
standards
• What
metadata
are
needed
to
make
the
data
meaningful?
• How
will
you
create
or
capture
these
metadata?
Wired.com
• Why
have
you
chosen
particular
standards
and
approaches
for
metadata?
51. 3. Policies
for
access
&
sharing
4. Policies
for
re-‐use
&
re-‐distribution
• Are
you
under
any
obligation
to
share
data?
• How,
when,
&
where
will
you
make
the
data
available?
• What
is
the
process
for
gaining
access
to
the
data?
• Who
owns
the
copyright
and/or
intellectual
property?
• Will
you
retain
rights
before
opening
data
to
wider
use?
How
long?
• Are
permission
restrictions
necessary?
• Embargo
periods
for
political/commercial/patent
reasons?
• Ethical
and
privacy
issues?
• Who
are
the
foreseeable
data
users?
• How
should
your
data
be
cited?
52. 5. Plans
for
archiving
&
preservation
• What
data
will
be
preserved
for
the
long
term?
For
how
long?
• Where
will
data
be
preserved?
• What
data
transformations
need
to
occur
before
preservation?
• What
metadata
will
be
submitted
alongside
the
datasets?
• Who
will
be
responsible
for
preparing
data
for
preservation?
Who
will
be
the
main
contact
person
for
the
archived
data?
From
Flickr
by
theManWhoSurfedTooMuch
53. Don’t
forget:
Budget
• Costs
of
data
preparation
&
documentation
Hardware,
software
Personnel
Archive
fees
• How
costs
will
be
paid
Request
funding!
dorrvs.com
54. NSF’s
Vision*
DMPs
and
their
evaluation
will
grow
&
change
over
time
(similar
to
broader
impacts)
Peer
review
will
determine
next
steps
Community-‐driven
guidelines
– Different
disciplines
have
different
definitions
of
acceptable
data
sharing
– Flexibility
at
the
directorate
and
division
levels
– Tailor
implementation
of
DMP
requirement
Evaluation
will
vary
with
directorate,
division,
&
program
officer
*Unofficially
Help
from
Jennifer
Schopf,
NSF
55. Roadmap
4. Toolbox
3. How
to
improve
2. Mistakes
we
make
1. Background
56. DMPTool:
dmp.cdlib.org
Step-‐by-‐step
wizard
for
generating
DMP
Create
|
edit
|
re-‐use
|
share
|
save
|
generate
Open
to
community
Links
to
institutional
resources
Directorate
information
&
updates
58. CDL
Services
for
UC
Community
Where
should
I
put
Data
Repository
my
data?
Deposit
|
Manage
|
Share
|
Preserve
www.cdlib.org/services/uc3
59. CDL
Services
for
UC
Community
Create
&
manage
persistent
identifiers
• Precise
identification
of
a
dataset
• Credit
to
data
producers
and
data
publishers
• A
link
from
the
traditional
literature
to
the
data
• Research
metrics
for
datasets
Example:
Sidlauskas,
B.
2007.
Data
from:
Testing
for
unequal
rates
of
morphological
diversification
in
the
absence
of
a
detailed
phylogeny:
a
case
study
from
characiform
fishes.
Dryad
Digital
Repository.
doi:10.5061/dryad.20
www.cdlib.org/services/uc3
60. Why
are
you
promoting
Excel?
• Open
source
add-‐in
• Facilitate
data
management,
sharing,
archiving
for
scientists
• Focus
on
atmospheric,
ecological,
hydrological,
and
oceanographic
data
• Collecting
requirements
for
add-‐in
from
scientists,
data
centers,
libraries
Funders:
Gordon
and
Betty
Moore
Foundation,
Microsoft
Research
61. Why
are
you
promoting
Excel?
Everyone
uses
it
Stopgap
measure
63. www.dataone.org
• Data
Education
Tutorials
• Database
of
best
practices
&
software
tools
• Links
to
DMPTool
• Primer
on
data
management
From
Flickr
by
Robert
Hruzek
66. Process
1. Assess
needs
2. Gather
requirements
3. Build
requirements
document
4. Build
community
67. Requirements
1. Must
work
for
Excel
users
without
the
add-‐in
2. No
additional
software
(other
than
add-‐in
and
Excel)
necessary
3. Can
be
used
offline
4. Perform
CSV
compatibility
checks,
reporting,
and
automated
fixes
5. Add
Metadata
to
data
file
a. Can
use
existing
metadata
as
a
template
b. Add-‐in
can
automatically
generate
some
of
the
metadata
where
the
info
is
available
from
the
file
6. Generate
a
citation
for
the
data
file
7. Deposit
data
and
metadata
in
a
repository
68. The
Great
Debate
Add-‐in
• Little
pieces
of
software
• Download
to
extend
the
capabilities
of
Excel
• Appear
as
“ribbon”
Web-‐based
application
• Require
the
web:
www
+
wba
• Do
not
require
that
you
download
a
program
• Websites
that
do
something
with
info/files
provided
by
user
• Examples:
Facebook,
YouTube
69. Add-‐in
New
&
Download
improved
add-‐in
DCXL
spreadsheet
add-‐in
Check
Create
Connect
Compatibility
Metadata
to
repository
1. Parse
for
compatibility
1. Make
template
1. Version
control
2. Report
potential
errors
2. Auto-‐fill
2. Backing
up
3. Allow
user-‐directed
3. Parameter
list
selection
3. Retrieve
info:
error
correction
4. Citation
generation
Authentication
5. DOI
connection
Keyword
list
Metadata
standard
Citation
format
Acceptable
file
formats
70. Summary:
Add-‐in
The Good The Bad
• Integrated
in
workflow
• Windows
only
• Familiar
UI,
functionality
• Install
&
updates
required
• Smaller
shift
• Not
as
generalizable/
• Available
offline
extensible
• Not
as
easy
for
community
to
get
involved
71. Web
application
New
&
Upload
Web-‐based
improved
spreadsheet
application
spreadsheet
Check
Create
Connect
Compatibility
Metadata
to
repository
1. Parse
for
compatibility
1. Make
template
1. Version
control
2. Report
potential
errors
2. Auto-‐fill
2. Backing
up
3. Allow
user-‐directed
3. Parameter
list
selection
3. Retrieve
info:
error
correction
4. Citation
generation
Authentication
5. DOI
connection
Keyword
list
Metadata
standard
Citation
format
Acceptable
file
formats
72. Summary:
Web
based
The Good The Bad
• Easier
to
maintain,
update
• Not
familiar
• Can
use
with
Mac
• Requires
new
UI
• Generalizable/extensible
• Not
integrated
in
Excel
• Community
involvement
• Offline
use
not
guaranteed
possible
73. Moving
forward…
• Simple,
clean
user
interface
• Connect
to
web
application
from
within
Excel
• Offline
use
of
web
application,
especially
ability
to
create
metadata
offline
74. Send
me
feedback!
From
Flickr
by
hashmil
Comment
on
the
blog
dcxl.cdlib.org
Email
me
carlystrasser@gmail.com
Tweet
me
@carlystrasser
FB
message
me
DCXLatCDL
75. Diane
Bisom
Ann
Frenkel
Dr.
Ruth
Jackson
dcxl.cdlib.org
@dcxlCDL
www.facebook.com/DCXLatCDL
www.carlystrasser.net
carlystrasser@gmail.com
@carlystrasser