UC Riverside: Data Management for Scientists

Data
Management
for
Scientists

Reduce
your
workload

Reuse
your
ideas

Recycle
your
data

www.oddee.com

Carly
Strasser,
PhD
UC
Riverside

California
Digital
Library,
UC
Oﬃce
of
the
President
February
2012

carly.strasser@ucop.edu

www.carlystrasser.net

Roadmap

4.  Toolbox

3.  How
to
improve

2.  Mistakes
we
make

1.  Background

What
role
can

libraries
play
in

data
education?

What
barriers
to
sharing

can
we
eliminate?

Why
don’t
people

share
data?

Is
data
management

Do
attitudes
about

being
taught?

sharing
diﬀer

among
disciplines?

How
can
we
promote
storing

data
in
repositories?

From
Flickr
by

DW0825

From
Flickr
by
Flickmor

From
Flickr
by

deltaMike

Digital
data

www.woodrow.org

C.
Strasser

Courtesey
of
WHOI

From
Flickr
by
US
Army
Environmental
Command

Digital
data

+

Complex
analyses

Data
Models

Maximum

Likelihood

estimation

Matrix

Models

Images
Tables
Paper

UGLY TRUTH
Many

Earth
|
Environmental
|
Ecological

scientists…

5shortessays.blogspot.com

are
not
taught
data
management

don’t
know
what
metadata
are

can’t
name
data
centers
or
repositories

don’t
share
data
publicly
or
store
it
in
an
archive

aren’t
convinced
they
should
share
data

2
tables
Random
notes

C:Documents and SettingshamptonMy DocumentsNCEAS Distributed Graduate Seminars[Wash Cres Lake Dec 15 Dont_Use.xls]Sheet1
Stable Isotope Data Sheet
Sampling Site / Identifier: Wash Cresc Lake Peter's lab Don't use - old data
Sample Type: Algal Washed Rocks
Date: Dec. 16
Tray ID and Sequence: Tray 004

13 15
Reference statistics: SD for delta C = 0.07 SD for delta N = 0.15

Position SampleID Weight (mg) %C delta 13C delta 13C_ca %N delta 15N delta 15N_ca Spec. No.
A1 ref 0.98 38.27 -25.05 -24.59 1.96 4.12 3.47 25354
A2 ref 0.98 39.78 -25.00 -24.54 2.03 4.01 3.36 25356
A3 ref 0.98 40.37 -24.99 -24.53 2.04 4.09 3.44 25358
A4 ref 1.01 42.23 -25.06 -24.60 2.17 4.20 3.55 25360 Shore Avg Con
A5 ALG01 3.05 1.88 -24.34 -23.88 0.17 -1.65 -2.30 25362 c -1.26 -27.22
A6 Lk Outlet Alg 3.06 31.55 -30.17 -29.71 0.92 0.87 0.22 25364 1.26 0.32
A7 ALG03 2.91 6.85 -21.11 -20.65 0.48 -0.97 -1.62 25366 c
A8 ALG05 2.91 35.56 -28.05 -27.59 2.30 0.59 -0.06 25368
A9 ALG07 3.04 33.49 -29.56 -29.10 1.68 0.79 0.14 25370
A10 ALG06 2.95 41.17 -27.32 -26.86 1.97 2.71 2.06 25372
B1 ALG04 3.01 43.74 -27.50 -27.04 1.36 0.99 0.34 25374 c
B2 ALG02 3 4.51 -22.68 -22.22 0.34 4.31 3.66 25376
B3 ALG01 2.99 1.59 -24.58 -24.12 0.15 -1.69 -2.34 25378 c
B4 ALG03 2.92 4.37 -21.06 -20.60 0.34 -1.52 -2.17 25380 c
B5 ALG07 2.9 33.58 -29.44 -28.98 1.74 0.62 -0.03 25382
B6 ref 1.01 44.94 -25.00 -24.54 2.59 3.96 3.31 25384
B7 ref 0.99 42.28 -24.87 -24.41 2.37 4.33 3.68 25386
B8 Lk Outlet Alg 3.04 31.43 -29.69 -29.23 1.07 0.95 0.30 25388
B9 ALG06 3.09 35.57 -27.26 -26.80 1.96 2.79 2.14 25390
B10 ALG02 3.05 5.52 -22.31 -21.85 0.45 4.72 4.07 25392
C1 ALG04 2.98 37.90 -27.42 -26.96 1.36 1.21 0.56 25394 c
C2 ALG05 3.04 31.74 -27.93 -27.47 2.40 0.73 0.08 25396
C3 ref 0.99 38.46 -25.09 -24.63 2.40 4.37 3.72 25398
23.78 1.17

From
Stephanie
Hampton
(2010)

ESA
Workshop
on
Best
Practices

Wash
Cres
Lake
Dec
15
Dont_Use.xls

Date: Dec. 16

13 15

A1 ref 0.98 38.27 -25.05 -24.59 1.96 4.12 3.47 25354
A2 ref 0.98 39.78 -25.00 -24.54 2.03 4.01 3.36 25356
A3 ref 0.98 40.37 -24.99 -24.53 2.04 4.09 3.44 25358
A5 ALG01 3.05 1.88 -24.34 -23.88 0.17 -1.65 -2.30 25362 c -1.26 -27.22
A6 Lk Outlet Alg 3.06 31.55 -30.17 -29.71 0.92 0.87 0.22 25364 1.26 0.32
A7 ALG03 2.91 6.85 -21.11 -20.65 0.48 -0.97 -1.62 25366 c
A8 ALG05 2.91 35.56 -28.05 -27.59 2.30 0.59 -0.06 25368
A9 ALG07 3.04 33.49 -29.56 -29.10 1.68 0.79 0.14 25370
A10 ALG06 2.95 41.17 -27.32 -26.86 1.97 2.71 2.06 25372
B1 ALG04 3.01 43.74 -27.50 -27.04 1.36 0.99 0.34 25374 c
B2 ALG02 3 4.51 -22.68 -22.22 0.34 4.31 3.66 25376
B3 ALG01 2.99 1.59 -24.58 -24.12 0.15 -1.69 -2.34 25378 c
B4 ALG03 2.92 4.37 -21.06 -20.60 0.34 -1.52 -2.17 25380 c
B5 ALG07 2.9 33.58 -29.44 -28.98 1.74 0.62 -0.03 25382
B6 ref 1.01 44.94 -25.00 -24.54 2.59 3.96 3.31 25384
B7 ref 0.99 42.28 -24.87 -24.41 2.37 4.33 3.68 25386
B8 Lk Outlet Alg 3.04 31.43 -29.69 -29.23 1.07 0.95 0.30 25388
B9 ALG06 3.09 35.57 -27.26 -26.80 1.96 2.79 2.14 25390
B10 ALG02 3.05 5.52 -22.31 -21.85 0.45 4.72 4.07 25392
C1 ALG04 2.98 37.90 -27.42 -26.96 1.36 1.21 0.56 25394 c
C2 ALG05 3.04 31.74 -27.93 -27.47 2.40 0.73 0.08 25396
C3 ref 0.99 38.46 -25.09 -24.63 2.40 4.37 3.72 25398
23.78 1.17

From
Stephanie
Hampton
(2010)

ESA
Workshop
on
Best
Practices

Date: Dec. 16

13 15

A1 ref 0.98 38.27 -25.05 -24.59 1.96 4.12 3.47 25354
A2 ref 0.98 39.78 -25.00 -24.54 2.03 4.01 3.36 25356
A3 ref 0.98 40.37 -24.99 -24.53 2.04 4.09 3.44 25358
A5 ALG01 3.05 1.88 -24.34 -23.88 0.17 -1.65 -2.30 25362 c -1.26 -27.22
A6 Lk Outlet Alg 3.06 31.55 -30.17 -29.71 0.92 0.87 0.22 25364 1.26 0.32
A7 ALG03 2.91 6.85 -21.11 -20.65 0.48 -0.97 -1.62 25366 c
A8 ALG05 2.91 35.56 -28.05 -27.59 2.30 0.59 -0.06 25368
A9 ALG07 3.04 33.49 -29.56 -29.10 1.68 0.79 0.14 25370
A10 ALG06 2.95 41.17 -27.32 -26.86 1.97 2.71 2.06 25372
B1 ALG04 3.01 43.74 -27.50 -27.04 1.36 0.99 0.34 25374 c SUMMARY OUTPUT
B2 ALG02 3 4.51 SampleID
-22.68 -22.22 ALG03
0.34 ALG05
4.31 3.66 ALG07
25376 ALG06 ALG04 ALG02 ALG01 ALG03 ALG07
B3 ALG01 2.99 1.59 -24.58 -24.12 0.15 -1.69 -2.34 25378 c Regression Statistics
B4 ALG03 2.92 4.37 -21.06 -20.60 0.34 -1.52 -2.17 25380 c Multiple R 0.283158
B5 ALG07 2.9 33.58 Weight (mg)
-29.44 -28.98 2.91
1.74 0.62 2.91
-0.03 25382 3.04 2.95 Square 0.080178
R 3.01 3 2.99 2.92 2.9
B6 ref 1.01 44.94 -25.00 -24.54 2.59 3.96 3.31 25384 Adjusted R Square
-0.022024
B7 ref 0.99 42.28 -24.87 -24.41 2.37 4.33 3.68 25386 Standard Error
1.906378
B8 Lk Outlet Alg 3.04 31.43 -29.69 %C-29.23 6.85
1.07 0.95 35.560.30 25388 33.49 41.17
Observations43.74 11 4.51 1.59 4.37 33.58
B9 ALG06 3.09 35.57 -27.26 -26.80 1.96 2.79 2.14 25390
B10 ALG02 3.05 5.52 -22.31
delta 13C
-21.85
-21.11
0.45 4.72
-28.054.07 25392
-29.56 -27.32
ANOVA
-27.50 -22.68 -24.58 -21.06 -29.44
C1 ALG04 2.98 37.90 delta 13C_ca
-27.42 -26.96 -20.65
1.36 1.21 -27.590.56 25394 -29.10
c -26.86 -27.04
df SS -22.22
MS F -24.12
Significance F -20.60 -28.98
C2 ALG05 3.04 31.74 -27.93 -27.47 2.40 0.73 0.08 25396 Regression 1 2.851116 2.851116 0.784507 0.398813
C3 ref 0.99 38.46 -25.09 -24.63 2.40 4.37 3.72 25398 Residual 9 32.7085 3.634278
23.78 %N 0.48
1.17 2.30 1.68 1.97
Total 1.3610 35.55962 0.34 0.15 0.34 1.74
delta 15N -0.97 0.59 0.79 2.71 0.99 4.31 -1.69 -1.52 0.62
Coefficients
Standard Error t Stat P-value Lower 95%Upper 95%Lower 95.0%
Upper 95.0%
delta 15N_ca -1.62 -0.06 0.14 2.06
Intercept -4.297428 4.671099 3.66
0.34 -2.34 -2.17
-0.920003 0.381568 -14.8642 6.269341 -14.8642 6.269341 -0.03
X Variable 1-0.158022 0.17841 -0.885724 0.398813 -0.561612 0.245569 -0.561612 0.245569

4.00

3.00

2.00

1.00

Series1

0.00
-35.00 -30.00 -25.00 -20.00 -15.00 -10.00 -5.00 0.00

-1.00

-2.00

-3.00

12

Random
stats
output

Date: Dec. 16

13 15

A1 ref 0.98 38.27 -25.05 -24.59 1.96 4.12 3.47 25354
A2 ref 0.98 39.78 -25.00 -24.54 2.03 4.01 3.36 25356
A3 ref 0.98 40.37 -24.99 -24.53 2.04 4.09 3.44 25358
A5 ALG01 3.05 1.88 -24.34 -23.88 0.17 -1.65 -2.30 25362 c -1.26 -27.22
A6 Lk Outlet Alg 3.06 31.55 -30.17 -29.71 0.92 0.87 0.22 25364 1.26 0.32
A7 ALG03 2.91 6.85 -21.11 -20.65 0.48 -0.97 -1.62 25366 c
A8 ALG05 2.91 35.56 -28.05 -27.59 2.30 0.59 -0.06 25368
A9 ALG07 3.04 33.49 -29.56 -29.10 1.68 0.79 0.14 25370
A10 ALG06 2.95 41.17 -27.32 -26.86 1.97 2.71 2.06 25372
B1 ALG04 3.01 43.74 -27.50 -27.04 1.36 0.99 0.34 25374 c SUMMARY OUTPUT
B2 ALG02 3 4.51 -22.68 -22.22 0.34 4.31 3.66 25376
B3 ALG01 2.99 1.59 -24.58 -24.12 0.15 -1.69 -2.34 25378 c Regression Statistics
B4 ALG03 2.92 4.37 -21.06 -20.60 0.34 -1.52 -2.17 25380 c Multiple R 0.283158
B5 ALG07 2.9 33.58 -29.44 -28.98 1.74 0.62 -0.03 25382 R Square 0.080178
B6 ref 1.01 44.94 -25.00 -24.54 2.59 3.96 3.31 25384 Adjusted R Square
-0.022024
B7 ref 0.99 42.28 -24.87 -24.41 2.37 4.33 3.68 25386 Standard Error
1.906378
B8 Lk Outlet Alg 3.04 31.43 -29.69 -29.23 1.07 0.95 0.30 25388 Observations 11
B9 ALG06 3.09 35.57 -27.26 -26.80 1.96 2.79 2.14 25390
B10 ALG02 3.05 5.52 -22.31 -21.85 0.45 4.72 4.07 25392 ANOVA
C1 ALG04 2.98 37.90 -27.42 -26.96 1.36 1.21 0.56 25394 c df SS MS F Significance F
C2 ALG05 3.04 31.74 -27.93 -27.47 2.40 0.73 0.08 25396 Regression 1 2.851116 2.851116 0.784507 0.398813
C3 ref 0.99 38.46 -25.09 -24.63 2.40 4.37 3.72 25398 Residual 9 32.7085 3.634278
23.78 1.17 Total 10 35.55962

Coefficients
Standard Error t Stat P-value Lower 95%Upper 95%Lower 95.0%
Upper 95.0%
Intercept -4.297428 4.671099 -0.920003 0.381568 -14.8642 6.269341 -14.8642 6.269341
X Variable 1-0.158022 0.17841 -0.885724 0.398813 -0.561612 0.245569 -0.561612 0.245569

Data
Hangover

What
happened?

From
Flickr
by
SteveMcN

Where
data
end
up

From
Flickr
by
diylibrarian

www

blog.order2disorder.com

From
Flickr
by
csessums

Data

Metadata

From
Flickr
by
csessums

Recreated
from
Klump
et
al.
2006

Who
cares?

From
Flickr
by
Redden-‐McAllister

From
Flickr
by
AJC1
www.rba.gov.au

Where
data
end
up

From
Flickr
by
diylibrarian

www

Data

www
Metadata

From
Flickr
by
torkildr

Recreated
from
Klump
et
al.
2006

Data

Reuse

Data

Sharing

Data

Management

Trends
in
Data
Archiving

Journal
publishers

Joint
Data
Archiving
Agreement

Data
Papers
etc.

Ecological
Archives,
Beyond
the
PDF

Trends
in
Data
Archiving

Journal
publishers

Joint
Data
Archiving
Agreement

Data
Papers
etc.

Ecological
Archives,
Beyond
the
PDF

Funders

Data
management
requirements

Best
Practices
for
Data
Management

1.  Planning

2.  Data
collection
&
organization

3.  Quality
control
&
assurance

4.  Metadata

5.  Workﬂows

6.  Data
stewardship
&
reuse

2.
Data
collection
&
organization

Create
unique
identiﬁers

•  Decide
on
naming
scheme
early

•  Create
a
key

•  Diﬀerent
for
each
sample

From
Flickr
by
zebbie
From
Flickr
by
sjbresnahan

2.
Data
collection
&
organization

Standardize

•  Consistent
within
columns

– only
numbers,
dates,
or
text

•  Consistent
names,
codes,
formats

Modiﬁed
from
K.
Vanderbilt

From
Pink
Floyd,
The
Wall

themurkyfringe.com

2.
Data
collection
&
organization

Standardize

•  Reduce
possibility

of
manual
error
by

constraining
entry

choices

Excel
lists

Data Google
Docs

Forms

validataion

Modiﬁed
from
K.
Vanderbilt

2.
Data
collection
&
organization

Create
parameter
table

Create
a
site
table

From
doi:10.3334/ORNLDAAC/777

From
doi:10.3334/ORNLDAAC/777

From
R
Cook,
ESA
Best
Practices
Workshop
2010

2.
Data
collection
&
organization

Use
descriptive
ﬁle
names

PhDcomics.com

2.
Data
collection
&
organization

Use
descriptive
file
names
*

•  Unique

•  Reflect
contents

Bad:

Mydata.xls
Better:
Eaffinis_nanaimo_2010_counts.xls

2001_data.csv

best
version.txt

Study
Year

organism
Site

name
What
was

measured

*Not
for
everyone

From
R
Cook,
ESA
Best
Practices
Workshop
2010

2.
Data
collection
&
organization

Organize
ﬁles

logically

Biodiversity

Lake

Experiments
Biodiv_H20_heatExp_2005to2008.csv

Biodiv_H20_predatorExp_2001to2003.csv

…

Field
work
Biodiv_H20_PlanktonCount_2001toActive.csv

Biodiv_H20_ChlAproﬁles_2003.csv

…

Grassland

From
S.
Hampton

2.
Data
collection
&
organization

Preserve
information
R
script
for
processing
&

analysis

•  Keep
raw
data
raw

•  Use
scripts
to
process
data

&
save
them
with
data

Raw
data
as
.csv

2.
Data
collection
&
oAll
of
the
things
that

rganization

make
Excel
great
for

data
organization

are
bad
for
archiving!

What
to
do?

1.  Create
archive-‐ready
raw
data

2.  Put
it
somewhere
special

3.  Have
your
fun
with
fancy
Excel
techniques

4.  Keep
archiving
in
mind

3.
Quality
control
and
quality
assurance

Deﬁne
&
enforce
standards

Double
data
entry

Document
changes

Minimize
manual
data
entry

No
missing,
impossible,
or
anomalous
values

60

50

40

30

20

10

0

0
5
10
15
20
25
30
35

4.
Metadata
basics
Why
are
you

What
is

promoting

metadata?

Excel?

4.
Metadata
basics

Metadata
=
Data
reporting

WHO
created
the
data?

WHAT
is
the
content
of
the
data
set?

WHEN
was
it
created?

WHERE
was
it
collected?

HOW
was
it
developed?

WHY
was
it
developed?

•  Scientific
context

4.
Metadata
basics
•  Scientific
reason
why
the
data
were

collected

•  What
data
were
collected

•  Digital
context
•  What
instruments
(including
model
&

•  Name
of
the
data
set
serial
number)
were
used

•  The
name(s)
of
the
data
file(s)
in
the
data
•  Environmental
conditions
during
collection

set
•  Where
collected
&
spatial
resolution
When

•  Date
the
data
set
was
last
modified
collected
&
temporal
resolution

•  Example
data
file
records
for
each
data
•  Standards
or
calibrations
used

type
file
•  Information
about
parameters

•  Pertinent
companion
files
•  How
each
was
measured
or
produced

•  List
of
related
or
ancillary
data
sets
•  Units
of
measure

•  Software
(including
version
number)
•  Format
used
in
the
data
set

used
to
prepare/read

the
data
set

•  Precision
&
accuracy
if
known

•  Data
processing
that
was
performed

•  Information
about
data

•  Personnel
&
stakeholders

•  Definitions
of
codes
used

•  Who
collected

•  Quality
assurance
&
control
measures

•  Who
to
contact
with
questions

•  Known
problems
that
limit
data
use
(e.g.

•  Funders
uncertainty,
sampling
problems)

•  How
to
cite
the
data
set

4.
Metadata
basics
What
is
a

What
is

metadata

metadata?

standard?

Select
the
appropriate

metadata
standard

•  Provides
structure
to
describe
data

Common
terms

|

definitions

|

language

|

structure

•  Lots
of
different
standards

EML
,
FGDC,
ISO19115,
DarwinCore,…

•  Tools
for
creating
metadata
files

Morpho
(EML),
Metavist
(FGDC),
NOAA
MERMaid
(CSGDM)

5.
Workflows

Simplest
workflows:
commented
scripts,
flow
charts

Temperature

data

Data
import
into
R
Data
in
R

Salinity

format

data

Quality
control
&

“Clean”
T
data
cleaning

&
S
data

Analysis:
mean,
SD

Summary

statistics

Graph
production

5.
Workﬂows

Fancy
Schmancy:
Kepler

Resulting
output

https://kepler-‐project.org

5.
Workflows

Workflows
enable

From
Flickr
by
merlinprincesse

Reproducibility

can
someone
independently
validate
findings?

Transparency

others
can
understand
how
you
arrived
at
your
results

Executability

others
can
re-‐run
or
re-‐use
your
analysis

6.
Data
stewardship
&
reuse

From
Flickr
by
greensambaman

The 20-Year Rule
The
metadata
accompanying
a

data
set
should
be
written
for
a

user
20
years
into
the
future
RULE

(National
Research
Council
1991)

6.
Data
stewardship
&
reuse

Use
stable
formats

csv,
txt,
tiﬀ

Create
back-‐up
copies

original,
near,
far

Periodically
test
ability
to
restore
information

Modified from R. Cook

6.
Data
stewardship
&
reuse

Where
do
I

put
my
data?

Insitutional
archive

Discipline/specialty
archive

DataCite
list
of
repostiories:

www.datacite.org/repolist

From
Flickr
by
torkildr

6.
Data
stewardship
&
reuse

Data
Citation:
Why
everyone
should
do
it

Allow
readers
to
find
data
products

Get
credit
for
data
and
publications

Promote
reproducibility

Better
measure
of
research
impact

Example:

Sidlauskas,
B.
2007.
Data
from:
Testing
for
unequal
rates
of
morphological

diversification
in
the
absence
of
a
detailed
phylogeny:
a
case
study
from

characiform
fishes.
Dryad
Digital
Repository.
doi:10.5061/dryad.20

Learn
more
at
www.datacite.org
Modified from R. Cook

Best
Practices
for
Data
Management

1.  Planning

2.  Data
collection
&
organization

3.  Quality
control
&
assurance

4.  Metadata

5.  Workﬂows

6.  Data
stewardship
&
reuse

7.  Planning

1.
Planning

What
is
a
data
management
plan?

A
document
that
describes
what
you
will
do
with
your
data

during
your
research
and
after
you
complete
your
research

Data

Hangover

1.
Planning

Why
should
I
prepare
a
DMP?

Saves
time

Increases
eﬃciency

Easier
to
use
data

Others
can
understand
&
use
data

Credit
for
data
products

Funders
require
it

NSF
DMP
Requirements

From
Grant
Proposal
Guidelines:

DMP
supplement
may
include:

1.  the
types
of
data,
samples,
physical
collections,
software,
curriculum

materials,
and
other
materials
to
be
produced
in
the
course
of
the
project

2. 
the
standards
to
be
used
for
data
and
metadata
format
and
content
(where

existing
standards
are
absent
or
deemed
inadequate,
this
should
be

documented
along
with
any
proposed
solutions
or
remedies)

3. 
policies
for
access
and
sharing
including
provisions
for
appropriate

protection
of
privacy,
conﬁdentiality,
security,
intellectual
property,
or
other

rights
or
requirements

4. 
policies
and
provisions
for
re-‐use,
re-‐distribution,
and
the
production
of

derivatives

5. 
plans
for
archiving
data,
samples,
and
other
research
products,
and
for

preservation
of
access
to
them

1.  Types
of
data
&
other
information

•  Types
of
data
produced

•  Relationship
to
existing
data

•  How/when/where
will
the
data
be
captured
or

created?
C.
Strasser

•  How
will
the
data
be
processed?

•  Quality
assurance
&
quality
control
measures

•  Security:
version
control,
backing
up
biology.kenyon.edu

•  Who
will
be
responsible
for
data
management

during/after
project?

From
Flickr
by
Lazurite

2.  Data
&
metadata
standards

•  What
metadata
are
needed
to
make
the
data
meaningful?

•  How
will
you
create
or
capture
these
metadata?

Wired.com

•  Why
have
you
chosen
particular
standards
and
approaches

for
metadata?

3.  Policies
for
access
&
sharing

4.  Policies
for
re-‐use
&
re-‐distribution

•  Are
you
under
any
obligation
to
share
data?

•  How,
when,
&
where
will
you
make
the
data
available?

•  What
is
the
process
for
gaining
access
to
the
data?

•  Who
owns
the
copyright
and/or
intellectual
property?

•  Will
you
retain
rights
before
opening
data
to
wider
use?
How
long?

•  Are
permission
restrictions
necessary?

•  Embargo
periods
for
political/commercial/patent
reasons?

•  Ethical
and
privacy
issues?

•  Who
are
the
foreseeable
data
users?

•  How
should
your
data
be
cited?

5.  Plans
for
archiving
&
preservation

•  What
data
will
be
preserved
for
the
long
term?
For
how
long?

•  Where
will
data
be
preserved?

•  What
data
transformations
need
to
occur
before

preservation?

•  What
metadata
will
be
submitted

alongside
the
datasets?

•  Who
will
be
responsible
for
preparing

data
for
preservation?
Who
will
be
the

main
contact
person
for
the
archived

data?

From
Flickr
by
theManWhoSurfedTooMuch

Don’t
forget:
Budget

•  Costs
of
data
preparation
&
documentation

Hardware,
software

Personnel

Archive
fees

•  How
costs
will
be
paid

Request
funding!

dorrvs.com

NSF’s
Vision*

DMPs
and
their
evaluation
will
grow
&
change
over
time

(similar
to
broader
impacts)

Peer
review
will
determine
next
steps

Community-‐driven
guidelines

–  Different
disciplines
have
different
definitions
of
acceptable

data
sharing

–  Flexibility
at
the
directorate
and
division
levels

–  Tailor
implementation
of
DMP
requirement

Evaluation
will
vary
with
directorate,
division,
&
program

officer

*Unofficially

Help
from
Jennifer
Schopf,
NSF

E-‐notebooks

•  NoteBook

•  ORNL
eNote

•  Evernote

•  Google
Docs

•  Blogs

•  wikis

•  TheLabNotebook.com

•  iPad
ELN

•  NoteBookMaker

iPad ELN, the ﬂexible
electronic laboratory notebook

TheLabNotebook.com"

CDL
Services
for
UC
Community

Where

should
I
put
Data
Repository

my
data?
Deposit

|

Manage

|

Share

|

Preserve

www.cdlib.org/services/uc3

CDL
Services
for
UC
Community

Create
&
manage
persistent
identifiers

•  Precise
identification
of
a
dataset

•  Credit
to
data
producers
and
data
publishers

•  A
link
from
the
traditional
literature
to
the
data

•  Research
metrics
for
datasets

Example:

Sidlauskas,
B.
2007.
Data
from:
Testing
for
unequal
rates
of
morphological

diversification
in
the
absence
of
a
detailed
phylogeny:
a
case
study
from

characiform
fishes.
Dryad
Digital
Repository.
doi:10.5061/dryad.20

www.cdlib.org/services/uc3

Why
are
you

promoting

Excel?

•  Open
source
add-‐in

•  Facilitate
data
management,
sharing,
archiving
for
scientists

•  Focus
on
atmospheric,
ecological,
hydrological,
and

oceanographic
data

•  Collecting
requirements
for
add-‐in
from
scientists,
data

centers,
libraries

Funders:
Gordon
and
Betty
Moore
Foundation,
Microsoft
Research

Why
are
you

promoting

Excel?

Everyone
uses
it

Stopgap
measure

www.dataone.org

•  Data
Education
Tutorials

•  Database
of
best
practices

&
software
tools

•  Links
to
DMPTool

•  Primer
on
data
management

From
Flickr
by
Robert
Hruzek

Data Management 101"

dcxl.cdlib.org

•  Data
Education
Tutorials

•  Other
resources

From
tripwow.tripadvisor.com,
Travelpod
member
Sutiramisu

Process

1.  Assess
needs

2.  Gather
requirements

3.  Build
requirements

document

4.  Build
community

Requirements

1.  Must
work
for
Excel
users
without
the
add-‐in

2.  No
additional
software
(other
than
add-‐in
and
Excel)
necessary

3.  Can
be
used
offline

4.  Perform
CSV
compatibility
checks,
reporting,
and
automated
fixes

5.  Add
Metadata
to
data
file

a.  Can
use
existing
metadata
as
a
template

b.  Add-‐in
can
automatically
generate
some
of
the
metadata

where
the
info
is
available
from
the
file

6.  Generate
a
citation
for
the
data
file

7.  Deposit
data
and
metadata
in
a
repository

The
Great
Debate

Add-‐in

•  Little
pieces
of
software

•  Download
to
extend
the
capabilities
of
Excel

•  Appear
as
“ribbon”

Web-‐based
application

•  Require
the
web:
www
+
wba

•  Do
not
require
that
you
download
a
program

•  Websites
that
do
something
with
info/ﬁles
provided
by
user

•  Examples:
Facebook,
YouTube

Add-‐in

New
&

Download
improved

add-‐in
DCXL
spreadsheet

add-‐in

Check
Create
Connect

Compatibility
Metadata
to

repository

1.  Parse
for
compatibility
1.  Make
template
1.  Version
control

2.  Report
potential
errors
2.  Auto-‐ﬁll
2.  Backing
up

3.  Allow
user-‐directed
3.  Parameter
list
selection
3.  Retrieve
info:

error
correction
4.  Citation
generation
Authentication

5.  DOI
connection
Keyword
list

Metadata
standard

Citation
format

Acceptable
ﬁle
formats

Summary:
Add-‐in

The Good The Bad
•  Integrated
in
workﬂow
•  Windows
only

•  Familiar
UI,
functionality
•  Install
&
updates
required

•  Smaller
shift
•  Not
as
generalizable/
•  Available
oﬄine
extensible

•  Not
as
easy
for
community

to
get
involved

Web

application

New
&

Upload
Web-‐based

improved

spreadsheet
application
spreadsheet

Check
Create
Connect

Compatibility
Metadata
to

repository

1.  Parse
for
compatibility
1.  Make
template
1.  Version
control

2.  Report
potential
errors
2.  Auto-‐ﬁll
2.  Backing
up

3.  Allow
user-‐directed
3.  Parameter
list
selection
3.  Retrieve
info:

error
correction
4.  Citation
generation
Authentication

5.  DOI
connection
Keyword
list

Metadata
standard

Citation
format

Acceptable
ﬁle
formats

Summary:
Web
based

The Good The Bad
•  Easier
to
maintain,
update
•  Not
familiar

•  Can
use
with
Mac
•  Requires
new
UI

•  Generalizable/extensible
•  Not
integrated
in
Excel

•  Community
involvement
•  Oﬄine
use
not
guaranteed

possible

Moving
forward…

•  Simple,
clean
user
interface

•  Connect
to
web
application
from
within
Excel

•  Oﬄine
use
of
web
application,
especially
ability
to

create
metadata
oﬄine

Send
me
feedback!

From
Flickr
by
hashmil

Comment
on
the
blog
dcxl.cdlib.org

Email
me
carlystrasser@gmail.com

Tweet
me
@carlystrasser

FB
message
me
DCXLatCDL

Diane
Bisom

Ann
Frenkel

Dr.
Ruth
Jackson

dcxl.cdlib.org

@dcxlCDL

www.facebook.com/DCXLatCDL

www.carlystrasser.net

carlystrasser@gmail.com

@carlystrasser

UC Riverside: Data Management for Scientists

Recommended

Recommended

More Related Content

Similar to UC Riverside: Data Management for Scientists

Similar to UC Riverside: Data Management for Scientists (20)

More from Carly Strasser

More from Carly Strasser (20)

Recently uploaded

Recently uploaded (20)

UC Riverside: Data Management for Scientists