Small Data: Bridging the Gap Between Generic and Specific Repositories

Small
Data,
or:

Bridging
the
Gap
Between
Speciﬁc

and
Generic
Research
Repositories

April
11,
2013

Anita
de
Waard

VP
Research
Data
CollaboraDons

a.dewaard@elsevier.com

hHp://researchdata.elsevier.com/

There
are
many
efforts
to
enhance

data
storing
and
sharing...

•  Many
different
research
databases–
both
generic
(Dryad,

Dataverse,
…)
and
specific
(NIF,
IEDA,
PDB,
…)

•  Many
systems
for
creaDng/sharing
workflows
(Taverna,

MyExperiment,
Vistrails,
Workflow4Ever
etc)

•  Many
e-‐lab
notebooks
(LabGuru,
LabArchives,

LaBlog,
etc)

•  Scores
of
projects,
commiHees,
standards,

bodies,
grants,

iniDaDves,
conferences
for
discussing
and
connecDng
all
of

this
(KEfED,
Pegasus,
PROV,
RDA,
Science
Gateways,

Codata,
BRDI,
Earthcube,
etc.
etc)

•  You
can
make
a
living
out
of
this
;-‐)!
(and
many
of
us
do…)

…but
this
is
what
scienDsts
do:

Using
anDbodies

and
squishy
bits

Grad
Students
experiment

and
enter
details
into
their

lab
notebook.

The
PI
then
tries
to

make
sense
of
this,

and
writes
a
paper.

End
of
story.

Why
save
research
data?

A.  Data
PreservaDon:

–  Preserve
record
of
scienDﬁc
process,

provenance

–  Enable
reproducible
research

B.  Data
Use:

–  Use
results
obtained
by
others

–  Do
beHer
science!

–  Improve
interdisciplinary
work

Where
the
data
goes
now:

PDB:

A
small
porDon
of
data

88,3
k

(1-‐2%?)
stored
in
small,

PetDB:

>
50
My
Papers
1,5
k
SedDB:

topic-‐focused

2
M
scienDsts
data
repositories
0.6
k

MiRB:

2
M
papers/year
25k

TAIR:

72,1
k

Some
data

(8%?)
stored
in
large,

generic
data

Majority
of
data
repositories

(90%?)

is
stored

on
local
hard
drives

Dryad:
Dataverse:

7,631
ﬁles
0.6
M

Datacite:

1.5
M

So
this
needs
to
happen:

PDB:

A
small
porDon
of
data

88,3
k

(1-‐2%?)
stored
in
small,

PetDB:

>
50
My
Papers
1,5
k
SedDB:

topic-‐focused

2
M
scienDsts
data
repositories
0.6
k

MiRB:

2
M
papers/year
25k

TAIR:

72,1
k

Some
data

(8%?)
stored
in
large,

generic
data

Majority
of
data
repositories

(90%?)

is
stored

on
local
hard
drives

Dryad:
Dataverse:

7,631
ﬁles
0.6
M

INCREASE
DATA

PRESERVATION
Datacite:

1.5
M

Data
PreservaDon
Issues:

ObjecDon:
“Our
lab
notebooks
are
all
on
paper

–
it’s
how
we
do
things”

Response:
Grao
tools
closely
on
scienDsts’
daily

pracDce

Example:
create
tailored
metadata
collecDon
tools

on
mini-‐tablets
in
labs
to
replace
paper
notebooks

Data
PreservaDon
Issues:

ObjecDon:
“I
need
to
see
a
direct
beneﬁt
of
any

eﬀort
I
put
in.”

Response:
Create
tools
to
allow
beHer
insight
in
own

and
other’s
results.

Example:
‘PI-‐Dashboard’:
allow
immediate
access/
analysis
of
shared
data:
new
science!

Data
Use
Issues:

ObjecDon:
“I
don’t
really
trust
anyone
else’s
data
–

and
don’t
think
they’ll
trust
mine”

Response:
Create
social
networking
context;
allow
data

owner
to
provide
granular
access
control.

Example:

•  In
Urban
Lab
app,
data
stored
by
researcher
name.

•  PI
decides
who
gets
to
see
which
data

•  Match
up
with
NIF
and
Eagle-‐I
ontologies
on
back
end

so
export
of
(part
of)
data
is
possible
at
any
Dme.

c
o
n
s
o
r
t
i
u
m

Data
Use
Issues:

•  ObjecDon:
“I
am
afraid
other
people
might
scoop
my

discoveries”

•  Response:
Reward
system
needs
to
move
from
direct

compeDDon
to
a
‘shared
mission’
approach
(cf.
Mars)

•  Example:
Data
Rescue
Challenge
in
the
geosciences:

collect
and
reward
stories/pracDces
of
data
preservaDon,

enable
cross-‐disciplinary
access
and
use
of
all
data.

The
2013
Interna.onal
Data
Rescue
Award

in
the
Geosciences

Organised
by
IEDA
and

Elsevier
Research
Data
Services

hHp://researchdata.elsevier.com/datachallenge

Data
PreservaDon
and
AnnotaDon:
:

Fine,
I’ll
do
it–
but
where
the
hell
do
I
put
it?

WANT
AND

Domain-‐Speciﬁc

Domain
of
study:
Collaborators:
Local

Data
Repository
Data
Repository

DIFFERENT

ALL
THEY

Generic
METADATA!!!!
InsDtuDonal

Data
Repository
Funding
Agency:
University:
Data
Repository

Comparing
Repository
Types:

Repository
Advantages

Disadvantages

Effort,
Reuse,
Credit,
Compliance

Local
data
Easy!
No
one
steals
No
one
sees
it.

Habit,
Ease,
Privacy,
Control

repository
your
data.

Not
compliant
with

MORE
ANNOTATION

requirements

InsDtuDonal
Not
very
difficult.
Data
can’t
easily
be

Repository
Administrators
are
reused.
Credit?

happy.

Generic
data
Not
very
hard
to
do.
Data
can’t
be
easily

repository
Have
complied!
reused.
Credit…

Domain-‐specific
Data
can
be
reused.
Lot
of
work
–
for

data
repository
Credit!

curators

Conclusions
for
data
annotaDon:

“Instead
of
building
newer
and
larger
weapons
of
mass
destrucHon,
I

think
mankind
should
try
to
get
more
use
out
of
the
ones
we
have”

Deep
Thoughts
by
Jack
Handy

•  Let’s
use
the
data
standards
we
already
have
–
and

agree
on
using
the
same
ones

•  Work
with
exisDng
data
repositories
in
a
ﬁeld
to
come

to
a
lowest
common
denominator
of
metadata

•  Tailor
the
systems
to
be
opDmally
easy
to
use
for

scienDsts
in
terms
of
metadata:
add
as
liHle
as
you
have

to,
as
few
Dmes
as
you
can.

Summary:

•  Data
PreservaDon:

–  Tailor
tools
to
fit
scienDsts’
workflow
–
follow
the
experiment!

–  We
are
creaDng
repositories
of
shared
experiments:
Enable

demonstrably
beFer
science!

•  Data
Use:

–  Allow
owner
full
control
over
who
sees
which
data
-‐
create

social
networking
context

–  CollecDvely
pioneer
long-‐term
funding
opDons;
support/
develop
‘shared
mission’
funding
challenges

•  How
annotaDon
can
help
reuse:

–  Collaborate
between
(generic/specific,
insDtuDonal,
cross-‐
naDonal)
data
faciliDes
to
integrate
repositories,
enable
cross-‐
repository
usage
and
reuse
exisIng
metadata.

QuesDons?

Anita
de
Waard

VP
Research
Data
CollaboraDons

a.dewaard@elsevier.com

hHp://researchdata.elsevier.com/

Elsevier
Research
Data
Services
Goals:

1.  Increase
Data
PreservaDon:

Help
increase
the
amount
and
quality
of
data

preserved
and
shared

2.  Improve
Data
Use:

Help
increase
the
value
and
usability
of
the
data

shared
by
increasing
annotaDon,
normalizaDon,

provenance
enabling
enhanced
interoperability

3.  Develop
Sustainable
Models:

Help
measure
and
deliver
credit
for
shared
data,
the

researchers,
the
insDtute,
and
the
funding
body,

enabling
more
sustainable
plaworms.

Guiding
Principles
of
RDS:

•  In
principle,
all
open
data
stays
open
and
URLs,

front
end
etc.
stay
where
they
are
(i.e.
with

repository)

•  CollaboraDon
is
tailored
to
data
repositories’

unique
needs/interests-‐
‘service-‐model’
type:

–  Aspects
where
collaboraDon
is
needed
are
discussed

–  A
collaboraDon
plan
is
drawn
up
using
a
Service-‐Level

Agreement:
agree
on
Dme,
condiDons,
etc.

•  Transparent
business
model

•  Very
small
(2/3
people)
department;
immediate

communicaDon;
instant
deployment
of
ideas.

“But
aren’t
you
guys
in
it
for
the
money?”

•  Yes,
we
are-‐
like
most
businesses…

•  Is
your
real
quesDon
perhaps:
‘Does
no
one
want
to
work

with
you
anymore
because
of
the
Open
Access
debate?’

•  The
OA
debate
focuses
on
three
issues:

–  IPR
and
Access
issues
E.g.
BY-‐NC-‐SA?
Github?
..?

–  Opaque
business
models

E.g.
Gold
Open
Access?
Shared
funding
model?

Commercial
analyDcs
with

shared
royalDes?

–  Lack
of
perceived
added

We
oﬀer
a
service:
only
use

value
it
if
it’s
any
good!

Small Data: Bridging the Gap Between Generic and Specific Repositories

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Small Data: Bridging the Gap Between Generic and Specific Repositories

Similar to Small Data: Bridging the Gap Between Generic and Specific Repositories (20)

More from Anita de Waard

More from Anita de Waard (20)

Recently uploaded

Recently uploaded (20)

Small Data: Bridging the Gap Between Generic and Specific Repositories