Supporting Data-Rich Research on Many Fronts

Suppor&ng
Data-‐Rich

Research
on
Many
Fronts

2 1
M a y
2 0 1 2

U n i v e r s i t y
o f
C a l i f o r n i a
C u r a & o n
C e n t e r

C a l i f o r n i a
D i g i t a l
L i b r a r y

California
Digital
Library

Serving
the
University
of
California
CDL
supports
the
research
lifecycle

•  10
campuses
•  Collec&ons

•  360K
students,
faculty,
and
staﬀ
•  Digital
Special
Collec&ons

•  100’s
of
museums,
art
galleries,
•  Discovery
&
Delivery

observatories,
marine
centers,
•  Publishing
Group

botanical
gardens
•  UC
Cura&on
Center
(UC3)

•  5
medical
centers

•  5
law
schools

•  3
Na&onal
Laboratories

California
Digital
Library
(CDL)

Our
environment
circa
2002-‐2008

Focus
on
preserva&on

For
memory
organiza&ons

Infrastructure:
sta&c

Services:
hosted

Content:
museum
&
library

Sustainability:
?

Our
environment
since
2008

Focus
on
preserva&on
 
cura%on
(lifecycle)

For
memory
organiza&ons
 

and
now
data
producers

Infrastructure:
sta&c


+
cloud,
VM,
bitbucket

Services:
hosted
 

+
partnered,
self-‐serve

Content:
museum
&
library
 

+
research,
web
crawls

Sustainability:
?


cost
recovery,
pay
once

Today’s
journey

Data
service
basics
at
CDL

• Stable
storage
(Merri)

• Stable
iden&ﬁers
(EZID)

• Data
cita&on
(DataCite)

• Management
(DMPTool)

• Preserva&on
cost
modeling

...
that
enable

• Federa&on
(DataONE)

• Data
papers

• Capture
(WAS
web
archiving)

• Excel
add-‐in
(DCXL)

The
scien&ﬁc
record
is
at
risk

Data
dissemina&on
is
rare,
risky,
expensive,

labor-‐intensive,
domain-‐speciﬁc,
and

receives
lile
credit
as
research
output

Global
Change
Galac&c
Change

The
changing
landscape

•  Ever
increasing
number,
size,
and

diversity
of
content

•  Ever
increasing
diversity
of

partners,
and
stakeholders

•  Decreasing
resources

•  Inevitability
of
disrup&ve
change

– Technology

– Ins&tu&onal
mission

R ESOURCES

T IME

Stable
storage:

Merri
repository

•  Cura&on
repository
open
to
the
UC

community
and
beyond

•  Discipline
/
content
agnos&c

•  Micro-‐services
architecture

•  Easy-‐to-‐use
UI
or
API

•  Hosted
or
locally
deployed

Primary
FuncAons

1.
Deposit

2.
Manage
(metadata,
versions,
etc)

3.
Access
(expose)

4.
Share
(with
other
researchers)

5.
Preserve

EZID:
Long
term
iden%fiers
made
easy

•  Precise
iden&fica&on
of
a
dataset

(DOI
or
ARK)

•  Credit
to
data
producers
and

data
publishers

•  A
link
from
the
tradi&onal

literature
to
the
data
(DataCite)

•  Exposure
and
research
metrics

for
datasets

(Web
of
Knowledge,
Google)

Take
control
of
the

Primary
FuncAons

management
and
distribu%on
of

1.
Create
persistent
iden&fiers
your
research,
share
and
get

2.
Manage
iden&fiers
(and
associated
credit
for
it,
and
build
your

metadata)
over
&me
reputa%on
through
its
collec%on

and
documenta%on

3.
Resolve
iden&fiers

Discovery:
DataCite
consor&um

•  Technische
Informa&onsbibliothek
(TIB),
•  Canada
Ins&tute
for
Scien&fic
and

Germany
Technical
Informa&on
(CISTI)

•  L’Ins&tut
de
l’Informa&on
Scien&fique

•  Australian
Na&onal
Data
Service
(ANDS)

et
Technique
(INIST),
France

•  The
Bri&sh
Library

•  Library
or
the
ETH
Zürich

•  California
Digital
Library,
USA
•  Library
of
TU
Delk,
The
Netherlands

•  Office
of
ScienAfic
and
Technical

InformaAon,
US
Department
of
Energy

•  Purdue
University,
USA

•  Technical
Informa&on
Center
of

Denmark

DMPTool

Mee&ng
funding
agencies
data
management
plan
requirements

•  Connect
researchers
to
resources
to

create
a
data
management
plan

•  NSF
and
directorates,
NIH,
NEH,

IMLS,
founda&ons
plus

•  Customizable

Primary
FuncAons

1.
Step-‐by-‐step
“wizard”

2.
Templates
and
examples

3.
Links
to
ins&tu&onal
resources

and
agency
informa&on

4.
Plan
publica&on
and
sharing

Number
of
Plans
Created

Oct
2011
–
Feb
2012

Cost
Model
1:
Pay
as
you
go

•  Billed/paid
annually

{ P
if
year = 0

0

if
year > 0

–  Costs
for
archival
System
(A ),
Workﬂows
(W ),
Content

Types
(C ),
Monitoring
(M ),
and
Interven%ons
(V )
are

considered
common
goods,
and
are
appor&oned
equally

across
all
n
Producers
(P )

•  Model
components
are
represented
by
two
terms:
the
number
of

units
and
the
per-‐unit
cost,
e.g.,
k ·S
–  Storage
cost
(S )
accounted
on
a
per-‐Producer
basis

Model
2:
Pay
once,
preserve
for
“ T”
years

•  Paid-‐up
price
for
ﬁxed
term T

–  A
func&on
of
r,
the
annual
investment
return,
and
d,
the

annual
decrease
in
unit
cost
of
preserva&on

–  G
is
the
cost
of
providing
a
year’s
preserva&on
service;

G0
includes
the
added
ﬁrst
year
expense
of
Producer

engagement
and
registra&on

–  Sepng
T
=
∞
calculates
the
price
for
“forever”

New
distributed
framework

CoordinaAng
Nodes
Flexible,
scalable,

Member
Nodes

•  retain
complete
metadata

sustainable
network

• 
catalog

ins&tu&ons

diverse

•  subset
of
all
data

• 

serve
local
community

•  perform
basic
indexing

• 
provide
network-‐wide

• 
provide
resources
for

managing
their
data

services

•  ensure
data
availability

(preserva&on)

•  provide
replica&on

services

Tradi&onal
ar&cles
vs
data
papers

The
collec&ve
data
product

Need
to
save
data
+
processing

Algorithms
+
Data
Structures
=
Programs

Vision
for
a
“data
paper”

•  Wrap
the
unfamiliar
in
a
familiar
façade

•  A
“data
paper”
is
minimally
a
cover
sheet

and
a
set
of
links
to
archived
ar&facts

•  Cover
sheet
contains
familiar
elements:

&tle,
date,
authors,
abstract,
and

persistent
iden&fier
(DOI,
ARK,
etc.)

•  Just
enough
to
permit
basic
exposure
and

discovery

–  Building
a
basic
data
cita&on

–  Indexing
by
services
such
as
Web
of

Science,
Google
Scholar

–  Ins&lling

confidence
in
the
iden&fier’s

stability

43 public archives
120+ archives total
58K crawls
7,500 + sites
600 million + URLs
40+ TB
24 institutions

Developed with LoC support by CDL, UNT, and others

What
are
people
using
WAS
for?

Archiving
at-‐risk
government
websites
and
publica&ons

Archiving
their
own
university
domains

Building
web
archives
to
complement
library
collec&ons

Documen&ng
web
coverage
of
signiﬁcant
events

Data
cura%on
for
Excel

•  Excel
is
the
database
of
choice
for
many
researchers

•  Make
it
easy
to
share,
archive,

and
publish
data

•  Keep
up
to
date
at
dcxl.cdlib.org

Primary
FuncAons
Surveyed
users
and
found:

•  Most
researchers
are
unaware
of

1.
An
Excel
add-‐in
and
web

preserva&on
op&ons

applica&on
•  Documenta&on
prac&ces
are
poor

2.
Metadata
descrip&on
(through
•  Excel
is
just
one
tool
in
workﬂows

extrac&on
and
augmenta&on)

3.
Check
for
good
data
prac&ces

3.
Transfer
to
repository

A
data
cura&on
approach
at
CDL

•  New
“data
paper”
publishing
model
[GBMF]

•  DataCite
consor&um
and
cita&on
standards

•  Other
fronts:

•  DataONE
global
data
network
[NSF]

•  Merri:
general-‐purpose
data
repository

•  EZID:
scheme-‐agnos&c
&
de-‐coupled
crea&on,

resolu&on,
and
management
of
persistent
ids

•  Data
management
plan
generator

•  Web
archiving
service
[Library
of
Congress]

•  Open-‐source
Excel
add-‐in
[MS
Research
&
GBMF]

Ques&ons?

John.Kunze@ucop.edu

California
Digital
Library

hp://www.cdlib.org/

Supporting Data-Rich Research on Many Fronts

More Related Content

What's hot

Viewers also liked

Similar to Supporting Data-Rich Research on Many Fronts

More from John Kunze

Recently uploaded

Supporting Data-Rich Research on Many Fronts