DataONE_cobb_hubbub2012_20120924_v05

DataONE:
An
interoperable
data

repositories
case
study

John
W.
Cobb

R&D
Staﬀ
and
DataONE
Leadership
Team
Member

Oak
Ridge
Na;onal
Laboratory

HUBbub
2012
,
the
HUBzero
conference

Indianapolis,
IN

24
September
2012

Acknowledgment:

•  Authorship:
This
talk

represents
work
of
the

en;re
DataONE

extended
team.

•  It
especially
draws
upon

slide
material
from

•  Bill
Michener,
UNM

(esp.
recent
DataONE

AHM
Sept.
18,
2012)

•  Amber
Budden
–

DataONE
Ass.
Dir.
For

CE

•  DataONE
is
an
NSF

supported
project

(OCI-‐0830944)

2

Hubs
and
data
repositories

•  A
personal
view

(apologies
for
a
possibly
mis-‐informed
speaker)

•  HUB-‐roots
(history
and
pre-‐history)

•  PUNCH:

web
portal
for
running
tools

(DOI:
10.1109/40.846308)

•  -‐>
NanoHUB:
Applica;on
orchestra;on
environment

•  +
RAPPTURE:
Rapid
Applica;on
por;ng
and
development

•  +
Framelesss
VNC
windows
–

seamless
hosted
environment
on
clients!

•  +
Rich
collabora;ve
environment
and
rich
user
experience
!!
(“wishlist”)

•  Repurpose:
Hubzero
-‐>
hubs
explode

(ex.
NEESHub
a
cri;cal
advantage
for
largest
research
award
in
Purdue
history)

•  Now
(and
recent
past)
turn
to
Hub+Data
Integra;on.
Some
successes

already

•  Opportunity:
Richer
interac;ons
between
HUB’s
and
mul;ple
data

repositories

•  Perhaps
for
example:
Enable
mul;-‐project
collabora;on
within
PURR?

•  Or:
Integrate
NEES
DB’s
with
SCEC
simula;ons
and
IRIS
waveforms?

3

Mul;ple
data
repository
access?

•  HUB
+
Database
exists

•  HUB
+
external
data
repository
access
use
case.

•  But
…..
What
if?

•  Access
mul;ple
(possibly
external)
repositories
from
within
a
HUB

environment?

•  Access
mul;ple
external
repositories
with
similar
data?
Say
aggregate
all
data

from
state
hydrologists?
C.f.
driNET
hip://drinet.hubzero.org

•  Integrate
disparate
data
sets
for
new
and
novel
analysis.

Recall

Noshir
Contractor’s
comments
this
morning:
teaming
and

interdisciplinary
work
has
increased
impact
(Wuchty,
Jones,
Uzzi)

•  Enable
reproducible
analysis
and
synthesis
via
a
automated
workﬂow
to
create

synthe;c
data
products

•  Programma;c
access

•  More
integra;on
(more
than
just
raw
search
terms
a
la
Google)

•  …

•  What
do
you
want
to
discover
today?
(to
paraphrase
Microsol)

4

DataONE
mo;va;on

•  DataONE
is
a
project
to
address
Plan

these
issues

•  Build
(assemble/aggregate)

Analyze
Collect

data
repository
interoperability

•  Advance
state
of
the
prac;ce

data
lifecycle
management

•  Planning
Integrate
Assure

•  Deposi;on

•  Metadata
genera;on

•  Seman;c
integra;on

Discover
Describe

•  Workﬂow
and
provenance

•  Analysis
Preserve

•  Synthesis

•  Focus
on
a
broad
science
area

•  Deploy
a
working
CI
and
grow
it

•  DataONE
–
Data
Observa;on

Network
Earth

5

Pressing
issues
for
the
digital
data
lifecycle

Plan

Analyze
Collect

Integrate
Assure

Discover
Describe

Preserve

7

Mul;ple
data
sources
–
mutually
reinforcing

Increasing
Process
Knowledge

Decreasing
Spa;al
Coverage

Intensive
science
sites

and
experiments

Extensive
science
sites

Volunteer
&

educa;on
networks

Remote

sensing

Adapted
from
CENR-‐OSTP

8

Scaiered
data
sources

“ﬁnding
the
needle
in
the
haystack”

Data
are
massively
dispersed

•  Ecological
ﬁeld
sta;ons
and
research
centers
(100s)

•  Natural
history
museums
and
biocollec;on
facili;es
(100s)

•  Agency
data
collec;ons
(100s
to
1000s)

•  Individual
scien;sts
(1000s
to
10,000s
to
100,000s)

9

Data
Preserva;on
and
Planning

✔ ?
10

Preserva;on:
Poor
data
prac;ce

“data
entropy”

Time
of
publica?on

Speciﬁc
details

General
details

Re?rement
or

Informa?on
Content

career
change

Accident

Death

Time
(Michener
et
al.
1997)

11

Preserva;on:
Data
longevity

Resource
Study Resource Type
Half-life
Rumsey (2002) Legal Citations 1.4 years

Harter and Kim (1996) Scholarly Article Citations 1.5 years

Koehler (1999 and 2002) Random Web Pages 2.0 years

Spinellis (2003) Computer Science 4.0 years
Citations
Markwell and Brooks Biological Science 4.6 years
(2002) Education Resources
Nelson and Allen (2002) Digital Library Object 24.5 years

Koehler,
W.
(2004)
Informa(on
Research
9(2):
174.
12

The Long Tail of Orphan Data

The
Ultra-‐violet
Most of the bytes
divergence
are at the high end,
Specialized repositories

but most of the
(e.g. GenBank, PDB)

datasets are at the
Volume

low end – Jim Gray

The
Infrared

Orphan data

Catastrophe

(B. Heidorn)

Rank frequency of datatype

13

Data
deluge
and
interoperability

“the
ﬂood
of
increasingly
heterogeneous
data”

Data
are
heterogeneous

•  Syntax

•  (format)

•  Schema

•  (model)

•  Seman;cs

•  (meaning)

Jones
et
al.
2007

14

Metadata
universe
(mul;-‐verse)

•  There
are
a
mul;tude
of
metadata
standards

•  Discipline
and
sub-‐discipline
speciﬁc

•  Each
with
diﬀerent
terms
and
context

Source:
Jenn
Riley,
Indiana
U.
Digital
Librarian

hip://www.dlib.indiana.edu/~jenlrile/metadatamap/

Via
John
Kunze,
Cal.
Dig.
Lib

15

Each
dot
is
its
own
standard
!

“…billions
and
billions
of
worlds
…”
–
Carl
Sagan

16

DataONE

CI
architectural
Elements

•  Hard-‐core
cyberinfrastructure
(CI)

•  CI
Member
Node
(MN)
data

repositories

•  Coordina;ng
Node
(CN)
global

metadata
repo’s
Investigator Toolkit

Simple,
but
powerful
REST
API/SPI
for

Web Interface Analysis, Visualization Data Management
• 
universal
access
Client Libraries

•  Inves;gator
toolkit
(ITK)
solware
tools
Java Python Command Line

to
allow
access
to
the
data
repository

collec;ve
via
familiar
access
idioms
Member Nodes Coordinating Nodes

•  Cultural
and
wetware
issues
Service Interfaces
Resolution Discovery
Educa;onal
Materials

Service Interfaces
•  Replication Registration

•  Best
prac;ces
Bridge to non-DataONE Coordination Layer
•  Workshops
and
tutorials
Member Node services Identiﬁers Catalog

•  Surveys
and
assessments
Preservation Monitor
Data Repository
•  Scien;st,
policymaker,
ci;zen
Object Store Index
engagement

•  Collabora;on,
governance,
and

sustainability

hip://mule1.dataone.org/ArchitectureDocs-‐current/

17

Key
Cyberinfrastructure
Elements

•  Unique
iden;ﬁers

•  Search
and
deliver

•  Replica;on

•  Federated
iden;ty

Usable
by

People
and
their
Agents

19

Suppor;ng
the
data
lifecycle

ORC

Node

UCSB

Node

UNM

Node

1.  Deposi;on/acquisi;on/ingest

}
2.  Cura;on
and
metadata
management

The
data
3.  Protec;on,
including
privacy

lifecycle
4.  Discovery,
access,
use,
and
dissemina;on

5.  Interoperability,
standards,
and
integra;on

6.  Evalua;on,
analysis,
and
visualiza;on" 20

DataONE
Supports
Data
Preserva;on

Three
major
components
for
a
Member
Nodes

ﬂexible,
scalable,
sustainable
•  diverse
ins;tu;ons

Coordina?ng
Nodes

network
•  serve
local
community

•  retain
complete
metadata

Inves?gator
Toolkit

•  provide
resources
for

catalog

managing
their
data

•  indexing
for
search

•  retain
copies
of
data

•  network-‐wide
services

•  ensure
content

availability
(preserva;on)

•  replica;on
services

21

DataONE
sa;sfies
arch
requirements

•  Enables
integra;on
of
mul;ple
geographically

diverse
and
metadata
diverse
repositories

•  Presents
collec;ve
search
results
across
mul;ple

repository

•  Provides
a
unified
API/SPI
for
search
and

programma;c
interface

hip://mule1.dataone.org/ArchitectureDocs-‐current/

•  DataONE
content
has
unique
iden;fiers
(DOI’s)
for

referencable/citable
data
objects

•  Supports
both
large
datasets
and
the
long-‐tails

22

DataONE
spurs
innova;on

•  Enables
new
analysis
and
synthesis
eﬀorts
by

integra;ng
tasks
across
repositories

•  Provides
means
for
data
replica;on
and
basis
for

repositories
to
build
“data
wills”
or
“data
trust”

plans

•  Provides
a
plavorm
to
develop
advanced

interoperable
workﬂow
tools
and
seman;c

integra;on
tools

23

DataONE:
current
state/recent
progress

24

DataONE:
Suppor;ng
Scien;ﬁc
Data

Preserva;on,
Discovery,
and
Innova;on

Current
Member
Nodes:

Coming
Soon:

Current
Tools:

Tools
Coming
Soon:

Queensland
University
of
Technology

25

Data
Management
Planning
Tool

hips://dmp.cdlib.org/

26

Plans
per
template
(as
of
June
2012)

400

Approximate
number
of
plans
per
template

350
339

300
287
Templates
of
greatest
interest
to

the
DataONE
community
in
red;

250

2,302
unique
users
to
date

197

200

159

150
133
133
124

101

100
71
65
60

46
37
36

50
34

17
15
6

0

hip://dmptool.org

29

✔ Check
for
best
prac;ces

✔ Create
metadata

✔ Connect
to
ONEShare

Data
&

Metadata
(EML)

30

2.
Data
Discovery

31

The
DataONE
Federa;on

32

ORNL
DAAC

as
a
DataONE

Member
Node

NASA
collectors
DAAC
Users

(UWG)

Inves?gator
Toolkit

DataONE
Users

33

hips://cn.dataone.org/onemercury/

34

Inves;gator
Toolkit
Support

Plan

DMP-Tool
Analyze
Collect

Kepler

Integrate
Assure

Discover
Describe

Preserve

37

Explora;on,
Visualiza;on,
and
Analysis

Diverse
bird
observa;ons
and
Model
results

environmental
data
from

300,00
loca;ons
in
the
US
Occurrence
of
Indigo
Bun?ng
(2008)

integrated
and
analyzed
using

High
Performance
Compu;ng

Resources

Land
Cover

Jan
Apr
Jun
Sep
Dec

Meteorology

•  Examine
paierns
of

migra;on

MODIS
–
Spa;o-‐Temporal
Exploratory
•  Infer
how
climate

Remote
Model
iden;fies
factors
change
may
affect

affec;ng
paierns
of
migra;on

sensing
data
bird
migra;on

38

Public
Par?cipa?on
in
Scien?ﬁc
Research
Conference:
4-‐5
August
2012
in
Portland,

Oregon
USA
prior
to
Ecological
Society
of
America
mee;ng
(6-‐10
Aug.):

hip://www.birds.cornell.edu/citscitoolkit/conference/2012

39

User
Assessments

Scien;sts:
BL
Scien;sts:
FU

Library
Policies:
BL
Library
Policies:
FU

Librarians:
BL
Librarians:
FU

Policy
Makers:
BL
Policy
Makers:
FU

Educators:
BL
Educators:
FU

Year
1
Year
2
Year
3
Year
4
Year
5

40

What
standard
do
you
currently
use?

676

266

95 95 96 97
12 21 26

DIF DwC DC EML FGDC Open ISO My Lab none
GIS
Metadata
language

41

Many
are
interested
in
sharing
data

Willing
to
share
data
across
a
broad
81%

group
of
researchers

Willing
to
place
at
least
some
of
my

78%

data
into
a
central
data
repository

with
no
restric;ons

Appropriate
to
create
new
datasets
76%

from
shared
data

Willing
to
place
all
of
my
data
into
a

41%

central
data
repository
with
no

restric;ons

0%
20%
40%
60%
80%
100%

Percent
agree

42

Modeler
Scientist

Manager
Resource
Ecological
Data Librarians
Data
Service
User
Matrix

Investigator
ToolKit

Data
Management
Planning

Best
Practices

Tools
Database

Training

Curricula
43

Community
Engagement

44

Best
Prac;ces
and
Solware
Tools

45

DataONE:
Next
steps

•  Member
node
growth

•  Number
of
member
nodes

•  Increase
the
number
and
size
of
data
sets

•  Sustainably

•  In
terms
of
resource
needs
form
MN’s

•  In
terms
of
resource
demands
on
DataONE

•  New
Inves;gator
toolkit
tools
(strategically)

•  An
increasing
number
of
science
use
cases
with

more
breakthrough
science

•  Also,
re-‐purposing
DataONE
CI
outside
of
Bio/Eco/
Env
areas
in
strategic
collabora;ve
partnerships

47

Ack:
DataONE
Team
and
Sponsors

• Amber
Budden,
Roger
Dahl,
Rebecca
Koskela,

Bill
• Ewa
Deelman

Michener,
Robert
Nahf,
Skye
Roseboom,
Mark

Servilla

• Deborah
McGuinness

• 
Dave
Vieglais

• Suzie
Allard,
Nick
Dexter,
Kimberly
Douglass,
• Jeff
Horsburgh

Carol
Tenopir,
Robert
Waltz,
Bruce
Wilson

• John
Cobb,
Bob
Cook,
Ranjeet
Devarakonda,
• Robert
Sandusky

Giri
Palanismy,
Line
Pouchard

• Patricia
Cruse,
John
Kunze
• Bertram
Ludaescher

•  Sky
Bristol,
Mike
Frame,
Richard
Huffine,
Viv
• Peter
Honeyman

Hutchison,
Jeff
Moriseie,
Jake
Weltzin,
Lisa
Zolly

• Stephanie
Hampton,
Chris
Jones,
Mai
• Cliff
Duke

Jones,
Ben
Leinfelder,
Andrew
Pippin

• Paul
Allen,
Rick
Bonney,
Steve
Kelling
• Carole
Goble

• Ryan
Scherle,
Todd
Vision
• Donald
Hobern

• Randy
Butler
• David
DeRoure

LEON LEVY
FOUNDATION
48

Ques;ons?

Contact
Points

John
W.
Cobb,
Ph.D.

Oak
Ridge

John
W.
Cobb,
Ph.D.

Oak
Ridge
Na;onal
Lab

cobbjw@ornl.gov

865.576.5439

hip://www.dataone.org/

hip://docs.dataone.org

49

DataONE_cobb_hubbub2012_20120924_v05

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Viewers also liked

Viewers also liked (6)

Similar to DataONE_cobb_hubbub2012_20120924_v05

Similar to DataONE_cobb_hubbub2012_20120924_v05 (20)

Recently uploaded

Recently uploaded (20)

DataONE_cobb_hubbub2012_20120924_v05