Towards FAIR Open Science with PID Kernel Information: RPID Testbed

Towards
FAIR
Open
Science
with
PID

Kernel
Information:
the
RPID

Testbed
Beth
Plale
School
of
Informatics,
Computing
and
Engineering
Data
To
Insight
Center
Indiana
University
Basarim 2017 Istanbul,
Turkey 15
Sep
2017
,

The
ideas
expressed
here
have
been
shaped
through
conversations

in
Research
Data
Alliance
(RDA).

Special
thanks
to
Peter

Wittenburg,
Tobias
Wiegel,
and
Larry
Lannom.
Ideas
are
being
put
into
action
through
a
US
NSF
funded
project

called
Robust
PID
(RPID)
Testbed
Project
partners
include
Beth
Plale,
Robert
Quick,
Robert
McDonald
Indiana
University
Bridget
Almas,
Tufts
University
Larry
Lannom,
CNRI
The
opinions
expressed
here
are
those
of
author
alone
and
do
not
represent
the
views
of

the
US
National
Science
Foundation

Scientific
data
today

is
baskets
of
apples

across
random

orchards
Discovery
is
a

blindman’s bluff

game

Commitment
to
data

as
it
ages
a
mere

hope
Cartoon
credit:
Auke
Herrema

The
Internet
is
a
worldwide
network
of

connected
computers.

Computers
have
an
IP

address
that
uniquely
identifies
device
on

network.

Imagine
worldwide
network
of
data
objects.

Data
objects
persist
(until
they
don’t).
Objects

are
findable,
accessible,
interoperable,
and

usable
(especially
reusable)
Indiana
University

Guiding
abstraction
for
Data

Sharing:
Identifies
entities
and

stakeholders
Of
interest
to
technologists

and
policy
makers
alike

Fecher B,
Friesike S,
Hebing
M
(2015)
What
Drives
Academic
Data
Sharing?.
PLOS
ONE
10(2):
e0118053.

https://doi.org/10.1371/journal.pone.0118053
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0118053

Fecher B,
Friesike S,
Hebing
M
(2015)
What
Drives
Academic
Data
Sharing?.
PLOS
ONE
10(2):
e0118053.

this
piece
is

actually
a

network

A
C
GF
B
D
E
Network
of

indepen-‐
dent,
globally

unique
and

persistent

Data
Objects

that
have

relationships

between

them
that
we

should

exploit
Data Object Layer
Is
part
of

Repositories
Data

Objects
In
reality
Data Objects
reside
in

repositories
Data
objects
reside
in

repositories
but
should
not

be
completely
controlled
by

repositories

Open
science
Open
science

is
an
umbrella

term
for

transparent

science
with

ease
of
access

to
all
products

from

beginning
to

end
Indiana
University

Image
credit:

Gema Bueno
de
la
Fuente
by
CC-‐BY

Open
science
Risk
in
defining
open
science
too
broadly
Open
science
must
respect
boundaries
set
by
law
or
decency:

licenses,
copyright,
human
subjects
privacy
Open
Science
increasingly
connected
to
FAIR
principles:

Findable
Accessible
Interoperable
Reusable

FAIR
Guiding
Principles
1. To
be
Findable any
Data
Object
should
be
uniquely
and

persistently
identifiable
1.1.
Same
Data
Object
should
be
re-‐findable
at
any
point
in

time,
thus
Data
Objects
should
be
persistent,
with
emphasis

on
their
metadata
1.2.
Data
Object
should
minimally
contain
basic
machine

actionable
metadata
that
allows
it
to
be
distinguished
from

other
Data
Objects
1.3.
Identifiers
for
any
concept
used
in
Data
Objects
should

therefore
be
Unique and
Persistent

FAIR
Guiding
Principles
2.
Data
is
Accessible in
that
it
can
be
always
obtained
by

machines
and
humans
2.1
Upon
appropriate
authorization
2.2
Through
a
well-‐defined
protocol
2.3
Thus,
machines
and
humans
alike
will
be
able
to
judge
the

actual
accessibility
of
each
Data
Object

FAIR
Guiding
Principles,
cont.
3.
Data
Objects
can
be
Interoperable
only
if:
3.1.
(Meta)
data
is
machine-‐actionable
3.2.
(Meta)
data
formats
utilize
shared
vocabularies

and/or
ontologies
3.3

(Meta)
data
within
Data
Object
should
thus
be

both
syntactically
parseable and
semantically

machine-‐accessible

FAIR
Guiding
Principles,
cont.
4.
For
Data
Objects
to
be
Re-‐usable additional
criteria
are:
4.1
Data
Objects
should
be
compliant
with principles
1-‐3
4.2
(Meta)
data
should
be
sufficiently
well-‐described
and
rich

that
it
can
be
automatically
(or
with
minimal
human
effort)

linked
or
integrated,
like-‐with-‐like,
with
other
data
sources
4.3
Published
Data
Objects
should
refer
to
their
sources
with

rich
enough
metadata
and
provenance
to
enable
proper

citation

Our
vision
• Starts
with
data
network
based
on
Digital

Object
Architecture
(DOA),
a
distributed

architecture
of
services
spread
worldwide
that

together
identify
and
resolve
digital
objects
• DOA
first
espoused
by
Internet
founder
Robert

Khan
in
the
mid’80’s.

• DOA
is
a
network
of
Handle
servers
at
its
core
Indiana
University

The
Digital
Object
Architecture
serves
as
base

infrastructure
only.
DOA
is
silent
on
issues
of

modeling
data
objects
themselves:
their

content,
their
relationship
to
their
own

metadata,
and
relationship
between
data

objects
For
object
modeling
we
turn
to
FAIR
principles

and
PID
Kernel
Information

Data
Object
Model
based
on
FAIR

principles
Data
modeling
questions
address
issues:
1)

What
goes
into
a
data
object?
2)

Should
a
data
object
include
its
metadata
or

should
the
metadata
be
a
new
object
or
both?
3)

What
kind
of
metadata
should
be
considered?
4)

What
is
the
granularity
of
a
data
object?
5)

Where
does
kernel
information
come
in?

Persistent
IDs
are
the

backbone
of
data

sharing
[
primary
and

secondary
use
]

• Persistent
IDs
(PID)
-‐-‐ names
a
data
object
with
name
that
is
globally
unique
-‐-‐ data
object
can
be
metadata,

data
or
a
digital
proxy
to
physical
object
-‐-‐ is
persistent
over
time
plale@indiana.edu

PID
makeup
• Handles
have
a
prefix
assigned
to
a
Local
Handle
Server
• Suffix
is
under
control
of
Local
Handle
Server
• e.g.,
RPID
testbed
assigns only
test
temporary
handles:
– 11723.1.test,
11723.2.test,
...
11723.8.test
:

assigned
for
internal use

– 11723.9.test.<proj
name>

:

assigned
to
projects
avoids
collisions
within
LHS
namespace
Indiana
University

• Handle
system
allows
key-‐value

information
stored
to
a
Local
Handle

Server
-‐-‐ names
a
Data
Object
with
name
that
is
globally
unique
-‐-‐ Data
Object
can
be
metadata,

data
or
a
digital
proxy
to
physical
object
-‐-‐ Is
persistent
over
time
15
Sep
2017

Data
Type

Registry
Service
Stores
type
definitions

for
kernel
information
Client
PIT

API

SDK
Handle
System
Global
Handle
Servers
Local
Handle

Service
Q:
prefix
authority
Local
Handle
Service
IP
Q:
local
handle
Handle
information
Q:
DTR
with
Profile
PID
DTR
Profile
Definition
(e.g.,
PID
to
Profile,
URL
to
target)

Scale:

[1000…50
00]
LHS
Scale:

[1..10]
Stores
PID
kernel

information
Handle
resolution
in
a
Digital
Object
Architecture
Trusted
PIDs
Filter-‐
ed
PIDS
Scale:

[80…100]

GHS

What
should
go
into
the
PID
Kernel

Information?
PID Kernel
Information is
a
small
amount
of

information
stored
at
resolver
(Local
Handle
Server)

in
PID
record
of
a
PID
Inspiration:
take
FAIR
principles
as
guide:
how
far

can
PID
Kernel
Information
aid
in
implementing

FAIR?

Kernel
Information
is
Cached
• By
FAIR
principle
1.1,
a
Local
Handle
Server
is
not
a

metadata
repository
so
cannot
serve
as
the

authoritative
source
for
any
form
of
metadata
for
a

data
object
• Thus
Kernel
Information
is
cached
copy
of
metadata

that
is
stored
and
stewarded
elsewhere
• FAIR
principle
1.1:
Same
Data
Object
should
be
re-‐
findable
at
any
point
in
time,
thus
Data
Objects
should

be
persistent,
with
emphasis
on
their
metadata

Promising
candidate
for
Kernel

Information
is
Provenance
Imagine
a
world
where
PIDs
identify

just
about
everything:
-‐>
Internet
of
Things
-‐>
Movie
clips
-‐>
Smart
city
sensor
data
-‐>
Pages
from
digitized
books
-‐>
Baby
food
containers

Further
imagine
an
Internet-‐scale
data

client
that
is
handed
a
list
of
a

100,000,000
PIDs.
How
does
client
quickly
sift
through

list
to
find
research
data
objects?
Further
suppose
client
is
able
to

winnow
list
down
to
just
research
data

objects,
how
does
it
then
quickly

discard
fakes?

plale@indiana.edu

Data
Type

Registry
Service
Stores
type

definitions

for
kernel

information
Client Handle
System
Global
Handle
Registry
Local
Handle

Service
Q:
prefix
authority
Local
Handle
Service
IP
Q:
local
handle
Handle
information
Q:
DTR
with
Profile
PID
DTR
Profile
Definition
[1000…
5000]
Stores
PID

kernel

information
PID
Kernel
Information
Use
case:

Client
filters
list
of
millions
of
PIDs
to
identify

research
data
and
makes
simple
determination
of
trust
Trusted

research

PIDs
Filter
-‐ed
PIDS

Client
working
with
PID
Kernel
Information
looks

at
each
PID
in
list,
accepts
those
that
have:
-‐-‐ Kernel
Information
profile
stored
in
Data
Type

Registry
(DTR),

-‐-‐ That
profile
is
associated
with
RDA
(in
some

unspecified
manner)
-‐-‐ PID
Kernel
Information
holds
tiny
amount
of

data
provenance
from
which
basic
sense
of
trust

is
derived

Kernel
Information
for
FAIR

Accessibility
• By
FAIR
principle
2,
Kernel
Information
conveys
accessibility

information
thus
making
it
easier
for
navigating
direct
data

object
access

• Includes
privacy
or
legal
restrictions
on
a
data
object
that

may
limit
access
to,
say
the
object’s
metadata
alone.
FAIR
Principle
2.
Data
is
Accessible in
that
it
can
be
always

obtained
by
machines
and
humans
2.1
Upon
appropriate
authorization
2.2
Through
a
well-‐defined
protocol
2.3
Thus,
machines
and
humans
alike
will
be
able
to
judge
the

actual
accessibility
of
each
Data
Object

Indiana
University

Data
Type

Registry
Service
Client Handle
System
Global
Handle
Registry
Local
Handle

Service
Q:
prefix
authority
Local
Handle
Service
IP
Q:
local
handle
Handle
information
+
PID

Kernel
Information
Q:
DTR
with
Profile
PID
DTR
Profile
Definition
for
PID

Kernel
Information
[1..10]
PID
Kernel
Information
Use
case:

Filter
list
of
million
PIDs
to
identify
research

data;
make
simple
determination
of
trust
Repository
Access
Retrieve
data

object
as
per

access
and

rights

restriction
in

PID
KI

PID
Kernel
Information
Summary
• Exploration
driven
by
identifying
and

evaluating
minimal
information
that
can
go

into
Kernel
Information
that
can
help
make

Data
Objects
FAIR
and
less
dependent
on
the

repository
system
to
enforce
FAIRness?

• Long
term
goal:

Smart
data
objects
• Kernel
information
has
potential
to
spawn

new
ecosystem
of
data
services
for
smart
data

objects

RPID
testbed
• Suite
of
software
services
for
use
by
community
– Data
type
registry
(RDA)
– PIT
API
(RDA)
– Handle
service
• Exploratory
services
– PID
Kernel
Information
– Mapping
CTS
URNs
to
handles
– Packaging
for
use
by
others
• Help
and
advice
• User
advisory
group

Indiana
University

Data
Type

Registry
Handle
Service
Prefix:
11723 Service

Installation

Testing
for

Reproducibility
36-‐Month
Testbed
RPID
Testbed

Who
can
use
the
Testbed
The
RPID
testbed
is
open
for research,

education,
non-‐profit,
or
pre-‐
competitive
use.

Fecher B,
Friesike S,
Hebing
M
(2015)
What
Drives
Academic
Data
Sharing?.
PLOS
ONE
10(2):
e0118053.

Summary:

Foundational

infrastructure
for
data

sharing
is
FAIR
inspired

Digital
Object
Archit
with
PID
Kernel
Info

• In
conclusion,
this
work
proposes
– Level
1a
data
resolution:

Digital
Object
Architecture

[Kahn]

– Level
1b
high
level
data
filtering:

PID
Kernel
Information
– Level
2:

FAIR
principles
as
data
object
layer
• Thus
contributes
to
Open
Science
with
foundational

infrastructure
enabling
new
ecosystem
of
data
services
• Follow
work
at:
– https://github.com/rpidproject
– RDA
PID
Kernel
Information
Working
Group
– Reach
us
at
rpid-‐l@iu.edu
Acknowledgements:

this
work
funded
in
part
by
the
National
Science
Foundation
under
grants 1659310
and
1349002

Towards FAIR Open Science with PID Kernel Information: RPID Testbed

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Towards FAIR Open Science with PID Kernel Information: RPID Testbed

Similar to Towards FAIR Open Science with PID Kernel Information: RPID Testbed (20)

More from Beth Plale

More from Beth Plale (11)

Recently uploaded

Recently uploaded (20)

Towards FAIR Open Science with PID Kernel Information: RPID Testbed