Introduction to Persistent Identifiers| www.eudat.eu |

www.eudat.euEUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065
Introduction to Persistent
Identifiers
PIDs in EUDAT
Version 2
July 2017
This work is licensed under the Creative
Commons CC-BY 4.0 licence

Content
What are persistent identifiers?
Why use persistent identifiers?
Different persistent identifier systems
The HANDLE system
EPIC PID system
B2HANDLE
Policies
Use cases

PERSISTENT IDENTIFIERS
PID Training

Science Data
Data generation is getting easier/cheaper
Complexity-shift from data generation to data processing &
analysis
The amount of data output is increasing, quality is getting
better
 How to stimulate reuse and enable reproducibility?
Data needs to be
ReusableAccessible
Findable
Interoperable

Briefly, what are PIDs?
Pointers to data resources
Data files, metadata files, documents …
Globally unique
With infinite lifespan
Can be used to identify and retrieve resources
Can be resolved to the resource
Examples: ISBN, DOIs, PURLs, Handles…
PID Training

Data Creation Cycle
temporarydata
citabledata
referabledata
raw data
registration & preservation
analysis & enrichment
Citable publication
Persistent and robust
identification

What is the Problem? Why not use simple
URLs?
The URL specifies the
location, on a
particular server, from
which the resource
could be retrieved.
Strictly network
locations for digital
resources.
domain may change
resource may be
relocated
link may change
B2SAFE Training
BUT
URLs a year or two later, often no longer workin the long term
In the long term URLs a year later, often no longer work
“link rot”

Persistent over time
today … ... ... 2030
11839/abc123 11839/abc123
11100
00100
01111
11100
00100
01111
http://www.example.com/ http://www.moved.com/
Supports access to resource as it moves from one location to another.
.. by design

Why can Persistent Identifiers help?
A Persistent Identifier is
distinct from a URL
not strictly bound to a specific server or filename
“A persistent identifier (PID) is a long-
lasting reference to a digital object—a
single file or set of files.“
https://en.wikipedia.org/wiki/Persistent_identifier
11100
00100
01111
11839 / abc123
resolution
prefix suffix
Identifier points to a resource with no actual
knowledge of the resource
Responsibility of the PID owner to keep it up-to-date
when the resource changes

Structure of a Persistent Identifier
points to a resource Is globally unique
11100
00100
01111
11839 / abc123
resolution
prefix sufffix
11839 / abc123
prefix sufffix
Once the PID is created, the
resource is globally addressable.
Data
Metadata
Document
Code
Prefix: designates administrative
domain, comes from an issuing
instance
Suffix: unique in the realm of the
prefix

Persistent over time
today … ... 2030
11839/abc123 11839/abc123
11100
00100
01111
11100
00100
01111
http://www.example.com/ http://www.moved.com/
.. by design
Update information
Redirection
Stable

PID Benefits
Persistent Identity via Indirection
Static references into fluid systems over time
Data on networks moves
Ownership/responsibility change
Formats change
Embedded IDs
For data object in hand – current state data
Updates
New related entities
Networks of Persistent Links
Data / metadata links
Provenance chains

PID Costs
Extra level of effort / cost on creation
Analysis – what to identify (granularity)
Folders, files
Single measurements in a time series experiment
Coordination across organisations
Maintain resolution system
Persistence requires sustained effort
Organisational discipline
Technology necessary but not sufficient
Analyse cost/benefit ratio
Don’t start unless it is worthwhile
Is your data worth it?

Persistent Identifier structure
Every persistent identifier consists of two parts: its prefix and
a unique local name under the prefix known as its suffix
Prefix - designates administrative domain, is generated by
an issuer, which makes sure that all prefixes are unique
Suffix - local name must be unique under its prefix.
The uniqueness of a prefix and the local name under that
prefix ensure that any identifier is globally unique within the
context of the System.
PID Training
< PREFIX > / < SUFFIX >
(e.g. 11111/123456745)

PID Systems
Persistent URLs (PURLs)a
Cost: no
Metadata: No additional metadata
purl: GPO/gpo46189
EPIC Systemb
Cost: $50 annual fee per prefix
Metadata: Associate any metadata
hdl:11210/123
Digital Object Identifier (DOI)d
Cost: fee per DOI + annual fee
Metadata: The INDECS schema,
stored in separate
database
DOI: 10.1000/182
Archival Resource Key (ARK)c
Cost: no
Metadata: ERC (Electronic Resource
Citation) metadata
ark: /12025/654xz321
Based on: Handle System

PID system Requirements
Attach multiple URLs to a PID
Allow part identifiers for complex
objects. Granularity issue
Allow attaching of extra metadata
to the PID (MD5 check, etc)
Actionable (i.e. converted to URL)
PIDs
HTTP proxy for resolving (use port
80 only)
Controlled by community
Programmable interface for
administration of PIDs from
applications
Delegation of PID administration
to other organisations
Distributed, robust, highly-
available, scalable
No single-point of failure,
distributed system with mirroring
Acceptable non-commercial
business model
PID Training

Identifier String Requirements
Not based on any
changeable attributes of
the entity, e.g.:
Location
Ownership
Any other attribute that
may change without
changing identity
Unique
Avoid conflicts and
referential uncertainty
A good PID system should
not allow you to use the
same suffix twice
Opaque, preferably a
“dumb number”
A well known pattern
invites assumptions that
may be misleading
Meaningful semantics
invite IP wars, language
problems
Nice to have
Human-readable
Cut-able, paste-able
Fits common systems, e.g.
URI specification
PID Training
that contribute to persistence

PIDs in EUDAT
EUDAT has adopted
Handle-based persistent
identifiers
A combined solution of
Handle system and EPIC
service
Employing the latest Handle
v.8
EUDAT developed a library
to interact with Handle v.8
 B2HANDLE
PID Training

The Handle System
The Handle System is a technology specification for
assigning, managing, and resolving persistent
identifiers for digital objects and other resources.
The protocols specified enable a distributed computer
system to store identifiers (names, known as Handles)
of digital resources and resolve those Handles to the
information necessary to locate, access, and otherwise
make use of the resources.
That information can be changed as needed to reflect
the current state or location of the identified
resource without changing the Handle.
PID Training

The Handle System
The main goal of the Handle system is to contribute to
persistence.
The Handle system is:
reliable
scalable
flexible
trusted
built on open architecture
transparent
PID Training

A Handle Record
Handle Data
Type/KEY
Index Handle data Timestamp
10232/1234 URL 1 https://www.eudat.eu/ex1 2014-04-
09 12:46:53Z
DOMAIN 2 EUDAT 2014-04-
09 12:46:53Z
HS_ADMIN 100 eudat/user1 2014-04-
09 12:46:53Z
PID Training
PID – handle: 10232/1234
Actionable PID (URL/resolving): http://hdl.handle.net/10232/1234

Resolving Handle Record
PID Training
Global Registry
E.g. Handle
system
3. Client gets request
to resolve hdl:10232/1234
1. Client sends request to Global to resolve
0.NA/10232 (prefix handle for 10232/1234)
2. Global Responds with Service
Information for 10232
#1
#1
#2
#3
Secondary Site A
Secondary Site B
Local Service
#1 #2
Primary Site
4. Server responds with
handle data
Service Information
Local Handle Service
IP xc xc xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
..
..
..
xc
xc
xc
..
..
..
xc
xc
xc
..
..
..
...
xcccxv
xccx
xccx
xcccxv
xccx
xccx
xcccxv
xccx
xccx

HANDLE Record Types
Common types
URL: the location the
HANDLE should resolve to
HS_ADMIN: special record
encoding the permissions
configured for this HANDLE
10320/LOC: supports
multiple locations based on
intelligent decision.
Custom EUDAT types
EUDAT/CHECKSUM: Useful for
integrity verification
EUDAT/ROR: Repository of
Records, the ID of the community
repository.
EUDAT/FIO: PID to first ingested
object in the EUDAT domain.
EUDAT/PARENT: PID associated
with the source object in a
replication chain.
EUDAT/REPLICA: List of PIDs
pointing to replicas

PID SYSTEM IN EUDAT

PID System: How does it work?
PID Service
generate and manage
PIDs for digital objects
according to policies
Example: B2HANDLE
python library (next
section)
PID Training
PID Replication (as one
of the EPIC policies)
replicate the database of
Handles to partners in
EPIC to guarantee robust
and highly available PID
resolution
Resolution Service
EPIC uses the distributed
network provided by
Handle and extends it
with own local Handle
servers.
Global Handle Mirror
A mirror of the Global
Handle in Europe

Resolution Service
The web address for the Handle resolution service that
EUDAT uses is http://hdl.handle.net.
PID Training

EUDAT options for PIDs
In order to access a data object stored in EUDAT, an
associated persistent identifier is needed.
EUDAT requires integration of Handle in your
infrastructure. Before your community or data centre can
create PIDs you need a prefix. There are two options:
you can run your own Handle system; or
you can pass the details to EUDAT partners to
manage it on your behalf.
additional benefit of using the EUDAT systems is
access to a Python library to manage your PID
Handles
PID Training

What is B2HANDLE?
B2HANDLE is EUDAT’s PID service based on
Handle as technology
EPIC as federation
B2HANDLE offers:
Assignment of prefix via one of the EUDAT partners
Hosting of PIDs, i.e. operation and maintenance of Handle
servers and technical services
Replication and safe-keeping of PIDs via the EPIC
federation
Resolution mechanism based on Handle
Easy maintenance and programmatic resolving of PIDs by
the B2HANDLE Python library for general interaction with
Handle servers
PID Training

B2HANDLE in other EUDAT services
In the EUDAT ecosystem, EUDAT services make use
of B2HANDLE to:
guarantee data access
provide long lasting references to data and
facilitate data publishing.
PID Training
B2SAFE and B2SHARE use the service to create and
manage PIDs for their hosted data objects.
B2FIND and B2STAGE use the resolving mechanism of
B2HANDLE to retrieve and refer to objects.

The B2HANDLE Python library
b2handle: A Python library for interaction with EUDAT
Handle services (Handle version 8)
Setup tools-enabled Python package easy
installation
Can be employed by end-users to programmatically
resolve handles
Credentials to one of the EUDAT Handle servers are
required for creation and maintenance of PIDs
Stable state; official release of v1.0 also for use by
EUDAT user communities

B2HANDLE: Available at GitHub
Code repository: https://github.com/EUDAT-B2SAFE/B2HANDLE

B2HANDLE documentation
Technical documentation: http://eudat-b2safe.github.io/B2HANDLE

B2HANDLE library features
Methods to read, create, modify Handles and their
records
Queries against native Handle REST interface
Support for multiple locations per object (10320/loc
entries)
Automatic management of Handle value indexes
Support for Handle reverse-lookup via additional Java
servlet
Support for resolving any Handle from any issuing
instance

How may I use a PID
When you have a PID use it:
To cite the data behind the PID:
In publications
On web-pages
Include actionable PIDs in linked data
Retrieve the data:
By using the corresponding resolver
Via the actionable PID
E.g. http://hdl.handle.net/11239/GRNET
PID Training

Policy Document
When to use Persistent Identifiers?
What should the PID resolve to?
There is no “one-size fits all” strategy for
implementing PIDs!
Create a Policy Document of What & When
Analyze the use of PIDs, create a policy for the
management
What to register
When it enters the data management life cycle
PID Training
analysis and thought

Policy Document
Simple Questions
Which data objects need a PID (collections, files, metadata
records)?
What kinds of data are likely to stay online long enough?
What kinds of data are likely to be linked to your PIDs?
What kinds of data are likely to be analysed/processed with
tools?
What will happen after data goes off-line?
etc..
PID Training
analysis and thought

PID Policies for EUDAT services
Each Service follows its own Policy for managing PIDs.
One of the main policies they all follow is the non-
deletion policy:
Once a PID is generated it is not allowed to delete
it.
E.g. B2SAFE and B2SHARE use the service to
create and manage PIDs for their hosted data
objects. They both create their own PID types (Keys)
in the PID record.
PID Training

Example 1: B2SHARE
The persistent identifier for
files, download single files
Cite the whole data publication

Example 2: B2SAFE
B2SAFE employs PIDs to keep track and link replicas of
data in the EUDAT network

Example 3: Enable data flows
PID Training
Link directly to the data (?locatt=id:n)
Optionally include a (mime) type in the Handle record –
can be used to select appropriate tooling

Summary
Persistent Identifiers provide a solution to the “link rot”
problem by providing an extra layer of indirection
Several systems are available with different conditions
PIDs do it yourself: Use a Policy Document
The HANDLE system - via EPIC policies - is the foundation
for EUDAT’s B2HANDLE service:
Low cost, only a flat annual fee
Robust, scalable and performing
Flexible, allows addition of any metadata
Provides a global resolver

www.eudat.eu
Authors Contributors
This work is licensed under the Creative Commons CC-BY 4.0 licence
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures.
Contract No. 654065
Themis Zamani, GRNET
Willem Elbers, CLARIN
Christine Staiger, SURFsara
Ellen Leenarts, DANS
Kostas Kavoussanakis, EPCC
Thank you

Introduction to Persistent Identifiers| www.eudat.eu |

More Related Content

What's hot

Viewers also liked

Similar to Introduction to Persistent Identifiers| www.eudat.eu |

More from EUDAT

Recently uploaded

Introduction to Persistent Identifiers| www.eudat.eu |

Editor's Notes