www.eudat.euEUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065
Introduction to Persistent
Identifiers
PIDs in EUDAT
Version 2
July 2017
This work is licensed under the Creative
Commons CC-BY 4.0 licence
Content
What are persistent identifiers?
Why use persistent identifiers?
Different persistent identifier systems
The HANDLE system
EPIC PID system
B2HANDLE
Policies
Use cases
PERSISTENT IDENTIFIERS
PID Training
Science Data
Data generation is getting easier/cheaper
Complexity-shift from data generation to data processing &
analysis
The amount of data output is increasing, quality is getting
better
 How to stimulate reuse and enable reproducibility?
Data needs to be
ReusableAccessible
Findable
Interoperable
Briefly, what are PIDs?
Pointers to data resources
Data files, metadata files, documents …
Globally unique
With infinite lifespan
Can be used to identify and retrieve resources
Can be resolved to the resource
Examples: ISBN, DOIs, PURLs, Handles…
PID Training
Data Creation Cycle
temporarydata
citabledata
referabledata
raw data
registration & preservation
analysis & enrichment
Citable publication
Persistent and robust
identification
What is the Problem? Why not use simple
URLs?
The URL specifies the
location, on a
particular server, from
which the resource
could be retrieved.
Strictly network
locations for digital
resources.
domain may change
resource may be
relocated
link may change
B2SAFE Training
BUT
URLs a year or two later, often no longer workin the long term
In the long term URLs a year later, often no longer work
“link rot”
Persistent over time
today … ... ... 2030
11839/abc123 11839/abc123
11100
00100
01111
11100
00100
01111
http://www.example.com/ http://www.moved.com/
Supports access to resource as it moves from one location to another.
.. by design
Why can Persistent Identifiers help?
A Persistent Identifier is
distinct from a URL
not strictly bound to a specific server or filename
“A persistent identifier (PID) is a long-
lasting reference to a digital object—a
single file or set of files.“
https://en.wikipedia.org/wiki/Persistent_identifier
11100
00100
01111
11839 / abc123
resolution
prefix suffix
Identifier points to a resource with no actual
knowledge of the resource
Responsibility of the PID owner to keep it up-to-date
when the resource changes
Structure of a Persistent Identifier
points to a resource Is globally unique
11100
00100
01111
11839 / abc123
resolution
prefix sufffix
11839 / abc123
prefix sufffix
Once the PID is created, the
resource is globally addressable.
Data
Metadata
Document
Code
Prefix: designates administrative
domain, comes from an issuing
instance
Suffix: unique in the realm of the
prefix
Persistent over time
today … ... 2030
11839/abc123 11839/abc123
11100
00100
01111
11100
00100
01111
http://www.example.com/ http://www.moved.com/
.. by design
Update information
Redirection
Stable
PID Benefits
Persistent Identity via Indirection
Static references into fluid systems over time
Data on networks moves
Ownership/responsibility change
Formats change
Embedded IDs
For data object in hand – current state data
Updates
New related entities
Networks of Persistent Links
Data / metadata links
Provenance chains
PID Costs
Extra level of effort / cost on creation
Analysis – what to identify (granularity)
Folders, files
Single measurements in a time series experiment
Coordination across organisations
Maintain resolution system
Persistence requires sustained effort
Organisational discipline
Technology necessary but not sufficient
Analyse cost/benefit ratio
Don’t start unless it is worthwhile
Is your data worth it?
PID SYSTEMS
Persistent Identifier structure
Every persistent identifier consists of two parts: its prefix and
a unique local name under the prefix known as its suffix
Prefix - designates administrative domain, is generated by
an issuer, which makes sure that all prefixes are unique
Suffix - local name must be unique under its prefix.
The uniqueness of a prefix and the local name under that
prefix ensure that any identifier is globally unique within the
context of the System.
PID Training
< PREFIX > / < SUFFIX >
(e.g. 11111/123456745)
PID Systems
Persistent URLs (PURLs)a
Cost: no
Metadata: No additional metadata
purl: GPO/gpo46189
EPIC Systemb
Cost: $50 annual fee per prefix
Metadata: Associate any metadata
hdl:11210/123
Digital Object Identifier (DOI)d
Cost: fee per DOI + annual fee
Metadata: The INDECS schema,
stored in separate
database
DOI: 10.1000/182
Archival Resource Key (ARK)c
Cost: no
Metadata: ERC (Electronic Resource
Citation) metadata
ark: /12025/654xz321
Based on: Handle System
PID system Requirements
Attach multiple URLs to a PID
Allow part identifiers for complex
objects. Granularity issue
Allow attaching of extra metadata
to the PID (MD5 check, etc)
Actionable (i.e. converted to URL)
PIDs
HTTP proxy for resolving (use port
80 only)
Controlled by community
Programmable interface for
administration of PIDs from
applications
Delegation of PID administration
to other organisations
Distributed, robust, highly-
available, scalable
No single-point of failure,
distributed system with mirroring
Acceptable non-commercial
business model
PID Training
Identifier String Requirements
Not based on any
changeable attributes of
the entity, e.g.:
Location
Ownership
Any other attribute that
may change without
changing identity
Unique
Avoid conflicts and
referential uncertainty
A good PID system should
not allow you to use the
same suffix twice
Opaque, preferably a
“dumb number”
A well known pattern
invites assumptions that
may be misleading
Meaningful semantics
invite IP wars, language
problems
Nice to have
Human-readable
Cut-able, paste-able
Fits common systems, e.g.
URI specification
PID Training
that contribute to persistence
PIDs in EUDAT
EUDAT has adopted
Handle-based persistent
identifiers
A combined solution of
Handle system and EPIC
service
Employing the latest Handle
v.8
EUDAT developed a library
to interact with Handle v.8
 B2HANDLE
PID Training
HANDLE SYSTEM
The Handle System
The Handle System is a technology specification for
assigning, managing, and resolving persistent
identifiers for digital objects and other resources.
The protocols specified enable a distributed computer
system to store identifiers (names, known as Handles)
of digital resources and resolve those Handles to the
information necessary to locate, access, and otherwise
make use of the resources.
That information can be changed as needed to reflect
the current state or location of the identified
resource without changing the Handle.
PID Training
The Handle System
The main goal of the Handle system is to contribute to
persistence.
The Handle system is:
reliable
scalable
flexible
trusted
built on open architecture
transparent
PID Training
A Handle Record
Handle Data
Type/KEY
Index Handle data Timestamp
10232/1234 URL 1 https://www.eudat.eu/ex1 2014-04-
09 12:46:53Z
DOMAIN 2 EUDAT 2014-04-
09 12:46:53Z
HS_ADMIN 100 eudat/user1 2014-04-
09 12:46:53Z
PID Training
PID – handle: 10232/1234
Actionable PID (URL/resolving): http://hdl.handle.net/10232/1234
Resolving Handle Record
PID Training
Global Registry
E.g. Handle
system
3. Client gets request
to resolve hdl:10232/1234
1. Client sends request to Global to resolve
0.NA/10232 (prefix handle for 10232/1234)
2. Global Responds with Service
Information for 10232
#1
#1
#2
#3
Secondary Site A
Secondary Site B
Local Service
#1 #2
Primary Site
4. Server responds with
handle data
Service Information
Local Handle Service
IP xc xc xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
xc
..
..
..
xc
xc
xc
..
..
..
xc
xc
xc
..
..
..
...
xcccxv
xccx
xccx
xcccxv
xccx
xccx
xcccxv
xccx
xccx
HANDLE Record Types
Common types
URL: the location the
HANDLE should resolve to
HS_ADMIN: special record
encoding the permissions
configured for this HANDLE
10320/LOC: supports
multiple locations based on
intelligent decision.
Custom EUDAT types
EUDAT/CHECKSUM: Useful for
integrity verification
EUDAT/ROR: Repository of
Records, the ID of the community
repository.
EUDAT/FIO: PID to first ingested
object in the EUDAT domain.
EUDAT/PARENT: PID associated
with the source object in a
replication chain.
EUDAT/REPLICA: List of PIDs
pointing to replicas
PID SYSTEM IN EUDAT
Handle system and EPIC
PID System: How does it work?
PID Service
generate and manage
PIDs for digital objects
according to policies
Example: B2HANDLE
python library (next
section)
PID Training
PID Replication (as one
of the EPIC policies)
replicate the database of
Handles to partners in
EPIC to guarantee robust
and highly available PID
resolution
Resolution Service
EPIC uses the distributed
network provided by
Handle and extends it
with own local Handle
servers.
Global Handle Mirror
A mirror of the Global
Handle in Europe
Handle system and EPIC
Resolution Service
The web address for the Handle resolution service that
EUDAT uses is http://hdl.handle.net.
PID Training
EUDAT options for PIDs
In order to access a data object stored in EUDAT, an
associated persistent identifier is needed.
EUDAT requires integration of Handle in your
infrastructure. Before your community or data centre can
create PIDs you need a prefix. There are two options:
you can run your own Handle system; or
you can pass the details to EUDAT partners to
manage it on your behalf.
additional benefit of using the EUDAT systems is
access to a Python library to manage your PID
Handles
PID Training
B2HANDLE
What is B2HANDLE?
B2HANDLE is EUDAT’s PID service based on
Handle as technology
EPIC as federation
B2HANDLE offers:
Assignment of prefix via one of the EUDAT partners
Hosting of PIDs, i.e. operation and maintenance of Handle
servers and technical services
Replication and safe-keeping of PIDs via the EPIC
federation
Resolution mechanism based on Handle
Easy maintenance and programmatic resolving of PIDs by
the B2HANDLE Python library for general interaction with
Handle servers
PID Training
B2HANDLE in other EUDAT services
In the EUDAT ecosystem, EUDAT services make use
of B2HANDLE to:
guarantee data access
provide long lasting references to data and
facilitate data publishing.
PID Training
B2SAFE and B2SHARE use the service to create and
manage PIDs for their hosted data objects.
B2FIND and B2STAGE use the resolving mechanism of
B2HANDLE to retrieve and refer to objects.
The B2HANDLE Python library
b2handle: A Python library for interaction with EUDAT
Handle services (Handle version 8)
Setup tools-enabled Python package easy
installation
Can be employed by end-users to programmatically
resolve handles
Credentials to one of the EUDAT Handle servers are
required for creation and maintenance of PIDs
Stable state; official release of v1.0 also for use by
EUDAT user communities
B2HANDLE: Available at GitHub
Code repository: https://github.com/EUDAT-B2SAFE/B2HANDLE
B2HANDLE documentation
Technical documentation: http://eudat-b2safe.github.io/B2HANDLE
B2HANDLE library features
Methods to read, create, modify Handles and their
records
Queries against native Handle REST interface
Support for multiple locations per object (10320/loc
entries)
Automatic management of Handle value indexes
Support for Handle reverse-lookup via additional Java
servlet
Support for resolving any Handle from any issuing
instance
POLICIES
How may I use a PID
When you have a PID use it:
To cite the data behind the PID:
In publications
On web-pages
Include actionable PIDs in linked data
Retrieve the data:
By using the corresponding resolver
Via the actionable PID
E.g. http://hdl.handle.net/11239/GRNET
PID Training
Policy Document
When to use Persistent Identifiers?
What should the PID resolve to?
There is no “one-size fits all” strategy for
implementing PIDs!
Create a Policy Document of What & When
Analyze the use of PIDs, create a policy for the
management
What to register
When it enters the data management life cycle
PID Training
analysis and thought
Policy Document
Simple Questions
Which data objects need a PID (collections, files, metadata
records)?
What kinds of data are likely to stay online long enough?
What kinds of data are likely to be linked to your PIDs?
What kinds of data are likely to be analysed/processed with
tools?
What will happen after data goes off-line?
etc..
PID Training
analysis and thought
PID Policies for EUDAT services
Each Service follows its own Policy for managing PIDs.
One of the main policies they all follow is the non-
deletion policy:
Once a PID is generated it is not allowed to delete
it.
E.g. B2SAFE and B2SHARE use the service to
create and manage PIDs for their hosted data
objects. They both create their own PID types (Keys)
in the PID record.
PID Training
USE CASES
Example 1: B2SHARE
The persistent identifier for
files, download single files
Cite the whole data publication
Example 2: B2SAFE
B2SAFE employs PIDs to keep track and link replicas of
data in the EUDAT network
Example 3: Enable data flows
PID Training
Link directly to the data (?locatt=id:n)
Optionally include a (mime) type in the Handle record –
can be used to select appropriate tooling
Summary
Persistent Identifiers provide a solution to the “link rot”
problem by providing an extra layer of indirection
Several systems are available with different conditions
PIDs do it yourself: Use a Policy Document
The HANDLE system - via EPIC policies - is the foundation
for EUDAT’s B2HANDLE service:
Low cost, only a flat annual fee
Robust, scalable and performing
Flexible, allows addition of any metadata
Provides a global resolver
Thanks
www.eudat.eu
Authors Contributors
This work is licensed under the Creative Commons CC-BY 4.0 licence
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures.
Contract No. 654065
Themis Zamani, GRNET
Willem Elbers, CLARIN
Christine Staiger, SURFsara
Ellen Leenarts, DANS
Kostas Kavoussanakis, EPCC
Thank you

Introduction to Persistent Identifiers| www.eudat.eu |

  • 1.
    www.eudat.euEUDAT receives fundingfrom the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065 Introduction to Persistent Identifiers PIDs in EUDAT Version 2 July 2017 This work is licensed under the Creative Commons CC-BY 4.0 licence
  • 2.
    Content What are persistentidentifiers? Why use persistent identifiers? Different persistent identifier systems The HANDLE system EPIC PID system B2HANDLE Policies Use cases
  • 3.
  • 4.
    Science Data Data generationis getting easier/cheaper Complexity-shift from data generation to data processing & analysis The amount of data output is increasing, quality is getting better  How to stimulate reuse and enable reproducibility? Data needs to be ReusableAccessible Findable Interoperable
  • 5.
    Briefly, what arePIDs? Pointers to data resources Data files, metadata files, documents … Globally unique With infinite lifespan Can be used to identify and retrieve resources Can be resolved to the resource Examples: ISBN, DOIs, PURLs, Handles… PID Training
  • 6.
    Data Creation Cycle temporarydata citabledata referabledata rawdata registration & preservation analysis & enrichment Citable publication Persistent and robust identification
  • 7.
    What is theProblem? Why not use simple URLs? The URL specifies the location, on a particular server, from which the resource could be retrieved. Strictly network locations for digital resources. domain may change resource may be relocated link may change B2SAFE Training BUT URLs a year or two later, often no longer workin the long term In the long term URLs a year later, often no longer work “link rot”
  • 8.
    Persistent over time today… ... ... 2030 11839/abc123 11839/abc123 11100 00100 01111 11100 00100 01111 http://www.example.com/ http://www.moved.com/ Supports access to resource as it moves from one location to another. .. by design
  • 9.
    Why can PersistentIdentifiers help? A Persistent Identifier is distinct from a URL not strictly bound to a specific server or filename “A persistent identifier (PID) is a long- lasting reference to a digital object—a single file or set of files.“ https://en.wikipedia.org/wiki/Persistent_identifier 11100 00100 01111 11839 / abc123 resolution prefix suffix Identifier points to a resource with no actual knowledge of the resource Responsibility of the PID owner to keep it up-to-date when the resource changes
  • 10.
    Structure of aPersistent Identifier points to a resource Is globally unique 11100 00100 01111 11839 / abc123 resolution prefix sufffix 11839 / abc123 prefix sufffix Once the PID is created, the resource is globally addressable. Data Metadata Document Code Prefix: designates administrative domain, comes from an issuing instance Suffix: unique in the realm of the prefix
  • 11.
    Persistent over time today… ... 2030 11839/abc123 11839/abc123 11100 00100 01111 11100 00100 01111 http://www.example.com/ http://www.moved.com/ .. by design Update information Redirection Stable
  • 12.
    PID Benefits Persistent Identityvia Indirection Static references into fluid systems over time Data on networks moves Ownership/responsibility change Formats change Embedded IDs For data object in hand – current state data Updates New related entities Networks of Persistent Links Data / metadata links Provenance chains
  • 13.
    PID Costs Extra levelof effort / cost on creation Analysis – what to identify (granularity) Folders, files Single measurements in a time series experiment Coordination across organisations Maintain resolution system Persistence requires sustained effort Organisational discipline Technology necessary but not sufficient Analyse cost/benefit ratio Don’t start unless it is worthwhile Is your data worth it?
  • 14.
  • 15.
    Persistent Identifier structure Everypersistent identifier consists of two parts: its prefix and a unique local name under the prefix known as its suffix Prefix - designates administrative domain, is generated by an issuer, which makes sure that all prefixes are unique Suffix - local name must be unique under its prefix. The uniqueness of a prefix and the local name under that prefix ensure that any identifier is globally unique within the context of the System. PID Training < PREFIX > / < SUFFIX > (e.g. 11111/123456745)
  • 16.
    PID Systems Persistent URLs(PURLs)a Cost: no Metadata: No additional metadata purl: GPO/gpo46189 EPIC Systemb Cost: $50 annual fee per prefix Metadata: Associate any metadata hdl:11210/123 Digital Object Identifier (DOI)d Cost: fee per DOI + annual fee Metadata: The INDECS schema, stored in separate database DOI: 10.1000/182 Archival Resource Key (ARK)c Cost: no Metadata: ERC (Electronic Resource Citation) metadata ark: /12025/654xz321 Based on: Handle System
  • 17.
    PID system Requirements Attachmultiple URLs to a PID Allow part identifiers for complex objects. Granularity issue Allow attaching of extra metadata to the PID (MD5 check, etc) Actionable (i.e. converted to URL) PIDs HTTP proxy for resolving (use port 80 only) Controlled by community Programmable interface for administration of PIDs from applications Delegation of PID administration to other organisations Distributed, robust, highly- available, scalable No single-point of failure, distributed system with mirroring Acceptable non-commercial business model PID Training
  • 18.
    Identifier String Requirements Notbased on any changeable attributes of the entity, e.g.: Location Ownership Any other attribute that may change without changing identity Unique Avoid conflicts and referential uncertainty A good PID system should not allow you to use the same suffix twice Opaque, preferably a “dumb number” A well known pattern invites assumptions that may be misleading Meaningful semantics invite IP wars, language problems Nice to have Human-readable Cut-able, paste-able Fits common systems, e.g. URI specification PID Training that contribute to persistence
  • 19.
    PIDs in EUDAT EUDAThas adopted Handle-based persistent identifiers A combined solution of Handle system and EPIC service Employing the latest Handle v.8 EUDAT developed a library to interact with Handle v.8  B2HANDLE PID Training
  • 20.
  • 21.
    The Handle System TheHandle System is a technology specification for assigning, managing, and resolving persistent identifiers for digital objects and other resources. The protocols specified enable a distributed computer system to store identifiers (names, known as Handles) of digital resources and resolve those Handles to the information necessary to locate, access, and otherwise make use of the resources. That information can be changed as needed to reflect the current state or location of the identified resource without changing the Handle. PID Training
  • 22.
    The Handle System Themain goal of the Handle system is to contribute to persistence. The Handle system is: reliable scalable flexible trusted built on open architecture transparent PID Training
  • 23.
    A Handle Record HandleData Type/KEY Index Handle data Timestamp 10232/1234 URL 1 https://www.eudat.eu/ex1 2014-04- 09 12:46:53Z DOMAIN 2 EUDAT 2014-04- 09 12:46:53Z HS_ADMIN 100 eudat/user1 2014-04- 09 12:46:53Z PID Training PID – handle: 10232/1234 Actionable PID (URL/resolving): http://hdl.handle.net/10232/1234
  • 24.
    Resolving Handle Record PIDTraining Global Registry E.g. Handle system 3. Client gets request to resolve hdl:10232/1234 1. Client sends request to Global to resolve 0.NA/10232 (prefix handle for 10232/1234) 2. Global Responds with Service Information for 10232 #1 #1 #2 #3 Secondary Site A Secondary Site B Local Service #1 #2 Primary Site 4. Server responds with handle data Service Information Local Handle Service IP xc xc xc xc xc xc xc xc xc xc xc xc xc xc xc xc xc xc xc xc xc xc xc xc .. .. .. xc xc xc .. .. .. xc xc xc .. .. .. ... xcccxv xccx xccx xcccxv xccx xccx xcccxv xccx xccx
  • 25.
    HANDLE Record Types Commontypes URL: the location the HANDLE should resolve to HS_ADMIN: special record encoding the permissions configured for this HANDLE 10320/LOC: supports multiple locations based on intelligent decision. Custom EUDAT types EUDAT/CHECKSUM: Useful for integrity verification EUDAT/ROR: Repository of Records, the ID of the community repository. EUDAT/FIO: PID to first ingested object in the EUDAT domain. EUDAT/PARENT: PID associated with the source object in a replication chain. EUDAT/REPLICA: List of PIDs pointing to replicas
  • 26.
    PID SYSTEM INEUDAT Handle system and EPIC
  • 27.
    PID System: Howdoes it work? PID Service generate and manage PIDs for digital objects according to policies Example: B2HANDLE python library (next section) PID Training PID Replication (as one of the EPIC policies) replicate the database of Handles to partners in EPIC to guarantee robust and highly available PID resolution Resolution Service EPIC uses the distributed network provided by Handle and extends it with own local Handle servers. Global Handle Mirror A mirror of the Global Handle in Europe Handle system and EPIC
  • 28.
    Resolution Service The webaddress for the Handle resolution service that EUDAT uses is http://hdl.handle.net. PID Training
  • 29.
    EUDAT options forPIDs In order to access a data object stored in EUDAT, an associated persistent identifier is needed. EUDAT requires integration of Handle in your infrastructure. Before your community or data centre can create PIDs you need a prefix. There are two options: you can run your own Handle system; or you can pass the details to EUDAT partners to manage it on your behalf. additional benefit of using the EUDAT systems is access to a Python library to manage your PID Handles PID Training
  • 30.
  • 31.
    What is B2HANDLE? B2HANDLEis EUDAT’s PID service based on Handle as technology EPIC as federation B2HANDLE offers: Assignment of prefix via one of the EUDAT partners Hosting of PIDs, i.e. operation and maintenance of Handle servers and technical services Replication and safe-keeping of PIDs via the EPIC federation Resolution mechanism based on Handle Easy maintenance and programmatic resolving of PIDs by the B2HANDLE Python library for general interaction with Handle servers PID Training
  • 32.
    B2HANDLE in otherEUDAT services In the EUDAT ecosystem, EUDAT services make use of B2HANDLE to: guarantee data access provide long lasting references to data and facilitate data publishing. PID Training B2SAFE and B2SHARE use the service to create and manage PIDs for their hosted data objects. B2FIND and B2STAGE use the resolving mechanism of B2HANDLE to retrieve and refer to objects.
  • 33.
    The B2HANDLE Pythonlibrary b2handle: A Python library for interaction with EUDAT Handle services (Handle version 8) Setup tools-enabled Python package easy installation Can be employed by end-users to programmatically resolve handles Credentials to one of the EUDAT Handle servers are required for creation and maintenance of PIDs Stable state; official release of v1.0 also for use by EUDAT user communities
  • 34.
    B2HANDLE: Available atGitHub Code repository: https://github.com/EUDAT-B2SAFE/B2HANDLE
  • 35.
    B2HANDLE documentation Technical documentation:http://eudat-b2safe.github.io/B2HANDLE
  • 36.
    B2HANDLE library features Methodsto read, create, modify Handles and their records Queries against native Handle REST interface Support for multiple locations per object (10320/loc entries) Automatic management of Handle value indexes Support for Handle reverse-lookup via additional Java servlet Support for resolving any Handle from any issuing instance
  • 37.
  • 38.
    How may Iuse a PID When you have a PID use it: To cite the data behind the PID: In publications On web-pages Include actionable PIDs in linked data Retrieve the data: By using the corresponding resolver Via the actionable PID E.g. http://hdl.handle.net/11239/GRNET PID Training
  • 39.
    Policy Document When touse Persistent Identifiers? What should the PID resolve to? There is no “one-size fits all” strategy for implementing PIDs! Create a Policy Document of What & When Analyze the use of PIDs, create a policy for the management What to register When it enters the data management life cycle PID Training analysis and thought
  • 40.
    Policy Document Simple Questions Whichdata objects need a PID (collections, files, metadata records)? What kinds of data are likely to stay online long enough? What kinds of data are likely to be linked to your PIDs? What kinds of data are likely to be analysed/processed with tools? What will happen after data goes off-line? etc.. PID Training analysis and thought
  • 41.
    PID Policies forEUDAT services Each Service follows its own Policy for managing PIDs. One of the main policies they all follow is the non- deletion policy: Once a PID is generated it is not allowed to delete it. E.g. B2SAFE and B2SHARE use the service to create and manage PIDs for their hosted data objects. They both create their own PID types (Keys) in the PID record. PID Training
  • 42.
  • 43.
    Example 1: B2SHARE Thepersistent identifier for files, download single files Cite the whole data publication
  • 44.
    Example 2: B2SAFE B2SAFEemploys PIDs to keep track and link replicas of data in the EUDAT network
  • 45.
    Example 3: Enabledata flows PID Training Link directly to the data (?locatt=id:n) Optionally include a (mime) type in the Handle record – can be used to select appropriate tooling
  • 46.
    Summary Persistent Identifiers providea solution to the “link rot” problem by providing an extra layer of indirection Several systems are available with different conditions PIDs do it yourself: Use a Policy Document The HANDLE system - via EPIC policies - is the foundation for EUDAT’s B2HANDLE service: Low cost, only a flat annual fee Robust, scalable and performing Flexible, allows addition of any metadata Provides a global resolver
  • 47.
  • 48.
    www.eudat.eu Authors Contributors This workis licensed under the Creative Commons CC-BY 4.0 licence EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No. 654065 Themis Zamani, GRNET Willem Elbers, CLARIN Christine Staiger, SURFsara Ellen Leenarts, DANS Kostas Kavoussanakis, EPCC Thank you

Editor's Notes

  • #5 Data generation is getting easier/cheaper. At the same time there is a shift from data generation to data processing & analysis. A new way to do science. As a result the number of data output is increasing. A new data world with science data. One of the grand challenges of science data is to facilitate knowledge discovery by assisting humans and machines in data access, integration and analysis. So as to make the data world a better place for science we must have some data principles in mind. The idea of these data principles is: Data should be Findable Findable – Easy to find by both humans and computer systems  Metadata Data should be Accessible Accessible – Stored for long term, accessed and/or downloaded with well-defined license and access Data should be Interoperable Interoperable – Ready to be combined with other datasets by humans as well as computer systems; Data should be Re-usable. Reusable – Ready to be used for future research and to be processed further using computational methods. PIDs can help to identify and locate data. In some cases PID systems can also be used for keeping metadata that is vital for making data interoperable on the technical side.
  • #7 What is actually the problem we are trying to solve? For this we have to look at the data creation process. This life cycle applies to any online digital object. analysis & enrichment. Data analysis is the process of applying data analysis techniques to a large amount of data, typically in big data repositories. It uses specialized algorithms, systems and processes to review, analyze and present information in a form that is more meaningful for organizations or end users. At the same time data enrichment is a general term that refers to processes used to enhance, refine or otherwise improve raw data. This idea and other similar concepts contribute to making data a valuable asset. Some examples are: interpret data derive data produce research outputs author publications prepare data for preservation registration & preservation: Data preservation, or more specifically, digital data preservation, refers to the series of managed activities necessary to ensure continued access to digital objects for as long as necessary. Long-term preservation can be defined as the ability to provide continued access to digital objects, or at least to the information contained in them, indefinitely. All theses steps produce: Temporary data Referable data Citable data Identifiers of the kind we are discussing are themselves digital objects; so they are subject to the same life cycle. PIDs can help to keep track of generated data and its relations.
  • #8 Let’s see the use of Data URL as a means to find and access data. The URL specifies the location, on a particular server, from which the resource could be retrieved. Strictly network locations for digital resources. Are URLs persistent? Suppose you want to publish online your research outputs. The transitional way to store data is to upload to a site, a repository, a directory. In order to access it you bookmark or share this URL. So you Publish it online at some address http://www.test.com/test.html. Other users may cite, access, re-use this url As long as nothing changes about the way the data is accessed, this works fine. But one day you decide to move the resource to another location. So relocate the resource at http://www.example.com/ Other users are not informed about this relocation and when they are trying to access the resource - at the first location – they always get a Page Not Found response. Apart from this administrative change, you may have experienced: domain may change (like the example) resource may be relocated: The directory structure is rearranged: subdirectories are created for each collection. link may change: The researcher decided to use a different platform, with different url queries to retrieve the resource. You will always get “a Page Not Found” response. In the long term however, URLs a year later, often no longer work. So this arrangement has proven to be fragile. We could say that the current need for persistent identifiers came out.
  • #9 This is what we want: a string that always resolves to the data. Persistent over time. By design. Even if the real location of the data changes.
  • #10 A Persistent Identifier is distinct from a URL not strictly bound to a specific server or filename. An identifier is a unique name, identity applied to a digital object so that this object can be easily referenced. It is a reference to the digital object. “A persistent identifier (PI) is a long-lasting reference to a digital object—a single file or set of files. “ The identifier points to a resource with no actual knowledge of the resource. It is the Responsibility of the owner to keep the PID up-to-date when the resource changes. We are going to talk about it in our examples.
  • #11 points to a resource Identifier points to a resource, with no actual knowledge of the resource. The resource is a black box. The type of the URL doesn't matter. It may be a file, a metadata record, a code collection. Is globally unique You won’t find an identifier with the same name that points to another resource. The system ensures that by design. Once it is created, the resource is globally addressable.
  • #12 PIDs solve these problems by introducing a “Redirection layer” The user has an opaque string, which is resolved to a URL. PID points to a URL which points to the digital object. If the digital object is moved, its ownership is changed, or the organization of the objects is changed, the URL is often changed. With PIDs you can easily update this information, while the user can still employ the PID to refer and retrieve the data. PIDs make use of a redirection layer bridging the stable and unstable worlds at the cost of some administrative responsibilities. The PID can be updated to point to the new URL. PIDs introduce a stable layer of redirection on top of more unstable identifiers such as URLs. PIDs provide a layer of redirection PID points to a URL The URL is unstable The PID is stable Update procedures need to be defined and thought through to enable stable referencing.
  • #13 Static references into fluid systems over time Data = digital object. A digital object may be moved, removed or renamed for many reasons.  It can move to other servers or even to other organisations. It may even be changed to another format. (ex from xls to zip) . Persistent identifiers are there to continue to provide access to this resource,  so the digital object gets a Static reference into fluid systems over time. Embedded IDs Apart from the main information (ID, location of the object), a number of related data could be stored in the PID record. Some ideas of these IDs are a) the version of the item or b) new related items This means that the user always knows the current and latest state of the data Networks of Persistent Links Persistent identifiers may contain info about other digital objects (Data / Metadata links). This may create a chain of links between digital objects.
  • #14 There are some costs when using PIDs Extra level of effort / cost on creation: when you decide to use PIDs, data and PIDs are strictly connected. In the management life cycle you must include a new task "managing the persistent identifier for the data”. Analysis – what to identify / granularity: Analyse the need for PID in your system. Which digital objects must have a PID? Do you need to assign a PID to all your files? Coordination across organisations: Often, an institution already cooperates with other institutions that deal with a similar environment. So a coordination across organisations is needed Maintain resolution system: There is a cost for maintaining a resolution system Persistence requires sustained effort Organisational discipline: An Organisational discipline must be followed to achieve persistence Technology necessary but not sufficient. Analyse cost/benefit ratio. Checklist and questions are mentioned in this presentation. Don’t start unless it is worthwhile Is your data worth it?
  • #16 Let’s start by looking at the persistent identifier string. Every identifier consists of two parts: its prefix and a unique local name under the prefix known as its suffix Any suffix - local name must be unique under its local namespace. The uniqueness of a prefix and a local name under that prefix ensures that any identifier is globally unique within the context of the System.
  • #17 Lets see some popular persistent identifier systems PURLs PURLs s are URLs which redirect to the location of the requested web resource using standard HTTP status codes. A PURL is thus a permanent web address which contains the command to redirect to another page, one which can change over time. (Persistent Uniform Resource Locators) are Web addresses that act as permanent identifiers in the face of a dynamic and changing Web infrastructure. Instead of resolving directly to Web resources, PURLs provide a level of indirection that allows the underlying Web addresses of resources to change over time without negatively affecting systems that depend on them. This capability provides continuity of references to network resources that may migrate from machine to machine for business, social or technical reasons. Cost: no Metadata: Does not support additional metadata Handle System The Handle System is a technology specification for assigning, managing, and resolving persistent identifiers for digital objects and other resources on the Internet. Cost: $50 annual fee per prefix Metadata: Associate any metadata ARK  is an identifier scheme conceived by the California Digital Library (CDL), aiming to identify objects in a persistent way. The scheme was designed on the basis that persistence "is purely a matter of service and is neither inherent in an object nor conferred on it by a particular naming syntax". An Archival Resource Key (ARK) is a Uniform Resource Locator (URL) that is a multi-purpose identifier for information objects of any type. An ARK contains the label ark: after the URL's hostname. Cost: no Metadata: ERC (Electronic Resource Citation) metadata DOI The Digital Object Identifier (DOI) was conceived as a generic framework for managing identification of content over digital networks, recognising the trend towards digital convergence and multimedia availability. A DOI name is an identifier (not a location) of an entity on digital networks. It provides a system for persistent and actionable identification and interoperable exchange of managed information on digital networks. It is based on the Handle Server for the resolution service. Cost: fee per DOI + annual fee Metadata: The INDECS schema
  • #18 (as per slide)
  • #19 (as per slide)
  • #20 As we have already mentioned, a persistent identifier is a long-lasting reference to a digital object. EUDAT data domain handles registered data and each digital object should have a persistent identifier. This persistent identifier is used for - Replica identification - Identification of the repository of record (in the case of replication) - Querying of additional information - Checksum (time stamped) A persistent Identifier helps you - access - use and re-use - verify your data EUDAT has adopted Handle-based persistent identifiers A combined solution of handle system and EPIC service So let’s discuss the Handle System
  • #21 EUDAT has adopted the Handle system and EPIC. We will now dive into the technical details on Handle.
  • #22 (as per slide)
  • #23 The Handle system is: reliable (using redundancy, no single points of failure, and fast enough to not appear broken); scalable (higher loads simply managed with more computers); flexible (can adapt to changing computing environments; useful to new applications): trusted (both resolution and administration have technical trust methods; an operating organization is committed to the long term); builds on open architecture (benefits from effort of a community in building applications on the infrastructure); transparent (users need not know the infrastructure details).
  • #24 PIDs such as Handles, are actually records in a database. One can convert the PID to a URL by appending it to the address of the resolver. When creating a Handle, the fields URL and HS_ADMIN are mandatory. The URL field is the actual location of the data and, when it’s an HTTP URL, clicking (resolving) the Actionable PID will redirect to what is written in the URL field. Each Handle may have a set of other values assigned, like in this example, the field “DOMAIN”, which denotes the domain where the data lives and who is responsible for the data. This is defined by the community or the domain the handle belongs το. (It is just an example not a real PID key that is used by EUDAT!) These Handle values use a common data structure for their data. For example, each Handle value has a unique index number that distinguishes it from other values in the value set. They also have a specific data type (or Key) that defines the syntax and semantics of the data in its data field. Besides these, each handle value contains a set of administrative information such as TTL and permissions.
  • #25 How does the resolving of a Handle work when you click on an actionable PID? For any HTTP request, for example http://hdl.handle.net/10232/1234 one of the proxy servers will query for the Handle, take the URL in the Handle record (or if there are multiple URLs in the Handle record it will select one, and that selection is in no particular order) and send an HTTP redirect to that URL to the user's web browser. If there is no URL value, the proxy will display the handle record. Now let us inspect some data types or keys that are 1) predefined or 2) service or user specific.
  • #26 The Handle record supports a number of record types. Every Handle value must have a data type specified in its <type> field. The <type> field identifies the data type that defines the syntax and semantics of data in the next <data> field. The data type may be registered with the Handle System to avoid potential conflicts. Each field in a Handle record is timestamped. Some common types are: URL: one location referenced by this HANDLE HS_ADMIN: special record encoding the permissions configured for this HANDLE. Each handle has one or more administrators. Any administrative operation (e.g., add, delete or modify handle values) can only be performed by the Handle administrator with adequate privilege. Handle administrators are defined in terms of HS_ADMIN values. 10320/LOC: supports multiple locations based on intelligent decision. Some custom types used by EUDAT are: Checksum: Useful for integrity verification EUDAT/ROR: EUDAT specific for B2SAFE. ROR: (Repository of Records), the repository where data was stored first. EUDAT/PPID: EUDAT specific for B2SAFE. the PID associated to the source object in a replication chain. If the chain has only two elements, the master copy and the first replica, then the PPID = ROR. Anything else you like
  • #27 What is it that EUDAT uses exactly and how?
  • #28 As we have already mentioned, EUDAT supports a combined solution of Handle system and EPIC service, i.e. not all PIDs in EUDAT are maintained in the EPIC federation, but all PIDs in EUDAT are Handles and can be supported by the same hardware and software. We had a closer look at the Handle system and what it provides technically. We will see now what EPIC adds to the pure technology. ePIC provides many highly reliable, redundant and performant services to the scientific research community. PID Service:  The first service which is publicly visible is the PID service. The PID Service is the main interface to register and manage persistent identifiers in ePIC . It is implemented as a RESTful web service and it is continuously being developed by ePIC. We will see how to use the PID service in the section B2HANDLE. Resolution: It is responsible for forwarding the user to the current location of the object (such as author or expiration date).  ePIC utilizes the Handle System to achieve a redundant and load-balanced setup between the data centers. Replication: Currently, five European data centers work together to replicate each other’s persistent identifiers. When a data center is temporarily not available, the other ePIC centers still resolve the PIDs. Mirror: The ePIC system is based on a worldwide hierarchy with the Global Handle Systems on the top of it .These systems are registries where the most important information of the prefixes is stored. One of EPIC founders (GWDG) runs a mirror so as to assure the resolution of prefixes in Europe, even if other parts of the global network is temporarily  not available. Most of the them are hidden, except from the one responsible for PID registration, which is publicly visible.
  • #29 Based on the Handle resolution mechanism we saw earlier. Resolution: It is responsible for forwarding the user to the current location of the object.   The PID Resolution system of ePIC is responsible for forwarding the users to the current location of an identified object. In addition to the current location, other information about the object (such as author or expiration date) can also be provided. ePIC utilizes the Handle System to achieve a redundant and load-balanced setup between the data centers. ePIC replicates the PID databases to guarantee high availability of the PID resolution. The resolution services of ePIC are also included into the worldwide Handle infrastructure to guarantee a highly reliable and performant resolution of PIDs issued by ePIC .
  • #30 (as per slide)
  • #38 Now we will dive into the subject, what do you need to arrange for on the organisational side, when you want to use PIDs in your project or at your institute.
  • #39 When you are a user you can use any PID For publishing and referencing data in your papers and on web pages You can include them as online resource in linked data You can fetch the data by resolving the PID via the resolver or via an actionable PID
  • #40 Now that I know how to use a PID when should I create – mint a PID? “When to use persistent identifiers”? It should be noted that among all the concepts which have been introduced there is no ‘one size fits all’ strategy for implementing persistent identifiers. Although the basic problems to be solved are the same, each of the systems addresses them in its own way on different administrative and technical levels. It is not possible to formulate one single recommendation for all. Create a Policy Document of What & When Analyze the use of PIDs, create a policy for the management What to register When it the data management life cycle
  • #41 Determining First of all, one must carefully analyse its current use of identifiers in general. In most cases, where data is collected, it is identified in some way. If it is data about other data or objects – metadata – it will often contain an identifier for the referred item. Answer these simple questions Which data objects need a PID (collections, files, metadata records)? What kinds of data are likely to stay online long enough? What kinds of data are likely to be linked to your PIDs? What kinds of data are likely to be analysed/processed with tools? What will happen after data goes off-line?
  • #43 Lets see a few use cases .
  • #44 B2SHARE is a user-friendly, reliable and trustworthy way for researchers, scientific communities and citizen scientists to store and share small-scale research data from diverse contexts. All B2SHARE artifacts are associated with a PID. B2SHARE creates several PIDs: 1) A Handle for the deposit, that resolves to the specific landing page in B2SHARE A DOI, that also resolves to the landing page but is also indexed in DataCite 2) A DOI for each uploaded file, that resolves directly to the file and thus enables programmatic and specific downloads of certain data.
  • #45 B2SAFE creates a PID using its own Handle prefix that refers to the original in the community’s repository B2SAFE uses B2HANDLE, i.e. it interacts with the Handle server via the B2HANDLE python library. Upon creation of the PID, B2SAFE also stores some extra information in the Handle entry: The URL  pointing to the real location of the data object Checksum  for integrity checks An optional identifier that is internally used by the community The data object is replicated to an EUDAT site Here B2SAFE creates a new PID for the replica, with the Handle prefix for that site The interaction is again done with the B2HANDLE python library The Handle entry of the replica uses the PID from the community centre to refer to the original data, here PID1x is used as value for the First ingested Object (FIO), as direct parent of the replica (PARENT) and also as link to the original community repository (ROR). At the same time the PID of the data at the Community centre is updated. It holds now a link to the replica (REPLICA)
  • #46 For Handles with multiple URL values, the proxy server (or web browser plug-in) simply selects the first URL value in the list of values returned by the Handle resolution. Because the order of that list is non-deterministic, there is no intelligent selection of a URL to which the client would be redirected. The 10320/loc Handle attribute was developed to improve the selection of specific resource URLs. Type “10320/loc” specifies an XML-formatted Handle value that contains a list of locations. Locatt attribute: If someone constructs a link as hdl:123/456?locatt=id:0 then the resolver will return the locations that have an "id" attribute of 0 (i.e., the first location). If there is only one location element, it is returned as a redirect. If there are more than one, then you can select the one you want by adding the ?locatt=id:n attribute.
  • #47 Persistent Identifiers provide a solution to the “link rot” problem by providing a layer of indirection. PIDs act as individual names for the objects with some extra information, like an ID card for people containing the address and some more information. The indirection helps to account for changes e.g. the address changes but users can access the data while still employing the same reference. PIDs also make it easier to keep digital objects accessible in the long term, despite changes of technology, organizations, and people, as they are independent of the data and their changes in publishing systems, transfer, and evolution of technology.   Requirements: An identifier must be globally unique, and should also be actionable - that is, a persistent identifier should provide a persistent link to the resource identified.  Several systems are available; some offer additional functionality in the form of support for storing additional metadata, providing a global resolver, etc. Policy Document: How to use persistent identifiers in your repository requires some analysis and thought  Persistence Needs Preservation. Huge amounts of useful material have been lost due to lack of resources, or explicitly assigned responsibility. There must be a policy in place to prevent this from happening. Among all the concepts which have been introduced there is no ‘one size fits all’ strategy for implementing persistent identifiers.  A policy will have to be adapted to suit the needs of the individual organization EUDAT uses Handle as technology  low costs, robust setup, allows for flexible creation of PIDs Via the EPIC federation EUDAT makes sure that PIDs are mirrored and kept safe