Persistent Identification: Easier Said than Done

Herbert Van de Sompel @hvdsomp
20200910
Herbert Van de Sompel
@hvdsomp
Persistent Identification:
Easier Said than Done

20200910
Disclaimer: I Love PIDs
• Interoperability in a distributed
ecology (centrally-managed)
• Allows emergence of value-
added services across the
distributed ecology
• Carves out a recognizable
niche on the web
https://pidservices.org/?#/

20200910
Disclaimer: I Love PIDs
Beit-Arie, O., et al. (2001) Linking to the Appropriate Copy: Report of a DOI-Based Prototype
https://doi.org/10.1045/september2001-caplan

20200910
But PIDs Aren't Magic
Geoff Bilder (2013) DOIs unambiguously and persistently identify published, trustworthy, citable
online scholarly literature. Right? https://www.crossref.org/blog/dois-unambiguously-and-
persistently-identify-published-trustworthy-citable-online-scholarly-literature-right/

20200910
Why Use PIDs?

20200910
Why Use PIDs?
1. Persistent identification of an object
doi:10.1045/september2001-caplan
2. Persistent linking to an object on the web

20200910
Link Rot
https://www.canva.com/learn/404-page-design/

20200910

20200910
Link Rot
Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0115253
Link Rot in arXiv corpus Link Rot in Elsevier corpus

20200910
Not Impossible to Keep URLs “Cool” but Hard
• Need to hold on to domain
• Not trivial when organizational changes occur
• Need to keep URIs alive
• Not trivial e.g. when migrating to a different CMS
• Awareness of the link rot problem has grown
• Permalinks
• Web archiving of linked resources

20200910
Not Impossible to Keep URLs “Cool” but Hard
Michael L. Nelson, Herbert Van de Sompel (2020) A 25 year retrospective on D-Lib Magazine
https://arxiv.org/abs/2008.11680 ; https://ws-dl.blogspot.com/2020/08/2020-08-27-25-year-retrospective-on-d.html
… thanks to an ongoing commitment from CNRI all of D-Lib Magazine’s
issues are still available on the live web, with no changes in their URIs
since the fourth issue (October, 1995). Although we have long known
“Cool URIs Don’t Change” [13], the reality is that most do, and
persisting over 5,000 URIs for up to 25 years is an accomplishment in
itself.
• D-Lib Magazine used handles and later DOIs for article identification
• But nevertheless kept its ~5000 URLs stable over time
• Comparable contemporary journals (Ariadne, CLIC, First Monday,
Jodi) were not able to achieve that
• Changed domain
• Migration to different CMS

20200910
Why Use PIDs?
3. Social convention/pressure

20200910
Social Convention/Pressure
• DOIs have become the de facto delineators of the scholarly record
• The FAIR recommendations are very concrete regarding the use of
PIDs
• Must have a DOI to make your object “cite-able”, to receive credit for
it

20200910
Social Convention/Pressure
Carl Boettinger (2013) DOI != citable
https://www.carlboettiger.info/2013/06/03/DOI-citable.html

20200910
Wikipedia Citations
https://en.wikipedia.org/wiki/Memento_Project
• Relies on a single web archive
to create a snapshot of the
linked resource
• Not possible with paywalled
content

20200910
Robust Links
https://robustlinks.mementoweb.org

20200910
Robust Links
https://robustlinks.mementoweb.org
• Creates snapshots of the linked
resource in one or more web
archives
• Discovers snapshots across all
public web archives
• Not possible with paywalled
content

20200910
Why Use PIDs?
3. Social convention/pressure
4. To benefit from the value-added services enabled by PIDs
https://pidservices.org/?#/

20200910
Responsibilities of Parties Involved in the PID Approach

20200910
Linker to B
• Uses PID B instead of B for linking
• Not trivial:
• The URL in the browser
address bar is so tempting
…
• Publishers don’t seem to
do quality control on links in
References

20200910
Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used
to Be Persistent. In: WWW2016. http://arxiv.org/1602.09102
Core assumption in the PID solution:
PIDs will be used to establish links.
But are they?

20200910
• When classifying links extracted from PMC as linking to articles, we
assumed that filtering on http://dx.doi.org/* would do the trick to
distinguish between links to articles and links to web-at-large
• But we found a lot of e.g. http://link.springer.com/article/*
• For example:
• http://link.springer.com/article/10.1007%2Fs00799-014-018-0
• Instead of:
• http://dx.doi.org/10.1007/s00799-014-0108-0
• We used CrossRef’s Reverse Domain Lookup to classify these
extracted links as linking to articles
A Disconcerting Observation during Hiberlink Research

20200910
URI References - PMC
Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used to Be Persistent.
In: WWW2016. http://arxiv.org/1602.09102
Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used
to Be Persistent. In: WWW2016. http://arxiv.org/1602.09102

20200910
Custodian of B
• Cares about persistence
• Sensibly resolves B to the Final Web
Destination

20200910
Martin Klein, Luydmilla Balakireva (2020) On the Persistence of Persistent Identifiers of the Scholarly Web
https://arxiv.org/abs/2004.03011
From PID B to B to … the Final Web Destination
PID B
B
Final
Web
Destination

20200910
Experiment
• Comparative study investigating resolution of 10k DOIs at the
end of scholarly publishers, i.e. from B to the Final Web
Destination
• HTTP HEAD
• cURL
• HTTP GET
• cURL
• HTTP GET+
• cURL + various common parameters e.g., user agent,
cookies
• HTTP GET
• Chrome

20200910
10,000
DOIs
• < 50% successful
requests across all
methods
• > 40% 300-level
responses w/ GET
HEAD GET GET+ Chrome
2xx 3xx 4xx 5xx Err
48.3%
Response Codes of Last Step in DOI Redirection Chain

20200910
Custodian of B
• Sensibly resolves B to the final web
destination
• Facilitates the use of PIDs for linking

20200910
Facilitate the use of PIDs for Linking
Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly Context not Found: One in Five Articles Suffers from
Reference Rot. https://doi.org/ 10.1371/journal.pone.0115253
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115253
https://doi.org/10.1371/journal.pone.0115253
URL
PID

20200910
Robert Sanderson and Herbert Van de Sompel (2010) Making Web Annotations Persistent over Time.
In: JCDL2010. https://doi.org/10.1145/1816123.1816125
URL
https://dl.acm.org/citation.cfm?doid=1816123.1816125
10.1145/1816123.1816125 PID

20200910
@inproceedings{Sanderson:2010:MWA:1816123.1816125,
author = {Sanderson, Robert and Van de Sompel, Herbert},
title = {Making Web Annotations Persistent over Time},
year = {2010},
isbn = {978-1-4503-0085-8},
pages = {1--10},
numpages = {10},
url = {http://doi.acm.org/10.1145/1816123.1816125},
doi = {10.1145/1816123.1816125},
acmid = {1816125},
publisher = {ACM},
}

20200910
cite-as Relation Type
Herbert Van de Sompel et al. (2019) cite-as: A Link Relation to Convey a Preferred URI for
Referencing. https://tools.ietf.org/html/rfc8574

20200910
Custodian of B
• Sensibly resolves B to the final web
destination
• Facilitates the use of PIDs for linking
• Keeps resolver correspondence table
(PID B ; B) up to date as B changes
location
• Acts responsibly under technology
changes, e.g. sticks with same PIDs
when migrating
• Does not reuse PID for another
object [*]
• Acts responsibly when B ceases to
exist, e.g. tombstone
• Acts responsibly when ceasing
existence, e.g. tombstone, archive
• Concern re startups
[*] BL Digital Research Team (2020) When is a persistent identifier not persistent? Or an identifier?
https://blogs.bl.uk/digital-scholarship/2020/09/when-is-a-persistent-identifier-not-persistent-or-an-identifier.html

20200910
Resolver
• Operates 24/7, forever
• Has an operation that remains
sustainable in the long term. That’s
not trivial:
• Publisher $ for Crossref
• Membership $ for DataCite
• Tax payers $ for NBN
• PURL service abandoned by
OCLC, taken over by Internet
Archive
• Provides tombstone in case
custodian of B ceases to exist and
provides no tombstone itself

20200910
Reminder: Link Persistence is not Object Persistence

20200910
Partial Archiving of Journal Articles
David Rosenthal (2013) Patio Perspectives at ANADP II: Preserving the Other Half
http://blog.dshr.org/2013/11/patio-perspectives-at-anadp-ii.html
• Too few
• Too healthy
• Too easy

20200910
Lack of Infrastructure to Determine Archival Status of DOIs
http://thekeepers.org -> https://keepers.issn.org/
• Journal centric
• Info via ISSN,
volume, issue
• No info via PID

20200910
Infrastructure to Determine Archival Status of URIs
http://timetravel.mementoweb.org/list/20190215101916/http://thekeepers.org
• Global audit by
HTTP URI,
across 20+ web
archives

Persistent Identification: Easier Said than Done

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Persistent Identification: Easier Said than Done

Similar to Persistent Identification: Easier Said than Done (20)

More from Herbert Van de Sompel

More from Herbert Van de Sompel (14)

Recently uploaded

Recently uploaded (20)

Persistent Identification: Easier Said than Done

Editor's Notes