Presentation for a workshop about persistent identifiers organized by the Royal Library of The Netherlands and DANS. Highlights the non-trivial commitments required of all parties involved in persistent identifier systems to actually keep links based on persistent identifiers ... err ... persistent.
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
Persistent Identification: Easier Said than Done
1. Herbert Van de Sompel @hvdsomp
20200910
Herbert Van de Sompel
@hvdsomp
Persistent Identification:
Easier Said than Done
2. Herbert Van de Sompel @hvdsomp
20200910
Disclaimer: I Love PIDs
• Interoperability in a distributed
ecology (centrally-managed)
• Allows emergence of value-
added services across the
distributed ecology
• Carves out a recognizable
niche on the web
https://pidservices.org/?#/
3. Herbert Van de Sompel @hvdsomp
20200910
Disclaimer: I Love PIDs
Beit-Arie, O., et al. (2001) Linking to the Appropriate Copy: Report of a DOI-Based Prototype
https://doi.org/10.1045/september2001-caplan
4. Herbert Van de Sompel @hvdsomp
20200910
But PIDs Aren't Magic
Geoff Bilder (2013) DOIs unambiguously and persistently identify published, trustworthy, citable
online scholarly literature. Right? https://www.crossref.org/blog/dois-unambiguously-and-
persistently-identify-published-trustworthy-citable-online-scholarly-literature-right/
6. Herbert Van de Sompel @hvdsomp
20200910
Why Use PIDs?
1. Persistent identification of an object
doi:10.1045/september2001-caplan
2. Persistent linking to an object on the web
https://doi.org/10.1045/september2001-caplan
7. Herbert Van de Sompel @hvdsomp
20200910
Link Rot
https://www.canva.com/learn/404-page-design/
10. Herbert Van de Sompel @hvdsomp
20200910
Link Rot
Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0115253
Link Rot in arXiv corpus Link Rot in Elsevier corpus
11. Herbert Van de Sompel @hvdsomp
20200910
Not Impossible to Keep URLs “Cool” but Hard
• Need to hold on to domain
• Not trivial when organizational changes occur
• Need to keep URIs alive
• Not trivial e.g. when migrating to a different CMS
• Awareness of the link rot problem has grown
• Permalinks
• Web archiving of linked resources
12. Herbert Van de Sompel @hvdsomp
20200910
Not Impossible to Keep URLs “Cool” but Hard
Michael L. Nelson, Herbert Van de Sompel (2020) A 25 year retrospective on D-Lib Magazine
https://arxiv.org/abs/2008.11680 ; https://ws-dl.blogspot.com/2020/08/2020-08-27-25-year-retrospective-on-d.html
… thanks to an ongoing commitment from CNRI all of D-Lib Magazine’s
issues are still available on the live web, with no changes in their URIs
since the fourth issue (October, 1995). Although we have long known
“Cool URIs Don’t Change” [13], the reality is that most do, and
persisting over 5,000 URIs for up to 25 years is an accomplishment in
itself.
• D-Lib Magazine used handles and later DOIs for article identification
• But nevertheless kept its ~5000 URLs stable over time
• Comparable contemporary journals (Ariadne, CLIC, First Monday,
Jodi) were not able to achieve that
• Changed domain
• Migration to different CMS
13. Herbert Van de Sompel @hvdsomp
20200910
Why Use PIDs?
1. Persistent identification of an object
doi:10.1045/september2001-caplan
2. Persistent linking to an object on the web
https://doi.org/10.1045/september2001-caplan
3. Social convention/pressure
14. Herbert Van de Sompel @hvdsomp
20200910
Social Convention/Pressure
• DOIs have become the de facto delineators of the scholarly record
• The FAIR recommendations are very concrete regarding the use of
PIDs
• Must have a DOI to make your object “cite-able”, to receive credit for
it
15. Herbert Van de Sompel @hvdsomp
20200910
Social Convention/Pressure
Carl Boettinger (2013) DOI != citable
https://www.carlboettiger.info/2013/06/03/DOI-citable.html
16. Herbert Van de Sompel @hvdsomp
20200910
Wikipedia Citations
https://en.wikipedia.org/wiki/Memento_Project
• Relies on a single web archive
to create a snapshot of the
linked resource
• Not possible with paywalled
content
17. Herbert Van de Sompel @hvdsomp
20200910
Robust Links
https://robustlinks.mementoweb.org
18. Herbert Van de Sompel @hvdsomp
20200910
Robust Links
https://robustlinks.mementoweb.org
• Creates snapshots of the linked
resource in one or more web
archives
• Discovers snapshots across all
public web archives
• Not possible with paywalled
content
19. Herbert Van de Sompel @hvdsomp
20200910
Why Use PIDs?
1. Persistent identification of an object
doi:10.1045/september2001-caplan
2. Persistent linking to an object on the web
https://doi.org/10.1045/september2001-caplan
3. Social convention/pressure
4. To benefit from the value-added services enabled by PIDs
https://pidservices.org/?#/
20. Herbert Van de Sompel @hvdsomp
20200910
Responsibilities of Parties Involved in the PID Approach
22. Herbert Van de Sompel @hvdsomp
20200910
Linker to B
• Uses PID B instead of B for linking
• Not trivial:
• The URL in the browser
address bar is so tempting
…
• Publishers don’t seem to
do quality control on links in
References
23. Herbert Van de Sompel @hvdsomp
20200910
Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used
to Be Persistent. In: WWW2016. http://arxiv.org/1602.09102
Core assumption in the PID solution:
PIDs will be used to establish links.
But are they?
24. Herbert Van de Sompel @hvdsomp
20200910
• When classifying links extracted from PMC as linking to articles, we
assumed that filtering on http://dx.doi.org/* would do the trick to
distinguish between links to articles and links to web-at-large
• But we found a lot of e.g. http://link.springer.com/article/*
• For example:
• http://link.springer.com/article/10.1007%2Fs00799-014-018-0
• Instead of:
• http://dx.doi.org/10.1007/s00799-014-0108-0
• We used CrossRef’s Reverse Domain Lookup to classify these
extracted links as linking to articles
A Disconcerting Observation during Hiberlink Research
25. Herbert Van de Sompel @hvdsomp
20200910
URI References - PMC
Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used to Be Persistent.
In: WWW2016. http://arxiv.org/1602.09102
Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used
to Be Persistent. In: WWW2016. http://arxiv.org/1602.09102
26. Herbert Van de Sompel @hvdsomp
20200910
Custodian of B
• Cares about persistence
• Sensibly resolves B to the Final Web
Destination
27. Herbert Van de Sompel @hvdsomp
20200910
Martin Klein, Luydmilla Balakireva (2020) On the Persistence of Persistent Identifiers of the Scholarly Web
https://arxiv.org/abs/2004.03011
From PID B to B to … the Final Web Destination
PID B
B
Final
Web
Destination
28. Herbert Van de Sompel @hvdsomp
20200910
Experiment
• Comparative study investigating resolution of 10k DOIs at the
end of scholarly publishers, i.e. from B to the Final Web
Destination
Martin Klein, Luydmilla Balakireva (2020) On the Persistence of Persistent Identifiers of the Scholarly Web
https://arxiv.org/abs/2004.03011
• HTTP HEAD
• cURL
• HTTP GET
• cURL
• HTTP GET+
• cURL + various common parameters e.g., user agent,
cookies
• HTTP GET
• Chrome
29. Herbert Van de Sompel @hvdsomp
20200910
10,000
DOIs
• < 50% successful
requests across all
methods
• > 40% 300-level
responses w/ GET
HEAD GET GET+ Chrome
2xx 3xx 4xx 5xx Err
48.3%
Response Codes of Last Step in DOI Redirection Chain
Martin Klein, Luydmilla Balakireva (2020) On the Persistence of Persistent Identifiers of the Scholarly Web
https://arxiv.org/abs/2004.03011
30. Herbert Van de Sompel @hvdsomp
20200910
Custodian of B
• Cares about persistence
• Sensibly resolves B to the final web
destination
• Facilitates the use of PIDs for linking
31. Herbert Van de Sompel @hvdsomp
20200910
Facilitate the use of PIDs for Linking
Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly Context not Found: One in Five Articles Suffers from
Reference Rot. https://doi.org/ 10.1371/journal.pone.0115253
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115253
https://doi.org/10.1371/journal.pone.0115253
URL
PID
32. Herbert Van de Sompel @hvdsomp
20200910
Facilitate the use of PIDs for Linking
Robert Sanderson and Herbert Van de Sompel (2010) Making Web Annotations Persistent over Time.
In: JCDL2010. https://doi.org/10.1145/1816123.1816125
URL
https://dl.acm.org/citation.cfm?doid=1816123.1816125
10.1145/1816123.1816125 PID
33. Herbert Van de Sompel @hvdsomp
20200910
@inproceedings{Sanderson:2010:MWA:1816123.1816125,
author = {Sanderson, Robert and Van de Sompel, Herbert},
title = {Making Web Annotations Persistent over Time},
year = {2010},
isbn = {978-1-4503-0085-8},
pages = {1--10},
numpages = {10},
url = {http://doi.acm.org/10.1145/1816123.1816125},
doi = {10.1145/1816123.1816125},
acmid = {1816125},
publisher = {ACM},
}
Facilitate the use of PIDs for Linking
34. Herbert Van de Sompel @hvdsomp
20200910
cite-as Relation Type
Herbert Van de Sompel et al. (2019) cite-as: A Link Relation to Convey a Preferred URI for
Referencing. https://tools.ietf.org/html/rfc8574
35. Herbert Van de Sompel @hvdsomp
20200910
Custodian of B
• Cares about persistence
• Sensibly resolves B to the final web
destination
• Facilitates the use of PIDs for linking
• Keeps resolver correspondence table
(PID B ; B) up to date as B changes
location
• Acts responsibly under technology
changes, e.g. sticks with same PIDs
when migrating
• Does not reuse PID for another
object [*]
• Acts responsibly when B ceases to
exist, e.g. tombstone
• Acts responsibly when ceasing
existence, e.g. tombstone, archive
• Concern re startups
[*] BL Digital Research Team (2020) When is a persistent identifier not persistent? Or an identifier?
https://blogs.bl.uk/digital-scholarship/2020/09/when-is-a-persistent-identifier-not-persistent-or-an-identifier.html
36. Herbert Van de Sompel @hvdsomp
20200910
Resolver
• Operates 24/7, forever
• Has an operation that remains
sustainable in the long term. That’s
not trivial:
• Publisher $ for Crossref
• Membership $ for DataCite
• Tax payers $ for NBN
• PURL service abandoned by
OCLC, taken over by Internet
Archive
• Provides tombstone in case
custodian of B ceases to exist and
provides no tombstone itself
37. Herbert Van de Sompel @hvdsomp
20200910
Reminder: Link Persistence is not Object Persistence
38. Herbert Van de Sompel @hvdsomp
20200910
Partial Archiving of Journal Articles
David Rosenthal (2013) Patio Perspectives at ANADP II: Preserving the Other Half
http://blog.dshr.org/2013/11/patio-perspectives-at-anadp-ii.html
• Too few
• Too healthy
• Too easy
39. Herbert Van de Sompel @hvdsomp
20200910
Lack of Infrastructure to Determine Archival Status of DOIs
http://thekeepers.org -> https://keepers.issn.org/
• Journal centric
• Info via ISSN,
volume, issue
• No info via PID
40. Herbert Van de Sompel @hvdsomp
20200910
Infrastructure to Determine Archival Status of URIs
http://timetravel.mementoweb.org/list/20190215101916/http://thekeepers.org
• Global audit by
HTTP URI,
across 20+ web
archives
41. Herbert Van de Sompel @hvdsomp
20200910
Herbert Van de Sompel
@hvdsomp
Persistent Identification:
Easier Said than Done
Editor's Notes
Also Digital Rights Management
PURL, NBN
From scratch DOI registration with migration, startups that vanish
PURL, NBN
From scratch DOI registration with migration, startups that vanish
PURL, NBN
From scratch DOI registration with migration, startups that vanish
PURL, NBN
From scratch DOI registration with migration, startups that vanish
500K articles, 1.2M URLs
PURL, NBN
From scratch DOI registration with migration, startups that vanish
500K articles, 1.2M URLs
We designed a study to investigate scholarly publishers and their responses to requests against DOIs.
We use common HTTP clients and methods that resemble both machine and human browsing behavior.
We send our request from 2 different network environments with different subscription levels to commercial publishers.
We send the same requests against web servers providing popular web content to compare our results.
Our 4 methods to dereference DOIs are shown on the x-axis
10k DOIs are displayed on the y-axis
Response codes are binned at the hundreds level, where green indicates 200-level response (success), gray represents 300-level responses (redirect), red – 400 (server error), blue – 500 (client error)
This graph shows results of requests sent from a VM in the Amazon Cloud, so a network presumably w/o subscriptions to commercial publishers.
A number of observations can immediately be made:
Less than 50% of DOIs consistently return a 200-level response, meaning success, across all 4 request methods.
Next, we recognize that the simple GET method seems not well-suited for resolving DOIs. With more than 40% of DOI chains ending in a 300-level response. This is noteworthy as, by definition, 300-level should not be a *final* response code of a redirect chain on the web
PURL, NBN
From scratch DOI registration with migration, startups that vanish