On the Persistence of Persistent Identifiers of the Scholarly Web
1. On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
TPDL, August 2020
Martin Klein & Lyudmila Balakireva
Los Alamos National Laboratory
{mklein, ludab}@lanl.gov
On the Persistence of Persistent
Identifiers of the Scholarly Web
HEAD GET GET+ Chrome
https://arxiv.org/abs/2004.03011
2. On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
TPDL, August 2020
DOIs are very common
3. On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
TPDL, August 2020
DOIs are very common
4. On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
TPDL, August 2020
DOIs are very common
5. On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
TPDL, August 2020
How does this work via HTTP?
https://doi.org/10.1007/978-3-540-87599-4_38
6. On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
TPDL, August 2020
Arrived at landing page
https://doi.org/10.1007/978-3-540-87599-4_38
https://link.springer.com/chapter/10.1007%2F978-3-540-87599-4_38
7. On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
TPDL, August 2020
HTTP redirects
https://doi.org/10.1007/978-3-540-87599-4_38
(HTTP 302 redirect)
http://link.springer.com/10.1007/978-3-540-87599-4_38
(HTTP 301 redirect)
https://link.springer.com/10.1007/978-3-540-87599-4_38
(HTTP 302 redirect)
https://link.springer.com/chapter/10.1007%2F978-3-540-87599-4_38
(HTTP 200)
8. On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
TPDL, August 2020
Questions…
• How persistent is this DOI resolution?
• Given different clients and network environments:
• Can we consistently arrive at the same location at the end
of the redirect chain?
• Is the path there (redirect chain) the same?
• Are there differences between Open Access and non-OA?
• Subscription vs non-Subscription level content?
• Do scholarly content providers differ from the popular web?
9. On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
TPDL, August 2020
Idea…
• Comparative study investigating scholarly publishers’ responses
• To common HTTP requests
• Against DOIs
• Using different web clients and request methods, resembling
• Machines ”browsing”, crawling
• Humans browsing
• From network environments with different subscriptions/licenses
• Amazon Web Service EC2 instance
• LANL internal
• Compare against web servers providing popular web content
10. On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
TPDL, August 2020
HTTP clients, request methods, dataset, networks
• HTTP HEAD
• cURL
• HTTP GET
• cURL
• HTTP GET+
• cURL + various common parameters e.g., user agent, cookies
• HTTP GET
• Chrome
• 10,000 DOIs, randomly picked, 100 DOIs from the 100 most
frequent publisher domains
• HTTP requests sent from AWS VM and LANL network
11. On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
TPDL, August 2020
HTTP clients, request methods, dataset, networks
• HTTP HEAD
• cURL
• HTTP GET
• cURL
• HTTP GET+
• cURL + various common parameters e.g., user agent, cookies
• HTTP GET
• Chrome
• 10,000 DOIs, randomly picked, 100 DOIs from the 100 most
frequent publisher domains
• HTTP requests sent from AWS VM and LANL network
12. On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
TPDL, August 2020
Response codes of last link in redirect chain by DOI
HEAD GET GET+ Chrome
2xx 3xx 4xx 5xx Err10,000DOIs
13. On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
TPDL, August 2020
Response codes of last link in redirect chain by DOI
HEAD GET GET+ Chrome
2xx 3xx 4xx 5xx Err
48.3%
• < 50% successful
requests across all
methods
14. On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
TPDL, August 2020
Response codes of last link in redirect chain by DOI
HEAD GET GET+ Chrome
2xx 3xx 4xx 5xx Err
48.3%
• < 50% successful
requests across all
methods
• > 40% 300-level
responses w/ GET
15. On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
TPDL, August 2020
Response codes of last link in redirect chain by DOI
HEAD GET GET+ Chrome
2xx 3xx 4xx 5xx Err
48.3%
• < 50% successful
requests across all
methods
• > 40% 300-level
responses w/ GET
• 25% return 200-level
w/ HEAD/Chrome
16. On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
TPDL, August 2020
Response codes of last link in redirect chain by DOI
HEAD GET GET+ Chrome
2xx 3xx 4xx 5xx Err
48.3%
• < 50% successful
requests across all
methods
• > 40% 300-level
responses w/ GET
• 25% return 200-level
w/ HEAD/Chrome
• 13% 400-level
responses w/ HEAD
17. On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
TPDL, August 2020
Response codes of last link in redirect chain by DOI
HEAD GET GET+ Chrome
2xx 3xx 4xx 5xx Err
48.3%
• < 50% successful
requests across all
methods
• > 40% 300-level
responses w/ GET
• 25% return 200-level
w/ HEAD/Chrome
• 13% 400-level
responses w/ HEAD
• 25% of them w/
200-level response
w/ any other method
18. On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
TPDL, August 2020
https://arxiv.org/abs/2004.03011
For more background, details, results
19. On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
TPDL, August 2020
On the Persistence of Persistent
Identifiers of the Scholarly Web
Thank you
&
stay safe!
Martin Klein & Lyudmila Balakireva
Los Alamos National Laboratory
{mklein, ludab}@lanl.gov
Editor's Notes
Hello and welcome to this session!
My name is Martin Klein and I work in the RL @ LANL. I’d like to give brief overview of the work done with my colleague Luda Balakireva on the persistence of persistent identifiers of the scholarly web.
More specifically, we are testing Digital Object Identifiers (DOIs) and how consistently or inconsistently scholarly publishers respond when DOIs are requested.
It is worth noting that several different persistent identifiers are used on the scholarly web but for the purpose of this study, we only investigate DOIs.
Why do we do that? Well, the answer is pretty simple: because DOIs are very common.
For example, traditional journal or conference proceeding papers are often assigned DOIs as shown in this example from the IEEE.
The same holds true for datasets that are often assigned DOIs as shown here in Zenodo.
Or more generally speaking, scholarly projects that can include multiple resources and types of resources, as shown here in the example of the Open Science Framework, are assigned DOIs.
So this all is to say that DOIs are very frequently used to identify scholarly resources on the web.
So how does this work, how are DOIs resolved on the web?
If we take this DOIs that is actionable via HTTP
And use a web browser to dereference it, the browser will eventually display the resource, in this case the landing page of a scholarly article, identified by the DOI.
Note that the URI of the landing page, shown on the bottom of this slide, is different from the DOI, as it is hosted by Springer.
The reason for this is that in the background, somewhat opaque to the user, the browser follows a number of HTTP redirects from the DOI to the landing page URI.
The redirect chain for our example DOI is shown here:
We first see a HTTP 302 redirect to Springer
Followed by a 301 redirect to the HTTPS protocol
And another 302 to the landing page URI.
The landing page, as the last link of the redirect chain, returns an HTTP 200 response code, indicating success of the request and the server’s response.
So the main question we are investigating with our work is: how persistent is this DOI resolution?
Given that DOIs can be requested by different HTTP clients and from different network environments, several subsequent questions arise. For example:
Can we consistently arrive at the same last link of a redirect chain?
Does the chain itself change?
Is there a difference between the resolution of DOIs that identify OA resources vs those that identify non-OA resources?
Does it matter if the request against a DOI comes from within an institutional network with certain subscription levels to commercial publishers?
If we observe such differences, is this typical only for the scholarly web or are these behaviors reflected in the popular web as well?
In short, our intention is to test the consistency of DOI responses.
Afterall, without consistency, how can we trust the persistence of such identifiers and their underlying infrastructure?
We designed a study to investigate scholarly publishers and their responses to requests against DOIs.
We use common HTTP clients and methods that resemble both machine and human browsing behavior.
We send our request from 2 different network environments with different subscription levels to commercial publishers.
We send the same requests against web servers providing popular web content to compare our results.
We use the here summarized 4 different HTTP methods and clients for our experiment.
We send HTTP HEAD requests with the popular command line tool cURL.
We send simple HTTP GET requests, also with cURL.
We send more complex HTTP GET requests with cURL, where we for example specify a user agent and accept cookies.
Lastly, we use the popular web browser Chrome to send HTTP GET requests.
We send these 4 requests against a corpus of 10k randomly sampled DOIs and repeat the experiment from 2 different network environments
a VM in the Amazon Cloud and
from within the LANL network.
We make the case that the first 3 methods resemble a machine browsing or crawling the web. Mostly because cURL is a tool that humans typically only use for testing but it is a tool that is frequently utilized in scripts that access web resources at scale.
In contrast, the Chrome method, somewhat naturally, most closely resembles a human browsing.
Due to time constraints I will only show one set of results.
What we see here in this graph is the response code of the last link of all redirect chains, distinguished by request method.
Our 4 methods to dereference DOIs are shown on the x-axis
10k DOIs are displayed on the y-axis
Response codes are binned at the hundreds level, where green indicates 200-level response (success), gray represents 300-level responses (redirect), red – 400 (server error), blue – 500 (client error)
This graph shows results of requests sent from a VM in the Amazon Cloud, so a network presumably w/o subscriptions to commercial publishers.
A number of observations can immediately be made:
1)
- Less than 50% of DOIs consistently return a 200-level response, meaning success, across all 4 request methods.
- In other words, more than 5k of our DOIs did not respond consistently across all 4 methods! A rather astonishing ratio!
- Looking at the individual methods, we can note that Chrome, the method most closely resembling a human browsing the web, performs best
2)
Next, we recognize that the simple GET method seems not well-suited for resolving DOIs
With more than 40% of DOI chains ending in a 300-level response.
This is noteworthy as, by definition, 300-level should not be a *final* response code of a redirect chain on the web
- No obvious reason why….
…especially given that a large fraction of those DOIs, 25% in total, result in a successful response with the HEAD or Chrome method used.
4)
Our next observation is that a significant portion – 13% - of DOI requests with the simple HEAD method result in a 400-level response.
One could think there are a lot of 403s meaning access forbidden or 405 meaning the HEAD method is not allowed against the resource
But that is not the case, this portion is indeed dominated by 404s meaning resource not found
Oddly, 25% of these DOIs result in a 200-level response when any other request method is used.
So, do they exist or not?
While such scenarios of changing response codes are not well-aligned with HTTP standards and best practice on the web,
our observations strongly indicate that scholarly publishers do respond differently to requests against the same DOI, depending on what method is used.
In addition, we can clearly see patterns where responses are different for methods that resemble machine vs human behavior.
This is represented by the success of the Chrome method and the lack of success in particular by the simple GET and HEAD method.
In aggregate, from our point of view, these observed inconsistencies raise more questions and do not increase trust in the persistence of persistent identifiers.
For more results, details on the methodology and dataset used, we refer to the paper.
The corresponding pre-print is available at the displayed URI on the bottom of this slide.
This concludes my short presentation. Thanks a lot for watching!
I am happy to hear your feedback and discuss our work.
Thank you!