On the Persistence of Persistent Identifiers of the Scholarly Web

•Download as PPTX, PDF•

0 likes•574 views

Martin Klein

TPDL 2020 presentation Preprint: https://arxiv.org/abs/2004.03011

Internet

On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
TPDL, August 2020
Martin Klein & Lyudmila Balakireva
Los Alamos National Laboratory
{mklein, ludab}@lanl.gov
On the Persistence of Persistent
Identifiers of the Scholarly Web
HEAD GET GET+ Chrome
https://arxiv.org/abs/2004.03011

On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
TPDL, August 2020
DOIs are very common

On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
TPDL, August 2020
How does this work via HTTP?
https://doi.org/10.1007/978-3-540-87599-4_38

On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
TPDL, August 2020
Arrived at landing page
https://doi.org/10.1007/978-3-540-87599-4_38
https://link.springer.com/chapter/10.1007%2F978-3-540-87599-4_38

On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
TPDL, August 2020
HTTP redirects
https://doi.org/10.1007/978-3-540-87599-4_38
 (HTTP 302 redirect)
http://link.springer.com/10.1007/978-3-540-87599-4_38
 (HTTP 301 redirect)
https://link.springer.com/10.1007/978-3-540-87599-4_38
 (HTTP 302 redirect)
https://link.springer.com/chapter/10.1007%2F978-3-540-87599-4_38
 (HTTP 200)

On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
TPDL, August 2020
Questions…
• How persistent is this DOI resolution?
• Given different clients and network environments:
• Can we consistently arrive at the same location at the end
of the redirect chain?
• Is the path there (redirect chain) the same?
• Are there differences between Open Access and non-OA?
• Subscription vs non-Subscription level content?
• Do scholarly content providers differ from the popular web?

On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
TPDL, August 2020
Idea…
• Comparative study investigating scholarly publishers’ responses
• To common HTTP requests
• Against DOIs
• Using different web clients and request methods, resembling
• Machines ”browsing”, crawling
• Humans browsing
• From network environments with different subscriptions/licenses
• Amazon Web Service EC2 instance
• LANL internal
• Compare against web servers providing popular web content

On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
TPDL, August 2020
HTTP clients, request methods, dataset, networks
• HTTP HEAD
• cURL
• HTTP GET
• cURL
• HTTP GET+
• cURL + various common parameters e.g., user agent, cookies
• HTTP GET
• Chrome
• 10,000 DOIs, randomly picked, 100 DOIs from the 100 most
frequent publisher domains
• HTTP requests sent from AWS VM and LANL network

On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
TPDL, August 2020
Response codes of last link in redirect chain by DOI
HEAD GET GET+ Chrome
2xx 3xx 4xx 5xx Err10,000DOIs

On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
TPDL, August 2020
https://arxiv.org/abs/2004.03011
For more background, details, results

On the Persistence of Persistent Identifiers of the Scholarly Web
@mart1nkle1n
TPDL, August 2020
On the Persistence of Persistent
Identifiers of the Scholarly Web
Thank you
&
stay safe!
Martin Klein & Lyudmila Balakireva
Los Alamos National Laboratory
{mklein, ludab}@lanl.gov

What's hot

cited by how-toCrossref

(Re-)Discovering Lost Web PagesMichael Nelson

Linking media, data, and servicesRuben Verborgh

Synchronicity: Just-In-Time Discovery of Lost Web PagesMichael Nelson

Introduction to Linked Data 1/5Juan Sequeda

ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...Michael Cummings

CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...Crossref

Introduction To Linked DataLeigh Dodds

Semantic Web ApplicationsJulian Higman

How to become an effective web searcherrangak

Google searching techniquessawarkar17

Chuck Koscher: The Metadata Engine #crossref15Crossref

Search engine and web crawlervinay arora

Changing Data: Implementing Primo for the Tri University Group of Libraries (...Alison Hitchens

Computer study lesson - Internet Search (25 Mar 2020)wmsklang

Location, location, location:A transaction comparison of catalog searches o...teaguese

Architecture of a search engineSylvain Utard

1018telling story from text 2Ke Jiang

Understanding Seo At A Glancepoojagupta267

Introduction to CrossRef Technical Basics Webinar 031815Crossref

What's hot (20)

cited by how-to

(Re-)Discovering Lost Web Pages

Linking media, data, and services

Synchronicity: Just-In-Time Discovery of Lost Web Pages

Introduction to Linked Data 1/5

ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...

CrossRef How-to: A Technical Introduction to the Basics of CrossRef, Chuck Ko...

Introduction To Linked Data

Semantic Web Applications

How to become an effective web searcher

Google searching techniques

Chuck Koscher: The Metadata Engine #crossref15

Search engine and web crawler

Changing Data: Implementing Primo for the Tri University Group of Libraries (...

Computer study lesson - Internet Search (25 Mar 2020)

Location, location, location:A transaction comparison of catalog searches o...

Architecture of a search engine

1018telling story from text 2

Understanding Seo At A Glance

Introduction to CrossRef Technical Basics Webinar 031815

Similar to On the Persistence of Persistent Identifiers of the Scholarly Web

On the Persistence of Persistent Identifiers of the Scholarly WebMartin Klein

(Re-) Discovering Lost Web PagesMichael Nelson

Linked DataDanny Ayers

WebofdataBill Roberts

Barcamprdu linkeddataDavid M. Johnson

Insight_150115_DemoMatt Rubashkin

DBpedia Framework - BBC TalkGeorgi Kobilarov

The Web of data and web data commonsJesse Wang

Deep Dive on Accelerating Content, APIs, and Applications with Amazon CloudFr...Amazon Web Services

Jeremy cabral search marketing summit - scraping data-driven content (1)Jeremy Cabral

Site Crawling: What To Do & What To Look ForOutspoken Media

The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014Robert Meusel

Transmission6 - Publishing Linked DataBill Roberts

The Power of Open DataPhil Windley

Rest web servicesPaulo Gandra de Sousa

Tagging search solution designAlexander Tokarev

Semantic web and Linked DataHyun Namgoong

Open Data and CKAN Data Cataloguesdavid-read

Web Hacking Series Part 1Aditya Kamat

API Design, A Quick Guide to REST, SOAP, gRPC, and GraphQL, By Vahid RahimianVahid Rahimian

Similar to On the Persistence of Persistent Identifiers of the Scholarly Web (20)

On the Persistence of Persistent Identifiers of the Scholarly Web

(Re-) Discovering Lost Web Pages

Linked Data

Webofdata

Barcamprdu linkeddata

Insight_150115_Demo

DBpedia Framework - BBC Talk

The Web of data and web data commons

Deep Dive on Accelerating Content, APIs, and Applications with Amazon CloudFr...

Jeremy cabral search marketing summit - scraping data-driven content (1)

Site Crawling: What To Do & What To Look For

The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014

Transmission6 - Publishing Linked Data

The Power of Open Data

Rest web services

Tagging search solution design

Semantic web and Linked Data

Open Data and CKAN Data Catalogues

Web Hacking Series Part 1

API Design, A Quick Guide to REST, SOAP, gRPC, and GraphQL, By Vahid Rahimian

Recently uploaded

How is AI changing journalism? (v. April 2024)Damian Radcliffe

FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607dollysharma2066

Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝soniya singh

Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)Dana Luther

GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebJames Anderson

Moving Beyond Twitter/X and Facebook - Social Media for local news providersDamian Radcliffe

Russian Call Girls Thane Swara 8617697112 Independent Escort Service ThaneCall girls in Ahmedabad High profile

VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...aditipandeya

'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...APNIC

Gram Darshan PPT cyber rural in villages of indiaimessage0108

Russian Call Girls in Kolkata Ishita 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4

Call Girls In South Ex 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICECall Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceCall Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceCall Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Servicesexy call girls service in goa

Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...SofiyaSharma5

Russian Call Girls in Kolkata Samaira 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4

Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceCall Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

VIP Call Girls Kolkata Ananya 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4

VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girladitipandeya

Recently uploaded (20)

How is AI changing journalism? (v. April 2024)

FULL ENJOY Call Girls In Mayur Vihar Delhi Contact Us 8377087607

Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝

Packaging the Monolith - PHP Tek 2024 (Breaking it down one bite at a time)

GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web

Moving Beyond Twitter/X and Facebook - Social Media for local news providers

Russian Call Girls Thane Swara 8617697112 Independent Escort Service Thane

VIP 7001035870 Find & Meet Hyderabad Call Girls Dilsukhnagar high-profile Cal...

'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...

Gram Darshan PPT cyber rural in villages of india

Russian Call Girls in Kolkata Ishita 🤌 8250192130 🚀 Vip Call Girls Kolkata

Call Girls In South Ex 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SERVICE

Rohini Sector 22 Call Girls Delhi 9999965857 @Sabina Saikh No Advance

Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance

Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service

Low Rate Young Call Girls in Sector 63 Mamura Noida ✔️☆9289244007✔️☆ Female E...

Russian Call Girls in Kolkata Samaira 🤌 8250192130 🚀 Vip Call Girls Kolkata

Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance

VIP Call Girls Kolkata Ananya 🤌 8250192130 🚀 Vip Call Girls Kolkata

VIP 7001035870 Find & Meet Hyderabad Call Girls LB Nagar high-profile Call Girl

On the Persistence of Persistent Identifiers of the Scholarly Web

1. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n TPDL, August 2020 Martin Klein & Lyudmila Balakireva Los Alamos National Laboratory {mklein, ludab}@lanl.gov On the Persistence of Persistent Identifiers of the Scholarly Web HEAD GET GET+ Chrome https://arxiv.org/abs/2004.03011

2. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n TPDL, August 2020 DOIs are very common

3. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n TPDL, August 2020 DOIs are very common

4. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n TPDL, August 2020 DOIs are very common

5. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n TPDL, August 2020 How does this work via HTTP? https://doi.org/10.1007/978-3-540-87599-4_38

6. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n TPDL, August 2020 Arrived at landing page https://doi.org/10.1007/978-3-540-87599-4_38 https://link.springer.com/chapter/10.1007%2F978-3-540-87599-4_38

7. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n TPDL, August 2020 HTTP redirects https://doi.org/10.1007/978-3-540-87599-4_38  (HTTP 302 redirect) http://link.springer.com/10.1007/978-3-540-87599-4_38  (HTTP 301 redirect) https://link.springer.com/10.1007/978-3-540-87599-4_38  (HTTP 302 redirect) https://link.springer.com/chapter/10.1007%2F978-3-540-87599-4_38  (HTTP 200)

8. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n TPDL, August 2020 Questions… • How persistent is this DOI resolution? • Given different clients and network environments: • Can we consistently arrive at the same location at the end of the redirect chain? • Is the path there (redirect chain) the same? • Are there differences between Open Access and non-OA? • Subscription vs non-Subscription level content? • Do scholarly content providers differ from the popular web?

9. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n TPDL, August 2020 Idea… • Comparative study investigating scholarly publishers’ responses • To common HTTP requests • Against DOIs • Using different web clients and request methods, resembling • Machines ”browsing”, crawling • Humans browsing • From network environments with different subscriptions/licenses • Amazon Web Service EC2 instance • LANL internal • Compare against web servers providing popular web content

10. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n TPDL, August 2020 HTTP clients, request methods, dataset, networks • HTTP HEAD • cURL • HTTP GET • cURL • HTTP GET+ • cURL + various common parameters e.g., user agent, cookies • HTTP GET • Chrome • 10,000 DOIs, randomly picked, 100 DOIs from the 100 most frequent publisher domains • HTTP requests sent from AWS VM and LANL network

11. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n TPDL, August 2020 HTTP clients, request methods, dataset, networks • HTTP HEAD • cURL • HTTP GET • cURL • HTTP GET+ • cURL + various common parameters e.g., user agent, cookies • HTTP GET • Chrome • 10,000 DOIs, randomly picked, 100 DOIs from the 100 most frequent publisher domains • HTTP requests sent from AWS VM and LANL network

12. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n TPDL, August 2020 Response codes of last link in redirect chain by DOI HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err10,000DOIs

13. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n TPDL, August 2020 Response codes of last link in redirect chain by DOI HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err 48.3% • < 50% successful requests across all methods

14. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n TPDL, August 2020 Response codes of last link in redirect chain by DOI HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err 48.3% • < 50% successful requests across all methods • > 40% 300-level responses w/ GET

15. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n TPDL, August 2020 Response codes of last link in redirect chain by DOI HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err 48.3% • < 50% successful requests across all methods • > 40% 300-level responses w/ GET • 25% return 200-level w/ HEAD/Chrome

16. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n TPDL, August 2020 Response codes of last link in redirect chain by DOI HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err 48.3% • < 50% successful requests across all methods • > 40% 300-level responses w/ GET • 25% return 200-level w/ HEAD/Chrome • 13% 400-level responses w/ HEAD

17. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n TPDL, August 2020 Response codes of last link in redirect chain by DOI HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err 48.3% • < 50% successful requests across all methods • > 40% 300-level responses w/ GET • 25% return 200-level w/ HEAD/Chrome • 13% 400-level responses w/ HEAD • 25% of them w/ 200-level response w/ any other method

18. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n TPDL, August 2020 https://arxiv.org/abs/2004.03011 For more background, details, results

19. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n TPDL, August 2020 On the Persistence of Persistent Identifiers of the Scholarly Web Thank you & stay safe! Martin Klein & Lyudmila Balakireva Los Alamos National Laboratory {mklein, ludab}@lanl.gov

Editor's Notes

Hello and welcome to this session! My name is Martin Klein and I work in the RL @ LANL. I’d like to give brief overview of the work done with my colleague Luda Balakireva on the persistence of persistent identifiers of the scholarly web. More specifically, we are testing Digital Object Identifiers (DOIs) and how consistently or inconsistently scholarly publishers respond when DOIs are requested. It is worth noting that several different persistent identifiers are used on the scholarly web but for the purpose of this study, we only investigate DOIs.
Why do we do that? Well, the answer is pretty simple: because DOIs are very common. For example, traditional journal or conference proceeding papers are often assigned DOIs as shown in this example from the IEEE.
The same holds true for datasets that are often assigned DOIs as shown here in Zenodo.
Or more generally speaking, scholarly projects that can include multiple resources and types of resources, as shown here in the example of the Open Science Framework, are assigned DOIs. So this all is to say that DOIs are very frequently used to identify scholarly resources on the web.
So how does this work, how are DOIs resolved on the web? If we take this DOIs that is actionable via HTTP
And use a web browser to dereference it, the browser will eventually display the resource, in this case the landing page of a scholarly article, identified by the DOI. Note that the URI of the landing page, shown on the bottom of this slide, is different from the DOI, as it is hosted by Springer.
The reason for this is that in the background, somewhat opaque to the user, the browser follows a number of HTTP redirects from the DOI to the landing page URI. The redirect chain for our example DOI is shown here: We first see a HTTP 302 redirect to Springer Followed by a 301 redirect to the HTTPS protocol And another 302 to the landing page URI. The landing page, as the last link of the redirect chain, returns an HTTP 200 response code, indicating success of the request and the server’s response.
So the main question we are investigating with our work is: how persistent is this DOI resolution? Given that DOIs can be requested by different HTTP clients and from different network environments, several subsequent questions arise. For example: Can we consistently arrive at the same last link of a redirect chain? Does the chain itself change? Is there a difference between the resolution of DOIs that identify OA resources vs those that identify non-OA resources? Does it matter if the request against a DOI comes from within an institutional network with certain subscription levels to commercial publishers? If we observe such differences, is this typical only for the scholarly web or are these behaviors reflected in the popular web as well? In short, our intention is to test the consistency of DOI responses. Afterall, without consistency, how can we trust the persistence of such identifiers and their underlying infrastructure?
We designed a study to investigate scholarly publishers and their responses to requests against DOIs. We use common HTTP clients and methods that resemble both machine and human browsing behavior. We send our request from 2 different network environments with different subscription levels to commercial publishers. We send the same requests against web servers providing popular web content to compare our results.
We use the here summarized 4 different HTTP methods and clients for our experiment. We send HTTP HEAD requests with the popular command line tool cURL. We send simple HTTP GET requests, also with cURL. We send more complex HTTP GET requests with cURL, where we for example specify a user agent and accept cookies. Lastly, we use the popular web browser Chrome to send HTTP GET requests. We send these 4 requests against a corpus of 10k randomly sampled DOIs and repeat the experiment from 2 different network environments a VM in the Amazon Cloud and from within the LANL network.
We make the case that the first 3 methods resemble a machine browsing or crawling the web. Mostly because cURL is a tool that humans typically only use for testing but it is a tool that is frequently utilized in scripts that access web resources at scale. In contrast, the Chrome method, somewhat naturally, most closely resembles a human browsing.
Due to time constraints I will only show one set of results. What we see here in this graph is the response code of the last link of all redirect chains, distinguished by request method. Our 4 methods to dereference DOIs are shown on the x-axis 10k DOIs are displayed on the y-axis Response codes are binned at the hundreds level, where green indicates 200-level response (success), gray represents 300-level responses (redirect), red – 400 (server error), blue – 500 (client error) This graph shows results of requests sent from a VM in the Amazon Cloud, so a network presumably w/o subscriptions to commercial publishers. A number of observations can immediately be made:
1) - Less than 50% of DOIs consistently return a 200-level response, meaning success, across all 4 request methods. - In other words, more than 5k of our DOIs did not respond consistently across all 4 methods! A rather astonishing ratio! - Looking at the individual methods, we can note that Chrome, the method most closely resembling a human browsing the web, performs best
2) Next, we recognize that the simple GET method seems not well-suited for resolving DOIs With more than 40% of DOI chains ending in a 300-level response. This is noteworthy as, by definition, 300-level should not be a *final* response code of a redirect chain on the web - No obvious reason why….
…especially given that a large fraction of those DOIs, 25% in total, result in a successful response with the HEAD or Chrome method used.
4) Our next observation is that a significant portion – 13% - of DOI requests with the simple HEAD method result in a 400-level response. One could think there are a lot of 403s meaning access forbidden or 405 meaning the HEAD method is not allowed against the resource But that is not the case, this portion is indeed dominated by 404s meaning resource not found
Oddly, 25% of these DOIs result in a 200-level response when any other request method is used. So, do they exist or not? While such scenarios of changing response codes are not well-aligned with HTTP standards and best practice on the web, our observations strongly indicate that scholarly publishers do respond differently to requests against the same DOI, depending on what method is used. In addition, we can clearly see patterns where responses are different for methods that resemble machine vs human behavior. This is represented by the success of the Chrome method and the lack of success in particular by the simple GET and HEAD method. In aggregate, from our point of view, these observed inconsistencies raise more questions and do not increase trust in the persistence of persistent identifiers.
For more results, details on the methodology and dataset used, we refer to the paper. The corresponding pre-print is available at the displayed URI on the bottom of this slide.
This concludes my short presentation. Thanks a lot for watching! I am happy to hear your feedback and discuss our work. Thank you!

On the Persistence of Persistent Identifiers of the Scholarly Web

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to On the Persistence of Persistent Identifiers of the Scholarly Web

Similar to On the Persistence of Persistent Identifiers of the Scholarly Web (20)

More from Martin Klein

More from Martin Klein (20)

Recently uploaded

Recently uploaded (20)

On the Persistence of Persistent Identifiers of the Scholarly Web

Editor's Notes