Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Reference Rot

17 views

Published on

A presentation of the work I had done with the Research Library Prototyping Team at Los Alamos National Laboratory given to the local chapter of the Special Libraries Association in New Mexico.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Reference Rot

  1. 1. Reference Rot Los Alamos National Laboratory Research Library Prototyping Team Presented by Shawn M. Jones
  2. 2. Citations are the building blocks of scholarly communications Citations Provide Support and Evidence + Experiment and Results Argument
  3. 3. DOIs Identify Scholarly Publications • Almost all scholarly publications (papers, articles, etc.) have an associated Digital Object Identifier (DOI) maintained by CrossRef • DOIs are persistent • If a publisher changes ownership or sells part of its catalog, the DOI remains with the publication so that scholars can continue to find the paper into the future "ISO 26324:2012(en), Information and documentation — Digital object identifier system". ISO.
  4. 4. URIs Identify Web Resources The World Wide Web consists of resources, such as pages or applications. Each web resource is identified by a Uniform Resource Identifier (URI). Examples of web resources: • Web pages • Google Search • Software Web Sites Each resource may have one or more representations that vary by dimensions such as language or document format. Uniform Resource Locators (URLs) are a subset of URIs that require a web location (a server with an application or directory structure). Architecture of the World Wide Web, Volume One (15 December 2004) edited by Ian Jacobs, Norman Walsh. https://www.w3.org/TR/webarch/
  5. 5. Scholars use URIs in References to Web Resources • The web resources behind URIs have no guarantee of persistence, they can disappear because: • Their website is gone due to lack of funding • An organization changes its website and doesn’t provide redirects to old resource • And more…
  6. 6. Why use URIs? • Existing publications are not the only supporting evidence in scholarly work • URIs are invaluable to researchers, it allows them to cite: • Software Projects • Datasets • Affiliation Web Sites • Funding • Scholar Web Sites • Blog Posts • Technical Reports • Evidence such as news stories or Tweets • And more…
  7. 7. Consider The Publication of the Paper and the Reader In the Future Following One of Its References The paper is published at some point, and its citations using URIs were good at that time. Will they be good for a reader in the future?
  8. 8. Reference Rot Problem #1: Link Rot The reader follows a reference and it is gone Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. PLOS ONE 9(12): e115253. DOI: 10.1371/journal.pone.0115253 This web-at-large resource is linked from the scholarly article Generalizing the OpenURL Framework beyond References to Scholarly Works but it is now gone!
  9. 9. Reference Rot Problem #2: Content Drift The reader follows a reference and it is not the same Jones SM, Van de Sompel H, Shankar H, Klein M, Tobin R, et al. (2016) Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content. PLOS ONE 11(12): e0167475. DOI: 10.1371/journal.pone.0167475 This web-at-large resource is linked from the scholarly article Searching for Quantum Gravity with High Energy Atmospheric Neutrinos and AMANDA-II but it has changed since publication.
  10. 10. A Potential Solution: Web Archives! Web Archives make snapshots of web resources so users can go back and look at a web page as it was in the past. There are many web archives, such as: • Internet Archive • Perma.cc • Archive.is • Icelandic Web Archive • UK Web Archive • Library of Congress These snapshots are called mementos.
  11. 11. Questions Addressed By Our Research 1. Is the use of URI references on the rise? 2. To what extent does link rot exist in scholarly URI references? 3. To what extent does content drift exist in scholarly URI references? 4. What can we do about reference rot? Can Web Archives help? 5. When are people using URIs when they should be using DOIs? 6. What can we do to ensure people use DOIs when they exist?
  12. 12. Dataset • 1.8 million articles from arXiv, Elsevier, and PubMed Central from 1997 to 2012 • For content drift comparison, Mementos are taken from 18 web archives • The data was processed by the University of Edinburgh and Los Alamos National Laboratory • From these articles we extracted 1.06 million URI references
  13. 13. Is the use of URI references on the rise?
  14. 14. The Number of URI References Goes Up Each Publication Year Articles and URI references per publication year - arXiv corpus. Articles and URI references per publication year - Elsevier corpus. Articles and URI references per publication year - PMC corpus. Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. PLOS ONE 9(12): e115253. DOI: 10.1371/journal.pone.0115253
  15. 15. To what extent does link rot exist in scholarly URI references?
  16. 16. Link Rot for References Gets Worse As We Look At Older Publications arXiv corpus Elsevier corpus PMC corpus Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. PLOS ONE 9(12): e115253. DOI: 10.1371/journal.pone.0115253 If a URI Reference no longer respond, then we have link rot.
  17. 17. Fewer Publications Are Immune to Reference Rot Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. PLOS ONE 9(12): e115253. DOI: 10.1371/journal.pone.0115253 Immune publications have no URI references Healthy publications have no link rot and have mementos within 14 days of publication for all of their references Infected publications have link rot or have no mementos for all of their references As noted before, more and more publications use URI references
  18. 18. To what extent does content drift exist in scholarly URI references?
  19. 19. Because of Web Archives, We Can Study Content Drift This Page Changed Much over 3 Months This Page Hasn’t Changed in 19 Years Jones SM, Van de Sompel H, Shankar H, Klein M, Tobin R, et al. (2016) Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content. PLOS ONE 11(12): e0167475. DOI: 10.1371/journal.pone.0167475
  20. 20. The Frequency Of Memento Creation Is Not the Same for All Resources Archived Regularly Archived Occasionally Archived Once Archived Never
  21. 21. Step 1: Find a memento of a reference from the publication date of the paper If a memento before the publication date and after the publication date match according to 4 similarity measures, we consider the two to be the same and either is representative of the reference as it existed at the time of publication. Representative mementos get compared with the current live version of the same reference in step 2. Jones SM, Van de Sompel H, Shankar H, Klein M, Tobin R, et al. (2016) Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content. PLOS ONE 11(12): e0167475. DOI: 10.1371/journal.pone.0167475
  22. 22. Many References Do Not Have Representative Mementos arXiv Corpus Elsevier Corpus PMC Corpus Jones SM, Van de Sompel H, Shankar H, Klein M, Tobin R, et al. (2016) Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content. PLOS ONE 11(12): e0167475. DOI: 10.1371/journal.pone.0167475
  23. 23. Step 2: Compare the memento of the reference with the web resource from now Jones SM, Van de Sompel H, Shankar H, Klein M, Tobin R, et al. (2016) Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content. PLOS ONE 11(12): e0167475. DOI: 10.1371/journal.pone.0167475 Using the same 4 similarity measures, we compare the content of the current resource with the content of the representative memento.
  24. 24. Content Drift Is Worse For Older Publications Jones SM, Van de Sompel H, Shankar H, Klein M, Tobin R, et al. (2016) Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content. PLOS ONE 11(12): e0167475. DOI: 10.1371/journal.pone.0167475 arXiv corpus PMC corpusElsevier corpus
  25. 25. What can we do about reference rot? Can Web Archives help?
  26. 26. What can we do about reference rot? Can Web Archives help? 1. Scholars can pro-actively create mementos in web archives for URI references • The Internet Archive’s “Save Page Now” • Perma.cc, Archive.is, and Web Cite exist for this purpose • Mink, Webrecorder.io 2. Other scholars/editors can reference these snapshots in scholarly literature • Robust Links • Memento Jones SM, Van de Sompel H, Shankar H, Klein M, Tobin R, et al. (2016) Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content. PLOS ONE 11(12): e0167475. DOI: 10.1371/journal.pone.0167475
  27. 27. Are people using URIs when they should be using DOIs?
  28. 28. Are people using URIs in references when they should be using DOIs? Van de Sompel H, Klein M, and Jones SM. 2016. Persistent URIs Must Be Used To Be Persistent. In Proceedings of WWW 2016, pp. 119-120. DOI: 10.1145/2872518.2889352 arXiv corpus PMC corpus We hypothesize that this is caused by citation software using the URI instead of the DOI because it does not know the DOI.
  29. 29. Problem: Machines Just See Links, Where Is The DOI? Links Links Link Link Links Links Links Links Links Links URI
  30. 30. Humans Can Get Meaning from Links on a Web Page Authors DOI Bibliographic Metadata PDF Document
  31. 31. Problem: Machines Cannot Find the DOI Browsers and citation software can easily access the URI; it indicates how to retrieve the resource. The DOI is buried in the text of the landing page. Citation software must be programmed with many publishers’ templates in order to find the DOI across all resources. Publishers also change their templates, causing software to break. Some publishers do not use the DOI in their EndNote/BibTeX citations.
  32. 32. What can we do to ensure people use DOIs when they exist?
  33. 33. HTTP Already Has A Solution, We Just Need to Use It • HTTP is the protocol of the web • Before HTTP sends content, it sends headers • Inside these headers, publishers can use the Link header to reference other content • Because the metadata is stored in the transfer protocol: • This solution requires no change to the content, meaning it works with any document format. • This solution can be applied to existing content with no change to the content. HTTP/1.1 200 OK Date: Mon, 17 Jul 2017 17:53:54 GMT Server: Apache/2.2.3 (Red Hat) Connection: close Link: <http://doi.org/10.101010/99999999>; rel=“identifier” Content-Type: text/html; charset=UTF-8 Van de Sompel H and Nelson ML. (2015) Reminiscing About 15 Years of Interoperability Efforts. D-Lib 21: 11/12. DOI: 10.1045/november2015- vandesompel
  34. 34. Using the HTTP Link Header, the machine can find the DOI Using the HTTP link header, publishers can provide metadata linking to the DOI from their resources. This way, a browser or citation manager can find the DOI if they are currently on the landing page or the PDF page. This effort is named “Signposting the Scholarly Web”.
  35. 35. Signposting is not just for DOIs • Why not link from the document’s landing page to the author’s ORCID?
  36. 36. Signposting is not just for DOIs • Why not link from the document to the metadata?
  37. 37. Signposting is not just for DOIs • Why not link from the landing page to supplemental items or other publication formats?
  38. 38. Find out more at signposting.org
  39. 39. Recap
  40. 40. Scholarly URI References In Jeopardy • URIs identify web resources and are not persistent • Link rot and content drift are problems for URI references and get worse for the older the publication is • Scholars sometimes use URIs instead of DOIs when creating references, even if DOIs exist
  41. 41. New Hope for Scholarly References • Web Archives play a role in preserving references • We can use a variety of tools to create mementos of references at the time of publication • We can access them with Memento and Robust Links • We can use signposting to help reference managers and other tools find DOIs and other information
  42. 42. Thanks for listening Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. PLOS ONE 9(12): e115253. DOI: 10.1371/journal.pone.0115253 Jones SM, Van de Sompel H, Shankar H, Klein M, Tobin R, et al. (2016) Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content. PLOS ONE 11(12): e0167475. DOI: 10.1371/journal.pone.0167475 Van de Sompel H, Klein M, and Jones SM. 2016. Persistent URIs Must Be Used To Be Persistent. In Proceedings of WWW 2016, pp. 119-120. DOI: 10.1145/2872518.2889352 Van de Sompel H and Nelson ML. (2015) Reminiscing About 15 Years of Interoperability Efforts. D-Lib 21: 11/12. DOI: 10.1045/november2015-vandesompel http://robustlinks.mementoweb.org http://signposting.org http://timetravel.mementoweb.org
  43. 43. Backup Slides
  44. 44. Demonstrations • Memento - http://timetravel.mementoweb.org • Robust Links - http://www.dlib.org/dlib/november15/vandesomp el/11vandesompel.html

×