Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Reference Rot: Threat and Remedy

1,922 views

Published on

Presented in Glasgow at UKSG, 31 March - 1 April, by Peter Burnhill and Richard Wincewicz.

This presentation looks at reference rot, link rot, and the work of Hiberlink to ensure web citations persist through time.

Published in: Education
  • Be the first to comment

Reference Rot: Threat and Remedy

  1. 1. Reference Rot: Threat and Remedy UKSG15 30 March - 1 April 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill & Richard Wincewicz EDINA, University of Edinburgh for the Hiberlink Team at University of Edinburgh & LANL Research Library
  2. 2. The Project Team 2013 – 2015, funded by the Andrew W. Mellon Foundation • Los Alamos National Laboratory: Research Library: Herbert Van de Sompel Harihar Shankar, [Martin Klein, Rob Sanderson], • University of Edinburgh: Language Technology Group: Claire Grover, Beatrice Alex, Colin Matheson, Richard Tobin, [Ke “Adam” Zhou] EDINA * : Peter Burnhill, Muriel Mewissen (Project Manager), Neil Mayo, Tim Stickland, Richard Wincewicz, Centre for Service Delivery & Digital Expertise Funded by the Andrew W. Mellon Foundation UKSG15 30 March - 1 April 2015
  3. 3. … acts as part of the Jisc Family edina.ac.uk
  4. 4. hiberlink.org Overview 1. Introduction / Threat 2. Analysis 3. Large-scale Evidence 4. Devising Remedy 5. Summary Tweet to #UKSG15
  5. 5. When what was referenced & cited ceases to say the same thing, or ‘has ceased to be’ http://www.snorgtees.com/this-parrot-has-ceased-to-be 1. The Threat of Reference Rot “when links to web resources no longer point to what they once did” Reference Rot = Link Rot + Content Drift
  6. 6. Link Rot ‘Link Rot’
  7. 7. + Content Drift: What is at end of URI has changed, or gone! http://dl00.org 2000 http://dl00.org 2004 http://dl00.org 2005 http://dl00.org 2008 (a) Dynamic content as values on webpage changes over time (b) Static content but very different (often unrelated) web pages
  8. 8. 2. Analysis Tweets on #hiberlink to #UKSG15
  9. 9. Take a landmark publication, 10+ years ago
  10. 10. Few of those references to the Web now work as intendedA re-direct [from RLG to OCLC] but ‘content drift’
  11. 11. Few of those references to the Web now work as intendedA re-direct [from RLG to OCLC] but ‘content drift’ Fail !!
  12. 12. Reference no longer works: ‘link rot’ Fail !!
  13. 13. Reference no longer works: ‘link rot’ Fail !!
  14. 14. A re-direct but content not found
  15. 15. A re-direct but content not found Fail !!
  16. 16. Successful link: URI worke as expected  in December 2014
  17. 17. But sadly, now does not  Fail !!
  18. 18. Successful link: URI works as expected 
  19. 19. Classic link rot: ‘Page Not Found’ Fail !!
  20. 20. reference to the Web is to an e-journal that is still current
  21. 21. Classic link rot: ‘Page Not Found’ Fail !!
  22. 22. URI works but content drift: reference is not as intended Fail !!
  23. 23. => Content of Citations Rot over Time!!
  24. 24. … meaning rotten references for the reader
  25. 25. … in what is then a rotten article! … & sale of rotten goods & undermining the integrity of the scholarly record
  26. 26. 3. Large-scale Evidence Tweets on #hiberlink to #UKSG15
  27. 27. Hiberlink Project Methodology to discover answer to a 2-part question Do references to web-based content (URIs) work? • Focus on content on ‘the wild Web’ • not that which is in e-journals etc i. Impact of Time: Is the URI still on the ‘Live Web’’? • Allowed up to a maximum of 50 redirects ii. Is a ‘Memento’ of that content in the ‘Archived Web’? Memento: a prior version, what the Original Resource was like at some time in the past.
  28. 28. 3. Large-scale Empirical Evidence c. 400,000 articles across the three corpora (Row #5 in Table 2) contained over a million web at large references (Row #4 in Table 3) Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. PLoS ONE 9(12): e115253. doi:10.1371/journal.pone.0115253 http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0115253
  29. 29. A Key Aspect of Hiberlink Project Methodology 1. Convert Scholarly Statement from PDF into XML 2. Locate the references & extract each and every URL • Many technical challenges • URL broken/newline; underscore as image • Use up to 15 regular expression for matching; regard as URI University of Edinburgh Language Technology Group: Beatrice Alex, Claire Grover, Colin Matheson, Richard Tobin, Ke Zhou
  30. 30. Scholarly Articles [in PMC] increasingly link to Web Resources, not just back to other Articles
  31. 31. Scholarly Articles [in Elsevier] increasingly link to Web Resources, not just back to other Articles
  32. 32. Mementos for URIs archived within 14 days of being referenced PMC corpus Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. PLoS ONE 9(12): e115253. doi:10.1371/journal.pone.0115253 http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0115253 6 publicly accessible web archives for lookup: Internet Archive, archive.is (archive.today), Archive-It, BL Web Archive, UK National Archives Web Archive & Icelandic National Archive
  33. 33. Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. PLoS ONE 9(12): e115253. doi:10.1371/journal.pone.0115253 http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0115253 Mementos for URIs archived within 14 days of being referenced Elsevier corpus 6 publicly accessible web archives for lookup: Internet Archive, archive.is (archive.today), Archive-It, BL Web Archive, UK National Archives Web Archive & Icelandic National Archive
  34. 34. 4. Devising Remedy for Reference Rot Tweets on #hiberlink to #UKSG15
  35. 35. The Remedy Is Quick Freeze & Archive
  36. 36. 3 workflows in scholarly statement ①Preparation -> Study - > Compose -> Submission ②Publication -> Editing -> (Revision) -> Acceptance -> Issue ③Post-Publication-> Deposit/Ingest -> Reader Access -> Use To identify the best opportunities for Intervention to make Remedy, to ‘flash-freeze’, either to avoid reference rot or to ‘stop the rot’ Identify the Actors & how to assist them do the right thing!
  37. 37. Ideally at the earliest moment of capture
  38. 38. … when the Authors are trawling for content
  39. 39. … for what an Author regards as significant
  40. 40. … or needs to provide as evidence
  41. 41. … re-factoring the HTML link that is returned • http://www.newyorker.com/magazine/2015/01/26/cobweb • Archive timestamp: 2015-02-19T09:46:36 • http://web.archive.org/web/20150219094636/http://www.n ewyorker.com/magazine/2015/01/26/cobweb Hiberlink Remedy: Components in a Robust Link b) Augment Link with Datetime and Archive URI a) Take simple URI - to article in New Yorker magazine (say) Hiberlink.org
  42. 42. What Robust Hiberlinks look like • Hiberlinks are modified <a> HTML elements • Include archive URL and timestamp as additional attributes <a href=“http://www.newyorker.com/magazine/2015/01/26/cobweb” data- versionurl=“http://web.archive.org/web/20150219094636/http://w ww.newyorker.com/magazine/2015/01/26/cobweb” data-versiondate=“2015-02-19T09:46:36”>Cobweb Article</a>
  43. 43. Help authors do the right thing: ① Triggering archiving of referenced web content when it is noted, using a reference manager eg EndNote, Reference Manager, Zotero – Hiberlink Plug-in developed for Zotero ② Returns Datetime URI for archived content that can be used in the citation Remedy To Avoid Reference Rot https://www.zotero.org/
  44. 44. Zotero workflow Create reference Add URL Update URL Duplicate reference Pass URL to archive service Receive archive URL Store data in database Add data to reference
  45. 45. Using the Plugin in Zotero Opportunity / Time for a Demo ?
  46. 46. So what should we expect of the Publisher? Beyond the assurance that the fish / references / articles sold are not rotten
  47. 47. Help Publishers do the right thing The next best opportunity for Quick Freeze • to avoid reference rot & to ‘stop the rot’ ① Study: Preparation -> (Review) -> Submission ② Publication: Editorial -> (Revision) -> Issue ③ Post-Publication: Deposit/Ingest -> Provide/Access -> Use Actors: ①The Author ② The Editor / Publisher ③The Access Platform / Librarian /Archival Organisations
  48. 48. OJS plugin 1. Parses the document • Converts .pdf to .html • Extracts URIs 2. Archives the content for each reference • The Author and Editor can choose which version is used as the archival copy 3. Creates an HTML version of the document • including a link to the archived version of each of the references
  49. 49. Well Published References & Augmented Links
  50. 50. Post-Publication (& other bulk processing) The last ‘best’ opportunity for Quick Freeze • not to avoid reference rot but to ‘stop the rot’ ① Study: Preparation -> (Review) -> Submission • Should note & act for each URI, one by one ② Publication: Editorial -> (Revision) -> Issue • (Probably) should examine each one by one ③Post-Publication: Deposit/Ingest • Cannot hope to process one by one
  51. 51. Post-Publication (& other bulk processing) The last ‘best’ opportunity for Quick Freeze • not to avoid reference rot but to ‘stop the rot’ ① Study: Preparation -> (Review) -> Submission • Should note & act for each URI, one by one ② Publication: Editorial -> (Revision) -> Issue • (Probably) should examine each one by one ③Post-Publication: Deposit/Ingest • Cannot hope to process one by one Actors: ①The Author ② The Editor / Publisher ③Access Platforms / Archival Organisations / Librarians
  52. 52. & each article contains many references
  53. 53. Recall Key Aspect of Hiberlink Methodology 1. Convert Scholarly Statement from PDF into XML 2. Locate the references & extract each and every URL • Many technical challenges • URL broken/newline; underscore as image • Use up to 15 regular expression for matching; regard as URI => Edinburgh Parser [github.com/hiberlink] University of Edinburgh Language Technology Group: Beatrice Alex, Claire Grover, Colin Matheson, Richard Tobin, Ke Zhou
  54. 54. Time to Build Infrastructure: HiberActive Publishing platform HiberActive External archival service (e.g. Internet Archive) • Asynchronous (returns Robust Link) • Distributed (archived with different organisations) • Lightweight (leveraging HTTP & what already exists)
  55. 55. 5. Summary Tweets on #hiberlink to #UKSG15
  56. 56. Hiberlink Outcomes 1. Defined the Threat of Reference Rot 2. Quantified the extent and way in which it exists & undermines the Scholarly Record 3. Pointed to potential & practical Remedy
  57. 57. Hiberlink Outcomes 1. Defined the Threat of Reference Rot 2. Quantified the extent and way in which it exists & undermines the Scholarly Record 3. Pointed to potential & practical Remedy As project comes to an end (June 2015) so we wish to: • Tell the world about these achievements • Engage with others – to build infrastructure – To prompt adoption (copying) of prototypes by 3rd parties • such as reference managers, editorial systems, publication systems, archival systems
  58. 58. Thank you, Questions welcome http://hiberlink.org #hiberlink Email: edina@ed.ac.uk Still Time to Tweet to #UKSG15 Funded by the Andrew W. Mellon Foundation

×