Reference Rot: Threat and Remedy
UKSG15
30 March - 1 April 2015
Funded by the Andrew W. Mellon Foundation
Peter Burnhill & Richard Wincewicz
EDINA, University of Edinburgh
for the Hiberlink Team at University of Edinburgh & LANL Research Library
The Project Team
2013 – 2015, funded by the
Andrew W. Mellon Foundation
• Los Alamos National Laboratory:
Research Library: Herbert Van de Sompel
Harihar Shankar, [Martin Klein, Rob Sanderson],
• University of Edinburgh:
Language Technology Group: Claire Grover,
Beatrice Alex, Colin Matheson, Richard Tobin, [Ke “Adam” Zhou]
EDINA * : Peter Burnhill, Muriel Mewissen (Project Manager),
Neil Mayo, Tim Stickland, Richard Wincewicz,
Centre for Service Delivery & Digital Expertise
Funded by the Andrew W. Mellon Foundation
UKSG15
30 March - 1 April 2015
… acts as part of the Jisc Family
edina.ac.uk
hiberlink.org
Overview
1. Introduction / Threat
2. Analysis
3. Large-scale Evidence
4. Devising Remedy
5. Summary
Tweet to #UKSG15
When what was referenced & cited
ceases to say the same thing, or ‘has ceased to be’
http://www.snorgtees.com/this-parrot-has-ceased-to-be
1. The Threat of Reference Rot
“when links to web resources
no longer point to what they once did”
Reference Rot = Link Rot + Content Drift
Link Rot
‘Link Rot’
+ Content Drift: What is at end of URI has changed, or gone!
http://dl00.org
2000
http://dl00.org
2004
http://dl00.org
2005
http://dl00.org
2008
(a) Dynamic content
as values on webpage
changes over time
(b) Static content
but very different (often
unrelated) web pages
2. Analysis
Tweets on #hiberlink to
#UKSG15
Take a landmark publication, 10+ years ago
Few of those references to the Web now work as intendedA re-direct [from RLG to OCLC] but ‘content drift’
Few of those references to the Web now work as intendedA re-direct [from RLG to OCLC] but ‘content drift’
Fail !!
Reference no longer works: ‘link rot’
Fail !!
Reference no longer works: ‘link rot’
Fail !!
A re-direct but content not found
A re-direct but content not found
Fail !!
Successful link: URI worke as expected  in December 2014
But sadly, now does not 
Fail !!
Successful link: URI works as expected 
Classic link rot: ‘Page Not Found’
Fail !!
reference to the Web is to an e-journal that is still current
Classic link rot: ‘Page Not Found’
Fail !!
URI works but content drift: reference is not as intended
Fail !!
=> Content of Citations Rot over Time!!
… meaning rotten references for the reader
… in what is then a rotten article!
… & sale of rotten goods & undermining the
integrity of the scholarly record
3. Large-scale Evidence
Tweets on #hiberlink to
#UKSG15
Hiberlink Project Methodology
to discover answer to a 2-part question
Do references to web-based content (URIs) work?
• Focus on content on ‘the wild Web’
• not that which is in e-journals etc
i. Impact of Time: Is the URI still on the ‘Live Web’’?
• Allowed up to a maximum of 50 redirects
ii. Is a ‘Memento’ of that content in the ‘Archived Web’?
Memento: a prior version, what the Original Resource was like at some time in the past.
3. Large-scale Empirical Evidence
c. 400,000 articles across the three corpora (Row #5 in Table 2)
contained over a million web at large references (Row #4 in Table 3)
Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One
in Five Articles Suffers from Reference Rot. PLoS ONE 9(12): e115253. doi:10.1371/journal.pone.0115253
http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0115253
A Key Aspect of Hiberlink Project Methodology
1. Convert Scholarly Statement from PDF into XML
2. Locate the references & extract each and every URL
• Many technical challenges
• URL broken/newline; underscore as image
• Use up to 15 regular expression for matching; regard as URI
University of Edinburgh Language Technology Group:
Beatrice Alex, Claire Grover, Colin Matheson, Richard Tobin, Ke Zhou
Scholarly Articles [in PMC] increasingly link to
Web Resources, not just back to other Articles
Scholarly Articles [in Elsevier] increasingly link to
Web Resources, not just back to other Articles
Mementos for URIs archived within 14 days of being referenced
PMC corpus
Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One
in Five Articles Suffers from Reference Rot. PLoS ONE 9(12): e115253. doi:10.1371/journal.pone.0115253
http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0115253
6 publicly accessible web archives for lookup: Internet Archive, archive.is (archive.today),
Archive-It, BL Web Archive, UK National Archives Web Archive & Icelandic National Archive
Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One
in Five Articles Suffers from Reference Rot. PLoS ONE 9(12): e115253. doi:10.1371/journal.pone.0115253
http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0115253
Mementos for URIs archived within 14 days of being referenced
Elsevier corpus
6 publicly accessible web archives for lookup: Internet Archive, archive.is (archive.today),
Archive-It, BL Web Archive, UK National Archives Web Archive & Icelandic National Archive
4. Devising Remedy for Reference Rot
Tweets on #hiberlink to
#UKSG15
The Remedy Is Quick Freeze & Archive
3 workflows in scholarly statement
①Preparation -> Study - > Compose -> Submission
②Publication -> Editing -> (Revision) -> Acceptance -> Issue
③Post-Publication-> Deposit/Ingest -> Reader Access -> Use
To identify the best opportunities for Intervention to make Remedy,
to ‘flash-freeze’, either to avoid reference rot or to ‘stop the rot’
Identify the Actors & how to assist them do the right thing!
Ideally at the earliest moment of capture
… when the Authors are trawling for content
… for what an Author regards as significant
… or needs to provide as evidence
… re-factoring the HTML link that is returned
• http://www.newyorker.com/magazine/2015/01/26/cobweb
• Archive timestamp: 2015-02-19T09:46:36
• http://web.archive.org/web/20150219094636/http://www.n
ewyorker.com/magazine/2015/01/26/cobweb
Hiberlink Remedy: Components in a Robust Link
b) Augment Link with Datetime and Archive URI
a) Take simple URI - to article in New Yorker magazine (say)
Hiberlink.org
What Robust Hiberlinks look like
• Hiberlinks are modified <a> HTML elements
• Include archive URL and timestamp as
additional attributes
<a
href=“http://www.newyorker.com/magazine/2015/01/26/cobweb”
data-
versionurl=“http://web.archive.org/web/20150219094636/http://w
ww.newyorker.com/magazine/2015/01/26/cobweb”
data-versiondate=“2015-02-19T09:46:36”>Cobweb Article</a>
Help authors do the right thing:
① Triggering archiving of referenced web content
when it is noted, using a reference manager
eg EndNote, Reference Manager, Zotero
– Hiberlink Plug-in developed for Zotero
② Returns Datetime URI for archived content that
can be used in the citation
Remedy To Avoid Reference Rot
https://www.zotero.org/
Zotero workflow
Create
reference
Add URL
Update
URL
Duplicate
reference
Pass URL to
archive service
Receive
archive URL
Store data
in database
Add data to
reference
Using the Plugin in Zotero
Opportunity / Time for a Demo ?
So what should we expect of the Publisher?
Beyond the assurance that
the fish / references / articles
sold are not rotten
Help Publishers do the right thing
The next best opportunity for Quick Freeze
• to avoid reference rot & to ‘stop the rot’
① Study: Preparation -> (Review) -> Submission
② Publication: Editorial -> (Revision) -> Issue
③ Post-Publication: Deposit/Ingest -> Provide/Access -> Use
Actors:
①The Author
② The Editor / Publisher
③The Access Platform / Librarian /Archival Organisations
OJS plugin
1. Parses the document
• Converts .pdf to .html
• Extracts URIs
2. Archives the content for each reference
• The Author and Editor can choose which version is
used as the archival copy
3. Creates an HTML version of the document
• including a link to the archived version of each of the
references
Well Published References & Augmented Links
Post-Publication (& other bulk processing)
The last ‘best’ opportunity for Quick Freeze
• not to avoid reference rot but to ‘stop the rot’
① Study: Preparation -> (Review) -> Submission
• Should note & act for each URI, one by one
② Publication: Editorial -> (Revision) -> Issue
• (Probably) should examine each one by one
③Post-Publication: Deposit/Ingest
• Cannot hope to process one by one
Post-Publication (& other bulk processing)
The last ‘best’ opportunity for Quick Freeze
• not to avoid reference rot but to ‘stop the rot’
① Study: Preparation -> (Review) -> Submission
• Should note & act for each URI, one by one
② Publication: Editorial -> (Revision) -> Issue
• (Probably) should examine each one by one
③Post-Publication: Deposit/Ingest
• Cannot hope to process one by one
Actors:
①The Author
② The Editor / Publisher
③Access Platforms / Archival Organisations
/ Librarians
& each article contains many references
Recall Key Aspect of Hiberlink Methodology
1. Convert Scholarly Statement from PDF into XML
2. Locate the references & extract each and every URL
• Many technical challenges
• URL broken/newline; underscore as image
• Use up to 15 regular expression for matching; regard as URI
=> Edinburgh Parser [github.com/hiberlink]
University of Edinburgh Language Technology Group:
Beatrice Alex, Claire Grover, Colin Matheson, Richard Tobin, Ke Zhou
Time to Build Infrastructure:
HiberActive
Publishing
platform HiberActive
External archival
service
(e.g. Internet Archive)
• Asynchronous (returns Robust Link)
• Distributed (archived with different organisations)
• Lightweight (leveraging HTTP & what already exists)
5. Summary
Tweets on #hiberlink to
#UKSG15
Hiberlink Outcomes
1. Defined the Threat of Reference Rot
2. Quantified the extent and way in which it
exists & undermines the Scholarly Record
3. Pointed to potential & practical Remedy
Hiberlink Outcomes
1. Defined the Threat of Reference Rot
2. Quantified the extent and way in which it exists & undermines the
Scholarly Record
3. Pointed to potential & practical Remedy
As project comes to an end (June 2015) so we wish to:
• Tell the world about these achievements
• Engage with others
– to build infrastructure
– To prompt adoption (copying) of prototypes by 3rd
parties
• such as reference managers, editorial systems, publication
systems, archival systems
Thank you,
Questions welcome
http://hiberlink.org #hiberlink
Email: edina@ed.ac.uk
Still Time to Tweet to #UKSG15
Funded by the Andrew W. Mellon Foundation

Reference Rot: Threat and Remedy

  • 1.
    Reference Rot: Threatand Remedy UKSG15 30 March - 1 April 2015 Funded by the Andrew W. Mellon Foundation Peter Burnhill & Richard Wincewicz EDINA, University of Edinburgh for the Hiberlink Team at University of Edinburgh & LANL Research Library
  • 2.
    The Project Team 2013– 2015, funded by the Andrew W. Mellon Foundation • Los Alamos National Laboratory: Research Library: Herbert Van de Sompel Harihar Shankar, [Martin Klein, Rob Sanderson], • University of Edinburgh: Language Technology Group: Claire Grover, Beatrice Alex, Colin Matheson, Richard Tobin, [Ke “Adam” Zhou] EDINA * : Peter Burnhill, Muriel Mewissen (Project Manager), Neil Mayo, Tim Stickland, Richard Wincewicz, Centre for Service Delivery & Digital Expertise Funded by the Andrew W. Mellon Foundation UKSG15 30 March - 1 April 2015
  • 3.
    … acts aspart of the Jisc Family edina.ac.uk
  • 4.
    hiberlink.org Overview 1. Introduction /Threat 2. Analysis 3. Large-scale Evidence 4. Devising Remedy 5. Summary Tweet to #UKSG15
  • 5.
    When what wasreferenced & cited ceases to say the same thing, or ‘has ceased to be’ http://www.snorgtees.com/this-parrot-has-ceased-to-be 1. The Threat of Reference Rot “when links to web resources no longer point to what they once did” Reference Rot = Link Rot + Content Drift
  • 6.
  • 7.
    + Content Drift:What is at end of URI has changed, or gone! http://dl00.org 2000 http://dl00.org 2004 http://dl00.org 2005 http://dl00.org 2008 (a) Dynamic content as values on webpage changes over time (b) Static content but very different (often unrelated) web pages
  • 8.
    2. Analysis Tweets on#hiberlink to #UKSG15
  • 9.
    Take a landmarkpublication, 10+ years ago
  • 10.
    Few of thosereferences to the Web now work as intendedA re-direct [from RLG to OCLC] but ‘content drift’
  • 11.
    Few of thosereferences to the Web now work as intendedA re-direct [from RLG to OCLC] but ‘content drift’ Fail !!
  • 12.
    Reference no longerworks: ‘link rot’ Fail !!
  • 13.
    Reference no longerworks: ‘link rot’ Fail !!
  • 14.
    A re-direct butcontent not found
  • 15.
    A re-direct butcontent not found Fail !!
  • 16.
    Successful link: URIworke as expected  in December 2014
  • 17.
    But sadly, nowdoes not  Fail !!
  • 18.
    Successful link: URIworks as expected 
  • 19.
    Classic link rot:‘Page Not Found’ Fail !!
  • 20.
    reference to theWeb is to an e-journal that is still current
  • 21.
    Classic link rot:‘Page Not Found’ Fail !!
  • 22.
    URI works butcontent drift: reference is not as intended Fail !!
  • 23.
    => Content ofCitations Rot over Time!!
  • 24.
    … meaning rottenreferences for the reader
  • 25.
    … in whatis then a rotten article! … & sale of rotten goods & undermining the integrity of the scholarly record
  • 26.
    3. Large-scale Evidence Tweetson #hiberlink to #UKSG15
  • 27.
    Hiberlink Project Methodology todiscover answer to a 2-part question Do references to web-based content (URIs) work? • Focus on content on ‘the wild Web’ • not that which is in e-journals etc i. Impact of Time: Is the URI still on the ‘Live Web’’? • Allowed up to a maximum of 50 redirects ii. Is a ‘Memento’ of that content in the ‘Archived Web’? Memento: a prior version, what the Original Resource was like at some time in the past.
  • 28.
    3. Large-scale EmpiricalEvidence c. 400,000 articles across the three corpora (Row #5 in Table 2) contained over a million web at large references (Row #4 in Table 3) Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. PLoS ONE 9(12): e115253. doi:10.1371/journal.pone.0115253 http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0115253
  • 29.
    A Key Aspectof Hiberlink Project Methodology 1. Convert Scholarly Statement from PDF into XML 2. Locate the references & extract each and every URL • Many technical challenges • URL broken/newline; underscore as image • Use up to 15 regular expression for matching; regard as URI University of Edinburgh Language Technology Group: Beatrice Alex, Claire Grover, Colin Matheson, Richard Tobin, Ke Zhou
  • 30.
    Scholarly Articles [inPMC] increasingly link to Web Resources, not just back to other Articles
  • 31.
    Scholarly Articles [inElsevier] increasingly link to Web Resources, not just back to other Articles
  • 32.
    Mementos for URIsarchived within 14 days of being referenced PMC corpus Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. PLoS ONE 9(12): e115253. doi:10.1371/journal.pone.0115253 http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0115253 6 publicly accessible web archives for lookup: Internet Archive, archive.is (archive.today), Archive-It, BL Web Archive, UK National Archives Web Archive & Icelandic National Archive
  • 33.
    Klein M, Vande Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. PLoS ONE 9(12): e115253. doi:10.1371/journal.pone.0115253 http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0115253 Mementos for URIs archived within 14 days of being referenced Elsevier corpus 6 publicly accessible web archives for lookup: Internet Archive, archive.is (archive.today), Archive-It, BL Web Archive, UK National Archives Web Archive & Icelandic National Archive
  • 34.
    4. Devising Remedyfor Reference Rot Tweets on #hiberlink to #UKSG15
  • 35.
    The Remedy IsQuick Freeze & Archive
  • 36.
    3 workflows inscholarly statement ①Preparation -> Study - > Compose -> Submission ②Publication -> Editing -> (Revision) -> Acceptance -> Issue ③Post-Publication-> Deposit/Ingest -> Reader Access -> Use To identify the best opportunities for Intervention to make Remedy, to ‘flash-freeze’, either to avoid reference rot or to ‘stop the rot’ Identify the Actors & how to assist them do the right thing!
  • 37.
    Ideally at theearliest moment of capture
  • 38.
    … when theAuthors are trawling for content
  • 39.
    … for whatan Author regards as significant
  • 40.
    … or needsto provide as evidence
  • 41.
    … re-factoring theHTML link that is returned • http://www.newyorker.com/magazine/2015/01/26/cobweb • Archive timestamp: 2015-02-19T09:46:36 • http://web.archive.org/web/20150219094636/http://www.n ewyorker.com/magazine/2015/01/26/cobweb Hiberlink Remedy: Components in a Robust Link b) Augment Link with Datetime and Archive URI a) Take simple URI - to article in New Yorker magazine (say) Hiberlink.org
  • 42.
    What Robust Hiberlinkslook like • Hiberlinks are modified <a> HTML elements • Include archive URL and timestamp as additional attributes <a href=“http://www.newyorker.com/magazine/2015/01/26/cobweb” data- versionurl=“http://web.archive.org/web/20150219094636/http://w ww.newyorker.com/magazine/2015/01/26/cobweb” data-versiondate=“2015-02-19T09:46:36”>Cobweb Article</a>
  • 43.
    Help authors dothe right thing: ① Triggering archiving of referenced web content when it is noted, using a reference manager eg EndNote, Reference Manager, Zotero – Hiberlink Plug-in developed for Zotero ② Returns Datetime URI for archived content that can be used in the citation Remedy To Avoid Reference Rot https://www.zotero.org/
  • 44.
    Zotero workflow Create reference Add URL Update URL Duplicate reference PassURL to archive service Receive archive URL Store data in database Add data to reference
  • 45.
    Using the Pluginin Zotero Opportunity / Time for a Demo ?
  • 46.
    So what shouldwe expect of the Publisher? Beyond the assurance that the fish / references / articles sold are not rotten
  • 47.
    Help Publishers dothe right thing The next best opportunity for Quick Freeze • to avoid reference rot & to ‘stop the rot’ ① Study: Preparation -> (Review) -> Submission ② Publication: Editorial -> (Revision) -> Issue ③ Post-Publication: Deposit/Ingest -> Provide/Access -> Use Actors: ①The Author ② The Editor / Publisher ③The Access Platform / Librarian /Archival Organisations
  • 48.
    OJS plugin 1. Parsesthe document • Converts .pdf to .html • Extracts URIs 2. Archives the content for each reference • The Author and Editor can choose which version is used as the archival copy 3. Creates an HTML version of the document • including a link to the archived version of each of the references
  • 49.
    Well Published References& Augmented Links
  • 50.
    Post-Publication (& otherbulk processing) The last ‘best’ opportunity for Quick Freeze • not to avoid reference rot but to ‘stop the rot’ ① Study: Preparation -> (Review) -> Submission • Should note & act for each URI, one by one ② Publication: Editorial -> (Revision) -> Issue • (Probably) should examine each one by one ③Post-Publication: Deposit/Ingest • Cannot hope to process one by one
  • 51.
    Post-Publication (& otherbulk processing) The last ‘best’ opportunity for Quick Freeze • not to avoid reference rot but to ‘stop the rot’ ① Study: Preparation -> (Review) -> Submission • Should note & act for each URI, one by one ② Publication: Editorial -> (Revision) -> Issue • (Probably) should examine each one by one ③Post-Publication: Deposit/Ingest • Cannot hope to process one by one Actors: ①The Author ② The Editor / Publisher ③Access Platforms / Archival Organisations / Librarians
  • 52.
    & each articlecontains many references
  • 53.
    Recall Key Aspectof Hiberlink Methodology 1. Convert Scholarly Statement from PDF into XML 2. Locate the references & extract each and every URL • Many technical challenges • URL broken/newline; underscore as image • Use up to 15 regular expression for matching; regard as URI => Edinburgh Parser [github.com/hiberlink] University of Edinburgh Language Technology Group: Beatrice Alex, Claire Grover, Colin Matheson, Richard Tobin, Ke Zhou
  • 54.
    Time to BuildInfrastructure: HiberActive Publishing platform HiberActive External archival service (e.g. Internet Archive) • Asynchronous (returns Robust Link) • Distributed (archived with different organisations) • Lightweight (leveraging HTTP & what already exists)
  • 55.
    5. Summary Tweets on#hiberlink to #UKSG15
  • 56.
    Hiberlink Outcomes 1. Definedthe Threat of Reference Rot 2. Quantified the extent and way in which it exists & undermines the Scholarly Record 3. Pointed to potential & practical Remedy
  • 57.
    Hiberlink Outcomes 1. Definedthe Threat of Reference Rot 2. Quantified the extent and way in which it exists & undermines the Scholarly Record 3. Pointed to potential & practical Remedy As project comes to an end (June 2015) so we wish to: • Tell the world about these achievements • Engage with others – to build infrastructure – To prompt adoption (copying) of prototypes by 3rd parties • such as reference managers, editorial systems, publication systems, archival systems
  • 58.
    Thank you, Questions welcome http://hiberlink.org#hiberlink Email: edina@ed.ac.uk Still Time to Tweet to #UKSG15 Funded by the Andrew W. Mellon Foundation