Presented in Glasgow at UKSG, 31 March - 1 April, by Peter Burnhill and Richard Wincewicz.
This presentation looks at reference rot, link rot, and the work of Hiberlink to ensure web citations persist through time.
1. Reference Rot: Threat and Remedy
UKSG15
30 March - 1 April 2015
Funded by the Andrew W. Mellon Foundation
Peter Burnhill & Richard Wincewicz
EDINA, University of Edinburgh
for the Hiberlink Team at University of Edinburgh & LANL Research Library
2. The Project Team
2013 – 2015, funded by the
Andrew W. Mellon Foundation
• Los Alamos National Laboratory:
Research Library: Herbert Van de Sompel
Harihar Shankar, [Martin Klein, Rob Sanderson],
• University of Edinburgh:
Language Technology Group: Claire Grover,
Beatrice Alex, Colin Matheson, Richard Tobin, [Ke “Adam” Zhou]
EDINA * : Peter Burnhill, Muriel Mewissen (Project Manager),
Neil Mayo, Tim Stickland, Richard Wincewicz,
Centre for Service Delivery & Digital Expertise
Funded by the Andrew W. Mellon Foundation
UKSG15
30 March - 1 April 2015
5. When what was referenced & cited
ceases to say the same thing, or ‘has ceased to be’
http://www.snorgtees.com/this-parrot-has-ceased-to-be
1. The Threat of Reference Rot
“when links to web resources
no longer point to what they once did”
Reference Rot = Link Rot + Content Drift
7. + Content Drift: What is at end of URI has changed, or gone!
http://dl00.org
2000
http://dl00.org
2004
http://dl00.org
2005
http://dl00.org
2008
(a) Dynamic content
as values on webpage
changes over time
(b) Static content
but very different (often
unrelated) web pages
27. Hiberlink Project Methodology
to discover answer to a 2-part question
Do references to web-based content (URIs) work?
• Focus on content on ‘the wild Web’
• not that which is in e-journals etc
i. Impact of Time: Is the URI still on the ‘Live Web’’?
• Allowed up to a maximum of 50 redirects
ii. Is a ‘Memento’ of that content in the ‘Archived Web’?
Memento: a prior version, what the Original Resource was like at some time in the past.
28. 3. Large-scale Empirical Evidence
c. 400,000 articles across the three corpora (Row #5 in Table 2)
contained over a million web at large references (Row #4 in Table 3)
Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One
in Five Articles Suffers from Reference Rot. PLoS ONE 9(12): e115253. doi:10.1371/journal.pone.0115253
http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0115253
29. A Key Aspect of Hiberlink Project Methodology
1. Convert Scholarly Statement from PDF into XML
2. Locate the references & extract each and every URL
• Many technical challenges
• URL broken/newline; underscore as image
• Use up to 15 regular expression for matching; regard as URI
University of Edinburgh Language Technology Group:
Beatrice Alex, Claire Grover, Colin Matheson, Richard Tobin, Ke Zhou
30. Scholarly Articles [in PMC] increasingly link to
Web Resources, not just back to other Articles
31. Scholarly Articles [in Elsevier] increasingly link to
Web Resources, not just back to other Articles
32. Mementos for URIs archived within 14 days of being referenced
PMC corpus
Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One
in Five Articles Suffers from Reference Rot. PLoS ONE 9(12): e115253. doi:10.1371/journal.pone.0115253
http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0115253
6 publicly accessible web archives for lookup: Internet Archive, archive.is (archive.today),
Archive-It, BL Web Archive, UK National Archives Web Archive & Icelandic National Archive
33. Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One
in Five Articles Suffers from Reference Rot. PLoS ONE 9(12): e115253. doi:10.1371/journal.pone.0115253
http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0115253
Mementos for URIs archived within 14 days of being referenced
Elsevier corpus
6 publicly accessible web archives for lookup: Internet Archive, archive.is (archive.today),
Archive-It, BL Web Archive, UK National Archives Web Archive & Icelandic National Archive
36. 3 workflows in scholarly statement
①Preparation -> Study - > Compose -> Submission
②Publication -> Editing -> (Revision) -> Acceptance -> Issue
③Post-Publication-> Deposit/Ingest -> Reader Access -> Use
To identify the best opportunities for Intervention to make Remedy,
to ‘flash-freeze’, either to avoid reference rot or to ‘stop the rot’
Identify the Actors & how to assist them do the right thing!
41. … re-factoring the HTML link that is returned
• http://www.newyorker.com/magazine/2015/01/26/cobweb
• Archive timestamp: 2015-02-19T09:46:36
• http://web.archive.org/web/20150219094636/http://www.n
ewyorker.com/magazine/2015/01/26/cobweb
Hiberlink Remedy: Components in a Robust Link
b) Augment Link with Datetime and Archive URI
a) Take simple URI - to article in New Yorker magazine (say)
Hiberlink.org
42. What Robust Hiberlinks look like
• Hiberlinks are modified <a> HTML elements
• Include archive URL and timestamp as
additional attributes
<a
href=“http://www.newyorker.com/magazine/2015/01/26/cobweb”
data-
versionurl=“http://web.archive.org/web/20150219094636/http://w
ww.newyorker.com/magazine/2015/01/26/cobweb”
data-versiondate=“2015-02-19T09:46:36”>Cobweb Article</a>
43. Help authors do the right thing:
① Triggering archiving of referenced web content
when it is noted, using a reference manager
eg EndNote, Reference Manager, Zotero
– Hiberlink Plug-in developed for Zotero
② Returns Datetime URI for archived content that
can be used in the citation
Remedy To Avoid Reference Rot
https://www.zotero.org/
46. So what should we expect of the Publisher?
Beyond the assurance that
the fish / references / articles
sold are not rotten
47. Help Publishers do the right thing
The next best opportunity for Quick Freeze
• to avoid reference rot & to ‘stop the rot’
① Study: Preparation -> (Review) -> Submission
② Publication: Editorial -> (Revision) -> Issue
③ Post-Publication: Deposit/Ingest -> Provide/Access -> Use
Actors:
①The Author
② The Editor / Publisher
③The Access Platform / Librarian /Archival Organisations
48. OJS plugin
1. Parses the document
• Converts .pdf to .html
• Extracts URIs
2. Archives the content for each reference
• The Author and Editor can choose which version is
used as the archival copy
3. Creates an HTML version of the document
• including a link to the archived version of each of the
references
50. Post-Publication (& other bulk processing)
The last ‘best’ opportunity for Quick Freeze
• not to avoid reference rot but to ‘stop the rot’
① Study: Preparation -> (Review) -> Submission
• Should note & act for each URI, one by one
② Publication: Editorial -> (Revision) -> Issue
• (Probably) should examine each one by one
③Post-Publication: Deposit/Ingest
• Cannot hope to process one by one
51. Post-Publication (& other bulk processing)
The last ‘best’ opportunity for Quick Freeze
• not to avoid reference rot but to ‘stop the rot’
① Study: Preparation -> (Review) -> Submission
• Should note & act for each URI, one by one
② Publication: Editorial -> (Revision) -> Issue
• (Probably) should examine each one by one
③Post-Publication: Deposit/Ingest
• Cannot hope to process one by one
Actors:
①The Author
② The Editor / Publisher
③Access Platforms / Archival Organisations
/ Librarians
53. Recall Key Aspect of Hiberlink Methodology
1. Convert Scholarly Statement from PDF into XML
2. Locate the references & extract each and every URL
• Many technical challenges
• URL broken/newline; underscore as image
• Use up to 15 regular expression for matching; regard as URI
=> Edinburgh Parser [github.com/hiberlink]
University of Edinburgh Language Technology Group:
Beatrice Alex, Claire Grover, Colin Matheson, Richard Tobin, Ke Zhou
54. Time to Build Infrastructure:
HiberActive
Publishing
platform HiberActive
External archival
service
(e.g. Internet Archive)
• Asynchronous (returns Robust Link)
• Distributed (archived with different organisations)
• Lightweight (leveraging HTTP & what already exists)
56. Hiberlink Outcomes
1. Defined the Threat of Reference Rot
2. Quantified the extent and way in which it
exists & undermines the Scholarly Record
3. Pointed to potential & practical Remedy
57. Hiberlink Outcomes
1. Defined the Threat of Reference Rot
2. Quantified the extent and way in which it exists & undermines the
Scholarly Record
3. Pointed to potential & practical Remedy
As project comes to an end (June 2015) so we wish to:
• Tell the world about these achievements
• Engage with others
– to build infrastructure
– To prompt adoption (copying) of prototypes by 3rd
parties
• such as reference managers, editorial systems, publication
systems, archival systems