Reference Rot and Linked Data: Threat and Remedy

3,724 views

Published on

Delivered by Peter Burnhill, Director of EDINA, at the PRELIDA Consolidation and Dissemination workshop on 17/18 October 2014 (http://prelida.eu/consolidation-workshop).

Summary: The web changes over time, and significant reference rot inevitably occurs. Web archiving delivers only a 50% chance of success. So in addition to the original URI, the link should be augmented with temporal context to increase robustness.

Published in: Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,724
On SlideShare
0
From Embeds
0
Number of Embeds
1,641
Actions
Shares
0
Downloads
5
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Reference Rot and Linked Data: Threat and Remedy

  1. 1. Reference Rot and Linked Data: Threat and Remedy Peter Burnhill EDINA, University of Edinburgh for the Hiberlink Team at University of Edinburgh & LANL Research Library PRELIDA 18/19thOctober 2014 Funded by the Andrew W. Mellon Foundation
  2. 2. The Project Team 2013 – 2015, funded by the Andrew W. Mellon Foundation • Los Alamos National Laboratory: Research Library: Martin Klein, [Rob Sanderson], Harihar Shankar, Herbert Van de Sompel • University of Edinburgh: Language Technology Group: Beatrice Alex, Claire Grover, Colin Matheson, Richard Tobin, [Ke “Adam” Zhou] EDINA * : Neil Mayo, Muriel Mewissen (Project Manager), Tim Stickland, Richard Wincewicz, Peter Burnhill Centre for Service Delivery & Digital Expertise Funded by the Andrew W. Mellon Foundation PRELIDA 18/19thOctober 2014
  3. 3. 1. Social Science Research Council [now ESRC, UK] – ‘Scientific Officer’ funding Data & use of Data 2. Scottish Education Data Archive, 1979 – 1984/1987 – Survey statistician: school leavers, YTS, 16-19 cohort surveys; demand for HE 3. Edinburgh University Data Library, 1984 to present – President of IASSIST, 1997 – 2001: social science data professionals 4. ESRC Regional Research Laboratory for Scotland, 1986 -1990 – Co-director, early days of Geographical Information Systems (GIS) – member of Data Task Force, UK Inter-Agency Global Env. Change 5. Graduate School, Faculty of Social Science, UofEd 1987 – 1997 – Senior Lecturer (p/t), teaching quantitative/survey methods – Director of RAPID: ESRC Research Activity & Publications Information Database 6. EDINA national data centre, 1995/6 to present – Director: set-up and continuous development; Jisc-funded UK national services 7. UK Digital Curation Centre (DCC), 2003/04 - 2004/05 – Director for set-up & definition of ‘data curation + digital preservation’ 8. CLOCKSS Founder & Board Member / LOCKSS deployment 3 Data Manufacturing Data Brokering Spatial Data & MetaData
  4. 4. licence to use Ensuring researchers, students and their teachers have ease and continuing access to online resources used for scholarship P.Burnhill, Edinburgh 2009 access to content & services
  5. 5. Buckland: thinking about Digital Libraries mix of the document tradition (signifying objects & their use) & the computation tradition (applying algorithmic, logical, mathematical, and mechanical techniques to information management) “Both traditions are needed. Information Science is rooted in part in humanities and qualitative social sciences. The landscape of Information Science is complex. An ecumenical view is needed.” – M.Buckland, Journal of American Society for Information Science, 50, 1999 2 (non-convergent) mentalities,Document-ness & Computation + a third dimension, the domain of application: • Academic discipline – if we do this for ourselves • Business area – if we do this for use beyond …
  6. 6. Related Activity by Partners • Los Alamos National Laboratory Research Library: • Memento • ResourceSync • http://www.niso.org/workrooms/resourcesync/ • University of Edinburgh / Informatics / Language Technology Group: • Text mining / Edinburgh Parser • University of Edinburgh/ Jisc / EDINA : • CLOCKSS / LOCKSS • Keepers Registry • https://www.era.lib.ed.ac.uk/handle/1842/6682
  7. 7. Top level Problem: We would like to assume that our libraries are ensuring that online e-journal content is being kept safe But online articles in the Scholarly Record are not in the custody of Libraries, nor on their digital shelves. Picture credit: http://somanybooksblog.com/2009/03/27/library-tour/
  8. 8. Evidence from <thekeepers.org> is worrying! The Keepers Registry aggregates what is being kept by the (10) leading archiving agencies (CLOCKSS, Portico, national libraries etc) with all issued with ISSN ① ‘Ingest Ratio’ = titles being ingested by one or more Keeper / ‘online serials’ in ISSN Register = 23,268 / 136,965 [in March 2014] => 17% * We do not know about 83% of e-serials having ISSN * ‘KeepSafe Ratio’ = ingest by 3+ Keepers = 9,652 / 136,965 => 7% ② Title Lists of 3 US research libraries (Columbia, Cornell & Duke), checked i2011/12 ‘Ingest Ratio’ = 22% to 28%; c.75% unknown fate ③ User-centric Evidence, UK usage in 2012, UK OpenURL Router logs => over two thirds 68% (36,326 titles) held by none!
  9. 9. Memento The Memento "Time Travel for the Web" protocol http://mementoweb.org/ • an interoperable approach to access web archives (IETF RFC 7089) • adopted by all major public archives worldwide, including the Internet Archive. • Memento for Chrome http://bit.ly/memento-for-chrome • This protocol underpins the work being done in Hiberlink
  10. 10. Now, about Reference Rot & Linked Data … 1. Some definitions • What is Reference Rot? • What may be special about Linked Data? 2. Evoking metaphor • The moment / snapshot / memento • Flash-freezing to avoid or to stop the rot (of fruit on vine) 3. Evidence of Threat of Reference 4. Devising Remedy for Reference Rot • Proposals for intervention: plug-ins & infrastructural solutions 5. Next Steps: how to take this work forward?
  11. 11. Investigating Reference Rot in Web-Based Scholarly Communication Reference Rot = Link Rot + Content Drift “when links to web resources no longer point to what they once did”
  12. 12. Link Rot ‘Link Rot’
  13. 13. + Content Drift: What is at end of URI has changed, or gone! http://dl00.org 2000 http://dl00.org 2004 (a) Dynamic content as values on webpage changes over time http://dl00.org 2005 http://dl00.org 2008 (b) Static content but very different (often unrelated) web pages
  14. 14. What of Linked Data? One or more sets of 3 linked URIs: conversation or statements for the long term? As time passes, so the content at the end of each of those URIs will suffer: Reference Rot = Link Rot + Content Drift “when links to web resources no longer point to what they once did” “Adding eScience Assets to the Data Web”, Herbert Van de Sompel, Carl Lagoze, Michael L. Nelson, Simeon Warner, Robert Sanderson, Pete Johnston. Proceedings of Linked Data on the Web (LDOW2009) Workshop, [v1] Thu, 11 Jun 2009 15:33:37 GMT http://arxiv.org/abs/0906.2135v1
  15. 15. Example: ‘mark up’ archaeological site record (metadata)
  16. 16. RDF graph: Article & Supplementary Data http://www.emeraldinsight.com/fig/0350570303002.png 1. Build and publish as metadata in XML format to be found on the web 2. Publishing text and data/multimedia content in XML will delight researchers • Researchers want to access ‘article as data’, via computational algorithm
  17. 17. What we are doing in Hiberlink 1. Creating evidence on extent of ‘Reference Rot’ – Main focus has been on references (& URIs) made in Journal Articles • Inc. reference rot in Supreme Court judgments with Harvard Law Library & permaCC – ETD2014 was opportunity to look at Reference Rot & the e-Thesis – PRELIDA is opportunity to look at impact on Linked Data 2. Understanding the preparation/publication/ingest workflow(s) – Identifying opportunity for productive intervention 1. Prototypes for pro-active archiving to enable remedy – Embedding such ‘solutions’ in existing tools & infrastructure 2. Raising awareness & seeking collaborative actions …. through events like this
  18. 18. Empirical evidence on the Threat of Reference Rot Large-scale analyses: Journal Articles & E-Theses
  19. 19. Methodology: to discover answer to 2 questions i. Do those links (URIs) still work? Is the URI on the ‘Live Web’’? • Allowing up to a maximum of 50 redirects, recording the HTTP transaction chain and regarding an 2XX status code as ‘live’
  20. 20. Methodology: to discover answer to 2 questions i. Do those links (URIs) still work? Is the URI on the ‘Live Web’’? • Allowing up to a maximum of 50 redirects, recording the HTTP transaction chain and regarding an 2XX status code as ‘live’ ii. Is there a ‘Memento’ of that reference in the ‘Archived Web’? Memento: a prior version, what the Original Resource was like at some time in the past.
  21. 21. A Measure of Reference Rot: Are those references available? [in 6,400 e-Theses defended in 2003-2010 at 5 US universities] After up to 50 redirects Live on Web Not Found on ‘Live Web’ All Count 29,122 16,860 45,982 % 63.3 36.7 100% Less than two-thirds of those links lead to live content 1st Order Indicator of ‘Reference Rot’ more than one third of references to the Web subject to ‘rot’
  22. 22. References in Citations Rot over Time: URIs cease to exist on the live Web [excluding 0s&1s: a few theses are unaffected; a few are ruined] We can’t stop that process of rot: Web content changes over time, Reference Rot is inevitable function of time Number of months elapsed from Date Thesis Defended until date archives checked (June 2014)
  23. 23. Searching for ‘Datetime’ Mementos of content in ‘Archived Web’ [in 6,400 e-Theses defended in 2003-2010 at 5 US universities] % Live on Web Not found on ‘Live Web’ All Found to be Archived 47.6 Not Found 52.4 All 100% There seems a 50:50 chance that referenced content is in the ‘Archived Web’. => half of those references are at ‘risk of loss’ Some content is being ‘co-incidentally harvested’ by routine web archiving.
  24. 24. ‘Incidental Archiving’ is constant over time (This is an ‘upper bound estimate’, independent of age of e-thesis) We can improve upon this ‘50:50 chance’ by pro-actively archiving what we cite
  25. 25. We already have ‘Lost Content’ for References to Web [in 6,400 e-Theses defended in 2003-2010 at 5 US universities] % Live on Web Not found on ‘Live Web’ All Found to be Archived 29.3 18.3 47.6 Not Found 34.0 18.4 52.4 All 63.3 36.7 100% 18.4% ‘not live & not found in archive’ judged to be lost forever 34% ‘live’ & ‘not in archive’ at is risk of loss NB: The 34% ‘at risk’ could be saved by pro-active archiving
  26. 26. Hiberlink Next Phase: in-depth study of Content Drift But demonstrated that problem exists & is severe • The Web changes over time: significant reference rot occurs • Routine Web Archiving delivers no better than 50:50 chance of success of having co-Incidentally archived what you referenced - and probably much less chance when we check extent of content drift - Not (yet) studied impact on Linked Data but expect similar
  27. 27. “Researchers need to know when information on a viewed page has changed. “Authors of long-shelf-life material want to be sure that their links will still work far into the future. Jonathan Zittrain, Larry Lessig and Kendra Albert report that • Harvard Law Review 75% of links are dead • top 1% Impact Factor Journals 10% of links dead just 15 months after publication • US Supreme Court decisions 29% of links dead 49% of links do not point to the original target http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2329161
  28. 28. Devising Remedy for Reference Rot for Linked Data?
  29. 29. Strategy for Making Remedy Seek pro-active ‘transactional archiving’ solutions – focus on what is regarded by authors as important a) Understand the preparation/publication workflow – identifying where there can be productive intervention a) Devise prototypes for pro-active archiving – writing & implementing code! b) Propose/test infrastructure for temporal referencing – supporting & using the Memento protocol Where possible, we wish to embed ‘solutions’ in existing tools & infrastructure
  30. 30. 3 workflows in scholarly statement ① Preparation-> Study - > Compose -> (Review) -> Submission ② Publication -> (Editorial)Examination -> (Revision) -> Acceptance -> Issue ③ Post-Publication-> Deposit/Ingest -> Provide/Access -> Use Extended length of stages in workflows magnify reference rot & affect, as referenced content on the web rots over time Identify the best opportunities for Intervention to make Remedy, to ‘flash-freeze’, either to avoid reference rot or to ‘stop the rot’ What are the key workflows for the manufacture, release and use of Linked Data?
  31. 31. 3 workflows in Linked Data What are the key workflows for the manufacture, release and use of Linked Data? ① Manufacture-> Create- > (Review) -> Prepare to publish/release/commit ② Authority: Release-> (Editorial)Examination -> (Revision) -> Acceptance ③ Use: Curate -> Deposit/Ingest -> Provide/Access -> Use What is it that changes over time: concepts, assigned attributes; why and on what timescale? Identify the best opportunities for Intervention to make Remedy, to ‘flash-freeze’, either to avoid reference rot or to ‘stop the rot’
  32. 32. ‘Work in progress’ to effect Remedy 1. Hiberlink Plug-in - for pro-active ‘transactional’ archiving – At the time of authoring (ie manufacture) 2. Missing Link - re-factoring the HTML link – By which one annotates with {DateTime; location of archived copy/ies} 3. HiberActive - a system for actively archiving references – Designed to ‘stop the rot’, a lossy 2nd Best to transactional archiving’ LANL: Martin Klein, Harihar Shankar, Herbert Van de Sompel UoEd EDINA: Neil Mayo, Tim Stickland, Richard Wincewicz Hiberlink ETD2014, Leicester UK July 25th 2014 Funded by the Andrew W. Mellon Foundation
  33. 33. Hiberlink Plug-in [for Zotero] ① Triggers archiving of referenced web content ② Returns DateTime URI for archived content For use during authoring [manufacture] of information object & before final issue but also before ingest by ‘library’ (& maybe for repair by ‘library’ …)
  34. 34. ‘Work in progress’ to effect Remedy (2) 1. Hiberlink Plug-in - to enable pro-active archiving 2. Missing Link - re-factor the HTML link that is returned a) Take simple URI - to French National Library (say) b) Augment Link with a set of Datetime & location pairs Prepared by: Herbert Van de Sompel, Martin Klein, Robert Sanderson - Los Alamos National Laboratory Michael Nelson - Old Dominion University http://mementoweb.org/missing-link/
  35. 35. ‘Work in progress’ to effect Remedy (3) 1. Hiberlink Plug-in - to enable pro-active archiving 2. Missing Link - re-factoring the HTML link First two approaches support ‘perfect scenario’: • All authors archive all their cited URIs • e.g. (but not exclusively) with Hiberlink / Zotero 3. HiberActive – Enables repositories to ‘stop the rot’ by actively archiving those references in e-theses – A notification hub, a component for the infrastructure • testing workflow with ResourceSync, CORE & external archive programme
  36. 36. Summary • The Web changes over time: significant reference rot inevitably occurs (as a function of time) • Web Archiving delivers only c.50:50 chance of success of co-incidentally archiving what you referenced • Link by means of the original URI, at time of manufacture • But then …. Augment the link with temporal context, to increase robustness of link to referenced content o Date of linking o URI of archived snapshot(s) • Then again, maybe this is all about archiving to support citation and not really about ‘preservation’, but it does assist continuity of access
  37. 37. Multi-level Problem: Digital Shelving for The Research Object; First Order References; Second Order References; …. Simple Statements [with URIs] 1st Order References [with URIs] Complex Research Objects {URIs} Picture credit: http://somanybooksblog.com/2009/03/27/library-tour/ 1st Order References {URI} 2nd Order References [with URIs] 2nd Order References {URI} “Digital information is best preserved by replicating it [on digital shelving] at multiple archives run by autonomous organizations” B. Cooper and H. Garcia-Molina (2002)
  38. 38. Next Steps: how to take this work forward? to ensure URI/references don’t rot • Need to move from the ‘incidental Web archiving’ of cited URIs to pro-active archiving, by makers of Linked Data & by repositories? • Engage with these Hiberlink remedies • The Hiberlink Plug-in for Zotero / HiberActive Email: edina@ed.ac.uk Subject: Hiberlink ETD
  39. 39. http://hiberlink.org #hiberlink Thank you, Questions welcome & check: http://hiberlink.org/news.html Email: edina@ed.ac.uk Funded by the Andrew W. Mellon Foundation

×