0
Reference	
  Rot	
  and	
  E-­‐Theses:	
  Threat	
  and	
  Remedy	
  	
  
Hiberlink
ETD2014, Leicester UK July 25th 2014
F...
Overview
1.  The Hiberlink Project & Reference Rot
2.  Evidence of Threat of Reference Rot for the E-Thesis
•  Our methods...
Reference Rot = Link Rot + Content Drift
“when links to web resources
no longer point to what they once did”
Investigating...
Link Rot
‘Link Rot’	
  
+ Content Drift: What is at end of URI has changed, or gone!
http://dl00.org
2000
http://dl00.org
2004
http://dl00.org
200...
An International Team at Work
funded by the
Andrew W. Mellon Foundation
•  Los Alamos National Laboratory:
Research Librar...
What we are doing in Hiberlink, 2013 - 2015
1.  Creating evidence on extent of ‘Reference Rot’
–  Main focus has been on r...
Evidence on the Threat of Reference Rot for
E-Theses
Retrieving thinking about the emerging e-Thesis in 1998
University	
  	
  
Theses	
  	
  
Online	
  	
  
Group,	
  1994/99...
Retrieving thinking about the emerging e-Thesis in 1998
University	
  	
  
Theses	
  	
  
Online	
  	
  
Group,	
  1994/99...
Measuring the Extent of ‘Reference Rot’ in e-Theses	
  
Data	
  Source
•  Looked for corpus of e-Theses for our study peri...
7,500 E-Theses Downloaded from 5 US Institutions
In	
  passing:	
  note	
  decline	
  in	
  numbers	
  indicates	
  ‘lag’	...
Key Aspects of Methodology (Stage 1)	
  
1.  Convert those e-Theses from PDF into XML
•  pd@ohtml	
  –xml
2.  Locate the r...
Key Aspects of Methodology (Stage 2)	
  
47,067 URIs were extracted
These were partitioned into two types:
i.  1,086	
  pu...
Increase in Linking to ‘Web-at-large’ Resources, 1997-2010
beyond the e-journal, to that which lacks ‘fixity’ and changes ...
But Wide ‘Between-Thesis’ Variation in Number of Web Links
0.00
0.75
1.50
2.25
3.00
1997 1998 1999 2000 2001 2002 2003 200...
Methodology (Stage 3): to discover answer to 2 questions
i.  Do those links (URIs) still work? Is the URI on the ‘Live Web...
Methodology (Stage 3): to discover answer to 2 questions
i.  Do those links (URIs) still work? Is the URI on the ‘Live Web...
Methodology (Stage 3): to discover answer to 2 questions
i.  Do those links (URIs) still work? Is the URI on the ‘Live Web...
A Measure of Reference Rot: Are those references available?
[in 6,400 e-Theses defended in 2003-2010 at 5 US universities]...
Confirm?: 2/3rds ‘Live’ URI Ratio same across ‘Big 3’, 2003-2010
0.0
0.2
0.4
0.6
0.8
1.0
fsu psu vt wpi zund
Institution
L...
The older the citation, the less likely to be still on the live Web
[excluding 0s&1s: a few theses are unaffected; a few a...
Searching for ‘Datetime’ Mementos of content in ‘Archived Web’
[in 6,400 e-Theses defended in 2003-2010 at 5 US universiti...
0.0
0.2
0.4
0.6
0.8
1.0
fsu psu vt wpi zund
Institution
A
r
c
h
i
v
e
R
a
t
i
o
50:50 chance that ‘DateTime Reference’ is ...
‘Incidental Archiving’ is constant over time
(This is an ‘upper bound estimate’, independent of age of e-thesis)
0.0
0.2
0...
We already have ‘Lost Content’ for References to Web
[in 6,400 e-Theses defended in 2003-2010 at 5 US universities]
%	
   ...
Devising Remedy for Reference Rot
in e-Theses
Having demonstrated problem exists & is severe
•  The Web changes over time: reference rot occurs (36.7%)
•  Incidental ar...
a)  Understand the preparation/publication workflow
–  identifying where there can be productive intervention
b)  Devise p...
Understanding 3 workflows: Rot or Remedy?
Iden6fy	
  the	
  Actors	
  
Extended	
  length	
  of	
  stages	
  in	
  workflow...
1.  Hiberlink Plug-in - to enable pro-active archiving
2.  Missing Link - re-factoring the HTML link
3.  HiberActive - ena...
1.  Hiberlink Plug-in - to help authors and middle-folk
(publishers/librarians) do the right thing:
– Zotero - used by aut...
For use during preparation of thesis & before final submission
but also
before deposit with Library (& maybe for repair by...
1.  Hiberlink Plug-in - to enable pro-active archiving
2.  Missing Link - re-factor the HTML link that is returned
‘Work i...
1.  Hiberlink Plug-in - to enable pro-active archiving
2.  Missing Link - re-factoring the HTML link
First two approaches ...
Next Steps: who wants to take this work forward?
to ensure references in e-Theses don’t rot
•  Need	
  to	
  move	
  from	...
Thank you,
Questions welcome
http://hiberlink.org #hiberlink
Email: edina@ed.ac.uk
Hiberlink
ETD2014, Leicester UK July 25...
Picture	
  credit:	
  hnp://somanybooksblog.com/2009/03/27/library-­‐tour/	
  
But online articles in the Scholarly Record...
Evidence from The Keepers Registry is worrying!
①  Compare what is being kept by the (10) leading archiving agencies
(CLOC...
Upcoming SlideShare
Loading in...5
×

Reference Rot and E-Theses: Threat and Remedy

992

Published on

An overview of how the Hiberlink project relates to the persistence on the web of digital versions of theses. Given by Peter Burnhill, Director of EDINA, at the 17th International Symposium on Electronic Theses & Dissertations - which took place from 23 July to 25 July 2014 at the University of Leicester in the UK.

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
992
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Reference Rot and E-Theses: Threat and Remedy"

  1. 1. Reference  Rot  and  E-­‐Theses:  Threat  and  Remedy     Hiberlink ETD2014, Leicester UK July 25th 2014 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh & for the Hiberlink Team at University of Edinburgh & LANL Research Library Centre  for  Service  Delivery  &  Digital  Exper6se  
  2. 2. Overview 1.  The Hiberlink Project & Reference Rot 2.  Evidence of Threat of Reference Rot for the E-Thesis •  Our methods, data source & findings 3.  Devising Remedy for Reference Rot in E-Thesis •  Proposals for intervention: plug-ins & infrastructural solutions 4.  Next Steps: who (else) wants to take this work forward?
  3. 3. Reference Rot = Link Rot + Content Drift “when links to web resources no longer point to what they once did” Investigating Reference Rot in Web-Based Scholarly Communication
  4. 4. Link Rot ‘Link Rot’  
  5. 5. + Content Drift: What is at end of URI has changed, or gone! http://dl00.org 2000 http://dl00.org 2004 http://dl00.org 2005 http://dl00.org 2008 (a)  Dynamic  content   as  values  on  webpage   changes  over  =me   (b)  Sta-c  content   but  very  different  (o@en   unrelated)  web  pages  
  6. 6. An International Team at Work funded by the Andrew W. Mellon Foundation •  Los Alamos National Laboratory: Research Library: Martin Klein, (Rob  Sanderson),                                              Harihar Shankar, Herbert Van de Sompel •  University of Edinburgh: Language Technology Group: Beatrice Alex, Claire Grover, Richard Tobin, Ke “Adam” Zhou EDINA * : Neil Mayo, Muriel Mewissen (Project Manager), Christine Rees, Tim Stickland, Richard Wincewicz, Peter Burnhill Centre  for  Service  Delivery  &  Digital  Exper6se   Hiberlink ETD2014, Leicester UK July 25th 2014 Funded by the Andrew W. Mellon Foundation
  7. 7. What we are doing in Hiberlink, 2013 - 2015 1.  Creating evidence on extent of ‘Reference Rot’ –  Main focus has been on references (& URIs) made in Journal Articles •  Includes work on reference rot in Supreme Court judgments with Harvard Law Library & permaCC –  ETD2014 is opportunity to look at Reference Rot & the e-Thesis 2.  Understanding the preparation/publication workflow –  Identifying opportunity for productive intervention 3.  Prototypes for pro-active archiving to enable remedy –  Embedding such ‘solutions’ in existing tools & infrastructure 4.  Raising awareness & seeking collaborative actions …. through events like this
  8. 8. Evidence on the Threat of Reference Rot for E-Theses
  9. 9. Retrieving thinking about the emerging e-Thesis in 1998 University     Theses     Online     Group,  1994/99     Ini=ated  by  U  of  Edinburgh  &  UC  London,  as   referenced  by  Susan  Copeland  in     ‘E-­‐Theses  Developments  in  the  UK’    2003      
  10. 10. Retrieving thinking about the emerging e-Thesis in 1998 University     Theses     Online     Group,  1994/99     Ini=ated  by  U  of  Edinburgh  &  UC  London,  as   referenced  by  Susan  Copeland  in     ‘E-­‐Theses  Developments  in  the  UK’    2003       4.  
  11. 11. Measuring the Extent of ‘Reference Rot’ in e-Theses   Data  Source •  Looked for corpus of e-Theses for our study period of 1997 – 2012 •  Interested only in Doctoral Theses/Dissertations •  NDLTD  Union  Catalogue     Basic  Method   a)  Define selection and use information in the metadata record •  Degree awarded (PhD etc); Department •  Date thesis was successfully defended •  Link to the full text of the Doctoral Thesis b)  Download selected e-Thesis from each Institution’s Repository
  12. 12. 7,500 E-Theses Downloaded from 5 US Institutions In  passing:  note  decline  in  numbers  indicates  ‘lag’  in  ingest/availability  of  e-­‐Theses  
  13. 13. Key Aspects of Methodology (Stage 1)   1.  Convert those e-Theses from PDF into XML •  pd@ohtml  –xml 2.  Locate the references & extract each and every URL •  Technical challenges: URL broken/newline; underscore as image •  Use  up  to  15  regular  expression  for  matching;  regard  as  URI UoEdin Language Technology Group: Beatrice Alex, Claire Grover, Richard Tobin, Ke “Adam” Zhou
  14. 14. Key Aspects of Methodology (Stage 2)   47,067 URIs were extracted These were partitioned into two types: i.  1,086  publisher  sites,  represen=ng  very  many  references  to  online  ar=cles:   ‘the  scholarly  record’   •  BTW,  who  does  keep  those  ar=cles  in  the Scholarly Record safe? •  Ask me for evidence on that! ii.  45,981  URIs  that  linked  ‘the  Web  at  large’   •  to  Web  content  required    for  scholarship   •  inc.  websites,  so@ware,  blogs,  videos,  online  debate  etc   •  to that which lacks ‘fixity’ and changes over time   Those  c.46,000  are  the  focus  for  the  Hiberlink  Project  
  15. 15. Increase in Linking to ‘Web-at-large’ Resources, 1997-2010 beyond the e-journal, to that which lacks ‘fixity’ and changes over time URIs,  by  Year  Thesis  Defended  (%),  1997  -­‐  2010     15 30 45 60 75 90 1998 2000 2002 2004 2006 2008 2010 Year % 50%  
  16. 16. But Wide ‘Between-Thesis’ Variation in Number of Web Links 0.00 0.75 1.50 2.25 3.00 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 YearDefended L L C t 1373   Count   (Log10)     •   10%  of  Theses  have  25  or  more  URIs   Median  (average)  increases  from  4  to  5.5   •   75%  have  2+  URIs  per  Thesis   Focus  on  e-­‐Theses  defended     from  2003       box  plots  of  medians  (averages)   &  quar=les,  with  ‘outliers’  
  17. 17. Methodology (Stage 3): to discover answer to 2 questions i.  Do those links (URIs) still work? Is the URI on the ‘Live Web’’? •  Allowing  up  to  a  maximum  of  50  redirects,  recording  the  HTTP  transac=on  chain  and   regarding  an  2XX  status  code  as  ‘live’  
  18. 18. Methodology (Stage 3): to discover answer to 2 questions i.  Do those links (URIs) still work? Is the URI on the ‘Live Web’’? •  Allowing  up  to  a  maximum  of  50  redirects,  recording  the  HTTP  transac=on  chain  and  regarding  an  2XX  status   code  as  ‘live’ ii.  Is there a ‘Memento’ of that reference in the ‘Archived Web’?   Memento:  a  prior  version,  what  the  Original  Resource  was  like  at  some  =me  in  the  past.
  19. 19. Methodology (Stage 3): to discover answer to 2 questions i.  Do those links (URIs) still work? Is the URI on the ‘Live Web’’? ii.  Is there a ‘Memento’ of that reference in the ‘Archived Web’? •  Archival check carried out in June 2014, using installed version of Memento tool developed by LANL http://www.mementoweb.org/guide/quick-intro/ •  A ‘Datetime’ version at or near the date the Thesis was defended •  Searching across several archives (not just Internet Archive) Approach first used in pilot work at LANL; UoEdin Language Technology Group: Beatrice Alex, Claire Grover, Richard Tobin, Ke “Adam” Zhou
  20. 20. A Measure of Reference Rot: Are those references available? [in 6,400 e-Theses defended in 2003-2010 at 5 US universities] Less than two-thirds of those links lead to live content Live  on  Web   Not  Found  on  ‘Live  Web’   All   Count              29,122                          16,860   45,982   %   63.3   36.7   100%   1st Order Indicator of ‘Reference Rot’ more than one third of references to the Web subject to ‘rot’ A9er  up  to  50  redirects  
  21. 21. Confirm?: 2/3rds ‘Live’ URI Ratio same across ‘Big 3’, 2003-2010 0.0 0.2 0.4 0.6 0.8 1.0 fsu psu vt wpi zund Institution L i v e R a t i o =>  ‘On  average’  1/3rds  of  the     links  in  an  e-­‐Thesis  are  ‘ronen’    
  22. 22. The older the citation, the less likely to be still on the live Web [excluding 0s&1s: a few theses are unaffected; a few are ruined] 0.0 0.2 0.4 0.6 0.8 50 75 100 125 MonthsElasped L i v e R a t i o We  can’t  stop  that  process  of  rot:     Web  content  changes  over  =me,   Reference  Rot  is  inevitable  func=on  of  =me   Number  of  months  elapsed  from  Date  Thesis  Defended  un=l  date  archives  checked  (June  2014)      
  23. 23. Searching for ‘Datetime’ Mementos of content in ‘Archived Web’ [in 6,400 e-Theses defended in 2003-2010 at 5 US universities] %   Live  on  Web   Not  found  on  ‘Live  Web’   All   Found  to  be   Archived   47.6   Not  Found   52.4   All   100%   There  seems  a  50:50  chance  that     referenced  content  is  in  the  ‘Archived  Web’.       Some  content  is  being  ‘co-­‐incidentally  harvested’  by  rou=ne  web  archiving.     => half of those references are at ‘risk of loss’
  24. 24. 0.0 0.2 0.4 0.6 0.8 1.0 fsu psu vt wpi zund Institution A r c h i v e R a t i o 50:50 chance that ‘DateTime Reference’ is ‘Incidentally Archived’
  25. 25. ‘Incidental Archiving’ is constant over time (This is an ‘upper bound estimate’, independent of age of e-thesis) 0.0 0.2 0.4 0.6 0.8 50 75 100 125 MonthsElasped A r c h i v e R a t i o We can improve upon this ‘50:50 chance’ by pro-actively archiving what we cite
  26. 26. We already have ‘Lost Content’ for References to Web [in 6,400 e-Theses defended in 2003-2010 at 5 US universities] %   Live  on  Web   Not  found  on  ‘Live  Web’   All   Found  to  be   Archived   29.3   18.3   47.6   Not  Found   34.0   18.4   52.4   All   63.3   36.7   100%   18.4% ‘not live & not found in archive’ judged to be lost forever 34% ‘live’ & ‘not in archive’ at is risk of loss NB: The 34% ‘at risk’ could be saved by pro-active archiving
  27. 27. Devising Remedy for Reference Rot in e-Theses
  28. 28. Having demonstrated problem exists & is severe •  The Web changes over time: reference rot occurs (36.7%) •  Incidental archiving via routine of web archiving initiatives delivers no more than 50:50 chance of success •  Seek pro-active ‘transactional archiving’ solutions –  focus on what is regarded by authors as important •  Thereby to remedy the integrity of the scholarly record We aim to embed ‘solutions’ in existing tools & infrastructure Our General Approach
  29. 29. a)  Understand the preparation/publication workflow –  identifying where there can be productive intervention b)  Devise prototypes for pro-active archiving –  writing & implementing code! c)  Propose/test infrastructure for temporal referencing –  supporting & using the Memento protocol We are embedding ‘solutions’ in existing tools & infrastructure Strategy for Making Remedy
  30. 30. Understanding 3 workflows: Rot or Remedy? Iden6fy  the  Actors   Extended  length  of  stages  in  workflows  magnify  reference  rot  &  affect   ①  Study  -­‐>  Prepara=on  -­‐  >  (Review)    -­‐>  Submission     ②  Post-­‐Submission  -­‐>  Examina=on  -­‐>  (Revision)  -­‐>  Award   ③  Post-­‐Award  -­‐>  Deposit/Ingest  -­‐>  Provide/Access  -­‐>  Use       Doctoral  Student     (&  Supervisor)   Faculty,  Examiners   &  Supervisor   University     &  Library   Iden6fy  the  best  opportuni6es  for  Interven6on  to  make  Remedy  
  31. 31. 1.  Hiberlink Plug-in - to enable pro-active archiving 2.  Missing Link - re-factoring the HTML link 3.  HiberActive - enables repositories to ‘stop the rot’ via actively archiving those references in e-theses LANL: Martin Klein, Harihar Shankar, Herbert Van de Sompel UoEd EDINA: Neil Mayo, Tim Stickland, Richard Wincewicz ‘Work in progress’ to effect Remedy Hiberlink ETD2014, Leicester UK July 25th 2014 Funded by the Andrew W. Mellon Foundation
  32. 32. 1.  Hiberlink Plug-in - to help authors and middle-folk (publishers/librarians) do the right thing: – Zotero - used by authors to manage references https://www.zotero.org/ –  Open Journal System (OJS) - used by OA publishers https://pkp.sfu.ca/ojs/ ‘Work in progress’ to effect Remedy (1)
  33. 33. For use during preparation of thesis & before final submission but also before deposit with Library (& maybe for repair by Library …) Hiberlink Plug-in for Zotero ①  Triggers archiving of referenced web content ②  Returns Datetime URI for archived content
  34. 34. 1.  Hiberlink Plug-in - to enable pro-active archiving 2.  Missing Link - re-factor the HTML link that is returned ‘Work in progress’ to effect Remedy (2) <a href="http://www.bnf.fr"> Link to the BNF </a> b)  Augment Link with a set of Datetime & location pairs <a href="http://www.bnf.fr" mset="2014-05-19, http://archive.today/zdpAn 2014-05-15 memento"> Link to the BNF </a> a)  Take simple URI - to French National Library (say)  
  35. 35. 1.  Hiberlink Plug-in - to enable pro-active archiving 2.  Missing Link - re-factoring the HTML link First two approaches support ‘perfect scenario’: •  All authors archive all their cited URIs •  e.g. (but not exclusively) with Hiberlink / Zotero 3.  HiberActive –  Enables repositories to ‘stop the rot’ by actively archiving those references in e-theses –  A notification hub, a component for the infrastructure •  testing workflow with ResourceSync, CORE & external archive programme ‘Work in progress’ to effect Remedy (3)
  36. 36. Next Steps: who wants to take this work forward? to ensure references in e-Theses don’t rot •  Need  to  move  from  the  ‘incidental  Web  archiving’  of  cited  URIs     to  pro-­‐ac=ve  archiving,  by  student/authors  &  by  libraries     a)  Offer  to  be  an  early  adopter  for  these  Hiberlink  remedies   •  The Hiberlink Plug-in for Zotero / HiberActive Email: edina@ed.ac.uk Subject: Hiberlink ETD   b)  Amend  ‘Guidance  for  ETD  Lifecycle  Management’  
  37. 37. Thank you, Questions welcome http://hiberlink.org #hiberlink Email: edina@ed.ac.uk Hiberlink ETD2014, Leicester UK July 25th 2014 Funded by the Andrew W. Mellon Foundation
  38. 38. Picture  credit:  hnp://somanybooksblog.com/2009/03/27/library-­‐tour/   But online articles in the Scholarly Record are not in the custody of Libraries, nor on their digital shelves. Aside:  We  would  all  like  to  assume  that  our  libraries  are   ensuring  that  online  e-­‐journal  content  is  being  kept  safe  
  39. 39. Evidence from The Keepers Registry is worrying! ①  Compare what is being kept by the (10) leading archiving agencies (CLOCKSS, Portico, national libraries etc) with all issued with ISSN ‘Ingest Ratio’ = titles being ingested by one or more Keeper / ‘online serials’ in ISSN Register = 23,268 / 136,965 [in March 2014] => 17% * We do not know about 83% of e-serials having ISSN * ‘KeepSafe Ratio’ = ingest by 3+ Keepers = 9,652 / 136,965 => 7% ②  Title Lists of 3 US research libraries (Columbia, Cornell & Duke), checked i2011/12 ‘Ingest Ratio’ = 22% to 28%; c.75% unknown fate ③  User-centric Evidence, usage logs for the UK OpenURL Router* => over two thirds 68% (36,326 titles) held by none!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×