Reference	
  Rot	
  and	
  E-­‐Theses:	
  Threat	
  and	
  Remedy	
  	
  
Hiberlink
ETD2014, Leicester UK July 25th 2014
Funded by the Andrew W. Mellon Foundation
Peter Burnhill
EDINA, University of Edinburgh &
for the Hiberlink Team at University of Edinburgh & LANL Research Library
Centre	
  for	
  Service	
  Delivery	
  &	
  Digital	
  Exper6se	
  
Overview
1.  The Hiberlink Project & Reference Rot
2.  Evidence of Threat of Reference Rot for the E-Thesis
•  Our methods, data source & findings
3.  Devising Remedy for Reference Rot in E-Thesis
•  Proposals for intervention: plug-ins & infrastructural solutions
4.  Next Steps: who (else) wants to take this work forward?
Reference Rot = Link Rot + Content Drift
“when links to web resources
no longer point to what they once did”
Investigating Reference Rot in Web-Based Scholarly Communication
Link Rot
‘Link Rot’	
  
+ Content Drift: What is at end of URI has changed, or gone!
http://dl00.org
2000
http://dl00.org
2004
http://dl00.org
2005
http://dl00.org
2008
(a)	
  Dynamic	
  content	
  
as	
  values	
  on	
  webpage	
  
changes	
  over	
  =me	
  
(b)	
  Sta-c	
  content	
  
but	
  very	
  different	
  (o@en	
  
unrelated)	
  web	
  pages	
  
An International Team at Work
funded by the
Andrew W. Mellon Foundation
•  Los Alamos National Laboratory:
Research Library: Martin Klein, (Rob	
  Sanderson),	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  Harihar Shankar, Herbert Van de Sompel
•  University of Edinburgh:
Language Technology Group: Beatrice Alex, Claire Grover,
Richard Tobin, Ke “Adam” Zhou
EDINA * : Neil Mayo, Muriel Mewissen (Project Manager),
Christine Rees, Tim Stickland, Richard Wincewicz, Peter Burnhill
Centre	
  for	
  Service	
  Delivery	
  &	
  Digital	
  Exper6se	
  
Hiberlink
ETD2014, Leicester UK July 25th 2014
Funded by the Andrew W. Mellon Foundation
What we are doing in Hiberlink, 2013 - 2015
1.  Creating evidence on extent of ‘Reference Rot’
–  Main focus has been on references (& URIs) made in Journal Articles
•  Includes work on reference rot in Supreme Court judgments with Harvard Law
Library & permaCC
–  ETD2014 is opportunity to look at Reference Rot & the e-Thesis
2.  Understanding the preparation/publication workflow
–  Identifying opportunity for productive intervention
3.  Prototypes for pro-active archiving to enable remedy
–  Embedding such ‘solutions’ in existing tools & infrastructure
4.  Raising awareness & seeking collaborative actions
…. through events like this
Evidence on the Threat of Reference Rot for
E-Theses
Retrieving thinking about the emerging e-Thesis in 1998
University	
  	
  
Theses	
  	
  
Online	
  	
  
Group,	
  1994/99	
  	
  
Ini=ated	
  by	
  U	
  of	
  Edinburgh	
  &	
  UC	
  London,	
  as	
  
referenced	
  by	
  Susan	
  Copeland	
  in	
  	
  
‘E-­‐Theses	
  Developments	
  in	
  the	
  UK’	
  	
  2003	
  	
  
	
  
Retrieving thinking about the emerging e-Thesis in 1998
University	
  	
  
Theses	
  	
  
Online	
  	
  
Group,	
  1994/99	
  	
  
Ini=ated	
  by	
  U	
  of	
  Edinburgh	
  &	
  UC	
  London,	
  as	
  
referenced	
  by	
  Susan	
  Copeland	
  in	
  	
  
‘E-­‐Theses	
  Developments	
  in	
  the	
  UK’	
  	
  2003	
  	
  
	
  
4.	
  
Measuring the Extent of ‘Reference Rot’ in e-Theses	
  
Data	
  Source
•  Looked for corpus of e-Theses for our study period of 1997 – 2012
•  Interested only in Doctoral Theses/Dissertations
•  NDLTD	
  Union	
  Catalogue	
  
	
  
Basic	
  Method	
  
a)  Define selection and use information in the metadata record
•  Degree awarded (PhD etc); Department
•  Date thesis was successfully defended
•  Link to the full text of the Doctoral Thesis
b)  Download selected e-Thesis from each Institution’s Repository
7,500 E-Theses Downloaded from 5 US Institutions
In	
  passing:	
  note	
  decline	
  in	
  numbers	
  indicates	
  ‘lag’	
  in	
  ingest/availability	
  of	
  e-­‐Theses	
  
Key Aspects of Methodology (Stage 1)	
  
1.  Convert those e-Theses from PDF into XML
•  pd@ohtml	
  –xml
2.  Locate the references & extract each and every URL
•  Technical challenges: URL broken/newline; underscore as image
•  Use	
  up	
  to	
  15	
  regular	
  expression	
  for	
  matching;	
  regard	
  as	
  URI
UoEdin Language Technology Group:
Beatrice Alex, Claire Grover, Richard Tobin, Ke “Adam” Zhou
Key Aspects of Methodology (Stage 2)	
  
47,067 URIs were extracted
These were partitioned into two types:
i.  1,086	
  publisher	
  sites,	
  represen=ng	
  very	
  many	
  references	
  to	
  online	
  ar=cles:	
  
‘the	
  scholarly	
  record’	
  
•  BTW,	
  who	
  does	
  keep	
  those	
  ar=cles	
  in	
  the Scholarly Record safe?
•  Ask me for evidence on that!
ii.  45,981	
  URIs	
  that	
  linked	
  ‘the	
  Web	
  at	
  large’	
  
•  to	
  Web	
  content	
  required	
  	
  for	
  scholarship	
  
•  inc.	
  websites,	
  so@ware,	
  blogs,	
  videos,	
  online	
  debate	
  etc	
  
•  to that which lacks ‘fixity’ and changes over time 	
  
Those	
  c.46,000	
  are	
  the	
  focus	
  for	
  the	
  Hiberlink	
  Project	
  
Increase in Linking to ‘Web-at-large’ Resources, 1997-2010
beyond the e-journal, to that which lacks ‘fixity’ and changes over time
URIs,	
  by	
  Year	
  Thesis	
  Defended	
  (%),	
  1997	
  -­‐	
  2010	
  	
  
15
30
45
60
75
90
1998 2000 2002 2004 2006 2008 2010
Year
%
50%	
  
But Wide ‘Between-Thesis’ Variation in Number of Web Links
0.00
0.75
1.50
2.25
3.00
1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
YearDefended
L
L
C
t
1373	
  
Count	
  
(Log10)	
  	
  
• 	
  10%	
  of	
  Theses	
  have	
  25	
  or	
  more	
  URIs	
  
Median	
  (average)	
  increases	
  from	
  4	
  to	
  5.5	
  
• 	
  75%	
  have	
  2+	
  URIs	
  per	
  Thesis	
  
Focus	
  on	
  e-­‐Theses	
  defended	
  	
  
from	
  2003	
  	
  	
  
box	
  plots	
  of	
  medians	
  (averages)	
  
&	
  quar=les,	
  with	
  ‘outliers’	
  
Methodology (Stage 3): to discover answer to 2 questions
i.  Do those links (URIs) still work? Is the URI on the ‘Live Web’’?
•  Allowing	
  up	
  to	
  a	
  maximum	
  of	
  50	
  redirects,	
  recording	
  the	
  HTTP	
  transac=on	
  chain	
  and	
  
regarding	
  an	
  2XX	
  status	
  code	
  as	
  ‘live’
	
  
Methodology (Stage 3): to discover answer to 2 questions
i.  Do those links (URIs) still work? Is the URI on the ‘Live Web’’?
•  Allowing	
  up	
  to	
  a	
  maximum	
  of	
  50	
  redirects,	
  recording	
  the	
  HTTP	
  transac=on	
  chain	
  and	
  regarding	
  an	
  2XX	
  status	
  
code	
  as	
  ‘live’
ii.  Is there a ‘Memento’ of that reference in the ‘Archived Web’?
	
  
Memento:	
  a	
  prior	
  version,	
  what	
  the	
  Original	
  Resource	
  was	
  like	
  at	
  some	
  =me	
  in	
  the	
  past.
Methodology (Stage 3): to discover answer to 2 questions
i.  Do those links (URIs) still work? Is the URI on the ‘Live Web’’?
ii.  Is there a ‘Memento’ of that reference in the ‘Archived Web’?
•  Archival check carried out in June 2014, using installed version of
Memento tool developed by LANL
http://www.mementoweb.org/guide/quick-intro/
•  A ‘Datetime’ version at or near the date the Thesis was defended
•  Searching across several archives (not just Internet Archive)
Approach first used in pilot work at LANL;
UoEdin Language Technology Group:
Beatrice Alex, Claire Grover, Richard Tobin, Ke “Adam” Zhou
A Measure of Reference Rot: Are those references available?
[in 6,400 e-Theses defended in 2003-2010 at 5 US universities]
Less than two-thirds
of those links lead
to live content
Live	
  on	
  Web	
   Not	
  Found	
  on	
  ‘Live	
  Web’	
   All	
  
Count	
   	
  	
  	
  	
  	
  	
  29,122	
   	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  16,860	
   45,982	
  
%	
  
63.3	
   36.7	
   100%	
  
1st Order Indicator of
‘Reference Rot’ more than one
third of references
to the Web subject to ‘rot’
A9er	
  up	
  to	
  50	
  redirects	
  
Confirm?: 2/3rds ‘Live’ URI Ratio same across ‘Big 3’, 2003-2010
0.0
0.2
0.4
0.6
0.8
1.0
fsu psu vt wpi zund
Institution
L
i
v
e
R
a
t
i
o =>	
  ‘On	
  average’	
  1/3rds	
  of	
  the	
  	
  
links	
  in	
  an	
  e-­‐Thesis	
  are	
  ‘ronen’	
  	
  
The older the citation, the less likely to be still on the live Web
[excluding 0s&1s: a few theses are unaffected; a few are ruined]
0.0
0.2
0.4
0.6
0.8
50 75 100 125
MonthsElasped
L
i
v
e
R
a
t
i
o We	
  can’t	
  stop	
  that	
  process	
  of	
  rot:	
  	
  
Web	
  content	
  changes	
  over	
  =me,	
  
Reference	
  Rot	
  is	
  inevitable	
  func=on	
  of	
  =me	
  
Number	
  of	
  months	
  elapsed	
  from	
  Date	
  Thesis	
  Defended	
  un=l	
  date	
  archives	
  checked	
  (June	
  2014)	
  	
  
	
  
Searching for ‘Datetime’ Mementos of content in ‘Archived Web’
[in 6,400 e-Theses defended in 2003-2010 at 5 US universities]
%	
   Live	
  on	
  Web	
   Not	
  found	
  on	
  ‘Live	
  Web’	
   All	
  
Found	
  to	
  be	
  
Archived	
  
47.6	
  
Not	
  Found	
   52.4	
  
All	
   100%	
  
There	
  seems	
  a	
  50:50	
  chance	
  that	
  	
  
referenced	
  content	
  is	
  in	
  the	
  ‘Archived	
  Web’.	
  	
  	
  
Some	
  content	
  is	
  being	
  ‘co-­‐incidentally	
  harvested’	
  by	
  rou=ne	
  web	
  archiving.	
  	
  
=> half of those references are at ‘risk of loss’
0.0
0.2
0.4
0.6
0.8
1.0
fsu psu vt wpi zund
Institution
A
r
c
h
i
v
e
R
a
t
i
o
50:50 chance that ‘DateTime Reference’ is ‘Incidentally Archived’
‘Incidental Archiving’ is constant over time
(This is an ‘upper bound estimate’, independent of age of e-thesis)
0.0
0.2
0.4
0.6
0.8
50 75 100 125
MonthsElasped
A
r
c
h
i
v
e
R
a
t
i
o
We can improve upon this ‘50:50 chance’
by pro-actively archiving what we cite
We already have ‘Lost Content’ for References to Web
[in 6,400 e-Theses defended in 2003-2010 at 5 US universities]
%	
   Live	
  on	
  Web	
   Not	
  found	
  on	
  ‘Live	
  Web’	
   All	
  
Found	
  to	
  be	
  
Archived	
  
29.3	
   18.3	
   47.6	
  
Not	
  Found	
   34.0	
   18.4	
   52.4	
  
All	
   63.3	
   36.7	
   100%	
  
18.4%
‘not live & not found in archive’
judged to be lost forever
34%
‘live’ & ‘not in archive’
at is risk of loss
NB: The 34% ‘at risk’ could be saved by pro-active archiving
Devising Remedy for Reference Rot
in e-Theses
Having demonstrated problem exists & is severe
•  The Web changes over time: reference rot occurs (36.7%)
•  Incidental archiving via routine of web archiving initiatives
delivers no more than 50:50 chance of success
•  Seek pro-active ‘transactional archiving’ solutions
–  focus on what is regarded by authors as important
•  Thereby to remedy the integrity of the scholarly record
We aim to embed ‘solutions’ in existing tools & infrastructure
Our General Approach
a)  Understand the preparation/publication workflow
–  identifying where there can be productive intervention
b)  Devise prototypes for pro-active archiving
–  writing & implementing code!
c)  Propose/test infrastructure for temporal referencing
–  supporting & using the Memento protocol
We are embedding ‘solutions’ in existing tools & infrastructure
Strategy for Making Remedy
Understanding 3 workflows: Rot or Remedy?
Iden6fy	
  the	
  Actors	
  
Extended	
  length	
  of	
  stages	
  in	
  workflows	
  magnify	
  reference	
  rot	
  &	
  affect	
  
①  Study	
  -­‐>	
  Prepara=on	
  -­‐	
  >	
  (Review)	
  	
  -­‐>	
  Submission	
  	
  
②  Post-­‐Submission	
  -­‐>	
  Examina=on	
  -­‐>	
  (Revision)	
  -­‐>	
  Award	
  
③  Post-­‐Award	
  -­‐>	
  Deposit/Ingest	
  -­‐>	
  Provide/Access	
  -­‐>	
  Use	
  	
  	
  
Doctoral	
  Student	
  	
  
(&	
  Supervisor)	
  
Faculty,	
  Examiners	
  
&	
  Supervisor	
  
University	
  	
  
&	
  Library	
  
Iden6fy	
  the	
  best	
  opportuni6es	
  for	
  Interven6on	
  to	
  make	
  Remedy	
  
1.  Hiberlink Plug-in - to enable pro-active archiving
2.  Missing Link - re-factoring the HTML link
3.  HiberActive - enables repositories to ‘stop the rot’ via
actively archiving those references in e-theses
LANL: Martin Klein, Harihar Shankar, Herbert Van de Sompel
UoEd EDINA: Neil Mayo, Tim Stickland, Richard Wincewicz
‘Work in progress’ to effect Remedy
Hiberlink
ETD2014, Leicester UK July 25th 2014
Funded by the Andrew W. Mellon Foundation
1.  Hiberlink Plug-in - to help authors and middle-folk
(publishers/librarians) do the right thing:
– Zotero - used by authors to manage references
https://www.zotero.org/
–  Open Journal System (OJS) - used by OA publishers
https://pkp.sfu.ca/ojs/
‘Work in progress’ to effect Remedy (1)
For use during preparation of thesis & before final submission
but also
before deposit with Library (& maybe for repair by Library …)
Hiberlink Plug-in for Zotero
①  Triggers archiving of referenced web content
②  Returns Datetime URI for archived content
1.  Hiberlink Plug-in - to enable pro-active archiving
2.  Missing Link - re-factor the HTML link that is returned
‘Work in progress’ to effect Remedy (2)
<a href="http://www.bnf.fr">
Link to the BNF
</a>
b)  Augment Link with a set of Datetime & location pairs
<a href="http://www.bnf.fr"
mset="2014-05-19,
http://archive.today/zdpAn 2014-05-15 memento">
Link to the BNF
</a>
a)  Take simple URI - to French National Library (say)	
  
1.  Hiberlink Plug-in - to enable pro-active archiving
2.  Missing Link - re-factoring the HTML link
First two approaches support ‘perfect scenario’:
•  All authors archive all their cited URIs
•  e.g. (but not exclusively) with Hiberlink / Zotero
3.  HiberActive
–  Enables repositories to ‘stop the rot’
by actively archiving those references in e-theses
–  A notification hub, a component for the infrastructure
•  testing workflow with ResourceSync, CORE
& external archive programme
‘Work in progress’ to effect Remedy (3)
Next Steps: who wants to take this work forward?
to ensure references in e-Theses don’t rot
•  Need	
  to	
  move	
  from	
  the	
  ‘incidental	
  Web	
  archiving’	
  of	
  cited	
  URIs	
  	
  
to	
  pro-­‐ac=ve	
  archiving,	
  by	
  student/authors	
  &	
  by	
  libraries	
  
	
  
a)  Offer	
  to	
  be	
  an	
  early	
  adopter	
  for	
  these	
  Hiberlink	
  remedies	
  
•  The Hiberlink Plug-in for Zotero / HiberActive
Email: edina@ed.ac.uk
Subject: Hiberlink ETD
	
  
b)  Amend	
  ‘Guidance	
  for	
  ETD	
  Lifecycle	
  Management’	
  
Thank you,
Questions welcome
http://hiberlink.org #hiberlink
Email: edina@ed.ac.uk
Hiberlink
ETD2014, Leicester UK July 25th 2014
Funded by the Andrew W. Mellon Foundation
Picture	
  credit:	
  hnp://somanybooksblog.com/2009/03/27/library-­‐tour/	
  
But online articles in the Scholarly Record are not in
the custody of Libraries, nor on their digital shelves.
Aside:	
  We	
  would	
  all	
  like	
  to	
  assume	
  that	
  our	
  libraries	
  are	
  
ensuring	
  that	
  online	
  e-­‐journal	
  content	
  is	
  being	
  kept	
  safe	
  
Evidence from The Keepers Registry is worrying!
①  Compare what is being kept by the (10) leading archiving agencies
(CLOCKSS, Portico, national libraries etc) with all issued with ISSN
‘Ingest Ratio’ = titles being ingested by one or more Keeper
/ ‘online serials’ in ISSN Register
= 23,268 / 136,965 [in March 2014] => 17%
* We do not know about 83% of e-serials having ISSN *
‘KeepSafe Ratio’ = ingest by 3+ Keepers = 9,652 / 136,965 => 7%
②  Title Lists of 3 US research libraries (Columbia, Cornell & Duke),
checked i2011/12 ‘Ingest Ratio’ = 22% to 28%; c.75% unknown fate
③  User-centric Evidence, usage logs for the UK OpenURL Router*
=> over two thirds 68% (36,326 titles) held by none!

Reference Rot and E-Theses: Threat and Remedy

  • 1.
    Reference  Rot  and  E-­‐Theses:  Threat  and  Remedy     Hiberlink ETD2014, Leicester UK July 25th 2014 Funded by the Andrew W. Mellon Foundation Peter Burnhill EDINA, University of Edinburgh & for the Hiberlink Team at University of Edinburgh & LANL Research Library Centre  for  Service  Delivery  &  Digital  Exper6se  
  • 2.
    Overview 1.  The HiberlinkProject & Reference Rot 2.  Evidence of Threat of Reference Rot for the E-Thesis •  Our methods, data source & findings 3.  Devising Remedy for Reference Rot in E-Thesis •  Proposals for intervention: plug-ins & infrastructural solutions 4.  Next Steps: who (else) wants to take this work forward?
  • 3.
    Reference Rot =Link Rot + Content Drift “when links to web resources no longer point to what they once did” Investigating Reference Rot in Web-Based Scholarly Communication
  • 4.
  • 5.
    + Content Drift:What is at end of URI has changed, or gone! http://dl00.org 2000 http://dl00.org 2004 http://dl00.org 2005 http://dl00.org 2008 (a)  Dynamic  content   as  values  on  webpage   changes  over  =me   (b)  Sta-c  content   but  very  different  (o@en   unrelated)  web  pages  
  • 6.
    An International Teamat Work funded by the Andrew W. Mellon Foundation •  Los Alamos National Laboratory: Research Library: Martin Klein, (Rob  Sanderson),                                              Harihar Shankar, Herbert Van de Sompel •  University of Edinburgh: Language Technology Group: Beatrice Alex, Claire Grover, Richard Tobin, Ke “Adam” Zhou EDINA * : Neil Mayo, Muriel Mewissen (Project Manager), Christine Rees, Tim Stickland, Richard Wincewicz, Peter Burnhill Centre  for  Service  Delivery  &  Digital  Exper6se   Hiberlink ETD2014, Leicester UK July 25th 2014 Funded by the Andrew W. Mellon Foundation
  • 7.
    What we aredoing in Hiberlink, 2013 - 2015 1.  Creating evidence on extent of ‘Reference Rot’ –  Main focus has been on references (& URIs) made in Journal Articles •  Includes work on reference rot in Supreme Court judgments with Harvard Law Library & permaCC –  ETD2014 is opportunity to look at Reference Rot & the e-Thesis 2.  Understanding the preparation/publication workflow –  Identifying opportunity for productive intervention 3.  Prototypes for pro-active archiving to enable remedy –  Embedding such ‘solutions’ in existing tools & infrastructure 4.  Raising awareness & seeking collaborative actions …. through events like this
  • 8.
    Evidence on theThreat of Reference Rot for E-Theses
  • 9.
    Retrieving thinking aboutthe emerging e-Thesis in 1998 University     Theses     Online     Group,  1994/99     Ini=ated  by  U  of  Edinburgh  &  UC  London,  as   referenced  by  Susan  Copeland  in     ‘E-­‐Theses  Developments  in  the  UK’    2003      
  • 10.
    Retrieving thinking aboutthe emerging e-Thesis in 1998 University     Theses     Online     Group,  1994/99     Ini=ated  by  U  of  Edinburgh  &  UC  London,  as   referenced  by  Susan  Copeland  in     ‘E-­‐Theses  Developments  in  the  UK’    2003       4.  
  • 11.
    Measuring the Extentof ‘Reference Rot’ in e-Theses   Data  Source •  Looked for corpus of e-Theses for our study period of 1997 – 2012 •  Interested only in Doctoral Theses/Dissertations •  NDLTD  Union  Catalogue     Basic  Method   a)  Define selection and use information in the metadata record •  Degree awarded (PhD etc); Department •  Date thesis was successfully defended •  Link to the full text of the Doctoral Thesis b)  Download selected e-Thesis from each Institution’s Repository
  • 12.
    7,500 E-Theses Downloadedfrom 5 US Institutions In  passing:  note  decline  in  numbers  indicates  ‘lag’  in  ingest/availability  of  e-­‐Theses  
  • 13.
    Key Aspects ofMethodology (Stage 1)   1.  Convert those e-Theses from PDF into XML •  pd@ohtml  –xml 2.  Locate the references & extract each and every URL •  Technical challenges: URL broken/newline; underscore as image •  Use  up  to  15  regular  expression  for  matching;  regard  as  URI UoEdin Language Technology Group: Beatrice Alex, Claire Grover, Richard Tobin, Ke “Adam” Zhou
  • 14.
    Key Aspects ofMethodology (Stage 2)   47,067 URIs were extracted These were partitioned into two types: i.  1,086  publisher  sites,  represen=ng  very  many  references  to  online  ar=cles:   ‘the  scholarly  record’   •  BTW,  who  does  keep  those  ar=cles  in  the Scholarly Record safe? •  Ask me for evidence on that! ii.  45,981  URIs  that  linked  ‘the  Web  at  large’   •  to  Web  content  required    for  scholarship   •  inc.  websites,  so@ware,  blogs,  videos,  online  debate  etc   •  to that which lacks ‘fixity’ and changes over time   Those  c.46,000  are  the  focus  for  the  Hiberlink  Project  
  • 15.
    Increase in Linkingto ‘Web-at-large’ Resources, 1997-2010 beyond the e-journal, to that which lacks ‘fixity’ and changes over time URIs,  by  Year  Thesis  Defended  (%),  1997  -­‐  2010     15 30 45 60 75 90 1998 2000 2002 2004 2006 2008 2010 Year % 50%  
  • 16.
    But Wide ‘Between-Thesis’Variation in Number of Web Links 0.00 0.75 1.50 2.25 3.00 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 YearDefended L L C t 1373   Count   (Log10)     •   10%  of  Theses  have  25  or  more  URIs   Median  (average)  increases  from  4  to  5.5   •   75%  have  2+  URIs  per  Thesis   Focus  on  e-­‐Theses  defended     from  2003       box  plots  of  medians  (averages)   &  quar=les,  with  ‘outliers’  
  • 17.
    Methodology (Stage 3):to discover answer to 2 questions i.  Do those links (URIs) still work? Is the URI on the ‘Live Web’’? •  Allowing  up  to  a  maximum  of  50  redirects,  recording  the  HTTP  transac=on  chain  and   regarding  an  2XX  status  code  as  ‘live’  
  • 18.
    Methodology (Stage 3):to discover answer to 2 questions i.  Do those links (URIs) still work? Is the URI on the ‘Live Web’’? •  Allowing  up  to  a  maximum  of  50  redirects,  recording  the  HTTP  transac=on  chain  and  regarding  an  2XX  status   code  as  ‘live’ ii.  Is there a ‘Memento’ of that reference in the ‘Archived Web’?   Memento:  a  prior  version,  what  the  Original  Resource  was  like  at  some  =me  in  the  past.
  • 19.
    Methodology (Stage 3):to discover answer to 2 questions i.  Do those links (URIs) still work? Is the URI on the ‘Live Web’’? ii.  Is there a ‘Memento’ of that reference in the ‘Archived Web’? •  Archival check carried out in June 2014, using installed version of Memento tool developed by LANL http://www.mementoweb.org/guide/quick-intro/ •  A ‘Datetime’ version at or near the date the Thesis was defended •  Searching across several archives (not just Internet Archive) Approach first used in pilot work at LANL; UoEdin Language Technology Group: Beatrice Alex, Claire Grover, Richard Tobin, Ke “Adam” Zhou
  • 20.
    A Measure ofReference Rot: Are those references available? [in 6,400 e-Theses defended in 2003-2010 at 5 US universities] Less than two-thirds of those links lead to live content Live  on  Web   Not  Found  on  ‘Live  Web’   All   Count              29,122                          16,860   45,982   %   63.3   36.7   100%   1st Order Indicator of ‘Reference Rot’ more than one third of references to the Web subject to ‘rot’ A9er  up  to  50  redirects  
  • 21.
    Confirm?: 2/3rds ‘Live’URI Ratio same across ‘Big 3’, 2003-2010 0.0 0.2 0.4 0.6 0.8 1.0 fsu psu vt wpi zund Institution L i v e R a t i o =>  ‘On  average’  1/3rds  of  the     links  in  an  e-­‐Thesis  are  ‘ronen’    
  • 22.
    The older thecitation, the less likely to be still on the live Web [excluding 0s&1s: a few theses are unaffected; a few are ruined] 0.0 0.2 0.4 0.6 0.8 50 75 100 125 MonthsElasped L i v e R a t i o We  can’t  stop  that  process  of  rot:     Web  content  changes  over  =me,   Reference  Rot  is  inevitable  func=on  of  =me   Number  of  months  elapsed  from  Date  Thesis  Defended  un=l  date  archives  checked  (June  2014)      
  • 23.
    Searching for ‘Datetime’Mementos of content in ‘Archived Web’ [in 6,400 e-Theses defended in 2003-2010 at 5 US universities] %   Live  on  Web   Not  found  on  ‘Live  Web’   All   Found  to  be   Archived   47.6   Not  Found   52.4   All   100%   There  seems  a  50:50  chance  that     referenced  content  is  in  the  ‘Archived  Web’.       Some  content  is  being  ‘co-­‐incidentally  harvested’  by  rou=ne  web  archiving.     => half of those references are at ‘risk of loss’
  • 24.
    0.0 0.2 0.4 0.6 0.8 1.0 fsu psu vtwpi zund Institution A r c h i v e R a t i o 50:50 chance that ‘DateTime Reference’ is ‘Incidentally Archived’
  • 25.
    ‘Incidental Archiving’ isconstant over time (This is an ‘upper bound estimate’, independent of age of e-thesis) 0.0 0.2 0.4 0.6 0.8 50 75 100 125 MonthsElasped A r c h i v e R a t i o We can improve upon this ‘50:50 chance’ by pro-actively archiving what we cite
  • 26.
    We already have‘Lost Content’ for References to Web [in 6,400 e-Theses defended in 2003-2010 at 5 US universities] %   Live  on  Web   Not  found  on  ‘Live  Web’   All   Found  to  be   Archived   29.3   18.3   47.6   Not  Found   34.0   18.4   52.4   All   63.3   36.7   100%   18.4% ‘not live & not found in archive’ judged to be lost forever 34% ‘live’ & ‘not in archive’ at is risk of loss NB: The 34% ‘at risk’ could be saved by pro-active archiving
  • 27.
    Devising Remedy forReference Rot in e-Theses
  • 28.
    Having demonstrated problemexists & is severe •  The Web changes over time: reference rot occurs (36.7%) •  Incidental archiving via routine of web archiving initiatives delivers no more than 50:50 chance of success •  Seek pro-active ‘transactional archiving’ solutions –  focus on what is regarded by authors as important •  Thereby to remedy the integrity of the scholarly record We aim to embed ‘solutions’ in existing tools & infrastructure Our General Approach
  • 29.
    a)  Understand thepreparation/publication workflow –  identifying where there can be productive intervention b)  Devise prototypes for pro-active archiving –  writing & implementing code! c)  Propose/test infrastructure for temporal referencing –  supporting & using the Memento protocol We are embedding ‘solutions’ in existing tools & infrastructure Strategy for Making Remedy
  • 30.
    Understanding 3 workflows:Rot or Remedy? Iden6fy  the  Actors   Extended  length  of  stages  in  workflows  magnify  reference  rot  &  affect   ①  Study  -­‐>  Prepara=on  -­‐  >  (Review)    -­‐>  Submission     ②  Post-­‐Submission  -­‐>  Examina=on  -­‐>  (Revision)  -­‐>  Award   ③  Post-­‐Award  -­‐>  Deposit/Ingest  -­‐>  Provide/Access  -­‐>  Use       Doctoral  Student     (&  Supervisor)   Faculty,  Examiners   &  Supervisor   University     &  Library   Iden6fy  the  best  opportuni6es  for  Interven6on  to  make  Remedy  
  • 31.
    1.  Hiberlink Plug-in- to enable pro-active archiving 2.  Missing Link - re-factoring the HTML link 3.  HiberActive - enables repositories to ‘stop the rot’ via actively archiving those references in e-theses LANL: Martin Klein, Harihar Shankar, Herbert Van de Sompel UoEd EDINA: Neil Mayo, Tim Stickland, Richard Wincewicz ‘Work in progress’ to effect Remedy Hiberlink ETD2014, Leicester UK July 25th 2014 Funded by the Andrew W. Mellon Foundation
  • 32.
    1.  Hiberlink Plug-in- to help authors and middle-folk (publishers/librarians) do the right thing: – Zotero - used by authors to manage references https://www.zotero.org/ –  Open Journal System (OJS) - used by OA publishers https://pkp.sfu.ca/ojs/ ‘Work in progress’ to effect Remedy (1)
  • 33.
    For use duringpreparation of thesis & before final submission but also before deposit with Library (& maybe for repair by Library …) Hiberlink Plug-in for Zotero ①  Triggers archiving of referenced web content ②  Returns Datetime URI for archived content
  • 34.
    1.  Hiberlink Plug-in- to enable pro-active archiving 2.  Missing Link - re-factor the HTML link that is returned ‘Work in progress’ to effect Remedy (2) <a href="http://www.bnf.fr"> Link to the BNF </a> b)  Augment Link with a set of Datetime & location pairs <a href="http://www.bnf.fr" mset="2014-05-19, http://archive.today/zdpAn 2014-05-15 memento"> Link to the BNF </a> a)  Take simple URI - to French National Library (say)  
  • 35.
    1.  Hiberlink Plug-in- to enable pro-active archiving 2.  Missing Link - re-factoring the HTML link First two approaches support ‘perfect scenario’: •  All authors archive all their cited URIs •  e.g. (but not exclusively) with Hiberlink / Zotero 3.  HiberActive –  Enables repositories to ‘stop the rot’ by actively archiving those references in e-theses –  A notification hub, a component for the infrastructure •  testing workflow with ResourceSync, CORE & external archive programme ‘Work in progress’ to effect Remedy (3)
  • 36.
    Next Steps: whowants to take this work forward? to ensure references in e-Theses don’t rot •  Need  to  move  from  the  ‘incidental  Web  archiving’  of  cited  URIs     to  pro-­‐ac=ve  archiving,  by  student/authors  &  by  libraries     a)  Offer  to  be  an  early  adopter  for  these  Hiberlink  remedies   •  The Hiberlink Plug-in for Zotero / HiberActive Email: edina@ed.ac.uk Subject: Hiberlink ETD   b)  Amend  ‘Guidance  for  ETD  Lifecycle  Management’  
  • 37.
    Thank you, Questions welcome http://hiberlink.org#hiberlink Email: edina@ed.ac.uk Hiberlink ETD2014, Leicester UK July 25th 2014 Funded by the Andrew W. Mellon Foundation
  • 38.
    Picture  credit:  hnp://somanybooksblog.com/2009/03/27/library-­‐tour/   But online articles in the Scholarly Record are not in the custody of Libraries, nor on their digital shelves. Aside:  We  would  all  like  to  assume  that  our  libraries  are   ensuring  that  online  e-­‐journal  content  is  being  kept  safe  
  • 39.
    Evidence from TheKeepers Registry is worrying! ①  Compare what is being kept by the (10) leading archiving agencies (CLOCKSS, Portico, national libraries etc) with all issued with ISSN ‘Ingest Ratio’ = titles being ingested by one or more Keeper / ‘online serials’ in ISSN Register = 23,268 / 136,965 [in March 2014] => 17% * We do not know about 83% of e-serials having ISSN * ‘KeepSafe Ratio’ = ingest by 3+ Keepers = 9,652 / 136,965 => 7% ②  Title Lists of 3 US research libraries (Columbia, Cornell & Duke), checked i2011/12 ‘Ingest Ratio’ = 22% to 28%; c.75% unknown fate ③  User-centric Evidence, usage logs for the UK OpenURL Router* => over two thirds 68% (36,326 titles) held by none!