An overview of how the Hiberlink project relates to the persistence on the web of digital versions of theses. Given by Peter Burnhill, Director of EDINA, at the 17th International Symposium on Electronic Theses & Dissertations - which took place from 23 July to 25 July 2014 at the University of Leicester in the UK.
1. Reference
Rot
and
E-‐Theses:
Threat
and
Remedy
Hiberlink
ETD2014, Leicester UK July 25th 2014
Funded by the Andrew W. Mellon Foundation
Peter Burnhill
EDINA, University of Edinburgh &
for the Hiberlink Team at University of Edinburgh & LANL Research Library
Centre
for
Service
Delivery
&
Digital
Exper6se
2. Overview
1. The Hiberlink Project & Reference Rot
2. Evidence of Threat of Reference Rot for the E-Thesis
• Our methods, data source & findings
3. Devising Remedy for Reference Rot in E-Thesis
• Proposals for intervention: plug-ins & infrastructural solutions
4. Next Steps: who (else) wants to take this work forward?
3. Reference Rot = Link Rot + Content Drift
“when links to web resources
no longer point to what they once did”
Investigating Reference Rot in Web-Based Scholarly Communication
5. + Content Drift: What is at end of URI has changed, or gone!
http://dl00.org
2000
http://dl00.org
2004
http://dl00.org
2005
http://dl00.org
2008
(a)
Dynamic
content
as
values
on
webpage
changes
over
=me
(b)
Sta-c
content
but
very
different
(o@en
unrelated)
web
pages
6. An International Team at Work
funded by the
Andrew W. Mellon Foundation
• Los Alamos National Laboratory:
Research Library: Martin Klein, (Rob
Sanderson),
Harihar Shankar, Herbert Van de Sompel
• University of Edinburgh:
Language Technology Group: Beatrice Alex, Claire Grover,
Richard Tobin, Ke “Adam” Zhou
EDINA * : Neil Mayo, Muriel Mewissen (Project Manager),
Christine Rees, Tim Stickland, Richard Wincewicz, Peter Burnhill
Centre
for
Service
Delivery
&
Digital
Exper6se
Hiberlink
ETD2014, Leicester UK July 25th 2014
Funded by the Andrew W. Mellon Foundation
7. What we are doing in Hiberlink, 2013 - 2015
1. Creating evidence on extent of ‘Reference Rot’
– Main focus has been on references (& URIs) made in Journal Articles
• Includes work on reference rot in Supreme Court judgments with Harvard Law
Library & permaCC
– ETD2014 is opportunity to look at Reference Rot & the e-Thesis
2. Understanding the preparation/publication workflow
– Identifying opportunity for productive intervention
3. Prototypes for pro-active archiving to enable remedy
– Embedding such ‘solutions’ in existing tools & infrastructure
4. Raising awareness & seeking collaborative actions
…. through events like this
9. Retrieving thinking about the emerging e-Thesis in 1998
University
Theses
Online
Group,
1994/99
Ini=ated
by
U
of
Edinburgh
&
UC
London,
as
referenced
by
Susan
Copeland
in
‘E-‐Theses
Developments
in
the
UK’
2003
10. Retrieving thinking about the emerging e-Thesis in 1998
University
Theses
Online
Group,
1994/99
Ini=ated
by
U
of
Edinburgh
&
UC
London,
as
referenced
by
Susan
Copeland
in
‘E-‐Theses
Developments
in
the
UK’
2003
4.
11. Measuring the Extent of ‘Reference Rot’ in e-Theses
Data
Source
• Looked for corpus of e-Theses for our study period of 1997 – 2012
• Interested only in Doctoral Theses/Dissertations
• NDLTD
Union
Catalogue
Basic
Method
a) Define selection and use information in the metadata record
• Degree awarded (PhD etc); Department
• Date thesis was successfully defended
• Link to the full text of the Doctoral Thesis
b) Download selected e-Thesis from each Institution’s Repository
12. 7,500 E-Theses Downloaded from 5 US Institutions
In
passing:
note
decline
in
numbers
indicates
‘lag’
in
ingest/availability
of
e-‐Theses
13. Key Aspects of Methodology (Stage 1)
1. Convert those e-Theses from PDF into XML
• pd@ohtml
–xml
2. Locate the references & extract each and every URL
• Technical challenges: URL broken/newline; underscore as image
• Use
up
to
15
regular
expression
for
matching;
regard
as
URI
UoEdin Language Technology Group:
Beatrice Alex, Claire Grover, Richard Tobin, Ke “Adam” Zhou
14. Key Aspects of Methodology (Stage 2)
47,067 URIs were extracted
These were partitioned into two types:
i. 1,086
publisher
sites,
represen=ng
very
many
references
to
online
ar=cles:
‘the
scholarly
record’
• BTW,
who
does
keep
those
ar=cles
in
the Scholarly Record safe?
• Ask me for evidence on that!
ii. 45,981
URIs
that
linked
‘the
Web
at
large’
• to
Web
content
required
for
scholarship
• inc.
websites,
so@ware,
blogs,
videos,
online
debate
etc
• to that which lacks ‘fixity’ and changes over time
Those
c.46,000
are
the
focus
for
the
Hiberlink
Project
15. Increase in Linking to ‘Web-at-large’ Resources, 1997-2010
beyond the e-journal, to that which lacks ‘fixity’ and changes over time
URIs,
by
Year
Thesis
Defended
(%),
1997
-‐
2010
15
30
45
60
75
90
1998 2000 2002 2004 2006 2008 2010
Year
%
50%
16. But Wide ‘Between-Thesis’ Variation in Number of Web Links
0.00
0.75
1.50
2.25
3.00
1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
YearDefended
L
L
C
t
1373
Count
(Log10)
•
10%
of
Theses
have
25
or
more
URIs
Median
(average)
increases
from
4
to
5.5
•
75%
have
2+
URIs
per
Thesis
Focus
on
e-‐Theses
defended
from
2003
box
plots
of
medians
(averages)
&
quar=les,
with
‘outliers’
17. Methodology (Stage 3): to discover answer to 2 questions
i. Do those links (URIs) still work? Is the URI on the ‘Live Web’’?
• Allowing
up
to
a
maximum
of
50
redirects,
recording
the
HTTP
transac=on
chain
and
regarding
an
2XX
status
code
as
‘live’
18. Methodology (Stage 3): to discover answer to 2 questions
i. Do those links (URIs) still work? Is the URI on the ‘Live Web’’?
• Allowing
up
to
a
maximum
of
50
redirects,
recording
the
HTTP
transac=on
chain
and
regarding
an
2XX
status
code
as
‘live’
ii. Is there a ‘Memento’ of that reference in the ‘Archived Web’?
Memento:
a
prior
version,
what
the
Original
Resource
was
like
at
some
=me
in
the
past.
19. Methodology (Stage 3): to discover answer to 2 questions
i. Do those links (URIs) still work? Is the URI on the ‘Live Web’’?
ii. Is there a ‘Memento’ of that reference in the ‘Archived Web’?
• Archival check carried out in June 2014, using installed version of
Memento tool developed by LANL
http://www.mementoweb.org/guide/quick-intro/
• A ‘Datetime’ version at or near the date the Thesis was defended
• Searching across several archives (not just Internet Archive)
Approach first used in pilot work at LANL;
UoEdin Language Technology Group:
Beatrice Alex, Claire Grover, Richard Tobin, Ke “Adam” Zhou
20. A Measure of Reference Rot: Are those references available?
[in 6,400 e-Theses defended in 2003-2010 at 5 US universities]
Less than two-thirds
of those links lead
to live content
Live
on
Web
Not
Found
on
‘Live
Web’
All
Count
29,122
16,860
45,982
%
63.3
36.7
100%
1st Order Indicator of
‘Reference Rot’ more than one
third of references
to the Web subject to ‘rot’
A9er
up
to
50
redirects
21. Confirm?: 2/3rds ‘Live’ URI Ratio same across ‘Big 3’, 2003-2010
0.0
0.2
0.4
0.6
0.8
1.0
fsu psu vt wpi zund
Institution
L
i
v
e
R
a
t
i
o =>
‘On
average’
1/3rds
of
the
links
in
an
e-‐Thesis
are
‘ronen’
22. The older the citation, the less likely to be still on the live Web
[excluding 0s&1s: a few theses are unaffected; a few are ruined]
0.0
0.2
0.4
0.6
0.8
50 75 100 125
MonthsElasped
L
i
v
e
R
a
t
i
o We
can’t
stop
that
process
of
rot:
Web
content
changes
over
=me,
Reference
Rot
is
inevitable
func=on
of
=me
Number
of
months
elapsed
from
Date
Thesis
Defended
un=l
date
archives
checked
(June
2014)
23. Searching for ‘Datetime’ Mementos of content in ‘Archived Web’
[in 6,400 e-Theses defended in 2003-2010 at 5 US universities]
%
Live
on
Web
Not
found
on
‘Live
Web’
All
Found
to
be
Archived
47.6
Not
Found
52.4
All
100%
There
seems
a
50:50
chance
that
referenced
content
is
in
the
‘Archived
Web’.
Some
content
is
being
‘co-‐incidentally
harvested’
by
rou=ne
web
archiving.
=> half of those references are at ‘risk of loss’
24. 0.0
0.2
0.4
0.6
0.8
1.0
fsu psu vt wpi zund
Institution
A
r
c
h
i
v
e
R
a
t
i
o
50:50 chance that ‘DateTime Reference’ is ‘Incidentally Archived’
25. ‘Incidental Archiving’ is constant over time
(This is an ‘upper bound estimate’, independent of age of e-thesis)
0.0
0.2
0.4
0.6
0.8
50 75 100 125
MonthsElasped
A
r
c
h
i
v
e
R
a
t
i
o
We can improve upon this ‘50:50 chance’
by pro-actively archiving what we cite
26. We already have ‘Lost Content’ for References to Web
[in 6,400 e-Theses defended in 2003-2010 at 5 US universities]
%
Live
on
Web
Not
found
on
‘Live
Web’
All
Found
to
be
Archived
29.3
18.3
47.6
Not
Found
34.0
18.4
52.4
All
63.3
36.7
100%
18.4%
‘not live & not found in archive’
judged to be lost forever
34%
‘live’ & ‘not in archive’
at is risk of loss
NB: The 34% ‘at risk’ could be saved by pro-active archiving
28. Having demonstrated problem exists & is severe
• The Web changes over time: reference rot occurs (36.7%)
• Incidental archiving via routine of web archiving initiatives
delivers no more than 50:50 chance of success
• Seek pro-active ‘transactional archiving’ solutions
– focus on what is regarded by authors as important
• Thereby to remedy the integrity of the scholarly record
We aim to embed ‘solutions’ in existing tools & infrastructure
Our General Approach
29. a) Understand the preparation/publication workflow
– identifying where there can be productive intervention
b) Devise prototypes for pro-active archiving
– writing & implementing code!
c) Propose/test infrastructure for temporal referencing
– supporting & using the Memento protocol
We are embedding ‘solutions’ in existing tools & infrastructure
Strategy for Making Remedy
30. Understanding 3 workflows: Rot or Remedy?
Iden6fy
the
Actors
Extended
length
of
stages
in
workflows
magnify
reference
rot
&
affect
① Study
-‐>
Prepara=on
-‐
>
(Review)
-‐>
Submission
② Post-‐Submission
-‐>
Examina=on
-‐>
(Revision)
-‐>
Award
③ Post-‐Award
-‐>
Deposit/Ingest
-‐>
Provide/Access
-‐>
Use
Doctoral
Student
(&
Supervisor)
Faculty,
Examiners
&
Supervisor
University
&
Library
Iden6fy
the
best
opportuni6es
for
Interven6on
to
make
Remedy
31. 1. Hiberlink Plug-in - to enable pro-active archiving
2. Missing Link - re-factoring the HTML link
3. HiberActive - enables repositories to ‘stop the rot’ via
actively archiving those references in e-theses
LANL: Martin Klein, Harihar Shankar, Herbert Van de Sompel
UoEd EDINA: Neil Mayo, Tim Stickland, Richard Wincewicz
‘Work in progress’ to effect Remedy
Hiberlink
ETD2014, Leicester UK July 25th 2014
Funded by the Andrew W. Mellon Foundation
32. 1. Hiberlink Plug-in - to help authors and middle-folk
(publishers/librarians) do the right thing:
– Zotero - used by authors to manage references
https://www.zotero.org/
– Open Journal System (OJS) - used by OA publishers
https://pkp.sfu.ca/ojs/
‘Work in progress’ to effect Remedy (1)
33. For use during preparation of thesis & before final submission
but also
before deposit with Library (& maybe for repair by Library …)
Hiberlink Plug-in for Zotero
① Triggers archiving of referenced web content
② Returns Datetime URI for archived content
34. 1. Hiberlink Plug-in - to enable pro-active archiving
2. Missing Link - re-factor the HTML link that is returned
‘Work in progress’ to effect Remedy (2)
<a href="http://www.bnf.fr">
Link to the BNF
</a>
b) Augment Link with a set of Datetime & location pairs
<a href="http://www.bnf.fr"
mset="2014-05-19,
http://archive.today/zdpAn 2014-05-15 memento">
Link to the BNF
</a>
a) Take simple URI - to French National Library (say)
35. 1. Hiberlink Plug-in - to enable pro-active archiving
2. Missing Link - re-factoring the HTML link
First two approaches support ‘perfect scenario’:
• All authors archive all their cited URIs
• e.g. (but not exclusively) with Hiberlink / Zotero
3. HiberActive
– Enables repositories to ‘stop the rot’
by actively archiving those references in e-theses
– A notification hub, a component for the infrastructure
• testing workflow with ResourceSync, CORE
& external archive programme
‘Work in progress’ to effect Remedy (3)
36. Next Steps: who wants to take this work forward?
to ensure references in e-Theses don’t rot
• Need
to
move
from
the
‘incidental
Web
archiving’
of
cited
URIs
to
pro-‐ac=ve
archiving,
by
student/authors
&
by
libraries
a) Offer
to
be
an
early
adopter
for
these
Hiberlink
remedies
• The Hiberlink Plug-in for Zotero / HiberActive
Email: edina@ed.ac.uk
Subject: Hiberlink ETD
b) Amend
‘Guidance
for
ETD
Lifecycle
Management’
38. Picture
credit:
hnp://somanybooksblog.com/2009/03/27/library-‐tour/
But online articles in the Scholarly Record are not in
the custody of Libraries, nor on their digital shelves.
Aside:
We
would
all
like
to
assume
that
our
libraries
are
ensuring
that
online
e-‐journal
content
is
being
kept
safe
39. Evidence from The Keepers Registry is worrying!
① Compare what is being kept by the (10) leading archiving agencies
(CLOCKSS, Portico, national libraries etc) with all issued with ISSN
‘Ingest Ratio’ = titles being ingested by one or more Keeper
/ ‘online serials’ in ISSN Register
= 23,268 / 136,965 [in March 2014] => 17%
* We do not know about 83% of e-serials having ISSN *
‘KeepSafe Ratio’ = ingest by 3+ Keepers = 9,652 / 136,965 => 7%
② Title Lists of 3 US research libraries (Columbia, Cornell & Duke),
checked i2011/12 ‘Ingest Ratio’ = 22% to 28%; c.75% unknown fate
③ User-centric Evidence, usage logs for the UK OpenURL Router*
=> over two thirds 68% (36,326 titles) held by none!