Investigating Reference Rot in Web-Based Scholarly Communication

Herbert Van de Sompel
Los Alamos National Laboratory
@hvdsomp

Martin Klein
Los Alamos National Laboratory
@mart1nkle1n

http://hiberlink.org #hiberlink
http://mementoweb.org #memento

Hiberlink is funded by the Andrew W. Mellon Foundation
Hiberlink Project Partners
• Los Alamos National Laboratory:
• Research Library: Martin Klein, Robert Sanderson, Herbert Van
de Sompel
• University of Edinburgh:
• Edina: Peter Burnhill, Neil Mayo, Muriel Mewissen, Christine
Rees, Tim Stickland, Riachard Wincewicz
• Language Technology Group: Beatrice Alex, Claire Grover,
Richard Tobin, Ke “Adam” Zhou
• Funding: Andrew W. Mellon Foundation

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Acknowledgments
• Primary datasets: arXiv, Chesapeake Project, Elsevier, PubMed
Central, PLoS, … (many more to come)

• Secondary datasets: Ex Libris, MS Academic, SerialsSolutions
• Technology support: CrossRef Labs, CrossRef Prospect, Elsevier

• Liaisons: archive.is, CrossRef, Internet Archive, Old Dominion
University Web Science & Digital Library Research Group, perma.cc

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Reference Rot
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Problem Domain
• Web-based scholarly communication links to, references, Web
resources:
• Formal citing of scholarly resources
• Referencing “Web at Large” resources needed or created in
research activities e.g. project websites, software, ontologies,
workflows, online debate, slides, blogs, videos, etc.

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Problem Domain
• Links to web resources are subject to Reference Rot:
• Link Rot: Link stops working, e.g. HTTP 404
• Content Decay: Linked content changes over time

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
References in Web-Based Scholarly Communication

To Scholarly Resources

To Web at Large Resources

Link Rot
Content Decay

an increasingly blurry boundary

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
References in Web-Based Scholarly Communication

To Scholarly Resources
Link Rot

To Web at Large Resources

DOI, HTTP version of DOI

Content Decay

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
References in Web-Based Scholarly Communication

To Scholarly Resources
Link Rot

DOI, HTTP version of DOI

Content Decay

To Web at Large Resources

Fixity of content

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
References in Web-Based Scholarly Communication

To Scholarly Resources
Link Rot

DOI, HTTP version of DOI

Content Decay

To Web at Large Resources

Fixity of content
Archiving: CLoCKSS,
LoCKSS, Portico, Keepers
Registry, …

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
References in Web-Based Scholarly Communication

To Scholarly Resources
Link Rot

DOI, HTTP version of DOI

Content Decay

To Web at Large Resources

Fixity of content
Archiving: CLoCKSS,
LoCKSS, Portico, Keepers
Registry, …

There are issues here too, see
David Rosenthal blog post http://blog.dshr.org/2013/11/patio-perspectives-at-anadp-ii.html
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
References to Scholarly Resources
• We hope/assume that peer-reviewed scholarly literature has fixity
and is adequately archived

• This, BTW, might not be a correct assumption:
• Dynamic, content rich, landing pages
• No public audit regarding archival status of electronic journal
literature archived in special-purpose infrastructure
• Poor archiving in public web archives, related to protected
content
• Initial information in Keepers Registry indicates spotty archiving
of of electronic journal literature
• … Still, this is NOT what Hiberlink investigates
See David Rosenthal blog post http://blog.dshr.org/2013/11/patio-perspectives-at-anadp-ii.html
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
References in Web-Based Scholarly Communication

To Scholarly Resources
Link Rot

DOI, HTTP version of DOI

Content Decay

To Web at Large Resources

Fixity of content
Archiving: CLoCKSS,
LoCKSS, Portico, Keepers
Registry, …

Hiberlink focus

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
References to “Web at Large” Resources
• Hiberlink focuses on the wide variety of web resources needed or
created in research activities

• These resources:
• Are not necessarily under the custodianship of a party that cares
about long term integrity, access
• Do not necessarily have the same sense of fixity that e.g.
journal articles have
• Reference Rot makes it impossible to adequately recreate the
temporal context for scholarly discourse

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Herbert Van de Sompel, et al. (2004) http://dx.doi.org/10.1045/september2004-vandesompel
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
!Exist

Archived

Exist

Archived

!Exist

Archived

!Exist

!Archived

Exist

Archived
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Hiberlink: Investigating Reference Rot

• Hiberlink explores references to Web at Large resources:
• Quantifies Reference Rot
• Explores potential solutions to Reference Rot
• Focuses on links in electronic journal articles
• But has the big picture in mind: dynamic, interdependent,
web-based scholarly assets
• See Herbert Van de Sompel, From the Version of
Record to a Version of the Record, CNI Spring 2013
plenary talk - http://www.youtube.com/watch?v=fhrGSQbNVA

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
References in Web-Based Scholarly Communication

To Scholarly Resources
Link Rot

DOI, HTTP version of DOI

Content Decay

To Web at Large Resources

Fixity of content
Archiving: CLoCKSS,
LoCKSS, Portico, Keepers
Registry, …

Is it worth our time to study this?

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Articles Increasingly Link to Web Resources

URIs extracted from PubMed papers – links to Web at Large resources
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
The New York Times Cares

http://www.nytimes.com/2013/09/24/us/politics/
in-supreme-court-opinions-clicks-that-lead-nowhere.html
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Reference Rot in Law Journals
Zittrain, J., Kendra, A., Lessig, L. (2013) Perma: Scoping and
Addressing the Problem of Link and Reference Rot in Legal
Citations
• Link rot in Law Journals: ~27%
• Reference rot in law journals: ~70%

http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2329161
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Not Just in Scholarly Communication
Zittrain, J., Kendra, A., Lessig, L. (2013) Perma: Scoping and
Addressing the Problem of Link and Reference Rot in Legal
Citations
Liebler, R., Liebert, J. (2012) Something rotten in the State of Legal
Citation
• Link rot: 29% of links in Supreme Court decisions (study of 19962010)
• Reference rot, including link rot: 49.9% of links in Supreme Court
decisions

http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2329161
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2188070
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Not Just in Scholarly Communication

http://en.wikipedia.org/wiki/Wikipedia_talk:Link_rot
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Quantifying Reference Rot
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Quantifying Reference Rot
• Reference Rot has been studied before:
• For the web at large
• For scholarly communication
• For government documents
• What is different with Hiberlink?
• Investigates Reference Rot not just link rot, i.e. includes the
aspect of changing content not just rotting links
• Investigates coverage of referenced resources in web archives
• Operates at a massive scale regarding number of journal
articles, referenced URIs, web archive lookups

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
STUDY
Author (Date)
Lawrence (2001)
Casserly (2003)
Casserly (2007)
Rumsey (2002)
Davis (2002)
Wren (2004)
Sellitto (2005)
Goh (2005)
Dimitrova (2007)
McCown (2005)
Wagner (2009)
Parker (2007)
Duda (2008)
Falagas (2007)
Russell (2008)
Wren (2008)
Moghaddam (2010)
Sanderson (2011)

Year of
Publication
of Citations
1993-1999
1999-2000
1999-2000
1997-2001
1999-2001
1994-2002
1995-2003
1997-2003
2000-2003
1995-2004
2002-2004
2002-2005
1997-2005
2003-2006
1999-2006
1994-2007
1995-2008
1993-2010

# URIs

67,577
500
500
3,406
688
1,630
1,043
2,516
1,126
4,387
2,011
1,229
2,100
1,417
510
6,154
1,761
162,052

#URIs looked
up in web
archives
500
500
2.011
1,761
162,052

Sanderson, R., Phillips, M., and Van de Sompel, H. (2011) http://arxiv.org/abs/1105.3459
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Quantifying Reference Rot - Methodology

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
• Various full text corpora
• Articles 01/1997-12/2012
• URI extraction from XML and PDF
• Improvement on URI extraction
techniques used in prior research
• Validation study planned
• Referencing article
• Referencing journal
• Article dates: submission,
acceptation, publication
• URI position: abstract, body,
footnote, references
• Filter DOIs, HTTP version of DOIs
• Filter URIs that should have been
referenced by means of a DOI
• Supported by secondary
datasets
• Filter obvious noise, e.g. localhost,
example.org, foo.bar, licenses, etc.
• HTTP HEAD on referenced URI-R
• Follow redirects up to a maximum
of 50
• Record HTTP transaction chain
• If HTTP transaction chain ends with
2XX status code: Exists
• If HTTP transaction chain does not
end with 2XX: !Exist
• Lookup in web archives via a
Memento Aggregator that covers
among others Internet Archive,
Archive-It, archive.is, British
Library web archive, UK National
Archives web archive, Icelandic
web archive
• Obtain TimeMap per URI
• If TimeMap does not exist:
!Archived
• If TimeMap exists, select
Memento URI-M closest to
article publication date
• HTTP HEAD on URI-M
• Follow archived redirects
up to a maximum of 50
• Record HTTP transaction
chain
• If HTTP transaction chain
ends 2XX: Archived
• If HTTP transaction chain
does not end with 2XX:
!Archived
Data used for analysis
200k

31.2%

10k

80

90

!Exist
Archived
Archived within 30 days
Archived within 14 days
Archived within 7 days
Archived within 1 day

50k

100

Quantifying Reference Rot – Early Results

1k
100

40

50

Amount of citations

60

70

16.8%

10

20

30

11.3%

1

0

40.7%
1997

1999

2001

2003

2005

2007

2009

2011

1

5

10

50
Weeks

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013

100

500

1000
Study: PubMed Central Corpus 01/1997 – 12/2012
•
•
•
•

Articles processed:
Articles that contain Web at Large URIs:
References to Web at Large URIs:
Unique referenced Web at Large URIs:

494,785
176,527
557,432
327,782

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Percentage Exists & Archived Referenced URIs
Exists & Archived
!Exists & Archived
Exists & !Archived
!Exists & !Archived

31.2%
16.8%

11.3%

40.7%
URIs extracted from PubMed papers – links to Web at Large resources
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Percentage Exists & Archived in 30 Day Window
23%

16.7%

Exists & Archived
!Exists & Archived
Exists & !Archived
!Exists & !Archived

5.1%

55.2%
URIs extracted from PubMed papers – links to Web at Large resources
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Percentage Exists & Archived in 15 Day Window
24.6%

Exists & Archived
!Exists & Archived
Exists & !Archived
!Exists & !Archived
12.4%

3.5%

59.5%
URIs extracted from PubMed papers – links to Web at Large resources
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Percentage Exists & Archived in 07 Day Window
25.8%

Exists & Archived
!Exists & Archived
Exists & !Archived
!Exists & !Archived
8.8%

2.3%

63.1%

URIs extracted from PubMed papers – links to Web at Large resources
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Percentage Exists & Archived in 01 Day Window
Exists & Archived
!Exists & Archived
Exists & !Archived
!Exists & !Archived

27.9%

0.9%
0.2%

71%

URIs extracted from PubMed papers – links to Web at Large resources
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
50
0

10

20

30

40

Percent

60

70

80

90

100

Percentage of !Exists per Year

1997

1999

2001

2003

2005

2007

2009

2011

URIs extracted from PubMed papers – links to Web at Large resources
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
100

Percentage of !Exists, Archived per Year

0

10

20

30

40

50

60

70

80

90

!Exist
Archived
Archived within 30 days
Archived within 14 days
Archived within 7 days
Archived within 1 day

1997

1999

2001

2003

2005

2007

2009

2011

URIs extracted from PubMed papers – links to Web at Large resources
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
100
90
80
0

10

20

30

40

50

60

70

80
70
60
50
40
30
0

10

20

Percent

Percentage !Exists URIs

90

!Exist
Archived
Archived within 30 days
Archived within 14 days
Archived within 7 days
Archived within 1 day

1997

1999

2001

2003

2005

2007

2009

2011

Percentage Archived URIs for !Exists URIs

100

Percentage of !Exists and of Those Archived per Year

URIs extracted from PubMed papers – links to Web at Large resources
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
100

1000

10000 30000

Absolute Number of Archived per Year

1

Archived
Archived within 30 days
Archived within 14 days
Archived within 7 days
Archived within 1 day
1997

1999

2001

2003

2005

2007

2009

2011

URIs extracted from PubMed papers – links to Web at Large resources
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Solving Reference Rot
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
References in Web-Based Scholarly Communication

To Scholarly Resources
Link Rot

DOI, HTTP version of DOI

Content Decay

Fixity of content

To Web at Large Resources

-

Archiving: CLoCKSS,
LoCKSS, Portico, Keepers
Registry, …

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Addressing Content Decay
• Aim for a more pro-active approach to collect snapshots of web
resources (likely to be) referenced in scholarly communication
• A system that hosts resources that are likely to be referenced in
scholarly communication can create snapshots of itself by:
o Using CMS, wikis, datawikis with solid versioning
mechanisms
o Subscribing to on-demand self web archiving service
o Using transactional web archives, cf. SiteStory
• Referenced resources can be web archived on-demand:
o By authors during note taking, authoring
o By platforms involved in the publication process, e.g.
archiving linked resources at the time of manuscript
submission
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
References in Web-Based Scholarly Communication

To Scholarly Resources

To Web at Large Resources

Link Rot

DOI, HTTP version of DOI

Content Decay

Fixity of content

-

Archiving: CLoCKSS,
LoCKSS, Portico, Keepers
Registry, …

Web archiving
Content Versioning Systems
Self archiving

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Click link to blog post
http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/perma/

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Receive page
http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/perma/

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Search and find Mementos in Internet Archive for
http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/perma/

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Search and find a Memento in archive.is for
http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/perma/

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Click perma.cc link to Memento of blog post
http://perma.cc/0Hg62eLdZ3T

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Receive Memento from perma.cc
http://perma.cc/0Hg62eLdZ3T

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Search and do not find Mementos in Internet Archive for
http://perma.cc/0Hg62eLdZ3T

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Search and do not find Mementos in archive.is for
http://perma.cc/0Hg62eLdZ3T

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
What Happened?
• Good news: The number of archived copies of the blog post was
increased by pro-actively creating a Memento in perma.cc
• Bad news: The possibility of finding Mementos for the blog post
in other web archives was undermined by replacing the Original
URI-R with the Memento URI-M
• The Memento URI-M is a key in only one archive
• The Original URI-R is a key in all web archives
• Using the Memento URI-M in a link requires the permanent
existence/uptime of the archive that issued it
• One link rot problem was replaced by another …

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Web Archives Less Permanent than Permanent?

http://webcitation.org
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Web Archives Less Permanent than Permanent?

http://ws-dl.blogspot.com/2013/11/2013-11-21-conservative-party-speeches.html
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Web Archives Less Permanent than Permanent?

http://richmondsfblog.com/2013/11/06/part-of-internet-archive-building-badly-burned-in-earlymorning-fire/
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
What To Do?
• Need an approach for referencing archived resources that
supports lookups in many web archives, not just one
• Since the Original URI-R is a key in all web archives, the linking
approach needs to necessarily include it
• Hence, two URIs are required:
• The Original URI-R
• The Memento URI-M, e.g. the perma.cc URI
• But a link in HTML only carries one URI!
• It is understandable that the Memento URI-M is used for the
link: the approach works with existing web infrastructure
• Yet, an approach to address link rot that itself is subject to
link rot is … err… problematic
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
The Missing Link Proposal

• Extend the link to the Original URI-R with temporal context:
• Memento URI-M in a specific archive
• Dates:
• date of page that contains the link
• date of the link, cf. “accessed at” in citations of web
resources
• Provide the Original URI-R and the temporal context in a
machine-actionable manner so it can be used by user and
machine agents to retrieve Mementos from various web archives

http://mementoweb.org/missing-link/
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
The Missing Link Proposal

http://mementoweb.org/missing-link/
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
How to Make Missing Link Happen?
• The existing approach works out of the box but is problematic
• Missing Link requires infrastructure changes but generally
contributes to increased web persistence:
• HTML
• META for page date: no problem, already in use
• Attributes for <a> to convey URI-M and link date:
• data- extensibility mechanism in HTML5 can be
used but is not intended for cross-site applications
• In 1995, HTML had the URN attribute for <a> as a
means to address web persistence concerns
• Browser, tool support

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
References in Web-Based Scholarly Communication

To Scholarly Resources

To Web at Large Resources

Link Rot

DOI, HTTP version of DOI

Missing Link proposal

Content Decay

Fixity of content

-

Archiving: CLoCKSS,
LoCKSS, Portico, Keepers
Registry, …

Web archiving
Content Versioning Systems
Self archiving

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Demo: Application Using Temporal Context for Links

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Application Using Temporal Context for Links
• Memento for Chrome is an application that uses Original URI-R
and dates to access Mementos in various web archives
• Memento around the date selected in user interface
calendar
• Most recently archived Memento

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Memento Time Travel for Chrome

http://bit.ly/memento-for-chrome
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Memento Time Travel for Chrome

http://www.youtube.com/watch?v=0_70lQPOOIg
http://www.youtube.com/watch?v=WtZHKeFwjzk
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Application Using Temporal Context for Links
• An experimental version of Memento for Chrome also uses
Missing Link information (Original URI-R, URI-M, and dates) to
access Mementos in various web archives:
• Memento around the date selected in user interface calendar
• Most recently archived Memento
• Memento around the date of the page that contains the link
• Memento around the date of the link
• Memento URI-M in a specific archive
• A Memento client is just one example of an application that can
use temporal context provided for links. Other applications,
including search engines, can use it too

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
NYT has <META itemprop=“datePublished” content=“2013-09-23”>

Link in NYT was:
<a href=“http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/perma/”>
Changed to:
<a href=“http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/perma/”
data-versionurl=“http://perma.cc/0Hg62eLdZ3T”>
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Right Click Link Get near current time (done on Nov 25 2013)
http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/perma/
enabler: <a href=“URI-R”>

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Receive Memento from archive.is, Nov 24 2013
http://archive.is/20131124221749/http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/perma/

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Right Click Link Get at page date
http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/perma/
enabler: <a href=“URI-R”> & <META itemprop=“datePublished” content=“2013-09-23”>

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Receive Memento from Internet Archive, Sep 24 2013
http://web.archive.org/web/20130924053315/http://futureoftheinternet/2013/09/22/perma

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Right Click Link Get from perma.cc
http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/perma/
enabler: <a href=“URI-R” data-versionurl=“URI-M”>

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Receive Memento from perma.cc, Oct 2 2013
http://perma.cc/0Hg62eLdZ3T

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Link in NYT was:
<a href=“http://perma.cc/0Hg62eLdZ3T”>
Changed to:
<a href=“http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/perma/”
data-versionurl=“http://perma.cc/0Hg62eLdZ3T”>
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
All previous options available

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Added:
<META itemprop=“datePublished” content=“2013-09-22”>

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Click Link (done on November 25 2013)
http://en.wikipedia.org/wiki/Link_rot
enabler: <a href=“URI-R”>

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Receive Page
http://en.wikipedia.org/wiki/Link_rot

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Scroll down in page
Shows Perma.cc link, added October 22 2013, a month after the blog post

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Right Click Link Get at page date
http://en.wikipedia.org/Link_rot
enabler: <a href=“URI-R”> & <META itemprop=“datePublished” content=“2013-09-22”>

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Receive Page
http://en.wikipedia.org/w/index.php?title=Link_rot&oldid=571327764

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Scroll down in page
Does not show Perma.cc link, added October 22 2013, a month after the blog post

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Link in blog was:
<a href=“http://librarylab.law.harvard.edu”>
Changed (for fun) to:
<a href=“http://librarylab.law.harvard.edu” data-versiondate=“2010-09-22”>

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Click Link (done on November 25 2013)
http://librarylab.law.harvard.edu
enabler: <a href=“URI-R”>

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Receive Page
http://librarylab.law.harvard.edu

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Right Click Link Get at page date
http://librarylab.law.harvard.edu
enabler: <a href=“URI-R”> & <META itemprop=“datePublished” content=“2013-09-22”>

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Receive Memento from archive.is, Jun 21 2013
http://archive.is/20130621162538/http://librarylab.law.harvard.edu

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Right Click Link Get at link date
http://librarylab.law.harvard.edu
enabler: <a href=“URI-R” data-versiondate=“2010-09-22”>

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Receive Memento from Internet Archive, Sep 18 2010
http://web.archive.org/web/20100918025331/http://librarylab.law.harvard.edu

Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Bottom Line: A Link Leads to Many Times and Archives

http://mementoweb.org/missing-link/
Herbert Van de Sompel, Martin Klein – Hiberlink
CNI Fall 2013, Washington, DC, December 9 2013
Investigating Reference Rot in Web-Based Scholarly Communication

Herbert Van de Sompel
Los Alamos National Laboratory
@hvdsomp

Martin Klein
Los Alamos National Laboratory
@mart1nkle1n

http://hiberlink.org #hiberlink
http://mementoweb.org #memento

Hiberlink is funded by the Andrew W. Mellon Foundation

Hiberlink: Investigating Reference Rot, December 2013

  • 1.
    Investigating Reference Rotin Web-Based Scholarly Communication Herbert Van de Sompel Los Alamos National Laboratory @hvdsomp Martin Klein Los Alamos National Laboratory @mart1nkle1n http://hiberlink.org #hiberlink http://mementoweb.org #memento Hiberlink is funded by the Andrew W. Mellon Foundation
  • 2.
    Hiberlink Project Partners •Los Alamos National Laboratory: • Research Library: Martin Klein, Robert Sanderson, Herbert Van de Sompel • University of Edinburgh: • Edina: Peter Burnhill, Neil Mayo, Muriel Mewissen, Christine Rees, Tim Stickland, Riachard Wincewicz • Language Technology Group: Beatrice Alex, Claire Grover, Richard Tobin, Ke “Adam” Zhou • Funding: Andrew W. Mellon Foundation Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 3.
    Acknowledgments • Primary datasets:arXiv, Chesapeake Project, Elsevier, PubMed Central, PLoS, … (many more to come) • Secondary datasets: Ex Libris, MS Academic, SerialsSolutions • Technology support: CrossRef Labs, CrossRef Prospect, Elsevier • Liaisons: archive.is, CrossRef, Internet Archive, Old Dominion University Web Science & Digital Library Research Group, perma.cc Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 4.
    Reference Rot Herbert Vande Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 5.
    Problem Domain • Web-basedscholarly communication links to, references, Web resources: • Formal citing of scholarly resources • Referencing “Web at Large” resources needed or created in research activities e.g. project websites, software, ontologies, workflows, online debate, slides, blogs, videos, etc. Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 6.
    Problem Domain • Linksto web resources are subject to Reference Rot: • Link Rot: Link stops working, e.g. HTTP 404 • Content Decay: Linked content changes over time Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 7.
    References in Web-BasedScholarly Communication To Scholarly Resources To Web at Large Resources Link Rot Content Decay an increasingly blurry boundary Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 8.
    References in Web-BasedScholarly Communication To Scholarly Resources Link Rot To Web at Large Resources DOI, HTTP version of DOI Content Decay Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 9.
    References in Web-BasedScholarly Communication To Scholarly Resources Link Rot DOI, HTTP version of DOI Content Decay To Web at Large Resources Fixity of content Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 10.
    References in Web-BasedScholarly Communication To Scholarly Resources Link Rot DOI, HTTP version of DOI Content Decay To Web at Large Resources Fixity of content Archiving: CLoCKSS, LoCKSS, Portico, Keepers Registry, … Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 11.
    References in Web-BasedScholarly Communication To Scholarly Resources Link Rot DOI, HTTP version of DOI Content Decay To Web at Large Resources Fixity of content Archiving: CLoCKSS, LoCKSS, Portico, Keepers Registry, … There are issues here too, see David Rosenthal blog post http://blog.dshr.org/2013/11/patio-perspectives-at-anadp-ii.html Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 12.
    References to ScholarlyResources • We hope/assume that peer-reviewed scholarly literature has fixity and is adequately archived • This, BTW, might not be a correct assumption: • Dynamic, content rich, landing pages • No public audit regarding archival status of electronic journal literature archived in special-purpose infrastructure • Poor archiving in public web archives, related to protected content • Initial information in Keepers Registry indicates spotty archiving of of electronic journal literature • … Still, this is NOT what Hiberlink investigates See David Rosenthal blog post http://blog.dshr.org/2013/11/patio-perspectives-at-anadp-ii.html Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 13.
    References in Web-BasedScholarly Communication To Scholarly Resources Link Rot DOI, HTTP version of DOI Content Decay To Web at Large Resources Fixity of content Archiving: CLoCKSS, LoCKSS, Portico, Keepers Registry, … Hiberlink focus Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 14.
    References to “Webat Large” Resources • Hiberlink focuses on the wide variety of web resources needed or created in research activities • These resources: • Are not necessarily under the custodianship of a party that cares about long term integrity, access • Do not necessarily have the same sense of fixity that e.g. journal articles have • Reference Rot makes it impossible to adequately recreate the temporal context for scholarly discourse Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 15.
    Herbert Van deSompel, et al. (2004) http://dx.doi.org/10.1045/september2004-vandesompel Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 16.
    !Exist Archived Exist Archived !Exist Archived !Exist !Archived Exist Archived Herbert Van deSompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 17.
    Hiberlink: Investigating ReferenceRot • Hiberlink explores references to Web at Large resources: • Quantifies Reference Rot • Explores potential solutions to Reference Rot • Focuses on links in electronic journal articles • But has the big picture in mind: dynamic, interdependent, web-based scholarly assets • See Herbert Van de Sompel, From the Version of Record to a Version of the Record, CNI Spring 2013 plenary talk - http://www.youtube.com/watch?v=fhrGSQbNVA Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 18.
    References in Web-BasedScholarly Communication To Scholarly Resources Link Rot DOI, HTTP version of DOI Content Decay To Web at Large Resources Fixity of content Archiving: CLoCKSS, LoCKSS, Portico, Keepers Registry, … Is it worth our time to study this? Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 19.
    Articles Increasingly Linkto Web Resources URIs extracted from PubMed papers – links to Web at Large resources Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 20.
    The New YorkTimes Cares http://www.nytimes.com/2013/09/24/us/politics/ in-supreme-court-opinions-clicks-that-lead-nowhere.html Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 21.
    Reference Rot inLaw Journals Zittrain, J., Kendra, A., Lessig, L. (2013) Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations • Link rot in Law Journals: ~27% • Reference rot in law journals: ~70% http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2329161 Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 22.
    Not Just inScholarly Communication Zittrain, J., Kendra, A., Lessig, L. (2013) Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations Liebler, R., Liebert, J. (2012) Something rotten in the State of Legal Citation • Link rot: 29% of links in Supreme Court decisions (study of 19962010) • Reference rot, including link rot: 49.9% of links in Supreme Court decisions http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2329161 http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2188070 Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 23.
    Not Just inScholarly Communication http://en.wikipedia.org/wiki/Wikipedia_talk:Link_rot Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 24.
    Quantifying Reference Rot HerbertVan de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 25.
    Quantifying Reference Rot •Reference Rot has been studied before: • For the web at large • For scholarly communication • For government documents • What is different with Hiberlink? • Investigates Reference Rot not just link rot, i.e. includes the aspect of changing content not just rotting links • Investigates coverage of referenced resources in web archives • Operates at a massive scale regarding number of journal articles, referenced URIs, web archive lookups Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 26.
    STUDY Author (Date) Lawrence (2001) Casserly(2003) Casserly (2007) Rumsey (2002) Davis (2002) Wren (2004) Sellitto (2005) Goh (2005) Dimitrova (2007) McCown (2005) Wagner (2009) Parker (2007) Duda (2008) Falagas (2007) Russell (2008) Wren (2008) Moghaddam (2010) Sanderson (2011) Year of Publication of Citations 1993-1999 1999-2000 1999-2000 1997-2001 1999-2001 1994-2002 1995-2003 1997-2003 2000-2003 1995-2004 2002-2004 2002-2005 1997-2005 2003-2006 1999-2006 1994-2007 1995-2008 1993-2010 # URIs 67,577 500 500 3,406 688 1,630 1,043 2,516 1,126 4,387 2,011 1,229 2,100 1,417 510 6,154 1,761 162,052 #URIs looked up in web archives 500 500 2.011 1,761 162,052 Sanderson, R., Phillips, M., and Van de Sompel, H. (2011) http://arxiv.org/abs/1105.3459 Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 27.
    Quantifying Reference Rot- Methodology Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 29.
    • Various fulltext corpora • Articles 01/1997-12/2012
  • 30.
    • URI extractionfrom XML and PDF • Improvement on URI extraction techniques used in prior research • Validation study planned
  • 31.
    • Referencing article •Referencing journal • Article dates: submission, acceptation, publication • URI position: abstract, body, footnote, references
  • 32.
    • Filter DOIs,HTTP version of DOIs • Filter URIs that should have been referenced by means of a DOI • Supported by secondary datasets • Filter obvious noise, e.g. localhost, example.org, foo.bar, licenses, etc.
  • 34.
    • HTTP HEADon referenced URI-R • Follow redirects up to a maximum of 50 • Record HTTP transaction chain • If HTTP transaction chain ends with 2XX status code: Exists • If HTTP transaction chain does not end with 2XX: !Exist
  • 35.
    • Lookup inweb archives via a Memento Aggregator that covers among others Internet Archive, Archive-It, archive.is, British Library web archive, UK National Archives web archive, Icelandic web archive
  • 36.
    • Obtain TimeMapper URI • If TimeMap does not exist: !Archived • If TimeMap exists, select Memento URI-M closest to article publication date • HTTP HEAD on URI-M • Follow archived redirects up to a maximum of 50 • Record HTTP transaction chain • If HTTP transaction chain ends 2XX: Archived • If HTTP transaction chain does not end with 2XX: !Archived
  • 37.
    Data used foranalysis
  • 38.
    200k 31.2% 10k 80 90 !Exist Archived Archived within 30days Archived within 14 days Archived within 7 days Archived within 1 day 50k 100 Quantifying Reference Rot – Early Results 1k 100 40 50 Amount of citations 60 70 16.8% 10 20 30 11.3% 1 0 40.7% 1997 1999 2001 2003 2005 2007 2009 2011 1 5 10 50 Weeks Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013 100 500 1000
  • 39.
    Study: PubMed CentralCorpus 01/1997 – 12/2012 • • • • Articles processed: Articles that contain Web at Large URIs: References to Web at Large URIs: Unique referenced Web at Large URIs: 494,785 176,527 557,432 327,782 Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 40.
    Percentage Exists &Archived Referenced URIs Exists & Archived !Exists & Archived Exists & !Archived !Exists & !Archived 31.2% 16.8% 11.3% 40.7% URIs extracted from PubMed papers – links to Web at Large resources Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 41.
    Percentage Exists &Archived in 30 Day Window 23% 16.7% Exists & Archived !Exists & Archived Exists & !Archived !Exists & !Archived 5.1% 55.2% URIs extracted from PubMed papers – links to Web at Large resources Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 42.
    Percentage Exists &Archived in 15 Day Window 24.6% Exists & Archived !Exists & Archived Exists & !Archived !Exists & !Archived 12.4% 3.5% 59.5% URIs extracted from PubMed papers – links to Web at Large resources Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 43.
    Percentage Exists &Archived in 07 Day Window 25.8% Exists & Archived !Exists & Archived Exists & !Archived !Exists & !Archived 8.8% 2.3% 63.1% URIs extracted from PubMed papers – links to Web at Large resources Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 44.
    Percentage Exists &Archived in 01 Day Window Exists & Archived !Exists & Archived Exists & !Archived !Exists & !Archived 27.9% 0.9% 0.2% 71% URIs extracted from PubMed papers – links to Web at Large resources Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 45.
    50 0 10 20 30 40 Percent 60 70 80 90 100 Percentage of !Existsper Year 1997 1999 2001 2003 2005 2007 2009 2011 URIs extracted from PubMed papers – links to Web at Large resources Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 46.
    100 Percentage of !Exists,Archived per Year 0 10 20 30 40 50 60 70 80 90 !Exist Archived Archived within 30 days Archived within 14 days Archived within 7 days Archived within 1 day 1997 1999 2001 2003 2005 2007 2009 2011 URIs extracted from PubMed papers – links to Web at Large resources Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 47.
    100 90 80 0 10 20 30 40 50 60 70 80 70 60 50 40 30 0 10 20 Percent Percentage !Exists URIs 90 !Exist Archived Archivedwithin 30 days Archived within 14 days Archived within 7 days Archived within 1 day 1997 1999 2001 2003 2005 2007 2009 2011 Percentage Archived URIs for !Exists URIs 100 Percentage of !Exists and of Those Archived per Year URIs extracted from PubMed papers – links to Web at Large resources Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 48.
    100 1000 10000 30000 Absolute Numberof Archived per Year 1 Archived Archived within 30 days Archived within 14 days Archived within 7 days Archived within 1 day 1997 1999 2001 2003 2005 2007 2009 2011 URIs extracted from PubMed papers – links to Web at Large resources Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 49.
    Solving Reference Rot HerbertVan de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 50.
    References in Web-BasedScholarly Communication To Scholarly Resources Link Rot DOI, HTTP version of DOI Content Decay Fixity of content To Web at Large Resources - Archiving: CLoCKSS, LoCKSS, Portico, Keepers Registry, … Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 51.
    Addressing Content Decay •Aim for a more pro-active approach to collect snapshots of web resources (likely to be) referenced in scholarly communication • A system that hosts resources that are likely to be referenced in scholarly communication can create snapshots of itself by: o Using CMS, wikis, datawikis with solid versioning mechanisms o Subscribing to on-demand self web archiving service o Using transactional web archives, cf. SiteStory • Referenced resources can be web archived on-demand: o By authors during note taking, authoring o By platforms involved in the publication process, e.g. archiving linked resources at the time of manuscript submission Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 52.
    References in Web-BasedScholarly Communication To Scholarly Resources To Web at Large Resources Link Rot DOI, HTTP version of DOI Content Decay Fixity of content - Archiving: CLoCKSS, LoCKSS, Portico, Keepers Registry, … Web archiving Content Versioning Systems Self archiving Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 53.
    Click link toblog post http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/perma/ Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 54.
    Receive page http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/perma/ Herbert Vande Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 55.
    Search and findMementos in Internet Archive for http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/perma/ Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 56.
    Search and finda Memento in archive.is for http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/perma/ Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 57.
    Click perma.cc linkto Memento of blog post http://perma.cc/0Hg62eLdZ3T Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 58.
    Receive Memento fromperma.cc http://perma.cc/0Hg62eLdZ3T Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 59.
    Search and donot find Mementos in Internet Archive for http://perma.cc/0Hg62eLdZ3T Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 60.
    Search and donot find Mementos in archive.is for http://perma.cc/0Hg62eLdZ3T Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 61.
    What Happened? • Goodnews: The number of archived copies of the blog post was increased by pro-actively creating a Memento in perma.cc • Bad news: The possibility of finding Mementos for the blog post in other web archives was undermined by replacing the Original URI-R with the Memento URI-M • The Memento URI-M is a key in only one archive • The Original URI-R is a key in all web archives • Using the Memento URI-M in a link requires the permanent existence/uptime of the archive that issued it • One link rot problem was replaced by another … Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 62.
    Web Archives LessPermanent than Permanent? http://webcitation.org Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 63.
    Web Archives LessPermanent than Permanent? http://ws-dl.blogspot.com/2013/11/2013-11-21-conservative-party-speeches.html Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 64.
    Web Archives LessPermanent than Permanent? http://richmondsfblog.com/2013/11/06/part-of-internet-archive-building-badly-burned-in-earlymorning-fire/ Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 65.
    What To Do? •Need an approach for referencing archived resources that supports lookups in many web archives, not just one • Since the Original URI-R is a key in all web archives, the linking approach needs to necessarily include it • Hence, two URIs are required: • The Original URI-R • The Memento URI-M, e.g. the perma.cc URI • But a link in HTML only carries one URI! • It is understandable that the Memento URI-M is used for the link: the approach works with existing web infrastructure • Yet, an approach to address link rot that itself is subject to link rot is … err… problematic Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 66.
    The Missing LinkProposal • Extend the link to the Original URI-R with temporal context: • Memento URI-M in a specific archive • Dates: • date of page that contains the link • date of the link, cf. “accessed at” in citations of web resources • Provide the Original URI-R and the temporal context in a machine-actionable manner so it can be used by user and machine agents to retrieve Mementos from various web archives http://mementoweb.org/missing-link/ Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 67.
    The Missing LinkProposal http://mementoweb.org/missing-link/ Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 68.
    How to MakeMissing Link Happen? • The existing approach works out of the box but is problematic • Missing Link requires infrastructure changes but generally contributes to increased web persistence: • HTML • META for page date: no problem, already in use • Attributes for <a> to convey URI-M and link date: • data- extensibility mechanism in HTML5 can be used but is not intended for cross-site applications • In 1995, HTML had the URN attribute for <a> as a means to address web persistence concerns • Browser, tool support Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 69.
    References in Web-BasedScholarly Communication To Scholarly Resources To Web at Large Resources Link Rot DOI, HTTP version of DOI Missing Link proposal Content Decay Fixity of content - Archiving: CLoCKSS, LoCKSS, Portico, Keepers Registry, … Web archiving Content Versioning Systems Self archiving Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 70.
    Demo: Application UsingTemporal Context for Links Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 71.
    Application Using TemporalContext for Links • Memento for Chrome is an application that uses Original URI-R and dates to access Mementos in various web archives • Memento around the date selected in user interface calendar • Most recently archived Memento Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 72.
    Memento Time Travelfor Chrome http://bit.ly/memento-for-chrome Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 73.
    Memento Time Travelfor Chrome http://www.youtube.com/watch?v=0_70lQPOOIg http://www.youtube.com/watch?v=WtZHKeFwjzk Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 74.
    Application Using TemporalContext for Links • An experimental version of Memento for Chrome also uses Missing Link information (Original URI-R, URI-M, and dates) to access Mementos in various web archives: • Memento around the date selected in user interface calendar • Most recently archived Memento • Memento around the date of the page that contains the link • Memento around the date of the link • Memento URI-M in a specific archive • A Memento client is just one example of an application that can use temporal context provided for links. Other applications, including search engines, can use it too Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 75.
    NYT has <METAitemprop=“datePublished” content=“2013-09-23”> Link in NYT was: <a href=“http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/perma/”> Changed to: <a href=“http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/perma/” data-versionurl=“http://perma.cc/0Hg62eLdZ3T”> Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 76.
    Right Click LinkGet near current time (done on Nov 25 2013) http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/perma/ enabler: <a href=“URI-R”> Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 77.
    Receive Memento fromarchive.is, Nov 24 2013 http://archive.is/20131124221749/http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/perma/ Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 78.
    Right Click LinkGet at page date http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/perma/ enabler: <a href=“URI-R”> & <META itemprop=“datePublished” content=“2013-09-23”> Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 79.
    Receive Memento fromInternet Archive, Sep 24 2013 http://web.archive.org/web/20130924053315/http://futureoftheinternet/2013/09/22/perma Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 80.
    Right Click LinkGet from perma.cc http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/perma/ enabler: <a href=“URI-R” data-versionurl=“URI-M”> Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 81.
    Receive Memento fromperma.cc, Oct 2 2013 http://perma.cc/0Hg62eLdZ3T Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 82.
    Link in NYTwas: <a href=“http://perma.cc/0Hg62eLdZ3T”> Changed to: <a href=“http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/perma/” data-versionurl=“http://perma.cc/0Hg62eLdZ3T”> Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 83.
    All previous optionsavailable Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 84.
    Added: <META itemprop=“datePublished” content=“2013-09-22”> HerbertVan de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 85.
    Click Link (doneon November 25 2013) http://en.wikipedia.org/wiki/Link_rot enabler: <a href=“URI-R”> Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 86.
    Receive Page http://en.wikipedia.org/wiki/Link_rot Herbert Vande Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 87.
    Scroll down inpage Shows Perma.cc link, added October 22 2013, a month after the blog post Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 88.
    Right Click LinkGet at page date http://en.wikipedia.org/Link_rot enabler: <a href=“URI-R”> & <META itemprop=“datePublished” content=“2013-09-22”> Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 89.
    Receive Page http://en.wikipedia.org/w/index.php?title=Link_rot&oldid=571327764 Herbert Vande Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 90.
    Scroll down inpage Does not show Perma.cc link, added October 22 2013, a month after the blog post Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 91.
    Link in blogwas: <a href=“http://librarylab.law.harvard.edu”> Changed (for fun) to: <a href=“http://librarylab.law.harvard.edu” data-versiondate=“2010-09-22”> Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 92.
    Click Link (doneon November 25 2013) http://librarylab.law.harvard.edu enabler: <a href=“URI-R”> Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 93.
    Receive Page http://librarylab.law.harvard.edu Herbert Vande Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 94.
    Right Click LinkGet at page date http://librarylab.law.harvard.edu enabler: <a href=“URI-R”> & <META itemprop=“datePublished” content=“2013-09-22”> Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 95.
    Receive Memento fromarchive.is, Jun 21 2013 http://archive.is/20130621162538/http://librarylab.law.harvard.edu Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 96.
    Right Click LinkGet at link date http://librarylab.law.harvard.edu enabler: <a href=“URI-R” data-versiondate=“2010-09-22”> Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 97.
    Receive Memento fromInternet Archive, Sep 18 2010 http://web.archive.org/web/20100918025331/http://librarylab.law.harvard.edu Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 98.
    Bottom Line: ALink Leads to Many Times and Archives http://mementoweb.org/missing-link/ Herbert Van de Sompel, Martin Klein – Hiberlink CNI Fall 2013, Washington, DC, December 9 2013
  • 99.
    Investigating Reference Rotin Web-Based Scholarly Communication Herbert Van de Sompel Los Alamos National Laboratory @hvdsomp Martin Klein Los Alamos National Laboratory @mart1nkle1n http://hiberlink.org #hiberlink http://mementoweb.org #memento Hiberlink is funded by the Andrew W. Mellon Foundation

Editor's Notes

  • #13 The basic consideration in the talk is that life used to be simple when scholarly assets were PDFs: single frozen assets
  • #21 Problem in scholarly communication, legal journals, supreme court opinions, wikipedia, … Since the problem is so broad, need a solution that works for the wqeb at large not just for scholarly communication
  • #64 Quote from Wagner et al:Because sites such as Internet Archive and WebCite will remove archived web pages at the owners’request, authors should not depend on these utilitiesas the sole archives for web-based information.