SlideShare a Scribd company logo
Reference Rot and !
Link Decoration!
Martin Klein!
UCLA
martinklein0815@gmail.com
@mart1nkle1n
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
Hiberlink Team
• Los Alamos National Laboratory
• Research Library: (Martin Klein), (Robert Sanderson), Harihar
Shankar, Herbert Van de Sompel!
• University of Edinburgh
• Edina: Peter Burnhill, Neil Mayo, Muriel Mewissen, Christine
Rees, Tim Strickland, Richard Wincewicz
• Language Technology Group: Beatrix Alex, Claire Grover,
Colin Matheson, Richard Tobin, (Ke “Adam” Zhou)
• Funding: Andrew W. Mellon Foundation
2
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
3
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115253
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
4
Reference Rot
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
5
Link Rot
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
6
“Entertaining” Link Rot
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
7
Ubiquitous Link Rot
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
8
Content Drift
http://dl00.org!
!
2000
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
9
Content Drift
http://dl00.org!
!
2004
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
10
Content Drift
http://dl00.org!
!
2005
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
11
Content Drift
http://dl00.org!
!
2008
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
12
NYT Coverage
Links in!
Supreme Court decisions:!
!
• Link rot: 29%!
!
• Reference rot: 49%
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
13
Scholarly Communication
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
14
!Exist
!Exist
!Exist
Exist
Exist
Archived
Archived
!Archived
Archived
Archived
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
Entrance Hiberlink
• These resources:
• Are not necessarily under the custodianship of parties that care about
long time integrity, access
• Do not necessarily have the same sense of fixity like e.g., journal articles
• Links to these resources are subject to Reference Rot:
• Link Rot: Link stops working e.g., HTTP 404
• Content Drift: Linked content changes over time
15
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
16
Quantifying!
Reference Rot
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
Our Study
• Time frame of publications: Jan 1997 - Dec 2012
• Articles from arXiv, Elsevier, and PMC in XML and PDF format
• Convert PDF to XML
• Extract URIs to web at large resources
• Store article’s publication date
• URI live web test (trusted in 200 OK response)
• URI archive lookup via Memento infrastructure
17
arXiv Elsevier PMC
total articles 707, 667 2, 285, 000 595, 889
articles with HTTP references 142, 134 94, 645 156, 160
amount of HTTP references 346, 177 232, 712 480, 853
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
18
1997 1999 2001 2003 2005 2007 2009 2011
02000060000100000140000180000
articles
URI references
1997 1999 2001 2003 2005 2007 2009 2011
050001500025000350004500055000
articles
URI references
1997 1999 2001 2003 2005 2007 2009 2011
050000100000150000200000250000300000350000
articles
URI references
PMC
Elsevier
arXiv
Our Corpora
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
19
Link Rot in arXiv
1997 1999 2001 2003 2005 2007 2009 2011
102030405060708090100
1000020000300004000050000
HTTP References
Link Rot
NumberofHTTPReferences
LinkRotPercentage
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
20
1997 1999 2001 2003 2005 2007 2009 2011
102030405060708090100
1000020000300004000050000
HTTP References
Link Rot
NumberofHTTPReferences
LinkRotPercentage
1997 1999 2001 2003 2005 2007 2009 2011
102030405060708090100
5000100001500020000250003000035000
HTTP References
Link Rot
NumberofHTTPReferences
LinkRotPercentage
1997 1999 2001 2003 2005 2007 2009 2011
102030405060708090100
20000400006000080000100000120000
HTTP References
Link Rot
NumberofHTTPReferences
LinkRotPercentage
PMC
Elsevier
arXiv
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
21
Content Drift / Archival Status
Not Archived
75.3%
Archived
24.7%
Rotten
26.0%
Active
74.0%
All Links
• Archival status used as proxy
• Availability of archived copy created within N days of article’s publication
• N = 14 arXiv
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
22
PMC
Elsevier
arXiv
Not Archived
75.3%
Archived
24.7%
Rotten
26.0%
Active
74.0%
All Links
Not Archived
75.2%
Archived
24.8%
Rotten
32.7%
Active
67.3%
All Links
Not Archived
74.5%
Archived
25.5%
Rotten
20.0%
Active
80.0%
All Links
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
23
Loss of Context
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
24
Loss of Context
all links active links
links archived!
(14 days)
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
STM Article Extrapolation
25
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
STM Article Extrapolation
• Immune: article contains no URIs to web at large
resources
• Healthy: none of the URIs to web at large
resources suffer from link rot nor content drift
• infected: at least one URI to web at large
resources suffers from link rot or content drift
26
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
27
Immune vs not Immune STM Articles
0
10
20
30
40
50
60
70
80
90
100
1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
Immune not Immune
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
STM Article Extrapolation
• Immune: article contains no URIs to web at large
resources
• Healthy: none of the URIs to web at large
resources suffer from reference rot
• Infected: at least one URI to web at large
resources suffers from reference rot
28
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
29
0
10
20
30
40
50
60
70
80
90
100
1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
Immune Healthy Infected
1/5 articles suffers !
from !
Reference Rot!
Immune, Healthy, Infected STM Articles
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
30
An approach to solve !
Reference Rot
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
Robust Links
1.Create snapshot of linked resources in a web archive when:
• drafting work
• submitting article
• publishing article
• aggregating article
31
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
Robust Links
1. Create snapshot of linked resources in a web
archive
2. Convey creation date of your web page in
machine-actionable manner
32
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
Page Creation Date
33
<!DOCTYPE html>
<html>
<head>
<title> … </title>
<meta itemprop="datePublished" content="2015-02-18" />
…
</head>
…
</html>
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
34
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
Robust Links
1. Create snapshot of linked resources in a web archive
2. Convey creation date of your web page in machine-
actionable manner
3. Decorate links with datetime of linking and URI of
archived snapshot, in addition to resource’s original
URI
35
http://robustlinks.mementoweb.org/spec/
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
Link Decoration
36
<a href="http://hiberlink.org/">http://hiberlink.org/</a>
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
Link Decoration
37
<a href="http://hiberlink.org/"
!
data-versionurl="http://archive.is/Bvq2v"
data-versiondate=“2014-11-01">
!
http://hiberlink.org/</a>
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
38
http://robustlinks.mementoweb.org/demo/uri_references_js.html
Reference Rot and Link Decoration!
@mart1nkle1n!
OAI9, Geneva, June 17th 2015
39
http://robustlinks.mementoweb.org/demo/uri_references_js.html
Reference Rot and !
Link Decoration!
Martin Klein!
UCLA
martinklein0815@gmail.com
@mart1nkle1n

More Related Content

Reference Rot and Link Decoration

  • 1. Reference Rot and ! Link Decoration! Martin Klein! UCLA martinklein0815@gmail.com @mart1nkle1n
  • 2. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 Hiberlink Team • Los Alamos National Laboratory • Research Library: (Martin Klein), (Robert Sanderson), Harihar Shankar, Herbert Van de Sompel! • University of Edinburgh • Edina: Peter Burnhill, Neil Mayo, Muriel Mewissen, Christine Rees, Tim Strickland, Richard Wincewicz • Language Technology Group: Beatrix Alex, Claire Grover, Colin Matheson, Richard Tobin, (Ke “Adam” Zhou) • Funding: Andrew W. Mellon Foundation 2
  • 3. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 3 http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115253
  • 4. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 4 Reference Rot
  • 5. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 5 Link Rot
  • 6. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 6 “Entertaining” Link Rot
  • 7. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 7 Ubiquitous Link Rot
  • 8. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 8 Content Drift http://dl00.org! ! 2000
  • 9. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 9 Content Drift http://dl00.org! ! 2004
  • 10. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 10 Content Drift http://dl00.org! ! 2005
  • 11. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 11 Content Drift http://dl00.org! ! 2008
  • 12. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 12 NYT Coverage Links in! Supreme Court decisions:! ! • Link rot: 29%! ! • Reference rot: 49%
  • 13. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 13 Scholarly Communication
  • 14. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 14 !Exist !Exist !Exist Exist Exist Archived Archived !Archived Archived Archived
  • 15. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 Entrance Hiberlink • These resources: • Are not necessarily under the custodianship of parties that care about long time integrity, access • Do not necessarily have the same sense of fixity like e.g., journal articles • Links to these resources are subject to Reference Rot: • Link Rot: Link stops working e.g., HTTP 404 • Content Drift: Linked content changes over time 15
  • 16. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 16 Quantifying! Reference Rot
  • 17. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 Our Study • Time frame of publications: Jan 1997 - Dec 2012 • Articles from arXiv, Elsevier, and PMC in XML and PDF format • Convert PDF to XML • Extract URIs to web at large resources • Store article’s publication date • URI live web test (trusted in 200 OK response) • URI archive lookup via Memento infrastructure 17 arXiv Elsevier PMC total articles 707, 667 2, 285, 000 595, 889 articles with HTTP references 142, 134 94, 645 156, 160 amount of HTTP references 346, 177 232, 712 480, 853
  • 18. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 18 1997 1999 2001 2003 2005 2007 2009 2011 02000060000100000140000180000 articles URI references 1997 1999 2001 2003 2005 2007 2009 2011 050001500025000350004500055000 articles URI references 1997 1999 2001 2003 2005 2007 2009 2011 050000100000150000200000250000300000350000 articles URI references PMC Elsevier arXiv Our Corpora
  • 19. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 19 Link Rot in arXiv 1997 1999 2001 2003 2005 2007 2009 2011 102030405060708090100 1000020000300004000050000 HTTP References Link Rot NumberofHTTPReferences LinkRotPercentage
  • 20. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 20 1997 1999 2001 2003 2005 2007 2009 2011 102030405060708090100 1000020000300004000050000 HTTP References Link Rot NumberofHTTPReferences LinkRotPercentage 1997 1999 2001 2003 2005 2007 2009 2011 102030405060708090100 5000100001500020000250003000035000 HTTP References Link Rot NumberofHTTPReferences LinkRotPercentage 1997 1999 2001 2003 2005 2007 2009 2011 102030405060708090100 20000400006000080000100000120000 HTTP References Link Rot NumberofHTTPReferences LinkRotPercentage PMC Elsevier arXiv
  • 21. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 21 Content Drift / Archival Status Not Archived 75.3% Archived 24.7% Rotten 26.0% Active 74.0% All Links • Archival status used as proxy • Availability of archived copy created within N days of article’s publication • N = 14 arXiv
  • 22. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 22 PMC Elsevier arXiv Not Archived 75.3% Archived 24.7% Rotten 26.0% Active 74.0% All Links Not Archived 75.2% Archived 24.8% Rotten 32.7% Active 67.3% All Links Not Archived 74.5% Archived 25.5% Rotten 20.0% Active 80.0% All Links
  • 23. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 23 Loss of Context
  • 24. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 24 Loss of Context all links active links links archived! (14 days)
  • 25. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 STM Article Extrapolation 25
  • 26. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 STM Article Extrapolation • Immune: article contains no URIs to web at large resources • Healthy: none of the URIs to web at large resources suffer from link rot nor content drift • infected: at least one URI to web at large resources suffers from link rot or content drift 26
  • 27. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 27 Immune vs not Immune STM Articles 0 10 20 30 40 50 60 70 80 90 100 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 Immune not Immune
  • 28. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 STM Article Extrapolation • Immune: article contains no URIs to web at large resources • Healthy: none of the URIs to web at large resources suffer from reference rot • Infected: at least one URI to web at large resources suffers from reference rot 28
  • 29. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 29 0 10 20 30 40 50 60 70 80 90 100 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 Immune Healthy Infected 1/5 articles suffers ! from ! Reference Rot! Immune, Healthy, Infected STM Articles
  • 30. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 30 An approach to solve ! Reference Rot
  • 31. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 Robust Links 1.Create snapshot of linked resources in a web archive when: • drafting work • submitting article • publishing article • aggregating article 31
  • 32. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 Robust Links 1. Create snapshot of linked resources in a web archive 2. Convey creation date of your web page in machine-actionable manner 32
  • 33. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 Page Creation Date 33 <!DOCTYPE html> <html> <head> <title> … </title> <meta itemprop="datePublished" content="2015-02-18" /> … </head> … </html>
  • 34. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 34
  • 35. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 Robust Links 1. Create snapshot of linked resources in a web archive 2. Convey creation date of your web page in machine- actionable manner 3. Decorate links with datetime of linking and URI of archived snapshot, in addition to resource’s original URI 35 http://robustlinks.mementoweb.org/spec/
  • 36. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 Link Decoration 36 <a href="http://hiberlink.org/">http://hiberlink.org/</a>
  • 37. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 Link Decoration 37 <a href="http://hiberlink.org/" ! data-versionurl="http://archive.is/Bvq2v" data-versiondate=“2014-11-01"> ! http://hiberlink.org/</a>
  • 38. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 38 http://robustlinks.mementoweb.org/demo/uri_references_js.html
  • 39. Reference Rot and Link Decoration! @mart1nkle1n! OAI9, Geneva, June 17th 2015 39 http://robustlinks.mementoweb.org/demo/uri_references_js.html
  • 40. Reference Rot and ! Link Decoration! Martin Klein! UCLA martinklein0815@gmail.com @mart1nkle1n