The document describes a study that used the Memento framework to analyze over 1 million URI references in 3.5 million scholarly articles published between 1997-2012. It found that around 75% of URIs had experienced content drift when compared to their most recent archived ("representative") versions. This level of drift threatens the integrity of the scholarly record. The study aims to raise awareness of this issue and encourage parties to help address it through archiving and use of robust links.
Using the Memento Framework to Assess Content Drift in Scholarly Communication
1. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
Using the Memento Framework
to Assess Content Drift
in Scholarly Communication
Acknowledgements:
Shawn Jones, Harihar Shankar (LANL)
Richard Tobin, Claire Grover (University of of Edinburgh)
Andy Jackson (British Library)
Martin Klein
@mart1nkle1n
Herbert Van de Sompel
@hvdsomp
Research Library
Los Alamos National Laboratory
2. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
2
Link Rot
3. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
3
4. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
4
Content Drift
5. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
5
http://dl00.org
2000
6. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
6
http://dl00.org
2004
7. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
7
http://dl00.org
2005
8. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
8
http://dl00.org
2008
9. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
9
Content Drift
(in legal documents)
10. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
10
11. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
11
Content Drift
(in scholarly articles)
12. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
12
Referenced in
http://dx.doi.org/10.1016/j.nuclphysa.2009.05.110
published on August 15th 2009
May 8th 2009 August 27th 2009
13. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
13
Referenced in
http://arxiv.org/abs/astro-ph/9707064
published on July 4th 1997
June 7th 1997 today
14. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
14
ArXiv
Corpus
1997 1999 2001 2003 2005 2007 2009 2011
02000060000100000140000180000
articles
URI references
15. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
15
http://hiberlink.org/
Definition:
• Link Rot + Content Drift = Reference Rot
Observation:
• Links to these resources are subject to Reference Rot
• Web at large resources referenced in scholarly articles
Problem:
• Threat to integrity of the web-based scholarly record
• Resources do not have the same sense of fixity like e.g.,
journal articles
• Resources’ custodianship is different, in terms of long-
term archiving, integrity, and access
16. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
16
http://dx.doi.org/10.1371/journal.pone.0115253
17. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
17
Focus: Content Drift
18. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
18
http://dx.doi.org/10.1371/journal.pone.0167475
19. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
19
Study Dataset
• 3.5 million articles from arXiv, Elsevier, PMC
• Published between Jan 1997 – Dec 2012
• Converted from PDF to XML
• Extraction of URIs to web at large resources (>1 million)
• Keep track of articles’ publication date
20. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
20
Novel Approach to Assess Content Drift
21. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
21
Step 1: Find Mementos
• ~ 1 million URI references
• ~ 650k Memento Pre/Post pairs
discovered via Memento
https://mementoweb.org
https://tools.ietf.org/html/rfc7089
t t+1t-1
22. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
22
Step 2: Select Representative Mementos
23. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
23
• Apply content similarity measures
• How similar is representative?
Step 2: Select Representative Mementos
24. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
24
Content Similarity Measures
• Compute normalized scores (values between 0...100) for:
• Simhash
• Jaccard
• Sørensen-Dice
• Cosine
25. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
25
Representative Mementos
• Idea
• If perfect score in all 4 similarity measures
Memento Pre and Post are the same
Representative Mementos
• Sanity check needed
• Via HTTP headers: E-Tag and Last-Modified
• If same for Pre and Post Memento
HTTP-same
• Sanity check passed!
• 98.88% of Memento pairs that are HTTP-same have perfect
score in all 4 similarity measures
26. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
26
• ~ 313k referenced URIs have
representative Mementos
Step 2: Select Representative Mementos
27. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
27
Representative Mementos in arXiv
28. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
28
arXiv
Elsevier
PMC
29. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
29
• 241k out of 313k URIs have a live web version
Step 3: Dereference Live Web Version of URI
30. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
30
Step 4: Representative Memento vs. Live Version
• Apply content similarity measures
• Bin results into 6 clusters
31. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
31
32. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
32
Aggregate
Similarity
Score
Good:
23.7% of
URIs have
*not*
drifted!
Bad:
3/4 URIs
*have*
drifted!
33. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
33
Content Drift & Link Rot Over Time - arXiv
34. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
34
arXiv
Elsevier
PMC
35. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
35
Take-Aways
1. Scholarly articles increasingly contain URI references to web at
large resources.
2. Such resources are subject to reference rot (link rot + content drift).
3. Custodians of these resources are typically not overly concerned
with archiving of their content and longevity of the scholarly record.
4. Spoiler: Authors, publishers, web archives, and other parties can
help tackle this problem (see my lightning talk + poster on Robust
Links).
36. Memento to Assess Content Drift in Scholarly Communication
@mart1nkle1n
IIPC WAC, 06/16/2017, London, UK
Using the Memento Framework
to Assess Content Drift
in Scholarly Communication
Martin Klein
@mart1nkle1n
Herbert Van de Sompel
@hvdsomp
Research Library
Los Alamos National Laboratory
Editor's Notes
IceCube Neutrino Observatory at the University of Wisconsin
http://icecube.wisc.edu
Institute for Astronomy at the University of Hawaii
http://www.ifa.hawaii.edu/~cowie/k_table.html
Previously, archival status (14-day window) as proxy
Previously, archival status (14-day window) as proxy