@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Herbert Van de Sompel
Los Alamos National Laboratory
@hvdsomp
Achieving Link Integrity for Managed Collections
Photo by Eric Sieverts
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Hyperlinks in Theory
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Hyperlinks in Reality
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Hyperlinks in Reality
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Link Rot
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Link Rot
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Hyperlinks in Reality
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Content Drift
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Content Drift
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Content Drift
2000 2004
2005 2008
http://dl00.org in 2000, 2004, 2005, 2008
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Content Drift
http://icecube.wisc.edu/ on May 8 2009 (left) and August 27 2009 (right)
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
No Content Drift
http://www.ifa.hawaii.edu/~cowie/k_table.html on June 9 1997 (left) and March 2016 (right)
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
The Web, All Hyperlinks Subject to Link Rot, Content Drift
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
The Web, All Hyperlinks Subject to Reference Rot
• Reference Rot hinders our ability to follow links as they were
intended when they were put in place:
• Link rot: A link stops working all together
• Content drift: The Linked content changes over time and may
eventually no longer be representative of the content that was
originally linked
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Creating Pockets of Persistence
• How to maintain the integrity of links?
• This challenge exists for the entire web. Some communities with well
managed collections care about addressing it because they consider
it a Quality of Service issue:
• Scholarly communication
• Cultural heritage
• Legal publications
• Government communication
• Journalism
• Wikipedia
• …
• What can these communities do to create Pockets of Persistence?
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
A Managed Collection Desires Reliable Outlinks
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Links to another Managed Collection
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Links to Web at Large Resources
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Exploring Link Rot & Content Drift
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
<Intermezzo - Hiberlink Study re Reference Rot in STM Articles>
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
PubMed Central Corpus
PMC articles published 1997-2012 PMC
Total 479,194
With links to articles 240,857
With links to web-at-large resources 156,160
Links PMC
To articles 744,678
To web-at-large resources 480,853A B
A B
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Links to Articles & to Web At Large Resources - PMC
Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0115253
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
<Intermezzo - Hiberlink Study re Reference Rot in STM Articles>
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Exploring Link Rot & Content Drift
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Links Rot Occurs when B moves to C
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Introduce PID(B)
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Link to PID(B) ; HTTP Redirect from PID(B) to B
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
When B moves to C: HTTP Redirect from PID(B) to C
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Core Assumption: PID(B) Will Be Used for Linking
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used
to Be Persistent. In: WWW2016. http://arxiv.org/1602.09102
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
• When classifying links extracted from PMC as linking to articles, we
assumed that filtering on http://dx.doi.org/* would do the trick
• But we found a lot of e.g. http://link.springer.com/article/*
• For example:
• http://link.springer.com/article/10.1007%2Fs00799-014-018-0
• Instead of:
• http://dx.doi.org/10.1007/s00799-014-0108-0
• We used CrossRef’s Reverse Domain Lookup to classify these
extracted links as linking to articles
A Disconcerting Observation
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
URI References - PMC
Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used to Be Persistent.
In: WWW2016. http://arxiv.org/1602.09102
Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used
to Be Persistent. In: WWW2016. http://arxiv.org/1602.09102
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Cartoon by Patrick Hochstenbach
http://signposting.org
<Intermezzo – Signposting the Scholarly Web>
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
• Proposal:
Use typed links to address some long standing problems regarding
scholarly resources on the web, by interlinking them using
appropriate relation types
• Focus on a limited set of patterns to support uniformly:
•Conveying a Persistent Identifier
•Expressing the web boundary of a scholarly resource
•Making bibliographic metadata discoverable
•Conveying an Author Identifier
•Conveying a license that applies to a resource
•Conveying a resource type
Signposting the Scholarly Web
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
HTTP Links
Mark Nottingham (2017) RFC8288: Web Linking
http://tools.iets.org/rfc/rfc8288.txt
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
HTTP Links
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
HTTP Links
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
HTTP Links Are Used
curl –I http://dbpedia.org/data/Reykjavik
HTTP/1.1 200 OK
Date: Thu, 27 Oct 2016 04:43:28 GMT
Content-Type: application/rdf+xml; charset=UTF-8
Content-Length: 1210
Link:
<http://creativecommons.org/licenses/by-sa/3.0>
; rel=“license",
<http://dbpedia.org/data/Reykjavik>
; rel="alternate"; type="text/n3",
<http://dbpedia.org/resource/Reykjavik>; rel="describes",
<http://mementoarchive.lanl.gov/dbpedia/timegate/http://dbpedia.org/
data/Reykjavik>
; rel="timegate"
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
For PIDs: Use cite-as Relation Type
Van de Sompel, H., Nelson M., Bilder, G, Kunze, J., and Warner, S. (2017) “cite-as”: A Link Relation
to Convey a Preferred URI for Referencing https://datatracker.ietf.org/doc/draft-vandesompel-citeas/
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
For PIDs: Use cite-as Relation Type
Van de Sompel, H., Nelson M., Bilder, G, Kunze, J., and Warner, S. (2017) “cite-as”: A Link Relation
to Convey a Preferred URI for Referencing https://datatracker.ietf.org/doc/draft-vandesompel-citeas/
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
• The target URI (PID) of the cite-as link can be picked up by
applications, e.g.:
• reference managers can pick up the PID of an object when the
user saves it while on the landing page, one of the constituent
resources
• publication pipelines can pick up the PID by looking up (HTTP
HEAD) URIs referenced in a paper to determine whether a PID
exists for them
For PIDs: Use cite-as Relation Type
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Cartoon by Patrick Hochstenbach
http://signposting.org
</Intermezzo – Signposting the Scholarly Web>
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
PID Alternative - When B Moves to C: HTTP Redirect from B to C
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
PID Alternative - When B Moves to C: HTTP Redirect from B to C
• Custodian of C needs to hold on to domain of B
• Custodian of C needs to establish redirection patterns; often those
are rather simple rules
• No problem with establishing links to PID(B); the URI in the browser
address bar (initially B, later C) is just fine
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Exploring Link Rot & Content Drift
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Content Drift Occurs when B Changes over Time
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Content Drift Occurs when B Changes over Time
• Is not really considered an issue because:
• the objects that receive PIDs were typically static, e.g. scientific
papers
• when a (substantially) new version of an object is published,
typically a new PID is assigned
• But:
• how to verify that the retrieved version of an object is indeed the
referenced version of the object?
• Requires:
• archiving objects in trusted archive(s)
• ability to retrieve objects from the archive(s)
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Archived Articles
David Rosenthal (2013) Patio Perspectives at ANADP II: Preserving the Other Half
http://blog.dshr.org/2013/11/patio-perspectives-at-anadp-ii.html
Too few
Too low risk
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
How to Audit Whether a PID-identified Object is Archived
http://thekeepers.org
Journal,
Volume, Issue
centric
Global audit by
DOI?
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Contrast: All Web-Archived Versions of David’s Blog Post
Global audit by
HTTP URI
Uses Memento
infrastructure
http://timetravel.mementoweb.org
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Exploring Link Rot & Content Drift
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Scholarly Context Adrift
Shawn Jones, Herbert Van de Sompel, et al. (2016) Scholarly context adrift. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0167475
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
How to Assess Content Drift?
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Step 1: Find Pre/Post Mementos
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Step 2: Select Representative Mementos
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Text Similarity Measures
• Compute aggregate text similarity scores (values between 0...100)
for:
• Simhash
• Jaccard
• Sørensen-Dice
• Cosine
• If the aggregate score is 100, we decide that the Pre/Post
Mementos are representative
• We find 137K URI references out of 480K that have representative
Mementos
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Step 3: Dereference Live Web Version of URI
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Step 4: Representative Memento vs. Live Version
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Content Drift - PMC
Shawn Jones, Herbert Van de Sompel, et al. (2016) Scholarly context adrift. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0167475
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Reference Rot for Links to Web at Large is Severe
• Link Rot and Content Drift are severe
• Cannot retrieve originally linked content from the live web
• Can potentially retrieve originally linked content from web archives
• But the archival coverage is too poor, a result of incidental
archiving
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
URI References without Representative Mementos - PMC
Shawn Jones, Herbert Van de Sompel, et al. (2016) Scholarly context adrift. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0167475
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Impact of Archival Gap on Links from Managed Collections
Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0115253
Links from Managed Collections to Domains Grey: Linked Content not Archived
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Uncertainty Regarding the Future of B when A Links to It
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Custodian of A Takes a Snapshot of B when Linking to It
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Taking a Snapshots of B: Automation is Key
• Web archive APIs for on-demand archiving
• perma.cc, Internet Archive, archive.is, webcitation
• Amber for Wordpress & Drupal archives resources linked in a page
• http://amberlink.org/
• Hiberlink’s experimental Zotero extension archives bookmarked
URLs
• http://hiberlink.org/zotero.html
• Hiberlink’s experimental HiberActive archives all URLs referenced in
a newly submitted paper
• https://www.slideshare.net/martinklein0815/hiberactive
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
site2cite
http://site2cite
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Custodian of A Links to Snapshot of B
• Typical practice for linking to snapshots:
<a href=“URL of snapshot of B”>
• Problems with this practice:
o Impossible to visit the original URI, if desired
o Requires the permanent existence/uptime of the archive that
holds the snapshot
-One link rot problem replaced by another
http://robustlinks.mementoweb.org/about/
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Permanent Existence/Uptime of Archives?
Capture of http://webcitation.org dated July 17 2013
https://archive.today/eAETp
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Permanent Existence/Uptime of Archives?
Remnant of discontinued web archive http://mummify.it captured on February 14 2014
https://web.archive.org/web/20140214233752/https://www.mummify.it/
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Permanent Existence/Uptime of Archives?
http://www.themoscowtimes.com/news/article/russia-bans-wayback-machine-internet-archive-over-
islamic-state-video/510074.html
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Permanent Existence/Uptime of Archives?
http://web.archive.org/web/20121101043952/http://vogin.nl on March 6 2017 at 15:59 CET
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Custodian of A Links to Snapshot of B, Decorates the Link
• Desired practice for linking to captures is to decorate the link so it
provides a variety of options:
<a href=“URL of snapshot of B”
data-originalurl=“B”
data-versiondate=“datetime of snapshot of B”>
• Supports:
o Revisiting the original URL
o Finding snapshots in any web archive (via original URL)
o Finding a temporally appropriate snapshot in any web archive
(via original URL & snapshot datetime)
o Automatically accessing a temporally appropriate snapshot in
any web archive (Memento protocol using original URL &
snapshot datetime)
http://robustlinks.mementoweb.org/spec/
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Robust Links: Link Decoration in Action
See Robust Links at work in: Van de Sompel H. & Nelson, M.L. (2015)
Reminiscing about 15 years of interoperability efforts. D-Lib Magazine.
https://doi.org/10.1045/november2015-vandesompel
JavaScript makes the
link decorations actionable
Robust Links Javascript
https://github.com/mementoweb/robustlinks
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Recap - A Managed Collection Desires Reliable Outlinks
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Takeaways
• When it comes to links to
managed collections, the
custodian of the linking collection
relies on the custodians of the
linked collections to preserve link
integrity.
• PIDs, HTTP redirects are
managed by the custodian of
linked collections.
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Takeaways
• When it comes to links to web at
large resources, the custodian of a
linking collection cannot rely on the
custodians of those linked
resources to maintain link integrity.
• Creation of Mementos, Robust
Links is managed by the custodian
of the collection that links to web at
large resources.
@hvdsomp
Thor Conference, Rome, Italy, November 15 2017
Herbert Van de Sompel
Los Alamos National Laboratory
@hvdsomp
Achieving Link Integrity for Managed Collections
Photo by Eric Sieverts

Achieving Link Integrity for Managed Collections

  • 1.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Herbert Van de Sompel Los Alamos National Laboratory @hvdsomp Achieving Link Integrity for Managed Collections Photo by Eric Sieverts
  • 2.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Hyperlinks in Theory
  • 3.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Hyperlinks in Reality
  • 4.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Hyperlinks in Reality
  • 5.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Link Rot
  • 6.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Link Rot
  • 7.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Hyperlinks in Reality
  • 8.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Content Drift
  • 9.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Content Drift
  • 10.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Content Drift 2000 2004 2005 2008 http://dl00.org in 2000, 2004, 2005, 2008
  • 11.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Content Drift http://icecube.wisc.edu/ on May 8 2009 (left) and August 27 2009 (right)
  • 12.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 No Content Drift http://www.ifa.hawaii.edu/~cowie/k_table.html on June 9 1997 (left) and March 2016 (right)
  • 13.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 The Web, All Hyperlinks Subject to Link Rot, Content Drift
  • 14.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 The Web, All Hyperlinks Subject to Reference Rot • Reference Rot hinders our ability to follow links as they were intended when they were put in place: • Link rot: A link stops working all together • Content drift: The Linked content changes over time and may eventually no longer be representative of the content that was originally linked
  • 15.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Creating Pockets of Persistence • How to maintain the integrity of links? • This challenge exists for the entire web. Some communities with well managed collections care about addressing it because they consider it a Quality of Service issue: • Scholarly communication • Cultural heritage • Legal publications • Government communication • Journalism • Wikipedia • … • What can these communities do to create Pockets of Persistence?
  • 16.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 A Managed Collection Desires Reliable Outlinks
  • 17.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Links to another Managed Collection
  • 18.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Links to Web at Large Resources
  • 19.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Exploring Link Rot & Content Drift
  • 20.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 <Intermezzo - Hiberlink Study re Reference Rot in STM Articles>
  • 21.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 PubMed Central Corpus PMC articles published 1997-2012 PMC Total 479,194 With links to articles 240,857 With links to web-at-large resources 156,160 Links PMC To articles 744,678 To web-at-large resources 480,853A B A B
  • 22.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Links to Articles & to Web At Large Resources - PMC Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE https://doi.org/10.1371/journal.pone.0115253
  • 23.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 <Intermezzo - Hiberlink Study re Reference Rot in STM Articles>
  • 24.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Exploring Link Rot & Content Drift
  • 25.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Links Rot Occurs when B moves to C
  • 26.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Introduce PID(B)
  • 27.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Link to PID(B) ; HTTP Redirect from PID(B) to B
  • 28.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 When B moves to C: HTTP Redirect from PID(B) to C
  • 29.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Core Assumption: PID(B) Will Be Used for Linking
  • 30.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used to Be Persistent. In: WWW2016. http://arxiv.org/1602.09102
  • 31.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 • When classifying links extracted from PMC as linking to articles, we assumed that filtering on http://dx.doi.org/* would do the trick • But we found a lot of e.g. http://link.springer.com/article/* • For example: • http://link.springer.com/article/10.1007%2Fs00799-014-018-0 • Instead of: • http://dx.doi.org/10.1007/s00799-014-0108-0 • We used CrossRef’s Reverse Domain Lookup to classify these extracted links as linking to articles A Disconcerting Observation
  • 32.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 URI References - PMC Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used to Be Persistent. In: WWW2016. http://arxiv.org/1602.09102 Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used to Be Persistent. In: WWW2016. http://arxiv.org/1602.09102
  • 33.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Cartoon by Patrick Hochstenbach http://signposting.org <Intermezzo – Signposting the Scholarly Web>
  • 34.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 • Proposal: Use typed links to address some long standing problems regarding scholarly resources on the web, by interlinking them using appropriate relation types • Focus on a limited set of patterns to support uniformly: •Conveying a Persistent Identifier •Expressing the web boundary of a scholarly resource •Making bibliographic metadata discoverable •Conveying an Author Identifier •Conveying a license that applies to a resource •Conveying a resource type Signposting the Scholarly Web
  • 35.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 HTTP Links Mark Nottingham (2017) RFC8288: Web Linking http://tools.iets.org/rfc/rfc8288.txt
  • 36.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 HTTP Links
  • 37.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 HTTP Links
  • 38.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 HTTP Links Are Used curl –I http://dbpedia.org/data/Reykjavik HTTP/1.1 200 OK Date: Thu, 27 Oct 2016 04:43:28 GMT Content-Type: application/rdf+xml; charset=UTF-8 Content-Length: 1210 Link: <http://creativecommons.org/licenses/by-sa/3.0> ; rel=“license", <http://dbpedia.org/data/Reykjavik> ; rel="alternate"; type="text/n3", <http://dbpedia.org/resource/Reykjavik>; rel="describes", <http://mementoarchive.lanl.gov/dbpedia/timegate/http://dbpedia.org/ data/Reykjavik> ; rel="timegate"
  • 39.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 For PIDs: Use cite-as Relation Type Van de Sompel, H., Nelson M., Bilder, G, Kunze, J., and Warner, S. (2017) “cite-as”: A Link Relation to Convey a Preferred URI for Referencing https://datatracker.ietf.org/doc/draft-vandesompel-citeas/
  • 40.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 For PIDs: Use cite-as Relation Type Van de Sompel, H., Nelson M., Bilder, G, Kunze, J., and Warner, S. (2017) “cite-as”: A Link Relation to Convey a Preferred URI for Referencing https://datatracker.ietf.org/doc/draft-vandesompel-citeas/
  • 41.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 • The target URI (PID) of the cite-as link can be picked up by applications, e.g.: • reference managers can pick up the PID of an object when the user saves it while on the landing page, one of the constituent resources • publication pipelines can pick up the PID by looking up (HTTP HEAD) URIs referenced in a paper to determine whether a PID exists for them For PIDs: Use cite-as Relation Type
  • 42.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Cartoon by Patrick Hochstenbach http://signposting.org </Intermezzo – Signposting the Scholarly Web>
  • 43.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 PID Alternative - When B Moves to C: HTTP Redirect from B to C
  • 44.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 PID Alternative - When B Moves to C: HTTP Redirect from B to C • Custodian of C needs to hold on to domain of B • Custodian of C needs to establish redirection patterns; often those are rather simple rules • No problem with establishing links to PID(B); the URI in the browser address bar (initially B, later C) is just fine
  • 45.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Exploring Link Rot & Content Drift
  • 46.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Content Drift Occurs when B Changes over Time
  • 47.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Content Drift Occurs when B Changes over Time • Is not really considered an issue because: • the objects that receive PIDs were typically static, e.g. scientific papers • when a (substantially) new version of an object is published, typically a new PID is assigned • But: • how to verify that the retrieved version of an object is indeed the referenced version of the object? • Requires: • archiving objects in trusted archive(s) • ability to retrieve objects from the archive(s)
  • 48.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Archived Articles David Rosenthal (2013) Patio Perspectives at ANADP II: Preserving the Other Half http://blog.dshr.org/2013/11/patio-perspectives-at-anadp-ii.html Too few Too low risk
  • 49.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 How to Audit Whether a PID-identified Object is Archived http://thekeepers.org Journal, Volume, Issue centric Global audit by DOI?
  • 50.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Contrast: All Web-Archived Versions of David’s Blog Post Global audit by HTTP URI Uses Memento infrastructure http://timetravel.mementoweb.org
  • 51.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Exploring Link Rot & Content Drift
  • 52.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Scholarly Context Adrift Shawn Jones, Herbert Van de Sompel, et al. (2016) Scholarly context adrift. In: PLOS ONE https://doi.org/10.1371/journal.pone.0167475
  • 53.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 How to Assess Content Drift?
  • 54.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Step 1: Find Pre/Post Mementos
  • 55.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Step 2: Select Representative Mementos
  • 56.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Text Similarity Measures • Compute aggregate text similarity scores (values between 0...100) for: • Simhash • Jaccard • Sørensen-Dice • Cosine • If the aggregate score is 100, we decide that the Pre/Post Mementos are representative • We find 137K URI references out of 480K that have representative Mementos
  • 57.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Step 3: Dereference Live Web Version of URI
  • 58.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Step 4: Representative Memento vs. Live Version
  • 59.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Content Drift - PMC Shawn Jones, Herbert Van de Sompel, et al. (2016) Scholarly context adrift. In: PLOS ONE https://doi.org/10.1371/journal.pone.0167475
  • 60.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Reference Rot for Links to Web at Large is Severe • Link Rot and Content Drift are severe • Cannot retrieve originally linked content from the live web • Can potentially retrieve originally linked content from web archives • But the archival coverage is too poor, a result of incidental archiving
  • 61.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 URI References without Representative Mementos - PMC Shawn Jones, Herbert Van de Sompel, et al. (2016) Scholarly context adrift. In: PLOS ONE https://doi.org/10.1371/journal.pone.0167475
  • 62.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Impact of Archival Gap on Links from Managed Collections Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE https://doi.org/10.1371/journal.pone.0115253 Links from Managed Collections to Domains Grey: Linked Content not Archived
  • 63.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Uncertainty Regarding the Future of B when A Links to It
  • 64.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Custodian of A Takes a Snapshot of B when Linking to It
  • 65.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Taking a Snapshots of B: Automation is Key • Web archive APIs for on-demand archiving • perma.cc, Internet Archive, archive.is, webcitation • Amber for Wordpress & Drupal archives resources linked in a page • http://amberlink.org/ • Hiberlink’s experimental Zotero extension archives bookmarked URLs • http://hiberlink.org/zotero.html • Hiberlink’s experimental HiberActive archives all URLs referenced in a newly submitted paper • https://www.slideshare.net/martinklein0815/hiberactive
  • 66.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 site2cite http://site2cite
  • 67.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Custodian of A Links to Snapshot of B • Typical practice for linking to snapshots: <a href=“URL of snapshot of B”> • Problems with this practice: o Impossible to visit the original URI, if desired o Requires the permanent existence/uptime of the archive that holds the snapshot -One link rot problem replaced by another http://robustlinks.mementoweb.org/about/
  • 68.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Permanent Existence/Uptime of Archives? Capture of http://webcitation.org dated July 17 2013 https://archive.today/eAETp
  • 69.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Permanent Existence/Uptime of Archives? Remnant of discontinued web archive http://mummify.it captured on February 14 2014 https://web.archive.org/web/20140214233752/https://www.mummify.it/
  • 70.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Permanent Existence/Uptime of Archives? http://www.themoscowtimes.com/news/article/russia-bans-wayback-machine-internet-archive-over- islamic-state-video/510074.html
  • 71.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Permanent Existence/Uptime of Archives? http://web.archive.org/web/20121101043952/http://vogin.nl on March 6 2017 at 15:59 CET
  • 72.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Custodian of A Links to Snapshot of B, Decorates the Link • Desired practice for linking to captures is to decorate the link so it provides a variety of options: <a href=“URL of snapshot of B” data-originalurl=“B” data-versiondate=“datetime of snapshot of B”> • Supports: o Revisiting the original URL o Finding snapshots in any web archive (via original URL) o Finding a temporally appropriate snapshot in any web archive (via original URL & snapshot datetime) o Automatically accessing a temporally appropriate snapshot in any web archive (Memento protocol using original URL & snapshot datetime) http://robustlinks.mementoweb.org/spec/
  • 73.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Robust Links: Link Decoration in Action See Robust Links at work in: Van de Sompel H. & Nelson, M.L. (2015) Reminiscing about 15 years of interoperability efforts. D-Lib Magazine. https://doi.org/10.1045/november2015-vandesompel JavaScript makes the link decorations actionable Robust Links Javascript https://github.com/mementoweb/robustlinks
  • 74.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Recap - A Managed Collection Desires Reliable Outlinks
  • 75.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Takeaways • When it comes to links to managed collections, the custodian of the linking collection relies on the custodians of the linked collections to preserve link integrity. • PIDs, HTTP redirects are managed by the custodian of linked collections.
  • 76.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Takeaways • When it comes to links to web at large resources, the custodian of a linking collection cannot rely on the custodians of those linked resources to maintain link integrity. • Creation of Mementos, Robust Links is managed by the custodian of the collection that links to web at large resources.
  • 77.
    @hvdsomp Thor Conference, Rome,Italy, November 15 2017 Herbert Van de Sompel Los Alamos National Laboratory @hvdsomp Achieving Link Integrity for Managed Collections Photo by Eric Sieverts

Editor's Notes

  • #65 Previously, archival status (14-day window) as proxy
  • #66 Previously, archival status (14-day window) as proxy
  • #67 Previously, archival status (14-day window) as proxy
  • #69 Previously, archival status (14-day window) as proxy
  • #70 Previously, archival status (14-day window) as proxy