Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Achieving Link Integrity for Managed Collections

493 views

Published on

Looks at hyperlinks from the perspective of a managed collection of resources for which link persistence/integrity is considered a quality of service concern. Distinguishes between links into other managed collections and to the web at large. Considers link rot and content drift.

Published in: Internet
  • Be the first to comment

Achieving Link Integrity for Managed Collections

  1. 1. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Herbert Van de Sompel Los Alamos National Laboratory @hvdsomp Achieving Link Integrity for Managed Collections Photo by Eric Sieverts
  2. 2. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Hyperlinks in Theory
  3. 3. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Hyperlinks in Reality
  4. 4. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Hyperlinks in Reality
  5. 5. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Link Rot
  6. 6. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Link Rot
  7. 7. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Hyperlinks in Reality
  8. 8. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Content Drift
  9. 9. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Content Drift
  10. 10. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Content Drift 2000 2004 2005 2008 http://dl00.org in 2000, 2004, 2005, 2008
  11. 11. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Content Drift http://icecube.wisc.edu/ on May 8 2009 (left) and August 27 2009 (right)
  12. 12. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 No Content Drift http://www.ifa.hawaii.edu/~cowie/k_table.html on June 9 1997 (left) and March 2016 (right)
  13. 13. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 The Web, All Hyperlinks Subject to Link Rot, Content Drift
  14. 14. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 The Web, All Hyperlinks Subject to Reference Rot • Reference Rot hinders our ability to follow links as they were intended when they were put in place: • Link rot: A link stops working all together • Content drift: The Linked content changes over time and may eventually no longer be representative of the content that was originally linked
  15. 15. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Creating Pockets of Persistence • How to maintain the integrity of links? • This challenge exists for the entire web. Some communities with well managed collections care about addressing it because they consider it a Quality of Service issue: • Scholarly communication • Cultural heritage • Legal publications • Government communication • Journalism • Wikipedia • … • What can these communities do to create Pockets of Persistence?
  16. 16. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 A Managed Collection Desires Reliable Outlinks
  17. 17. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Links to another Managed Collection
  18. 18. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Links to Web at Large Resources
  19. 19. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Exploring Link Rot & Content Drift
  20. 20. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 <Intermezzo - Hiberlink Study re Reference Rot in STM Articles>
  21. 21. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 PubMed Central Corpus PMC articles published 1997-2012 PMC Total 479,194 With links to articles 240,857 With links to web-at-large resources 156,160 Links PMC To articles 744,678 To web-at-large resources 480,853A B A B
  22. 22. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Links to Articles & to Web At Large Resources - PMC Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE https://doi.org/10.1371/journal.pone.0115253
  23. 23. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 <Intermezzo - Hiberlink Study re Reference Rot in STM Articles>
  24. 24. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Exploring Link Rot & Content Drift
  25. 25. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Links Rot Occurs when B moves to C
  26. 26. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Introduce PID(B)
  27. 27. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Link to PID(B) ; HTTP Redirect from PID(B) to B
  28. 28. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 When B moves to C: HTTP Redirect from PID(B) to C
  29. 29. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Core Assumption: PID(B) Will Be Used for Linking
  30. 30. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used to Be Persistent. In: WWW2016. http://arxiv.org/1602.09102
  31. 31. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 • When classifying links extracted from PMC as linking to articles, we assumed that filtering on http://dx.doi.org/* would do the trick • But we found a lot of e.g. http://link.springer.com/article/* • For example: • http://link.springer.com/article/10.1007%2Fs00799-014-018-0 • Instead of: • http://dx.doi.org/10.1007/s00799-014-0108-0 • We used CrossRef’s Reverse Domain Lookup to classify these extracted links as linking to articles A Disconcerting Observation
  32. 32. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 URI References - PMC Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used to Be Persistent. In: WWW2016. http://arxiv.org/1602.09102 Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used to Be Persistent. In: WWW2016. http://arxiv.org/1602.09102
  33. 33. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Cartoon by Patrick Hochstenbach http://signposting.org <Intermezzo – Signposting the Scholarly Web>
  34. 34. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 • Proposal: Use typed links to address some long standing problems regarding scholarly resources on the web, by interlinking them using appropriate relation types • Focus on a limited set of patterns to support uniformly: •Conveying a Persistent Identifier •Expressing the web boundary of a scholarly resource •Making bibliographic metadata discoverable •Conveying an Author Identifier •Conveying a license that applies to a resource •Conveying a resource type Signposting the Scholarly Web
  35. 35. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 HTTP Links Mark Nottingham (2017) RFC8288: Web Linking http://tools.iets.org/rfc/rfc8288.txt
  36. 36. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 HTTP Links
  37. 37. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 HTTP Links
  38. 38. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 HTTP Links Are Used curl –I http://dbpedia.org/data/Reykjavik HTTP/1.1 200 OK Date: Thu, 27 Oct 2016 04:43:28 GMT Content-Type: application/rdf+xml; charset=UTF-8 Content-Length: 1210 Link: <http://creativecommons.org/licenses/by-sa/3.0> ; rel=“license", <http://dbpedia.org/data/Reykjavik> ; rel="alternate"; type="text/n3", <http://dbpedia.org/resource/Reykjavik>; rel="describes", <http://mementoarchive.lanl.gov/dbpedia/timegate/http://dbpedia.org/ data/Reykjavik> ; rel="timegate"
  39. 39. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 For PIDs: Use cite-as Relation Type Van de Sompel, H., Nelson M., Bilder, G, Kunze, J., and Warner, S. (2017) “cite-as”: A Link Relation to Convey a Preferred URI for Referencing https://datatracker.ietf.org/doc/draft-vandesompel-citeas/
  40. 40. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 For PIDs: Use cite-as Relation Type Van de Sompel, H., Nelson M., Bilder, G, Kunze, J., and Warner, S. (2017) “cite-as”: A Link Relation to Convey a Preferred URI for Referencing https://datatracker.ietf.org/doc/draft-vandesompel-citeas/
  41. 41. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 • The target URI (PID) of the cite-as link can be picked up by applications, e.g.: • reference managers can pick up the PID of an object when the user saves it while on the landing page, one of the constituent resources • publication pipelines can pick up the PID by looking up (HTTP HEAD) URIs referenced in a paper to determine whether a PID exists for them For PIDs: Use cite-as Relation Type
  42. 42. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Cartoon by Patrick Hochstenbach http://signposting.org </Intermezzo – Signposting the Scholarly Web>
  43. 43. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 PID Alternative - When B Moves to C: HTTP Redirect from B to C
  44. 44. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 PID Alternative - When B Moves to C: HTTP Redirect from B to C • Custodian of C needs to hold on to domain of B • Custodian of C needs to establish redirection patterns; often those are rather simple rules • No problem with establishing links to PID(B); the URI in the browser address bar (initially B, later C) is just fine
  45. 45. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Exploring Link Rot & Content Drift
  46. 46. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Content Drift Occurs when B Changes over Time
  47. 47. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Content Drift Occurs when B Changes over Time • Is not really considered an issue because: • the objects that receive PIDs were typically static, e.g. scientific papers • when a (substantially) new version of an object is published, typically a new PID is assigned • But: • how to verify that the retrieved version of an object is indeed the referenced version of the object? • Requires: • archiving objects in trusted archive(s) • ability to retrieve objects from the archive(s)
  48. 48. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Archived Articles David Rosenthal (2013) Patio Perspectives at ANADP II: Preserving the Other Half http://blog.dshr.org/2013/11/patio-perspectives-at-anadp-ii.html Too few Too low risk
  49. 49. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 How to Audit Whether a PID-identified Object is Archived http://thekeepers.org Journal, Volume, Issue centric Global audit by DOI?
  50. 50. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Contrast: All Web-Archived Versions of David’s Blog Post Global audit by HTTP URI Uses Memento infrastructure http://timetravel.mementoweb.org
  51. 51. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Exploring Link Rot & Content Drift
  52. 52. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Scholarly Context Adrift Shawn Jones, Herbert Van de Sompel, et al. (2016) Scholarly context adrift. In: PLOS ONE https://doi.org/10.1371/journal.pone.0167475
  53. 53. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 How to Assess Content Drift?
  54. 54. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Step 1: Find Pre/Post Mementos
  55. 55. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Step 2: Select Representative Mementos
  56. 56. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Text Similarity Measures • Compute aggregate text similarity scores (values between 0...100) for: • Simhash • Jaccard • Sørensen-Dice • Cosine • If the aggregate score is 100, we decide that the Pre/Post Mementos are representative • We find 137K URI references out of 480K that have representative Mementos
  57. 57. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Step 3: Dereference Live Web Version of URI
  58. 58. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Step 4: Representative Memento vs. Live Version
  59. 59. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Content Drift - PMC Shawn Jones, Herbert Van de Sompel, et al. (2016) Scholarly context adrift. In: PLOS ONE https://doi.org/10.1371/journal.pone.0167475
  60. 60. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Reference Rot for Links to Web at Large is Severe • Link Rot and Content Drift are severe • Cannot retrieve originally linked content from the live web • Can potentially retrieve originally linked content from web archives • But the archival coverage is too poor, a result of incidental archiving
  61. 61. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 URI References without Representative Mementos - PMC Shawn Jones, Herbert Van de Sompel, et al. (2016) Scholarly context adrift. In: PLOS ONE https://doi.org/10.1371/journal.pone.0167475
  62. 62. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Impact of Archival Gap on Links from Managed Collections Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE https://doi.org/10.1371/journal.pone.0115253 Links from Managed Collections to Domains Grey: Linked Content not Archived
  63. 63. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Uncertainty Regarding the Future of B when A Links to It
  64. 64. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Custodian of A Takes a Snapshot of B when Linking to It
  65. 65. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Taking a Snapshots of B: Automation is Key • Web archive APIs for on-demand archiving • perma.cc, Internet Archive, archive.is, webcitation • Amber for Wordpress & Drupal archives resources linked in a page • http://amberlink.org/ • Hiberlink’s experimental Zotero extension archives bookmarked URLs • http://hiberlink.org/zotero.html • Hiberlink’s experimental HiberActive archives all URLs referenced in a newly submitted paper • https://www.slideshare.net/martinklein0815/hiberactive
  66. 66. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 site2cite http://site2cite
  67. 67. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Custodian of A Links to Snapshot of B • Typical practice for linking to snapshots: <a href=“URL of snapshot of B”> • Problems with this practice: o Impossible to visit the original URI, if desired o Requires the permanent existence/uptime of the archive that holds the snapshot -One link rot problem replaced by another http://robustlinks.mementoweb.org/about/
  68. 68. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Permanent Existence/Uptime of Archives? Capture of http://webcitation.org dated July 17 2013 https://archive.today/eAETp
  69. 69. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Permanent Existence/Uptime of Archives? Remnant of discontinued web archive http://mummify.it captured on February 14 2014 https://web.archive.org/web/20140214233752/https://www.mummify.it/
  70. 70. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Permanent Existence/Uptime of Archives? http://www.themoscowtimes.com/news/article/russia-bans-wayback-machine-internet-archive-over- islamic-state-video/510074.html
  71. 71. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Permanent Existence/Uptime of Archives? http://web.archive.org/web/20121101043952/http://vogin.nl on March 6 2017 at 15:59 CET
  72. 72. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Custodian of A Links to Snapshot of B, Decorates the Link • Desired practice for linking to captures is to decorate the link so it provides a variety of options: <a href=“URL of snapshot of B” data-originalurl=“B” data-versiondate=“datetime of snapshot of B”> • Supports: o Revisiting the original URL o Finding snapshots in any web archive (via original URL) o Finding a temporally appropriate snapshot in any web archive (via original URL & snapshot datetime) o Automatically accessing a temporally appropriate snapshot in any web archive (Memento protocol using original URL & snapshot datetime) http://robustlinks.mementoweb.org/spec/
  73. 73. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Robust Links: Link Decoration in Action See Robust Links at work in: Van de Sompel H. & Nelson, M.L. (2015) Reminiscing about 15 years of interoperability efforts. D-Lib Magazine. https://doi.org/10.1045/november2015-vandesompel JavaScript makes the link decorations actionable Robust Links Javascript https://github.com/mementoweb/robustlinks
  74. 74. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Recap - A Managed Collection Desires Reliable Outlinks
  75. 75. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Takeaways • When it comes to links to managed collections, the custodian of the linking collection relies on the custodians of the linked collections to preserve link integrity. • PIDs, HTTP redirects are managed by the custodian of linked collections.
  76. 76. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Takeaways • When it comes to links to web at large resources, the custodian of a linking collection cannot rely on the custodians of those linked resources to maintain link integrity. • Creation of Mementos, Robust Links is managed by the custodian of the collection that links to web at large resources.
  77. 77. @hvdsomp Thor Conference, Rome, Italy, November 15 2017 Herbert Van de Sompel Los Alamos National Laboratory @hvdsomp Achieving Link Integrity for Managed Collections Photo by Eric Sieverts

×