Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

To the Rescue of the Orphans of Scholarly Communication

1,227 views

Published on

To the Rescue of the Orphans of Scholarly Communication
presentation at CNI Spring 2017 meeting
Herbert Van de Sompel
http://orcid.org/0000-0002-0715-6126
Michael L. Nelson
http://orcid.org/0000-0003-3749-8116
Martin Klein
http://orcid.org/0000-0003-0130-2097

Published in: Internet
  • Be the first to comment

To the Rescue of the Orphans of Scholarly Communication

  1. 1. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Herbert Van de Sompel Los Alamos National Laboratory @hvdsomp http://orcid.org/0000-0002-0715-6126 Michael L. Nelson Old Dominion University @phonedude_mln http://orcid.org/0000-0003-3749-8116 Martin Klein Los Alamos National Laboratory @mart1nkle1n http://orcid.org/0000-0003-0130-2097 To the Rescue of the Orphans of Scholarly Communication The project is funded by the Andrew W. Mellon Foundation
  2. 2. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 • Problem statement Scholarly objects are everywhere on the web, and are not systematically archived • Project perspective Capturing objects using an institutional & web archiving paradigm • Object capture flow: • Step 1: Discovering a researcher’s web identities • Step 2: Discovering artifacts per web identity • Step 3: Determining the web boundary per artifact • Step 4: Capturing resources in the artifacts’ web boundary Outline
  3. 3. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Scholarship is Evolving • The research process, not just its outcome, is becoming visible … on the web • Massive extension of the scholarly record with an enormous variety of novel objects • The objects are heterogeneous, dynamic, compound, inter-related and distributed across the web • The objects are often hosted on common web platforms that are not dedicated to scholarship
  4. 4. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 101 Innovations in Scholarly Communication Bianca Kramer & Jeroen Bosman. 101 Innovations in Scholarly Communication https://innoscholcomm.silk.co/
  5. 5. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 The Evolving Scholarly Record Brian Lavoie et al. (2014) The Evolving Scholarly Record http://www.oclc.org/content/dam/research/publications/library/2014/oclcresearch-evolving-scholarly-
  6. 6. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Web Platforms Record Scholarship • Increasingly, common web platforms are used for scholarship • GitHub, Wikis, Wordpress, etc. • Many of these platforms have desirable characteristics • Versioning • Time stamping • Social embedding • But, these platforms record rather than archive Herbert Van de Sompel & Andrew Treloar (2014) A Perspective on Archiving the Scholarly Web http://public.lanl.gov/herbertv/papers/Papers/2014/iPres2014_Sompel_Treloar.pdf
  7. 7. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Recording is not Archiving “GitHub reserves the right at any time and from time to time to modify or discontinue, temporarily or permanently, the Service (or any part thereof) with or without notice.” GitHub Terms of Service http://help.github.com/articles/github-terms-of-service https://help.github.com/articles/github-terms-of-service/
  8. 8. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Recording is not Archiving GitHub Terms of Service http://help.github.com/articles/github-terms-of-service https://opensource.googleblog.com/2015/03/farewell-to-google-code.html
  9. 9. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Recording versus Archiving Recording Archiving Short-term Longer-term No guarantees provided Attempt to provide guarantees Write many/read many Write once/Read many Scholarly process Scholarly record Herbert Van de Sompel & Andrew Treloar (2014) A Perspective on Archiving the Scholarly Web http://public.lanl.gov/herbertv/papers/Papers/2014/iPres2014_Sompel_Treloar.pdf
  10. 10. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Meet Some New School Researchers Ian Milligan Mark Matienzo
  11. 11. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Meet Some New School Researchers Ian Milligan https://ianmilligan.ca/ https://www.slideshare.net/IanMilligan1 https://github.com/ianmilligan1
  12. 12. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Meet Some New School Researchers Mark Matienzo http://matienzo.org/ https://www.slideshare.net/anarchivist/presentations https://github.com/anarchivist https://osf.io/tgr4k/ https://www.drupal.org/user/380762
  13. 13. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 SlideShare Artifact: 0 Mementos https://www.slideshare.net/IanMilligan1/resaw-geo-cities http://timetravel.mementoweb.org/list/20140513211653/https://www.slideshare.net/IanMilligan1/resa
  14. 14. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 GitHub Artifact: 1 Memento https://github.com/ianmilligan1/Historian-WARC-1 http://web.archive.org/web/*/https://github.com/ianmilligan1/Historian-WARC-1
  15. 15. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 The Scholarly Orphans Project • Funded by the Andrew W, Mellon Foundation • Los Alamos National Laboratory & New Mexico Consortium • Old Dominion University • 04/2016 - 03/2019 • How to capture scholarly orphans for long-term archiving? • Project explores a paradigm inspired by web archiving • Scale of the problem • Bilateral agreements with most web portals unlikely • Project explores an institution driven paradigm • Institution should be interested in capturing the artifacts its scholars deposit on the web
  16. 16. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 An Institutional & Web Archiving Perspective
  17. 17. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Related Work • LOCKSS • Web crawling approach • Focused on journal literature • Archive-It • On-demand, subscription-based web archiving • Not focused on scholarly orphans • Institutional repository • Capture an institution’s output • Focused on manual upload (of journal literature) • The Locker Project • Capture an individual’s web presence • Not focused on scholarly orphans
  18. 18. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Capture Flow
  19. 19. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Capture Flow – Step 1
  20. 20. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Algorithmic Discovery of Web Identities James Powell et al. (2014) EgoSystem: Where are our alumni? In: code4lib http://journal.code4lib.org/articles/9519
  21. 21. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Discovery of Web Identities via a Registry: ORCID Martin Klein and Herbert Van de Sompel (2017) Discovering scholarly orphans using ORCID In: JCDL2017 https://arxiv.org/abs/1703.09343 Ian Milligan http://orcid.org/0000-0002-1470-7723 Mark Matienzo http://orcid.org/0000-0003-3270-1306
  22. 22. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Ian Milligan’s ORCID • Web Identities: 0 http://orcid.org/0000-0002-1470-7723
  23. 23. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Mark Matienzo’s ORCID • Web Identities: 3 (homepage, ScopusID, ResearcherID) http://orcid.org/0000-0003-3270-1306
  24. 24. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Mark Matienzo’s Home Page • URI to GitHub repository, Twitter • Could be included in ORCID profile http://matienzo.org/
  25. 25. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 • Evaluation of ORCID for automatic discovery of Web Identities • How well does ORCID represent the global community of active researchers? • Adoption rate • Subject coverage • Geo-location coverage • How well does ORCID score when it comes to listing Web Identities? Discovery of Web Identities via a Registry: ORCID
  26. 26. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 ORCID - Adoption Rate 2013 2014 2015 2016 05000001000000150000020000002500000 ORCIDs total ORCIDs with given names ORCIDs with first names ORCIDs with works ORCIDs with affiliations ORCIDs with web identities
  27. 27. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 ORCID - Subject Coverage 0 10 20 30 40 50 60 Other Life Sciences Physical Sciences Mathematics and Computer Sciences Education Psychology and Social Sciences Engineering Humanities and Arts ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ORCID Subjects Ph.D. Researchers Publications
  28. 28. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 ORCID - Geo-Location Coverage
  29. 29. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 ORCID - Geo-Location Coverage
  30. 30. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 ORCID - Geo-Location Coverage
  31. 31. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 ORCID - Web Identities
  32. 32. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 ORCID - Web Identities
  33. 33. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Discovery of Web Identities via a Registry: ORCID • Adoption rate is increasing • Subject coverage is focused, does not cover all disciplines equally • Geo-Location coverage is good but not quite representative • Web Identity coverage is poor; not usable for our purpose in its current state
  34. 34. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Capture Flow – Step 2
  35. 35. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Discovery of Artifacts per Web Identity • Algorithmic approach • Scrape artifacts from pages http://matienzo.org/publications/
  36. 36. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Discovery of Artifacts per Web Identity • Notifications • Subscribe to portal notifications about a researcher’s new artifacts https://www.slideshare.net/anarchivist/presentations
  37. 37. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Discovery of Artifacts per Web Identity • Artifact Registry • 5 artifacts of interest (standards document, reports, book reviews) http://orcid.org/0000-0003-3270-1306
  38. 38. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Capture Flow – Step 3
  39. 39. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Determination of Web Boundary per Artifact http://signposting.org
  40. 40. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 HTTP Links Mark Nottingham (2010) RFC5988: Web Linking. http://tools.iets.org/rfc/rfc5988.txt
  41. 41. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 HTTP Links
  42. 42. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 HTTP Links
  43. 43. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Signposting - Publication Boundary Pattern http://signposting.org/publication_boundary/oxford/
  44. 44. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Signposting - Bibliographic Metadata Pattern http://signposting.org/bibliographic_metadata/springer/
  45. 45. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Capture Flow – Step 4
  46. 46. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 • Legal • robots.txt • Licenses • Technical • Capture tools • Capture quality • Capture authenticity Challenges Regarding Capturing Web Artifacts
  47. 47. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Legal Challenges re Capturing Artifacts – A Wake-Up Call SlideShare • robots.txt unclear, some pages disallowed • License seems to prohibit archiving GitHub • robots.txt unclear, some pages disallowed • License seems to allow archiving Drupal • robots.txt allows relevant URIs • License seems to prohibit archiving Open Science Framework • robots.txt does not disallow crawlers • License does not mention archiving, individual content may have specific license
  48. 48. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Capture Tools Challenges: Mark’s SlideShare Live Internet Archive Webrecorder.io https://www.slideshare.net/anarchivist/to-hell-with-good-intentions-linked-data-and-the-power-to-name http://web.archive.org/web/20161229053246/http://www.slideshare.net/anarchivist/to-hell-with-good-intentions-linked-data-and-the-power-to-name https://webrecorder.io/martinklein/cni_test/20170330014029/https://www.slideshare.net/anarchivist/to-hell-with-good-intentions-linked-data-and- the-power-to-name
  49. 49. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Capture Tools Challenges: Mark’s GitHub Live Internet Archive Webrecorder.io https://github.com/rightsstatements/rightsstatements.github.io https://web.archive.org/web/20170328040646/https://github.com/rightsstatements/rightsstatements.github.io https://webrecorder.io/martinklein/cni_test/20170330014135/https://github.com/rightsstatements/rightsstatements.github.io
  50. 50. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Capture Tools Challenges: Mark’s OSF Live Internet Archive Webrecorder.io https://osf.io/h4ru8/wiki/home/ http://web.archive.org/web/20170328042647/https://osf.io/h4ru8/wiki/home/ https://webrecorder.io/martinklein/cni_test/20170330014219/https://osf.io/h4ru8/wiki/home/
  51. 51. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Capture Quality - How well was this page archived? • Continuing research on memento damage, first published at JCDL 2014 • Premise: simply reporting “9/10 embedded images were archived” is insufficient to describe how well the archive / replay system performed • Use heuristics from Mechanical Turk testing to approximate human conception of damage, e.g.: o increase weight of missing images that are large, or centered in the viewport o stylesheets can be important! check for “ugly” results J.F. Brunelle, M. Kelly, H. SalahEldeen M. C. Weigle, and M. L. Nelson (2014) Not all mementos are created equal: Measuring the impact of missing resources. In: JCDL 2014 http://dx.doi.org/10.1109/JCDL.2014.6970187 http://dx.doi.org/10.1007/s00799-015-0150-6
  52. 52. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Triptych CSS “regular” web pages have nearly equal distribution of content over each third of a page if a CSS is missing AND > 75% of the non-background color is in the left 2/3s of the page, then users consider this damaged
  53. 53. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 A Memento Damage Service, Python Library, and Docker Image Erika Siregar http://memento-damage.cs.odu.edu/ https://github.com/erikaris/web-memento-damage
  54. 54. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Just a Little Bit of Damage…
  55. 55. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Moderate Damage…
  56. 56. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Significant Damage…
  57. 57. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Ian’s GitHub Memento… https://github.com/ianmilligan1/Historian-WARC-1 http://web.archive.org/web/20130922192416/https://github.com/ianmilligan1/Historian-WARC-1
  58. 58. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 … Has Slight Damage does not appear to violate the “75% / left- 2/3s” rule
  59. 59. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Capture Authenticity - Has this page been tampered with? • The days of implicitly trusting Brewster & IA are over o the people who brought you fake news will eventually bring you fake archives o “mo’ archives mo’ problems” • Premise: use multiple, independent archives to record fixity information from dated observations of mementos • Plans: o blockchain o provenance (i.e., a memento of memento != 2 independent mementos) https://climate.nasa.gov/vital-signs/carbon-dioxide/ http://web.archive.org/web/20170312201332/https://climate.nasa.gov/vital-signs/carbon-dioxide/ http://localhost:8282/michael/wayback/20170313023607/https://climate.nasa.gov/vital-signs/carbon-dioxide/
  60. 60. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Push a Web Page into Multiple Archives Mohamed Aturban (2017) Archive Now (archivenow): A Python Library to Integrate On-Demand Archives http://ws-dl.blogspot.com/2017/02/2017-02-22-archive-now-archivenow.html
  61. 61. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Record Fixity in a Manifest File Shawn Jones (2016) Mementos In the Raw, Take Two http://ws-dl.blogspot.com/2016/08/2016-08-15-mementos-in-raw-take-two.html
  62. 62. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Publish Manifest to the Web
  63. 63. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Archive the Manifest
  64. 64. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 “You can’t tell the players without a scorecard” – Harry M. Stevens
  65. 65. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Verifying the Authenticity of a Memento • Given a Memento, URI-M, that we wish to verify • Lookup the URI-M at a manifest server o e.g, captureproject.org/{URI-M} • Discover all the mementos of the manifest, and verify their integrity with “trusty URIs” • For each URI-M listed in the manifest, repeat the fixity calculation as described in the manifest • Vote if fixity matches (not tampered with) or if fixity doesn’t match (tampered with) o Majority vote wins (assuming independent archives) Mohamed Aturban (2017) Summary of "Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data” http://ws-dl.blogspot.com/2017/01/2017-01-15-summary-of-trusty-uris.html Video at https://www.youtube.com/watch?v=EY15lj-7_lc
  66. 66. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Discussion
  67. 67. @hvdsomp, @phonedude_mln, @mart1nkle1n CNI Spring 2017, Albuquerque, NM, 3 Apr 2017 Herbert Van de Sompel Los Alamos National Laboratory @hvdsomp http://orcid.org/0000-0002-0715-6126 Michael L. Nelson Old Dominion University @phonedude_mln http://orcid.org/0000-0003-3749-8116 Martin Klein Los Alamos National Laboratory @mart1nkle1n http://orcid.org/0000-0003-0130-2097 To the Rescue of the Orphans of Scholarly Communication The project is funded by the Andrew W. Mellon Foundation

×