Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

An Institutional Perspective to Rescue Scholarly Orphans

243 views

Published on

An Institutional Perspective to Rescue Scholarly Orphans
Presentation at the CNI 2019 Spring Meeting

Published in: Internet
  • Be the first to comment

An Institutional Perspective to Rescue Scholarly Orphans

  1. 1. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO Martin Klein LANL @mart1nkle1n https://orcid.org/0000-0003-0130-2097 Herbert Van de Sompel DANS @hvdsomp https://orcid.org/0000-0002-0715-6126 An Institutional Perspective to Rescue Scholarly Orphans The Scholarly Orphans project is funded by the Andrew W. Mellon Foundation
  2. 2. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO Scholarly Orphans Team • Los Alamos National Laboratory: • Lyudmila Balakireva • Martin Klein • James Powell • Harihar Shankar • Herbert Van de Sompel • Old Dominion University: • Sawood Alam • Grant Atkins • Shawn Jones • Mat Kelly • Michael L. Nelson
  3. 3. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO Scholarly Orphans – Project Motivation
  4. 4. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO • Consideration • Researchers are increasingly using a variety of web platforms for collaboration and communication • Why? • Many of these platforms have desirable characteristics • Versioning • Time stamping • Social embedding • Their institutions do not provide platforms that have global reach • Collaboration, cf. Github ~ productivity • Communication, cf. SlideShare ~ visibility Research and Research Communication on the Web
  5. 5. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO Emma Schymanski https://orcid.org/0000-0001-6868-8145 https://github.com/schymane https://www.slideshare.net/EmmaSchymanski https://figshare.com/authors/Emma_Schymanski/5087039 https://publons.com/author/1538491/emma-schymanski#profile https://www.eawag.ch/en/aboutus/portrait/organisation/staff/profile/emma-schymanski/
  6. 6. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO Shawn Jones https://orcid.org/0000-0002-4372-870X http://www.shawnmjones.org/ https://github.com/shawnmjones https://www.slideshare.net/shawnmjones https://en.wikipedia.org/wiki/User:Shawnmjones https://www.blogger.com/profile/17827543974149663194
  7. 7. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO • Consideration • Researchers deposit artifacts in web platforms • Web Platforms: • Dedicated to scholarship: • Commercial: e.g., FigShare, Publons • Not for profit: e.g., OSF, Zenodo • General purpose: • Commercial: e.g., GitHub, SlideShare • Not for profit: e.g., Wikipedia, Wikidata Research and Research Communication on the Web
  8. 8. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO • Consideration • Researchers deposit artifacts in web platforms • Status quo - The researchers’ institutions are in the dark • Do not know about the existence of these artifact • Do not have a copy of these artifacts Research and Research Communication on the Web
  9. 9. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO • Consideration • Researchers deposit artifacts in web platforms • Status quo – Uncertainty regarding long-term access • Commercial: changing business model, no preservation commitment • Not for profit: unpredictable funding stream Research and Research Communication on the Web
  10. 10. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO • Consideration • Researchers deposit artifacts in web platforms • Status quo - Not systematically archived • No frameworks like LOCKSS/Portico exist for these artifacts • Researchers only selectively deposit artifacts in portals that provide archival guarantees; to obtain a cite-able DOI • Can’t expect researchers to (also) upload all artifacts in IRs • Web archives only incidentally archive these artifacts, cf. anecdotal & Hiberlink project evidence Research and Research Communication on the Web Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE https://doi.org/10.1371/journal.pone.0115253
  11. 11. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO Emma’s SlideShare Artifact: 0 Mementos https://www.slideshare.net/EmmaSchymanski/dmcm2018-community-resources-connecting-chemistry-and-toxicity-knowledge http://timetravel.mementoweb.org/
  12. 12. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO Shawn’s GitHub Artifact: 1 Memento https://github.com/shawnmjones/mediawiki https://web.archive.org/web/*/https://github.com/shawnmjones/mediawiki
  13. 13. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO Scholarly Orphans – Project Overview How to capture Scholarly Orphans for long-term archiving?
  14. 14. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO The Scholarly Orphans Project • Explores an institution-driven paradigm • Academic institutions typically have a long shelf life • A basic premise underlying e.g., LOCKSS, perma.cc • An academic institution should be interested in capturing the artifacts (intellectual property) its scholars deposit on the web • Collecting and archiving such artifacts aligns with the mission of academic libraries
  15. 15. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO An Institutional Perspective
  16. 16. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO The Scholarly Orphans Project • Explores a paradigm inspired by web archiving • Scale of the problem • Can’t expect researchers to upload all artifacts in an institutional repository • Bilateral agreements for archival purposes with most web portals unlikely
  17. 17. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO A Web Archiving Perspective
  18. 18. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO Scholarly Orphans – Prototype Pipeline Overview
  19. 19. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO Prototype Pipeline
  20. 20. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO Tracking Artifacts
  21. 21. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO Tracking Artifacts - Description • In order to track artifacts that were recently deposited by an institutional researcher in a portal, one reasonably needs: • The web identity of the researcher in the portal • Algorithmic discovery • Discovery via a registry • Manual collection • A portal API that supports: • Access by web identity • Access to contributions “since …” for the web identity • Result of tracking: • URI(s) of new artifact(s) discovered in the portal
  22. 22. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO Tracking Artifacts - Challenges • Portal API access by web identity • Broadly supported by general purpose portals • Typically not supported by scholarly portals • Some lack an API altogether • Should add ORCID access to APIs • OAI-PMH and ResourceSync need sets per web identity • Professional versus personal contributions • Tracking frequency/scale
  23. 23. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO Capturing Artifacts
  24. 24. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO Capturing Artifacts - Description • The capture process takes as input the URI of a new artifact discovered in a portal • Its task is to create a representative institutional capture of the artifact • Result of capture: • WARC file for new artifact in an institutional archive
  25. 25. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO Capturing Artifacts - Challenges • Delineate the web boundary of the artifact • More than the input artifact URI • The boundary is in the eye of the beholder • Create a high-fidelity capture using an approach that scales for a steady stream of new artifacts • Determine the web boundary of the artifact • Handle dynamic content & interactive features of web pages • We made a significant breakthrough with the Memento Tracer framework Memento Tracer: http://tracer.mementoweb.org
  26. 26. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO Archiving Artifacts
  27. 27. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO Archiving Artifacts - Description • The archiving process takes as input the URI of a WARC file generated by the capture process • Its task is to ingest the WARC file in a cross-institutional web archive • This can be achieved using off-the-shelf web archiving software, e.g., pywb, Open Wayback • Result of archiving: • Mementos pertaining to newly discovered artifact in a cross- institutional, Memento-compliant web archive • Possibility to link to artifacts using Robust Links: <a href=“URI-A” data-versionurl=“URI-M” data-versiondate=“date-of-capture” Robust Links: http://robustlinks.mementoweb.org/about/
  28. 28. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO Archiving Artifacts - Challenges • Attempted to use ipwb, a pywb version that uses IPFS • Cross-institutional distributed file system with redundancy • Ran out of time to get it operationally stable Sawood Alam, Mat Kelly, and Michael L. Nelson (2016) InterPlanetary Wayback: The Permanent Web Archive https://doi.org/10.1145/2910896.2925467
  29. 29. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO Pipeline Demo https://myresearchinstitute.org
  30. 30. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO myresearch.institute - Researchers • Uniquely identified by ORCIDs • Web identities in multiple portals • Create various types of artifacts
  31. 31. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO myresearch.institute - Portals • Tracking started August 27 2018 • Tracking artifacts created starting August 1 2018
  32. 32. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO myresearch.institute – Statistics
  33. 33. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO Productivity Portal Distribution
  34. 34. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO Researcher Contributions
  35. 35. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO Researcher Contributions
  36. 36. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO Researcher Contributions
  37. 37. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO Artifact Frequency
  38. 38. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO Artifact Frequency per Portal
  39. 39. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO Scholarly Orphans – Pipeline • 10,187 unique artifacts tracked, captured, and archived since 08/01/2018 • 41MB event database • 61GB of WARC files • 2.3GB of web archive index
  40. 40. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO Scholarly Orphans – Pipeline • Capture process, post tracking • Within 9 minutes 50% of artifacts captured • Within 1 hour 21 minutes 75% of artifacts captured • Archiver process, post capture • Within 10 minutes 50% of artifacts archived • Within 57 minutes 75% of artifacts archived
  41. 41. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO Summary • The Scholarly Orphans project explores an institution-driven approach to capture scholarly artifacts deposited in web portals • Artifacts out of scope of existing archival approaches such as LOCKSS, Portico, web archives • Institutions have a long shelf life, should be interested in collecting these artifacts, and have feasible scale for identity/artifact discovery • Prototype at myresearch.institute illustrates feasibility, opportunities, and challenges of this institutional perspective
  42. 42. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO “Ha, this is awesome! Thanks for letting me know - carry on as usual, and feel free to monitor away. I'll try not to change my behaviour or anything now with this new knowledge :)” “This is fine, since everything you are capturing is public to start with. I also wonder if you know about Software Heritage?” “I’m very comfortable with being part of this (very important) research project” “I'm cool with it :-)” “Interesting project! I’m happy to participate.” “One more thing, is it possible to get a copy of the URI-Rs that you guys detected so that I can feed them into an archive of my choice?...” What Our Researchers Say…
  43. 43. @mart1nkle1n @hvdsomp CNI Spring 2019, April 8 2019, St, Louis, MO Martin Klein LANL @mart1nkle1n https://orcid.org/0000-0003-0130-2097 Herbert Van de Sompel DANS @hvdsomp https://orcid.org/0000-0002-0715-6126 An Institutional Perspective to Rescue Scholarly Orphans The Scholarly Orphans project is funded by the Andrew W. Mellon Foundation

×