@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Herbert Van de Sompel
DANS
@hvdsomp
https://orcid.org/0000-0002-0715-6126
Collecting the Organizational Scholarly Record
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
James Powell, Harihar Shankar, Marko Rodriguez, and Herbert Van de Sompel (2014) EgoSystem: Where are our
Alumni? code{4}lib journal, issue 24. https://journal.code4lib.org/articles/9519
2013 - EgoSystem
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
EgoSystem Team
• Los Alamos National Laboratory:
• James Powell
• Harihar Shankar
• Herbert Van de Sompel
• Aurellius:
• Marko Rodriguez
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Motivation
• When postdocs leave LANL, the local information systems
maintain very little information about them
• But senior management is interested in engaging them after they
leave LANL as Ambassadors and Advocates
• They needs answers to questions like:
• Who is currently working where?
• Who is involved in what areas of research?
• Who might serve as advocates for the Lab?
• Who knows someone who knows someone we need to
connect with?
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
2012 - Initial Approach: Set Up a VIVO Instance
• 2700+ records were
ingested from LANL
Postdoc Office data to
create initial user profiles
• 8 postdoc alumni were
contacted to complete
their profile
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
• Up-to-date information at all times is essential to meet the need of
senior LANL management
• Some existing VIVO instances seemed to have been pre-
populated but then remained static after launch
• Would current and former postdocs be interested in
maintaining a professional profile on a VIVO instance
intended to help out LANL?
Doubts about the VIVO Instance
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
• Leverage public, network-level information pertaining to LANL
Alumni
• Find their network presences - social portals, scientific
portals, homepages, etc.
• Recurrently collect information from those presences: current
employer, social network neighborhood, geo location, etc.
• Create applications based on that information
• Rationale: People have incentives to keep network-layer
information up-to-date
• Goal: Devise a sustainable approach to gather and use up-
to-date information pertaining to LANL Alumni
2013 - New Approach: Leverage Network-Level Information
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Available information elements for PostDocs:
• Z#
• Name
• Institutions:
o PhD University; LANL; Institution after
LANL
• Field of Study
• Discipline
Find network identities:
• Various queries based on information
elements in:
o Yahoo Boss API; MS Academic
Search API
• Search for candidate identities:
o LinkedIn; MS Academic; Twitter;
Homepage; Blogger; SlideShare;
WikiPedia
• Rank and select candidate identities
o Reward when: same identities from
various searches; content matches
information elements
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
LinkedIn Identity
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
LinkedIn Identity
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
LinkedIn Identity
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Twitter Identity
Network-derived information:
• Identities:
o LinkedIn; MS Academic; Twitter;
Homepage; Blogger; SlideShare;
WikiPedia
• Additional information elements:
o Current institution; geo location;
updated discipline
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
0
200
400
600
800
1000
1200
1400
1600
1800
none one two three four five
Web Identities Discovered Per Postdoc
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Resulting Identity Types per Postdoc
0
500
1000
1500
2000
2500
3000
3500
LANL MS Academic LinkedIn Twitter
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
• Random set of 100 postdocs
• MS Academic
o 86 correct
- 71 correctly discovered identities
- 15 correctly labeled as not having identity
o 14 incorrect
- 2 discovered identities did not match the postdoc
- 12 existing identities were not discovered
• Algorithms favored precision over recall
Evaluation of the Discovery Algorithm
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Network-derived information:
• Network neighborhood:
o Social network ~ Twitter: followers,
followed
o Academic network ~ co-authors MS
Academic
o Affiliations ~ LinkedIn, homepage
• Artifacts: papers, slide decks
• Concepts
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
• Platonic vertices
o Persons
o Institutions
o Artifacts
o Concepts
• Affiliation vertices
o Different types
o Different time periods
• Graph extent, started with 3,005 postdocs:
o Vertices: 9,015,844
o Edges: 19,399,683
Property Graph Representation of Resulting Information
Property Graph Representation of Resulting Information
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Graph Database for Storage/Retrieval/Analysis
Titan Distributed Graph Database
http://titan.thinkaurelius.com/
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
• Simple web query interface
• Shareable profile page for individuals
• Graph analytics (aggregate social networks, path analysis) and
graph visualization
• Who’s where (the LANL Director travels) search
• Capability to add non-LANL person to the graph
o To find closest path to the person via a LANL postdoc
EgoSystem Application
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Success?
• At the end of the demo meeting, the director said (paraphrasing)
o “I didn’t know what I wanted when we first met but this looks
like what I want, what I need.”
• Project discontinued because of the inability to access LinkedIn
data in legitimate manner
• As a result of heuristic-based processes, the database, query
results are not necessarily correct/complete. This made
EgoSystem an approximating application.
• Fantastic 2 month (~ 6 MM) project that did not yield a production
system but in which we learned an awful lot
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
James Powell, Martin Klein, and Herbert Van de Sompel (2017) Autoload: a pipeline for expanding the holdings of
an Institutional Repository enabled by ResourceSync code{4}lib journal, issue 36.
https://journal.code4lib.org/articles/12427
2016 - Autoload
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
2018 – myresearch.institute
The Scholarly Orphans project
is funded by the Andrew W. Mellon Foundation
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
myresearch.institute Team
• Los Alamos National Laboratory:
• Lyudmila Balakireva
• Martin Klein
• James Powell
• Harihar Shankar
• Herbert Van de Sompel
• Old Dominion University:
• Sawood Alam
• Grant Atkins
• Shawn Jones
• Mat Kelly
• Michael L. Nelson
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
• Consideration
• Researchers are increasingly using a variety of web platforms for
collaboration and communication
• Why?
• Many of these platforms have desirable characteristics
• Versioning
• Time stamping
• Social embedding
• Their institutions do not provide platforms that have global reach
• Collaboration, cf. Github ~ productivity
• Communication, cf. SlideShare ~ visibility
Research and Research Communication on the Web
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
• Consideration
• Researchers are increasingly using a variety of web platforms for
collaboration and communication
• Web Platforms:
• Dedicated to scholarship:
• Commercial: e.g., FigShare, Publons
• Not for profit: e.g., OSF, Zenodo
• General purpose:
• Commercial: e.g., GitHub, SlideShare
• Not for profit: e.g., Wikipedia, Wikidata
Research and Research Communication on the Web
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Emma Schymanski
https://orcid.org/0000-0001-6868-8145
https://github.com/schymane
https://www.slideshare.net/EmmaSchymanski
https://figshare.com/authors/Emma_Schymanski/5087039
https://publons.com/author/1538491/emma-schymanski#profile
https://www.eawag.ch/en/aboutus/portrait/organisation/staff/profile/emma-schymanski/
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Shawn Jones
https://orcid.org/0000-0002-4372-870X
http://www.shawnmjones.org/
https://github.com/shawnmjones
https://www.slideshare.net/shawnmjones
https://en.wikipedia.org/wiki/User:Shawnmjones
https://www.blogger.com/profile/17827543974149663194
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
• Consideration
• Researchers deposit artifacts in web platforms
• Status quo - The researchers’ institutions are in the dark
• Do not know about the existence of these artifact
• Do not have a copy of these artifacts
Research and Research Communication on the Web
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
• Consideration
• Researchers deposit artifacts in web platforms
• Status quo – Uncertainty regarding long-term access
• Commercial: changing business model, no preservation commitment
• Not for profit: unpredictable funding stream
Research and Research Communication on the Web
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
• Consideration
• Researchers deposit artifacts in web platforms
• Status quo - Not systematically archived
• No frameworks like LOCKSS/Portico exist for these artifacts
• Researchers only selectively deposit artifacts in portals that
provide archival guarantees; to obtain a cite-able DOI
• Can’t expect researchers to (also) upload all artifacts in IRs
• Web archives only incidentally archive these artifacts, cf.
anecdotal & Hiberlink project evidence
Research and Research Communication on the Web
Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0115253
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Emma’s SlideShare Artifact: 0 Mementos
https://www.slideshare.net/EmmaSchymanski/dmcm2018-community-resources-connecting-chemistry-and-toxicity-knowledge
http://timetravel.mementoweb.org/
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Shawn’s GitHub Artifact: 1 Memento
https://github.com/shawnmjones/mediawiki
https://web.archive.org/web/*/https://github.com/shawnmjones/mediawiki
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Evidence from the Hiberlink Project
Web resources referenced in Elsevier corpus (1996-2012)
without representative Memento in public web archives
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
The Scholarly Orphans Project: How to Archive these Artifacts?
• Explores an institution-driven paradigm
• Academic institutions typically have a long shelf life
• A basic premise underlying e.g., LOCKSS, perma.cc
• An academic institution should be interested in capturing the
artifacts (intellectual property) its scholars deposit on the web
• Collecting and archiving such artifacts aligns with the
mission of academic libraries
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
An Institutional Perspective
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
The Scholarly Orphans Project: How to Archive these Artifacts?
• Explores a paradigm inspired by web archiving
• Scale of the problem
• Can’t expect researchers to upload all artifacts in an institutional
repository
• Bilateral agreements for archival purposes with most web
portals unlikely
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
A Web Archiving Perspective
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
myresearch.institute Prototype Pipeline
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Tracking Artifacts
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Tracking Artifacts - Description
• In order to track artifacts that were recently deposited by an
institutional researcher in a portal, one reasonably needs:
• The web identity of the researcher in the portal
• Algorithmic discovery, cf. EgoSystem
• Discovery via a registry, cf. ORCID paper
• Manual collection
• A portal API that supports:
• Access by web identity
• Access to contributions “since …” for the web identity
• Result of tracking:
• URI(s) of new artifact(s) discovered in the portal
Klein, M., and Van de Sompel, H. (2017) Discovering Scholarly Orphans Using ORCID. Proceedings of the 2017
ACM/IEEE Joint Conference on Digital Libraries https://arxiv.org/abs/1703.09343
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Tracking Artifacts - Challenges
• Portal API access by web identity
• Broadly supported by general purpose portals
• Typically not supported by scholarly portals
• Some lack an API altogether
• Should add ORCID access to APIs
• OAI-PMH and ResourceSync need sets per web identity
• Professional versus personal contributions
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Capturing Artifacts
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Capturing Artifacts - Description
• The capture process takes as input the URI of a new artifact
discovered in a portal
• Its task is to create a representative institutional capture of the
artifact
• Result of capture:
• WARC file for new artifact in an institutional archive
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Capturing Artifacts - Challenges
• Create a high-fidelity capture using an approach that scales for a
steady stream of new artifacts
• Handle dynamic content & interactive features of web pages
• Determine the web boundary of the artifact
• More than the input artifact URI
• The boundary is in the eye of the beholder
• We made a significant breakthrough with the Memento Tracer
framework
• Others (cf. webrecorder.io Autopilot, IA Brozzler) are working on
the same problem
Memento Tracer: http://tracer.mementoweb.org
Autopilot: https://blog.webrecorder.io/2019/08/14/autopilot
Brozzler: https://github.com/internetarchive/brozzler
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Capturing Artifacts
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Memento Tracer - Framework
http://tracer.mementoweb.org
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Archiving Artifacts
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Archiving Artifacts - Description
• The archiving process takes as input the URI of a WARC file
generated by the capture process
• Its task is to ingest the WARC file in a cross-institutional web archive
• This can be achieved using off-the-shelf web archiving software,
e.g., pywb, Open Wayback
• Result of archiving:
• Mementos pertaining to newly discovered artifact in a cross-
institutional, Memento-compliant web archive
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Archiving Artifacts - Challenges
• Attempted to use ipwb, a pywb version that uses IPFS
• Cross-institutional distributed file system with redundancy
• Ran out of time to get it operationally stable
Sawood Alam, Mat Kelly, and Michael L. Nelson (2016) InterPlanetary Wayback: The Permanent Web Archive
https://doi.org/10.1145/2910896.2925467
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
myresearch.institute - Researchers
• Uniquely identified by ORCIDs
• Web identities in multiple portals
• Create various types of artifacts
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
myresearch.institute - Portals
• Tracking started August 27 2018
• Tracking artifacts created starting
August 1 2018
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Scholarly Orphans – Pipeline
• 16,005 unique artifacts tracked, captured, and archived between
20180801 and 20190828
• 60MB event database
• 83GB of WARC files
• 3GB of web archive index
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Showtime: myresearch.institute Portal
https://myresearchinstitute.org
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Success?
• “Interesting project! I’m happy to participate.”
“One more thing, is it possible to get a copy of the URI-Rs that
you guys detected so that I can feed them into an archive of my
choice?...”
• Prototype pipeline developed over 8 months (24 MM)
• Metrics of the prototype demonstrate that researchers generate
a lot of artifacts (that their institutions are typically not aware of)
• Metrics of the prototype suggest it should be possible to run a
production pipeline at the scale of an academic institution
• But would they …?
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Some Final Thoughts
• For a number of reasons, applications that leverage network-level
information at scale (e.g. EgoSystem, myresearch.institute,
Autoload) tend not to be perfect. But they are automatic.
• Do institutions reserve sufficient resources for innovation and
failure? The alternative seems to be outsourcing and loss of
expertise.
• Ideas/visions are rarely fully realized when working on them. But
many times, the work does improve on the status quo. So keep
dreaming and working!
@hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Herbert Van de Sompel
DANS
@hvdsomp
https://orcid.org/0000-0002-0715-6126
Collecting the Organizational Scholarly Record

Collecting the organizational scholarly record

  • 1.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro Herbert Van de Sompel DANS @hvdsomp https://orcid.org/0000-0002-0715-6126 Collecting the Organizational Scholarly Record
  • 2.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro James Powell, Harihar Shankar, Marko Rodriguez, and Herbert Van de Sompel (2014) EgoSystem: Where are our Alumni? code{4}lib journal, issue 24. https://journal.code4lib.org/articles/9519 2013 - EgoSystem
  • 3.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro EgoSystem Team • Los Alamos National Laboratory: • James Powell • Harihar Shankar • Herbert Van de Sompel • Aurellius: • Marko Rodriguez
  • 4.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro Motivation • When postdocs leave LANL, the local information systems maintain very little information about them • But senior management is interested in engaging them after they leave LANL as Ambassadors and Advocates • They needs answers to questions like: • Who is currently working where? • Who is involved in what areas of research? • Who might serve as advocates for the Lab? • Who knows someone who knows someone we need to connect with?
  • 5.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro 2012 - Initial Approach: Set Up a VIVO Instance • 2700+ records were ingested from LANL Postdoc Office data to create initial user profiles • 8 postdoc alumni were contacted to complete their profile
  • 6.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro • Up-to-date information at all times is essential to meet the need of senior LANL management • Some existing VIVO instances seemed to have been pre- populated but then remained static after launch • Would current and former postdocs be interested in maintaining a professional profile on a VIVO instance intended to help out LANL? Doubts about the VIVO Instance
  • 7.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro • Leverage public, network-level information pertaining to LANL Alumni • Find their network presences - social portals, scientific portals, homepages, etc. • Recurrently collect information from those presences: current employer, social network neighborhood, geo location, etc. • Create applications based on that information • Rationale: People have incentives to keep network-layer information up-to-date • Goal: Devise a sustainable approach to gather and use up- to-date information pertaining to LANL Alumni 2013 - New Approach: Leverage Network-Level Information
  • 8.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro
  • 9.
    Available information elementsfor PostDocs: • Z# • Name • Institutions: o PhD University; LANL; Institution after LANL • Field of Study • Discipline
  • 10.
    Find network identities: •Various queries based on information elements in: o Yahoo Boss API; MS Academic Search API • Search for candidate identities: o LinkedIn; MS Academic; Twitter; Homepage; Blogger; SlideShare; WikiPedia • Rank and select candidate identities o Reward when: same identities from various searches; content matches information elements
  • 11.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro LinkedIn Identity
  • 12.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro LinkedIn Identity
  • 13.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro LinkedIn Identity
  • 14.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro Twitter Identity
  • 15.
    Network-derived information: • Identities: oLinkedIn; MS Academic; Twitter; Homepage; Blogger; SlideShare; WikiPedia • Additional information elements: o Current institution; geo location; updated discipline
  • 16.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro 0 200 400 600 800 1000 1200 1400 1600 1800 none one two three four five Web Identities Discovered Per Postdoc
  • 17.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro Resulting Identity Types per Postdoc 0 500 1000 1500 2000 2500 3000 3500 LANL MS Academic LinkedIn Twitter
  • 18.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro • Random set of 100 postdocs • MS Academic o 86 correct - 71 correctly discovered identities - 15 correctly labeled as not having identity o 14 incorrect - 2 discovered identities did not match the postdoc - 12 existing identities were not discovered • Algorithms favored precision over recall Evaluation of the Discovery Algorithm
  • 19.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro Network-derived information: • Network neighborhood: o Social network ~ Twitter: followers, followed o Academic network ~ co-authors MS Academic o Affiliations ~ LinkedIn, homepage • Artifacts: papers, slide decks • Concepts
  • 20.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro • Platonic vertices o Persons o Institutions o Artifacts o Concepts • Affiliation vertices o Different types o Different time periods • Graph extent, started with 3,005 postdocs: o Vertices: 9,015,844 o Edges: 19,399,683 Property Graph Representation of Resulting Information
  • 21.
    Property Graph Representationof Resulting Information
  • 22.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro Graph Database for Storage/Retrieval/Analysis Titan Distributed Graph Database http://titan.thinkaurelius.com/
  • 23.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro • Simple web query interface • Shareable profile page for individuals • Graph analytics (aggregate social networks, path analysis) and graph visualization • Who’s where (the LANL Director travels) search • Capability to add non-LANL person to the graph o To find closest path to the person via a LANL postdoc EgoSystem Application
  • 33.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro Success? • At the end of the demo meeting, the director said (paraphrasing) o “I didn’t know what I wanted when we first met but this looks like what I want, what I need.” • Project discontinued because of the inability to access LinkedIn data in legitimate manner • As a result of heuristic-based processes, the database, query results are not necessarily correct/complete. This made EgoSystem an approximating application. • Fantastic 2 month (~ 6 MM) project that did not yield a production system but in which we learned an awful lot
  • 34.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro James Powell, Martin Klein, and Herbert Van de Sompel (2017) Autoload: a pipeline for expanding the holdings of an Institutional Repository enabled by ResourceSync code{4}lib journal, issue 36. https://journal.code4lib.org/articles/12427 2016 - Autoload
  • 35.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro 2018 – myresearch.institute The Scholarly Orphans project is funded by the Andrew W. Mellon Foundation
  • 36.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro myresearch.institute Team • Los Alamos National Laboratory: • Lyudmila Balakireva • Martin Klein • James Powell • Harihar Shankar • Herbert Van de Sompel • Old Dominion University: • Sawood Alam • Grant Atkins • Shawn Jones • Mat Kelly • Michael L. Nelson
  • 37.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro • Consideration • Researchers are increasingly using a variety of web platforms for collaboration and communication • Why? • Many of these platforms have desirable characteristics • Versioning • Time stamping • Social embedding • Their institutions do not provide platforms that have global reach • Collaboration, cf. Github ~ productivity • Communication, cf. SlideShare ~ visibility Research and Research Communication on the Web
  • 38.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro • Consideration • Researchers are increasingly using a variety of web platforms for collaboration and communication • Web Platforms: • Dedicated to scholarship: • Commercial: e.g., FigShare, Publons • Not for profit: e.g., OSF, Zenodo • General purpose: • Commercial: e.g., GitHub, SlideShare • Not for profit: e.g., Wikipedia, Wikidata Research and Research Communication on the Web
  • 39.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro Emma Schymanski https://orcid.org/0000-0001-6868-8145 https://github.com/schymane https://www.slideshare.net/EmmaSchymanski https://figshare.com/authors/Emma_Schymanski/5087039 https://publons.com/author/1538491/emma-schymanski#profile https://www.eawag.ch/en/aboutus/portrait/organisation/staff/profile/emma-schymanski/
  • 40.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro Shawn Jones https://orcid.org/0000-0002-4372-870X http://www.shawnmjones.org/ https://github.com/shawnmjones https://www.slideshare.net/shawnmjones https://en.wikipedia.org/wiki/User:Shawnmjones https://www.blogger.com/profile/17827543974149663194
  • 41.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro • Consideration • Researchers deposit artifacts in web platforms • Status quo - The researchers’ institutions are in the dark • Do not know about the existence of these artifact • Do not have a copy of these artifacts Research and Research Communication on the Web
  • 42.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro • Consideration • Researchers deposit artifacts in web platforms • Status quo – Uncertainty regarding long-term access • Commercial: changing business model, no preservation commitment • Not for profit: unpredictable funding stream Research and Research Communication on the Web
  • 43.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro • Consideration • Researchers deposit artifacts in web platforms • Status quo - Not systematically archived • No frameworks like LOCKSS/Portico exist for these artifacts • Researchers only selectively deposit artifacts in portals that provide archival guarantees; to obtain a cite-able DOI • Can’t expect researchers to (also) upload all artifacts in IRs • Web archives only incidentally archive these artifacts, cf. anecdotal & Hiberlink project evidence Research and Research Communication on the Web Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE https://doi.org/10.1371/journal.pone.0115253
  • 44.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro Emma’s SlideShare Artifact: 0 Mementos https://www.slideshare.net/EmmaSchymanski/dmcm2018-community-resources-connecting-chemistry-and-toxicity-knowledge http://timetravel.mementoweb.org/
  • 45.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro Shawn’s GitHub Artifact: 1 Memento https://github.com/shawnmjones/mediawiki https://web.archive.org/web/*/https://github.com/shawnmjones/mediawiki
  • 46.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro Evidence from the Hiberlink Project Web resources referenced in Elsevier corpus (1996-2012) without representative Memento in public web archives
  • 47.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro The Scholarly Orphans Project: How to Archive these Artifacts? • Explores an institution-driven paradigm • Academic institutions typically have a long shelf life • A basic premise underlying e.g., LOCKSS, perma.cc • An academic institution should be interested in capturing the artifacts (intellectual property) its scholars deposit on the web • Collecting and archiving such artifacts aligns with the mission of academic libraries
  • 48.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro An Institutional Perspective
  • 49.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro The Scholarly Orphans Project: How to Archive these Artifacts? • Explores a paradigm inspired by web archiving • Scale of the problem • Can’t expect researchers to upload all artifacts in an institutional repository • Bilateral agreements for archival purposes with most web portals unlikely
  • 50.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro A Web Archiving Perspective
  • 51.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro myresearch.institute Prototype Pipeline
  • 52.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro Tracking Artifacts
  • 53.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro Tracking Artifacts - Description • In order to track artifacts that were recently deposited by an institutional researcher in a portal, one reasonably needs: • The web identity of the researcher in the portal • Algorithmic discovery, cf. EgoSystem • Discovery via a registry, cf. ORCID paper • Manual collection • A portal API that supports: • Access by web identity • Access to contributions “since …” for the web identity • Result of tracking: • URI(s) of new artifact(s) discovered in the portal Klein, M., and Van de Sompel, H. (2017) Discovering Scholarly Orphans Using ORCID. Proceedings of the 2017 ACM/IEEE Joint Conference on Digital Libraries https://arxiv.org/abs/1703.09343
  • 54.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro Tracking Artifacts - Challenges • Portal API access by web identity • Broadly supported by general purpose portals • Typically not supported by scholarly portals • Some lack an API altogether • Should add ORCID access to APIs • OAI-PMH and ResourceSync need sets per web identity • Professional versus personal contributions
  • 55.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro Capturing Artifacts
  • 56.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro Capturing Artifacts - Description • The capture process takes as input the URI of a new artifact discovered in a portal • Its task is to create a representative institutional capture of the artifact • Result of capture: • WARC file for new artifact in an institutional archive
  • 57.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro Capturing Artifacts - Challenges • Create a high-fidelity capture using an approach that scales for a steady stream of new artifacts • Handle dynamic content & interactive features of web pages • Determine the web boundary of the artifact • More than the input artifact URI • The boundary is in the eye of the beholder • We made a significant breakthrough with the Memento Tracer framework • Others (cf. webrecorder.io Autopilot, IA Brozzler) are working on the same problem Memento Tracer: http://tracer.mementoweb.org Autopilot: https://blog.webrecorder.io/2019/08/14/autopilot Brozzler: https://github.com/internetarchive/brozzler
  • 58.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro Capturing Artifacts
  • 59.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro Memento Tracer - Framework http://tracer.mementoweb.org
  • 60.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro Archiving Artifacts
  • 61.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro Archiving Artifacts - Description • The archiving process takes as input the URI of a WARC file generated by the capture process • Its task is to ingest the WARC file in a cross-institutional web archive • This can be achieved using off-the-shelf web archiving software, e.g., pywb, Open Wayback • Result of archiving: • Mementos pertaining to newly discovered artifact in a cross- institutional, Memento-compliant web archive
  • 62.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro Archiving Artifacts - Challenges • Attempted to use ipwb, a pywb version that uses IPFS • Cross-institutional distributed file system with redundancy • Ran out of time to get it operationally stable Sawood Alam, Mat Kelly, and Michael L. Nelson (2016) InterPlanetary Wayback: The Permanent Web Archive https://doi.org/10.1145/2910896.2925467
  • 63.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro myresearch.institute - Researchers • Uniquely identified by ORCIDs • Web identities in multiple portals • Create various types of artifacts
  • 64.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro myresearch.institute - Portals • Tracking started August 27 2018 • Tracking artifacts created starting August 1 2018
  • 65.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro Scholarly Orphans – Pipeline • 16,005 unique artifacts tracked, captured, and archived between 20180801 and 20190828 • 60MB event database • 83GB of WARC files • 3GB of web archive index
  • 66.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro Showtime: myresearch.institute Portal https://myresearchinstitute.org
  • 67.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro Success? • “Interesting project! I’m happy to participate.” “One more thing, is it possible to get a copy of the URI-Rs that you guys detected so that I can feed them into an archive of my choice?...” • Prototype pipeline developed over 8 months (24 MM) • Metrics of the prototype demonstrate that researchers generate a lot of artifacts (that their institutions are typically not aware of) • Metrics of the prototype suggest it should be possible to run a production pipeline at the scale of an academic institution • But would they …?
  • 68.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro Some Final Thoughts • For a number of reasons, applications that leverage network-level information at scale (e.g. EgoSystem, myresearch.institute, Autoload) tend not to be perfect. But they are automatic. • Do institutions reserve sufficient resources for innovation and failure? The alternative seems to be outsourcing and loss of expertise. • Ideas/visions are rarely fully realized when working on them. But many times, the work does improve on the status quo. So keep dreaming and working!
  • 69.
    @hvdsomp VIVO Conference 2019,September 5 2019, Podgorica, Montenegro Herbert Van de Sompel DANS @hvdsomp https://orcid.org/0000-0002-0715-6126 Collecting the Organizational Scholarly Record

Editor's Notes

  • #47 ~100k articles with links > 230k links total
  • #60 New paradigm for web archiving, found as part of this problem Unexpected, yet most important result/contribution of this effort Lets imagine you need to frequently archive slide decks from SlideShare (we do) Understand that there are boundary and quality problems Bring human (curator) in the loop Navigate to *one* SS presentation Interact with that presentation in an attempt to show what the boundary is, make explicit what needs to be archived Browser extension, listens to browser events, intercepts them and records them in abstract way (not in terms of URLs, addresses in the DOM, Xpath, CSS selectors) Result: trace expresses in abstract way the interactions the curator had with slide deck Abstract b/c same info how to interact with *this* presentation will apply to *all* presentations Record one, share, re-use with headless browser Share in repo, collectively create, curate traces, update with layout of pages