Slides used for a keynote presentation at the VIVO 2019 Conference in Podgorica, Montenegro.
Abstract: The invitation to present a keynote at the VIVO Conference and the goal of the VIVO platform, as stated on the DuraSpace site, to create an integrated record of the scholarly work of an organisation reminded me of various efforts that I have been involved in over the past years that had similar goals. EgoSystem (2014) attempted to gather information about postdocs that had left the organisation, leaving little or no contact details behind. Autoload (2017), an operational service, discovers papers by organisational researchers in order to upload them in the institutional repository. myresearch.institute (2018), an experiment that is still in progress, discovers artefacts that researchers deposit in web productivity portals and subsequently archives them. More recently, I have been involved in thinking about the future of NARCIS, a portal that provides an overview of research productivity in The Netherlands. The approach taken in all these efforts share a characteristic motivated by a desire to devise scalable and sustainable solutions: let machines rather than humans do the work. In this talk, I will provide an overview of these efforts, their motivations, the challenges involved, and the nature of success (if any).
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Collecting the organizational scholarly record
1. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Herbert Van de Sompel
DANS
@hvdsomp
https://orcid.org/0000-0002-0715-6126
Collecting the Organizational Scholarly Record
2. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
James Powell, Harihar Shankar, Marko Rodriguez, and Herbert Van de Sompel (2014) EgoSystem: Where are our
Alumni? code{4}lib journal, issue 24. https://journal.code4lib.org/articles/9519
2013 - EgoSystem
3. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
EgoSystem Team
• Los Alamos National Laboratory:
• James Powell
• Harihar Shankar
• Herbert Van de Sompel
• Aurellius:
• Marko Rodriguez
4. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Motivation
• When postdocs leave LANL, the local information systems
maintain very little information about them
• But senior management is interested in engaging them after they
leave LANL as Ambassadors and Advocates
• They needs answers to questions like:
• Who is currently working where?
• Who is involved in what areas of research?
• Who might serve as advocates for the Lab?
• Who knows someone who knows someone we need to
connect with?
5. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
2012 - Initial Approach: Set Up a VIVO Instance
• 2700+ records were
ingested from LANL
Postdoc Office data to
create initial user profiles
• 8 postdoc alumni were
contacted to complete
their profile
6. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
• Up-to-date information at all times is essential to meet the need of
senior LANL management
• Some existing VIVO instances seemed to have been pre-
populated but then remained static after launch
• Would current and former postdocs be interested in
maintaining a professional profile on a VIVO instance
intended to help out LANL?
Doubts about the VIVO Instance
7. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
• Leverage public, network-level information pertaining to LANL
Alumni
• Find their network presences - social portals, scientific
portals, homepages, etc.
• Recurrently collect information from those presences: current
employer, social network neighborhood, geo location, etc.
• Create applications based on that information
• Rationale: People have incentives to keep network-layer
information up-to-date
• Goal: Devise a sustainable approach to gather and use up-
to-date information pertaining to LANL Alumni
2013 - New Approach: Leverage Network-Level Information
9. Available information elements for PostDocs:
• Z#
• Name
• Institutions:
o PhD University; LANL; Institution after
LANL
• Field of Study
• Discipline
10. Find network identities:
• Various queries based on information
elements in:
o Yahoo Boss API; MS Academic
Search API
• Search for candidate identities:
o LinkedIn; MS Academic; Twitter;
Homepage; Blogger; SlideShare;
WikiPedia
• Rank and select candidate identities
o Reward when: same identities from
various searches; content matches
information elements
15. Network-derived information:
• Identities:
o LinkedIn; MS Academic; Twitter;
Homepage; Blogger; SlideShare;
WikiPedia
• Additional information elements:
o Current institution; geo location;
updated discipline
16. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
0
200
400
600
800
1000
1200
1400
1600
1800
none one two three four five
Web Identities Discovered Per Postdoc
17. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Resulting Identity Types per Postdoc
0
500
1000
1500
2000
2500
3000
3500
LANL MS Academic LinkedIn Twitter
18. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
• Random set of 100 postdocs
• MS Academic
o 86 correct
- 71 correctly discovered identities
- 15 correctly labeled as not having identity
o 14 incorrect
- 2 discovered identities did not match the postdoc
- 12 existing identities were not discovered
• Algorithms favored precision over recall
Evaluation of the Discovery Algorithm
19. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Network-derived information:
• Network neighborhood:
o Social network ~ Twitter: followers,
followed
o Academic network ~ co-authors MS
Academic
o Affiliations ~ LinkedIn, homepage
• Artifacts: papers, slide decks
• Concepts
20. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
• Platonic vertices
o Persons
o Institutions
o Artifacts
o Concepts
• Affiliation vertices
o Different types
o Different time periods
• Graph extent, started with 3,005 postdocs:
o Vertices: 9,015,844
o Edges: 19,399,683
Property Graph Representation of Resulting Information
22. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Graph Database for Storage/Retrieval/Analysis
Titan Distributed Graph Database
http://titan.thinkaurelius.com/
23. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
• Simple web query interface
• Shareable profile page for individuals
• Graph analytics (aggregate social networks, path analysis) and
graph visualization
• Who’s where (the LANL Director travels) search
• Capability to add non-LANL person to the graph
o To find closest path to the person via a LANL postdoc
EgoSystem Application
24.
25.
26.
27.
28.
29.
30.
31.
32.
33. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Success?
• At the end of the demo meeting, the director said (paraphrasing)
o “I didn’t know what I wanted when we first met but this looks
like what I want, what I need.”
• Project discontinued because of the inability to access LinkedIn
data in legitimate manner
• As a result of heuristic-based processes, the database, query
results are not necessarily correct/complete. This made
EgoSystem an approximating application.
• Fantastic 2 month (~ 6 MM) project that did not yield a production
system but in which we learned an awful lot
34. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
James Powell, Martin Klein, and Herbert Van de Sompel (2017) Autoload: a pipeline for expanding the holdings of
an Institutional Repository enabled by ResourceSync code{4}lib journal, issue 36.
https://journal.code4lib.org/articles/12427
2016 - Autoload
35. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
2018 – myresearch.institute
The Scholarly Orphans project
is funded by the Andrew W. Mellon Foundation
36. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
myresearch.institute Team
• Los Alamos National Laboratory:
• Lyudmila Balakireva
• Martin Klein
• James Powell
• Harihar Shankar
• Herbert Van de Sompel
• Old Dominion University:
• Sawood Alam
• Grant Atkins
• Shawn Jones
• Mat Kelly
• Michael L. Nelson
37. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
• Consideration
• Researchers are increasingly using a variety of web platforms for
collaboration and communication
• Why?
• Many of these platforms have desirable characteristics
• Versioning
• Time stamping
• Social embedding
• Their institutions do not provide platforms that have global reach
• Collaboration, cf. Github ~ productivity
• Communication, cf. SlideShare ~ visibility
Research and Research Communication on the Web
38. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
• Consideration
• Researchers are increasingly using a variety of web platforms for
collaboration and communication
• Web Platforms:
• Dedicated to scholarship:
• Commercial: e.g., FigShare, Publons
• Not for profit: e.g., OSF, Zenodo
• General purpose:
• Commercial: e.g., GitHub, SlideShare
• Not for profit: e.g., Wikipedia, Wikidata
Research and Research Communication on the Web
39. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Emma Schymanski
https://orcid.org/0000-0001-6868-8145
https://github.com/schymane
https://www.slideshare.net/EmmaSchymanski
https://figshare.com/authors/Emma_Schymanski/5087039
https://publons.com/author/1538491/emma-schymanski#profile
https://www.eawag.ch/en/aboutus/portrait/organisation/staff/profile/emma-schymanski/
40. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Shawn Jones
https://orcid.org/0000-0002-4372-870X
http://www.shawnmjones.org/
https://github.com/shawnmjones
https://www.slideshare.net/shawnmjones
https://en.wikipedia.org/wiki/User:Shawnmjones
https://www.blogger.com/profile/17827543974149663194
41. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
• Consideration
• Researchers deposit artifacts in web platforms
• Status quo - The researchers’ institutions are in the dark
• Do not know about the existence of these artifact
• Do not have a copy of these artifacts
Research and Research Communication on the Web
42. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
• Consideration
• Researchers deposit artifacts in web platforms
• Status quo – Uncertainty regarding long-term access
• Commercial: changing business model, no preservation commitment
• Not for profit: unpredictable funding stream
Research and Research Communication on the Web
43. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
• Consideration
• Researchers deposit artifacts in web platforms
• Status quo - Not systematically archived
• No frameworks like LOCKSS/Portico exist for these artifacts
• Researchers only selectively deposit artifacts in portals that
provide archival guarantees; to obtain a cite-able DOI
• Can’t expect researchers to (also) upload all artifacts in IRs
• Web archives only incidentally archive these artifacts, cf.
anecdotal & Hiberlink project evidence
Research and Research Communication on the Web
Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0115253
44. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Emma’s SlideShare Artifact: 0 Mementos
https://www.slideshare.net/EmmaSchymanski/dmcm2018-community-resources-connecting-chemistry-and-toxicity-knowledge
http://timetravel.mementoweb.org/
45. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Shawn’s GitHub Artifact: 1 Memento
https://github.com/shawnmjones/mediawiki
https://web.archive.org/web/*/https://github.com/shawnmjones/mediawiki
46. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Evidence from the Hiberlink Project
Web resources referenced in Elsevier corpus (1996-2012)
without representative Memento in public web archives
47. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
The Scholarly Orphans Project: How to Archive these Artifacts?
• Explores an institution-driven paradigm
• Academic institutions typically have a long shelf life
• A basic premise underlying e.g., LOCKSS, perma.cc
• An academic institution should be interested in capturing the
artifacts (intellectual property) its scholars deposit on the web
• Collecting and archiving such artifacts aligns with the
mission of academic libraries
49. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
The Scholarly Orphans Project: How to Archive these Artifacts?
• Explores a paradigm inspired by web archiving
• Scale of the problem
• Can’t expect researchers to upload all artifacts in an institutional
repository
• Bilateral agreements for archival purposes with most web
portals unlikely
53. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Tracking Artifacts - Description
• In order to track artifacts that were recently deposited by an
institutional researcher in a portal, one reasonably needs:
• The web identity of the researcher in the portal
• Algorithmic discovery, cf. EgoSystem
• Discovery via a registry, cf. ORCID paper
• Manual collection
• A portal API that supports:
• Access by web identity
• Access to contributions “since …” for the web identity
• Result of tracking:
• URI(s) of new artifact(s) discovered in the portal
Klein, M., and Van de Sompel, H. (2017) Discovering Scholarly Orphans Using ORCID. Proceedings of the 2017
ACM/IEEE Joint Conference on Digital Libraries https://arxiv.org/abs/1703.09343
54. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Tracking Artifacts - Challenges
• Portal API access by web identity
• Broadly supported by general purpose portals
• Typically not supported by scholarly portals
• Some lack an API altogether
• Should add ORCID access to APIs
• OAI-PMH and ResourceSync need sets per web identity
• Professional versus personal contributions
56. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Capturing Artifacts - Description
• The capture process takes as input the URI of a new artifact
discovered in a portal
• Its task is to create a representative institutional capture of the
artifact
• Result of capture:
• WARC file for new artifact in an institutional archive
57. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Capturing Artifacts - Challenges
• Create a high-fidelity capture using an approach that scales for a
steady stream of new artifacts
• Handle dynamic content & interactive features of web pages
• Determine the web boundary of the artifact
• More than the input artifact URI
• The boundary is in the eye of the beholder
• We made a significant breakthrough with the Memento Tracer
framework
• Others (cf. webrecorder.io Autopilot, IA Brozzler) are working on
the same problem
Memento Tracer: http://tracer.mementoweb.org
Autopilot: https://blog.webrecorder.io/2019/08/14/autopilot
Brozzler: https://github.com/internetarchive/brozzler
61. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Archiving Artifacts - Description
• The archiving process takes as input the URI of a WARC file
generated by the capture process
• Its task is to ingest the WARC file in a cross-institutional web archive
• This can be achieved using off-the-shelf web archiving software,
e.g., pywb, Open Wayback
• Result of archiving:
• Mementos pertaining to newly discovered artifact in a cross-
institutional, Memento-compliant web archive
62. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Archiving Artifacts - Challenges
• Attempted to use ipwb, a pywb version that uses IPFS
• Cross-institutional distributed file system with redundancy
• Ran out of time to get it operationally stable
Sawood Alam, Mat Kelly, and Michael L. Nelson (2016) InterPlanetary Wayback: The Permanent Web Archive
https://doi.org/10.1145/2910896.2925467
63. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
myresearch.institute - Researchers
• Uniquely identified by ORCIDs
• Web identities in multiple portals
• Create various types of artifacts
64. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
myresearch.institute - Portals
• Tracking started August 27 2018
• Tracking artifacts created starting
August 1 2018
65. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Scholarly Orphans – Pipeline
• 16,005 unique artifacts tracked, captured, and archived between
20180801 and 20190828
• 60MB event database
• 83GB of WARC files
• 3GB of web archive index
66. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Showtime: myresearch.institute Portal
https://myresearchinstitute.org
67. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Success?
• “Interesting project! I’m happy to participate.”
“One more thing, is it possible to get a copy of the URI-Rs that
you guys detected so that I can feed them into an archive of my
choice?...”
• Prototype pipeline developed over 8 months (24 MM)
• Metrics of the prototype demonstrate that researchers generate
a lot of artifacts (that their institutions are typically not aware of)
• Metrics of the prototype suggest it should be possible to run a
production pipeline at the scale of an academic institution
• But would they …?
68. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Some Final Thoughts
• For a number of reasons, applications that leverage network-level
information at scale (e.g. EgoSystem, myresearch.institute,
Autoload) tend not to be perfect. But they are automatic.
• Do institutions reserve sufficient resources for innovation and
failure? The alternative seems to be outsourcing and loss of
expertise.
• Ideas/visions are rarely fully realized when working on them. But
many times, the work does improve on the status quo. So keep
dreaming and working!
69. @hvdsomp
VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
Herbert Van de Sompel
DANS
@hvdsomp
https://orcid.org/0000-0002-0715-6126
Collecting the Organizational Scholarly Record
Editor's Notes
~100k articles with links
> 230k links total
New paradigm for web archiving, found as part of this problem
Unexpected, yet most important result/contribution of this effort
Lets imagine you need to frequently archive slide decks from SlideShare (we do)
Understand that there are boundary and quality problems
Bring human (curator) in the loop
Navigate to *one* SS presentation
Interact with that presentation in an attempt to show what the boundary is, make explicit what needs to be archived
Browser extension, listens to browser events, intercepts them and records them in abstract way (not in terms of URLs, addresses in the DOM, Xpath, CSS selectors)
Result: trace expresses in abstract way the interactions the curator had with slide deck
Abstract b/c same info how to interact with *this* presentation will apply to *all* presentations
Record one, share, re-use with headless browser
Share in repo, collectively create, curate traces, update with layout of pages