Research software is a key asset for understanding, reusing and reproducing results in computational sciences. An increasing amount of software is stored in code repositories, which usually contain human readable instructions indicating how to use it and set it up. However, developers and researchers often need to spend a significant amount of time to understand how to invoke a software component, prepare data in the required format, and use it in combination with other software. In addition, this time investment makes it challenging to discover and compare software with similar functionality. In this talk I will describe our efforts to address these issues by creating and using Open Knowledge Graphs that describe research software in a machine readable manner. Our work includes: 1) an ontology that extends schema.org and codemeta, designed to describe software and the specific data formats it uses; 2) an approach to publish software metadata as an open knowledge graph, linked to other Web of Data objects; and 3) a framework for automatically extracting metadata from software repositories; and 4) a framework to curate, query, explore and compare research software metadata in a collaborative manner. The talk will illustrate our approach with real-world examples, including a domain application for inspecting and discovering hydrology, agriculture, and economic software models; and the results of our framework when enriching the research software entries in Zenodo.org.
Towards Knowledge Graphs of Reusable Research Software Metadata
1. Information Sciences Institute
TOWARDS KNOWLEDGE GRAPHS OF
REUSABLE RESEARCH SOFTWARE
METADATA
Daniel Garijo, Yolanda Gil, Maximiliano Osrio, Varun Ratnakar,
Deborah Khider, Hernan Vargas
Information Sciences Institute, University of Southern
California
@dgarijov
dgarijo@isi.edu
2. Information Sciences Institute
Is there a reproducibility crisis? [Nature, 2016]
Source: https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970
3. Information Sciences Institute
Reproducibility in Computational Sciences:
Open Research Data, Software and Methods
Scientific publication
Research Data Research Software Research Methods
4. Information Sciences Institute
Challenges for Finding, Understanding,
(Re)Using and Sharing Research Software
• What does the software component do?
Which of its methods should I use?
• How to transform my data to use the
software component?
• How to interpret the results produced by
the software component?
• How to invoke the software component?
• How to configure the software
component with the right parameters?
• How to compare software with similar
software?
Software designerSoftware user
• How to ease capturing the
dependencies and installation
instructions of my software?
• How to encapsulate my software so it
can be used with other data?
• How to describe my software so it
can be used by others?
• How to test if my software is ready to
be used by others?
• How can my component be found by
others
5. Information Sciences Institute
How are we addressing these challenges?
1. Describe Research Software in a machine-readable manner
2. Link and connect Research Software in Knowledge Graphs
3. Build applications for helping finding, understanding and reusing Research
Software using those Knowledge Graphs
7. Information Sciences Institute
Representing Software Metadata: OntoSoft
Crowdsourced Software Metadata Registry
• Complements code repositories to
make them understandable
• Software metadata designed for
scientists
• Metadata is curated by decentralized
communities of users
• Training scientists on best practices
http://ontosoft.org
Finding Software
OntoSoft: Capturing scientific software metadata. Gil, Y.; Ratnakar, V.; and Garijo, D. In Proceedings of
the 8th International Conference on Knowledge Capture, pages 32, 2015. ACM
8. Information Sciences Institute
Adding Structure to Software Metadata: OKG-Soft
Explore input/output variables
Explore Software I/O files
Knowledge Graph with machine-readable
Software Metadata:
• (From OntoSoft) Attribution, license, funding,
usage examples...
• Executable software components
• Software invocation
• Input & output files, variables and units
• Containers used to encapsulate and run
software components
[Garijo et al 2019]: OKG-Soft: An Open Knowledge Graph with Machine Readable Scientific Software Metadata. International
Conference on eScience, San Diego, USA. 2019
9. Information Sciences Institute
Evolving OntoSoft: Software Description Ontology
https://w3id.org/okn/o/sd#
Extensions:
• Schema.org/Codemeta (software metadata)
• W3C Data Cubes (Contents of inputs and outputs)
• NASA QUDT (Units)
• DockerPedia (Software images)
• Scientific Variables Ontology (Standard Variables)
14
10. Information Sciences Institute
1. Describing Research Software Metadata
2. Creating Knowledge Graphs with
Research Software Metadata
• Automatically
11. Information Sciences Institute
Automated Software Metadata Annotation
[Mao et al 2019]: SoMEF: A Framework for Capturing Software Metadata from its Documentation. 2019 IEEE BigData REU Symposium. Los
Angeles, 2019
whimian/pyGeoPressure
SoMEF
Description: A Python package for pore pressure
prediction...
Installation: pip install pygeopressure
Invocation: import pygeopressure as ppp
Citation: Yu, (2018). PyGeoPressure: Geopressure
Prediction in Python. Journal of Open Source Software,
3(30), 992, https://doi.org/10.21105/joss.00992
Software Metadata
Extraction Framework
Software repository
Metadata fields
(17 metadata categories):
description, installation
instructions, invocation,
citation, usage notes,
requirements, contact,
contributors, FAQ, support,
license, keywords...
https://somef.readthedocs.io/en/latest/
https://github.com/KnowledgeCaptureAndDiscovery/somef
12. Information Sciences Institute
SOSEN-KG: integrating Zenodo and GitHub
https://github.com/KnowledgeCaptureAndDiscovery/sosen
Prototype with > 13K entries of research software metadata
• Integrating metadata from Zenodo and GitHub (versions, authors, etc.)
• Expanding it with Wikidata (future work)
13. Information Sciences Institute
1. Describing Research Software Metadata
2. Creating Knowledge Graphs with
Research Software Metadata
• Automatically
• Crowdsourcing
14. Information Sciences Institute
OKG-SOFT
Software Model Catalog contains:
• Models from hydrology, agriculture and economy, their versions and model
configurations.
• More than 200 variables mapped to SVO.
• All models are executable through scientific workflows
• Most contents are added manually (expert users) collaboratively
• Automated unit transformations
• Automated software image description
• Semi-automated Wikidata linking
OKG-Soft: An Open Knowledge Graph with Machine Readable Scientific Software Metadata. Garijo, D.; Osorio, M.; Khider, D.; Ratnakar, V.;
and Gil, Y. In 2019 15th International Conference on eScience (eScience), pages 349–358, San Diego, CA, USA, September 2019. IEEE
15. Information Sciences Institute
1. Describing Research Software Metadata
2. Creating Knowledge Graphs with
Research Software Metadata
• Automatically
• Crowdsourcing
3. Using KGs to Find, Understand and Reuse
Research Software
17. Information Sciences Institute
OKG-SOFT Framework: Exploring Research
Software Model Metadata
Explore variables of inputs and outputs
Explore software I/O
Find, compare and configure
software models
http://models.mint.isi.edu
18. Information Sciences Institute
Research Software Reuse:
Encapsulating & Testing
Machine-
readable
component
specification
Assistants +
Guidelines
TestsTestsTests
Portable
Component
Software
Metadata
Registry OKG-SOFT
https://mic-cli.readthedocs.io/en/latest/
https://dame-cli.readthedocs.io/en/latest/
20. Information Sciences Institute
Overcoming the reproducibility crisis (partly)
• Research software is a critical asset for reproducible
computational experiments
• We need to improve the findability, (re)usability and
understanding of research software:
– Wider adoption
– Better comparison of similar computational methods
– Better understanding of data products
• In this presentation we covered:
– How to describe research software and its metadata
• OntoSoft, Software Description Ontology
– How to build Knowledge Graphs with research software metadata
• OntoSoft, OKG-Soft, SOSEN-KG
– How we are using KGs to help find, compare, understand and reuse
research software
21. Information Sciences Institute
Knowledge Capture and Discovery Group
Yolanda Gil
Varun Ratnakar
Daniel Garijo
Deborah Khider
Maximiliano Osorio
Hernan Vargas
https://knowledgecaptureanddiscovery.github.io/
22. Information Sciences Institute
TOWARDS KNOWLEDGE GRAPHS OF
REUSABLE RESEARCH SOFTWARE
METADATA
Daniel Garijo, Yolanda Gil, Maximiliano Osrio, Varun Ratnakar,
Deborah Khider, Hernan Vargas
Information Sciences Institute, University of Southern
California
@dgarijov
dgarijo@isi.edu
Editor's Notes
The survey specifies that 3 measures should be taken: better statistics (cherrypicking), mentoring and more robust design.
In our field (computational sciences), the problem narrows down to reusing data, software and methods.
There are other aspects like hypothesis, experimental design, etc. But these are the core for reproducibility