SlideShare a Scribd company logo
SOMEF:
A Metadata Extraction Framework
from Software Documentation
Daniel Garijo
@dgarijov
Ontology Engineering Group,
Universidad Politécnica de Madrid
The Importance of Software Metadata
- An increasing amount of software is available online
- Metadata is critical for:
- Findability
- Reusability
- Understanding
> 100 M repositories
> 28 M repositories
Current Pipeline for Software Metadata Curation
Icons from https://icons8.com/
Develop software &
documentation
Create entry
in registry
Curation and
Validation
SOMEF: Automating Software Metadata Extraction
SOMEF
● Readme Analysis
○ Supervised classification
○ Regular expressions
○ Header analysis
● File exploration
○ Notebooks
○ Dockerfiles
○ Documentation
● GitHub API
Repository Extraction Results (Metadata)
https://github.com/KnowledgeCaptureAndDiscovery/somef/
Recognizing Metadata Categories
● Name (GA)
● Full title (RE)
● Description (SC, HA)
● Citation (SC, RE, HA)
● Installation instructions (SC, HA)
● Invocation (SC)
● Usage examples (HA)
● Documentation (HA, FE)
● Requirements (HA)
● Contributors (HA)
● FAQ (HA)
● Support (HA)
● License (GA, HA)
● Stars (GA)
● Contact (HA)
● Download URL (HA, GA)
● DOI (RE)
● DockerFile (FE)
● Notebooks (FE)
● Executable notebooks (Binder) (RE)
● Owner: (RE)
● Keywords (GA)
● Source code (GA)
● Releases (GA)
● Changelog (GA)
● Issue tracker (GA)
● Programming languages (GA)
● Acknowledgements (HA)
Method used:
● Supervised Classification (SC)
● Header Analysis and Synset comparison (HA)
● File Exploration (FE)
● Regular Expressions (RE)
● GitHub API (GA)
Example Output
Beginning to capture software in context
SOMEF helps automatically extract software metadata. Benefits:
● Let authors know what metadata is missing
● Help making software findable
● Capture context of software (setup, configuration, examples)
But there is much more to explore:
● Better comparison
● Input/output preparation
● Using different software together
Problems? Suggestions? New features?
https://github.com/KnowledgeCaptureAndDiscovery/somef/issues

More Related Content

What's hot

OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...
OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...
OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...
dgarijo
 
Coming to terms to FAIR semantics
Coming to terms to FAIR semanticsComing to terms to FAIR semantics
Coming to terms to FAIR semantics
María Poveda Villalón
 
FAIRer Research
FAIRer ResearchFAIRer Research
FAIRer Research
Carole Goble
 
OKE2018 Challenge @ ESWC2018
OKE2018 Challenge @ ESWC2018OKE2018 Challenge @ ESWC2018
OKE2018 Challenge @ ESWC2018
Holistic Benchmarking of Big Linked Data
 
It Takes a Village to Grow ORCIDs on Campus: Establishing and Integrating Uni...
It Takes a Village to Grow ORCIDs on Campus: Establishing and Integrating Uni...It Takes a Village to Grow ORCIDs on Campus: Establishing and Integrating Uni...
It Takes a Village to Grow ORCIDs on Campus: Establishing and Integrating Uni...
Violeta Ilik
 
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
Carole Goble
 
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Anita de Waard
 
Research Shared: researchobject.org
Research Shared: researchobject.orgResearch Shared: researchobject.org
Research Shared: researchobject.org
Norman Morrison
 
FAIRness through a novel combination of Web technologies
FAIRness through a novel combination of Web technologiesFAIRness through a novel combination of Web technologies
FAIRness through a novel combination of Web technologies
Research Data Alliance
 
Make your data great again - Ver 2
Make your data great again - Ver 2Make your data great again - Ver 2
Make your data great again - Ver 2
Daniel JACOB
 
Mining Sociotechnical Information From Software Repositories
Mining Sociotechnical Information From Software RepositoriesMining Sociotechnical Information From Software Repositories
Mining Sociotechnical Information From Software Repositories
Marco Aurelio Gerosa
 
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
Stuart Chalk
 
Capturing the context: one small(ish step for modellers, one giant leap for m...
Capturing the context: one small(ish step for modellers, one giant leap for m...Capturing the context: one small(ish step for modellers, one giant leap for m...
Capturing the context: one small(ish step for modellers, one giant leap for m...
FAIRDOM
 
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical DataA Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
Stuart Chalk
 
Scientific Units in the Electronic Age
Scientific Units in the Electronic AgeScientific Units in the Electronic Age
Scientific Units in the Electronic Age
Stuart Chalk
 
Working with RDF in Jupyter Notebooks: some lessons in getting rid of Excel f...
Working with RDF in Jupyter Notebooks: some lessons in getting rid of Excel f...Working with RDF in Jupyter Notebooks: some lessons in getting rid of Excel f...
Working with RDF in Jupyter Notebooks: some lessons in getting rid of Excel f...
EugeneMorozov
 
The Case for Stable VIVO URIs
The Case for Stable VIVO URIsThe Case for Stable VIVO URIs
The Case for Stable VIVO URIs
Violeta Ilik
 
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Stuart Chalk
 
A guided tour of Araport
A guided tour of AraportA guided tour of Araport
A guided tour of Araport
Araport
 
2016 Summer - Araport Project Overview Leaflet
2016 Summer - Araport Project Overview Leaflet2016 Summer - Araport Project Overview Leaflet
2016 Summer - Araport Project Overview Leaflet
Araport
 

What's hot (20)

OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...
OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...
OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...
 
Coming to terms to FAIR semantics
Coming to terms to FAIR semanticsComing to terms to FAIR semantics
Coming to terms to FAIR semantics
 
FAIRer Research
FAIRer ResearchFAIRer Research
FAIRer Research
 
OKE2018 Challenge @ ESWC2018
OKE2018 Challenge @ ESWC2018OKE2018 Challenge @ ESWC2018
OKE2018 Challenge @ ESWC2018
 
It Takes a Village to Grow ORCIDs on Campus: Establishing and Integrating Uni...
It Takes a Village to Grow ORCIDs on Campus: Establishing and Integrating Uni...It Takes a Village to Grow ORCIDs on Campus: Establishing and Integrating Uni...
It Takes a Village to Grow ORCIDs on Campus: Establishing and Integrating Uni...
 
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
 
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
 
Research Shared: researchobject.org
Research Shared: researchobject.orgResearch Shared: researchobject.org
Research Shared: researchobject.org
 
FAIRness through a novel combination of Web technologies
FAIRness through a novel combination of Web technologiesFAIRness through a novel combination of Web technologies
FAIRness through a novel combination of Web technologies
 
Make your data great again - Ver 2
Make your data great again - Ver 2Make your data great again - Ver 2
Make your data great again - Ver 2
 
Mining Sociotechnical Information From Software Repositories
Mining Sociotechnical Information From Software RepositoriesMining Sociotechnical Information From Software Repositories
Mining Sociotechnical Information From Software Repositories
 
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
 
Capturing the context: one small(ish step for modellers, one giant leap for m...
Capturing the context: one small(ish step for modellers, one giant leap for m...Capturing the context: one small(ish step for modellers, one giant leap for m...
Capturing the context: one small(ish step for modellers, one giant leap for m...
 
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical DataA Generic Scientific Data Model and Ontology for Representation of Chemical Data
A Generic Scientific Data Model and Ontology for Representation of Chemical Data
 
Scientific Units in the Electronic Age
Scientific Units in the Electronic AgeScientific Units in the Electronic Age
Scientific Units in the Electronic Age
 
Working with RDF in Jupyter Notebooks: some lessons in getting rid of Excel f...
Working with RDF in Jupyter Notebooks: some lessons in getting rid of Excel f...Working with RDF in Jupyter Notebooks: some lessons in getting rid of Excel f...
Working with RDF in Jupyter Notebooks: some lessons in getting rid of Excel f...
 
The Case for Stable VIVO URIs
The Case for Stable VIVO URIsThe Case for Stable VIVO URIs
The Case for Stable VIVO URIs
 
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
 
A guided tour of Araport
A guided tour of AraportA guided tour of Araport
A guided tour of Araport
 
2016 Summer - Araport Project Overview Leaflet
2016 Summer - Araport Project Overview Leaflet2016 Summer - Araport Project Overview Leaflet
2016 Summer - Araport Project Overview Leaflet
 

Similar to SOMEF: a metadata extraction framework from software documentation

Organizing the Data Chaos of Scientists
Organizing the Data Chaos of ScientistsOrganizing the Data Chaos of Scientists
Organizing the Data Chaos of Scientists
Andreas Schreiber
 
Primers or Reminders? The Effects of Existing Review Comments on Code Review
Primers or Reminders? The Effects of Existing Review Comments on Code ReviewPrimers or Reminders? The Effects of Existing Review Comments on Code Review
Primers or Reminders? The Effects of Existing Review Comments on Code Review
Delft University of Technology
 
DataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data ManagementDataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data Management
Andreas Schreiber
 
Let's build Developer Portal with Backstage
Let's build Developer Portal with BackstageLet's build Developer Portal with Backstage
Let's build Developer Portal with Backstage
Opsta
 
Diagnose Your Microservices
Diagnose Your MicroservicesDiagnose Your Microservices
Diagnose Your Microservices
Marcus Hirt
 
ExSchema - ICSM'13
ExSchema - ICSM'13ExSchema - ICSM'13
ExSchema - ICSM'13
jccastrejon
 
DICE & Cloudify – Quality Big Data Made Easy
DICE & Cloudify – Quality Big Data Made EasyDICE & Cloudify – Quality Big Data Made Easy
DICE & Cloudify – Quality Big Data Made Easy
Cloudify Community
 
Industrializing Machine learning pipelines
Industrializing Machine learning pipelinesIndustrializing Machine learning pipelines
Industrializing Machine learning pipelines
Germain Tanguy
 
Tutorial Workgroup - Model versioning and collaboration
Tutorial Workgroup - Model versioning and collaborationTutorial Workgroup - Model versioning and collaboration
Tutorial Workgroup - Model versioning and collaboration
PascalDesmarets1
 
Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1
Henry S
 
stackconf 2021 | Embracing change: Policy-as-code for Kubernetes with OPA and...
stackconf 2021 | Embracing change: Policy-as-code for Kubernetes with OPA and...stackconf 2021 | Embracing change: Policy-as-code for Kubernetes with OPA and...
stackconf 2021 | Embracing change: Policy-as-code for Kubernetes with OPA and...
NETWAYS
 
Django, What is it, Why is it cool?
Django, What is it, Why is it cool?Django, What is it, Why is it cool?
Django, What is it, Why is it cool?
Tom Brander
 
Enterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshEnterprise guide to building a Data Mesh
Enterprise guide to building a Data Mesh
Sion Smith
 
Whats new in_mlflow
Whats new in_mlflowWhats new in_mlflow
Whats new in_mlflow
Databricks
 
Managing and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in PythonManaging and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in Python
Simon Frid
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
NETWAYS
 
Automatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPCAutomatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPC
Facultad de Informática UCM
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
Makoto Yui
 
Codeigniter
CodeigniterCodeigniter
Codeigniter
minhrau111
 
Content Strategy and Developer Engagement for DevPortals
Content Strategy and Developer Engagement for DevPortalsContent Strategy and Developer Engagement for DevPortals
Content Strategy and Developer Engagement for DevPortals
Axway
 

Similar to SOMEF: a metadata extraction framework from software documentation (20)

Organizing the Data Chaos of Scientists
Organizing the Data Chaos of ScientistsOrganizing the Data Chaos of Scientists
Organizing the Data Chaos of Scientists
 
Primers or Reminders? The Effects of Existing Review Comments on Code Review
Primers or Reminders? The Effects of Existing Review Comments on Code ReviewPrimers or Reminders? The Effects of Existing Review Comments on Code Review
Primers or Reminders? The Effects of Existing Review Comments on Code Review
 
DataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data ManagementDataFinder: A Python Application for Scientific Data Management
DataFinder: A Python Application for Scientific Data Management
 
Let's build Developer Portal with Backstage
Let's build Developer Portal with BackstageLet's build Developer Portal with Backstage
Let's build Developer Portal with Backstage
 
Diagnose Your Microservices
Diagnose Your MicroservicesDiagnose Your Microservices
Diagnose Your Microservices
 
ExSchema - ICSM'13
ExSchema - ICSM'13ExSchema - ICSM'13
ExSchema - ICSM'13
 
DICE & Cloudify – Quality Big Data Made Easy
DICE & Cloudify – Quality Big Data Made EasyDICE & Cloudify – Quality Big Data Made Easy
DICE & Cloudify – Quality Big Data Made Easy
 
Industrializing Machine learning pipelines
Industrializing Machine learning pipelinesIndustrializing Machine learning pipelines
Industrializing Machine learning pipelines
 
Tutorial Workgroup - Model versioning and collaboration
Tutorial Workgroup - Model versioning and collaborationTutorial Workgroup - Model versioning and collaboration
Tutorial Workgroup - Model versioning and collaboration
 
Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1
 
stackconf 2021 | Embracing change: Policy-as-code for Kubernetes with OPA and...
stackconf 2021 | Embracing change: Policy-as-code for Kubernetes with OPA and...stackconf 2021 | Embracing change: Policy-as-code for Kubernetes with OPA and...
stackconf 2021 | Embracing change: Policy-as-code for Kubernetes with OPA and...
 
Django, What is it, Why is it cool?
Django, What is it, Why is it cool?Django, What is it, Why is it cool?
Django, What is it, Why is it cool?
 
Enterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshEnterprise guide to building a Data Mesh
Enterprise guide to building a Data Mesh
 
Whats new in_mlflow
Whats new in_mlflowWhats new in_mlflow
Whats new in_mlflow
 
Managing and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in PythonManaging and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in Python
 
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
OSMC 2023 | What’s new with Grafana Labs’s Open Source Observability stack by...
 
Automatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPCAutomatic generation of hardware memory architectures for HPC
Automatic generation of hardware memory architectures for HPC
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
 
Codeigniter
CodeigniterCodeigniter
Codeigniter
 
Content Strategy and Developer Engagement for DevPortals
Content Strategy and Developer Engagement for DevPortalsContent Strategy and Developer Engagement for DevPortals
Content Strategy and Developer Engagement for DevPortals
 

More from dgarijo

WDPlus: Leveraging Wikidata to Link and Extend Tabular Data
WDPlus: Leveraging Wikidata to Link and Extend Tabular DataWDPlus: Leveraging Wikidata to Link and Extend Tabular Data
WDPlus: Leveraging Wikidata to Link and Extend Tabular Data
dgarijo
 
Towards Human-Guided Machine Learning - IUI 2019
Towards Human-Guided Machine Learning - IUI 2019Towards Human-Guided Machine Learning - IUI 2019
Towards Human-Guided Machine Learning - IUI 2019
dgarijo
 
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven ScienceCapturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
dgarijo
 
WIDOCO: A Wizard for Documenting Ontologies
WIDOCO: A Wizard for Documenting OntologiesWIDOCO: A Wizard for Documenting Ontologies
WIDOCO: A Wizard for Documenting Ontologies
dgarijo
 
Towards Automating Data Narratives
Towards Automating Data NarrativesTowards Automating Data Narratives
Towards Automating Data Narratives
dgarijo
 
Automated Hypothesis Testing with Large Scale Scientific Workflows
Automated Hypothesis Testing with Large Scale Scientific WorkflowsAutomated Hypothesis Testing with Large Scale Scientific Workflows
Automated Hypothesis Testing with Large Scale Scientific Workflows
dgarijo
 
OntoSoft: A Distributed Semantic Registry for Scientific Software
OntoSoft: A Distributed Semantic Registry for Scientific SoftwareOntoSoft: A Distributed Semantic Registry for Scientific Software
OntoSoft: A Distributed Semantic Registry for Scientific Software
dgarijo
 
OEG tools for supporting Ontology Engineering
OEG tools for supporting Ontology EngineeringOEG tools for supporting Ontology Engineering
OEG tools for supporting Ontology Engineering
dgarijo
 
Software Metadata: Describing "dark software" in GeoSciences
Software Metadata: Describing "dark software" in GeoSciencesSoftware Metadata: Describing "dark software" in GeoSciences
Software Metadata: Describing "dark software" in GeoSciences
dgarijo
 
Reproducibility Using Semantics: An Overview
Reproducibility Using Semantics: An OverviewReproducibility Using Semantics: An Overview
Reproducibility Using Semantics: An Overview
dgarijo
 
PhD Thesis: Mining abstractions in scientific workflows
PhD Thesis: Mining abstractions in scientific workflowsPhD Thesis: Mining abstractions in scientific workflows
PhD Thesis: Mining abstractions in scientific workflows
dgarijo
 
Publicación de datos y métodos científicos en investigación
Publicación de datos y métodos científicos en investigaciónPublicación de datos y métodos científicos en investigación
Publicación de datos y métodos científicos en investigación
dgarijo
 
EDBT 2015: Summer School Overview
EDBT 2015: Summer School OverviewEDBT 2015: Summer School Overview
EDBT 2015: Summer School Overview
dgarijo
 
Similarity in Wikipedia Articles (EDBT Summer School)
Similarity in Wikipedia Articles (EDBT Summer School)Similarity in Wikipedia Articles (EDBT Summer School)
Similarity in Wikipedia Articles (EDBT Summer School)
dgarijo
 
Semantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologistsSemantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologists
dgarijo
 
Is preserving data enough? Towards the preservation of scientific methods
Is preserving data enough? Towards the preservation of scientific methods Is preserving data enough? Towards the preservation of scientific methods
Is preserving data enough? Towards the preservation of scientific methods
dgarijo
 
Creating abstractions from scientific workflows: PhD symposium 2015
Creating abstractions from scientific workflows: PhD symposium 2015Creating abstractions from scientific workflows: PhD symposium 2015
Creating abstractions from scientific workflows: PhD symposium 2015
dgarijo
 
Towards Workflow Ecosystems Through Semantic and Standard Representations
Towards Workflow Ecosystems Through Semantic and Standard RepresentationsTowards Workflow Ecosystems Through Semantic and Standard Representations
Towards Workflow Ecosystems Through Semantic and Standard Representations
dgarijo
 
Workflow Reuse in Practice: A Study of Neuroimaging Pipeline Users
Workflow Reuse in Practice: A Study of Neuroimaging Pipeline UsersWorkflow Reuse in Practice: A Study of Neuroimaging Pipeline Users
Workflow Reuse in Practice: A Study of Neuroimaging Pipeline Users
dgarijo
 
Frag Flow: Automated Fragment Detection in Scientific Workflows
Frag Flow: Automated Fragment Detection in Scientific WorkflowsFrag Flow: Automated Fragment Detection in Scientific Workflows
Frag Flow: Automated Fragment Detection in Scientific Workflows
dgarijo
 

More from dgarijo (20)

WDPlus: Leveraging Wikidata to Link and Extend Tabular Data
WDPlus: Leveraging Wikidata to Link and Extend Tabular DataWDPlus: Leveraging Wikidata to Link and Extend Tabular Data
WDPlus: Leveraging Wikidata to Link and Extend Tabular Data
 
Towards Human-Guided Machine Learning - IUI 2019
Towards Human-Guided Machine Learning - IUI 2019Towards Human-Guided Machine Learning - IUI 2019
Towards Human-Guided Machine Learning - IUI 2019
 
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven ScienceCapturing Context in Scientific Experiments: Towards Computer-Driven Science
Capturing Context in Scientific Experiments: Towards Computer-Driven Science
 
WIDOCO: A Wizard for Documenting Ontologies
WIDOCO: A Wizard for Documenting OntologiesWIDOCO: A Wizard for Documenting Ontologies
WIDOCO: A Wizard for Documenting Ontologies
 
Towards Automating Data Narratives
Towards Automating Data NarrativesTowards Automating Data Narratives
Towards Automating Data Narratives
 
Automated Hypothesis Testing with Large Scale Scientific Workflows
Automated Hypothesis Testing with Large Scale Scientific WorkflowsAutomated Hypothesis Testing with Large Scale Scientific Workflows
Automated Hypothesis Testing with Large Scale Scientific Workflows
 
OntoSoft: A Distributed Semantic Registry for Scientific Software
OntoSoft: A Distributed Semantic Registry for Scientific SoftwareOntoSoft: A Distributed Semantic Registry for Scientific Software
OntoSoft: A Distributed Semantic Registry for Scientific Software
 
OEG tools for supporting Ontology Engineering
OEG tools for supporting Ontology EngineeringOEG tools for supporting Ontology Engineering
OEG tools for supporting Ontology Engineering
 
Software Metadata: Describing "dark software" in GeoSciences
Software Metadata: Describing "dark software" in GeoSciencesSoftware Metadata: Describing "dark software" in GeoSciences
Software Metadata: Describing "dark software" in GeoSciences
 
Reproducibility Using Semantics: An Overview
Reproducibility Using Semantics: An OverviewReproducibility Using Semantics: An Overview
Reproducibility Using Semantics: An Overview
 
PhD Thesis: Mining abstractions in scientific workflows
PhD Thesis: Mining abstractions in scientific workflowsPhD Thesis: Mining abstractions in scientific workflows
PhD Thesis: Mining abstractions in scientific workflows
 
Publicación de datos y métodos científicos en investigación
Publicación de datos y métodos científicos en investigaciónPublicación de datos y métodos científicos en investigación
Publicación de datos y métodos científicos en investigación
 
EDBT 2015: Summer School Overview
EDBT 2015: Summer School OverviewEDBT 2015: Summer School Overview
EDBT 2015: Summer School Overview
 
Similarity in Wikipedia Articles (EDBT Summer School)
Similarity in Wikipedia Articles (EDBT Summer School)Similarity in Wikipedia Articles (EDBT Summer School)
Similarity in Wikipedia Articles (EDBT Summer School)
 
Semantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologistsSemantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologists
 
Is preserving data enough? Towards the preservation of scientific methods
Is preserving data enough? Towards the preservation of scientific methods Is preserving data enough? Towards the preservation of scientific methods
Is preserving data enough? Towards the preservation of scientific methods
 
Creating abstractions from scientific workflows: PhD symposium 2015
Creating abstractions from scientific workflows: PhD symposium 2015Creating abstractions from scientific workflows: PhD symposium 2015
Creating abstractions from scientific workflows: PhD symposium 2015
 
Towards Workflow Ecosystems Through Semantic and Standard Representations
Towards Workflow Ecosystems Through Semantic and Standard RepresentationsTowards Workflow Ecosystems Through Semantic and Standard Representations
Towards Workflow Ecosystems Through Semantic and Standard Representations
 
Workflow Reuse in Practice: A Study of Neuroimaging Pipeline Users
Workflow Reuse in Practice: A Study of Neuroimaging Pipeline UsersWorkflow Reuse in Practice: A Study of Neuroimaging Pipeline Users
Workflow Reuse in Practice: A Study of Neuroimaging Pipeline Users
 
Frag Flow: Automated Fragment Detection in Scientific Workflows
Frag Flow: Automated Fragment Detection in Scientific WorkflowsFrag Flow: Automated Fragment Detection in Scientific Workflows
Frag Flow: Automated Fragment Detection in Scientific Workflows
 

Recently uploaded

CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
PKavitha10
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
abbyasa1014
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
Gino153088
 
artificial intelligence and data science contents.pptx
artificial intelligence and data science contents.pptxartificial intelligence and data science contents.pptx
artificial intelligence and data science contents.pptx
GauravCar
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
Yasser Mahgoub
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
kandramariana6
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
bijceesjournal
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
VANDANAMOHANGOUDA
 
People as resource Grade IX.pdf minimala
People as resource Grade IX.pdf minimalaPeople as resource Grade IX.pdf minimala
People as resource Grade IX.pdf minimala
riddhimaagrawal986
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
Nada Hikmah
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
ydzowc
 
Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
Atif Razi
 
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by AnantLLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
Anant Corporation
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
171ticu
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
co23btech11018
 
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
ecqow
 
Material for memory and display system h
Material for memory and display system hMaterial for memory and display system h
Material for memory and display system h
gowrishankartb2005
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
Design and optimization of ion propulsion drone
Design and optimization of ion propulsion droneDesign and optimization of ion propulsion drone
Design and optimization of ion propulsion drone
bjmsejournal
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 

Recently uploaded (20)

CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
 
Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
 
artificial intelligence and data science contents.pptx
artificial intelligence and data science contents.pptxartificial intelligence and data science contents.pptx
artificial intelligence and data science contents.pptx
 
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
2008 BUILDING CONSTRUCTION Illustrated - Ching Chapter 02 The Building.pdf
 
132/33KV substation case study Presentation
132/33KV substation case study Presentation132/33KV substation case study Presentation
132/33KV substation case study Presentation
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
 
ITSM Integration with MuleSoft.pptx
ITSM  Integration with MuleSoft.pptxITSM  Integration with MuleSoft.pptx
ITSM Integration with MuleSoft.pptx
 
People as resource Grade IX.pdf minimala
People as resource Grade IX.pdf minimalaPeople as resource Grade IX.pdf minimala
People as resource Grade IX.pdf minimala
 
Curve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods RegressionCurve Fitting in Numerical Methods Regression
Curve Fitting in Numerical Methods Regression
 
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
原版制作(Humboldt毕业证书)柏林大学毕业证学位证一模一样
 
Applications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdfApplications of artificial Intelligence in Mechanical Engineering.pdf
Applications of artificial Intelligence in Mechanical Engineering.pdf
 
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by AnantLLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
 
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
一比一原版(CalArts毕业证)加利福尼亚艺术学院毕业证如何办理
 
Material for memory and display system h
Material for memory and display system hMaterial for memory and display system h
Material for memory and display system h
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
Design and optimization of ion propulsion drone
Design and optimization of ion propulsion droneDesign and optimization of ion propulsion drone
Design and optimization of ion propulsion drone
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 

SOMEF: a metadata extraction framework from software documentation

  • 1. SOMEF: A Metadata Extraction Framework from Software Documentation Daniel Garijo @dgarijov Ontology Engineering Group, Universidad Politécnica de Madrid
  • 2. The Importance of Software Metadata - An increasing amount of software is available online - Metadata is critical for: - Findability - Reusability - Understanding > 100 M repositories > 28 M repositories
  • 3. Current Pipeline for Software Metadata Curation Icons from https://icons8.com/ Develop software & documentation Create entry in registry Curation and Validation
  • 4. SOMEF: Automating Software Metadata Extraction SOMEF ● Readme Analysis ○ Supervised classification ○ Regular expressions ○ Header analysis ● File exploration ○ Notebooks ○ Dockerfiles ○ Documentation ● GitHub API Repository Extraction Results (Metadata) https://github.com/KnowledgeCaptureAndDiscovery/somef/
  • 5. Recognizing Metadata Categories ● Name (GA) ● Full title (RE) ● Description (SC, HA) ● Citation (SC, RE, HA) ● Installation instructions (SC, HA) ● Invocation (SC) ● Usage examples (HA) ● Documentation (HA, FE) ● Requirements (HA) ● Contributors (HA) ● FAQ (HA) ● Support (HA) ● License (GA, HA) ● Stars (GA) ● Contact (HA) ● Download URL (HA, GA) ● DOI (RE) ● DockerFile (FE) ● Notebooks (FE) ● Executable notebooks (Binder) (RE) ● Owner: (RE) ● Keywords (GA) ● Source code (GA) ● Releases (GA) ● Changelog (GA) ● Issue tracker (GA) ● Programming languages (GA) ● Acknowledgements (HA) Method used: ● Supervised Classification (SC) ● Header Analysis and Synset comparison (HA) ● File Exploration (FE) ● Regular Expressions (RE) ● GitHub API (GA)
  • 7. Beginning to capture software in context SOMEF helps automatically extract software metadata. Benefits: ● Let authors know what metadata is missing ● Help making software findable ● Capture context of software (setup, configuration, examples) But there is much more to explore: ● Better comparison ● Input/output preparation ● Using different software together Problems? Suggestions? New features? https://github.com/KnowledgeCaptureAndDiscovery/somef/issues