SlideShare a Scribd company logo
1 of 41
Download to read offline
Dipartimento di Ingegneria e Scienze
Università degli Studi dell’Aquila
dell’Informazione e Matematica
http://www.di.univaq.it/diruscio/
davide.diruscio@univaq.it
@ddiruscio
CrossSim:
exploiting mutual relationships
to detect similar OSS projects
Davide Di Ruscio
Joint work with Phuong T. Nguyen, Juri Di Rocco, Riccardo Rubei
2
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Context
Related activities
- Searching for candidate components
- Evaluating a set of retrieved candidate components to find the most suitable one
- Understanding how to use the selected components
- Monitoring the selected components
Development of new software systems
by reusing existing open source components
3
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Selecting and Using OSS components
Challenging tasks
▪ assessing quality, maturity,
activity of development
and user support is not a
straightforward process
Different and heterogeneous
source of information
▪ e.g., code repositories,
communication channels, bug
tracking systems
Source code
Q&A systems
Bug Reports
API
Documentation
Tutorials
Configuration
Management Systems
4
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Source code
Q&A systems
Bug Reports
API
Documentation
Tutorials
Configuration
Management Systems
www.crossminer.org
5
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
CROSSMINER: high-level view
Data Preprocessing Capturing Context
Producing
Recommendations
Presenting
Recommendations
6
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Mining and Analysis Tools
CROSSMINER: high-level view
Data Preprocessing Capturing Context
Producing
Recommendations
Presenting
Recommendations
Knowledge Base
Source Code
Miner
NLP
Miner
Configuration
Miner
Cross project
Analysis
OSS forges
Source Code
Natural
language
channels
Configuration
Scripts
lookup/store
mine
7
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
CROSSMINER: high-level view
Data Preprocessing Capturing Context
Producing
Recommendations
Presenting
Recommendations
Developer
IDE
Knowledge Base
query
recommendations
Data
Storage
Real-time recommendations that serve productivity and quality increase
8
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
CROSSMINER: high-level view
The main intuition is to bring to the domain of software
development the notion of recommendation systems that are
typically used for popular e-commerce systems to present users
with interesting items previously unknown to them
9
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Recommendation examples
Depending on the set of selected third-party libraries, the system is able
to recommend additional libraries that should be included in the
project being developed
10
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Recommendation examples
Depending on the set of selected third-party libraries, the system is able
to recommend additional libraries that should be included in the
project being developed
Given a selected library, the system is able to suggest alternative ones
that share some similarities with the selected one
11
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Recommendation examples
Depending on the set of selected third-party libraries, the system is able
to recommend additional libraries that should be included in the
project being developed
Given a selected library, the system is able to suggest alternative ones
that share some similarities with the selected one
Depending on the set of selected libraries, the system shows API
documentation and Q&A posts that can help developers to understand
how to use the selected libraries
12
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Recommendation examples
Depending on the set of selected third-party libraries, the system is able
to recommend additional libraries that should be included in the
project being developed
Given a selected library, the system is able to suggest alternative ones
that share some similarities with the selected one
Depending on the set of selected libraries, the system shows API
documentation and Q&A posts that can help developers to understand
how to use the selected libraries
During the development, developers get recommendations about API
function calls that might be used
…
13
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Logical implications
Recommendations
 Automated classification of artifacts
 Clustering
 Definition of an extensible and configurable similarity
calculation approach
14
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Logical implications
Recommendations
 Automated classification of artifacts
 Clustering
 Definition of an extensible and configurable similarity
calculation approach
Understanding the similarities between open source software projects
allows for reusing of source code and prototyping,
or choosing alternative implementations
15
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Software Similarity: Overview
Low-level Software Similarity: Using source code
(variable/function names, API references, etc.)
High-level Software Similarity: Using metadata such as
readme files, description, GitHub star events
16
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Existing approaches
Algorithm Tec. Constraints/Characteristics Category
MUDABlue C programs with source code Low-level Similarity
CLAN Java programs with API calls Low-level Similarity
CLANdroid Mobile apps from Android Package, GitHub, Google Code;
The algorithm extracts identifiers and intents from source
code, APIs and sensors from JAR files, permissions from
AndroidManifest.xml
High and Low-
level Similarity
RepoPal GitHub Java repositories contain readme file and possess at least 20 stars High-level Similarity
GPLAG Programs with source code Low-level Similarity
LibRec Maven projects (pom.xml), contain more than 10.000 lines of code and
are not a fork of another project in GitHub and use at least 10 libraries.
High-level Similarity
SimApp Mobile applications with a set of 10 features:
Application name, category, developer, description,
update, permission, screenshot, content rating, size and user reviews
High-level Similarity
DroidVisor Android apps with security features High-level Similarity
AnDarwin App code information: app's market, signature, description Low-level Similarity
TagSim Sourceforge projects with proper tags High-level Similarity
17
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
SANER 2017 - http://ieeexplore.ieee.org/document/7884605/
18
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Overview of CrossSim
Graphs for representing different kinds of relationships in the
OSS ecosystem
• e.g., developers commit to repositories, users star repositories,
projects contain source code files, etc.
Cross Project Relationships for Computing Open Source Software Similarity
19
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
CrossSim: OSS Ecosystem Representation
20
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
CrossSim: OSS Ecosystem Representation
The main hypothesis is that
the projects are aiming at
creating common
functionalities by using
common libraries
21
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
CrossSim: OSS Ecosystem Representation
The main hypothesis is that
the projects are aiming at
creating common
functionalities by using
common libraries
Based on the graph
structure, one can exploit
nodes, links, and the mutual
relationships to compute
similarity using existing
graph similarity algorithms
22
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
CrossSim: Graph Similarity
The similarity between two nodes is dependent on their neighbors
Two nodes are considered to be similar if they are referenced by similar nodes
In the example A and B are highly similar since they are referenced by many
same nodes
Based on SimRank
23
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
CrossSim Evaluation Process
24
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
CrossSim Evaluation Process
Success rate: if at least one of the top-5 retrieved projects is labelled Similar or
Highly similar, the query is considered to be successful.
• Success rate is the ratio of successful queries to the total number of queries
25
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
CrossSim Evaluation Process
Confidence: Given a pair of <query, retrieved project> the confidence of an
evaluator is the score she assigns to the similarity between the projects
26
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
CrossSim Evaluation Process
Precision: The precision for each query is the proportion of projects in
the top-5 list that are labelled as Similar or Highly similar by humans
27
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
CrossSim Evaluation Process
Ranking: correlations among the ranking calculated by the similarity tools
and the scores given by the human evaluation
28
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
CrossSim Evaluation Process
Execution time: related to the application of RepoPal and CROSSSIM on
the dataset to obtain the corresponding similarity matrices
29
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Data Collection
580 projects from GitHub satisfying the following
requirements:
– pom.xml or .gradle files available
– Having at least 9 dependencies
– Having the README.md file available
– Having at least 20 stars (as needed by RepoPal)
Filtered from an initial set retrieved from specific categories
(e.g., PDF processors, JSON parsers, ORM projects, Spring
MVC related tools)
30
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Definition of query
50 projects among the 580 in the data set have been selected
as queries
Queries have been chosen to equally cover all the considered
categories of the projects in the dataset
31
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Mix and shuffle of the results
For each query the top-5 most similar projects calculated by
RepoPal and CrossSim were retrieved
Results from RepoPal and CrossSim were mixed and shuffled
The obtained list was labelled by human evaluators
32
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Fragment of the collected human scores
Project 1 Project 2 Score
neo4j-contrib/sparql-plugin castagna/jena-examples 3
neo4j-contrib/sparql-plugin claudiomartella/dbpedia4neo 3
neo4j-contrib/sparql-plugin claudiomartella/dbpedia4neo 3
neo4j-contrib/sparql-plugin dbpedia/links 3
neo4j-contrib/sparql-plugin eclipse/rdf4j 3
neo4j-contrib/sparql-plugin jbarrasa/neosemantics 3
neo4j-contrib/sparql-plugin jbarrasa/neosemantics 3
neo4j-contrib/sparql-plugin niclashoyer/neo4j-sparql-extension 4
neo4j-contrib/sparql-plugin niclashoyer/neo4j-sparql-extension 4
neo4j-contrib/sparql-plugin streamreasoning/CSPARQL-engine 3
AskNowQA/AutoSPARQL AKSW/RDFUnit 3
AskNowQA/AutoSPARQL AKSW/SPARQL2NL 4
AskNowQA/AutoSPARQL AKSW/SPARQL2NL 4
AskNowQA/AutoSPARQL AKSW/Sparqlify 3
AskNowQA/AutoSPARQL castagna/jena-examples 3
AskNowQA/AutoSPARQL pyvandenbussche/sparqles 3
AskNowQA/AutoSPARQL rdfhdt/hdt-java 3
AskNowQA/AutoSPARQL rdfhdt/hdt-java 3
AskNowQA/AutoSPARQL socialsignin/spring-social-security-demo 2
AskNowQA/AutoSPARQL yhegde/facebook-page-scraper 2
33
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Considered CrossSim configurations
CrossSim1: star events and dependencies
CrossSim2: CrossSim1 + commiters
CrossSim3: CrossSim1 – most frequent dependencies
CrossSim4: CrossSim2 – most frequent dependencies
E.g., since testing is a common
functionality of many software projects,
JUnit does not help contribute towards
the characterization of a project and
thus, needs to be removed
from the graph
34
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Outcomes of the applied metrics: Precision
35
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Outcomes of the applied metrics: Precision
Both gain a success rate
of 100%, however CROSSSIM3
has a better precision.
36
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Outcomes of the applied metrics: Precision
The inclusion of all developers
who have committed updates
at least once to a project in the
graph is counterproductive as it
adds a decline in precision
Both gain a success rate
of 100%, however CROSSSIM3
has a better precision.
37
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Outcomes of the applied metrics: Precision
The inclusion of all developers
who have committed updates
at least once to a project in the
graph is counterproductive as it
adds a decline in precision
Both gain a success rate
of 100%, however CROSSSIM3
has a better precision.
Concerning ranking correlations
CROSSSIM3 performs slightly
better than RepoPal
• −0.214 for CROSSSIM3
• −0.163 for RepoPal
38
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Outcomes of the applied metrics: Confidence
39
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Outcomes of the applied metrics: Execution time
Intel Core i5-7200U
CPU @ 2.50GHz × 4, 8GB RAM,
Ubuntu 16.04
40
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Next Steps (in a short term)
Use of CrossSim for recommending libraries
– Exploiting the techniques proposed in CrossSim
– Representing OSS projects in graph
– Computing similarity between projects
– Using collaborative-filtering techniques to recommend libraries to OSS
projects
41
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Conclusions www.crossminer.org
@crossminer eclipse.org/scava

More Related Content

Similar to CrossSim: exploiting mutual relationships to detect similar OSS projects

Developing recommendation systems to support open source software developers ...
Developing recommendation systems to support open source software developers ...Developing recommendation systems to support open source software developers ...
Developing recommendation systems to support open source software developers ...
Davide Ruscio
 
Top-N Recommendations from Implicit Feedback leveraging Linked Open Data
Top-N Recommendations from Implicit Feedback leveraging Linked Open DataTop-N Recommendations from Implicit Feedback leveraging Linked Open Data
Top-N Recommendations from Implicit Feedback leveraging Linked Open Data
Vito Ostuni
 

Similar to CrossSim: exploiting mutual relationships to detect similar OSS projects (20)

Flink for Everyone: Self-Service Data Analytics with StreamPipes
Flink for Everyone: Self-Service Data Analytics with StreamPipesFlink for Everyone: Self-Service Data Analytics with StreamPipes
Flink for Everyone: Self-Service Data Analytics with StreamPipes
 
Dynamic IoT data, protocol, and middleware interoperability with resource sli...
Dynamic IoT data, protocol, and middleware interoperability with resource sli...Dynamic IoT data, protocol, and middleware interoperability with resource sli...
Dynamic IoT data, protocol, and middleware interoperability with resource sli...
 
Developing recommendation systems to support open source software developers ...
Developing recommendation systems to support open source software developers ...Developing recommendation systems to support open source software developers ...
Developing recommendation systems to support open source software developers ...
 
OpenAIRE: Implementing Open Science
OpenAIRE: Implementing Open ScienceOpenAIRE: Implementing Open Science
OpenAIRE: Implementing Open Science
 
Introduction to OpenAIRE services and the OpenAIRE Research Graph
Introduction to OpenAIRE services and the OpenAIRE Research GraphIntroduction to OpenAIRE services and the OpenAIRE Research Graph
Introduction to OpenAIRE services and the OpenAIRE Research Graph
 
OSS Projects Knowledge Mining with CROSSMINER, OW2con'18, June 7-8, 2018
OSS Projects Knowledge Mining with CROSSMINER, OW2con'18, June 7-8, 2018OSS Projects Knowledge Mining with CROSSMINER, OW2con'18, June 7-8, 2018
OSS Projects Knowledge Mining with CROSSMINER, OW2con'18, June 7-8, 2018
 
Ramp up your testing solution, ExpoQA 2023
Ramp up your testing solution, ExpoQA 2023Ramp up your testing solution, ExpoQA 2023
Ramp up your testing solution, ExpoQA 2023
 
EUBrasilCloudFORUM Research Roadmap on Cloud Computing, including security
EUBrasilCloudFORUM Research Roadmap on Cloud Computing, including securityEUBrasilCloudFORUM Research Roadmap on Cloud Computing, including security
EUBrasilCloudFORUM Research Roadmap on Cloud Computing, including security
 
Nieuwerburgh - Open science e-infrastructure for research analysis and impact...
Nieuwerburgh - Open science e-infrastructure for research analysis and impact...Nieuwerburgh - Open science e-infrastructure for research analysis and impact...
Nieuwerburgh - Open science e-infrastructure for research analysis and impact...
 
Software Architecture Evaluation: A Systematic Mapping Study
Software Architecture Evaluation: A Systematic Mapping StudySoftware Architecture Evaluation: A Systematic Mapping Study
Software Architecture Evaluation: A Systematic Mapping Study
 
Self-Service IoT Data Analytics with StreamPipes
Self-Service IoT Data Analytics with StreamPipesSelf-Service IoT Data Analytics with StreamPipes
Self-Service IoT Data Analytics with StreamPipes
 
Use of MDE to Analyse Open Source Software
Use of MDE to Analyse Open Source SoftwareUse of MDE to Analyse Open Source Software
Use of MDE to Analyse Open Source Software
 
Top-N Recommendations from Implicit Feedback leveraging Linked Open Data
Top-N Recommendations from Implicit Feedback leveraging Linked Open DataTop-N Recommendations from Implicit Feedback leveraging Linked Open Data
Top-N Recommendations from Implicit Feedback leveraging Linked Open Data
 
Towards a Resource Slice Interoperability Hub for IoT
Towards a Resource Slice Interoperability Hub for IoTTowards a Resource Slice Interoperability Hub for IoT
Towards a Resource Slice Interoperability Hub for IoT
 
Product Engineer Certified Lean Six Sigma Black Belt by IASSC
Product Engineer Certified Lean Six Sigma Black Belt by IASSCProduct Engineer Certified Lean Six Sigma Black Belt by IASSC
Product Engineer Certified Lean Six Sigma Black Belt by IASSC
 
4th International Conference on Artificial Intelligence and Machine Learning ...
4th International Conference on Artificial Intelligence and Machine Learning ...4th International Conference on Artificial Intelligence and Machine Learning ...
4th International Conference on Artificial Intelligence and Machine Learning ...
 
Call for Research Papers - 4th International Conference on Artificial Intelli...
Call for Research Papers - 4th International Conference on Artificial Intelli...Call for Research Papers - 4th International Conference on Artificial Intelli...
Call for Research Papers - 4th International Conference on Artificial Intelli...
 
4 th International Conference on Artificial Intelligence and Machine Learning...
4 th International Conference on Artificial Intelligence and Machine Learning...4 th International Conference on Artificial Intelligence and Machine Learning...
4 th International Conference on Artificial Intelligence and Machine Learning...
 
TechEvent Customer Project "Trend-Analytics"
TechEvent Customer Project "Trend-Analytics"TechEvent Customer Project "Trend-Analytics"
TechEvent Customer Project "Trend-Analytics"
 
Modeling the Impact of R & Python Packages: Dependency and Contributor Networks
Modeling the Impact of R & Python Packages: Dependency and Contributor NetworksModeling the Impact of R & Python Packages: Dependency and Contributor Networks
Modeling the Impact of R & Python Packages: Dependency and Contributor Networks
 

More from Davide Ruscio

Collaborative model driven software engineering: a Systematic Mapping Study
Collaborative model driven software engineering: a Systematic Mapping StudyCollaborative model driven software engineering: a Systematic Mapping Study
Collaborative model driven software engineering: a Systematic Mapping Study
Davide Ruscio
 

More from Davide Ruscio (10)

Detecting java software similarities by using different clustering
Detecting java software similarities by using different clusteringDetecting java software similarities by using different clustering
Detecting java software similarities by using different clustering
 
On the way of listening to the crowd for supporting modeling activities
On the way of listening to the crowd for supporting modeling activitiesOn the way of listening to the crowd for supporting modeling activities
On the way of listening to the crowd for supporting modeling activities
 
FOCUS: A Recommender System for Mining API Function Calls and Usage Patterns
FOCUS:  A Recommender System for Mining API Function Calls and  Usage PatternsFOCUS:  A Recommender System for Mining API Function Calls and  Usage Patterns
FOCUS: A Recommender System for Mining API Function Calls and Usage Patterns
 
Consistency Recovery in Interactive Modeling
Consistency Recovery in Interactive ModelingConsistency Recovery in Interactive Modeling
Consistency Recovery in Interactive Modeling
 
Edelta: an approach for defining and applying reusable metamodel refactorings
Edelta: an approach for defining and applying reusable metamodel refactoringsEdelta: an approach for defining and applying reusable metamodel refactorings
Edelta: an approach for defining and applying reusable metamodel refactorings
 
Semantic based model matching with emf compare
Semantic based model matching with emf compareSemantic based model matching with emf compare
Semantic based model matching with emf compare
 
Collaborative model driven software engineering: a Systematic Mapping Study
Collaborative model driven software engineering: a Systematic Mapping StudyCollaborative model driven software engineering: a Systematic Mapping Study
Collaborative model driven software engineering: a Systematic Mapping Study
 
Model repositories: will they become reality?
Model repositories: will they become reality?Model repositories: will they become reality?
Model repositories: will they become reality?
 
Mining Correlations of ATL Transformation and Metamodel Metrics
Mining Correlations of ATL Transformation and Metamodel MetricsMining Correlations of ATL Transformation and Metamodel Metrics
Mining Correlations of ATL Transformation and Metamodel Metrics
 
MDEForge: an extensible Web-based modeling platform
MDEForge: an extensible Web-based modeling platformMDEForge: an extensible Web-based modeling platform
MDEForge: an extensible Web-based modeling platform
 

Recently uploaded

Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
chiefasafspells
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
masabamasaba
 

Recently uploaded (20)

WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptxBUS PASS MANGEMENT SYSTEM USING PHP.pptx
BUS PASS MANGEMENT SYSTEM USING PHP.pptx
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
WSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - KanchanaWSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - Kanchana
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 

CrossSim: exploiting mutual relationships to detect similar OSS projects

  • 1. Dipartimento di Ingegneria e Scienze Università degli Studi dell’Aquila dell’Informazione e Matematica http://www.di.univaq.it/diruscio/ davide.diruscio@univaq.it @ddiruscio CrossSim: exploiting mutual relationships to detect similar OSS projects Davide Di Ruscio Joint work with Phuong T. Nguyen, Juri Di Rocco, Riccardo Rubei
  • 2. 2 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Context Related activities - Searching for candidate components - Evaluating a set of retrieved candidate components to find the most suitable one - Understanding how to use the selected components - Monitoring the selected components Development of new software systems by reusing existing open source components
  • 3. 3 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Selecting and Using OSS components Challenging tasks ▪ assessing quality, maturity, activity of development and user support is not a straightforward process Different and heterogeneous source of information ▪ e.g., code repositories, communication channels, bug tracking systems Source code Q&A systems Bug Reports API Documentation Tutorials Configuration Management Systems
  • 4. 4 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Source code Q&A systems Bug Reports API Documentation Tutorials Configuration Management Systems www.crossminer.org
  • 5. 5 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic CROSSMINER: high-level view Data Preprocessing Capturing Context Producing Recommendations Presenting Recommendations
  • 6. 6 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Mining and Analysis Tools CROSSMINER: high-level view Data Preprocessing Capturing Context Producing Recommendations Presenting Recommendations Knowledge Base Source Code Miner NLP Miner Configuration Miner Cross project Analysis OSS forges Source Code Natural language channels Configuration Scripts lookup/store mine
  • 7. 7 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic CROSSMINER: high-level view Data Preprocessing Capturing Context Producing Recommendations Presenting Recommendations Developer IDE Knowledge Base query recommendations Data Storage Real-time recommendations that serve productivity and quality increase
  • 8. 8 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic CROSSMINER: high-level view The main intuition is to bring to the domain of software development the notion of recommendation systems that are typically used for popular e-commerce systems to present users with interesting items previously unknown to them
  • 9. 9 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Recommendation examples Depending on the set of selected third-party libraries, the system is able to recommend additional libraries that should be included in the project being developed
  • 10. 10 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Recommendation examples Depending on the set of selected third-party libraries, the system is able to recommend additional libraries that should be included in the project being developed Given a selected library, the system is able to suggest alternative ones that share some similarities with the selected one
  • 11. 11 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Recommendation examples Depending on the set of selected third-party libraries, the system is able to recommend additional libraries that should be included in the project being developed Given a selected library, the system is able to suggest alternative ones that share some similarities with the selected one Depending on the set of selected libraries, the system shows API documentation and Q&A posts that can help developers to understand how to use the selected libraries
  • 12. 12 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Recommendation examples Depending on the set of selected third-party libraries, the system is able to recommend additional libraries that should be included in the project being developed Given a selected library, the system is able to suggest alternative ones that share some similarities with the selected one Depending on the set of selected libraries, the system shows API documentation and Q&A posts that can help developers to understand how to use the selected libraries During the development, developers get recommendations about API function calls that might be used …
  • 13. 13 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Logical implications Recommendations  Automated classification of artifacts  Clustering  Definition of an extensible and configurable similarity calculation approach
  • 14. 14 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Logical implications Recommendations  Automated classification of artifacts  Clustering  Definition of an extensible and configurable similarity calculation approach Understanding the similarities between open source software projects allows for reusing of source code and prototyping, or choosing alternative implementations
  • 15. 15 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Software Similarity: Overview Low-level Software Similarity: Using source code (variable/function names, API references, etc.) High-level Software Similarity: Using metadata such as readme files, description, GitHub star events
  • 16. 16 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Existing approaches Algorithm Tec. Constraints/Characteristics Category MUDABlue C programs with source code Low-level Similarity CLAN Java programs with API calls Low-level Similarity CLANdroid Mobile apps from Android Package, GitHub, Google Code; The algorithm extracts identifiers and intents from source code, APIs and sensors from JAR files, permissions from AndroidManifest.xml High and Low- level Similarity RepoPal GitHub Java repositories contain readme file and possess at least 20 stars High-level Similarity GPLAG Programs with source code Low-level Similarity LibRec Maven projects (pom.xml), contain more than 10.000 lines of code and are not a fork of another project in GitHub and use at least 10 libraries. High-level Similarity SimApp Mobile applications with a set of 10 features: Application name, category, developer, description, update, permission, screenshot, content rating, size and user reviews High-level Similarity DroidVisor Android apps with security features High-level Similarity AnDarwin App code information: app's market, signature, description Low-level Similarity TagSim Sourceforge projects with proper tags High-level Similarity
  • 17. 17 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic SANER 2017 - http://ieeexplore.ieee.org/document/7884605/
  • 18. 18 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Overview of CrossSim Graphs for representing different kinds of relationships in the OSS ecosystem • e.g., developers commit to repositories, users star repositories, projects contain source code files, etc. Cross Project Relationships for Computing Open Source Software Similarity
  • 19. 19 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic CrossSim: OSS Ecosystem Representation
  • 20. 20 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic CrossSim: OSS Ecosystem Representation The main hypothesis is that the projects are aiming at creating common functionalities by using common libraries
  • 21. 21 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic CrossSim: OSS Ecosystem Representation The main hypothesis is that the projects are aiming at creating common functionalities by using common libraries Based on the graph structure, one can exploit nodes, links, and the mutual relationships to compute similarity using existing graph similarity algorithms
  • 22. 22 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic CrossSim: Graph Similarity The similarity between two nodes is dependent on their neighbors Two nodes are considered to be similar if they are referenced by similar nodes In the example A and B are highly similar since they are referenced by many same nodes Based on SimRank
  • 23. 23 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic CrossSim Evaluation Process
  • 24. 24 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic CrossSim Evaluation Process Success rate: if at least one of the top-5 retrieved projects is labelled Similar or Highly similar, the query is considered to be successful. • Success rate is the ratio of successful queries to the total number of queries
  • 25. 25 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic CrossSim Evaluation Process Confidence: Given a pair of <query, retrieved project> the confidence of an evaluator is the score she assigns to the similarity between the projects
  • 26. 26 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic CrossSim Evaluation Process Precision: The precision for each query is the proportion of projects in the top-5 list that are labelled as Similar or Highly similar by humans
  • 27. 27 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic CrossSim Evaluation Process Ranking: correlations among the ranking calculated by the similarity tools and the scores given by the human evaluation
  • 28. 28 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic CrossSim Evaluation Process Execution time: related to the application of RepoPal and CROSSSIM on the dataset to obtain the corresponding similarity matrices
  • 29. 29 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Data Collection 580 projects from GitHub satisfying the following requirements: – pom.xml or .gradle files available – Having at least 9 dependencies – Having the README.md file available – Having at least 20 stars (as needed by RepoPal) Filtered from an initial set retrieved from specific categories (e.g., PDF processors, JSON parsers, ORM projects, Spring MVC related tools)
  • 30. 30 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Definition of query 50 projects among the 580 in the data set have been selected as queries Queries have been chosen to equally cover all the considered categories of the projects in the dataset
  • 31. 31 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Mix and shuffle of the results For each query the top-5 most similar projects calculated by RepoPal and CrossSim were retrieved Results from RepoPal and CrossSim were mixed and shuffled The obtained list was labelled by human evaluators
  • 32. 32 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Fragment of the collected human scores Project 1 Project 2 Score neo4j-contrib/sparql-plugin castagna/jena-examples 3 neo4j-contrib/sparql-plugin claudiomartella/dbpedia4neo 3 neo4j-contrib/sparql-plugin claudiomartella/dbpedia4neo 3 neo4j-contrib/sparql-plugin dbpedia/links 3 neo4j-contrib/sparql-plugin eclipse/rdf4j 3 neo4j-contrib/sparql-plugin jbarrasa/neosemantics 3 neo4j-contrib/sparql-plugin jbarrasa/neosemantics 3 neo4j-contrib/sparql-plugin niclashoyer/neo4j-sparql-extension 4 neo4j-contrib/sparql-plugin niclashoyer/neo4j-sparql-extension 4 neo4j-contrib/sparql-plugin streamreasoning/CSPARQL-engine 3 AskNowQA/AutoSPARQL AKSW/RDFUnit 3 AskNowQA/AutoSPARQL AKSW/SPARQL2NL 4 AskNowQA/AutoSPARQL AKSW/SPARQL2NL 4 AskNowQA/AutoSPARQL AKSW/Sparqlify 3 AskNowQA/AutoSPARQL castagna/jena-examples 3 AskNowQA/AutoSPARQL pyvandenbussche/sparqles 3 AskNowQA/AutoSPARQL rdfhdt/hdt-java 3 AskNowQA/AutoSPARQL rdfhdt/hdt-java 3 AskNowQA/AutoSPARQL socialsignin/spring-social-security-demo 2 AskNowQA/AutoSPARQL yhegde/facebook-page-scraper 2
  • 33. 33 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Considered CrossSim configurations CrossSim1: star events and dependencies CrossSim2: CrossSim1 + commiters CrossSim3: CrossSim1 – most frequent dependencies CrossSim4: CrossSim2 – most frequent dependencies E.g., since testing is a common functionality of many software projects, JUnit does not help contribute towards the characterization of a project and thus, needs to be removed from the graph
  • 34. 34 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Outcomes of the applied metrics: Precision
  • 35. 35 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Outcomes of the applied metrics: Precision Both gain a success rate of 100%, however CROSSSIM3 has a better precision.
  • 36. 36 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Outcomes of the applied metrics: Precision The inclusion of all developers who have committed updates at least once to a project in the graph is counterproductive as it adds a decline in precision Both gain a success rate of 100%, however CROSSSIM3 has a better precision.
  • 37. 37 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Outcomes of the applied metrics: Precision The inclusion of all developers who have committed updates at least once to a project in the graph is counterproductive as it adds a decline in precision Both gain a success rate of 100%, however CROSSSIM3 has a better precision. Concerning ranking correlations CROSSSIM3 performs slightly better than RepoPal • −0.214 for CROSSSIM3 • −0.163 for RepoPal
  • 38. 38 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Outcomes of the applied metrics: Confidence
  • 39. 39 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Outcomes of the applied metrics: Execution time Intel Core i5-7200U CPU @ 2.50GHz × 4, 8GB RAM, Ubuntu 16.04
  • 40. 40 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Next Steps (in a short term) Use of CrossSim for recommending libraries – Exploiting the techniques proposed in CrossSim – Representing OSS projects in graph – Computing similarity between projects – Using collaborative-filtering techniques to recommend libraries to OSS projects
  • 41. 41 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Conclusions www.crossminer.org @crossminer eclipse.org/scava