Slides presented at SEAA 2018 http://dsd-seaa2018.fit.cvut.cz/seaa/ related to the paper http://reposto.di.univaq.it/aigon2/index.php/attachments/single/211
Software development is a knowledge-intensive activity, which requires mastering several languages, frameworks, technology trends (among other aspects) under the pressure of ever-increasing arrays of external libraries and resources.
Recommender systems are gaining high relevance in software
engineering since they aim at providing developers with real-time recommendations, which can reduce the time spent on discovering and understanding reusable artifacts from software repositories, and thus inducing productivity and quality gains.
In this presentation, we focus on the problem of mining open source software repositories to identify similar projects, which can be evaluated and eventually reused by developers. To this end, CROSSSIM is proposed as a novel approach to model open source software projects and related artifacts and to compute similarities among them. An evaluation on a dataset containing 580 GitHub projects shows that CROSSSIM outperforms an existing technique, which has been proven to have a good performance in detecting similar GitHub repositories.
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
CrossSim: exploiting mutual relationships to detect similar OSS projects
1. Dipartimento di Ingegneria e Scienze
Università degli Studi dell’Aquila
dell’Informazione e Matematica
http://www.di.univaq.it/diruscio/
davide.diruscio@univaq.it
@ddiruscio
CrossSim:
exploiting mutual relationships
to detect similar OSS projects
Davide Di Ruscio
Joint work with Phuong T. Nguyen, Juri Di Rocco, Riccardo Rubei
2. 2
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Context
Related activities
- Searching for candidate components
- Evaluating a set of retrieved candidate components to find the most suitable one
- Understanding how to use the selected components
- Monitoring the selected components
Development of new software systems
by reusing existing open source components
3. 3
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Selecting and Using OSS components
Challenging tasks
▪ assessing quality, maturity,
activity of development
and user support is not a
straightforward process
Different and heterogeneous
source of information
▪ e.g., code repositories,
communication channels, bug
tracking systems
Source code
Q&A systems
Bug Reports
API
Documentation
Tutorials
Configuration
Management Systems
4. 4
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Source code
Q&A systems
Bug Reports
API
Documentation
Tutorials
Configuration
Management Systems
www.crossminer.org
7. 7
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
CROSSMINER: high-level view
Data Preprocessing Capturing Context
Producing
Recommendations
Presenting
Recommendations
Developer
IDE
Knowledge Base
query
recommendations
Data
Storage
Real-time recommendations that serve productivity and quality increase
8. 8
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
CROSSMINER: high-level view
The main intuition is to bring to the domain of software
development the notion of recommendation systems that are
typically used for popular e-commerce systems to present users
with interesting items previously unknown to them
9. 9
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Recommendation examples
Depending on the set of selected third-party libraries, the system is able
to recommend additional libraries that should be included in the
project being developed
10. 10
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Recommendation examples
Depending on the set of selected third-party libraries, the system is able
to recommend additional libraries that should be included in the
project being developed
Given a selected library, the system is able to suggest alternative ones
that share some similarities with the selected one
11. 11
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Recommendation examples
Depending on the set of selected third-party libraries, the system is able
to recommend additional libraries that should be included in the
project being developed
Given a selected library, the system is able to suggest alternative ones
that share some similarities with the selected one
Depending on the set of selected libraries, the system shows API
documentation and Q&A posts that can help developers to understand
how to use the selected libraries
12. 12
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Recommendation examples
Depending on the set of selected third-party libraries, the system is able
to recommend additional libraries that should be included in the
project being developed
Given a selected library, the system is able to suggest alternative ones
that share some similarities with the selected one
Depending on the set of selected libraries, the system shows API
documentation and Q&A posts that can help developers to understand
how to use the selected libraries
During the development, developers get recommendations about API
function calls that might be used
…
13. 13
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Logical implications
Recommendations
Automated classification of artifacts
Clustering
Definition of an extensible and configurable similarity
calculation approach
14. 14
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Logical implications
Recommendations
Automated classification of artifacts
Clustering
Definition of an extensible and configurable similarity
calculation approach
Understanding the similarities between open source software projects
allows for reusing of source code and prototyping,
or choosing alternative implementations
15. 15
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Software Similarity: Overview
Low-level Software Similarity: Using source code
(variable/function names, API references, etc.)
High-level Software Similarity: Using metadata such as
readme files, description, GitHub star events
16. 16
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Existing approaches
Algorithm Tec. Constraints/Characteristics Category
MUDABlue C programs with source code Low-level Similarity
CLAN Java programs with API calls Low-level Similarity
CLANdroid Mobile apps from Android Package, GitHub, Google Code;
The algorithm extracts identifiers and intents from source
code, APIs and sensors from JAR files, permissions from
AndroidManifest.xml
High and Low-
level Similarity
RepoPal GitHub Java repositories contain readme file and possess at least 20 stars High-level Similarity
GPLAG Programs with source code Low-level Similarity
LibRec Maven projects (pom.xml), contain more than 10.000 lines of code and
are not a fork of another project in GitHub and use at least 10 libraries.
High-level Similarity
SimApp Mobile applications with a set of 10 features:
Application name, category, developer, description,
update, permission, screenshot, content rating, size and user reviews
High-level Similarity
DroidVisor Android apps with security features High-level Similarity
AnDarwin App code information: app's market, signature, description Low-level Similarity
TagSim Sourceforge projects with proper tags High-level Similarity
18. 18
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Overview of CrossSim
Graphs for representing different kinds of relationships in the
OSS ecosystem
• e.g., developers commit to repositories, users star repositories,
projects contain source code files, etc.
Cross Project Relationships for Computing Open Source Software Similarity
20. 20
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
CrossSim: OSS Ecosystem Representation
The main hypothesis is that
the projects are aiming at
creating common
functionalities by using
common libraries
21. 21
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
CrossSim: OSS Ecosystem Representation
The main hypothesis is that
the projects are aiming at
creating common
functionalities by using
common libraries
Based on the graph
structure, one can exploit
nodes, links, and the mutual
relationships to compute
similarity using existing
graph similarity algorithms
22. 22
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
CrossSim: Graph Similarity
The similarity between two nodes is dependent on their neighbors
Two nodes are considered to be similar if they are referenced by similar nodes
In the example A and B are highly similar since they are referenced by many
same nodes
Based on SimRank
23. 23
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
CrossSim Evaluation Process
24. 24
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
CrossSim Evaluation Process
Success rate: if at least one of the top-5 retrieved projects is labelled Similar or
Highly similar, the query is considered to be successful.
• Success rate is the ratio of successful queries to the total number of queries
25. 25
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
CrossSim Evaluation Process
Confidence: Given a pair of <query, retrieved project> the confidence of an
evaluator is the score she assigns to the similarity between the projects
26. 26
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
CrossSim Evaluation Process
Precision: The precision for each query is the proportion of projects in
the top-5 list that are labelled as Similar or Highly similar by humans
27. 27
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
CrossSim Evaluation Process
Ranking: correlations among the ranking calculated by the similarity tools
and the scores given by the human evaluation
28. 28
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
CrossSim Evaluation Process
Execution time: related to the application of RepoPal and CROSSSIM on
the dataset to obtain the corresponding similarity matrices
29. 29
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Data Collection
580 projects from GitHub satisfying the following
requirements:
– pom.xml or .gradle files available
– Having at least 9 dependencies
– Having the README.md file available
– Having at least 20 stars (as needed by RepoPal)
Filtered from an initial set retrieved from specific categories
(e.g., PDF processors, JSON parsers, ORM projects, Spring
MVC related tools)
30. 30
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Definition of query
50 projects among the 580 in the data set have been selected
as queries
Queries have been chosen to equally cover all the considered
categories of the projects in the dataset
31. 31
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Mix and shuffle of the results
For each query the top-5 most similar projects calculated by
RepoPal and CrossSim were retrieved
Results from RepoPal and CrossSim were mixed and shuffled
The obtained list was labelled by human evaluators
33. 33
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Considered CrossSim configurations
CrossSim1: star events and dependencies
CrossSim2: CrossSim1 + commiters
CrossSim3: CrossSim1 – most frequent dependencies
CrossSim4: CrossSim2 – most frequent dependencies
E.g., since testing is a common
functionality of many software projects,
JUnit does not help contribute towards
the characterization of a project and
thus, needs to be removed
from the graph
34. 34
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Outcomes of the applied metrics: Precision
35. 35
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Outcomes of the applied metrics: Precision
Both gain a success rate
of 100%, however CROSSSIM3
has a better precision.
36. 36
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Outcomes of the applied metrics: Precision
The inclusion of all developers
who have committed updates
at least once to a project in the
graph is counterproductive as it
adds a decline in precision
Both gain a success rate
of 100%, however CROSSSIM3
has a better precision.
37. 37
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Outcomes of the applied metrics: Precision
The inclusion of all developers
who have committed updates
at least once to a project in the
graph is counterproductive as it
adds a decline in precision
Both gain a success rate
of 100%, however CROSSSIM3
has a better precision.
Concerning ranking correlations
CROSSSIM3 performs slightly
better than RepoPal
• −0.214 for CROSSSIM3
• −0.163 for RepoPal
38. 38
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Outcomes of the applied metrics: Confidence
39. 39
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Outcomes of the applied metrics: Execution time
Intel Core i5-7200U
CPU @ 2.50GHz × 4, 8GB RAM,
Ubuntu 16.04
40. 40
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Next Steps (in a short term)
Use of CrossSim for recommending libraries
– Exploiting the techniques proposed in CrossSim
– Representing OSS projects in graph
– Computing similarity between projects
– Using collaborative-filtering techniques to recommend libraries to OSS
projects