CrossSim: exploiting mutual relationships to detect similar OSS projects

Dipartimento di Ingegneria e Scienze
Università degli Studi dell’Aquila
dell’Informazione e Matematica
http://www.di.univaq.it/diruscio/
davide.diruscio@univaq.it
@ddiruscio
CrossSim:
exploiting mutual relationships
to detect similar OSS projects
Davide Di Ruscio
Joint work with Phuong T. Nguyen, Juri Di Rocco, Riccardo Rubei

2
Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic
Context
Related activities
- Searching for candidate components
- Evaluating a set of retrieved candidate components to find the most suitable one
- Understanding how to use the selected components
- Monitoring the selected components
Development of new software systems
by reusing existing open source components

3
Selecting and Using OSS components
Challenging tasks
▪ assessing quality, maturity,
activity of development
and user support is not a
straightforward process
Different and heterogeneous
source of information
▪ e.g., code repositories,
communication channels, bug
tracking systems
Source code
Q&A systems
Bug Reports
API
Documentation
Tutorials
Configuration
Management Systems

4
Source code
Q&A systems
Bug Reports
API
Documentation
Tutorials
Configuration
Management Systems
www.crossminer.org

5
CROSSMINER: high-level view
Data Preprocessing Capturing Context
Producing
Recommendations
Presenting
Recommendations

6
Mining and Analysis Tools
Producing
Recommendations
Presenting
Recommendations
Knowledge Base
Source Code
Miner
NLP
Miner
Configuration
Miner
Cross project
Analysis
OSS forges
Source Code
Natural
language
channels
Configuration
Scripts
lookup/store
mine

7
Producing
Recommendations
Presenting
Recommendations
Developer
IDE
Knowledge Base
query
recommendations
Data
Storage
Real-time recommendations that serve productivity and quality increase

8
The main intuition is to bring to the domain of software
development the notion of recommendation systems that are
typically used for popular e-commerce systems to present users
with interesting items previously unknown to them

9
Recommendation examples
Depending on the set of selected third-party libraries, the system is able
to recommend additional libraries that should be included in the
project being developed

10
Given a selected library, the system is able to suggest alternative ones
that share some similarities with the selected one

11
Depending on the set of selected libraries, the system shows API
documentation and Q&A posts that can help developers to understand
how to use the selected libraries

12
Depending on the set of selected libraries, the system shows API
documentation and Q&A posts that can help developers to understand
how to use the selected libraries
During the development, developers get recommendations about API
function calls that might be used
…

13
Logical implications
Recommendations
 Automated classification of artifacts
 Clustering
 Definition of an extensible and configurable similarity
calculation approach

14
Logical implications
Recommendations
 Automated classification of artifacts
 Clustering
 Definition of an extensible and configurable similarity
calculation approach
Understanding the similarities between open source software projects
allows for reusing of source code and prototyping,
or choosing alternative implementations

15
Software Similarity: Overview
Low-level Software Similarity: Using source code
(variable/function names, API references, etc.)
High-level Software Similarity: Using metadata such as
readme files, description, GitHub star events

16
Existing approaches
Algorithm Tec. Constraints/Characteristics Category
MUDABlue C programs with source code Low-level Similarity
CLAN Java programs with API calls Low-level Similarity
CLANdroid Mobile apps from Android Package, GitHub, Google Code;
The algorithm extracts identifiers and intents from source
code, APIs and sensors from JAR files, permissions from
AndroidManifest.xml
High and Low-
level Similarity
RepoPal GitHub Java repositories contain readme file and possess at least 20 stars High-level Similarity
GPLAG Programs with source code Low-level Similarity
LibRec Maven projects (pom.xml), contain more than 10.000 lines of code and
are not a fork of another project in GitHub and use at least 10 libraries.
High-level Similarity
SimApp Mobile applications with a set of 10 features:
Application name, category, developer, description,
update, permission, screenshot, content rating, size and user reviews
High-level Similarity
DroidVisor Android apps with security features High-level Similarity
AnDarwin App code information: app's market, signature, description Low-level Similarity
TagSim Sourceforge projects with proper tags High-level Similarity

17
SANER 2017 - http://ieeexplore.ieee.org/document/7884605/

18
Overview of CrossSim
Graphs for representing different kinds of relationships in the
OSS ecosystem
• e.g., developers commit to repositories, users star repositories,
projects contain source code ﬁles, etc.
Cross Project Relationships for Computing Open Source Software Similarity

19
CrossSim: OSS Ecosystem Representation

20
The main hypothesis is that
the projects are aiming at
creating common
functionalities by using
common libraries

21
The main hypothesis is that
the projects are aiming at
creating common
functionalities by using
common libraries
Based on the graph
structure, one can exploit
nodes, links, and the mutual
relationships to compute
similarity using existing
graph similarity algorithms

22
CrossSim: Graph Similarity
The similarity between two nodes is dependent on their neighbors
Two nodes are considered to be similar if they are referenced by similar nodes
In the example A and B are highly similar since they are referenced by many
same nodes
Based on SimRank

23
CrossSim Evaluation Process

24
Success rate: if at least one of the top-5 retrieved projects is labelled Similar or
Highly similar, the query is considered to be successful.
• Success rate is the ratio of successful queries to the total number of queries

25
Conﬁdence: Given a pair of <query, retrieved project> the conﬁdence of an
evaluator is the score she assigns to the similarity between the projects

26
Precision: The precision for each query is the proportion of projects in
the top-5 list that are labelled as Similar or Highly similar by humans

27
Ranking: correlations among the ranking calculated by the similarity tools
and the scores given by the human evaluation

28
Execution time: related to the application of RepoPal and CROSSSIM on
the dataset to obtain the corresponding similarity matrices

29
Data Collection
580 projects from GitHub satisfying the following
requirements:
– pom.xml or .gradle files available
– Having at least 9 dependencies
– Having the README.md file available
– Having at least 20 stars (as needed by RepoPal)
Filtered from an initial set retrieved from specific categories
(e.g., PDF processors, JSON parsers, ORM projects, Spring
MVC related tools)

30
Definition of query
50 projects among the 580 in the data set have been selected
as queries
Queries have been chosen to equally cover all the considered
categories of the projects in the dataset

31
Mix and shuffle of the results
For each query the top-5 most similar projects calculated by
RepoPal and CrossSim were retrieved
Results from RepoPal and CrossSim were mixed and shuffled
The obtained list was labelled by human evaluators

32
Fragment of the collected human scores
Project 1 Project 2 Score
neo4j-contrib/sparql-plugin castagna/jena-examples 3
neo4j-contrib/sparql-plugin claudiomartella/dbpedia4neo 3
neo4j-contrib/sparql-plugin claudiomartella/dbpedia4neo 3
neo4j-contrib/sparql-plugin dbpedia/links 3
neo4j-contrib/sparql-plugin eclipse/rdf4j 3
neo4j-contrib/sparql-plugin jbarrasa/neosemantics 3
neo4j-contrib/sparql-plugin jbarrasa/neosemantics 3
neo4j-contrib/sparql-plugin niclashoyer/neo4j-sparql-extension 4
neo4j-contrib/sparql-plugin niclashoyer/neo4j-sparql-extension 4
neo4j-contrib/sparql-plugin streamreasoning/CSPARQL-engine 3
AskNowQA/AutoSPARQL AKSW/RDFUnit 3
AskNowQA/AutoSPARQL AKSW/SPARQL2NL 4
AskNowQA/AutoSPARQL AKSW/SPARQL2NL 4
AskNowQA/AutoSPARQL AKSW/Sparqlify 3
AskNowQA/AutoSPARQL castagna/jena-examples 3
AskNowQA/AutoSPARQL pyvandenbussche/sparqles 3
AskNowQA/AutoSPARQL rdfhdt/hdt-java 3
AskNowQA/AutoSPARQL rdfhdt/hdt-java 3
AskNowQA/AutoSPARQL socialsignin/spring-social-security-demo 2
AskNowQA/AutoSPARQL yhegde/facebook-page-scraper 2

33
Considered CrossSim configurations
CrossSim1: star events and dependencies
CrossSim2: CrossSim1 + commiters
CrossSim3: CrossSim1 – most frequent dependencies
CrossSim4: CrossSim2 – most frequent dependencies
E.g., since testing is a common
functionality of many software projects,
JUnit does not help contribute towards
the characterization of a project and
thus, needs to be removed
from the graph

34
Outcomes of the applied metrics: Precision

35
Both gain a success rate
of 100%, however CROSSSIM3
has a better precision.

36
The inclusion of all developers
who have committed updates
at least once to a project in the
graph is counterproductive as it
adds a decline in precision

37
The inclusion of all developers
who have committed updates
at least once to a project in the
graph is counterproductive as it
adds a decline in precision
Concerning ranking correlations
CROSSSIM3 performs slightly
better than RepoPal
• −0.214 for CROSSSIM3
• −0.163 for RepoPal

38
Outcomes of the applied metrics: Confidence

39
Outcomes of the applied metrics: Execution time
Intel Core i5-7200U
CPU @ 2.50GHz × 4, 8GB RAM,
Ubuntu 16.04

40
Next Steps (in a short term)
Use of CrossSim for recommending libraries
– Exploiting the techniques proposed in CrossSim
– Representing OSS projects in graph
– Computing similarity between projects
– Using collaborative-filtering techniques to recommend libraries to OSS
projects

41
Conclusions www.crossminer.org
@crossminer eclipse.org/scava

CrossSim: exploiting mutual relationships to detect similar OSS projects

Recommended

Recommended

More Related Content

Similar to CrossSim: exploiting mutual relationships to detect similar OSS projects

Similar to CrossSim: exploiting mutual relationships to detect similar OSS projects (20)

More from Davide Ruscio

More from Davide Ruscio (10)

Recently uploaded

Recently uploaded (20)

CrossSim: exploiting mutual relationships to detect similar OSS projects