Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

CrossSim: exploiting mutual relationships to detect similar OSS projects

170 views

Published on

Slides presented at SEAA 2018 http://dsd-seaa2018.fit.cvut.cz/seaa/ related to the paper http://reposto.di.univaq.it/aigon2/index.php/attachments/single/211

Software development is a knowledge-intensive activity, which requires mastering several languages, frameworks, technology trends (among other aspects) under the pressure of ever-increasing arrays of external libraries and resources.
Recommender systems are gaining high relevance in software
engineering since they aim at providing developers with real-time recommendations, which can reduce the time spent on discovering and understanding reusable artifacts from software repositories, and thus inducing productivity and quality gains.
In this presentation, we focus on the problem of mining open source software repositories to identify similar projects, which can be evaluated and eventually reused by developers. To this end, CROSSSIM is proposed as a novel approach to model open source software projects and related artifacts and to compute similarities among them. An evaluation on a dataset containing 580 GitHub projects shows that CROSSSIM outperforms an existing technique, which has been proven to have a good performance in detecting similar GitHub repositories.

Published in: Software
  • Be the first to comment

  • Be the first to like this

CrossSim: exploiting mutual relationships to detect similar OSS projects

  1. 1. Dipartimento di Ingegneria e Scienze Università degli Studi dell’Aquila dell’Informazione e Matematica http://www.di.univaq.it/diruscio/ davide.diruscio@univaq.it @ddiruscio CrossSim: exploiting mutual relationships to detect similar OSS projects Davide Di Ruscio Joint work with Phuong T. Nguyen, Juri Di Rocco, Riccardo Rubei
  2. 2. 2 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Context Related activities - Searching for candidate components - Evaluating a set of retrieved candidate components to find the most suitable one - Understanding how to use the selected components - Monitoring the selected components Development of new software systems by reusing existing open source components
  3. 3. 3 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Selecting and Using OSS components Challenging tasks ▪ assessing quality, maturity, activity of development and user support is not a straightforward process Different and heterogeneous source of information ▪ e.g., code repositories, communication channels, bug tracking systems Source code Q&A systems Bug Reports API Documentation Tutorials Configuration Management Systems
  4. 4. 4 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Source code Q&A systems Bug Reports API Documentation Tutorials Configuration Management Systems www.crossminer.org
  5. 5. 5 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic CROSSMINER: high-level view Data Preprocessing Capturing Context Producing Recommendations Presenting Recommendations
  6. 6. 6 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Mining and Analysis Tools CROSSMINER: high-level view Data Preprocessing Capturing Context Producing Recommendations Presenting Recommendations Knowledge Base Source Code Miner NLP Miner Configuration Miner Cross project Analysis OSS forges Source Code Natural language channels Configuration Scripts lookup/store mine
  7. 7. 7 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic CROSSMINER: high-level view Data Preprocessing Capturing Context Producing Recommendations Presenting Recommendations Developer IDE Knowledge Base query recommendations Data Storage Real-time recommendations that serve productivity and quality increase
  8. 8. 8 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic CROSSMINER: high-level view The main intuition is to bring to the domain of software development the notion of recommendation systems that are typically used for popular e-commerce systems to present users with interesting items previously unknown to them
  9. 9. 9 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Recommendation examples Depending on the set of selected third-party libraries, the system is able to recommend additional libraries that should be included in the project being developed
  10. 10. 10 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Recommendation examples Depending on the set of selected third-party libraries, the system is able to recommend additional libraries that should be included in the project being developed Given a selected library, the system is able to suggest alternative ones that share some similarities with the selected one
  11. 11. 11 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Recommendation examples Depending on the set of selected third-party libraries, the system is able to recommend additional libraries that should be included in the project being developed Given a selected library, the system is able to suggest alternative ones that share some similarities with the selected one Depending on the set of selected libraries, the system shows API documentation and Q&A posts that can help developers to understand how to use the selected libraries
  12. 12. 12 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Recommendation examples Depending on the set of selected third-party libraries, the system is able to recommend additional libraries that should be included in the project being developed Given a selected library, the system is able to suggest alternative ones that share some similarities with the selected one Depending on the set of selected libraries, the system shows API documentation and Q&A posts that can help developers to understand how to use the selected libraries During the development, developers get recommendations about API function calls that might be used …
  13. 13. 13 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Logical implications Recommendations  Automated classification of artifacts  Clustering  Definition of an extensible and configurable similarity calculation approach
  14. 14. 14 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Logical implications Recommendations  Automated classification of artifacts  Clustering  Definition of an extensible and configurable similarity calculation approach Understanding the similarities between open source software projects allows for reusing of source code and prototyping, or choosing alternative implementations
  15. 15. 15 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Software Similarity: Overview Low-level Software Similarity: Using source code (variable/function names, API references, etc.) High-level Software Similarity: Using metadata such as readme files, description, GitHub star events
  16. 16. 16 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Existing approaches Algorithm Tec. Constraints/Characteristics Category MUDABlue C programs with source code Low-level Similarity CLAN Java programs with API calls Low-level Similarity CLANdroid Mobile apps from Android Package, GitHub, Google Code; The algorithm extracts identifiers and intents from source code, APIs and sensors from JAR files, permissions from AndroidManifest.xml High and Low- level Similarity RepoPal GitHub Java repositories contain readme file and possess at least 20 stars High-level Similarity GPLAG Programs with source code Low-level Similarity LibRec Maven projects (pom.xml), contain more than 10.000 lines of code and are not a fork of another project in GitHub and use at least 10 libraries. High-level Similarity SimApp Mobile applications with a set of 10 features: Application name, category, developer, description, update, permission, screenshot, content rating, size and user reviews High-level Similarity DroidVisor Android apps with security features High-level Similarity AnDarwin App code information: app's market, signature, description Low-level Similarity TagSim Sourceforge projects with proper tags High-level Similarity
  17. 17. 17 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic SANER 2017 - http://ieeexplore.ieee.org/document/7884605/
  18. 18. 18 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Overview of CrossSim Graphs for representing different kinds of relationships in the OSS ecosystem • e.g., developers commit to repositories, users star repositories, projects contain source code files, etc. Cross Project Relationships for Computing Open Source Software Similarity
  19. 19. 19 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic CrossSim: OSS Ecosystem Representation
  20. 20. 20 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic CrossSim: OSS Ecosystem Representation The main hypothesis is that the projects are aiming at creating common functionalities by using common libraries
  21. 21. 21 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic CrossSim: OSS Ecosystem Representation The main hypothesis is that the projects are aiming at creating common functionalities by using common libraries Based on the graph structure, one can exploit nodes, links, and the mutual relationships to compute similarity using existing graph similarity algorithms
  22. 22. 22 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic CrossSim: Graph Similarity The similarity between two nodes is dependent on their neighbors Two nodes are considered to be similar if they are referenced by similar nodes In the example A and B are highly similar since they are referenced by many same nodes Based on SimRank
  23. 23. 23 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic CrossSim Evaluation Process
  24. 24. 24 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic CrossSim Evaluation Process Success rate: if at least one of the top-5 retrieved projects is labelled Similar or Highly similar, the query is considered to be successful. • Success rate is the ratio of successful queries to the total number of queries
  25. 25. 25 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic CrossSim Evaluation Process Confidence: Given a pair of <query, retrieved project> the confidence of an evaluator is the score she assigns to the similarity between the projects
  26. 26. 26 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic CrossSim Evaluation Process Precision: The precision for each query is the proportion of projects in the top-5 list that are labelled as Similar or Highly similar by humans
  27. 27. 27 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic CrossSim Evaluation Process Ranking: correlations among the ranking calculated by the similarity tools and the scores given by the human evaluation
  28. 28. 28 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic CrossSim Evaluation Process Execution time: related to the application of RepoPal and CROSSSIM on the dataset to obtain the corresponding similarity matrices
  29. 29. 29 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Data Collection 580 projects from GitHub satisfying the following requirements: – pom.xml or .gradle files available – Having at least 9 dependencies – Having the README.md file available – Having at least 20 stars (as needed by RepoPal) Filtered from an initial set retrieved from specific categories (e.g., PDF processors, JSON parsers, ORM projects, Spring MVC related tools)
  30. 30. 30 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Definition of query 50 projects among the 580 in the data set have been selected as queries Queries have been chosen to equally cover all the considered categories of the projects in the dataset
  31. 31. 31 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Mix and shuffle of the results For each query the top-5 most similar projects calculated by RepoPal and CrossSim were retrieved Results from RepoPal and CrossSim were mixed and shuffled The obtained list was labelled by human evaluators
  32. 32. 32 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Fragment of the collected human scores Project 1 Project 2 Score neo4j-contrib/sparql-plugin castagna/jena-examples 3 neo4j-contrib/sparql-plugin claudiomartella/dbpedia4neo 3 neo4j-contrib/sparql-plugin claudiomartella/dbpedia4neo 3 neo4j-contrib/sparql-plugin dbpedia/links 3 neo4j-contrib/sparql-plugin eclipse/rdf4j 3 neo4j-contrib/sparql-plugin jbarrasa/neosemantics 3 neo4j-contrib/sparql-plugin jbarrasa/neosemantics 3 neo4j-contrib/sparql-plugin niclashoyer/neo4j-sparql-extension 4 neo4j-contrib/sparql-plugin niclashoyer/neo4j-sparql-extension 4 neo4j-contrib/sparql-plugin streamreasoning/CSPARQL-engine 3 AskNowQA/AutoSPARQL AKSW/RDFUnit 3 AskNowQA/AutoSPARQL AKSW/SPARQL2NL 4 AskNowQA/AutoSPARQL AKSW/SPARQL2NL 4 AskNowQA/AutoSPARQL AKSW/Sparqlify 3 AskNowQA/AutoSPARQL castagna/jena-examples 3 AskNowQA/AutoSPARQL pyvandenbussche/sparqles 3 AskNowQA/AutoSPARQL rdfhdt/hdt-java 3 AskNowQA/AutoSPARQL rdfhdt/hdt-java 3 AskNowQA/AutoSPARQL socialsignin/spring-social-security-demo 2 AskNowQA/AutoSPARQL yhegde/facebook-page-scraper 2
  33. 33. 33 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Considered CrossSim configurations CrossSim1: star events and dependencies CrossSim2: CrossSim1 + commiters CrossSim3: CrossSim1 – most frequent dependencies CrossSim4: CrossSim2 – most frequent dependencies E.g., since testing is a common functionality of many software projects, JUnit does not help contribute towards the characterization of a project and thus, needs to be removed from the graph
  34. 34. 34 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Outcomes of the applied metrics: Precision
  35. 35. 35 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Outcomes of the applied metrics: Precision Both gain a success rate of 100%, however CROSSSIM3 has a better precision.
  36. 36. 36 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Outcomes of the applied metrics: Precision The inclusion of all developers who have committed updates at least once to a project in the graph is counterproductive as it adds a decline in precision Both gain a success rate of 100%, however CROSSSIM3 has a better precision.
  37. 37. 37 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Outcomes of the applied metrics: Precision The inclusion of all developers who have committed updates at least once to a project in the graph is counterproductive as it adds a decline in precision Both gain a success rate of 100%, however CROSSSIM3 has a better precision. Concerning ranking correlations CROSSSIM3 performs slightly better than RepoPal • −0.214 for CROSSSIM3 • −0.163 for RepoPal
  38. 38. 38 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Outcomes of the applied metrics: Confidence
  39. 39. 39 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Outcomes of the applied metrics: Execution time Intel Core i5-7200U CPU @ 2.50GHz × 4, 8GB RAM, Ubuntu 16.04
  40. 40. 40 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Next Steps (in a short term) Use of CrossSim for recommending libraries – Exploiting the techniques proposed in CrossSim – Representing OSS projects in graph – Computing similarity between projects – Using collaborative-filtering techniques to recommend libraries to OSS projects
  41. 41. 41 Euromicro SEAA 2018 - August 30, 2018 - Prague | Czech Republic Conclusions www.crossminer.org @crossminer eclipse.org/scava

×