Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Detecting java software similarities by using different clustering

These are the slides of the talk I delivered at the Journal First session of ICSME 2020.

  • Be the first to comment

  • Be the first to like this

Detecting java software similarities by using different clustering

  1. 1. Detecting Java software similarities by using different clustering techniques Andrea Capiluppi*, Davide Di Ruscio**, Juri Di Rocco**, Phuong T. Nguyen**, Nemitari Ajienka*** ICSME 2020 * Department of Computer Science, University of Groningen, The Netherlands ** Department of Information Engineering, Computer Science and Mathematics, University of L’Aquila, Italy *** Department of Computer Science, University of Nottingham, UK https://doi.org/10.1016/j.infsof.2020.106279
  2. 2. Detecting Java software similarities by using different clustering techniques 2ICSME2020 On the need of always larger samples of systems Research on empirical software engineering has increasingly used data made available in online repositories or collective efforts Gather “as much data as possible” - to prevent bias in the representation of a small sample - work with a sample as close as the population itself - showcase the performance of existing or new tools in treating vast amount of data
  3. 3. Detecting Java software similarities by using different clustering techniques 3ICSME2020 On the need of always larger samples of systems Research on empirical software engineering has increasingly used data made available in online repositories or collective efforts Cumulative number of FOSS projects per year Average number of FOSS projects per year
  4. 4. Detecting Java software similarities by using different clustering techniques 4ICSME2020 Similarity of Systems and Empirical Research insensitive to that Very few works have clearly stated the similarity (or differences) between systems in the interpretation of the results - by explicitly proposing explanations based on application domains - by sampling the projects to be analysed from a specific, restricted topic
  5. 5. Detecting Java software similarities by using different clustering techniques 5ICSME2020 Assumptions of this paper A specific software system might be similar to others to some degree, and that there are different approaches to defining their similarity A sample of software systems might get divided into subsets (or clusters), each containing similar systems, and showing differences with other clusters
  6. 6. Detecting Java software similarities by using different clustering techniques 6ICSME2020 Reasons for Clustering Clustering is among the fundamental techniques in knowledge mining and information retrieval A clustering algorithm attempts to distribute objects into groups of similar objects so as the similarity between one pair of objects in a cluster is higher than that between one of the objects to any objects in a different cluster “the degree to which two distinct programs are similar is related to how precisely they are alike”
  7. 7. Detecting Java software similarities by using different clustering techniques 7ICSME2020 Reasons for Clustering Clustering is among the fundamental techniques in knowledge mining and information retrieval A clustering algorithm attempts to distribute objects into groups of similar objects so as the similarity between one pair of objects in a cluster is higher than that between one of the objects to any objects in a different cluster “the degree to which two distinct programs are similar is related to how precisely they are alike” s1 s2 s6 s4 s5 s3 s7 s8 Log management JSON Parsing DB Management
  8. 8. Detecting Java software similarities by using different clustering techniques 8ICSME2020 Research question Are OO metrics sensitive to the context of their clusters? The main goal is to investigate whether experiments in software engineering can generalize results based on populations under different contexts and how sensitive are cluster techniques to provide such classification
  9. 9. Detecting Java software similarities by using different clustering techniques 9ICSME2020 Types of clustering techniques used in the paper CrossSim (Graph-based similarity) Clustering based on projects descriptions (manually classified) LDA-informed Clustering 1. We group software systems based on the three different clustering techniques 2. We collect the values of the OO metrics suite in each cluster 3. We then test whether clusters are statistically different between each other, using the Kolgomorov-Smirnov (KS) hypothesis testing The aim is to reject, for every OO metric m, the null hypothesis H0,m: the samples are drawn from the same population
  10. 10. Detecting Java software similarities by using different clustering techniques 10ICSME2020 CrossSim Based on the graph structure, one can exploit nodes, links, and the mutual relationships to compute similarity using existing graph similarity algorithms Nguyen, P.T., Di Rocco, J., Rubei, R., Di Ruscio, D. An automated approach to assess the similarity of GitHub repositories. Software Quality Journal 2020 Phuong T. Nguyen, Juri Di Rocco, Davide Di Ruscio, Massimiliano Di Penta: CrossRec: Supporting software developers by recommending third-party libraries. J. Syst. Softw. 161 (2020)
  11. 11. Detecting Java software similarities by using different clustering techniques 11ICSME2020 Results for CrossSim Clustering 12 projects (6 pairs), from a larger population of 5,000 projects extracted as part of the CROSSMINER project The similarity by CrossSim is computed according to libraries, stargazers, and committers Result - We cannot conclude that CrossSim clusters are structurally different from each others https://www.crossminer.org
  12. 12. Detecting Java software similarities by using different clustering techniques 12ICSME2020 Results from manual classification Java subset of 520 projects collected out of the 5,000 projects [1,2] Manually assigned to 12 categories – e.g, Communications, Database, Software Development, Text Editors, … Result - The obtained clusters result in pools of attributes that are structurally different from each other – Each cluster is a standalone category, with specific (and unique) characteristics [1] H. Borges, A. Hora, M.T. Valente, Understanding the factors that impact the popularity of github repositories, in: 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME), IEEE, 2016, pp. 334–344. [2] H. Borges, M.T. Valente, What’s in a github star? understanding repository starring practices in a social coding platform, J. Syst. Softw. 146 (2018) 112–129.
  13. 13. Detecting Java software similarities by using different clustering techniques 13ICSME2020 Results from LDA-informed Clustering Latent Dirichlet Allocation (LDA) information retrieval method
  14. 14. Detecting Java software similarities by using different clustering techniques 14ICSME2020 Results from LDA-informed Clustering 20 Categories/Domains from SourceForge 100 most starred Java projects from GitHub Result - Strong evidence to reject null hypothesis based on KS test • OO attributes are showing differences among the different clusters
  15. 15. Detecting Java software similarities by using different clustering techniques 15ICSME2020 Take-away messages 1. When you cluster software systems in categories you can create strongly different results 2. The interpretation of software metrics might be more sensitive to context than reported so far in the literature – The correlation among OO metrics can be extremely sensitive to application domains
  16. 16. Detecting Java software similarities by using different clustering techniques 16ICSME2020 Take-away messages 3. We should pay more attention to the application domain of the studied systems • e.g. the metrics one should consider to analyse gaming software should be different from those used to assess the quality of security software • LOCs are less appropriate for assessing the quality of security software or in general of mission critical software systems The empirical findings might need readjustment depending on the cluster of projects they evaluate
  17. 17. Detecting Java software similarities by using different clustering techniques Andrea Capiluppi*, Davide Di Ruscio**, Juri Di Rocco**, Phuong T. Nguyen**, Nemitari Ajienka*** * Department of Computer Science, University of Groningen, The Netherlands ** Department of Information Engineering, Computer Science and Mathematics, University of L’Aquila, Italy *** Department of Computer Science, University of Nottingham, UK https://doi.org/10.1016/j.infsof.2020.106279

×