SlideShare a Scribd company logo

Detecting java software similarities by using different clustering

These are the slides of the talk I delivered at the Journal First session of ICSME 2020.

1 of 17
Download to read offline
Detecting Java software similarities
by using different clustering techniques
Andrea Capiluppi*, Davide Di Ruscio**, Juri Di Rocco**, Phuong T. Nguyen**,
Nemitari Ajienka***
ICSME 2020
* Department of Computer Science, University of Groningen, The Netherlands
** Department of Information Engineering, Computer Science and Mathematics, University of L’Aquila, Italy
*** Department of Computer Science, University of Nottingham, UK https://doi.org/10.1016/j.infsof.2020.106279
Detecting Java software similarities by using different clustering techniques 2ICSME2020
On the need of always larger samples of systems
Research on empirical software engineering has increasingly used data
made available in online repositories or collective efforts
Gather “as much data as possible”
- to prevent bias in the representation of a small sample
- work with a sample as close as the population itself
- showcase the performance of existing or new tools in treating vast amount of
data
Detecting Java software similarities by using different clustering techniques 3ICSME2020
On the need of always larger samples of systems
Research on empirical software engineering has increasingly used data
made available in online repositories or collective efforts
Cumulative number of FOSS projects per year Average number of FOSS projects per year
Detecting Java software similarities by using different clustering techniques 4ICSME2020
Similarity of Systems and Empirical Research
insensitive to that
Very few works have clearly stated the
similarity (or differences) between
systems in the interpretation of the
results
- by explicitly proposing explanations based
on application domains
- by sampling the projects to be analysed
from a specific, restricted topic
Detecting Java software similarities by using different clustering techniques 5ICSME2020
Assumptions of this paper
A specific software system might be similar to others to some degree,
and that there are different approaches to defining their similarity
A sample of software systems might get divided into subsets (or
clusters), each containing similar systems, and showing differences with
other clusters
Detecting Java software similarities by using different clustering techniques 6ICSME2020
Reasons for Clustering
Clustering is among the fundamental techniques in knowledge mining and
information retrieval
A clustering algorithm attempts to distribute objects into groups of similar
objects so as the similarity between one pair of objects in a cluster is higher
than that between one of the objects to any objects in a different cluster
“the degree to which two distinct programs are similar is related to
how precisely they are alike”
Ad

Recommended

A new clutering approach for anomaly intrusion detection
A new clutering approach for anomaly intrusion detectionA new clutering approach for anomaly intrusion detection
A new clutering approach for anomaly intrusion detectionIJDKP
 
Evaluating the Use of Clustering for Automatically Organising Digital Library...
Evaluating the Use of Clustering for Automatically Organising Digital Library...Evaluating the Use of Clustering for Automatically Organising Digital Library...
Evaluating the Use of Clustering for Automatically Organising Digital Library...pathsproject
 
36x48_new_modelling_cloud_infrastructure
36x48_new_modelling_cloud_infrastructure36x48_new_modelling_cloud_infrastructure
36x48_new_modelling_cloud_infrastructureWashington Garcia
 
Comparative Analysis of K-Means Data Mining and Outlier Detection Approach fo...
Comparative Analysis of K-Means Data Mining and Outlier Detection Approach fo...Comparative Analysis of K-Means Data Mining and Outlier Detection Approach fo...
Comparative Analysis of K-Means Data Mining and Outlier Detection Approach fo...IJCSIS Research Publications
 
A COMPARATIVE STUDY OF SOCIAL NETWORKING APPROACHES IN IDENTIFYING THE COVERT...
A COMPARATIVE STUDY OF SOCIAL NETWORKING APPROACHES IN IDENTIFYING THE COVERT...A COMPARATIVE STUDY OF SOCIAL NETWORKING APPROACHES IN IDENTIFYING THE COVERT...
A COMPARATIVE STUDY OF SOCIAL NETWORKING APPROACHES IN IDENTIFYING THE COVERT...ijwscjournal
 
Android malware detection through online learning
Android malware detection through online learningAndroid malware detection through online learning
Android malware detection through online learningIJARIIT
 
NRNB Annual Report 2017
NRNB Annual Report 2017NRNB Annual Report 2017
NRNB Annual Report 2017Alexander Pico
 

More Related Content

What's hot

Cyb 5675 class project final
Cyb 5675   class project finalCyb 5675   class project final
Cyb 5675 class project finalCraig Cannon
 
Concept Drift Identification using Classifier Ensemble Approach
Concept Drift Identification using Classifier Ensemble Approach  Concept Drift Identification using Classifier Ensemble Approach
Concept Drift Identification using Classifier Ensemble Approach IJECEIAES
 
Distributional Semantics and Unsupervised Clustering for Sensor Relevancy Pre...
Distributional Semantics and Unsupervised Clustering for Sensor Relevancy Pre...Distributional Semantics and Unsupervised Clustering for Sensor Relevancy Pre...
Distributional Semantics and Unsupervised Clustering for Sensor Relevancy Pre...iammyr
 
Pre-defense_talk
Pre-defense_talkPre-defense_talk
Pre-defense_talkaphex34
 
Event detection and summarization based on social networks and semantic query...
Event detection and summarization based on social networks and semantic query...Event detection and summarization based on social networks and semantic query...
Event detection and summarization based on social networks and semantic query...ijnlc
 
Finding Bad Code Smells with Neural Network Models
Finding Bad Code Smells with Neural Network Models Finding Bad Code Smells with Neural Network Models
Finding Bad Code Smells with Neural Network Models IJECEIAES
 
Machine reading for cancer biology
Machine reading for cancer biologyMachine reading for cancer biology
Machine reading for cancer biologyLaura Berry
 
‘CodeAliker’ - Plagiarism Detection on the Cloud
‘CodeAliker’ - Plagiarism Detection on the Cloud ‘CodeAliker’ - Plagiarism Detection on the Cloud
‘CodeAliker’ - Plagiarism Detection on the Cloud acijjournal
 
NeXO Web Poster for ISMB 2014 BioVis SIG
NeXO Web Poster for ISMB 2014 BioVis SIGNeXO Web Poster for ISMB 2014 BioVis SIG
NeXO Web Poster for ISMB 2014 BioVis SIGKeiichiro Ono
 
Towards reproducibility and maximally-open data
Towards reproducibility and maximally-open dataTowards reproducibility and maximally-open data
Towards reproducibility and maximally-open dataPablo Bernabeu
 
The Systematic Methodology for Accurate Test Packet Generation and Fault Loca...
The Systematic Methodology for Accurate Test Packet Generation and Fault Loca...The Systematic Methodology for Accurate Test Packet Generation and Fault Loca...
The Systematic Methodology for Accurate Test Packet Generation and Fault Loca...IOSRjournaljce
 
DoS Forensic Exemplar Comparison to a Known Sample
DoS Forensic Exemplar Comparison to a Known SampleDoS Forensic Exemplar Comparison to a Known Sample
DoS Forensic Exemplar Comparison to a Known SampleCSCJournals
 

What's hot (13)

Cyb 5675 class project final
Cyb 5675   class project finalCyb 5675   class project final
Cyb 5675 class project final
 
20170412 om patri pres 153pdf
20170412 om patri pres 153pdf20170412 om patri pres 153pdf
20170412 om patri pres 153pdf
 
Concept Drift Identification using Classifier Ensemble Approach
Concept Drift Identification using Classifier Ensemble Approach  Concept Drift Identification using Classifier Ensemble Approach
Concept Drift Identification using Classifier Ensemble Approach
 
Distributional Semantics and Unsupervised Clustering for Sensor Relevancy Pre...
Distributional Semantics and Unsupervised Clustering for Sensor Relevancy Pre...Distributional Semantics and Unsupervised Clustering for Sensor Relevancy Pre...
Distributional Semantics and Unsupervised Clustering for Sensor Relevancy Pre...
 
Pre-defense_talk
Pre-defense_talkPre-defense_talk
Pre-defense_talk
 
Event detection and summarization based on social networks and semantic query...
Event detection and summarization based on social networks and semantic query...Event detection and summarization based on social networks and semantic query...
Event detection and summarization based on social networks and semantic query...
 
Finding Bad Code Smells with Neural Network Models
Finding Bad Code Smells with Neural Network Models Finding Bad Code Smells with Neural Network Models
Finding Bad Code Smells with Neural Network Models
 
Machine reading for cancer biology
Machine reading for cancer biologyMachine reading for cancer biology
Machine reading for cancer biology
 
‘CodeAliker’ - Plagiarism Detection on the Cloud
‘CodeAliker’ - Plagiarism Detection on the Cloud ‘CodeAliker’ - Plagiarism Detection on the Cloud
‘CodeAliker’ - Plagiarism Detection on the Cloud
 
NeXO Web Poster for ISMB 2014 BioVis SIG
NeXO Web Poster for ISMB 2014 BioVis SIGNeXO Web Poster for ISMB 2014 BioVis SIG
NeXO Web Poster for ISMB 2014 BioVis SIG
 
Towards reproducibility and maximally-open data
Towards reproducibility and maximally-open dataTowards reproducibility and maximally-open data
Towards reproducibility and maximally-open data
 
The Systematic Methodology for Accurate Test Packet Generation and Fault Loca...
The Systematic Methodology for Accurate Test Packet Generation and Fault Loca...The Systematic Methodology for Accurate Test Packet Generation and Fault Loca...
The Systematic Methodology for Accurate Test Packet Generation and Fault Loca...
 
DoS Forensic Exemplar Comparison to a Known Sample
DoS Forensic Exemplar Comparison to a Known SampleDoS Forensic Exemplar Comparison to a Known Sample
DoS Forensic Exemplar Comparison to a Known Sample
 

Similar to Detecting java software similarities by using different clustering

AI Based Student S Assignments Plagiarism Detector
AI Based Student S Assignments Plagiarism DetectorAI Based Student S Assignments Plagiarism Detector
AI Based Student S Assignments Plagiarism DetectorAsia Smith
 
Zhao huang deep sim deep learning code functional similarity
Zhao huang deep sim   deep learning code functional similarityZhao huang deep sim   deep learning code functional similarity
Zhao huang deep sim deep learning code functional similarityitrejos
 
M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...
M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...
M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...Marco Brambilla
 
Developing Projects & Research
Developing Projects & ResearchDeveloping Projects & Research
Developing Projects & ResearchThomas Mylonas
 
Ph.D. Thesis: A Methodology for the Development of Autonomic and Cognitive In...
Ph.D. Thesis: A Methodology for the Development of Autonomic and Cognitive In...Ph.D. Thesis: A Methodology for the Development of Autonomic and Cognitive In...
Ph.D. Thesis: A Methodology for the Development of Autonomic and Cognitive In...Universita della Calabria,
 
principle of oop’s in cpp
principle of oop’s in cppprinciple of oop’s in cpp
principle of oop’s in cppgourav kottawar
 
IRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using CobwebIRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using CobwebIRJET Journal
 
IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...
IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...
IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...IEEEFINALYEARSTUDENTPROJECTS
 
2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...
2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...
2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...IEEEMEMTECHSTUDENTSPROJECTS
 
a-novel-web-attack-detection-system-for-internet-of-things-via-ensemble-class...
a-novel-web-attack-detection-system-for-internet-of-things-via-ensemble-class...a-novel-web-attack-detection-system-for-internet-of-things-via-ensemble-class...
a-novel-web-attack-detection-system-for-internet-of-things-via-ensemble-class...Manoj895639
 
User Identity Linkage: Data Collection, DataSet Biases, Method, Control and A...
User Identity Linkage: Data Collection, DataSet Biases, Method, Control and A...User Identity Linkage: Data Collection, DataSet Biases, Method, Control and A...
User Identity Linkage: Data Collection, DataSet Biases, Method, Control and A...IIIT Hyderabad
 
A Literature Review on Plagiarism Detection in Computer Programming Assignments
A Literature Review on Plagiarism Detection in Computer Programming AssignmentsA Literature Review on Plagiarism Detection in Computer Programming Assignments
A Literature Review on Plagiarism Detection in Computer Programming AssignmentsIRJET Journal
 
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...ijseajournal
 
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...ijseajournal
 
Nurturing the Software Ecosystems of the Future
Nurturing the Software Ecosystems of the FutureNurturing the Software Ecosystems of the Future
Nurturing the Software Ecosystems of the FutureTom Mens
 
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...IRJET Journal
 
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...IRJET Journal
 

Similar to Detecting java software similarities by using different clustering (20)

AI Based Student S Assignments Plagiarism Detector
AI Based Student S Assignments Plagiarism DetectorAI Based Student S Assignments Plagiarism Detector
AI Based Student S Assignments Plagiarism Detector
 
Zhao huang deep sim deep learning code functional similarity
Zhao huang deep sim   deep learning code functional similarityZhao huang deep sim   deep learning code functional similarity
Zhao huang deep sim deep learning code functional similarity
 
Final proj 2 (1)
Final proj 2 (1)Final proj 2 (1)
Final proj 2 (1)
 
Presentation at MTSR 2012
Presentation at MTSR 2012Presentation at MTSR 2012
Presentation at MTSR 2012
 
M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...
M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...
M.Sc. Thesis Topics and Proposals @ Polimi Data Science Lab - 2024 - prof. Br...
 
Developing Projects & Research
Developing Projects & ResearchDeveloping Projects & Research
Developing Projects & Research
 
Ph.D. Thesis: A Methodology for the Development of Autonomic and Cognitive In...
Ph.D. Thesis: A Methodology for the Development of Autonomic and Cognitive In...Ph.D. Thesis: A Methodology for the Development of Autonomic and Cognitive In...
Ph.D. Thesis: A Methodology for the Development of Autonomic and Cognitive In...
 
principle of oop’s in cpp
principle of oop’s in cppprinciple of oop’s in cpp
principle of oop’s in cpp
 
IRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using CobwebIRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using Cobweb
 
IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...
IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...
IEEE 2014 JAVA DATA MINING PROJECTS Multi comm finding community structure in...
 
2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...
2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...
2014 IEEE JAVA DATA MINING PROJECT Multi comm finding community structure in ...
 
a-novel-web-attack-detection-system-for-internet-of-things-via-ensemble-class...
a-novel-web-attack-detection-system-for-internet-of-things-via-ensemble-class...a-novel-web-attack-detection-system-for-internet-of-things-via-ensemble-class...
a-novel-web-attack-detection-system-for-internet-of-things-via-ensemble-class...
 
User Identity Linkage: Data Collection, DataSet Biases, Method, Control and A...
User Identity Linkage: Data Collection, DataSet Biases, Method, Control and A...User Identity Linkage: Data Collection, DataSet Biases, Method, Control and A...
User Identity Linkage: Data Collection, DataSet Biases, Method, Control and A...
 
A Literature Review on Plagiarism Detection in Computer Programming Assignments
A Literature Review on Plagiarism Detection in Computer Programming AssignmentsA Literature Review on Plagiarism Detection in Computer Programming Assignments
A Literature Review on Plagiarism Detection in Computer Programming Assignments
 
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...
 
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...
A DATA EXTRACTION ALGORITHM FROM OPEN SOURCE SOFTWARE PROJECT REPOSITORIES FO...
 
Nurturing the Software Ecosystems of the Future
Nurturing the Software Ecosystems of the FutureNurturing the Software Ecosystems of the Future
Nurturing the Software Ecosystems of the Future
 
Marvin_Capstone
Marvin_CapstoneMarvin_Capstone
Marvin_Capstone
 
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
 
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
Study on Different Code-Clone Detection Techniques & Approaches to MitigateCo...
 

More from Davide Ruscio

Developing recommendation systems to support open source software developers ...
Developing recommendation systems to support open source software developers ...Developing recommendation systems to support open source software developers ...
Developing recommendation systems to support open source software developers ...Davide Ruscio
 
On the way of listening to the crowd for supporting modeling activities
On the way of listening to the crowd for supporting modeling activitiesOn the way of listening to the crowd for supporting modeling activities
On the way of listening to the crowd for supporting modeling activitiesDavide Ruscio
 
FOCUS: A Recommender System for Mining API Function Calls and Usage Patterns
FOCUS:  A Recommender System for Mining API Function Calls and  Usage PatternsFOCUS:  A Recommender System for Mining API Function Calls and  Usage Patterns
FOCUS: A Recommender System for Mining API Function Calls and Usage PatternsDavide Ruscio
 
CrossSim: exploiting mutual relationships to detect similar OSS projects
CrossSim: exploiting mutual relationships to detect similar OSS projectsCrossSim: exploiting mutual relationships to detect similar OSS projects
CrossSim: exploiting mutual relationships to detect similar OSS projectsDavide Ruscio
 
Use of MDE to Analyse Open Source Software
Use of MDE to Analyse Open Source SoftwareUse of MDE to Analyse Open Source Software
Use of MDE to Analyse Open Source SoftwareDavide Ruscio
 
Consistency Recovery in Interactive Modeling
Consistency Recovery in Interactive ModelingConsistency Recovery in Interactive Modeling
Consistency Recovery in Interactive ModelingDavide Ruscio
 
Edelta: an approach for defining and applying reusable metamodel refactorings
Edelta: an approach for defining and applying reusable metamodel refactoringsEdelta: an approach for defining and applying reusable metamodel refactorings
Edelta: an approach for defining and applying reusable metamodel refactoringsDavide Ruscio
 
Semantic based model matching with emf compare
Semantic based model matching with emf compareSemantic based model matching with emf compare
Semantic based model matching with emf compareDavide Ruscio
 
Collaborative model driven software engineering: a Systematic Mapping Study
Collaborative model driven software engineering: a Systematic Mapping StudyCollaborative model driven software engineering: a Systematic Mapping Study
Collaborative model driven software engineering: a Systematic Mapping StudyDavide Ruscio
 
Model repositories: will they become reality?
Model repositories: will they become reality?Model repositories: will they become reality?
Model repositories: will they become reality?Davide Ruscio
 
Mining Correlations of ATL Transformation and Metamodel Metrics
Mining Correlations of ATL Transformation and Metamodel MetricsMining Correlations of ATL Transformation and Metamodel Metrics
Mining Correlations of ATL Transformation and Metamodel Metrics Davide Ruscio
 
MDEForge: an extensible Web-based modeling platform
MDEForge: an extensible Web-based modeling platformMDEForge: an extensible Web-based modeling platform
MDEForge: an extensible Web-based modeling platformDavide Ruscio
 

More from Davide Ruscio (12)

Developing recommendation systems to support open source software developers ...
Developing recommendation systems to support open source software developers ...Developing recommendation systems to support open source software developers ...
Developing recommendation systems to support open source software developers ...
 
On the way of listening to the crowd for supporting modeling activities
On the way of listening to the crowd for supporting modeling activitiesOn the way of listening to the crowd for supporting modeling activities
On the way of listening to the crowd for supporting modeling activities
 
FOCUS: A Recommender System for Mining API Function Calls and Usage Patterns
FOCUS:  A Recommender System for Mining API Function Calls and  Usage PatternsFOCUS:  A Recommender System for Mining API Function Calls and  Usage Patterns
FOCUS: A Recommender System for Mining API Function Calls and Usage Patterns
 
CrossSim: exploiting mutual relationships to detect similar OSS projects
CrossSim: exploiting mutual relationships to detect similar OSS projectsCrossSim: exploiting mutual relationships to detect similar OSS projects
CrossSim: exploiting mutual relationships to detect similar OSS projects
 
Use of MDE to Analyse Open Source Software
Use of MDE to Analyse Open Source SoftwareUse of MDE to Analyse Open Source Software
Use of MDE to Analyse Open Source Software
 
Consistency Recovery in Interactive Modeling
Consistency Recovery in Interactive ModelingConsistency Recovery in Interactive Modeling
Consistency Recovery in Interactive Modeling
 
Edelta: an approach for defining and applying reusable metamodel refactorings
Edelta: an approach for defining and applying reusable metamodel refactoringsEdelta: an approach for defining and applying reusable metamodel refactorings
Edelta: an approach for defining and applying reusable metamodel refactorings
 
Semantic based model matching with emf compare
Semantic based model matching with emf compareSemantic based model matching with emf compare
Semantic based model matching with emf compare
 
Collaborative model driven software engineering: a Systematic Mapping Study
Collaborative model driven software engineering: a Systematic Mapping StudyCollaborative model driven software engineering: a Systematic Mapping Study
Collaborative model driven software engineering: a Systematic Mapping Study
 
Model repositories: will they become reality?
Model repositories: will they become reality?Model repositories: will they become reality?
Model repositories: will they become reality?
 
Mining Correlations of ATL Transformation and Metamodel Metrics
Mining Correlations of ATL Transformation and Metamodel MetricsMining Correlations of ATL Transformation and Metamodel Metrics
Mining Correlations of ATL Transformation and Metamodel Metrics
 
MDEForge: an extensible Web-based modeling platform
MDEForge: an extensible Web-based modeling platformMDEForge: an extensible Web-based modeling platform
MDEForge: an extensible Web-based modeling platform
 

Recently uploaded

P1 Inspection Types in Municity 5 Smartsheet
P1 Inspection Types in Municity 5 SmartsheetP1 Inspection Types in Municity 5 Smartsheet
P1 Inspection Types in Municity 5 SmartsheetMatthewTHawley
 
Essence of Requirements Engineering: Pragmatic Insights for 2024
Essence of Requirements Engineering: Pragmatic Insights for 2024Essence of Requirements Engineering: Pragmatic Insights for 2024
Essence of Requirements Engineering: Pragmatic Insights for 2024Asher Sterkin
 
sql ppt for students who preparing for sql
sql ppt for students who preparing for sqlsql ppt for students who preparing for sql
sql ppt for students who preparing for sqlbharatjanadharwarud
 
"Taking an idea to a Product in Health diagnostics" by Dr. Geetha Manjunath, ...
"Taking an idea to a Product in Health diagnostics" by Dr. Geetha Manjunath, ..."Taking an idea to a Product in Health diagnostics" by Dr. Geetha Manjunath, ...
"Taking an idea to a Product in Health diagnostics" by Dr. Geetha Manjunath, ...ISPMAIndia
 
AI Product Management by Abhijit Bendigiri
AI Product Management by Abhijit BendigiriAI Product Management by Abhijit Bendigiri
AI Product Management by Abhijit BendigiriISPMAIndia
 
LLMOps with Azure Machine Learning prompt flow
LLMOps with Azure Machine Learning prompt flowLLMOps with Azure Machine Learning prompt flow
LLMOps with Azure Machine Learning prompt flowNaoki (Neo) SATO
 
The Age of AI: Elevating Experiences & Delivering Customer Value!
The Age of AI: Elevating Experiences & Delivering Customer Value!The Age of AI: Elevating Experiences & Delivering Customer Value!
The Age of AI: Elevating Experiences & Delivering Customer Value!ISPMAIndia
 
Getting Started with Trello for Beginners.pptx
Getting Started with Trello for Beginners.pptxGetting Started with Trello for Beginners.pptx
Getting Started with Trello for Beginners.pptxmavinoikein
 
No more Dockerfiles? Buildpacks to help you ship your image!
No more Dockerfiles? Buildpacks to help you ship your image!No more Dockerfiles? Buildpacks to help you ship your image!
No more Dockerfiles? Buildpacks to help you ship your image!Anthony Dahanne
 
maximum subarray ppt for killing camp students
maximum subarray ppt for killing camp studentsmaximum subarray ppt for killing camp students
maximum subarray ppt for killing camp studentsssuser82c38d
 
Open Sprintera (Where Open Source Sparks a Sprint of Possibilities)
Open Sprintera (Where Open Source Sparks a Sprint of Possibilities)Open Sprintera (Where Open Source Sparks a Sprint of Possibilities)
Open Sprintera (Where Open Source Sparks a Sprint of Possibilities)GDSCNiT
 
killing camp week 6 problem - maximal matrix.pdf
killing camp week 6 problem - maximal matrix.pdfkilling camp week 6 problem - maximal matrix.pdf
killing camp week 6 problem - maximal matrix.pdfssuser82c38d
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkTimothy Spann
 
killingcamp longest common subsequence.pdf
killingcamp longest common subsequence.pdfkillingcamp longest common subsequence.pdf
killingcamp longest common subsequence.pdfssuser82c38d
 
"Discovery and Delivery through Product IntelliGenAI framework" by Ramkumar A...
"Discovery and Delivery through Product IntelliGenAI framework" by Ramkumar A..."Discovery and Delivery through Product IntelliGenAI framework" by Ramkumar A...
"Discovery and Delivery through Product IntelliGenAI framework" by Ramkumar A...ISPMAIndia
 
AUTOKEYUNLOCKER-BRANDS-SUPPORT-STANDARD-VERSION.pdf
AUTOKEYUNLOCKER-BRANDS-SUPPORT-STANDARD-VERSION.pdfAUTOKEYUNLOCKER-BRANDS-SUPPORT-STANDARD-VERSION.pdf
AUTOKEYUNLOCKER-BRANDS-SUPPORT-STANDARD-VERSION.pdfAutokey
 
OpenChain AI Study Group - North America and Europe - 2024-02-20
OpenChain AI Study Group - North America and Europe - 2024-02-20OpenChain AI Study Group - North America and Europe - 2024-02-20
OpenChain AI Study Group - North America and Europe - 2024-02-20Shane Coughlan
 
The Game-Changer_ How Software Development Outsource Can Catapult Your Growth...
The Game-Changer_ How Software Development Outsource Can Catapult Your Growth...The Game-Changer_ How Software Development Outsource Can Catapult Your Growth...
The Game-Changer_ How Software Development Outsource Can Catapult Your Growth...emili denli
 
Automation for Bonterra Impact Management (fka Apricot)
Automation for Bonterra Impact Management (fka Apricot)Automation for Bonterra Impact Management (fka Apricot)
Automation for Bonterra Impact Management (fka Apricot)Jeffrey Haguewood
 

Recently uploaded (20)

P1 Inspection Types in Municity 5 Smartsheet
P1 Inspection Types in Municity 5 SmartsheetP1 Inspection Types in Municity 5 Smartsheet
P1 Inspection Types in Municity 5 Smartsheet
 
Essence of Requirements Engineering: Pragmatic Insights for 2024
Essence of Requirements Engineering: Pragmatic Insights for 2024Essence of Requirements Engineering: Pragmatic Insights for 2024
Essence of Requirements Engineering: Pragmatic Insights for 2024
 
sql ppt for students who preparing for sql
sql ppt for students who preparing for sqlsql ppt for students who preparing for sql
sql ppt for students who preparing for sql
 
"Taking an idea to a Product in Health diagnostics" by Dr. Geetha Manjunath, ...
"Taking an idea to a Product in Health diagnostics" by Dr. Geetha Manjunath, ..."Taking an idea to a Product in Health diagnostics" by Dr. Geetha Manjunath, ...
"Taking an idea to a Product in Health diagnostics" by Dr. Geetha Manjunath, ...
 
AI Product Management by Abhijit Bendigiri
AI Product Management by Abhijit BendigiriAI Product Management by Abhijit Bendigiri
AI Product Management by Abhijit Bendigiri
 
LLMOps with Azure Machine Learning prompt flow
LLMOps with Azure Machine Learning prompt flowLLMOps with Azure Machine Learning prompt flow
LLMOps with Azure Machine Learning prompt flow
 
The Age of AI: Elevating Experiences & Delivering Customer Value!
The Age of AI: Elevating Experiences & Delivering Customer Value!The Age of AI: Elevating Experiences & Delivering Customer Value!
The Age of AI: Elevating Experiences & Delivering Customer Value!
 
Getting Started with Trello for Beginners.pptx
Getting Started with Trello for Beginners.pptxGetting Started with Trello for Beginners.pptx
Getting Started with Trello for Beginners.pptx
 
No more Dockerfiles? Buildpacks to help you ship your image!
No more Dockerfiles? Buildpacks to help you ship your image!No more Dockerfiles? Buildpacks to help you ship your image!
No more Dockerfiles? Buildpacks to help you ship your image!
 
maximum subarray ppt for killing camp students
maximum subarray ppt for killing camp studentsmaximum subarray ppt for killing camp students
maximum subarray ppt for killing camp students
 
eLearning Content Development Company Code and Pixels.pdf
eLearning Content Development Company Code and Pixels.pdfeLearning Content Development Company Code and Pixels.pdf
eLearning Content Development Company Code and Pixels.pdf
 
Open Sprintera (Where Open Source Sparks a Sprint of Possibilities)
Open Sprintera (Where Open Source Sparks a Sprint of Possibilities)Open Sprintera (Where Open Source Sparks a Sprint of Possibilities)
Open Sprintera (Where Open Source Sparks a Sprint of Possibilities)
 
killing camp week 6 problem - maximal matrix.pdf
killing camp week 6 problem - maximal matrix.pdfkilling camp week 6 problem - maximal matrix.pdf
killing camp week 6 problem - maximal matrix.pdf
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
 
killingcamp longest common subsequence.pdf
killingcamp longest common subsequence.pdfkillingcamp longest common subsequence.pdf
killingcamp longest common subsequence.pdf
 
"Discovery and Delivery through Product IntelliGenAI framework" by Ramkumar A...
"Discovery and Delivery through Product IntelliGenAI framework" by Ramkumar A..."Discovery and Delivery through Product IntelliGenAI framework" by Ramkumar A...
"Discovery and Delivery through Product IntelliGenAI framework" by Ramkumar A...
 
AUTOKEYUNLOCKER-BRANDS-SUPPORT-STANDARD-VERSION.pdf
AUTOKEYUNLOCKER-BRANDS-SUPPORT-STANDARD-VERSION.pdfAUTOKEYUNLOCKER-BRANDS-SUPPORT-STANDARD-VERSION.pdf
AUTOKEYUNLOCKER-BRANDS-SUPPORT-STANDARD-VERSION.pdf
 
OpenChain AI Study Group - North America and Europe - 2024-02-20
OpenChain AI Study Group - North America and Europe - 2024-02-20OpenChain AI Study Group - North America and Europe - 2024-02-20
OpenChain AI Study Group - North America and Europe - 2024-02-20
 
The Game-Changer_ How Software Development Outsource Can Catapult Your Growth...
The Game-Changer_ How Software Development Outsource Can Catapult Your Growth...The Game-Changer_ How Software Development Outsource Can Catapult Your Growth...
The Game-Changer_ How Software Development Outsource Can Catapult Your Growth...
 
Automation for Bonterra Impact Management (fka Apricot)
Automation for Bonterra Impact Management (fka Apricot)Automation for Bonterra Impact Management (fka Apricot)
Automation for Bonterra Impact Management (fka Apricot)
 

Detecting java software similarities by using different clustering

  • 1. Detecting Java software similarities by using different clustering techniques Andrea Capiluppi*, Davide Di Ruscio**, Juri Di Rocco**, Phuong T. Nguyen**, Nemitari Ajienka*** ICSME 2020 * Department of Computer Science, University of Groningen, The Netherlands ** Department of Information Engineering, Computer Science and Mathematics, University of L’Aquila, Italy *** Department of Computer Science, University of Nottingham, UK https://doi.org/10.1016/j.infsof.2020.106279
  • 2. Detecting Java software similarities by using different clustering techniques 2ICSME2020 On the need of always larger samples of systems Research on empirical software engineering has increasingly used data made available in online repositories or collective efforts Gather “as much data as possible” - to prevent bias in the representation of a small sample - work with a sample as close as the population itself - showcase the performance of existing or new tools in treating vast amount of data
  • 3. Detecting Java software similarities by using different clustering techniques 3ICSME2020 On the need of always larger samples of systems Research on empirical software engineering has increasingly used data made available in online repositories or collective efforts Cumulative number of FOSS projects per year Average number of FOSS projects per year
  • 4. Detecting Java software similarities by using different clustering techniques 4ICSME2020 Similarity of Systems and Empirical Research insensitive to that Very few works have clearly stated the similarity (or differences) between systems in the interpretation of the results - by explicitly proposing explanations based on application domains - by sampling the projects to be analysed from a specific, restricted topic
  • 5. Detecting Java software similarities by using different clustering techniques 5ICSME2020 Assumptions of this paper A specific software system might be similar to others to some degree, and that there are different approaches to defining their similarity A sample of software systems might get divided into subsets (or clusters), each containing similar systems, and showing differences with other clusters
  • 6. Detecting Java software similarities by using different clustering techniques 6ICSME2020 Reasons for Clustering Clustering is among the fundamental techniques in knowledge mining and information retrieval A clustering algorithm attempts to distribute objects into groups of similar objects so as the similarity between one pair of objects in a cluster is higher than that between one of the objects to any objects in a different cluster “the degree to which two distinct programs are similar is related to how precisely they are alike”
  • 7. Detecting Java software similarities by using different clustering techniques 7ICSME2020 Reasons for Clustering Clustering is among the fundamental techniques in knowledge mining and information retrieval A clustering algorithm attempts to distribute objects into groups of similar objects so as the similarity between one pair of objects in a cluster is higher than that between one of the objects to any objects in a different cluster “the degree to which two distinct programs are similar is related to how precisely they are alike” s1 s2 s6 s4 s5 s3 s7 s8 Log management JSON Parsing DB Management
  • 8. Detecting Java software similarities by using different clustering techniques 8ICSME2020 Research question Are OO metrics sensitive to the context of their clusters? The main goal is to investigate whether experiments in software engineering can generalize results based on populations under different contexts and how sensitive are cluster techniques to provide such classification
  • 9. Detecting Java software similarities by using different clustering techniques 9ICSME2020 Types of clustering techniques used in the paper CrossSim (Graph-based similarity) Clustering based on projects descriptions (manually classified) LDA-informed Clustering 1. We group software systems based on the three different clustering techniques 2. We collect the values of the OO metrics suite in each cluster 3. We then test whether clusters are statistically different between each other, using the Kolgomorov-Smirnov (KS) hypothesis testing The aim is to reject, for every OO metric m, the null hypothesis H0,m: the samples are drawn from the same population
  • 10. Detecting Java software similarities by using different clustering techniques 10ICSME2020 CrossSim Based on the graph structure, one can exploit nodes, links, and the mutual relationships to compute similarity using existing graph similarity algorithms Nguyen, P.T., Di Rocco, J., Rubei, R., Di Ruscio, D. An automated approach to assess the similarity of GitHub repositories. Software Quality Journal 2020 Phuong T. Nguyen, Juri Di Rocco, Davide Di Ruscio, Massimiliano Di Penta: CrossRec: Supporting software developers by recommending third-party libraries. J. Syst. Softw. 161 (2020)
  • 11. Detecting Java software similarities by using different clustering techniques 11ICSME2020 Results for CrossSim Clustering 12 projects (6 pairs), from a larger population of 5,000 projects extracted as part of the CROSSMINER project The similarity by CrossSim is computed according to libraries, stargazers, and committers Result - We cannot conclude that CrossSim clusters are structurally different from each others https://www.crossminer.org
  • 12. Detecting Java software similarities by using different clustering techniques 12ICSME2020 Results from manual classification Java subset of 520 projects collected out of the 5,000 projects [1,2] Manually assigned to 12 categories – e.g, Communications, Database, Software Development, Text Editors, … Result - The obtained clusters result in pools of attributes that are structurally different from each other – Each cluster is a standalone category, with specific (and unique) characteristics [1] H. Borges, A. Hora, M.T. Valente, Understanding the factors that impact the popularity of github repositories, in: 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME), IEEE, 2016, pp. 334–344. [2] H. Borges, M.T. Valente, What’s in a github star? understanding repository starring practices in a social coding platform, J. Syst. Softw. 146 (2018) 112–129.
  • 13. Detecting Java software similarities by using different clustering techniques 13ICSME2020 Results from LDA-informed Clustering Latent Dirichlet Allocation (LDA) information retrieval method
  • 14. Detecting Java software similarities by using different clustering techniques 14ICSME2020 Results from LDA-informed Clustering 20 Categories/Domains from SourceForge 100 most starred Java projects from GitHub Result - Strong evidence to reject null hypothesis based on KS test • OO attributes are showing differences among the different clusters
  • 15. Detecting Java software similarities by using different clustering techniques 15ICSME2020 Take-away messages 1. When you cluster software systems in categories you can create strongly different results 2. The interpretation of software metrics might be more sensitive to context than reported so far in the literature – The correlation among OO metrics can be extremely sensitive to application domains
  • 16. Detecting Java software similarities by using different clustering techniques 16ICSME2020 Take-away messages 3. We should pay more attention to the application domain of the studied systems • e.g. the metrics one should consider to analyse gaming software should be different from those used to assess the quality of security software • LOCs are less appropriate for assessing the quality of security software or in general of mission critical software systems The empirical findings might need readjustment depending on the cluster of projects they evaluate
  • 17. Detecting Java software similarities by using different clustering techniques Andrea Capiluppi*, Davide Di Ruscio**, Juri Di Rocco**, Phuong T. Nguyen**, Nemitari Ajienka*** * Department of Computer Science, University of Groningen, The Netherlands ** Department of Information Engineering, Computer Science and Mathematics, University of L’Aquila, Italy *** Department of Computer Science, University of Nottingham, UK https://doi.org/10.1016/j.infsof.2020.106279