WekaNose presentation

Machine Learning based
Code Smell detection through
WekaNose
A research accomplished by:
ESSeRE Lab - University of Milan Bicocca

What is a Code Smell?
A code smell is a surface indication that usually corresponds to a deeper problem in the system.
More specifically:
● A code smell has to be sniffable (something that's quick to spot);
● A code smell don't always indicate a problem.
1 Machine Learning based Code Smell detection through WekaNose September 2018

What is the problem with the state of art
approaches?
● Code smells can be subjectively interpreted;
● The results provided by detectors are usually different;
● The agreement in the results is scarce;
The metrics and
thresholds selection
problems

Why should we bother?

● This approach allows to exploits the full interpretability of Code Smells;
● This approach requires just to describe the Code Smell, rather than formalize a definition;
● The Machine Learning algorithm learn the concept from data;
Why should we use Machine Learning algorithms?

What is WekaNose?
WekaNose is a tool that supports a workflow that
aims to train Machine Learning algorithms specifically
for Code Smell detection.
This whole process is divided in three main part:
● the creation of the dataset;
● the training and testing of the
Machine Learning algorithms;
● the evaluation of the Machine Learning
algorithms performance, in term of Code Smell
detection;

Main problems
● Creation of a balance dataset, in terms of:
○ Labels;
○ Statistical properties;
● Machine Learning Algorithms selection:
○ No free lunch theorem
● Do high performances in the Machine Learning context imply
high performance in the actual Code Smells detection?

WekaNose’s workflow
Describe the
Code Smell
Select a collection of
heterogeneous system
Extract code metrics
from all the systems
Use Code Smell advisors
to sample candidates
Label the instances
Choose the Machine
Learning algorithms
Perform the Machine Learning
parameter optimization
Compare the Machine Learning
algorithms with each other
Use the SonarQube Plug-in for
actually detect the Code Smell
Umberto Azadi, Francesca Arcelli Fontana, and Marco Zanoni. 2018. Machine learning based code smell detection through WekaNose. In Proceedings of the 40th International Conference on
Software Engineering: Companion Proceeedings (ICSE '18). ACM, New York, NY, USA, 288-289. DOI: https://doi.org/10.1145/3183440.3194974

Describe the
Code Smell
Sampling and Label of
the instances
Choose the Machine
Learning algorithms
Data Class:
The Data Class Code Smell refers to classes that store data without using complex functionality,
and having other classes that strongly rely on them. A Data Class reveals many attributes, it is not
complex, and it provides data field through accessor methods.
Switch Statement:
The Switch Statements code smell refers to method that contain a complex switch operator or a
sequence of if statements that compromise the readability or/and the clarity of the code.

Describe the
Code Smell
Label the instances
Choose the Machine
Learning algorithms
An example is the Qualitas Corpus:
the Qualitas Corpus is a curated collection of 112 software systems
intended to be used for empirical studies of code artefacts.

Describe the
Code Smell
Label the instances
Choose the Machine
Learning algorithms
● A large set of object-oriented metrics
at method, class and package levels
have been taken into account and they
are considered as independent
variables in our machine learning
approach
● All metrics have been computed
through “Design Features and Metrics
for Java” (DFMC4J) [1]
[1] http://essere.disco.unimib.it/wiki/jcodeodor_doc

Describe the
Code Smell
Label the instances
Choose the Machine
Learning algorithms
An Advisor is a deterministic rule, implemented locally or in an external
tool, that gives a classification of a code element (method or class), telling
if it is a code smell or not.

Describe the
Code Smell
Label the instances
Choose the Machine
Learning algorithms

Describe the
Code Smell
Label the instances
Choose the Machine
Learning algorithms
5 Machine Learning algorithms have
been considered so far:
● J48 (x3)
● Random Forest
● Naïve Bayes
● JRip
● SVM (x10)
Each algorithm have been trained
with and without the boosting
technique: AdaBoostM1.
Therefore 32 algorithms have been
trained, tested and compared.

Describe the
Code Smell
Label the instances
Choose the Machine
Learning algorithms
The parameters are considered the best if the performances of
the machine learning algorithms are maximised by them.
WEKA provides a set of algorithms that perform a greed
search with this purpose, that can be used in WekaNose.
Eibe Frank, Mark A. Hall, and Ian H. Witten (2016). The WEKA Workbench. Online Appendix for "Data Mining:
Practical Machine Learning Tools and Techniques", Morgan Kaufmann, Fourth Edition, 2016.

Describe the
Code Smell
Label the instances
Choose the Machine
Learning algorithms
The WEKA Experimenter [1] has been used to compare the
Machine Learning algorithms, specifically:
● A 10-fold cross-validation tests with 10 repetitions were
performed for each classifiers with the best parameters
found;
● The performances were compared using the corrected
paired t-tests.
Eibe Frank, Mark A. Hall, and Ian H. Witten (2016). The WEKA Workbench. Online Appendix for "Data
Mining: Practical Machine Learning Tools and Techniques", Morgan Kaufmann, Fourth Edition, 2016.

Describe the
Code Smell
Label the instances
Choose the Machine
Learning algorithms
Sonar-WekaNose-Plugin is a SonarQube plugin
that is able to analyze Java code in order to notify
the presence of Code Smells by using the
Machine Learning algorithms trained through
WekaNose.
Alessandro Polastri (2018), Bachelor Thesis: “SonarQube plugin for Code Smell
detection through machine learning techniques”.

Results so far obtained: Algorithms Comparison
[1] Arcelli Fontana, Francesca & Mäntylä, Mika & Zanoni, Marco & Marino, Alessandro. (2015). Comparing and experimenting machine learning techniques for code smell detection. Empirical
Software Engineering. 21. DOI: https://doi.org/10.1007/s10664-015-9378-4;
[2] Umberto Azadi (2017), Bachelor Thesis: “Code smell detection through machine learning techniques”.

Best
Algorithm
per
Code Smell
Code Smell Machine Learning Algorithm
Data Class Boosted J48 pruned
God Class Naïve Bayes
Feature Envy Boosted JRip
Long Method Boosted J48 Unpruned
Long Parameter List Boosted J48 Unpruned
Switch Statements JRip

Threats to validity
● Threats to internal validity:
○ The manual evaluation of code smells is subject to certain degree error that concern the
developer’s experience, the knowledge and the ability to understand design issues and other
factors.
● Threats to external validity:
○ Code Smell candidates were selected with random sampling and stratifying the choice according
to the number of positive results of code smell Advisors. This criterion increases the probability of
selecting instances affected by code smell but the sampling method could cause a distortion during
the building of the training set, because the selection criterion is only partly random.
● Experiments limitations:
○ The selected systems are all open source;
○ The metrics used are only the one computed by DFMC4J

Future Development
● Evaluate the severity (not just the presence) of the Code Smell;
● Evaluation of Active Learning techniques in this context;
● Social platforms to collect data;
● Train new machine learning algorithms in order to increase the number of identified code smells;
● Understand the correlation between the performance measures used to evaluate the machine
learning algorithms and the performance measures used to estimate the goodness of a code
smell detector;
● Design and evaluate hybrid rules that use both the machine learning algorithms and the rules set
by the user to perform a detection;
● Improve the comparison between machine learning-based and rule-based code smell detection:
○ Evaluating if there are algorithms that tend to guarantee high performance compared to actual detection,
by experimenting with comparisons on large software systems and by experimenting with the comparison
on software systems belonging to a specific application domain;
○ Extend the comparison by considering other code smells detection tools, based on different techniques;

Thank you for your attention
http://essere.disco.unimib.it/wiki/wekanose

WekaNose presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to WekaNose presentation

Similar to WekaNose presentation (20)

Recently uploaded

Recently uploaded (20)

WekaNose presentation