User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
1. User-Perceived Source Code Quality Estimation based on
Static Analysis Metrics
1
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
Electrical and Computer Engineering Dept., Aristotle University of Thessaloniki
Intelligent Systems & Software Engineering Labgroup, Information Processing Laboratory
Thessaloniki, Greece
Email: {mpapamic, thdiaman}@issel.ee.auth.gr, asymeon@eng.auth.gr
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
2. 2 Outline
The concept of user-perceived quality.
Research objectives.
Key implementation points.
The designed system.
Evaluation.
Conclusion and Future work.
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
3. 3 Why to evaluate code quality?
Various open source software projects.
Numerous online software repositories.
Source Code Quality Evaluation
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
Code Reuse
Is a software component
suitable for reuse?
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
4. 4 User-Perceived source code quality
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
Idea:
Use of software components popularity as a quality indicator.
But:
Popularity cannot be used as a sole quality criterion.
- Is based on current trends.
- Depends on the programming language.
Popularity
Static Analysis
Metrics
Recommended
Coding Practices
+ +
Measure of
quality
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
5. 5 The idea
User-Perceived Quality Estimation
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
Idea Tools Used Proposed System
Use of software components
popularity as ground truth –
GitHub number of stars
Use of static analysis metrics
and violations of “good”
coding practices
Apply machine learning
techniques for estimating
user-perceived source code
quality
Static
Analysis
Quality
Evaluation
Models
Quality
Score
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
6. 6 Key implementation points
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
Qualitative evaluation of the selected repositories.
Training set formation.
Target set formation.
Quality estimation models.
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
7. 7 Training dataset
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
Top 100
Repositories
GitHub
24930
files
Training Dataset
Qualitative Evaluation
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
8. 8 Selected repositories qualitative evaluation
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
PMD Ruleset Percentage (%) of files
containing severe violations
PMD Ruleset Percentage (%) of files
containing severe violations
Priority 1 Priority2 Priority 1 Priority2
Unused Code 0.0% 0.0% Coupling 0.0% 0.0%
Basic 0.015% 0.337% Design 3.37% 3.9%
Braces 0.0% 0.0% Empty 0.0% 0.0%
Comments 0.0% 0.0% Finalizers 0.0% 0.0%
Naming 14.11% 0.45% Optimizations 0.0% 0.0%
Clone 0.0% 0.0% Strict
Exception
4.99% 0.0%
CodeSize 0.0% 0.0% Strings 0.0% 0.06%
Controversial 1.75% 1.58% Unnecessary 0.0% 0.0%
Very small
percentage of
files contain
severe
violations
9. 9
Target set formation
Use of GitHub stars as ground truth.
But:
GitHub stars per repository (NOT per file)
Every source code file is of different importance
Big differences in the number of files between
repositories
10000
x stars y stars z stars
Dependency
Analysis
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
10. 10
Target set formation
For the i-th file of the j-th repository, the target if formulated as follows:
𝐹𝑠𝑐𝑜𝑟𝑒 𝑖, 𝑗 = log
𝑅 𝑠𝑡𝑎𝑟𝑠 𝑗
𝑛 𝑓𝑖𝑙𝑒𝑠 𝑗
+
𝑑𝑒𝑝 𝑖
𝑛 𝑓𝑖𝑙𝑒𝑠 𝑗
∗ 𝑅 𝑠𝑡𝑎𝑟𝑠 𝑗
Smoothing
factor A base score to all
files in the same
repository
Added value
according to the
significance of the
source code file
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
11. 11 Quality Evaluation Models
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
ANNs Model
Input: The values of 73 static analysis metrics.
Output: User-Perceived source code quality estimation
Applicable only for source code files that exceed minimum
quality threshold
SVMs - One Class Classifier
Used to rule out low quality code.
One Class
Classifier
ANNs Model
Accepted
Static
Analysis Quality
Estimation
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
12. 12 ANNs Model
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
Two-layer feedforward network.
Levenberg-Marquardt algorithm (LMA) for
adjusting the weights and the biases.
(Training, Validation, Test) = (70%, 15%,
15%).
Applicable only for source code files that
exceed minimum quality threshold.
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
13. 13 SVMs One Class Classifier
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
Used to rule out low quality code.
Gaussian radial basis kernel function.
Training involved the use of 7 metrics:
Average Block Depth,
Average Cyclomatic Complexity
Average Depth of Inheritance Hierarchy
Average Line of Codes Per Method
Comments Ratio
Distance
Lines Of Code
(nu, gamma, tolerance) = (0.1, 0.01, 0.01)
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
1124 false-
positives
14. 14 System Evaluation
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
Results validation:
Quantitative: Using PMD
Qualitative: Examination of a representative sample of files and their
metrics
Evaluation on three main axes:
1. The system's ability to distinguish high quality source code files.
2. The effectiveness of the model for estimating the quality of files
exceeding a quality threshold.
3. The accuracy of predicting the popularity of Java repositories given
their source code files.
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
15. 15 System Evaluation
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
Repositories selected:
8 random typical GitHub projects chosen independently.
lines-of-code-per-file ratio around 100, including also several extreme
cases.
Both human and auto-generated code.
The auto-generated projects are expected to be of high quality.
Follow all coding conventions.
Are architecturally and functionally complete.
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
16. 16 Evaluation – One Class Classifier
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
The percentage of the
rejected files that
contained severe
violations is very high
17. 17 Evaluation – ANNs Model
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
The quality score reflects
the characteristics of the
repositories
19. 19 Conclusions and future work
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
Conclusions:
Reliable determination of the area of high quality source code based
on static analysis metrics.
Effective user-perceived source code quality estimation.
Future Work:
Further investigation of the response of our model in different
scenarios.
Expansion of the ground truth coverage by using more metrics.
Application of feature selection techniques in order to drop
overlapping metrics.
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
20. 20
Thank you!
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis