User-Perceived Source Code Quality Estimation based on
Static Analysis Metrics
1
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
Electrical and Computer Engineering Dept., Aristotle University of Thessaloniki
Intelligent Systems & Software Engineering Labgroup, Information Processing Laboratory
Thessaloniki, Greece
Email: {mpapamic, thdiaman}@issel.ee.auth.gr, asymeon@eng.auth.gr
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
2 Outline
 The concept of user-perceived quality.
 Research objectives.
 Key implementation points.
 The designed system.
 Evaluation.
 Conclusion and Future work.
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
3 Why to evaluate code quality?
 Various open source software projects.
 Numerous online software repositories.
Source Code Quality Evaluation
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
Code Reuse
Is a software component
suitable for reuse?
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
4 User-Perceived source code quality
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
Idea:
 Use of software components popularity as a quality indicator.
But:
 Popularity cannot be used as a sole quality criterion.
- Is based on current trends.
- Depends on the programming language.
Popularity
Static Analysis
Metrics
Recommended
Coding Practices
+ +
Measure of
quality
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
5 The idea
User-Perceived Quality Estimation
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
Idea Tools Used Proposed System
 Use of software components
popularity as ground truth –
GitHub number of stars
 Use of static analysis metrics
and violations of “good”
coding practices
 Apply machine learning
techniques for estimating
user-perceived source code
quality
Static
Analysis
Quality
Evaluation
Models
Quality
Score
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
6 Key implementation points
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
 Qualitative evaluation of the selected repositories.
 Training set formation.
 Target set formation.
 Quality estimation models.
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
7 Training dataset
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
Top 100
Repositories
GitHub
24930
files
Training Dataset
Qualitative Evaluation
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
8 Selected repositories qualitative evaluation
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
PMD Ruleset Percentage (%) of files
containing severe violations
PMD Ruleset Percentage (%) of files
containing severe violations
Priority 1 Priority2 Priority 1 Priority2
Unused Code 0.0% 0.0% Coupling 0.0% 0.0%
Basic 0.015% 0.337% Design 3.37% 3.9%
Braces 0.0% 0.0% Empty 0.0% 0.0%
Comments 0.0% 0.0% Finalizers 0.0% 0.0%
Naming 14.11% 0.45% Optimizations 0.0% 0.0%
Clone 0.0% 0.0% Strict
Exception
4.99% 0.0%
CodeSize 0.0% 0.0% Strings 0.0% 0.06%
Controversial 1.75% 1.58% Unnecessary 0.0% 0.0%
Very small
percentage of
files contain
severe
violations
9
Target set formation
 Use of GitHub stars as ground truth.
But:
 GitHub stars per repository (NOT per file)
 Every source code file is of different importance
 Big differences in the number of files between
repositories
10000
x stars y stars z stars
Dependency
Analysis
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
10
Target set formation
For the i-th file of the j-th repository, the target if formulated as follows:
𝐹𝑠𝑐𝑜𝑟𝑒 𝑖, 𝑗 = log
𝑅 𝑠𝑡𝑎𝑟𝑠 𝑗
𝑛 𝑓𝑖𝑙𝑒𝑠 𝑗
+
𝑑𝑒𝑝 𝑖
𝑛 𝑓𝑖𝑙𝑒𝑠 𝑗
∗ 𝑅 𝑠𝑡𝑎𝑟𝑠 𝑗
Smoothing
factor A base score to all
files in the same
repository
Added value
according to the
significance of the
source code file
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
11 Quality Evaluation Models
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
ANNs Model
 Input: The values of 73 static analysis metrics.
 Output: User-Perceived source code quality estimation
 Applicable only for source code files that exceed minimum
quality threshold
SVMs - One Class Classifier
 Used to rule out low quality code.
One Class
Classifier
ANNs Model
Accepted
Static
Analysis Quality
Estimation
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
12 ANNs Model
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
 Two-layer feedforward network.
 Levenberg-Marquardt algorithm (LMA) for
adjusting the weights and the biases.
 (Training, Validation, Test) = (70%, 15%,
15%).
 Applicable only for source code files that
exceed minimum quality threshold.
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
13 SVMs One Class Classifier
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
 Used to rule out low quality code.
 Gaussian radial basis kernel function.
 Training involved the use of 7 metrics:
 Average Block Depth,
 Average Cyclomatic Complexity
 Average Depth of Inheritance Hierarchy
 Average Line of Codes Per Method
 Comments Ratio
 Distance
 Lines Of Code
 (nu, gamma, tolerance) = (0.1, 0.01, 0.01)
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
1124 false-
positives
14 System Evaluation
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
Results validation:
 Quantitative: Using PMD
 Qualitative: Examination of a representative sample of files and their
metrics
Evaluation on three main axes:
1. The system's ability to distinguish high quality source code files.
2. The effectiveness of the model for estimating the quality of files
exceeding a quality threshold.
3. The accuracy of predicting the popularity of Java repositories given
their source code files.
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
15 System Evaluation
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
Repositories selected:
 8 random typical GitHub projects chosen independently.
 lines-of-code-per-file ratio around 100, including also several extreme
cases.
 Both human and auto-generated code.
 The auto-generated projects are expected to be of high quality.
 Follow all coding conventions.
 Are architecturally and functionally complete.
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
16 Evaluation – One Class Classifier
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
The percentage of the
rejected files that
contained severe
violations is very high
17 Evaluation – ANNs Model
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
The quality score reflects
the characteristics of the
repositories
18 Evaluation – Popularity Prediction
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
𝑅 𝑠𝑡𝑎𝑟𝑠 𝑖, 𝑗 =
𝑒 𝐹𝑠𝑐𝑜𝑟𝑒(𝑖,𝑗)
∙ 𝑛 𝑓𝑖𝑙𝑒𝑠(𝑗)
1 + 𝑑𝑒𝑝(𝑖)
𝑅 𝑠𝑡𝑎𝑟𝑠 𝑗 =
𝑖=1
𝑛 𝑓𝑖𝑙𝑒𝑠(𝑗)
𝑑𝑒𝑝(𝑖)
𝑑𝑒𝑝𝑡𝑜𝑡𝑎𝑙(𝑗)
∙ 𝑅 𝑠𝑡𝑎𝑟𝑠 𝑖, 𝑗
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
19 Conclusions and future work
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
Conclusions:
 Reliable determination of the area of high quality source code based
on static analysis metrics.
 Effective user-perceived source code quality estimation.
Future Work:
 Further investigation of the response of our model in different
scenarios.
 Expansion of the ground truth coverage by using more metrics.
 Application of feature selection techniques in order to drop
overlapping metrics.
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
20
Thank you!
IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
User-Perceived Source Code Quality Estimation based on Static Analysis Metrics
Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis

User-Perceived Source Code Quality Estimation based on Static Analysis Metrics

  • 1.
    User-Perceived Source CodeQuality Estimation based on Static Analysis Metrics 1 Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis Electrical and Computer Engineering Dept., Aristotle University of Thessaloniki Intelligent Systems & Software Engineering Labgroup, Information Processing Laboratory Thessaloniki, Greece Email: {mpapamic, thdiaman}@issel.ee.auth.gr, asymeon@eng.auth.gr User-Perceived Source Code Quality Estimation based on Static Analysis Metrics Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis IEEE International Conference on Software Quality, Reliability & Security – QRS 2016
  • 2.
    2 Outline  Theconcept of user-perceived quality.  Research objectives.  Key implementation points.  The designed system.  Evaluation.  Conclusion and Future work. IEEE International Conference on Software Quality, Reliability & Security – QRS 2016 User-Perceived Source Code Quality Estimation based on Static Analysis Metrics Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
  • 3.
    3 Why toevaluate code quality?  Various open source software projects.  Numerous online software repositories. Source Code Quality Evaluation IEEE International Conference on Software Quality, Reliability & Security – QRS 2016 Code Reuse Is a software component suitable for reuse? User-Perceived Source Code Quality Estimation based on Static Analysis Metrics Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
  • 4.
    4 User-Perceived sourcecode quality IEEE International Conference on Software Quality, Reliability & Security – QRS 2016 Idea:  Use of software components popularity as a quality indicator. But:  Popularity cannot be used as a sole quality criterion. - Is based on current trends. - Depends on the programming language. Popularity Static Analysis Metrics Recommended Coding Practices + + Measure of quality User-Perceived Source Code Quality Estimation based on Static Analysis Metrics Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
  • 5.
    5 The idea User-PerceivedQuality Estimation IEEE International Conference on Software Quality, Reliability & Security – QRS 2016 Idea Tools Used Proposed System  Use of software components popularity as ground truth – GitHub number of stars  Use of static analysis metrics and violations of “good” coding practices  Apply machine learning techniques for estimating user-perceived source code quality Static Analysis Quality Evaluation Models Quality Score User-Perceived Source Code Quality Estimation based on Static Analysis Metrics Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
  • 6.
    6 Key implementationpoints IEEE International Conference on Software Quality, Reliability & Security – QRS 2016  Qualitative evaluation of the selected repositories.  Training set formation.  Target set formation.  Quality estimation models. User-Perceived Source Code Quality Estimation based on Static Analysis Metrics Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
  • 7.
    7 Training dataset IEEEInternational Conference on Software Quality, Reliability & Security – QRS 2016 Top 100 Repositories GitHub 24930 files Training Dataset Qualitative Evaluation User-Perceived Source Code Quality Estimation based on Static Analysis Metrics Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
  • 8.
    8 Selected repositoriesqualitative evaluation IEEE International Conference on Software Quality, Reliability & Security – QRS 2016 User-Perceived Source Code Quality Estimation based on Static Analysis Metrics Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis PMD Ruleset Percentage (%) of files containing severe violations PMD Ruleset Percentage (%) of files containing severe violations Priority 1 Priority2 Priority 1 Priority2 Unused Code 0.0% 0.0% Coupling 0.0% 0.0% Basic 0.015% 0.337% Design 3.37% 3.9% Braces 0.0% 0.0% Empty 0.0% 0.0% Comments 0.0% 0.0% Finalizers 0.0% 0.0% Naming 14.11% 0.45% Optimizations 0.0% 0.0% Clone 0.0% 0.0% Strict Exception 4.99% 0.0% CodeSize 0.0% 0.0% Strings 0.0% 0.06% Controversial 1.75% 1.58% Unnecessary 0.0% 0.0% Very small percentage of files contain severe violations
  • 9.
    9 Target set formation Use of GitHub stars as ground truth. But:  GitHub stars per repository (NOT per file)  Every source code file is of different importance  Big differences in the number of files between repositories 10000 x stars y stars z stars Dependency Analysis IEEE International Conference on Software Quality, Reliability & Security – QRS 2016 User-Perceived Source Code Quality Estimation based on Static Analysis Metrics Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
  • 10.
    10 Target set formation Forthe i-th file of the j-th repository, the target if formulated as follows: 𝐹𝑠𝑐𝑜𝑟𝑒 𝑖, 𝑗 = log 𝑅 𝑠𝑡𝑎𝑟𝑠 𝑗 𝑛 𝑓𝑖𝑙𝑒𝑠 𝑗 + 𝑑𝑒𝑝 𝑖 𝑛 𝑓𝑖𝑙𝑒𝑠 𝑗 ∗ 𝑅 𝑠𝑡𝑎𝑟𝑠 𝑗 Smoothing factor A base score to all files in the same repository Added value according to the significance of the source code file IEEE International Conference on Software Quality, Reliability & Security – QRS 2016 User-Perceived Source Code Quality Estimation based on Static Analysis Metrics Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
  • 11.
    11 Quality EvaluationModels IEEE International Conference on Software Quality, Reliability & Security – QRS 2016 ANNs Model  Input: The values of 73 static analysis metrics.  Output: User-Perceived source code quality estimation  Applicable only for source code files that exceed minimum quality threshold SVMs - One Class Classifier  Used to rule out low quality code. One Class Classifier ANNs Model Accepted Static Analysis Quality Estimation User-Perceived Source Code Quality Estimation based on Static Analysis Metrics Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
  • 12.
    12 ANNs Model IEEEInternational Conference on Software Quality, Reliability & Security – QRS 2016  Two-layer feedforward network.  Levenberg-Marquardt algorithm (LMA) for adjusting the weights and the biases.  (Training, Validation, Test) = (70%, 15%, 15%).  Applicable only for source code files that exceed minimum quality threshold. User-Perceived Source Code Quality Estimation based on Static Analysis Metrics Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
  • 13.
    13 SVMs OneClass Classifier IEEE International Conference on Software Quality, Reliability & Security – QRS 2016  Used to rule out low quality code.  Gaussian radial basis kernel function.  Training involved the use of 7 metrics:  Average Block Depth,  Average Cyclomatic Complexity  Average Depth of Inheritance Hierarchy  Average Line of Codes Per Method  Comments Ratio  Distance  Lines Of Code  (nu, gamma, tolerance) = (0.1, 0.01, 0.01) User-Perceived Source Code Quality Estimation based on Static Analysis Metrics Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis 1124 false- positives
  • 14.
    14 System Evaluation IEEEInternational Conference on Software Quality, Reliability & Security – QRS 2016 Results validation:  Quantitative: Using PMD  Qualitative: Examination of a representative sample of files and their metrics Evaluation on three main axes: 1. The system's ability to distinguish high quality source code files. 2. The effectiveness of the model for estimating the quality of files exceeding a quality threshold. 3. The accuracy of predicting the popularity of Java repositories given their source code files. User-Perceived Source Code Quality Estimation based on Static Analysis Metrics Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
  • 15.
    15 System Evaluation IEEEInternational Conference on Software Quality, Reliability & Security – QRS 2016 Repositories selected:  8 random typical GitHub projects chosen independently.  lines-of-code-per-file ratio around 100, including also several extreme cases.  Both human and auto-generated code.  The auto-generated projects are expected to be of high quality.  Follow all coding conventions.  Are architecturally and functionally complete. User-Perceived Source Code Quality Estimation based on Static Analysis Metrics Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
  • 16.
    16 Evaluation –One Class Classifier IEEE International Conference on Software Quality, Reliability & Security – QRS 2016 User-Perceived Source Code Quality Estimation based on Static Analysis Metrics Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis The percentage of the rejected files that contained severe violations is very high
  • 17.
    17 Evaluation –ANNs Model IEEE International Conference on Software Quality, Reliability & Security – QRS 2016 User-Perceived Source Code Quality Estimation based on Static Analysis Metrics Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis The quality score reflects the characteristics of the repositories
  • 18.
    18 Evaluation –Popularity Prediction IEEE International Conference on Software Quality, Reliability & Security – QRS 2016 𝑅 𝑠𝑡𝑎𝑟𝑠 𝑖, 𝑗 = 𝑒 𝐹𝑠𝑐𝑜𝑟𝑒(𝑖,𝑗) ∙ 𝑛 𝑓𝑖𝑙𝑒𝑠(𝑗) 1 + 𝑑𝑒𝑝(𝑖) 𝑅 𝑠𝑡𝑎𝑟𝑠 𝑗 = 𝑖=1 𝑛 𝑓𝑖𝑙𝑒𝑠(𝑗) 𝑑𝑒𝑝(𝑖) 𝑑𝑒𝑝𝑡𝑜𝑡𝑎𝑙(𝑗) ∙ 𝑅 𝑠𝑡𝑎𝑟𝑠 𝑖, 𝑗 User-Perceived Source Code Quality Estimation based on Static Analysis Metrics Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
  • 19.
    19 Conclusions andfuture work IEEE International Conference on Software Quality, Reliability & Security – QRS 2016 Conclusions:  Reliable determination of the area of high quality source code based on static analysis metrics.  Effective user-perceived source code quality estimation. Future Work:  Further investigation of the response of our model in different scenarios.  Expansion of the ground truth coverage by using more metrics.  Application of feature selection techniques in order to drop overlapping metrics. User-Perceived Source Code Quality Estimation based on Static Analysis Metrics Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis
  • 20.
    20 Thank you! IEEE InternationalConference on Software Quality, Reliability & Security – QRS 2016 User-Perceived Source Code Quality Estimation based on Static Analysis Metrics Michail Papamichail, Themistoklis Diamantopoulos and Andreas Symeonidis