Comparing Design and Code Metrics for Software Quality Prediction Y. Jiang, B. Cukic, T. Menzies Lane Department of CSEE W...
Predicting Faults Earlier Matters <ul><li>Boehm observed that fault removal is 50  to  200 times less costly when performe...
How early? <ul><li>Do requirements metrics correlate with fault-proneness? [Jiang et. al. ISSRE 07] </li></ul>
Predicting From Design Metrics? <ul><li>It has been successfully demonstrated. </li></ul><ul><li>Ohlsson and Alberg (’96) ...
Goal of This Study <ul><li>Thorough comparison of fault prediction models which utilize: </li></ul><ul><ul><li>Design metr...
Metrics Description (1) <ul><li>Code metrics </li></ul>
Metrics Description (2) <ul><li>Design metrics </li></ul>
Experimental Design Classification 10x10 CV Illustrate Results Using ROC Code Visualize  Using Boxplot diagrams Compare us...
Datasets: NASA MDP <ul><li>Used every dataset which offered both design and code level </li></ul><ul><li>metrics. </li></ul>
Experimental Design (2) <ul><li>5 classification algorithms </li></ul><ul><ul><li>Random forest, begging, boosting, logist...
Analysis example: PC5 data set <ul><li>The mean AUC  </li></ul><ul><li>All : 0.979  </li></ul><ul><li>Code : 0.967  </li><...
Typical Results
Not So Typical Results
Atypical Results
Test Statistical Significance  <ul><li>Use the procedure recommended by Demsar for each of the 13 data sets.  </li></ul><u...
Pairwise comparison <ul><li>Test the following hypotheses for pairwise comparison of two experiments  A  and  B .  </li></...
The Result of Hypothesis Test (1) <ul><li>Friedman’s test </li></ul><ul><ul><li>Average p-value = 0.00003604 (<0.05)  </li...
 
Findings <ul><li>Statistical significance tests utilized AUC for model comparison </li></ul><ul><li>In 7 datasets:  all=co...
Summary of Observations  <ul><li>The performance of models is influenced  </li></ul><ul><ul><li>MORE by metrics  </li></ul...
Threats to Validity <ul><li>Noise in the metrics data sets. </li></ul><ul><ul><li>Would feature selection change some outc...
Ensuing Research <ul><li>Software fault prediction can be improved </li></ul><ul><ul><li>Improvement  unlikely  to come fr...
Upcoming SlideShare
Loading in …5
×

Cukic Promise08 V3

887 views

Published on

Comparing Design and Code Metrics for Software Quality Prediction

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
887
On SlideShare
0
From Embeds
0
Number of Embeds
103
Actions
Shares
0
Downloads
13
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Cukic Promise08 V3

    1. 1. Comparing Design and Code Metrics for Software Quality Prediction Y. Jiang, B. Cukic, T. Menzies Lane Department of CSEE West Virginia University PROMISE 2008
    2. 2. Predicting Faults Earlier Matters <ul><li>Boehm observed that fault removal is 50 to 200 times less costly when performed in the design phase rather than after the deployment. </li></ul><ul><li>NASA research shows that a fault introduced in the requirements, which leaks into the design, code, test, integration, and operational phases, ensues the correction cost factors of 5, 10, 50, 130, and 368, respectively. </li></ul><ul><li>Therefore, the earlier we can identify fault-prone artifacts, the better. </li></ul>
    3. 3. How early? <ul><li>Do requirements metrics correlate with fault-proneness? [Jiang et. al. ISSRE 07] </li></ul>
    4. 4. Predicting From Design Metrics? <ul><li>It has been successfully demonstrated. </li></ul><ul><li>Ohlsson and Alberg (’96) demonstrated that design metrics predict fault prone modules effectively. </li></ul><ul><ul><li>“ Design metrics are better predictors than code size (LOC)” </li></ul></ul><ul><ul><li>Telephone switching domain </li></ul></ul><ul><li>Basili validated so called CK object oriented (design) metrics using eight student developed systems. </li></ul><ul><li>Nagappan, Ball & Zeller confirmed Ohlsson’s findings using OO design metrics on five Microsoft systems </li></ul>
    5. 5. Goal of This Study <ul><li>Thorough comparison of fault prediction models which utilize: </li></ul><ul><ul><li>Design metrics </li></ul></ul><ul><ul><li>Static code metrics </li></ul></ul><ul><ul><li>Combination of both </li></ul></ul><ul><li>Statistically significant number of projects and modules within projects. </li></ul>
    6. 6. Metrics Description (1) <ul><li>Code metrics </li></ul>
    7. 7. Metrics Description (2) <ul><li>Design metrics </li></ul>
    8. 8. Experimental Design Classification 10x10 CV Illustrate Results Using ROC Code Visualize Using Boxplot diagrams Compare using Nonparametric Statistical Tests Evaluate Results Using AUC trapezoid rule Design All
    9. 9. Datasets: NASA MDP <ul><li>Used every dataset which offered both design and code level </li></ul><ul><li>metrics. </li></ul>
    10. 10. Experimental Design (2) <ul><li>5 classification algorithms </li></ul><ul><ul><li>Random forest, begging, boosting, logistic regression, NaiveBayes </li></ul></ul><ul><li>10 by 10 way cross-validation: </li></ul><ul><ul><li>one 10 way experiment generates an ROC curve => 10 ROCs => 10 AUCs </li></ul></ul><ul><li>We analyzed 1950 experiments! </li></ul><ul><ul><li>13 [Data sets] *3 [Metrics sets] *5 [Classifiers] *10 [CV] </li></ul></ul><ul><li>We only show the best model from each metrics set in each data set (project). </li></ul>
    11. 11. Analysis example: PC5 data set <ul><li>The mean AUC </li></ul><ul><li>All : 0.979 </li></ul><ul><li>Code : 0.967 </li></ul><ul><li>Design: 0.956. </li></ul>ROC Boxplot
    12. 12. Typical Results
    13. 13. Not So Typical Results
    14. 14. Atypical Results
    15. 15. Test Statistical Significance <ul><li>Use the procedure recommended by Demsar for each of the 13 data sets. </li></ul><ul><ul><li>Friedman test tests whether performance differs amongst design , code , and all experiments . </li></ul></ul><ul><ul><ul><li>If no, no further test is necessary. </li></ul></ul></ul><ul><ul><ul><li>If yes, then </li></ul></ul></ul><ul><ul><li>Use pairwise nonparametric tests (typically the Wilcoxon test or the Mann-Whitney test) to determine which group of metrics is the best. </li></ul></ul><ul><li>95% confidence level used in all experiments </li></ul>
    16. 16. Pairwise comparison <ul><li>Test the following hypotheses for pairwise comparison of two experiments A and B . </li></ul><ul><ul><li>H 0 : There is no difference in the performance of the models from metrics from group A and group B ; </li></ul></ul><ul><ul><li>H 1 : The performance of the group A metrics is better than that of group B metrics; </li></ul></ul><ul><ul><li>H 2 : The performance of the group A metrics is worse than that of group B metrics. </li></ul></ul>
    17. 17. The Result of Hypothesis Test (1) <ul><li>Friedman’s test </li></ul><ul><ul><li>Average p-value = 0.00003604 (<0.05) </li></ul></ul><ul><ul><li>Strongly suggests there is statistically significant difference amongst the models from all , code , and design over all 13 datasets. </li></ul></ul><ul><li>Two pairwise nonparametric tests (the Wilcoxon test or the Mann-Whitney test) agree in all cases but one </li></ul><ul><ul><li>PC2: the Mann-Whitney has all >code , but the Wilcoxon has all=code, </li></ul></ul><ul><ul><li>This discrepancy does not affect our overall trend. </li></ul></ul>
    18. 19. Findings <ul><li>Statistical significance tests utilized AUC for model comparison </li></ul><ul><li>In 7 datasets: all=code; </li></ul><ul><li>In 6 datasets, all >code . </li></ul><ul><li>In all 13 datasets, all>design . </li></ul><ul><li>In 12 datasets, code>design . </li></ul><ul><li>Only exception is KC4 project, where design>code . </li></ul>
    19. 20. Summary of Observations <ul><li>The performance of models is influenced </li></ul><ul><ul><li>MORE by metrics </li></ul></ul><ul><ul><li>THAN by classification algorithms. </li></ul></ul><ul><li>Combination of design AND code metrics provides better models than code or design metrics alone. </li></ul><ul><li>The models from code metrics generally perform better than that formed from design metrics only. </li></ul><ul><li>Design metrics useful to predict fault prone modules earlier. </li></ul><ul><li>Clear indication that integrating metrics from different phases of development is useful. </li></ul>
    20. 21. Threats to Validity <ul><li>Noise in the metrics data sets. </li></ul><ul><ul><li>Would feature selection change some outcomes? </li></ul></ul><ul><li>Generality of NASA datasets. </li></ul><ul><li>Design metrics reengineered from code. </li></ul><ul><ul><li>More accurately reflect the code base than those computed from design documentation. </li></ul></ul><ul><li>All metrics data contains a few independent variables which are not in Code or Design groups. </li></ul><ul><ul><li>Needs correction, but the results unlikely to change. </li></ul></ul>
    21. 22. Ensuing Research <ul><li>Software fault prediction can be improved </li></ul><ul><ul><li>Improvement unlikely to come from the application of more off-the-shelf data mining algorithms. </li></ul></ul><ul><ul><li>Accounting for project’s “business context” may contribute to improvement. </li></ul></ul><ul><ul><ul><li>Metrics from different development stages add information not available from the code. </li></ul></ul></ul><ul><ul><ul><li>Evaluation of effectiveness should be tailored to project-specific (subsystem/module-specific) risks. </li></ul></ul></ul><ul><ul><li>Reliable metrics collection. </li></ul></ul>

    ×