Successfully reported this slideshow.
Yan Ma, Bojan Cukic  Lane Department of Computer Science and Electrical Engineering West Virginia University May 2007 Adeq...
Evaluating Defect Models <ul><li>Hundreds of research papers.  </li></ul><ul><ul><li>Most offer very little one can genera...
Software defect data:  Class Imbalance  <ul><li>A few modules are fault-prone. </li></ul><ul><ul><li>A problem for supervi...
Software Defect Data: Correlation  MDP-PC1:  Pearson correlation coefficients 1.000 0.473 0.982 0.987 0.996 0.924 N 0.473 ...
Software Defect Data: Correlation (2)  MDP-KC2:  Pearson correlation coefficients 1.000 0.912 0.836 0.887 0.636 0.909 LOB ...
Software Defect Data: Correlation (3) <ul><li>Five “most informative attributes” </li></ul>
Software Defect Data:  Module Size <ul><li>Defect-free modules are smaller. </li></ul><ul><ul><ul><li>In MDP, modules are ...
Software Defect Data:  Close Neighbors <ul><li>The “nearest neighbor” of most defective modules is a fault free module.  <...
Implications on Evaluation <ul><li>Many machine learning algorithms ineffective. </li></ul><ul><ul><li>But one would never...
Classification Success Measures  <ul><li>Probability of detection (PD):  Correctly classifying faulty modules (called  sen...
Success Measures (2) <ul><li>Accuracy, PD, specificity, precision index tell a one sided story.  </li></ul><ul><li>Indices...
Success Measures (3) <ul><li>F-measure , like  G-mean 1 , combines PD and Precision. </li></ul><ul><ul><li>More flexibilit...
Comparing Models <ul><li>G-mean2  and  F-measure (  =2)  reflect the difference between the two models.  </li></ul><ul><u...
Comparing Performance MDP-PC1 0.588 0.542 0.396 0.661 0.498 ED  (   = 0.5) 0.200 0.264 0.436 0.077 0.290 F-measure (    ...
Comparing Performance: KC-2 0.433 0.375 0.409 0.500 0.492 ED (   = 0.67) 0.377 0.329 0.361 0.435 0.427 ED (   = 0.5) 0.4...
Visual Tools: Margin Plots
Visual Tools: ROC  <ul><li>Classifiers that allow multiple operational points are more flexible. </li></ul><ul><li>Area Un...
Summary <ul><li>Emerging data collection must be accompanied by mature evaluation.  </li></ul><ul><li>Research reports mus...
Current work <ul><li>Statistical significance calculation between ROC curves. </li></ul><ul><li>ROC vs. PR (Precision-Reca...
Upcoming SlideShare
Loading in …5
×

Adequate and Precise Evaluation of Predictive Models in Software Engineering Studies

1,698 views

Published on

Yan Ma and Bojan Cukic

Published in: Economy & Finance, Technology
  • Be the first to comment

  • Be the first to like this

Adequate and Precise Evaluation of Predictive Models in Software Engineering Studies

  1. 1. Yan Ma, Bojan Cukic Lane Department of Computer Science and Electrical Engineering West Virginia University May 2007 Adequate Evaluation of Quality Models in Software Engineering Studies CITeR The Center for Identification Technology Research www.citer.wvu.edu An NSF I/UCR Center advancing integrative research
  2. 2. Evaluating Defect Models <ul><li>Hundreds of research papers. </li></ul><ul><ul><li>Most offer very little one can generalize and reapply. </li></ul></ul><ul><ul><li>Initial hurdle was the lack of data, but not any longer: </li></ul></ul><ul><ul><ul><li>Open source repositories, NASA MDP, PROMISE datasets. </li></ul></ul></ul><ul><li>How to evaluate defect prediction models? </li></ul>
  3. 3. Software defect data: Class Imbalance <ul><li>A few modules are fault-prone. </li></ul><ul><ul><li>A problem for supervised learning algorithms, which typically try to maximize overall accuracy. </li></ul></ul>
  4. 4. Software Defect Data: Correlation MDP-PC1: Pearson correlation coefficients 1.000 0.473 0.982 0.987 0.996 0.924 N 0.473 1.000 0.468 0.468 0.464 0.545 LCC 0.982 0.468 1.000 0.995 0.971 0.931 B 0.987 0.468 0.995 1.000 0.976 0.937 V 0.996 0.464 0.971 0.976 1.000 0.908 TOpnd 0.924 0.545 0.931 0.937 0.908 1.000 LOC N LCC B V TOpnd LOC
  5. 5. Software Defect Data: Correlation (2) MDP-KC2: Pearson correlation coefficients 1.000 0.912 0.836 0.887 0.636 0.909 LOB 0.912 1.000 0.972 0.990 0.615 0.991 Top 0.836 0.972 1.000 0.970 0.577 0.968 IV.G 0.887 0.990 0.970 1.000 0.536 0.986 V 0.636 0.615 0.577 0.536 1.000 0.632 UOp 0.909 0.991 0.968 0.986 0.632 1.000 LOC LOB TOp IV.G V UOp LOC
  6. 6. Software Defect Data: Correlation (3) <ul><li>Five “most informative attributes” </li></ul>
  7. 7. Software Defect Data: Module Size <ul><li>Defect-free modules are smaller. </li></ul><ul><ul><ul><li>In MDP, modules are very small. </li></ul></ul></ul>The 90 th percentile of LOC for the collection of defect modules and defect-free modules 131 165 114 167 99 Defect 55 72 47 55 42 Defect-free CM1 JM1 PC1 KC2 KC1
  8. 8. Software Defect Data: Close Neighbors <ul><li>The “nearest neighbor” of most defective modules is a fault free module. </li></ul><ul><ul><li>Measured by Euclidian distance between module metrics </li></ul></ul>97.96% 73.47% CM1 85.71% 75.32% PC1 75.46% 67.90% JM1 58.33% 58.33% KC2 73.62% 66.26% KC1 % of defect modules that has  2 among the three nearest neighbors in the majority class % of defect modules whose nearest neighbor is a majority class instance Project
  9. 9. Implications on Evaluation <ul><li>Many machine learning algorithms ineffective. </li></ul><ul><ul><li>But one would never know by reading the literature. </li></ul></ul><ul><li>Experimental results rarely reported adequately. </li></ul>
  10. 10. Classification Success Measures <ul><li>Probability of detection (PD): Correctly classifying faulty modules (called sensitivity ). </li></ul><ul><li>Specificity : Correctly classified fault-free modules. </li></ul><ul><li>False alarm (PF): Proportion of misclassified fault-free modules. </li></ul><ul><ul><li>PF = 1- Specificity </li></ul></ul><ul><li>Precision index : Proportion of faulty modules amongst those predicted as faulty. </li></ul>Random Forests on PC1 (only 7% modules faulty )
  11. 11. Success Measures (2) <ul><li>Accuracy, PD, specificity, precision index tell a one sided story. </li></ul><ul><li>Indices combine measures of interest. </li></ul>The geometric mean of the two accuracies. Higher precision leads to “cheaper” V&V
  12. 12. Success Measures (3) <ul><li>F-measure , like G-mean 1 , combines PD and Precision. </li></ul><ul><ul><li>More flexibility. </li></ul></ul><ul><ul><li>Weight should reflect project’s “ cost vs. risk aversion ” </li></ul></ul><ul><ul><ul><li>May be difficult. </li></ul></ul></ul>
  13. 13. Comparing Models <ul><li>G-mean2 and F-measure (  =2) reflect the difference between the two models. </li></ul><ul><ul><li>Interpretation still in the domain of human understanding. </li></ul></ul>0.527 0.305 F-measure (  =2) 0.368 0.372 F-measure (  =1) 0.783 0.519 G-mean 2 0.426 0.399 G-mean 1 Figure 1. (b) Figure 1. (a) PC1: random forests at different voting cutoffs
  14. 14. Comparing Performance MDP-PC1 0.588 0.542 0.396 0.661 0.498 ED (  = 0.5) 0.200 0.264 0.436 0.077 0.290 F-measure (  = 2) 0.275 0.327 0.428 0.106 0.278 F-measure (  = 1) 0.410 0.480 0.649 0.253 0.529 G-mean 2 0.352 0.356 0.428 0.137 0.278 G-mean 1 0.938 0.933 0.918 0.924 0.892 Overall Accuracy 0.732 0.540 0.415 0.289 0.259 Precision 0.995 0.985 0.954 0.988 0.936 1 – PF 0.169 0.234 0.442 0.065 0.299 PD Bagging J48 IB1 Logistic Naïve Bayes Indices
  15. 15. Comparing Performance: KC-2 0.433 0.375 0.409 0.500 0.492 ED (  = 0.67) 0.377 0.329 0.361 0.435 0.427 ED (  = 0.5) 0.498 0.552 0.504 0.418 0.434 F-measure (  = 2) 0.543 0.562 0.496 0.472 0.501 F-measure (  = 1) 0.663 0.700 0.661 0.602 0.615 G-mean 2 0.549 0.562 0.496 0.483 0.518 G-mean 1 0.836 0.824 0.786 0.820 0.836 Overall Accuracy 0.639 0.578 0.483 0.599 0.674 Precision 0.931 0.896 0.858 0.932 0.950 1 - PF 0.472 0.546 0.509 0.389 0.398 PD Bagging J48 IB1 Logistic Naïve Bayes Indices
  16. 16. Visual Tools: Margin Plots
  17. 17. Visual Tools: ROC <ul><li>Classifiers that allow multiple operational points are more flexible. </li></ul><ul><li>Area Under the Curve, if two curves available. </li></ul><ul><li>Distance from the ideal performance. </li></ul>Factor  depends of the misclassification cost.
  18. 18. Summary <ul><li>Emerging data collection must be accompanied by mature evaluation. </li></ul><ul><li>Research reports must reveal all aspects of achieved predictive performance fairly. </li></ul><ul><li>The goodness of a model depends on the cost of misclassification . </li></ul>
  19. 19. Current work <ul><li>Statistical significance calculation between ROC curves. </li></ul><ul><li>ROC vs. PR (Precision-Recall) curves, cost curves. </li></ul>

×