An Iterative Semi-supervised  Approach to Software Fault Prediction Huihua Lu, Bojan Cukic, Mark Culp Lane Department of C...
Presentation Outline <ul><li>Introduction  </li></ul><ul><li>Semi-supervised Learning </li></ul><ul><li>Methodology </li><...
Introduction <ul><li>Software Quality Assurance </li></ul><ul><ul><li>Identify where faults hide, subject to V&V </li></ul...
Goal of the Study <ul><li>Can we match the performance of supervised learning fault prediction models from a smaller set o...
Presentation Outline <ul><li>Introduction  </li></ul><ul><li>Semi-supervised Learning </li></ul><ul><li>Methodology </li><...
Semi-Supervised Learning-1 <ul><li>Supervised Learning </li></ul><ul><ul><li>Train a model from labeled (training) data on...
Semi-Supervised Learning-2 <ul><li>Traditional Semi-supervised Learning algorithms </li></ul><ul><ul><li>Co-training </li>...
Related Work <ul><li>In software fault prediction </li></ul><ul><ul><li>Khoshgoftaar: Inductive semi-supervised learning <...
Presentation Outline <ul><li>Introduction  </li></ul><ul><li>Semi-supervised Learning </li></ul><ul><li>Methodology </li><...
Methodology-1 <ul><li>Fitting the Fits (FTF) semi-supervised algorithm </li></ul><ul><ul><li>A variant of Self-training [3...
Methodology-2 <ul><li>The Base Learner: </li></ul><ul><ul><li>Initializes the labels for unlabeled data </li></ul></ul><ul...
Software Data Sets <ul><li>These are large NASA MDP projects (> 1,000 modules) </li></ul>
Performance Measures <ul><li>Labels in binary classification problem: </li></ul><ul><ul><li>1 - fault prone module </li></...
Presentation Outline <ul><li>Introduction  </li></ul><ul><li>Semi-supervised Learning </li></ul><ul><li>Methodology </li><...
Experiments <ul><li>FTF with Random Forests vs. Random Forest </li></ul><ul><li>Does FTF outperform supervised learning wi...
Presentation Outline <ul><li>Introduction  </li></ul><ul><li>Semi-supervised Learning </li></ul><ul><li>Methodology </li><...
Results: PC 3
Results at threshold 0.5
At threshold 0.1
Overall Comparison
Presentation Outline <ul><li>Introduction  </li></ul><ul><li>Semi-supervised Learning </li></ul><ul><li>Methodology </li><...
Summary <ul><li>Does FTF with Random Forests as base learner outperform supervised learning with Random Forest? </li></ul>...
Future Work <ul><li>Try out different base learners with FTF </li></ul><ul><ul><li>Base learner in FTF has dramatic effect...
Questions <ul><li>Please direct questions to </li></ul><ul><li>Bojan Cukic:  [email_address] </li></ul><ul><li>Huihua Lu: ...
Upcoming SlideShare
Loading in …5
×

Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

3,030 views

Published on

Promise 2011:
"An Iterative Semi-supervised Approach to Software Fault Prediction"
Huihua Lu, Bojan Cukic and Mark Culp.

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
3,030
On SlideShare
0
From Embeds
0
Number of Embeds
2,055
Actions
Shares
0
Downloads
23
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • WV
  • WV
  • The assumption of Supervised Learning is that the distribution of training data should be identical to the distribution of testing data. That is the training data should be presentive to the data space. WV
  • WV
  • WV
  • WV
  • WV
  • WV
  • WV
  • WV
  • WV
  • WV
  • 1. Horizontal lines represent the performance of random forest. Supervised learning does not change with iteration… Mention that AUC does not change 2. Then observe that improvements in PD limited at low threshold because few FP modules remain to be detected Performance (PD) may slightly deteriorate at low thresholds (few %) WV
  • Semi-supervised approach with a small number of labeled modules (2% or 5%) may not lead to improvement WV
  • WV
  • Lager size of Labeled data, better performance Overall, semi-supervised algorithm performs better than the corresponding supervised algorithm (RF) WV
  • WV
  • WV
  • WV
  • Promise 2011: "An Iterative Semi-supervised Approach to Software Fault Prediction"

    1. 1. An Iterative Semi-supervised Approach to Software Fault Prediction Huihua Lu, Bojan Cukic, Mark Culp Lane Department of Computer Science and Electrical Engineering Department of Statistics West Virginia University Morgantown, WV September 2011
    2. 2. Presentation Outline <ul><li>Introduction </li></ul><ul><li>Semi-supervised Learning </li></ul><ul><li>Methodology </li></ul><ul><li>Experiments </li></ul><ul><li>Results and Discussion </li></ul><ul><li>Conclusion and Future Work </li></ul>
    3. 3. Introduction <ul><li>Software Quality Assurance </li></ul><ul><ul><li>Identify where faults hide, subject to V&V </li></ul></ul><ul><ul><li>Without automation, costly and time-consuming </li></ul></ul><ul><li>Software Fault Prediction </li></ul><ul><ul><li>Software metrics: code metrics, complexity metrics, etc. </li></ul></ul><ul><ul><li>Software fault prediction models identify faulty modules </li></ul></ul><ul><ul><ul><li>Supervised learning algorithms are the norm </li></ul></ul></ul><ul><li>Practical Problem </li></ul><ul><ul><li>For one-of-a-kind systems or new systems, ground truth data may be sparse </li></ul></ul><ul><li>The Goal of Our Study </li></ul><ul><ul><li>Evaluate the performance of semi-supervised learning approaches </li></ul></ul>
    4. 4. Goal of the Study <ul><li>Can we match the performance of supervised learning fault prediction models from a smaller set of labeled modules? </li></ul><ul><li>Consequence:  If very few modules are labeled (very real scenario), include unlabeled modules for training  Most published studies use 50% or more software modules for model training. This is not practical for new projects. </li></ul>
    5. 5. Presentation Outline <ul><li>Introduction </li></ul><ul><li>Semi-supervised Learning </li></ul><ul><li>Methodology </li></ul><ul><li>Experiments </li></ul><ul><li>Results and Discussion </li></ul><ul><li>Conclusion and Future Work </li></ul>
    6. 6. Semi-Supervised Learning-1 <ul><li>Supervised Learning </li></ul><ul><ul><li>Train a model from labeled (training) data only </li></ul></ul><ul><ul><li>Labeled data could be expensive to create </li></ul></ul><ul><ul><ul><li>Modules receive labels through detailed V&V </li></ul></ul></ul><ul><li>Semi-Supervised Learning </li></ul><ul><ul><li>Train a model from both the labeled data and the unlabeled data </li></ul></ul><ul><ul><ul><li>Include new modules as they become available in a version control system. </li></ul></ul></ul><ul><ul><li>Unlabeled data are the modules with unknown fault content </li></ul></ul>
    7. 7. Semi-Supervised Learning-2 <ul><li>Traditional Semi-supervised Learning algorithms </li></ul><ul><ul><li>Co-training </li></ul></ul><ul><ul><ul><li>Assumption: features can be separated into two sets </li></ul></ul></ul><ul><ul><li>Generative Learning (EM algorithm) </li></ul></ul><ul><ul><ul><li>Assumption: need the knowledge of the distribution of data </li></ul></ul></ul><ul><ul><li>Self-training </li></ul></ul><ul><ul><ul><li>Assumption: None </li></ul></ul></ul>
    8. 8. Related Work <ul><li>In software fault prediction </li></ul><ul><ul><li>Khoshgoftaar: Inductive semi-supervised learning </li></ul></ul><ul><ul><ul><li>Data from one project separated into labeled and unlabeled sets; performance are evaluated on a different project </li></ul></ul></ul><ul><ul><ul><li>Achieved better performance than a tree-based supervised algorithm-C4.5 </li></ul></ul></ul><ul><ul><li>Khoshgoftaar: Clustering based semi-supervised learning </li></ul></ul><ul><ul><ul><li>Extend unsupervised learning into semi-supervised learning </li></ul></ul></ul><ul><ul><ul><li>Better partitioning than unsupervised learning </li></ul></ul></ul><ul><ul><ul><li>Assume that human domain experts participate in classifying modules into fault-prone and not-fault-prone. </li></ul></ul></ul><ul><ul><li>Many supervised learning modeling approaches. </li></ul></ul>
    9. 9. Presentation Outline <ul><li>Introduction </li></ul><ul><li>Semi-supervised Learning </li></ul><ul><li>Methodology </li></ul><ul><li>Experiments </li></ul><ul><li>Results and Discussion </li></ul><ul><li>Conclusion and Future Work </li></ul>
    10. 10. Methodology-1 <ul><li>Fitting the Fits (FTF) semi-supervised algorithm </li></ul><ul><ul><li>A variant of Self-training [3] </li></ul></ul><ul><ul><li>Idea: Reduce the semi-supervised problem to some form of a supervised problem </li></ul></ul><ul><ul><li>The Algorithm: </li></ul></ul>Initialize the labels for U Reset the labels for L Fit the labels for U+L
    11. 11. Methodology-2 <ul><li>The Base Learner: </li></ul><ul><ul><li>Initializes the labels for unlabeled data </li></ul></ul><ul><ul><li>“ Improves” the labels of unlabeled data in iterations </li></ul></ul><ul><ul><li>May lead to global convergence </li></ul></ul><ul><li>Random Forests </li></ul><ul><ul><li>A good choice in the domain based on previous work </li></ul></ul><ul><ul><li>Robust to noise </li></ul></ul>
    12. 12. Software Data Sets <ul><li>These are large NASA MDP projects (> 1,000 modules) </li></ul>
    13. 13. Performance Measures <ul><li>Labels in binary classification problem: </li></ul><ul><ul><li>1 - fault prone module </li></ul></ul><ul><ul><li>0 - not fault prone module </li></ul></ul><ul><ul><li>For each module, estimate the probability </li></ul></ul><ul><li>Area under ROC curve and Probability of Detection (PD) used for performance comparison </li></ul><ul><ul><li>PD = </li></ul></ul>
    14. 14. Presentation Outline <ul><li>Introduction </li></ul><ul><li>Semi-supervised Learning </li></ul><ul><li>Methodology </li></ul><ul><li>Experiments </li></ul><ul><li>Results and Discussion </li></ul><ul><li>Conclusion and Future Work </li></ul>
    15. 15. Experiments <ul><li>FTF with Random Forests vs. Random Forest </li></ul><ul><li>Does FTF outperform supervised learning with the same size of labeled modules? </li></ul><ul><ul><li>Size of labeled data: 2%, 5%, 10%, 25%, 50% </li></ul></ul><ul><ul><li>Stop the FTF algorithm after 50 iterations </li></ul></ul><ul><li>Is the behavior and performance of FTF consistent over different software projects? </li></ul>
    16. 16. Presentation Outline <ul><li>Introduction </li></ul><ul><li>Semi-supervised Learning </li></ul><ul><li>Methodology </li></ul><ul><li>Experiments </li></ul><ul><li>Results and Discussion </li></ul><ul><li>Conclusion and Future Work </li></ul>
    17. 17. Results: PC 3
    18. 18. Results at threshold 0.5
    19. 19. At threshold 0.1
    20. 20. Overall Comparison
    21. 21. Presentation Outline <ul><li>Introduction </li></ul><ul><li>Semi-supervised Learning </li></ul><ul><li>Methodology </li></ul><ul><li>Experiments </li></ul><ul><li>Results and Discussion </li></ul><ul><li>Conclusion and Future Work </li></ul>
    22. 22. Summary <ul><li>Does FTF with Random Forests as base learner outperform supervised learning with Random Forest? </li></ul><ul><ul><li>Yes, in most cases. </li></ul></ul><ul><ul><li>Improvement modest and not statistically significant </li></ul></ul><ul><li>How small can the size of the labeled data set be for the FTF to start outperforming supervised learning? </li></ul><ul><ul><li>When 5% or more modules labeled, semi-supervised approach seems a promising direction. </li></ul></ul><ul><ul><li>Performance improves in comparison to the same size of labeled modules </li></ul></ul><ul><li>Is the behavior and performance of FTF consistent over different data sets? </li></ul><ul><ul><li>Yes </li></ul></ul>
    23. 23. Future Work <ul><li>Try out different base learners with FTF </li></ul><ul><ul><li>Base learner in FTF has dramatic effects. RF used because it performs well in software fault modeling </li></ul></ul><ul><ul><li>RF does not converge, other base learners might </li></ul></ul><ul><ul><li>Analyze robustness to noise </li></ul></ul><ul><li>Expand on projects of different size or from different domains </li></ul><ul><li>Introduce more sophisticated semi-supervised algorithms </li></ul>
    24. 24. Questions <ul><li>Please direct questions to </li></ul><ul><li>Bojan Cukic: [email_address] </li></ul><ul><li>Huihua Lu: [email_address] </li></ul>

    ×