Implications of Ceiling Effects in Defect Predictors

619 views

Published on

Implications of Ceiling Effects in Defect Predictors - PROMISE 2008

Published in: Business, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
619
On SlideShare
0
From Embeds
0
Number of Embeds
87
Actions
Shares
0
Downloads
7
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Implications of Ceiling Effects in Defect Predictors

  1. 2. Outline <ul><li>Approach </li></ul><ul><li>Use More Data </li></ul><ul><li>Use Less Data </li></ul><ul><li>Use Even Less Data </li></ul><ul><li>Discussions </li></ul><ul><li>Examples </li></ul><ul><li>Conclusions </li></ul>
  2. 3. Approach <ul><li>Other Research: Try changing data miners </li></ul><ul><ul><li>Various data miners: no ground-breaking improvements </li></ul></ul><ul><li>This Research: Try changing training data </li></ul><ul><ul><li>Sub-sampling: Over/ Under/ Micro sampling </li></ul></ul><ul><ul><li>Hypothesis : Static Code Attributes have limited information content </li></ul></ul><ul><ul><li>Predictions: </li></ul></ul><ul><ul><ul><li>Simple learners can extract limited information content </li></ul></ul></ul><ul><ul><ul><li>No need for more complex learners </li></ul></ul></ul><ul><ul><ul><li>Further progress needs increasing the information content in data </li></ul></ul></ul>
  3. 4. State-of-the-art Defect Predictor <ul><li>Naive Bayes with simple log-filtering </li></ul><ul><li>Probability of detection (pd): 75% </li></ul><ul><li>Probability of false alarms (pf): 21% </li></ul><ul><li>Other data miners failed to achieve such performance: </li></ul><ul><ul><li>Logistic regression </li></ul></ul><ul><ul><li>J48 </li></ul></ul><ul><ul><li>OneR </li></ul></ul><ul><ul><li>Complex variants of Bayes </li></ul></ul><ul><ul><li>Various others available in WEKA... </li></ul></ul>
  4. 5. How Much Data: Use more... <ul><li>Experimental Rig: </li></ul><ul><ul><li>Stratify </li></ul></ul><ul><ul><li>|Test|=100 samples </li></ul></ul><ul><ul><li>N={100, 200, 300,...} </li></ul></ul><ul><ul><li>|Training|=N*90% samples </li></ul></ul><ul><ul><li>Randomize and repeat 20 times </li></ul></ul><ul><li>Plots of N vs. balance </li></ul>
  5. 6. Over/ Under Sampling: Use Less... <ul><li>Software Datasets are not balanced </li></ul><ul><ul><li>~10% Defective </li></ul></ul><ul><li>Target Class: Defective (modules) </li></ul><ul><li>Under Sampling: </li></ul><ul><ul><li>Use all target class instances, say N </li></ul></ul><ul><ul><li>Pick N from other class </li></ul></ul><ul><ul><li>Learn theories on 2N instances </li></ul></ul><ul><li>Over Sampling: </li></ul><ul><ul><li>Use all from other class, say M (M>N) </li></ul></ul><ul><ul><li>Using N target class instances, populate M instances </li></ul></ul><ul><ul><li>Learn theories on 2M instances </li></ul></ul>
  6. 7. Over/ Under Sampling: Use Less... <ul><li>NB/none is still among the best </li></ul><ul><li>Sampling J48 does not out-perform NB </li></ul><ul><li>NB/none is equivalent with NB/ under </li></ul><ul><li>Under sampling does not harm classifier performance. </li></ul><ul><li>Theories can be learned from a very small sample of available data </li></ul>
  7. 8. Micro Sampling: Use Even Less... <ul><li>Given N defective modules: </li></ul><ul><ul><li>M = {25, 50, 75, ...} <= N </li></ul></ul><ul><ul><li>Select M defective and M defect-free modules. </li></ul></ul><ul><ul><li>Learn theories on 2M instances </li></ul></ul><ul><li>Undersampling: M=N </li></ul><ul><li>8/12 datasets -> M = 25 </li></ul><ul><li>1/12 datasets -> M = 75 </li></ul><ul><li>3/12 datasets -> M = {200, 575, 1025} </li></ul>
  8. 9. Discussions <ul><li>Incremental Case Based Reasoning </li></ul><ul><li>Automatic Data Miners </li></ul><ul><li>When is CBR preferable to ADM? </li></ul><ul><ul><li>Impractical in large number of cases </li></ul></ul><ul><li>Our results suggest 50 samples are adequate. </li></ul><ul><li>CBR can perform as well as ADM. </li></ul><ul><li>One step further: CBR can perform better than ADM. </li></ul>
  9. 10. Example 1: Requirement Metrics <ul><li>Does not mean “Use Requirement Docs” all the time! </li></ul><ul><li>Combine features from whatever sources available. </li></ul><ul><li>Explore whatever is not a black-box approach. </li></ul><ul><li>Consistent with prior research </li></ul><ul><li>SE should make use of domain specific knowledge! </li></ul>From: Text Mining To: NLP Subject: Semantics
  10. 11. Example 2: Simple Weighting <ul><li>Combine features wisely! </li></ul><ul><li>Black-box Feature Selection -> NP-hard. </li></ul><ul><li>Information provided by black-box approach is not necessarily meaningful to humans. </li></ul><ul><li>Information provided by humans is meaningful for black-boxes. </li></ul>Check the validity of NB assumptions!
  11. 12. Example 3: WHICH Rule Learner <ul><li>Current practice: </li></ul><ul><ul><li>Learn predictors with criteria P </li></ul></ul><ul><ul><li>Assess predictors with criteria Q </li></ul></ul><ul><ul><li>In general: P≠Q </li></ul></ul><ul><li>WHICH supports defining P≈Q </li></ul><ul><ul><li>Learn what you will assess later. </li></ul></ul><ul><li>micro20 means only 20+20 samples. </li></ul>
  12. 13. <ul><li>WHICH initially creates a sorted stack of all attribute ranges in isolation. </li></ul><ul><li>It then, based on score, randomly selects two rules from the stack, combines them, and places the new rule in the stack in sorted order. </li></ul><ul><li>It continues to do this until a stopping criterion is met. </li></ul><ul><li>WHICH supports both conjunction and disjunctions. </li></ul><ul><li>If a the two rules selected both contain different ranges from the same attribute, they are OR'd together instead of AND'd </li></ul>outlook=sunny AND rain=true outlook=overcast outlook = [ sunny OR overcast ] AND rain = true Example 3: WHICH Rule Learner
  13. 14. Example 4: NN-Sampling <ul><li>Within vs. Cross Company Data </li></ul><ul><ul><li>Substantial increase in pd... </li></ul></ul><ul><ul><li>...with the cost of substantial increase in pf. </li></ul></ul><ul><ul><li>CC Data should only be used for mission critical projects </li></ul></ul><ul><ul><li>Companies should starve for local (WC) data </li></ul></ul><ul><li>Why? </li></ul><ul><ul><li>CC data contains a larger space of samples... </li></ul></ul><ul><ul><li>...it also includes irrelavancies. </li></ul></ul><ul><li>Howto decrese pf? </li></ul><ul><ul><li>Remove irrelavancies by sampling from CC data. </li></ul></ul>
  14. 15. Example 4: NN-Sampling <ul><li>Same patterns in: </li></ul><ul><ul><li>NASA MDP and </li></ul></ul><ul><ul><li>Turkish washing machines </li></ul></ul>
  15. 16. Conclusions <ul><li>Defect predictors are practical tools </li></ul><ul><li>Limited information content hypothesis </li></ul><ul><ul><ul><li>Simple learners can extract limited information content </li></ul></ul></ul><ul><ul><ul><li>No need for more complex learners </li></ul></ul></ul><ul><ul><ul><li>Further progress needs increasing the information content in data </li></ul></ul></ul><ul><li>Current research paradigm has reached its limits </li></ul><ul><li>Black-box methods lack the business knowledge </li></ul><ul><li>Human-in-the-loop CBR tools should take place </li></ul><ul><ul><li>Practical: Small samples to examine </li></ul></ul><ul><ul><li>Instantaneous: ADM will run fast </li></ul></ul><ul><ul><li>Direction: Increase information content </li></ul></ul>Promise data: OK. What about Promise tools? Increase in information content? Building predictors aligned with business goals.
  16. 17. Future Work <ul><li>Benchmark Human-in-the-loop CBR against ADM. </li></ul><ul><li>Instead of which learner , ask which data. </li></ul><ul><li>Better sampling strategies? </li></ul>
  17. 18. Thanks... <ul><li>Questions ? </li></ul>

×