Successfully reported this slideshow.

More Related Content

Related Audiobooks

Free with a 14 day trial from Scribd

See all

Transfer defect learning

  1. 1. Transfer Defect Learning Jaechang Nam The Hong Kong University of Science and Technology, China Sinno Jialian Pan Institute for Infocomm Research, Singapore Sunghun Kim The Hong Kong University of Science and Technology, China
  2. 2. Defect Prediction • Hassan et al.@ICSE`09, Predicting Faults Using the Complexity of Code Changes • D’Ambros et al.@MSR`10, An Extensive Comparison of Bug Prediction Approaches • Rahman et al.@ICSE`12, Recalling the Impression of Cross-Project Defect Prediction • Hata et al.@ICSE`12, Bug Prediction based on Fine-grained Module histories • … 2 Program Prediction Model (Machine learning) Future defects
  3. 3. Training prediction model 3 Test set Training set
  4. 4. Training prediction model 3 Test set Training set M1 M2 … M19 M20 Class 11 5 … 53 78 Buggy … … … … … … 1 1 … 3 9 Clean M1 M2 … M19 M20 Class 2 1 … 2 8 ? … … … … … … 13 6 … 45 69 ?
  5. 5. Cross prediction model 4 Target project (Test set) Source project (Training set)
  6. 6. Cross-project Defect Prediction 5 “Training data is often not available, either because a company is too small or it is the first release of a product” Zimmerman et al.@FSE`09, Cross-project Defect Prediction
  7. 7. Cross-project Defect Prediction 5 “Training data is often not available, either because a company is too small or it is the first release of a product” Zimmerman et al.@FSE`09, Cross-project Defect Prediction “For many new projects we may not have enough historical data to train prediction models.” Rahman, Posnett, and Devanbu @ICSE`12, Recalling the “Imprecision” of Cross-project Defect Prediction
  8. 8. Cross-project defect prediction • Zimmerman et al.@FSE`09 – “We ran 622 cross-project predictions and found only 3.4% actually worked.” 6 Worked, 3.4% Not worked, 96.6%
  9. 9. Cross-company defect prediction • Turhan and Menzies et al.@ESEJ`09 – “Within-company data models are still the best” 7 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Cross Cross with a NN filter Within Avg. F-measure
  10. 10. Cross-project defect prediction • Rahman, Posnett, and Devanbu@FSE`12 8 0 0.1 0.2 0.3 0.4 0.5 0.6 Cross Within Avg. F-measure
  11. 11. Cross prediction results 9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 F-measure Cross Within Cross Within Cross Within Equinox JDT Lucene
  12. 12. Approaches of Transfer Defect Learning 10 Normalization TCA TCA+
  13. 13. 11 • Data preprocessing for training and test dataNormalization • A state-of-the art transfer learning algorithm • Transfer Component AnalysisTCA • Adapted TCA for cross-project defect prediction • Decision rules to select a suitable data normalization optionTCA+ Approaches of Transfer Defect Learning
  14. 14. Data Normalization • Adjust all feature values in the same scale – E.g., Make Mean = 0 and Std = 1 • Known to be helpful for classification algorithms to improve prediction performance [Han et al. 2012]. 12
  15. 15. Normalization Options • N1: Min-max Normalization (max=1, min=0) [Han et al., 2012] • N2: Z-score Normalization (mean=0, std=1) [Han et al., 2012] • N3: Z-score Normalization only using source mean and standard deviation • N4: Z-score Normalization only using target mean and standard deviation 13
  16. 16. 14 • Data preprocessing for training and test dataNormalization • A state-of-the art transfer learning algorithm • Transfer Component Analysis TCA • Adapted TCA for cross-project defect prediction • Decision rules to select a suitable data normalization optionTCA+ Approaches of Transfer Defect Learning
  17. 17. Transfer Learning 15
  18. 18. Transfer Learning 15 Traditional Machine Learning (ML) Learning System Learning System
  19. 19. Transfer Learning 15 Traditional Machine Learning (ML) Learning System Learning System Transfer Learning Learning System Learning System Knowledge Transfer
  20. 20. Transfer Learning 15 Traditional Machine Learning (ML) Learning System Learning System Transfer Learning Learning System Learning System Knowledge Transfer
  21. 21. A Common Assumption in Traditional ML 16 Pan andYang@TKDE`10, Survey onTransfer Learning • Same distribution
  22. 22. A Common Assumption in Traditional ML 16 Pan andYang@TKDE`10, Survey onTransfer Learning • Same distribution Cross Prediction
  23. 23. A Common Assumption in Traditional ML 16 Pan andYang@TKDE`10, Survey onTransfer Learning • Same distribution Transfer Learning
  24. 24. Transfer Component Analysis • Unsupervised Transfer learning – Target project labels are not known. • Must have the same feature space • Make distribution difference between training and test datasets similar 17 Pan et al.@TNN`10, Domain Adaptation viaTransfer ComponentAnalysis
  25. 25. Transfer Component Analysis (cont.) • Feature extraction approach – Dimensionality reduction – Projection • Map original data in a lower-dimensional feature space 18
  26. 26. Transfer Component Analysis (cont.) • Feature extraction approach – Dimensionality reduction – Projection • Map original data in a lower-dimensional feature space 18 2-dimensional feature space
  27. 27. Transfer Component Analysis (cont.) • Feature extraction approach – Dimensionality reduction – Projection • Map original data in a lower-dimensional feature space 18 1-dimensional feature space
  28. 28. Transfer Component Analysis (cont.) • Feature extraction approach – Dimensionality reduction – Projection • Map original data in a lower-dimensional feature space 18 1-dimensional feature space
  29. 29. Transfer Component Analysis (cont.) • Feature extraction approach – Dimensionality reduction – Projection • Map original data in a lower-dimensional feature space 18 1-dimensional feature space 2-dimensional feature space
  30. 30. Transfer Component Analysis (cont.) • Feature extraction approach – Dimensionality reduction – Projection • Map original data in a lower-dimensional feature space – C.f. Principal Component Analysis (PCA) 18 1-dimensional feature space
  31. 31. Transfer Component Analysis (cont.) 19 Pan et al.@TNN`10, Domain Adaptation viaTransfer ComponentAnalysis Target domain data Source domain data
  32. 32. Transfer Component Analysis (cont.) 20 PCA TCA Pan et al.@TNN`10, Domain Adaptation viaTransfer ComponentAnalysis
  33. 33. Preliminary Results using TCA 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 F-measure 21*Baseline: Cross-project defect prediction without TCA and normalization Baseline NoN N1 N2 N3 N4 Baseline NoN N1 N2 N3 N4 Safe  Apache Apache  Safe
  34. 34. Preliminary Results using TCA 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 F-measure 21*Baseline: Cross-project defect prediction without TCA and normalization Prediction performance of TCA varies according to different normalization options! Baseline NoN N1 N2 N3 N4 Baseline NoN N1 N2 N3 N4 Safe  Apache Apache  Safe
  35. 35. 22 • Data preprocessing for training and test dataNormalization • A state-of-the art transfer learning algorithm • Transfer Component Analysis TCA • Adapted TCA for cross-project defect prediction • Decision rules to select a suitable data normalization option TCA+ Approaches of Transfer Defect Learning
  36. 36. TCA+: Decision rules • Find a suitable normalization for TCA • Steps – #1: Characterize a dataset – #2: Measure similarity between source and target datasets – #3: Decision rules 23
  37. 37. #1: Characterize a dataset 24 3 1 … Dataset A Dataset B 2 4 5 8 9 6 11 d1,2 d1,5 d1,3 d3,11 3 1 … 2 4 5 8 9 6 11 d2,6 d1,2 d1,3 d3,11 DIST={dij : i,j, 1 ≤ i < n, 1 < j ≤ n, i < j} A
  38. 38. #2: Measure Similarity between source and target • Minimum (min) and maximum (max) values of DIST • Mean and standard deviation (std) of DIST • The number of instances 25
  39. 39. #3: Decision Rules • Rule #1 – Mean and Std are same  NoN • Rule #2 – Max and Min are different  N1 (max=1, min=0) • Rule #3,#4 – Std and # of instances are different  N3 or N4 (src/tgt mean=0, std=1) • Rule #5 – Default  N2 (mean=0, std=1) 26
  40. 40. EVALUATION 27
  41. 41. Experimental Setup • 8 software subjects • Machine learning algorithm – Logistic regression 28 ReLink (Wu et al.@FSE`11) Projects # of metrics (features) Apache 26 (Source code) Safe ZXing AEEEM (D’Ambros et al.@MSR`10) Projects # of metrics (features) Apache Lucene (LC) 61 (Source code, Churn, Entropy,…) Equinox (EQ) Eclipse JDT Eclipse PDE UI Mylyn (ML)
  42. 42. Experimental Design 29 Test set (50%) Training set (50%) Within-project defect prediction
  43. 43. Experimental Design 30 Target project (Test set) Source project (Training set) Cross-project defect prediction
  44. 44. Experimental Design 31 Target project (Test set) Source project (Training set) Cross-project defect prediction with TCA/TCA+ TCA/TCA+
  45. 45. RESULTS 32
  46. 46. ReLink Result 33*Baseline: Cross-project defect prediction without TCA/TCA+ 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 F-measure Baseline TCA TCA+ Within Safe  Apache Apache  Safe Safe  ZXing Baseline TCA TCA+ Within Baseline TCA TCA+ Within
  47. 47. ReLink Result F-measure 34 Cross Source  Target Safe  Apache Zxing  Apache Apache  Safe Zxing  Safe Apache  ZXing Safe  ZXing Average Baseline 0.52 0.69 0.49 0.59 0.46 0.10 0.49 TCA 0.64 0.64 0.72 0.70 0.45 0.42 0.59 TCA+ 0.64 0.72 0.72 0.64 0.49 0.53 0.61 Within Target  Target 0.64 0.62 0.33 0.53 *Baseline: Cross-project defect prediction without TCA/TCA+
  48. 48. AEEEM Result 35*Baseline: Cross-project defect prediction without TCA/TCA+ 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 F-measure Baseline TCA TCA+ Within JDT  EQ PDE  LC PDE  ML Baseline TCA TCA+ Within Baseline TCA TCA+ Within
  49. 49. AEEEM Result F-measure 36 Cross Source  Target JDT  EQ LC  EQ ML  EQ … PDE  LC EQ  ML JDT  ML LC  ML PDE ML … Average Baseline 0.31 0.50 0.24 … 0.33 0.19 0.27 0.20 0.27 … 0.32 TCA 0.59 0.62 0.56 … 0.27 0.62 0.56 0.58 0.48 … 0.41 TCA+ 0.60 0.62 0.56 … 0.33 0.62 0.56 0.60 0.54 … 0.41 Within Source  Target 0.58 … 0.37 0.30 … 0.42
  50. 50. Threats to Validity • Systems are open-source projects. • Experimental results may not be generalizable. • Decision rules in TCA+ may not be generalizable. 37
  51. 51. Future Work • Transfer defect learning on different feature space – e.g., ReLink  AEEEM AEEEM  ReLink • Local models using Transfer Learning • Adapt Transfer learning in other Software Engineering (SE) problems – e.g., Knowledge from mailing lists  Bug triage problem 38
  52. 52. Conclusion • TCA+ – TCA • Make distributions of source and target similar – Decision rules to improve TCA – Significantly improved cross-project defect prediction performance • Transfer Learning in SE – Transfer learning may benefit other prediction and recommendation systems in SE domains. 39

×