Transfer Defect LearningJaechang NamThe Hong Kong University of Science and Technology, ChinaSinno Jialian PanInstitute fo...
Defect Prediction• Hassan et al.@ICSE`09, Predicting Faults Using the Complexity of CodeChanges• D’Ambros et al.@MSR`10, A...
Training prediction model3Test setTraining set
Training prediction model3Test setTraining setM1 M2 … M19 M20 Class11 5 … 53 78 Buggy… … … … … …1 1 … 3 9 CleanM1 M2 … M19...
Cross prediction model4Target project (Test set)Source project (Training set)
Cross-project Defect Prediction5“Training data is often not available, eitherbecause a company is too small or it is the f...
Cross-project Defect Prediction5“Training data is often not available, eitherbecause a company is too small or it is the f...
Cross-project defect prediction• Zimmerman et al.@FSE`09– “We ran 622 cross-project predictions and foundonly 3.4% actuall...
Cross-company defect prediction• Turhan and Menzies et al.@ESEJ`09– “Within-company data models are still the best”700.050...
Cross-project defect prediction• Rahman, Posnett, and Devanbu@FSE`12800.10.20.30.40.50.6Cross WithinAvg. F-measure
Cross prediction results900.10.20.30.40.50.60.7F-measureCross Within Cross Within Cross WithinEquinox JDT Lucene
Approaches of Transfer Defect Learning10Normalization TCATCA+
11• Data preprocessing for training and test dataNormalization• A state-of-the art transfer learning algorithm• Transfer C...
Data Normalization• Adjust all feature values in the same scale– E.g., Make Mean = 0 and Std = 1• Known to be helpful for ...
Normalization Options• N1: Min-max Normalization (max=1, min=0)[Han et al., 2012]• N2: Z-score Normalization (mean=0, std=...
14• Data preprocessing for training and test dataNormalization• A state-of-the art transfer learning algorithm• Transfer C...
Transfer Learning15
Transfer Learning15Traditional Machine Learning (ML)LearningSystemLearningSystem
Transfer Learning15Traditional Machine Learning (ML)LearningSystemLearningSystemTransfer LearningLearningSystemLearningSys...
Transfer Learning15Traditional Machine Learning (ML)LearningSystemLearningSystemTransfer LearningLearningSystemLearningSys...
A Common Assumption inTraditional ML16Pan andYang@TKDE`10, Survey onTransfer Learning• Same distribution
A Common Assumption inTraditional ML16Pan andYang@TKDE`10, Survey onTransfer Learning• Same distributionCross Prediction
A Common Assumption inTraditional ML16Pan andYang@TKDE`10, Survey onTransfer Learning• Same distributionTransfer Learning
Transfer Component Analysis• Unsupervised Transfer learning– Target project labels are not known.• Must have the same feat...
Transfer Component Analysis (cont.)• Feature extraction approach– Dimensionality reduction– Projection• Map original datai...
Transfer Component Analysis (cont.)• Feature extraction approach– Dimensionality reduction– Projection• Map original datai...
Transfer Component Analysis (cont.)• Feature extraction approach– Dimensionality reduction– Projection• Map original datai...
Transfer Component Analysis (cont.)• Feature extraction approach– Dimensionality reduction– Projection• Map original datai...
Transfer Component Analysis (cont.)• Feature extraction approach– Dimensionality reduction– Projection• Map original datai...
Transfer Component Analysis (cont.)• Feature extraction approach– Dimensionality reduction– Projection• Map original datai...
Transfer Component Analysis (cont.)19Pan et al.@TNN`10, Domain Adaptation viaTransfer ComponentAnalysisTarget domain dataS...
Transfer Component Analysis (cont.)20PCA TCAPan et al.@TNN`10, Domain Adaptation viaTransfer ComponentAnalysis
Preliminary Results using TCA00.10.20.30.40.50.60.70.8F-measure21*Baseline: Cross-project defect prediction without TCA an...
Preliminary Results using TCA00.10.20.30.40.50.60.70.8F-measure21*Baseline: Cross-project defect prediction without TCA an...
22• Data preprocessing for training and test dataNormalization• A state-of-the art transfer learning algorithm• Transfer C...
TCA+: Decision rules• Find a suitable normalization for TCA• Steps– #1: Characterize a dataset– #2: Measure similaritybetw...
#1: Characterize a dataset2431…Dataset A Dataset B24589611d1,2d1,5d1,3d3,1131…24589611d2,6d1,2d1,3d3,11DIST={dij : i,j, 1 ...
#2: Measure Similarity betweensource and target• Minimum (min) and maximum (max) values ofDIST• Mean and standard deviatio...
#3: Decision Rules• Rule #1– Mean and Std are same  NoN• Rule #2– Max and Min are different  N1 (max=1, min=0)• Rule #3,...
EVALUATION27
Experimental Setup• 8 software subjects• Machine learning algorithm– Logistic regression28ReLink (Wu et al.@FSE`11)Project...
Experimental Design29Test set(50%)Training set(50%)Within-project defect prediction
Experimental Design30Target project (Test set)Source project (Training set)Cross-project defect prediction
Experimental Design31Target project (Test set)Source project (Training set)Cross-project defect prediction with TCA/TCA+TC...
RESULTS32
ReLink Result33*Baseline: Cross-project defect prediction without TCA/TCA+00.10.20.30.40.50.60.70.8F-measureBaseline TCA T...
ReLink ResultF-measure34CrossSource  TargetSafe  ApacheZxing  ApacheApache  SafeZxing  SafeApache  ZXingSafe  ZXing...
AEEEM Result35*Baseline: Cross-project defect prediction without TCA/TCA+00.10.20.30.40.50.60.7F-measureBaseline TCA TCA+ ...
AEEEM ResultF-measure36CrossSource  TargetJDT  EQLC  EQML  EQ…PDE  LCEQ  MLJDT  MLLC  MLPDE ML…AverageBaseline0.3...
Threats to Validity• Systems are open-source projects.• Experimental results may not begeneralizable.• Decision rules in T...
Future Work• Transfer defect learning on differentfeature space– e.g., ReLink  AEEEMAEEEM  ReLink• Local models using Tr...
Conclusion• TCA+– TCA• Make distributions of source and target similar– Decision rules to improve TCA– Significantly impro...
Upcoming SlideShare
Loading in …5
×

Transfer defect learning

1,488 views

Published on

JC's ICSE 2013 presentation.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,488
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
43
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transfer defect learning

  1. 1. Transfer Defect LearningJaechang NamThe Hong Kong University of Science and Technology, ChinaSinno Jialian PanInstitute for Infocomm Research, SingaporeSunghun KimThe Hong Kong University of Science and Technology, China
  2. 2. Defect Prediction• Hassan et al.@ICSE`09, Predicting Faults Using the Complexity of CodeChanges• D’Ambros et al.@MSR`10, An Extensive Comparison of Bug PredictionApproaches• Rahman et al.@ICSE`12, Recalling the Impression of Cross-Project DefectPrediction• Hata et al.@ICSE`12, Bug Prediction based on Fine-grained Modulehistories• …2Program Prediction Model(Machine learning)Future defects
  3. 3. Training prediction model3Test setTraining set
  4. 4. Training prediction model3Test setTraining setM1 M2 … M19 M20 Class11 5 … 53 78 Buggy… … … … … …1 1 … 3 9 CleanM1 M2 … M19 M20 Class2 1 … 2 8 ?… … … … … …13 6 … 45 69 ?
  5. 5. Cross prediction model4Target project (Test set)Source project (Training set)
  6. 6. Cross-project Defect Prediction5“Training data is often not available, eitherbecause a company is too small or it is the firstrelease of a product”Zimmerman et al.@FSE`09, Cross-project Defect Prediction
  7. 7. Cross-project Defect Prediction5“Training data is often not available, eitherbecause a company is too small or it is the firstrelease of a product”Zimmerman et al.@FSE`09, Cross-project Defect Prediction“For many new projects we may not have enoughhistorical data to train prediction models.”Rahman, Posnett, and Devanbu @ICSE`12, Recalling the“Imprecision” of Cross-project Defect Prediction
  8. 8. Cross-project defect prediction• Zimmerman et al.@FSE`09– “We ran 622 cross-project predictions and foundonly 3.4% actually worked.”6Worked,3.4%Notworked,96.6%
  9. 9. Cross-company defect prediction• Turhan and Menzies et al.@ESEJ`09– “Within-company data models are still the best”700.050.10.150.20.250.30.350.4Cross Cross with a NNfilterWithinAvg. F-measure
  10. 10. Cross-project defect prediction• Rahman, Posnett, and Devanbu@FSE`12800.10.20.30.40.50.6Cross WithinAvg. F-measure
  11. 11. Cross prediction results900.10.20.30.40.50.60.7F-measureCross Within Cross Within Cross WithinEquinox JDT Lucene
  12. 12. Approaches of Transfer Defect Learning10Normalization TCATCA+
  13. 13. 11• Data preprocessing for training and test dataNormalization• A state-of-the art transfer learning algorithm• Transfer Component AnalysisTCA• Adapted TCA for cross-project defect prediction• Decision rules to select a suitable data normalization optionTCA+Approaches of Transfer Defect Learning
  14. 14. Data Normalization• Adjust all feature values in the same scale– E.g., Make Mean = 0 and Std = 1• Known to be helpful for classificationalgorithms to improve predictionperformance [Han et al. 2012].12
  15. 15. Normalization Options• N1: Min-max Normalization (max=1, min=0)[Han et al., 2012]• N2: Z-score Normalization (mean=0, std=1)[Han et al., 2012]• N3: Z-score Normalization only using sourcemean and standard deviation• N4: Z-score Normalization only using targetmean and standard deviation13
  16. 16. 14• Data preprocessing for training and test dataNormalization• A state-of-the art transfer learning algorithm• Transfer Component AnalysisTCA• Adapted TCA for cross-project defect prediction• Decision rules to select a suitable data normalization optionTCA+Approaches of Transfer Defect Learning
  17. 17. Transfer Learning15
  18. 18. Transfer Learning15Traditional Machine Learning (ML)LearningSystemLearningSystem
  19. 19. Transfer Learning15Traditional Machine Learning (ML)LearningSystemLearningSystemTransfer LearningLearningSystemLearningSystemKnowledgeTransfer
  20. 20. Transfer Learning15Traditional Machine Learning (ML)LearningSystemLearningSystemTransfer LearningLearningSystemLearningSystemKnowledgeTransfer
  21. 21. A Common Assumption inTraditional ML16Pan andYang@TKDE`10, Survey onTransfer Learning• Same distribution
  22. 22. A Common Assumption inTraditional ML16Pan andYang@TKDE`10, Survey onTransfer Learning• Same distributionCross Prediction
  23. 23. A Common Assumption inTraditional ML16Pan andYang@TKDE`10, Survey onTransfer Learning• Same distributionTransfer Learning
  24. 24. Transfer Component Analysis• Unsupervised Transfer learning– Target project labels are not known.• Must have the same feature space• Make distribution difference betweentraining and test datasets similar17Pan et al.@TNN`10, Domain Adaptation viaTransfer ComponentAnalysis
  25. 25. Transfer Component Analysis (cont.)• Feature extraction approach– Dimensionality reduction– Projection• Map original datain a lower-dimensional feature space18
  26. 26. Transfer Component Analysis (cont.)• Feature extraction approach– Dimensionality reduction– Projection• Map original datain a lower-dimensional feature space182-dimensional feature space
  27. 27. Transfer Component Analysis (cont.)• Feature extraction approach– Dimensionality reduction– Projection• Map original datain a lower-dimensional feature space181-dimensional feature space
  28. 28. Transfer Component Analysis (cont.)• Feature extraction approach– Dimensionality reduction– Projection• Map original datain a lower-dimensional feature space181-dimensional feature space
  29. 29. Transfer Component Analysis (cont.)• Feature extraction approach– Dimensionality reduction– Projection• Map original datain a lower-dimensional feature space181-dimensional feature space2-dimensional feature space
  30. 30. Transfer Component Analysis (cont.)• Feature extraction approach– Dimensionality reduction– Projection• Map original datain a lower-dimensional feature space– C.f. Principal Component Analysis (PCA)181-dimensional feature space
  31. 31. Transfer Component Analysis (cont.)19Pan et al.@TNN`10, Domain Adaptation viaTransfer ComponentAnalysisTarget domain dataSource domain data
  32. 32. Transfer Component Analysis (cont.)20PCA TCAPan et al.@TNN`10, Domain Adaptation viaTransfer ComponentAnalysis
  33. 33. Preliminary Results using TCA00.10.20.30.40.50.60.70.8F-measure21*Baseline: Cross-project defect prediction without TCA and normalizationBaseline NoN N1 N2 N3 N4 Baseline NoN N1 N2 N3 N4Safe  Apache Apache  Safe
  34. 34. Preliminary Results using TCA00.10.20.30.40.50.60.70.8F-measure21*Baseline: Cross-project defect prediction without TCA and normalizationPrediction performance of TCAvaries according to differentnormalization options!Baseline NoN N1 N2 N3 N4 Baseline NoN N1 N2 N3 N4Safe  Apache Apache  Safe
  35. 35. 22• Data preprocessing for training and test dataNormalization• A state-of-the art transfer learning algorithm• Transfer Component AnalysisTCA• Adapted TCA for cross-project defect prediction• Decision rules to select a suitable datanormalization optionTCA+Approaches of Transfer Defect Learning
  36. 36. TCA+: Decision rules• Find a suitable normalization for TCA• Steps– #1: Characterize a dataset– #2: Measure similaritybetween source and target datasets– #3: Decision rules23
  37. 37. #1: Characterize a dataset2431…Dataset A Dataset B24589611d1,2d1,5d1,3d3,1131…24589611d2,6d1,2d1,3d3,11DIST={dij : i,j, 1 ≤ i < n, 1 < j ≤ n, i < j}A
  38. 38. #2: Measure Similarity betweensource and target• Minimum (min) and maximum (max) values ofDIST• Mean and standard deviation (std) of DIST• The number of instances25
  39. 39. #3: Decision Rules• Rule #1– Mean and Std are same  NoN• Rule #2– Max and Min are different  N1 (max=1, min=0)• Rule #3,#4– Std and # of instances are different N3 or N4 (src/tgt mean=0, std=1)• Rule #5– Default  N2 (mean=0, std=1)26
  40. 40. EVALUATION27
  41. 41. Experimental Setup• 8 software subjects• Machine learning algorithm– Logistic regression28ReLink (Wu et al.@FSE`11)Projects# of metrics(features)Apache26(Source code)SafeZXingAEEEM (D’Ambros et al.@MSR`10)Projects# of metrics(features)Apache Lucene (LC)61(Source code,Churn,Entropy,…)Equinox (EQ)Eclipse JDTEclipse PDE UIMylyn (ML)
  42. 42. Experimental Design29Test set(50%)Training set(50%)Within-project defect prediction
  43. 43. Experimental Design30Target project (Test set)Source project (Training set)Cross-project defect prediction
  44. 44. Experimental Design31Target project (Test set)Source project (Training set)Cross-project defect prediction with TCA/TCA+TCA/TCA+
  45. 45. RESULTS32
  46. 46. ReLink Result33*Baseline: Cross-project defect prediction without TCA/TCA+00.10.20.30.40.50.60.70.8F-measureBaseline TCA TCA+ WithinSafe  Apache Apache  Safe Safe  ZXingBaseline TCA TCA+ Within Baseline TCA TCA+ Within
  47. 47. ReLink ResultF-measure34CrossSource  TargetSafe  ApacheZxing  ApacheApache  SafeZxing  SafeApache  ZXingSafe  ZXingAverageBaseline0.520.690.490.590.460.100.49TCA0.640.640.720.700.450.420.59TCA+0.640.720.720.640.490.530.61WithinTarget  Target0.640.620.330.53*Baseline: Cross-project defect prediction without TCA/TCA+
  48. 48. AEEEM Result35*Baseline: Cross-project defect prediction without TCA/TCA+00.10.20.30.40.50.60.7F-measureBaseline TCA TCA+ WithinJDT  EQ PDE  LC PDE  MLBaseline TCA TCA+ Within Baseline TCA TCA+ Within
  49. 49. AEEEM ResultF-measure36CrossSource  TargetJDT  EQLC  EQML  EQ…PDE  LCEQ  MLJDT  MLLC  MLPDE ML…AverageBaseline0.310.500.24…0.330.190.270.200.27…0.32TCA0.590.620.56…0.270.620.560.580.48…0.41TCA+0.600.620.56…0.330.620.560.600.54…0.41WithinSource  Target0.58…0.370.30…0.42
  50. 50. Threats to Validity• Systems are open-source projects.• Experimental results may not begeneralizable.• Decision rules in TCA+ may not begeneralizable.37
  51. 51. Future Work• Transfer defect learning on differentfeature space– e.g., ReLink  AEEEMAEEEM  ReLink• Local models using Transfer Learning• Adapt Transfer learning in other SoftwareEngineering (SE) problems– e.g., Knowledge from mailing lists Bug triage problem38
  52. 52. Conclusion• TCA+– TCA• Make distributions of source and target similar– Decision rules to improve TCA– Significantly improved cross-project defectprediction performance• Transfer Learning in SE– Transfer learning may benefit otherprediction and recommendation systems inSE domains.39

×