Transfer defect learning

Sung Kim
Sung KimAssociate Prof.
Transfer Defect Learning
Jaechang Nam
The Hong Kong University of Science and Technology, China
Sinno Jialian Pan
Institute for Infocomm Research, Singapore
Sunghun Kim
The Hong Kong University of Science and Technology, China
Defect Prediction
• Hassan et al.@ICSE`09, Predicting Faults Using the Complexity of Code
Changes
• D’Ambros et al.@MSR`10, An Extensive Comparison of Bug Prediction
Approaches
• Rahman et al.@ICSE`12, Recalling the Impression of Cross-Project Defect
Prediction
• Hata et al.@ICSE`12, Bug Prediction based on Fine-grained Module
histories
• …
2
Program Prediction Model
(Machine learning)
Future defects
Training prediction model
3
Test set
Training set
Training prediction model
3
Test set
Training set
M1 M2 … M19 M20 Class
11 5 … 53 78 Buggy
… … … … … …
1 1 … 3 9 Clean
M1 M2 … M19 M20 Class
2 1 … 2 8 ?
… … … … … …
13 6 … 45 69 ?
Cross prediction model
4
Target project (Test set)
Source project (Training set)
Cross-project Defect Prediction
5
“Training data is often not available, either
because a company is too small or it is the first
release of a product”
Zimmerman et al.@FSE`09, Cross-project Defect Prediction
Cross-project Defect Prediction
5
“Training data is often not available, either
because a company is too small or it is the first
release of a product”
Zimmerman et al.@FSE`09, Cross-project Defect Prediction
“For many new projects we may not have enough
historical data to train prediction models.”
Rahman, Posnett, and Devanbu @ICSE`12, Recalling the
“Imprecision” of Cross-project Defect Prediction
Cross-project defect prediction
• Zimmerman et al.@FSE`09
– “We ran 622 cross-project predictions and found
only 3.4% actually worked.”
6
Worked,
3.4%
Not
worked,
96.6%
Cross-company defect prediction
• Turhan and Menzies et al.@ESEJ`09
– “Within-company data models are still the best”
7
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Cross Cross with a NN
filter
Within
Avg. F-measure
Cross-project defect prediction
• Rahman, Posnett, and Devanbu@FSE`12
8
0
0.1
0.2
0.3
0.4
0.5
0.6
Cross Within
Avg. F-measure
Cross prediction results
9
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
F-measure
Cross Within Cross Within Cross Within
Equinox JDT Lucene
Approaches of Transfer Defect Learning
10
Normalization TCA
TCA+
11
• Data preprocessing for training and test dataNormalization
• A state-of-the art transfer learning algorithm
• Transfer Component AnalysisTCA
• Adapted TCA for cross-project defect prediction
• Decision rules to select a suitable data normalization optionTCA+
Approaches of Transfer Defect Learning
Data Normalization
• Adjust all feature values in the same scale
– E.g., Make Mean = 0 and Std = 1
• Known to be helpful for classification
algorithms to improve prediction
performance [Han et al. 2012].
12
Normalization Options
• N1: Min-max Normalization (max=1, min=0)
[Han et al., 2012]
• N2: Z-score Normalization (mean=0, std=1)
[Han et al., 2012]
• N3: Z-score Normalization only using source
mean and standard deviation
• N4: Z-score Normalization only using target
mean and standard deviation
13
14
• Data preprocessing for training and test dataNormalization
• A state-of-the art transfer learning algorithm
• Transfer Component Analysis
TCA
• Adapted TCA for cross-project defect prediction
• Decision rules to select a suitable data normalization optionTCA+
Approaches of Transfer Defect Learning
Transfer Learning
15
Transfer Learning
15
Traditional Machine Learning (ML)
Learning
System
Learning
System
Transfer Learning
15
Traditional Machine Learning (ML)
Learning
System
Learning
System
Transfer Learning
Learning
System
Learning
System
Knowledge
Transfer
Transfer Learning
15
Traditional Machine Learning (ML)
Learning
System
Learning
System
Transfer Learning
Learning
System
Learning
System
Knowledge
Transfer
A Common Assumption in
Traditional ML
16
Pan andYang@TKDE`10, Survey onTransfer Learning
• Same distribution
A Common Assumption in
Traditional ML
16
Pan andYang@TKDE`10, Survey onTransfer Learning
• Same distribution
Cross Prediction
A Common Assumption in
Traditional ML
16
Pan andYang@TKDE`10, Survey onTransfer Learning
• Same distribution
Transfer Learning
Transfer Component Analysis
• Unsupervised Transfer learning
– Target project labels are not known.
• Must have the same feature space
• Make distribution difference between
training and test datasets similar
17
Pan et al.@TNN`10, Domain Adaptation viaTransfer ComponentAnalysis
Transfer Component Analysis (cont.)
• Feature extraction approach
– Dimensionality reduction
– Projection
• Map original data
in a lower-dimensional feature space
18
Transfer Component Analysis (cont.)
• Feature extraction approach
– Dimensionality reduction
– Projection
• Map original data
in a lower-dimensional feature space
18
2-dimensional feature space
Transfer Component Analysis (cont.)
• Feature extraction approach
– Dimensionality reduction
– Projection
• Map original data
in a lower-dimensional feature space
18
1-dimensional feature space
Transfer Component Analysis (cont.)
• Feature extraction approach
– Dimensionality reduction
– Projection
• Map original data
in a lower-dimensional feature space
18
1-dimensional feature space
Transfer Component Analysis (cont.)
• Feature extraction approach
– Dimensionality reduction
– Projection
• Map original data
in a lower-dimensional feature space
18
1-dimensional feature space
2-dimensional feature space
Transfer Component Analysis (cont.)
• Feature extraction approach
– Dimensionality reduction
– Projection
• Map original data
in a lower-dimensional feature space
– C.f. Principal Component Analysis (PCA)
18
1-dimensional feature space
Transfer Component Analysis (cont.)
19
Pan et al.@TNN`10, Domain Adaptation viaTransfer ComponentAnalysis
Target domain data
Source domain data
Transfer Component Analysis (cont.)
20
PCA TCA
Pan et al.@TNN`10, Domain Adaptation viaTransfer ComponentAnalysis
Preliminary Results using TCA
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
F-measure
21*Baseline: Cross-project defect prediction without TCA and normalization
Baseline NoN N1 N2 N3 N4 Baseline NoN N1 N2 N3 N4
Safe  Apache Apache  Safe
Preliminary Results using TCA
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
F-measure
21*Baseline: Cross-project defect prediction without TCA and normalization
Prediction performance of TCA
varies according to different
normalization options!
Baseline NoN N1 N2 N3 N4 Baseline NoN N1 N2 N3 N4
Safe  Apache Apache  Safe
22
• Data preprocessing for training and test dataNormalization
• A state-of-the art transfer learning algorithm
• Transfer Component Analysis
TCA
• Adapted TCA for cross-project defect prediction
• Decision rules to select a suitable data
normalization option
TCA+
Approaches of Transfer Defect Learning
TCA+: Decision rules
• Find a suitable normalization for TCA
• Steps
– #1: Characterize a dataset
– #2: Measure similarity
between source and target datasets
– #3: Decision rules
23
#1: Characterize a dataset
24
3
1
…
Dataset A Dataset B
2
4
5
8
9
6
11
d1,2
d1,5
d1,3
d3,11
3
1
…
2
4
5
8
9
6
11
d2,6
d1,2
d1,3
d3,11
DIST={dij : i,j, 1 ≤ i < n, 1 < j ≤ n, i < j}
A
#2: Measure Similarity between
source and target
• Minimum (min) and maximum (max) values of
DIST
• Mean and standard deviation (std) of DIST
• The number of instances
25
#3: Decision Rules
• Rule #1
– Mean and Std are same  NoN
• Rule #2
– Max and Min are different  N1 (max=1, min=0)
• Rule #3,#4
– Std and # of instances are different
 N3 or N4 (src/tgt mean=0, std=1)
• Rule #5
– Default  N2 (mean=0, std=1)
26
EVALUATION
27
Experimental Setup
• 8 software subjects
• Machine learning algorithm
– Logistic regression
28
ReLink (Wu et al.@FSE`11)
Projects
# of metrics
(features)
Apache
26
(Source code)
Safe
ZXing
AEEEM (D’Ambros et al.@MSR`10)
Projects
# of metrics
(features)
Apache Lucene (LC)
61
(Source code,
Churn,
Entropy,…)
Equinox (EQ)
Eclipse JDT
Eclipse PDE UI
Mylyn (ML)
Experimental Design
29
Test set
(50%)
Training set
(50%)
Within-project defect prediction
Experimental Design
30
Target project (Test set)
Source project (Training set)
Cross-project defect prediction
Experimental Design
31
Target project (Test set)
Source project (Training set)
Cross-project defect prediction with TCA/TCA+
TCA/TCA+
RESULTS
32
ReLink Result
33*Baseline: Cross-project defect prediction without TCA/TCA+
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
F-measure
Baseline TCA TCA+ Within
Safe  Apache Apache  Safe Safe  ZXing
Baseline TCA TCA+ Within Baseline TCA TCA+ Within
ReLink Result
F-measure
34
Cross
Source  Target
Safe  Apache
Zxing  Apache
Apache  Safe
Zxing  Safe
Apache  ZXing
Safe  ZXing
Average
Baseline
0.52
0.69
0.49
0.59
0.46
0.10
0.49
TCA
0.64
0.64
0.72
0.70
0.45
0.42
0.59
TCA+
0.64
0.72
0.72
0.64
0.49
0.53
0.61
Within
Target  Target
0.64
0.62
0.33
0.53
*Baseline: Cross-project defect prediction without TCA/TCA+
AEEEM Result
35*Baseline: Cross-project defect prediction without TCA/TCA+
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
F-measure
Baseline TCA TCA+ Within
JDT  EQ PDE  LC PDE  ML
Baseline TCA TCA+ Within Baseline TCA TCA+ Within
AEEEM Result
F-measure
36
Cross
Source  Target
JDT  EQ
LC  EQ
ML  EQ
…
PDE  LC
EQ  ML
JDT  ML
LC  ML
PDE ML
…
Average
Baseline
0.31
0.50
0.24
…
0.33
0.19
0.27
0.20
0.27
…
0.32
TCA
0.59
0.62
0.56
…
0.27
0.62
0.56
0.58
0.48
…
0.41
TCA+
0.60
0.62
0.56
…
0.33
0.62
0.56
0.60
0.54
…
0.41
Within
Source  Target
0.58
…
0.37
0.30
…
0.42
Threats to Validity
• Systems are open-source projects.
• Experimental results may not be
generalizable.
• Decision rules in TCA+ may not be
generalizable.
37
Future Work
• Transfer defect learning on different
feature space
– e.g., ReLink  AEEEM
AEEEM  ReLink
• Local models using Transfer Learning
• Adapt Transfer learning in other Software
Engineering (SE) problems
– e.g., Knowledge from mailing lists
 Bug triage problem
38
Conclusion
• TCA+
– TCA
• Make distributions of source and target similar
– Decision rules to improve TCA
– Significantly improved cross-project defect
prediction performance
• Transfer Learning in SE
– Transfer learning may benefit other
prediction and recommendation systems in
SE domains.
39
1 of 52

Recommended

Machine Learning and Real-World Applications by
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachinePulse
23.8K views19 slides
Python for Data Science by
Python for Data SciencePython for Data Science
Python for Data ScienceHarri Hämäläinen
17.1K views39 slides
Machine Learning by
Machine LearningMachine Learning
Machine LearningDarshan Ambhaikar
59.4K views24 slides
Machine Learning Real Life Applications By Examples by
Machine Learning Real Life Applications By ExamplesMachine Learning Real Life Applications By Examples
Machine Learning Real Life Applications By ExamplesMario Cartia
3.2K views56 slides
An introduction to Machine Learning by
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learningbutest
38.8K views136 slides
Software Defect Prediction on Unlabeled Datasets by
Software Defect Prediction on Unlabeled DatasetsSoftware Defect Prediction on Unlabeled Datasets
Software Defect Prediction on Unlabeled DatasetsSung Kim
16.7K views86 slides

More Related Content

What's hot

Regularization by
RegularizationRegularization
RegularizationDarren Yow-Bang Wang
2.8K views36 slides
Machine learning for materials design: opportunities, challenges, and methods by
Machine learning for materials design: opportunities, challenges, and methodsMachine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methodsAnubhav Jain
1.6K views22 slides
A brief introduction of Artificial neural network by example by
A brief introduction of Artificial neural network by exampleA brief introduction of Artificial neural network by example
A brief introduction of Artificial neural network by exampleMrinmoy Majumder
1.1K views14 slides
Performance analysis(Time & Space Complexity) by
Performance analysis(Time & Space Complexity)Performance analysis(Time & Space Complexity)
Performance analysis(Time & Space Complexity)swapnac12
3.8K views14 slides
HML: Historical View and Trends of Deep Learning by
HML: Historical View and Trends of Deep LearningHML: Historical View and Trends of Deep Learning
HML: Historical View and Trends of Deep LearningYan Xu
2.3K views32 slides
Machine Learning: Introduction to Neural Networks by
Machine Learning: Introduction to Neural NetworksMachine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksFrancesco Collova'
16.1K views42 slides

What's hot(20)

Machine learning for materials design: opportunities, challenges, and methods by Anubhav Jain
Machine learning for materials design: opportunities, challenges, and methodsMachine learning for materials design: opportunities, challenges, and methods
Machine learning for materials design: opportunities, challenges, and methods
Anubhav Jain1.6K views
A brief introduction of Artificial neural network by example by Mrinmoy Majumder
A brief introduction of Artificial neural network by exampleA brief introduction of Artificial neural network by example
A brief introduction of Artificial neural network by example
Mrinmoy Majumder1.1K views
Performance analysis(Time & Space Complexity) by swapnac12
Performance analysis(Time & Space Complexity)Performance analysis(Time & Space Complexity)
Performance analysis(Time & Space Complexity)
swapnac123.8K views
HML: Historical View and Trends of Deep Learning by Yan Xu
HML: Historical View and Trends of Deep LearningHML: Historical View and Trends of Deep Learning
HML: Historical View and Trends of Deep Learning
Yan Xu2.3K views
Machine Learning: Introduction to Neural Networks by Francesco Collova'
Machine Learning: Introduction to Neural NetworksMachine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural Networks
Francesco Collova'16.1K views
Machine Learning Tutorial | Machine Learning Basics | Machine Learning Algori... by Simplilearn
Machine Learning Tutorial | Machine Learning Basics | Machine Learning Algori...Machine Learning Tutorial | Machine Learning Basics | Machine Learning Algori...
Machine Learning Tutorial | Machine Learning Basics | Machine Learning Algori...
Simplilearn6.9K views
Development and quality plan by nethisip13
Development and quality planDevelopment and quality plan
Development and quality plan
nethisip133K views
Artificial neural network by DEEPASHRI HK
Artificial neural networkArtificial neural network
Artificial neural network
DEEPASHRI HK186.8K views
5.3 mining sequential patterns by Krish_ver2
5.3 mining sequential patterns5.3 mining sequential patterns
5.3 mining sequential patterns
Krish_ver28.8K views
PhD thesis defence slides by Sean Moran
PhD thesis defence slidesPhD thesis defence slides
PhD thesis defence slides
Sean Moran1.4K views

Viewers also liked

Video concept detection by learning from web images by
Video concept detection by learning from web imagesVideo concept detection by learning from web images
Video concept detection by learning from web imagesMediaMixerCommunity
1.2K views20 slides
A survey on transfer learning by
A survey on transfer learningA survey on transfer learning
A survey on transfer learningazuring
2.8K views42 slides
How Do Software Engineers Understand Code Changes? FSE 2012 by
How Do Software Engineers Understand Code Changes? FSE 2012How Do Software Engineers Understand Code Changes? FSE 2012
How Do Software Engineers Understand Code Changes? FSE 2012Sung Kim
1.8K views51 slides
The Anatomy of Developer Social Networks by
The Anatomy of Developer Social NetworksThe Anatomy of Developer Social Networks
The Anatomy of Developer Social NetworksSung Kim
835 views46 slides
Automatically Generated Patches as Debugging Aids: A Human Study (FSE 2014) by
Automatically Generated Patches as Debugging Aids: A Human Study (FSE 2014)Automatically Generated Patches as Debugging Aids: A Human Study (FSE 2014)
Automatically Generated Patches as Debugging Aids: A Human Study (FSE 2014)Sung Kim
1.9K views65 slides
REMI: Defect Prediction for Efficient API Testing (

ESEC/FSE 2015, Industria... by
REMI: Defect Prediction for Efficient API Testing (

ESEC/FSE 2015, Industria...REMI: Defect Prediction for Efficient API Testing (

ESEC/FSE 2015, Industria...
REMI: Defect Prediction for Efficient API Testing (

ESEC/FSE 2015, Industria...Sung Kim
2.5K views16 slides

Viewers also liked(20)

Video concept detection by learning from web images by MediaMixerCommunity
Video concept detection by learning from web imagesVideo concept detection by learning from web images
Video concept detection by learning from web images
MediaMixerCommunity1.2K views
A survey on transfer learning by azuring
A survey on transfer learningA survey on transfer learning
A survey on transfer learning
azuring2.8K views
How Do Software Engineers Understand Code Changes? FSE 2012 by Sung Kim
How Do Software Engineers Understand Code Changes? FSE 2012How Do Software Engineers Understand Code Changes? FSE 2012
How Do Software Engineers Understand Code Changes? FSE 2012
Sung Kim1.8K views
The Anatomy of Developer Social Networks by Sung Kim
The Anatomy of Developer Social NetworksThe Anatomy of Developer Social Networks
The Anatomy of Developer Social Networks
Sung Kim835 views
Automatically Generated Patches as Debugging Aids: A Human Study (FSE 2014) by Sung Kim
Automatically Generated Patches as Debugging Aids: A Human Study (FSE 2014)Automatically Generated Patches as Debugging Aids: A Human Study (FSE 2014)
Automatically Generated Patches as Debugging Aids: A Human Study (FSE 2014)
Sung Kim1.9K views
REMI: Defect Prediction for Efficient API Testing (

ESEC/FSE 2015, Industria... by Sung Kim
REMI: Defect Prediction for Efficient API Testing (

ESEC/FSE 2015, Industria...REMI: Defect Prediction for Efficient API Testing (

ESEC/FSE 2015, Industria...
REMI: Defect Prediction for Efficient API Testing (

ESEC/FSE 2015, Industria...
Sung Kim2.5K views
Source code comprehension on evolving software by Sung Kim
Source code comprehension on evolving softwareSource code comprehension on evolving software
Source code comprehension on evolving software
Sung Kim1.6K views
Crowd debugging (FSE 2015) by Sung Kim
Crowd debugging (FSE 2015)Crowd debugging (FSE 2015)
Crowd debugging (FSE 2015)
Sung Kim1.9K views
How We Get There: A Context-Guided Search Strategy in Concolic Testing (FSE 2... by Sung Kim
How We Get There: A Context-Guided Search Strategy in Concolic Testing (FSE 2...How We Get There: A Context-Guided Search Strategy in Concolic Testing (FSE 2...
How We Get There: A Context-Guided Search Strategy in Concolic Testing (FSE 2...
Sung Kim2.2K views
A Survey on Dynamic Symbolic Execution for Automatic Test Generation by Sung Kim
A Survey on  Dynamic Symbolic Execution  for Automatic Test GenerationA Survey on  Dynamic Symbolic Execution  for Automatic Test Generation
A Survey on Dynamic Symbolic Execution for Automatic Test Generation
Sung Kim3.1K views
Partitioning Composite Code Changes to Facilitate Code Review (MSR2015) by Sung Kim
Partitioning Composite Code Changes to Facilitate Code Review (MSR2015)Partitioning Composite Code Changes to Facilitate Code Review (MSR2015)
Partitioning Composite Code Changes to Facilitate Code Review (MSR2015)
Sung Kim1.6K views
Automatic patch generation learned from human written patches by Sung Kim
Automatic patch generation learned from human written patchesAutomatic patch generation learned from human written patches
Automatic patch generation learned from human written patches
Sung Kim9.2K views
CrashLocator: Locating Crashing Faults Based on Crash Stacks (ISSTA 2014) by Sung Kim
CrashLocator: Locating Crashing Faults Based on Crash Stacks (ISSTA 2014)CrashLocator: Locating Crashing Faults Based on Crash Stacks (ISSTA 2014)
CrashLocator: Locating Crashing Faults Based on Crash Stacks (ISSTA 2014)
Sung Kim6.4K views
Personalized Defect Prediction by Sung Kim
Personalized Defect PredictionPersonalized Defect Prediction
Personalized Defect Prediction
Sung Kim3.7K views
A Survey on Automatic Test Generation and Crash Reproduction by Sung Kim
A Survey on Automatic Test Generation and Crash ReproductionA Survey on Automatic Test Generation and Crash Reproduction
A Survey on Automatic Test Generation and Crash Reproduction
Sung Kim2.1K views
STAR: Stack Trace based Automatic Crash Reproduction by Sung Kim
STAR: Stack Trace based Automatic Crash ReproductionSTAR: Stack Trace based Automatic Crash Reproduction
STAR: Stack Trace based Automatic Crash Reproduction
Sung Kim7K views
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak by PyData
Learn to Build an App to Find Similar Images using Deep Learning- Piotr TeterwakLearn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
Learn to Build an App to Find Similar Images using Deep Learning- Piotr Teterwak
PyData5.9K views
Transfer Learning: An overview by jins0618
Transfer Learning: An overviewTransfer Learning: An overview
Transfer Learning: An overview
jins06188.2K views
Transfer of adult learning theories into practice by Issa Al Balushi
Transfer of adult learning theories into practiceTransfer of adult learning theories into practice
Transfer of adult learning theories into practice
Issa Al Balushi2K views
Tensor board by Sung Kim
Tensor boardTensor board
Tensor board
Sung Kim8.4K views

Similar to Transfer defect learning

The Status of ML Algorithms for Structure-property Relationships Using Matb... by
The Status of ML Algorithms for Structure-property Relationships Using Matb...The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...Anubhav Jain
88 views29 slides
Synthesis of analytical methods data driven decision-making by
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingAdam Doyle
121 views36 slides
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo... by
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...MLconf
739 views35 slides
[SEKE 2014] Practical Human Resource Allocation in Software Projects Using Ge... by
[SEKE 2014] Practical Human Resource Allocation in Software Projects Using Ge...[SEKE 2014] Practical Human Resource Allocation in Software Projects Using Ge...
[SEKE 2014] Practical Human Resource Allocation in Software Projects Using Ge...Jihun Park
1.5K views25 slides
Automated Testing of Autonomous Driving Assistance Systems by
Automated Testing of Autonomous Driving Assistance SystemsAutomated Testing of Autonomous Driving Assistance Systems
Automated Testing of Autonomous Driving Assistance SystemsLionel Briand
1.5K views63 slides
Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map... by
Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...
Dagstuhl 2013 - Montali - On the Relationship between OBDA and Relational Map...Faculty of Computer Science - Free University of Bozen-Bolzano
332 views30 slides

Similar to Transfer defect learning(20)

The Status of ML Algorithms for Structure-property Relationships Using Matb... by Anubhav Jain
The Status of ML Algorithms for Structure-property Relationships Using Matb...The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...
Anubhav Jain88 views
Synthesis of analytical methods data driven decision-making by Adam Doyle
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-making
Adam Doyle121 views
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo... by MLconf
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
MLconf739 views
[SEKE 2014] Practical Human Resource Allocation in Software Projects Using Ge... by Jihun Park
[SEKE 2014] Practical Human Resource Allocation in Software Projects Using Ge...[SEKE 2014] Practical Human Resource Allocation in Software Projects Using Ge...
[SEKE 2014] Practical Human Resource Allocation in Software Projects Using Ge...
Jihun Park1.5K views
Automated Testing of Autonomous Driving Assistance Systems by Lionel Briand
Automated Testing of Autonomous Driving Assistance SystemsAutomated Testing of Autonomous Driving Assistance Systems
Automated Testing of Autonomous Driving Assistance Systems
Lionel Briand1.5K views
RAMSES: Robust Analytic Models for Science at Extreme Scales by Ian Foster
RAMSES: Robust Analytic Models for Science at Extreme ScalesRAMSES: Robust Analytic Models for Science at Extreme Scales
RAMSES: Robust Analytic Models for Science at Extreme Scales
Ian Foster747 views
End-to-End Object Detection with Transformers by Seunghyun Hwang
End-to-End Object Detection with TransformersEnd-to-End Object Detection with Transformers
End-to-End Object Detection with Transformers
Seunghyun Hwang1.1K views
lecture1.ppt by SagarDR5
lecture1.pptlecture1.ppt
lecture1.ppt
SagarDR52 views
Handling Missing Attributes using Matrix Factorization  by CS, NcState
Handling Missing Attributes using Matrix Factorization Handling Missing Attributes using Matrix Factorization 
Handling Missing Attributes using Matrix Factorization 
CS, NcState4.3K views
Heuristic design of experiments w meta gradient search by Greg Makowski
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
Greg Makowski1.1K views
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ... by SAIL_QU
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
SAIL_QU126 views
Testing Machine Learning-enabled Systems: A Personal Perspective by Lionel Briand
Testing Machine Learning-enabled Systems: A Personal PerspectiveTesting Machine Learning-enabled Systems: A Personal Perspective
Testing Machine Learning-enabled Systems: A Personal Perspective
Lionel Briand1.2K views
Evaluating Machine Learning Algorithms for Materials Science using the Matben... by Anubhav Jain
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Anubhav Jain147 views
Speeding up information extraction programs: a holistic optimizer and a learn... by INRIA-OAK
Speeding up information extraction programs: a holistic optimizer and a learn...Speeding up information extraction programs: a holistic optimizer and a learn...
Speeding up information extraction programs: a holistic optimizer and a learn...
INRIA-OAK844 views
Enabling Automated Software Testing with Artificial Intelligence by Lionel Briand
Enabling Automated Software Testing with Artificial IntelligenceEnabling Automated Software Testing with Artificial Intelligence
Enabling Automated Software Testing with Artificial Intelligence
Lionel Briand380 views
Scalable Software Testing and Verification of Non-Functional Properties throu... by Lionel Briand
Scalable Software Testing and Verification of Non-Functional Properties throu...Scalable Software Testing and Verification of Non-Functional Properties throu...
Scalable Software Testing and Verification of Non-Functional Properties throu...
Lionel Briand478 views

More from Sung Kim

DeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning by
DeepAM: Migrate APIs with Multi-modal Sequence to Sequence LearningDeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning
DeepAM: Migrate APIs with Multi-modal Sequence to Sequence LearningSung Kim
1.3K views23 slides
Deep API Learning (FSE 2016) by
Deep API Learning (FSE 2016)Deep API Learning (FSE 2016)
Deep API Learning (FSE 2016)Sung Kim
1.4K views25 slides
Time series classification by
Time series classificationTime series classification
Time series classificationSung Kim
5.7K views29 slides
Heterogeneous Defect Prediction (

ESEC/FSE 2015) by
Heterogeneous Defect Prediction (

ESEC/FSE 2015)Heterogeneous Defect Prediction (

ESEC/FSE 2015)
Heterogeneous Defect Prediction (

ESEC/FSE 2015)Sung Kim
2.2K views28 slides
A Survey on Automatic Software Evolution Techniques by
A Survey on Automatic Software Evolution TechniquesA Survey on Automatic Software Evolution Techniques
A Survey on Automatic Software Evolution TechniquesSung Kim
1.1K views51 slides
Survey on Software Defect Prediction by
Survey on Software Defect PredictionSurvey on Software Defect Prediction
Survey on Software Defect PredictionSung Kim
14.1K views97 slides

More from Sung Kim(14)

DeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning by Sung Kim
DeepAM: Migrate APIs with Multi-modal Sequence to Sequence LearningDeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning
DeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning
Sung Kim1.3K views
Deep API Learning (FSE 2016) by Sung Kim
Deep API Learning (FSE 2016)Deep API Learning (FSE 2016)
Deep API Learning (FSE 2016)
Sung Kim1.4K views
Time series classification by Sung Kim
Time series classificationTime series classification
Time series classification
Sung Kim5.7K views
Heterogeneous Defect Prediction (

ESEC/FSE 2015) by Sung Kim
Heterogeneous Defect Prediction (

ESEC/FSE 2015)Heterogeneous Defect Prediction (

ESEC/FSE 2015)
Heterogeneous Defect Prediction (

ESEC/FSE 2015)
Sung Kim2.2K views
A Survey on Automatic Software Evolution Techniques by Sung Kim
A Survey on Automatic Software Evolution TechniquesA Survey on Automatic Software Evolution Techniques
A Survey on Automatic Software Evolution Techniques
Sung Kim1.1K views
Survey on Software Defect Prediction by Sung Kim
Survey on Software Defect PredictionSurvey on Software Defect Prediction
Survey on Software Defect Prediction
Sung Kim14.1K views
MSR2014 opening by Sung Kim
MSR2014 openingMSR2014 opening
MSR2014 opening
Sung Kim17K views
Defect, defect, defect: PROMISE 2012 Keynote by Sung Kim
Defect, defect, defect: PROMISE 2012 Keynote Defect, defect, defect: PROMISE 2012 Keynote
Defect, defect, defect: PROMISE 2012 Keynote
Sung Kim4.5K views
Predicting Recurring Crash Stacks (ASE 2012) by Sung Kim
Predicting Recurring Crash Stacks (ASE 2012)Predicting Recurring Crash Stacks (ASE 2012)
Predicting Recurring Crash Stacks (ASE 2012)
Sung Kim1.6K views
Puzzle-Based Automatic Testing: Bringing Humans Into the Loop by Solving Puzz... by Sung Kim
Puzzle-Based Automatic Testing: Bringing Humans Into the Loop by Solving Puzz...Puzzle-Based Automatic Testing: Bringing Humans Into the Loop by Solving Puzz...
Puzzle-Based Automatic Testing: Bringing Humans Into the Loop by Solving Puzz...
Sung Kim1.8K views
Software Development Meets the Wisdom of Crowds by Sung Kim
Software Development Meets the Wisdom of CrowdsSoftware Development Meets the Wisdom of Crowds
Software Development Meets the Wisdom of Crowds
Sung Kim1.4K views
BugTriage with Bug Tossing Graphs (ESEC/FSE 2009) by Sung Kim
BugTriage with Bug Tossing Graphs (ESEC/FSE 2009)BugTriage with Bug Tossing Graphs (ESEC/FSE 2009)
BugTriage with Bug Tossing Graphs (ESEC/FSE 2009)
Sung Kim2.1K views
Self-defending software: Automatically patching errors in deployed software ... by Sung Kim
Self-defending software: Automatically patching  errors in deployed software ...Self-defending software: Automatically patching  errors in deployed software ...
Self-defending software: Automatically patching errors in deployed software ...
Sung Kim1.6K views
ReCrash: Making crashes reproducible by preserving object states (ECOOP 2008) by Sung Kim
ReCrash: Making crashes reproducible by preserving object states (ECOOP 2008)ReCrash: Making crashes reproducible by preserving object states (ECOOP 2008)
ReCrash: Making crashes reproducible by preserving object states (ECOOP 2008)
Sung Kim1.7K views

Transfer defect learning

  • 1. Transfer Defect Learning Jaechang Nam The Hong Kong University of Science and Technology, China Sinno Jialian Pan Institute for Infocomm Research, Singapore Sunghun Kim The Hong Kong University of Science and Technology, China
  • 2. Defect Prediction • Hassan et al.@ICSE`09, Predicting Faults Using the Complexity of Code Changes • D’Ambros et al.@MSR`10, An Extensive Comparison of Bug Prediction Approaches • Rahman et al.@ICSE`12, Recalling the Impression of Cross-Project Defect Prediction • Hata et al.@ICSE`12, Bug Prediction based on Fine-grained Module histories • … 2 Program Prediction Model (Machine learning) Future defects
  • 4. Training prediction model 3 Test set Training set M1 M2 … M19 M20 Class 11 5 … 53 78 Buggy … … … … … … 1 1 … 3 9 Clean M1 M2 … M19 M20 Class 2 1 … 2 8 ? … … … … … … 13 6 … 45 69 ?
  • 5. Cross prediction model 4 Target project (Test set) Source project (Training set)
  • 6. Cross-project Defect Prediction 5 “Training data is often not available, either because a company is too small or it is the first release of a product” Zimmerman et al.@FSE`09, Cross-project Defect Prediction
  • 7. Cross-project Defect Prediction 5 “Training data is often not available, either because a company is too small or it is the first release of a product” Zimmerman et al.@FSE`09, Cross-project Defect Prediction “For many new projects we may not have enough historical data to train prediction models.” Rahman, Posnett, and Devanbu @ICSE`12, Recalling the “Imprecision” of Cross-project Defect Prediction
  • 8. Cross-project defect prediction • Zimmerman et al.@FSE`09 – “We ran 622 cross-project predictions and found only 3.4% actually worked.” 6 Worked, 3.4% Not worked, 96.6%
  • 9. Cross-company defect prediction • Turhan and Menzies et al.@ESEJ`09 – “Within-company data models are still the best” 7 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Cross Cross with a NN filter Within Avg. F-measure
  • 10. Cross-project defect prediction • Rahman, Posnett, and Devanbu@FSE`12 8 0 0.1 0.2 0.3 0.4 0.5 0.6 Cross Within Avg. F-measure
  • 11. Cross prediction results 9 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 F-measure Cross Within Cross Within Cross Within Equinox JDT Lucene
  • 12. Approaches of Transfer Defect Learning 10 Normalization TCA TCA+
  • 13. 11 • Data preprocessing for training and test dataNormalization • A state-of-the art transfer learning algorithm • Transfer Component AnalysisTCA • Adapted TCA for cross-project defect prediction • Decision rules to select a suitable data normalization optionTCA+ Approaches of Transfer Defect Learning
  • 14. Data Normalization • Adjust all feature values in the same scale – E.g., Make Mean = 0 and Std = 1 • Known to be helpful for classification algorithms to improve prediction performance [Han et al. 2012]. 12
  • 15. Normalization Options • N1: Min-max Normalization (max=1, min=0) [Han et al., 2012] • N2: Z-score Normalization (mean=0, std=1) [Han et al., 2012] • N3: Z-score Normalization only using source mean and standard deviation • N4: Z-score Normalization only using target mean and standard deviation 13
  • 16. 14 • Data preprocessing for training and test dataNormalization • A state-of-the art transfer learning algorithm • Transfer Component Analysis TCA • Adapted TCA for cross-project defect prediction • Decision rules to select a suitable data normalization optionTCA+ Approaches of Transfer Defect Learning
  • 18. Transfer Learning 15 Traditional Machine Learning (ML) Learning System Learning System
  • 19. Transfer Learning 15 Traditional Machine Learning (ML) Learning System Learning System Transfer Learning Learning System Learning System Knowledge Transfer
  • 20. Transfer Learning 15 Traditional Machine Learning (ML) Learning System Learning System Transfer Learning Learning System Learning System Knowledge Transfer
  • 21. A Common Assumption in Traditional ML 16 Pan andYang@TKDE`10, Survey onTransfer Learning • Same distribution
  • 22. A Common Assumption in Traditional ML 16 Pan andYang@TKDE`10, Survey onTransfer Learning • Same distribution Cross Prediction
  • 23. A Common Assumption in Traditional ML 16 Pan andYang@TKDE`10, Survey onTransfer Learning • Same distribution Transfer Learning
  • 24. Transfer Component Analysis • Unsupervised Transfer learning – Target project labels are not known. • Must have the same feature space • Make distribution difference between training and test datasets similar 17 Pan et al.@TNN`10, Domain Adaptation viaTransfer ComponentAnalysis
  • 25. Transfer Component Analysis (cont.) • Feature extraction approach – Dimensionality reduction – Projection • Map original data in a lower-dimensional feature space 18
  • 26. Transfer Component Analysis (cont.) • Feature extraction approach – Dimensionality reduction – Projection • Map original data in a lower-dimensional feature space 18 2-dimensional feature space
  • 27. Transfer Component Analysis (cont.) • Feature extraction approach – Dimensionality reduction – Projection • Map original data in a lower-dimensional feature space 18 1-dimensional feature space
  • 28. Transfer Component Analysis (cont.) • Feature extraction approach – Dimensionality reduction – Projection • Map original data in a lower-dimensional feature space 18 1-dimensional feature space
  • 29. Transfer Component Analysis (cont.) • Feature extraction approach – Dimensionality reduction – Projection • Map original data in a lower-dimensional feature space 18 1-dimensional feature space 2-dimensional feature space
  • 30. Transfer Component Analysis (cont.) • Feature extraction approach – Dimensionality reduction – Projection • Map original data in a lower-dimensional feature space – C.f. Principal Component Analysis (PCA) 18 1-dimensional feature space
  • 31. Transfer Component Analysis (cont.) 19 Pan et al.@TNN`10, Domain Adaptation viaTransfer ComponentAnalysis Target domain data Source domain data
  • 32. Transfer Component Analysis (cont.) 20 PCA TCA Pan et al.@TNN`10, Domain Adaptation viaTransfer ComponentAnalysis
  • 33. Preliminary Results using TCA 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 F-measure 21*Baseline: Cross-project defect prediction without TCA and normalization Baseline NoN N1 N2 N3 N4 Baseline NoN N1 N2 N3 N4 Safe  Apache Apache  Safe
  • 34. Preliminary Results using TCA 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 F-measure 21*Baseline: Cross-project defect prediction without TCA and normalization Prediction performance of TCA varies according to different normalization options! Baseline NoN N1 N2 N3 N4 Baseline NoN N1 N2 N3 N4 Safe  Apache Apache  Safe
  • 35. 22 • Data preprocessing for training and test dataNormalization • A state-of-the art transfer learning algorithm • Transfer Component Analysis TCA • Adapted TCA for cross-project defect prediction • Decision rules to select a suitable data normalization option TCA+ Approaches of Transfer Defect Learning
  • 36. TCA+: Decision rules • Find a suitable normalization for TCA • Steps – #1: Characterize a dataset – #2: Measure similarity between source and target datasets – #3: Decision rules 23
  • 37. #1: Characterize a dataset 24 3 1 … Dataset A Dataset B 2 4 5 8 9 6 11 d1,2 d1,5 d1,3 d3,11 3 1 … 2 4 5 8 9 6 11 d2,6 d1,2 d1,3 d3,11 DIST={dij : i,j, 1 ≤ i < n, 1 < j ≤ n, i < j} A
  • 38. #2: Measure Similarity between source and target • Minimum (min) and maximum (max) values of DIST • Mean and standard deviation (std) of DIST • The number of instances 25
  • 39. #3: Decision Rules • Rule #1 – Mean and Std are same  NoN • Rule #2 – Max and Min are different  N1 (max=1, min=0) • Rule #3,#4 – Std and # of instances are different  N3 or N4 (src/tgt mean=0, std=1) • Rule #5 – Default  N2 (mean=0, std=1) 26
  • 41. Experimental Setup • 8 software subjects • Machine learning algorithm – Logistic regression 28 ReLink (Wu et al.@FSE`11) Projects # of metrics (features) Apache 26 (Source code) Safe ZXing AEEEM (D’Ambros et al.@MSR`10) Projects # of metrics (features) Apache Lucene (LC) 61 (Source code, Churn, Entropy,…) Equinox (EQ) Eclipse JDT Eclipse PDE UI Mylyn (ML)
  • 42. Experimental Design 29 Test set (50%) Training set (50%) Within-project defect prediction
  • 43. Experimental Design 30 Target project (Test set) Source project (Training set) Cross-project defect prediction
  • 44. Experimental Design 31 Target project (Test set) Source project (Training set) Cross-project defect prediction with TCA/TCA+ TCA/TCA+
  • 46. ReLink Result 33*Baseline: Cross-project defect prediction without TCA/TCA+ 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 F-measure Baseline TCA TCA+ Within Safe  Apache Apache  Safe Safe  ZXing Baseline TCA TCA+ Within Baseline TCA TCA+ Within
  • 47. ReLink Result F-measure 34 Cross Source  Target Safe  Apache Zxing  Apache Apache  Safe Zxing  Safe Apache  ZXing Safe  ZXing Average Baseline 0.52 0.69 0.49 0.59 0.46 0.10 0.49 TCA 0.64 0.64 0.72 0.70 0.45 0.42 0.59 TCA+ 0.64 0.72 0.72 0.64 0.49 0.53 0.61 Within Target  Target 0.64 0.62 0.33 0.53 *Baseline: Cross-project defect prediction without TCA/TCA+
  • 48. AEEEM Result 35*Baseline: Cross-project defect prediction without TCA/TCA+ 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 F-measure Baseline TCA TCA+ Within JDT  EQ PDE  LC PDE  ML Baseline TCA TCA+ Within Baseline TCA TCA+ Within
  • 49. AEEEM Result F-measure 36 Cross Source  Target JDT  EQ LC  EQ ML  EQ … PDE  LC EQ  ML JDT  ML LC  ML PDE ML … Average Baseline 0.31 0.50 0.24 … 0.33 0.19 0.27 0.20 0.27 … 0.32 TCA 0.59 0.62 0.56 … 0.27 0.62 0.56 0.58 0.48 … 0.41 TCA+ 0.60 0.62 0.56 … 0.33 0.62 0.56 0.60 0.54 … 0.41 Within Source  Target 0.58 … 0.37 0.30 … 0.42
  • 50. Threats to Validity • Systems are open-source projects. • Experimental results may not be generalizable. • Decision rules in TCA+ may not be generalizable. 37
  • 51. Future Work • Transfer defect learning on different feature space – e.g., ReLink  AEEEM AEEEM  ReLink • Local models using Transfer Learning • Adapt Transfer learning in other Software Engineering (SE) problems – e.g., Knowledge from mailing lists  Bug triage problem 38
  • 52. Conclusion • TCA+ – TCA • Make distributions of source and target similar – Decision rules to improve TCA – Significantly improved cross-project defect prediction performance • Transfer Learning in SE – Transfer learning may benefit other prediction and recommendation systems in SE domains. 39