Software Defect Prediction
on Unlabeled Datasets
- PhD Thesis Defence -
July 23, 2015
Jaechang Nam
Department of Computer Science and Engineering
HKUST
Software Defect Prediction
• General question of software defect
prediction
– Can we identify defect-prone entities (source
code file, binary, module, change,...) in advance?
• # of defects
• buggy or clean
• Why? (applications)
– Quality assurance for large software
(Akiyama@IFIP’71)
– Effective resource allocation
• Testing (Menzies@TSE`07, Kim@FSE`15)
• Code review (Rahman@FSE’11)
2
3
Predict
Training
?
?
Model
Project A
: Metric value
: Buggy-labeled instance
: Clean-labeled instance
?: Unlabeled instance
Software Defect Prediction
Related Work
Munson@TSE`92, Basili@TSE`95, Menzies@TSE`07,
Hassan@ICSE`09, Bird@FSE`11,D’ambros@EMSE112
Lee@FSE`11,...
What if labeled instances do not
exist?
4
?
?
?
?
?
Project X
Unlabeled
Dataset
?: Unlabeled instance
: Metric value
Model
New projects
Projects lacking in
historical data
Solution 1
Cross-Project Defect Prediction
(CPDP)
7
?
?
?
?
?
Training
Predict
Model
Project A
(source)
Project X
(target)
Unlabeled
Dataset
: Metric value
: Buggy-labeled instance
: Clean-labeled instance
?: Unlabeled instance
Related Work
Watanabe@PROMISE08, Turhan@EMSE`09
Zimmermann@FSE`09, Ma@IST`12,
Zhang@MSR`14
Challenges
Same metric set
(same feature space)
• Worse than WPDP
• Heterogeneous
metrics between
source and target
Only 2% out of 622 CPDP
combinations worked.
(Zimmermann@FSE`09)
Solution 2
Using Only Unlabeled Datasets
8
?
?
?
?
?
Project X
Unlabeled
Dataset
Training
Model
Predict
Related Work
Zhong@HASE`04,
Catal@ITNG`09
• Manual Effort
Challenge
Human-intervention
9
Software Defect Prediction
on Unlabeled Datasets
Sub-problems Proposed Techniques
CPDP comparable to WPDP? Transfer Defect Learning (TCA+)
CPDP across projects with
heterogeneous metric sets?
Heterogeneous Defect Prediction (HDP)
DP using only unlabeled
datasets
without human effort?
CLAMI
10
Software Defect Prediction
on Unlabeled Datasets
Sub-problems Proposed Techniques
CPDP comparable to WPDP? Transfer Defect Learning (TCA+)
CPDP across projects with
heterogeneous metric sets?
Heterogeneous Defect Prediction (HDP)
DP using only unlabeled
datasets
without human effort?
CLAMI
CPDP
• Reason for poor prediction
performance of CPDP
– Different distributions of source and target
datasets (Pan et al@TKDE`09)
11
TCA+
12
Source Target
Oops, we are different! Let’s meet at another world!
(Projecting datasets into a latent feature space)
New Source New Target
Normalize US together!Normalization
Transfer
Component
Analysis (TCA)
+
Make different distributions between source and target
similar!
Data Normalization
• Adjust all metric values in the same
scale
– E.g., Make Mean = 0 and Std = 1
• Known to be helpful for classification
algorithms to improve prediction
performance (Han@`12).
13
Normalization Options
• N1: Min-max Normalization (max=1, min=0) [Han et
al., 2012]
• N2: Z-score Normalization (mean=0, std=1) [Han et
al., 2012]
• N3: Z-score Normalization only using source mean
and standard deviation
• N4: Z-score Normalization only using target mean
and standard deviation
• NoN: No normalization
14
Decision Rules for Normalization
• Find a suitable normalization
• Steps
– #1: Characterize a dataset
– #2: Measure similarity
between source and target datasets
– #3: Decision rules
15
Decision Rules for Normalization
#1: Characterize a dataset
3
1
…
Dataset A Dataset B
2
4
5
8
9
6
11
d1,2
d1,5
d1,3
d3,11
3
1
…
2
4
5
8
9
6
11
d2,6
d1,2
d1,3
d3,11
DIST={dij : i,j, 1 ≤ i < n, 1 < j ≤ n, i
< j}
A
16
Decision Rules for Normalization
#2: Measure Similarity between source and
target
3
1
…
Dataset A Dataset B
2
4
5
8
9
6
11
d1,2
d1,5
d1,3
d3,11
3
1
…
2
4
5
8
9
6
11
d2,6
d1,2
d1,3
d3,11
DIST={dij : i,j, 1 ≤ i < n, 1 < j ≤ n, i
< j}
A
17
• Minimum (min) and maximum (max) values of
DIST
• Mean and standard deviation (std) of DIST
• The number of instances
Decision Rules for Normalization
#3: Decision Rules
• Rule #1
– Mean and Std are same NoN
• Rule #2
– Max and Min are different N1 (max=1, min=0)
• Rule #3, #4
– Std and # of instances are different
N3 or N4 (src/tgt mean=0, std=1)
• Rule #5
– Default N2 (mean=0, std=1)
18
TCA
• Key idea
Source Target
New Source New Target
Oops, we are different! Let’s meet at another world!
(Projecting datasets into a latent feature space)
19
TCA (cont.)
20
Pan et al.@TNN`10, Domain Adaptation via Transfer Component Analysis
Target domain data
Source domain data
Buggy source instances
Clean source instances
Buggy target instances
Clean target instances
TCA+
22
Source Target
New Source New Target
Normalize us together with a suitable
option!
Normalization
Transfer
Component
Analysis (TCA)
+
Make different distributions between source and target
similar!
Oops, we are different! Let’s meet at another world!
(Projecting datasets into a latent feature space)
Research Questions
• RQ1
– What is the cross-project prediction performance
of TCA/TCA+ compared to WPDP?
• RQ2
– What is the cross-project prediction performance
of TCA/TCA+ compared to that CPDP without
TCA/TCA+?
24
Related Work
Transfer
learning
Metric
Compensation
NN Filter TNB TCA+
Preprocessing N/A
Feature
selection,
Log-filter
Log-filter Normalization
Machine
learner
C4.5 Naive Bayes TNB
Logistic
Regression
# of Subjects 2 10 10 8
# of
predictions
2 10 10 26
Avg. f-
measure
0.67
(W:0.79,
C:0.58)
0.35
(W:0.37,
C:0.26)
0.39
(NN: 0.35,
C:0.33)
0.46
(W:0.46,
C:0.36)
Citation Watanabe@PROMISE
`08
Turhan@ESEJ`0
9
Ma@IST`12 Nam@ICSE`13
* NN = Nearest neighbor, W = Within, C = Cross
34
35
Software Defect Prediction
on Unlabeled Datasets
Sub-problems Proposed Techniques
CPDP comparable to WPDP? Transfer Defect Learning (TCA+)
CPDP across projects with
heterogeneous metric sets?
Heterogeneous Defect Prediction (HDP)
DP using only unlabeled
datasets
without human effort?
CLAMI
Key Idea
• Most defect prediction metrics
– Measure complexity of software and its
development process.
• e.g.
– The number of developers touching a source code file
(Bird@FSE`11)
– The number of methods in a class (D’Ambroas@ESEJ`12)
– The number of operands (Menzies@TSE`08)
More complexity implies more defect-proneness
(Rahman@ICSE`13)
38
Key Idea
• Most defect prediction metrics
– Measure complexity of software and its
development process.
• e.g.
– The number of developers touching a source code file
(Bird@FSE`11)
– The number of methods in a class (D’Ambroas@ESEJ`12)
– The number of operands (Menzies@TSE`08)
More complexity implies more defect-proneness
(Rahman@ICSE`13)
39
Match source and target metrics that have similar
distribution
Metric Selection
• Why? (Guyon@JMLR`03)
– Select informative metrics
• Remove redundant and irrelevant metrics
– Decrease complexity of metric matching
combinations
• Feature Selection Approaches
(Gao@SPE`11,Shivaji@TSE`13)
– Gain Ratio
– Chi-square
– Relief-F
– Significance attribute evaluation
41
Metric Matching
42
Source Metrics Target Metrics
X1
X2
Y1
Y2
0.8
0.5
* We can apply different cutoff values of matching score
* It can be possible that there is no matching at all.
Compute Matching Score
KSAnalyzer
• Use p-value of Kolmogorov-Smirnov Test
(Massey@JASA`51)
43
Matching Score M of i-th source and j-th target metrics:
Mij = pij
Baselines
• WPDP
• CPDP-CM (Turhan@EMSE`09,Ma@IST`12,He@IST`14)
– Cross-project defect prediction using only
common metrics between source and target
datasets
• CPDP-IFS (He@CoRR`14)
– Cross-project defect prediction on
Imbalanced Feature Set (i.e. heterogeneous
metric set)
– 16 distributional characteristics of values of
an instance as features (e.g., mean, std,
maximum,...)
46
Research Questions (RQs)
• RQ1
– Is heterogeneous defect prediction comparable
to WPDP?
• RQ2
– Is heterogeneous defect prediction comparable
to CPDP-CM?
• RQ3
– Is Heterogeneous defect prediction comparable
to CPDP-IFS?
47
Benchmark Datasets
Group Dataset
# of instances # of
metrics
Granularity
All Buggy (%)
AEEEM
EQ 325 129 (39.7%)
61 Class
JDT 997 206 (20.7%)
LC 399 64 (9.36%)
ML 1862 245 (13.2%)
PDE 1492 209 (14.0%)
MORP
H
ant-1.3 125 20 (16.0%)
20 Class
arc 234 27 (11.5%)
camel-1.0 339 13 (3.8%)
poi-1.5 237 141 (75.0%)
redaktor 176 27 (15.3%)
skarbonka 45 9 (20.0%)
tomcat 858 77 (9.0%)
velocity-1.4 196 147 (75.0%)
xalan-2.4 723 110 (15.2%)
xerces-1.2 440 71 (16.1%)
48
Group Dataset
# of instances # of
metrics
Granularity
All Buggy (%)
ReLink
Apache 194 98 (50.5%)
26 FileSafe 56 22 (39.3%)
ZXing 399
118
(29.6%)
NASA
cm1 327 42 (12.8%)
37 Function
mw1 253 27 (10.7%)
pc1 705 61 (8.7%)
pc3 1077
134
(12.4%)
pc4 1458
178
(12.2%)
SOFTLA
B
ar1 121 9 (7.4%)
29 Function
ar3 63 8 (12.7%)
ar4 107 20 (18.7%)
ar5 36 8 (22.2%)
ar6 101 15 (14.9%)
600 prediction combinations in total!
Experimental Settings
• Logistic Regression
• HDP vs. WPDP, CPDP-CM, and CPDP-IFS
49
Test set
(50%)
Training set
(50%)
Project
1
Project
2
Project
n
...
...
X 1000
Project
1
Project
2
Project
n
...
...
CPDP-CM
CPDP-IFS
HDP
WPDP
Project A
Different Feature Selections
(median AUCs, Win/Tie/Loss)
57
Approach
Against
WPDP
Against
CPDP-CM
Against
CPDP-IFS
HDP
AUC Win% AUC Win% AUC Win% AUC
Gain Ratio 0.657 63.7% 0.645 63.2% 0.536 80.2% 0.720
Chi-Square 0.657 64.7% 0.651 66.4% 0.556 82.3% 0.727
Significanc
e
0.657 66.2% 0.636 66.2% 0.553 82.0% 0.724
Relief-F 0.670 57.0% 0.657 63.1% 0.543 80.5% 0.709
None 0.657 47.3% 0.624 50.3% 0.536 66.3% 0.663
Results in Different Cutoffs
58
Cutoff
Against
WPDP
Against
CPDP-CM
Against
CPDP-IFS
HDP Target
Coverage
AUC Win% AUC Win% AUC Win% AUC
0.05 0.657 66.2% 0.636 66.2% 0.553 82.4% 0.724* 100%
0.90 0.657 100% 0.761 71.4% 0.624 100% 0.852* 21%
59
Software Defect Prediction
on Unlabeled Datasets
Sub-problems Proposed Techniques
CPDP comparable to WPDP? Transfer Defect Learning (TCA+)
CPDP across projects with
heterogeneous metric sets?
Heterogeneous Defect Prediction (HDP)
DP using only unlabeled
datasets
without human effort?
CLAMI
How?
• Recall the trend of defect prediction metrics
– Measures complexity of software and its
development process.
• e.g.
– The number of developers touching a source code file
(Bird@FSE`11)
– The number of methods in a class (D’Ambroas@ESEJ`12)
– The number of operands (Menzies@TSE`08)
Higher metric values imply more defect-proneness
(Rahman@ICSE`13)
63
How?
• Recall this trend of defect prediction metrics
– Measures complexity of software and its
development process.
• e.g.
– The number of developers touching a source code file
(Bird@FSE`11)
– The number of methods in a class (D’Ambroas@ESEJ`12)
– The number of operands (Menzies@TSE`08)
Higher metric values imply more defect-proneness
(Rahman@ICSE`13)
64
(1) Label instances that have higher metric values as
buggy!
(2) Generate a training set by removing metrics and
instances that violates (1).
Baselines
• Supervised learning model (i.e. WPDP)
• Defect prediction only using unlabeled
datasets
– Expert-based (Zhong@HASE`04)
• Cluster instances by K-Mean into 20 clusters
• A human expert labels each cluster
– Threshold-based (Catal@ITNG`09)
• [LoC, CC, UOP, UOpnd, TOp, TOpnd]
= [65, 10, 25, 40, 125, 70]
– Label an instance whose any metric value is greater
than a threshold value
• Manual effort requires to decide threshold values in
advance.
71
Research Questions (RQs)
• RQ1
– CLAMI vs. Supervised learning model?
• RQ2
– CLAMI vs. Expert-/threshold-based approaches?
(Zhong@HASE`04, Catal@ITNG`09)
72
Experimental Settings (RQ1)
- Supervised learning model -
74
Test set (50%)
Training set (50%)
Supervised
Model
(Baseline)
Training
Predict
X 1000
CLAMI
Model
Training
Predict
Experimental Settings (RQ2)
-Comparison to existing approaches -
75
Unlabeled Dataset
CLAMI
Model
Predict
Training
Predict
Threshold-
Based
(Baseline1,
Catal@ITNG`09)
Expert-
Based
(Baseline2,
Zhong@HASE`04)
Distributions of metrics (Safe)
80
Most frequently selected metrics by CLAMI
Metrics with less discriminative power
Distributions of metrics (Lucene)
81
Most frequently selected metrics by CLAMI
Metrics with less discriminative power
82
Software Defect Prediction
on Unlabeled Datasets
Sub-problems Proposed Techniques
CPDP comparable to WPDP? Transfer Defect Learning (TCA+)
CPDP across projects with
heterogeneous metric sets?
Heterogeneous Defect Prediction (HDP)
DP using only unlabeled
datasets
without human effort?
CLAMI
Publications at HKUST
• Defect Prediction
– Micro Interaction Metrics for Defect Prediction@FSE`11, Taek Lee,
Jaechang Nam, Donggyun Han, Sunghun Kim and Hoh Peter In
– Transfer Defect Learning@ICSE`13, Jaechang Nam, Sinno Jialin Pan and
Sunghun Kim, Nominee, ACM SIGSOFT Distinguished Paper Award
– Heterogeneous Defect Prediction@FSE`15, Jaechang Nam ann Sunghun Kim
– REMI: Defect Prediction for Efficient API Testing@FSE`15, Mijung Kim,
Jaechang Nam, Jaehyuk Yeon, Soonhwang Choi, and Sunghun Kim, Industrial
Track
– CLAMI: Defect Prediction on Unlabeled Datasets@ASE`15, Jaechang Nam
and Sunghun Kim
• Testing
– Calibrated Mutation Testing@MUTATION`12, Jaechang Nam, David Schuler,
and Andreas Zeller
• Automated bug-fixing
– Automatic Patch Generation Learned from Human-written
Patches@ICSE`13, Dongsun Kim, Jaechang Nam, Jaewoo Song and Sunghun
Kim, ACM SIGSOFT Distinguished Paper Award Winner
84
Good afternoon, everyone!
I’m JC. Thanks for coming to my PhD defence.
The title of my thesis is Software Defect Prediction on Unlabeled Datasets.
General Question of software defect prediction is:
Can we identify defect-prone software entities in advance?
For example, by using defect prediction technique, we can predict whether a source code file is buggy or clean.
After predicting defect-prone software entities, software quality assurance teams can effectively allocate limited resources for software testing and code review to develop reliable software product.
Here is Project A and some software entities. Let say these entities are source code files.
I want to predict whether these files are buggy or clean.
To do this, we need a prediction model.
Since defect prediction models are trained by machine learning algorithms, we need labeled instances collected from previous releases.
This is an labeled instance. An instance consists of features and labels.
Various software metrics such as LoC, # of functions in a file, and # of authors touching a source file, are used as features for machine learning.
Software metrics measure complexity of software and its development process
Each instance can be labeled by past bug information.
Software metrics and past bug information can be collected from software archives such as version control systems and bug report systems.
With these labeled instances, we can build a prediction model and predict the unlabeled instances.
This prediction is conducted within the same project. So, we call this Within-project defect prediction (WPDP).
There are many studies about WPDP and showed good prediction performance. ( like prediction accuracy is 0.7.)
What if there are no labeled instances. This can happen in new projects and projects lacking in historical data.
New projects do not have past bug information to label instances.
Some projects also does not have bug information because of lacking in historical data from software archives.
When I participated in an industrial project for Samsung electronics, it was really difficult to generate labeled instances because their software archives are not well managed by developers.
So, in some real industrial projects, we may not generate labeled instances to build a prediction model.
Without labeled instances, we can not build a prediction model.
After experiencing this limitation form the industry, I decided to address this problem.
We define this problem as Software Defect Prediction on Unlabeled Datasets.
There are existing solutions to build a prediction model for unlabeled datasets.
The first solution is cross-project defect prediction. We can reuse labeled instances from other projects.
Normalization gives all data values in the same scale. For example, we can make mean value of data set as 0 and standard deviation as 1.
Normalization is also known to be helpful for classification algorithm. As many defect prediction models classify source code as buggy or clean. It is a classification problem. So we applied normalization for all training and test data sets.
Based on these normalization techniques, we defined several normalization options for defect prediction data sets.
NI is min-max normalization which makes maximum and minimum value as 1 and 0 respectively.
N2 is z-score normalization which makes mean and standard deviation as 0 and 1 respectively.
We assume that some data sets may not have enough statistical information. So we defined variations of z-score normalization.
To normalize both source and target data sets, N3 is only using mean and standard deviation from source data (when target data does not have enough statistical information. For example, lack of instances in a data set.
N4 is only using target information for normalizing both source and target data sets.
TCA+ provides decision rules to select suitable normalization option.
For the decision rules, we first characterize both source and target data sets to identify their difference.
In the second step, we measure similarity between source and target data sets.
With degree of similarity, we created decision rules!
Then, how could we characterize data set?
Here are two data sets.
Intuitively, Data set A’s distribution is more sparser than data set B.
To quantify this difference, we compute Euclidean distance of all pairs of instances in each data set.
We defined DIST set for distances of all pairs.
Likewise, we can get DIST set from Data set B.
Then, how could we characterize data set?
Here are two data sets.
Intuitively, Data set A’s distribution is more sparser than data set B.
To quantify this difference, we compute Euclidean distance of all pairs of instances in each data set.
We defined DIST set for distances of all pairs.
Likewise, we can get DIST set from Data set B.
These are decision rules.
If mean and std is same, we assume that distributions between source and target is same. So we applied no normalization.
For Rule2, if max and min values are different, we used N1(min-max normalization)
for Rule3 and 4, we considered std and # of instances. If target information is not enough, then we used source mean and std to normalize both datasets.
In case of Rule 5, if there are no rules are applicable, we applied N2 option, which make mean and std as 0 and 1 respectively.
Here is an example showing how PCA and TCA works.
In two-dimensional space, there are source and target data sets and we can see distributions are clearly different.
If we apply PCA and TCA , and then we can get the following results in one-dimensional space.
Probability density function
Probability mass function
In PCA, instances are projected into one dimensional space, however, distribution between source and target are still different.
In TCA, all instances are also projected in one-dimensional space, where distribution between source and target is similar.
Positive and negative instance of both training and test domains have discriminative power as shown in this figure.
You can check detailed equations about this algorithm in this paper
[add labels]
We report within-project prediction results.
In Within prediction settings, we used 50:50 random splits, which is widely used in several literatures.
We repeated 50:50 random splits 100 times
Wilcoxon-matched paired test
Wilcoxon-matched paired test
Wilcoxon-matched paired test
Various feature selection approaches can be applied
AEEEM: object- oriented (OO) metrics, previous-defect metrics, entropy met- rics of change and code, and churn-of-source-code metrics [4].
MORPH: McCabe’s cyclomatic metrics, CK metrics, and other OO metrics [36].
ReLink: code complexity metrics
NASA: Halstead metrics and McCabe’s cyclomatic metrics, additional complexity metrics such as parameter count and percentage of comments
SOFTLAB: Halstead metrics and McCabe’s cyclomatic metrics
Clustering: group instances that have higher metric values
Labeling: label groups that have higher metrics values as buggy
Metric and Instance selection: select more informative metrics and instances
Clustering: group instances that have higher metric values
Labeling: label groups that have higher metrics values as buggy
Metric and Instance selection: select more informative metrics and instances
Manual effort to decide threshold
Literature
Tuning machine: using known bugs, decide threshold values that minimize prediction error.
Analysis of multiple releases
In case of lucene, all clusters are labeled as clean by expert.
better results are bold-faced. (not a statistical testing. experiment conducted once)