An Empirical Comparison of Model Validation Techniques for Defect Prediction Models

An Empirical Comparison of  
Model Validation Techniques for  
Defect Prediction Models
1
Chakkrit (Kla) 
Tantithamthavorn
Shane McIntosh Ahmed E. Hassan Kenichi Matsumoto
http://chakkrit.com kla@chakkrit.com @klainfo
Chakkrit Tantithamthavorn, Shane McIntosh, Ahmed E. Hassan, Kenichi Matsumoto: An Empirical Comparison
of Model Validation Techniques for Defect Prediction Models. IEEE Trans. Software Eng. 43(1): 1-18 (2017)

Software Quality Assurance (SQA) teams play a critical
role in ensuring the absence of software defects
2
SQA Team
Deliver defect-free  
software product
Customer

SQA tasks are expensive and  
time-consuming
3

time-consuming
SQA tasks require 50% of
the development resources
3

time-consuming
Facebook allocates about 3
months to test a new product
SQA tasks require 50% of
the development resources
3

Pre-release period
Release
Defect
prediction
models
Defect prediction models are used to predict software
modules that are likely to be defective in the future
Post-release period
4

Pre-release period
Release
Defect
prediction
models
Module A
Module C
Module B
Module D
Post-release period
Clean
Defect-prone
Clean
Defect-prone
Module A
Module C
Module B
Module D
4

Pre-release period
Release
Defect
prediction
models
Module A
Module C
Module B
Module D
Post-release period
Clean
Defect-prone
Clean
Defect-prone
Module A
Module C
Module B
Module D
Lewis et al., ICSE’13
Mockus et al., BLTJ’00 Ostrand et al., TSE’05 Kim et al., FSE’15
Naggappan et al., ICSE’06
Zimmermann et al., FSE’09
Caglayan et al., ICSE’15
Tan et al., ICSE’15
Shimagaki et al., ICSE’16
4

Model performance is used in various purposes
for defect prediction research
5

Estimate how well a
model performs on
unseen data
Zimmermann et al., FSE’09 
D’Ambros et al., MSR’10, EMSE’12
Ma et al., IST’12
5

Estimate how well a
model performs on
unseen data
Zimmermann et al., FSE’09 
D’Ambros et al., MSR’10, EMSE’12
Ma et al., IST’12
Select a top-performing  
defect prediction model
Lessmann et al., TSE’08
Ghotra et al., ICSE’15
Tantithamthavorn et al., ICSE’16
5

Using Model Validation Techniques (MVTs) to
estimate model performance
Defect  
Dataset
Training 
Corpus
Testing 
Corpus
Model 
Validation
6

Defect  
Dataset
Training 
Corpus
Testing 
Corpus
Defect 
Models
Model 
Validation
6

Defect  
Dataset
Training 
Corpus
Testing 
Corpus
Defect 
Models
Model 
Validation
Performance 
Estimates
Compute  
performance
6

We perform a literature analysis to ﬁnd the most
commonly used model validation techniques
7

Holdout validation randomly splits a dataset into training
and testing corpora according to a given proportion
Testing
70% 30%
Training
Holdout Validation
• 50% Holdout
• 70% Holdout
• Repeated 50% Holdout
8

Training
k-fold CV randomly partitions a dataset into k
folds of roughly equal size and defective ratio
Testing
70% 30%
Training
Holdout Validation
Testing
k-1 folds
k-Fold Cross Validation
Repeat k times
1 fold
• 50% Holdout
• 70% Holdout
• Leave-one-out CV
• 2 Fold CV
• 10 Fold CV
• Repeated 10 fold CV
9

Training
Bootstrap is a resampling process with replacement
with the same size of the original sample
Testing
70% 30%
Training
Holdout Validation
Testing
k-1 folds
Repeat k times
1 fold
Bootstrap Validation
Training Testing
bootstrap
Repeat N times
out-of-sample
• 50% Holdout
• 70% Holdout
• 2 Fold CV
• 10 Fold CV
• Ordinary bootstrap
• Optimism-reduced bootstrap
• Out-of-sample bootstrap
• .632 Bootstrap
10

Training
Our literature analysis shows that there are 3 families of
12 most commonly-used model validation techniques
Testing
70% 30%
Training
Holdout Validation
Testing
k-1 folds
Repeat k times
1 fold
Bootstrap Validation
Training Testing
bootstrap
Repeat N times
out-of-sample
• 50% Holdout
• 70% Holdout
• 2 Fold CV
• 10 Fold CV
• Ordinary bootstrap
• Optimism-reduced bootstrap
• Out-of-sample bootstrap
• .632 Bootstrap
11

Model validation techniques may produce
diﬀerent performance estimates
AUC=0.73
Construct and evaluate
the model using
ordinary bootstrap
the model using  
50% holdout validation
Defect
Dataset
AUC=0.58
12

Model validation techniques may produce
diﬀerent performance estimates
It’s not clear which model validation
techniques provide the most accurate
performance estimates
AUC=0.73
the model using
ordinary bootstrap
the model using  
50% holdout validation
Defect
Dataset
AUC=0.58
12

Model validation techniques may produce unstable
performance estimates when using a small dataset
Defect
Dataset
13

Defect
Dataset
Original  
Sample
Small  
Sample
13

Defect
Dataset
Original  
Sample
Small  
Sample
AUC=0.70±0.01
AUC=0.63±0.08
the model using  
10-fold cross validation
the model using  
13

Defect
Dataset
Original  
Sample
Small  
Sample
AUC=0.70±0.01
AUC=0.63±0.08
the model using  
the model using  
Model validation techniques may
produce unstable performance
estimates when using a small dataset
13

Bias measures the difference between
performance estimates and the ground-truth
Variance measures the variation of
performance estimates when an experiment
is repeated
Examining the bias and variance of performance estimates
that are produced by model validation techniques (MVTs)
14
Defect  
Dataset
Sample  
Dataset
Unseen 
Dataset

is repeated
14
Model Validation 
Techniques
Defect  
Dataset
Sample  
Dataset
Unseen 
Dataset

is repeated
14
Model Validation 
Techniques
Defect  
Dataset
Sample  
Dataset
Unseen 
Dataset
Training
Testing
Defect  
Models
Performance 
Estimates

is repeated
14
Model Validation 
Techniques
Defect  
Dataset
Sample  
Dataset
Unseen 
Dataset
Defect  
Models
Training
Testing
Performance
on unseen data 
(ground-truth)
Training
Testing
Defect  
Models
Performance 
Estimates

V.S.
Bias and variance
calculation
Bias Variance
is repeated
14
Model Validation 
Techniques
Defect  
Dataset
Sample  
Dataset
Unseen 
Dataset
Defect  
Models
Training
Testing
Performance
on unseen data 
(ground-truth)
Training
Testing
Defect  
Models
Performance 
Estimates

Identifying statistically distinct ranks of  
model validation techniques
Scott-Knott
ESD test
Dataset 1
….1000x
Technique 1
Bias Distribution
Technique 12
Bias Distribution
Technique 2
Bias Distribution
1000x 1000x
15https://github.com/klainfo/ScottKnottESD

Scott-Knott
ESD test
Dataset 1
….1000x
Technique 1
Bias Distribution
Technique 12
Bias Distribution
Technique 2
Bias Distribution
1000x 1000x
Ranking for  
dataset 1

Scott-Knott
ESD test
Dataset 1
….1000x
Technique 1
Bias Distribution
Technique 12
Bias Distribution
Technique 2
Bias Distribution
1000x 1000x
Ranking for  
dataset 1
Ranking for  
Dataset 18
Scott-Knott
ESD test
Dataset 18
1000x
Technique 1
Bias Distribution
Technique 12
Bias Distribution
….
Technique 2
Bias Distribution
1000x 1000x

Pool of ranking  
for each dataset
Dataset T1 T2 T3
1 2 1 3
2 1 2 3
3 1 1 2
T1 T2 T3
1 1 2
Ranking for  
dataset 1
Ranking for  
Dataset 18
Scott-Knott
ESD test

17
A threat of bias exists if researchers ﬁxate on
studying the same datasets with the same metrics 
[Tantithamthavorn et al., TSE’16]
We study a collection of  
18 datasets from 5 open corpora

17
1-7K Modules 
21-28% Defective Rate 
21-38 Metrics 
[Shepperd et al., TSE’13]
1-10K Modules 
15-32 Metrics 
 
[Zimmermann et al., PROMISE’07]
[D’Ambros et al., MSR’10] 
[Kim et al., ICSE’11]
600-800 Modules 
20 Metrics 
 
[Jureczko et al., PROMISE’10]
A threat of bias exists if researchers ﬁxate on
studying the same datasets with the same metrics 
[Tantithamthavorn et al., TSE’16]
We study a collection of  
18 datasets from 5 open corpora

(RQ1) Which model validation
techniques are the least biased
for defect prediction models?
18

18
The out-of-sample bootstrap
validation produces the least biased

techniques are the most stable for
defect prediction models?
19

techniques are the most stable for
defect prediction models?
19
The ordinary bootstrap validation
produces the most stable

●
● ●
●
Holdout 0.5
Holdout 0.7
2−Fold, 10−Fold CV
Rep. 10−Fold CV
Ordinary
Optimism
Outsample
.632 Bootstrap
Rep. Holdout 0.5, 0.7
1
1.5
2
2.5
3
11.522.53
Mean Ranks of Bias
MeanRanksofVariance
Family ● Bootstrap Cross Validation Holdout
Considering both the bias and variance of
the model validation techniques (MVTs)
20

●
● ●
●
Holdout 0.5
Holdout 0.7
Rep. 10−Fold CV
Ordinary
Optimism
Outsample
.632 Bootstrap
1
1.5
2
2.5
3
11.522.53
Mean Ranks of Bias
MeanRanksofVariance
Considering both the bias and variance of
the model validation techniques (MVTs)
20
A technique that
appears at the
rank 1 is the top-
performing
technique

●
● ●
●
Holdout 0.5
Holdout 0.7
Rep. 10−Fold CV
Ordinary
Optimism
Outsample
.632 Bootstrap
1
1.5
2
2.5
3
11.522.53
Mean Ranks of Bias
MeanRanksofVariance
Bias and variance of performance estimates that
are produced by MVTS are statistically diﬀerent
21

●
● ●
●
Holdout 0.5
Holdout 0.7
Rep. 10−Fold CV
Ordinary
Optimism
Outsample
.632 Bootstrap
1
1.5
2
2.5
3
11.522.53
Mean Ranks of Bias
MeanRanksofVariance
Single-repetition holdout family produces the least
accurate and stable performance estimates
22

●
● ●
●
Holdout 0.5
Holdout 0.7
Rep. 10−Fold CV
Ordinary
Optimism
Outsample
.632 Bootstrap
1
1.5
2
2.5
3
11.522.53
Mean Ranks of Bias
MeanRanksofVariance
22
produces the least
accurate and stable
performance
estimates
Single-repetition
holdout validation

●
● ●
●
Holdout 0.5
Holdout 0.7
Rep. 10−Fold CV
Ordinary
Optimism
Outsample
.632 Bootstrap
1
1.5
2
2.5
3
11.522.53
Mean Ranks of Bias
MeanRanksofVariance
22
produces the least
accurate and stable
performance
estimates
Single-repetition
holdout validation
produces the most
accurate and stable
performance
estimates
Out-of-sample
bootstrap validation

●
● ●
●
Holdout 0.5
Holdout 0.7
Rep. 10−Fold CV
Ordinary
Optimism
Outsample
.632 Bootstrap
1
1.5
2
2.5
3
11.522.53
Mean Ranks of Bias
MeanRanksofVariance
22
produces the least
accurate and stable
performance
estimates
Single-repetition
holdout validation
produces the most
accurate and stable
performance
estimates
Out-of-sample
Out-of-sample bootstrap should be
used in future defect prediction studies

10-fold cross-validation V.S. out-of-sample
Fold 1
100 modules, 5% defective ratio
10-fold cross-validation
Fold 5
Fold 6
…
Fold 10
There is a high chance that a
testing corpus does not have
any defective modules
…
Out-of-sample bootstrap
Training
Testing
A sample with replacement
with the same size of the
original sample
Modules that do not appear
in the bootstrap sample
Bootstrap sample
~36.8%
A bootstrap sample is 
nearly representative to  
the original dataset
23

24
Reasons why out-of-sample bootstrap
validation should be used in future studies
(1) ﬁts models using a
dataset that is of equal
length to the original
dataset
(3) handles small and
high-dimensional data
better than 10-fold CV
(4) produces the least
biased and most stable
(2) handles the the scarcity
of defective modules in the
small testing corpora
(5) requires the same
computational cost as
the repeated 10-fold
cross validation

25
An example R script of  
the use of out-of-sample bootstrap
performance <- NULL 
for(i in seq(1,100)){ 
# generate bootstrap samples for training
indices <- sample(nrow(data), replace=TRUE) 
training <- data[indices,] 
# generate testing samples 
testing <- data[-unique(indices),] 
# construct a logistic regression model
m <- glm(bug ~ ., data=training, family=“binomial”) 
# extract probability using testing dataset 
prob <- predict(m, testing, type=“response”) 
# compute AUC performance 
performance <- c(performance, auc(testing$bug, prob))
} 
mean(performance) # report an average AUC performance

26http://chakkrit.com kla@chakkrit.com
@klainfoChakkrit (Kla) Tantithamthavorn

An Empirical Comparison of Model Validation Techniques for Defect Prediction Models

More Related Content

What's hot

Similar to An Empirical Comparison of Model Validation Techniques for Defect Prediction Models

Recently uploaded

An Empirical Comparison of Model Validation Techniques for Defect Prediction Models