An Empirical Comparison of 

Model Validation Techniques for 

Defect Prediction Models
1
Chakkrit (Kla)

Tantithamthavorn
Shane McIntosh Ahmed E. Hassan Kenichi Matsumoto
http://chakkrit.com kla@chakkrit.com @klainfo
Chakkrit Tantithamthavorn, Shane McIntosh, Ahmed E. Hassan, Kenichi Matsumoto: An Empirical Comparison
of Model Validation Techniques for Defect Prediction Models. IEEE Trans. Software Eng. 43(1): 1-18 (2017)
Software Quality Assurance (SQA) teams play a critical
role in ensuring the absence of software defects
2
SQA Team
Deliver defect-free 

software product
Customer
SQA tasks are expensive and 

time-consuming
3
SQA tasks are expensive and 

time-consuming
SQA tasks require 50% of
the development resources
3
SQA tasks are expensive and 

time-consuming
Facebook allocates about 3
months to test a new product
SQA tasks require 50% of
the development resources
3
Pre-release period
Release
Defect
prediction
models
Defect prediction models are used to predict software
modules that are likely to be defective in the future
Post-release period
4
Pre-release period
Release
Defect
prediction
models
Module A
Module C
Module B
Module D
Defect prediction models are used to predict software
modules that are likely to be defective in the future
Post-release period
Clean
Defect-prone
Clean
Defect-prone
Module A
Module C
Module B
Module D
4
Pre-release period
Release
Defect
prediction
models
Module A
Module C
Module B
Module D
Defect prediction models are used to predict software
modules that are likely to be defective in the future
Post-release period
Clean
Defect-prone
Clean
Defect-prone
Module A
Module C
Module B
Module D
Lewis et al., ICSE’13
Mockus et al., BLTJ’00 Ostrand et al., TSE’05 Kim et al., FSE’15
Naggappan et al., ICSE’06
Zimmermann et al., FSE’09
Caglayan et al., ICSE’15
Tan et al., ICSE’15
Shimagaki et al., ICSE’16
4
Model performance is used in various purposes
for defect prediction research
5
Model performance is used in various purposes
for defect prediction research
Estimate how well a
model performs on
unseen data
Zimmermann et al., FSE’09

D’Ambros et al., MSR’10, EMSE’12
Ma et al., IST’12
5
Model performance is used in various purposes
for defect prediction research
Estimate how well a
model performs on
unseen data
Zimmermann et al., FSE’09

D’Ambros et al., MSR’10, EMSE’12
Ma et al., IST’12
Select a top-performing 

defect prediction model
Lessmann et al., TSE’08
Ghotra et al., ICSE’15
Tantithamthavorn et al., ICSE’16
5
Using Model Validation Techniques (MVTs) to
estimate model performance
Defect 

Dataset
Training

Corpus
Testing

Corpus
Model

Validation
6
Using Model Validation Techniques (MVTs) to
estimate model performance
Defect 

Dataset
Training

Corpus
Testing

Corpus
Defect

Models
Model

Validation
6
Using Model Validation Techniques (MVTs) to
estimate model performance
Defect 

Dataset
Training

Corpus
Testing

Corpus
Defect

Models
Model

Validation
Performance

Estimates
Compute 

performance
6
We perform a literature analysis to find the most
commonly used model validation techniques
7
Holdout validation randomly splits a dataset into training
and testing corpora according to a given proportion
Testing
70% 30%
Training
Holdout Validation
• 50% Holdout
• 70% Holdout
• Repeated 50% Holdout
• Repeated 70% Holdout
8
Training
k-fold CV randomly partitions a dataset into k
folds of roughly equal size and defective ratio
Testing
70% 30%
Training
Holdout Validation
Testing
k-1 folds
k-Fold Cross Validation
Repeat k times
1 fold
• 50% Holdout
• 70% Holdout
• Repeated 50% Holdout
• Repeated 70% Holdout
• Leave-one-out CV
• 2 Fold CV
• 10 Fold CV
• Repeated 10 fold CV
9
Training
Bootstrap is a resampling process with replacement
with the same size of the original sample
Testing
70% 30%
Training
Holdout Validation
Testing
k-1 folds
k-Fold Cross Validation
Repeat k times
1 fold
Bootstrap Validation
Training Testing
bootstrap
Repeat N times
out-of-sample
• 50% Holdout
• 70% Holdout
• Repeated 50% Holdout
• Repeated 70% Holdout
• Leave-one-out CV
• 2 Fold CV
• 10 Fold CV
• Repeated 10 fold CV
• Ordinary bootstrap
• Optimism-reduced bootstrap
• Out-of-sample bootstrap
• .632 Bootstrap
10
Training
Our literature analysis shows that there are 3 families of
12 most commonly-used model validation techniques
Testing
70% 30%
Training
Holdout Validation
Testing
k-1 folds
k-Fold Cross Validation
Repeat k times
1 fold
Bootstrap Validation
Training Testing
bootstrap
Repeat N times
out-of-sample
• 50% Holdout
• 70% Holdout
• Repeated 50% Holdout
• Repeated 70% Holdout
• Leave-one-out CV
• 2 Fold CV
• 10 Fold CV
• Repeated 10 fold CV
• Ordinary bootstrap
• Optimism-reduced bootstrap
• Out-of-sample bootstrap
• .632 Bootstrap
11
Model validation techniques may produce
different performance estimates
AUC=0.73
Construct and evaluate
the model using
ordinary bootstrap
Construct and evaluate
the model using 

50% holdout validation
Defect
Dataset
AUC=0.58
12
Model validation techniques may produce
different performance estimates
It’s not clear which model validation
techniques provide the most accurate
performance estimates
AUC=0.73
Construct and evaluate
the model using
ordinary bootstrap
Construct and evaluate
the model using 

50% holdout validation
Defect
Dataset
AUC=0.58
12
Model validation techniques may produce unstable
performance estimates when using a small dataset
Defect
Dataset
13
Model validation techniques may produce unstable
performance estimates when using a small dataset
Defect
Dataset
Original 

Sample
Small 

Sample
13
Model validation techniques may produce unstable
performance estimates when using a small dataset
Defect
Dataset
Original 

Sample
Small 

Sample
AUC=0.70±0.01
AUC=0.63±0.08
Construct and evaluate
the model using 

10-fold cross validation
Construct and evaluate
the model using 

10-fold cross validation
13
Model validation techniques may produce unstable
performance estimates when using a small dataset
Defect
Dataset
Original 

Sample
Small 

Sample
AUC=0.70±0.01
AUC=0.63±0.08
Construct and evaluate
the model using 

10-fold cross validation
Construct and evaluate
the model using 

10-fold cross validation
Model validation techniques may
produce unstable performance
estimates when using a small dataset
13
Bias measures the difference between
performance estimates and the ground-truth
Variance measures the variation of
performance estimates when an experiment
is repeated
Examining the bias and variance of performance estimates
that are produced by model validation techniques (MVTs)
14
Defect 

Dataset
Sample 

Dataset
Unseen

Dataset
Bias measures the difference between
performance estimates and the ground-truth
Variance measures the variation of
performance estimates when an experiment
is repeated
Examining the bias and variance of performance estimates
that are produced by model validation techniques (MVTs)
14
Model Validation

Techniques
Defect 

Dataset
Sample 

Dataset
Unseen

Dataset
Bias measures the difference between
performance estimates and the ground-truth
Variance measures the variation of
performance estimates when an experiment
is repeated
Examining the bias and variance of performance estimates
that are produced by model validation techniques (MVTs)
14
Model Validation

Techniques
Defect 

Dataset
Sample 

Dataset
Unseen

Dataset
Training
Testing
Defect 

Models
Performance

Estimates
Bias measures the difference between
performance estimates and the ground-truth
Variance measures the variation of
performance estimates when an experiment
is repeated
Examining the bias and variance of performance estimates
that are produced by model validation techniques (MVTs)
14
Model Validation

Techniques
Defect 

Dataset
Sample 

Dataset
Unseen

Dataset
Defect 

Models
Training
Testing
Performance
on unseen data

(ground-truth)
Training
Testing
Defect 

Models
Performance

Estimates
V.S.
Bias and variance
calculation
Bias Variance
Bias measures the difference between
performance estimates and the ground-truth
Variance measures the variation of
performance estimates when an experiment
is repeated
Examining the bias and variance of performance estimates
that are produced by model validation techniques (MVTs)
14
Model Validation

Techniques
Defect 

Dataset
Sample 

Dataset
Unseen

Dataset
Defect 

Models
Training
Testing
Performance
on unseen data

(ground-truth)
Training
Testing
Defect 

Models
Performance

Estimates
Identifying statistically distinct ranks of 

model validation techniques
Scott-Knott
ESD test
Dataset 1
….1000x
Technique 1
Bias Distribution
Technique 12
Bias Distribution
Technique 2
Bias Distribution
1000x 1000x
15https://github.com/klainfo/ScottKnottESD
Identifying statistically distinct ranks of 

model validation techniques
Scott-Knott
ESD test
Dataset 1
….1000x
Technique 1
Bias Distribution
Technique 12
Bias Distribution
Technique 2
Bias Distribution
1000x 1000x
Ranking for 

dataset 1
15https://github.com/klainfo/ScottKnottESD
Identifying statistically distinct ranks of 

model validation techniques
Scott-Knott
ESD test
Dataset 1
….1000x
Technique 1
Bias Distribution
Technique 12
Bias Distribution
Technique 2
Bias Distribution
1000x 1000x
Ranking for 

dataset 1
Ranking for 

Dataset 18
Scott-Knott
ESD test
Dataset 18
1000x
Technique 1
Bias Distribution
Technique 12
Bias Distribution
….
Technique 2
Bias Distribution
1000x 1000x
15https://github.com/klainfo/ScottKnottESD
Identifying statistically distinct ranks of 

model validation techniques
Pool of ranking 

for each dataset
Dataset T1 T2 T3
1 2 1 3
2 1 2 3
3 1 1 2
T1 T2 T3
1 1 2
Ranking for 

dataset 1
Ranking for 

Dataset 18
Scott-Knott
ESD test
16https://github.com/klainfo/ScottKnottESD
17
A threat of bias exists if researchers fixate on
studying the same datasets with the same metrics

[Tantithamthavorn et al., TSE’16]
We study a collection of 

18 datasets from 5 open corpora
17
1-7K Modules

21-28% Defective Rate

21-38 Metrics

[Shepperd et al., TSE’13]
1-10K Modules

11-44% Defective Rate

15-32 Metrics



[Zimmermann et al., PROMISE’07]
[D’Ambros et al., MSR’10]

[Kim et al., ICSE’11]
600-800 Modules

36-48% Defective Rate

20 Metrics



[Jureczko et al., PROMISE’10]
A threat of bias exists if researchers fixate on
studying the same datasets with the same metrics

[Tantithamthavorn et al., TSE’16]
We study a collection of 

18 datasets from 5 open corpora
Examining the bias and variance of performance estimates
that are produced by model validation techniques (MVTs)
(RQ1) Which model validation
techniques are the least biased
for defect prediction models?
18
Examining the bias and variance of performance estimates
that are produced by model validation techniques (MVTs)
(RQ1) Which model validation
techniques are the least biased
for defect prediction models?
18
The out-of-sample bootstrap
validation produces the least biased
performance estimates
Examining the bias and variance of performance estimates
that are produced by model validation techniques (MVTs)
(RQ2) Which model validation
techniques are the most stable for
defect prediction models?
19
(RQ1) Which model validation
techniques are the least biased
for defect prediction models?
The out-of-sample bootstrap
validation produces the least biased
performance estimates
Examining the bias and variance of performance estimates
that are produced by model validation techniques (MVTs)
(RQ2) Which model validation
techniques are the most stable for
defect prediction models?
19
(RQ1) Which model validation
techniques are the least biased
for defect prediction models?
The out-of-sample bootstrap
validation produces the least biased
performance estimates
The ordinary bootstrap validation
produces the most stable
performance estimates
●
● ●
●
Holdout 0.5
Holdout 0.7
2−Fold, 10−Fold CV
Rep. 10−Fold CV
Ordinary
Optimism
Outsample
.632 Bootstrap
Rep. Holdout 0.5, 0.7
1
1.5
2
2.5
3
11.522.53
Mean Ranks of Bias
MeanRanksofVariance
Family ● Bootstrap Cross Validation Holdout
Considering both the bias and variance of
the model validation techniques (MVTs)
20
●
● ●
●
Holdout 0.5
Holdout 0.7
2−Fold, 10−Fold CV
Rep. 10−Fold CV
Ordinary
Optimism
Outsample
.632 Bootstrap
Rep. Holdout 0.5, 0.7
1
1.5
2
2.5
3
11.522.53
Mean Ranks of Bias
MeanRanksofVariance
Family ● Bootstrap Cross Validation Holdout
Considering both the bias and variance of
the model validation techniques (MVTs)
20
●
● ●
●
Holdout 0.5
Holdout 0.7
2−Fold, 10−Fold CV
Rep. 10−Fold CV
Ordinary
Optimism
Outsample
.632 Bootstrap
Rep. Holdout 0.5, 0.7
1
1.5
2
2.5
3
11.522.53
Mean Ranks of Bias
MeanRanksofVariance
Family ● Bootstrap Cross Validation Holdout
Considering both the bias and variance of
the model validation techniques (MVTs)
20
●
● ●
●
Holdout 0.5
Holdout 0.7
2−Fold, 10−Fold CV
Rep. 10−Fold CV
Ordinary
Optimism
Outsample
.632 Bootstrap
Rep. Holdout 0.5, 0.7
1
1.5
2
2.5
3
11.522.53
Mean Ranks of Bias
MeanRanksofVariance
Family ● Bootstrap Cross Validation Holdout
Considering both the bias and variance of
the model validation techniques (MVTs)
20
A technique that
appears at the
rank 1 is the top-
performing
technique
●
● ●
●
Holdout 0.5
Holdout 0.7
2−Fold, 10−Fold CV
Rep. 10−Fold CV
Ordinary
Optimism
Outsample
.632 Bootstrap
Rep. Holdout 0.5, 0.7
1
1.5
2
2.5
3
11.522.53
Mean Ranks of Bias
MeanRanksofVariance
Family ● Bootstrap Cross Validation Holdout
Bias and variance of performance estimates that
are produced by MVTS are statistically different
21
●
● ●
●
Holdout 0.5
Holdout 0.7
2−Fold, 10−Fold CV
Rep. 10−Fold CV
Ordinary
Optimism
Outsample
.632 Bootstrap
Rep. Holdout 0.5, 0.7
1
1.5
2
2.5
3
11.522.53
Mean Ranks of Bias
MeanRanksofVariance
Family ● Bootstrap Cross Validation Holdout
Single-repetition holdout family produces the least
accurate and stable performance estimates
22
●
● ●
●
Holdout 0.5
Holdout 0.7
2−Fold, 10−Fold CV
Rep. 10−Fold CV
Ordinary
Optimism
Outsample
.632 Bootstrap
Rep. Holdout 0.5, 0.7
1
1.5
2
2.5
3
11.522.53
Mean Ranks of Bias
MeanRanksofVariance
Family ● Bootstrap Cross Validation Holdout
Single-repetition holdout family produces the least
accurate and stable performance estimates
22
produces the least
accurate and stable
performance
estimates
Single-repetition
holdout validation
●
● ●
●
Holdout 0.5
Holdout 0.7
2−Fold, 10−Fold CV
Rep. 10−Fold CV
Ordinary
Optimism
Outsample
.632 Bootstrap
Rep. Holdout 0.5, 0.7
1
1.5
2
2.5
3
11.522.53
Mean Ranks of Bias
MeanRanksofVariance
Family ● Bootstrap Cross Validation Holdout
Single-repetition holdout family produces the least
accurate and stable performance estimates
22
produces the least
accurate and stable
performance
estimates
Single-repetition
holdout validation
produces the most
accurate and stable
performance
estimates
Out-of-sample
bootstrap validation
●
● ●
●
Holdout 0.5
Holdout 0.7
2−Fold, 10−Fold CV
Rep. 10−Fold CV
Ordinary
Optimism
Outsample
.632 Bootstrap
Rep. Holdout 0.5, 0.7
1
1.5
2
2.5
3
11.522.53
Mean Ranks of Bias
MeanRanksofVariance
Family ● Bootstrap Cross Validation Holdout
Single-repetition holdout family produces the least
accurate and stable performance estimates
22
produces the least
accurate and stable
performance
estimates
Single-repetition
holdout validation
produces the most
accurate and stable
performance
estimates
Out-of-sample
bootstrap validation
Out-of-sample bootstrap should be
used in future defect prediction studies
10-fold cross-validation V.S. out-of-sample
bootstrap validation
Fold 1
100 modules, 5% defective ratio
10-fold cross-validation
Fold 5
Fold 6
…
Fold 10
There is a high chance that a
testing corpus does not have
any defective modules
…
Out-of-sample bootstrap
Training
Testing
A sample with replacement
with the same size of the
original sample
Modules that do not appear
in the bootstrap sample
Bootstrap sample
~36.8%
A bootstrap sample is

nearly representative to 

the original dataset
23
24
Reasons why out-of-sample bootstrap
validation should be used in future studies
(1) fits models using a
dataset that is of equal
length to the original
dataset
(3) handles small and
high-dimensional data
better than 10-fold CV
(4) produces the least
biased and most stable
performance estimates
(2) handles the the scarcity
of defective modules in the
small testing corpora
(5) requires the same
computational cost as
the repeated 10-fold
cross validation
25
An example R script of 

the use of out-of-sample bootstrap
performance <- NULL

for(i in seq(1,100)){

# generate bootstrap samples for training
indices <- sample(nrow(data), replace=TRUE)

training <- data[indices,]

# generate testing samples

testing <- data[-unique(indices),]

# construct a logistic regression model
m <- glm(bug ~ ., data=training, family=“binomial”)

# extract probability using testing dataset

prob <- predict(m, testing, type=“response”)

# compute AUC performance

performance <- c(performance, auc(testing$bug, prob))
}

mean(performance) # report an average AUC performance
26http://chakkrit.com kla@chakkrit.com
@klainfoChakkrit (Kla) Tantithamthavorn
26http://chakkrit.com kla@chakkrit.com
@klainfoChakkrit (Kla) Tantithamthavorn
26http://chakkrit.com kla@chakkrit.com
@klainfoChakkrit (Kla) Tantithamthavorn
26http://chakkrit.com kla@chakkrit.com
@klainfoChakkrit (Kla) Tantithamthavorn
26http://chakkrit.com kla@chakkrit.com
@klainfoChakkrit (Kla) Tantithamthavorn

An Empirical Comparison of Model Validation Techniques for Defect Prediction Models

  • 1.
    An Empirical Comparisonof 
 Model Validation Techniques for 
 Defect Prediction Models 1 Chakkrit (Kla)
 Tantithamthavorn Shane McIntosh Ahmed E. Hassan Kenichi Matsumoto http://chakkrit.com kla@chakkrit.com @klainfo Chakkrit Tantithamthavorn, Shane McIntosh, Ahmed E. Hassan, Kenichi Matsumoto: An Empirical Comparison of Model Validation Techniques for Defect Prediction Models. IEEE Trans. Software Eng. 43(1): 1-18 (2017)
  • 2.
    Software Quality Assurance(SQA) teams play a critical role in ensuring the absence of software defects 2 SQA Team Deliver defect-free 
 software product Customer
  • 3.
    SQA tasks areexpensive and 
 time-consuming 3
  • 4.
    SQA tasks areexpensive and 
 time-consuming SQA tasks require 50% of the development resources 3
  • 5.
    SQA tasks areexpensive and 
 time-consuming Facebook allocates about 3 months to test a new product SQA tasks require 50% of the development resources 3
  • 6.
    Pre-release period Release Defect prediction models Defect predictionmodels are used to predict software modules that are likely to be defective in the future Post-release period 4
  • 7.
    Pre-release period Release Defect prediction models Module A ModuleC Module B Module D Defect prediction models are used to predict software modules that are likely to be defective in the future Post-release period Clean Defect-prone Clean Defect-prone Module A Module C Module B Module D 4
  • 8.
    Pre-release period Release Defect prediction models Module A ModuleC Module B Module D Defect prediction models are used to predict software modules that are likely to be defective in the future Post-release period Clean Defect-prone Clean Defect-prone Module A Module C Module B Module D Lewis et al., ICSE’13 Mockus et al., BLTJ’00 Ostrand et al., TSE’05 Kim et al., FSE’15 Naggappan et al., ICSE’06 Zimmermann et al., FSE’09 Caglayan et al., ICSE’15 Tan et al., ICSE’15 Shimagaki et al., ICSE’16 4
  • 9.
    Model performance isused in various purposes for defect prediction research 5
  • 10.
    Model performance isused in various purposes for defect prediction research Estimate how well a model performs on unseen data Zimmermann et al., FSE’09
 D’Ambros et al., MSR’10, EMSE’12 Ma et al., IST’12 5
  • 11.
    Model performance isused in various purposes for defect prediction research Estimate how well a model performs on unseen data Zimmermann et al., FSE’09
 D’Ambros et al., MSR’10, EMSE’12 Ma et al., IST’12 Select a top-performing 
 defect prediction model Lessmann et al., TSE’08 Ghotra et al., ICSE’15 Tantithamthavorn et al., ICSE’16 5
  • 12.
    Using Model ValidationTechniques (MVTs) to estimate model performance Defect 
 Dataset Training
 Corpus Testing
 Corpus Model
 Validation 6
  • 13.
    Using Model ValidationTechniques (MVTs) to estimate model performance Defect 
 Dataset Training
 Corpus Testing
 Corpus Defect
 Models Model
 Validation 6
  • 14.
    Using Model ValidationTechniques (MVTs) to estimate model performance Defect 
 Dataset Training
 Corpus Testing
 Corpus Defect
 Models Model
 Validation Performance
 Estimates Compute 
 performance 6
  • 15.
    We perform aliterature analysis to find the most commonly used model validation techniques 7
  • 16.
    Holdout validation randomlysplits a dataset into training and testing corpora according to a given proportion Testing 70% 30% Training Holdout Validation • 50% Holdout • 70% Holdout • Repeated 50% Holdout • Repeated 70% Holdout 8
  • 17.
    Training k-fold CV randomlypartitions a dataset into k folds of roughly equal size and defective ratio Testing 70% 30% Training Holdout Validation Testing k-1 folds k-Fold Cross Validation Repeat k times 1 fold • 50% Holdout • 70% Holdout • Repeated 50% Holdout • Repeated 70% Holdout • Leave-one-out CV • 2 Fold CV • 10 Fold CV • Repeated 10 fold CV 9
  • 18.
    Training Bootstrap is aresampling process with replacement with the same size of the original sample Testing 70% 30% Training Holdout Validation Testing k-1 folds k-Fold Cross Validation Repeat k times 1 fold Bootstrap Validation Training Testing bootstrap Repeat N times out-of-sample • 50% Holdout • 70% Holdout • Repeated 50% Holdout • Repeated 70% Holdout • Leave-one-out CV • 2 Fold CV • 10 Fold CV • Repeated 10 fold CV • Ordinary bootstrap • Optimism-reduced bootstrap • Out-of-sample bootstrap • .632 Bootstrap 10
  • 19.
    Training Our literature analysisshows that there are 3 families of 12 most commonly-used model validation techniques Testing 70% 30% Training Holdout Validation Testing k-1 folds k-Fold Cross Validation Repeat k times 1 fold Bootstrap Validation Training Testing bootstrap Repeat N times out-of-sample • 50% Holdout • 70% Holdout • Repeated 50% Holdout • Repeated 70% Holdout • Leave-one-out CV • 2 Fold CV • 10 Fold CV • Repeated 10 fold CV • Ordinary bootstrap • Optimism-reduced bootstrap • Out-of-sample bootstrap • .632 Bootstrap 11
  • 20.
    Model validation techniquesmay produce different performance estimates AUC=0.73 Construct and evaluate the model using ordinary bootstrap Construct and evaluate the model using 
 50% holdout validation Defect Dataset AUC=0.58 12
  • 21.
    Model validation techniquesmay produce different performance estimates It’s not clear which model validation techniques provide the most accurate performance estimates AUC=0.73 Construct and evaluate the model using ordinary bootstrap Construct and evaluate the model using 
 50% holdout validation Defect Dataset AUC=0.58 12
  • 22.
    Model validation techniquesmay produce unstable performance estimates when using a small dataset Defect Dataset 13
  • 23.
    Model validation techniquesmay produce unstable performance estimates when using a small dataset Defect Dataset Original 
 Sample Small 
 Sample 13
  • 24.
    Model validation techniquesmay produce unstable performance estimates when using a small dataset Defect Dataset Original 
 Sample Small 
 Sample AUC=0.70±0.01 AUC=0.63±0.08 Construct and evaluate the model using 
 10-fold cross validation Construct and evaluate the model using 
 10-fold cross validation 13
  • 25.
    Model validation techniquesmay produce unstable performance estimates when using a small dataset Defect Dataset Original 
 Sample Small 
 Sample AUC=0.70±0.01 AUC=0.63±0.08 Construct and evaluate the model using 
 10-fold cross validation Construct and evaluate the model using 
 10-fold cross validation Model validation techniques may produce unstable performance estimates when using a small dataset 13
  • 26.
    Bias measures thedifference between performance estimates and the ground-truth Variance measures the variation of performance estimates when an experiment is repeated Examining the bias and variance of performance estimates that are produced by model validation techniques (MVTs) 14 Defect 
 Dataset Sample 
 Dataset Unseen
 Dataset
  • 27.
    Bias measures thedifference between performance estimates and the ground-truth Variance measures the variation of performance estimates when an experiment is repeated Examining the bias and variance of performance estimates that are produced by model validation techniques (MVTs) 14 Model Validation
 Techniques Defect 
 Dataset Sample 
 Dataset Unseen
 Dataset
  • 28.
    Bias measures thedifference between performance estimates and the ground-truth Variance measures the variation of performance estimates when an experiment is repeated Examining the bias and variance of performance estimates that are produced by model validation techniques (MVTs) 14 Model Validation
 Techniques Defect 
 Dataset Sample 
 Dataset Unseen
 Dataset Training Testing Defect 
 Models Performance
 Estimates
  • 29.
    Bias measures thedifference between performance estimates and the ground-truth Variance measures the variation of performance estimates when an experiment is repeated Examining the bias and variance of performance estimates that are produced by model validation techniques (MVTs) 14 Model Validation
 Techniques Defect 
 Dataset Sample 
 Dataset Unseen
 Dataset Defect 
 Models Training Testing Performance on unseen data
 (ground-truth) Training Testing Defect 
 Models Performance
 Estimates
  • 30.
    V.S. Bias and variance calculation BiasVariance Bias measures the difference between performance estimates and the ground-truth Variance measures the variation of performance estimates when an experiment is repeated Examining the bias and variance of performance estimates that are produced by model validation techniques (MVTs) 14 Model Validation
 Techniques Defect 
 Dataset Sample 
 Dataset Unseen
 Dataset Defect 
 Models Training Testing Performance on unseen data
 (ground-truth) Training Testing Defect 
 Models Performance
 Estimates
  • 31.
    Identifying statistically distinctranks of 
 model validation techniques Scott-Knott ESD test Dataset 1 ….1000x Technique 1 Bias Distribution Technique 12 Bias Distribution Technique 2 Bias Distribution 1000x 1000x 15https://github.com/klainfo/ScottKnottESD
  • 32.
    Identifying statistically distinctranks of 
 model validation techniques Scott-Knott ESD test Dataset 1 ….1000x Technique 1 Bias Distribution Technique 12 Bias Distribution Technique 2 Bias Distribution 1000x 1000x Ranking for 
 dataset 1 15https://github.com/klainfo/ScottKnottESD
  • 33.
    Identifying statistically distinctranks of 
 model validation techniques Scott-Knott ESD test Dataset 1 ….1000x Technique 1 Bias Distribution Technique 12 Bias Distribution Technique 2 Bias Distribution 1000x 1000x Ranking for 
 dataset 1 Ranking for 
 Dataset 18 Scott-Knott ESD test Dataset 18 1000x Technique 1 Bias Distribution Technique 12 Bias Distribution …. Technique 2 Bias Distribution 1000x 1000x 15https://github.com/klainfo/ScottKnottESD
  • 34.
    Identifying statistically distinctranks of 
 model validation techniques Pool of ranking 
 for each dataset Dataset T1 T2 T3 1 2 1 3 2 1 2 3 3 1 1 2 T1 T2 T3 1 1 2 Ranking for 
 dataset 1 Ranking for 
 Dataset 18 Scott-Knott ESD test 16https://github.com/klainfo/ScottKnottESD
  • 35.
    17 A threat ofbias exists if researchers fixate on studying the same datasets with the same metrics
 [Tantithamthavorn et al., TSE’16] We study a collection of 
 18 datasets from 5 open corpora
  • 36.
    17 1-7K Modules
 21-28% DefectiveRate
 21-38 Metrics
 [Shepperd et al., TSE’13] 1-10K Modules
 11-44% Defective Rate
 15-32 Metrics
 
 [Zimmermann et al., PROMISE’07] [D’Ambros et al., MSR’10]
 [Kim et al., ICSE’11] 600-800 Modules
 36-48% Defective Rate
 20 Metrics
 
 [Jureczko et al., PROMISE’10] A threat of bias exists if researchers fixate on studying the same datasets with the same metrics
 [Tantithamthavorn et al., TSE’16] We study a collection of 
 18 datasets from 5 open corpora
  • 37.
    Examining the biasand variance of performance estimates that are produced by model validation techniques (MVTs) (RQ1) Which model validation techniques are the least biased for defect prediction models? 18
  • 38.
    Examining the biasand variance of performance estimates that are produced by model validation techniques (MVTs) (RQ1) Which model validation techniques are the least biased for defect prediction models? 18 The out-of-sample bootstrap validation produces the least biased performance estimates
  • 39.
    Examining the biasand variance of performance estimates that are produced by model validation techniques (MVTs) (RQ2) Which model validation techniques are the most stable for defect prediction models? 19 (RQ1) Which model validation techniques are the least biased for defect prediction models? The out-of-sample bootstrap validation produces the least biased performance estimates
  • 40.
    Examining the biasand variance of performance estimates that are produced by model validation techniques (MVTs) (RQ2) Which model validation techniques are the most stable for defect prediction models? 19 (RQ1) Which model validation techniques are the least biased for defect prediction models? The out-of-sample bootstrap validation produces the least biased performance estimates The ordinary bootstrap validation produces the most stable performance estimates
  • 41.
    ● ● ● ● Holdout 0.5 Holdout0.7 2−Fold, 10−Fold CV Rep. 10−Fold CV Ordinary Optimism Outsample .632 Bootstrap Rep. Holdout 0.5, 0.7 1 1.5 2 2.5 3 11.522.53 Mean Ranks of Bias MeanRanksofVariance Family ● Bootstrap Cross Validation Holdout Considering both the bias and variance of the model validation techniques (MVTs) 20
  • 42.
    ● ● ● ● Holdout 0.5 Holdout0.7 2−Fold, 10−Fold CV Rep. 10−Fold CV Ordinary Optimism Outsample .632 Bootstrap Rep. Holdout 0.5, 0.7 1 1.5 2 2.5 3 11.522.53 Mean Ranks of Bias MeanRanksofVariance Family ● Bootstrap Cross Validation Holdout Considering both the bias and variance of the model validation techniques (MVTs) 20
  • 43.
    ● ● ● ● Holdout 0.5 Holdout0.7 2−Fold, 10−Fold CV Rep. 10−Fold CV Ordinary Optimism Outsample .632 Bootstrap Rep. Holdout 0.5, 0.7 1 1.5 2 2.5 3 11.522.53 Mean Ranks of Bias MeanRanksofVariance Family ● Bootstrap Cross Validation Holdout Considering both the bias and variance of the model validation techniques (MVTs) 20
  • 44.
    ● ● ● ● Holdout 0.5 Holdout0.7 2−Fold, 10−Fold CV Rep. 10−Fold CV Ordinary Optimism Outsample .632 Bootstrap Rep. Holdout 0.5, 0.7 1 1.5 2 2.5 3 11.522.53 Mean Ranks of Bias MeanRanksofVariance Family ● Bootstrap Cross Validation Holdout Considering both the bias and variance of the model validation techniques (MVTs) 20 A technique that appears at the rank 1 is the top- performing technique
  • 45.
    ● ● ● ● Holdout 0.5 Holdout0.7 2−Fold, 10−Fold CV Rep. 10−Fold CV Ordinary Optimism Outsample .632 Bootstrap Rep. Holdout 0.5, 0.7 1 1.5 2 2.5 3 11.522.53 Mean Ranks of Bias MeanRanksofVariance Family ● Bootstrap Cross Validation Holdout Bias and variance of performance estimates that are produced by MVTS are statistically different 21
  • 46.
    ● ● ● ● Holdout 0.5 Holdout0.7 2−Fold, 10−Fold CV Rep. 10−Fold CV Ordinary Optimism Outsample .632 Bootstrap Rep. Holdout 0.5, 0.7 1 1.5 2 2.5 3 11.522.53 Mean Ranks of Bias MeanRanksofVariance Family ● Bootstrap Cross Validation Holdout Single-repetition holdout family produces the least accurate and stable performance estimates 22
  • 47.
    ● ● ● ● Holdout 0.5 Holdout0.7 2−Fold, 10−Fold CV Rep. 10−Fold CV Ordinary Optimism Outsample .632 Bootstrap Rep. Holdout 0.5, 0.7 1 1.5 2 2.5 3 11.522.53 Mean Ranks of Bias MeanRanksofVariance Family ● Bootstrap Cross Validation Holdout Single-repetition holdout family produces the least accurate and stable performance estimates 22 produces the least accurate and stable performance estimates Single-repetition holdout validation
  • 48.
    ● ● ● ● Holdout 0.5 Holdout0.7 2−Fold, 10−Fold CV Rep. 10−Fold CV Ordinary Optimism Outsample .632 Bootstrap Rep. Holdout 0.5, 0.7 1 1.5 2 2.5 3 11.522.53 Mean Ranks of Bias MeanRanksofVariance Family ● Bootstrap Cross Validation Holdout Single-repetition holdout family produces the least accurate and stable performance estimates 22 produces the least accurate and stable performance estimates Single-repetition holdout validation produces the most accurate and stable performance estimates Out-of-sample bootstrap validation
  • 49.
    ● ● ● ● Holdout 0.5 Holdout0.7 2−Fold, 10−Fold CV Rep. 10−Fold CV Ordinary Optimism Outsample .632 Bootstrap Rep. Holdout 0.5, 0.7 1 1.5 2 2.5 3 11.522.53 Mean Ranks of Bias MeanRanksofVariance Family ● Bootstrap Cross Validation Holdout Single-repetition holdout family produces the least accurate and stable performance estimates 22 produces the least accurate and stable performance estimates Single-repetition holdout validation produces the most accurate and stable performance estimates Out-of-sample bootstrap validation Out-of-sample bootstrap should be used in future defect prediction studies
  • 50.
    10-fold cross-validation V.S.out-of-sample bootstrap validation Fold 1 100 modules, 5% defective ratio 10-fold cross-validation Fold 5 Fold 6 … Fold 10 There is a high chance that a testing corpus does not have any defective modules … Out-of-sample bootstrap Training Testing A sample with replacement with the same size of the original sample Modules that do not appear in the bootstrap sample Bootstrap sample ~36.8% A bootstrap sample is
 nearly representative to 
 the original dataset 23
  • 51.
    24 Reasons why out-of-samplebootstrap validation should be used in future studies (1) fits models using a dataset that is of equal length to the original dataset (3) handles small and high-dimensional data better than 10-fold CV (4) produces the least biased and most stable performance estimates (2) handles the the scarcity of defective modules in the small testing corpora (5) requires the same computational cost as the repeated 10-fold cross validation
  • 52.
    25 An example Rscript of 
 the use of out-of-sample bootstrap performance <- NULL
 for(i in seq(1,100)){
 # generate bootstrap samples for training indices <- sample(nrow(data), replace=TRUE)
 training <- data[indices,]
 # generate testing samples
 testing <- data[-unique(indices),]
 # construct a logistic regression model m <- glm(bug ~ ., data=training, family=“binomial”)
 # extract probability using testing dataset
 prob <- predict(m, testing, type=“response”)
 # compute AUC performance
 performance <- c(performance, auc(testing$bug, prob)) }
 mean(performance) # report an average AUC performance
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.