Msr2012 bettenburg presentation

T
Think Locally, Act Globally
Improving Defect and Effort Prediction Models
Nicolas Bettenburg • Meiyappan Nagappan • Ahmed E. Hassan
Queen’s University • Kingston, ON, Canada
SOFTWARE ANALYSIS
& INTELLIGENCE LAB
Saturday, 2 June, 12

Data Modelling in Empirical SE
Observations
2
measured from project data

Observations
Model
2
describe observations
mathematically

Observations
Model
Understanding
Prediction
2
describe observations
mathematically
guide decision making
guide process optimizations and future research

Model Building Today
3
Whole Dataset

3
Whole Dataset
Testing Data
Training Data

3
Whole Dataset
Testing Data
Training Data
M
Learned Model

3
Whole Dataset
Testing Data
Training Data
M
Learned Model
Predictions
Y

3
Whole Dataset
Testing Data
Training Data
M
Learned Model
Predictions
Y
Compare

Much Research Effort on
new metrics and new models!
4

Maybe we need to look more at the data part

In the Field

In the Field
Tom Zimmermann

In the Field
Tom Zimmermann
We ran 622 cross-project
predictions and found that only
3.4% actually worked.

In the Field
Tim Menzies
Tom Zimmermann

In the Field
Tim Menzies
Tom Zimmermann
Rather than focus on
generalities, empirical SE should
focus more on context-speciﬁc
principles.

In the Field
Tim Menzies
Tom Zimmermann
Rather than focus on
generalities, empirical SE should
focus more on context-speciﬁc
principles.
Taking local properties of data into
consideration leads to better models!

Using Locality in Statistical Models

Does this principle work for statistical models?1

Does it work for Prediction?2

Does it work for Prediction?2
Can we do better?3

M
Learned Model
Building Local Models
8
Whole Dataset
Testing Data
Training Data
Predictions
Y

M
Learned Model
8
Whole Dataset
Testing Data
Training Data
Predictions
Y
Cluster Data

8
Whole Dataset
Testing Data
Training Data Learned Models
M1 M2 M3
Predictions
Y
Cluster Data Learn Multiple
Models

8
Whole Dataset
Testing Data
M1 M2 M3
Predictions
Y Y Y
Models
Predict
Individually

8
Whole Dataset
Compare
Testing Data
M1 M2 M3
Predictions
Y Y Y
Models
Predict
Individually

9
Global Statistical ModelHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.

9
Global Statistical ModelHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Model ﬁt leaves much room for improvement!

CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 3
X
f(X)
0 1 2 3 4 5 6
10
Local Statistical Model

X
f(X)
0 1 2 3 4 5 6
10
Model 1
Model 2

X
f(X)
0 1 2 3 4 5 6
10
Model 1
Model 2
Improved Fit!

How can we use this approach to get an
even better ﬁt?

12
Be Even More Local !HAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6

12
X
f(X)
0 1 2 3 4 5 6
Great Fit!

12
X
f(X)
0 1 2 3 4 5 6
Great Fit!
BUT: Risk of Overﬁtting the Data!!

Clustering independent of Fit

GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
C(Y |X) = f(X) = X ,
X = 0 + 1X1 + 2X2 + 3X3 + 4X4,
X1 = X X2 = (X a)+
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS
X
f(X)
0 1 2 3 4 5 6
C(Y |X) = f(X) = X ,
where X = 0 + 1X1 + 2X2 + 3X3 + 4
and
X1 = X X2 = (X a)+
X3 = (X b)+ X4 = (X c)+.
14

X
f(X)
0 1 2 3 4 5 6
C(Y |X) = f(X) = X ,
X = 0 + 1X1 + 2X2 + 3X3 + 4X4,
X1 = X X2 = (X a)+
X
f(X)
0 1 2 3 4 5 6
C(Y |X) = f(X) = X ,
where X = 0 + 1X1 + 2X2 + 3X3 + 4
and
X1 = X X2 = (X a)+
X3 = (X b)+ X4 = (X c)+.
14
Optimize Local Fit wrt. Minimizing Global Overﬁt

X
f(X)
0 1 2 3 4 5 6
C(Y |X) = f(X) = X ,
X = 0 + 1X1 + 2X2 + 3X3 + 4X4,
X1 = X X2 = (X a)+
X
f(X)
0 1 2 3 4 5 6
C(Y |X) = f(X) = X ,
where X = 0 + 1X1 + 2X2 + 3X3 + 4
and
X1 = X X2 = (X a)+
X3 = (X b)+ X4 = (X c)+.
14
X
f(X)
0 1 2 3 4 5 6
C(Y |X) = f(X) = X ,

X
f(X)
0 1 2 3 4 5 6
C(Y |X) = f(X) = X ,
X = 0 + 1X1 + 2X2 + 3X3 + 4X4,
X1 = X X2 = (X a)+
X
f(X)
0 1 2 3 4 5 6
C(Y |X) = f(X) = X ,
where X = 0 + 1X1 + 2X2 + 3X3 + 4
and
X1 = X X2 = (X a)+
X3 = (X b)+ X4 = (X c)+.
14
X
f(X)
0 1 2 3 4 5 6
C(Y |X) = f(X) = X ,
Multivariate Adaptive Regression Splines (MARS)

X
f(X)
0 1 2 3 4 5 6
C(Y |X) = f(X) = X ,
X = 0 + 1X1 + 2X2 + 3X3 + 4X4,
X1 = X X2 = (X a)+
X
f(X)
0 1 2 3 4 5 6
C(Y |X) = f(X) = X ,
where X = 0 + 1X1 + 2X2 + 3X3 + 4
and
X1 = X X2 = (X a)+
X3 = (X b)+ X4 = (X c)+.
14
X
f(X)
0 1 2 3 4 5 6
C(Y |X) = f(X) = X ,
Multivariate Adaptive Regression Splines (MARS)
create local knowledge that optimizes process globally

Case Study
15

Case Study
15
Xalan 2.6
Lucene 2.4
Post-Release Defects per Class
20 CK Metrics

Case Study
15
Xalan 2.6
Lucene 2.4
20 CK Metrics
CHINA
Total Development Effort in Hours
14 FP Metrics

Case Study
15
Xalan 2.6
Lucene 2.4
20 CK Metrics
CHINA
Total Development Effort in Hours
14 FP Metrics
NasaCoc
Development Length in Months
24 COCOMO-II Metrics

Results: Goodness of Fit
16
Rank-Correlation (0 = worst ﬁt, 1 = optimal ﬁt)

16
Global
Local
(Clustered)
MARS
Xalan 2.6
Lucene 2.4
CHINA
NasaCOC
0.33 0.52 0.69
0.32 0.60 0.83
0.83 0.89 0.89
0.93 0.97 0.99

16
Global
Local
(Clustered)
MARS
Xalan 2.6
Lucene 2.4
CHINA
NasaCOC
0.33 0.52 0.69
0.32 0.60 0.83
0.83 0.89 0.89
0.93 0.97 0.99
NumberofClusters
0
2
4
6
8
Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10
Dataset
CHINA
Lucene 2.4
NasaCoc
Xalan 2.6
Figure 3: Number of clusters generated by MCLUST in each run of the 10-fold cross validation.
term for each additional prediction variable entering the
regression model [23].
For practical purposes, we use a publicly available imple-
mentation of BIC-based model selection, contained in the
R package: BMA. The input to the BMA implementation
is the dataset itself, as well as a list of all dependent and
independent variables that should be considered. In our case
study, we always supply a list of all independent variables
that were left after VIF analysis. The output of the BMA
is too small to continue or until a maximum number of terms
is reached. In our case study, the maximum number of terms
is automatically determined by the implementation, and is
based on the amount of independent variables we give as
input. For MARS models, we use all independent variables
in a dataset after VIF analysis.
The ﬁrst phase often builds a model that suffers from
overﬁtting. As a result, the second phase, called the back-
ward phase, prunes the model, to increase the model’s gen-Saturday, 2 June, 12

16
Global
Local
(Clustered)
MARS
Xalan 2.6
Lucene 2.4
CHINA
NasaCOC
0.33 0.52 0.69
0.32 0.60 0.83
0.83 0.89 0.89
0.93 0.97 0.99
UP TO 2.5x BETTER FIT WHEN USING DATA LOCALITY!

0
0.175
0.35
0.525
0.7
Xalan 2.6
0.4
0.52
0.64
0
0.3
0.6
0.9
1.2
Lucene 2.4
0.94
1.151.15
0
200
400
600
800
CHINA
234.43
552.85
765
0
1
2
3
4
NasaCoC
1.63
2.14
3.26
Results: Prediction Error
17
Global Local MARS

0
0.175
0.35
0.525
0.7
Xalan 2.6
0.4
0.52
0.64
0
0.3
0.6
0.9
1.2
Lucene 2.4
0.94
1.151.15
0
200
400
600
800
CHINA
234.43
552.85
765
0
1
2
3
4
NasaCoC
1.63
2.14
3.26
Results: Prediction Error
17
Global Local MARS
Up to 4x lower prediction error with Local Models!

Model
Interpretation
?

Model Interpretation
19
0 5 10 15 20
−2.5−1.5−0.50.5
1 avg_cc
0 50 100 150
0.500.600.700.80
2 ca
0.0 0.2 0.4 0.6 0.8 1.0
0.440.480.52
3 cam
0 5 10 15 20 25 30
0.50.70.91.1
4 cbm
0 10 20 30 40 50
0.500.540.580.62
5 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.350.45
6 dam
1 2 3 4 5 6 7 8
0.30.40.50.6
7 dit
0 1 2 3 4 5
0.500.550.600.65
8 ic
0 1000 3000 5000
0.61.01.41.8
9 lcom
0.0 0.5 1.0 1.5 2.0
0.30.40.50.60.7
10 lcom3
0 1000 2000 3000 4000
0.51.01.52.0
11 loc
0 20 40 60 80 120
1234
12 max_cc
.470.490.51
13 mfa
0.540.58
14 moa
0.460.50
15 noc
0.600.70
16 npm
(a) Part of a global Model learned on the Xalan 2.6 dataset
0.0
0.81.21.6
1
0.20.40.60.8
0
1234560.00.51.0
(b) P
2.6 d
Figure 6: Global models report general trends, while global models with local c
describes the response (in this case bugs) while keeping all other prediction variab
ic npm mfa
Fold 9, Cluster 1
pr
O
wSaturday, 2 June, 12

19
0 5 10 15 20
−2.5−1.5−0.50.5
1 avg_cc
0 50 100 150
0.500.600.700.80
2 ca
0.0 0.2 0.4 0.6 0.8 1.0
0.440.480.52
3 cam
0 5 10 15 20 25 30
0.50.70.91.1
4 cbm
0 10 20 30 40 50
0.500.540.580.62
5 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.350.45
6 dam
1 2 3 4 5 6 7 8
0.30.40.50.6
7 dit
0 1 2 3 4 5
0.500.550.600.65
8 ic
0 1000 3000 5000
0.61.01.41.8
9 lcom
0.0 0.5 1.0 1.5 2.0
0.30.40.50.60.7
10 lcom3
0 1000 2000 3000 4000
0.51.01.52.0
11 loc
0 20 40 60 80 120
1234
12 max_cc
.470.490.51
13 mfa
0.540.58
14 moa
0.460.50
15 noc
0.600.70
16 npm
0.0
0.81.21.6
1
0.20.40.60.8
0
1234560.00.51.0
(b) P
2.6 d
ic npm mfa
Fold 9, Cluster 1
pr
O
w
Traditional Global Model: General Trends

19
0 5 10 15 20
−2.5−1.5−0.50.5
1 avg_cc
0 50 100 150
0.500.600.700.80
2 ca
0.0 0.2 0.4 0.6 0.8 1.0
0.440.480.52
3 cam
0 5 10 15 20 25 30
0.50.70.91.1
4 cbm
0 10 20 30 40 50
0.500.540.580.62
5 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.350.45
6 dam
1 2 3 4 5 6 7 8
0.30.40.50.6
7 dit
0 1 2 3 4 5
0.500.550.600.65
8 ic
0 1000 3000 5000
0.61.01.41.8
9 lcom
0.0 0.5 1.0 1.5 2.0
0.30.40.50.60.7
10 lcom3
0 1000 2000 3000 4000
0.51.01.52.0
11 loc
0 20 40 60 80 120
1234
12 max_cc
.470.490.51
13 mfa
0.540.58
14 moa
0.460.50
15 noc
0.600.70
16 npm
0.0
0.81.21.6
1
0.20.40.60.8
0
1234560.00.51.0
(b) P
2.6 d
ic npm mfa
Fold 9, Cluster 1
pr
O
w
Traditional Global Model: General Trends
One Curve per metric, run corp on that curve

20
0 1000 3000 5000
0.61.0
0.0 0.5 1.0 1.5 2.0
0.30.40.
0 1000 2000 3000 4000
0.51.01.
0 20 40 60 80 120
12
0.0 0.2 0.4 0.6 0.8 1.0
0.450.470.490.51
13 mfa
0 5 10 15
0.500.540.58
14 moa
0 5 10 15 20 25 30
0.420.460.50
15 noc
0 20 40 60 80 100 120
0.500.600.70
16 npm
0 1000 2000 3000 4000
123
0.0 0.2 0.4
0.81.0
0 20 40 60 80 100 120
−1.00.00.51.0
13 npm
Figure 6: Global models report general trends, while global models with local considerations give insight
describes the response (in this case bugs) while keeping all other prediction variables at their median value
0 2 4 6 8 10
0 10 20 30 40 60
01230 1 2 3 4
ic
npm
npm
ic mfa
mfa
Fold 9, Cluster 1
Fold 9, Cluster 6
Figure 7: Example of contradicting trends in local models (Xalan 2.6,
Cluster 1 and Cluster 6 in Fold 9).
model already partition the data into regions with individual
properties. For example, we observe that an increase of ic
(measuring the inheritance coupling through parent classes)
is predicted to only have a negative effect on bug-proneness
prediction models lead
Our findings thus co
who observed a simil
WHICH machine-lear
have practical implic
using regression mod
are more insightful th
general trends across
demonstrated that such
particular parts of the
in the Xalan 2.6 def
sets of classes are infl
as inheritance, cohes
reinforce the recomm
the use of a “one-size
model, when trying to
B. Act Globally
When the goal is carry
understanding, local m
0 1000 3000 5000
0.61.0
0.0 0.5 1.0 1.5 2.0
0.30.40.5
0 1000 2000 3000 4000
0.51.01.5
0 20 40 60 80 120
123
0.0 0.2 0.4 0.6 0.8 1.0
0.450.470.490.51
13 mfa
0 5 10 15
0.500.540.58
14 moa
0 5 10 15 20 25 30
0.420.460.50
15 noc
0 20 40 60 80 100 120
0.500.600.70
16 npm
0 1000 2000 3000 4000
1234
0.0 0.2 0
0.81.01
0 20 40 60 80 100 120
−1.00.00.51.0
13 npm
Figure 6: Global models report general trends, while global models with local considerations give insig
describes the response (in this case bugs) while keeping all other prediction variables at their median val
0 2 4 6 8 10
0 10 20 30 40 60
0123
0 1 2 3 4
ic
npm
npm
ic mfa
mfa
Fold 9, Cluster 1
Fold 9, Cluster 6
prediction models lea
Our findings thus c
who observed a sim
WHICH machine-lea
have practical impli
using regression mo
are more insightful t
general trends acros
demonstrated that su
particular parts of th
in the Xalan 2.6 de
sets of classes are in
as inheritance, coh
reinforce the recom
the use of a “one-si
model, when trying t
B. Act Globally
When the goal is car
understanding, local
Cluster 1
Cluster 6
...

20
Local (Clustered) Model: Many, many, many Trends!
0 1000 3000 5000
0.61.0
0.0 0.5 1.0 1.5 2.0
0.30.40.
0 1000 2000 3000 4000
0.51.01.
0 20 40 60 80 120
12
0.0 0.2 0.4 0.6 0.8 1.0
0.450.470.490.51
13 mfa
0 5 10 15
0.500.540.58
14 moa
0 5 10 15 20 25 30
0.420.460.50
15 noc
0 20 40 60 80 100 120
0.500.600.70
16 npm
0 1000 2000 3000 4000
123
0.0 0.2 0.4
0.81.0
0 20 40 60 80 100 120
−1.00.00.51.0
13 npm
0 2 4 6 8 10
0 10 20 30 40 60
01230 1 2 3 4
ic
npm
npm
ic mfa
mfa
Fold 9, Cluster 1
Fold 9, Cluster 6
WHICH machine-lear
B. Act Globally
0 1000 3000 5000
0.61.0
0.0 0.5 1.0 1.5 2.0
0.30.40.5
0 1000 2000 3000 4000
0.51.01.5
0 20 40 60 80 120
123
0.0 0.2 0.4 0.6 0.8 1.0
0.450.470.490.51
13 mfa
0 5 10 15
0.500.540.58
14 moa
0 5 10 15 20 25 30
0.420.460.50
15 noc
0 20 40 60 80 100 120
0.500.600.70
16 npm
0 1000 2000 3000 4000
1234
0.0 0.2 0
0.81.01
0 20 40 60 80 100 120
−1.00.00.51.0
13 npm
0 2 4 6 8 10
0 10 20 30 40 60
0123
0 1 2 3 4
ic
npm
npm
ic mfa
mfa
Fold 9, Cluster 1
Fold 9, Cluster 6
who observed a sim
WHICH machine-lea
using regression mo
in the Xalan 2.6 de
as inheritance, coh
reinforce the recom
B. Act Globally
Cluster 1
Cluster 6
...

20
Local (Clustered) Model: Many, many, many Trends!
0 1000 3000 5000
0.61.0
0.0 0.5 1.0 1.5 2.0
0.30.40.
0 1000 2000 3000 4000
0.51.01.
0 20 40 60 80 120
12
0.0 0.2 0.4 0.6 0.8 1.0
0.450.470.490.51
13 mfa
0 5 10 15
0.500.540.58
14 moa
0 5 10 15 20 25 30
0.420.460.50
15 noc
0 20 40 60 80 100 120
0.500.600.70
16 npm
0 1000 2000 3000 4000
123
0.0 0.2 0.4
0.81.0
0 20 40 60 80 100 120
−1.00.00.51.0
13 npm
0 2 4 6 8 10
0 10 20 30 40 60
01230 1 2 3 4
ic
npm
npm
ic mfa
mfa
Fold 9, Cluster 1
Fold 9, Cluster 6
WHICH machine-lear
B. Act Globally
0 1000 3000 5000
0.61.0
0.0 0.5 1.0 1.5 2.0
0.30.40.5
0 1000 2000 3000 4000
0.51.01.5
0 20 40 60 80 120
123
0.0 0.2 0.4 0.6 0.8 1.0
0.450.470.490.51
13 mfa
0 5 10 15
0.500.540.58
14 moa
0 5 10 15 20 25 30
0.420.460.50
15 noc
0 20 40 60 80 100 120
0.500.600.70
16 npm
0 1000 2000 3000 4000
1234
0.0 0.2 0
0.81.01
0 20 40 60 80 100 120
−1.00.00.51.0
13 npm
0 2 4 6 8 10
0 10 20 30 40 60
0123
0 1 2 3 4
ic
npm
npm
ic mfa
mfa
Fold 9, Cluster 1
Fold 9, Cluster 6
who observed a sim
WHICH machine-lea
using regression mo
in the Xalan 2.6 de
as inheritance, coh
reinforce the recom
B. Act Globally
Cluster 1
Cluster 6
...
Sometimes even contradict

21
5
0.0 0.2 0.4 0.6 0.8 1.0
0.81.21.6 1 cam
0 5 10 15 20 25 30
1.01.21.41.61.8
2 cbm
0 10 20 30 40 50
0.20.40.60.81.0
3 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.650.750.85
4 dam
1 2 3 4 5 6 7 8
0.20.40.60.8
5 dit
0 1 2 3 4 5
0.91.11.31.5
6 ic
0 1000 3000 5000
1.01.52.02.53.0
7 lcom
0.0 0.5 1.0 1.5 2.0
0.550.650.750.85
8 lcom3
0 1000 2000 3000 4000
123456
9 loc
0.0 0.2 0.4 0.6 0.8 1.0
0.81.01.21.4
10 mfa
0 5 10 15
0.00.20.40.60.8
11 moa
0 5 10 15 20 25 30
0.00.51.01.5
12 noc
.00.51.0
13 npm
bug earth(formula=f,data=training1)
(b) Part of a Global model with local considerations learned on the Xalan
2.6 dataset
local considerations give insights into different regions of the data. The Y-Axis
n variables at their median values.
prediction models leads to an improved fit of these models.
Our findings thus confirm the results of Menzies et al.,

21
Regression Splines: Local Trends in a Single Curve
5
0.0 0.2 0.4 0.6 0.8 1.0
0.81.21.6 1 cam
0 5 10 15 20 25 30
1.01.21.41.61.8
2 cbm
0 10 20 30 40 50
0.20.40.60.81.0
3 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.650.750.85
4 dam
1 2 3 4 5 6 7 8
0.20.40.60.8
5 dit
0 1 2 3 4 5
0.91.11.31.5
6 ic
0 1000 3000 5000
1.01.52.02.53.0
7 lcom
0.0 0.5 1.0 1.5 2.0
0.550.650.750.85
8 lcom3
0 1000 2000 3000 4000
123456
9 loc
0.0 0.2 0.4 0.6 0.8 1.0
0.81.01.21.4
10 mfa
0 5 10 15
0.00.20.40.60.8
11 moa
0 5 10 15 20 25 30
0.00.51.01.5
12 noc
.00.51.0
13 npm
2.6 dataset

21
Regression Splines: Local Trends in a Single Curve
5
0.0 0.2 0.4 0.6 0.8 1.0
0.81.21.6 1 cam
0 5 10 15 20 25 30
1.01.21.41.61.8
2 cbm
0 10 20 30 40 50
0.20.40.60.81.0
3 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.650.750.85
4 dam
1 2 3 4 5 6 7 8
0.20.40.60.8
5 dit
0 1 2 3 4 5
0.91.11.31.5
6 ic
0 1000 3000 5000
1.01.52.02.53.0
7 lcom
0.0 0.5 1.0 1.5 2.0
0.550.650.750.85
8 lcom3
0 1000 2000 3000 4000
123456
9 loc
0.0 0.2 0.4 0.6 0.8 1.0
0.81.01.21.4
10 mfa
0 5 10 15
0.00.20.40.60.8
11 moa
0 5 10 15 20 25 30
0.00.51.01.5
12 noc
.00.51.0
13 npm
2.6 dataset
Combines the best of both worlds!

Using Locality in Data
to build better Statistical Models.

vs = Two Extremes

vs = Two Extremes
Build Local Model, globally Optimized

vs = Two Extremes
• combines best of both worlds

vs = Two Extremes
• outperforms global and clustered local

vs = Two Extremes
• outperforms global and clustered local
• summarizes local trends in single curve

Msr2012 bettenburg presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Similar to Msr2012 bettenburg presentation

Similar to Msr2012 bettenburg presentation (20)

More from SAIL_QU

More from SAIL_QU (20)

Msr2012 bettenburg presentation