SlideShare a Scribd company logo
1 of 78
Download to read offline
T
Think Locally, Act Globally
Improving Defect and Effort Prediction Models
Nicolas Bettenburg • Meiyappan Nagappan • Ahmed E. Hassan
Queen’s University • Kingston, ON, Canada
SOFTWARE ANALYSIS
& INTELLIGENCE LAB
Saturday, 2 June, 12
Data Modelling in Empirical SE
Observations
2
measured from project data
Saturday, 2 June, 12
Data Modelling in Empirical SE
Observations
Model
2
measured from project data
describe observations
mathematically
Saturday, 2 June, 12
Data Modelling in Empirical SE
Observations
Model
Understanding
Prediction
2
measured from project data
describe observations
mathematically
guide decision making
guide process optimizations and future research
Saturday, 2 June, 12
Model Building Today
3
Whole Dataset
Saturday, 2 June, 12
Model Building Today
3
Whole Dataset
Testing Data
Training Data
Saturday, 2 June, 12
Model Building Today
3
Whole Dataset
Testing Data
Training Data
M
Learned Model
Saturday, 2 June, 12
Model Building Today
3
Whole Dataset
Testing Data
Training Data
M
Learned Model
Predictions
Y
Saturday, 2 June, 12
Model Building Today
3
Whole Dataset
Testing Data
Training Data
M
Learned Model
Predictions
Y
Compare
Saturday, 2 June, 12
Much Research Effort on
new metrics and new models!
4
Saturday, 2 June, 12
Maybe we need to look more at the data part
Saturday, 2 June, 12
In the Field
Saturday, 2 June, 12
In the Field
Tom Zimmermann
Saturday, 2 June, 12
In the Field
Tom Zimmermann
We ran 622 cross-project
predictions and found that only
3.4% actually worked.
Saturday, 2 June, 12
In the Field
Tim Menzies
Tom Zimmermann
We ran 622 cross-project
predictions and found that only
3.4% actually worked.
Saturday, 2 June, 12
In the Field
Tim Menzies
Tom Zimmermann
We ran 622 cross-project
predictions and found that only
3.4% actually worked.
Rather than focus on
generalities, empirical SE should
focus more on context-specific
principles.
Saturday, 2 June, 12
In the Field
Tim Menzies
Tom Zimmermann
We ran 622 cross-project
predictions and found that only
3.4% actually worked.
Rather than focus on
generalities, empirical SE should
focus more on context-specific
principles.
Taking local properties of data into
consideration leads to better models!
Saturday, 2 June, 12
Using Locality in Statistical Models
Saturday, 2 June, 12
Does this principle work for statistical models?1
Using Locality in Statistical Models
Saturday, 2 June, 12
Does this principle work for statistical models?1
Does it work for Prediction?2
Using Locality in Statistical Models
Saturday, 2 June, 12
Does this principle work for statistical models?1
Does it work for Prediction?2
Can we do better?3
Using Locality in Statistical Models
Saturday, 2 June, 12
M
Learned Model
Building Local Models
8
Whole Dataset
Testing Data
Training Data
Predictions
Y
Saturday, 2 June, 12
M
Learned Model
Building Local Models
8
Whole Dataset
Testing Data
Training Data
Predictions
Y
Cluster Data
Saturday, 2 June, 12
Building Local Models
8
Whole Dataset
Testing Data
Training Data Learned Models
M1 M2 M3
Predictions
Y
Cluster Data Learn Multiple
Models
Saturday, 2 June, 12
Building Local Models
8
Whole Dataset
Testing Data
Training Data Learned Models
M1 M2 M3
Predictions
Y Y Y
Cluster Data Learn Multiple
Models
Predict
Individually
Saturday, 2 June, 12
Building Local Models
8
Whole Dataset
Compare
Testing Data
Training Data Learned Models
M1 M2 M3
Predictions
Y Y Y
Cluster Data Learn Multiple
Models
Predict
Individually
Saturday, 2 June, 12
9
Global Statistical ModelHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
Saturday, 2 June, 12
9
Global Statistical ModelHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
Saturday, 2 June, 12
9
Global Statistical ModelHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
Saturday, 2 June, 12
9
Global Statistical ModelHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
Model fit leaves much room for improvement!
Saturday, 2 June, 12
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 3
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
10
Local Statistical Model
Saturday, 2 June, 12
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 3
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
10
Local Statistical Model
Saturday, 2 June, 12
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 3
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
10
Local Statistical Model
Model 1
Model 2
Saturday, 2 June, 12
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 3
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
10
Local Statistical Model
Model 1
Model 2
Improved Fit!
Saturday, 2 June, 12
How can we use this approach to get an
even better fit?
Saturday, 2 June, 12
12
Be Even More Local !HAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
Saturday, 2 June, 12
12
Be Even More Local !HAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
Saturday, 2 June, 12
12
Be Even More Local !HAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
Saturday, 2 June, 12
12
Be Even More Local !HAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
Great Fit!
Saturday, 2 June, 12
12
Be Even More Local !HAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
Great Fit!
BUT: Risk of Overfitting the Data!!
Saturday, 2 June, 12
Saturday, 2 June, 12
Clustering independent of Fit
Saturday, 2 June, 12
GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f(X) = X ,
X = 0 + 1X1 + 2X2 + 3X3 + 4X4,
X1 = X X2 = (X a)+
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f(X) = X ,
where X = 0 + 1X1 + 2X2 + 3X3 + 4
and
X1 = X X2 = (X a)+
X3 = (X b)+ X4 = (X c)+.
14
Saturday, 2 June, 12
GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f(X) = X ,
X = 0 + 1X1 + 2X2 + 3X3 + 4X4,
X1 = X X2 = (X a)+
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f(X) = X ,
where X = 0 + 1X1 + 2X2 + 3X3 + 4
and
X1 = X X2 = (X a)+
X3 = (X b)+ X4 = (X c)+.
14
Optimize Local Fit wrt. Minimizing Global Overfit
Saturday, 2 June, 12
GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f(X) = X ,
X = 0 + 1X1 + 2X2 + 3X3 + 4X4,
X1 = X X2 = (X a)+
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f(X) = X ,
where X = 0 + 1X1 + 2X2 + 3X3 + 4
and
X1 = X X2 = (X a)+
X3 = (X b)+ X4 = (X c)+.
14
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f(X) = X ,
Optimize Local Fit wrt. Minimizing Global Overfit
Saturday, 2 June, 12
GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f(X) = X ,
X = 0 + 1X1 + 2X2 + 3X3 + 4X4,
X1 = X X2 = (X a)+
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f(X) = X ,
where X = 0 + 1X1 + 2X2 + 3X3 + 4
and
X1 = X X2 = (X a)+
X3 = (X b)+ X4 = (X c)+.
14
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f(X) = X ,
Optimize Local Fit wrt. Minimizing Global Overfit
Saturday, 2 June, 12
GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f(X) = X ,
X = 0 + 1X1 + 2X2 + 3X3 + 4X4,
X1 = X X2 = (X a)+
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f(X) = X ,
where X = 0 + 1X1 + 2X2 + 3X3 + 4
and
X1 = X X2 = (X a)+
X3 = (X b)+ X4 = (X c)+.
14
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f(X) = X ,
Optimize Local Fit wrt. Minimizing Global Overfit
Multivariate Adaptive Regression Splines (MARS)
Saturday, 2 June, 12
GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f(X) = X ,
X = 0 + 1X1 + 2X2 + 3X3 + 4X4,
X1 = X X2 = (X a)+
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f(X) = X ,
where X = 0 + 1X1 + 2X2 + 3X3 + 4
and
X1 = X X2 = (X a)+
X3 = (X b)+ X4 = (X c)+.
14
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f(X) = X ,
Optimize Local Fit wrt. Minimizing Global Overfit
Multivariate Adaptive Regression Splines (MARS)
create local knowledge that optimizes process globally
Saturday, 2 June, 12
Case Study
15
Saturday, 2 June, 12
Case Study
15
Xalan 2.6
Lucene 2.4
Post-Release Defects per Class
20 CK Metrics
Saturday, 2 June, 12
Case Study
15
Xalan 2.6
Lucene 2.4
Post-Release Defects per Class
20 CK Metrics
CHINA
Total Development Effort in Hours
14 FP Metrics
Saturday, 2 June, 12
Case Study
15
Xalan 2.6
Lucene 2.4
Post-Release Defects per Class
20 CK Metrics
CHINA
Total Development Effort in Hours
14 FP Metrics
NasaCoc
Development Length in Months
24 COCOMO-II Metrics
Saturday, 2 June, 12
Results: Goodness of Fit
16
Rank-Correlation (0 = worst fit, 1 = optimal fit)
Saturday, 2 June, 12
Results: Goodness of Fit
16
Global
Local
(Clustered)
MARS
Xalan 2.6
Lucene 2.4
CHINA
NasaCOC
0.33 0.52 0.69
0.32 0.60 0.83
0.83 0.89 0.89
0.93 0.97 0.99
Rank-Correlation (0 = worst fit, 1 = optimal fit)
Saturday, 2 June, 12
Results: Goodness of Fit
16
Global
Local
(Clustered)
MARS
Xalan 2.6
Lucene 2.4
CHINA
NasaCOC
0.33 0.52 0.69
0.32 0.60 0.83
0.83 0.89 0.89
0.93 0.97 0.99
Rank-Correlation (0 = worst fit, 1 = optimal fit)
Saturday, 2 June, 12
Results: Goodness of Fit
16
Global
Local
(Clustered)
MARS
Xalan 2.6
Lucene 2.4
CHINA
NasaCOC
0.33 0.52 0.69
0.32 0.60 0.83
0.83 0.89 0.89
0.93 0.97 0.99
Rank-Correlation (0 = worst fit, 1 = optimal fit)
Saturday, 2 June, 12
Results: Goodness of Fit
16
Global
Local
(Clustered)
MARS
Xalan 2.6
Lucene 2.4
CHINA
NasaCOC
0.33 0.52 0.69
0.32 0.60 0.83
0.83 0.89 0.89
0.93 0.97 0.99
Rank-Correlation (0 = worst fit, 1 = optimal fit)
Saturday, 2 June, 12
Results: Goodness of Fit
16
Global
Local
(Clustered)
MARS
Xalan 2.6
Lucene 2.4
CHINA
NasaCOC
0.33 0.52 0.69
0.32 0.60 0.83
0.83 0.89 0.89
0.93 0.97 0.99
Rank-Correlation (0 = worst fit, 1 = optimal fit)
NumberofClusters
0
2
4
6
8
Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10
Dataset
CHINA
Lucene 2.4
NasaCoc
Xalan 2.6
Figure 3: Number of clusters generated by MCLUST in each run of the 10-fold cross validation.
term for each additional prediction variable entering the
regression model [23].
For practical purposes, we use a publicly available imple-
mentation of BIC-based model selection, contained in the
R package: BMA. The input to the BMA implementation
is the dataset itself, as well as a list of all dependent and
independent variables that should be considered. In our case
study, we always supply a list of all independent variables
that were left after VIF analysis. The output of the BMA
is too small to continue or until a maximum number of terms
is reached. In our case study, the maximum number of terms
is automatically determined by the implementation, and is
based on the amount of independent variables we give as
input. For MARS models, we use all independent variables
in a dataset after VIF analysis.
The first phase often builds a model that suffers from
overfitting. As a result, the second phase, called the back-
ward phase, prunes the model, to increase the model’s gen-Saturday, 2 June, 12
Results: Goodness of Fit
16
Global
Local
(Clustered)
MARS
Xalan 2.6
Lucene 2.4
CHINA
NasaCOC
0.33 0.52 0.69
0.32 0.60 0.83
0.83 0.89 0.89
0.93 0.97 0.99
Rank-Correlation (0 = worst fit, 1 = optimal fit)
UP TO 2.5x BETTER FIT WHEN USING DATA LOCALITY!
Saturday, 2 June, 12
0
0.175
0.35
0.525
0.7
Xalan 2.6
0.4
0.52
0.64
0
0.3
0.6
0.9
1.2
Lucene 2.4
0.94
1.151.15
0
200
400
600
800
CHINA
234.43
552.85
765
0
1
2
3
4
NasaCoC
1.63
2.14
3.26
Results: Prediction Error
17
Global Local MARS
Saturday, 2 June, 12
0
0.175
0.35
0.525
0.7
Xalan 2.6
0.4
0.52
0.64
0
0.3
0.6
0.9
1.2
Lucene 2.4
0.94
1.151.15
0
200
400
600
800
CHINA
234.43
552.85
765
0
1
2
3
4
NasaCoC
1.63
2.14
3.26
Results: Prediction Error
17
Global Local MARS
Up to 4x lower prediction error with Local Models!
Saturday, 2 June, 12
Model
Interpretation
?
Saturday, 2 June, 12
Model Interpretation
19
0 5 10 15 20
−2.5−1.5−0.50.5
1 avg_cc
0 50 100 150
0.500.600.700.80
2 ca
0.0 0.2 0.4 0.6 0.8 1.0
0.440.480.52
3 cam
0 5 10 15 20 25 30
0.50.70.91.1
4 cbm
0 10 20 30 40 50
0.500.540.580.62
5 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.350.45
6 dam
1 2 3 4 5 6 7 8
0.30.40.50.6
7 dit
0 1 2 3 4 5
0.500.550.600.65
8 ic
0 1000 3000 5000
0.61.01.41.8
9 lcom
0.0 0.5 1.0 1.5 2.0
0.30.40.50.60.7
10 lcom3
0 1000 2000 3000 4000
0.51.01.52.0
11 loc
0 20 40 60 80 120
1234
12 max_cc
.470.490.51
13 mfa
0.540.58
14 moa
0.460.50
15 noc
0.600.70
16 npm
(a) Part of a global Model learned on the Xalan 2.6 dataset
0.0
0.81.21.6
1
0.20.40.60.8
0
1234560.00.51.0
(b) P
2.6 d
Figure 6: Global models report general trends, while global models with local c
describes the response (in this case bugs) while keeping all other prediction variab
ic npm mfa
Fold 9, Cluster 1
pr
O
wSaturday, 2 June, 12
Model Interpretation
19
0 5 10 15 20
−2.5−1.5−0.50.5
1 avg_cc
0 50 100 150
0.500.600.700.80
2 ca
0.0 0.2 0.4 0.6 0.8 1.0
0.440.480.52
3 cam
0 5 10 15 20 25 30
0.50.70.91.1
4 cbm
0 10 20 30 40 50
0.500.540.580.62
5 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.350.45
6 dam
1 2 3 4 5 6 7 8
0.30.40.50.6
7 dit
0 1 2 3 4 5
0.500.550.600.65
8 ic
0 1000 3000 5000
0.61.01.41.8
9 lcom
0.0 0.5 1.0 1.5 2.0
0.30.40.50.60.7
10 lcom3
0 1000 2000 3000 4000
0.51.01.52.0
11 loc
0 20 40 60 80 120
1234
12 max_cc
.470.490.51
13 mfa
0.540.58
14 moa
0.460.50
15 noc
0.600.70
16 npm
(a) Part of a global Model learned on the Xalan 2.6 dataset
0.0
0.81.21.6
1
0.20.40.60.8
0
1234560.00.51.0
(b) P
2.6 d
Figure 6: Global models report general trends, while global models with local c
describes the response (in this case bugs) while keeping all other prediction variab
ic npm mfa
Fold 9, Cluster 1
pr
O
w
Traditional Global Model: General Trends
Saturday, 2 June, 12
Model Interpretation
19
0 5 10 15 20
−2.5−1.5−0.50.5
1 avg_cc
0 50 100 150
0.500.600.700.80
2 ca
0.0 0.2 0.4 0.6 0.8 1.0
0.440.480.52
3 cam
0 5 10 15 20 25 30
0.50.70.91.1
4 cbm
0 10 20 30 40 50
0.500.540.580.62
5 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.350.45
6 dam
1 2 3 4 5 6 7 8
0.30.40.50.6
7 dit
0 1 2 3 4 5
0.500.550.600.65
8 ic
0 1000 3000 5000
0.61.01.41.8
9 lcom
0.0 0.5 1.0 1.5 2.0
0.30.40.50.60.7
10 lcom3
0 1000 2000 3000 4000
0.51.01.52.0
11 loc
0 20 40 60 80 120
1234
12 max_cc
.470.490.51
13 mfa
0.540.58
14 moa
0.460.50
15 noc
0.600.70
16 npm
(a) Part of a global Model learned on the Xalan 2.6 dataset
0.0
0.81.21.6
1
0.20.40.60.8
0
1234560.00.51.0
(b) P
2.6 d
Figure 6: Global models report general trends, while global models with local c
describes the response (in this case bugs) while keeping all other prediction variab
ic npm mfa
Fold 9, Cluster 1
pr
O
w
Traditional Global Model: General Trends
One Curve per metric, run corp on that curve
Saturday, 2 June, 12
20
0 1000 3000 5000
0.61.0
0.0 0.5 1.0 1.5 2.0
0.30.40.
0 1000 2000 3000 4000
0.51.01.
0 20 40 60 80 120
12
0.0 0.2 0.4 0.6 0.8 1.0
0.450.470.490.51
13 mfa
0 5 10 15
0.500.540.58
14 moa
0 5 10 15 20 25 30
0.420.460.50
15 noc
0 20 40 60 80 100 120
0.500.600.70
16 npm
0 1000 2000 3000 4000
123
0.0 0.2 0.4
0.81.0
0 20 40 60 80 100 120
−1.00.00.51.0
13 npm
Figure 6: Global models report general trends, while global models with local considerations give insight
describes the response (in this case bugs) while keeping all other prediction variables at their median value
0 2 4 6 8 10
0 10 20 30 40 60
01230 1 2 3 4
ic
npm
npm
ic mfa
mfa
Fold 9, Cluster 1
Fold 9, Cluster 6
Figure 7: Example of contradicting trends in local models (Xalan 2.6,
Cluster 1 and Cluster 6 in Fold 9).
model already partition the data into regions with individual
properties. For example, we observe that an increase of ic
(measuring the inheritance coupling through parent classes)
is predicted to only have a negative effect on bug-proneness
prediction models lead
Our findings thus co
who observed a simil
WHICH machine-lear
have practical implic
using regression mod
are more insightful th
general trends across
demonstrated that such
particular parts of the
in the Xalan 2.6 def
sets of classes are infl
as inheritance, cohes
reinforce the recomm
the use of a “one-size
model, when trying to
B. Act Globally
When the goal is carry
understanding, local m
0 1000 3000 5000
0.61.0
0.0 0.5 1.0 1.5 2.0
0.30.40.5
0 1000 2000 3000 4000
0.51.01.5
0 20 40 60 80 120
123
0.0 0.2 0.4 0.6 0.8 1.0
0.450.470.490.51
13 mfa
0 5 10 15
0.500.540.58
14 moa
0 5 10 15 20 25 30
0.420.460.50
15 noc
0 20 40 60 80 100 120
0.500.600.70
16 npm
0 1000 2000 3000 4000
1234
0.0 0.2 0
0.81.01
0 20 40 60 80 100 120
−1.00.00.51.0
13 npm
Figure 6: Global models report general trends, while global models with local considerations give insig
describes the response (in this case bugs) while keeping all other prediction variables at their median val
0 2 4 6 8 10
0 10 20 30 40 60
0123
0 1 2 3 4
ic
npm
npm
ic mfa
mfa
Fold 9, Cluster 1
Fold 9, Cluster 6
Figure 7: Example of contradicting trends in local models (Xalan 2.6,
Cluster 1 and Cluster 6 in Fold 9).
model already partition the data into regions with individual
properties. For example, we observe that an increase of ic
(measuring the inheritance coupling through parent classes)
is predicted to only have a negative effect on bug-proneness
prediction models lea
Our findings thus c
who observed a sim
WHICH machine-lea
have practical impli
using regression mo
are more insightful t
general trends acros
demonstrated that su
particular parts of th
in the Xalan 2.6 de
sets of classes are in
as inheritance, coh
reinforce the recom
the use of a “one-si
model, when trying t
B. Act Globally
When the goal is car
understanding, local
Cluster 1
Cluster 6
Model Interpretation
...
Saturday, 2 June, 12
20
Local (Clustered) Model: Many, many, many Trends!
0 1000 3000 5000
0.61.0
0.0 0.5 1.0 1.5 2.0
0.30.40.
0 1000 2000 3000 4000
0.51.01.
0 20 40 60 80 120
12
0.0 0.2 0.4 0.6 0.8 1.0
0.450.470.490.51
13 mfa
0 5 10 15
0.500.540.58
14 moa
0 5 10 15 20 25 30
0.420.460.50
15 noc
0 20 40 60 80 100 120
0.500.600.70
16 npm
0 1000 2000 3000 4000
123
0.0 0.2 0.4
0.81.0
0 20 40 60 80 100 120
−1.00.00.51.0
13 npm
Figure 6: Global models report general trends, while global models with local considerations give insight
describes the response (in this case bugs) while keeping all other prediction variables at their median value
0 2 4 6 8 10
0 10 20 30 40 60
01230 1 2 3 4
ic
npm
npm
ic mfa
mfa
Fold 9, Cluster 1
Fold 9, Cluster 6
Figure 7: Example of contradicting trends in local models (Xalan 2.6,
Cluster 1 and Cluster 6 in Fold 9).
model already partition the data into regions with individual
properties. For example, we observe that an increase of ic
(measuring the inheritance coupling through parent classes)
is predicted to only have a negative effect on bug-proneness
prediction models lead
Our findings thus co
who observed a simil
WHICH machine-lear
have practical implic
using regression mod
are more insightful th
general trends across
demonstrated that such
particular parts of the
in the Xalan 2.6 def
sets of classes are infl
as inheritance, cohes
reinforce the recomm
the use of a “one-size
model, when trying to
B. Act Globally
When the goal is carry
understanding, local m
0 1000 3000 5000
0.61.0
0.0 0.5 1.0 1.5 2.0
0.30.40.5
0 1000 2000 3000 4000
0.51.01.5
0 20 40 60 80 120
123
0.0 0.2 0.4 0.6 0.8 1.0
0.450.470.490.51
13 mfa
0 5 10 15
0.500.540.58
14 moa
0 5 10 15 20 25 30
0.420.460.50
15 noc
0 20 40 60 80 100 120
0.500.600.70
16 npm
0 1000 2000 3000 4000
1234
0.0 0.2 0
0.81.01
0 20 40 60 80 100 120
−1.00.00.51.0
13 npm
Figure 6: Global models report general trends, while global models with local considerations give insig
describes the response (in this case bugs) while keeping all other prediction variables at their median val
0 2 4 6 8 10
0 10 20 30 40 60
0123
0 1 2 3 4
ic
npm
npm
ic mfa
mfa
Fold 9, Cluster 1
Fold 9, Cluster 6
Figure 7: Example of contradicting trends in local models (Xalan 2.6,
Cluster 1 and Cluster 6 in Fold 9).
model already partition the data into regions with individual
properties. For example, we observe that an increase of ic
(measuring the inheritance coupling through parent classes)
is predicted to only have a negative effect on bug-proneness
prediction models lea
Our findings thus c
who observed a sim
WHICH machine-lea
have practical impli
using regression mo
are more insightful t
general trends acros
demonstrated that su
particular parts of th
in the Xalan 2.6 de
sets of classes are in
as inheritance, coh
reinforce the recom
the use of a “one-si
model, when trying t
B. Act Globally
When the goal is car
understanding, local
Cluster 1
Cluster 6
Model Interpretation
...
Saturday, 2 June, 12
20
Local (Clustered) Model: Many, many, many Trends!
0 1000 3000 5000
0.61.0
0.0 0.5 1.0 1.5 2.0
0.30.40.
0 1000 2000 3000 4000
0.51.01.
0 20 40 60 80 120
12
0.0 0.2 0.4 0.6 0.8 1.0
0.450.470.490.51
13 mfa
0 5 10 15
0.500.540.58
14 moa
0 5 10 15 20 25 30
0.420.460.50
15 noc
0 20 40 60 80 100 120
0.500.600.70
16 npm
0 1000 2000 3000 4000
123
0.0 0.2 0.4
0.81.0
0 20 40 60 80 100 120
−1.00.00.51.0
13 npm
Figure 6: Global models report general trends, while global models with local considerations give insight
describes the response (in this case bugs) while keeping all other prediction variables at their median value
0 2 4 6 8 10
0 10 20 30 40 60
01230 1 2 3 4
ic
npm
npm
ic mfa
mfa
Fold 9, Cluster 1
Fold 9, Cluster 6
Figure 7: Example of contradicting trends in local models (Xalan 2.6,
Cluster 1 and Cluster 6 in Fold 9).
model already partition the data into regions with individual
properties. For example, we observe that an increase of ic
(measuring the inheritance coupling through parent classes)
is predicted to only have a negative effect on bug-proneness
prediction models lead
Our findings thus co
who observed a simil
WHICH machine-lear
have practical implic
using regression mod
are more insightful th
general trends across
demonstrated that such
particular parts of the
in the Xalan 2.6 def
sets of classes are infl
as inheritance, cohes
reinforce the recomm
the use of a “one-size
model, when trying to
B. Act Globally
When the goal is carry
understanding, local m
0 1000 3000 5000
0.61.0
0.0 0.5 1.0 1.5 2.0
0.30.40.5
0 1000 2000 3000 4000
0.51.01.5
0 20 40 60 80 120
123
0.0 0.2 0.4 0.6 0.8 1.0
0.450.470.490.51
13 mfa
0 5 10 15
0.500.540.58
14 moa
0 5 10 15 20 25 30
0.420.460.50
15 noc
0 20 40 60 80 100 120
0.500.600.70
16 npm
0 1000 2000 3000 4000
1234
0.0 0.2 0
0.81.01
0 20 40 60 80 100 120
−1.00.00.51.0
13 npm
Figure 6: Global models report general trends, while global models with local considerations give insig
describes the response (in this case bugs) while keeping all other prediction variables at their median val
0 2 4 6 8 10
0 10 20 30 40 60
0123
0 1 2 3 4
ic
npm
npm
ic mfa
mfa
Fold 9, Cluster 1
Fold 9, Cluster 6
Figure 7: Example of contradicting trends in local models (Xalan 2.6,
Cluster 1 and Cluster 6 in Fold 9).
model already partition the data into regions with individual
properties. For example, we observe that an increase of ic
(measuring the inheritance coupling through parent classes)
is predicted to only have a negative effect on bug-proneness
prediction models lea
Our findings thus c
who observed a sim
WHICH machine-lea
have practical impli
using regression mo
are more insightful t
general trends acros
demonstrated that su
particular parts of th
in the Xalan 2.6 de
sets of classes are in
as inheritance, coh
reinforce the recom
the use of a “one-si
model, when trying t
B. Act Globally
When the goal is car
understanding, local
Cluster 1
Cluster 6
Model Interpretation
...
Sometimes even contradict
Saturday, 2 June, 12
21
5
0.0 0.2 0.4 0.6 0.8 1.0
0.81.21.6 1 cam
0 5 10 15 20 25 30
1.01.21.41.61.8
2 cbm
0 10 20 30 40 50
0.20.40.60.81.0
3 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.650.750.85
4 dam
1 2 3 4 5 6 7 8
0.20.40.60.8
5 dit
0 1 2 3 4 5
0.91.11.31.5
6 ic
0 1000 3000 5000
1.01.52.02.53.0
7 lcom
0.0 0.5 1.0 1.5 2.0
0.550.650.750.85
8 lcom3
0 1000 2000 3000 4000
123456
9 loc
0.0 0.2 0.4 0.6 0.8 1.0
0.81.01.21.4
10 mfa
0 5 10 15
0.00.20.40.60.8
11 moa
0 5 10 15 20 25 30
0.00.51.01.5
12 noc
.00.51.0
13 npm
bug earth(formula=f,data=training1)
(b) Part of a Global model with local considerations learned on the Xalan
2.6 dataset
local considerations give insights into different regions of the data. The Y-Axis
n variables at their median values.
prediction models leads to an improved fit of these models.
Our findings thus confirm the results of Menzies et al.,
Model Interpretation
Saturday, 2 June, 12
21
Regression Splines: Local Trends in a Single Curve
5
0.0 0.2 0.4 0.6 0.8 1.0
0.81.21.6 1 cam
0 5 10 15 20 25 30
1.01.21.41.61.8
2 cbm
0 10 20 30 40 50
0.20.40.60.81.0
3 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.650.750.85
4 dam
1 2 3 4 5 6 7 8
0.20.40.60.8
5 dit
0 1 2 3 4 5
0.91.11.31.5
6 ic
0 1000 3000 5000
1.01.52.02.53.0
7 lcom
0.0 0.5 1.0 1.5 2.0
0.550.650.750.85
8 lcom3
0 1000 2000 3000 4000
123456
9 loc
0.0 0.2 0.4 0.6 0.8 1.0
0.81.01.21.4
10 mfa
0 5 10 15
0.00.20.40.60.8
11 moa
0 5 10 15 20 25 30
0.00.51.01.5
12 noc
.00.51.0
13 npm
bug earth(formula=f,data=training1)
(b) Part of a Global model with local considerations learned on the Xalan
2.6 dataset
local considerations give insights into different regions of the data. The Y-Axis
n variables at their median values.
prediction models leads to an improved fit of these models.
Our findings thus confirm the results of Menzies et al.,
Model Interpretation
Saturday, 2 June, 12
21
Regression Splines: Local Trends in a Single Curve
5
0.0 0.2 0.4 0.6 0.8 1.0
0.81.21.6 1 cam
0 5 10 15 20 25 30
1.01.21.41.61.8
2 cbm
0 10 20 30 40 50
0.20.40.60.81.0
3 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.650.750.85
4 dam
1 2 3 4 5 6 7 8
0.20.40.60.8
5 dit
0 1 2 3 4 5
0.91.11.31.5
6 ic
0 1000 3000 5000
1.01.52.02.53.0
7 lcom
0.0 0.5 1.0 1.5 2.0
0.550.650.750.85
8 lcom3
0 1000 2000 3000 4000
123456
9 loc
0.0 0.2 0.4 0.6 0.8 1.0
0.81.01.21.4
10 mfa
0 5 10 15
0.00.20.40.60.8
11 moa
0 5 10 15 20 25 30
0.00.51.01.5
12 noc
.00.51.0
13 npm
bug earth(formula=f,data=training1)
(b) Part of a Global model with local considerations learned on the Xalan
2.6 dataset
local considerations give insights into different regions of the data. The Y-Axis
n variables at their median values.
prediction models leads to an improved fit of these models.
Our findings thus confirm the results of Menzies et al.,
Model Interpretation
Combines the best of both worlds!
Saturday, 2 June, 12
Saturday, 2 June, 12
Using Locality in Data
to build better Statistical Models.
Saturday, 2 June, 12
Using Locality in Data
to build better Statistical Models.
vs = Two Extremes
Saturday, 2 June, 12
Using Locality in Data
to build better Statistical Models.
vs = Two Extremes
Build Local Model, globally Optimized
Saturday, 2 June, 12
Using Locality in Data
to build better Statistical Models.
vs = Two Extremes
Build Local Model, globally Optimized
• combines best of both worlds
Saturday, 2 June, 12
Using Locality in Data
to build better Statistical Models.
vs = Two Extremes
Build Local Model, globally Optimized
• combines best of both worlds
• outperforms global and clustered local
Saturday, 2 June, 12
Using Locality in Data
to build better Statistical Models.
vs = Two Extremes
Build Local Model, globally Optimized
• combines best of both worlds
• outperforms global and clustered local
• summarizes local trends in single curve
Saturday, 2 June, 12

More Related Content

What's hot

Paper id 71201914
Paper id 71201914Paper id 71201914
Paper id 71201914IJRAT
 
Intermediate Algebra 7th Edition Tobey Solutions Manual
Intermediate Algebra 7th Edition Tobey Solutions ManualIntermediate Algebra 7th Edition Tobey Solutions Manual
Intermediate Algebra 7th Edition Tobey Solutions Manualryqakul
 
FURTHER RESULTS ON ODD HARMONIOUS GRAPHS
FURTHER RESULTS ON ODD HARMONIOUS GRAPHSFURTHER RESULTS ON ODD HARMONIOUS GRAPHS
FURTHER RESULTS ON ODD HARMONIOUS GRAPHSgraphhoc
 
bayesImageS: an R package for Bayesian image analysis
bayesImageS: an R package for Bayesian image analysisbayesImageS: an R package for Bayesian image analysis
bayesImageS: an R package for Bayesian image analysisMatt Moores
 
Potencias resueltas 1eso (1)
Potencias resueltas 1eso (1)Potencias resueltas 1eso (1)
Potencias resueltas 1eso (1)Lina Manriquez
 
Sol mat haeussler_by_priale
Sol mat haeussler_by_prialeSol mat haeussler_by_priale
Sol mat haeussler_by_prialeJeff Chasi
 
Radix-3 Algorithm for Realization of Type-II Discrete Sine Transform
Radix-3 Algorithm for Realization of Type-II Discrete Sine TransformRadix-3 Algorithm for Realization of Type-II Discrete Sine Transform
Radix-3 Algorithm for Realization of Type-II Discrete Sine TransformIJERA Editor
 
THREE-ASSOCIATE CLASS PARTIALLY BALANCED INCOMPLETE BLOCK DESIGNS IN TWO REP...
THREE-ASSOCIATE CLASS PARTIALLY BALANCED  INCOMPLETE BLOCK DESIGNS IN TWO REP...THREE-ASSOCIATE CLASS PARTIALLY BALANCED  INCOMPLETE BLOCK DESIGNS IN TWO REP...
THREE-ASSOCIATE CLASS PARTIALLY BALANCED INCOMPLETE BLOCK DESIGNS IN TWO REP...Sumeet Saurav
 
Chapter6 anova-bibd (1)
Chapter6 anova-bibd (1)Chapter6 anova-bibd (1)
Chapter6 anova-bibd (1)sabbir11
 
Histogram Equalization
Histogram EqualizationHistogram Equalization
Histogram EqualizationEr. Nancy
 
solucionario de purcell 1
solucionario de purcell 1solucionario de purcell 1
solucionario de purcell 1José Encalada
 

What's hot (14)

Paper id 71201914
Paper id 71201914Paper id 71201914
Paper id 71201914
 
Intermediate Algebra 7th Edition Tobey Solutions Manual
Intermediate Algebra 7th Edition Tobey Solutions ManualIntermediate Algebra 7th Edition Tobey Solutions Manual
Intermediate Algebra 7th Edition Tobey Solutions Manual
 
FURTHER RESULTS ON ODD HARMONIOUS GRAPHS
FURTHER RESULTS ON ODD HARMONIOUS GRAPHSFURTHER RESULTS ON ODD HARMONIOUS GRAPHS
FURTHER RESULTS ON ODD HARMONIOUS GRAPHS
 
bayesImageS: an R package for Bayesian image analysis
bayesImageS: an R package for Bayesian image analysisbayesImageS: an R package for Bayesian image analysis
bayesImageS: an R package for Bayesian image analysis
 
On the Zeros of Complex Polynomials
On the Zeros of Complex PolynomialsOn the Zeros of Complex Polynomials
On the Zeros of Complex Polynomials
 
Potencias resueltas 1eso (1)
Potencias resueltas 1eso (1)Potencias resueltas 1eso (1)
Potencias resueltas 1eso (1)
 
Sol mat haeussler_by_priale
Sol mat haeussler_by_prialeSol mat haeussler_by_priale
Sol mat haeussler_by_priale
 
E49032630
E49032630E49032630
E49032630
 
Radix-3 Algorithm for Realization of Type-II Discrete Sine Transform
Radix-3 Algorithm for Realization of Type-II Discrete Sine TransformRadix-3 Algorithm for Realization of Type-II Discrete Sine Transform
Radix-3 Algorithm for Realization of Type-II Discrete Sine Transform
 
Lecture8 xing
Lecture8 xingLecture8 xing
Lecture8 xing
 
THREE-ASSOCIATE CLASS PARTIALLY BALANCED INCOMPLETE BLOCK DESIGNS IN TWO REP...
THREE-ASSOCIATE CLASS PARTIALLY BALANCED  INCOMPLETE BLOCK DESIGNS IN TWO REP...THREE-ASSOCIATE CLASS PARTIALLY BALANCED  INCOMPLETE BLOCK DESIGNS IN TWO REP...
THREE-ASSOCIATE CLASS PARTIALLY BALANCED INCOMPLETE BLOCK DESIGNS IN TWO REP...
 
Chapter6 anova-bibd (1)
Chapter6 anova-bibd (1)Chapter6 anova-bibd (1)
Chapter6 anova-bibd (1)
 
Histogram Equalization
Histogram EqualizationHistogram Equalization
Histogram Equalization
 
solucionario de purcell 1
solucionario de purcell 1solucionario de purcell 1
solucionario de purcell 1
 

Similar to Msr2012 bettenburg presentation

Chapter2 functionsandgraphs-151003144959-lva1-app6891
Chapter2 functionsandgraphs-151003144959-lva1-app6891Chapter2 functionsandgraphs-151003144959-lva1-app6891
Chapter2 functionsandgraphs-151003144959-lva1-app6891Cleophas Rwemera
 
Introductory maths analysis chapter 02 official
Introductory maths analysis   chapter 02 officialIntroductory maths analysis   chapter 02 official
Introductory maths analysis chapter 02 officialEvert Sandye Taasiringan
 
Matematicas para ingenieria 4ta edicion - john bird
Matematicas para ingenieria   4ta edicion - john birdMatematicas para ingenieria   4ta edicion - john bird
Matematicas para ingenieria 4ta edicion - john birdAllan Bernal Espinoza
 
Modules Linear Algebra Drills
Modules Linear Algebra DrillsModules Linear Algebra Drills
Modules Linear Algebra DrillsDaniel Bragais
 
1)  Use properties of logarithms to expand the following logarit.docx
1)  Use properties of logarithms to expand the following logarit.docx1)  Use properties of logarithms to expand the following logarit.docx
1)  Use properties of logarithms to expand the following logarit.docxhirstcruz
 
Physics Notes: Solved numerical of Physics first year
Physics Notes: Solved numerical of Physics first yearPhysics Notes: Solved numerical of Physics first year
Physics Notes: Solved numerical of Physics first yearRam Chand
 
A Study on Optimization using Stochastic Linear Programming
A Study on Optimization using Stochastic Linear ProgrammingA Study on Optimization using Stochastic Linear Programming
A Study on Optimization using Stochastic Linear ProgrammingIOSR Journals
 
Lecture complex fractions
Lecture complex fractionsLecture complex fractions
Lecture complex fractionsHazel Joy Chong
 
Algebra Trigonometry Problems
Algebra Trigonometry ProblemsAlgebra Trigonometry Problems
Algebra Trigonometry ProblemsDon Dooley
 
Chapter0 reviewofalgebra-151003150137-lva1-app6891
Chapter0 reviewofalgebra-151003150137-lva1-app6891Chapter0 reviewofalgebra-151003150137-lva1-app6891
Chapter0 reviewofalgebra-151003150137-lva1-app6891Cleophas Rwemera
 
Introductory maths analysis chapter 00 official
Introductory maths analysis   chapter 00 officialIntroductory maths analysis   chapter 00 official
Introductory maths analysis chapter 00 officialEvert Sandye Taasiringan
 
1) Use properties of logarithms to expand the following logarithm.docx
1)  Use properties of logarithms to expand the following logarithm.docx1)  Use properties of logarithms to expand the following logarithm.docx
1) Use properties of logarithms to expand the following logarithm.docxdorishigh
 
A coefficient inequality for the starlike univalent functions in the unit dis...
A coefficient inequality for the starlike univalent functions in the unit dis...A coefficient inequality for the starlike univalent functions in the unit dis...
A coefficient inequality for the starlike univalent functions in the unit dis...Alexander Decker
 
Student manual
Student manualStudent manual
Student manualec931657
 

Similar to Msr2012 bettenburg presentation (20)

Chapter2 functionsandgraphs-151003144959-lva1-app6891
Chapter2 functionsandgraphs-151003144959-lva1-app6891Chapter2 functionsandgraphs-151003144959-lva1-app6891
Chapter2 functionsandgraphs-151003144959-lva1-app6891
 
Chapter 2 - Functions and Graphs
Chapter 2 - Functions and GraphsChapter 2 - Functions and Graphs
Chapter 2 - Functions and Graphs
 
Introductory maths analysis chapter 02 official
Introductory maths analysis   chapter 02 officialIntroductory maths analysis   chapter 02 official
Introductory maths analysis chapter 02 official
 
Matematicas para ingenieria 4ta edicion - john bird
Matematicas para ingenieria   4ta edicion - john birdMatematicas para ingenieria   4ta edicion - john bird
Matematicas para ingenieria 4ta edicion - john bird
 
Modules Linear Algebra Drills
Modules Linear Algebra DrillsModules Linear Algebra Drills
Modules Linear Algebra Drills
 
1)  Use properties of logarithms to expand the following logarit.docx
1)  Use properties of logarithms to expand the following logarit.docx1)  Use properties of logarithms to expand the following logarit.docx
1)  Use properties of logarithms to expand the following logarit.docx
 
Physics Notes: Solved numerical of Physics first year
Physics Notes: Solved numerical of Physics first yearPhysics Notes: Solved numerical of Physics first year
Physics Notes: Solved numerical of Physics first year
 
Modeling quadratic fxns
Modeling quadratic fxnsModeling quadratic fxns
Modeling quadratic fxns
 
A Study on Optimization using Stochastic Linear Programming
A Study on Optimization using Stochastic Linear ProgrammingA Study on Optimization using Stochastic Linear Programming
A Study on Optimization using Stochastic Linear Programming
 
K map
K mapK map
K map
 
Lecture complex fractions
Lecture complex fractionsLecture complex fractions
Lecture complex fractions
 
Algebra Trigonometry Problems
Algebra Trigonometry ProblemsAlgebra Trigonometry Problems
Algebra Trigonometry Problems
 
Chapter0 reviewofalgebra-151003150137-lva1-app6891
Chapter0 reviewofalgebra-151003150137-lva1-app6891Chapter0 reviewofalgebra-151003150137-lva1-app6891
Chapter0 reviewofalgebra-151003150137-lva1-app6891
 
Introductory maths analysis chapter 00 official
Introductory maths analysis   chapter 00 officialIntroductory maths analysis   chapter 00 official
Introductory maths analysis chapter 00 official
 
P-2 Bentuk Akar.pdf
P-2 Bentuk Akar.pdfP-2 Bentuk Akar.pdf
P-2 Bentuk Akar.pdf
 
Grade 10 tutorials
Grade 10 tutorialsGrade 10 tutorials
Grade 10 tutorials
 
1) Use properties of logarithms to expand the following logarithm.docx
1)  Use properties of logarithms to expand the following logarithm.docx1)  Use properties of logarithms to expand the following logarithm.docx
1) Use properties of logarithms to expand the following logarithm.docx
 
Indices
IndicesIndices
Indices
 
A coefficient inequality for the starlike univalent functions in the unit dis...
A coefficient inequality for the starlike univalent functions in the unit dis...A coefficient inequality for the starlike univalent functions in the unit dis...
A coefficient inequality for the starlike univalent functions in the unit dis...
 
Student manual
Student manualStudent manual
Student manual
 

More from SAIL_QU

Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Integration Practices and the Evolution of Ad Libraries in the G...Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Integration Practices and the Evolution of Ad Libraries in the G...SAIL_QU
 
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...SAIL_QU
 
Improving the testing efficiency of selenium-based load tests
Improving the testing efficiency of selenium-based load testsImproving the testing efficiency of selenium-based load tests
Improving the testing efficiency of selenium-based load testsSAIL_QU
 
Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying User-Developer Interactions Through the Distribution and Reviewing M...Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying User-Developer Interactions Through the Distribution and Reviewing M...SAIL_QU
 
Studying online distribution platforms for games through the mining of data f...
Studying online distribution platforms for games through the mining of data f...Studying online distribution platforms for games through the mining of data f...
Studying online distribution platforms for games through the mining of data f...SAIL_QU
 
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...SAIL_QU
 
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...SAIL_QU
 
Mining Development Knowledge to Understand and Support Software Logging Pract...
Mining Development Knowledge to Understand and Support Software Logging Pract...Mining Development Knowledge to Understand and Support Software Logging Pract...
Mining Development Knowledge to Understand and Support Software Logging Pract...SAIL_QU
 
Which Log Level Should Developers Choose For a New Logging Statement?
Which Log Level Should Developers Choose For a New Logging Statement?Which Log Level Should Developers Choose For a New Logging Statement?
Which Log Level Should Developers Choose For a New Logging Statement?SAIL_QU
 
Towards Just-in-Time Suggestions for Log Changes
Towards Just-in-Time Suggestions for Log ChangesTowards Just-in-Time Suggestions for Log Changes
Towards Just-in-Time Suggestions for Log ChangesSAIL_QU
 
The Impact of Task Granularity on Co-evolution Analyses
The Impact of Task Granularity on Co-evolution AnalysesThe Impact of Task Granularity on Co-evolution Analyses
The Impact of Task Granularity on Co-evolution AnalysesSAIL_QU
 
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...SAIL_QU
 
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...SAIL_QU
 
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...SAIL_QU
 
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...SAIL_QU
 
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...SAIL_QU
 
What Do Programmers Know about Software Energy Consumption?
What Do Programmers Know about Software Energy Consumption?What Do Programmers Know about Software Energy Consumption?
What Do Programmers Know about Software Energy Consumption?SAIL_QU
 
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...SAIL_QU
 
Revisiting the Experimental Design Choices for Approaches for the Automated R...
Revisiting the Experimental Design Choices for Approaches for the Automated R...Revisiting the Experimental Design Choices for Approaches for the Automated R...
Revisiting the Experimental Design Choices for Approaches for the Automated R...SAIL_QU
 
Measuring Program Comprehension: A Large-Scale Field Study with Professionals
Measuring Program Comprehension: A Large-Scale Field Study with ProfessionalsMeasuring Program Comprehension: A Large-Scale Field Study with Professionals
Measuring Program Comprehension: A Large-Scale Field Study with ProfessionalsSAIL_QU
 

More from SAIL_QU (20)

Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Integration Practices and the Evolution of Ad Libraries in the G...Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Integration Practices and the Evolution of Ad Libraries in the G...
 
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
 
Improving the testing efficiency of selenium-based load tests
Improving the testing efficiency of selenium-based load testsImproving the testing efficiency of selenium-based load tests
Improving the testing efficiency of selenium-based load tests
 
Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying User-Developer Interactions Through the Distribution and Reviewing M...Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying User-Developer Interactions Through the Distribution and Reviewing M...
 
Studying online distribution platforms for games through the mining of data f...
Studying online distribution platforms for games through the mining of data f...Studying online distribution platforms for games through the mining of data f...
Studying online distribution platforms for games through the mining of data f...
 
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
 
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
 
Mining Development Knowledge to Understand and Support Software Logging Pract...
Mining Development Knowledge to Understand and Support Software Logging Pract...Mining Development Knowledge to Understand and Support Software Logging Pract...
Mining Development Knowledge to Understand and Support Software Logging Pract...
 
Which Log Level Should Developers Choose For a New Logging Statement?
Which Log Level Should Developers Choose For a New Logging Statement?Which Log Level Should Developers Choose For a New Logging Statement?
Which Log Level Should Developers Choose For a New Logging Statement?
 
Towards Just-in-Time Suggestions for Log Changes
Towards Just-in-Time Suggestions for Log ChangesTowards Just-in-Time Suggestions for Log Changes
Towards Just-in-Time Suggestions for Log Changes
 
The Impact of Task Granularity on Co-evolution Analyses
The Impact of Task Granularity on Co-evolution AnalysesThe Impact of Task Granularity on Co-evolution Analyses
The Impact of Task Granularity on Co-evolution Analyses
 
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
 
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
 
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
 
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
 
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
 
What Do Programmers Know about Software Energy Consumption?
What Do Programmers Know about Software Energy Consumption?What Do Programmers Know about Software Energy Consumption?
What Do Programmers Know about Software Energy Consumption?
 
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
 
Revisiting the Experimental Design Choices for Approaches for the Automated R...
Revisiting the Experimental Design Choices for Approaches for the Automated R...Revisiting the Experimental Design Choices for Approaches for the Automated R...
Revisiting the Experimental Design Choices for Approaches for the Automated R...
 
Measuring Program Comprehension: A Large-Scale Field Study with Professionals
Measuring Program Comprehension: A Large-Scale Field Study with ProfessionalsMeasuring Program Comprehension: A Large-Scale Field Study with Professionals
Measuring Program Comprehension: A Large-Scale Field Study with Professionals
 

Msr2012 bettenburg presentation

  • 1. T Think Locally, Act Globally Improving Defect and Effort Prediction Models Nicolas Bettenburg • Meiyappan Nagappan • Ahmed E. Hassan Queen’s University • Kingston, ON, Canada SOFTWARE ANALYSIS & INTELLIGENCE LAB Saturday, 2 June, 12
  • 2. Data Modelling in Empirical SE Observations 2 measured from project data Saturday, 2 June, 12
  • 3. Data Modelling in Empirical SE Observations Model 2 measured from project data describe observations mathematically Saturday, 2 June, 12
  • 4. Data Modelling in Empirical SE Observations Model Understanding Prediction 2 measured from project data describe observations mathematically guide decision making guide process optimizations and future research Saturday, 2 June, 12
  • 5. Model Building Today 3 Whole Dataset Saturday, 2 June, 12
  • 6. Model Building Today 3 Whole Dataset Testing Data Training Data Saturday, 2 June, 12
  • 7. Model Building Today 3 Whole Dataset Testing Data Training Data M Learned Model Saturday, 2 June, 12
  • 8. Model Building Today 3 Whole Dataset Testing Data Training Data M Learned Model Predictions Y Saturday, 2 June, 12
  • 9. Model Building Today 3 Whole Dataset Testing Data Training Data M Learned Model Predictions Y Compare Saturday, 2 June, 12
  • 10. Much Research Effort on new metrics and new models! 4 Saturday, 2 June, 12
  • 11. Maybe we need to look more at the data part Saturday, 2 June, 12
  • 13. In the Field Tom Zimmermann Saturday, 2 June, 12
  • 14. In the Field Tom Zimmermann We ran 622 cross-project predictions and found that only 3.4% actually worked. Saturday, 2 June, 12
  • 15. In the Field Tim Menzies Tom Zimmermann We ran 622 cross-project predictions and found that only 3.4% actually worked. Saturday, 2 June, 12
  • 16. In the Field Tim Menzies Tom Zimmermann We ran 622 cross-project predictions and found that only 3.4% actually worked. Rather than focus on generalities, empirical SE should focus more on context-specific principles. Saturday, 2 June, 12
  • 17. In the Field Tim Menzies Tom Zimmermann We ran 622 cross-project predictions and found that only 3.4% actually worked. Rather than focus on generalities, empirical SE should focus more on context-specific principles. Taking local properties of data into consideration leads to better models! Saturday, 2 June, 12
  • 18. Using Locality in Statistical Models Saturday, 2 June, 12
  • 19. Does this principle work for statistical models?1 Using Locality in Statistical Models Saturday, 2 June, 12
  • 20. Does this principle work for statistical models?1 Does it work for Prediction?2 Using Locality in Statistical Models Saturday, 2 June, 12
  • 21. Does this principle work for statistical models?1 Does it work for Prediction?2 Can we do better?3 Using Locality in Statistical Models Saturday, 2 June, 12
  • 22. M Learned Model Building Local Models 8 Whole Dataset Testing Data Training Data Predictions Y Saturday, 2 June, 12
  • 23. M Learned Model Building Local Models 8 Whole Dataset Testing Data Training Data Predictions Y Cluster Data Saturday, 2 June, 12
  • 24. Building Local Models 8 Whole Dataset Testing Data Training Data Learned Models M1 M2 M3 Predictions Y Cluster Data Learn Multiple Models Saturday, 2 June, 12
  • 25. Building Local Models 8 Whole Dataset Testing Data Training Data Learned Models M1 M2 M3 Predictions Y Y Y Cluster Data Learn Multiple Models Predict Individually Saturday, 2 June, 12
  • 26. Building Local Models 8 Whole Dataset Compare Testing Data Training Data Learned Models M1 M2 M3 Predictions Y Y Y Cluster Data Learn Multiple Models Predict Individually Saturday, 2 June, 12
  • 27. 9 Global Statistical ModelHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 X f(X) 0 1 2 3 4 5 6 Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. Saturday, 2 June, 12
  • 28. 9 Global Statistical ModelHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 X f(X) 0 1 2 3 4 5 6 Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. Saturday, 2 June, 12
  • 29. 9 Global Statistical ModelHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 X f(X) 0 1 2 3 4 5 6 Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. Saturday, 2 June, 12
  • 30. 9 Global Statistical ModelHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 X f(X) 0 1 2 3 4 5 6 Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. Model fit leaves much room for improvement! Saturday, 2 June, 12
  • 31. CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 3 X f(X) 0 1 2 3 4 5 6 Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. 10 Local Statistical Model Saturday, 2 June, 12
  • 32. CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 3 X f(X) 0 1 2 3 4 5 6 Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. 10 Local Statistical Model Saturday, 2 June, 12
  • 33. CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 3 X f(X) 0 1 2 3 4 5 6 Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. 10 Local Statistical Model Model 1 Model 2 Saturday, 2 June, 12
  • 34. CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 3 X f(X) 0 1 2 3 4 5 6 Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. 10 Local Statistical Model Model 1 Model 2 Improved Fit! Saturday, 2 June, 12
  • 35. How can we use this approach to get an even better fit? Saturday, 2 June, 12
  • 36. 12 Be Even More Local !HAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 X f(X) 0 1 2 3 4 5 6 Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. Saturday, 2 June, 12
  • 37. 12 Be Even More Local !HAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 X f(X) 0 1 2 3 4 5 6 Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. Saturday, 2 June, 12
  • 38. 12 Be Even More Local !HAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 X f(X) 0 1 2 3 4 5 6 Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. Saturday, 2 June, 12
  • 39. 12 Be Even More Local !HAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 X f(X) 0 1 2 3 4 5 6 Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. Great Fit! Saturday, 2 June, 12
  • 40. 12 Be Even More Local !HAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 X f(X) 0 1 2 3 4 5 6 Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. Great Fit! BUT: Risk of Overfitting the Data!! Saturday, 2 June, 12
  • 42. Clustering independent of Fit Saturday, 2 June, 12
  • 43. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 X f(X) 0 1 2 3 4 5 6 Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. C(Y |X) = f(X) = X , X = 0 + 1X1 + 2X2 + 3X3 + 4X4, X1 = X X2 = (X a)+ CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS X f(X) 0 1 2 3 4 5 6 Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. C(Y |X) = f(X) = X , where X = 0 + 1X1 + 2X2 + 3X3 + 4 and X1 = X X2 = (X a)+ X3 = (X b)+ X4 = (X c)+. 14 Saturday, 2 June, 12
  • 44. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 X f(X) 0 1 2 3 4 5 6 Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. C(Y |X) = f(X) = X , X = 0 + 1X1 + 2X2 + 3X3 + 4X4, X1 = X X2 = (X a)+ CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS X f(X) 0 1 2 3 4 5 6 Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. C(Y |X) = f(X) = X , where X = 0 + 1X1 + 2X2 + 3X3 + 4 and X1 = X X2 = (X a)+ X3 = (X b)+ X4 = (X c)+. 14 Optimize Local Fit wrt. Minimizing Global Overfit Saturday, 2 June, 12
  • 45. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 X f(X) 0 1 2 3 4 5 6 Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. C(Y |X) = f(X) = X , X = 0 + 1X1 + 2X2 + 3X3 + 4X4, X1 = X X2 = (X a)+ CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS X f(X) 0 1 2 3 4 5 6 Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. C(Y |X) = f(X) = X , where X = 0 + 1X1 + 2X2 + 3X3 + 4 and X1 = X X2 = (X a)+ X3 = (X b)+ X4 = (X c)+. 14 CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 X f(X) 0 1 2 3 4 5 6 Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. C(Y |X) = f(X) = X , Optimize Local Fit wrt. Minimizing Global Overfit Saturday, 2 June, 12
  • 46. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 X f(X) 0 1 2 3 4 5 6 Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. C(Y |X) = f(X) = X , X = 0 + 1X1 + 2X2 + 3X3 + 4X4, X1 = X X2 = (X a)+ CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS X f(X) 0 1 2 3 4 5 6 Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. C(Y |X) = f(X) = X , where X = 0 + 1X1 + 2X2 + 3X3 + 4 and X1 = X X2 = (X a)+ X3 = (X b)+ X4 = (X c)+. 14 CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 X f(X) 0 1 2 3 4 5 6 Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. C(Y |X) = f(X) = X , Optimize Local Fit wrt. Minimizing Global Overfit Saturday, 2 June, 12
  • 47. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 X f(X) 0 1 2 3 4 5 6 Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. C(Y |X) = f(X) = X , X = 0 + 1X1 + 2X2 + 3X3 + 4X4, X1 = X X2 = (X a)+ CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS X f(X) 0 1 2 3 4 5 6 Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. C(Y |X) = f(X) = X , where X = 0 + 1X1 + 2X2 + 3X3 + 4 and X1 = X X2 = (X a)+ X3 = (X b)+ X4 = (X c)+. 14 CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 X f(X) 0 1 2 3 4 5 6 Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. C(Y |X) = f(X) = X , Optimize Local Fit wrt. Minimizing Global Overfit Multivariate Adaptive Regression Splines (MARS) Saturday, 2 June, 12
  • 48. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 X f(X) 0 1 2 3 4 5 6 Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. C(Y |X) = f(X) = X , X = 0 + 1X1 + 2X2 + 3X3 + 4X4, X1 = X X2 = (X a)+ CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS X f(X) 0 1 2 3 4 5 6 Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. C(Y |X) = f(X) = X , where X = 0 + 1X1 + 2X2 + 3X3 + 4 and X1 = X X2 = (X a)+ X3 = (X b)+ X4 = (X c)+. 14 CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 X f(X) 0 1 2 3 4 5 6 Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. C(Y |X) = f(X) = X , Optimize Local Fit wrt. Minimizing Global Overfit Multivariate Adaptive Regression Splines (MARS) create local knowledge that optimizes process globally Saturday, 2 June, 12
  • 50. Case Study 15 Xalan 2.6 Lucene 2.4 Post-Release Defects per Class 20 CK Metrics Saturday, 2 June, 12
  • 51. Case Study 15 Xalan 2.6 Lucene 2.4 Post-Release Defects per Class 20 CK Metrics CHINA Total Development Effort in Hours 14 FP Metrics Saturday, 2 June, 12
  • 52. Case Study 15 Xalan 2.6 Lucene 2.4 Post-Release Defects per Class 20 CK Metrics CHINA Total Development Effort in Hours 14 FP Metrics NasaCoc Development Length in Months 24 COCOMO-II Metrics Saturday, 2 June, 12
  • 53. Results: Goodness of Fit 16 Rank-Correlation (0 = worst fit, 1 = optimal fit) Saturday, 2 June, 12
  • 54. Results: Goodness of Fit 16 Global Local (Clustered) MARS Xalan 2.6 Lucene 2.4 CHINA NasaCOC 0.33 0.52 0.69 0.32 0.60 0.83 0.83 0.89 0.89 0.93 0.97 0.99 Rank-Correlation (0 = worst fit, 1 = optimal fit) Saturday, 2 June, 12
  • 55. Results: Goodness of Fit 16 Global Local (Clustered) MARS Xalan 2.6 Lucene 2.4 CHINA NasaCOC 0.33 0.52 0.69 0.32 0.60 0.83 0.83 0.89 0.89 0.93 0.97 0.99 Rank-Correlation (0 = worst fit, 1 = optimal fit) Saturday, 2 June, 12
  • 56. Results: Goodness of Fit 16 Global Local (Clustered) MARS Xalan 2.6 Lucene 2.4 CHINA NasaCOC 0.33 0.52 0.69 0.32 0.60 0.83 0.83 0.89 0.89 0.93 0.97 0.99 Rank-Correlation (0 = worst fit, 1 = optimal fit) Saturday, 2 June, 12
  • 57. Results: Goodness of Fit 16 Global Local (Clustered) MARS Xalan 2.6 Lucene 2.4 CHINA NasaCOC 0.33 0.52 0.69 0.32 0.60 0.83 0.83 0.89 0.89 0.93 0.97 0.99 Rank-Correlation (0 = worst fit, 1 = optimal fit) Saturday, 2 June, 12
  • 58. Results: Goodness of Fit 16 Global Local (Clustered) MARS Xalan 2.6 Lucene 2.4 CHINA NasaCOC 0.33 0.52 0.69 0.32 0.60 0.83 0.83 0.89 0.89 0.93 0.97 0.99 Rank-Correlation (0 = worst fit, 1 = optimal fit) NumberofClusters 0 2 4 6 8 Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10 Dataset CHINA Lucene 2.4 NasaCoc Xalan 2.6 Figure 3: Number of clusters generated by MCLUST in each run of the 10-fold cross validation. term for each additional prediction variable entering the regression model [23]. For practical purposes, we use a publicly available imple- mentation of BIC-based model selection, contained in the R package: BMA. The input to the BMA implementation is the dataset itself, as well as a list of all dependent and independent variables that should be considered. In our case study, we always supply a list of all independent variables that were left after VIF analysis. The output of the BMA is too small to continue or until a maximum number of terms is reached. In our case study, the maximum number of terms is automatically determined by the implementation, and is based on the amount of independent variables we give as input. For MARS models, we use all independent variables in a dataset after VIF analysis. The first phase often builds a model that suffers from overfitting. As a result, the second phase, called the back- ward phase, prunes the model, to increase the model’s gen-Saturday, 2 June, 12
  • 59. Results: Goodness of Fit 16 Global Local (Clustered) MARS Xalan 2.6 Lucene 2.4 CHINA NasaCOC 0.33 0.52 0.69 0.32 0.60 0.83 0.83 0.89 0.89 0.93 0.97 0.99 Rank-Correlation (0 = worst fit, 1 = optimal fit) UP TO 2.5x BETTER FIT WHEN USING DATA LOCALITY! Saturday, 2 June, 12
  • 63. Model Interpretation 19 0 5 10 15 20 −2.5−1.5−0.50.5 1 avg_cc 0 50 100 150 0.500.600.700.80 2 ca 0.0 0.2 0.4 0.6 0.8 1.0 0.440.480.52 3 cam 0 5 10 15 20 25 30 0.50.70.91.1 4 cbm 0 10 20 30 40 50 0.500.540.580.62 5 ce 0.0 0.2 0.4 0.6 0.8 1.0 0.350.45 6 dam 1 2 3 4 5 6 7 8 0.30.40.50.6 7 dit 0 1 2 3 4 5 0.500.550.600.65 8 ic 0 1000 3000 5000 0.61.01.41.8 9 lcom 0.0 0.5 1.0 1.5 2.0 0.30.40.50.60.7 10 lcom3 0 1000 2000 3000 4000 0.51.01.52.0 11 loc 0 20 40 60 80 120 1234 12 max_cc .470.490.51 13 mfa 0.540.58 14 moa 0.460.50 15 noc 0.600.70 16 npm (a) Part of a global Model learned on the Xalan 2.6 dataset 0.0 0.81.21.6 1 0.20.40.60.8 0 1234560.00.51.0 (b) P 2.6 d Figure 6: Global models report general trends, while global models with local c describes the response (in this case bugs) while keeping all other prediction variab ic npm mfa Fold 9, Cluster 1 pr O wSaturday, 2 June, 12
  • 64. Model Interpretation 19 0 5 10 15 20 −2.5−1.5−0.50.5 1 avg_cc 0 50 100 150 0.500.600.700.80 2 ca 0.0 0.2 0.4 0.6 0.8 1.0 0.440.480.52 3 cam 0 5 10 15 20 25 30 0.50.70.91.1 4 cbm 0 10 20 30 40 50 0.500.540.580.62 5 ce 0.0 0.2 0.4 0.6 0.8 1.0 0.350.45 6 dam 1 2 3 4 5 6 7 8 0.30.40.50.6 7 dit 0 1 2 3 4 5 0.500.550.600.65 8 ic 0 1000 3000 5000 0.61.01.41.8 9 lcom 0.0 0.5 1.0 1.5 2.0 0.30.40.50.60.7 10 lcom3 0 1000 2000 3000 4000 0.51.01.52.0 11 loc 0 20 40 60 80 120 1234 12 max_cc .470.490.51 13 mfa 0.540.58 14 moa 0.460.50 15 noc 0.600.70 16 npm (a) Part of a global Model learned on the Xalan 2.6 dataset 0.0 0.81.21.6 1 0.20.40.60.8 0 1234560.00.51.0 (b) P 2.6 d Figure 6: Global models report general trends, while global models with local c describes the response (in this case bugs) while keeping all other prediction variab ic npm mfa Fold 9, Cluster 1 pr O w Traditional Global Model: General Trends Saturday, 2 June, 12
  • 65. Model Interpretation 19 0 5 10 15 20 −2.5−1.5−0.50.5 1 avg_cc 0 50 100 150 0.500.600.700.80 2 ca 0.0 0.2 0.4 0.6 0.8 1.0 0.440.480.52 3 cam 0 5 10 15 20 25 30 0.50.70.91.1 4 cbm 0 10 20 30 40 50 0.500.540.580.62 5 ce 0.0 0.2 0.4 0.6 0.8 1.0 0.350.45 6 dam 1 2 3 4 5 6 7 8 0.30.40.50.6 7 dit 0 1 2 3 4 5 0.500.550.600.65 8 ic 0 1000 3000 5000 0.61.01.41.8 9 lcom 0.0 0.5 1.0 1.5 2.0 0.30.40.50.60.7 10 lcom3 0 1000 2000 3000 4000 0.51.01.52.0 11 loc 0 20 40 60 80 120 1234 12 max_cc .470.490.51 13 mfa 0.540.58 14 moa 0.460.50 15 noc 0.600.70 16 npm (a) Part of a global Model learned on the Xalan 2.6 dataset 0.0 0.81.21.6 1 0.20.40.60.8 0 1234560.00.51.0 (b) P 2.6 d Figure 6: Global models report general trends, while global models with local c describes the response (in this case bugs) while keeping all other prediction variab ic npm mfa Fold 9, Cluster 1 pr O w Traditional Global Model: General Trends One Curve per metric, run corp on that curve Saturday, 2 June, 12
  • 66. 20 0 1000 3000 5000 0.61.0 0.0 0.5 1.0 1.5 2.0 0.30.40. 0 1000 2000 3000 4000 0.51.01. 0 20 40 60 80 120 12 0.0 0.2 0.4 0.6 0.8 1.0 0.450.470.490.51 13 mfa 0 5 10 15 0.500.540.58 14 moa 0 5 10 15 20 25 30 0.420.460.50 15 noc 0 20 40 60 80 100 120 0.500.600.70 16 npm 0 1000 2000 3000 4000 123 0.0 0.2 0.4 0.81.0 0 20 40 60 80 100 120 −1.00.00.51.0 13 npm Figure 6: Global models report general trends, while global models with local considerations give insight describes the response (in this case bugs) while keeping all other prediction variables at their median value 0 2 4 6 8 10 0 10 20 30 40 60 01230 1 2 3 4 ic npm npm ic mfa mfa Fold 9, Cluster 1 Fold 9, Cluster 6 Figure 7: Example of contradicting trends in local models (Xalan 2.6, Cluster 1 and Cluster 6 in Fold 9). model already partition the data into regions with individual properties. For example, we observe that an increase of ic (measuring the inheritance coupling through parent classes) is predicted to only have a negative effect on bug-proneness prediction models lead Our findings thus co who observed a simil WHICH machine-lear have practical implic using regression mod are more insightful th general trends across demonstrated that such particular parts of the in the Xalan 2.6 def sets of classes are infl as inheritance, cohes reinforce the recomm the use of a “one-size model, when trying to B. Act Globally When the goal is carry understanding, local m 0 1000 3000 5000 0.61.0 0.0 0.5 1.0 1.5 2.0 0.30.40.5 0 1000 2000 3000 4000 0.51.01.5 0 20 40 60 80 120 123 0.0 0.2 0.4 0.6 0.8 1.0 0.450.470.490.51 13 mfa 0 5 10 15 0.500.540.58 14 moa 0 5 10 15 20 25 30 0.420.460.50 15 noc 0 20 40 60 80 100 120 0.500.600.70 16 npm 0 1000 2000 3000 4000 1234 0.0 0.2 0 0.81.01 0 20 40 60 80 100 120 −1.00.00.51.0 13 npm Figure 6: Global models report general trends, while global models with local considerations give insig describes the response (in this case bugs) while keeping all other prediction variables at their median val 0 2 4 6 8 10 0 10 20 30 40 60 0123 0 1 2 3 4 ic npm npm ic mfa mfa Fold 9, Cluster 1 Fold 9, Cluster 6 Figure 7: Example of contradicting trends in local models (Xalan 2.6, Cluster 1 and Cluster 6 in Fold 9). model already partition the data into regions with individual properties. For example, we observe that an increase of ic (measuring the inheritance coupling through parent classes) is predicted to only have a negative effect on bug-proneness prediction models lea Our findings thus c who observed a sim WHICH machine-lea have practical impli using regression mo are more insightful t general trends acros demonstrated that su particular parts of th in the Xalan 2.6 de sets of classes are in as inheritance, coh reinforce the recom the use of a “one-si model, when trying t B. Act Globally When the goal is car understanding, local Cluster 1 Cluster 6 Model Interpretation ... Saturday, 2 June, 12
  • 67. 20 Local (Clustered) Model: Many, many, many Trends! 0 1000 3000 5000 0.61.0 0.0 0.5 1.0 1.5 2.0 0.30.40. 0 1000 2000 3000 4000 0.51.01. 0 20 40 60 80 120 12 0.0 0.2 0.4 0.6 0.8 1.0 0.450.470.490.51 13 mfa 0 5 10 15 0.500.540.58 14 moa 0 5 10 15 20 25 30 0.420.460.50 15 noc 0 20 40 60 80 100 120 0.500.600.70 16 npm 0 1000 2000 3000 4000 123 0.0 0.2 0.4 0.81.0 0 20 40 60 80 100 120 −1.00.00.51.0 13 npm Figure 6: Global models report general trends, while global models with local considerations give insight describes the response (in this case bugs) while keeping all other prediction variables at their median value 0 2 4 6 8 10 0 10 20 30 40 60 01230 1 2 3 4 ic npm npm ic mfa mfa Fold 9, Cluster 1 Fold 9, Cluster 6 Figure 7: Example of contradicting trends in local models (Xalan 2.6, Cluster 1 and Cluster 6 in Fold 9). model already partition the data into regions with individual properties. For example, we observe that an increase of ic (measuring the inheritance coupling through parent classes) is predicted to only have a negative effect on bug-proneness prediction models lead Our findings thus co who observed a simil WHICH machine-lear have practical implic using regression mod are more insightful th general trends across demonstrated that such particular parts of the in the Xalan 2.6 def sets of classes are infl as inheritance, cohes reinforce the recomm the use of a “one-size model, when trying to B. Act Globally When the goal is carry understanding, local m 0 1000 3000 5000 0.61.0 0.0 0.5 1.0 1.5 2.0 0.30.40.5 0 1000 2000 3000 4000 0.51.01.5 0 20 40 60 80 120 123 0.0 0.2 0.4 0.6 0.8 1.0 0.450.470.490.51 13 mfa 0 5 10 15 0.500.540.58 14 moa 0 5 10 15 20 25 30 0.420.460.50 15 noc 0 20 40 60 80 100 120 0.500.600.70 16 npm 0 1000 2000 3000 4000 1234 0.0 0.2 0 0.81.01 0 20 40 60 80 100 120 −1.00.00.51.0 13 npm Figure 6: Global models report general trends, while global models with local considerations give insig describes the response (in this case bugs) while keeping all other prediction variables at their median val 0 2 4 6 8 10 0 10 20 30 40 60 0123 0 1 2 3 4 ic npm npm ic mfa mfa Fold 9, Cluster 1 Fold 9, Cluster 6 Figure 7: Example of contradicting trends in local models (Xalan 2.6, Cluster 1 and Cluster 6 in Fold 9). model already partition the data into regions with individual properties. For example, we observe that an increase of ic (measuring the inheritance coupling through parent classes) is predicted to only have a negative effect on bug-proneness prediction models lea Our findings thus c who observed a sim WHICH machine-lea have practical impli using regression mo are more insightful t general trends acros demonstrated that su particular parts of th in the Xalan 2.6 de sets of classes are in as inheritance, coh reinforce the recom the use of a “one-si model, when trying t B. Act Globally When the goal is car understanding, local Cluster 1 Cluster 6 Model Interpretation ... Saturday, 2 June, 12
  • 68. 20 Local (Clustered) Model: Many, many, many Trends! 0 1000 3000 5000 0.61.0 0.0 0.5 1.0 1.5 2.0 0.30.40. 0 1000 2000 3000 4000 0.51.01. 0 20 40 60 80 120 12 0.0 0.2 0.4 0.6 0.8 1.0 0.450.470.490.51 13 mfa 0 5 10 15 0.500.540.58 14 moa 0 5 10 15 20 25 30 0.420.460.50 15 noc 0 20 40 60 80 100 120 0.500.600.70 16 npm 0 1000 2000 3000 4000 123 0.0 0.2 0.4 0.81.0 0 20 40 60 80 100 120 −1.00.00.51.0 13 npm Figure 6: Global models report general trends, while global models with local considerations give insight describes the response (in this case bugs) while keeping all other prediction variables at their median value 0 2 4 6 8 10 0 10 20 30 40 60 01230 1 2 3 4 ic npm npm ic mfa mfa Fold 9, Cluster 1 Fold 9, Cluster 6 Figure 7: Example of contradicting trends in local models (Xalan 2.6, Cluster 1 and Cluster 6 in Fold 9). model already partition the data into regions with individual properties. For example, we observe that an increase of ic (measuring the inheritance coupling through parent classes) is predicted to only have a negative effect on bug-proneness prediction models lead Our findings thus co who observed a simil WHICH machine-lear have practical implic using regression mod are more insightful th general trends across demonstrated that such particular parts of the in the Xalan 2.6 def sets of classes are infl as inheritance, cohes reinforce the recomm the use of a “one-size model, when trying to B. Act Globally When the goal is carry understanding, local m 0 1000 3000 5000 0.61.0 0.0 0.5 1.0 1.5 2.0 0.30.40.5 0 1000 2000 3000 4000 0.51.01.5 0 20 40 60 80 120 123 0.0 0.2 0.4 0.6 0.8 1.0 0.450.470.490.51 13 mfa 0 5 10 15 0.500.540.58 14 moa 0 5 10 15 20 25 30 0.420.460.50 15 noc 0 20 40 60 80 100 120 0.500.600.70 16 npm 0 1000 2000 3000 4000 1234 0.0 0.2 0 0.81.01 0 20 40 60 80 100 120 −1.00.00.51.0 13 npm Figure 6: Global models report general trends, while global models with local considerations give insig describes the response (in this case bugs) while keeping all other prediction variables at their median val 0 2 4 6 8 10 0 10 20 30 40 60 0123 0 1 2 3 4 ic npm npm ic mfa mfa Fold 9, Cluster 1 Fold 9, Cluster 6 Figure 7: Example of contradicting trends in local models (Xalan 2.6, Cluster 1 and Cluster 6 in Fold 9). model already partition the data into regions with individual properties. For example, we observe that an increase of ic (measuring the inheritance coupling through parent classes) is predicted to only have a negative effect on bug-proneness prediction models lea Our findings thus c who observed a sim WHICH machine-lea have practical impli using regression mo are more insightful t general trends acros demonstrated that su particular parts of th in the Xalan 2.6 de sets of classes are in as inheritance, coh reinforce the recom the use of a “one-si model, when trying t B. Act Globally When the goal is car understanding, local Cluster 1 Cluster 6 Model Interpretation ... Sometimes even contradict Saturday, 2 June, 12
  • 69. 21 5 0.0 0.2 0.4 0.6 0.8 1.0 0.81.21.6 1 cam 0 5 10 15 20 25 30 1.01.21.41.61.8 2 cbm 0 10 20 30 40 50 0.20.40.60.81.0 3 ce 0.0 0.2 0.4 0.6 0.8 1.0 0.650.750.85 4 dam 1 2 3 4 5 6 7 8 0.20.40.60.8 5 dit 0 1 2 3 4 5 0.91.11.31.5 6 ic 0 1000 3000 5000 1.01.52.02.53.0 7 lcom 0.0 0.5 1.0 1.5 2.0 0.550.650.750.85 8 lcom3 0 1000 2000 3000 4000 123456 9 loc 0.0 0.2 0.4 0.6 0.8 1.0 0.81.01.21.4 10 mfa 0 5 10 15 0.00.20.40.60.8 11 moa 0 5 10 15 20 25 30 0.00.51.01.5 12 noc .00.51.0 13 npm bug earth(formula=f,data=training1) (b) Part of a Global model with local considerations learned on the Xalan 2.6 dataset local considerations give insights into different regions of the data. The Y-Axis n variables at their median values. prediction models leads to an improved fit of these models. Our findings thus confirm the results of Menzies et al., Model Interpretation Saturday, 2 June, 12
  • 70. 21 Regression Splines: Local Trends in a Single Curve 5 0.0 0.2 0.4 0.6 0.8 1.0 0.81.21.6 1 cam 0 5 10 15 20 25 30 1.01.21.41.61.8 2 cbm 0 10 20 30 40 50 0.20.40.60.81.0 3 ce 0.0 0.2 0.4 0.6 0.8 1.0 0.650.750.85 4 dam 1 2 3 4 5 6 7 8 0.20.40.60.8 5 dit 0 1 2 3 4 5 0.91.11.31.5 6 ic 0 1000 3000 5000 1.01.52.02.53.0 7 lcom 0.0 0.5 1.0 1.5 2.0 0.550.650.750.85 8 lcom3 0 1000 2000 3000 4000 123456 9 loc 0.0 0.2 0.4 0.6 0.8 1.0 0.81.01.21.4 10 mfa 0 5 10 15 0.00.20.40.60.8 11 moa 0 5 10 15 20 25 30 0.00.51.01.5 12 noc .00.51.0 13 npm bug earth(formula=f,data=training1) (b) Part of a Global model with local considerations learned on the Xalan 2.6 dataset local considerations give insights into different regions of the data. The Y-Axis n variables at their median values. prediction models leads to an improved fit of these models. Our findings thus confirm the results of Menzies et al., Model Interpretation Saturday, 2 June, 12
  • 71. 21 Regression Splines: Local Trends in a Single Curve 5 0.0 0.2 0.4 0.6 0.8 1.0 0.81.21.6 1 cam 0 5 10 15 20 25 30 1.01.21.41.61.8 2 cbm 0 10 20 30 40 50 0.20.40.60.81.0 3 ce 0.0 0.2 0.4 0.6 0.8 1.0 0.650.750.85 4 dam 1 2 3 4 5 6 7 8 0.20.40.60.8 5 dit 0 1 2 3 4 5 0.91.11.31.5 6 ic 0 1000 3000 5000 1.01.52.02.53.0 7 lcom 0.0 0.5 1.0 1.5 2.0 0.550.650.750.85 8 lcom3 0 1000 2000 3000 4000 123456 9 loc 0.0 0.2 0.4 0.6 0.8 1.0 0.81.01.21.4 10 mfa 0 5 10 15 0.00.20.40.60.8 11 moa 0 5 10 15 20 25 30 0.00.51.01.5 12 noc .00.51.0 13 npm bug earth(formula=f,data=training1) (b) Part of a Global model with local considerations learned on the Xalan 2.6 dataset local considerations give insights into different regions of the data. The Y-Axis n variables at their median values. prediction models leads to an improved fit of these models. Our findings thus confirm the results of Menzies et al., Model Interpretation Combines the best of both worlds! Saturday, 2 June, 12
  • 73. Using Locality in Data to build better Statistical Models. Saturday, 2 June, 12
  • 74. Using Locality in Data to build better Statistical Models. vs = Two Extremes Saturday, 2 June, 12
  • 75. Using Locality in Data to build better Statistical Models. vs = Two Extremes Build Local Model, globally Optimized Saturday, 2 June, 12
  • 76. Using Locality in Data to build better Statistical Models. vs = Two Extremes Build Local Model, globally Optimized • combines best of both worlds Saturday, 2 June, 12
  • 77. Using Locality in Data to build better Statistical Models. vs = Two Extremes Build Local Model, globally Optimized • combines best of both worlds • outperforms global and clustered local Saturday, 2 June, 12
  • 78. Using Locality in Data to build better Statistical Models. vs = Two Extremes Build Local Model, globally Optimized • combines best of both worlds • outperforms global and clustered local • summarizes local trends in single curve Saturday, 2 June, 12