Measuring Program Comprehension: A Large-Scale Field Study with Professionals
Msr2012 bettenburg presentation
1. T
Think Locally, Act Globally
Improving Defect and Effort Prediction Models
Nicolas Bettenburg • Meiyappan Nagappan • Ahmed E. Hassan
Queen’s University • Kingston, ON, Canada
SOFTWARE ANALYSIS
& INTELLIGENCE LAB
Saturday, 2 June, 12
2. Data Modelling in Empirical SE
Observations
2
measured from project data
Saturday, 2 June, 12
3. Data Modelling in Empirical SE
Observations
Model
2
measured from project data
describe observations
mathematically
Saturday, 2 June, 12
4. Data Modelling in Empirical SE
Observations
Model
Understanding
Prediction
2
measured from project data
describe observations
mathematically
guide decision making
guide process optimizations and future research
Saturday, 2 June, 12
14. In the Field
Tom Zimmermann
We ran 622 cross-project
predictions and found that only
3.4% actually worked.
Saturday, 2 June, 12
15. In the Field
Tim Menzies
Tom Zimmermann
We ran 622 cross-project
predictions and found that only
3.4% actually worked.
Saturday, 2 June, 12
16. In the Field
Tim Menzies
Tom Zimmermann
We ran 622 cross-project
predictions and found that only
3.4% actually worked.
Rather than focus on
generalities, empirical SE should
focus more on context-specific
principles.
Saturday, 2 June, 12
17. In the Field
Tim Menzies
Tom Zimmermann
We ran 622 cross-project
predictions and found that only
3.4% actually worked.
Rather than focus on
generalities, empirical SE should
focus more on context-specific
principles.
Taking local properties of data into
consideration leads to better models!
Saturday, 2 June, 12
19. Does this principle work for statistical models?1
Using Locality in Statistical Models
Saturday, 2 June, 12
20. Does this principle work for statistical models?1
Does it work for Prediction?2
Using Locality in Statistical Models
Saturday, 2 June, 12
21. Does this principle work for statistical models?1
Does it work for Prediction?2
Can we do better?3
Using Locality in Statistical Models
Saturday, 2 June, 12
23. M
Learned Model
Building Local Models
8
Whole Dataset
Testing Data
Training Data
Predictions
Y
Cluster Data
Saturday, 2 June, 12
24. Building Local Models
8
Whole Dataset
Testing Data
Training Data Learned Models
M1 M2 M3
Predictions
Y
Cluster Data Learn Multiple
Models
Saturday, 2 June, 12
25. Building Local Models
8
Whole Dataset
Testing Data
Training Data Learned Models
M1 M2 M3
Predictions
Y Y Y
Cluster Data Learn Multiple
Models
Predict
Individually
Saturday, 2 June, 12
26. Building Local Models
8
Whole Dataset
Compare
Testing Data
Training Data Learned Models
M1 M2 M3
Predictions
Y Y Y
Cluster Data Learn Multiple
Models
Predict
Individually
Saturday, 2 June, 12
27. 9
Global Statistical ModelHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
Saturday, 2 June, 12
28. 9
Global Statistical ModelHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
Saturday, 2 June, 12
29. 9
Global Statistical ModelHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
Saturday, 2 June, 12
30. 9
Global Statistical ModelHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
Model fit leaves much room for improvement!
Saturday, 2 June, 12
31. CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 3
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
10
Local Statistical Model
Saturday, 2 June, 12
32. CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 3
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
10
Local Statistical Model
Saturday, 2 June, 12
33. CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 3
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
10
Local Statistical Model
Model 1
Model 2
Saturday, 2 June, 12
34. CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 3
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
10
Local Statistical Model
Model 1
Model 2
Improved Fit!
Saturday, 2 June, 12
35. How can we use this approach to get an
even better fit?
Saturday, 2 June, 12
36. 12
Be Even More Local !HAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
Saturday, 2 June, 12
37. 12
Be Even More Local !HAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
Saturday, 2 June, 12
38. 12
Be Even More Local !HAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
Saturday, 2 June, 12
39. 12
Be Even More Local !HAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
Great Fit!
Saturday, 2 June, 12
40. 12
Be Even More Local !HAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
Great Fit!
BUT: Risk of Overfitting the Data!!
Saturday, 2 June, 12
43. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f(X) = X ,
X = 0 + 1X1 + 2X2 + 3X3 + 4X4,
X1 = X X2 = (X a)+
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f(X) = X ,
where X = 0 + 1X1 + 2X2 + 3X3 + 4
and
X1 = X X2 = (X a)+
X3 = (X b)+ X4 = (X c)+.
14
Saturday, 2 June, 12
44. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f(X) = X ,
X = 0 + 1X1 + 2X2 + 3X3 + 4X4,
X1 = X X2 = (X a)+
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f(X) = X ,
where X = 0 + 1X1 + 2X2 + 3X3 + 4
and
X1 = X X2 = (X a)+
X3 = (X b)+ X4 = (X c)+.
14
Optimize Local Fit wrt. Minimizing Global Overfit
Saturday, 2 June, 12
45. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f(X) = X ,
X = 0 + 1X1 + 2X2 + 3X3 + 4X4,
X1 = X X2 = (X a)+
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f(X) = X ,
where X = 0 + 1X1 + 2X2 + 3X3 + 4
and
X1 = X X2 = (X a)+
X3 = (X b)+ X4 = (X c)+.
14
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f(X) = X ,
Optimize Local Fit wrt. Minimizing Global Overfit
Saturday, 2 June, 12
46. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f(X) = X ,
X = 0 + 1X1 + 2X2 + 3X3 + 4X4,
X1 = X X2 = (X a)+
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f(X) = X ,
where X = 0 + 1X1 + 2X2 + 3X3 + 4
and
X1 = X X2 = (X a)+
X3 = (X b)+ X4 = (X c)+.
14
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f(X) = X ,
Optimize Local Fit wrt. Minimizing Global Overfit
Saturday, 2 June, 12
47. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f(X) = X ,
X = 0 + 1X1 + 2X2 + 3X3 + 4X4,
X1 = X X2 = (X a)+
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f(X) = X ,
where X = 0 + 1X1 + 2X2 + 3X3 + 4
and
X1 = X X2 = (X a)+
X3 = (X b)+ X4 = (X c)+.
14
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f(X) = X ,
Optimize Local Fit wrt. Minimizing Global Overfit
Multivariate Adaptive Regression Splines (MARS)
Saturday, 2 June, 12
48. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f(X) = X ,
X = 0 + 1X1 + 2X2 + 3X3 + 4X4,
X1 = X X2 = (X a)+
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f(X) = X ,
where X = 0 + 1X1 + 2X2 + 3X3 + 4
and
X1 = X X2 = (X a)+
X3 = (X b)+ X4 = (X c)+.
14
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34
X
f(X)
0 1 2 3 4 5 6
Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5.
C(Y |X) = f(X) = X ,
Optimize Local Fit wrt. Minimizing Global Overfit
Multivariate Adaptive Regression Splines (MARS)
create local knowledge that optimizes process globally
Saturday, 2 June, 12
51. Case Study
15
Xalan 2.6
Lucene 2.4
Post-Release Defects per Class
20 CK Metrics
CHINA
Total Development Effort in Hours
14 FP Metrics
Saturday, 2 June, 12
52. Case Study
15
Xalan 2.6
Lucene 2.4
Post-Release Defects per Class
20 CK Metrics
CHINA
Total Development Effort in Hours
14 FP Metrics
NasaCoc
Development Length in Months
24 COCOMO-II Metrics
Saturday, 2 June, 12
53. Results: Goodness of Fit
16
Rank-Correlation (0 = worst fit, 1 = optimal fit)
Saturday, 2 June, 12
54. Results: Goodness of Fit
16
Global
Local
(Clustered)
MARS
Xalan 2.6
Lucene 2.4
CHINA
NasaCOC
0.33 0.52 0.69
0.32 0.60 0.83
0.83 0.89 0.89
0.93 0.97 0.99
Rank-Correlation (0 = worst fit, 1 = optimal fit)
Saturday, 2 June, 12
55. Results: Goodness of Fit
16
Global
Local
(Clustered)
MARS
Xalan 2.6
Lucene 2.4
CHINA
NasaCOC
0.33 0.52 0.69
0.32 0.60 0.83
0.83 0.89 0.89
0.93 0.97 0.99
Rank-Correlation (0 = worst fit, 1 = optimal fit)
Saturday, 2 June, 12
56. Results: Goodness of Fit
16
Global
Local
(Clustered)
MARS
Xalan 2.6
Lucene 2.4
CHINA
NasaCOC
0.33 0.52 0.69
0.32 0.60 0.83
0.83 0.89 0.89
0.93 0.97 0.99
Rank-Correlation (0 = worst fit, 1 = optimal fit)
Saturday, 2 June, 12
57. Results: Goodness of Fit
16
Global
Local
(Clustered)
MARS
Xalan 2.6
Lucene 2.4
CHINA
NasaCOC
0.33 0.52 0.69
0.32 0.60 0.83
0.83 0.89 0.89
0.93 0.97 0.99
Rank-Correlation (0 = worst fit, 1 = optimal fit)
Saturday, 2 June, 12
58. Results: Goodness of Fit
16
Global
Local
(Clustered)
MARS
Xalan 2.6
Lucene 2.4
CHINA
NasaCOC
0.33 0.52 0.69
0.32 0.60 0.83
0.83 0.89 0.89
0.93 0.97 0.99
Rank-Correlation (0 = worst fit, 1 = optimal fit)
NumberofClusters
0
2
4
6
8
Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10
Dataset
CHINA
Lucene 2.4
NasaCoc
Xalan 2.6
Figure 3: Number of clusters generated by MCLUST in each run of the 10-fold cross validation.
term for each additional prediction variable entering the
regression model [23].
For practical purposes, we use a publicly available imple-
mentation of BIC-based model selection, contained in the
R package: BMA. The input to the BMA implementation
is the dataset itself, as well as a list of all dependent and
independent variables that should be considered. In our case
study, we always supply a list of all independent variables
that were left after VIF analysis. The output of the BMA
is too small to continue or until a maximum number of terms
is reached. In our case study, the maximum number of terms
is automatically determined by the implementation, and is
based on the amount of independent variables we give as
input. For MARS models, we use all independent variables
in a dataset after VIF analysis.
The first phase often builds a model that suffers from
overfitting. As a result, the second phase, called the back-
ward phase, prunes the model, to increase the model’s gen-Saturday, 2 June, 12
59. Results: Goodness of Fit
16
Global
Local
(Clustered)
MARS
Xalan 2.6
Lucene 2.4
CHINA
NasaCOC
0.33 0.52 0.69
0.32 0.60 0.83
0.83 0.89 0.89
0.93 0.97 0.99
Rank-Correlation (0 = worst fit, 1 = optimal fit)
UP TO 2.5x BETTER FIT WHEN USING DATA LOCALITY!
Saturday, 2 June, 12
63. Model Interpretation
19
0 5 10 15 20
−2.5−1.5−0.50.5
1 avg_cc
0 50 100 150
0.500.600.700.80
2 ca
0.0 0.2 0.4 0.6 0.8 1.0
0.440.480.52
3 cam
0 5 10 15 20 25 30
0.50.70.91.1
4 cbm
0 10 20 30 40 50
0.500.540.580.62
5 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.350.45
6 dam
1 2 3 4 5 6 7 8
0.30.40.50.6
7 dit
0 1 2 3 4 5
0.500.550.600.65
8 ic
0 1000 3000 5000
0.61.01.41.8
9 lcom
0.0 0.5 1.0 1.5 2.0
0.30.40.50.60.7
10 lcom3
0 1000 2000 3000 4000
0.51.01.52.0
11 loc
0 20 40 60 80 120
1234
12 max_cc
.470.490.51
13 mfa
0.540.58
14 moa
0.460.50
15 noc
0.600.70
16 npm
(a) Part of a global Model learned on the Xalan 2.6 dataset
0.0
0.81.21.6
1
0.20.40.60.8
0
1234560.00.51.0
(b) P
2.6 d
Figure 6: Global models report general trends, while global models with local c
describes the response (in this case bugs) while keeping all other prediction variab
ic npm mfa
Fold 9, Cluster 1
pr
O
wSaturday, 2 June, 12
64. Model Interpretation
19
0 5 10 15 20
−2.5−1.5−0.50.5
1 avg_cc
0 50 100 150
0.500.600.700.80
2 ca
0.0 0.2 0.4 0.6 0.8 1.0
0.440.480.52
3 cam
0 5 10 15 20 25 30
0.50.70.91.1
4 cbm
0 10 20 30 40 50
0.500.540.580.62
5 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.350.45
6 dam
1 2 3 4 5 6 7 8
0.30.40.50.6
7 dit
0 1 2 3 4 5
0.500.550.600.65
8 ic
0 1000 3000 5000
0.61.01.41.8
9 lcom
0.0 0.5 1.0 1.5 2.0
0.30.40.50.60.7
10 lcom3
0 1000 2000 3000 4000
0.51.01.52.0
11 loc
0 20 40 60 80 120
1234
12 max_cc
.470.490.51
13 mfa
0.540.58
14 moa
0.460.50
15 noc
0.600.70
16 npm
(a) Part of a global Model learned on the Xalan 2.6 dataset
0.0
0.81.21.6
1
0.20.40.60.8
0
1234560.00.51.0
(b) P
2.6 d
Figure 6: Global models report general trends, while global models with local c
describes the response (in this case bugs) while keeping all other prediction variab
ic npm mfa
Fold 9, Cluster 1
pr
O
w
Traditional Global Model: General Trends
Saturday, 2 June, 12
65. Model Interpretation
19
0 5 10 15 20
−2.5−1.5−0.50.5
1 avg_cc
0 50 100 150
0.500.600.700.80
2 ca
0.0 0.2 0.4 0.6 0.8 1.0
0.440.480.52
3 cam
0 5 10 15 20 25 30
0.50.70.91.1
4 cbm
0 10 20 30 40 50
0.500.540.580.62
5 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.350.45
6 dam
1 2 3 4 5 6 7 8
0.30.40.50.6
7 dit
0 1 2 3 4 5
0.500.550.600.65
8 ic
0 1000 3000 5000
0.61.01.41.8
9 lcom
0.0 0.5 1.0 1.5 2.0
0.30.40.50.60.7
10 lcom3
0 1000 2000 3000 4000
0.51.01.52.0
11 loc
0 20 40 60 80 120
1234
12 max_cc
.470.490.51
13 mfa
0.540.58
14 moa
0.460.50
15 noc
0.600.70
16 npm
(a) Part of a global Model learned on the Xalan 2.6 dataset
0.0
0.81.21.6
1
0.20.40.60.8
0
1234560.00.51.0
(b) P
2.6 d
Figure 6: Global models report general trends, while global models with local c
describes the response (in this case bugs) while keeping all other prediction variab
ic npm mfa
Fold 9, Cluster 1
pr
O
w
Traditional Global Model: General Trends
One Curve per metric, run corp on that curve
Saturday, 2 June, 12
66. 20
0 1000 3000 5000
0.61.0
0.0 0.5 1.0 1.5 2.0
0.30.40.
0 1000 2000 3000 4000
0.51.01.
0 20 40 60 80 120
12
0.0 0.2 0.4 0.6 0.8 1.0
0.450.470.490.51
13 mfa
0 5 10 15
0.500.540.58
14 moa
0 5 10 15 20 25 30
0.420.460.50
15 noc
0 20 40 60 80 100 120
0.500.600.70
16 npm
0 1000 2000 3000 4000
123
0.0 0.2 0.4
0.81.0
0 20 40 60 80 100 120
−1.00.00.51.0
13 npm
Figure 6: Global models report general trends, while global models with local considerations give insight
describes the response (in this case bugs) while keeping all other prediction variables at their median value
0 2 4 6 8 10
0 10 20 30 40 60
01230 1 2 3 4
ic
npm
npm
ic mfa
mfa
Fold 9, Cluster 1
Fold 9, Cluster 6
Figure 7: Example of contradicting trends in local models (Xalan 2.6,
Cluster 1 and Cluster 6 in Fold 9).
model already partition the data into regions with individual
properties. For example, we observe that an increase of ic
(measuring the inheritance coupling through parent classes)
is predicted to only have a negative effect on bug-proneness
prediction models lead
Our findings thus co
who observed a simil
WHICH machine-lear
have practical implic
using regression mod
are more insightful th
general trends across
demonstrated that such
particular parts of the
in the Xalan 2.6 def
sets of classes are infl
as inheritance, cohes
reinforce the recomm
the use of a “one-size
model, when trying to
B. Act Globally
When the goal is carry
understanding, local m
0 1000 3000 5000
0.61.0
0.0 0.5 1.0 1.5 2.0
0.30.40.5
0 1000 2000 3000 4000
0.51.01.5
0 20 40 60 80 120
123
0.0 0.2 0.4 0.6 0.8 1.0
0.450.470.490.51
13 mfa
0 5 10 15
0.500.540.58
14 moa
0 5 10 15 20 25 30
0.420.460.50
15 noc
0 20 40 60 80 100 120
0.500.600.70
16 npm
0 1000 2000 3000 4000
1234
0.0 0.2 0
0.81.01
0 20 40 60 80 100 120
−1.00.00.51.0
13 npm
Figure 6: Global models report general trends, while global models with local considerations give insig
describes the response (in this case bugs) while keeping all other prediction variables at their median val
0 2 4 6 8 10
0 10 20 30 40 60
0123
0 1 2 3 4
ic
npm
npm
ic mfa
mfa
Fold 9, Cluster 1
Fold 9, Cluster 6
Figure 7: Example of contradicting trends in local models (Xalan 2.6,
Cluster 1 and Cluster 6 in Fold 9).
model already partition the data into regions with individual
properties. For example, we observe that an increase of ic
(measuring the inheritance coupling through parent classes)
is predicted to only have a negative effect on bug-proneness
prediction models lea
Our findings thus c
who observed a sim
WHICH machine-lea
have practical impli
using regression mo
are more insightful t
general trends acros
demonstrated that su
particular parts of th
in the Xalan 2.6 de
sets of classes are in
as inheritance, coh
reinforce the recom
the use of a “one-si
model, when trying t
B. Act Globally
When the goal is car
understanding, local
Cluster 1
Cluster 6
Model Interpretation
...
Saturday, 2 June, 12
67. 20
Local (Clustered) Model: Many, many, many Trends!
0 1000 3000 5000
0.61.0
0.0 0.5 1.0 1.5 2.0
0.30.40.
0 1000 2000 3000 4000
0.51.01.
0 20 40 60 80 120
12
0.0 0.2 0.4 0.6 0.8 1.0
0.450.470.490.51
13 mfa
0 5 10 15
0.500.540.58
14 moa
0 5 10 15 20 25 30
0.420.460.50
15 noc
0 20 40 60 80 100 120
0.500.600.70
16 npm
0 1000 2000 3000 4000
123
0.0 0.2 0.4
0.81.0
0 20 40 60 80 100 120
−1.00.00.51.0
13 npm
Figure 6: Global models report general trends, while global models with local considerations give insight
describes the response (in this case bugs) while keeping all other prediction variables at their median value
0 2 4 6 8 10
0 10 20 30 40 60
01230 1 2 3 4
ic
npm
npm
ic mfa
mfa
Fold 9, Cluster 1
Fold 9, Cluster 6
Figure 7: Example of contradicting trends in local models (Xalan 2.6,
Cluster 1 and Cluster 6 in Fold 9).
model already partition the data into regions with individual
properties. For example, we observe that an increase of ic
(measuring the inheritance coupling through parent classes)
is predicted to only have a negative effect on bug-proneness
prediction models lead
Our findings thus co
who observed a simil
WHICH machine-lear
have practical implic
using regression mod
are more insightful th
general trends across
demonstrated that such
particular parts of the
in the Xalan 2.6 def
sets of classes are infl
as inheritance, cohes
reinforce the recomm
the use of a “one-size
model, when trying to
B. Act Globally
When the goal is carry
understanding, local m
0 1000 3000 5000
0.61.0
0.0 0.5 1.0 1.5 2.0
0.30.40.5
0 1000 2000 3000 4000
0.51.01.5
0 20 40 60 80 120
123
0.0 0.2 0.4 0.6 0.8 1.0
0.450.470.490.51
13 mfa
0 5 10 15
0.500.540.58
14 moa
0 5 10 15 20 25 30
0.420.460.50
15 noc
0 20 40 60 80 100 120
0.500.600.70
16 npm
0 1000 2000 3000 4000
1234
0.0 0.2 0
0.81.01
0 20 40 60 80 100 120
−1.00.00.51.0
13 npm
Figure 6: Global models report general trends, while global models with local considerations give insig
describes the response (in this case bugs) while keeping all other prediction variables at their median val
0 2 4 6 8 10
0 10 20 30 40 60
0123
0 1 2 3 4
ic
npm
npm
ic mfa
mfa
Fold 9, Cluster 1
Fold 9, Cluster 6
Figure 7: Example of contradicting trends in local models (Xalan 2.6,
Cluster 1 and Cluster 6 in Fold 9).
model already partition the data into regions with individual
properties. For example, we observe that an increase of ic
(measuring the inheritance coupling through parent classes)
is predicted to only have a negative effect on bug-proneness
prediction models lea
Our findings thus c
who observed a sim
WHICH machine-lea
have practical impli
using regression mo
are more insightful t
general trends acros
demonstrated that su
particular parts of th
in the Xalan 2.6 de
sets of classes are in
as inheritance, coh
reinforce the recom
the use of a “one-si
model, when trying t
B. Act Globally
When the goal is car
understanding, local
Cluster 1
Cluster 6
Model Interpretation
...
Saturday, 2 June, 12
68. 20
Local (Clustered) Model: Many, many, many Trends!
0 1000 3000 5000
0.61.0
0.0 0.5 1.0 1.5 2.0
0.30.40.
0 1000 2000 3000 4000
0.51.01.
0 20 40 60 80 120
12
0.0 0.2 0.4 0.6 0.8 1.0
0.450.470.490.51
13 mfa
0 5 10 15
0.500.540.58
14 moa
0 5 10 15 20 25 30
0.420.460.50
15 noc
0 20 40 60 80 100 120
0.500.600.70
16 npm
0 1000 2000 3000 4000
123
0.0 0.2 0.4
0.81.0
0 20 40 60 80 100 120
−1.00.00.51.0
13 npm
Figure 6: Global models report general trends, while global models with local considerations give insight
describes the response (in this case bugs) while keeping all other prediction variables at their median value
0 2 4 6 8 10
0 10 20 30 40 60
01230 1 2 3 4
ic
npm
npm
ic mfa
mfa
Fold 9, Cluster 1
Fold 9, Cluster 6
Figure 7: Example of contradicting trends in local models (Xalan 2.6,
Cluster 1 and Cluster 6 in Fold 9).
model already partition the data into regions with individual
properties. For example, we observe that an increase of ic
(measuring the inheritance coupling through parent classes)
is predicted to only have a negative effect on bug-proneness
prediction models lead
Our findings thus co
who observed a simil
WHICH machine-lear
have practical implic
using regression mod
are more insightful th
general trends across
demonstrated that such
particular parts of the
in the Xalan 2.6 def
sets of classes are infl
as inheritance, cohes
reinforce the recomm
the use of a “one-size
model, when trying to
B. Act Globally
When the goal is carry
understanding, local m
0 1000 3000 5000
0.61.0
0.0 0.5 1.0 1.5 2.0
0.30.40.5
0 1000 2000 3000 4000
0.51.01.5
0 20 40 60 80 120
123
0.0 0.2 0.4 0.6 0.8 1.0
0.450.470.490.51
13 mfa
0 5 10 15
0.500.540.58
14 moa
0 5 10 15 20 25 30
0.420.460.50
15 noc
0 20 40 60 80 100 120
0.500.600.70
16 npm
0 1000 2000 3000 4000
1234
0.0 0.2 0
0.81.01
0 20 40 60 80 100 120
−1.00.00.51.0
13 npm
Figure 6: Global models report general trends, while global models with local considerations give insig
describes the response (in this case bugs) while keeping all other prediction variables at their median val
0 2 4 6 8 10
0 10 20 30 40 60
0123
0 1 2 3 4
ic
npm
npm
ic mfa
mfa
Fold 9, Cluster 1
Fold 9, Cluster 6
Figure 7: Example of contradicting trends in local models (Xalan 2.6,
Cluster 1 and Cluster 6 in Fold 9).
model already partition the data into regions with individual
properties. For example, we observe that an increase of ic
(measuring the inheritance coupling through parent classes)
is predicted to only have a negative effect on bug-proneness
prediction models lea
Our findings thus c
who observed a sim
WHICH machine-lea
have practical impli
using regression mo
are more insightful t
general trends acros
demonstrated that su
particular parts of th
in the Xalan 2.6 de
sets of classes are in
as inheritance, coh
reinforce the recom
the use of a “one-si
model, when trying t
B. Act Globally
When the goal is car
understanding, local
Cluster 1
Cluster 6
Model Interpretation
...
Sometimes even contradict
Saturday, 2 June, 12
69. 21
5
0.0 0.2 0.4 0.6 0.8 1.0
0.81.21.6 1 cam
0 5 10 15 20 25 30
1.01.21.41.61.8
2 cbm
0 10 20 30 40 50
0.20.40.60.81.0
3 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.650.750.85
4 dam
1 2 3 4 5 6 7 8
0.20.40.60.8
5 dit
0 1 2 3 4 5
0.91.11.31.5
6 ic
0 1000 3000 5000
1.01.52.02.53.0
7 lcom
0.0 0.5 1.0 1.5 2.0
0.550.650.750.85
8 lcom3
0 1000 2000 3000 4000
123456
9 loc
0.0 0.2 0.4 0.6 0.8 1.0
0.81.01.21.4
10 mfa
0 5 10 15
0.00.20.40.60.8
11 moa
0 5 10 15 20 25 30
0.00.51.01.5
12 noc
.00.51.0
13 npm
bug earth(formula=f,data=training1)
(b) Part of a Global model with local considerations learned on the Xalan
2.6 dataset
local considerations give insights into different regions of the data. The Y-Axis
n variables at their median values.
prediction models leads to an improved fit of these models.
Our findings thus confirm the results of Menzies et al.,
Model Interpretation
Saturday, 2 June, 12
70. 21
Regression Splines: Local Trends in a Single Curve
5
0.0 0.2 0.4 0.6 0.8 1.0
0.81.21.6 1 cam
0 5 10 15 20 25 30
1.01.21.41.61.8
2 cbm
0 10 20 30 40 50
0.20.40.60.81.0
3 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.650.750.85
4 dam
1 2 3 4 5 6 7 8
0.20.40.60.8
5 dit
0 1 2 3 4 5
0.91.11.31.5
6 ic
0 1000 3000 5000
1.01.52.02.53.0
7 lcom
0.0 0.5 1.0 1.5 2.0
0.550.650.750.85
8 lcom3
0 1000 2000 3000 4000
123456
9 loc
0.0 0.2 0.4 0.6 0.8 1.0
0.81.01.21.4
10 mfa
0 5 10 15
0.00.20.40.60.8
11 moa
0 5 10 15 20 25 30
0.00.51.01.5
12 noc
.00.51.0
13 npm
bug earth(formula=f,data=training1)
(b) Part of a Global model with local considerations learned on the Xalan
2.6 dataset
local considerations give insights into different regions of the data. The Y-Axis
n variables at their median values.
prediction models leads to an improved fit of these models.
Our findings thus confirm the results of Menzies et al.,
Model Interpretation
Saturday, 2 June, 12
71. 21
Regression Splines: Local Trends in a Single Curve
5
0.0 0.2 0.4 0.6 0.8 1.0
0.81.21.6 1 cam
0 5 10 15 20 25 30
1.01.21.41.61.8
2 cbm
0 10 20 30 40 50
0.20.40.60.81.0
3 ce
0.0 0.2 0.4 0.6 0.8 1.0
0.650.750.85
4 dam
1 2 3 4 5 6 7 8
0.20.40.60.8
5 dit
0 1 2 3 4 5
0.91.11.31.5
6 ic
0 1000 3000 5000
1.01.52.02.53.0
7 lcom
0.0 0.5 1.0 1.5 2.0
0.550.650.750.85
8 lcom3
0 1000 2000 3000 4000
123456
9 loc
0.0 0.2 0.4 0.6 0.8 1.0
0.81.01.21.4
10 mfa
0 5 10 15
0.00.20.40.60.8
11 moa
0 5 10 15 20 25 30
0.00.51.01.5
12 noc
.00.51.0
13 npm
bug earth(formula=f,data=training1)
(b) Part of a Global model with local considerations learned on the Xalan
2.6 dataset
local considerations give insights into different regions of the data. The Y-Axis
n variables at their median values.
prediction models leads to an improved fit of these models.
Our findings thus confirm the results of Menzies et al.,
Model Interpretation
Combines the best of both worlds!
Saturday, 2 June, 12
73. Using Locality in Data
to build better Statistical Models.
Saturday, 2 June, 12
74. Using Locality in Data
to build better Statistical Models.
vs = Two Extremes
Saturday, 2 June, 12
75. Using Locality in Data
to build better Statistical Models.
vs = Two Extremes
Build Local Model, globally Optimized
Saturday, 2 June, 12
76. Using Locality in Data
to build better Statistical Models.
vs = Two Extremes
Build Local Model, globally Optimized
• combines best of both worlds
Saturday, 2 June, 12
77. Using Locality in Data
to build better Statistical Models.
vs = Two Extremes
Build Local Model, globally Optimized
• combines best of both worlds
• outperforms global and clustered local
Saturday, 2 June, 12
78. Using Locality in Data
to build better Statistical Models.
vs = Two Extremes
Build Local Model, globally Optimized
• combines best of both worlds
• outperforms global and clustered local
• summarizes local trends in single curve
Saturday, 2 June, 12