Cross-Project Build
Co-change Prediction
Shane
McIntosh
Ahmed E.
Hassan
shanemcintosh@acm.org
@shane_mcintosh
shanemcintosh.org
Emad
Shihab
David
Lo
Xin
Xia
What is a build system?
Source
code
2
What is a build system?
Source
code
Deliverable
2
.tex
.c
.cc
.o
.o
.dvi
.a
.exe
.pdf
.deb
Build systems describe how sources are
translated into deliverables
3
The build system is at the
heart of techniques like
Continuous Integration (CI)
4
.c .mk
The build system is at the
heart of techniques like
Continuous Integration (CI)
Commit
4
Commit
9719cf0
.c .mk
The build system is at the
heart of techniques like
Continuous Integration (CI)
Commit
4
Build
Commit
9719cf0
.c .mk
The build system is at the
heart of techniques like
Continuous Integration (CI)
Commit
4
Build
Test
Commit
9719cf0
.c .mk
The build system is at the
heart of techniques like
Continuous Integration (CI)
Commit
4
Build
Test
Report
Commit
9719cf0 wassuccessfullyintegrated
Commit
9719cf0
.c .mk
The build system is at the
heart of techniques like
Continuous Integration (CI)
Commit
4
Build
Test
Report
Commit
9719cf0 wassuccessfullyintegrated
Commit
9719cf0
.c .mk
“...nothing can be
said to be certain,
except death and
taxes” - Benjamin Franklin
The Build “Tax”
An Empirical Study of Build
Maintenance Effort
S. McIntosh, B. Adams, T. H. D.
Nguyen, Y. Kamei, A. E. Hassan
[ICSE 2011]
Up to 27% of source
changes require build
changes, too!
5
Neglected build maintenance
is a frequent cause of
build breakage
6
.c .mk
Neglected build maintenance
is a frequent cause of
build breakage
Commit
6
Commit
aedd38
.c
.mk
Neglected build maintenance
is a frequent cause of
build breakage
Commit
6
Commit
aedd38
.c
.mk
Neglected build maintenance
is a frequent cause of
build breakage
Commit
6
Build
Commit
aedd38
.c
.mk
Neglected build maintenance
is a frequent cause of
build breakage
Commit
6
Build
Test
Commit
aedd38
.c
.mk
Neglected build maintenance
is a frequent cause of
build breakage
Commit
6
Build
Test
Commit
aedd38
.c
.mk
Neglected build maintenance
is a frequent cause of
build breakage
Commit
6
Build
Test
Report
Commit
aedd38
.c
.mk
Commit
aedd38
broke the
build!
Neglected build maintenance
can even impact end users
7
Neglected build maintenance
can even impact end users
7
Not working due
to linking of
incorrect SQLite
library version
Neglected build maintenance
can even impact end users
7
Not working due
to linking of
incorrect SQLite
library version
When are build
changes necessary?
8
Overview of the studied systems
8
Overview of the studied systems
29 years of
historical data
8
Overview of the studied systems
29 years of
historical data
Proprietary and opensource systems
Grouping related changes according
to the work items that they address
9
Grouping related changes according
to the work items that they address
.c .c .c
Changes
.mk
9
Missed code
in #2121
Add feature
#2121
Fix for
bug #1234
Grouping related changes according
to the work items that they address
.c .c .c
Transactions
Changes
.mk
9
2121
Missed code
in #2121
Add feature
#2121
1234
Fix for
bug #1234
Grouping related changes according
to the work items that they address
.c .c .c
Transactions
Work items
Changes
.mk
9
1 2
.mk
10
We train classifiers to identify code
changes that require build co-changes
Work
items
.c.c .c
Classification
model
Build change
necessary
No build change
necessary
1 2
.mk
10
We train classifiers to identify code
changes that require build co-changes
Work
items
.c
.c .cClassification
model
Build change
necessary
No build change
necessary
1 2
.mk
11
Work
items
.c
Build change
necessary
No build change
necessary
Classification
model
We train classifiers to identify code
changes that require build co-changes
12
Prior work shows that within-project build
co-change prediction can be accurate
Mining Co-Change Information to
Understand when Build Changes
are Necessary
S. McIntosh, B. Adams, M.
Nagappan, A. E. Hassan
[ICSME 2014]
Build co-change
classifiers can achieve
an AUC of 0.60-0.88
However, a large amount of historical
data was used to train the classifiers
13
However, a large amount of historical
data was used to train the classifiers
13
However, a large amount of historical
data was used to train the classifiers
13
What about new
projects?
However, a large amount of historical
data was used to train the classifiers
13
What about new
projects?
…or projects withpoorly-recordedhistorical data?
However, a large amount of historical
data was used to train the classifiers
13
What about new
projects?
…or projects withpoorly-recordedhistorical data?
Can we leverage these large
corpora for the small ones?
14
14
How well do build co-
change prediction models
perform on sparse data?
Precision
Recall
F1-score
AUC
0 0.25 0.5 0.75 1
5%
50%
90%
14
How well do build co-
change prediction models
perform on sparse data?
Precision
Recall
F1-score
AUC
0 0.25 0.5 0.75 1
5%
50%
90%
Challenge 1:
Very small datasets tend
to yield models that
under-perform
14
How well do build co-
change prediction models
perform on sparse data?
Precision
Recall
F1-score
AUC
0 0.25 0.5 0.75 1
5%
50%
90%
How well do build co-
change prediction models
perform on other datasets?
Precision
Recall
F1-score
AUC
0 0.25 0.5 0.75 1
Eclipse => Mozilla
Jazz => Mozilla
Lucene => Mozilla
Challenge 1:
Very small datasets tend
to yield models that
under-perform
14
How well do build co-
change prediction models
perform on sparse data?
Precision
Recall
F1-score
AUC
0 0.25 0.5 0.75 1
5%
50%
90%
How well do build co-
change prediction models
perform on other datasets?
Precision
Recall
F1-score
AUC
0 0.25 0.5 0.75 1
Eclipse => Mozilla
Jazz => Mozilla
Lucene => Mozilla
Challenge 1:
Very small datasets tend
to yield models that
under-perform
Challenge 2:
Cross-project build co-
change models tend
to under-perform
15
Domain-specific project characteristics may
limit the applicability of cross-project models
Training
corpus
Testing
corpus
Training
corpus
16
Classification
model
Testing
corpus
Domain-specific project characteristics may
limit the applicability of cross-project models
Training
corpus
16
Classification
model
Testing
corpus
?
Domain-specific project characteristics may
limit the applicability of cross-project models
17
Using transfer learning to provide some
domain knowledge to the training corpus
Training
corpus
Testing
corpus
Move some training
data from target
system to the
training corpus
17
Using transfer learning to provide some
domain knowledge to the training corpus
Training
corpus
Testing
corpus
18
Training
corpus
Testing
corpus
Using transfer learning to provide some
domain knowledge to the training corpus
19
Training
corpus
Testing
corpus
Classification
model
Using transfer learning to provide some
domain knowledge to the training corpus
19
Training
corpus
Testing
corpus
Classification
model
?
Using transfer learning to provide some
domain knowledge to the training corpus
20
Challenge 3:
Build co-changes are the minority
20
Challenge 3:
Build co-changes are the minority
Only 8%-17% of changesare build co-changing
21
Training
corpus
Testing
corpus
Use training corpus to find an
appropriate threshold
22
Training
corpus
Testing
corpus
Classification
model
Use training corpus to find an
appropriate threshold
Set aside the
testing corpus
22
Training
corpus
Testing
corpus
Classification
model
Use training corpus to find an
appropriate threshold
23
Training
corpus
Classification
model
Use training corpus to find an
appropriate threshold
Training
corpus
Incorrectly
classified!
23
Training
corpus
Classification
model
Use training corpus to find an
appropriate threshold
Training
corpus
24
Use training corpus to find an
appropriate threshold
Training
corpus
Classification
model
24
Use training corpus to find an
appropriate threshold
Training
corpus
Classification
model 1
25
Use training corpus to find an
appropriate threshold
Training
corpus
Classification
model
Classification
model 1
2
25
Use training corpus to find an
appropriate threshold
Training
corpus
Classification
model
Classification
model 1
2
26
Use training corpus to find an
appropriate threshold
Classification
model
Classification
model 1
2
…
Classification
model N
Ensemble of
models used on
the testing corpus
26
Use training corpus to find an
appropriate threshold
Classification
model
Classification
model 1
2
…
Classification
model N
27
Evaluating our approach
Relative
performance
27
Evaluating our approach
Relative
performance
Training configuration
sensitivity
Source
Target
28
Evaluating our approach
Relative
performance
Source
Target
Training configuration
sensitivity
29
Our approach outperforms baseline
cross-project approaches
Eclipse
Jazz
Lucene
Mozilla
Average
0 0.25 0.5 0.75 1
Our approach Ordinary cross-project AdaBoost TrAdaBoost
Worstmeasured
F-score
29
Our approach outperforms baseline
cross-project approaches
Eclipse
Jazz
Lucene
Mozilla
Average
0 0.25 0.5 0.75 1
Our approach Ordinary cross-project AdaBoost TrAdaBoost
Worstmeasured
F-score
37%-42%
improvement
30
Our approach achieves similar
results to within-project models
Eclipse
Jazz
Lucene
Mozilla
Average
0 0.25 0.5 0.75 1
Our approach Within-project
Worstmeasured
F-score
30
Our approach achieves similar
results to within-project models
Eclipse
Jazz
Lucene
Mozilla
Average
0 0.25 0.5 0.75 1
Our approach Within-project
Only a 7% drop in
performance
Worstmeasured
F-score
31
Evaluating our approach
Relative
performance
Source
Target
Training configuration
sensitivity
31
Evaluating our approach
Relative
performance
37%-42%
improvement
over baseline
Source
Target
Training configuration
sensitivity
31
Evaluating our approach
Relative
performance
37%-42%
improvement
over baseline
Only 7% drop
of within-project
F-measure
Source
Target
Training configuration
sensitivity
32
Evaluating our approach
Relative
performance
Source
Target
37%-42%
improvement
over baseline
Only 7% drop
of within-project
F-measure
Training configuration
sensitivity
33
Additional data from the target system
slowly improves classifier performance
Source
Target
F-score
34
Evaluating our approach
Relative
performance
Source
Target
37%-42%
improvement
over baseline
Only 7% drop
of within-project
F-measure
Training configuration
sensitivity
34
Evaluating our approach
Relative
performance
Source
Target
37%-42%
improvement
over baseline
Only 7% drop
of within-project
F-measure
Training configuration
sensitivity
F-score tends to improve
as more target system
data becomes available
shanemcintosh@acm.org

Cross-Project Build Co-change Prediction