The document summarizes the 5th round of a Java unit testing tool competition. It describes the infrastructure used, including modifications to work with libraries not in DEFECT4J and a new tool to detect flaky tests. 69 classes across 8 projects were used as benchmarks. The results showed EvoSuite performed best overall based on generated test coverage, effectiveness on real and mutated code, and test quality metrics. Statistical analysis confirmed EvoSuite was better than the other 3 participating tools. Lessons included benefits of statistical analysis and selecting non-trivial test classes.
5. The Infrastructure
Defect4j
Defect4j
• The previous edition used
DEFECT4J to detect flaky tests
and to measure effectiveness
• In the new edition, we modified
the infrastructure to work with
libraries not in DEFECT4J
• We developed our own tool to
detect flaky tests
• Effectiveness based on mutation
analysis: PITest + JaCoCo
5
6. The Infrastructure
Defect4j
Defect4j
• The previous edition used
DEFECT4J to detect flaky tests
and to measure effectiveness
• In the new edition, we modified
the infrastructure to work with
libraries not in DEFECT4J
• We developed our own tool to
detect flaky tests
• Effectiveness based on mutation
analysis: PITest + JaCoCo
6
7. The Infrastructure
Defect4j
Defect4j
• The previous edition used
DEFECT4J to detect flaky tests
and to measure effectiveness
• In the new edition, we modified
the infrastructure to work with
libraries not in DEFECT4J
• We developed our own tool to
detect flaky tests
• Effectiveness based on mutation
analysis: PITest + JaCoCo
7
8. The Infrastructure
• The previous edition used
DEFECT4J to detect flaky tests
and to measure effectiveness
• In the new edition, we modified
the infrastructure to work with
libraries not in DEFECT4J
• We developed our own tool to
detect flaky tests
• Effectiveness based on mutation
analysis: PITest + JaCoCo
Our Tool
PITest
+
JaCoCo
8
9. Test Management
Flaky tests:
• Pass during generation but fail when re-executed
• Detection mechanism: we run each test suite five times
• Ignored when computing the coverage scores
Non-compiling tests:
• Generated test suites were re-compiled in our own
execution environment
9
10. Metric Computation
Code Coverage:
• Statement coverage
• Condition coverage
Mutation Score:
• We did not use PITest’s running engine since it gave
errors for test cases with ad-hoc/non-standard JUnit
runners (e.g., in EvoSuite)
• We only use PITest engine for the generation of
mutants
• Combining PITest with JaCoCo: executing only
mutants infecting covered lines
10
11. We apply the same formula used in the last competition since it
combines coverage metrics, effectiveness, execution time and
number of flaky/non-compiling tests
Scoring Formula
T = Generated Test
B = Search Budget
C = Class under test
R = independent Run
Covi = statement coverage
Covb = branch coverage
Covm = Strong Mutation
covScorehT,B,C,ri = 1 ⇥ Covi + 2 ⇥ Covb + 4 ⇥ Covm1 2 4
11
12. We apply the same formula used in the last competition since it
combines coverage metrics, effectiveness, execution time and
number of flaky/non-compiling tests
Scoring Formula
tScorehT,B,C,ri = covScorehT,B,C,ri ⇥ min
✓
1,
L
genTime
◆
T = Generated Test
B = Search Budget
C = Class under test
R = independent Run
Covi = statement coverage
Covb = branch coverage
Covm = Strong Mutation
getTime = generation time
covScorehT,B,C,ri = 1 ⇥ Covi + 2 ⇥ Covb + 4 ⇥ Covm1 2 4
2 x B
12
13. We apply the same formula used in the last competition since it
combines coverage metrics, effectiveness, execution time and
number of flaky/non-compiling tests
Scoring Formula
tScorehT,B,C,ri = covScorehT,B,C,ri ⇥ min
✓
1,
L
genTime
◆
T = Generated Test
B = Search Budget
C = Class under test
R = independent Run
Covi = statement coverage
Covb = branch coverage
Covm = Strong Mutation
getTime = generation time
penalty = percentage of flaky
test and non-compiling tests
ScorehT,B,C,ri = tScorehT,B,C,ri + penaltyhT,B,C,ri
covScorehT,B,C,ri = 1 ⇥ Covi + 2 ⇥ Covb + 4 ⇥ Covm1 2 4
2 x B
13
16. Selection of the Benchmark Classes
Source Application Domain # Classes
# Selected
Classes
BCEL
Apache
commons
Bytecode manipulation 431 10
Jxpath Java Beans manipulation with Path syntax 180 10
Imaging Framework to write/read images with various formats 427 4
Google Gson
Google
Conversion of Java Objects into their JSON
representation and vice versa
174 9
Re2j
Regular expression engine for time-linear regular
expression matching
47 8
Freehep
Java Analysis
Studio
Open-source repository providing Java utilities for high
energy physics applications
180 10
LA4j Github
Linear Algebra primitives (matrices and vectors) and
algorithms
208 10
Okhttp Github
HTTP and HTTP/2 client for Android and Java
applications
193 8
16
17. Selection of the Benchmark Classes
Source Application Domain # Classes
# Selected
Classes
BCEL
Apache
commons
Bytecode manipulation 431 10
Jxpath Java Beans manipulation with Path syntax 180 10
Imaging Framework to write/read images with various formats 427 4
Google Gson
Google
Conversion of Java Objects into their JSON
representation and vice versa
174 9
Re2j
Regular expression engine for time-linear regular
expression matching
47 8
Freehep
Java Analysis
Studio
Open-source repository providing Java utilities for high
energy physics applications
180 10
LA4j Github
Linear Algebra primitives (matrices and vectors) and
algorithms
208 10
Okhttp Github
HTTP and HTTP/2 client for Android and Java
applications
193 8
17
18. Selection of the Benchmark Classes
Source Application Domain # Classes
# Selected
Classes
BCEL
Apache
commons
Bytecode manipulation 431 10
Jxpath Java Beans manipulation with Path syntax 180 10
Imaging Framework to write/read images with various formats 427 4
Google Gson
Google
Conversion of Java Objects into their JSON
representation and vice versa
174 9
Re2j
Regular expression engine for time-linear regular
expression matching
47 8
Freehep
Java Analysis
Studio
Open-source repository providing Java utilities for high
energy physics applications
180 10
LA4j Github
Linear Algebra primitives (matrices and vectors) and
algorithms
208 10
Okhttp Github
HTTP and HTTP/2 client for Android and Java
applications
193 8
18
19. Selection Procedure
HOW:
• Computing the McCabe’s cyclomatic complexity (MCC) for all methods in
each java library
• Filtering out all trivial classes, i.e., classes that contains only methods
with a MCC < 3
• Random sampling from the pruned projects
WHAT/WHY:
• Removing (likely) trivial classes not challenging for the tools
• Developers may use automated tools for complex classes
19
20. Benchmark Statistics
Largest Class:
Name = XPathParserTokenManager
Project = JXPATH
N. Statements = 1029
N. Branches = 872
Smallest Class:
Name = ForwardBackSubstitutionSolver
Project = LA4J
N. Statements = 26
N. Branches = 20
# Branches
Frequency
# Statements
Frequency
20
21. The Methodology
• Search Budgets = 10s, 30s, 60s, 120s, 240s, 300s, 480s
• Number of CUTs = 69
• Number of repetitions = 3
• All tools have been executed in parallel (multi-threading)
on the same machine
• Statistical analysis:
Friedman’s test: non-parametric test for multiple-problem analysis
Post-hoc Connover’s procedure for pairwise multiple comparisons
21
25. Coverage Results
There are 43 classes out of 69 (≈
60%) for which at least one of
the two participant tools could
not generate any test case.
What happens if we consider only
classes for which both EvoSuite and
JTexpert could generate tests?
Filtered Results with
Search Budget = 480s
25
26. Scalability%BranchCoverage
0
25
50
75
100
Search Budget
10s 30s 60s 120s 240s 300s 480s
EvoSuite JTExpert
T3 Randoop
%StrongMutationCov.
0
12.5
25
37.5
50
Search Budget
10s 30s 60s 120s 240s 300s 480s
EvoSuite JTExpert
T3 Randoop
Comparison for the class Parser.java extracted from the library Re4J.
N. Statements = 760, N. Branches = 565, N. Mutants = 203
26
28. Generated vs. Manually-written Tests
Comparison of the scores achieved by
• EvoSuite after 480s
• JTexpert after 480s
• T3 after 480s
• Random after 480s
• Manually-written tests
• Optimal Score
N.B.: We only considered the 63
subjects for which we found
developers-written tests.
0
50
100
150
200
250
300
350
400
450
500
268
61
78
125
251
Optimal
EvoSuite
JTExpert
T3
Randoop
M
anual
28
29. Tool Total Score St. Dev.
Friedman’s Test
Statistically better than
(Conover’s procedure)
Rank Score
EvoSuite 1457 193 1 1.55 JTExpert, T3, Randoop
JTexpert 849 102 2 2.71 T3, Randoop
T3 526 82 3 2.81 Random
Random 448 34 4 2.92
Statistical Analysis
29
30. Tool Total Score St. Dev.
Friedman’s Test
Statistically better than
(Conover’s procedure)
Rank Score
EvoSuite 1457 193 1 1.55 JTExpert, T3, Randoop
JTexpert 849 102 2 2.71 T3, Randoop
T3 526 82 3 2.81 Random
Random 448 34 4 2.92
Statistical Analysis
30
31. Statistical Analysis
Tool Total Score St. Dev.
Friedman’s Test
Statistically better than
(Conover’s procedure)
Rank Score
EvoSuite 1457 193 1 1.55 JTExpert, T3, Randoop
JTexpert 849 102 2 2.71 T3, Randoop
T3 526 82 3 2.81 Random
Random 448 34 4 2.92
31
32. Lessons Learnt
• Using multi-problem statistical tests
• Selection procedure to filter-out (likely) trivial classes
• Subject categories: string manipulation, computational intensive, object
manipulation, etc.
• What next:
• Publishing the benchmark infrastructure
• Performing a more in-depth analysis for each subject category
• More Tools, new languages? (i.e., C, C#?)
32