This study revisits the analysis of test smells in automatically generated tests conducted by Grano et al. The authors find:
1) Through manual analysis of 100 test suites, the prevalence of test smells is lower than what detection tools reported in the previous study.
2) Detection tools have high false positive and false negative rates when identifying test smells in generated tests. They incorrectly flagged tests as containing smells like Mystery Guest and Resource Optimism.
3) Detection tools failed to identify instances of smells like Sensitive Equality and Indirect Testing that were found during manual analysis.
4) Relying solely on detection tools can lead to incorrect conclusions about test quality. Human validation of tool warnings is important.
CTAC 2024 Valencia - Sven Zoelle - Most Crucial Invest to Digitalisation_slid...
Revisiting Test Smells in Automatically Generated Tests
1. Revisiting Test Smells in Automatically
Generated Tests: Limitations, Pitfalls,
and Opportunities
A. Panichella, S. Panichella, G. Fraser,
A. A. Sawant, and V. J. Hellendoorn
1
2. Related Work
2
[Grano et al., JSS 2019]
JTExpert
Test Case Generation Tools
Test Smell Detection Tool from previous
work [EMSE 2015]
GPD
Grano Palomba Di Nucci
3. Related Work
3
[Grano et al., JSS 2019]
Main Results
81%
GPD precision in detecting test
smells (100% recall)
The tests [by EvoSuite] are scented since the
beginning as "crossover and mutation operations
[…] do not change the structure of the tests
of the JUnit test suites by
EvoSuite contain test smells
88%
Threats To Validity
Warnings raised by GPD are not manually
validated
EvoSuite was misconfigured:
- Old search algorithm
- Tests and Assertions are not minimization
Mutation and crossover alter the test
structure by adding/removing statements
[Arcuri and Fraser, TSE 2012]
5. Our Study
• RQ1: How widespread are test smells in
automatically generated tests?
• RQ2: How accurate are automated tools in
detecting code smells in automatically generated
tests?
• RQ3: How well do test smells reflects real
problem in test suites?
5
Manually analysing
generated tests rather then
relying on detection tools
Assessing smell detection
accuracy based on the
manual oracle
6. Manual Analysis
6
100 Java
classes from
SF110
The same
classes used by
Grano at al.
100
Generated
Test Suites
Validator 2 Validator 3 Validator 4Validator 1
Validator 3 Validator 4 Validator 2Validator 1
Cross-
validated
Oracle
7. RQ1: Distributions of Test Smells
7
Eager Test
Assertion Roulette
Indirect Testing
Sensitive Equality
Mystery Guest
Resource Optimism
% Smelly Test Suites
0 25 50 75 100
Our results based on a
manually validated dataset
Results by Grano et al.
(based on automated tools
warning)
8. RQ2: Accuracy of Smell Detection Tools
8
Large False Positive Rate for Assertion
Roulette and Eager Tests
TABLE IV: Detection performance of different automated test smell detection tools for test cases generated by EVOSUITE.
FPR denotes the False Positive Rate and FNR is the False Negative Rate. The best values are highlighted in grey colour.
Test smell
Tool used by Grano et al. [6] TSDETECT calibrated by Spadini et al. [2]
FPR FNR Precision Recall F-measure FPR FNR Precision Recall F-measure
Assertion Roulette 0.72 0.00 0.22 1.00 0.36 0.05 0.50 0.67 0.5 0.57
Eager Test 0.53 0.05 0.33 0.95 0.49 0.05 0.45 0.73 0.55 0.63
Mystery Guest 0.12 — — — — 0.03 — — — —
Sensitive Equality 0.00 0.67 1.00 0.33 0.50 0.00 0.67 1.00 0.33 0.50
Resource Optimism 0.02 — — — — 0.02 — — — —
Indirect Testing 0.00 1.00 — 0.00 — — — — — —
@Test(timeout = 4000)
public void test07() throws Throwable {
ScriptOrFnScope s0 = new ScriptOrFnScope((-806),
(ScriptOrFnScope) null);
ScriptOrFnScope s1 = new ScriptOrFnScope((-330), s0);
s1.preventMunging();
s1.munge();
assertNotSame(s0, s1);
}
Fig. 2: Example of false positive for the tool used by Grano
et al. for Eager Test
@Test(timeout = 4000)
public void test00() throws Throwable {
Show show0 = new Show();
File file0 = MockFile.createTempFile("...");
Mystery Guest and Resource Optimism. For these two
types of smells, both detection tools raise several warnings.
However, they are all false positives by definition, as our gold
standard does not contain any instances of such smells. The
detection tools both annotate test methods that contain specific
strings or objects, such as: “File”, “FileOutputStream”
“DB”, “HttpClient” as smelly; however, EVOSUITE sep-
arates the test code from environmental dependencies (e.g.,
external files) in a fully automated fashion through byte-
code instrumentation [43]. In particular, it uses two mech-
anisms: (1) mocking, and (2) customized test runners. For
one, classes that access the filesystem (e.g., java.io.File)
(GPD)
Large False Negative Rate for Sensitive
Equality and Indirect Testing
GPD
Low False Positive Rate
Large False Negative Rate for most of
the test smells
TsDetector
9. Limitations of Test Smell Detection Tools
9
According to GPD warnings
12% of the JUnit test suites by EvoSuite
contain Mystery Guest
2% of the JUnit test suites by EvoSuite
contain Resource Optimism
EvoSuite does not use external
resources or files thanks to:
• Sandbox and scaffolding
• Automated mocks generation
• The use a customized JUnit runner
FALSE
POSITIVES
10. Limitations of Test Smell Detection Tools
10
GPD and TsDetector fail to detect instances of Sensitive Equality
@Test(timeout = 4000)
public void test62() throws Throwable {
SubstringLabeler.Match substringLabeler_Match0 = new SubstringLabeler.Match();
String string0 = substringLabeler_Match0.toString();
assertEquals("Substring: [Atts: ]", string0);
}
public void test62() throws Throwable {
SubstringLabeler.Match substringLabeler_Match0 = new SubstringLabeler.Match();
assertEquals("Substring: [Atts: ]", substringLabeler_Match0.toString());
}
Test generated
by EvoSuite but
not detected by
the two tools
This test would
be detected
11. Discussion
• In the paper we further discuss the limitations of test smell detection
tools (GDP and TsDetector) with more examples
• Our results disagree with the conclusions by Grano et al.
• Only 80% 32% of generated tests contain test smells
• Researchers should avoid self-assessing their test smell detection tools
• The involvement of human participants (preferably in industrial contexts)
is critical for improving the accuracy of detection tools
11
12. Revisiting Test Smells in Automatically
Generated Tests: Limitations, Pitfalls,
and Opportunities
A. Panichella, S. Panichella, G. Fraser,
A. A. Sawant, and V. J. Hellendoorn
12