On The Relation of Test Smells to Software Code Quality

On The Relation of Test Smells to
Software Code Quality
Seneca
Davide Spadini, Fabio Palomba,
Andy Zaidman, Magiel Bruntink, Alberto Bacchelli

@DavideSpadini ishepard
Seneca
Davide Spadini, Fabio Palomba,
Andy Zaidman, Magiel Bruntink, Alberto Bacchelli

Refactoring Test Code
Arie van Deursen Leon Moonen Alex van den Bergh Gerard Kok
CWI Software Improvement Group
The Netherlands The Netherlands
http://www.cwi.nl/~{arie,leon}/ http://www.software-improvers.com/
{arie,leon}@cwi.nl {alex,gerard}@software-improvers.com
ABSTRACT
Two key aspects of extreme programming (XP) are unit
testing and merciless refactoring. Given the fact that the
ideal test code / production code ratio approaches 1:1, it is
not surprising that unit tests are being refactored. We found
that refactoring test code is different from refactoring pro-
duction code in two ways: (1) there is a distinct set of bad
smells involved, and (2) improving test code involves ad-
ditional test-specific refactorings. To share our experiences
with other XP practitioners, we describe a set of bad smells
that indicate trouble in test code, and a collection of test
refactorings to remove these smells.
Keywords
Refactoring, unit testing, extreme programming.
1 INTRODUCTION
“If there is a technique at the heart of extreme program-
ming (XP), it is unit testing” [1]. As part of their program-
ming activity, XP developers write and maintain (white
box) unit tests continually. These tests are automated,
written in the same programming language as the produc-
tion code, considered an explicit part of the code, and put
under revision control.
The XP process encourages writing a test class for every
class in the system. Methods in these test classes are used
to verify complicated functionality and unusual circum-
stances. Moreover, they are used to document code by ex-
plicitly indicating what the expected results of a method
should be for typical cases. Last but not least, tests are
added upon receiving a bug report to check for the bug and
to check the bug fix [2]. A typical test for a particular
method includes: (1) code to set up the fixture (the data
used for testing), (2) the call of the method, (3) a compari-
son of the actual results with the expected values, and (4)
code to tear down the fixture. Writing tests is usually sup-
ported by frameworks such as JUnit [3].
The test code / production code ratio may vary from project
to project, but is ideally considered to approach a ratio of
1:1. In our project we currently have a 2:3 ratio, although
others have reported a lower ratio1
. One of the corner
stones of XP is that having many tests available helps the
developers to overcome their fear for change: the tests will
provide immediate feedback if the system gets broken at a
critical place. The downside of having many tests, how-
ever, is that changes in functionality will typically involve
changes in the test code as well. The more test code we get,
the more important it becomes that this test code is as eas-
ily modifiable as the production code.
The key XP practice to keep code flexible is “refactor mer-
cilessly”: transforming the code in order to bring it in the
simplest possible state. To support this, a catalog of “code
smells” and a wide range of refactorings is available, vary-
ing from simple modifications up to ways to introduce de-
sign patterns systematically in existing code [5].
When trying to apply refactorings to the test code of our
project we discovered that refactoring test code is different
from refactoring production code. Test code has a distinct
set of smells, dealing with the ways in which test cases are
organized, how they are implemented, and how they inter-
act with each other. Moreover, improving test code in-
volves a mixture of refactorings from [5] specialized to test
code improvements, as well as a set of additional refactor-
ings, involving the modification of test classes, ways of
grouping test cases, and so on.
The goal of this paper is to share our experience in im-
proving our test code with other XP practitioners. To that
end, we describe a set of test smells indicating trouble in
test code, and a collection of test refactorings explaining
how to overcome some of these problems through a simple
program modification.
This paper assumes some familiarity with the xUnit frame-
work [3] and refactorings as described by Fowler [5]. We
will refer to refactorings described in this book using Name
1
This project started a year ago and involves the development of a prod-
uct called DocGen [4]. Development is done by a small team of five peo-
ple using XP techniques. Code is written in Java and we use the JUnit
Test smells
Does Refactoring of Test Smells
Induce Fixing Flaky Tests?
Fabio Palomba and Andy Zaidman
Delft University of Technology, The Netherlands
f.palomba@tudelft.nl, a.e.zaidman@tudelft.nl
Abstract—Regression testing is a core activity that allows devel-
opers to ensure that source code changes do not introduce bugs.
An important prerequisite then is that test cases are deterministic.
However, this is not always the case as some tests suffer from so-
called flakiness. Flaky tests have serious consequences, as they can
hide real bugs and increase software inspection costs. Existing
research has focused on understanding the root causes of test
flakiness and devising techniques to automatically fix flaky tests;
a key area of investigation being concurrency. In this paper,
we investigate the relationship between flaky tests and three
previously defined test smells, namely Resource Optimism, Indirect
Testing and Test Run War. We have set up a study involving 19,532
JUnit test methods belonging to 18 software systems. A key result
of our investigation is that 54% of tests that are flaky contain a
test code smell that can cause the flakiness. Moreover, we found
that refactoring the test smells not only removed the design flaws,
but also fixed all 54% of flaky tests causally co-occurring with
test smells.
Index Terms—Test Smells; Flaky Tests; Refactoring;
I. INTRODUCTION
Test cases form the first line of defense against the introduc-
tion of software faults, especially when testing for regression
faults [1], [2]. As such, with the help of testing frameworks
but just flaky [19]. Perhaps most importantly, from a psy-
chological point of view flaky tests can reduce a developer’s
confidence in the tests, possibly leading to ignoring actual test
failures [17]. Because of this, the research community has
spent considerably effort on trying to understand the causes
behind test flakiness [18], [20], [21], [22] and on devising
automated techniques able to fix flaky tests [23], [24], [25].
However, most of this research mainly focused the attention
on some specific causes possibly leading to the introduction of
flaky tests, such as concurrency [26], [25], [27] or test order
dependency [22] issues, thus proposing ad-hoc solutions that
cannot be used to fix flaky tests characterized by other root
causes. Indeed, according to the findings by Luo et al. [18]
who conducted an empirical study on the motivations behind
test code flakiness, the problems faced by previous research
only represent a part of whole story: a deeper analysis of
possible fixing strategies of other root causes (e.g., flakiness
due to wrong usage of external resources) is still missing.
In this paper, we aim at making a further step ahead toward
the comprehension of test flakiness, by investigating the role
of so-called test smells [28], [29], [30], i.e., poor design or
implementation choices applied by programmers during the
Empir Software Eng (2015) 20:1052–1094
DOI 10.1007/s10664-014-9313-0
Are test smells really harmful? An empirical study
Gabriele Bavota · Abdallah Qusef · Rocco Oliveto ·
Andrea De Lucia · Dave Binkley
Published online: 31 May 2014
© Springer Science+Business Media New York 2014
Abstract Bad code smells have been defined as indicators of potential problems in source
code. Techniques to identify and mitigate bad code smells have been proposed and stud-
ied. Recently bad test code smells (test smells for short) have been put forward as a kind
of bad code smell specific to tests such a unit tests. What has been missing is empirical
investigation into the prevalence and impact of bad test code smells. Two studies aimed at
providing this missing empirical data are presented. The first study finds that there is a high
diffusion of test smells in both open source and industrial software systems with 86 % of
JUnit tests exhibiting at least one test smell and six tests having six distinct test smells. The
second study provides evidence that test smells have a strong negative impact on program
comprehension and maintenance. Highlights from this second study include the finding that
comprehension is 30 % better in the absence of test smells.
Davide Spadini,⇤‡ Fabio Palomba§ Andy Zaidman,⇤ Magiel Bruntink,‡ Alberto Bacchelli§
‡Software Improvement Group, ⇤Delft University of Technology, §University of Zurich
⇤{d.spadini, a.e.zaidman}@tudelft.nl, ‡m.bruntink@sig.eu, §{palomba, bacchelli}@ifi.uzh.ch
Abstract—Test smells are sub-optimal design choices in the
implementation of test code. As reported by recent studies, their
presence might not only negatively affect the comprehension of
test suites but can also lead to test cases being less effective
in finding bugs in production code. Although significant steps
toward understanding test smells, there is still a notable absence
of studies assessing their association with software quality.
In this paper, we investigate the relationship between the
presence of test smells and the change- and defect-proneness of
test code, as well as the defect-proneness of the tested production
code. To this aim, we collect data on 221 releases of ten software
systems and we analyze more than a million test cases to investi-
gate the association of six test smells and their co-occurrence with
software quality. Key results of our study include:(i) tests with
smells are more change- and defect-prone, (ii) ‘Indirect Testing’,
‘Eager Test’, and ‘Assertion Roulette’ are the most significant
smells for change-proneness and, (iii) production code is more
defect-prone when tested by smelly tests.
I. INTRODUCTION
Automated testing (hereafter referred to as just testing)
has become an essential process for improving the quality of
software systems [12], [47]. In fact, testing can help to point
out defects and to ensure that production code is robust under
many usage conditions [12], [16]. Writing tests, however, is as
challenging as writing production code and developers should
maintain test code with the same care they use for production
found evidence of a negative impact of test smells on both
comprehensibility and maintainability of test code [7].
Although the study by Bavota et al. [7] made a first,
necessary step toward the understanding of maintainability
aspects of test smells, our empirical knowledge on whether
and how test smells are associated with software quality
aspects is still limited. Indeed, van Deursen et al. [74] based
their definition of test smells on their anecdotal experience,
without extensive evidence on whether and how such smells
are negatively associated with the overall system quality.
To fill this gap, in this paper we quantitatively investigate
the relationship between the presence of smells in test methods
and the change- and defect-proneness of both these test
methods and the production code they intend to test. Similar
to several previous studies on software quality [24], [62], we
employ the proxy metrics change-proneness (i.e., number of
times a method changes between two releases) and defect-
proneness (i.e., number of defects the method had between two
releases). We conduct an extensive observational study [15],
collecting data from 221 releases of ten open source software
systems, analyze more than a million test cases, and inves-
tigate the association between six test smell types and the
aforementioned proxy metrics.
Based on the experience and reasoning reported by van

Research questions
RQ1: Are test smells associated with change/
defect proneness of test code?
RQ2: Are test smells associated with defect
proneness of production code?

Methodology — subject systems
10 OSS
221 Major releases
# Releases # Classes # Methods KLOC
Total 221 9 - 2,072 68 - 19,445 1 - 334
• All the metrics are calculated at method level!

Methodology — test smells
t
Ri
• We calculate which test methods are affected by test smells
in every release, using the detector by Bavota et al.
method is_smelly type
file1.java:m1 FALSE
file1.java:m2 TRUE Mystery Guest
file2.java:m1 TRUE
Eager Test,
Indirect Testing
file2.java:m2 FALSE
• Type of smells:
1. Mystery Guest
2. Resource Optimism
3. Eager Test
4. Assertion Roulette
5. Indirect Testing
6. Sensitive Equality

Methodology — change proneness of test code
• We deﬁne change proneness of a test method Ti in release Ri
as the number of times Ti changed between Ri and Ri-1.

t
Ri-1 Ri

t
Ri-1 Ri
#00abc45

t
Ri-1 Ri
ATest.java
#00abc45

t
Ri-1 Ri
ATest.java
#00abc45
ATest.java

ATest.javaATest.java

method1
method2
method5
method6
method3
method1
method2
method4

method1
method2
method5
method6
method3
method1
method2
method4
sum = a + b
return sum
sum = a + b
return sum

method2
method5
method6
method3
method2
method4

method2
method5
method6
method3
method2
method4
diff = a - b
return diff
diff = b - a
return diff

method2
method5
method6
method3
method2
method4
diff = a - b
return diff
diff = b - a
return diff
method2 changes ++

method5
method6
method3
method4

method5
method6
method3
method4
cosine similarity < 0.9

method5
method6
method3
method4
cosine similarity > 0.9
method5 changes ++

method6
method3

method6
method3
cosine similarity < 0.9

method6
Method Added

Methodology — defect proneness
• We deﬁne defect proneness of a (test and production) method
Ti in release Ri as the number of defects Ti contained in Ri.
• We ﬁrst obtain the bug inducing commits, and then we apply
SZZ.
t
Ri
bug#1 bug#2

Research questions

Research questions
RQ1.1: To what extent are test smells
associated with the change- and defect-
proneness of test code?

RQ1.1: Is the co-occurrence of test smells associated with the
change- and defect-proneness of test code?
Change Proneness
1
1.47
1.31
size
overall
1.95
2.02
Conf. Int.
1.46-1.50
1.29-1.32
1.86-2.04
1.84-2.19
small
average
large

Defect Proneness
1
1.45
1.63
3.55
2.37
1.56
1.81
Conf. Int.
1.50-1.63
2.05-2.75
2.74-4.61
size
small
average
large
C.P.
no
yes
overall
1.54-1.71
1.37-1.53
1.74-1.89

Research questions
RQ1.2: Is the co-occurrence of test smells

0.0
2.5
5.0
7.5
10.0
12.5
0 1 2 3 4 5 6
Number of test smells
Numberofchanges

0
10
20
30
40
0 1 2 3 4 5 6
Numberofbugs

Research questions
RQ1.3: Are certain test smell types more

RQ1.3: Are certain test smell types more associated with the change-
and defect-proneness of test code?
0
20
40
60
Assertion Roulette Eager Test Indirect Testing Mystery Guest Sensitive Equality
Smell Type
Relationwithmaintainability
Number of changes
Number of bugs

Research questions

Research questions
associated with the defect-proneness of
production code?

RQ2.1: To what extent are test smells associated with the change-
and defect- proneness of production code?
Defect Proneness
1
Conf. Int.
1.52-1.60
2.03-2.46
1.84-2.54
1.67-1.75
2.17
2.23
1.56
size
overall
small
average
large
1.71

RQ2.1: To what extent are test smells associated with the change-
and defect- proneness of production code?
Defect Proneness
0
5
10
Non−smelly Smelly
Type
Numberofbugs

Research questions
production code?
production code?

RQ2.2: Is the co-occurrence of test smells associated with the defect-
0.0
2.5
5.0
7.5
10.0
0 1 2 3 4 5 6
Numberofbugsintheproductionmethods

Research questions
production code?
production code?
RQ2.3: Are certain test smell types more
production code?

RQ2.3: Are certain test smell types more associated with the defect-
0.0
2.5
5.0
7.5
10.0
Assertion Roulette Eager Test Indirect Testing Mystery Guest Sensitive Equality
Smell Type
Numberofbugsintheproductionmethods

SummaryTestcodeProductioncode
More change- and
defect-prone if affected
by smells
Slightly more change-
prone if affected by
more smells
More defect-prone if
exercised by test code
affected by test smells
‘Indirect Testing’ and
‘Eager Test’ smells are
more defect-prone in
the exercised
production code

On The Relation of Test Smells to Software Code Quality

On The Relation of Test Smells to Software Code Quality

More Related Content

What's hot

Similar to On The Relation of Test Smells to Software Code Quality

More from Delft University of Technology

Recently uploaded

On The Relation of Test Smells to Software Code Quality