Does Refactoring of Test Smells Induce
Fixing Flaky Tests?
Fabio Palomba, Andy Zaidman
Delft University of Technology

The Netherlands
Regression Testing
Test cases form the first line of 

defense against the introduction of faults
Pinto et al.

“Understanding Myths and Realities of 

Test-Suite Evolution”
FSE 2012
Regression Testing
The entire team rely on them to decide
whether to merge a pull request
Gousios et al.

“Work practices and challenges in pull-based
development: The integrator’s perspective”
ICSE 2015
Regression Testing
Or even whether proceed 

with the deployment of the system
Beller et al.

“Ooops, my tests broke the build: An
explorative analysis of Travis CI with
GitHub” - MSR 2017
Developers’ productivity is dependent on both the
ability to find real problems and the cost of
diagnosing faults in a timely fashion
Perez et al.

“A Test-Suite Diagnosability Metric for
Spectrum-based fault localization
approaches” - ICSE 2017
Regression Testing
Vahabzadeh et al.

“An Empirical Study of Bugs in Test Code”

ICSME 2015
Even angels eat beans
!
Test code can be affected by
bugs as well
One of the typical bugs affecting
test suites is called “flakiness”
Flaky Tests
Test cases that exhibit both
a passing and failing result
with the same code
Luo et al.

“An Empirical Analysis of Flaky Tests”

FSE 2014
Flaky Tests
Test cases that exhibit both
a passing and failing result
with the same code
Luo et al.

“An Empirical Analysis of Flaky Tests”

FSE 2014
They might hide real bugs,
increase maintenance costs, and
reduce developers’ confidence
Flaky Tests
Flaky Tests are relevant for practitioners and
highly discussed on the web
Empirical Studies
Likely causes behind test code flakiness
Empirical Studies
Flaky Test Identification
Likely causes behind test code flakiness
Proposing solutions to fix flaky tests
Empirical Studies
Flaky Test Identification
They focus only on specific causes
behind test code flakiness
Likely causes behind test code flakiness
Proposing solutions to fix flaky tests
Empirical Studies
Flaky Test Identification
Thus, only ad-hoc solutions are available
Likely causes behind test code flakiness
Proposing solutions to fix flaky tests
Luo et al.

“An Empirical Analysis of Flaky Tests”

FSE 2014
Problems faced by previous research
only represent a part of the whole story
A Deeper Analysis
Luo et al.

“An Empirical Analysis of Flaky Tests”

FSE 2014
Problems faced by previous research
only represent a part of the whole story
A Deeper Analysis
A deeper analysis of possible fixing
strategies of other root causes is still
missing
Studying Test Smells
Van Deursen et al.

“Refactoring Test Code”

XP 2001
Symptoms of poor design or
implementation choices in test code
Studying Test Smells
Symptoms of poor design or
implementation choices in test code
Van Deursen et al.

“Refactoring Test Code”

XP 2001
Resource Optimism is a test that makes
optimistic assumption about the state or the
existence of an external resource
Studying Test Smells
Symptoms of poor design or
implementation choices in test code
Van Deursen et al.

“Refactoring Test Code”

XP 2001
Indirect Testing is a test that exercises
different classes with respect to the
corresponding production class
Studying Test Smells
Symptoms of poor design or
implementation choices in test code
Van Deursen et al.

“Refactoring Test Code”

XP 2001
Test Run War is a test that allocates
resources that are also used by other
methods, possibly causing interferences
Research Questions
?
What are the causes of test flakiness?
Research Questions
?
What are the causes of test flakiness?
To what extent can flaky tests be
explained by the presence of tests smells?
Research Questions
?
What are the causes of test flakiness?
To what extent can flaky tests be
explained by the presence of tests smells?
To what extent does refactoring of test
smells help in removing flaky tests?
Research Questions
?
Software Projects
18open-source systems randomly
selected from Github
Detecting Test Smells
Bavota et al.

“Are Test Smells Really Harmful?”

An Empirical Study

EMSE
Palomba et al.

“On the Diffuseness of Test Smells in
Automatically Generated Test Code”

SBST 2016
The detector exploits the definition of
the smells to detect them
The detector has a precision of 88%
and a recall of 100%
Detecting Flaky Tests
If the output of a test method was
different at least once
We ran JUnit class ten times
Detecting Flaky Tests
If the output of a test method was
different at least once
We ran JUnit class ten times
Detecting Flaky Tests
If the output of a test method was
different at least once
We ran JUnit class ten times
Detecting Flaky Tests
If the output of a test method was
different at least once
We ran JUnit class ten times
Detecting Flaky Tests
If the output of a test method was
different at least once
We ran JUnit class ten times
Detecting Flaky Tests
If the output of a test method was
different at least once
We ran JUnit class ten times
Detecting Flaky Tests
If the output of a test method was
different at least once
We ran JUnit class ten times
Detecting Flaky Tests
If the output of a test method was
different at least once
We ran JUnit class ten times
Detecting Flaky Tests
If the output of a test method was
different at least once
We ran JUnit class ten times
Detecting Flaky Tests
If the output of a test method was
different at least once
We ran JUnit class ten times
Detecting Flaky Tests
If the output of a test method was
different at least once
We ran JUnit class ten times
Causes of Test Flakiness
We manually linked each flaky test onto one of the
10 root causes defined in the taxonomy by Luo et al.
Luo et al.

“An Empirical Analysis of Flaky Tests”

FSE 2014
Causes of Test Flakiness
We manually linked each flaky test onto one of the
10 root causes defined in the taxonomy by Luo et al.
Luo et al.

“An Empirical Analysis of Flaky Tests”

FSE 2014
JAVA
LOG
Source code of the test
Exceptions thrown
Causes of Test Flakiness
Async Wait
A test method making an
asynchronous call and that does
not wait for the result of the call.
27%
The test method does not properly
acquire or release one or more to
its resources
IO issue
Causes of Test Flakiness
22%
Different threads interact in a non-
desirable manner
Concurrency
Causes of Test Flakiness
17%
Causes of Test Flakiness
11%
Test Ordering Network
10%
Ordering of tests execution Network performance
Test Smells vs Flaky Tests
Test Smells Flaky Tests
Test Smells vs Flaky Tests
61%
61%
Test Smells vs Flaky Tests
How many of them are casual co-occurrences?
Test Smells vs Flaky Tests
61%
We manually identified the
flakiness-inducing test smells
Test Smells vs Flaky Tests
61%
We manually identified the
flakiness-inducing test smells
A Resource Optimism was
casually related to a test
case if the flakiness was due
to issues in the management
fo external resources
Test Smells vs Flaky Tests
54%
Test Smells vs Flaky Tests
54%
Resource Optimism
IO issue
Network
Test Smells vs Flaky Tests
54%
Resource Optimism
IO issue
Network
Indirect Testing Test Ordering
Test Smells vs Flaky Tests
54%
Resource Optimism
IO issue
Network
Indirect Testing Test Ordering
Test Run War Concurrency
The Role of Refactoring
We manually refactored according to the guidelines
defined by Van Deursen et al.
Van Deursen et al.

“Refactoring Test Code”

XP 2001
The Role of Refactoring
We re-ran the flaky tests identification
We re-ran the test smell detection
The Role of Refactoring
100%
of the test code refactored did not
present a test smell anymore
The Role of Refactoring
100%
of the test code refactored did not
present a flaky test anymore
The Role of Refactoring
54%
of the total flaky tests were
removed by means of refactoring
Flaky Tests
Test Smells
&
Future Research Agenda
Extending the Study
Understanding whether refactoring of test
smells is actually adopted by developers
as flaky test fixing strategy
Tools for Automated Refactoring of
Test Smells
Does Refactoring of Test Smells Induce
Fixing Flaky Tests?
Fabio Palomba, Andy Zaidman
Delft University of Technology

The Netherlands

Does Refactoring of Test Smells Induce Fixing Flaky Tests?

  • 1.
    Does Refactoring ofTest Smells Induce Fixing Flaky Tests? Fabio Palomba, Andy Zaidman Delft University of Technology
 The Netherlands
  • 2.
    Regression Testing Test casesform the first line of 
 defense against the introduction of faults Pinto et al.
 “Understanding Myths and Realities of 
 Test-Suite Evolution” FSE 2012
  • 3.
    Regression Testing The entireteam rely on them to decide whether to merge a pull request Gousios et al.
 “Work practices and challenges in pull-based development: The integrator’s perspective” ICSE 2015
  • 4.
    Regression Testing Or evenwhether proceed 
 with the deployment of the system Beller et al.
 “Ooops, my tests broke the build: An explorative analysis of Travis CI with GitHub” - MSR 2017
  • 5.
    Developers’ productivity isdependent on both the ability to find real problems and the cost of diagnosing faults in a timely fashion Perez et al.
 “A Test-Suite Diagnosability Metric for Spectrum-based fault localization approaches” - ICSE 2017 Regression Testing
  • 6.
    Vahabzadeh et al.
 “AnEmpirical Study of Bugs in Test Code”
 ICSME 2015 Even angels eat beans ! Test code can be affected by bugs as well One of the typical bugs affecting test suites is called “flakiness”
  • 7.
    Flaky Tests Test casesthat exhibit both a passing and failing result with the same code Luo et al.
 “An Empirical Analysis of Flaky Tests”
 FSE 2014
  • 8.
    Flaky Tests Test casesthat exhibit both a passing and failing result with the same code Luo et al.
 “An Empirical Analysis of Flaky Tests”
 FSE 2014 They might hide real bugs, increase maintenance costs, and reduce developers’ confidence
  • 9.
    Flaky Tests Flaky Testsare relevant for practitioners and highly discussed on the web
  • 10.
    Empirical Studies Likely causesbehind test code flakiness
  • 11.
    Empirical Studies Flaky TestIdentification Likely causes behind test code flakiness Proposing solutions to fix flaky tests
  • 12.
    Empirical Studies Flaky TestIdentification They focus only on specific causes behind test code flakiness Likely causes behind test code flakiness Proposing solutions to fix flaky tests
  • 13.
    Empirical Studies Flaky TestIdentification Thus, only ad-hoc solutions are available Likely causes behind test code flakiness Proposing solutions to fix flaky tests
  • 14.
    Luo et al.
 “AnEmpirical Analysis of Flaky Tests”
 FSE 2014 Problems faced by previous research only represent a part of the whole story A Deeper Analysis
  • 15.
    Luo et al.
 “AnEmpirical Analysis of Flaky Tests”
 FSE 2014 Problems faced by previous research only represent a part of the whole story A Deeper Analysis A deeper analysis of possible fixing strategies of other root causes is still missing
  • 16.
    Studying Test Smells VanDeursen et al.
 “Refactoring Test Code”
 XP 2001 Symptoms of poor design or implementation choices in test code
  • 17.
    Studying Test Smells Symptomsof poor design or implementation choices in test code Van Deursen et al.
 “Refactoring Test Code”
 XP 2001 Resource Optimism is a test that makes optimistic assumption about the state or the existence of an external resource
  • 18.
    Studying Test Smells Symptomsof poor design or implementation choices in test code Van Deursen et al.
 “Refactoring Test Code”
 XP 2001 Indirect Testing is a test that exercises different classes with respect to the corresponding production class
  • 19.
    Studying Test Smells Symptomsof poor design or implementation choices in test code Van Deursen et al.
 “Refactoring Test Code”
 XP 2001 Test Run War is a test that allocates resources that are also used by other methods, possibly causing interferences
  • 20.
  • 21.
    What are thecauses of test flakiness? Research Questions ?
  • 22.
    What are thecauses of test flakiness? To what extent can flaky tests be explained by the presence of tests smells? Research Questions ?
  • 23.
    What are thecauses of test flakiness? To what extent can flaky tests be explained by the presence of tests smells? To what extent does refactoring of test smells help in removing flaky tests? Research Questions ?
  • 24.
    Software Projects 18open-source systemsrandomly selected from Github
  • 25.
    Detecting Test Smells Bavotaet al.
 “Are Test Smells Really Harmful?”
 An Empirical Study
 EMSE Palomba et al.
 “On the Diffuseness of Test Smells in Automatically Generated Test Code”
 SBST 2016 The detector exploits the definition of the smells to detect them The detector has a precision of 88% and a recall of 100%
  • 26.
    Detecting Flaky Tests Ifthe output of a test method was different at least once We ran JUnit class ten times
  • 27.
    Detecting Flaky Tests Ifthe output of a test method was different at least once We ran JUnit class ten times
  • 28.
    Detecting Flaky Tests Ifthe output of a test method was different at least once We ran JUnit class ten times
  • 29.
    Detecting Flaky Tests Ifthe output of a test method was different at least once We ran JUnit class ten times
  • 30.
    Detecting Flaky Tests Ifthe output of a test method was different at least once We ran JUnit class ten times
  • 31.
    Detecting Flaky Tests Ifthe output of a test method was different at least once We ran JUnit class ten times
  • 32.
    Detecting Flaky Tests Ifthe output of a test method was different at least once We ran JUnit class ten times
  • 33.
    Detecting Flaky Tests Ifthe output of a test method was different at least once We ran JUnit class ten times
  • 34.
    Detecting Flaky Tests Ifthe output of a test method was different at least once We ran JUnit class ten times
  • 35.
    Detecting Flaky Tests Ifthe output of a test method was different at least once We ran JUnit class ten times
  • 36.
    Detecting Flaky Tests Ifthe output of a test method was different at least once We ran JUnit class ten times
  • 37.
    Causes of TestFlakiness We manually linked each flaky test onto one of the 10 root causes defined in the taxonomy by Luo et al. Luo et al.
 “An Empirical Analysis of Flaky Tests”
 FSE 2014
  • 38.
    Causes of TestFlakiness We manually linked each flaky test onto one of the 10 root causes defined in the taxonomy by Luo et al. Luo et al.
 “An Empirical Analysis of Flaky Tests”
 FSE 2014 JAVA LOG Source code of the test Exceptions thrown
  • 39.
    Causes of TestFlakiness Async Wait A test method making an asynchronous call and that does not wait for the result of the call. 27%
  • 40.
    The test methoddoes not properly acquire or release one or more to its resources IO issue Causes of Test Flakiness 22%
  • 41.
    Different threads interactin a non- desirable manner Concurrency Causes of Test Flakiness 17%
  • 42.
    Causes of TestFlakiness 11% Test Ordering Network 10% Ordering of tests execution Network performance
  • 43.
    Test Smells vsFlaky Tests Test Smells Flaky Tests
  • 44.
    Test Smells vsFlaky Tests 61%
  • 45.
    61% Test Smells vsFlaky Tests How many of them are casual co-occurrences?
  • 46.
    Test Smells vsFlaky Tests 61% We manually identified the flakiness-inducing test smells
  • 47.
    Test Smells vsFlaky Tests 61% We manually identified the flakiness-inducing test smells A Resource Optimism was casually related to a test case if the flakiness was due to issues in the management fo external resources
  • 48.
    Test Smells vsFlaky Tests 54%
  • 49.
    Test Smells vsFlaky Tests 54% Resource Optimism IO issue Network
  • 50.
    Test Smells vsFlaky Tests 54% Resource Optimism IO issue Network Indirect Testing Test Ordering
  • 51.
    Test Smells vsFlaky Tests 54% Resource Optimism IO issue Network Indirect Testing Test Ordering Test Run War Concurrency
  • 52.
    The Role ofRefactoring We manually refactored according to the guidelines defined by Van Deursen et al. Van Deursen et al.
 “Refactoring Test Code”
 XP 2001
  • 53.
    The Role ofRefactoring We re-ran the flaky tests identification We re-ran the test smell detection
  • 54.
    The Role ofRefactoring 100% of the test code refactored did not present a test smell anymore
  • 55.
    The Role ofRefactoring 100% of the test code refactored did not present a flaky test anymore
  • 56.
    The Role ofRefactoring 54% of the total flaky tests were removed by means of refactoring
  • 57.
  • 58.
    Future Research Agenda Extendingthe Study Understanding whether refactoring of test smells is actually adopted by developers as flaky test fixing strategy Tools for Automated Refactoring of Test Smells
  • 59.
    Does Refactoring ofTest Smells Induce Fixing Flaky Tests? Fabio Palomba, Andy Zaidman Delft University of Technology
 The Netherlands