VST2022.pdf

Do Tests Generated by AI Help Developers?

Open Challenges, Applications and
Opportunities
Annibale Panichella, Ph.D.

a.panichella@tudelft.nl

@AnniPanic
1

About Me
Assistant Professor in
Software Engineering at
TU Delft
2

The CISE Lab
3
https://www.ciselab.nl
Dr. Annibale Panichella

(Lab leader)
Dr. Pouria Derakhshanfar

(Post-doc)
Imara van Dinten

(Ph.D. Student)
Mitchell Olsthoorn

(Ph.D. Student)
Leonhard Applis

(Ph.D. Student)
Team of 10 M.Sc.

students

My Research Interests
4
Word-cloud from my research papers Research Topics:

• Automated Test Generation

• Crash Replication

• Security Attacks Generations

• SE for Cyber-Physical Systems

• Empirical Software Engineering

• Testing for AI-based systems

• …

Spot the Bug
6
int balance = 1000;

void decrease(int amount) {

if (balance <= amount) {

balance = balance – amount;

} else {

printf(“Insufficient fundsn”);

}

}

void increase(int amount) {

balance = balance + amount;

}

Spot the Bug
7
int balance = 1000;




} else {


}

}



}
It should be balance >= amount

Spot the Bug
8
int balance = 1000;




} else {


}

}



}
What if the amount is negative?

Spot the Bug
9
int balance = 1000;




} else {


}

}



}
What if the amount is negative?
What if sum is too large for int ?

Is That Easy?
10
Class = Pass2Verifier.java
Project = Apache commons BCEL

Software Testing Is…
11
Slow Painful Boring
Zzz…
Necessary

AI for SE
12
Arti
fi
cial

Intelligence
Software_

Testing_
AI-Based

Software

Engineering
Optimization Search
Genetic Algorithms
Ant Colony
Test Case Coverage
Assertions
Failures Bugs
Machine Learning

The Master Algorithm
13
[P. Domingos 2015

“The Master Algorithm”]
Tribe Origin Master Algorithm
Symbolists Logic, philosophy Inverse deduction
Connectionists Neuroscience Back-Propagation
Evolutionary Biology Evolutionary Algorithms
Bayesian Statistics Probabilistic inference
Analogizers Psychology Kernel machines
The most used AI-
tribes in Software
Testing

A (Very) Brief Historical Overview
14

Historical Overview
15
NATO 1968 NATO 1969
1968-69 First SE Conference

Historical Overview
16
Ramamoorthy et al. IEEE TSE 1976
1976 Test Data Generation

(Symbolic execution)

Symbolic Execution
17
int foo (int v) {

return 2*v;

}

void method (int x, int y){

int z = foo(y);

if (z == x)

if (x > y+10)

printf(“Error”);

}
Concrete

State
x = 2
y = 1
z = 2
Symbolic

State
x = x0
y = y0
z = 2*y0
Path

Condition
2*y0 = x0
x0 <= y0+10
Code Under Test
Find y0 and x0 that solve
these equations/paths

Symbolic AI
[J. Haugeland, 1985] • Symbolic AI is often called GOFAI (Good Old-fashioned
Arti
fi
cial Intelligence)

• The overall idea is that many aspects of intelligence can be
achieved by manipulation of “symbols” and symbolic solvers

• Pros: powerful and each

• Cons:

• Not all problems can be modelled as symbolic equations

• Not all formulas can be solved with exact methods

• Path explosion problem in testing

Historical Overview
19
Automatic generation of random
self-checking test cases
D. L. Bird; C. U. Munoz
IBM Systems Journal

1982 Random Testing

(Data generation)

Random Testing
20
class Triangle {
int a, b, c; //sides
String type = "NOT_TRIANGLE";
void computeTriangleType(int a, int b, int c) {
1. if (a == b) {
2. if (b == c)
3. type = "EQUILATERAL";
else
4. type = "ISOSCELES";
} else {
5. if (a == c) {
} else {
7. if (b == c)
8. type = “ISOSCELES”;
else
9. type = “SCALENE”;
}
}
}
Program Under Test The simplest fuzzer
public class TestDataGenerator {

static int lowerBound = -100;

static int upperBound = 100;

static int[] generate(int nData){

int[] data = new int[nData];

for (int i=0; i<nData; i++){

double value = lowerBound + Math.random()

* (upperBound - lowerBound);

data[i] = (int) Math.round(value);

}

return data;

}

}
[-66, 59, -8] [91, 43, 36]
[51, -76, -62]
[74, 66, -40] …
Output
Number of inputs
Upper and Lower

Bounds
It is fast and useful, but it does not
generate complete test cases (only
the test input)
!

Historical Overview
21

1982 Random Testing

(Data generation)
IEEE TSE
1990 Numerical Optimization

(Data generation)

Historical Overview
22

1982 Random Testing

(Data generation)
Pargas et al. IEEE TSE1999
1990 Numerical Optimization

(Data generation)
1999 Genetic Algorithms

(Data Generation)

The Test Case Generation Era
23
2007 Randoop - Random Testing

(Test Case Generation)
2011 EvoSuite - Genetic Algorithm


P. Tonella, ISSTA 2004

Test Case Generation
24
class Triangle {
int a, b, c; //sides
String type = "NOT_TRIANGLE";
void computeTriangleType(int a, int b, int c) {
1. if (a == b) {
2. if (b == c)
3. type = "EQUILATERAL";
else
} else {
5. if (a == c) {
} else {
7. if (b == c)
8. type = “ISOSCELES”;
else
9. type = “SCALENE”;
}
}
}
Program Under Test @Test

public void testTriangle_invalid1() {

assertEquals(Triangle2.Type.INVALID,

Triangle2.triangle(0,0,0));

}

@Test




}

@Test

public void testTriangle_equilateral() {

assertEquals(Triangle2.Type.EQUILATERAL,


}

@Test

public void testTriangle_isoscele() {

assertEquals(Triangle2.Type.ISOSCELES,


}

@Test

public void testTriangle_scalene() {

assertEquals(Triangle2.Type.SCALENE,


}
Generated Test Suite
AI

The Test Case Generation Era
25
2007 Randoop - Random Testing

2011 EvoSuite - Genetic Algorithm


2013 SBST tool competition

2015 Many-objective GAs

A. Panichella et al., ICST 2015 A. Panichella et al., TSE 2018
Many-objective evolutionary algorithms outperform
state-of-the-art test case generation algorithms
Nowadays, many-objective algorithms are the core engine of many
existing state-of-the-art tools (see next slide)

Some Existing Tools…
26
BOTSING

The Good…
28
Program Under Test
@Test




}

@Test




}

@Test

public void testTriangle_equilateral() {

assertEquals(Triangle2.Type.EQUILATERAL,


}

@Test

public void testTriangle_isoscele() {

assertEquals(Triangle2.Type.ISOSCELES,


}

@Test

public void testTriangle_scalene() {

assertEquals(Triangle2.Type.SCALENE,


}
Output
Developer

The Good…
29
EMSE 2015 ICSE-SEIP 2017 SBST - Tool Competition 2017
EvoSuite
fi
nds 1600 unknown
bugs in 100 projects
EvoSuite detects 56.40% bugs
on an industrial project

Generated tests achieves
better coverage than
manually-written tests

The Good…
30
0.00
0.23
0.45
0.68
0.90
Eager
 
Tests
Assertion
 
Roulette
Indirect
 
Testing
Sensitive
 
Equality
Manually-writtenTests GeneratedTests
Test Smells
Frequency
Automatically generated test
cases are shorter and contains
fewer test smells than their
manually-written counterpart
A. Panichella et al. “Test Smells 20 Years Later:
Detectability, Validity, and Reliability” under
review in EMSE
A. Panichella et al. “Revisiting test smells in
automatically generated tests: limitations, pitfalls,
and opportunities”, ICSME 2020

Are Generated Tests Readable?
32
G. Fraser et al., Does Automated Unit Test
Generation Really Help Software Testers? A
Controlled Empirical Study, TOSEM 2015.
0%
Testing
Comprehension
Testing time
75% 100%
Generated tests achieve higher structural
coverage than manually created test suites.
Do not lead to
fi
nd faults more quickly if
developers have to manually-validate the tests

33
G. Fraser et al., Does Automated Unit Test
Generation Really Help Software Testers? A
Controlled Empirical Study, TOSEM 2015.
There is no difference in the
ef
fi
ciency of debugging when it is
supported either by manual or
EvoSuite test cases

34
Live Demo..

How to make the best of

AI-based testing?
35

My Personal View
36
Test cases generated by AI-methods do
fi
nd
many crashes and runtime exceptions
Generating effective (bug detecting) tests is not
the end of the story
Focus on application domains that are too hard
and complex to test by hand
Easy to validate oracle
Generate documentation
Successful results

Let me show some examples…
37

Generating Test Documentation
38
TestDescriber

Generated test
+ Documentation
S. Panichella, A. Panichella, M. Beller, A. Zaidman, H.C. Gall. “The impact of test case
Summaries on Bug Fixing Performance: And Empirical Investigation”. ICSE 2016

39
Main Steps in TestDescriber:

public class Option {
public Option(String opt, String longOpt,
boolean hasArg, String descr)
throws IllegalArgumentException {
OptionValidator.validateOption(opt);
this.opt = opt;
this.longOpt = longOpt;
if (hasArg) {
this.numberOfArgs = 1;
}
this.description = descr;
}
...
}
Production Code

this.opt = opt;
this.longOpt = longOpt;
if (hasArg) {
this.numberOfArgs = 1;
}
this.description = descr;
}
...
}
40

1. Select the covered statements

Covered Code

this opt = opt;
this longOpt = longOpt;
if (hasArg) {false
}
this description = descr;
}
...
}
41


2. Filter out Java keywords, etc.

Covered Code

public Option(String opt, String long Opt,
boolean has Arg, String descr)
Option Validator.validate Option(opt);
this opt = opt;
this long Opt = long Opt;
if (has Arg) {false
;
}
this description = descr;
}
...
}
42
Covered Code



3. Identi
fi
er Splitting (Camel case)

public Option(String option, String long Option,
boolean has Argument String description)
Option Validator.validate Option(option);
this option = option;
this long Option = long Option;
if (has Argument) {false
}
this description = description;
}
...
}
43
Covered Code



3. Identi
fi

4. Abbreviation Expansion (using external
vocabularies)

Option(String option, String long Option
,
throws IllegalArgumentException
this option = option
;
if (has Argument false
}
}
NOUN NOUN NOUN
ADJ
NOUN
NOUN
VERB
NOUN NOUN NOUN
NOUN
VERB NOUN
NOUN
ADJ
ADJ ADJ ADJ
NOUN
NOUN NOUN
VERB
ADJ
NOUN
CON
NOUN
ADJ
44



3. Identi
fi

4. Abbreviation Expansion (using external
vocabularies)

5. Part-of-Speech tagger
Covered Code

Option(String option, String long Option
,
throws IllegalArgumentException
this option = option
;
if (has Argument false
}
}
NOUN NOUN NOUN
ADJ
NOUN
NOUN
VERB
NOUN NOUN NOUN
NOUN
VERB NOUN
NOUN
ADJ
ADJ ADJ ADJ
NOUN
NOUN NOUN
VERB
ADJ
NOUN
CON
NOUN
ADJ
45
Covered Code
The test case instantiates an "Option"
with:
- option equal to “...”
- long option equal to “...”
- it has no argument
- description equal to “…”
An option-validator validates the
instantiated object
The test asserts the following
condition:
- "Option" has no argument
Natural Language Sentences

How Do Test Case Summaries Impact the Number of

Bugs Fixed by Developers?
46
Participants WITHOUT TestDescriber
summaries
fi
xed 40% of bugs
Both the two groups had 45 minutes

to
fi
x each class
Participants WITH TestDescriber summaries,
fi
xed 60%-80% of bugs

Test Documentation and Comprehension
47
Without
With 4%
6%
14%
33%
14%
6%
32%
9%
36%
45%
Perceived test comprehensibility WITH and

WITHOUT TestDescriber summaries
Without Summaries:

• Only 15% of participants consider the test
cases as “easy to understand”.

• 40% of participants considered the test
cases as incomprehensible.
With Summaries:

• 46% of participants consider the test
cases as “easy to understand”.

• Only 18% of participants considered
the test cases as incomprehensible.

Test Documentation and Comprehension
48
Roy et al., ASE 2020
Follow-up work that uses Deep Learning for
post-process generated tests
DeepTC-Enhancer generate:

• Test Documentation

• Test Method Names

• Variable Names

Easy to validate test oracle
49

An Example
51
https://issues.apache.org/jira/browse/COLLECTIONS-70
Created in June 2005
Solved in January 2006
Major Bug for Apache

Commons Collections
A test case is always needed to

help debugging

The Botsing Project
52
Target Crash:

Bug Name: ACC-70

Library: Apache Commons Collection

Exception in thread "main" java.lang.NullPointerException at

org.apache.commons.collections.list.TreeList$TreeListIterator.previous (TreeList.java:841)

at java.util.Collections.get(Unknown Source)

at java.util.Collections.iteratorBinarySearch(Unknown Source)

at java.util.Collections.binarySearch(Unknown Source)

at utils.queue.QueueSorted.put(QueueSorted.java:51)

at framework.search.GraphSearch.solve(GraphSearch.java:53)

at search.informed.BestFirstSearch.solve(BestFirstSearch.java:20)

at Hlavni.main(Hlavni.java:66)
public Object previous() {
...
if (next == null) {
next = parent.root.get(nextIndex - 1);
} else {
next = next.previous();
}
Object value = next.getValue();
...
}
}
Buggy Code
if “parent” is null,

this code triggers

an exception
BOTSING
public void test0() throws Throwable {
TreeList treeList0 = new TreeList();
treeList0.add((Object) null);
TreeList.TreeListIterator treeList_TreeListIterator0 = new
TreeList.TreeListIterator(treeList0, 732);
// Undeclared exception!
treeList_TreeListIterator0.previous();
}
Test generated by BOTSING

53
Test Case

Selection
Evolutionary

Algorithm
Test Execution
Initial

Tests
Variants

Generation
The Botsing Project
java.lang.IllegalArgumentException:

org.apache.commons.collections.map.AbstractHashedMap.<init> (AbstractHashedMap.java:142)


org.apache.commons.collections.map.AbstractLinkedMap.<init> (AbstractLinkedMap.java:95)

org.apache.commons.collections.map.LinkedMap.<init> (LinkedMap.java:78)

org.apache.commons.collections.map.TransformedMap.transformMap (TransformedMap.java:153)

org.apache.commons.collections.map.TransformedMap.putAll (TransformedMap.java:190)
java.lang.IllegalArgumentException:



org.apache.commons.collections.map.AbstractLinkedMap.<init> (AbstractLinkedMap.java:31)

org.apache.commons.collections.map.LinkedMap.<init> (LinkedMap.java:72)

org.apache.commons.collections.map.TransformedMap.transformMap (TransformedMap.java:148)

org.apache.commons.collections.map.TransformedMap.putAll (TransformedMap.java:190)
Test quality is measured using the “distance”
between the target and the generated stack traces
Target Stack Trace
Produced Stack Trace

54
Do Generated Tests Help Developers?
Time
to
fi
x
a
bug
(in
s)
With Botsing Without Botsing
https://github.com/STAMP-project/botsing
Generated tests help developers
fi
xing

bugs signi
fi
cantly faster
Generated tests help developers locating

bugs signi
fi
cantly faster
“Search-based crash reproduction and its
impact on debugging”, TSE 2018

Testing for Complex Systems

(Testing the Untestable)
55

Advanced Driver Assistance Systems (ADAS)
Traf
fi
c Sign Recognition (TSR)
Pedestrian Protection (PP) Lane Departure Warning (LDW)
56
Automated Emergency Braking (AEB)

Feature Interactions
Sensors /
Camera
Autonomous

Feature
Actuator
Braking (over time)
57
Sensors /
Camera
Autonomous

Feature
Actuator
Sensors /
Camera
Autonomous

Feature
Actuator
.

.

.
30% 20% … 80%
Acceleration (over time)
60% 10% … 20%
Steering (over time)
30% 20% … 80%

Feature Interactions
Sensors /
Camera
Autonomous

Feature
Actuator
58
Sensors /
Camera
Autonomous

Feature
Actuator
Sensors /
Camera
Autonomous

Feature
Actuator
.

.

.
Priority?

Integration Components
59
Pedestrian
Protection

(PP)
Autom. Emerg.
Braking

(AEB)
Lane Dep.
Warning

(LDW)
The integration is a rule set: each
condition checks a speci
fi
c feature
interaction situation and resolves
potential con
fl
icts that may arise
under that condition

Integration Components
60
ystems ICSE ’20, May 23-29, 2020, Seoul, South Korea
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
ve con�icts be-
th and �cth are
is a risk of colli-
(�c(t) < �cth ^
Detected(t)), then
mmand issued by
nd requires sub-
Figure 4: A decision tree diagram representing integration
rules 1, 2 and 4 in Figure 3.
Table 1: Safety requirements for AutoDrive.
Features Requirements
PP The PP system shall avoid collision with pedestrians by initiating emergency
braking in case of impending collision with pedestrians.
TSR The TSR system shall stop the vehicle at the stop sign by initiating a full
braking when a stop sign is detected
AEB The AEB system shall avoid collision with vehicles by initiating emergency
braking in case of impending collision with vehicles.
ACC The ACC system shall respect the safety distance by keeping the vehicle
Simpli
fi
ed Example
Condition Template

⟨if op1 operator threshold⟩
speed(t) < speedLeadingCar(t) (t is the time stamp)

Testing Automated Driving Systems
61
Testing on-the-road
!
Simulation-based Testing

Test Inputs
Environment
Position
and speed
Road Shape
Traf
fi
c lights
position and
status
62
Weather
Ego Car:

- Initial Position

- Initial Speed
Car Under Test:

- Initial Position

- Initial Speed

Feature Interactions Failures
63
Stop
Min Distance
50Km

AI-Based Testing
65
Test Case

Selection
Initial Tests
Evolutionary

Algorithm
Test Execution
Variants

Generation

AI-Based Testing
66
Test 1
Test 2
Test Case

Selection
Initial Tests
Evolutionary

Algorithm
Test Execution
Variants

Generation

AI-Based Testing
67
Minimum distance within the
simulation time window
Results of Test 1
2m
Results of Test 2
1m
Results of Test 3
1.5m
Test Case

Selection
Initial Tests
Evolutionary

Algorithm
Test Execution
Variants

Generation

AI-Based Testing
68
The best test case it the one closer to

violate the safe distance (
fi
tness)
Test Case

Selection
Initial Tests
Evolutionary

Algorithm
Test Execution
Variants

Generation
Results of Test 1
2m
Results of Test 2
1m

AI-Based Testing
69
Test 2
Mutation and/or
Crossover
Test Case

Selection
Initial Tests
Evolutionary

Algorithm
Test Execution
Variants

Generation

Search Objectives
70
stem.
s
a
TABLE I
SAFETY REQUIREMENTS AND FAILURE DISTANCE FUNCTIONS FOR
SafeDrive.
Feature Requirement Failure distance functions (FD1, . . . , FD5)
PP No collision with
pedestrians
FD1(i) is the distance between the ego car and the
pedestrian at step i.
AEB No collision with
cars
FD2(i) is the distance between the ego car and the
leading car at step i.
TSR Stop at a stop sign Let u(i) be the speed of the ego car at time step i
if a stop sign is detected, and let u(i) = 0 if there
is no stop sign. We define FD3(i) = 0 if u(i)
5km/h; FD3(i) = 1
u(i)
if u(i) 6= 0; and otherwise,
FD3(i) = 1.
TSR Respect the speed
limit
Let u0(i) be the difference between the speed of the
ego car and the speed limit at step i if a speed-
limit sign is detected, and let u0(i) = 0 if there
is no speed-limit sign. We define FD4(i) = 0 if
u0(i) 10km/h; FD4(i) = 1
u0(i)
if u0(i) 6= 0;
and otherwise, FD4(i) = 1.
ACC Respect the safety
distance
FD5(i) is the absolute difference between the safety
distance sd and FD2(i).
C. Hybrid Test Objectives
Our test objectives aim to guide the test generation process
towards test inputs that reveal undesired feature interactions.
We first present our formal notation and assumptions and then
we introduce our test objectives. Note that since in this paper,
• For each safety requirement, we
measure the distance to fail that
requirement during the simulation

• The problem is inherently many-
objectives

Case Study
• Two case study systems from IEE (industrial partner)

• Designed by experts

• Manually tested for more than six months

• Different rules to integrated feature actuator commands

• Both systems consist of four self-driving features

• Adaptive Cruise Control (ACC)

• Automated Emergency Braking (AEB)

• Traf
fi
c Sign Recognition (TSR)

• Pedestrian Protection (PP)
71

Many-Objective Search in Action
72
0 min
0
1
2
3
4
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10
Distance
to
a
failure
Different types of
(potential) failures with
feature interactions
A zero distance means we
found a test that exposes
the failure

Some Results
73
12 h
#
Discovered
Failures
0
1
2
3
4
5
6
7
8
9
10
0 3 6 9 12
Coverage-based Fuzzing
Many-objective search

Feedback From Domain Experts
• The failure we found were due to undesired
feature interactions

• The failures were not previously known to the
experts (new tests for regression testing)

• We identi
fi
ed ways to improve the feature
interaction logic to avoid such failures
75

Do Tests Generated by AI Help Developers?

Open Challenges, Applications and
Opportunities
Annibale Panichella, Ph.D.

a.panichella@tudelft.nl

@AnniPanic
77

VST2022.pdf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to VST2022.pdf

Similar to VST2022.pdf (20)

More from Annibale Panichella

More from Annibale Panichella (19)

Recently uploaded

Recently uploaded (20)

VST2022.pdf