Achieving Scalability in Software Testing with Machine Learning and Metaheuristic Search

.lusoftware veriﬁcation & validation
VVS
Achieving Scalability in Software Testing
with Machine Learning and
Metaheuristic Search
Lionel Briand
RAISE @ ICSE 2018

Scope
• The main challenge in testing software systems is
scalability
• Addressing scalability entails effective automation
• Lessons learned from industrial research collaborations:
satellite, automotive, finance, energy …
• Experiences from combining metaheuristic search,
machine learning, and other AI techniques, in addressing
testing scalability
2

Scalability
• The extent to which a technique can be applied on large or
complex artifacts (e.g., input spaces, code, models) and still
provide useful, automated support with acceptable effort, CPU,
and memory?
3

Collaborative Research @ SnT
4
• Research in context
• Addresses actual needs
• Well-defined problem
• Long-term collaborations
• Our lab is the industry

SVV Dept.
5
• Established in 2012, part of the SnT centre
• Requirements Engineering, Security Analysis, Design Verification,
Automated Testing, Runtime Monitoring
• ~ 25 lab members
• Partnerships
• ERC Advanced grant

Outline
• Overview, problem definition
• Example research projects with industry partners:
• Testing advanced driver assistance systems
• Testing controllers (automotive)
• Stress testing critical task deadlines (Energy)
• Vulnerability testing (Banking)
• Reflections and lessons learned
6

Software Testing
8
SW Representation
(e.g., specifications)
SW Code
Derive Test cases
Execute Test cases
Compare
Expected
Results or properties
Get Test Results
Test Oracle
[Test Result==Oracle][Test Result!=Oracle]
Automation!

Search-Based Software Testing
• Express test generation problem
as a search problem
• Search for test input data with
certain properties, i.e.,
constraints
• Non-linearity of software (if,
loops, …): complex,
discontinuous, non-linear
search spaces (Baresel)
• Many search algorithms
(metaheuristics), from local
search to global search, e.g.,
Hill Climbing, Simulated
Annealing and Genetic
Algorithms
e search space neighbouring the
for fitness. If a better candidate
mbing moves to that new point,
rhood of that candidate solution.
the neighbourhood of the current
fers no better candidate solutions;
If the local optimum is not the
gure 3a), the search may benefit
performing a climb from a new
cape (Figure 3b).
le Hill Climbing is Simulated
Simulated Annealing is similar to
ement around the search space is
be made to points of lower fitness
he aim of escaping local optima.
bability value that is dependent
‘temperature’, which decreases
ogresses (Figure 4). The lower
kely the chances of moving to a
ch space, until ‘freezing point’ is
the algorithm behaves identically
d Annealing is named so because
hysical process of annealing in
curve of the fitness landscape until a local optimum is found. The fina
position may not represent the global optimum (part (a)), and restarts ma
be required (part (b))
Fitness
Input domain
Figure 4. Simulated Annealing may temporarily move to points of poore
fitness in the search space
Fitness
Input domain
Figure 5. Genetic Algorithms are global searches, sampling many poin
in the fitness landscape at once
“Search-Based Software Testing: Past, Present and Future”
Phil McMinn
Genetic Algorithm
9
cusses future directions for Search-Based
g, comprising issues involving execution
estability, automated oracles, reduction of
st and multi-objective optimisation. Finally,
udes with closing remarks.
-BASED OPTIMIZATION ALGORITHMS
form of an optimization algorithm, and
mplement, is random search. In test data
s are generated at random until the goal of
mple, the coverage of a particular program
nch) is fulfilled. Random search is very poor
ns when those solutions occupy a very small
ll search space. Such a situation is depicted
re the number of inputs covering a particular
are very few in number compared to the
ut domain. Test data may be found faster
ly if the search is given some guidance.
c searches, this guidance can be provided
a problem-specific fitness function, which
points in the search space with respect to
or their suitability for solving the problem
Input domain
portion of
input domain
denoting required
test data
randomly-generated
inputs
Figure 2. Random search may fail to fulfil low-probability test goals
Fitness
Input domain
(a) Climbing to a local optimum

Testing Advanced Driving
Assistance Systems
10

Cyber-Physical Systems
• A system of collaborating computational elements controlling
physical entities
11

Advanced Driver Assistance
Systems (ADAS)
12
Automated Emergency Braking (AEB)
Pedestrian Protection (PP)
Lane Departure Warning (LDW)
Traffic Sign Recognition (TSR)

Automotive Environment
• Highly varied environments, e.g., road topology, weather, building
and pedestrians …
• Huge number of possible scenarios, e.g., determined by
trajectories of pedestrians and cars
• ADAS play an increasingly critical role
• A challenge for testing
13

Advanced Driver Assistance
Systems (ADAS)
Decisions are made over time based on sensor data
14
Sensors
Controller
Actuators Decision
Sensors
/Camera
Environment
ADAS

A General and Fundamental Shift
• Increasingly so, it is easier to learn behavior from data using
machine learning, rather than specify and code
• Deep learning, reinforcement learning …
• Example: Neural networks (deep learning)
• Millions of weights learned
• No explicit code, no specifications
• Verification, testing?
15

CPS Development Process
16
Functional modeling:
• Controllers
• Plant
• Decision
Continuous and discrete
Simulink models
Model simulation and
testing
Architecture modelling
• Structure
• Behavior
• Traceability
System engineering modeling
(SysML)
Analysis:
• Model execution and
testing
• Model-based testing
• Traceability and
change impact
analysis
• ...
(partial) Code generation
Deployed executables on
target platform
Hardware (Sensors ...)
Analog simulators
Testing (expensive)
Hardware-in-the-Loop
Stage
Software-in-the-Loop
Stage
Model-in-the-Loop Stage

Automotive Environment
• Highly varied environments, e.g., road topology, weather, building
and pedestrians …
• Huge number of possible scenarios, e.g., determined by
trajectories of pedestrians and cars
• ADAS play an increasingly critical role
• A challenge for testing
17

Our Goal
• Developing an automated testing technique
for ADAS
18
• To help engineers efficiently and
effectively explore the complex test input
space of ADAS
• To identify critical (failure-revealing) test
scenarios
• Characterization of input conditions that
lead to most critical situations, e.g.,
safety violations

19
Automated Emergency Braking
System (AEB)
19
“Brake-request”
when braking is needed
to avoid collisions
Decision making
Vision
(Camera)
Sensor
Brake
Controller
Objects’
position/speed

Example Critical Situation
• “AEB properly detects a pedestrian in front of the car with a
high degree of certainty and applies braking, but an accident
still happens where the car hits the pedestrian with a
relatively high speed”
20

Testing ADAS
21
A simulator based on
Physical/Mathematical models
On-road testing
Simulation-based (model) testing

Testing via Physics-based
Simulation
22
ADAS
(SUT)
Simulator (Matlab/Simulink)
Model
(Matlab/Simulink)
▪ Physical plant (vehicle / sensors / actuators)
▪ Other cars
▪ Pedestrians
▪ Environment (weather / roads / traﬃc signs)
Test input
Test output
time-stamped output

AEB Domain Model
- visibility:
VisibilityRange
- fog: Boolean
- fogColor:
FogColor
Weather
- frictionCoeff:
Real
Road1
- v0 : Real
Vehicle
- : Real
- : Real
- : Real
- :Real
Pedestrian
- simulationTime:
Real
- timeStep: Real
Test
Scenario
1
1
- ModerateRain
- HeavyRain
- VeryHeavyRain
- ExtremeRain
«enumeration»
RainType- ModerateSnow
- HeavySnow
- VeryHeavySnow
- ExtremeSnow
«enumeration»
SnowType
- DimGray
- Gray
- DarkGray
- Silver
- LightGray
- None
«enumeration»
FogColor
1
WeatherC
{{OCL} self.fog=false
implies self.visibility = “300”
and self.fogColor=None}
Straight
- height:
RampHeight
Ramped
- radius:
CurvedRadius
Curved
- snowType:
SnowType
Snow
- rainType:
RainType
Rain
Normal
- 5 - 10 - 15 - 20
- 25 - 30 - 35 - 40
«enumeration»
CurvedRadius (CR)
- 4 - 6 - 8 - 10 - 12
«enumeration»
RampHeight (RH)
- 10 - 20 - 30 - 40 - 50
- 60 - 70 - 80 - 90 - 100
- 110 - 120 - 130 - 140
- 150 - 160 - 170 - 180
- 190 - 200 - 210 - 220
- 230 - 240 - 250 - 260
- 270 - 280 - 290 - 300
«enumeration»
VisibilityRange
- : TTC: Real
- : certaintyOfDetection:
Real
- : braking: Boolean
AEB Output
- : Real
- : Real
Output functions
Mobile
object
Position
vector
- x: Real
- y: Real
Position
1 11
1
1
Static input
1
Output
1
1
Dynamic input
xp
0
yp
0
vp
0
✓p
0
vc
0
v3
v2
v1
F1
F2

ADAS Testing Challenges
• Test input space is large, complex and multidimensional
• Explaining failures and fault localization are difficult
• Execution of physics-based simulation models is computationally
expensive
24

Black-Box Search-based Testing
25
Test input generation (NSGA II)
Evaluating test inputs
- Select best tests
- Generate new tests
(candidate)
test inputs
- Simulate every (candidate) test
- Compute fitness functions
Fitness
values
Test cases revealing worst case system behaviors
Input data ranges/dependencies + Simulator + Fitness functions
defined based on Oracles

Search: Genetic Evolution
26
Initial input
Fitness
computation
Selection
Breeding

Better Guidance
• Fitness computations rely on simulations and are very
expensive
• Search needs better guidance
27

Decision Trees
28
Partition the input space into homogeneous regions
All points
Count 1200
“non-critical” 79%
“critical” 21%
“critical” 41%
Count 564 Count 636
“critical” 2%
Count 412
“critical” 51%
Count 152
“critical” 16%
Count 230 Count 182
vp
0 >= 7.2km/h vp
0 < 7.2km/h
✓p
0 < 218.6 ✓p
0 >= 218.6
RoadTopology(CR = 5,
Straight, RH = [4 12](m))
RoadTopology
(CR = [10 40](m))
“critical” 69%
“critical” 28%

Genetic Evolution Guided by
Classification
29
Initial input
Fitness
computation
Classification
Selection
Breeding

Search Guided by Classification
30
Test input generation (NSGA II)
Evaluating test inputs
Build a classification tree
Select/generate tests in the fittest regions
Apply genetic operators
Input data ranges/dependencies + Simulator + Fitness functions
defined based on Oracles
(candidate)
test inputs
- Simulate every (candidate) test
- Compute fitness functions
Fitness
values
Test cases revealing worst case system behaviors +
A characterization of critical input regions

NSGAII-DT vs. NSGAII
31
NSGAII-DT outperforms NSGAII
HV
0.0
0.4
0.8
GD
0.05
0.15
0.25
SP
2
0.6
1.0
1.4
6 10 14 18 22 24
Time (h)
NSGAII-DT
NSGAII

Dynamic Continuous Controllers
33

MiL Test Cases
34
Model
Simulation
Input
Signals
Output
Signal(s)
S3
t
S2
t
S1
t
S3
t
S2
t
S1
t
Test Case 1
Test Case 2

• Supercharger bypass flap controller
üFlap position is bounded within
[0..1]
üImplemented in MATLAB/Simulink
ü34 (sub-)blocks decomposed into 6
abstraction levels
Supercharger
Bypass Flap
Supercharger
Bypass Flap
Flap position = 0 (open) Flap position = 1 (closed)
Simple Example
35

Initial
Desired Value
Final
Desired Value
time time
Desired Value
Actual Value
T/2 T T/2 T
Test Input Test Output
Plant
Model
Controller
(SUT)
Desired value Error
Actual value
System output+
-
MiL Testing of Controllers
36

Configurable Controllers at MIL
Plant Model
+
+
+
⌃
+
-
e(t)
actual(t)
desired(t)
⌃
KP e(t)
KD
de(t)
dt
KI
R
e(t) dt
P
I
D
output(t)
Time-dependent variables
Configuration Parameters
37

Requirements and Test Objectives
InitialDesired
(ID)
Desired ValueI (input)
Actual Value (output)
FinalDesired
(FD)
time
T/2 T
Smoothness
Responsiveness
Stability
38

A Search-Based Test Approach
Initial Desired (ID)
FinalDesired(FD)
Worst Case(s)?
• Search directed by model
execution feedback
• Controller’s dynamic behavior
can be complex
• Meta-heuristic search in (large)
input space: Finding worst case
inputs
• Possible because of automated
oracle (feedback loop)
• Different worst cases for
different requirements
39

Initial Solution
HeatMap
Diagram
1. Exploration
List of
Critical
RegionsDomain
Expert
Worst-Case
Scenarios
+
Controller-
plant
model
Objective
Functions
based on
Requirements
2. Single-State
Search
time
Desired Value
Actual Value
0 1 2
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Initial Desired
Final Desired
40

Results
• We found much worse scenarios during MiL testing than our
partner had found so far
• These scenarios are also run at the HiL level, where testing is
much more expensive: MiL results => test selection for HiL
• But further research was needed:
• Simulations are expensive
• Configuration parameters
41

Final Solution
+
Controller
Model
(Simulink)
Worst-Case
Scenarios
List of
Critical
PartitionsRegression
Tree
1.Exploration with
Dimensionality
Reduction
2.Search with
Surrogate
Modeling
Objective
Functions
Domain
Expert
Visualization of the
8-dimension space
using regression treesDimensionality
reduction to identify
the significant variables
(Elementary Effect Analysis)
Surrogate modeling
to predict the fitness
function and
speed up the search
(Neural network)
42

Regression Tree
All Points
FD>=0.43306
Count
Mean
Std Dev
Count
Mean
Std Dev
FD<0.43306
Count
Mean
Std Dev
ID>=0.64679
Count
Mean
Std Dev
Count
Mean
Std Dev
Cal5>=0.020847 Cal5>0.020847
Count
Mean
Std Dev
Count
Mean
Std Dev
Cal5>=0.014827 Cal5<0.014827
Count
Mean
Std Dev
Count
Mean
Std Dev
1000
0.007822
0.0049497
ID<0.64679
574
0.0059513
0.0040003
426
0.0103425
0.0049919
373
0.0047594
0.0034346
201
0.0081631
0.0040422
182
0.0134555
0.0052883
244
0.0080206
0.0031751
70
0.0106795
0.0052045
131
0.0068185
0.0023515
43

Surrogate Modeling
Any supervised learning or statistical
technique providing fitness predictions
with confidence intervals
1. Predict higher fitness with high
confidence: Move to new position,
no simulation
2. Predict lower fitness with high
confidence: Do not move to new
position, no simulation
3. Low confidence in prediction:
Simulation
Surrogate Model
Real Function
x
Fitness
44

ü Our approach is able to identify more critical violations of the
controller requirements that had neither been found with
default/fixed configurations nor by manual testing.
MiL-Testing
different configurations
Stability
Smoothness
Responsiveness
MiL-Testing
fixed configurations Manual MiL-Testing
- -2.2% deviation
24% over/undershoot 20% over/undershoot 5% over/undershoot
170 ms response time 80 ms response time 50 ms response time
Results
45

Schedulability Analysis and
Testing
46

Problem and Context
• Schedulability analysis encompasses techniques that try to
predict whether (critical) tasks are schedulable, i.e., meet
their deadlines
• Stress testing runs carefully selected test cases that have
a high probability of leading to deadline misses
• Stress testing is complementary to schedulability analysis
• Testing is typically expensive, e.g., hardware in the loop
• Finding stress test cases is difficult
47

Finding Stress Test Cases is Hard
48
0
1
2
3
4
5
6
7
8
9
j0, j1 , j2 arrive at at0 , at1 , at2 and must
finish before dl0 , dl1 , dl2
J1 can miss its deadline dl1 depending on
when at2 occurs!
0
1
2
3
4
5
6
7
8
9
j0 j1 j2 j0 j1 j2
at0
dl0
dl1
at1 dl2
at2
T
T
at0
dl0 dl1
at1
at2
dl2

Challenges and Solutions
• Ranges for arrival times form a very large input space
• Task interdependencies and properties constrain what
parts of the space are feasible
• Solution: We re-expressed the problem as a constraint
optimization problem and used a combination of constraint
programming (IBM CPLEX) and meta-heuristic search (GA)
49

Constraint Optimization
50
Constraint Optimization Problem
Static Properties of Tasks
(Constants)
Dynamic Properties of Tasks
(Variables)
Performance Requirement
(Objective Function)
OS Scheduler Behaviour
(Constraints)

Combining CP and GA
51
A:12 S. Di Alesio et al.
Fig. 3: Overview of GA+CP: the solutions x , y and z in the initial population of GA evolve into

Case Study
52
Drivers
(Software-Hardware Interface)
Control Modules
Alarm Devices
(Hardware)
Multicore Architecture
Real-Time Operating System
System monitors gas leaks and fire in
oil extraction platforms

Summary
• We provided a solution for generating stress test cases by combining
meta-heuristic search and constraint programming
• Meta-heuristic search (GA) identifies high risk regions in the
input space
• Constraint programming (CP) finds provably worst-case
schedules within these (limited) regions
• Achieve (nearly) GA efficiency and CP effectiveness
• Our approach can be used both for stress testing and
schedulability analysis (assumption free)
53

42%
32%
9%
4%
3%
3%
3%
2%
2%
Code Injection
Manipulated data structures
Collect and analyze information
Indicator
Employ probabilistic techniques
Manipulate system resources
Subvert access control
Abuse existing functionality
Engage in deceptive…
X-Force Threat Intelligence Index 2017
55
https://www.ibm.com/security/xforce/
More than 40% of all
attacks were injection
attacks (e.g., SQLi)

Web Applications
56
Server SQL DatabaseClient

Web Applications
57
Web form
str1
str2
Username
Password
OK
SQL query
SELECT *
FROM Users WHERE
(usr = ‘str1’ AND psw = ‘str2’)
Name Surname …
John Smith …
Result

Injection Attacks
58
SQL query
Name Surname …
Aria Stark …
John Snow …
… … …
Query result
SELECT *
FROM Users
WHERE (usr = ‘’ AND
psw = ‘’) OR 1=1 --
Web form
‘) OR 1=1 --
Username
Password
OK

Protection Layers
Server
SQL
Database
Client
Data input
Validation
and
Sanitization
Database
Firewall
Web
Application
Firewall
59

Web Application Firewalls (WAFs)
60
Servermalicious
malicious
malicious
legitimate
WAF

WAF Rule Set
61
Rule set of Apache ModSecurity
https://github.com/SpiderLabs/ModSecurity

Misconfigured WAFs
62
BLOCKED
False Positive
ALLOWED
False Negative

Grammar-based Attack Generation
• BNF grammar for SQLi attacks
• Random strategy: randomly selected production rules are
applied recursively until only terminals are left
• Random strategy not efficient for bypassing attacks that
are difficult to find
• Machine learning? Search?
• How to guide the search? How can ML help?

Anatomy of SQLi attacks
64
‘ OR“a”=“a”#
Bypassing Attack
<START>
<sq> <wsp> <sqliAttack> <cmt>
<boolAttack>
<opOR> <boolTrueExpr>
OR <bynaryTrue>
<dq> <ch> <dq> <opEq> <dq> <ch> <dq>
“ a ” = “ a ”
<sQuoteContext>
‘ #_
Derivation Tree
‘
_
OR”a”=“a”
#
S =
{
Attack Slices

Learning Attack Patterns
65
S1 S2 S3 S4 … Sn Outcome
A1 1 1 0 0 … 0 Passed
A2 0 1 0 0 … 0 Blocked
… … … … … … … …
Am 1 1 1 1 … 1 Blocked
Training Set
PassedBlocked
S4
YesNo
YesNo
YesNo
S3
S2
Decision Tree
Sn
S1
…
• Random trees
• Random forest

Learning Attack Patterns
66
S1 S2 S3 S4 … Sn Outcome
A1 1 1 0 0 … 0 Passed
A2 0 1 0 0 … 0 Blocked
… … … … … … … …
Am 1 1 1 1 … 1 Blocked
PassedBlocked
S4
YesNo
YesNo
YesNo
S3
S2
Sn
S1
…
Training Set Decision Tree
Attack Pattern
S2 ∧ ¬ Sn ∧ S1

Machine Learning
PassedBlocked
S4
YesNo
YesNo
YesNo
S3
S2
Sn
S1
…
Generating Attacks via ML and EAs
67
Evolutionary Algorithm (EA)
Iteratively refine successful attack
conditions

Some Results
Apache ModSecurity
68
DistinctAttacks
Industrial WAFs
DistinctAttacks
Machine Learning-driven attack generation led to more
distinct, successful attacks being discovered faster

Related Work
• Automated repair of WAFs
• Automated testing targeting XML and SQL injections in web
applications
69

Search-Based Solutions
• Versatile
• Helps relax assumptions compared to exact approaches
• Helps decrease modeling requirements
• Scalability, e.g., easy to parallelize
• Requires massive empirical studies
• Search is rarely sufficient by itself
71

Multidisciplinary Approach
• Single-technology approaches rarely work in practice
• Combined search with:
• Machine learning
• Solvers, e.g., CP, SMT
• Statistical approaches, e.g., sensitivity analysis
• System and environment modeling and simulation
72

Objectives
• Reduce search space
• Better guide and focus search
• Compute fitness and provide guidance
• Avoid expensive and useless fitness computations
• Explain failures (e.g., decision trees)
• Get more guarantees (e.g., constraint programming)
73

Acknowledgements
• Shiva Nejati
• Reza Matinnejad
• Raja Ben Abdessalem
• Stefano Di Alesio
• Dennis Appelt
• Annibale Panichella
74

Selected References
• L. Briand et al. “Testing the untestable: Model testing of complex software-intensive systems”,
IEEE/ACM ICSE 2016, V2025
• R. Matinnejad et al., “MiL Testing of Highly Configurable Continuous Controllers: Scalable Search
Using Surrogate Models”, IEEE/ACM ASE 2014 (Distinguished paper award)
• S. Di Alesio et al. “Combining genetic algorithms and constraint programming to support stress
testing of task deadlines”, ACM Transactions on Software Engineering and Methodology (TOSEM),
25(1):4, 2015
• R. Ben Abdessalem et al., "Testing Vision-Based Control Systems Using Learnable Evolutionary
Algorithms”, IEEE/ACM ICSE 2018
• D. Appelt et al., “A Machine Learning-Driven Evolutionary Approach for Testing Web Application
Firewalls”, To appear in IEEE Transaction on Reliability
• More on: https://wwwen.uni.lu/snt/people/lionel_briand?page=Publications
75

Achieving Scalability in Software Testing with Machine Learning and Metaheuristic Search

More Related Content

What's hot

Similar to Achieving Scalability in Software Testing with Machine Learning and Metaheuristic Search

More from Lionel Briand

Recently uploaded

Achieving Scalability in Software Testing with Machine Learning and Metaheuristic Search