.lusoftware verification & validation
VVS
Achieving Scalability in Software Testing
with Machine Learning and
Metaheuristic Search
Lionel Briand
Definition of Software Testing
• ISTQB: “Software testing is a process of executing a program
or application with the intent of finding the software bugs. It
can also be stated as the process of validating and verifying
that a software program or application or product meets the
business and technical requirements that guided its design
and development.”
2
Scope
• The main challenge in testing software systems is
scalability
• Addressing scalability entails effective automation
• Lessons learned from industrial research collaborations:
satellite, automotive, finance, energy …
• Experiences from combining metaheuristic search,
machine learning, and other AI techniques, in addressing
testing scalability
3
Scalability
• The extent to which a technique can be applied on large or
complex artifacts (e.g., input spaces, code, models) and still
provide useful, automated support with acceptable effort, CPU,
and memory?
4
Collaborative Research @ SnT
5
• Research in context
• Addresses actual needs
• Well-defined problem
• Long-term collaborations
• Our lab is the industry
SVV Dept.
6
• Established in 2012, part of the SnT centre
• Requirements Engineering, Security Analysis, Design Verification,
Automated Testing, Runtime Monitoring
• ~ 25 lab members
• Partnerships with industry
• ERC Advanced grant
Outline
• Overview, problem definition
• Example research projects with industry partners:
• Vulnerability testing (Banking)
• Testing advanced driver assistance systems
• Testing controllers (automotive)
• Stress testing critical task deadlines (Energy)
• Reflections and lessons learned
7
Introduction
8
Software Testing
9
SW Representation
(e.g., specifications)
SW Code
Derive Test cases
Execute Test cases
Compare
Expected
Results or properties
Get Test Results
Test Oracle
[Test Result==Oracle][Test Result!=Oracle]
Automation!
Search-Based Software Testing
• Express test generation problem
as a search or optimization
problem
• Search for test input data with
certain properties, i.e., constraints
• Non-linearity of software (if, loops,
…): complex, discontinuous, non-
linear search spaces (Baresel)
• Many search algorithms
(metaheuristics), from local
search to global search, e.g., Hill
Climbing, Simulated Annealing
and Genetic Algorithms
e search space neighbouring the
for fitness. If a better candidate
mbing moves to that new point,
rhood of that candidate solution.
the neighbourhood of the current
fers no better candidate solutions;
If the local optimum is not the
gure 3a), the search may benefit
performing a climb from a new
cape (Figure 3b).
le Hill Climbing is Simulated
Simulated Annealing is similar to
ement around the search space is
be made to points of lower fitness
he aim of escaping local optima.
bability value that is dependent
‘temperature’, which decreases
ogresses (Figure 4). The lower
kely the chances of moving to a
ch space, until ‘freezing point’ is
the algorithm behaves identically
d Annealing is named so because
hysical process of annealing in
curve of the fitness landscape until a local optimum is found. The fina
position may not represent the global optimum (part (a)), and restarts ma
be required (part (b))
Fitness
Input domain
Figure 4. Simulated Annealing may temporarily move to points of poore
fitness in the search space
Fitness
Input domain
Figure 5. Genetic Algorithms are global searches, sampling many poin
in the fitness landscape at once
“Search-Based Software Testing: Past, Present and Future”
Phil McMinn
Genetic Algorithm
10
cusses future directions for Search-Based
g, comprising issues involving execution
estability, automated oracles, reduction of
st and multi-objective optimisation. Finally,
udes with closing remarks.
-BASED OPTIMIZATION ALGORITHMS
form of an optimization algorithm, and
mplement, is random search. In test data
s are generated at random until the goal of
mple, the coverage of a particular program
nch) is fulfilled. Random search is very poor
ns when those solutions occupy a very small
ll search space. Such a situation is depicted
re the number of inputs covering a particular
are very few in number compared to the
ut domain. Test data may be found faster
ly if the search is given some guidance.
c searches, this guidance can be provided
a problem-specific fitness function, which
points in the search space with respect to
or their suitability for solving the problem
Input domain
portion of
input domain
denoting required
test data
randomly-generated
inputs
Figure 2. Random search may fail to fulfil low-probability test goals
Fitness
Input domain
(a) Climbing to a local optimum
Vulnerability Testing
11
42%
32%
9%
4%
3%
3%
3%
2%
2%
Code Injection
Manipulated data structures
Collect and analyze information
Indicator
Employ probabilistic techniques
Manipulate system resources
Subvert access control
Abuse existing functionality
Engage in deceptive…
X-Force Threat Intelligence Index
2017
12
https://www.ibm.com/security/xforce/
More than 40% of all
attacks were injection
attacks (e.g., SQLi)
Web Applications
13
Server SQL DatabaseClient
Web Applications
14
Web form
str1
str2
Username
Password
OK
SQL query
SELECT *
FROM Users WHERE
(usr = ‘str1’ AND psw = ‘str2’)
Name Surname …
John Smith …
Result
Server SQL DatabaseClient
Injection Attacks
15
SQL query
Name Surname …
Aria Stark …
John Snow …
… … …
Query result
SELECT *
FROM Users
WHERE (usr = ‘’ AND
psw = ‘’) OR 1=1 --
Server SQL DatabaseClient
Web form
‘) OR 1=1 --
Username
Password
OK
Protection Layers
Server
SQL
Database
Client
Data input
Validation
and
Sanitization
Database
Firewall
Web
Application
Firewall
16
Web Application Firewalls (WAFs)
17
Servermalicious
malicious
malicious
legitimate
WAF
WAF Rule Set
18
Rule set of Apache ModSecurity
https://github.com/SpiderLabs/ModSecurity
Misconfigured WAFs
19
BLOCKED
False Positive
ALLOWED
False Negative
Grammar-based Attack
Generation
• BNF grammar for SQLi attacks
• Random strategy: randomly selected production rules are
applied recursively until only terminals are left
• Random strategy not efficient for bypassing attacks that are
difficult to find
• Machine learning? Search?
• How to guide the search? How can ML help?
Anatomy of SQLi attacks
21
‘ OR“a”=“a”#
Bypassing Attack
<START>
<sq> <wsp> <sqliAttack> <cmt>
<boolAttack>
<opOR> <boolTrueExpr>
OR <bynaryTrue>
<dq> <ch> <dq> <opEq> <dq> <ch> <dq>
“ a ” = “ a ”
<sQuoteContext>
‘ #_
Derivation Tree
‘
_
OR”a”=“a”
#
S =
{
Attack Slices
Learning Attack Patterns
22
S1 S2 S3 S4 … Sn Outcome
A1 1 1 0 0 … 0 Passed
A2 0 1 0 0 … 0 Blocked
… … … … … … … …
Am 1 1 1 1 … 1 Blocked
Training Set
PassedBlocked
S4
YesNo
YesNo
YesNo
S3
S2
Decision Tree
Sn
S1
…
• Random trees
• Random forest
Learning Attack Patterns
23
S1 S2 S3 S4 … Sn Outcome
A1 1 1 0 0 … 0 Passed
A2 0 1 0 0 … 0 Blocked
… … … … … … … …
Am 1 1 1 1 … 1 Blocked
PassedBlocked
S4
YesNo
YesNo
YesNo
S3
S2
Sn
S1
…
Training Set Decision Tree
Attack Pattern
S2 ∧ ¬ Sn ∧ S1
Machine Learning
Generating Attacks via ML and
EAs
24
Evolutionary Algorithm (EA)
Iteratively refine successful attack
conditions PassedBlocked
S4
YesNo
YesNo
YesNo
S3
S2
Sn
S1
…
Some Results
Apache ModSecurity
25
DistinctAttacks
Industrial WAFs
DistinctAttacks
Machine Learning-driven attack generation led to more
distinct, successful attacks being discovered faster
Related Work
• Automated repair of WAFs
• Automated testing targeting XML and SQL injections in web
applications
26
Testing Advanced Driving
Assistance Systems
27
Cyber-Physical Systems
• A system of collaborating computational elements controlling
physical entities
28
Advanced Driver Assistance
Systems (ADAS)
29
Automated Emergency Braking (AEB)
Pedestrian Protection (PP)
Lane Departure Warning (LDW)
Traffic Sign Recognition (TSR)
Automotive Environment
• Highly varied environments, e.g., road topology, weather, building
and pedestrians …
• Huge number of possible scenarios, e.g., determined by
trajectories of pedestrians and cars
• ADAS play an increasingly critical role
• A challenge for testing
30
Advanced Driver Assistance
Systems (ADAS)
Decisions are made over time based on sensor data
31
Sensors
Controller
Actuators Decision
Sensors
/Camera
Environment
ADAS
A General and Fundamental Shift
• Increasingly so, it is easier to learn behavior from data using
machine learning, rather than specify and code
• Deep learning, reinforcement learning …
• Example: Neural networks (deep learning)
• Millions of weights learned
• No explicit code, no specifications
• Verification, testing?
32
CPS Development Process
33
Functional modeling:
• Controllers
• Plant
• Decision
Continuous and discrete
Simulink models
Model simulation and
testing
Architecture modelling
• Structure
• Behavior
• Traceability
System engineering modeling
(SysML)
Analysis:
• Model execution and
testing
• Model-based testing
• Traceability and
change impact
analysis
• ...
(partial) Code generation
Deployed executables on
target platform
Hardware (Sensors ...)
Analog simulators
Testing (expensive)
Hardware-in-the-Loop
Stage
Software-in-the-Loop
Stage
Model-in-the-Loop Stage
Automotive Environment
• Highly varied environments, e.g., road topology, weather, building
and pedestrians …
• Huge number of possible scenarios, e.g., determined by
trajectories of pedestrians and cars
• ADAS play an increasingly critical role
• A challenge for testing
34
Our Goal
• Developing an automated testing technique
for ADAS
35
• To help engineers efficiently and
effectively explore the complex test input
space of ADAS
• To identify critical (failure-revealing) test
scenarios
• Characterization of input conditions that
lead to most critical situations, e.g.,
safety violations
36
Automated Emergency Braking
System (AEB)
36
“Brake-request”
when braking is needed
to avoid collisions
Decision making
Vision
(Camera)
Sensor
Brake
Controller
Objects’
position/speed
Example Critical Situation
• “AEB properly detects a pedestrian in front of the car with a
high degree of certainty and applies braking, but an accident
still happens where the car hits the pedestrian with a
relatively high speed”
37
Testing ADAS
38
A simulator based on
Physical/Mathematical models
On-road testing
Simulation-based (model) testing
Testing via Physics-based
Simulation
39
ADAS
(SUT)
Simulator (Matlab/Simulink)
Model
(Matlab/Simulink)
▪ Physical plant (vehicle / sensors / actuators)
▪ Other cars
▪ Pedestrians
▪ Environment (weather / roads / traffic signs)
Test input
Test output
time-stamped output
AEB Domain Model
- visibility:
VisibilityRange
- fog: Boolean
- fogColor:
FogColor
Weather
- frictionCoeff:
Real
Road1
- v0 : Real
Vehicle
- : Real
- : Real
- : Real
- :Real
Pedestrian
- simulationTime:
Real
- timeStep: Real
Test
Scenario
1
1
- ModerateRain
- HeavyRain
- VeryHeavyRain
- ExtremeRain
«enumeration»
RainType- ModerateSnow
- HeavySnow
- VeryHeavySnow
- ExtremeSnow
«enumeration»
SnowType
- DimGray
- Gray
- DarkGray
- Silver
- LightGray
- None
«enumeration»
FogColor
1
WeatherC
{{OCL} self.fog=false
implies self.visibility = “300”
and self.fogColor=None}
Straight
- height:
RampHeight
Ramped
- radius:
CurvedRadius
Curved
- snowType:
SnowType
Snow
- rainType:
RainType
Rain
Normal
- 5 - 10 - 15 - 20
- 25 - 30 - 35 - 40
«enumeration»
CurvedRadius (CR)
- 4 - 6 - 8 - 10 - 12
«enumeration»
RampHeight (RH)
- 10 - 20 - 30 - 40 - 50
- 60 - 70 - 80 - 90 - 100
- 110 - 120 - 130 - 140
- 150 - 160 - 170 - 180
- 190 - 200 - 210 - 220
- 230 - 240 - 250 - 260
- 270 - 280 - 290 - 300
«enumeration»
VisibilityRange
- : TTC: Real
- : certaintyOfDetection:
Real
- : braking: Boolean
AEB Output
- : Real
- : Real
Output functions
Mobile
object
Position
vector
- x: Real
- y: Real
Position
1 11
1
1
Static input
1
Output
1
1
Dynamic input
xp
0
yp
0
vp
0
✓p
0
vc
0
v3
v2
v1
F1
F2
ADAS Testing Challenges
• Test input space is large, complex and multidimensional
• Explaining failures and fault localization are difficult
• Execution of physics-based simulation models is computationally
expensive
41
Black-Box Search-based Testing
42
Test input generation (NSGA II)
Evaluating test inputs
- Select best tests
- Generate new tests
(candidate)
test inputs
- Simulate every (candidate) test
- Compute fitness functions
Fitness
values
Test cases revealing worst case system behaviors
Input data ranges/dependencies + Simulator + Fitness functions
defined based on Oracles
Search: Genetic Evolution
43
Initial input
Fitness
computation
Selection
Breeding
Better Guidance
• Fitness computations rely on simulations and are very
expensive
• Search needs better guidance
44
Decision Trees
45
Partition the input space into homogeneous regions
All points
Count 1200
“non-critical” 79%
“critical” 21%
“non-critical” 59%
“critical” 41%
Count 564 Count 636
“non-critical” 98%
“critical” 2%
Count 412
“non-critical” 49%
“critical” 51%
Count 152
“non-critical” 84%
“critical” 16%
Count 230 Count 182
vp
0 >= 7.2km/h vp
0 < 7.2km/h
✓p
0 < 218.6 ✓p
0 >= 218.6
RoadTopology(CR = 5,
Straight, RH = [4 12](m))
RoadTopology
(CR = [10 40](m))
“non-critical” 31%
“critical” 69%
“non-critical” 72%
“critical” 28%
Genetic Evolution Guided by
Classification
46
Initial input
Fitness
computation
Classification
Selection
Breeding
Search Guided by Classification
47
Test input generation (NSGA II)
Evaluating test inputs
Build a classification tree
Select/generate tests in the fittest regions
Apply genetic operators
Input data ranges/dependencies + Simulator + Fitness functions
defined based on Oracles
(candidate)
test inputs
- Simulate every (candidate) test
- Compute fitness functions
Fitness
values
Test cases revealing worst case system behaviors +
A characterization of critical input regions
NSGAII-DT vs. NSGAII
48
NSGAII-DT outperforms NSGAII
HV
0.0
0.4
0.8
GD
0.05
0.15
0.25
SP
2
0.6
1.0
1.4
6 10 14 18 22 24
Time (h)
NSGAII-DT
NSGAII
Testing Controllers
49
Dynamic Continuous Controllers
50
MiL Test Cases
51
Model
Simulation
Input
Signals
Output
Signal(s)
S3
t
S2
t
S1
t
S3
t
S2
t
S1
t
Test Case 1
Test Case 2
• Supercharger bypass flap controller
üFlap position is bounded within
[0..1]
üImplemented in MATLAB/Simulink
ü34 (sub-)blocks decomposed into 6
abstraction levels
Supercharger
Bypass Flap
Supercharger
Bypass Flap
Flap position = 0 (open) Flap position = 1 (closed)
Simple Example
52
Initial
Desired Value
Final
Desired Value
time time
Desired Value
Actual Value
T/2 T T/2 T
Test Input Test Output
Plant
Model
Controller
(SUT)
Desired value Error
Actual value
System output+
-
MiL Testing of Controllers
53
Configurable Controllers at MIL
Plant Model
+
+
+
⌃
+
-
e(t)
actual(t)
desired(t)
⌃
KP e(t)
KD
de(t)
dt
KI
R
e(t) dt
P
I
D
output(t)
Time-dependent variables
Configuration Parameters
54
Requirements and Test Objectives
InitialDesired
(ID)
Desired ValueI (input)
Actual Value (output)
FinalDesired
(FD)
time
T/2 T
Smoothness
Responsiveness
Stability
55
A Search-Based Test Approach
Initial Desired (ID)
FinalDesired(FD)
Worst Case(s)?
• Search directed by model
execution feedback
• Controller’s dynamic behavior
can be complex
• Meta-heuristic search in (large)
input space: Finding worst case
inputs
• Possible because of automated
oracle (feedback loop)
• Different worst cases for
different requirements
56
Initial Solution
HeatMap
Diagram
1. Exploration
List of
Critical
RegionsDomain
Expert
Worst-Case
Scenarios
+
Controller-
plant
model
Objective
Functions
based on
Requirements
2. Single-State
Search
time
Desired Value
Actual Value
0 1 2
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Initial Desired
Final Desired
57
Results
• We found much worse scenarios during MiL testing than our
partner had found so far
• These scenarios are also run at the HiL level, where testing is
much more expensive: MiL results => test selection for HiL
• But further research was needed:
• Simulations are expensive
• Configuration parameters
58
Final Solution
+
Controller
Model
(Simulink)
Worst-Case
Scenarios
List of
Critical
PartitionsRegression
Tree
1.Exploration with
Dimensionality
Reduction
2.Search with
Surrogate
Modeling
Objective
Functions
Domain
Expert
Visualization of the
8-dimension space
using regression treesDimensionality
reduction to identify
the significant variables
(Elementary Effect Analysis)
Surrogate modeling
to predict the fitness
function and
speed up the search
(Neural network)
59
Regression Tree
All Points
FD>=0.43306
Count
Mean
Std Dev
Count
Mean
Std Dev
FD<0.43306
Count
Mean
Std Dev
ID>=0.64679
Count
Mean
Std Dev
Count
Mean
Std Dev
Cal5>=0.020847 Cal5>0.020847
Count
Mean
Std Dev
Count
Mean
Std Dev
Cal5>=0.014827 Cal5<0.014827
Count
Mean
Std Dev
Count
Mean
Std Dev
1000
0.007822
0.0049497
ID<0.64679
574
0.0059513
0.0040003
426
0.0103425
0.0049919
373
0.0047594
0.0034346
201
0.0081631
0.0040422
182
0.0134555
0.0052883
244
0.0080206
0.0031751
70
0.0106795
0.0052045
131
0.0068185
0.0023515
60
Surrogate Modeling
Any supervised learning or statistical
technique providing fitness predictions
with confidence intervals
1. Predict higher fitness with high
confidence: Move to new position,
no simulation
2. Predict lower fitness with high
confidence: Do not move to new
position, no simulation
3. Low confidence in prediction:
Simulation
Surrogate Model
Real Function
x
Fitness
61
ü Our approach is able to identify more critical violations of the
controller requirements that had neither been found with
default/fixed configurations nor by manual testing.
MiL-Testing
different configurations
Stability
Smoothness
Responsiveness
MiL-Testing
fixed configurations Manual MiL-Testing
- -2.2% deviation
24% over/undershoot 20% over/undershoot 5% over/undershoot
170 ms response time 80 ms response time 50 ms response time
Results
62
Schedulability Analysis and
Testing
63
Problem and Context
• Schedulability analysis encompasses techniques that try to
predict whether (critical) tasks are schedulable, i.e., meet
their deadlines
• Stress testing runs carefully selected test cases that have
a high probability of leading to deadline misses
• Stress testing is complementary to schedulability analysis
• Testing is typically expensive, e.g., hardware in the loop
• Finding stress test cases is difficult
64
Finding Stress Test Cases is Hard
65
0
1
2
3
4
5
6
7
8
9
j0, j1 , j2 arrive at at0 , at1 , at2 and must
finish before dl0 , dl1 , dl2
J1 can miss its deadline dl1 depending on
when at2 occurs!
0
1
2
3
4
5
6
7
8
9
j0 j1 j2 j0 j1 j2
at0
dl0
dl1
at1 dl2
at2
T
T
at0
dl0 dl1
at1
at2
dl2
Challenges and Solutions
• Ranges for arrival times form a very large input space
• Task interdependencies and properties constrain what
parts of the space are feasible
• Solution: We re-expressed the problem as a constraint
optimization problem and used a combination of constraint
programming (IBM CPLEX) and meta-heuristic search (GA)
66
Constraint Optimization
67
Constraint Optimization Problem
Static Properties of Tasks
(Constants)
Dynamic Properties of Tasks
(Variables)
Performance Requirement
(Objective Function)
OS Scheduler Behaviour
(Constraints)
Combining CP and GA
68
A:12 S. Di Alesio et al.
Fig. 3: Overview of GA+CP: the solutions x , y and z in the initial population of GA evolve into
Case Study
69
Drivers
(Software-Hardware Interface)
Control Modules
Alarm Devices
(Hardware)
Multicore Architecture
Real-Time Operating System
System monitors gas leaks and fire in
oil extraction platforms
Summary
• We provided a solution for generating stress test cases by combining
meta-heuristic search and constraint programming
• Meta-heuristic search (GA) identifies high risk regions in the
input space
• Constraint programming (CP) finds provably worst-case
schedules within these (limited) regions
• Achieve (nearly) GA efficiency and CP effectiveness
• Our approach can be used both for stress testing and
schedulability analysis (assumption free)
70
Reflecting
71
Search-Based Solutions
• Versatile
• Helps relax assumptions compared to exact approaches
• Helps decrease modeling requirements
• Scalability, e.g., easy to parallelize
• Requires massive empirical studies
• Search is rarely sufficient by itself
72
Multidisciplinary Approach
• Single-technology approaches rarely work in practice
• Combined search with:
• Machine learning
• Solvers, e.g., CP, SMT
• Statistical approaches, e.g., sensitivity analysis
• System and environment modeling and simulation
73
Objectives
• Reduce search space
• Better guide and focus search
• Compute fitness and provide guidance
• Avoid expensive and useless fitness computations
• Explain failures (e.g., decision trees)
• Get more guarantees (e.g., constraint programming)
74
Acknowledgements
• Shiva Nejati
• Reza Matinnejad
• Raja Ben Abdessalem
• Stefano Di Alesio
• Dennis Appelt
• Annibale Panichella
75
Selected References
• L. Briand et al. “Testing the untestable: Model testing of complex software-intensive systems”,
IEEE/ACM ICSE 2016, V2025
• R. Matinnejad et al., “MiL Testing of Highly Configurable Continuous Controllers: Scalable Search
Using Surrogate Models”, IEEE/ACM ASE 2014 (Distinguished paper award)
• S. Di Alesio et al. “Combining genetic algorithms and constraint programming to support stress
testing of task deadlines”, ACM Transactions on Software Engineering and Methodology (TOSEM),
25(1):4, 2015
• R. Ben Abdessalem et al., "Testing Vision-Based Control Systems Using Learnable Evolutionary
Algorithms”, IEEE/ACM ICSE 2018
• D. Appelt et al., “A Machine Learning-Driven Evolutionary Approach for Testing Web Application
Firewalls”, To appear in IEEE Transaction on Reliability
• More on: https://wwwen.uni.lu/snt/people/lionel_briand?page=Publications
76
.lusoftware verification & validation
VVS
Achieving Scalability in Software Testing
with Machine Learning and
Metaheuristic Search
Lionel Briand

Presentation by Lionel Briand

  • 1.
    .lusoftware verification &validation VVS Achieving Scalability in Software Testing with Machine Learning and Metaheuristic Search Lionel Briand
  • 2.
    Definition of SoftwareTesting • ISTQB: “Software testing is a process of executing a program or application with the intent of finding the software bugs. It can also be stated as the process of validating and verifying that a software program or application or product meets the business and technical requirements that guided its design and development.” 2
  • 3.
    Scope • The mainchallenge in testing software systems is scalability • Addressing scalability entails effective automation • Lessons learned from industrial research collaborations: satellite, automotive, finance, energy … • Experiences from combining metaheuristic search, machine learning, and other AI techniques, in addressing testing scalability 3
  • 4.
    Scalability • The extentto which a technique can be applied on large or complex artifacts (e.g., input spaces, code, models) and still provide useful, automated support with acceptable effort, CPU, and memory? 4
  • 5.
    Collaborative Research @SnT 5 • Research in context • Addresses actual needs • Well-defined problem • Long-term collaborations • Our lab is the industry
  • 6.
    SVV Dept. 6 • Establishedin 2012, part of the SnT centre • Requirements Engineering, Security Analysis, Design Verification, Automated Testing, Runtime Monitoring • ~ 25 lab members • Partnerships with industry • ERC Advanced grant
  • 7.
    Outline • Overview, problemdefinition • Example research projects with industry partners: • Vulnerability testing (Banking) • Testing advanced driver assistance systems • Testing controllers (automotive) • Stress testing critical task deadlines (Energy) • Reflections and lessons learned 7
  • 8.
  • 9.
    Software Testing 9 SW Representation (e.g.,specifications) SW Code Derive Test cases Execute Test cases Compare Expected Results or properties Get Test Results Test Oracle [Test Result==Oracle][Test Result!=Oracle] Automation!
  • 10.
    Search-Based Software Testing •Express test generation problem as a search or optimization problem • Search for test input data with certain properties, i.e., constraints • Non-linearity of software (if, loops, …): complex, discontinuous, non- linear search spaces (Baresel) • Many search algorithms (metaheuristics), from local search to global search, e.g., Hill Climbing, Simulated Annealing and Genetic Algorithms e search space neighbouring the for fitness. If a better candidate mbing moves to that new point, rhood of that candidate solution. the neighbourhood of the current fers no better candidate solutions; If the local optimum is not the gure 3a), the search may benefit performing a climb from a new cape (Figure 3b). le Hill Climbing is Simulated Simulated Annealing is similar to ement around the search space is be made to points of lower fitness he aim of escaping local optima. bability value that is dependent ‘temperature’, which decreases ogresses (Figure 4). The lower kely the chances of moving to a ch space, until ‘freezing point’ is the algorithm behaves identically d Annealing is named so because hysical process of annealing in curve of the fitness landscape until a local optimum is found. The fina position may not represent the global optimum (part (a)), and restarts ma be required (part (b)) Fitness Input domain Figure 4. Simulated Annealing may temporarily move to points of poore fitness in the search space Fitness Input domain Figure 5. Genetic Algorithms are global searches, sampling many poin in the fitness landscape at once “Search-Based Software Testing: Past, Present and Future” Phil McMinn Genetic Algorithm 10 cusses future directions for Search-Based g, comprising issues involving execution estability, automated oracles, reduction of st and multi-objective optimisation. Finally, udes with closing remarks. -BASED OPTIMIZATION ALGORITHMS form of an optimization algorithm, and mplement, is random search. In test data s are generated at random until the goal of mple, the coverage of a particular program nch) is fulfilled. Random search is very poor ns when those solutions occupy a very small ll search space. Such a situation is depicted re the number of inputs covering a particular are very few in number compared to the ut domain. Test data may be found faster ly if the search is given some guidance. c searches, this guidance can be provided a problem-specific fitness function, which points in the search space with respect to or their suitability for solving the problem Input domain portion of input domain denoting required test data randomly-generated inputs Figure 2. Random search may fail to fulfil low-probability test goals Fitness Input domain (a) Climbing to a local optimum
  • 11.
  • 12.
    42% 32% 9% 4% 3% 3% 3% 2% 2% Code Injection Manipulated datastructures Collect and analyze information Indicator Employ probabilistic techniques Manipulate system resources Subvert access control Abuse existing functionality Engage in deceptive… X-Force Threat Intelligence Index 2017 12 https://www.ibm.com/security/xforce/ More than 40% of all attacks were injection attacks (e.g., SQLi)
  • 13.
  • 14.
    Web Applications 14 Web form str1 str2 Username Password OK SQLquery SELECT * FROM Users WHERE (usr = ‘str1’ AND psw = ‘str2’) Name Surname … John Smith … Result Server SQL DatabaseClient
  • 15.
    Injection Attacks 15 SQL query NameSurname … Aria Stark … John Snow … … … … Query result SELECT * FROM Users WHERE (usr = ‘’ AND psw = ‘’) OR 1=1 -- Server SQL DatabaseClient Web form ‘) OR 1=1 -- Username Password OK
  • 16.
  • 17.
    Web Application Firewalls(WAFs) 17 Servermalicious malicious malicious legitimate WAF
  • 18.
    WAF Rule Set 18 Ruleset of Apache ModSecurity https://github.com/SpiderLabs/ModSecurity
  • 19.
  • 20.
    Grammar-based Attack Generation • BNFgrammar for SQLi attacks • Random strategy: randomly selected production rules are applied recursively until only terminals are left • Random strategy not efficient for bypassing attacks that are difficult to find • Machine learning? Search? • How to guide the search? How can ML help?
  • 21.
    Anatomy of SQLiattacks 21 ‘ OR“a”=“a”# Bypassing Attack <START> <sq> <wsp> <sqliAttack> <cmt> <boolAttack> <opOR> <boolTrueExpr> OR <bynaryTrue> <dq> <ch> <dq> <opEq> <dq> <ch> <dq> “ a ” = “ a ” <sQuoteContext> ‘ #_ Derivation Tree ‘ _ OR”a”=“a” # S = { Attack Slices
  • 22.
    Learning Attack Patterns 22 S1S2 S3 S4 … Sn Outcome A1 1 1 0 0 … 0 Passed A2 0 1 0 0 … 0 Blocked … … … … … … … … Am 1 1 1 1 … 1 Blocked Training Set PassedBlocked S4 YesNo YesNo YesNo S3 S2 Decision Tree Sn S1 … • Random trees • Random forest
  • 23.
    Learning Attack Patterns 23 S1S2 S3 S4 … Sn Outcome A1 1 1 0 0 … 0 Passed A2 0 1 0 0 … 0 Blocked … … … … … … … … Am 1 1 1 1 … 1 Blocked PassedBlocked S4 YesNo YesNo YesNo S3 S2 Sn S1 … Training Set Decision Tree Attack Pattern S2 ∧ ¬ Sn ∧ S1
  • 24.
    Machine Learning Generating Attacksvia ML and EAs 24 Evolutionary Algorithm (EA) Iteratively refine successful attack conditions PassedBlocked S4 YesNo YesNo YesNo S3 S2 Sn S1 …
  • 25.
    Some Results Apache ModSecurity 25 DistinctAttacks IndustrialWAFs DistinctAttacks Machine Learning-driven attack generation led to more distinct, successful attacks being discovered faster
  • 26.
    Related Work • Automatedrepair of WAFs • Automated testing targeting XML and SQL injections in web applications 26
  • 27.
  • 28.
    Cyber-Physical Systems • Asystem of collaborating computational elements controlling physical entities 28
  • 29.
    Advanced Driver Assistance Systems(ADAS) 29 Automated Emergency Braking (AEB) Pedestrian Protection (PP) Lane Departure Warning (LDW) Traffic Sign Recognition (TSR)
  • 30.
    Automotive Environment • Highlyvaried environments, e.g., road topology, weather, building and pedestrians … • Huge number of possible scenarios, e.g., determined by trajectories of pedestrians and cars • ADAS play an increasingly critical role • A challenge for testing 30
  • 31.
    Advanced Driver Assistance Systems(ADAS) Decisions are made over time based on sensor data 31 Sensors Controller Actuators Decision Sensors /Camera Environment ADAS
  • 32.
    A General andFundamental Shift • Increasingly so, it is easier to learn behavior from data using machine learning, rather than specify and code • Deep learning, reinforcement learning … • Example: Neural networks (deep learning) • Millions of weights learned • No explicit code, no specifications • Verification, testing? 32
  • 33.
    CPS Development Process 33 Functionalmodeling: • Controllers • Plant • Decision Continuous and discrete Simulink models Model simulation and testing Architecture modelling • Structure • Behavior • Traceability System engineering modeling (SysML) Analysis: • Model execution and testing • Model-based testing • Traceability and change impact analysis • ... (partial) Code generation Deployed executables on target platform Hardware (Sensors ...) Analog simulators Testing (expensive) Hardware-in-the-Loop Stage Software-in-the-Loop Stage Model-in-the-Loop Stage
  • 34.
    Automotive Environment • Highlyvaried environments, e.g., road topology, weather, building and pedestrians … • Huge number of possible scenarios, e.g., determined by trajectories of pedestrians and cars • ADAS play an increasingly critical role • A challenge for testing 34
  • 35.
    Our Goal • Developingan automated testing technique for ADAS 35 • To help engineers efficiently and effectively explore the complex test input space of ADAS • To identify critical (failure-revealing) test scenarios • Characterization of input conditions that lead to most critical situations, e.g., safety violations
  • 36.
    36 Automated Emergency Braking System(AEB) 36 “Brake-request” when braking is needed to avoid collisions Decision making Vision (Camera) Sensor Brake Controller Objects’ position/speed
  • 37.
    Example Critical Situation •“AEB properly detects a pedestrian in front of the car with a high degree of certainty and applies braking, but an accident still happens where the car hits the pedestrian with a relatively high speed” 37
  • 38.
    Testing ADAS 38 A simulatorbased on Physical/Mathematical models On-road testing Simulation-based (model) testing
  • 39.
    Testing via Physics-based Simulation 39 ADAS (SUT) Simulator(Matlab/Simulink) Model (Matlab/Simulink) ▪ Physical plant (vehicle / sensors / actuators) ▪ Other cars ▪ Pedestrians ▪ Environment (weather / roads / traffic signs) Test input Test output time-stamped output
  • 40.
    AEB Domain Model -visibility: VisibilityRange - fog: Boolean - fogColor: FogColor Weather - frictionCoeff: Real Road1 - v0 : Real Vehicle - : Real - : Real - : Real - :Real Pedestrian - simulationTime: Real - timeStep: Real Test Scenario 1 1 - ModerateRain - HeavyRain - VeryHeavyRain - ExtremeRain «enumeration» RainType- ModerateSnow - HeavySnow - VeryHeavySnow - ExtremeSnow «enumeration» SnowType - DimGray - Gray - DarkGray - Silver - LightGray - None «enumeration» FogColor 1 WeatherC {{OCL} self.fog=false implies self.visibility = “300” and self.fogColor=None} Straight - height: RampHeight Ramped - radius: CurvedRadius Curved - snowType: SnowType Snow - rainType: RainType Rain Normal - 5 - 10 - 15 - 20 - 25 - 30 - 35 - 40 «enumeration» CurvedRadius (CR) - 4 - 6 - 8 - 10 - 12 «enumeration» RampHeight (RH) - 10 - 20 - 30 - 40 - 50 - 60 - 70 - 80 - 90 - 100 - 110 - 120 - 130 - 140 - 150 - 160 - 170 - 180 - 190 - 200 - 210 - 220 - 230 - 240 - 250 - 260 - 270 - 280 - 290 - 300 «enumeration» VisibilityRange - : TTC: Real - : certaintyOfDetection: Real - : braking: Boolean AEB Output - : Real - : Real Output functions Mobile object Position vector - x: Real - y: Real Position 1 11 1 1 Static input 1 Output 1 1 Dynamic input xp 0 yp 0 vp 0 ✓p 0 vc 0 v3 v2 v1 F1 F2
  • 41.
    ADAS Testing Challenges •Test input space is large, complex and multidimensional • Explaining failures and fault localization are difficult • Execution of physics-based simulation models is computationally expensive 41
  • 42.
    Black-Box Search-based Testing 42 Testinput generation (NSGA II) Evaluating test inputs - Select best tests - Generate new tests (candidate) test inputs - Simulate every (candidate) test - Compute fitness functions Fitness values Test cases revealing worst case system behaviors Input data ranges/dependencies + Simulator + Fitness functions defined based on Oracles
  • 43.
    Search: Genetic Evolution 43 Initialinput Fitness computation Selection Breeding
  • 44.
    Better Guidance • Fitnesscomputations rely on simulations and are very expensive • Search needs better guidance 44
  • 45.
    Decision Trees 45 Partition theinput space into homogeneous regions All points Count 1200 “non-critical” 79% “critical” 21% “non-critical” 59% “critical” 41% Count 564 Count 636 “non-critical” 98% “critical” 2% Count 412 “non-critical” 49% “critical” 51% Count 152 “non-critical” 84% “critical” 16% Count 230 Count 182 vp 0 >= 7.2km/h vp 0 < 7.2km/h ✓p 0 < 218.6 ✓p 0 >= 218.6 RoadTopology(CR = 5, Straight, RH = [4 12](m)) RoadTopology (CR = [10 40](m)) “non-critical” 31% “critical” 69% “non-critical” 72% “critical” 28%
  • 46.
    Genetic Evolution Guidedby Classification 46 Initial input Fitness computation Classification Selection Breeding
  • 47.
    Search Guided byClassification 47 Test input generation (NSGA II) Evaluating test inputs Build a classification tree Select/generate tests in the fittest regions Apply genetic operators Input data ranges/dependencies + Simulator + Fitness functions defined based on Oracles (candidate) test inputs - Simulate every (candidate) test - Compute fitness functions Fitness values Test cases revealing worst case system behaviors + A characterization of critical input regions
  • 48.
    NSGAII-DT vs. NSGAII 48 NSGAII-DToutperforms NSGAII HV 0.0 0.4 0.8 GD 0.05 0.15 0.25 SP 2 0.6 1.0 1.4 6 10 14 18 22 24 Time (h) NSGAII-DT NSGAII
  • 49.
  • 50.
  • 51.
  • 52.
    • Supercharger bypassflap controller üFlap position is bounded within [0..1] üImplemented in MATLAB/Simulink ü34 (sub-)blocks decomposed into 6 abstraction levels Supercharger Bypass Flap Supercharger Bypass Flap Flap position = 0 (open) Flap position = 1 (closed) Simple Example 52
  • 53.
    Initial Desired Value Final Desired Value timetime Desired Value Actual Value T/2 T T/2 T Test Input Test Output Plant Model Controller (SUT) Desired value Error Actual value System output+ - MiL Testing of Controllers 53
  • 54.
    Configurable Controllers atMIL Plant Model + + + ⌃ + - e(t) actual(t) desired(t) ⌃ KP e(t) KD de(t) dt KI R e(t) dt P I D output(t) Time-dependent variables Configuration Parameters 54
  • 55.
    Requirements and TestObjectives InitialDesired (ID) Desired ValueI (input) Actual Value (output) FinalDesired (FD) time T/2 T Smoothness Responsiveness Stability 55
  • 56.
    A Search-Based TestApproach Initial Desired (ID) FinalDesired(FD) Worst Case(s)? • Search directed by model execution feedback • Controller’s dynamic behavior can be complex • Meta-heuristic search in (large) input space: Finding worst case inputs • Possible because of automated oracle (feedback loop) • Different worst cases for different requirements 56
  • 57.
    Initial Solution HeatMap Diagram 1. Exploration Listof Critical RegionsDomain Expert Worst-Case Scenarios + Controller- plant model Objective Functions based on Requirements 2. Single-State Search time Desired Value Actual Value 0 1 2 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Initial Desired Final Desired 57
  • 58.
    Results • We foundmuch worse scenarios during MiL testing than our partner had found so far • These scenarios are also run at the HiL level, where testing is much more expensive: MiL results => test selection for HiL • But further research was needed: • Simulations are expensive • Configuration parameters 58
  • 59.
    Final Solution + Controller Model (Simulink) Worst-Case Scenarios List of Critical PartitionsRegression Tree 1.Explorationwith Dimensionality Reduction 2.Search with Surrogate Modeling Objective Functions Domain Expert Visualization of the 8-dimension space using regression treesDimensionality reduction to identify the significant variables (Elementary Effect Analysis) Surrogate modeling to predict the fitness function and speed up the search (Neural network) 59
  • 60.
    Regression Tree All Points FD>=0.43306 Count Mean StdDev Count Mean Std Dev FD<0.43306 Count Mean Std Dev ID>=0.64679 Count Mean Std Dev Count Mean Std Dev Cal5>=0.020847 Cal5>0.020847 Count Mean Std Dev Count Mean Std Dev Cal5>=0.014827 Cal5<0.014827 Count Mean Std Dev Count Mean Std Dev 1000 0.007822 0.0049497 ID<0.64679 574 0.0059513 0.0040003 426 0.0103425 0.0049919 373 0.0047594 0.0034346 201 0.0081631 0.0040422 182 0.0134555 0.0052883 244 0.0080206 0.0031751 70 0.0106795 0.0052045 131 0.0068185 0.0023515 60
  • 61.
    Surrogate Modeling Any supervisedlearning or statistical technique providing fitness predictions with confidence intervals 1. Predict higher fitness with high confidence: Move to new position, no simulation 2. Predict lower fitness with high confidence: Do not move to new position, no simulation 3. Low confidence in prediction: Simulation Surrogate Model Real Function x Fitness 61
  • 62.
    ü Our approachis able to identify more critical violations of the controller requirements that had neither been found with default/fixed configurations nor by manual testing. MiL-Testing different configurations Stability Smoothness Responsiveness MiL-Testing fixed configurations Manual MiL-Testing - -2.2% deviation 24% over/undershoot 20% over/undershoot 5% over/undershoot 170 ms response time 80 ms response time 50 ms response time Results 62
  • 63.
  • 64.
    Problem and Context •Schedulability analysis encompasses techniques that try to predict whether (critical) tasks are schedulable, i.e., meet their deadlines • Stress testing runs carefully selected test cases that have a high probability of leading to deadline misses • Stress testing is complementary to schedulability analysis • Testing is typically expensive, e.g., hardware in the loop • Finding stress test cases is difficult 64
  • 65.
    Finding Stress TestCases is Hard 65 0 1 2 3 4 5 6 7 8 9 j0, j1 , j2 arrive at at0 , at1 , at2 and must finish before dl0 , dl1 , dl2 J1 can miss its deadline dl1 depending on when at2 occurs! 0 1 2 3 4 5 6 7 8 9 j0 j1 j2 j0 j1 j2 at0 dl0 dl1 at1 dl2 at2 T T at0 dl0 dl1 at1 at2 dl2
  • 66.
    Challenges and Solutions •Ranges for arrival times form a very large input space • Task interdependencies and properties constrain what parts of the space are feasible • Solution: We re-expressed the problem as a constraint optimization problem and used a combination of constraint programming (IBM CPLEX) and meta-heuristic search (GA) 66
  • 67.
    Constraint Optimization 67 Constraint OptimizationProblem Static Properties of Tasks (Constants) Dynamic Properties of Tasks (Variables) Performance Requirement (Objective Function) OS Scheduler Behaviour (Constraints)
  • 68.
    Combining CP andGA 68 A:12 S. Di Alesio et al. Fig. 3: Overview of GA+CP: the solutions x , y and z in the initial population of GA evolve into
  • 69.
    Case Study 69 Drivers (Software-Hardware Interface) ControlModules Alarm Devices (Hardware) Multicore Architecture Real-Time Operating System System monitors gas leaks and fire in oil extraction platforms
  • 70.
    Summary • We provideda solution for generating stress test cases by combining meta-heuristic search and constraint programming • Meta-heuristic search (GA) identifies high risk regions in the input space • Constraint programming (CP) finds provably worst-case schedules within these (limited) regions • Achieve (nearly) GA efficiency and CP effectiveness • Our approach can be used both for stress testing and schedulability analysis (assumption free) 70
  • 71.
  • 72.
    Search-Based Solutions • Versatile •Helps relax assumptions compared to exact approaches • Helps decrease modeling requirements • Scalability, e.g., easy to parallelize • Requires massive empirical studies • Search is rarely sufficient by itself 72
  • 73.
    Multidisciplinary Approach • Single-technologyapproaches rarely work in practice • Combined search with: • Machine learning • Solvers, e.g., CP, SMT • Statistical approaches, e.g., sensitivity analysis • System and environment modeling and simulation 73
  • 74.
    Objectives • Reduce searchspace • Better guide and focus search • Compute fitness and provide guidance • Avoid expensive and useless fitness computations • Explain failures (e.g., decision trees) • Get more guarantees (e.g., constraint programming) 74
  • 75.
    Acknowledgements • Shiva Nejati •Reza Matinnejad • Raja Ben Abdessalem • Stefano Di Alesio • Dennis Appelt • Annibale Panichella 75
  • 76.
    Selected References • L.Briand et al. “Testing the untestable: Model testing of complex software-intensive systems”, IEEE/ACM ICSE 2016, V2025 • R. Matinnejad et al., “MiL Testing of Highly Configurable Continuous Controllers: Scalable Search Using Surrogate Models”, IEEE/ACM ASE 2014 (Distinguished paper award) • S. Di Alesio et al. “Combining genetic algorithms and constraint programming to support stress testing of task deadlines”, ACM Transactions on Software Engineering and Methodology (TOSEM), 25(1):4, 2015 • R. Ben Abdessalem et al., "Testing Vision-Based Control Systems Using Learnable Evolutionary Algorithms”, IEEE/ACM ICSE 2018 • D. Appelt et al., “A Machine Learning-Driven Evolutionary Approach for Testing Web Application Firewalls”, To appear in IEEE Transaction on Reliability • More on: https://wwwen.uni.lu/snt/people/lionel_briand?page=Publications 76
  • 77.
    .lusoftware verification &validation VVS Achieving Scalability in Software Testing with Machine Learning and Metaheuristic Search Lionel Briand