Applications of Search-based
Software Testing to Trustworthy
Artificial Intelligence
Lionel Briand
SSBSE 2022 Keynote
http://www.lbriand.info
ML-Enabled Systems (MLS)
2
Sensors
Controller
Actuators Decision
Sensors
/Camera
Environment
ADAS
Machine Learning
Example Automotive Applications
• Object detection, identification, classification,
localization and prediction of movement
• Sensor fusion and scene comprehension, e.g., lane
detection
• Driver monitoring
• Driver replacement
• Functional safety, security
• Powertrains, e.g., improve motor control and battery
management
3
Tian et al. 2018
Importance
• ML components are increasingly part of safety- or mission-
critical systems (ML-enabled systems - MLS)
• Many domains, including aerospace, automotive, health care,
…
• Many ML algorithms, supervised vs. unsupervised,
classification vs regression, etc.
• But increasing use of deep learning and reinforcement
learning
4
Testing Levels
• Testing is still the main mechanism through which to gain trust
• Levels: model (e.g., Deep Neural Networks or DNN) , integration,
system
• Research largely focused on model testing
• Integration: Issues that arise when multiple models and
components are integrated
• System: Test the MLS in its target environment, in-field or
simulated
5
Challenges: Overview
• Behavior driven by training data and learning process
• Neither specifications nor code
• Huge model input space, especially for autonomous
systems
• Automated test oracles / verdicts
• Test suite adequacy, i.e., when is it good enough?
• Models are never perfect, but how do we decide
whether they are good enough?
6
Model-Level Testing and
Analysis
7
Test Inputs: Adversarial or
Natural?
• Adversarial inputs: Focus on robustness, e.g., noise or attacks
• Natural inputs: Focus on functional aspects, e.g., functional
safety
8
Adversarial example due to noise (Goodfellow et al., 2014)
Key-points Detection Testing with
Simulation in the Loop:
• DNNs used for key-points detection
in images
• Many applications, e.g., face
recognition
• Testing: Find test suite that causes
DNN to poorly predict as many key-
points as possible within time
budget
• Images generated by a simulator
9
Ground truth
Predicted
Example Application
• Drowsiness or gaze detection based on interior camera monitoring the driver
• In the drowsiness or gaze detection problem, each Key-Point (KP) may be highly
important for safety
• Each KP leads to one test objective
• For our subject DNN, we have 27 test objectives
• Goal: Cause the DNN to mispredict as many key-points as possible
• Solution: Many-objective search algorithms (based on genetic algorithms)
combined with simulator
10
Ul Haq et al., 2021
Overview
11
Input Generator (search) Simulator
Input (vector)
DNN
Fitness
Calculator
Actual Key-points Positions
Predicted Key-points Positions
Fitness Score
(Error Value)
Most Critical
Test Inputs
Test
Image
Results
• Our approach is effective in generating test suites that cause the DNN to severely
mispredict more than 93% of all key-points on average
• Not all mispredictions can be considered failures … (e.g., shadows)
• We must know when the DNN cannot be expected to be accurate and have contingency
measures
• Some key-points are more severely predicted than others, detailed analysis revealed
two reasons:
• Under-representation of some key-points (hidden) in the training data
• Large variation in the shape and size of the mouth across different 3D models (more
training needed)
12
Interpretation
• Regression trees to predict accuracy based on simulation parameters
• Enable detailed analysis to find the root causes of high Normalized Error (NE) values, e.g., shadow
on the location of KP26 is the cause of high NE values
• Regression trees show excellent accuracy, reasonable size.
• Amenable to risk analysis, gaining useful safety insights, and contingency plans at run-time
13
Image Characteristics Condition NE
! = 9 ∧ # < 18.41 0.04
! = 9 ∧ # ≥ 18.41 ∧ $ < −22.31 ∧ % < 17.06 0.26
! = 9 ∧ # ≥ 18.41 ∧ $ < −22.31 ∧ 17.06 ≤ % < 19 0.71
! = 9 ∧ # ≥ 18.41 ∧ $ < −22.31 ∧ % ≥ 19 0.36
Representative rules derived from the decision tree for KP26
(M: Model-ID, P: Pitch, R: Roll, Y: Yaw, NE: Normalized Error)
(A) A test image satisfying
the first condition
(B) A test image satisfying
the third condition
NE = 0.013 NE = 0.89
Real Images
• Test selection requires a different approach than with a
simulator
• Labeling costs are significant
14
DNN Structural Coverage Criteria
15
Chen et al. 2020
• Neuron coverage
• How neurons are
activated
• Many variants
Limitations
• Require access to the DNN internals and sometimes the
training set. Not realistic in many practical settings.
• There is a weak correlation between coverage and
misclassification or poor predictions for natural inputs
• Also many questions regarding the studies focused on
adversarial inputs …
16
Diversity-driven Test Selection
• Test inputs are real images
• Black-box approach based on measuring the diversity of
test inputs.
• Scalable selection
• Assumption: The more diverse, the more likely test inputs
are to reveal faults
17
Agahababaeyan et al., 2022
Extracting Image Features
• VGG16 is a convolutional neural network trained on a
subset of the ImageNet dataset, a collection of over 14
million images belonging to 22,000 categories.
18
Features:
- Activation values
after last conv layer
- Characterize semantic
elements such as shapes
and colors
Geometric Diversity (GD)
• Given a dataset X and its corresponding feature vectors V,
the geometric diversity of a subset S ⊆ X is defined as the
hyper-volume of the parallelepiped spanned by the rows of
Vs, i.e., feature vectors of items in S, where the larger the
volume, the more diverse is the feature space of S
19
Correlation with Faults
Correlation between geometric diversity and faults
Estimating Faults in DNNs
21
#Clusters ~ #Faults
Pareto Front Optimization
22
• Black-box test selection to detect as many diverse faults
as possible for a given test budget
• Label selected images
• Search-based Approach: Multi-Objective Genetic Search
(NSGA-II)
• Two objectives (Max):
• diversity
• uncertainty (e.g., Gini)
GD score
Gini
score
Explanation and Analysis with
Heatmaps
23
• Heatmaps capture the extent to which the pixels of an image impacted on a specific
result
• Limitations:
• Heatmaps should be manually inspected to determine the reason for misclassification
• underrepresented (but dangerous) failure causes might be unnoticed
• Model debugging (i.e., improvement) not automated
A heatmap can show that long hair is what
caused a female doctor to be classified as a
nurse
HUDD
24
• How to help perform risk analysis with real images?
• Rely on heatmaps to generate clusters of images leading to failures because of a
same root cause (common characteristics of the images)
• enable engineers to ensure safety with introducing countermeasures
Error-inducing
Test Set images
Step1.
Heatmap
based
clustering
Root cause clusters
C1 C2 C3
Step 2. Inspection of subset
of cluster elements.
Fahmy et al. 2021
• Classification
• Gaze Detection
26
90
270
180 0
45
22.5
67.5
337.5
315
292.5
247.5
225
202.5
157.5
135
112.5
Top
Center
B
o
t
t
o
m
L
e
f
t
Bottom
Center
B
o
t
t
o
m
R
i
g
h
t
Top
Right
Middle
Right
T
o
p
L
e
f
t
Middle
Left
Example Application
Clusters identify different problems
Cluster 1
(angle ~157.5)
borderline cases
Cluster3
(near closed eyes)
incomplete training set
27
Cluster 2
(eye middle center)
incomplete set of classes
Combining Simulation and Actual
Images in Testing
28
Training Set
Simulator
Images
DNN
Training
Test Set
Simulator
Images
DNN
Testing
DNN
Training
(fine-tuning)
Training Set
Real-world
Images
DNN
Testing
Trained
DNN
Fine-Tuned
DNN
Real-world Error
Inducing Images
Test Set
Real-world
Images
RQ1: Is it possible to automatically characterize unsafe scenarios?
RQ2: Is it possible to improve the accuracy of the DNN by leveraging the identified
unsafe scenarios?
Simulator-based Explanations for
DNN Errors (SEDE)
29
RCCs
Real-world Error
Inducing
Images
HUDD
Unsafe images in the RCC
+ simulator parameters
Safe images similar to unsafe
+ simulator parameters
Configuration
Parameters
Simulator
Images
Evolutionary
Algorithm
Simulator
IF-THEN rules
Rules Extraction
Algorithm (PART)
Explanation
expression
Fahmy et al., 2022
Example PART Rule
• Example simulation
parameters: pitch,
yaw, and roll
(orientation of the
head)
• Rules expressed in
terms of simulation
parameters
30
Case 1
Reinforcement Learning Testing
and Safety
31
Cartpole Example
• Environment (CartPole from OpenAI GYM): a pole is attached to a cart, and
the goal is to move the cart right and left to keep the pole from falling.
• Actions: Left and right
• Reward: For every timestep that the pole remains upright (+1 reward)
• Termination
• Pole angle is greater than 12 degrees
• Cart distance from the center point is greater than 2.4
• Episode length is more than 200
32
Reward and Functional Faults
33
• Episodes: State-action sequences
• Testing goal: Find episodes to detect and explain reward and functional faults
• Reward fault: the accumulated reward is less than a threshold, e.g., the pole
falls down in the first 70 timesteps
• Functional fault: the pole is stable and, regardless of the accumulated reward,
the cart moves beyond the 2.4 limited distance, and the episode terminates
Functional fault Reward fault
STARLA: Objectives
• Detect as quickly
as possible diverse
faulty episodes
• Accurately
characterize faulty
episodes
• STARLA: Search
and ML based
34
Zolfagharian et al. 2022
System-Level Testing and
Analysis
35
Testing via Physics-based
Simulation
36
ADAS
(SUT)
Simulator (Matlab/Simulink)
Model
(Matlab/Simulink)
▪ Physical plant (vehicle / sensors / actuators)
▪ Other cars
▪ Pedestrians
▪ Environment (weather / roads / traffic signs)
Test input
Test output
time-stamped output
Example Violation
• Violation: Ego Vehicle collides with other vehicle
• Vehicle-in-front slows down suddenly and then moves to the right
• Possible reason: Agent was not trained with such episodes
38
Car View Top View
Reinforcement Learning for ADAS
Testing (MORLOT)
• Goal: make sequential changes to the environment, in a realistic
manner, with the objective of triggering safety violations
• Train an RL agent to do so
• Challenge: Test many requirements simultaneously
39
Ul Haq et al. 2022
Example
• Consider the vehicle-in-front as an RL
agent that aims to cause the ego vehicle
to violate safety requirements by
performing a sequence of actions
• State: collision
• Action: turn left/right,
increase/decrease speed of vehicle in
front
• Reward: distance between vehicles
40
Ego Vehicle Vehicle in-front
Possible actions
Reinforcement Learning for ADAS
Testing (MORLOT)
41
Action (Behavior of environment actors, e.g., vehicle in front)
Reward (e.g., based on distance from vehicle in front)
State/Next State (Locations and conditions about the environment,
e.g., collision)
RL Agent
RL-Environment
Simulator
(CARLA)
Control
(Image, location,…)
ADAS
Ul Haq et al. 2022
MORLOT: Overview
42
Objectives
(i.e., reqs),
Q-tables
Many-Objective Reinforcement Learning for Online Testing
Observe
State
Choose Action
Randomly or using
Many-Objective Search
Take Action/
Receive
Rewards
Update Q-
tables/
Archive
Reset
Environment
Many-Objective Search
Select Action
Get Fitness Value
Reinforcement Learning Environment
Interacts
Tabular-based Reinforcement Learning (Q-learning)
System Testing Challenges
43
To address the challenges, we propose SAMOTA (Surrogate-Assisted
Many-Objective Testing Approach) by leveraging many-objective
search and Surrogate Models (SMs)
Ul Haq et al. 2022
Large Input
Space
Many Safety
Requirements
Computationally-
intensive
Simulation
Surrogate Models
• Surrogate model: Model that mimics the simulator, to a
certain extent, while being much less computationally
expensive
• Research: Combine search with surrogate modeling to
decrease the computational cost of testing (Ul Haq et al. 2022)
44
Polynomial
Regression
(PR)
Radial Basis Function (RBF) Kringing (KR)
SAMOTA
45
Steps:
1. Initialization
2. Global search
3. Local search
4. Glocal search
5. Local search
6. …
Execute
Simulator
Global
SMs Many Objective
Search Algorithm
Most Critical
Test Cases
Most Uncertain
Test Cases
Global Search
Initialisation
Execute
Simulator Database
Minimal
Test Suite
Local SM
per Cluster
Local Search
Most Critical
Test Cases
Single Objective
Search Algorithm
Clustering
Top Points
Conclusions
46
Automated Testing
• Testing remains the key verification mechanism in practice to
demonstrate trustworthiness and understand limitations (e.g., safety)
• Different levels of testing: model, integration, system
• Automation is key at all levels and requires different strategies, e.g.,
sequences of actions and states versus single inputs (e.g., images)
• Key recurring strategy: combine evolutionary computing and machine
learning
• Concrete industrial problems could not have been addressed otherwise
• Scalability remains the main challenge
47
Why Search?
• Models are hard to analyze and therefore not amenable
automated reasoning (though there is work)
• Models exhibit complex, no-linear behaviors
• AI-enabled systems often operate in open contexts described
by many dimensions (AI for autonomy)
• Testing AI-enabled systems usually involves interactions with
complex simulators
• Search is therefore a natural approach in such context
48
Applications of Search-based
Software Testing to Trustworthy
Artificial Intelligence
Lionel Briand
SSBSE 2022 Keynote
http://www.lbriand.info
References
50
Selected References
• Ul Haq et al. "Automatic Test Suite Generation for Key-points Detection DNNs Using Many-Objective Search" ACM International
Symposium on Software Testing (ISSTA), 2021
• Ul Haq et al., “Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and Many-Objective Optimization”
IEEE/ACM ICSE 2022
• Ul Haq et al., “Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems”, ArXiv report,
https://arxiv.org/abs/2210.15432
• Fahmy et al. "Supporting DNN Safety Analysis and Retraining through Heatmap-based Unsupervised Learning" IEEE Transactions
on Reliability, Special section on Quality Assurance of Machine Learning Systems, 2021
• Fahmy et al. "Simulator-based explanation and debugging of hazard-triggering events in DNN-based safety-critical systems”,
ArXiv report, https://arxiv.org/pdf/2204.00480.pdf, ACM TOSEM, 2022
• Attaoui et al., “Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction and Clustering”, ArXiv report,
https://arxiv.org/pdf/2201.05077v1.pdf, ACM TOSEM, 2022
• Aghababaeyan et al., “Black-Box Testing of Deep Neural Networks through Test Case Diversity”,
https://arxiv.org/pdf/2112.12591.pdf
• Zolfagharian et al., “Search-Based Testing Approach for Deep Reinforcement Learning Agents”,
https://arxiv.org/pdf/2206.07813.pdf
51
Selected ML Testing References
• Goodfellow et al. "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014).
• Zhang et al. "DeepRoad: GAN-based metamorphic testing and input validation framework for autonomous driving systems." In 33rd
IEEE/ACM International Conference on Automated Software Engineering (ASE), 2018.
• Tian et al. "DeepTest: Automated testing of deep-neural-network-driven autonomous cars." In Proceedings of the 40th
international conference on software engineering, 2018.
• Li et al. “Structural Coverage Criteria for Neural Networks Could Be Misleading”, IEEE/ACM 41st International Conference on
Software Engineering: New Ideas and Emerging Results (NIER)
• Kim et al. "Guiding deep learning system testing using surprise adequacy." In IEEE/ACM 41st International Conference on Software
Engineering (ICSE), 2019.
• Ma et al. "DeepMutation: Mutation testing of deep learning systems." In 2018 IEEE 29th International Symposium on Software
Reliability Engineering (ISSRE), 2018.
• Zhang et al. "Machine learning testing: Survey, landscapes and horizons." IEEE Transactions on Software Engineering (2020).
• Riccio et al. "Testing machine learning based systems: a systematic mapping." Empirical Software Engineering 25, no. 6 (2020)
• Gerasimou et al., “Importance-Driven Deep Learning System Testing”, IEEE/ACM 42nd International Conference on Software
Engineering, 2020
52

Applications of Search-based Software Testing to Trustworthy Artificial Intelligence

  • 1.
    Applications of Search-based SoftwareTesting to Trustworthy Artificial Intelligence Lionel Briand SSBSE 2022 Keynote http://www.lbriand.info
  • 2.
    ML-Enabled Systems (MLS) 2 Sensors Controller ActuatorsDecision Sensors /Camera Environment ADAS Machine Learning
  • 3.
    Example Automotive Applications •Object detection, identification, classification, localization and prediction of movement • Sensor fusion and scene comprehension, e.g., lane detection • Driver monitoring • Driver replacement • Functional safety, security • Powertrains, e.g., improve motor control and battery management 3 Tian et al. 2018
  • 4.
    Importance • ML componentsare increasingly part of safety- or mission- critical systems (ML-enabled systems - MLS) • Many domains, including aerospace, automotive, health care, … • Many ML algorithms, supervised vs. unsupervised, classification vs regression, etc. • But increasing use of deep learning and reinforcement learning 4
  • 5.
    Testing Levels • Testingis still the main mechanism through which to gain trust • Levels: model (e.g., Deep Neural Networks or DNN) , integration, system • Research largely focused on model testing • Integration: Issues that arise when multiple models and components are integrated • System: Test the MLS in its target environment, in-field or simulated 5
  • 6.
    Challenges: Overview • Behaviordriven by training data and learning process • Neither specifications nor code • Huge model input space, especially for autonomous systems • Automated test oracles / verdicts • Test suite adequacy, i.e., when is it good enough? • Models are never perfect, but how do we decide whether they are good enough? 6
  • 7.
  • 8.
    Test Inputs: Adversarialor Natural? • Adversarial inputs: Focus on robustness, e.g., noise or attacks • Natural inputs: Focus on functional aspects, e.g., functional safety 8 Adversarial example due to noise (Goodfellow et al., 2014)
  • 9.
    Key-points Detection Testingwith Simulation in the Loop: • DNNs used for key-points detection in images • Many applications, e.g., face recognition • Testing: Find test suite that causes DNN to poorly predict as many key- points as possible within time budget • Images generated by a simulator 9 Ground truth Predicted
  • 10.
    Example Application • Drowsinessor gaze detection based on interior camera monitoring the driver • In the drowsiness or gaze detection problem, each Key-Point (KP) may be highly important for safety • Each KP leads to one test objective • For our subject DNN, we have 27 test objectives • Goal: Cause the DNN to mispredict as many key-points as possible • Solution: Many-objective search algorithms (based on genetic algorithms) combined with simulator 10 Ul Haq et al., 2021
  • 11.
    Overview 11 Input Generator (search)Simulator Input (vector) DNN Fitness Calculator Actual Key-points Positions Predicted Key-points Positions Fitness Score (Error Value) Most Critical Test Inputs Test Image
  • 12.
    Results • Our approachis effective in generating test suites that cause the DNN to severely mispredict more than 93% of all key-points on average • Not all mispredictions can be considered failures … (e.g., shadows) • We must know when the DNN cannot be expected to be accurate and have contingency measures • Some key-points are more severely predicted than others, detailed analysis revealed two reasons: • Under-representation of some key-points (hidden) in the training data • Large variation in the shape and size of the mouth across different 3D models (more training needed) 12
  • 13.
    Interpretation • Regression treesto predict accuracy based on simulation parameters • Enable detailed analysis to find the root causes of high Normalized Error (NE) values, e.g., shadow on the location of KP26 is the cause of high NE values • Regression trees show excellent accuracy, reasonable size. • Amenable to risk analysis, gaining useful safety insights, and contingency plans at run-time 13 Image Characteristics Condition NE ! = 9 ∧ # < 18.41 0.04 ! = 9 ∧ # ≥ 18.41 ∧ $ < −22.31 ∧ % < 17.06 0.26 ! = 9 ∧ # ≥ 18.41 ∧ $ < −22.31 ∧ 17.06 ≤ % < 19 0.71 ! = 9 ∧ # ≥ 18.41 ∧ $ < −22.31 ∧ % ≥ 19 0.36 Representative rules derived from the decision tree for KP26 (M: Model-ID, P: Pitch, R: Roll, Y: Yaw, NE: Normalized Error) (A) A test image satisfying the first condition (B) A test image satisfying the third condition NE = 0.013 NE = 0.89
  • 14.
    Real Images • Testselection requires a different approach than with a simulator • Labeling costs are significant 14
  • 15.
    DNN Structural CoverageCriteria 15 Chen et al. 2020 • Neuron coverage • How neurons are activated • Many variants
  • 16.
    Limitations • Require accessto the DNN internals and sometimes the training set. Not realistic in many practical settings. • There is a weak correlation between coverage and misclassification or poor predictions for natural inputs • Also many questions regarding the studies focused on adversarial inputs … 16
  • 17.
    Diversity-driven Test Selection •Test inputs are real images • Black-box approach based on measuring the diversity of test inputs. • Scalable selection • Assumption: The more diverse, the more likely test inputs are to reveal faults 17 Agahababaeyan et al., 2022
  • 18.
    Extracting Image Features •VGG16 is a convolutional neural network trained on a subset of the ImageNet dataset, a collection of over 14 million images belonging to 22,000 categories. 18 Features: - Activation values after last conv layer - Characterize semantic elements such as shapes and colors
  • 19.
    Geometric Diversity (GD) •Given a dataset X and its corresponding feature vectors V, the geometric diversity of a subset S ⊆ X is defined as the hyper-volume of the parallelepiped spanned by the rows of Vs, i.e., feature vectors of items in S, where the larger the volume, the more diverse is the feature space of S 19
  • 20.
    Correlation with Faults Correlationbetween geometric diversity and faults
  • 21.
    Estimating Faults inDNNs 21 #Clusters ~ #Faults
  • 22.
    Pareto Front Optimization 22 •Black-box test selection to detect as many diverse faults as possible for a given test budget • Label selected images • Search-based Approach: Multi-Objective Genetic Search (NSGA-II) • Two objectives (Max): • diversity • uncertainty (e.g., Gini) GD score Gini score
  • 23.
    Explanation and Analysiswith Heatmaps 23 • Heatmaps capture the extent to which the pixels of an image impacted on a specific result • Limitations: • Heatmaps should be manually inspected to determine the reason for misclassification • underrepresented (but dangerous) failure causes might be unnoticed • Model debugging (i.e., improvement) not automated A heatmap can show that long hair is what caused a female doctor to be classified as a nurse
  • 24.
    HUDD 24 • How tohelp perform risk analysis with real images? • Rely on heatmaps to generate clusters of images leading to failures because of a same root cause (common characteristics of the images) • enable engineers to ensure safety with introducing countermeasures Error-inducing Test Set images Step1. Heatmap based clustering Root cause clusters C1 C2 C3 Step 2. Inspection of subset of cluster elements. Fahmy et al. 2021
  • 25.
    • Classification • GazeDetection 26 90 270 180 0 45 22.5 67.5 337.5 315 292.5 247.5 225 202.5 157.5 135 112.5 Top Center B o t t o m L e f t Bottom Center B o t t o m R i g h t Top Right Middle Right T o p L e f t Middle Left Example Application
  • 26.
    Clusters identify differentproblems Cluster 1 (angle ~157.5) borderline cases Cluster3 (near closed eyes) incomplete training set 27 Cluster 2 (eye middle center) incomplete set of classes
  • 27.
    Combining Simulation andActual Images in Testing 28 Training Set Simulator Images DNN Training Test Set Simulator Images DNN Testing DNN Training (fine-tuning) Training Set Real-world Images DNN Testing Trained DNN Fine-Tuned DNN Real-world Error Inducing Images Test Set Real-world Images RQ1: Is it possible to automatically characterize unsafe scenarios? RQ2: Is it possible to improve the accuracy of the DNN by leveraging the identified unsafe scenarios?
  • 28.
    Simulator-based Explanations for DNNErrors (SEDE) 29 RCCs Real-world Error Inducing Images HUDD Unsafe images in the RCC + simulator parameters Safe images similar to unsafe + simulator parameters Configuration Parameters Simulator Images Evolutionary Algorithm Simulator IF-THEN rules Rules Extraction Algorithm (PART) Explanation expression Fahmy et al., 2022
  • 29.
    Example PART Rule •Example simulation parameters: pitch, yaw, and roll (orientation of the head) • Rules expressed in terms of simulation parameters 30 Case 1
  • 30.
  • 31.
    Cartpole Example • Environment(CartPole from OpenAI GYM): a pole is attached to a cart, and the goal is to move the cart right and left to keep the pole from falling. • Actions: Left and right • Reward: For every timestep that the pole remains upright (+1 reward) • Termination • Pole angle is greater than 12 degrees • Cart distance from the center point is greater than 2.4 • Episode length is more than 200 32
  • 32.
    Reward and FunctionalFaults 33 • Episodes: State-action sequences • Testing goal: Find episodes to detect and explain reward and functional faults • Reward fault: the accumulated reward is less than a threshold, e.g., the pole falls down in the first 70 timesteps • Functional fault: the pole is stable and, regardless of the accumulated reward, the cart moves beyond the 2.4 limited distance, and the episode terminates Functional fault Reward fault
  • 33.
    STARLA: Objectives • Detectas quickly as possible diverse faulty episodes • Accurately characterize faulty episodes • STARLA: Search and ML based 34 Zolfagharian et al. 2022
  • 34.
  • 35.
    Testing via Physics-based Simulation 36 ADAS (SUT) Simulator(Matlab/Simulink) Model (Matlab/Simulink) ▪ Physical plant (vehicle / sensors / actuators) ▪ Other cars ▪ Pedestrians ▪ Environment (weather / roads / traffic signs) Test input Test output time-stamped output
  • 36.
    Example Violation • Violation:Ego Vehicle collides with other vehicle • Vehicle-in-front slows down suddenly and then moves to the right • Possible reason: Agent was not trained with such episodes 38 Car View Top View
  • 37.
    Reinforcement Learning forADAS Testing (MORLOT) • Goal: make sequential changes to the environment, in a realistic manner, with the objective of triggering safety violations • Train an RL agent to do so • Challenge: Test many requirements simultaneously 39 Ul Haq et al. 2022
  • 38.
    Example • Consider thevehicle-in-front as an RL agent that aims to cause the ego vehicle to violate safety requirements by performing a sequence of actions • State: collision • Action: turn left/right, increase/decrease speed of vehicle in front • Reward: distance between vehicles 40 Ego Vehicle Vehicle in-front Possible actions
  • 39.
    Reinforcement Learning forADAS Testing (MORLOT) 41 Action (Behavior of environment actors, e.g., vehicle in front) Reward (e.g., based on distance from vehicle in front) State/Next State (Locations and conditions about the environment, e.g., collision) RL Agent RL-Environment Simulator (CARLA) Control (Image, location,…) ADAS Ul Haq et al. 2022
  • 40.
    MORLOT: Overview 42 Objectives (i.e., reqs), Q-tables Many-ObjectiveReinforcement Learning for Online Testing Observe State Choose Action Randomly or using Many-Objective Search Take Action/ Receive Rewards Update Q- tables/ Archive Reset Environment Many-Objective Search Select Action Get Fitness Value Reinforcement Learning Environment Interacts Tabular-based Reinforcement Learning (Q-learning)
  • 41.
    System Testing Challenges 43 Toaddress the challenges, we propose SAMOTA (Surrogate-Assisted Many-Objective Testing Approach) by leveraging many-objective search and Surrogate Models (SMs) Ul Haq et al. 2022 Large Input Space Many Safety Requirements Computationally- intensive Simulation
  • 42.
    Surrogate Models • Surrogatemodel: Model that mimics the simulator, to a certain extent, while being much less computationally expensive • Research: Combine search with surrogate modeling to decrease the computational cost of testing (Ul Haq et al. 2022) 44 Polynomial Regression (PR) Radial Basis Function (RBF) Kringing (KR)
  • 43.
    SAMOTA 45 Steps: 1. Initialization 2. Globalsearch 3. Local search 4. Glocal search 5. Local search 6. … Execute Simulator Global SMs Many Objective Search Algorithm Most Critical Test Cases Most Uncertain Test Cases Global Search Initialisation Execute Simulator Database Minimal Test Suite Local SM per Cluster Local Search Most Critical Test Cases Single Objective Search Algorithm Clustering Top Points
  • 44.
  • 45.
    Automated Testing • Testingremains the key verification mechanism in practice to demonstrate trustworthiness and understand limitations (e.g., safety) • Different levels of testing: model, integration, system • Automation is key at all levels and requires different strategies, e.g., sequences of actions and states versus single inputs (e.g., images) • Key recurring strategy: combine evolutionary computing and machine learning • Concrete industrial problems could not have been addressed otherwise • Scalability remains the main challenge 47
  • 46.
    Why Search? • Modelsare hard to analyze and therefore not amenable automated reasoning (though there is work) • Models exhibit complex, no-linear behaviors • AI-enabled systems often operate in open contexts described by many dimensions (AI for autonomy) • Testing AI-enabled systems usually involves interactions with complex simulators • Search is therefore a natural approach in such context 48
  • 47.
    Applications of Search-based SoftwareTesting to Trustworthy Artificial Intelligence Lionel Briand SSBSE 2022 Keynote http://www.lbriand.info
  • 48.
  • 49.
    Selected References • UlHaq et al. "Automatic Test Suite Generation for Key-points Detection DNNs Using Many-Objective Search" ACM International Symposium on Software Testing (ISSTA), 2021 • Ul Haq et al., “Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and Many-Objective Optimization” IEEE/ACM ICSE 2022 • Ul Haq et al., “Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems”, ArXiv report, https://arxiv.org/abs/2210.15432 • Fahmy et al. "Supporting DNN Safety Analysis and Retraining through Heatmap-based Unsupervised Learning" IEEE Transactions on Reliability, Special section on Quality Assurance of Machine Learning Systems, 2021 • Fahmy et al. "Simulator-based explanation and debugging of hazard-triggering events in DNN-based safety-critical systems”, ArXiv report, https://arxiv.org/pdf/2204.00480.pdf, ACM TOSEM, 2022 • Attaoui et al., “Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction and Clustering”, ArXiv report, https://arxiv.org/pdf/2201.05077v1.pdf, ACM TOSEM, 2022 • Aghababaeyan et al., “Black-Box Testing of Deep Neural Networks through Test Case Diversity”, https://arxiv.org/pdf/2112.12591.pdf • Zolfagharian et al., “Search-Based Testing Approach for Deep Reinforcement Learning Agents”, https://arxiv.org/pdf/2206.07813.pdf 51
  • 50.
    Selected ML TestingReferences • Goodfellow et al. "Explaining and harnessing adversarial examples." arXiv preprint arXiv:1412.6572 (2014). • Zhang et al. "DeepRoad: GAN-based metamorphic testing and input validation framework for autonomous driving systems." In 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE), 2018. • Tian et al. "DeepTest: Automated testing of deep-neural-network-driven autonomous cars." In Proceedings of the 40th international conference on software engineering, 2018. • Li et al. “Structural Coverage Criteria for Neural Networks Could Be Misleading”, IEEE/ACM 41st International Conference on Software Engineering: New Ideas and Emerging Results (NIER) • Kim et al. "Guiding deep learning system testing using surprise adequacy." In IEEE/ACM 41st International Conference on Software Engineering (ICSE), 2019. • Ma et al. "DeepMutation: Mutation testing of deep learning systems." In 2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE), 2018. • Zhang et al. "Machine learning testing: Survey, landscapes and horizons." IEEE Transactions on Software Engineering (2020). • Riccio et al. "Testing machine learning based systems: a systematic mapping." Empirical Software Engineering 25, no. 6 (2020) • Gerasimou et al., “Importance-Driven Deep Learning System Testing”, IEEE/ACM 42nd International Conference on Software Engineering, 2020 52