The Road Toward Dependable AI Based Systems

Paolo Tonella
Software Institute, Università della Svizzera italiana, Lugano, Switzerland
The Road Toward Dependable
AI Based Systems
1
https://www.pre-crime.eu

Precrime
ERC project: https://pre-crime.eu
2
Self-assessment Oracles for Anticipatory Testing
“Testing the unexpected”
• Anticipate failures due to unexpected conditions
• Identify unexpected execution contexts at runtime
• Generate valid, but unexpected execution scenarios

Deep neural networks
High accuracy on human competitive tasks
3

Integrated into complex applications
4

Programming model
5
0.9
0.1
Train set Validation set Test set
W
• Data driven programming/no control logic
• Numeric input/numeric output
• Non deterministic training process
is_red
0.1
0.9
is_red
Is this a bug?

Learned behaviour vs programmed behaviour
6
Training set
Network architecture
Hyperparameters
Training process
Code
Code acc = 0.99
x≥0 y・y = x
0.1
0.9
is_red
Is this a bug?

Testing fundamentals
Goal of testing
Inputs
Program
Fault
Execution
Failure
Oracle
7
Repair faults, exposed by choosing inputs
whose execution violates oracles
Repair
DL faults and failures
Q1: What is a DL fault?
Q2: What is a unique DL failure?
Q3: What does it mean to repair a DL fault?
DL inputs
Q4: How to generate DL test inputs?
Q5: How to prioritize/select DL test inputs?
DL oracles
Q6: When is a DL program correct?
Q7: How to prevent runtime oracle violations?

DL faults
8
Train/validation set
Hyperparameters
Code
Data
Architecture
IEEE std glossary: fault = “an incorrect step,
process or data de
fi
nition in a computer program”

Taxonomy of Real Faults in
Deep Learning Systems
MODEL
(29+45)
GPU USAGE
(10+1)
API (20+0)
TENSORS &
INPUTS (53+20)
TRAINING
(37+160)
Model Type &
Properties (6+20)
Layers
(23+25)
Activation
Function (3+2) wrong model
initialisation
(1+2)
wrong
weights
initialisation
(1+0)
wrong
selection of
model
(2+1)
multiple
initialisations
of CNN
(1+0)
suboptimal
network
structure
(1+15)
wrong
network
architecture
(0+2)
wrong type
of activation
function
(1+2)
missing
softmax
activation
function
(1+0)
missing relu
activation
function
(1+0)
wrong input
sample size
for linear layer
(5+0)
wrong
defined input
shape
(2+0)
wrong amount &
type of pooling
in convolutional
layer
(0+1)
wrong
defined
output shape
(3+1)
wrong defined
input & output
shape
(1+0)
wrong filter size
for a
convolutional
layer
(1+1)
layers'
dimensions
mismatch
(0+9)
suboptimal
number of
neurons in the
layer
(0+6)
bias needed
in a layer
(1+0)
missing
destination
GPU device
(1+0)
incorrect
state
sharing
(1+0)
wrong
reference to
GPU device
(2+0)
missing
transfer of
data to GPU
(1+0)
wrong data
parallelism on
GPUs
(1+0)
calling
unsupported
operations on
CUDA tensors
(1+0)
conversion to
CUDA tensor
inside the
training/test loop
(1+0)
wrongly
implemented data
transfer function
(CPU-GPU)
(0+1)
wrong
position of
data shuffle
operation
(1+0)
deprecated
API
(1+0)
wrong usage
of image
decoding API
(1+0)
wrong usage
of placeholder
restoration API
(1+0)
missing
argument
scoping
(1+0)
wrong tensor
transfer to
GPU
(1+0)
missing global
variables
initialisation
(3+0)
wrong API
usage
(10+0)
missing API
call
(1+0)
wrong
reference to
operational
graph
(1+0)
Wrong Tensor
Shape (21+5)
Wrong Input
(32+15)
wrong tensor
shape (missing
squeeze)
(5+0)
wrong tensor
shape (wrong
indexing)
(2+0)
wrong tensor
shape (wrong
output
padding)
(1+0)
wrong tensor
shape (other)
(13+3)
tensor shape
mismatch
(0+2)
Wrong Shape of
Input Data (22+7)
Wrong Type of
Input Data (5+3)
Wrong Input
Format (5+5)
wrong shape
of input data
for a method
(6+0)
wrong shape
of input data
for a layer
(16+2)
wrong shape
of input data
(0+5)
wrong type
of input data
for a method
(4+0)
wrong type
of input data
for a layer
(1+0)
wrong type
of input data
(0+3)
wrong input
format
(1+5)
wrong input
format for
RNN
(2+0)
wrong format
of passed
weights
(1+0)
incompatible
tensor type
(1+0)
Hyperparameters
(10+26)
suboptimal
number of
epochs
(2+4)
data batching
required
(2+7)
suboptimal
batch size
(2+1)
wrongly
implemented
data batching
(1+0)
missing
regularisation
(loss and
weight)
(0+1)
suboptimal
learning rate
(1+5)
suboptimal
hyper-
parameters
tuning
(2+8)
Loss Function
(7+16)
missing
masking of
invalid values
to zero
(1+0)
wrong loss
function
calculation
(5+4)
missing loss
function
(0+1)
wrong
selection of
loss function
(1+11)
Validation/Testing
(2+4)
missing
validation set
(1+0)
incorrect
train/test
data split
(0+3)
wrong
performance
metric
(1+1)
Preprocessing of Training
Data (13+37)
Missing
Preprocessing (11+22)
missing preprocessing
step (subsampling,
normalisation,
input scaling,
resize of the images,
oversampling,
encoding of categorical
data, padding,
... skip ...,
data shuffling,
interpolation)
Wrong Preprocessing
(2+15)
wrong preprocessing
step (pixel encoding,
padding,
text segmentation,
normalisation,
... skip ...,
positional encoding,
character encoding)
Optimiser (3+3)
wrong
optimisation
function
(1+3)
epsilon for
Adam
optimiser too
low
(2+0)
Training Data
Quality (2+60)
low quality
training data
(0+11)
not enough
training data
(0+14)
overlapping
output
classes in
training data
(0+1)
too many
output
categories
(0+1)
small range of
values for a
feature
(0+2)
unbalanced
training data
(0+11)
wrong
selection of
features
(1+6)
wrong labels
for training
data
(1+12)
discarding
important
features
(0+2)
Training Process
(0+14)
model too big to
fit into available
memory
(0+5)
reference for
non-existing
checkpoint
(0+1)
missing data
augmentation
(0+3)
redundant
data
augmentation
(0+1)
wrong
management of
memory
resources
(0+4)
missing
dropout layer
(1+1)
missing
normalisation
layer
(1+0)
missing
average
pooling layer
(0+1)
missing
softmax
layer
(1+0)
redundant
softmax
layer
(1+0)
wrong layer
type
(1+2)
wrong type
of pooling
layer
(0+1)
missing
dense layer
(1+0)
missing
flatten layer
(1+0)
Missing/Redundant/
Wrong Layer (7+5)
Layer Properties
(13+18)
GPU tensor is
used instead of
CPU tensor
(1+0)
Taxonomy of real faults
Repository mining + interviews with developers
9
Stack Over
fl
ow: 477 discussions
Github: 271 issues and pull requests
Github: 311 commits
Interviews: 20 researchers and practitioners
Validation Survey: 21
researchers and
practitioners

Training faults
Training (37+160)
11

Training faults
Training (37+160)
12

DeepCrime
Mutation testing of DL systems
13
Training set
Network architecture
Hyperparameters
DeepCrime
Re-train
Accuracy = 85%
Accuracy = 95%
https://github.com/deepcrime-tool/DeepCrime

DeepCrime
Mutation operators
14
Training Data
Change labels of training data TCL
Remove portion of training data TRD
Unbalance training data TUD
Add noise to training data TAN
Make output classes overlap TCO
Hyperparameters
Change batch size HBS
Decrease learning rate HLR
Change number of epochs HNE
Disable data batching HDB
Activation
Function
Change activation function ACH
Remove activation function ARM
Add activation function to layer AAL
Regularisation
Add weights regularisation RAW
Change weights regularisation RCW
Remove weights regularisation RRW
Change dropout rate RCD
Change patience parameter RCP
Weights
Change weights initialisation WCI
Add bias to a layer WAB
Remove bias from a layer WRB
Loss function Change loss function LCH
Optimisation
Function
Change optimisation function OCH
Change gradient clipping OCG
Validation Remove validation set VRM
Uses of DL mutation
Evaluate adequacy of a test set
Compare test input generators
Benchmark repair tools
Assess test ordering produced by
prioritization techniques

DL failures
Failure counting is not trivial
15
❌
6
❌ ❌ ❌ ❌
6 6 6 6
Diversity matters!

Feature maps
Failures grouped by features
16
Luminosity
Min
Radius
Orientation Mean Lateral Position
MNIST BeamNG
16
Number of feature map
cells
fi
lled with
misbehaviours
Number of failures

Fault repair
17
Training Data
Change Fix labels of training data TCL
Remove Augment training data TRD
Unbalance training data TUD
Add Remove noise to from training data TAN
Make Fix output classes that overlap TCO
Hyperparameters
Change Fine-tune batch size HBS
Decrease Fine-tune learning rate HLR
Change Fine-tune number of epochs HNE
Disable Fine-tune data batching HDB
Activation
Function
Change Change activation function ACH
Remove Add activation function ARM
Add Remove activation function to layer AAL
Regularisation
Add Remove weights regularisation RAW
Change Change weights regularisation RCW
Remove Add weights regularisation RRW
Change Fine-tune dropout rate RCD
Change Fine-tune patience parameter RCP
Weights
Change Fine-tune weights initialisation WCI
Add Remove bias to a layer WAB
Remove Add bias from a layer WRB
Loss function Change Change loss function LCH
Optimisation
Function
Change Change optimisation function OCH
Change Change gradient clipping OCG
Validation Remove Add validation set VRM

DL architecture repair
SE vs DL community
18
[AutoTrainer] Xiaoyu Zhang, Juan Zhai, Shiqing Ma, Chao Shen: AUTOTRAINER: An Automatic DNN Training Problem Detection and Repair System. ICSE, pp. 359-371, 2021.
[Hebo] A. I. Cowen-Rivers, W. Lyu, R. Tutunov, Z. Wang, A. Grosnit, R. R. Gri
ffi
ths, A. M. Maraval, H. Jianye, J. Wang, J. Peters, Haitham Bou Ammar: Hebo: Pushing the limits of
sample-e
ffi
cient hyper-parameter optimisation, JAIR, vol. 74, pp. 1269-1349, 2022.
[Bohb] S. Falkner, A. Klein, and F. Hutter: Bohb: Robust and e
ffi
cient hyperparameter optimization at scale. ICML, pp. 1437-1446, 2018.
Fit surrogate
Sample hyperparameters
Train & evaluate model
Initial hyperparameters
Bayesian hyperparameter optimization [Hebo, Bohb]
Until training
budget is
over
Rule based repair [AutoTrainer]
Symptom → Solution
S1: Add batch normalisation layer
S2: Substitute activation function
S3: Add gradient clip
S4: Substitute initialiser
S5: Adjust batch size
S6: Adjust learning rate
S7: Substitute optimizer
Hyperparameters
H1: activation function
H2: optimizer
H3: batch size
H4: number of epochs
H5: loss function
H6: learning rate
H7: weight initialization
H8: batching (enable/disable)
H9: number of neurons in a layer

DL architecture repair
Are we there yet?
19
IR = Improvement Rate
The random baseline produces comparable or better patches than
other repair techniques, but the e
ff
ectiveness of tools varies
depending on the fault, which justi
fi
es the need for future work
to
fi
nd more e
ffi
cient ways of exploring the hyperparameter space.
IR =
Mpatch − Mfault
Mfix − Mfault

Take-away messages
20
DL faults a
ff
ect data, model architecture,
hyper-parameters, training process, not
just the code.
When exposing/counting failures, diversity in the
feature space matters. Exposing more failures
does not necessarily mean higher e
ff
ectiveness.
The landscape of repair operations is not well understood
and rule-based/bayesian techniques are not substantially
superior to random exploration of such a space.

DL inputs
Test input generation
21
DL inputs
Inputs classified as 5
Frontier of
behaviours
Validity
Dom
ain
Frontier
Input Pair
5
5
5
9
21
6
9
6
4
6
1
Validity
Domain
Validity Domain
LQ HQ
LQ HQ
DeepJanus

DL inputs
Input validity
22
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9

DL inputs
Auto-encoder validity
23
latent
space
Encoder Decoder
Training set
Anomaly set
[ ]
(e >
𝜃
)?
“All three testing techniques [DeepExplore, DLFuzz,
DeepConcolic] studied produced signi
fi
cant numbers of
[autoencoder] invalid tests”
Input validity approximated as Auto-encoder validity

DL inputs
Validity of automatically generated inputs
24
Test Generator Comparison
Auto-encoder validity
Human validity
Label preservation
Automated vs Human
Test Generators
Root causes of invalidity:
Raw input generators: corruption of large number of pixels
Latent space generators: lack of continuity in latent space
Model based generators: low quality model
All generators: label preservation not ensured

DL inputs
Test input selection and prioritization
25
DL inputs
Train
set
Train model
Active
set
Prioritize and
select
Selected
data
Label data

DL inputs
Uncertainty estimation
26
pip install uncertainty-wizard
github.com/testingautomated-usi/uncertainty-wizard

DL inputs
Evaluation metrics for test input prioritization
27
6 9 8 9 6 5
1
2
3
APFD = 9/18
6 6 5 8 9 9
1
2
3 3
APFD = 15/18, +50%
3
3
1 1 1
{
Any permutation gives
the same APFD
APFD is actually
measuring just top-3
accuracy
Diversity matters!

DL inputs
Take-away messages
28
Generating DL inputs that trigger failures
is easy, but ensuring their validity and
assigning them a valid label is not.
Model uncertainty can be used to select and
prioritize test cases.
Traditional test case prioritization evaluation metrics, like
APFD, do not account for DL failure diversity.
DL inputs

DL oracles
System level quality
29
DL oracles
lateral position
headway position
steering
braking
traf
fi
c signs
O(x) = m1(x) ≤ t1 ∧ … ∧ mn(x) ≤ tn
Oracles depend on thresholds
∀x ∈ X, mD
1 (x) ≤ t*
1
∧ … ∧ mD
n (x) ≤ t*
n
No FP in (manually) veri
fi
ed executions X
t*
1
, …, t*
n = arg max
t1,…,tn
|{j ∈ M|∃x, m
Mj
1
(x) > t1 ∨ … ∨ m
Mj
n (x) > tn}
No/few FN when executing on faulty models (max mutants killed)
0.1
0.9
is_red

DL oracles
Runtime oracle
30
DL oracles
Autonomous
driving
Supervisor
Safe disengagement
Train set
latent
space
Encoder Decoder
Histogram and theoretical densities
data
Density
0.02 0.04 0.06 0.08 0.10 0.12
0
10
20
30
40
50
gamma
CDF
10
0.12
Q−Q plot
𝜃
Train set
Auto-encoder based anomaly detection

DL oracles
Runtime oracle: eutoencoders
31
DL oracles
(e >
𝜃
)?
Nominal
Original Reconstructed
Anomalous
l
l

DL oracles
Runtime oracle: uncertainty estimation
32
DL oracles
github.com/testingautomated-usi/uncertainty-wizard

DL oracles
Take-away messages
33
DL oracles should capture system level quality,
not just extreme cases of misbehaviours.
Runtime supervisors are needed to deal with rare or
unexpected conditions in which misbehaviours are
possible.
DL oracles
Thresholds for system level oracles should ensure
no FP and few FN on mutants.

A framework for DL testing
Dev-to-production data shift
34
Goal
Repair faults, exposed by choosing inputs
whose execution violates oracles, to increase
reliability measured in production
Inputs
Program
Fault
Execution
Failure
Oracle
Repair
Production test set
Production environment
Reliability
Boolean Correctness ⟹ System Reliability

Implications for researchers
35
E
ff
ectiveness of test input generation, test
prioritization and DL model repair should be
measured as reliability increase, on production data
We need ways to collect/simulate
production data
We need techniques to estimate
reliability in a sample e
ffi
cient way
Correctness, i.e. no error on the test set
or 100% accuracy, is not a realistic goal.
Production time reliability should be
estimated instead.

Putting all together
36
Inputs
Program
Fault
Execution
Failure
Oracle
Repair
Production test set
Production environment
Reliability
DL faults a
ff
ect data, model architecture,
hyper-parameters, training process, not
just the code.
When exposing/counting failures, diversity in the
feature space matters. Exposing more failures
does not necessarily mean higher e
ff
ectiveness.
Generating DL inputs that trigger failures
is easy, but ensuring their validity and
assigning them a valid label is not.
DL oracles should capture system
level quality, not just extreme
cases of misbehaviours.
Runtime supervisors are needed to deal with rare or
unexpected conditions in which misbehaviours are
possible.
E
ff
ectiveness of test input generation, test
prioritization and DL model repair should be
measured as reliability increase, on production data

Paolo Tonella
Software Institute, Università della Svizzera italiana, Lugano, Switzerland
The Road Toward Dependable
AI Based Systems
37

The Road Toward Dependable AI Based Systems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The Road Toward Dependable AI Based Systems

Similar to The Road Toward Dependable AI Based Systems (20)

Recently uploaded

Recently uploaded (20)

The Road Toward Dependable AI Based Systems