SlideShare a Scribd company logo
1 of 37
Download to read offline
Paolo Tonella
Software Institute, Università della Svizzera italiana, Lugano, Switzerland
The Road Toward Dependable
AI Based Systems
1
https://www.pre-crime.eu
Precrime
ERC project: https://pre-crime.eu
2
Self-assessment Oracles for Anticipatory Testing
“Testing the unexpected”
• Anticipate failures due to unexpected conditions
• Identify unexpected execution contexts at runtime
• Generate valid, but unexpected execution scenarios
Deep neural networks
High accuracy on human competitive tasks
3
Deep neural networks
Integrated into complex applications
4
Deep neural networks
Programming model
5
0.9
0.1
Train set Validation set Test set
W
• Data driven programming/no control logic
• Numeric input/numeric output
• Non deterministic training process
is_red
0.1
0.9
is_red
Is this a bug?
Deep neural networks
Learned behaviour vs programmed behaviour
6
Training set
Network architecture
Hyperparameters
Training process
Code
Code acc = 0.99
x≥0 y・y = x
0.1
0.9
is_red
Is this a bug?
Testing fundamentals
Goal of testing
Inputs
Program
Fault
Execution
Failure
Oracle
7
Repair faults, exposed by choosing inputs
whose execution violates oracles
Repair
DL faults and failures
Q1: What is a DL fault?
Q2: What is a unique DL failure?
Q3: What does it mean to repair a DL fault?
DL inputs
Q4: How to generate DL test inputs?
Q5: How to prioritize/select DL test inputs?
DL oracles
Q6: When is a DL program correct?
Q7: How to prevent runtime oracle violations?
DL faults
8
Train/validation set
Hyperparameters
Code
Data
Architecture
IEEE std glossary: fault = “an incorrect step,
process or data de
fi
nition in a computer program”
DL faults and failures
Q1: What is a DL fault?
Q2: What is a unique DL failure?
Q3: What does it mean to repair a DL fault?
Taxonomy of Real Faults in
Deep Learning Systems
MODEL
(29+45)
GPU USAGE
(10+1)
API (20+0)
TENSORS &
INPUTS (53+20)
TRAINING
(37+160)
Model Type &
Properties (6+20)
Layers
(23+25)
Activation
Function (3+2) wrong model
initialisation
(1+2)
wrong
weights
initialisation
(1+0)
wrong
selection of
model
(2+1)
multiple
initialisations
of CNN
(1+0)
suboptimal
network
structure
(1+15)
wrong
network
architecture
(0+2)
wrong type
of activation
function
(1+2)
missing
softmax
activation
function
(1+0)
missing relu
activation
function
(1+0)
wrong input
sample size
for linear layer
(5+0)
wrong
defined input
shape
(2+0)
wrong amount &
type of pooling
in convolutional
layer
(0+1)
wrong
defined
output shape
(3+1)
wrong defined
input & output
shape
(1+0)
wrong filter size
for a
convolutional
layer
(1+1)
layers'
dimensions
mismatch
(0+9)
suboptimal
number of
neurons in the
layer
(0+6)
bias needed
in a layer
(1+0)
missing
destination
GPU device
(1+0)
incorrect
state
sharing
(1+0)
wrong
reference to
GPU device
(2+0)
missing
transfer of
data to GPU
(1+0)
wrong data
parallelism on
GPUs
(1+0)
calling
unsupported
operations on
CUDA tensors
(1+0)
conversion to
CUDA tensor
inside the
training/test loop
(1+0)
wrongly
implemented data
transfer function
(CPU-GPU)
(0+1)
wrong
position of
data shuffle
operation
(1+0)
deprecated
API
(1+0)
wrong usage
of image
decoding API
(1+0)
wrong usage
of placeholder
restoration API
(1+0)
missing
argument
scoping
(1+0)
wrong tensor
transfer to
GPU
(1+0)
missing global
variables
initialisation
(3+0)
wrong API
usage
(10+0)
missing API
call
(1+0)
wrong
reference to
operational
graph
(1+0)
Wrong Tensor
Shape (21+5)
Wrong Input
(32+15)
wrong tensor
shape (missing
squeeze)
(5+0)
wrong tensor
shape (wrong
indexing)
(2+0)
wrong tensor
shape (wrong
output
padding)
(1+0)
wrong tensor
shape (other)
(13+3)
tensor shape
mismatch
(0+2)
Wrong Shape of
Input Data (22+7)
Wrong Type of
Input Data (5+3)
Wrong Input
Format (5+5)
wrong shape
of input data
for a method
(6+0)
wrong shape
of input data
for a layer
(16+2)
wrong shape
of input data
(0+5)
wrong type
of input data
for a method
(4+0)
wrong type
of input data
for a layer
(1+0)
wrong type
of input data
(0+3)
wrong input
format
(1+5)
wrong input
format for
RNN
(2+0)
wrong format
of passed
weights
(1+0)
incompatible
tensor type
(1+0)
Hyperparameters
(10+26)
suboptimal
number of
epochs
(2+4)
data batching
required
(2+7)
suboptimal
batch size
(2+1)
wrongly
implemented
data batching
(1+0)
missing
regularisation
(loss and
weight)
(0+1)
suboptimal
learning rate
(1+5)
suboptimal
hyper-
parameters
tuning
(2+8)
Loss Function
(7+16)
missing
masking of
invalid values
to zero
(1+0)
wrong loss
function
calculation
(5+4)
missing loss
function
(0+1)
wrong
selection of
loss function
(1+11)
Validation/Testing
(2+4)
missing
validation set
(1+0)
incorrect
train/test
data split
(0+3)
wrong
performance
metric
(1+1)
Preprocessing of Training
Data (13+37)
Missing
Preprocessing (11+22)
missing preprocessing
step (subsampling,
normalisation,
input scaling,
resize of the images,
oversampling,
encoding of categorical
data, padding,
... skip ...,
data shuffling,
interpolation)
Wrong Preprocessing
(2+15)
wrong preprocessing
step (pixel encoding,
padding,
text segmentation,
normalisation,
... skip ...,
positional encoding,
character encoding)
Optimiser (3+3)
wrong
optimisation
function
(1+3)
epsilon for
Adam
optimiser too
low
(2+0)
Training Data
Quality (2+60)
low quality
training data
(0+11)
not enough
training data
(0+14)
overlapping
output
classes in
training data
(0+1)
too many
output
categories
(0+1)
small range of
values for a
feature
(0+2)
unbalanced
training data
(0+11)
wrong
selection of
features
(1+6)
wrong labels
for training
data
(1+12)
discarding
important
features
(0+2)
Training Process
(0+14)
model too big to
fit into available
memory
(0+5)
reference for
non-existing
checkpoint
(0+1)
missing data
augmentation
(0+3)
redundant
data
augmentation
(0+1)
wrong
management of
memory
resources
(0+4)
missing
dropout layer
(1+1)
missing
normalisation
layer
(1+0)
missing
average
pooling layer
(0+1)
missing
softmax
layer
(1+0)
redundant
softmax
layer
(1+0)
wrong layer
type
(1+2)
wrong type
of pooling
layer
(0+1)
missing
dense layer
(1+0)
missing
flatten layer
(1+0)
Missing/Redundant/
Wrong Layer (7+5)
Layer Properties
(13+18)
GPU tensor is
used instead of
CPU tensor
(1+0)
Taxonomy of real faults
Repository mining + interviews with developers
9
Stack Over
fl
ow: 477 discussions
Github: 271 issues and pull requests
Github: 311 commits
Interviews: 20 researchers and practitioners
Validation Survey: 21
researchers and
practitioners
Model faults
Model (29+45)
10
Training faults
Training (37+160)
11
Training faults
Training (37+160)
12
DeepCrime
Mutation testing of DL systems
13
Training set
Network architecture
Hyperparameters
DeepCrime
Re-train
Accuracy = 85%
Accuracy = 95%
https://github.com/deepcrime-tool/DeepCrime
DeepCrime
Mutation operators
14
Training Data
Change labels of training data TCL
Remove portion of training data TRD
Unbalance training data TUD
Add noise to training data TAN
Make output classes overlap TCO
Hyperparameters
Change batch size HBS
Decrease learning rate HLR
Change number of epochs HNE
Disable data batching HDB
Activation
Function
Change activation function ACH
Remove activation function ARM
Add activation function to layer AAL
Regularisation
Add weights regularisation RAW
Change weights regularisation RCW
Remove weights regularisation RRW
Change dropout rate RCD
Change patience parameter RCP
Weights
Change weights initialisation WCI
Add bias to a layer WAB
Remove bias from a layer WRB
Loss function Change loss function LCH
Optimisation
Function
Change optimisation function OCH
Change gradient clipping OCG
Validation Remove validation set VRM
Uses of DL mutation
Evaluate adequacy of a test set
Compare test input generators
Benchmark repair tools
Assess test ordering produced by
prioritization techniques
DL failures
Failure counting is not trivial
15
❌
6
❌ ❌ ❌ ❌
6 6 6 6
Diversity matters!
DL faults and failures
Q1: What is a DL fault?
Q2: What is a unique DL failure?
Q3: What does it mean to repair a DL fault?
Feature maps
Failures grouped by features
16
Luminosity
Min
Radius
Orientation Mean Lateral Position
MNIST BeamNG
16
Number of feature map
cells
fi
lled with
misbehaviours
Number of failures
DL faults and failures
Fault repair
17
DL faults and failures
Q1: What is a DL fault?
Q2: What is a unique DL failure?
Q3: What does it mean to repair a DL fault?
Training Data
Change Fix labels of training data TCL
Remove Augment training data TRD
Unbalance training data TUD
Add Remove noise to from training data TAN
Make Fix output classes that overlap TCO
Hyperparameters
Change Fine-tune batch size HBS
Decrease Fine-tune learning rate HLR
Change Fine-tune number of epochs HNE
Disable Fine-tune data batching HDB
Activation
Function
Change Change activation function ACH
Remove Add activation function ARM
Add Remove activation function to layer AAL
Regularisation
Add Remove weights regularisation RAW
Change Change weights regularisation RCW
Remove Add weights regularisation RRW
Change Fine-tune dropout rate RCD
Change Fine-tune patience parameter RCP
Weights
Change Fine-tune weights initialisation WCI
Add Remove bias to a layer WAB
Remove Add bias from a layer WRB
Loss function Change Change loss function LCH
Optimisation
Function
Change Change optimisation function OCH
Change Change gradient clipping OCG
Validation Remove Add validation set VRM
DL architecture repair
SE vs DL community
18
[AutoTrainer] Xiaoyu Zhang, Juan Zhai, Shiqing Ma, Chao Shen: AUTOTRAINER: An Automatic DNN Training Problem Detection and Repair System. ICSE, pp. 359-371, 2021.
[Hebo] A. I. Cowen-Rivers, W. Lyu, R. Tutunov, Z. Wang, A. Grosnit, R. R. Gri
ffi
ths, A. M. Maraval, H. Jianye, J. Wang, J. Peters, Haitham Bou Ammar: Hebo: Pushing the limits of
sample-e
ffi
cient hyper-parameter optimisation, JAIR, vol. 74, pp. 1269-1349, 2022.
[Bohb] S. Falkner, A. Klein, and F. Hutter: Bohb: Robust and e
ffi
cient hyperparameter optimization at scale. ICML, pp. 1437-1446, 2018.
Fit surrogate
Sample hyperparameters
Train & evaluate model
Initial hyperparameters
Bayesian hyperparameter optimization [Hebo, Bohb]
Until training
budget is
over
Rule based repair [AutoTrainer]
Symptom → Solution
S1: Add batch normalisation layer
S2: Substitute activation function
S3: Add gradient clip
S4: Substitute initialiser
S5: Adjust batch size
S6: Adjust learning rate
S7: Substitute optimizer
Hyperparameters
H1: activation function
H2: optimizer
H3: batch size
H4: number of epochs
H5: loss function
H6: learning rate
H7: weight initialization
H8: batching (enable/disable)
H9: number of neurons in a layer
DL architecture repair
Are we there yet?
19
IR = Improvement Rate
The random baseline produces comparable or better patches than
other repair techniques, but the e
ff
ectiveness of tools varies
depending on the fault, which justi
fi
es the need for future work
to
fi
nd more e
ffi
cient ways of exploring the hyperparameter space.
IR =
Mpatch − Mfault
Mfix − Mfault
DL faults and failures
Take-away messages
20
DL faults and failures
Q1: What is a DL fault?
Q2: What is a unique DL failure?
Q3: What does it mean to repair a DL fault?
DL faults a
ff
ect data, model architecture,
hyper-parameters, training process, not
just the code.
When exposing/counting failures, diversity in the
feature space matters. Exposing more failures
does not necessarily mean higher e
ff
ectiveness.
The landscape of repair operations is not well understood
and rule-based/bayesian techniques are not substantially
superior to random exploration of such a space.
DL inputs
Test input generation
21
DL inputs
Q4: How to generate DL test inputs?
Q5: How to prioritize/select DL test inputs?
Inputs classified as 5
Frontier of
behaviours
Validity
Dom
ain
Frontier
Input Pair
5
5
5
9
21
6
9
6
4
6
1
Validity
Domain
Validity Domain
LQ HQ
LQ HQ
DeepJanus
DL inputs
Input validity
22
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
DL inputs
Auto-encoder validity
23
latent
space
Encoder Decoder
Training set
Anomaly set
[ ]
(e >
𝜃
)?
“All three testing techniques [DeepExplore, DLFuzz,
DeepConcolic] studied produced signi
fi
cant numbers of
[autoencoder] invalid tests”
Input validity approximated as Auto-encoder validity
DL inputs
Validity of automatically generated inputs
24
Test Generator Comparison
Auto-encoder validity
Human validity
Label preservation
Automated vs Human
Test Generators
Root causes of invalidity:
Raw input generators: corruption of large number of pixels
Latent space generators: lack of continuity in latent space
Model based generators: low quality model
All generators: label preservation not ensured
DL inputs
Test input selection and prioritization
25
DL inputs
Q4: How to generate DL test inputs?
Q5: How to prioritize/select DL test inputs?
Train
set
Train model
Active
set
Prioritize and
select
Selected
data
Label data
DL inputs
Uncertainty estimation
26
pip install uncertainty-wizard
github.com/testingautomated-usi/uncertainty-wizard
DL inputs
Evaluation metrics for test input prioritization
27
6 9 8 9 6 5
1
2
3
APFD = 9/18
6 6 5 8 9 9
1
2
3 3
APFD = 15/18, +50%
3
3
1 1 1
{
Any permutation gives
the same APFD
APFD is actually
measuring just top-3
accuracy
Diversity matters!
DL inputs
Take-away messages
28
Generating DL inputs that trigger failures
is easy, but ensuring their validity and
assigning them a valid label is not.
Model uncertainty can be used to select and
prioritize test cases.
Traditional test case prioritization evaluation metrics, like
APFD, do not account for DL failure diversity.
DL inputs
Q4: How to generate DL test inputs?
Q5: How to prioritize/select DL test inputs?
DL oracles
System level quality
29
DL oracles
Q6: When is a DL program correct?
Q7: How to prevent runtime oracle violations?
lateral position
headway position
steering
braking
traf
fi
c signs
O(x) = m1(x) ≤ t1 ∧ … ∧ mn(x) ≤ tn
Oracles depend on thresholds
∀x ∈ X, mD
1 (x) ≤ t*
1
∧ … ∧ mD
n (x) ≤ t*
n
No FP in (manually) veri
fi
ed executions X
t*
1
, …, t*
n = arg max
t1,…,tn
|{j ∈ M|∃x, m
Mj
1
(x) > t1 ∨ … ∨ m
Mj
n (x) > tn}
No/few FN when executing on faulty models (max mutants killed)
0.1
0.9
is_red
DL oracles
Runtime oracle
30
DL oracles
Q6: When is a DL program correct?
Q7: How to prevent runtime oracle violations?
Autonomous
driving
Supervisor
Safe disengagement
Train set
latent
space
Encoder Decoder
Histogram and theoretical densities
data
Density
0.02 0.04 0.06 0.08 0.10 0.12
0
10
20
30
40
50
gamma
CDF
10
0.12
Q−Q plot
𝜃
Train set
Auto-encoder based anomaly detection
DL oracles
Runtime oracle: eutoencoders
31
DL oracles
Q6: When is a DL program correct?
Q7: How to prevent runtime oracle violations?
(e >
𝜃
)?
Nominal
Original Reconstructed
Anomalous
l
l
DL oracles
Runtime oracle: uncertainty estimation
32
DL oracles
Q6: When is a DL program correct?
Q7: How to prevent runtime oracle violations?
github.com/testingautomated-usi/uncertainty-wizard
DL oracles
Take-away messages
33
DL oracles should capture system level quality,
not just extreme cases of misbehaviours.
Runtime supervisors are needed to deal with rare or
unexpected conditions in which misbehaviours are
possible.
DL oracles
Q6: When is a DL program correct?
Q7: How to prevent runtime oracle violations?
Thresholds for system level oracles should ensure
no FP and few FN on mutants.
A framework for DL testing
Dev-to-production data shift
34
Goal
Repair faults, exposed by choosing inputs
whose execution violates oracles, to increase
reliability measured in production
Inputs
Program
Fault
Execution
Failure
Oracle
Repair
Production test set
Production environment
Reliability
Boolean Correctness ⟹ System Reliability
A framework for DL testing
Implications for researchers
35
E
ff
ectiveness of test input generation, test
prioritization and DL model repair should be
measured as reliability increase, on production data
We need ways to collect/simulate
production data
We need techniques to estimate
reliability in a sample e
ffi
cient way
Correctness, i.e. no error on the test set
or 100% accuracy, is not a realistic goal.
Production time reliability should be
estimated instead.
A framework for DL testing
Putting all together
36
Inputs
Program
Fault
Execution
Failure
Oracle
Repair
Production test set
Production environment
Reliability
DL faults a
ff
ect data, model architecture,
hyper-parameters, training process, not
just the code.
When exposing/counting failures, diversity in the
feature space matters. Exposing more failures
does not necessarily mean higher e
ff
ectiveness.
Generating DL inputs that trigger failures
is easy, but ensuring their validity and
assigning them a valid label is not.
DL oracles should capture system
level quality, not just extreme
cases of misbehaviours.
Runtime supervisors are needed to deal with rare or
unexpected conditions in which misbehaviours are
possible.
E
ff
ectiveness of test input generation, test
prioritization and DL model repair should be
measured as reliability increase, on production data
Paolo Tonella
Software Institute, Università della Svizzera italiana, Lugano, Switzerland
The Road Toward Dependable
AI Based Systems
37

More Related Content

What's hot

Trees In The Database - Advanced data structures
Trees In The Database - Advanced data structuresTrees In The Database - Advanced data structures
Trees In The Database - Advanced data structuresLorenzo Alberton
 
Execute Automation Testing in 3 Steps
Execute Automation Testing in 3 StepsExecute Automation Testing in 3 Steps
Execute Automation Testing in 3 StepsExecuteAutomation
 
Test and Behaviour Driven Development (TDD/BDD)
Test and Behaviour Driven Development (TDD/BDD)Test and Behaviour Driven Development (TDD/BDD)
Test and Behaviour Driven Development (TDD/BDD)Lars Thorup
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteJulian Hyde
 
My first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdfMy first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdfAlkin Tezuysal
 
[Webinar] Qt Test-Driven Development Using Google Test and Google Mock
[Webinar] Qt Test-Driven Development Using Google Test and Google Mock[Webinar] Qt Test-Driven Development Using Google Test and Google Mock
[Webinar] Qt Test-Driven Development Using Google Test and Google MockICS
 
Php internal architecture
Php internal architecturePhp internal architecture
Php internal architectureElizabeth Smith
 
5 things you didn't know nginx could do
5 things you didn't know nginx could do5 things you didn't know nginx could do
5 things you didn't know nginx could dosarahnovotny
 
Elasticsearch: Introducing the wildcard field
Elasticsearch: Introducing the wildcard fieldElasticsearch: Introducing the wildcard field
Elasticsearch: Introducing the wildcard fieldElasticsearch
 
Qt test framework
Qt test frameworkQt test framework
Qt test frameworkICS
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scaladatamantra
 
JUnit- A Unit Testing Framework
JUnit- A Unit Testing FrameworkJUnit- A Unit Testing Framework
JUnit- A Unit Testing FrameworkOnkar Deshpande
 
Selenium Deep Dive
Selenium Deep DiveSelenium Deep Dive
Selenium Deep DiveAnand Bagmar
 
Rundeck Overview
Rundeck OverviewRundeck Overview
Rundeck OverviewRundeck
 
Mocking in Java with Mockito
Mocking in Java with MockitoMocking in Java with Mockito
Mocking in Java with MockitoRichard Paul
 

What's hot (20)

Trees In The Database - Advanced data structures
Trees In The Database - Advanced data structuresTrees In The Database - Advanced data structures
Trees In The Database - Advanced data structures
 
Execute Automation Testing in 3 Steps
Execute Automation Testing in 3 StepsExecute Automation Testing in 3 Steps
Execute Automation Testing in 3 Steps
 
Test and Behaviour Driven Development (TDD/BDD)
Test and Behaviour Driven Development (TDD/BDD)Test and Behaviour Driven Development (TDD/BDD)
Test and Behaviour Driven Development (TDD/BDD)
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
 
Practical Object Oriented Models In Sql
Practical Object Oriented Models In SqlPractical Object Oriented Models In Sql
Practical Object Oriented Models In Sql
 
My first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdfMy first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdf
 
[Webinar] Qt Test-Driven Development Using Google Test and Google Mock
[Webinar] Qt Test-Driven Development Using Google Test and Google Mock[Webinar] Qt Test-Driven Development Using Google Test and Google Mock
[Webinar] Qt Test-Driven Development Using Google Test and Google Mock
 
Efficient Android Threading
Efficient Android ThreadingEfficient Android Threading
Efficient Android Threading
 
Models for hierarchical data
Models for hierarchical dataModels for hierarchical data
Models for hierarchical data
 
Php internal architecture
Php internal architecturePhp internal architecture
Php internal architecture
 
Java On CRaC
Java On CRaCJava On CRaC
Java On CRaC
 
5 things you didn't know nginx could do
5 things you didn't know nginx could do5 things you didn't know nginx could do
5 things you didn't know nginx could do
 
Elasticsearch: Introducing the wildcard field
Elasticsearch: Introducing the wildcard fieldElasticsearch: Introducing the wildcard field
Elasticsearch: Introducing the wildcard field
 
Qt test framework
Qt test frameworkQt test framework
Qt test framework
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
The basics of fluentd
The basics of fluentdThe basics of fluentd
The basics of fluentd
 
JUnit- A Unit Testing Framework
JUnit- A Unit Testing FrameworkJUnit- A Unit Testing Framework
JUnit- A Unit Testing Framework
 
Selenium Deep Dive
Selenium Deep DiveSelenium Deep Dive
Selenium Deep Dive
 
Rundeck Overview
Rundeck OverviewRundeck Overview
Rundeck Overview
 
Mocking in Java with Mockito
Mocking in Java with MockitoMocking in Java with Mockito
Mocking in Java with Mockito
 

Similar to The Road Toward Dependable AI Based Systems

Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016MLconf
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkSigOpt
 
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...Akihiro Hayashi
 
Asynchronous Methods for Deep Reinforcement Learning
Asynchronous Methods for Deep Reinforcement LearningAsynchronous Methods for Deep Reinforcement Learning
Asynchronous Methods for Deep Reinforcement LearningSsu-Rui Lee
 
From Black Box to Black Magic, Pycon Ireland 2014
From Black Box to Black Magic, Pycon Ireland 2014From Black Box to Black Magic, Pycon Ireland 2014
From Black Box to Black Magic, Pycon Ireland 2014Gloria Lovera
 
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Akihiro Hayashi
 
Performance and how to measure it - ProgSCon London 2016
Performance and how to measure it - ProgSCon London 2016Performance and how to measure it - ProgSCon London 2016
Performance and how to measure it - ProgSCon London 2016Matt Warren
 
Rajat Monga at AI Frontiers: Deep Learning with TensorFlow
Rajat Monga at AI Frontiers: Deep Learning with TensorFlowRajat Monga at AI Frontiers: Deep Learning with TensorFlow
Rajat Monga at AI Frontiers: Deep Learning with TensorFlowAI Frontiers
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionAkihiro Hayashi
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupMachine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupTill Rohrmann
 
Efficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceEfficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceSpiros Economakis
 
Efficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceEfficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceSpiros Oikonomakis
 
Performance is a feature! - London .NET User Group
Performance is a feature! - London .NET User GroupPerformance is a feature! - London .NET User Group
Performance is a feature! - London .NET User GroupMatt Warren
 
A Formal Framework for Prototyping Executable Semantics in ATL
A Formal Framework for Prototyping Executable Semantics in ATLA Formal Framework for Prototyping Executable Semantics in ATL
A Formal Framework for Prototyping Executable Semantics in ATLArtur Boronat
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowDatabricks
 

Similar to The Road Toward Dependable AI Based Systems (20)

Java Micro-Benchmarking
Java Micro-BenchmarkingJava Micro-Benchmarking
Java Micro-Benchmarking
 
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott Clark
 
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
Exploration of Supervised Machine Learning Techniques for Runtime Selection o...
 
Asynchronous Methods for Deep Reinforcement Learning
Asynchronous Methods for Deep Reinforcement LearningAsynchronous Methods for Deep Reinforcement Learning
Asynchronous Methods for Deep Reinforcement Learning
 
From Black Box to Black Magic, Pycon Ireland 2014
From Black Box to Black Magic, Pycon Ireland 2014From Black Box to Black Magic, Pycon Ireland 2014
From Black Box to Black Magic, Pycon Ireland 2014
 
Py conie 2014
Py conie 2014Py conie 2014
Py conie 2014
 
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
Machine-learning based performance heuristics for Runtime CPU/GPU Selection i...
 
Performance and how to measure it - ProgSCon London 2016
Performance and how to measure it - ProgSCon London 2016Performance and how to measure it - ProgSCon London 2016
Performance and how to measure it - ProgSCon London 2016
 
Rajat Monga at AI Frontiers: Deep Learning with TensorFlow
Rajat Monga at AI Frontiers: Deep Learning with TensorFlowRajat Monga at AI Frontiers: Deep Learning with TensorFlow
Rajat Monga at AI Frontiers: Deep Learning with TensorFlow
 
Mutant Tests Too: The SQL
Mutant Tests Too: The SQLMutant Tests Too: The SQL
Mutant Tests Too: The SQL
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
 
Machine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning GroupMachine Learning with Apache Flink at Stockholm Machine Learning Group
Machine Learning with Apache Flink at Stockholm Machine Learning Group
 
Efficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceEfficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/Reduce
 
Efficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/ReduceEfficient processing of Rank-aware queries in Map/Reduce
Efficient processing of Rank-aware queries in Map/Reduce
 
Performance is a feature! - London .NET User Group
Performance is a feature! - London .NET User GroupPerformance is a feature! - London .NET User Group
Performance is a feature! - London .NET User Group
 
eam2
eam2eam2
eam2
 
Database programming
Database programmingDatabase programming
Database programming
 
A Formal Framework for Prototyping Executable Semantics in ATL
A Formal Framework for Prototyping Executable Semantics in ATLA Formal Framework for Prototyping Executable Semantics in ATL
A Formal Framework for Prototyping Executable Semantics in ATL
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflow
 

Recently uploaded

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 

Recently uploaded (20)

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 

The Road Toward Dependable AI Based Systems

  • 1. Paolo Tonella Software Institute, Università della Svizzera italiana, Lugano, Switzerland The Road Toward Dependable AI Based Systems 1 https://www.pre-crime.eu
  • 2. Precrime ERC project: https://pre-crime.eu 2 Self-assessment Oracles for Anticipatory Testing “Testing the unexpected” • Anticipate failures due to unexpected conditions • Identify unexpected execution contexts at runtime • Generate valid, but unexpected execution scenarios
  • 3. Deep neural networks High accuracy on human competitive tasks 3
  • 4. Deep neural networks Integrated into complex applications 4
  • 5. Deep neural networks Programming model 5 0.9 0.1 Train set Validation set Test set W • Data driven programming/no control logic • Numeric input/numeric output • Non deterministic training process is_red 0.1 0.9 is_red Is this a bug?
  • 6. Deep neural networks Learned behaviour vs programmed behaviour 6 Training set Network architecture Hyperparameters Training process Code Code acc = 0.99 x≥0 y・y = x 0.1 0.9 is_red Is this a bug?
  • 7. Testing fundamentals Goal of testing Inputs Program Fault Execution Failure Oracle 7 Repair faults, exposed by choosing inputs whose execution violates oracles Repair DL faults and failures Q1: What is a DL fault? Q2: What is a unique DL failure? Q3: What does it mean to repair a DL fault? DL inputs Q4: How to generate DL test inputs? Q5: How to prioritize/select DL test inputs? DL oracles Q6: When is a DL program correct? Q7: How to prevent runtime oracle violations?
  • 8. DL faults 8 Train/validation set Hyperparameters Code Data Architecture IEEE std glossary: fault = “an incorrect step, process or data de fi nition in a computer program” DL faults and failures Q1: What is a DL fault? Q2: What is a unique DL failure? Q3: What does it mean to repair a DL fault?
  • 9. Taxonomy of Real Faults in Deep Learning Systems MODEL (29+45) GPU USAGE (10+1) API (20+0) TENSORS & INPUTS (53+20) TRAINING (37+160) Model Type & Properties (6+20) Layers (23+25) Activation Function (3+2) wrong model initialisation (1+2) wrong weights initialisation (1+0) wrong selection of model (2+1) multiple initialisations of CNN (1+0) suboptimal network structure (1+15) wrong network architecture (0+2) wrong type of activation function (1+2) missing softmax activation function (1+0) missing relu activation function (1+0) wrong input sample size for linear layer (5+0) wrong defined input shape (2+0) wrong amount & type of pooling in convolutional layer (0+1) wrong defined output shape (3+1) wrong defined input & output shape (1+0) wrong filter size for a convolutional layer (1+1) layers' dimensions mismatch (0+9) suboptimal number of neurons in the layer (0+6) bias needed in a layer (1+0) missing destination GPU device (1+0) incorrect state sharing (1+0) wrong reference to GPU device (2+0) missing transfer of data to GPU (1+0) wrong data parallelism on GPUs (1+0) calling unsupported operations on CUDA tensors (1+0) conversion to CUDA tensor inside the training/test loop (1+0) wrongly implemented data transfer function (CPU-GPU) (0+1) wrong position of data shuffle operation (1+0) deprecated API (1+0) wrong usage of image decoding API (1+0) wrong usage of placeholder restoration API (1+0) missing argument scoping (1+0) wrong tensor transfer to GPU (1+0) missing global variables initialisation (3+0) wrong API usage (10+0) missing API call (1+0) wrong reference to operational graph (1+0) Wrong Tensor Shape (21+5) Wrong Input (32+15) wrong tensor shape (missing squeeze) (5+0) wrong tensor shape (wrong indexing) (2+0) wrong tensor shape (wrong output padding) (1+0) wrong tensor shape (other) (13+3) tensor shape mismatch (0+2) Wrong Shape of Input Data (22+7) Wrong Type of Input Data (5+3) Wrong Input Format (5+5) wrong shape of input data for a method (6+0) wrong shape of input data for a layer (16+2) wrong shape of input data (0+5) wrong type of input data for a method (4+0) wrong type of input data for a layer (1+0) wrong type of input data (0+3) wrong input format (1+5) wrong input format for RNN (2+0) wrong format of passed weights (1+0) incompatible tensor type (1+0) Hyperparameters (10+26) suboptimal number of epochs (2+4) data batching required (2+7) suboptimal batch size (2+1) wrongly implemented data batching (1+0) missing regularisation (loss and weight) (0+1) suboptimal learning rate (1+5) suboptimal hyper- parameters tuning (2+8) Loss Function (7+16) missing masking of invalid values to zero (1+0) wrong loss function calculation (5+4) missing loss function (0+1) wrong selection of loss function (1+11) Validation/Testing (2+4) missing validation set (1+0) incorrect train/test data split (0+3) wrong performance metric (1+1) Preprocessing of Training Data (13+37) Missing Preprocessing (11+22) missing preprocessing step (subsampling, normalisation, input scaling, resize of the images, oversampling, encoding of categorical data, padding, ... skip ..., data shuffling, interpolation) Wrong Preprocessing (2+15) wrong preprocessing step (pixel encoding, padding, text segmentation, normalisation, ... skip ..., positional encoding, character encoding) Optimiser (3+3) wrong optimisation function (1+3) epsilon for Adam optimiser too low (2+0) Training Data Quality (2+60) low quality training data (0+11) not enough training data (0+14) overlapping output classes in training data (0+1) too many output categories (0+1) small range of values for a feature (0+2) unbalanced training data (0+11) wrong selection of features (1+6) wrong labels for training data (1+12) discarding important features (0+2) Training Process (0+14) model too big to fit into available memory (0+5) reference for non-existing checkpoint (0+1) missing data augmentation (0+3) redundant data augmentation (0+1) wrong management of memory resources (0+4) missing dropout layer (1+1) missing normalisation layer (1+0) missing average pooling layer (0+1) missing softmax layer (1+0) redundant softmax layer (1+0) wrong layer type (1+2) wrong type of pooling layer (0+1) missing dense layer (1+0) missing flatten layer (1+0) Missing/Redundant/ Wrong Layer (7+5) Layer Properties (13+18) GPU tensor is used instead of CPU tensor (1+0) Taxonomy of real faults Repository mining + interviews with developers 9 Stack Over fl ow: 477 discussions Github: 271 issues and pull requests Github: 311 commits Interviews: 20 researchers and practitioners Validation Survey: 21 researchers and practitioners
  • 13. DeepCrime Mutation testing of DL systems 13 Training set Network architecture Hyperparameters DeepCrime Re-train Accuracy = 85% Accuracy = 95% https://github.com/deepcrime-tool/DeepCrime
  • 14. DeepCrime Mutation operators 14 Training Data Change labels of training data TCL Remove portion of training data TRD Unbalance training data TUD Add noise to training data TAN Make output classes overlap TCO Hyperparameters Change batch size HBS Decrease learning rate HLR Change number of epochs HNE Disable data batching HDB Activation Function Change activation function ACH Remove activation function ARM Add activation function to layer AAL Regularisation Add weights regularisation RAW Change weights regularisation RCW Remove weights regularisation RRW Change dropout rate RCD Change patience parameter RCP Weights Change weights initialisation WCI Add bias to a layer WAB Remove bias from a layer WRB Loss function Change loss function LCH Optimisation Function Change optimisation function OCH Change gradient clipping OCG Validation Remove validation set VRM Uses of DL mutation Evaluate adequacy of a test set Compare test input generators Benchmark repair tools Assess test ordering produced by prioritization techniques
  • 15. DL failures Failure counting is not trivial 15 ❌ 6 ❌ ❌ ❌ ❌ 6 6 6 6 Diversity matters! DL faults and failures Q1: What is a DL fault? Q2: What is a unique DL failure? Q3: What does it mean to repair a DL fault?
  • 16. Feature maps Failures grouped by features 16 Luminosity Min Radius Orientation Mean Lateral Position MNIST BeamNG 16 Number of feature map cells fi lled with misbehaviours Number of failures
  • 17. DL faults and failures Fault repair 17 DL faults and failures Q1: What is a DL fault? Q2: What is a unique DL failure? Q3: What does it mean to repair a DL fault? Training Data Change Fix labels of training data TCL Remove Augment training data TRD Unbalance training data TUD Add Remove noise to from training data TAN Make Fix output classes that overlap TCO Hyperparameters Change Fine-tune batch size HBS Decrease Fine-tune learning rate HLR Change Fine-tune number of epochs HNE Disable Fine-tune data batching HDB Activation Function Change Change activation function ACH Remove Add activation function ARM Add Remove activation function to layer AAL Regularisation Add Remove weights regularisation RAW Change Change weights regularisation RCW Remove Add weights regularisation RRW Change Fine-tune dropout rate RCD Change Fine-tune patience parameter RCP Weights Change Fine-tune weights initialisation WCI Add Remove bias to a layer WAB Remove Add bias from a layer WRB Loss function Change Change loss function LCH Optimisation Function Change Change optimisation function OCH Change Change gradient clipping OCG Validation Remove Add validation set VRM
  • 18. DL architecture repair SE vs DL community 18 [AutoTrainer] Xiaoyu Zhang, Juan Zhai, Shiqing Ma, Chao Shen: AUTOTRAINER: An Automatic DNN Training Problem Detection and Repair System. ICSE, pp. 359-371, 2021. [Hebo] A. I. Cowen-Rivers, W. Lyu, R. Tutunov, Z. Wang, A. Grosnit, R. R. Gri ffi ths, A. M. Maraval, H. Jianye, J. Wang, J. Peters, Haitham Bou Ammar: Hebo: Pushing the limits of sample-e ffi cient hyper-parameter optimisation, JAIR, vol. 74, pp. 1269-1349, 2022. [Bohb] S. Falkner, A. Klein, and F. Hutter: Bohb: Robust and e ffi cient hyperparameter optimization at scale. ICML, pp. 1437-1446, 2018. Fit surrogate Sample hyperparameters Train & evaluate model Initial hyperparameters Bayesian hyperparameter optimization [Hebo, Bohb] Until training budget is over Rule based repair [AutoTrainer] Symptom → Solution S1: Add batch normalisation layer S2: Substitute activation function S3: Add gradient clip S4: Substitute initialiser S5: Adjust batch size S6: Adjust learning rate S7: Substitute optimizer Hyperparameters H1: activation function H2: optimizer H3: batch size H4: number of epochs H5: loss function H6: learning rate H7: weight initialization H8: batching (enable/disable) H9: number of neurons in a layer
  • 19. DL architecture repair Are we there yet? 19 IR = Improvement Rate The random baseline produces comparable or better patches than other repair techniques, but the e ff ectiveness of tools varies depending on the fault, which justi fi es the need for future work to fi nd more e ffi cient ways of exploring the hyperparameter space. IR = Mpatch − Mfault Mfix − Mfault
  • 20. DL faults and failures Take-away messages 20 DL faults and failures Q1: What is a DL fault? Q2: What is a unique DL failure? Q3: What does it mean to repair a DL fault? DL faults a ff ect data, model architecture, hyper-parameters, training process, not just the code. When exposing/counting failures, diversity in the feature space matters. Exposing more failures does not necessarily mean higher e ff ectiveness. The landscape of repair operations is not well understood and rule-based/bayesian techniques are not substantially superior to random exploration of such a space.
  • 21. DL inputs Test input generation 21 DL inputs Q4: How to generate DL test inputs? Q5: How to prioritize/select DL test inputs? Inputs classified as 5 Frontier of behaviours Validity Dom ain Frontier Input Pair 5 5 5 9 21 6 9 6 4 6 1 Validity Domain Validity Domain LQ HQ LQ HQ DeepJanus
  • 22. DL inputs Input validity 22 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
  • 23. DL inputs Auto-encoder validity 23 latent space Encoder Decoder Training set Anomaly set [ ] (e > 𝜃 )? “All three testing techniques [DeepExplore, DLFuzz, DeepConcolic] studied produced signi fi cant numbers of [autoencoder] invalid tests” Input validity approximated as Auto-encoder validity
  • 24. DL inputs Validity of automatically generated inputs 24 Test Generator Comparison Auto-encoder validity Human validity Label preservation Automated vs Human Test Generators Root causes of invalidity: Raw input generators: corruption of large number of pixels Latent space generators: lack of continuity in latent space Model based generators: low quality model All generators: label preservation not ensured
  • 25. DL inputs Test input selection and prioritization 25 DL inputs Q4: How to generate DL test inputs? Q5: How to prioritize/select DL test inputs? Train set Train model Active set Prioritize and select Selected data Label data
  • 26. DL inputs Uncertainty estimation 26 pip install uncertainty-wizard github.com/testingautomated-usi/uncertainty-wizard
  • 27. DL inputs Evaluation metrics for test input prioritization 27 6 9 8 9 6 5 1 2 3 APFD = 9/18 6 6 5 8 9 9 1 2 3 3 APFD = 15/18, +50% 3 3 1 1 1 { Any permutation gives the same APFD APFD is actually measuring just top-3 accuracy Diversity matters!
  • 28. DL inputs Take-away messages 28 Generating DL inputs that trigger failures is easy, but ensuring their validity and assigning them a valid label is not. Model uncertainty can be used to select and prioritize test cases. Traditional test case prioritization evaluation metrics, like APFD, do not account for DL failure diversity. DL inputs Q4: How to generate DL test inputs? Q5: How to prioritize/select DL test inputs?
  • 29. DL oracles System level quality 29 DL oracles Q6: When is a DL program correct? Q7: How to prevent runtime oracle violations? lateral position headway position steering braking traf fi c signs O(x) = m1(x) ≤ t1 ∧ … ∧ mn(x) ≤ tn Oracles depend on thresholds ∀x ∈ X, mD 1 (x) ≤ t* 1 ∧ … ∧ mD n (x) ≤ t* n No FP in (manually) veri fi ed executions X t* 1 , …, t* n = arg max t1,…,tn |{j ∈ M|∃x, m Mj 1 (x) > t1 ∨ … ∨ m Mj n (x) > tn} No/few FN when executing on faulty models (max mutants killed) 0.1 0.9 is_red
  • 30. DL oracles Runtime oracle 30 DL oracles Q6: When is a DL program correct? Q7: How to prevent runtime oracle violations? Autonomous driving Supervisor Safe disengagement Train set latent space Encoder Decoder Histogram and theoretical densities data Density 0.02 0.04 0.06 0.08 0.10 0.12 0 10 20 30 40 50 gamma CDF 10 0.12 Q−Q plot 𝜃 Train set Auto-encoder based anomaly detection
  • 31. DL oracles Runtime oracle: eutoencoders 31 DL oracles Q6: When is a DL program correct? Q7: How to prevent runtime oracle violations? (e > 𝜃 )? Nominal Original Reconstructed Anomalous l l
  • 32. DL oracles Runtime oracle: uncertainty estimation 32 DL oracles Q6: When is a DL program correct? Q7: How to prevent runtime oracle violations? github.com/testingautomated-usi/uncertainty-wizard
  • 33. DL oracles Take-away messages 33 DL oracles should capture system level quality, not just extreme cases of misbehaviours. Runtime supervisors are needed to deal with rare or unexpected conditions in which misbehaviours are possible. DL oracles Q6: When is a DL program correct? Q7: How to prevent runtime oracle violations? Thresholds for system level oracles should ensure no FP and few FN on mutants.
  • 34. A framework for DL testing Dev-to-production data shift 34 Goal Repair faults, exposed by choosing inputs whose execution violates oracles, to increase reliability measured in production Inputs Program Fault Execution Failure Oracle Repair Production test set Production environment Reliability Boolean Correctness ⟹ System Reliability
  • 35. A framework for DL testing Implications for researchers 35 E ff ectiveness of test input generation, test prioritization and DL model repair should be measured as reliability increase, on production data We need ways to collect/simulate production data We need techniques to estimate reliability in a sample e ffi cient way Correctness, i.e. no error on the test set or 100% accuracy, is not a realistic goal. Production time reliability should be estimated instead.
  • 36. A framework for DL testing Putting all together 36 Inputs Program Fault Execution Failure Oracle Repair Production test set Production environment Reliability DL faults a ff ect data, model architecture, hyper-parameters, training process, not just the code. When exposing/counting failures, diversity in the feature space matters. Exposing more failures does not necessarily mean higher e ff ectiveness. Generating DL inputs that trigger failures is easy, but ensuring their validity and assigning them a valid label is not. DL oracles should capture system level quality, not just extreme cases of misbehaviours. Runtime supervisors are needed to deal with rare or unexpected conditions in which misbehaviours are possible. E ff ectiveness of test input generation, test prioritization and DL model repair should be measured as reliability increase, on production data
  • 37. Paolo Tonella Software Institute, Università della Svizzera italiana, Lugano, Switzerland The Road Toward Dependable AI Based Systems 37