Deep Learning (DL) systems are rapidly being adopted in safety and security-critical domains, urgently calling for ways to test their correctness and robustness. Testing of DL systems has traditionally relied on manual collection and labeling of data. Recently, a number of coverage criteria based on neuron activation values have been proposed. These criteria essentially count the number of neurons whose activation during the execution of a DL system satisfied certain properties, such as being above predefined thresholds. However, existing coverage criteria are not sufficiently fine-grained to capture subtle behaviors exhibited by DL systems. Moreover, evaluations have focused on showing a correlation between adversarial examples and proposed criteria rather than evaluating and guiding their use for actual testing of DL systems. In this model, the authors propose a novel test adequacy criterion for testing of DL systems, called Surprise Adequacy for Deep Learning Systems (SADL), which is based on the behavior of DL systems with respect to their training data.
Surprise Adequacy for Deep Learning Systems (SADL)
1. Guiding Deep Learning System Testing Using
Surprise Adequacy
Authors: Jinhan Kim, Robert Feldt, Shin Yoo
Presented by
Fatemeh Ghorbani
2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE)
2. Outline
• Introduction
▫ Statement of the problem
▫ Related works
▫ Surprise Adequacy for Deep Learning Systems (SADL)
• SADL measurement
▫ Surprise Adequacy (SA)
▫ Likelihood-based SA (LSA)
▫ Distance-based SA (DSA)
▫ Surprise Coverage (SC)
• Experimental setup
• Research questions and results
• Conclusion
1/15
3. Introduction
• Statement of the problem
• Related works
• Surprise Adequacy for Deep Learning Systems (SADL)
2/15
4. Statement of the problem
• Unexpected behaviors of deep learning (DL) systems
▫ Adversarial examples
• Essential need to verify behaviors
• Testing DL systems correctness
3/15
5. • DeepTest
• DeepXplore
▫ Neuron Coverage (NC)
• Major limitation
▫ Convey little information
▫ Discretization
Related works and their limitations
4/15
6. SADL
• Based on the behavior of DL systems
▫ With respect to training data
▫ Data-flow
• The actual measure of surprise:
▫ The likelihood of new input and training data
▫ The distance between activation trace vectors of new input and training
data
• Quantitively measurement
5/15
7. Surprise Adequacy (SA)
• Activation traces of inputs and training data over neurons in N:
• Compare them
Fully captures the behaviors of the DL system
6/15
8. Likelihood-based SA (LSA)
• Applies KDE to estimate the probability density of each activation value:
• Obtains the surprise of a new input
• To reduce computational cost: consider layer L activation trace
7/15
9. Distance-based SA (DSA)
• Use the distances between activation traces
as the measure of surprise
• Only apply DSA for classification task
8/15
10. Surprise Coverage (SC)
• SC can only be measured with
respect to predefined upper band
• Sense of redundancy is weaker
LSA and DSA (continuous)
bucketing (discretise)
LSC and DSC
9/15
11. Experimental setup
• Data sets and DL systems:
▫ MNIST, CIFAR-10
▫ Self-driving car challenge
▫ Pre-trained Dave-2 and Chauffeur model
▫ Evaluation of SADL accuracy: CNN and MSE
• Adversarial examples and synthetic inputs
▫ Five attack strategies
▫ DeepXplore, DeepTest
10/15
12. Research questions and results
1) SADL capability of capturing the relative surprise
▫ Training adversarial example classifier using logistic
regression
▫ Quantitatively and visually: SADL can measure how
surprising the input is
▫ DSA from specific layer: produce higher accuracy
▫ Inputs with higher SA: harder to classify
▫ Adversarial examples: higher SA value
10000
adversarial
examples
1000 for
training
9000 for
evaluation
11/15
13. Research questions and results (Con.)
2) Selection of layer impact on accuracy
▫ With LSA: there is no strong evidence
▫ With DSA: deepest layer produces the most accurate
classifier
▫ The layer sensitivity varies across different attack strategies
12/15
14. Research questions and results (Con.)
3) Correlation between SC and
other criteria
• Most of the criteria increase as
additional inputs are added at
each step (exception: NC)
MNIST and
CIFAR-10
add
1000 adversarial
examples
Dave-2 add
700 synthetic
images
(by DeepXplore)
Chauffeur add
1000 synthetic
images
(by DeepTest)
13/15
15. Research questions and results (Con.)
Choose four sets of 100
images
Train existing models for
five additional epochs
Measure performance
4) Retraining guide
• Sampling from wider ranges
improves accuracy
• Best retraining performance: full
range
• SA can provide guidance
14/15
16. Conclusion
• SC and SA are good indicators of DL systems behavior
• SA is correlated with how difficult a DL system finds an input
• SC can be used to guide selection of inputs for effective training
• SA can classify adversarial examples accurately
15/15
How presentation will benefit audience: Adult learners are more interested in a subject if they know how or why it is important to them.
Presenter’s level of expertise in the subject: Briefly state your credentials in this area, or explain why participants should listen to you.
Example objectives
At the end of this lesson, you will be able to:
Save files to the team Web server.
Move files to different locations on the team Web server.
Share files on the team Web server.