Machine Learning Software Engineering Patterns and Their Engineering
Comparing Offline and Online Testing of Deep Neural Networks: An Autonomous Car Case Study
1. .lusoftware verification & validation
VVS
Comparing Offline and Online
Testing of Deep Neural Networks:
An Autonomous Car Case Study
Fitash Ul Haq, Donghwan Shin, Shiva Nejati, and Lionel Briand
2020-10-25
2. Introduction
• Deep Neural Networks (DNNs) help accurately automate real-world
tasks such as speech recognition and image classification
• DNNs are increasingly used in safety critical autonomous systems,
such as Automated Driving System (ADS)
• The challenge of ensuring safety and reliability of DNN-based
systems emerges as a fundamental problem
!2
3. Existing Testing Approaches
• Many DNN testing approaches have been proposed recently
• Distinct modes of testing:
• Offline testing
• Online testing
!3
4. Offline Testing
• Testing DNNs as stand-alone components
• DNNs are tested using (historical) data in an open-loop mode
!4
Label Image Prediction
DNN
Prediction Error
Test data
5. Online Testing
• Testing DNNs embedded into a specific application
• DNNs are tested when embedded into an application environment in a
closed-loop mode
!5
DNN
(Virtual)
Ego Car
Image
Prediction
Embedded
Mobile Objects
over Time
Application Environment
Safety Violation
6. Offline Testing vs. Online Testing?
• Comparatively, offline testing has been far more studied to date
• Limited insight as to how these two DNN testing approaches
compare with another
• Do large prediction errors identified by offline testing always lead to
safety violations detectable by online testing?
• Do the safety violations identified by online testing translate into large
prediction errors in offline testing?
!6
RQ1: How do offline and online testing results differ and complement each other?
7. Real-world vs. Simulated Data?
• Testing DNNs embedded into real and operational environments is
often very expensive, dangerous, and time-consuming
• To answer RQ1, we can rely on high-fidelity simulators that allow us
to specify and execute scenarios capturing various situations
• However, we do not know if simulator-generated data are a reliable
substitute to real-world data for the purpose of DNN testing
!7
RQ0: Can we use simulator-generated data as a reliable substitute
to real-world data for the purpose of DNN testing?
8. DNNs in ADS
• In this study, though the investigated questions are relevant to all
autonomous systems, we focus on DNNs in the context of ADS
!8
ADS
DNN
Camera Steering angle
Brake & Accelerate
Environment
Lidar
… …
Feedback Action
9. Offline Testing for ADS-DNN
!9
PredictionsDNNTest Data
Human Drvier Real Car
Domain Model Simulator
10. Online Testing for ADS-DNN
!10
Domain Model
Image
DNNSimulator
Steering Angle
Ego Car and
Mobile Objects
Behaviors
over Time
11. Domain Model for Simulator
• Capturing the test input space
• Based on the features observed in
real-world datasets
• Each entity has multiple variables
• Additional constraints describing
valid value assignments to the
variables
• A (test) simulation scenario is
determined by a vector of values
assigned to the variables
!11
Scenario
Weather
type: {sunny, fog, rainy, snowy}
visibility: {low, medium, high}
Road
type: {straight, curve, spiral}
direction: {left, right}
length: {25, 50, 75, 100}
curveRadius: {20, 30, …, 60}
numLanes: {1, 2, 3}
…
Car
speed: {10, 20, …, 100}
oppositeLane: Boolean
headlight: Boolean
highBeam: Boolean
foglight: Boolean
infrontEgoCar: Boolean
Environment
trees: Boolean
…
12. Research Questions
• RQ0: Can we use simulator-generated data as a reliable alternative
source to real-world data?
• We configure the simulator to generate a dataset that resembles the
characteristics of a real-life dataset, and then compare the offline
testing results for these datasets
• RQ1: How do offline and online testing results differ and complement
each other?
• For the same simulator-generated datasets, we compare the offline
and online testing results
!12
13. Subject DNN Models
• Two publicly-available, widely used pre-trained DNN-based steering
angle prediction models, i.e., Autumn and Chauffeur
• Autumn consists of an image preprocessing module that computes the
optical flow and a Convolutional Neural Network (CNN) that predicts
steering angles
• Chauffeur consists of one CNN that extracts the image features and a
Recurrent Neural Network (RNN) that predicts steering angles from the
previous 100 consecutive images
!13
14. Real-world Dataset
!14
• Sequences of [image, steering angle] pairs from the Udacity Challenge
−1.0
−0.5
0.0
0.5
1.0
0 2000 4000
Image ID
Steeringangle(deg/25)
(a fragment of) the Training Data
Actual Steering Angle for Testing Data
(i.e., 5614 labeled images for testing)
15. Prediction Errors
• Prediction errors of the DNN models for the real-world testing dataset
• The prediction error is computed by two well-known metrics, Mean
Absolute Error (MAE) and Root Mean Square Error (RMSE)
• The models are reasonably accurate for the real-world test dataset
!15
Model Reported RMSE Our RMSE Our MAE
Autumn Not Reported 0,049 0,034
Chauffeur 0,058 0,092 0,055
Meaning: 1.375° on average
17. RQ0: Replicate Real-world Dataset
• It is infeasible to generate SD with exactly the same environmental
properties and vehicle dynamics as in RD
• Instead, we say SD is comparable with a subsequence of RD if:
• the images have the same features (e.g., sunny weather)
• the steering angle difference per image is small enough on average
• We propose a two-step heuristic to generate SDs that are
comparable with the subsequences of RD
!17
18. RQ0: Two-Step Heuristic (1/2)
• Step 1: Randomly generate SDs based on a domain model restricted
to the features observed in RD
• For example, the restricted domain model includes only sunny weather
since the test dataset has only sunny images
• This enables us to steer the simulator to resemble the characteristics
of the images in the test dataset, to the extent possible
!18
19. RQ0: Two-Step Heuristic (2/2)
• Step 2: For each SD, identify a comparable subsequence of RD
considering steering angles
• We obtain comparable dataset pairs with small-enough steering angle
differences
!19
Simulator-
generated
Steering
Angles
Human-generated Steering Angles
for the real-world dataset
Minimal Difference
(less than a small threshold)
Comparable subsequence of the real-world dataset
Search
20. RQ0: Results (1/2)
• We identified 92 simulator-generated datasets that could match
subsequences of the Udacity real-life test dataset
• One of the comparable pairs is shown as follows:
!20
−0.1
0.0
0.1
0.2
0.3
0.4
0 50 100 150 200
Image ID
Actualsteeringangle(deg/25)
Real (human) Simulated
Steering AnglesImages
Real-world Simulator-generated
21. RQ0: Results (2/2)
• Distributions of MAE differences, i.e., abs(MAE(r), MAE(s)), where r
and s are comparable real-world and simulator-generated datasets
!21
0.00
0.25
0.50
0.75
1.00
Autumn Chauffeur
MAEdifference
Meaning: 2.5° on average
• For Autumn, 96.7% of the comparable pairs
have an MAE difference below 0.1
• For Chauffeur, 68.5% of the comparable
pairs have an MAE difference below 0.1
• Even when MAE is larger than 0.1,
MAE(s) is always greater than MAE(r)
0.10
22. RQ0: Implications
• The prediction error differences between simulator-generated
datasets and real-life datasets are less than 0.1, on average, for both
Autumn and Chauffeur
• We can use simulator-generated datasets as a reliable alternative to
real-world datasets for testing these DNNs
!22
23. RQ1: Setup (1/2)
• We randomly generate 50 scenarios and compare the offline and
online testing results for each of the simulator-generated datasets
• For offline testing, we use the MAE metric (i.e., prediction error)
• For online testing, we use the Maximum Distance from Center of Lane
(MDCL) metric to measure the lane departure degree (i.e., safety
violation)
• However, we cannot directly compare MAE and MDCL values since
MAE and MDCL are different metrics
!23
24. RQ1: Setup (2/2)
• To determine whether the offline and online testing results are
consistent or not, we set threshold values for MAE and MDCL
• We interpret the offline testing result as acceptable if MAE < 0.1
(meaning the average prediction error < 2.5°)
• We interpret the online testing result as acceptable if MDCL < 1
(meaning the maximum departure < one meter)
• If both offline and online testing results are consistently
(un)acceptable, we say offline and online testing are in agreement
!24
26. RQ1: Results (2/2)
• One of the scenarios on which offline and online testing disagreed
!26
0
1
2
3
4
5
0 20 40 60
Image ID
Predictionerror(deg)
Offline Testing Result Online Testing Result
27. RQ1: Implications
• Offline and online testing results differ in many cases
• Offline testing is more optimistic than online testing because the
accumulation of errors is not observed in offline testing
• Online testing is preferable to offline testing for ADS-DNNs
!27
28. Conclusion
• We showed that simulator-generated datasets yield DNN prediction
errors that are similar to those obtained by real-world datasets
• We also found that many safety violations identified by online testing
were not detected by offline testing
• As part of future work, we plan to investigate how to improve the
performance of DNN-based ADS using offline and online testing
results
!28
29. .lusoftware verification & validation
VVS
Comparing Offline and Online
Testing of Deep Neural Networks:
An Autonomous Car Case Study
Fitash Ul Haq, Donghwan Shin, Shiva Nejati, and Lionel Briand
2020-10-25